[Chapter 11] 11.2 Freely Available awks

11.2 Freely Available awks

There are three versions of awk whose source code is freely available. They are the Bell Labs awk, GNU awk, and mawk, by Michael Brennan. This section discusses the extensions that are common to two or more of them, and then looks at each version in detail and describes how to obtain it.

11.2.1 Common Extensions

This section discusses extensions to the awk language that are available in two or more of the freely available awks.[2]

[2] As the maintainer of gawk and the author of many of the extensions described here and in the section below on gawk, my opinion about the usefulness of these extensions may be biased. :-) You should make your own evaluation. [A.R.]

11.2.1.1 Deleting all elements of an array

All three free awks extend the delete statement, making it possible to delete all the elements of an array at one time. The syntax is:

delete array

Normally, to delete every element from an array, you have to use a loop, like this.

for (i in data)
	delete data[i]

With the extended version of the delete statement, you can simply use

delete data

This is particularly useful for arrays with lots of subscripts; this version is considerably faster than the one using a loop.

Even though it no longer has any elements, you cannot use the array name as a simple variable. Once an array, always an array.

This extension appeared first in gawk, then in mawk and the Bell Labs awk.

11.2.1.2 Obtaining individual characters

All three awks extend field splitting and array splitting as follows. If the value of FS is the empty string, then each character of the input record becomes a separate field. This greatly simplifies cases where it's necessary to work with individual characters.

Similarly, if the third argument to the split() function is the empty string, each character in the original string will become a separate element of the target array.

Without these extensions, you have to use repeated calls to the substr() function to obtain individual characters.

This extension appeared first in mawk, then in gawk and the Bell Labs awk.

11.2.1.3 Flushing buffered output

The 1993 version of the Bell Labs awk introduced a new function that is not in the POSIX standard, fflush(). Like close(), the argument to fflush() is the name of an open file or pipe. Unlike close(), the fflush() function only works on output files and pipes.

Most programs buffer their output, storing data to be written to a file or pipe in an internal chunk of memory until there's enough to send on to the destination. Occasionally, it's useful for the programmer to be able to explicitly flush the buffer, that is, force all buffered data to actually be delivered. This is the purpose of the fflush() function.

This function appeared first in the Bell Labs awk, then in gawk and mawk.

11.2.1.4 Special filenames

With any version of awk, you can write directly to the special UNIX file, /dev/tty, that is a name for the user's terminal. This can be used to direct prompts or messages to the user's attention when the output of the program is directed to a file:

printf "Enter your name:" >"/dev/tty"

This prints "Enter your name:" directly on the terminal, no matter where the standard output and the standard error are directed.

The three free awks support several special filenames, as listed in Table 11.4.

Table 11.4: Special Filenames
Filename	Description
/dev/stdin	Standard input (not mawk)[3]
/dev/stdout	Standard output
/dev/stderr	Standard error

[3] The mawk manpage recommends using "-" for the standard input, which is most portable.

Note that a special filename, like any filename, must be quoted when specified as a string constant.

The /dev/stdin, /dev/stdout, and /dev/stderr special files originated in V8 UNIX. Gawk was the first to build in special recognition of these files, followed by mawk and the Bell Labs awk.

Error messages inform users about problems often related to missing or incorrect input. You can simply inform the user with a print statement. However, if the output of the program is redirected to a file, the user won't see it. Therefore, it is good practice to specify explicitly that the error message be sent to the terminal.

The following printerr() function helps to create consistent user error messages. It prints the word "ERROR" followed by a supplied message, the record number, and the current record. The following example directs output to /dev/tty:

function printerr (message) {
	# print message, record number and record
	printf("ERROR:%s (%d) %s\n", message, NR, $0) > "/dev/tty"
}

If the output of the program is sent to the terminal screen, then error messages will be mixed in with the output. Outputting "ERROR" will help the user recognize error messages.

In UNIX, the standard destination for error messages is standard error. The rationale for writing to standard error is the same as above. To write to standard error explicitly, you must use the convoluted syntax "cat 1>&2" as in the following example:

print "ERROR" | "cat 1>&2"

This directs the output of the print statement to a pipe which executes the cat command. You can also use the system() function to execute a UNIX command such as cat or echo and direct its output to standard error.

When the special file /dev/stderr is available, this gets much simpler:

print "ERROR" > "/dev/stderr"  # recent awks only

11.2.1.5 The nextfile statement

The nextfile statement is similar to next, but it operates at a higher level. When nextfile is executed, the current data file is abandoned, and processing starts over at the top of the script, using the first record of the following file. This is useful when you know that you only need to process part of a file; there's no need to then set up a loop to skip records using next.

The nextfile statement originated in gawk, and then was added to the Bell Labs awk. It will be available in mawk, starting with version 1.4.

11.2.1.6 Regular expression record separators (gawk and mawk)

Gawk and mawk allow RS to be a full regular expression, not just a single character. In that case, the records are separated by the longest text in the input that matches the regular expression. Gawk also sets RT (the record terminator) to the actual input text that matched RS. An example of this is given below.

The ability to have RS be a regular expression first appeared in mawk, and was later added to gawk.

11.2.2 Bell Labs awk

The Bell Labs awk is, of course, the direct descendant of the original V7 awk, and of the "new" awk that first became avaliable with System V Release 3.1. Source code is freely available via anonymous FTP to the host netlib.bell-labs.com. It is in the file /netlib/research/awk.bundle.Z. This is a compressed shell archive file. Be sure to use "binary," or "image" mode to transfer the file. This version of awk requires an ANSI C compiler.

There have been several distinct versions; we will identify them here according to the year they became available.

The first version of new awk became available in late 1987. It had almost everything we've described in the previous four chapters (although there are footnotes that indicate those things that are not available). This version is still in use on SunOS 4.1.x systems and some System V Release 3 UNIX systems.

In 1989, for System V Release 4, several new things were added. The only difference between this version and POSIX awk is that POSIX uses CONVFMT for number-to-string conversions, while the 1989 version still used OFMT. The new features were:

Escape characters in command-line assignments were now interpreted.
The tolower() and toupper() functions were added.
printf was improved: dynamic width and precision were added, and the behavior for "%c" was rationalized.
The return value from the srand() function was defined to be the previous seed. (The awk book didn't state what srand() returned.)
It became possible to use regular expressions as simple expressions. For example:
```
if (/cute/ || /sweet/)
	print "potential here!"
```
The -v option was added to allow setting variables on the command line before execution of the BEGIN procedure.
Multiple -f options could now be used to have multiple source files. (This originated in MKS awk, was adopted by gawk, and then added to the Bell Labs awk.)
The ENVIRON array was added. (This was developed independently for both MKS awk and gawk, and then added to the Bell Labs awk.)

In 1993, Brian Kernighan of Bell Labs was able to release the source code to his awk. At this point, CONVFMT became available, and the fflush() function, described above, was added. A bug-fix release was made in August of 1994.

In June of 1996, Brian Kernighan made another release. It can be retrieved either from the FTP site given above, or via a World Wide Web browser from Dr. Kernighan's Web page (http://cm.bell-labs.com/who/bwk), which refers to this version as "the one true awk." :-) This version adds several features that originated in gawk and mawk, described earlier in this chapter in the "Common Extensions" section.

11.2.3 GNU awk (gawk)

The Free Software Foundation GNU project's version of awk, gawk, implements all the features of the POSIX awk, and many more. It is perhaps the most popular of the freely available implementations; gawk is used on Linux systems, as well as various other freely available UNIX-like systems, such as NetBSD and FreeBSD.

Source code for gawk is available via anonymous FTP[4] to the host ftp.gnu.ai.mit.edu. It is in the file /pub/gnu/gawk-3.0.3.tar.gz (there may be a later version there by the time you read this). This is a tar file compressed with the gzip program, whose source code is available in the same directory. There are many sites worldwide that "mirror" the files from the main GNU distribution site; if you know of one close to you, you should get the files from there. Be sure to use "binary" or "image" mode to transfer the file(s).

[4] If you don't have Internet access and wish to get a copy of gawk, contact the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 U.S.A. The telephone number is 617-542-5942, and the fax number is 617-542-2652.

Besides the common extensions listed earlier, gawk has a number of additional features. We examine them in this section.

11.2.3.1 Command line options

Gawk has several very useful command-line options. Like most GNU programs, these options are spelled out and begin with two dashes, "--".

--lint and --lint-old cause gawk to check your program, both at parse-time and at run-time, for constructs that are dubious or nonportable to other versions of awk. The --lint-old option warns about function calls that are not portable to the original version of awk. It is separate from --lint, since most systems now have some version of new awk.
--traditional disables GNU-specific extensions, such as the time functions and gensub() (see below). With this option, gawk is intended to behave the same as the Bell Labs awk.
--re-interval enables full POSIX regular expression matching, by allowing gawk to recognize interval expressions (such as "/stuff{1,3}/").
--posix disables all extensions that are not specified in the POSIX standard. This option also turns on recognition of interval expressions.

There are a number of other options that are less important for everyday programming and script portability; see the gawk documentation for details.

Although POSIX awk allows you to have multiple instances of the -f option, there is no easy way to use library functions from a command-line program. The --source option in gawk makes this possible.

gawk --source 'script' -f mylibs.awk file1 file2

This example runs the program in script, which can use awk functions from the file mylibs.awk. The input data comes from file1 and file2.

11.2.3.2 An awk program search path

Gawk allows you to specify an environment variable named AWKPATH that defines a search path for awk program files. By default, it is defined to be .:/usr/local/share/awk. Thus, when a filename is specified with the -f option, the two default directories will be searched, beginning with the current directory. Note that if the filename contains a "/", then no search is performed.

For example, if mylibs.awk was a file of awk functions in /usr/local/share/awk, and myprog.awk was a program in the current directory, we run gawk like this:

gawk -f myprog.awk -f mylibs.awk datafile1

Gawk would find each file in the appropriate place. This makes it much easier to have and use awk library functions.

11.2.3.3 Line continuation

Gawk allows you to break lines after either a "?" or ":". You can also continue strings across newlines using a backslash.

$ gawk 'BEGIN { print "hello, \
> world" }'
hello, world

11.2.3.4 Extended regular expressions

Gawk provides several additional regular expression operators. These are common to most GNU programs that work with regular expressions. The extended operators are listed in Table 11.5.

Table 11.5: Gawk Extended Regular Expressions
Special Operators	Usage
\w	Matches any word-constituent character (a letter, digit, or underscore).
\W	Matches any character that is not word-constituent.
\<	Matches the empty string at the beginning of a word.
\>	Matches the empty string at the end of a word.
\y	Matches the empty string at either the beginning or end of a word (the word boundary). Other GNU software uses "\b", but that was already taken.
\B	Matches the empty string within a word.
\`	Matches the empty string at the beginning of a buffer. This is the same as a string in awk, and thus is the same as ^. It is provided for compatibility with GNU Emacs and other GNU software.
\'	Matches the empty string at the end of a buffer. This is the same as a string in awk, and thus is the same as $. It is provided for compatibility with GNU Emacs and other GNU software.

You can think of "\w" as a shorthand for the (POSIX) notation [[:alnum:]_] and "\W" as a shorthand for [^[:alnum:]_]. The following table gives examples of what the middle four operators match, borrowed from Effective AWK Programming.

Table 11.6: Examples of gawk Extended Regular Expression Operators
Expression	Matches	Does Not Match
\<away	away	stowaway
stow\>	stow	stowaway
\yballs?\y	ball or balls	ballroom or baseball
\Brat\B	crate	dirty rat

11.2.3.5 Regular expression record terminators

Besides allowing RS to be a regular expression, gawk sets the variable RT (record terminator) to the actual input text that matched the value of RS.

Here is a simple example, due to Michael Brennan, that shows the power of gawk's RS and RT variables. As we have seen, one of the most common uses of sed is its substitute command (s/old/new/g). By setting RS to the pattern to match, and ORS to the replacement text, a simple print statement can print the unchanged text followed by the replacement text.

$ cat simplesed.awk
# simplesed.awk --- do s/old/new/g using just print
#    Thanks to Michael Brennan for the idea
#
# NOTE! RS and ORS must be set on the command line
{
    if (RT == "")
        printf "%s", $0
    else
        print
}

There is one wrinkle; at end of file, RT will be empty, so we use a printf statement to print the record.[5] We could run the program like this.

[5] See Effective AWK Programming [Robbins], Section 16.2.8, for an elaborate version of this program.

$ cat simplesed.data
"This OLD house" is a great show.
I like shopping for old things at garage sales.
$ gawk -f simplesed.awk RS="old|OLD" ORS="brand new" simplesed.data
"This brand new house" is a great show.
I like shopping for brand new things at garage sales.

11.2.3.6 Separating fields

Besides the regular way that awk lets you split the input into records and the record into fields, gawk gives you some additional capabilities.

First, as mentioned above, if the value of FS is the empty string, then each character of the input record becomes a separate field.

Second, the special variable FIELDWIDTHS can be used to split out data that occurs in fixed-width columns. Such data may or may not have whitespace separating the values of the fields.

FIELDWIDTHS = "5 6 8 3"

Here, the record has four fields: $1 is five characters wide, $2 is six characters wide, and so on. Assigning a value to FIELDWIDTHS causes gawk to start using it for field splitting. Assigning a value to FS causes gawk to return to the regular field splitting mechanism. Use FS = FS to make this happen without having to save the value of FS in an extra variable.

This facility would be of most use when working with fixed-width field data, where there may not be any whitespace separating fields, or when intermediate fields may be all blank.

11.2.3.7 Additional special files

Gawk has a number of additional special filenames that it interprets internally. All of the special filenames are listed in Table 11.7.

Table 11.7: Gawk's Special Filenames
Filename	Description
/dev/stdin	Standard input.
/dev/stdout	Standard output.
/dev/stderr	Standard error.
/dev/fd/`n`	The file referenced as file descriptor n.
Obsolete Filename	Description
/dev/pid	Returns a record containing the process ID number.
/dev/ppid	Returns a record containing the parent process ID number.
/dev/pgrpid	Returns a record containing the process group ID number.
/dev/user	Returns a record with the real and effective user IDs, the real and effective group IDs, and if available, any secondary group IDs.

The first three were described earlier. The fourth filename provides access to any open file descriptor that may have been inherited from gawk's parent process (usually the shell). You can use file descriptor 0 for standard input, 1 for standard output, and 2 for standard error.

The second group of special files, labeled "obsolete," have been in gawk for a while, but are being phased out. They will be replaced by a PROCINFO array, whose subscipts are the desired item and whose element value is the associated value.

For example, you would use PROCINFO["pid"] to get the current process ID, instead of using getline pid < "/dev/pid". Check the gawk documentation to see if PROCINFO is available and if these filenames are still supported.

11.2.3.8 Additional variables

Gawk has several more system variables. They are listed in Table 11.8.

Table 11.8: Additional gawk System Variables
Variable	Description
ARGIND	The index in ARGV of the current input file.
ERRNO	A message describing the error if `getline` or `close()` fail.
FIELDWIDTHS	A space-separated list of numbers describing the widths of the input fields.
IGNORECASE	If non-zero, pattern matches and string comparisons are case-independent.
RT	The value of the input text that matched RS.

We have already seen the record terminator variable, RT, so we'll proceed to the other variables that we haven't covered yet.

All pattern matching and string comparison in awk is case sensitive. Gawk introduced the IGNORECASE variable so that you can specify that regular expressions be interpreted without regard for upper- or lowercase characters. Beginning with version 3.0 of gawk, string comparisons can also be done without case sensitivity.

The default value of IGNORECASE is zero, which means that pattern matching and string comparison are performed the same as in traditional awk. If IGNORECASE is set to a non-zero value, then case distinctions are ignored. This applies to all places where regular expressions are used, including the field separator FS, the record separator RS, and all string comparisons. It does not apply to array subscripting.

Two more gawk variables are of interest. ARGIND is set automatically by gawk to be the index in ARGV of the current input file name. This variable gives you a way to track how far along you are in the list of filenames.

Finally, if an error occurs doing a redirection for getline or during a close(), gawk sets ERRNO to a string describing the error. This makes it possible to provide descriptive error messages when something goes wrong.

11.2.3.9 Additional functions

Gawk has one additional string function, and two functions for dealing with the current date and time. They are listed in Table 11.9.

Table 11.9: Additional gawk Functions
Gawk Function	Description
gensub(r, s, h, t)	If h is a string starting with g or G, globally substitutes s for r in t. Otherwise, h is a number: substitutes for the h'th occurrence. Returns the new value, t is unchanged. If t is not supplied, defaults to $0.
systime()	Returns the current time of day in seconds since the Epoch (00:00 a.m., January 1, 1970 UTC).
strftime(format, timestamp)	Formats timestamp (of the same form returned by `systime()`) according to format. If no timestamp, use current time. If no format either, use a default format whose output is similar to the `date` command.

11.2.3.10 A general substitution function

The 3.0 version of gawk introduced a new general substitution function, named gensub(). The sub() and gsub() functions have some problems.

You can change either the first occurrence of a pattern or all the occurrences of a pattern. There is no way to change, say, only the third occurrence of a pattern but not the ones before it or after it.
Both sub() and gsub() change the actual target string, which may be undesirable.
It is impossible to get sub() and gsub() to emit a literal backslash followed by the matched text, because an ampersand preceded by a backslash is never replaced.[6]
[6] A full discussion is given in Effective AWK Programming [Robbins], Section 12.3. The details are not for the faint of heart.
There is no way to get at parts of the matched text, analogous to the $...$ construct in sed.

For all these reasons, gawk introduced the gensub() function. The function takes at least three arguments. The first is a regular expression to search for. The second is the replacement string. The third is a flag that controls how many substitutions should be performed. The fourth argument, if present, is the original string to change. If it is not provided, the current input record ($0) is used.

The pattern can have subpatterns delimited by parentheses. For example, it can have "/(part) (one|two|three)/". Within the replacement string, a backslash followed by a digit represents the text that matched the nth subpattern.

$ echo part two | gawk '{ print gensub(/(part) (one|two|three)/, "\\2", "g") }'
two

The flag is either a string beginning with g or G, in which case the substitution happens globally, or it is a number indicating that the nth occurrence should be replaced.

$ echo a b c a b c a b c | gawk '{ print gensub(/a/, "AA", 2) }'
a b c AA b c a b c

The fourth argument is the string in which to make the change. Unlike sub() and gsub(), the target string is not changed. Instead, the new string is the return value from gensub().

$ gawk '
BEGIN { old = "hello, world"
        new = gensub(/hello/, "goodbye", 1, old)
        printf("<%s>, <%s>\n", old, new)
}'
<hello, world>, <goodbye, world>

11.2.3.11 Time management for programmers

Awk programs are very often used for processing the log files produced by various programs. Often, each record in a log file contains a timestamp, indicating when the record was produced. For both conciseness and precision, the timestamp is written as the result of the UNIX time(2) system call, which is the number of seconds since midnight, January 1, 1970 UTC. (This date is often referred to as "the Epoch.") To make it easier to generate and process log file records with these kinds of timestamps in them, gawk has two functions, systime() and strftime().

The systime() function is primarily intended for generating timestamps to go into log records. Suppose, for example, that we use an awk script to respond to CGI queries to our WWW server. We might log each query to a log file.

{
...
printf("%s:%s:%d\n", User, Host, systime()) >> "/var/log/cgi/querylog"
...
}

Such a record might look like

arnold:some.domain.com:831322007

The strftime() function[7] makes it easy to turn timestamps into human-readable dates. The format string is similar to the one used by sprintf(); it consists of literal text mixed with format specifications for different components of date and time.

[7] This function is patterned after the function of the same name in ANSI C.

$ gawk 'BEGIN { print strftime("Today is %A, %B %d, %Y") }'
Today is Sunday, May 05, 1996

The list of available formats is quite long. See your local strftime(3) manpage, and the gawk documentation for the full list. Our hypothetical CGI log file might be processed by this program:

# cgiformat --- process CGI logs
# data format is user:host:timestamp
#1
BEGIN {	FS = ":"; SUBSEP = "@" }

#2
{
# make data more obvious
	user = $1; host = $2; time = $3
# store first contact by this user
	if (! ((user, host) in first))
		first[user, host] = time
# count contacts
	count[user, host]++
# save last contact
	last[user, host] = time
}

#3
END {
# print the results
	for (contact in count) {
		i = strftime("%y-%m-%d %H:%M", first[contact])
		j = strftime("%y-%m-%d %H:%M", last[contact])
		printf "%s -> %d times between %s and %s\n",
			contact, count[contact], i, j
	}
}

The first step is to set FS to ":" to split the field correctly. We also use a neat trick and set the subscript separator to "@", so that the arrays become indexed by "user@host" strings.

In the second step, we look to see if this is the first time we've seen this user. If so (they're not in the first array), we add them. Then we increment the count of how many times they've connected. Finally we store this record's timestamp in the last array. This element keeps getting overwritten each time we see a new connection by the user. That's OK; what we will end up with is the last (most recent) connection stored in the array.

The END procedure formats the data for us. It loops through the count array, formatting the timestamps in the first and last arrays for printing. Consider a log file with the following records in it.

$ cat /var/log/cgi/querylog
arnold:some.domain.com:831322007
mary:another.domain.org:831312546
arnold:some.domain.com:831327215
mary:another.domain.org:831346231
arnold:some.domain.com:831324598

Here's what running the program produces:

$ gawk -f cgiformat.awk /var/log/cgi/querylog
mary@another.domain.org -> 2 times between 96-05-05 12:09 and 96-05-05 21:30
arnold@some.domain.com -> 3 times between 96-05-05 14:46 and 96-05-05 15:29

11.2.4 Michael's awk (mawk)

The third freely available awk is mawk, written by Michael Brennan. This program is upwardly compatible with POSIX awk, and has a few extensions as well. It is solid and performs very well. Source code for mawk is freely available via anonymous FTP from ftp.whidbey.net. It is in /pub/brennan/mawk1.3.3.tar.gz. (There may be a later version there by the time you read this.) This is also a tar file compressed with the gzip program. Be sure to use "binary," or "image" mode to transfer the file.

Mawk's primary advantages are its speed and robustness. Although it has fewer features than gawk, it almost always outperforms it.[8] Besides UNIX systems, mawk also runs under MS-DOS.

[8] Gawk's advantages are that it has a larger feature set, it has been ported to more non-UNIX kinds of systems, and it comes with much more extensive documentation.

The common extensions described above are also available in mawk.


11.1 Original awk		11.3 Commercial awks