[Chapter 7] 7.5 Records and Fields

7.5 Records and Fields

Awk makes the assumption that its input is structured and not just an endless string of characters. In the simplest case, it takes each input line as a record and each word, separated by spaces or tabs, as a field. (The characters separating the fields are often referred to as delimiters.) The following record in the file names has three fields, separated by either a space or a tab.

John Robinson	666-555-1111

Two or more consecutive spaces and/or tabs count as a single delimiter.

7.5.1 Referencing and Separating Fields

Awk allows you to refer to fields in actions using the field operator $. This operator is followed by a number or a variable that identifies the position of a field by number. "$1" refers to the first field, "$2" to the second field, and so on. "$0" refers to the entire input record. The following example displays the last name first and the first name second, followed by the phone number.

$ awk '{ print $2, $1, $3 }' names
Robinson John 666-555-1111

$1 refers to the first name, $2 to the last name, and $3 to the phone number. The commas that separate each argument in the print statement cause a space to be output between the values. (Later on, we'll discuss the output field separator (OFS), whose value the comma outputs and which is by default a space.) In this example, a single input line forms one record containing three fields: there is a space between the first and last names and a tab between the last name and the phone number. If you wanted to grab the first and last name as a single field, you could set the field separator explicitly so that only tabs are recognized. Then, awk would recognize only two fields in this record.

You can use any expression that evaluates to an integer to refer to a field, not just numbers and variables.

$ echo a b c d | awk 'BEGIN { one = 1; two = 2 }
> { print $(one + two) }'
c

You can change the field separator with the -F option on the command line. It is followed by the delimiter character (either immediately, or separated by whitespace). In the following example, the field separator is changed to a tab.

$ awk -F"\t" '{ print $2 }' names
666-555-1111

"\t" is an escape sequence (discussed below) that represents an actual tab character. It should be surrounded by single or double quotes.

Commas delimit fields in the following two address records.

John Robinson,Koren Inc.,978 4th Ave.,Boston,MA 01760,696-0987 
Phyllis Chapman,GVE Corp.,34 Sea Drive,Amesbury,MA 01881,879-0900

An awk program can print the name and address in block format.

# blocklist.awk -- print name and address in block form.
# input file -- name, company, street, city, state and zip, phone
{ 	print ""	# output blank line
	print $1	# name
	print $2	# company
	print $3	# street 
	print $4, $5	# city, state zip 
}

The first print statement specifies an empty string (@DQ@@DQ@) (remember, print by itself outputs the current line). This arranges for the records in the report to be separated by blank lines. We can invoke this script and specify that the field separator is a comma using the following command:

awk -F, -f blocklist.awk names

The following report is produced:

John Robinson
Koren Inc.
978 4th Ave.
Boston  MA 01760

Phyllis Chapman
GVE Corp.
34 Sea Drive
Amesbury  MA 01881

It is usually a better practice, and more convenient, to specify the field separator in the script itself. The system variable FS can be defined to change the field separator. Because this must be done before the first input line is read, we must assign this variable in an action controlled by the BEGIN rule.

BEGIN { FS = "," }

Now let's use it in a script to print out the names and phone numbers.

# phonelist.awk -- print name and phone number. 
# input file -- name, company, street, city, state and zip, phone

BEGIN { FS = "," }  # comma-delimited fields

{ print $1 ", " $6 }

Notice that we use blank lines in the script itself to improve readability. The print statement puts a comma followed by a space between the two output fields. This script can be invoked from the command line:

$ awk -f phonelist.awk names
John Robinson, 696-0987
Phyllis Chapman, 879-0900

This gives you a basic idea of how awk can be used to work with data that has a recognizable structure. This script is designed to print all lines of input, but we could modify the single action by writing a pattern-matching rule that selected only certain names or addresses. So, if we had a large listing of names, we could select only the names of people residing in a particular state. We could write:

/MA/ { print $1 ", " $6 }

where MA would match the postal state abbreviation for Massachusetts. However, we could possibly match a company name or some other field in which the letters "MA" appeared. We can test a specific field for a match. The tilde (~) operator allows you to test a regular expression against a field.

$5 ~ /MA/   { print $1 ", " $6 }

You can reverse the meaning of the rule by using bang-tilde (!~).

$5 !~ /MA/   { print $1 ", " $6 }

This rule would match all those records whose fifth field did not have "MA" in it. A more challenging pattern-matching rule would be one that matches only long-distance phone numbers. The following regular expression looks for an area code.

$6 ~ /1?(-|)?\(?[0-9]+\)?(|-)?[0-9]+-[0-9]+/

This rule matches any of the following forms:

707-724-0000
(707) 724-0000
(707)724-0000
1-707-724-0000   
1 707-724-0000   
1(707)724-0000

The regular expression can be deciphered by breaking down its parts. "1?" means zero or one occurrences of "1". "(-|)?" looks for either a hyphen or a space in the next position, or nothing at all. "\(?" looks for zero or one left parenthesis; the backslash prevents the interpretation of "(" as the grouping metacharacter. "[0-9]+" looks for one or more digits; note that we took the lazy way out and specified one or more digits rather than exactly three. In the next position, we are looking for an optional right parenthesis, and again, either a space or a hyphen, or nothing at all. Then we look for one or more digits "[0-9]+" followed by a hyphen followed by one or more digits "[0-9]+".

7.5.2 Field Splitting: The Full Story

There are three distinct ways you can have awk separate fields. The first method is to have fields separated by whitespace. To do this, set FS equal to a single space. In this case, leading and trailing whitespace (spaces and/or tabs) are stripped from the record, and fields are separated by runs of spaces and/or tabs. Since the default value of FS is a single space, this is the way awk normally splits each record into fields.

The second method is to have some other single character separate fields. For example, awk programs for processing the UNIX /etc/passwd file usually use a ":" as the field separator. When FS is any single character, each occurrence of that character separates another field. If there are two successive occurrences, the field between them simply has the empty string as its value.

Finally, if you specify more than a single character as the field separator, it will be interpreted as a regular expression. That is, the field separator will be the "leftmost longest non-null and nonoverlapping" substring[2] that matches the regular expression. (The phrase "null string" is technical jargon for what we've been calling the "empty string.") You can see the difference between specifying:

[2] The AWK Programming Language [Aho], p. 60.

FS = "\t"

which causes each tab to be interpreted as a field separator, and:

FS = "\t+"

which specifies that one or more consecutive tabs separate a field. Using the first specification, the following line would have three fields:

abc\t\tdef

whereas the second specification would only recognize two fields. Using a regular expression allows you to specify several characters to be used as delimiters:

FS = "[':\t]"

Any of the three characters in brackets will be interpreted as the field separator.


7.4 Pattern Matching		7.6 Expressions