Previous Page TOC Next Page Home


15

Awk, Awk

By Ann Marshall

Overview

The UNIX utility awk is a pattern matching and processing language with considerably more power than you may realize. It searches one or more specified files, checking for records that match a specified pattern. If awk finds a match, the corresponding action is performed. A simple concept, but it results in a powerful tool. Often an awk program is only a few lines long, and because of this, an awk program is often written, used, and discarded. A traditional programming language, such as Pascal or C, would take more thought, more lines of code, and hence, more time. Short awk programs arise from two of its built-in features: the amount of predefined flexibility and the number of details that are handled by the language automatically. Together, these features allow the manipulation of large data files in short (often single-line) programs, and make awk stand apart from other programming languages. Certainly any time you spend learning awk will pay dividends in improved productivity and efficiency.

Uses

The uses for awk vary from the simple to the complex. Originally awk was intended for various kinds of data manipulation. Intentionally omitting parts of a file, counting occurrences in a file, and writing reports are naturals for awk.

Awk uses the syntax of the C programming language, so if you know C, you have an idea of awk syntax. If you are new to programming or don't know C, learning awk will familiarize you with many of the C constructs.

Examples of where awk can be helpful abound. Computer-aided manufacturing, for example, is plagued with nonstandardization, so the output of a computer that's running a particular tool is quite likely to be incompatible with the input required for a different tool. Rather than write any complex C program, this type of simple data transformation is a perfect awk task.

One real problem of computer-aided manufacturing today is that no standard format yet exists for the program running the machine. Therefore, the output from Computer A running Machine A probably is not the input needed for Computer B running Machine B. Although Machine A is finished with the material, Machine B is not ready to accept it. Production halts while someone edits the file so it meets Computer B's needed format. This is a perfect and simple awk task.

Due to the amount of built-in automation within awk, it is also useful for rapid prototyping or trying out an idea that could later be implemented in another language.

Features

Reflecting the UNIX environment, awk features resemble the structures of both C and shell scripts. Highlights include its being flexible, its predefined variables, automation, its standard program constructs, conventional variable types, its powerful output formatting borrowed from C, and its ease of use.

The flexibility means that most tasks may be done more than one way in awk. With the application in mind, the programmer chooses which method to use . The built-in variables already provide many of the tools to do what is needed. Awk is highly automated. For instance, awk automatically retrieves each record, separates it into fields, and does type conversion when needed without programmer request. Furthermore, there are no variable declarations. Awk includes the "usual" programming constructs for the control of program flow: an if statement for two way decisions and do, for and while statements for looping. Awk also includes its own notational shorthand to ease typing. (This is UNIX after all!) Awk borrows the printf() statement from C to allow "pretty" and versatile formats for output. These features combine to make awk user friendly.

Brief History

Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan created awk in 1977. (The name is from the creators' last initials.) In 1985, more features were added, creating nawk (new awk). For quite a while, nawk remained exclusively the property of AT&T, Bell Labs. Although it became part of System V for Release 3.1, some versions of UNIX, like SunOS, keep both awk and nawk due to a syntax incompatibility. Others, like System V run nawk under the name awk (although System V. has nawk too). In The Free Software Foundation, GNU introduced their version of awk, gawk, based on the IEEE POSIX (Institute of Electrical and Electronics Engineers, Inc., IEEE Standard for Information Technology, Portable Operating System Interface, Part 2: Shell and Utilities Volume 2, ANSI approved 4/5/93), awk standard which is different from awk or nawk. Linux, PC shareware UNIX, uses gawk rather than awk or nawk. Throughout this chapter I have used the word awk when any of the three will do the concept. The versions are mostly upwardly compatible. Awk is the oldest, then nawk, then POSIX awk, then gawk as shown below. I have used the notation version++ to denote a concept that began in that version and continues through any later versions.


NOTE: Due to different syntax, awk code can never be upgraded to nawk. However, except as noted, all the concepts of awk are implemented in nawk (and gawk). Where it matters, I have specified the version.


Figure 15.1. The evolution of awk.

Refer to the end of the chapter for more information and further resources on awk and its derivatives.

Fundamentals

This section introduces the basics of the awk programming language. Although my discussion first skims the surface of each topic to familiarize you with how awk functions, later sections of the chapter go into greater detail. One feature of awk that almost continually holds true is this: you can do most tasks more than one way. The command line exemplifies this. First, I explain the variety of ways awk may be called from the command line—using files for input, the program file, and possibly an output file. Next, I introduce the main construct of awk, which is the pattern action statement. Then, I explain the fundamental ways awk can read and transform input. I conclude the section with a look at the format of an awk program.

Entering Awk from the Command Line

In its simplest form, awk takes the material you want to process from standard input and displays the results to standard output (the monitor). You write the awk program on the command line. The following table shows the various ways you can enter awk and input material for processing.

You can either specify explicit awk statements on the command line, or, with the -f flag, specify an awk program file that contains a series of awk commands. In addition to the standard UNIX design allowing for standard input and output, you can, of course, use file redirection in your shell, too, so awk < inputfile is functionally identical to awk inputfile. To save the output in a file, again use file redirection: awk > outputfile does the trick. Helpfully, awk can work with multiple input files at once if they are specified on the command line.

The most common way to see people use awk is as part of a command pipe, where it's filtering the output of a command. An example is ls -l | awk {print $3} which would print just the third column of each line of the ls command. Awk scripts can become quite complex, so if you have a standard set of filter rules that you'd like to apply to a file, with the output sent directly to the printer, you could use something like awk -f myawkscript inputfile | lp.


TIP: If you opt to specify your awk script on the command line, you'll find it best to use single quotes to let you use spaces and to ensure that the command shell doesn't falsely interpret any portion of the command.

Files for Input

These input and output places can be changed if desired. You can specify an input file by typing the name of the file after the program with a blank space between the two. The input file enters the awk environment from your workstation keyboard (standard input). To signal the end of the input file, type Ctl + d. The program on the command line executes on the input file you just entered and the results are displayed on the monitor (the standard output.)

Here's a simple little awk command that echoes all lines I type, prefacing each with the number of words (or fields, in awk parlance, hence the NF variable for number of fields) in the line. (Note that Ctrl+d means that while holding down the Control key you should press the d key).

$ awk '{print $NF : $0}'

I am testing my typing.

A quick brown fox jumps when vexed by lazy ducks.

Ctrl+d

5: I am testing my typing.

10: A quick brown fox jumps when vexed by lazy ducks.

$ _

You can also name more than one input file on the command line, causing the combined files to act as one input. This is one way of having multiple runs through one input file.


TIP: Keep in mind that the correct ordering on the command line is crucial for your program to work correctly: files are read from left to right, so if you want to have file1 and file2 read in that order, you'll need to specify them as such on the command line.

The Program File

With awk's automatic type conversion, a file of names and a file of numbers entered in the reverse order at the command line generate strange-looking output rather than an error message. That is why for longer programs, it is simpler to put the program in a file and specify the name of the file on the command line. The -f option does this. Notice that this is an exception to the usual way UNIX handles options. Usually the options occur at the end of a command; however, here an input file is the last parameter.


NOTE: Versions of awk that meet the POSIX awk specifications are allowed to have multiple -f options. You can use this for running multiple programs using the same input.

Specifying Output on the Command Line

Output from awk may be redirected to a file or piped to another program (see Chapter 4). The command awk /^5/ {print $0} | grep 3, for example, will result in just those lines that start with the digit five (that's what the awk part does) and also contain the digit three (the grep command). If you wanted to save that output to a file, by contrast, you could use awk /^5/ {print $0} > results and the file results would contain all lines prefaced by the digit 5. If you opt for neither of these courses, the output of awk will be displayed on your screen directly, which can be quite useful in many instances, particularly when you're developing—or fine tuning—your awk script.

Patterns and Actions

Awk programs are divided into three main blocks; the BEGIN block, the per-statement processing block, and the END block. Unless explicitly stated, all statements to awk appear in the per-statement block (you'll see later where the other blocks can come in particularly handy for programming, though).

Statements within awk are divided into two parts: a pattern, telling awk what to match, and a corresponding action, telling awk what to do when a line matching the pattern is found. The action part of a pattern action statement is enclosed in curly braces ({}) and may be multiple statements. Either part of a pattern action statement may be omitted. An action with no specified pattern matches every record of the input file you want to search (that's how the earlier example of {print $0} worked). A pattern without an action indicates that you want input records to be copied to the output file as they are (i.e., printed).

The example of /^5/ {print $0} is an example of a two-part statement: the pattern here is all lines that begin with the digit five (the ^ indicates that it should appear at the beginning of the line: without it the pattern would say any line that includes the digit five) and the action is print the entire line verbatim. ($0 is shorthand for the entire line.)

Input

Awk automatically scans, in order, each record of the input file looking for each pattern action statement in the awk program. Unless otherwise set, awk assumes each record is a single line. (See the sections "Advanced Concepts","Multi-line Records" for how to change this.) If the input file has blank lines in it, the blank lines count as a record too. Awk automatically retrieves each record for analysis; there is no read statement in awk.

A programmer may also disrupt the automatic input order in of two ways: the next and exit statements. The next statement tells awk to retrieve the next record from the input file and continue without running the current input record through the remaining portion of pattern action statements in the program. For example, if you are doing a crossword puzzle and all the letters of a word are formed by previous words, most likely you wouldn't even bother to read that clue but simply skip to the clue below; this is how the next statement would work, if your list of clues were the input. The other method of disrupting the usual flow of input is through the exit statement. The exit statement transfers control to the END block—if one is specified—or quits the program, as if all the input has been read; suppose the arrival of a friend ends your interest in the crossword puzzle, but you still put the paper away. Within the END block, an exit statement causes the program to quit.

An input record refers to the entire line of a file including any characters, spaces, or Tabs. The spaces and tabs are called whitespace.


TIP: If you think that your input file may include both spaces and tabs, you can save yourself a lot of confusion by ensuring that all tabs become spaces with the expand program. It works like this: expand filename | awk { stuff }.

The whitespace in the input file and the whitespace in the output file are not related and any whitespace you want in the output file, you must explicitly put there.

Fields

A group of characters in the input record or output file is called a field. Fields are predefined in awk: $1 is the first field, $2 is the second, $3 is the third, and so on. $0 indicates the entire line. Fields are separated by a field separator (any single character including Tab), held in the variable FS. Unless you change it, FS has a space as its value. FS may be changed by either starting the programfile with the following statement:

BEGIN {FS = "char" }

or by setting the -Fchar command line option where char is the selected field separator character you want to use.

One file that you might have viewed which demonstrates where changing the field separator could be helpful is the /etc/passwd file that defines all user accounts. Rather than having the different fields separated by spaces or tabs, the password file is structured with lines:

news:?:6:11:USENET News:/usr/spool/news:/bin/ksh

Each field is separated by a colon! You could change each colon to a space (with sed, for example), but that wouldn't work too well: notice that the fifth field, USENET News, contains a space already. Better to change the field separator. If you wanted to just have a list of the fifth fields in each line, therefore, you could use the simple awk command awk -F: {print $5} /etc/passwd.

Likewise, the built-in variable OFS holds the value of the output field separator. OFS also has a default value of a space. It, too, may be changed by placing the following line at the start of a program.

BEGIN {OFS = "char" }

If you want to automatically translate the passwd file so that it listed only the first and fifth fields, separated by a tab, you can therefore use the awk script:

BEGIN { FS=":" ; OFS="       " }

{ print $1, $5 }

Notice here that the script contains two blocks: the BEGIN block and the main per-input line block. Also notice that most of the work is done automatically.

Program Format

With a few noted exceptions, awk programs are free format. The interpreter ignores any blank lines in a programfile. Add them to improve the readability of your program whenever you wish. The same is true for Tabs and spaces between operators and the parts of a program. Therefore, these two lines are treated identically by the awk interpreter.

$4 == 2               {print "Two"}

$4     ==     2     {     print     "Two"     }

If more than one pattern action line appears on a line, you'll need to separate them with a semicolon, as shown above in the BEGIN block for the passwd file translator. If you stick with one-command-per-line then you won't need to worry too much about the semicolons. There are a couple of spots, however, where the semicolon must always be used: before an else statement or when included in the syntax of a statement. (See the "Loops" or "The Conditional Statement" sections.) However, you may always put a semicolon at the end of a statement.

The other format restriction for awk programs is that at least the opening curly bracket of the action half of a pattern action statement must be on the same line as the accompanying pattern, if both pattern and action exist. Thus, following examples all do the same thing.

The first shows all statements on one line:

$2==0     {print ""; print ""; print "";}

The second with the first statement on the same line as the pattern to match:

$2==0     {     print ""

          print ""

          print ""}

and finally as spread out as possible:

$2==0     {

          print ""

          print ""

          print ""

     }

When the second field of the input file is equal to 0, awk prints three blank lines to the output file.


NOTE: Notice that print "" prints a blank line to the output file, whereas the statement print alone prints the current input line.

When you look at an awk program file, you may also find commentary within. Anything typed from a # to the end of the line is considered a comment and is ignored by awk. They are notes to anyone reading the program to explain what is going on in words, not computerese.

A Note on awk Error Messages

Awk error messages (when they appear) tend to be cryptic. Often, due to the brevity of the program, a typo is easily found. Not all errors are as obvious; I have scattered some examples of errors throughout this chapter.

Print Selected Fields

Awk includes three ways to specify printing. The first is implied. A pattern without an action assumes that the action is to print. The two ways of actively commanding awk to print are print and printf(). For now, I am going to stick to using only implied printing and the print statement. printf is discussed in a later section ("Input/Output") and is used mainly for precise output. This section demonstrates the first two types of printing through some step-by-step examples.

Program Components

If I want to be sure the System Administrator spelled my name correctly in the /etc/password file, I enter an awk command to find a match but omit an action. The following command line puts a list on-screen.

$ awk '/Ann/' /etc/passwd

amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh

andhs26:0TFnZSVwcua3Y:2488:23:DeAnn O'Neal:/usr/lstudent/andhs26:/bin/csh

alewis:VYfz4EatT4OoA:2623:22:Annie Lewis:/usr/lteach/alewis:/bin/csh

cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann McIntyre:/usr/lteach/cmcintyr:/bin/csh

jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn Flanagan:/usr/lteach/jflanaga:/bin/csh

lschultz:mic35ZiFj9zWk:3060:22:Lee Ann Schultz, :/usr/lteach/lschultz:/bin/csh

akestle:job57Lb5/ofoE:3063:22:Ann Kestle.:/usr/lteach/akestle:/bin/csh

bakehs59:yRYV6BtcW7wFg:3075:23:DeAnna Adlington, Baker :/usr/bakehs59:/bin/csh

ahernan:AZZPQNCkw6ffs:3144:23:Ann Hernandez:/usr/lstudent/ahernan:/bin/csh

$ _

I look on the monitor and see the correct spelling.


ERROR NOTE: For the sake of making a point, suppose I had chosen the pattern /Anne/. A quick glance above shows that there would be no matches. Entering awk '/Anne/' /etc/passwd will therefore produce nothing but another system prompt to the monitor. This can be confusing if you expect output. The same goes the other way; above, I wanted the name Ann, but the names LeAnn, Annie and DeAnna matched, too. Sometimes choosing a pattern too long or too short can cause an unneeded headache.


TIP: If a pattern match is not found, look for a typo in the pattern you are trying to match.

Printing specified fields of an ASCII (plain text) file is a straightforward awk task. Because this program example is so short, only the input is in a file. The first input file, "sales", is a file of car sales by month. The file consists of each salesperson's name, followed by a monthly sales figure. The end field is a running total of that person's total sales.

The Input File and Program
$cat sales

John Anderson,12,23,7,42

Joe Turner,10,25,15,50

Susan Greco,15,13,18,46

Bob Burmeister,8,21,17,46

The following command line prints the salesperson's name and the total sales for the first quarter.

awk -F, '{print $1,$5}' sales

John Anderson 42

Joe Turner 50

Susan Greco 46

Bob Burmeister 46

A comma (,) between field variables indicates that I want OFS applied between output fields as shown in a previous example. Remember without the comma, no field separator will be used, and the displayed output fields (or output file) will all run together.


TIP: Putting two field separators in a row inside a print statement creates a syntax error with the print statement; however, using the same field twice in a single print statement is valid syntax. For example:

awk '{print($1,$1)'

Patterns

A pattern is the first half of an awk program statement. In awk there are six accepted pattern types. This section discusses each of the six in detail. You have already seen a couple of them, including BEGIN, and a specified, slash-delimited pattern, in use. Awk has many string matching capabilities arising from patterns, and the use of regular expressions in patterns. A range pattern locates a sequence. All patterns except range patterns may be combined in a compound pattern.

I began the chapter by saying awk was a pattern-match and process language. This section explores exactly what is meant by a pattern match. As you'll see, what kind pattern you can match depends on exactly how you're using the awk pattern specification notation.

BEGIN and END

The two special patterns BEGIN and END may be used to indicate a match, either before the first input record is read, or after the last input record is read, respectively. Some versions of awk require that, if used, BEGIN must be the first pattern of the program and, if used, END must be the last pattern of the program. While not necessarily a requirement, it is nonetheless an excellent habit to get into, so I encourage you to do so, as I do throughout this chapter. Using the BEGIN pattern for initializing variables is common (although variables can be passed from the command line to the program too; see "Command Line Arguments") The END pattern is used for things which are input-dependent such as totals.

If I want to know how many lines are in a given program, I type the following line:

$awk 'END {print _Total lines: _$NR}' myprogram

I see Total lines: 256 on the monitor and therefore know that the file myprogram has 256 lines. At any point while awk is processing the file, the variable NR counts the number of records read so far. NR at the end of a file has a value equal to the number of lines in the file.

How might you see a BEGIN block in use? Your first thought might be to initialize variables, but if it's a numeric value, it's automatically initialized to zero before its first use. Instead, perhaps you're building a table of data and want to have some columnar headings. With this in mind, here's a simple awk script that shows you all the accounts that people named Dave have on your computer:

BEGIN { 

     FS=_:_     # remember that the passwd file uses colons

     OFS=_     _     # we_re setting the output to a TAB

     print _Account_,_Username_

     }

/Dav/     {print $1, $5}

Here's what it looks like in action (we've called this file _daves.awk_, though the program matches Dave and David, of course):

$ awk -f daves.awk /etc/passwd

Account     Username

andrews     Dave Andrews

d3          David Douglas Dunlap

daves       Dave Smith

taylor      Dave Taylor

Note that you could also easily have a summary of the total number of matched accounts by adding a variable that's incremented for each match, then in the END block output in some manner. Here's one way to do it:

BEGIN {  FS=_:_ ; OFS=_     _ # input colon separated, output tab separated

     print _Account_,_Username_

     }

/Dav/     {print $1, $5 ; matches++ }

END     { print _A total of _matches_ matches._}

Here you can see how awk allows you to shorten the length of programs by having multiple items on a single line, particularly useful for initialization. Also notice the C increment notation: _matches++_ is functionally identical to _matches = matches + 1_. Finally, also notice that we didn't have to initialize the variable _matches_ to zero since it was done for us automatically by the awk system.

Expressions

Any expression may be used with any operator in awk. An expression consists of any operator in awk, and its corresponding operand in the form of a pattern-match statement. Type conversion—variables being interpreted as numbers at one point, but strings at another—is automatic, but never explicit. The type of operand needed is decided by the operator type. If a numeric operator is given a string operand, it is converted and vice versa.


TIP: To force a conversion, if the desired change is string to number, add (+) 0. If you wish to explicitly convert a number to a string concatenate "" (the null string) to the variable. Two quick examples: num=3; num=num __ creates a new numeric variable and sets it to the number three, then by appending a null string to it, translates it to a string (e.g., the string with the character 3 within). Adding zero to that string — num=num + 0 — forces it back to a numeric value.

Any expression can be a pattern. If the pattern, in this case the expression, evaluates to a nonzero or nonnull value, then the pattern matches that input record. Patterns often involve comparison. The following are the valid awk comparison operators:

Operator


Meaning


==

is equal to

<

less than

>

greater than

<=

less than or equal to

>=

greater than or equal to

!=

not equal to

~

matched by

!~

not matched by

In awk, as in C, the logical equality operator is == rather than =. The single = compares memory location, whereas == compares values. When the pattern is a comparison, the pattern matches if the comparison is true (non-null or non-zero). Here's an example: what if you wanted to only print lines where the first field had a numeric value of less than twenty? No problem in awk:

$1 < 20 {print $0}

If the expression is arithmetic, it is matched when it evaluates to a nonzero number. For example, here's a small program that will print the first ten lines that have exactly seven words:

BEGIN  {i=0}

NF==7 { print $0 ; i++ }

/i==10/ {exit}

There's another way that you could use these comparisons too, since awk understands collation orders (that is, whether words are greater or lesser than other words in a standard dictionary ordering). Consider the situation where you have a phone directory—a sorted list of names—in a file and want to print all the names that would appear in the corporate phonebook before a certain person, say D. Hughes. You could do this quite succinctly:

$1 >= "Hughes,D" { exit }

When the pattern is a string, a match occurs if the expression is non-null. In the earlier example with the pattern /Ann/, it was assumed to be a string since it was enclosed in slashes. In a comparison expression, if both operands have a numeric value, the comparison is based on the numeric value. Otherwise, the comparison is made using string ordering, which is why this simple example works.


TIP: You can write more than two comparisons to a line in awk.

The pattern $2 <= $1 could involve either a numeric comparison or a string comparison. Whichever it is, it will vary from file to file or even from record to record within the same file.


TIP: Know your input file well when using such patterns, particularly since awk will often silently assume a type for the variable and work with it, without error messages or other warnings.

String Matching

There are three forms of string matching. The simplest is to surround a string by slashes (/). No quotation marks are used. Hence /"Ann"/ is actually the string ' "Ann" ' not the string Ann, and /"Ann"/ returns no input. The entire input record is returned if the expression within the slashes is anywhere in the record. The other two matching operators have a more specific scope. The operator ~ means "is matched by," and the pattern matches when the input field being tested for a match contains the substring on the right hand side.

$2 ~ /mm/

This example matches every input record containing mm somewhere in the second field. It could also be written as $2 ~ "mm".

The other operator !~ means "is not matched by."

$2 !~ /mm/

This example matches every input record not containing mm anywhere in the second field.

Armed with that explanation, you can now see that /Ann/ is really just shorthand for the more complex statement $0 ~ /Ann/.

Regular expressions are common to UNIX, and they come in two main flavors. You have probably used them unconsciously on the command line as wildcards, where * matches zero or more characters and ? matches any single character. For instance entering the first line below results in the command interpreter matching all files with the suffix abc and the rm command deleting them.

rm *abc

Awk works with regular expressions that are similar to those used with grep, sed, and other editors but subtly different than the wildcards used with the command shell. In particular, . matches a character and * matches zero or more of the previous character in the pattern (so a pattern of x*y will match anything that has any number of the letter x followed by a y. To force a single x to appear too, you'd need to use the regular expression xx*y instead). By default, patterns can appear anywhere on the line, so to have them tied to an edge, you need to use ^ to indicate the beginning of the word or line, and $ for the end. If you wanted to match all lines where the first word ends in abc, for example, you could use $1 ~ /abc$/. The following line matches all records where the fourth field begins with the letter a:

$4 ~ /^a.*/

Range Patterns

The pattern portion of a pattern/action pair may also consist of two patterns separated by a comma (,); the action is performed for all lines between the first occurrence of the first pattern and the next occurrence of the second.

At most companies, employees receive different benefits according to their respective hire dates. It so happens that I have a file listing all employees in my company, including hire date. If I wanted to write an awk program that just lists the employees hired between 1980 and 1987 I could use the following script, if the first field is the employee's name and the third field is the year hired. Here's how that data file might look (notice that I use : to separate fields so that we don't have to worry about the spaces in the employee names)

$ cat emp.data.

John Anderson:sales:1980

Joe Turner:marketing:1982

Susan Greco:sales:1985

Ike Turner:pr:1988

Bob Burmeister:accounting:1991

The program could then be invoked:

$ awk -F: '$3 > 1980,$3 < 1987 {print $1, $3}' emp.data

With the output:

John Anderson 1980

Joe Turner 1982

Susan Greco 1985

TIP: The above example works because the input is already in order according to hire year. Range patterns often work best with pre-sorted input. This particular data file would be a bit tricky to sort within UNIX, but you could use the rather complex command sort -c: +3 -4 -rn emp.data > new.emp.data to sort things correctly. (See Chapter 6 for more details on using the powerful sort command.)

Notice range patterns are inclusive—they include both the first item matched and the end data indicated in the pattern. The range pattern matches all records from the first occurrence of the first pattern to the first occurrence of the second. This is a subtle point, but it has a major affect on how range patterns work. First, if the second pattern is never found, all remaining records match. So given the input file below:

$ cat sample.data

1

3

5

7

9

11

The following output appears on the monitor, totally disregarding that 9 and 11 are out of range.

$ awk '$1==3, $1==8' file1 sample.data

3

5

7

9

11

The end pattern of a range is not equivalent to a <= operand, though liberal use of these patterns can alleviate the problem, as shown in the employee hire date example above.

Secondly, as stated, the pattern matches the first range; others that might occur later in the data file are ignored. That's why you have to make sure that the data is sorted as you expect.


CAUTION: Range patterns cannot be parts of a larger pattern.

A more useful example of the range pattern comes from awk's ability to handle multiple input files. I have a function finder program that finds code segments I know exist and tells me where they are. The code segments for a particular function X, for example, are bracketed by the phrase "function X" at the beginning and } /* end of X at the end. It can be expressed as the awk pattern range:

'/function functionname/,/} \/* end of functionname/'

Compound Patterns

Patterns can be combined using the following logical operators and parentheses as needed.

Operator


Meaning


!

not

||

or (you can also use | in regular expressions)

&&

and

The pattern may be simple or quite complicated: (NF<3) || (NF >4). This matches all input records not having exactly four fields. As is usual in awk, there are a wide variety of ways to do the same thing (specify a pattern). Regular expressions are allowed in string matching, but their use is not forced. To form a pattern that matches strings beginning with a or b or c or d, there are several pattern options:

/^[a-d].*/ 

/^a.*/ !! /^b.*/ || /^c.*/ || /^d.*/ 

NOTE: When using range patterns: $1==2, $1==4 and $1>= 2 && $1 <=4 are not the same ranges at all. First, the range pattern depends on the occurrence of the second pattern as a stop marker, not on the value indicated in the range. Secondly, as I mentioned earlier, the first pattern only matches the first range, others are ignored.

For instance, consider the following simple input file:

$ cat mydata

1     0

3     1

4     1

5     1

7     0

4     2

5     2

1     0

4     3

The first range I try, '$1==3,$1==5, produces:

$ awk '$1==3,$1==5' mydata

3     1

4     1

5     1

Compare this to the following pattern and output.

$ awk '$1>=3 && $1<=5' mydata

3     1

4     1

5     1

4     2

5     2

4     3

Range patterns cannot be parts of a combined pattern.

Actions

The remainder of this chapter explores the action part of a pattern action statement. As the name suggests, the action part tells awk what to do when a pattern is found. Patterns are optional. An awk program built solely of actions looks like other iterative programming languages. But looks are deceptive—even without a pattern, awk matches every input record to the first pattern action statement before moving to the second.

Actions must be enclosed in curly braces ({}) whether accompanied by a pattern or alone. An action part may consist of multiple statements. When the statements have no pattern and are single statements (no compound loops or conditions), brackets for each individual action are optional provided the actions begin with a left curly brace and end with a right curly brace. Consider the following two action pieces:

{name = $1

print name}

and

{name = $1}

{print name},

These two produce identical output.

Variables

An integral part of any programming language are variables, the virtual boxes within which you can store values, count things, and more. In this section, I talk about variables in awk. Awk has three types of variables: user-defined variables, field variables, and predefined variables that are provided by the language automatically. The next section is devoted to a discussion of built-in variables. Awk doesn't have variable declarations. A variable comes to life the first time it is mentioned; in a twist on René Descarte's philosophical conundrum, you use it, therefore it is. The section concludes with an example of turning an awk program into a shell script.


CAUTION: Since there are no declarations, be doubly careful to initialize all the variables you use, though you can always be sure that they automatically start with the value zero.

Naming

The rule for naming user-defined variables is that they can be any combination of letters, digits, and underscores, as long as the name starts with a letter. It is helpful to give a variable a name indicative of its purpose in the program. Variables already defined by awk are written in all uppercase. Since awk is case-sensitive, ofs is not the same variable as OFS and capitalization (or lack thereof) is a common error. You have already seen field variables—variables beginning with $, followed by a number, and indicating a specific input field.

A variable is a number or a string or both. There is no type declaration, and type conversion is automatic if needed. Recall the car sales file used earlier. For illustration suppose I enter the program awk -F: { print $1 * 10} emp.data, and awk obligingly provides the rest:

0

0

0

0

0

Of course, this makes no sense! The point is that awk did exactly what it was asked without complaint: it multiplied the name of the employee times ten, and when it tried to translate the name into a number for the mathematical operation it failed, resulting in a zero. Ten times zero, needless to say, is zero...

Awk in a Shell Script

Before examining the next example, review what you know about shell programming (Chapters 10-14). Remember, every file containing shell commands needs to be changed to an executable file before you can run it as a shell script. To do this you should enter chmod +x filename from the command line.

Sometimes awk's automatic type conversion benefits you. Imagine that I'm still trying to build an office system with awk scripts and this time I want to be able to maintain a running monthly sales total based on a data file that contains individual monthly sales. It looks like this:

cat monthly.sales

John Anderson,12,23,7

Joe Turner,10,25,15

Susan Greco,15,13,18

Bob Burmeister,8,21,17

These need to be added together to calculate the running totals for each person's sales. Let a program do it!

$cat total.awk

BEGIN      {OFS=,}     #change OFS to keep the file format the same.

{print $1, " monthly sales summary: " $2+$3+$4 }

That's the awk script, so let's see how it works:

$ awk -f total.awk monthly.sales

cat sales

John Anderson, monthly sales summary: 42

Joe Turner, monthly sales summary: 50

Susan Greco, monthly sales summary: 46

Bob Burmeister, monthly sales summary: 46

CAUTION: Always run your program once to be sure it works before you make it part of a complicated shell script!

Your task has been reduced to entering the monthly sales figures in the sales file and editing the program file total to include the correct number of fields (if you put a for loop for(i=2;i<+NF;i++) the number of fields is correctly calculated, but printing is a hassle and needs an if statement with 12 else if clauses).

In this case, not having to wonder if a digit is part of a string or a number is helpful. Just keep an eye on the input data, since awk performs whatever actions you specify, regardless of the actual data type with which you're working.

Built-in Variables

This section discusses the built-in variables found in awk. Because there are many versions of awk, I included notes for those variables found in nawk, POSIX awk, and gawk since they all differ. As before, unless otherwise noted, the variables of earlier releases may be found in the later implementations. Awk was released first and contains the core set of built-in variables used by all updates. Nawk expands the set. The POSIX awk specification encompasses all variables defined in nawk plus one additional variable. Gawk applies the POSIX awk standards and then adds some built-in variables which are found in gawk alone; the built-in variables noted when discussing gawk are unique to gawk. This list is a guideline not a hard and fast rule. For instance, the built-in variable ENVIRON is formally introduced in the POSIX awk specifications; it exists in gawk; it is in also in the System V implementation of nawk, but SunOS nawk doesn't have the variable ENVIRON. (See the section "'Oh man! I need help.'"in Chapter 5 for more information on how to use man pages).

As I stated earlier, awk is case sensitive. In all implementations of awk, built-in variables are written entirely in upper case.

Built-in Variables for Awk

When awk first became a part of UNIX, the built-in variables were the bare essentials. As the name indicates, the variable FILENAME holds the name of the current input file. Recall the function finder code; type the new line below:

/function functionname/,/} \/* end of functionname/' {print $0}

END     {print ""; print "Found in the file " FILENAME}

This adds the finishing touch.

The value of the variable FS determines the input field separator. FS has a space as its default value. The built-in variable NF contains the number of fields in the current record (remember, fields are akin to words, and records are input lines). This value may change for each input record.

What happens if within an awk script I have the following statement?

$3 = "Third field"

It reassigns $3 and all other field variables, also reassigning NF to the new value. The total number of records read may be found in the variable NR. The variable OFS holds the value for the output field separator. The default value of OFS is a space. The value for the output format for numbers resides in the variable OFMT which has a default value of %.6g. This is the format specifier for the print statement, though its syntax comes from the C printf format string. ORS is the output record separator. Unless changed, the value of ORS is newline(\n).

Built-in Variables for Nawk

NOTE: When awk was expanded in 1985, part of the expansion included adding more built-in variables.


CAUTION: Some implementations of UNIX simply put the new code in the spot for the old code and didn't bother keeping both awk and nawk. System V and SunOS have both available. Linux has neither awk nor nawk but uses gawk. System V has both, but the awk uses nawk expansions. The book "awk the programming language" by the awk authors speaks of awk throughout the book, but the programming language it describes is called nawk on most systems.

The built-in variable ARGC holds the value for the number of command line arguments. The variable ARGV is an array containing the command line arguments. Subscripts for ARGV begin with 0 and continue through ARGC-1. ARGV[0] is always awk. The available UNIX options do not occupy ARGV. The variable FNR represents the number of the current record within that input file. Like NR, this value changes with each new record. FNR is always <= NR. The built-in variable RLENGTH holds the value of the length of string matched by the match function. The variable RS holds the value of the input record separator. The default value of RS is a newline. The start of the string matched by the match function resides in RSTART. Between RSTART and RLENGTH, it is possible to determine what was matched. The variable SUBSEP contains the value of the subscript separator. It has a default value of "\034".

Built-in Variables for POSIX Awk

The POSIX awk specification introduces one new built-in variable beyond those in nawk. The built-in variable ENVIRON is an array that holds the values of the current environment variables. (Environment variables are discussed more thoroughly later in this chapter.) The subscript values for ENVIRON are the names of the environment variables themselves, and each ENVIRON element is the value of that variable. For instance, ENVIRON["HOME"] on my PC under Linux is "/home". Notice that using ENVIRON can save much system dependence within awk source code in some cases but not others. ENVIRON["HOME"] at work is "/usr/anne" while my SunOS account doesn't have an ENVIRON variable because it's not POSIX compliant.

Here's an example of how you could work with the environment variables:

ENVIRON[EDITOR] == "vi"  {print NR,$0}

This program prints my program listings with line numbers if I am using vi as my default editor. More on this example later in the chapter.

Built-in Variables in Gawk

The GNU group further enhanced awk by adding four new variables to gawk, its public re-implementation of awk. Gawk does not differ between UNIX versions as much as awk and nawk do, fortunately. These built-in variables are in addition to those mentioned in the POSIX specification as described above. The variable CONVFMT contains the conversion format for numbers. The default value of CONVFMT is "%.6g" and is for internal use only. The variable FIELDWIDTHS allows a programmer the option of having fixed field widths rather than a single character field separator. The values of FIELDWIDTHS are numbers separated by a space or Tab (\t), so fields need not all be the same width. When the FIELDWIDTHS variable is set, each field is expected to have a fixed width. Gawk separates the input record using the FIELDWIDTHS values for field widths. If FIELDWIDTHS is set, the value of FS is disregarded. Assigning a new value to FS overrides the use of FIELDWIDTHS; it restores the default behavior.

To see where this could be useful, let's imagine that you've just received a datafile from accounting that indicates the different employees in your group and their ages. It might look like:

$ cat gawk.datasample

1Swensen, Tim  24

1Trinkle, Dan  22

0Mitchel, Carl 27

The very first character, you find out, indicates if they're hourly or salaried: a value of 1 means that they're salaried, and a value of 0 is hourly. How to split that character out from the rest of the data field? With the FIELDWIDTHS statement. Here's a simple gawk script that could attractively list the data:

BEGIN {FIELDWIDTHS = 1 8 1 4 1 2}

{ if ($1 == 1) print "Salaried employee "$2,$4" is "$6" years old.";

  else         print "Hourly   employee "$2,$4" is "$6" years old."

}

The output would look like:

Salaried employee Swensen, Tim  is 24 years old.

Salaried employee Trinkle, Dan  is 22 years old.

Hourly   employee Mitchel, Carl is 27 years old.

TIP: When calculating the different FIELDWIDTH values, don't forget any field separators: the spaces between words do count in this case.

The variable IGNORECASE controls the case sensitivity of gawk regular expressions. If IGNORECASE has a nonzero value, pattern matching ignores case for regular expression operations. The default value of IGNORECASE is zero; all regular expression operations are normally case sensitive.

Conditions (No IFs, &&s or buts)

Awk program statements are, by their very nature, conditional; if a pattern matches, then a specified action or actions occurs. Actions, too, have a conditional form. This section discusses conditional flow. It focuses on the syntax of the if statement, but, as usual in awk, there are multiple ways to do something.

A conditional statement does a test before it performs the action. One test, the pattern match, has already happened; this test is an action. The last two sections introduced variables; now you can begin putting them to practical uses.

The if Statement

An if statement takes the form of a typical iterative programming language control structure where E1 is an expression, as mentioned in the "Patterns" section earlier in this chapter:

if E1 S2; else S3.

While E1 is always a single expression, S2 and S3 may be either single- or multiple-action statements (that means conditions in conditions are legal syntax, but I am getting ahead of myself). Returns and indention are, as usual in awk, entirely up to you. However, if S2 and the else statement are on the same line, and S2 is a single statement, a semicolon must separate S2 from the else statement. When awk encounters an if statement, evaluation occurs as follows: first E1 is evaluated, and if E1 is nonzero or nonnull(true), S2 is executed; if E1 is zero or null(false) and there's an else clause, S3 is executed. For instance, if you want to print a blank line when the third field has the value 25 and the entire line in all other cases, you could use a program snippet like this:

{ if $3 == 25

     print ""

else

     print $0 }

The portion of the if statement involving S is completely optional since sometimes your choice is limited to whether or not to have awk execute S2:

{ if $3 == 25

     print "" }

Although the if statement is an action, E1 can test for a pattern match using the pattern-match operator ~. As you have already seen, you can use it to look for my name in the password file another way. The first way is shorter, but they do the same thing.

$awk '/Ann/'/etc/passwd

$awk '{if ($0 ~ /Ann/) print $0}' /etc/passwd

One use of the if statement combined with a pattern match is to further filter the screen input. For example here I'm going to only print the lines in the password file that contain both Ann and a capital m character:

$ awk '/Ann/ { if ($0 ~ /M/) print}' /etc/passwd

amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh

cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann McIntyre:/usr/lteach/cmcintyr:/bin/csh

jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn Flanagan:/usr/lteach/jflanaga:/bin/csh

Either S2 or S3 or both may consist of multiple-action statements. If any of them do, the group of statements is enclosed in curly braces. Curly braces may be put wherever you wish as long as they enclose the action. The rule of thumb: if it's one statement, the braces are optional. More than one and it's required.

You can also use multiple else clauses. The car sales example gets one field longer each month. The first two fields are always the salesperson's name and the last field is the accumulated annual total, so it is possible to calculate the month by the value of NF:

if(NF=4) month="Jan."

else if(NF=5) month="Feb"

else if(NF=6) month="March"

else if(NF=7) month="April"

else if(NF=8) month="May" # and so on

NOTE: Whatever the value of NF, the overall block of code will execute only once. It falls through the remaining else clauses.

The Conditional Statement

Nawk++ also has a conditional statement, really just shorthand for an if statement. It takes the format shown and uses the same conditional operator found in C:

E1 ? S2 : S3

Here, E1 is an expression, and S2 and S3 are single-action statements. When it encounters a conditional statement, awk evaluates it in the same order as an if statement: first E1 is evaluated; if E1 is nonzero or nonnull (true), S2 is executed; if E1 is zero or null (false), S3 is executed. Only one statement, S2 or S3, is chosen, never both.

The conditional statement is a good place for the programmer to provide error messages. Return to the monthly sales example. When we wanted to differentiate between hourly and salaried employees, we had a big if-else statement:

{ if ($1 == 1) print "Salaried employee "$2,$4" is "$6" years old.";

  else         print "Hourly   employee "$2,$4" is "$6" years old."

}

In fact, there's an easier way to do this with conditional statements:

{ print ($1==1? "Salaried":"Hourly") "employee "$2,$4" is "$6" years old." }

CAUTION: Remember the conditional statement is not part of original awk!

At first glance, and for short statements, the if statement appears identical to the conditional statement. On closer inspection, the statement you should use in a specific case differs. Either is fine for use when choosing between either of two single statements, but the if statement is required for more complicated situations, such as when E2 and E3 are multiple statements. Use if for multiple else statements (the first example), or for a condition inside a condition like the second example below:

{ if (NR == 100)

     { print \$(NF-1)\{""

     print "This is the 100th record"

     print $0

       print

     }

}

{ if($1==0)

     if(name~/Fred/

          print "Fred is broke" }
Patterns as Conditions

As if that does not provide ample choice, notice that the program relying on pattern-matching (had I chosen that method) produces the same output. Look at the program and its output.

$ cat lowsales.awk}

BEGIN      {OFS=\\t\{"\t"}}

$(NF-1) <= 7    {print $1, $(NF-1),\,\"Check \Attendance"\ {Sales"}     }

$(NF-1) > 7     {print $1, $(NF-1)     }     # Next to last field

{$ awk -f lowsales.awk emp.data}

John Anderson     7     \check attendance\ {Check Sales}

Joe Turner        15

Susan Greco       18

Bob Burmeister    17

Since the two patterns above are nonoverlapping and one immediately follows the other, the two programs accomplish the same thing. Which to use is a matter of programming style. I find the conditional statement or the if statement more readable than two patterns in a row. When you are choosing whether to use the nawk conditional statement or the if statement because you're concerned about printing two long messages, using the if statement is cleaner. Above all, if you chose to use the conditional statement, keep in mind you can't use awk; you must use nawk or gawk.

Loops

People often write programs to perform a repetitive task or several repeated tasks. These repetitions are called loops. Loops are the subject of this section. The loop structures of awk very much resemble those found in C. First, let's look at a shortcut in counting with 1 notation. Then I'll show you the ways to program loops in awk. The looping constructs of awk are the do(nawk), for, and while statements. As with multiple-action groups in an if statement, curly braces({}) surround a group of action statements associated in a loop. Without curly braces, only the statement immediately following the keyword is considered part of the loop.


TIP: Forgetting curly braces is a common looping error.

The section concludes with a discussion of how (and some examples of why) to interrupt a loop.

Increment and Decrement

As stated earlier, assignment statements take the form x = y, where the value y is being assigned to x. Awk has some shorthand methods of writing this. For example, to add a monthly sales total to the car sales file, you'll need to add a variable to keep a running total of the sales figures. Call it total . You need to start total at zero and add each $(NF-1) as read. In standard programming practice, that would be written total = total + $(NF -1). This is okay in awk, too. However, a shortened format of total += $(NF-1) is also acceptable.

There are two ways to indicate line+= 1 and line -=1 (line =line+1 and line=line-1 in awk shorthand). They are called increment and decrement, respectively, and can be further shortened to the simpler line++ and line—. At any reference to a variable, you can not only use this notation but even vary whether the action is performed immediately before or after the value is used in that statement. This is called prefix and postfix notation, and is represented by ++line and line++.

For clarity's sake, focus on increment for a moment. Decrement functions the same way using subtraction. Using the ++line notation tells awk to do the addition before doing the operation indicated in the line. Using the postfix form says to do the operation in the line, then do the addition. Sometimes the choice does not matter; keeping a counter of the number of sales people (to later calculate a sales average at the end of the month) requires a counter of names. The statements totalpeople++ and ++totalpeople do the same thing and are interchangeable when they occupy a line by themselves. But suppose I decide to print the person's number along with his or her name and sales. Adding either of the second two lines below to the previous example produces different results based on starting both at totalpeople=1.

$ cat awkscript.v1

BEGIN { totalpeople = 1 }

{print ++totalpeople, $1, $(NF-1)     }

$ cat awkscript.v2

BEGIN { totalpeople = 1 }

{print totalpeople++, $1, $(NF-1)     }

The first example will actually have the first employee listed as #2, since the totalpeople variable is incremented before it's used in the print statement. By contrast, the second version will do what we want because it'll use the variable value, then afterwards increment it to the next value.


TIP: Be consistent. Either is fine, but stick with one numbering system or the other, and there is less likelihood that you will accidently enter a loop an unexpected number of times.

The While Statement

Awk provides the while statement for general looping. It has the following form:

while(E1)

     S1

Here, E1 is an expression (a condition), and S1 is either one action statement or a group of action statements enclosed in curly braces. When awk meets a while statement, E1 is evaluated. If E1 is true, S1 executes from start to finish, then E1 is again evaluated. If E1 is true, S1 again executes. The process continues until E1 is evaluated to false. When it does, execution continues with the next action statement after the loop. Consider the program below:

{ while ($0~/M/)

     print

}

Typically the condition (E1) tests a variable, and the variable is changed in the while loop.

{ i=1

  while (i<20)

     {  print i

      i++

     }

}

This second code snippet will print the numbers from 1 to 19, then once the while loop tests with i=20, the condition of i<20 will become false and the loop will be done.

The Do Statement

Nawk++ provides the do statement for looping in addition to the while statement. The do statement takes the following form:

 do

     S

while .

Here, S is either a single statement or a group of action statements enclosed in curly braces, and E is the test condition. When awk comes to a do statement, S is executed once, and then condition E is tested. If E evaluates to nonzero or nonnull, S executes again, and so on until the condition E becomes false. The difference between the do and the while statement rests in their order of evaluation. The while statement checks the condition first and executes the body of the loop if the condition is true. Use the while statement to check conditions that may be initially false. For instance, while (not end-of-file(input)) is a common example. The do statement executes the loop first and then checks the condition. Use the do statement when testing a condition which depends on the first execution to meet the condition.

The do statement can be initiated using the while statement. Put the code that is in the loop before the condition as well as in the body of the loop.

The For Statement

The for statement is a compacted while loop designed for counting. Use it when you know ahead of time that S is a repetitive task and the number of times it executes can be expressed as a single variable. The for loop has the following form:

for(pre-loop-statements;TEST:post-loop-statements)

Here, pre-loop-statements usually initialize the counting variable; TEST is the test condition; and post-loop-statements indicate any loop variable increments.

For example,

{ for(i=1; i<=30; i++) print i.}

This is a succinct way of saying initialize i to 1, then continue looping while i<=30, and incrementing i by one each time through. The statement executed each time simply prints the value of i. The result of this statement is a list of the numbers 1 through 30.


TIP: The condition test should either be < 21 or <= 20 to execute the loop 20 times. The equality operator == is not a good test condition. Changing the loop to the line below illustrates why.

{ for (i=1;i==20;i+2) print i }

Each iteration of the loop adds 2 to the value of i. i goes to 3 to 5 to 7_ to 19 to 21—never having a value of 20. Consequently, you have an infinite loop; it never stops.

The for loop can also be used involving loops of unknown size:

for (i=1; i<=NF; i++)

     print $i

This prints each field on a unique line. True, you don't know what the number of fields will be, but you do know NF will contain that number.

The for loop does not have to be incremented; it could be decremented instead:

$awk -F: '{ for (i = NF; i > 0; —i) print $i }' sales.data

This prints the fields in reverse order, one per line.

Loop Control

The only restriction of the loop control value is that it must be an integer. Because of the desire to create easily readable code, most programmers try to avoid branching out of loops midway. Awk offers two ways to do this; however, if you need it: break and continue. Sometimes unexpected or invalid input leaves little choice but to exit the loop or have the program crash—something a programmer strives to avoid. Input errors are one accepted time to use the break statement. For instance, when reading the car sales data into the array name, I wrote the program expecting five fields on every line. If something happens and a line has the wrong number of fields, the program is in trouble. A way to protect your program from this is to have code like:

{ for(i=1; i<=NF; i++)

     if (NF != 5) {

          print "Error on line " NR invalid input...leaving loop."

          break  }

     else

          continue with program code...

The break statement terminates only the loop. It is not equivalent to the exit statement which transfers control to the END statement of the program. I handle the problem as shown on the CD-ROM in file LIST15_1.


TIP: The ideal error message depends, of course, on your application, the knowledge of the end users, and the likelihood they will be able to correct the error.

As another use for the break statement consider do S while (1). It is an infinite loop depending on another way out. Suppose your program begins by displaying a menu on screen. (See the LIST 15_2 file on the CD-ROM.)

The above example shows an infinite loop controlled with the break statement giving the end user a way out.


NOTE: The built-in nawk function getline does what it seems. For the point of the example take it on faith that it returns a character.

The continue statement causes execution to skip the current iteration remaining in both the do and the while statements. Control transfers to the evaluation of the test condition. In the for loop control goes to post-loop-instructions. When is this of use? Consider computing a true sales ratio by calculating the amount sold and dividing that number by hours worked.

Since this is all kept in separate files, the simplest way to handle the task is to read the first list into an array, calculate the figure for the report, and do whatever else is needed.

FILENAME=="total"          read each $(NF-1) into monthlytotal[i]

FILENAME=="per"            with each i

                              monthlytotal[i]/$2

whatever else

But what if $2 is 0? The program will crash because dividing by 0 is an illegal statement. While it is unlikely that an employee will miss an entire month of work, it is possible. So, it is good idea to allow for the possibility. This is one use for the continue statement. The above program segment expands to Listing 15.1.

BEGIN         { star = 0

          other stuff...

}

FILENAME=="total"         { for(i=1;NF;i++)

                               monthlyttl[i]=$(NF-1) 

                   }

FILENAME=="per"           { for(i=1;NF;i++)

                              if($2 == 0)   {

                                  print "*"

                                  star++

                                 continue }

                            else

                              print monthlyttl[i]/$2

                     whatever else

                         }

END   { if(star>=1)

         print "* indicates employee did not work all month."

      else

whatever

}

The above program makes some assumptions about the data in addition to assuming valid input data. What are these assumptions and more importantly, how do you fix them? The data in both files is assumed to be the same length, and the names are assumed to be in the same order.

Recall that in awk, array subscripts are stored as strings. Since each list contains a name and its associated figure, you can match names. Before running this program, run the UNIX sort utility to insure the files have the names in alphabetical order (see "Sorting Text Files" in Chapter 6). After making changes, use file LIST15_4 on the CD-ROM.

Strings

There are two primary types of data that awk can work with—numeric values or sequences of characters and digits that comprise words, phrases or sentences. The latter are called strings within awk and most other programming languages. For instance, "now is the time for all good men" is a string. A string is always enclosed in double quotes(""). It can be almost any length (the exact number varies from UNIX version to version).

One of the important string operations is called concatenation. The word means putting together. When you concatenate two strings you are creating a third string that is the combination of string1, followed immediately by string2. To perform concatenation in awk simply leave a space between two strings.

print "My name is" "Ann."

This prints the line:

My name isAnn.

(To ensure that a space is included you can either use a comma in the print statement or simply add a space to one of the strings: print "My name is " "Ann").

Built-In String Functions

As a rule, awk returns the leftmost, longest string in all its functions. This means that it will return the string occurring first (farthest to the left). Then, it collects the longest string possible. For instance, if the string you are looking for is "y*" in the string "any of the guyys knew it" then the match returns "yy" over "y" even though the single y appears earlier in the string.

Let's consider the different string functions available, organized by awk version.

Awk

The original awk contained few built-in functions for handling strings. The length function returns the length of the string. It has an optional argument. If you use the argument, it must follow the keyword and be enclosed in parentheses: length(string). If there is no argument, the length of $0 is the value. For example, it is difficult to determine from some screen editors if a line of text stops at 80 characters or wraps around. The following invocation of awk aids by listing just those lines that are longer than 80 characters in the specified file.

$ awk '{ if (length > 80)  { print NR ": " $0}' file-with-long-lines

The other string function available in the original awk is substring, which takes the form substr(string,position,len) and returns the len length substring of the string starting at position.


NOTE: A disagreement exists over which functions originated in awk and which originated in nawk. Consult your system for the final word on awk string functions. The functions in nawk are fairly standard.

Nawk

When awk was expanded to nawk, many built-in functions were added for string manipulation while keeping the two from awk. The function gsub(r, s, t) substitutes string s into target string t every time the regular expression r occurs and returns the number of substitutions. If t is not given gsub() uses $0. For instance, gsub(/l/, "y","Randall") turns Randall into Randayy. The g in gsub means global because all occurrences in the target string change.

The function sub(r, s, t) works like gsub(), except the substitution occurs only once. Thus sub(/l/, "y","Randall") returns "Randayl". The place the substring t occurs in string s is returned with the function index(s, t): index("i", "Chris")) returns 4. As you'd expect the return value is zero if substring t is not found. The function match(s, r) returns the position in s where the regular expression r occurs. It returns the index where the substring begins or 0 if there is no substring. It sets the values of RSTART and RLENGTH.

The split function separates a string into parts. For example, if your program reads in a date as 5-10-94, and later you want it written May 10, 1994 the first step is to divide the date appropriately. The built-in function split does this: split("5-10-94", store, "-") divides the date, and sets store["1"] = "5", store["2"] = "10" and store["3"] = 94. Notice that here the subscripts start with "1" not "0".

POSIX Awk

The POSIX awk specification added two built-in functions for use with strings. They are tolower(str) and toupper(str). Both functions return a copy of the string str with the alphabetic characters converted to the appropriate case. Non-alphabetic characters are left alone.

Gawk

Gawk provides two functions returning time-related information. The systime() function returns the current time of day in seconds since Midnight UTC (Universal Time Coordinated, the new name for Greenwich Mean Time), January 1970 on POSIX systems. The function strftime(f, t), where f is a format and t is a timestamp of the same form as returned by system(), returns a formatted timestamp similar to the ANSI C function strftime().

String Constants

String constants are the way awk identifies a non-keyboard, but essential, character. Since they are strings, when you use one, you must enclose it in double quotes (""). These constants may appear in printing or in patterns involving regular expressions. For instance, the following command prints all lines less than 80 characters long that don't begin with a tab. See Table 15.3.

awk 'length < 80 && /\t/' another-file-with-long-lines
Expression


Meaning


\\

The way of indicating to print a backslash.

\a

The "alert" character; usually the ASCII BEL.

\b

A backspace character.

\f

A formfeed character.

\n

A newline character.

\r

Carriage return character.

\t

Horizontal tab character.

\v

Vertical tab character.

\x

Indicates the following value is a hexidecimal number.

\0

Indicates the following value is an octal number.

Arrays

An array is a method of storing pieces of similar data in the computer for later use. Suppose your boss asks for a program that reads in the name, social security number, and a bunch of personnel data to print check stubs and the detachable check. For three or four employees keeping name1, name2, etc. might be feasible, but at 20, it is tedious and at 200, impossible. This is a use for arrays! See file LIST15_5 on the CD-ROM.


NOTE: Since the first input record is the checkdate, the total lines (NR) is not the number of checks to issue. I could have used NR-1, but I chose clarity over brevity.

Much easier, cleaner, and quicker! It also works for any number of employees without code changes. Awk only supports single-dimension arrays. (See the section "Advanced Concepts" for how to simulate multiple-dimensional arrays.) That and a few other things set awk arrays apart from the arrays of other programming languages. This section focuses on arrays; I will explain their use, then discuss their special property. I conclude by listing three features of awk (a built-in function, a built-in variable, and an operator) designed to help you work with arrays.

Arrays in awk, like variables, don't need to be declared. Further, no indication of size must be given ahead of time; in programming terms, you'd say arrays in awk are dynamic. To create an array, give it a name and put its subscript after the name in square brackets ([]), name[2] from above, for instance. Array subscripts are also called the indices of the array ; in name[2], 2 is the index to the array name, and it accesses the one name stored at location 2.


NOTE: One peculiarity in awk is that elements are not stored in the order they are entered. This bug is fixed in nawk++.

Awk arrays are different from those of other programming languages because in awk, array subscripts are stored as strings, not numbers. Technically, the term is associative arrays and it's unusual in programming languages. Be aware that the use of strings as subscripts can confuse you if you think purely in numeric terms. Since "3" > "15", an array element with a subscript 15 is stored before one with subscript of "3", even though numerically 3 > 15.

Since subscripts are strings, a subscript can be a field value. grade[$1]=$2 is a valid statement, as is salary["John"].

Array Specialties

Nawk++ has additions specifically intended for use with arrays. The first is a test for membership. Suppose Mark Turner enrolled late in a class I teach, and I don't remember if I added his name to the list I keep on my computer. The following program checks the list for me.

BEGIN {i=1}

{ name [i++] = $1 }

END { if ("Mark Turner" in name)

      print "He's enrolled in the course!"

    }

The delete function is a built-in function to remove array elements from computer memory. To remove an element, for example, you could use the command delete name[1].


CAUTION: Once you remove an element from memory, it's gone, and it ain't coming back! When in doubt, keep it.

Although technology is advancing and memory is not the precious commodity it once was considered to be, it is still a good idea to clean up after yourself when you write a program. Think of the check printing program above. Two hundred names won't fill the memory. But if your program controls personnel activity, it writes checks and checkstubs; adds and deletes employees; and charts sales. It's better to update each file to disk and remove the arrays not in use. For one thing, there is less chance of reading obsolete data. It also consumes less memory and minimizes the chance of using an array of old data for a new task. The clean-up can be most easily done:

END  {i= totalemps

     while(i>0) {

          delete name[i]

          delete data[i—] }

     }

Nawk++ creates another built-in variable for use when simulating multidimensional arrays. More on its use appears later, in the section "Advanced Concepts." It is called SUBSEP and has a default value of "\034". To add this variable to awk, just create it in your program:

BEGIN { SUBSEP = "\034" }

Recall that in awk, array subscripts are stored as strings. Since each list contains a name and its associated figure, you can match names and hence match files. Here are the answers to the question about using two files and assuring they have the same order (from the car sales example earlier). Before running this program, run the UNIX sort utility to insure the files have the names in alphabetical order. (See "Sorting Text Files" in Chapter 6.) After making changes, use the program in file LIST15_6 on the CD-ROM.

Arithmetic

Although awk is primarily a language for pattern matching, and hence, text and strings pop into mind more readily than math and numbers, awk also has a good set of math tools. In this section, first I show the basics, then we look at the math functions built into awk.

Operators

Awk supports the usual math operations. The expression x^y is x superscript y, that is, x to the y power. The % operator calculates remainders in awk: x%y is the remainder of x divided by y, and the result is machine-dependent. All math uses, floating point, and numbers are equivalent no matter which format they are expressed in so 100 = 1.00e+02.

The math operators in awk consist of the four basic functions: + (addition), - (subtraction), / (division), and * (multiplication), plus ^ and % for exponential and remainder.

As you saw earlier in the most recent sales example, fields can be used in arithmetic too. If, in the middle of the month, my boss asks for a list of the names and latest monthly sales totals, I don't need to panic over the discarded figures; I can just print a new list. My first shot seems simple enough (Listing 15.2).

BEGIN      {OFS="\t"}

{          print $1, $2, $6 }          # field #6 = May

Then a thought hits. What if my boss asks for the same thing next month? Sure, changing a field number each month is not a big deal but is it really necessary??

I look at the data. No matter what month it is, the current month's totals are always the next to last field. I start over with the program in Listing 15.3.

BEGIN      {OFS= _\t_}

{          print $1,$2, $(NF-1) }  

TIP: Again, watch yourself because awk lets you get away with murder. If I forgot the parentheses on the last statement above, rather than get a monthly total, I would print a list of the running total97Ä1! Also, rather than generate an error, if I mistype $(NF-1) and get $(NF+1) (not hard to do using the number pad), awk assigns nonexistent variables (here the number of fields + 1) to the null string. In this case, it prints blank lines.

Another use for arithmetic concerns assignment. Field variables may be changed by assignment. Given the following file, the statement $3 = 7 is a valid statement and produces the results below:

$ cat inputfile

1 2

3 4

5 6

7 8

9 10

$ awk '{$3 = 7}' inputfile

1 2 7

3 4 7

5 6 7

7 8 7

9 10 7

NOTE: The above statement forces $0 and NF values to change. Awk recalculates them as it runs.

If I run the following program, four lines appear on the monitor, showing the new values.

     {   if(NR==1)

          print $0, NF  }

     { if (NR >= 2 && NR <= 4) { $3=7; print $0, NF } }

END {print $0, NF }

Now when we run the data file through awk here's what we see:

$awk -f newsample.awk inputfile

1 2 2

3 4 7 3

5 6 7 3

7 8 7 3
Numeric Functions

Awk has a well-rounded selection of built-in numeric functions. As before in the sections on "Built-in Variables" and "Strings," the functions build on each other beginning with those found in awk.

Awk

To start, awk has built-in functions exp(exp), log(exp), sqrt(exp), and int(exp) where int() truncates its argument to an integer.

Nawk

Nawk added further arithmetic functions to awk. It added atan2(y,x) which returns the arctangent of y/x. It also added two random number generator functions: rand() and srand(x). There is also some disagreement over which functions originated in awk and which in nawk. Most versions have all the trigonometric functions in nawk, regardless of where they first appeared.

Input and Output

This section takes a closer look at the way input and output function in awk. I examine input first and look briefly at the getline function of nawk++ . Next, I show how awk output works, and the two different print statements in awk: print and printf.

Input

Awk handles the majority of input automatically—there is no explicit read statement, unlike most programming languages. Each line of the program is applied to each input record in the order the records appear in the input file. If the input file has 20 records then the first pattern action statement in the program looks for a match 20 times. The next statement causes the input to skip to the next program statement without trying the rest of the input against that pattern action statement. The exit statement acts as if all input has been processed. When awk encounters an exit statement, if there is one, the control goes to the END pattern action statement.

The Getline Statement

One addition, when awk was expanded to nawk, was the built-in function getline. It is also supported by the POSIX awk specification. The function may take several forms. At its simplest, it's written getline. When written alone, getline retrieves the next input record and splits it into fields as usual, setting FNR, NF and NR. The function returns 1 if the operation is successful, 0 if it is at the end of the file (EOF), and -1 if the function encounters an error. Thus,

while (getline == 1)

simulates awk's automatic input.

Writing getline variable reads the next record into variable (getline char from the earlier menu example, for instance). Field splitting does not take place, and NF remains 0; but FNR and NR are incremented. Either of the above two may be written using input from a file besides the one containing the input records by appending < "filename" on the end of the command. Furthermore, getline char < "stdin" takes the input from the keyboard. As you'd expect neither FNR nor NR are affected when the input is read from another file. You can also write either of the two above forms, taking the input from a command.

Output

There are two forms of printing in awk: the print statement and the printf statement. Until now, I have used the print statement. It is the fallback. There are two forms of the print statement. One has parentheses; one doesn't. So, print $0 is the same as print($0). In awk shorthand, the statement print by itself is equivalent to print $0. As shown in an earlier example, a blank line is printed with the statement print "". Use the format you prefer.


NOTE: print() is not accepted shorthand; it generates a syntax error.

Nawk requires parentheses, if the print statement involves a relational operator.

For a simple example consider file1:

$cat file1

1     10

3     8

5     6

7     4

9     2

10    0

The command line

$ nawk 'BEGIN {FS="\t"}; {print($1>$2)}' file1

shows

0

0

0

1

1

1

on the monitor.

Knowing that 0 indicates false and 1 indicates true, the above is what you'd expect, but most programming languages won't print the result of a relation directly. Nawk will.


NOTE: This requires nawk or later. Trying the above in awk results in a syntax error.

Nawk prints the results of relations with both print and printf. Both print and printf require the use of parentheses when a relation is involved, however, to distinguish between > meaning greater than and > meaning the redirection operator.

The printf Statement

printf is used when the use of formatted output is required. It closely resembles C's printf. Like the print statement, it comes in two forms: with and without parentheses. Either may be used, except the parentheses are required when using a relational operator. (See below.)

printf format-specifier, variable1,variable2, variable3,..variablen

printf(format-specifier, variable1,variable2, variable3,..variablen)

The format specifier is always required with printf. It contains both any literal text, and the specific format for displaying any variables you want to print. The format specifier always begins with a %. Any combination of three modifiers may occur: a - indicates the variable should be left justified within its field; a number indicates the total width of the field should be that number, if the number begins with a 0: %-05 means to make the variable 5 wide and pad with 0s as needed; the last modifier is .number the meaning depends on the type of variable, the number indicates either the maximum number string width, or the number of digits to follow to the right of the decimal point. After zero or more modifiers, the display format ends with a single character indicating the type of variable to display.


TIP: And yes, numbers can be displayed as characters and nondigit strings can be displayed as a number. With printf anything goes!

Remember the format specifier has a string value and since it does, it must always be enclosed in double quotes("), whether it is a literal string such as

printf("This is an example of a string in the display format.")

or a combination,

printf("This is the %d example", occurrence)

or just a variable

printf("%d", occurrence).

NOTE: The POSIX awk specification (and hence gawk) supports the dynamic field width and precision modifiers like ANSI C printf() routines do. To use this feature, place an * in place of either of the actual display modifiers and the value will be substituted from the argument list following the format string. Neither awk or nawk have this feature.

Before I go into detail about display format modifiers, I will show the characters used for display types. The following list shows the format specifier types without any modifiers.

Format


Meaning


%c

An ASCII character

%d

A decimal number (an integer, no decimal point involved)

%i

Just like %d (Remember i for integer)

%e

A floating point number in scientific notation (1.00000E+01)

%f

A floating point number (10001010.434)

%g

awk chooses between %e or %f display format, the one producing a shorter string is selected. Nonsignificant zeros are not printed.

%o

An unsigned octal (base 8) number

%s

A string

%x

An unsigned hexadecimal (base 16) number

%X

Same as %x but letters are uppercase rather than lowercase.


NOTE: If the argument used for %c is numeric, it is treated as a character and printed. Otherwise, the argument is assumed to be a string and only the first character of that string is printed.

Look at some examples without display modifiers. When the file file1 looks like this:

$ cat file1

34

99

-17

2.5

-.3

the command line

awk '{printf("%c %d %e %f\n", $1, $1, $1, $1)}' file1

produces the following output:

" 34 3.400000e+01 34.000000

c 99 9.900000e+01 99.000000

_ -17 -1.700000e+01 -17.000000

_ 2 2.500000e+00 2.500000

 0 -3.000000e-01 -0.300000

By contrast, a slightly different format string produces dramatically different results with the same input:

$ awk '{printf("%g %o %x", $1)}' file1

34 42 22

99 143 63

-17 37777777757 ffffffef

2.5 2 2

-0.3 0 0

Now let's change file1 to contain just a single word:

$cat file1

Example

The string above has seven characters. For clarity, I have used * instead of a blank space so the total field width is visible on paper.

printf("%s\n", $1)

     Example

printf("%9s\n", $1)

     **Example

printf("%-9s\n", $1)

     Example**

printf("%.4s\n", $1)

     Exam

printf("%9.4s\n", $1)

     *****Exam

printf("%-9.4s\n", $1)

     Exam*****

One topic pertaining to printf remains. The function printf was written so that it writes exactly what you tell it to write—and how you want it written, no more and no less. That is acceptable until you realize that you can't enter every character you may want to use from the keyboard. Awk uses the same escape sequences found in C for nonprinting characters. The two most important to remember are \n for a carriage return and \t for a tab character.


TIP: There are two ways to print a double quote; neither of which is that obvious. One way around this problem is to use the printf variable by its ASCII value:

doublequote = 34
printf("%c", doublequote)

The other strategy is to use a backslash to escape the default interpretation of the double quote as the end of the string:

printf("Joe said \"undoubtedly\" and hurried along.\n")

This second approach doesn't always work, unfortunately.

Closing Files and Pipes

Unlike most programming languages there is no way to open a file in awk; opening files is implicit. However, you must close a file if you intend to read from it after writing to it. Suppose you enter the command cat file1 < file2 in your awk program. Before you can read file2 you must close the pipe. To do this, use the statement close(cat file1 < file2). You may also do the same for a file: close(file2).

Command Line Arguments

As you have probably noticed, awk presents a programmer with a variety of ways to accomplish the same thing. This section focuses on the command line. You will see how to pass command line arguments to your program from the command line and how to set the value of built-in variables on the command line. A summary of command line options concludes the section.

Passing Command Line Arguments

Command line arguments are available in awk through a built-in array called, as in C, ARGV. Again echoing C semantics, the value of the built-in ARGC is one less than the number of command line arguments. Given the command line awk -f programfile infile1, ARGC has a value of 2. ARGV[0] = awk and ARGV[1] = infile1.


NOTE: The subscripts for ARGV start with 0 not 1.

programfile is not considered an argument—no option argument is. Had -F been in the command line, ARGV would not contain a comma either. Note that this behavior is very different to how argv and argc are interpreted in C programs too.

Setting Variables on the Command Line

It is possible to pass variable values from the command line to your awk program just by stating the variable and its value. For example, for the command line, awk -f programfile infile x=1 FS=,. Normally, command line arguments are filenames, but the equal sign indicates an assignment. This lets variables change value before and after a file is read. For instance, when the input is from multiple files, the order they are listed on the command line becomes very important since the first named input file is the first input read. Consider the command line awk -f program file2 file1 and this program segment.

BEGIN { if ( FILENAME = "foo") {

               print 'Unexpected input...Abandon ship!"

               exit

      }

      }

The programmer has written this program to accept one file as first input and anything else causes the program to do nothing except print the error message.

awk -f program x=1 file1 x=2 file2

The change in variable values above can also be used to check the order of files. Since you (the programmer) know their correct order, you can check for the appropriate value of x.


TIP: Awk only allows two command line options. The -f option indicates the file containing the awk program. When no -f option is used, the program is expected to be a part of the command line. The POSIX awk specification adds the option of using more than one -f option. This is useful when running more than one awk program on the same input. The other option is the -Fchar option where char is the single character chosen as the input field separate. Without a specified -F option, the input field separator is a space, until the variable FS is otherwise set.

Functions

This section discusses user-defined functions, also known in some programming languages as subroutines. For a discussion of functions built into awk see either "Strings" or "Arithmetic" as appropriate.

The ability to add, define, and use functions was not originally part of awk. It was added in 1985 when awk was expanded. Technically, this means you must use either nawk or gawk, if you intend to write awk functions; but again, since some systems use the nawk implementation and call it awk, check your man pages before writing any code.

Function Definition

An awk function definition statement appears like the following:

function functionname(list of parameters) {

     the function body

}

A function can exist anywhere a pattern action statement can be. As most of awk is, functions are free format but must be separated with either a semicolon or a newline. Like the action part of a pattern action statement, newlines are optional anywhere after the opening curly brace. The list of parameters is a list of variables separated by commas that are used within the function. The function body consists of one or more pattern action statements.

A function is invoked with a function call from inside the action part of a regular pattern action statement. The left parenthesis of the function call must immediately follow the function name, without any space between them to avoid a syntactic ambiguity with the concatenation operator. This restriction does not apply to the built-in functions.

Parameters

Most function variables in awk are given to the function call by value. Actual parameters listed in the function call of the program are copied and passed to the formal parameters declared in the function. For instance, let's define a new function called isdigit, as shown:

function isdigit(x) {

     x=8

}

{  x=5

   print x

   isdigit(x)

   print x

}

Now let's use this simple program:

$ awk -f isdigit.awk

5

5

The call isdigit(x) copies the value of x into the local variable x within the function itself. The initial value of x here is five, as is shown in the first print statement, and is not reset to a higher value after the isdigit function is finished. Note that if there was a print statement at the end of the isdigit function itself, however, the value would be eight, as expected. Call by value ensures you don't accidently clobber an important value.

Variables

Local variables in a function are possible. However, as functions were not a part of awk until awk was expanded, handling local variables in functions was not a concern. It shows: local variables must be listed in the parameter list and can't just be created as used within a routine. A space separates local variables from program parameters. For example, function isdigit(x a,b) indicates that x is a program parameter, while a and b are local variables; they have life and meaning only as long as isdigit is active.

Global variables are any variables used throughout the program, including inside functions. Any changes to global variables at any point in the program affect the variable for the entire program. In awk, to make a variable global, just exclude it from the parameter list entirely.

Let's see how this works with an example script:

function isdigit(x) {

     x=8

     a=3

 }

  { x=5 ; a = 2

  print "x = " x " and a = " a

  isdigit(x)

  print "now x = " x " and a = " a

 }

The output is:

x = 5 and a = 2

x = 5 and a = 3

Function Calls

Functions may call each other. A function may also be recursive (that is, a function may call itself multiple times). The best example of recursion is factorial numbers: factorial(n) is computed as n * factorial(n-1) down to n=1, which has a value of one. The value factorial(5) is 5 * 4 * 3 * 2 * 1 = 120 and could be written as an awk program:

function factorial(n) {

  if (n == 1) return 1;

  else return ( n * factorial(n-1) )

}

For a more in-depth look at the fascinating world of recursion I recommend you see either a programming or data structures book.

Gawk follows the POSIX awk specification in almost every aspect. There is a difference, though, in function declarations. In gawk, the word func may be used instead of the word function. The POSIX2 spec mentions that the original awk authors asked that this shorthand be omitted, and it is.

The Return Statement

A function body may (but doesn't have to) end with a return statement. A return statement has two forms. The statement may consist of the direction alone: return. The other form is return E, where E is some expression. In either case, the return statement gives control back to the calling function. The return E statement gives control back, and also gives a value to the function.


TIP: Be careful: if the function is supposed to return a value and doesn't explicitly use the return statement, the results returned to the calling program are undefined.

Let's revisit the isdigit() function to see how to make it finally ascertain whether the given character is a digit or not:

function isdigit(x) {

     if (x >= "0" && x <= "9")

          return 1;

     else

          return 0 

}

As with C programming, I use a value of zero to indicate false, and a value of 1 indicates true. A return statement often is used when a function cannot continue due to some error. Note also that with inline conditionals—as explained earlier—this routine can be shrunk down to a single line: function isdigit(x) { return (x >= "0" && x <= "9") }

Writing Reports

This section discusses writing reports. Before continuing with this section, it would be a good idea to be sure you are familiar with both the UNIX sort command (see section "Sorting Text Files" in Chapter 6) and the use of pipes in UNIX (see section "Pipes" in Chapter 4). Generating a report in awk is a sequence of steps, with each step producing the input for the next step. Report writing is usually a three step process: pick the data, sort the data, make the output pretty.

BEGIN and END Revisited

The section on "Patterns" discussed the BEGIN and END patterns as pre- and post-input processing sections of a program. Along with initializing variables, the BEGIN pattern serves another purpose: BEGIN is awk's provided place to print headers for reports. Indeed, it is the only chance. Remember the way awk input works automatically. The lines:

{ print "                     Total Sales"

  print "  Salesperson       for the Month"

  print "  ———————————————" }

would print a header for each input record rather than a single header at the top of the report! The same is true for the END pattern, only it follows the last input record. So,

{print "———————————————"

 print "                Total sales",ttl" }

should only be in the END pattern.

Much better would be:

BEGIN { print "                     Total Sales"

       print "  Salesperson       for the Month"

        print "  ————————————————" }

{ per person processing statements }

{print "———————————————"

 print "               Total sales",ttl" }

The Built-in System Function

While awk allows you to accomplish quite a few tasks with a few lines of code, it's still helpful sometimes to be able to tie in the many other features of UNIX. Fortunately almost all versions of nawk++ have the built-in function system(value) where value is a string that you would enter from the UNIX command line.


NOTE: The original awk does NOT have the system function.

The text is enclosed in double quotes and the variables are written using a space for concatenating. For example, if I am making a packet of files to e-mail to someone, and I create a list of the files I wish to send, I put a file list in a file called sendrick:

$cat sendrick

/usr/anne/ch1.doc

/usr/informix/program.4gl

/usr/anne/pics.txt

then awk can build the concatenated file with:

$ nawk '{system("cat" $1)}' sendrick > forrick

creates a file called forrick containing a full copy of each file. Yes, a shell script could be written to do the same thing, but shell scripts don't do the pattern matching that awk does, and they are not great at writing reports either.

UNIX users are split roughly in half over which text editor they use—vi or emacs. I began using UNIX and the vi editor, so I prefer vi. The vi editor has no way to set off a block of text and do some operation, such as move or delete, to the block, and so falls back on the common measure, the line; a specified number of lines are deleted or copied.

When dealing with long programs, I don't like to guess about the line numbers in a block_or take the time to count them either! So I have a short script which adds line numbers to my printouts for me. It is centered around the following awk program. See file LST15_10 on the CD-ROM.

Advanced Concepts

As you spend more time with awk, you might yearn to explore some of the more complex facets of the programming language. I highlight some of the key ones below.

Multi-Line Records

By default, the input record separator RS recognizes a newline as the marker between records. As is the norm in awk, this can be changed to allow for multi-line records. When RS is set to the null string, then the newline character always acts as a field separator, in addition to whatever value FS may have.

Multidimensional Arrays

While awk does not directly support multidimensional arrays, it can simulate them using the single dimension array type awk does support. Why do this? An array may be compared to a bunch of books. Different people access them different ways. Someone who doesn't have many may keep them on a shelf in the room—consider this a single dimension array with each book at location[i]. Time passes and you buy a bookcase. Now each book is in location[shelf,i]. The comparison goes as far as you wish—consider the intercounty library with each book at location[branchnum, floor, room, bookcasenum, shelf, i]. The appropriate dimensions for the array depend very much on the type of problem you are solving. If the intercounty library keeps track of all their books by a catalog number rather than location; a single dimension of book[catalog_num] = title makes more sense than location[branchnum, floor, room, bookcasenum, shelf, i] = title. Awk allows either choice.

Awk stores array subscripts as strings rather than as numbers, so adding another dimension is actually only a matter of concatenating another subscript value to the existing subscript. Suppose you design a program to inventory jeans at Levi's. You could set up the inventory so that item[inventorynum]=itemnum or item[style, size, color] = itemnum. The built-in variable SUBSEP is put between subscripts when a comma appears between subscripts. SUBSEP defaults to the value \034, a value with little chance of being in a subscript. Since SUBSEP marks the end of each subscript, subscript names do not have to be the same length. For example,

item["501","12w","stone washed blue"], 

item["dockers","32m","black"]

item["relaxed fit", "9j", "indigo"]

are all valid examples of the inventory. Determining the existence of an element is done just as it is for a single dimension array with the addition of parentheses around the subscript. Your program should reorder when a certain size gets low.

if (("501",,) in item) print a tag.

NOTE: The in keyword is nawk++ syntax.

The price increases on 501s, and your program is responsible for printing new price tags for the items which need a new tag:

for ("501" in item)

     print a new tag.

Recall the string function split; split("501", ,SUBSEP) will retrieve every element in the array with "501" as its first subscript.

Summary

In this chapter I have covered the fundamentals of awk as a programming language and as a tool. In the beginning of the chapter I gave an introduction to the key concepts, an overview of what you would need to know to get started writing and using awk. I spoke about patterns, a feature that sets awk apart from other programming languages. Two sections were devoted to variables, one on user defined variables and one on built-in variables.

The later part of the chapter talks about awk as a programming language. I discussed conditional statements, looping, arrays, input output, and user defined functions. I close with a brief section on writing reports.

The next chapter is about Perl, a language very related to awk.

V is the first implementation using the variable. A = awk G = gawk P = POSIX awk N = nawk

V Variable


Meaning


Default(if any)


N ARGC

The number of command line arguments


N ARGV

An array of command line arguments


A FS

The input field separator

space

A NF

The number of fields in the current record


G CONVFMT

The conversion format for numbers

%.6g

G FIELDWIDTHS

A white-space separated


G IGNORECASE

Controls the case sensitivity

zero (case sensitive)

P FNR

The current record number


A FILENAME

The name of the current input file


A NR

The number of records already read


A OFS

The output field separator

space

A ORS

The output record separator

newline

A OFMT

The output format for numbers

%.6g

N RLENGTH

Length of string matched by match function


A RS

Input record separator

newline

N RSTART

Start of string matched by match function


N SUBSEP

Subscript separator

"\034"

Further Reading

For further reading:



Aho, Alfred V., Brian W. Kernighan and Peter J. Weinberger, The awk Programming Language. Reading, Mass.: Addison-Wesley,1988 (copyright AT&T Bell Lab.)
IEEE Standard for Information Technology, Portable Operating System Inferface (POSIX), Part 2: Shell and Utilities, Volume 2. Std. 1003.2-1992. New York: IEEE, 1993.

See also the man pages for awk, nawk, or gawk on your system.

Obtaining Source Code

Awk comes in many varieties. I recommend either gawk or nawk. Nawk is the more standard whereas gawk has some non-POSIX extensions not found in nawk. Either version is a good choice.

To obtain nawk from AT&T: nawk is in the UNIX Toolkit. The dialup number in the United States is 908-522-6900, login as guest.

To obtain gawk: contact the Free Software Foundation, Inc. The phone number is 617-876-3296.

Previous Page TOC Next Page Home