Department of Engineering

IT Services


Unix has many text manipulation utilities. The most flexible is awk. Its conciseness is paid for by

  • speed of execution.
  • potentially hieroglyphic expressions.

but if you need to manipulate text files which have a fairly fixed line format, awk is ideal. It operates on the fields of a line (the default field separator, FS, being <space>). When awk reads in a line, the first field can be referred to as `$1', the second `$2' etc. The whole line is `$0'. A short awk program can be written on the command line. eg

    cat file | awk '{print NF,$0}'

which prepends each line with the Number of Fields (ie, words) on the line. The quotes are necessary because otherwise the shell would interpret special characters like `$' before awk had a chance to read them. Longer programs are best put into files.

Two examples in /export/Examples/Korn_shell (wordcount and awker) should give CUED users a start (the awk manual page has more examples). Once you have copied over wordcount and text, do

  wordcount text

you will get a list of words in text and their frequency. Here is wordcount

awk '   {for (i = 1; i<=NF ; i++)
    END {for (word in num)
             print word, num[word]
    ' $*

The syntax is similar to that of C. awk lines take form

     <pattern>          { <action> }

Each input line is matched against each awk line in turn. If, as here in wordcount, there is no target pattern on the awk line then all input lines will be matched. If there is a match but no action, the default action is to print the whole line.

Thus, the for loop is done for every line in the input. Each word in the line (NF is the number of words on the line) is used as an index into an array whose element is incremented with each instance of that word. The ability to have strings as array `subscripts' is very useful. END is a special pattern, matched by the end of the input file. Here its matching action is to run a different sort of for loop that prints out the words and their frequencies. The variable word takes successively the value of the string `subscripts' of the array num.

Example 2 introduces some more concepts. Copy
/export/Examples/Korn_shell/data (shown below)

Tom     1.35     paid
Dick    3.87     Unpaid
Harry   56.00    Unpaid
Tom     36.03    unpaid
Harry   22.60    unpaid
Tom     8.15     paid
Tom     11.44    unpaid

and /export/Examples/Korn_shell/awker if you haven't done so already. Here is the text of awker

awk '
$3 ~ /^[uU]npaid$/ {total[$1] += $2; owing=1}

	  if (owing)
	    for (i in total)
	       print i, "owes", total[i] > "invoice"
	    print "No one owes anything" > "invoice"

' $*


  awker data

will add up how much each person still owes and put the answer in a file called invoice. In awker the 3rd field is matched against a regular expression (to find out more about these, type man 5 regexp ). Note that both 'Unpaid' and 'unpaid' will match, but nothing else. If there is a match then the action is performed. Note that awk copes intelligently with strings that represent numbers; explicit conversion is rarely necessary. The `total' array has indices which are the people's names. If anyone owes, then a variable `owing' is set to 1. At the end of the input, the amount each person owes is printed out.

Other awk facilities are:-

  • fields can be split:-
       n = split(field,new_array,separator)
  • there are some string manipulation routines, e.g.:-
       substr(string,first_pos,max_chars), index(string,substring)
  • awk has built-in math functions ( exp, log, sqrt, int) and relational operators (==, !=, >, >=, <, <=, ~ (meaning ``contains''), !~).

As you see, awk is almost a language in itself, and people used to C syntax can soon create useful scripts with it.