Why awk?

I had decided to write this because I feel the language is quite unkown or underrated considering it’s usefulnes. In my opinion it is one of the best things to come out of UNIX and is essentialy omnipresent and yet underutilised. It is essentisentially a very advanced text filter (like grep or sed) and therefore integrates perfectly with the usual UNIX workflow. I personally believe the lack of knowledge of awk is why people say that bash scripting is not good for data processing. While for more advanced tasks Python, R or other are certainly a better choice; for simple, especially ad-hoc data processing awk is invaluable.

Did I get your attention? Ready to learn you some awk? Good, let’s get started.

Things might get AWKward

The only drawback I see with awk is the ammount of different implementations. You will most likely encouter the original POSIX implementation by the brilliant Brian Kernighan. If you are on GNU/Linux it is quite likely that you will have GNU awk or gawk for short (which is most likely symlinked to awk). Other implementations such as mawk exist. In this article I will be using the GNU version, they are mostly compatible but the GNU version adds some quality of life improvements.

Do I have awk installed?

Most likely if you are on GNU/Linux or MacOS, you can try the following:

$ awk --version
GNU Awk 5.2.1, API 3.2, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP 6.3.0)
Copyright (C) 1989, 1991-2022 Free Software Foundation.

What makes awk great (for file processing)

I would say that for me the most defining features of the language is it’s design when it comes to line processing. The entire language was designed as a domain specific language for this very purpose. There is no need to use looping constructs to iterate lines. That logic is baked into the syntax and makes sequential line processing absolutely trivial. Let’s see on the following example.

Our first awk command

You can also write awk scripts directly you will invoke it quite often directly from your shell or as a powerfull tool inside a shell script.

$ awk '{print}' /etc/passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
tadeas:x:1000:1000:,,,:/home/tadeas:/bin/bash

So how does this work?

You invoke awk with two arguments.

  1. The programme to execute
  2. The input file (/etc/passwd the UNIX standard file with user details in our case)

I generally recommend using single quote for the programme since string literals must always use double quotes in awk.

Now let’s look on the programme in more detail. The entire body of the programme is a single action statement. This action statement will be evaluated on every line and print it’s entire contents.

An awk programme is essentially a sequence of pattern-action statements. These take on the following form pattern { action }. For each line where the pattern is found action will be executed. Our first programme is just a single pattern-action statement with an empty pattern (so it maches all lines). If we want to we can filter out just the daemon user:

$ awk '/daemon/{print}' /etc/passwd
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

This searches the line for an occurence of the word daemon.

Awk essentially splits the entire file into records (lines by default) and fields. You can imagine it like the following:

Field1 Field2 Field3 # Record1
Field1 Field2 Field3 # Record2
Field1 Field2 Field3 # Record3

That make it very good for working with table-like content such as TSV (tab-seperated), CSV (comma-separated), docker and kubernetes output (more on that later). By default it would separate the files by whitespace (spaces and tabs).

But you might notice that the /etc/passwd file is not seperated by whitespace but colons instead. We can also redefine it by using the -F argument to awk. So what if want to print the user’s default shell? That is the 7th field. You can specify just that by doing the following:

$ awk -F ":" '/daemon/{ print $7 }' /etc/passwd
/usr/sbin/nologin

Quite neat isn’t it? Let me show you one more thing before we continue. There are two special pattern in awk BEGIN and END. These have special meaning and are executed only once before and after processing all of the record respectively. This can be useful to print headers, footers, set variables and output summaries.

Now what if we wanted to output multiple fields? Let’s say the shell and the home directory:

$ awk -F ":" '/daemon/{ print $6, $7 }' /etc/passwd
/var/root /usr/sbin/nologin

Now that works quite nicelly with one small expection. It is space seperated while the original was colon separated. Why is that? Awk actually uses the OFS or Output Field Separator variable to determine what to print between records. We can of course modify it and this is where the special patterns come in handy:

$ awk -F ":" 'BEGIN{OFS=":"} /daemon/{ print $6, $7 }' /etc/passwd
/var/root:/usr/sbin/nologin

Now we have two pattern-match statements. One BEGIN which as we know has a special meaning is executed once before processing all the lines. It sets the value of OFS to colon to match the input. You could also do BEGIN{OFS=FS} instead. Please note the difference the IFS environment variable and FS awk variable. Instead of using the -F parameter you might also set the FS in the BEGIN:

$ awk 'BEGIN{FS=":"; OFS=FS} /daemon/{ print $6, $7 }' /etc/passwd
/var/root:/usr/bin/false

Please notice that the two action statements are separated by a semicolon. First you set Field Separator to equal “:” and the you set Output Field Separator to equal the value of Field Separator.

Conditions instead of patterns

You can also use conditions and variables to filter lines in awk, let’s say you have the following file of student name and test score:

John 89
Elizabeth 95
Thomas 45
Judy 40

And let’s say you would like to filter out ones below 50 (perhaps to recommend them some extra lessons), you could do the following:

$ awk '$2<50{print $1}' grades.csv
Thomas
Judy

Let’s break it down a little bit:

  • $1 is column with the names, $2 is column with the scores
  • for reach record you evaluate the condition $2<50
  • if the score is below 50 print the name

Built-in Variables

There are quite a few built-in variables here are some interesting examples:

  • NR - Number of record - number of the currently processed record
  • NF - Number of fields in current record

Especially NR is quite usefull for skipping headers as you will see in the next example.

Working with csv

Let’s say you have the following csv file:

Name,Count,Price
Foo,2,12
Bar,5,20
Baz,1,300

And you would like to get the total amount, you might be tempted to use a spreadsheet calculator but you can do it quite easily with awk:

$ awk 'BEGIN{FS=","}NR>1{sum+=$2*$3}END{print "Total:", sum}' example.csv
Total: 424

Let’s break it down a little bit:

  1. at the beginning of the programme we set the field separator to be a comma for the csv
  2. NR>1 means to only process lines that have the record number larger than 1, that effectively skips the first header line
  3. add the product of fields 2 and 3 (count and price) into the variable sum (notice you do not have to declare it beforehand)
  4. at the end of the programme print the value stored in the sum variable

Closing Thoughts

I hope I got you interested in learning and using awk, there are many great things you can do with it, but this should give you a rough idea. I might follow up with a post of some more difficult examples.


Sources:

GNU AWK User Guide.