How to Awk

🗓️
🔄
6 min

Not only a command but a full-blown scripting language, awk is a powerful tool for text processing.

It’s a great way to quickly search through text files, extract and format data, and even perform basic calculations.

Dive much deeper into awk here and here.

Keep in mind

Not all awk implementations are created equal. This post references the GNU implementation.

The basics

Awk operates on records and fields. By default, a record is a line (uses \n as a separator) and a field is a “word” (uses \s or ‘space’ as a separator).

It performs an action based on a pattern, as in “if it matches this, do that”.

Your basic awk command looks something like this:

awk '/himom/ {print $0}' file
| |
pattern action

Patterns will always be delimited by / while actions will be within {}. Note also the use of single quotes.

This reads: “On each record (line) that matches the pattern himom, run the action print $0 (which prints the whole line).”

You can omit the pattern to perform the action in all lines, or omit the action to print each matching line.

So awk '{print $0}' file would print the whole file, while awk '/himom/' file would do the same as awk '/himom/ {print $0}' file.

Positions

As you might imagine, changing $0 for $n will print the nth field (word) instead of the whole record (line).

Regex

The pattern '/himom/' is a shorthand for '$0 ~ /himom/'.
This means that patterns can be applied on a per-column basis.

If you know your regex, you might expect the previous pattern to match lines containing only the word himom.

This is not the case, the default behavior is to match anything containing the given pattern.

Also, while ^ and $ usually designate beginning and end of line, here they indicate beginning and end of word (field).

This means that for awk '$1 ~ /01$/', the line 01 02 03 would match

Unlike sed or grep, you don’t need the -E to use extended regular expressions since this is the default behavior.

Variables

Ignorecase

Awk is case-sensitive by default, but this can be switched off by setting this variable:

sh
awk -v IGNORECASE=1 '/fooBar/ {print $1}' file

We use the -v flag to set the IGNORECASE Variable to true.

Filename

When processing multiple files in a script, it might be useful to also print the current file name:

sh
awk '{print FILENAME}' file.txt

(Input) Record and Field separator (RS & FS)

As mentioned before, the default RS is \n while the default FS is \s.
This can be configured to fit different file structures.

A CSV for example might not behave as expected:

csv
Tonia,Ellerey,Tonia.Ellerey@yopmail.com,firefighter
Joleen,Viddah,Joleen.Viddah@yopmail.com,police officer
Cherilyn,Kat,Cherilyn.Kat@yopmail.com,firefighter
Janenna,Natica,Janenna.Natica@yopmail.com,worker

Something like awk '{print $3}' file will not really work, but awk -v FS="," '{print $3}' file will:

Tonia.Ellerey@yopmail.com
Joleen.Viddah@yopmail.com
Cherilyn.Kat@yopmail.com
Janenna.Natica@yopmail.com

Similarly, we could change the RS variable as well, although that is a less common use case.

(Output) Record and Field separator (ORS & OFS)

These are used to format the output of your awk command.

While for simple commands, something like awk '{print $3" - "$4}' file should do the trick, this can get tedious and unreadable fast with more complex ones.

For such cases, use the OFS variable:

sh
awk -v OFS=" - " '{print $3, $4}' file

Notice the " - " separator in both examples.

There is also printf support in awk, so you can get as fancy as you want.

Record and Field number (NR & NF)

These hold the value of the current line (record) and word (field) numbers. You can print them with something like:

sh
awk '{print "Line num:", NR, "Num of fields:", NF, "Content:", $0}' file

Or use them to conditionally apply the action:

sh
awk 'NF<10 && NR>2 {print $2}' file

“Print the 2nd field of all records whose NR is greater than 2 (3rd line onwards) and whose NF is less than 10 (9 or fewer fields)”.

The not so basics

Logical operators

As hinted above, we can use && and || as in most other programming language. Patterns can be mixed and matched using these logical operators.

sh
awk '/bilbo/ && /frodo/ {print "My Precious"}' file
awk '/bilbo/ || /frodo/ {print "Is it you mister Frodo?"}' file

Or you can negate the match, as in “only perform the action on lines that DON’T match the pattern”.

sh
awk '! ~ /frodo/ { print "Pohtatoes" }' file

Ternary operations

Since we can use logical operators, you might imagine that we can also take advantage of ternary operators.

sh
awk '/frodo/ ? /ring/ : /orcs/ { print $0" --> Either frodo with the ring, or the orcs" }' file

Which we can write in pseudocode as:

if matches(frodo) AND matches(ring)
print "Either frodo with the ring, or the orcs"
else if matches(orcs)
print "Either frodo with the ring, or the orcs"
else
don't print

So for a file:

frodo
ring
orcs
frodo ring
frodo orcs
ring orcs
frodo ring orcs

The command above would output:

orcs --> Either frodo with the ring, or the orcs
frodo ring --> Either frodo with the ring, or the orcs
ring orcs --> Either frodo with the ring, or the orcs
frodo ring orcs --> Either frodo with the ring, or the orcs

Range

If the file you are working with has some kind of internal sorting, you might want to operate based on that instead of the NR.

You can use multiple matches to create a range on which to perform the action.
So on a file like:

first line
second line
third line
fourth line
fifth line

The command awk '/second/ , /fourth/ {print $0}' file outputs:

second line
third line
fourth line

Scripting

Here we only covered how to use awk as a one-liner from the command line, but awk is actually a fully featured scripting language.

The previous point regarding ternary operations skips over the fact that the action per se can include conditional logic.

This for example, is a valid awk script:

awk
#!/usr/bin/awk
/hi/ {
if($1 > $2){
print "mom!"
}
else print "there!"
}

In fact, if your awk commands are getting a bit out of hand, turning them into a script might make things a lot easier.


Other posts you might like