Not only a command but a full-blown scripting language, awk is a powerful tool for text processing.
It’s a great way to quickly search through text files, extract and format data, and even perform basic calculations.
Dive much deeper into awk here and here.
Keep in mind
Not all awk implementations are created equal. This post references the GNU implementation.
The basics
Awk operates on records and fields. By default, a record is a line (uses \n
as a separator) and a field is a “word” (uses \s
or ‘space’ as a separator).
It performs an action based on a pattern, as in “if it matches this, do that”.
Your basic awk command looks something like this:
Patterns will always be delimited by /
while actions will be within {}
. Note also the use of single quotes.
This reads: “On each record (line) that matches the pattern himom
, run the action print $0
(which prints the whole line).”
You can omit the pattern to perform the action in all lines, or omit the action to print each matching line.
So awk '{print $0}' file
would print the whole file, while awk '/himom/' file
would do the same as awk '/himom/ {print $0}' file
.
Positions
As you might imagine, changing $0
for $n
will print the nth field (word) instead of the whole record (line).
Regex
The pattern '/himom/'
is a shorthand for '$0 ~ /himom/'
.
This means that patterns can be applied on a per-column basis.
If you know your regex, you might expect the previous pattern to match lines containing only the word himom
.
This is not the case, the default behavior is to match anything containing the given pattern.
Also, while ^
and $
usually designate beginning and end of line, here they indicate beginning and end of word (field).
This means that for awk '$1 ~ /01$/'
, the line 01 02 03
would match
Unlike sed or grep, you don’t need the -E
to use extended regular expressions since this is the default behavior.
Variables
Ignorecase
Awk is case-sensitive by default, but this can be switched off by setting this variable:
We use the -v
flag to set the IGNORECASE
Variable to true
.
Filename
When processing multiple files in a script, it might be useful to also print the current file name:
(Input) Record and Field separator (RS & FS)
As mentioned before, the default RS is \n
while the default FS is \s
.
This can be configured to fit different file structures.
A CSV for example might not behave as expected:
Something like awk '{print $3}' file
will not really work, but awk -v FS="," '{print $3}' file
will:
Similarly, we could change the RS
variable as well, although that is a less common use case.
(Output) Record and Field separator (ORS & OFS)
These are used to format the output of your awk command.
While for simple commands, something like awk '{print $3" - "$4}' file
should do the trick, this can get tedious and unreadable fast with more complex ones.
For such cases, use the OFS
variable:
Notice the " - "
separator in both examples.
There is also printf
support in awk, so you can get as fancy as you want.
Record and Field number (NR & NF)
These hold the value of the current line (record) and word (field) numbers. You can print them with something like:
Or use them to conditionally apply the action:
“Print the 2nd field of all records whose NR is greater than 2 (3rd line onwards) and whose NF is less than 10 (9 or fewer fields)”.
The not so basics
Logical operators
As hinted above, we can use &&
and ||
as in most other programming language. Patterns can be mixed and matched using these logical operators.
Or you can negate the match, as in “only perform the action on lines that DON’T match the pattern”.
Ternary operations
Since we can use logical operators, you might imagine that we can also take advantage of ternary operators.
Which we can write in pseudocode as:
So for a file:
The command above would output:
Range
If the file you are working with has some kind of internal sorting, you might want to operate based on that instead of the NR.
You can use multiple matches to create a range on which to perform the action.
So on a file like:
The command awk '/second/ , /fourth/ {print $0}' file
outputs:
Scripting
Here we only covered how to use awk as a one-liner from the command line, but awk is actually a fully featured scripting language.
The previous point regarding ternary operations skips over the fact that the action per se can include conditional logic.
This for example, is a valid awk script:
In fact, if your awk commands are getting a bit out of hand, turning them into a script might make things a lot easier.