Not only a command but a full-blown scripting language, Awk is a powerful tool for text processing. It’s a great way to quickly search through text files, extract and format data, and even perform basic calculations.
This post leans heavily on this and this amazing works.
Keep in mind
Not all Awk implementations are created equal. This post references the GNU version, although hopefully most of the information is generic enough to be of use in most awk implementations.
Basics
Awk operates on records and fields. By default, a record is a line (uses \n
as a separator) and a field is a “word” (uses \s
as a separator). It performs an action based on a pattern, as in “if it matches this, do that”.
Your basic Awk command looks something like /himom/ {print $0}
. In this example, /himom/
is a pattern and {print $0}
is an action. Patterns will always be delimited by /
while actions will be within {}
.
This reads like “On each record (line) that contains a match for the pattern himom
, run the action print $0
(which prints the whole line).”
To call it on a file run: awk '/himom/ {print $0}' file
(always use single quotes!).
Omit the pattern to perform the action in all lines. Omit the action to print each matching line (awk '/himom/ {print $0}' file
and awk '/himom/' file
do the same thing).
Positions
As you might expect, changing $0
for $n
will print the nth field (word) instead of the whole record (line).
Regex
The pattern '/himom/'
is a shorthand for '$0 ~ /himom/'
. This means that patterns can be applied on a per-column basis.
If you know your regex, you might expect the previous pattern to match only for lines containing only the word himom
.
This is not the case. You can leverage the full power of extended regular expressions out of the box (unlike Sed or Grep, you don’t need -E
here), but the default behavior is to match anything containing the given pattern.
Also, while ^
and $
usually designate beginning and end of line, here they indicate beginning and end of match. This means that for awk '$1 ~ /-01$/'
, the line 2016-03-01 94.580002 93.610002
would match
As mentioned before, you can skip the pattern altogether. The command awk '{print $1}' file
will just print the first field (word) on each record (line) in the file
.
Not so Basics
Logical operators
Patterns can be mixed and matched with your typical logical operators.
Or you can negate the match, as in “only perform the action on lines that DON’T match the pattern”.
Variables
BEGIN and END
Awk allows you to run specific actions before and after it does the processing.
This can be used to create headers and footers for your output, although more often than not you’ll use it as a safe space to set other variables such as…
IGNORECASE
The match is case-sensitive by default. Change this behavior by setting this variable to 1
:
Of course, you can still print your pretty header!
Notice how we separate the two statements within the BEGIN
action with a ;
.
(Input) Record and Field separator (RS & FS)
As mentioned before, the default RS is \n
while the default FS is \s
. This might work for you, or it might not, but we can change these values!
Suppose you are working with a proper comma separated CSV.
Something like awk '{print $4}' file
will not really work, but awk -v FS=, '{print $4}' file
will:
We simply use the -v
flag to set the FS
variable to ,
.
There are multiple ways to set FS
and RS
. In fact, some versions of Awk might not have a -v
flag available.
IMO however, this is the most reliable, simple and easy-to-read option when using Awk as a one liner.
Fancy things you can do
Record and Field number (NR & NF)
Just like we can change RS and FS, we can play around with NR and NF too.
Say you are working with a file like the following:
You know that there is a useless line just under the header and that all lines with more than 10 fields are incorrect.
We can use NF to limit the number of fields and NR to limit the record number a line should have to be evaluated by awk:
“Print the 4th field of all records whose NR is greater than 2 (3rd line onwards) and whose NF are less than 10 (9 or fewer fields)“:
Output Record and Field separator (ORS & OFS)
What if you want to format the output of your awk command?
Well for simple commands, something like awk 'NF<10 && NR>2 {print $3" <-> "$4}' file
should do the trick (notice the <->
):
There is another option that might be nicer for more complex commands:
The output is the same, but you can probably imagine that the second option scales better when the commands start getting fancy. There is also printf
support in Awk, so you can get as fancy as you like!
Range
If the file you are working with has some kind of sorting, you might want to operate based on that instead of the NR.
You can use multiple matches to create a range on which to perform the action. So on a file like:
The command awk '/second/ , /fourth/ {print $0}' file
outputs:
If statements
Yes, you can even fit if
statements in your Awk command.
Say you want to print the 9th field only if the 5th one is greater than 50:
But wait! There’s more.
You can use ternary operations for more complex behavior! (please consider if it makes sense, you might want to write a script at this point…)
In pseudocode this reads:
So for a file:
The command:
Would output:
Some notes on Awk
Multi-file
Awk can read multiple files, but it’s behavior when doing so is not the most intuitive and some variables are slightly different.
This post only covers how to use Awk given a single file. If possible, I would advise using Awk in this way to keep things simple.
Scripting
This post only covers how to use it as a one liner from the command prompt, but Awk is much more than a command.
The {}
you see are around actions and the ;
that separate commands within an action are there for a reason: Awk is not just a command, it’s a fully featured scripting language.
This for example, is a valid Awk script:
As you might imagine, this post barely scratches the surface of what can be done with Awk. There’s support for user defined variables, arrays and flow control (with things like next
and exit
).
Have fun exploring!