There are plenty of super useful CLI utilities, many of which you should already have in your system. To get the most out of them, some basic understanding of common regex patterns is needed.
Keep in mind that not all Regex engines are created equal and their implementations and valid patterns may vary a bit. However, the general concepts should be more or less the same.
The basics
Ranges
Use []
to match whatever falls within the given range.
[abc]
➡️ ‘a’ or ‘b’ or ‘c’.
[a-z]
➡️ Any char between ‘a’ and ‘z’. It may or may not include diacritics.
[a-zA-Z0-9]
➡️ Any alphanumeric char either lower or upper case.
You can negate them with ^
:
[^a-z]
➡️ Any char not between ‘a’ and ‘z’.
More on how to negate matches below.
The Dot
Use it to match any char, usually except new lines.
.
➡️ Any one char.
..
➡️ Any two chars (not necessarily the same ones).
Multipliers
Use them to match any number of the previous item.
a+
➡️ 1 or more instances of a
ab+
➡️ a
followed by 1 or more instances of b
(so ab
, abb
, and so on)
.+
➡️ Any char 1 or more times
.*
➡️ Any char 0 or more times
.?
➡️ Any char 0 or 1 times
Greedy vs Lazy matches
What would you expect to happen if you pass a string like <body>Banana</body>
through a regex like <.*>
?
You might be surprised to find that it likely would not match <body>
nor </body>
. In fact, it would most likely match the whole <body>Banana</body>
instead.
By default, most regex engine’s +
and *
multipliers are greedy, which means that they will try to match as much as possible.
A lazy match is probably what you want in most cases, and you usually get that by adding ?
to the multiplier: so using <.*?>
instead will match <body>
and/or </body>
.
If you want to get real fancy you could also use <[^>]+>
to achieve this, which should be understandable by this point. It’s usually more efficient, but be careful, regular expressions get out unreadable real fast.
So remember, if you are having trouble with .*
(or .+
), try using .*?
(or .+?
) instead.
Numbered Multipliers
Instead of matching any number, these match a given number or range of numbers of the previous item.
a{5}
➡️ ‘aaaaa’.
a{1-5}
➡️ Between 1 and 5 consecutive ‘a’.
What’s cool about them is that they can behave like a more interesting ?
multiplier:
a{3,}
➡️ 3 or more ‘a’.
The not-so-basics
Short-hands
Regex can get hard to write and read, and there are certain structures we often want to match against.
To make our life easier, we can use short-hands (if your regex engine supports them):
Anchors
You might need a regex to only match at the beginning or the end of a line. For this, we use anchors like ^
and $
:
^
➡️ Start of the line.
$
➡️ End of the line.
\b
➡️ Word boundary (beginning or end of word).
So, for a regex like \bFOO$
:
FOO
in What a nice line of text BAR FOO
would match.
FOO
in What a nice line of text BARFOO
would not.
Multiple matches
Just like an if
statements, you can match for more than one expression:
foo|bar
➡️ Would match either foo
or bar
.
Escaping special chars
What if we want our regex to match some of the special chars we’ve seen (like $
, [
or +
) literally?
We would need to escape them by putting a \
in front of them.
If we take our previous example and escape the $
: \bFOO\$
:
FOO
in What a nice line of text BAR FOO$ something else
would match.
FOO
in What a nice line of text BAR FOO
would not.
If you come across a scary looking, unreadable regex this is probably the main culprit. Don’t let the \
scare you!
Grouping and References
One neat trick that most regex engines will allow you to do is grouping parts of the match and referencing them later in the regex.
One regex can have multiple groups and these get referenced by their number (starting with 1).
You surround the group in ()
and reference it with \
followed by the group’s number:
(foo)-(bar) \2\1
➡️ Will match foo-bar barfoo
(notice the spaces).
If you know how sed works, you can probably imagine this can save a lot of headaches.
Negations
You can negate parts of your regex using lookarounds.
Say you want to match all instances of foo
followed by anything but bar
, followed by baz
. So for example, we want foowhateverbaz
to match but not foobarbaz
.
A lookahead like foo(?!bar).+?baz
would do just that: It negates the part of the regex between parenthesis and preceded by ?!
.
It simply means ‘not followed by (?!this)
’.
Similarly, you might want to go about this the other way around.
If you want to match all instances of foo
except when it is preceded by bar
, you could use a lookbehind like (?<!bar)foo
.
So This whateverfoo is weird
would match.
But This barfoo is weird
would not.
It simply means ‘not preceded by (?<!this)
’.
Both lookaheads and lookbehinds can be used to match a pattern while negating another one. Which one to use just depends on whether you want to negate something before or after something else.