How to Regex

There are plenty of super useful CLI utilities, many of which you should already have in your system. To get the most out of them, some basic understanding of common regex patterns is needed.

Keep in mind that not all Regex engines are created equal and their implementations and valid patterns may vary a bit. However, the general concepts should be more or less the same.

The basics

Ranges

Use [] to match whatever falls within the given range.

[abc] ➡️ ‘a’ or ‘b’ or ‘c’.

[a-z] ➡️ Any char between ‘a’ and ‘z’. It may or may not include diacritics.

[a-zA-Z0-9] ➡️ Any alphanumeric char either lower or upper case.

You can negate them with ^:

[^a-z] ➡️ Any char not between ‘a’ and ‘z’.

The Dot

Use it to match any char, usually except new lines.

. ➡️ Any one char.

.. ➡️ Any two chars (not necessarily the same ones).

Multipliers

Use them to match any number of the previous item.

a+ ➡️ 1 or more instances of a

ab+ ➡️ a followed by 1 or more instances of b (so ab, abb, and so on)

.+ ➡️ Any char 1 or more times

.* ➡️ Any char 0 or more times

.? ➡️ Any char 0 or 1 times

Greedy vs Lazy matches

What would you expect to happen if you pass a string like <body>Banana</body> through a regex like <.*>?

You might be surprised to find that it likely would not match <body> nor </body>. In fact, it would most likely match the whole <body>Banana</body> instead.

By default, most regex engine’s + and * multipliers are greedy, which means that they will try to match as much as possible.

A lazy match is probably what you want in most cases, and you usually get that by adding ? to the multiplier: so using <.*?> instead will match <body> and/or </body>.

If you want to get real fancy you could also use <[^>]+> to achieve this, which should be understandable by this point. It’s usually more efficient, but be careful, regular expressions get out unreadable real fast.

So remember, if you are having trouble with .* (or .+), try using .*? (or .+?) instead.

Numbered Multipliers

Instead of matching any number, these match a given number or range of numbers of the previous item.

a{5} ➡️ ‘aaaaa’.

a{1-5} ➡️ Between 1 and 5 consecutive ‘a’.

What’s cool about them is that they can behave like a more interesting ? multiplier:

a{3,} ➡️ 3 or more ‘a’.

The not so basics

Short-hands

Regex can get hard to write and read, and there are certain structures we often want to match against.

To make our life easier, we can use short-hands (if your regex engine supports them):

\s ➡️ a whitespace.
\S ➡️ anything but a whitespace (opposite of \s).
\d ➡️ a digit (0-9).
\D ➡️ anything but a digit (opposite of \d).
\w ➡️ a 'word' char (shorthand for [a-zA-Z0-9_]).
\W ➡️ anything but a 'word' char (opposite of \w).

Anchors

You might need a regex to only match at the beginning or the end of a line. For this, we use anchors like ^ and $:

^ ➡️ Start of the line.

$ ➡️ End of the line.

\b ➡️ Word boundary (beginning or end of word).

So, for a regex like \bFOO$:

FOO in What a nice line of text BAR FOO would match.

FOO in What a nice line of text BARFOO would not.

Multiple matches

Just like an if statements, you can match for more than one expression:

foo|bar ➡️ Would match either foo or bar.

Escaping special chars

What if we want our regex to match some of the special chars we’ve seen (like $, [ or +) literally?

We would need to escape them by putting a \ in front of them.

If we take our previous example and escape the $: \bFOO\$:

FOO in What a nice line of text BAR FOO$ something else would match.

FOO in What a nice line of text BAR FOO would not.

If you come across a scary looking, unreadable regex this is probably the main culprit. Don’t let the \ scare you!

Grouping and References

One neat trick that most regex engines will allow you to do is grouping parts of the match and referencing them later in the regex.

One regex can have multiple groups and these get referenced by their number (starting with 1).

You surround the group in () and reference it with \ followed by the group’s number:

(foo)-(bar) \2\1 ➡️ Will match foo-bar barfoo (notice the spaces).

If you know how sed works, you can probably imagine this can save a lot of headaches.

Negations

You can negate parts of your regex using lookarounds.

Say you want to match all instances of foo followed by anything but bar, followed by baz.
So for example, we want foowhateverbaz to match but not foobarbaz.

A lookahead like foo(?!bar).+?baz would do just that: It negates the part of the regex between parenthesis and preceded by ?!.

It simply means ‘not followed by (?!this)’.

Similarly, you might want to go about this the other way around.

If you want to match all instances of foo except when it is preceded by bar, you could use a lookbehind like (?<!bar)foo.

So This whateverfoo is weird would match.

But This barfoo is weird would not.

It simply means ‘not preceded by (?<!this)’.

Both lookaheads and lookbehinds can be used to match a pattern while negating another one.
Which one to use just depends on whether you want to negate something before or after something else.

How to Regex

Other posts you might like