Text: regular expressions

Regular expressions (regex) are insanely useful. When you come across a problem that fits well with regex it will save you hours of work (and provide the warm satisfaction that comes with solving a gnarly problem with a few lines of code). The challenge is that without some mastery you don’t always spot the opportunities.

A modicum of competence with regex will pay dividends throughout your career. The deeper your understanding, the more leverage you will gain from the text processing tools (Vim, Sed, Awk, Grep, etc.), although regex support is also included in many more tools. In addition, most modern languages include inbuilt support for regex.

There are varying implementations of regexes. In practice, this works out ok as the core regex features are mostly the same across implementations. It’s the more esoteric features that can be frustratingly different, but you quickly know when this is the case and can dive into the relevant documentation.

What exactly are regular expressions?

Explaining what regular expressions are is like writing the instructions for tying shoelaces for someone who has never done it. It’s basically an impossible task and much easier done by example. In any case here goes…

Regular expressions are a language (ok, maybe not technically a languge, but let’s go with it for now) for describing a pattern to match arbitrary input text.

Regular expressions are utilized differently across various tools, but the actual regex patterns themselves are similar..

grep example

grep searches a group of files for lines that match your regex.

Usage: grep '^order:' *.md

This will search all lines for every file that matches the *.md glob looking for lines that match the '^order:' regular expression. This regular expression will match lines that have the characters order: starting in the first column. The ^ character is known as a meta-character and has special meaning in regular expressions. Meta-characters are explained below. Grep will display the matching filenames and line contents to standard output.

Grep is its own world which you can read about here

Vim example

In vim you search using regular expressions by entering / while in normal mode followed by your regular expression.

Usage: /^order: entered in Vim normal mode will search the current buffer for the characters order: starting in the first column.

You can read more about searching in vim here

The Basics

Meta-characters

Regular expressions contain meta-characters and literals. Meta-characters carry special meaning and literal characters are consumed literally as they are by the regular expression as it is pattern matching.

The meta-characters are:

^ start of line
$ end of line
. matches any character
| is alternatation
( and ) are used to group multiple characters and to limit the scope of alternaltives
? indicates optionality, it attaches to the preceding construct
+ indicates one or more, it attaches to the preceding construct
* indicates zero or more, it attaches to the preceding construct
\< match start of word, this is a meta-character sequence (not always supported)
\> match end of word, this is a meta-character sequence (not always supported)
{4,7} is the interval quantifier, this will match what precedes it between 4 and 7 times (not always supported)

Character Classes

A character class is the set of characters allowed at that particular point. Only one character is used.

[ and ] delineates the character class
[ae] will match the letter e or the letter a
[a-z] a dash is used to indicate a range, in this case all lowercase letters
[a-zA-Z] multiple ranges are ok, this is all lowercase and uppercase letters
[0-9$] ranges can be combined with literals, in this case any number or the dollar sign
[^aeiou] the ^ character as the first character in a character class is used to indicate negation, in this case matching any character that is not a vowel

The meta-characters inside a character class are completely different from the meta-characters in the container regular expression. It is easier just to learn two sets of rules rather then as a set of character class exceptions.

in general, the meta-characters from above are not treated as meta-characters inside a character class.
- is only a meta-character inside a character class, but if it is the first character after [ or [^ in a character class it is not treated as a meta-character (because it could not represent a range)
^ is only a meta-character in a character class if it is the first character

Back Referencing

Another use for parentheses is to match text that is the same as previously matched text.

\1 refers to the result of the first parentheses match, \2 the second, etc.

For example, \<([a-z]+) +\1\> will find lowercase words that are duplicated.

Or, more practically:

$ egrep -i '\<([a-z]+) +\1\>' *.md

Escaping Meta-characters

The backslash, \ is used to escape the metacharacters and treat them as literal characters.

\. in a regular expression will match the period, ., character explicitly.

Resources

Mastering Regular Expressions, 3rd Edition, by Jeffrey Friedl
https://learning.oreilly.com/library/view/mastering-regular-expressions/0596528124/.
Since the late nineties (when I was writing a ton of Perl) this book has been my bible for regular expressions.

Text