Text

Text: regular expressions

Regular expressions (regex) are insanely useful. When you come across a problem that fits well with regex it will save you hours of work (and provide the warm satisfaction that comes with solving a gnarly problem with a few lines of code). The challenge is that without some mastery you don’t always spot the opportunities.

A modicum of competence with regex will pay dividends throughout your career. The deeper your understanding, the more leverage you will gain from the text processing tools (Vim, Sed, Awk, Grep, etc.), although regex support is also included in many more tools. In addition, most modern languages include inbuilt support for regex.

There are varying implementations of regexes. In practice, this works out ok as the core regex features are mostly the same across implementations. It’s the more esoteric features that can be frustratingly different, but you quickly know when this is the case and can dive into the relevant documentation.

What exactly are regular expressions?

Explaining what regular expressions are is like writing the instructions for tying shoelaces for someone who has never done it. It’s basically an impossible task and much easier done by example. In any case here goes…

Regular expressions are a language (ok, maybe not technically a languge, but let’s go with it for now) for describing a pattern to match arbitrary input text.

Regular expressions are utilized differently across various tools, but the actual regex patterns themselves are similar..

grep example

grep searches a group of files for lines that match your regex.

Usage: grep '^order:' *.md

This will search all lines for every file that matches the *.md glob looking for lines that match the '^order:' regular expression. This regular expression will match lines that have the characters order: starting in the first column. The ^ character is known as a meta-character and has special meaning in regular expressions. Meta-characters are explained below. Grep will display the matching filenames and line contents to standard output.

Grep is its own world which you can read about here

Vim example

In vim you search using regular expressions by entering / while in normal mode followed by your regular expression.

Usage: /^order: entered in Vim normal mode will search the current buffer for the characters order: starting in the first column.

You can read more about searching in vim here

The Basics

Meta-characters

Regular expressions contain meta-characters and literals. Meta-characters carry special meaning and literal characters are consumed literally as they are by the regular expression as it is pattern matching.

The meta-characters are:

Character Classes

A character class is the set of characters allowed at that particular point. Only one character is used.

The meta-characters inside a character class are completely different from the meta-characters in the container regular expression. It is easier just to learn two sets of rules rather then as a set of character class exceptions.

Back Referencing

Another use for parentheses is to match text that is the same as previously matched text.

\1 refers to the result of the first parentheses match, \2 the second, etc.

For example, \<([a-z]+) +\1\> will find lowercase words that are duplicated.

Or, more practically:

$ egrep -i '\<([a-z]+) +\1\>' *.md

Escaping Meta-characters

The backslash, \ is used to escape the metacharacters and treat them as literal characters.

\. in a regular expression will match the period, ., character explicitly.

Resources

Mastering Regular Expressions, 3rd Edition, by Jeffrey Friedl
https://learning.oreilly.com/library/view/mastering-regular-expressions/0596528124/.
Since the late nineties (when I was writing a ton of Perl) this book has been my bible for regular expressions.

Tools

Web

Languages

Data