From time to time, I find an application that seems to be a good fit for regular expressions (regex). In those situations, I usually reread a regex tutorial. Just enough to remember its syntax and fulfill the task at hand. However, I was never really able to understand how regex really works. It was just some kind of efficient and intricate pattern matching on strings that I didn’t really understand.
Lately, I’ve been reading Eloquent JavaScript. It has a chapter dedicated to regular expressions. Since I’m already (kind of) familiar with regex, I almost skipped it. Fortunately, I didn’t. That chapter included a very simple and good explanation of the way a regular expression is tested against strings, which I’m going to describe in this post.
How regex matches patterns
The key idea is that regex operates like a flow diagram. To each pattern (or element) in the regular expression, corresponds one node in the diagram.
Let’s see one such diagram for the 19[0-9]{2}
regular expression:
When trying to match the regular expressioon against a string, the goal will be to traverse the whole diagram, from left to right. The first node corresponds to the string 19
. After that, we need, two times, any of the characters in the range 0-9
. This means the regex above will match any number between 1900
and 1999
.
The matching starts from the first character in the tested string. Then, each node is matched with a part of the string. If this is successful, it moves to the next node, and so on until a complete match is found. If it fails, the current starting character is skipped and the process is repeated with the next character in the string.
Let’s analyze this process with the following string.
The year was 1984
When matching against the regular expression 19[0-9]{2}]
, it will start with character T
. Since the diagram requires the first character to be 1
, it will fail. This process will be repeated all the way to the 13th character, 1
. From there, it will verify that the next character is 9
, completing the first node 19
in the diagram, and then continue to the next node that requires two times a character in the range 0-9
. Finally, it will match 1984
.
Let’s take a look at a couple of extra features of regular expressions that can be understood with this diagram.
Branching between patterns
We know that we can match any of cat
, pat
, and sat
with the regex (c|p|s)at
. Using the choice operator |
, we match any of c
, p
, or s
with the first pattern group. The diagram looks as follows:
The mechanism at play is very similar to the one described in the previous section. The particularity is that there are several different paths to traverse the diagram from left to right to find a match.
Looping a pattern
We can use special quantifiers to match the same pattern a certain number of times:
+
: matches the preceding pattern one or more times.*
: matches the preceding pattern zero or more times.?
: matches the preceding pattern zero or one time.
This is akin to creating a loop in the diagram.
Let’s see an example for the cats*
regular expression.