Skip to content

A mental model to understand regex

From time to time, I find an application that seems to be a good fit for regular expressions (regex). In those situations, I usually reread a regex tutorial. Just enough to remember its syntax and fulfill the task at hand. However, I was never really able to understand how regex really works. It was just some kind of efficient and intricate pattern matching on strings that I didn’t really understand.

Lately, I’ve been reading Eloquent JavaScript. It has a chapter dedicated to regular expressions. Since I’m already (kind of) familiar with regex, I almost skipped it. Fortunately, I didn’t. That chapter included a very simple and good explanation of the way a regular expression is tested against strings, which I’m going to describe in this post.

How regex matches patterns

The key idea is that regex operates like a flow diagram. To each pattern (or element) in the regular expression, corresponds one node in the diagram.

Let’s see one such diagram for the 19[0-9]{2} regular expression:

1909One of2 times
Diagram generated at debuggex

When trying to match the regular expressioon against a string, the goal will be to traverse the whole diagram, from left to right. The first node corresponds to the string 19. After that, we need, two times, any of the characters in the range 0-9. This means the regex above will match any number between 1900 and 1999.

The matching starts from the first character in the tested string. Then, each node is matched with a part of the string. If this is successful, it moves to the next node, and so on until a complete match is found. If it fails, the current starting character is skipped and the process is repeated with the next character in the string.

Let’s analyze this process with the following string.

The year was 1984

When matching against the regular expression 19[0-9]{2}], it will start with character T. Since the diagram requires the first character to be 1, it will fail. This process will be repeated all the way to the 13th character, 1. From there, it will verify that the next character is 9, completing the first node 19 in the diagram, and then continue to the next node that requires two times a character in the range 0-9. Finally, it will match 1984.

Let’s take a look at a couple of extra features of regular expressions that can be understood with this diagram.

Branching between patterns

We know that we can match any of cat , pat, and sat with the regex (c|p|s)at. Using the choice operator |, we match any of c, p, or s with the first pattern group. The diagram looks as follows:

cpsGroup 1at
Diagram generated at debuggex

The mechanism at play is very similar to the one described in the previous section. The particularity is that there are several different paths to traverse the diagram from left to right to find a match.

Looping a pattern

We can use special quantifiers to match the same pattern a certain number of times:

  • +: matches the preceding pattern one or more times.
  • *: matches the preceding pattern zero or more times.
  • ?: matches the preceding pattern zero or one time.

This is akin to creating a loop in the diagram.

Let’s see an example for the cats* regular expression.

cats
Diagram generated at debuggex

Here, we have one path where s is not necessary at all to have a match and a loop that will take any number of repeated s characters. Both, cat and catsss are possible matches.

Greedy-matching

The example of the last section raises one final aspect of regular expressions that we will consider. If we were to test the string catsss against the cats+ regular expression, would the corresponding match be cats or catsss? (Remember that the + quantifier requires at least one match)

By default, the quantifiers will match the longest possible string. This is known as greedy-matching. We can imagine that the regular expression will try to match all the way down to the end of the string and backtrack from there until a match is fulfilled.

To get the shortest possible match instead of the longest one, we have to modify the quantifier appending a ?. That’s how we can match for cats using the modified regex cats+?.

Published inArticles
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments