From time to time, I find an application that seems to be a good fit for regular expressions (regex). In those situations, I usually reread a regex tutorial. Just enough to remember its syntax and fulfill the task at hand. However, I was never really able to understand how regex really works. It was just some kind of efficient and intricate pattern matching on strings that I didn’t really understand.

Lately, I’ve been reading Eloquent JavaScript. It has a chapter dedicated to regular expressions. Since I’m already (kind of) familiar with regex, I almost skipped it. Fortunately, I didn’t. That chapter included a very simple and good explanation of the way a regular expression is tested against strings, which I’m going to describe in this post.

## How regex matches patterns

The key idea is that regex operates like a flow diagram. To each pattern (or element) in the regular expression, corresponds one node in the diagram.

Let’s see one such diagram for the `19[0-9]{2}` regular expression:

When trying to match the regular expressioon against a string, the goal will be to traverse the whole diagram, from left to right. The first node corresponds to the string `19`. After that, we need, two times, any of the characters in the range `0-9`. This means the regex above will match any number between `1900` and `1999`.

The matching starts from the first character in the tested string. Then, each node is matched with a part of the string. If this is successful, it moves to the next node, and so on until a complete match is found. If it fails, the current starting character is skipped and the process is repeated with the next character in the string.

Let’s analyze this process with the following string.

`The year was 1984`

When matching against the regular expression `19[0-9]{2}]`, it will start with character `T`. Since the diagram requires the first character to be `1`, it will fail. This process will be repeated all the way to the 13th character, `1`. From there, it will verify that the next character is `9`, completing the first node `19` in the diagram, and then continue to the next node that requires two times a character in the range `0-9`. Finally, it will match `1984`.

Let’s take a look at a couple of extra features of regular expressions that can be understood with this diagram.

## Branching between patterns

We know that we can match any of `cat` , `pat`, and `sat` with the regex `(c|p|s)at`. Using the choice operator `|`, we match any of `c`, `p`, or `s` with the first pattern group. The diagram looks as follows:

The mechanism at play is very similar to the one described in the previous section. The particularity is that there are several different paths to traverse the diagram from left to right to find a match.

## Looping a pattern

We can use special quantifiers to match the same pattern a certain number of times:

• `+`: matches the preceding pattern one or more times.
• `*`: matches the preceding pattern zero or more times.
• `?`: matches the preceding pattern zero or one time.

This is akin to creating a loop in the diagram.

Let’s see an example for the `cats*` regular expression.

Here, we have one path where `s` is not necessary at all to have a match and a loop that will take any number of repeated `s` characters. Both, `cat` and `catsss` are possible matches.

## Greedy-matching

The example of the last section raises one final aspect of regular expressions that we will consider. If we were to test the string `catsss` against the `cats+` regular expression, would the corresponding match be `cats` or `catsss`? (Remember that the `+` quantifier requires at least one match)

By default, the quantifiers will match the longest possible string. This is known as greedy-matching. We can imagine that the regular expression will try to match all the way down to the end of the string and backtrack from there until a match is fulfilled.

To get the shortest possible match instead of the longest one, we have to modify the quantifier appending a `?`. That’s how we can match for `cats` using the modified regex `cats+?`.

Published inArticles
Subscribe
Notify of