Overzealous regexes

2017-11-25

Today I was trying to solve a parsing problem with regular expressions. Yes, I know. Moving on, the regular expression should be able to skip JS strings and then check for a particular character afterwards. I had that working fine but when I was trying my test cases it wasn't passing. It took me three different regex debug websites to realize what the problem was. The websites did not help. The problem is overzealous backtracking.

Note: this blog is about JS regexes but I'm pretty sure the subject at hand is generic

Okay the test case was as follows:

Code:
stuff require("foo") more stuff
stuff require("foo") more / stuff

In this case the required (hur hur) regex is fairly easy:

Code:
/.*require\("[^"]"\).*\//

It will search for require and require the parens, content enclosed by strings, any content, and a forward slash. You can tweak the eagerness a bit and perhaps add whitespace skipping bits, but this is pretty straightforward stuff.

However. I want to be pixel perfect. In context that means catching all edge cases. In JS, strings can have escapes. While that may not seem immediately relevant, you should note that technically we must guard against escaping the double quote. So the string parsing bit becomes "(?:\\.|[^"])*". Horf. This tells the regex engine to first try to parse a backslash followed by any one character, or otherwise to parse one character that is not a double quote.

This is fine. And it works on something like "fo\"o". Good! Or is it...?

The answer is "no". Can you figure out what the problem is before reading on? (I'd be very impressed if you did)

We have a regex and two test cases; the first should match and the second should not match:

Code:
/.*require\("(?:\\.|[^"])*"\).*\//

stuff require("fo\"o") more stuff
stuff require("fo\"o") more / stuff

And yes, this works. Actually, my requirements are slightly simpler than this and my regex ended up like this:

Code:
/(require\("(?:\\.|[^"])*".*\/.*$)/gm

stuff require("fo\"o") more stuff
stuff require("fo\"o") more / stuff

This still works. The main change is that I dropped the check for a closing parenthesis because that was not relevant for my case. This, combined with the earlier "mistake", proved to be a problem. Look what happens when we mangle the string a bit...

Code:
/require\("(?:\\.|[^"])*".*\/.*$/gm

stuff require("some/path") more stuff
stuff require("some/path") more / stuff

Still fine. The forward slash is relevant though since it was my target character and, obviously, quite common in file paths. It's actually when the double quote is escaped and the target character is also present that the bug is exposed:

Code:
/require\("(?:\\.|[^"])*".*\/.*$/gm

stuff require("so\"me/path") more stuff
stuff require("so\"me/path") more / stuff

And boom. Both cases match. Wat?

It's kind of weird. Ignoring the fact that you're very unlikely to be escaping quotes into a path name, of course. The regex shouldn't match. It can either match a backslash with any character and otherwise any single character that is not a double quote. What's going on here?

The answer is backtracking. There's nothing in the regex that forbids it to parse a backslash as a single character rather than a pair. The way we coded the regex it will first try to match a pair. Failing that it will happily consume one backslash to see whether that works.

The key is the "(?:\\.|[^"])*" bit. You're inclined to read this as "Require a double quote, then if you see a backslash accept it AND any next character. Otherwise, if the character is not a double quote, accept it. Repeat those steps as many times as possible. Afterwards require another double quote ". And nobody would blame you. However, this "conditional" is only going to work for the first pass.

The kicker is in the suffix of my whole regex. While the second test case, the one with a forward slash at the end, parses by design, the first test passes because of backtracking. The regex engine will first try to parse the whole thing and not fail until it reaches the end of the line. Only then it "realizes" that it did not see a forward slash. The engine then "backtracks" to the first point where it can branch its search. That would be the backslash in the string. While we "asked" the regular expression to accept any character after a backslash, we did not forbid it to parse the backslash as part of the [^"] pattern. And so it sees an opportunity to try that and end up with a match. Huzah! Owait...

The fix here is simple, especially after I explained it quite explicitly; we forbid the engine to consume the backslash as a single character: so [^"] becomes [^"\\]. This works.

Code:
/require\("(?:\\.|[^"\\])*".*\/.*$/gm

stuff require("so\"me/path") more stuff
stuff require("so\"me/path") more / stuff

The main takeaway? Regexes do exactly what you say and also anything you don't tell them not to do.

Asimov was right.