Regex Parsing

2017-05-12

And now you have two problems. Nah, don't worry. I would never use regular expressions for a parser. This post is about parsing regular expression literals. And not for deep analysis or anything but rather as part of my new parser. In fact, just as part of the lexer (tokenizer) for that new parser.

I just finished writing the lexer. Took me about a week or so. Actually longer than expected because I underestimated the regular expression syntax. And I'd love to tell you how it's due to all the new syntax but honestly, that's only half the truth.

This post will only cover parsing the es6+ regex syntax in a lexer. Nothing fancy. But if you're not into parsing or JS, turn back now.

If at all I will link to the ES7 spec as published on the ecma site. I know it's more like a living spec these days but I want something fixed to point to. ES7 it is at the time of writing.

On to it.

The problem


From a cfg point of view parsing regular expressions is actually easy. And in fact there are two goal symbols to parse regular expressions; RegularExpressionLiteral and Pattern. To clarify, the "goal symbol" is a way to describe a particular tree of parsing rules you want to follow to parse a certain thing.

The difference between the two productions is that RegularExpressionLiteral is a fast version with limited checking and error reporting while the underwhelmingly named Pattern actually describes how to properly parse a regular expression literal, including errors and escapes.

Besides those goals there's also an extension of the syntax for web in Annex B. It's pretty horrible and I've ignored it for now. Obviously I'll have no choice but to support it eventually, real world code and all that, but I'll worry about that later.

While the RegularExpressionLiteral goal certainly makes for faster parsing, the lack of error reporting feels underwhelming. And since I intend to use my parser for tooling later on it only makes sense to deal with the hard verification part now. Furthermore I've found some fairly solid ways of parsing the regular expression without backtracking or double evaluation of the input. Applying other verification certainly amounts to some overhead but I think I have that decently contained. Perhaps I'll add a fast regex parsing option later. Though honestly, regular expressions will be a fairly infrequent path so the hot code emphasis won't be there anyways.

The actual problem


From here on out I'll describe how I parse the Pattern goal. Ultimately there are two main problems with this pattern as far as parsing and verification is concerned; ranges and back references.

1: code points


ES6 introduced the concept of unicode code points (versus code units as we were used to). In regular expressions this would cause some trouble so instead of shoehorning that into the existing rules they opted for a new flag, u, that would enable sort of like a strict regex mode that matches code points instead of code units.

2: flags at the finish


Interestingly enough, from a parsing perspective this unicode mode is almost no problem except for ranges. But there it becomes a super big problem due to how unicode surrogate pairs work, how you don't know what mode you're actually parsing in until you've parsed the flags, and how a dash can mean both a range and the actual dash character. And the flags always occur at the end of a regular expression. So either;

- you seek ahead using the RegularExpressionLiteral goal, or
- you just pick a most likely mode and start parsing and restart when you discover it was the wrong mode, or
- you verify both modes simultaneously and remember the result so you can apply it after you've parsed the flags

Since I don't want to backtrack I've opted for that last option. While I parse a regular expression I parse in two different ways; interpret the regex as code units and as code points. This is only relevant for ranges but unfortunately you can't just "ignore until" because of these "surrogate pairs".

3: surrogate pairs


I know, we're tumbling down a rabbit hole but here's where it ends. Surrogate pairs are an alternative way of coding unicode characters with code points beyond 0xffff, the largest code point you can encode with a classic unicode escape (\uffff). While ES6 also introduces a different way of encoding characters beyond the 0xffff it still supports these surrogate pairs. A surrogate pair is a back-to-back low-high code unit, both within certain given ranges. When found back-to-back they form one bigger code point. Kinda like power ranges super kung fu robot merge morphs!

For example, the "Pile of poo" emoji (💩) has code point 0x1F4A9. In ES6 you can encode this as either \u{1F4A9} or as surrogate pairs \uD83D\uDCA9.

The, for us, unfortunate part about this encoding scheme is that it's fully optional; a "leading" surrogate pair char may appear without a "trailing" surrogate char. And trailing chars may appear without leading chars.

4: char class ranges


Crawling back up the rabbit hole, this surrogate pair is a problem for ranges in character classes. So character classes are the /[abc]/ things where you can tell the engine to match the next code unit/point if it's part of a given collection of characters. This collection can have ranges but the way the syntax of these ranges are defined feels a little too implicit. The character that causes a range is a regular dash. However, the syntax and semantics also explicitly a dash to contribute a character to the collection in certain cases. ("A - character can be treated literally or it can denote a range. It is treated literally if it is the first or last character of ClassRanges, the beginning or end limit of a range specification, or immediately follows a range specification." ktnx).

They should have just made a dash special in u-mode and require an escape for a dash literal. Oh well, ships have sailed.

5: range verification


The pivotal kicker for us is that the semantics require ranges to be verified lo-hi pairs. In this context it doesn't matter whether one side of the range was an escaped or a literal code unit/point. With the u flag this concerns code points (so surrogate pairs are relevant) and without the u flag this only concerns code units (ignoring surrogate pairs).

Note 1: "... An error occurs if either ClassAtom does not represent a single character (for example, if one is \w) or if the first ClassAtom's character value is greater than the second ClassAtom's character value."

This rule causes another subtle annoyance, more on that later.

Recap


To reiterate;

- Regexes can have character classes.
- These classes can have ranges.
- Ranges require the use of the dash sign (-) which can also be interpreted as a regular character in certain edge cases.
- Ranges must be valid lo-hi, where the value is based on the unicode code unit or point.
- This latter point depends on a regex flag that may or may not be present and you won't know until you've at least parsed the whole regex in some way.
- If the u flag is used the range depends on code points which means unicode surrogate pairs are must be checked.
- Surrogate pairs are two consecutive characters with each a code unit within a certain range that combined form a large code point.
- With the u flag the range must still be valid in code points.
- And because the flag isn't known until the end you must either track both, pick one and possibly backtrack, or do a quick scan to determine the flag before a real parse.

Okay.

Bytecode desync


The combination of these rules make it such that a range that would be valid with a u flag may be illegal without a u flag.

For example, /[💩-💫]/ is a textbook case of a range that is valid with u flag and invalid without u flag. Thanks Mathias for dropping me down the rabbit hole like that.

It's easier to see why this is a problem when we escape the regex with unicode quads; /[\uD83D\uDCA9-\uD83D\uDCAB]/. As you can see, those emoji are now depicted as their surrogate pairs. So when the u flag is present, this range could match three characters (💩 💪 💫), but without the flag it's really two individual characters on the outside and the range \uDCA9-\uD83D, which is invalid because clearly 0xDCA9 is bigger than 0xD83D.

This will be the case for pretty much any surrogate pair because each side of the surrogate pair only appears in certain ranges (and they're not the same range). And since in theory, they may be used without the u flag, you must assume the worst. However, in practice I doubt you'll ever see this case so you're really coding defensively here. Just a kick in the teeth.

If you want to learn more about this special unicode hell I suggest you read Mathias' blog post about it. He actually knows what he's talking about ;)

Ok, back to me.

Solution


My goal was to parse the regular expression in a single loop, without backtracking, and vigilantly preventing "unrecycled" peeking as much as possible. I got that working fairly well now. I could paste the whole thing here but it's over a 100 loc without any of the sub parsers. Here's some pseudo code:

Code:
while (!eof()) {
let c = read();
if (isEnd(c)) return state;
if (isEscape(c)) c = descape();

// surrogate pair maintenance
if (isPair(prev, c)) c = surrogate(prev, c);
else if (!wasLongEscape()) checkHead(c);

// do range checks for both as if with and without flag
if (urangeOpen) {
if (isCodePointBoundary(c, prev)) {
urangeOpen = false;
confirmRange(uleft, uright);
}
} else if (c === DASH) {
urangeOpen = true;
} else {
uleft = c;
}

// again for without flag. this is easier.
if (nrangeOpen) {
nrangeOpen = false;
confirmRange(nleft, nright);
} else if (c === DASH) {
nrangeOpen = true;
} else {
nleft = c;
}
prev = c;
}

So you process the input in a loop and confirm when unicode code point boundaries are crossed. The cases are:

- the previous char was a head
- either the current is a tail and it's a pair, or
- the current is another head so a boundary was crossed, or
- the current is neither and a boundary was crossed
- the current is no head (in that case we don't have to verify the tail at all)

We need to track surrogate pairs otherwise you run the risk of requiring a backwards peek for something like HT-HT-.

I will probably do some perf testing to see whether it's not just more efficient to do some extra peeking. For know I'm just happy to solve this conundrum with relative elegance :)

Part deux


Oh yes, there were actually two problems when parsing es6 regular expressions. The second one concerns capturing groups and back references. It's a relatively minor one which mainly needs some "manual book keeping" to get right.

When you use something like \123 in a regular expression it doesn't actually contribute a character by that code point to the regex. Instead it refers to the current value of a capturing group by that index. This capturing group may or may not have a value at the time of evaluating the back reference but at parse time all you need to verify is that the group will ultimately exist. Herein lies the problem; you can't tell whether a back reference is a syntax error until you've completely parsed the body.

Note that the only exception is \0. That isn't a back reference but another way to encode a NUL into the regex. That's the character with code point zero. (And when used, the escape may not be followed by other digits or that'd be a syntax error.)

The good news is that we don't need to track and verify the index of each individual back reference. Instead all we track is the highest index of any back reference parsed. At the same time we count the number of capturing groups as we parse them. Once we complete parsing the regex body we do one last check to confirm that the highest back reference index seen is in fact lower or equal to the number of capturing groups. That's all that counts. But you still have to track it.

Note that in character classes the only valid digit escape is \0. You can't use back references there at all. I would say "luckily" but at this point it wouldn't have mattered.

Bonus part


Oh yeah. I'd nearly forget. There is another catch. This one is more of a problem for architecture and efficiency. There are various escapes but only six of them are so called CharacterClassEscape. These are \d (any digit), \s (any whitespace), and \w (any letter or underscore). And their upper case inverted counter parts.

The problem with these chars is that they are outright invalid in a range. It is illegal to have something like /[a-\s]/ because the escapes don't denote a single character.

It's an architectural problem because you generally want to parse the escapes in an abstracted function. Most escapes amount to a certain code point/unit and you need that value to verify the range. But these class escapes are invalid and you won't know whether it'll be a class until you've found the dash. And don't even get me started on the double tracking stuff required thanks to the u flag.

My solution here was to have the backslash parser return the code point value of escaped sequences. When there's an error it will return a certain number. I picked 0x110000 since that's the first illegal code point (in JS?). For the escaped sequences I simply return another, higher, value since I don't care about the actual escape. Parser doesn't care (until I do, but even then I can use a fixed symbol for it).

Extra bonus part


Another catch is that the long unicode escapes are only valid with the unicode flag enabled. Unfortunately that doesn't help us much in my single loop parse. I could jump to a simpler loop that only validates against unicode mode but my setup already has the mechanics in place to track that unicode or non-unicode mode is invalid so I just use that.

The biggest problem there is similar to before; in character classes we need to return the value of the escaped char AND whether it was a long unicode escape. For this I use binary ops and pick a bit that is larger than 0x110000. Since we don't care about the actual escaped value beyond the first one that is illegal, we can simply max the value to 0x110000 and use all the values between 0x110000 and 32bit for whatever we want.

The bitwise hack is ugly but I feel the alternatives are worse.

Underestimated


I definitely underestimated the difficulty of properly (deep) parsing an ES6+ regular expression literal. Thanks Mathias for putting me back on my feet at jsconf :)

Most of the syntax isn't all that difficult but the ranges combined with unicode flag definitely throw a monkey wrench into things.

Of course you can always go the slow way. Either by parsing the regex twice, checking the flag ahead of time, or just backtracking in general. That's not the way I want to play but sometimes you just don't have a choice.