Parser hinting

2014-05-27

The way I currently parse is actually very straightforward. Just "be" the computer as you encounter source code, making decisions only when you have to. It's not too far from being a state machine in that regard. A hand tuned super optimized state machine at that ;) I'm always looking for ways to try and improve the speed of parsing. Sometimes this involves many drastic changes which makes it very difficult to determine whether it's actually worth doing. The problem is that you can't check until you've fully implemented the change. Sometimes with disappointing counter-intuitive results.

One such case was implementing parser hints. This requires a bit of explaining of the current architecture of ZeParser though so let's do that first.

The parser works on tokens it gets from a tokenizer, also called a lexer. The lexer has to look at the current input pointer and determine what kind of token to parse next and how long the token will be. I've written a few posts on this exact problem before so I won't go deeper into it here.

Whenever the parser "consumes" a the current token the lexer will immediately parse the next token. There are many cases where the parser can actually tell what kind of token to expect next. But most of these cases are not known at consume time of the token before it.

When parsing the input source if (foo) bar; the tokenizer will first parse if. The parser will detect it as being a statement start and consume the token. So right now the parser will consume the if token and it _knows_ that the only valid token next would be a ( since anything else is a syntax error. Well not everything because there's still this pesky whitespace, which can appear pretty much anywhere and still adds about eleven choices into the mix (seven spaces and four newlines, not counting unicode spaces), of which four are actually common (space, tab, cr, lf). However, 12 token start choices is still better than 100...

The problem with this approach is when reusing the same parsing function for various reasons, making it difficult to predict the next input. The header of the example above contains an "expressions", meaning one or more expressions divided by comma's. This "expressions" part of the code is parsed by a generic high level parser function. It is used pretty much anywhere where multiple expressions are allowed (expression statement, statement header, function call, dynamic property value, case arg, etc). In each such case the hint is a comma or any of: closing parenthesis, colon, semi-colon, statement start. Which depends on the first parser function that needs to read the next token.

So anyways. I had rewritten the architecture to lazily parse. I figured this would be faster anyways. It could optimize some cases where for example only binary operators were never validly possible, and much more. But when I implemented it I ended up with a considerable perf regression! I had to discard a whole evening of work and start over. The worst thing is that I'm not sure that the approach is bad, or maybe just one minor part of it. And maybe I'm JIT penalizes me for some unexpected tricks. I may never know :)

For now, ZeParser2 will not do hinting and is still capable of getting very high perf. I'm confident that hinting can be implemented in at least some cases though to go even faster.

I did have one success story with significant perf improvement though: it's very likely heuristic to expect spaces or tabs after parsing a newline. Obviously indentation is key here. So currently the only place where hinting is applied is when parsing newlines. In that case the newline parsing function will also scan for spaces and tabs and consume them if it encounters them. While numbers don't mean anything, the perf increase was a significant percentage :D