WhiteSpace

2011-01-27

Parsing ECMAScript (JavaScript, js) is not very easy. Well, that's my opinion anyways. The problem is that the grammar given is not entirely context free. This is due to the regular expression literal. The grammar given by the specification simply says that the parser should tell the tokenizer whether it's expecting a regex or a div next, and that there are no cases where both may be parsed (so no ambiguity).

Whilst I still have to write a big ass blog post about actually parsing js as a whole, I'd like to touch on a smaller subset of that problem. When to allow spaces in a token and when not to. This difference is not explicitly stated in the spec. Now the spec assumes a little more parsing knowledge than I had when I started so maybe this is just another piece of knowledge I was missing, or maybe it's something I solved for myself but there are better ways. Who knows, maybe you'll be able to tell me afterwards ;)

So the way my parser takes on js source is by a tokenizer which simply parses tokens (that would be WhiteSpace, LineTerminator, SingleLineComment, MultiLineComment, Token, RegularExpressionLiteral and DivPunctuator). Two of these tokens could be confused with each other. The only way to properly parse something like a/b/c and / / / / / (yes, valid) using the grammar is to tell the tokenizer whether you're might be expecting a regex or a division.

Now, at least in my parser, I get that pretty much for free. The parser is a LL(1) top-down parser, so it starts at Program and runs all the way down, backtracking on failure. It simply checks what the next production is of the current rule and if it's in fact a regular expression, it'll tell the tokenizer so. Otherwise the tokenizer assumes a division. Now this works, no problem.

So you end up with tokens. The spec tells you to discard all tokens except for LineTerminators (they are ignored by syntax, but might cause ASI) and to replace MultilineComment with a single LineTerminator if and only if the comment contained at least one return. Because would cause ASI just as well. Either way, we end up with a stream of Token, RegularExpressionLiteral, DivPunctuator and LineTerminators. In my parser, these aren't actually discarded, but become part of the parse tree. The line terminators are remembered by state, when testing for ASI this state is checked. Whenever anything else than a newline is parsed, the state is cleared. In the end, the parser only really evaluates Token, RegularExpressionLiteral and DivPunctuator. They are all the same at that point, so I don't have to distinguish between them in the parser.

So now the tricky part. When parsing (not tokenizing), you're kind of applying a second grammar onto the "anonymous" tokens. But you're also ignoring "whitespace" (inc comments etc). So while parsing the rules for a production, tokens are supposed to be bound by "whitespace" or punctuators. This is handled automatically by the tokenizer, since, when whitespace is encountered the token ends.

Let's say the parser is trying to parse the production PropertyName. This has three rules; IdentifierName, StringLiteral and NumericLiteral. The parser returns me a Token. I don't know what kind of token so I have to apply the grammar to the string covered by the returned token. If that's an IdentifierName the parser will continue with that production. It always remembers the last parsed token. Other than that it's agnostic, so it doesn't know that it shouldn't parse more tokens to fulfill the "longest parsed string wins" prerogative. For IdentifierName, that's not a big problem. The smallest units in that production are the letters of the name, they don't contain spaces. So when one is encountered, the (recursive) match process ends.

Enter strings. They are also built up on characters. But unlike Identifier, they can contain whitespace. So you can't just blindly rely on the system like that. This puzzled me for a while. In the end I solved it by adding a "local search" state whenever I was parsing strings, numbers, regexes and identifiers. The reason Identifiers are also in this list is because otherwise a partial match could trigger a reject because it would match with a reserved word (var variable;). Or something, I don't exactly recall.

Writing this and having rewritten the cfg parser to be quite a bit more efficient, I'm going to change this mechanism. A Token should be identifiable by checking it's first kid (otherwise you'll know from the token directly). That way you can let the token determine what will be parsed (and also optimize a little because the tokenizer will already have built the remaining parse tree).

I'm in the phase of hooking the new cfg parser back up to the old parser. I'm going to rewrite the parser too, but for now I'm really curious what kind of gains I'll get for introducing a new architecture to the cfg. Hindsight always gives you great insights in what's likely to work better. Been sick the past week so progress has been very slow. Either couldn't work on it or couldn't get myself to work on it, due to not feeling good (or being plain sick). But I'm getting better so that'll be sorted fast.