ES5 spec choices

2010-02-23

I'm working on a Ecmascript 5 parser in Javascript. Yes yes, there are plenty. Whatever. I just want to do this myself.

Having said that, I do keep bumping into small problems. They are probably mostly due to the way the language is described. Whilst it provides a complete and explicit CFG (Context Free Grammar), some rules are so hell bent there seems to be no other way than to rewrite them slightly, hoping the rewrite won't introduce bugs (underparsing or overparsing).

For example. The MultiLineComment production has the following definition:

Code: (CFG)
MultiLineComment ::
/* MultiLineCommentChars(opt) */

MultiLineCommentChars ::
MultiLineNotAsteriskChar MultiLineCommentChars(opt)
* PostAsteriskCommentChars(opt)

PostAsteriskCommentChars ::
MultiLineNotForwardSlashOrAsteriskChar MultiLineCommentChars(opt)
* PostAsteriskCommentChars(opt)

MultiLineNotAsteriskChar ::
SourceCharacter (but not asterisk)

MultiLineNotForwardSlashOrAsteriskChar ::
SourceCharacter (but not forward-slash or asterisk)

(I won't explain how this syntax works, check the wikipedia page for more details on how CFG's work)

These rules are long and get a bit complex. They are trying to allow you to parse any character within a multiline comment, without allowing the */ production, because that's part of MultiLineComment itself.

The problem here is that you could apply PostAsteriskCommentChars over and over and overparse. Of course there'd be no space left to parse the mandatory */ required by MultiLineComment and you have to backtrack in order to make it work. Nice in theory, bad in practice.

Besides problematic programmatically, I don't think it's obvious at first glance how the restriction is applied. So why not form the rule like this:

Code: (CFG)
MultiLineComment ::
/* MultiLineCommentChars(opt) */

MultiLineCommentChars ::
MultiLineCommentChar, MultiLineCommentChars(opt)

MultiLineCommentChar ::
SourceCharacter (but not asterisk)
SourceCharacter (lookahead no forward-slash)

The lookahead is already used more than once, so it's not like they have to introduce a new feature. And I think it's easier to understand what's going on, but maybe that's just in my current mindset...

The MultiLineCommentChar allows for an asterisk, but not when it's followed by a forward-slash. Or it allows any non-asterisk char. So it will always parse one character at a time.

MultiLineCommentChars in turn can be applied recursively untill no more characters are returned. And finally MultiLineComment can match the closing multiline comment sequence */ to match the whole production.

I wished I could speak to somebody who was part of creating these rules, to explain to me some of the arguments behind certain choices.

Oh well, who knows, I might get to know some of those people some day :)