7 Lexical Conventions

2010-03-09

So let's get down to the parsing.

Input source should first be converted to a sequence of input elements, which are tokens. This is the repetitive matching of productions to the source text. Whatever substring matches at the front becomes one token. Every time this happens, the parser should take the longest possible sequence of characters from the front. After one token has been formed, the "front" moves that many characters forward.

For the lexical grammar, there are two "goal symbols"; InputElementDiv and InputElementRegExp.

Here we see our first challenge. The problem is that a forward slash might be a division, but it might also be the start of a regular expression literal. The syntax of these is explained later in more detail. The problem here is that whenever input source contains two forward slashes with characters in between which might be parsed as a RegularExpressionBody, it will, and should, because it's the longest possible sequence. Much longer than a single forward slash.

The major problem here is that the lexical parser cannot make this distinction on its own! It simply cannot tell one from another, without some kind of higher parser telling it one way or the other. This means that, while the lexical grammar is sufficient to parse the language, no parser can properly parse the "a/b/c" case because it doesn't know there's no syntactical production that could match an Identifier directly followed by a RegularExpressionLiteral...

However, it clearly states that there are no syntactical situations where a forward slash might be ambigue. It's either division or a regular expression, but never possibly both.

So. The lexical grammar can parse the entire language to some degree, but it needs a syntactic parser to tell it which of the two goal symbols it should parse next.