I recently tweeted about a particular case that's troublesome to parsing. I don't think my tweet properly conveyed the problem with that snippet so I wanted to try and explain it in a blog post.
As you may or may not know I've been working on the next version of my parser (
ZeParser 3). I don't have a lot of time for it. I started on this version before I joined Facebook and feel that it's nowhere near finished. But hey, I do enjoy working on it.
I'm writing my parser in a particular way that attempts to avoid backtracking and double processing as much as possible. Yesterday I hit a new test case that may be the first to cause trouble with this goal.
The
tweet outlined the particular case at hand:
async
([ a, b, c ]) => a;
Versus
async
([ a, b, c ]);
The trouble here a combination of ASI, the
async
not being a keyword, and a particular rule about newlines. Together they make these two snippets equivalent to:
async;
([ a, b, c ]) => a;
// and
async([ a, b, c ]);
Both in sloppy and strict mode. The spec does not specify that
async
is any form of reserved word and only disallows using it as a variable name in certain syntactical cases (like how you can't start an expression with
async function
unless you make it a function).
The rule that makes the above work is that the
async
functions are "restricted productions" and are not allowed to have a newline between the
async
identifier and the function, both regular and arrow functions.
So when parsing you may find an identifier with the name
async
and while it's very likely you'll be parsing an async function, you may just as easily be parsing a function call. And you won't know until after completely parsing the parenthesis.
Now this is not
that much different from the way arrow functions are specified. They have a similar problem in that you first have to parse the complete parenthized group before being able to determine whether or not it's an arrow or a group. However I managed to work around this by parsing a group by default and also tracking whether or not it can, may, or must be an arrow. Then after the group the parser can properly choose what to do when it finds or doesn't find an arrow token (
=>
).
Mind you, this is only a prolem for the AST. If AST parsing is disabled then the parser has no backtracking whatsoever. But when an AST is also built up then it has to do some cleanup. In particular it has to start at the group (which usually leads to a
SequenceExpression
) and walk the entire sub-tree, renaming certain nodes. For example
( [ x ] );
leads to an
ArrayExpression
while
( [ x ] ) => x;
leads to an
ArrayPattern
.
There are only a few nodes to rename and only in the case of an
AssignmentExpression
a property has to be deleted (so far literally the only node to have this trait). It was an acceptible compromise and I was a bit surprised to discover how generically it worked.
Back to
async
. Here the problem is more difficult because in the case where it turns out to be an arrow the AST will have two nodes rather than one. And while building up an AST I don't collect upward and merge downward. I merge upward and replace downward. If that makes sense.
Quickly explained; When parsing in general my parser will have an AST, a current node, and a current property. Whenever something new is parsed a new will be added to the current property of the current node and a new current property/node will be set.
Initially there's a root node,
Program
(and a parent property that's not relevant yet) This roughly looks like this:
{
type: 'Program',
sourceType: 'module',
directives: [],
body: [statements and declarations],
}
The
directives
and
sourceType
will be set on this
Program
node and
body
will be set to a list (array). The first statement will be parsed and be put in
body
. Let's say it's an
if
statement. Then the current property will be
body
and an
IfStatement
node is created there and set as active. First the
test
property is parsed on it and then the
consequent
. The AST now roughly looks like this:
{
type: 'Program',
sourceType: 'module',
directives: [],
body: [
{
type: 'IfStatement',
test: node,
consequent: node,
},
],
}
After completing the
if
statement the current property remains
body
and the next statement or declaration is parsed. Etc.
When parsing
async
the parser will skip the node and start parsing the group. After the group the AST looks roughly like this:
body: [{
type: 'ExpressionStatement',
expressions: [
{
type: 'SequenceExpression',
expressions: [],
},
(parser is now here)
],
}]
So the
SequenceExpression
node is closed after parsing the
)
and the next token is checked. The parser will still know about having skipped the
async
identifier token and it did not forget this token was followed by a newline.
If the next token is not an arrow then we can take the last node in the
expressions
list and convert it to a
CallExpression
node. That's quite easy.
On the other hand when next token is
=>
then the
SequenceExpression
is changed to an
ArrowFunctionExpression
with
async: false
. The problem is then that we have to inject a new
ExpressionStatement
with
async
as the only expression.
So far this is the only case where this is true. And the parser is pretty far at this point. I think it supports the desugaring and arrow syntax generically and pretty decently. And of course most of the other stuff. So it's a little surprising that these kinds of edge cases are causing such pains.
The decision tree for the parser when it finds the
async
identifier looks like this:
// parser decision tree based on tokens:
// - `in`: this is `async in y` and is "okay" regardless of newlines
// - `instanceof`: this is `async instanceof y` and is "okay" regardless of newlines
// - newline
// - `function`: ASI after `async` (the function, if valid, will not be considered to be async)
// - ident: ASI after `async`
// - paren open, parsed group
// - newline: this is `async(..)` and the newline is ignored
// - eof: this is `async(..)` and the newline is ignored
// - `=>`: the `async` ident is considered a regular var, ASI must happen, arrow is non-async
// - else: this is `async(..)` and the newline is ignored
// - else: `ASI after `async`
// - `function`: ASI after `async`, must be non-async function
// - ident
// - newline: error (because ASI must occur and `async foo;` is illegal)
// - `=>`: arrow function without parenthesis
// - else: error (because `async foo;` is illegal)
// - paren open, parsed group
// - newline: this is `async(..)`, if the arrow shows up after this group then that's an error
// - arrow: async arrow
// - else: `async(..)`
// - else: `async` as a var name
Okay enough rambling about stuff I can't properly explain. Hope you enjoyed it, anyways.