pvdz.ee - Parsing `async` is hard

I recently tweeted about a particular case that's troublesome to parsing. I don't think my tweet properly conveyed the problem with that snippet so I wanted to try and explain it in a blog post.

As you may or may not know I've been working on the next version of my parser (ZeParser 3). I don't have a lot of time for it. I started on this version before I joined Facebook and feel that it's nowhere near finished. But hey, I do enjoy working on it.

I'm writing my parser in a particular way that attempts to avoid backtracking and double processing as much as possible. Yesterday I hit a new test case that may be the first to cause trouble with this goal.

The tweet outlined the particular case at hand:

Code:

async
([ a, b, c ]) => a;

Versus

Code:

async
([ a, b, c ]);

The trouble here a combination of ASI, the async not being a keyword, and a particular rule about newlines. Together they make these two snippets equivalent to:

Code:

async;
([ a, b, c ]) => a;

// and
async([ a, b, c ]);

Both in sloppy and strict mode. The spec does not specify that async is any form of reserved word and only disallows using it as a variable name in certain syntactical cases (like how you can't start an expression with async function unless you make it a function).

The rule that makes the above work is that the async functions are "restricted productions" and are not allowed to have a newline between the async identifier and the function, both regular and arrow functions.

So when parsing you may find an identifier with the name async and while it's very likely you'll be parsing an async function, you may just as easily be parsing a function call. And you won't know until after completely parsing the parenthesis.

Now this is not that much different from the way arrow functions are specified. They have a similar problem in that you first have to parse the complete parenthized group before being able to determine whether or not it's an arrow or a group. However I managed to work around this by parsing a group by default and also tracking whether or not it can, may, or must be an arrow. Then after the group the parser can properly choose what to do when it finds or doesn't find an arrow token (=>).

Mind you, this is only a prolem for the AST. If AST parsing is disabled then the parser has no backtracking whatsoever. But when an AST is also built up then it has to do some cleanup. In particular it has to start at the group (which usually leads to a SequenceExpression) and walk the entire sub-tree, renaming certain nodes. For example ( [ x ] ); leads to an ArrayExpression while ( [ x ] ) => x; leads to an ArrayPattern.

There are only a few nodes to rename and only in the case of an AssignmentExpression a property has to be deleted (so far literally the only node to have this trait). It was an acceptible compromise and I was a bit surprised to discover how generically it worked.

Back to async. Here the problem is more difficult because in the case where it turns out to be an arrow the AST will have two nodes rather than one. And while building up an AST I don't collect upward and merge downward. I merge upward and replace downward. If that makes sense.

Quickly explained; When parsing in general my parser will have an AST, a current node, and a current property. Whenever something new is parsed a new will be added to the current property of the current node and a new current property/node will be set.

Initially there's a root node, Program (and a parent property that's not relevant yet) This roughly looks like this:

Code:

{
  type: 'Program',
  sourceType: 'module',
  directives: [],
  body: [statements and declarations],
}

The directives and sourceType will be set on this Program node and body will be set to a list (array). The first statement will be parsed and be put in body. Let's say it's an if statement. Then the current property will be body and an IfStatement node is created there and set as active. First the test property is parsed on it and then the consequent. The AST now roughly looks like this:

Code:

{
  type: 'Program',
  sourceType: 'module',
  directives: [],
  body: [
    {
      type: 'IfStatement',
      test: node,
      consequent: node,
    },
  ],
}

After completing the if statement the current property remains body and the next statement or declaration is parsed. Etc.

When parsing async the parser will skip the node and start parsing the group. After the group the AST looks roughly like this:

Code:

body: [{
  type: 'ExpressionStatement',
  expressions: [
    {
      type: 'SequenceExpression',
      expressions: [],
    },
    (parser is now here)
  ],
}]

So the SequenceExpression node is closed after parsing the ) and the next token is checked. The parser will still know about having skipped the async identifier token and it did not forget this token was followed by a newline.

If the next token is not an arrow then we can take the last node in the expressions list and convert it to a CallExpression node. That's quite easy.

On the other hand when next token is => then the SequenceExpression is changed to an ArrowFunctionExpression with async: false. The problem is then that we have to inject a new ExpressionStatement with async as the only expression.

So far this is the only case where this is true. And the parser is pretty far at this point. I think it supports the desugaring and arrow syntax generically and pretty decently. And of course most of the other stuff. So it's a little surprising that these kinds of edge cases are causing such pains.

The decision tree for the parser when it finds the async identifier looks like this:

Code:

// parser decision tree based on tokens:
// - `in`: this is `async in y` and is "okay" regardless of newlines
// - `instanceof`: this is `async instanceof y` and is "okay" regardless of newlines
// - newline
//   - `function`: ASI after `async` (the function, if valid, will not be considered to be async)
//   - ident: ASI after `async`
//   - paren open, parsed group
//     - newline: this is `async(..)` and the newline is ignored
//     - eof: this is `async(..)` and the newline is ignored
//     - `=>`: the `async` ident is considered a regular var, ASI must happen, arrow is non-async
//     - else: this is `async(..)` and the newline is ignored
//   - else: `ASI after `async`
// - `function`: ASI after `async`, must be non-async function
// - ident
//   - newline: error (because ASI must occur and `async foo;` is illegal)
//   - `=>`: arrow function without parenthesis
//   - else: error (because `async foo;` is illegal)
// - paren open, parsed group
//   - newline: this is `async(..)`, if the arrow shows up after this group then that's an error
//   - arrow: async arrow
//   - else: `async(..)`
// - else: `async` as a var name

Okay enough rambling about stuff I can't properly explain. Hope you enjoyed it, anyways.