About three years ago I started writing on what ended up becoming Tenko. A JavaScript parser written in JavaScript. I had written a JS parser before; ZeParser. That was only for the old version of JS; "ES5". Since I wanted to write tooling based on "modern JS" I needed to improve that parser or start from scratch. I started from scratch.
I kind of expected the parser to be somewhat similar and to be pretty straightforward. Once you figure out the div disambiguation and ASI rules, parsing ES5 is actually not that hard. So I set out to write a parser for ES6+. And that ended up being Tenko.
Goals
I had a few goals in mind when I started out with Tenko.
- Simple code
- No backtracking (to the extreme)
- Lots of tests
- 100% Spec compliant
- Version specific support
- Script and module goal
- Strict modern
- Annex B support
- AST support, token only mode
- Performant
- As few objects as possible (for perf)
Note that I had no business goals. I started writing this thing for fun and enjoyed working on it. There's no ultimate goal other than it being a basis for other tools to write.
Performance
I wanted to write the fastest ES6+ parser.
ZeParser2, my ES5 parser, was really fast. But it was kind of cheating; it didn't produce an AST. While I wrote the parser for tools that didn't _really_ need the AST, it always kind of stung me that I hadn't bothered to support ASTs. So that's something I wanted to fix in this iteration.
When I started on Tenko I had to consider the architecture. With performance in mind there were a few key choices I had to pick from. Would it be flat functions or a class? Would I rely on prototype or pass on state through args? Do I want closures or pass on state through an object (or prototype).
All these choices were around keeping state; Do I keep state in a class instance, a plain object, juggle them in function args, global closures? I reached out but got no real definitive answer. Ultimately I went with module globals and a lot of juggling. In hindsight ... I'm not so sure that was the best choice. But that's what I picked.
I went with functions and argument juggling because I thought they would be the easiest to optimise by the engines. Simple code is easy to interpret so if I wrote straightforward structures, they would be inlined super well. At least that was my reasoning. I think it should still hold, but there are caveats.
Backtracking
Ultimately a parser is about looking at parts of input and considering what to do next. In some cases this doesn't give you enough information and you have to look ahead to make the right choice. While for most cases this look ahead is a fixed "distance", there are some cases where this could potentially be megabytes of input before you know. And if you made the wrong guess you would have to go back and parse it with a different interpretation. That is called "backtracking".
While backtracking wasn't a huge deal for ES5, there are now three cases in ES6 where this plays an important role; arrows, patterns, and the unicode flag in regular expressions. In all of these cases you potentially scan forward an arbitrary number of bytes until you know how you should have been parsing those bytes.
In the spec this is called a "cover grammar", which is then "refined". For arrows you have to scan forward until finding (or not finding) the
=>
. For patterns this is an assignment. For regular expressions you have to scan forward the entire regular expression and see the flags before you know how to exactly parse/validate that regular expression.
Granted, the regular expression case is probably the least dangerous in this regard. With some notable exceptions, regular expressions tend to be relatively small and the cover grammar is simple. This matter seems a bit worse for patterns and arrows although in the grand scheme of things, most real world arrows and patterns will be small as well.
Still, I set it to parse all of the structures in one go, keeping various state to deal with ambiguity cases without backtracking. While I realised at some point this was a mistake and caused me to violate the "simple code" principle, I was too far in and stuck to it. In hindsight, it's probably also defying its own point of being performant. I think now, looking back, it's probably faster to apply the cover grammar and do a reparse. But hey, when I started Tenko I had never even heard of a "cover grammar". The road was long.
Hard nuts
There were a few things in ES6 that I've heavily underestimated initially. Things that look simple on the surface or even as a consumer/writer of JS but are incredibly hard to consume as a parser. My principle of "be the parser", which worked quite well in ES5, was definitely not as efficient for ES6+ code.
A shortlist;
- Arrows (vs groups)
- Async arrows vs regular call (legacy code)
- Destructuring assignment
- Unicode flag in regular expressions
- Regular expression verification
- Regular expressions and AnnexB rules
- Syntax errors for lexical binding collision edge cases
- Syntax errors for `await` and `yield` in function params
- Syntax errors related to the "use strict" directive
- Test262 edge case coverage is nice but far from complete (like 50%?)
- Certain parts of the spec are super cryptic, very easy to make interpretation mistakes
- The strict mode rule on
delete
and parenthesis
- Named capturing groups
Arrows
One of the first difficult / new things I tackled was arrows. I would revise this approach a few times before its final state. Especially in light of the
async
variant being even more complex.
The problem here is, basically, the difference between
(a);
and
(a)=>a
. Mind you, that group could be an assignment with a very long and complex expression that would still be viable to be the start of an arrow.
The start of an arrow can, virtually always, be valid as JS even if it wasn't followed by an arrow wasn't parsed. The edge cases are around certain pattern and spread/rest syntax that is only valid as a pattern or literal. For example, did you know that
{a=b}
is not valid as an object literal? You can have the shorthand
x = {a}
but it can't have a "default". If that kind of object appears anywhere it must automatically be a pattern, or syntax error.
Having to track an AST made this considerably harder for me. You couldn't just skip and discard the tokens and track whether it was valid with and/or without arrow. You had to build up an AST as you went along. And with my no-backtracking principle, that meant doing it at once.
Ultimately, this meant building up an AST and assuming it wasn't a group. Then, if it turns out to be an arrow go back in the generated AST and turn any object/array literal that is not an initialiser into a pattern. Meh :(
Async arrows
At some point, well after "solving" arrow parsing, I started implementing async arrows. The problem with async arrows is that there is a legacy (annex B) interpretation that is valid when the arrow is missing. This means
async(x) => x
and
async(x)
are both valid but vastly different.
I think this was the first time when I realised I couldn't just build up the AST scanning forward. It was the only case in its class but a problem nonetheless. If it wasn't an arrow but was valid then it had to be a
CallExpression
. And that is a completely different AST structure.
Another complication is that spread/rest now added an extra dimension of problems. For regular groups, spread is not valid. But for a function call it _is_ valid. But at least spread was already kind of taken care of since it is valid to spread in an array.
Works or B0rks?
(...a) => a;
(...a);
async(...a) => a;
async(...a);
(([...a])) => a;
(([...a]));
async(([...a])) => a;
async(([...a]));
Destructuring assignments
Destructuring in general was a difficult problem for me to solve. But destructuring assignments were the worst because they meant that at any arbitrary point, array and object literals could become vastly different structures, with different rules, well after starting to parse them. And you would only know it was a destructuring assignment after seeing the
=
after it.
Since these objects and arrays could contain almost any kind of expression, backtracking on them would be potentially expensive.
It turned out that the AST for patterns vs literals wasn't actually that far off. It meant walking the tree and giving each literal node the pattern name (
ArrayLiteral
to
ArrayPattern
) and to update assignments. A downside was that the pattern nodes were missing the
operator
property and, for performance reasons, meant having to be replaced.
Unicode flag in regex
Definitely flying under my radar was the complexity of regular expressions. Especially when it came down to the unicode flag.
It wasn't entirely under my radar. Mathias had already been posting about relevant edge cases caused by the unicode flag and surrogate pairs. I just didn't realize how deep that rabbit hole went. And how much time I would waste on this.
The unicode flag is basically the strict mode setting for regular expressions. It disallows many annex B rules and it enables parsing regular expressions as code points, rather than code units.
While the code unit vs code point distinction isn't very important for a parser in most cases, there is one in particular where this matters at a lot: "character class ranges":
/[a-z]/
. The reason is that these ranges must be properly ascending, so
[z-a]/
is a syntax error. The character to signify ranges, a regular dash (
-
), can also appear unescaped as a regular character in the character class, like so:
/[a-]/
or
/[-z]/
, or even so:
/[--]/
.
Most importantly in this context, the range can end up differently with the unicode flag versus without the unicode flag because of "surrogate pairs". Added bonus for using unicode escapes in these ranges. Extra extra bonus for encoding a surrogate pair with double unicode escapes.
Also consider unicode "ruby escapes" (
/[\u0001-\u{02}]/u
). Or the fact that the ruby escapes are only valid with the unicode flag, and otherwise can only be valid with annex B rules. But in that case, ends up with real unintuitive interpretations:
/[\u{02}-\u0003]/u
matches the range
}-u
, so the regex would match either of the four characters
u023{
or anything in that range. Hmmmmm ... but wait, that range is invalid because
ascii({) > ascii(u)
, so it's actually a syntax error. Oops.
Regular expression verification
Apart from unicode flag oddities, verifying the rules turned out to be a big time sink.
On the one hand I had stuck to my no-backtracking principle. This was before realising the surrogate pair trouble above. So once I realised this had to be solved I would change the lexer to parse regular expressions for both modes at the same time. That means, while consuming input and creating a regular expression token, the lexer will track whether the parser is valid with and without unicode flag. This includes surrogate pairs, escapes, annex B, etc or even which character contributes to a range, or even whether there is a range at all.
It might be interesting to learn how the interpretation of the regex can desync between rules for unicode flag and without unicode flag. This regards surrogate pairs. Keep in mind that a surrogate pair is a special encoding of a code point where the character is split up in two separate characters. Without unicode flag these characters are read and interpreted as two individual non-special characters. With the unicode flag the pair is considered one single character. That matters in ranges. The emoji for "fox" is
\u1F98A
, which is the surrogate pair
\uD83E\uDD8A
. When we look at a desynced range we can go for
/[\u000-\uD83E\uDD8A-\u0000]/
. With unicode flag this char class has the range
0-0x1F98A
and the two individual characters
-
and
0
. Without unicode flag, this is actually two ranges:
0-0xD83E
and
0xDD8A-0
. The second range is illegal (because the range must be low to high) and so is a syntax error. Similarly, you can create a syntax error range with unicode flag that would not be a range or error without unicode flag.
There are some real head aches here, worsened by three ways of referring to surrogate pairs (as literal characters, as unicode escape quad, and as ruby escape) and by annex B rules where
\u{...}
may end a range with the code unit
u
and start a next range with code unit
}
. Very unintuitively.
Lexical binding collisions
When it comes to
let
and
const
it's fairly easy to see the error when creating a duplicate binding. But our understanding of the rules get a little muddy when it comes to bindings that do not explicitly have
let
or
const
.
For example, do you know how function declarations are bound? The answer is not so simple. It's different between global and function scope. It's different for function declarations and "function statements". It's different for script goal versus module goal.
Is it illegal to have two functions with the same name in the same scope? Same rules as above apply.
Can you create a function with the same name as a catch scope variable? Or as a function parameter name? What about arrow parameter name? And what about various
for
-loop bindings?
There's a very select group of people that'll ever bother to read this at all, but even amongst them I bet there will be very few people that can properly / comfortably answer a quiz based on the above. The hidden complexity is amazing and certainly took me by surprise.
Keep in mind, a parser should properly throw syntax errors for duplicate lexical bindings so these rules are relevant to apply correctly.
await
and yield
in params
For runtime reasons, it's illegal to have the
await
and
yield
_keywords_ appear in function params. But ... they're not actual keywords. Only in certain contexts. The
await
identifier is only a keyword in async functions or the module parsing goal. The
yield
identifier is only a keyword inside a generator or inside strict code.
While for a programmer it's not so hard to spot these keywords in function params, for the parser it's much harder because it might not even know whether it is inside a function at all. The async arrow case is even more difficult here since
async (x = await);
is not an error as long as you're parsing against the script goal. And keep in mind that there could potentially be megabytes of JS before knowing whether or not this is an arrow. Yikes!
Use strict
There are some gems hidden underneath the relatively simple concept of the directive.
Many people will know that putting the string
"use strict"
at the top of a function or file will put the code in so called "strict mode". This changes a bunch of rules for the better, like not being able to assign to an undeclared variable. It's an opt-in mode that has no opt-out and module goal code as well as code in a class[*] will automatically enable strict mode.
Few will know that using this directive can retroactively affect code.
Here follow a few cases affected by this.
Octal escapes
Strings are not allowed to have so called "octal escapes" in strict mode. These are escape sequences with multiple numbers, starting with a zero:
"\012"
. If you use the strict mode directive, this will apply to all strings in the current scope, including directives that preceded it:
"\012";
"use strict"
So this is an error. It first parses the string with the octal escape and accepts it. Then it sees the strict mode directive, enabling strict mode, and re-evaluates the first string. Since it contains an octal escape, it's now a syntax error.
Simple parameters
If a function contains a strict mode directive (regardless of whether it already is strict), it is required to have "simple" parameters. It will be a syntax error if that's not the case. So something benign like this will fail:
function (x = 5) {
"use strict";
}
In this context, "simple" means as much as "would be valid in ES5". So all parameters must be simple parameters. No defaults, no patterns.
For Tenko this means having to track this state and forward it to the function body parser.
Strict mode keywords as param or function name
Since strict mode disallows certain variable names that would otherwise be allowed, the directive also applies this retroactively for the function name and parameter names. This mainly applies to
yield
,
eval
, and
arguments
.
let x = function eval(){ "use strict"; }
function f(arguments){ "use strict"; }
function f(yield){ "use strict"; }
Weird tail
Since the strict mode directive is just a string, it may also have a "tail", meaning it can be called or have a property accessed. These cases are syntactically valid, even though clearly the calls will fail. They do not enable strict mode, since that requires the whole statement to be only the string, not anything else.
Here are some examples where the octal should not throw an error unless the directive is interpreted as such (but in each case it shouldn't).
Here's a misleading one with property:
"use strict"
.foo
"\01"
Here is one with a group after it ... except it's not a group. It's a call!
"use strict"
("oops")
"\01"
What about an array? Nope, it's a dynamic property access :)
"use strict"
["oops"].forEach(log)
"\01"
And this is a new one: template? Nope. It's a tagged template!
"use strict"
`oops`
"\01"
Test262
Many parsers and envs pride themselves on passing the test262 test suite. It contains many ugly edge cases for the language and, in all fairness, can be quite daunting to pass. However, and unfortunately, it's far from a complete coverage on all test cases.
You might be thinking, "Well, why don't you contribute". Because time. But you can always consider paying me. Or go through my 33k test cases yourself and convert them into whatever the format is that test262 requires.
Cryptic spec
The language is defined in a
specification, or "spec". And the specification should contain all the rules and cover every edge case surrounding the language. In a way, it does.
Unfortunately I think the spec is for the most part a giant ball of spaghetti. Many rules rely implicitly on other rules that may be several pages away or on systems declared at the start of the specification. For example: the grammar
allows a
const
without initializer. But
a static semantic rule is disallowing that particular reading,
referring to another part to define the distinction (this particular case feels very artificial).
Or why don't you try to figure out when function statements are allowed?
Or where is operator precedence defined, anyways? Spoiler: it's an implicit rule based on the order of the grammar. The specification, as of right now, doesn't even contain a single occurrence of the word "precedence".
I've been working as a consumer of the spec for over 10 years and still I got my ass handed to me when reporting a few bugs to Acorn last week. Turns out my interpretation for some edge cases were wrong.
My point is: for as clear and unambiguous as the spec is attempting to be, it makes it very difficult to consume it. It's too big and rules are too far apart, too implicit. Especially when it comes to the nitty gritty of the edge cases and the annex B rules you need a galaxy brain to get it all in your head. Certain syntactical structures introduced in ES6 certainly didn't help here, like arrows, patterns, and the regex unicode flag.
Strict mode delete
and parens
One particular odd case is deleting an identifier, like
delete foo;
. This used to be possible, and would try to delete a binding from global scope.
In strict mode the use of
delete
on just an identifier is forbidden. This includes the case where the identifier is (only) wrapped in any number of parenthesis. So
delete foo; delete (foo); delete (((((foo)))));
are all illegal. And while it seems fairly easy to verify that, try to consider the generic case; Any level may be something other than an identifier, or an arrow, or have a tail like call or member expression.
delete ((x)=>x);
delete (x());
delete (x).foo;
delete (((x).foo));
This was a pretty nasty case to solve for me.
Regex named capturing groups
One of the last new features was named capturing groups. Something like
/(?<foo>group)/
when matching would create a
.group
property when it matches anything. Great addition. But the spec becomes a little weird.
Especially because these named groups can be referenced through the
\k
escape, like this:
/(?<name>peter).*name: \k<name>/
, which would match something like "This is peter. So this was the name: peter" and the match would have a
name
property with value
"peter"
.
The spec uses a special
N
notation to tell whether or not this syntax is enabled which is necessary because under annex B rules,
\k
is a valid escape and is a regular
k
. With the new rule, `\k` is no longer a valid escape on its own and this
N
flag modifies this behavior.
It turns out that there's a large
static semantic that basically says
\k
is valid without a group name if and only if the entire regex does not contain another defined group name (whether by another
\k
or a named capturing group) and does not have the unicode flag.
Test cases
My lifeline while building this parser are my tests. I've got
a suite of 33.000 test cases which try to cover almost anything I could think of.
Initially the tests were in a giant array with simplistic output expectations and token checks. At some point not too long ago I converted all these cases to individual files and started to dump the entire AST or error output into each file.
The test files are markdown files, which means they automatically
render quite nicely on github. Additionally, if you're ever interested in how a certain test should parse you can always look at the test case online, without having to run it through Tenko first.
These test cases are auto-generated each time the test suite runs. That's quite nice as it means I can use git to verify whether my changes changed anything at all. They're just snapshots in that regard.
Writing tests for most software is quite cumbersome and you always end up in an argument of writing integration / e2e / unit tests. In the case of a parser that's much simpler: e2e tests only. If you can't trigger a certain piece of code with input then it's probably dead code.
Tenko currently sports a
test coverage of over 96% of the lexer and parser. The bits that are not covered are dev code, dead code assertions, or babel/acorn compat code. The dev stuff is stripped from a build and the acorn/babel stuff doesn't run in my test suite. I'll happily claim to have e2e input coverage of over 99% of my production code.
Spec compliant
What can I say, I like trying to figure out and to follow the rules of the spec.
As far as I can tell the parser is doing that. And if somebody knows of a problem or bad interpretation of the rules, I hope they tell me and I'll do my best to fix it asap.
One annoyance is that the spec is now more of a living document and subtle changes are made without a lot of attention. I tried to bring this up once but it was discarded as being irrelevant.
One reason I wanted to support specific version targeting is that there was a point where I really needed an to date parser that would target up to a very specific version of the spec, simply because the target environment wasn't being updated and ran a three year old JS env. Having a parser that targets a specific version means validating whether or not the build would even run at all, without having to test it on that target environment. So the use cases are a little niche, they're super important if you happen to be one of them.
At this point I can't guarantee that super fine grained rules are properly represented. In particular there was a recent change about
for ( ... of)
where the left hand side was made more strict. This meant that, technically, older engines would allow different cases than a more modern spec. Provided they implement that particular change. I'm not sure how I can prevent this unless the spec actually publishes such changes as well. I'm also not sure it's worth that much to keep up with it that closely.
AST support
So this one is new for me. Not ASTs but my parser outputting them. I went with the Estree AST spec because why not.
I remember a long time ago when I was boasting to have the fastest ES5 parser that Ariya, who built Esprima, blasted me for it because ZeParser didn't output an AST. He was right, of course.
Implementing ASTs, conceptually, turned out to be easier than I expected it to be. Lot's of respect for those involved in forming the Estree spec. I really like how it's been specced together, even if it could use a little more explanation here and there.
The locations are a bit annoying because it means having to juggle those values around and around. I don't really like how I ended up having to solving them. But they work. Parity with Babel and Acorn for almost everything.
There's no doubt for me that I'm losing a lot of perf on building the AST's. I can fix it but it requires a drastic architectural change. This would be very time consuming and I'm just not willing to put in more time right now. Say I get it faster, so what?
One thing I considered was making a binary scratch pad where I encode the AST while running, and only after parsing actually construct the AST from that buffer. I might still do that one day. But it's not trivial to do this with my architecture because it would mean having to juggle the addresses as offsets and there's already too much juggling going on.
Perf
Ah, what can I say. Performance. Even though sadly I think Tenko is performing slightly under Acorn, I'm okay with the state of it. It can do 20mb of es5 ("webkit.js") in about 5 seconds and gets decent scores on other benchmarks as well. I'm aware there's somebody working on a parser that's significantly faster and don't expect to be able to turn Tenko into that level. So for me the performance game is over for now. And I'm fine with that. Choices were made and, in hindsight, these bets didn't pay out. Next time I'll have to be more considerate of the price of objects when building an AST.
I was very conscious about objects otherwise. There's only a few cases where objects and arrays were used besides the AST.
One such case is labels. You must track labels but only if they're defined. So if they're not defined, the object doesn't have to exist. That's a nice hack preventing me from creating a new object on every statement level.
Another case is scope tracking. To properly throw syntax errors for duplicate lexical bindings you have to put them in a scope object. Can't really be circumvented.
The third case is import/exports. To validate a module you have to check whether all exports are in fact also global bindings. To this end you have to track all of these explicitly. Additionally you have to verify that each symbol is only exported once. More objects.
So now Tenko is pretty fast, but I can't say I've hit my own perf goals. I blame the AST building more than anything else. Maybe some day I'll get around re-implementing that.
End
So that's it. Actually that's more than I initially thought I'd write.
Tenko is a full featured parser, supporting AST and location tracking, scope tracking, all the early errors that are expected, in depth regex tracking, and can target specific versions of the specification. It does not support any plugins and its architecture doesn't really allow this anyways. There's a toolkit through which you can control all the tests, run a fuzzer, create a build, do code coverage, do perf inspections, and ... well many more things (this is
./t
in the root of the project). There are a lot of things there and I'm happy where it stands now.
For now I'm done working on Tenko. I'll do maintenance if anyone has bugs and I'll see about implementing new syntax features as they are added to the language.
I spent quite some time working on this parser. Pretty much all my free time that wasn't spent on work or family went into this for the past three years. Probably over a thousand hours, if not more. And I promised myself I could play
Factorio once I "finished" the parser. So that's what I'm doing now :)
Hope you enjoyed this post. If you ever use Tenko for anything I'd love to hear about it :) Let me know on
Twitter.