The Streamer

2014-08-02

Introducing the first, I think, streaming JavaScript parser in Javascript: The Streamer. Well, streamer is actually a term I've gotten used to using when referring to this build. The name is only used internally since it's really just a transformed regular ZeParser2 build. As far as I can see, it's the first JS parser that can parse from a stream, without caching the whole thing of course. It's capable of parsing JS exactly like ZeParser2 would, except it can yield for more input at any point in the parsing process.

There's so much to tell about this I'm not sure where to start. Okay, maybe start with what it actually can do.

Capabilities


The capabilities are actually quite simple: it can take a stream of JS source code and parse it on the fly. You can give it an event handler that'll be called on every token. Don't be fooled into thinking that makes it just a tokenizer/lexer. It's so much more, but it takes some effort if you need to get customized information. You can't really just tokenize JS without a parser due to the ambiguous forward slash. IIRC it is possible in certain parser approaches but I'm not familiar with those yet and I recall they are quite cumbersome. So even though all you get is/are tokens, behind the scenes the whole source code is validated.

Tokens may not suit your need though. In that case you may need to go into the source code yourself and add whatever meta data you need to the tokens. For example, I often link the opening bracket of a function to the closing bracket directly. And both brackets to the function keyword token. This allows you to quickly skip over functions, or recursively process the body of a function. This change is fairly trivial to make but I'll concede that it's a bit involved. On top of that, as the whole goal is to stream, you should really keep in mind that certain meta data may only be available after parsing a lot of further code. Functions, for example, may start at the beginning of the code and end all the way at the end of the source code. This means it would have to parse the entire source before you could find the end of the function, defeating the point of parsing in the first place. This is not a limitation of my parser but a limitation of LALR ("Look-Ahead Left-to-Right") parsers in general. And probably a problem in general since the opposite way would get you into the same problem :)

I'm only aware of two issues with the streamer. One is ASI's not getting the proper token index. This is because we won't know about an ASI until we encounter a token that would otherwise be invalid. I feel this can be corrected easily in the onToken handler so I'm not going to make ze parser more complex by delaying output by one token for this case. The other issue is simply perf related, I'll cover that in more detail below.

So what's the use then? Well, I can only think of a few myself.

Use case


The reason I needed it right now is to create benchmarks for ze parser. The idea is to take a huge script and transform it to apply certain coding style guides and see how the parser performs on them. I have a 35mb emscripten file of a program called "Kate". This file is perfect for this task because it's huge and has a very specific, minified, coding style. I want to take this source and beautify it with spaces, tabs. I want to use the three forms of newlines (\n, \r\n, and \r). Semi-colons or semi-free. A file with variables containing higher unicode characters. Etc. I just want to see how my parser performs against very specific coding/build styles and make sure I'm not training my parser to tightly to my current benchmark.

The problem with creating these bench derivatives is that existing parsers are not streaming. The tools to beautify are built on these parsers so that means they can't stream either. A regular build of ZeParser2 nor other parsers like Esprima help you here; they all need to cache in the entire source before they can parse. This leads to memory constraints and the beautifier I tried first ran out of memory quickly. The streamer can keep on parsing infinite source though. I plan to write a script that reads the 35mb file as a stream, feed it to the streamer, and have the script stream the different coding styles to several files at the same time (because, why not).

Another reason could be to do statistical analysis on large chunks of source code. Count occurrences of certain structures or api's. This is the kind of stuff browser vendors and spec writers can use. They'll have a streaming JS parser in a different language for this though, but that's besides the point.

Test a fuzzer. I know, this is getting a bit contrived ;) And inceptive since a fuzzer should actually test a parser. But you could hook it up to the streamer and have it continuously parse fuzzer output. That way you can let it run indefinitely. Either the fuzzer is gonna crap out, or the parser, or ideally: neither :) In a classic setup the parser would just repetitively parse the fuzzer output though, so it's not a huge change.

Perf related. But since ze parser can parse lots of megabytes at once much faster than the streamer could do it streamed, I'm not so sure about the validity of this use case. Especially since by far most of the time, the amount of JS source won't really exceed one or two MB. Such scripts are fine to cache and process in one go without a streamer. But still, it may be interesting if memory usage is a concern over arbitrary scripts in an automated service.

Yeah I dunno. But I'm sure there are people who'll be happy with this.

Speed


There's a downside though; speed. While ZeParser2 is currently hands-down the fastest JS parser available in JS, the streamer is too but only because it's the only one. The streamer build is several magnitudes slower compared to ze parser. This is an unfortunate though unavoidable side effect from the transformation. A quick cli comparison:

Code:
~/zeparser2$ cat ../gonzales/data/sources/kate.js.jo.35mb.js | bin/teststream.js
-- finished (40971930 bytes, 21974110 tokens, 42146ms, biggest chunk: 65536 bytes)

~/zeparser2$ cat ../gonzales/data/sources/16mb-benchmark.js | bin/teststream.js
-- finished (15723571 bytes, 5960817 tokens, 7908ms, biggest chunk: 65536 bytes)

~/zeparser2/zeparser2$ cat ../gonzales/data/sources/16mb-benchmark.js | bin/cli.js --build
Using build parser (build/zp.js)
Using dev parser (src/par.js)
(Use --stream for streaming parser, --build for the build, nothing for dev parser)
----------
----------
-- finished (5960817 tokens, 15723571 bytes, 386ms)

~/zeparser2$ cat ../gonzales/data/sources/kate.js.jo.35mb.js | bin/cli.js --build
Using build parser (build/zp.js)
Using dev parser (src/par.js)
(Use --stream for streaming parser, --build for the build, nothing for dev parser)
----------
----------
-- finished (21974110 tokens, 40971930 bytes, 1286ms)

That's 1 sec vs 42 sec for 35mb bench and 386 ms vs 8 sec for the 16mb bench. I'm not entirely sure why double the code takes 6x as long for the streamer, but I'm sure the parse speed is unrelated to the amount of data parsed. I've let it parse continuously up to 23GB while maintaining stable parse speed and memory levels.

My point is; the streamer is much slower than ze parser. But that's no surprise since the streamer uses a switch in a while loop for nearly every function, creating new function instances while parsing. A lot of new functions. We'll take a look at why that is. But first I want to explain what happens conceptually. This will make explaining the how seem that much more sense ;)

Co-routines


ES6 will introduce the yield keyword. This keyword allows you to halt the current function at any point and hand over execution back to the caller. When making a streaming parser, this is kind of what you want. Kind of, because we actually want multi-level yielding. This is kind of what co-routines do. The call stack of ze parser could be anywhere between 5 and 30 levels deep (just a guess). Manually coding a yield for every function at any point where we may need to get more input is going to be very very cumbersome. Let alone doing it in ES5 semantics. It's possible but ES5 has nothing even close to what yield does. And co-routines are not a part of the language and won't be for a long time, if ever.

I really wanted a streaming parser but did not want it to affect ZeParser2 too much. Neither did I want to start a new project for this. In part because I don't have the time for it and in part because you still have the yielding problem. You could actually simulate the call stack "manually". This way you could suspend and yield at any point between function calls. While such an approach seems viable to me, and could actually be quite performant, it obviously drastically changes the architecture and I do not want to spend the time writing a parser from scratch. I'll leave that as an exercise for the reader ;)

Not to long ago I published UnYield, which does yield in ES5. It transforms occurrences of yield to this draconic boilerplate to simulate yielding. While it could not be used for this project, the approach turned out to be very similar by treating any function with a certain prefix as if it were a yield keyword. This kind of makes it a DSL and that's fine.

Transformation of functions


The regular build of ZeParser2 consists of a namespace with single level functions (no inner functions) that close over some variables in the namespace. The functions, originally prototypal methods in Par or Tok, are all prefixed with this_par_ or this_tok_. I can use this prefix as my marker when transforming the build! If I let such a prefix mean "this function is a generator, transform it and let all calls to it be considered a possible yield", I pretty much got a co-routine. And so it did.

The transformation script goes over the source and searches for function declarations. When it finds one with name that has such a prefix, it will consume and transform it to a generator. This process consists of a few steps:

- Make the function return a new function
- Make sure all local variables are declared in the outer scope (so the returned function has closes over them)
- Make sure no local variables are declared in the inner scope
- Add a infinite loop to the inner function
- Add a switch inside the inner loop in the inner function
- Simulate all execution logic in terms of switching in a loop (for example: if becomes a jump to a different case in the switch)
- Transform calls to prefixed functions, explained below
- Add a global variable frozen, which means the current execution is suspended, causing a cascading yield to a predefined "top" of the call stack
- Treat one specific function as the one that actually suspends the parser (in this case waitForInput, which is a noop in regular mode, but the call is rewritten to return (frozen = true), (undefined); in the streamer)
- Cleanup some transformation artifacts afterwards, like dead code

The original function is changed to a function that returns a new function. This function can be called repetitively while maintaining state. The first time you call it, all vars are initialized as they would be in a regular function. The second time however, they will retain the value they had after being called the first time. This is nothing magic, just a straightforward closure.

This is an example with empty body:

Code:
function this_tok_waitForInput(foo){
// noop
return false;
}

// =>

function this_tok_waitForInput(foo){
var step = 1;
return function inside_this_tok_waitForInput(thawValue){
while (true) {
switch (step) {
case 1:
// noop
return false;
return;
default:
throw "uncompiled step? [" + step + "]";
}
}
};
}

The original parameters are kept and part of the now outer function. Their values also become closures like the local variables. The local function, named such that debugging is easier, has a double return. One is that of the original function and the second is the implicit return. The last throw is a unit test for development :) Won't be in a production build because it should be dead code.

Jump


The loop switching is a basic simulation of arbitrary jump targets, which is otherwise not possible in JS. Labels can only be used together with break or continue and are very statement limited. But if we have a switch in a loop, we can use case values as jump targets. Each switch will get its own jump variable (named step in my build). The outer infinite loop will ensure that the code re-enters the switch when it breaks. The only way to stop is to explicitly return (or throw), which is fine because the code inside the switch actually resembles the original function body anyways.

Transformation of function calls


When the inner function returns it may be for two reasons; either the parser is frozen (suspended) and the function needs to be resumed, or the function finished running and the instance should be GC'ed. Not only that, but an original function call needs to first pass on the original parameters as it would already do, but then immediately call the function to "start" running the actual code. The first call is only initializing the generator. Let's take a look at transforming a prefixed function call:

- Find any function calls to prefixed variables
- If they are part of a more complex expression slice them out and put them on their own statement before this expression
- Do this in the proper order in case of multiple such occurrences
- Take expression logic into account, this is important and I had to delegate this to a transformation that simply eliminates all occurrences of && and ||
- Assign original function call to a fresh variable, declared in the outer scope because it needs to be called again when the parser resumes
- Call the immediately returned function immediately
- If after the function call the frozen variable is true the parser is suspending so bubble this up (return immediately, next time this function is called it will jump to this call again)
- Otherwise this function is really done and we can GC it
- Substitute the original occurrence of the function call with the return value of the last call (intermediate calls are only propagated to the parser call site)

In code, a regular call as its own statement looks like this:

Code:
this_par_parseBlock(true, inFunction, inLoop, inSwitch, labelSet);

// =>

case 14:
step = 15; case 15:
v2 = (f2 = f2 || this_par_parseBlock(true, inFunction, inLoop, inSwitch, labelSet))(thawValue);
if (frozen) return v2;
else f2 = null;

And a sub-expression call:

Code:
while (this_par_parseStatement(inFunction, inLoop, inSwitch, labelSet, true, ''));

// =>

step = 4;
case 4:
v0 = (f0 = f0 || this_par_parseStatement(inFunction, inLoop, inSwitch, labelSet, true, ''))(thawValue);
if (frozen) return v0;
else f0 = null;
step = 5;
case 5:
if (!(!(v0))) { step = 7; continue; }

(Note that the while is also transformed to an if, using the loop switch to retain original execution semantics)

Streamer


The result is a parser that can arbitrarily suspend execution, waiting for more input. There's one function call that's transformed into the actual yield that triggers the suspend (waitForInput in my case). When executed, the transformed code will set frozen to true, marking the function as suspended. Every function call in the call stack will immediately return when it sees that var set, up to the call site of the parser. In this case, Par.parse() will return a function that's considered the entry call site for the parser.

When the parser is resumed (by calling obj.thaw(value), the frozen var is set to false. The switches in each function will jump to right before the function call that it just suspended in and call it again. Each call passes on the thawing value, which in our case will either be the new additional input to parse, or false to mark the EOF. At the top of the call stack, inside getMoreInput, this value is evaluated and regular execution resumes.

When the function returns and frozen is not set, the function is deemed finished, the return value deemed the actual return value to use next, and the function instance is destroyed. Thusly, arbitrary suspendsion is achieved :D

Code


With the current api of ZeParser2, calling Par.parse on the streamer will have it return an object with some properties:

Code:
var obj = Par.parse('var foo =');
console.log(obj);
// =>
{
frozen: bool
thaw: function(thawValue)
}

You can use this objec to check whether or not the parser is frozen. It will be updated by reference. As long as the parser is frozen, the return value of calling .thaw() will be undefined (or whatever I decide to return there, but I don't think any value makes sense here).

To start the parser and have it start on the argument code (Par.parse("this.code")), you must also call .thaw(). The argument is ignored for the first call. For testing purposes there is a flag that causes the streamer to immediately parse given input: Par.parse("parse.me", {runWithoutFurtherInput:true}). There's no point in using this flag other than testing since it will immediately keep calling .thaw(false) until it finishes.

To parse something streaming you generally do this:

Code:
var input = ['v', 'ar ', 'f', 'oo', ';'];
var obj = Par.parse('', onToken: function(tokenType, value, start, stop, tokenIndex){ });
var p = obj.thaw(); // initial call

while (input.length) p = obj.thaw(input.shift());

// parsing result will be in p now
console.log(p);

The calls to .thaw do not have to be synchronous, that's only in this example.

The repo has a bunch of tests. It also includes a file that does one test case streaming, if you want to see an actual example.

You can also take a look at the cli files;

Code:
# stream a big file through cli
cat bigfile.js | bin/cli.js --stream

# infinitely parse either cli input or hardcoded input, outputs memory stats
bin/memleak.js

# another cli streaming test file
cat bigfile.js | bin/teststream.js

Note that the repo contains two build files. There's zp.js, which is the normal ZeParser2 build. And there's zps.js, which is the streamer build. You only need to include zps.js in a project. Both build files should be up to date with the latest stable release.

ZeParser2 is not on npm.

.end()


Okay, I have no idea how many people actually make it to the end of this blog post. Don't know how many people find this kind of stuff interesting and/or can follow what I tried to explain above. For myself, I find this kind of hackery pretty awesome. Turning the regular parsing into a streaming parser with about 650 lines of additional code? I'd like to see you try. No really, I mean go for it. It's a great exercise! It also prevents having to maintain two code bases for kind of the same thing. It does mean the code gets a few additions it would otherwise not need, but I feel those are justified and most of them can be eliminated in the build step. The perf loss is containable and having a streaming parser is just a huge win for me.

Overall I'm pretty proud of this achievement :)

Thanks Eddy for giving me the final push to finally work this out :)