pvdz.ee - The battle of JIT

After running the fuzzing experiment for a long time, trying to get the best path out of it, I finally ended with something that averages on 3.6 checks. Kind of disappointing because I was expecting to at least break the 3.5 barrier, maybe even go under 3. But whatever, I have to move on.

Note of warning: this post describes super micro optimizations. Do not take this for advice on general JS perf. Heed this warning carefully please. We're talking 6 million calls of a function to parse a 16MB source code file. It is very unlikely you'll ever find yourself in a situation where this kind of perf matters. Then again, if you're reading this blog post you might... :)

So let's first take a look at what this path looks like. Note that this is a combination of fuzzing and manual tuning afterwards. Just to reiterate: this is the point where the tokenizer/scanner, given a character, has to determine what type of token to parse next (read more).

Code:

if (c < ORD_DOT) {
  if (c === ORD_SPACE) return this.__plusOne(WHITE);
  if (c < ORD_OPEN_PAREN) {
    if (c === ORD_CR) return this.__parseCR();
    if (c === ORD_TAB) return this.__plusOne(WHITE);
    if (c === ORD_DQUOTE) return this.__parseDoubleString();
    if (c === ORD_EXCL) return this.__parseEqualSigns();
    if (c === ORD_AND) return this.__parseSameOrCompound(c);
    if (c === ORD_SQUOTE) return this.__parseSingleString();
    if (c === ORD_$) return this.__parseIdentifier();
    if (c === ORD_PERCENT) return this.__parseCompound();
    if (c === ORD_LF) return this.__newline();
    if (c === ORD_VTAB) return this.__plusOne(WHITE);
    if (c === ORD_FF) return this.__plusOne(WHITE);
  } else if (c <= ORD_CLOSE_PAREN) {
    return this.__plusOne(PUNCTUATOR);
  } else {
    if (c === ORD_COMMA) return this.__plusOne(PUNCTUATOR);
    if (c === ORD_STAR) return this.__parseCompound();
    // + -
    return this.__parseSameOrCompound(c);
  }
} else {
  if (c < ORD_L_A) {
    if (c <= ORD_L_9) {
      if (c === ORD_DOT) return this.__parseDot();
      if (c >= ORD_L_1) return this.__parseNumber();
      if (c === ORD_L_0) return this.__parseZero();
      if (c === ORD_FWDSLASH) return this.__parseFwdSlash(expressionStart);
    } else if (c <= ORD_SEMI) {
      // : or ;
      return this.__plusOne(PUNCTUATOR);
    } else {
      if (c < ORD_L_A_UC) {
        if (c === ORD_IS) return this.__parseEqualSigns();
        if (c === ORD_QMARK) return this.__plusOne(PUNCTUATOR);
        if (c <= ORD_GT) return this.__parseLtgtPunctuator(c);
      } else if (c <= ORD_L_Z_UC) {
        return this.__parseIdentifier();
      } else {
        if (c === ORD_CLOSE_SQUARE) return this.__plusOne(PUNCTUATOR);
        if (c === ORD_OPEN_SQUARE) return this.__plusOne(PUNCTUATOR);
        if (c === ORD_LODASH) return this.__parseIdentifier();
        if (c === ORD_XOR) return this.__parseCompound();
        if (c === ORD_BACKSLASH) return this.__parseBackslash();
      }
    }
  } else if (c <= ORD_L_Z) {
    return this.__parseIdentifier();
  } else {
    if (c === ORD_OPEN_CURLY) return this.__plusOne(PUNCTUATOR);
    if (c === ORD_CLOSE_CURLY) return this.__plusOne(PUNCTUATOR);
    if (c === ORD_OR) return this.__parseSameOrCompound(c);
    if (c === ORD_LS) return this.__newline();
    if (c === ORD_PS) return this.__newline();
    if (c === ORD_NBSP) return this.__plusOne(WHITE);
    if (c === ORD_BOM) return this.__plusOne(WHITE);
    if (c === ORD_TILDE) return this.__parseCompound();
  }
}

It's basically a few steps of binary search at first (splitting in the ascii range), and each step having a long tail of minor percentages. There's some interesting simple rules in there but I won't bother you with them here. Using this to parse a 16mb file of concatted libraries and real-world libs I get an average of 3.6 checks before returning somewhere. Consider that the function is called 5.9 million times for this file and you can imagine the importance of being optimal here.

So is this optimal? No. I'm currently experimenting with various ways of doing the above and seeing some interesting results.

In my testing I've modified the Gonzales project and created a custom html that loads a custom parsers.js file. I also modified the ZeParser2's build script. In case you don't know, this is about ZeParser2 :) I can give the build a name and it will store a build file, including original sources, in some build dir. It will then modify this custom parsers file for Gonzales and add every dir in the build dir into it. I've also modified Gonzales to optionally run endlessly (while cycling between parsers and sources of course) and some other minor tweaks.

This means I can pit various ZeParser2 builds against each other with ease. Gonzales will keep on testing them to try and eradicate the random results from the browser. Since all parsers run under the same conditions, if they ran like that a 1000 times I think we can say that the average and min/max results are relevant. Sure, only for one browser. But I'll take them for granted.

Doing so I've tried a few strategies to determine the next token in my parser. Let's take a look at some of them:

- Optimized if/else. Basically what you've seen above. An optimized but straightforward conditional structure that sorts out what the character might signify. It can do range checks, obviously, which are supposedly faster than doing the whole range individually, like with letters and numbers. The optimization is based on frequency numbers from a huge benchmark file.

- Optimized single switch. All it does is check each character individually. The order depends on the frequency stats. Switches can't do range checks though so I'm not expecting that to win, especially not since they are not jump tables like they are in some other languages. The cases of JavaScript switches can be dynamic and are to be treated as such. This means you could put a function call in a case value and it will work. This limits the optimizability a bit.

- Finding type of token by doing indexOf in various arrays. This suffers from being hard to optimize without losing the possible gains by doing it group by group. On top of that indexOf is simply slower than character comparisons.

- Regular Expressions. I'm already discarding them for being too slow. Previous tests have shown that straightforward logic structures beat regular expression approaches every. single. time. ZeParser2 only uses regex for unicode checks now, where the sheer number of characters to check makes it worth it, even if it were just in terms of code size.

I've also got a bunch of derivations of the above.

- if-else split up over multiple files. The idea is that JIT does not like too many conditions inside the same function. Splitting them up over multiple functions might add the function call overhead, but will make it easier for JIT to optimize. There are various degrees of splitting and of course there seems to be a limit. You might be tempted to, say, "normalize" every function to a max of one condition, but that doesn't work either and makes the result slower. Finding middle ground here is time consuming and difficult due to jittery results.

- if-else with some switches for blocks only containing ifs.

- if-else with some computational tricks. There are some micro optimizations possible that involve some small math. For example, there's a check for the "PS" and "LS" characters. These are 0x2028 and 0x2029 and behave like newlines. You won't ever find them, but that's besides the point here. To check them requires to conditions. However, you could do some binary math to combine the check to (c ^ ORD_PS) <= 1. Unfortunately this specific trick is only really useful for these two characters, and them being rare it doesn't really help anyways. But there are more tricks like that. However, it's questionable whether using them is worth it because a condition is super fast and the math might not be, even when they appear to be. They also usually eliminate very few, most of the time just one, condition anyways.

- A switch that is using "fall-through" grouping, rather than returning on each case.

- A switch that is broken up in various functions, much like explained above.

- A switch combined with range checks for identifiers and numbers. Since a switch can't do fast range checks, the switch will be broken up. This might hurt optimization.

- A combination of the last switch with splitting up to try and get perf back by putting logical things in one function.

- A switch where every case that leads to the same function call is put after one another. This kind of ignores the frequency stats, though they are still used for local ordering as well as ordering of frequency of results. The theory to test here is whether it's faster to group calls to the same function in a switch.

I think that's about it. At least for my attempts here.

I've ran these tests in chrome (34) and firefox (28) on a ~16mb benchmark file containing various libraries and real-world scripts. I've also let it parse a dev version of jQuery for comparisons. Screenshots:

Chrome:

Firefox:

Some explanations:

This is done by Gonzales. It will first fetch all sources and parsers and load the parser code. So that overhead is not part of the results. For each combo Gonzales will first instantiate a fresh parser instance and then feed the specific source to the parser as a single string. It will do this a number of times consecutively, without yielding to the browser. In this case I've set this number to 5. So per source/parser combo, Gonzales ran five tests before moving to the next combo.

The first full run (all combo's) is ignored. This allows the browser to warm up if it supports it. This might feel like cheating, but it will happen anyways if you run the same source multiple times and otherwise this first run will skew the results. We see this happen in firefox because it seems the JIT cache is cooling down much quicker than chrome's cache. This results in very jumpy results and higher averages than chrome, by default. At least in chrome I can ignore the first run and rely on JIT remaining hot. For firefox I would have to ignore all the first runs per source/parser batch. Anyways, the average is only moderately interesting anyways. I'm more concerned about the min parse times anyways.

So per table cell you can see the average parse time for that combo. Below you see a min ~ max. These are, obviously, over the entire run and not just one batch. They are the number of milliseconds used to parse. Yeah, no microtime here. That's okay.

Lastly you can see the number of times the benchmark ran (1x = parser parsed the source once). The yellow block just indicates where the runner is currently at When I made the screenshot.

Legenda for the parser versions displayed:

- current published: this is what you can currently find in the repo. It's blatantly faster than anything I can produce right now, but that's in part because my current version is "more correct" and had to suffer a "big" perf hit because of it. I've been trying to beat it ever since but have not managed to do so yet.
- dev version: this is "ifelse" before processing it. The build script processes the raw source and transforms the prototypal approach to a closure approach. It also eliminates all the constants, replacing them with the constant value. It's crazy how much difference that gives in firefox, and how little it gives in chrome.
- switch_group_split: a switch, grouped by return, split over multiple functions
- ifelse: the exact structure you see at the top of this post
- ifelse_split: same as ifelse, except deeper conditional levels are put in their own functions a few times
- indexof: using arrays of characters and an indexOf for each such group. Terribly slow.
- ifelse_split_more: same as ifelse_split, except all new functions are a maximum of one conditional level deep
- switch_group: huge switch, but each case that ends up to be the same return value will fall through. Cases are not moved because of this though, they are in switch-grouped.
- switch_group_split2: cases are grouped by return function to call, in addition any ranges are being done as regular ifs. Each step is put in its own function, so these functions have either a switch or an if
- switch_grouped: huge switch, but each case is grouped together by the return function to be called. Cases are moved to be grouped together.
- switch_splits: Break up the huge split, with each case having its own return statement, in multiple functions.
- switch_reduced: I think this is the same as switch_group.
- ifelse_switch: same as normal ifelse, except consecutive ifs on the same level are replaced with a switch.
- ifelse_split_super: same as ifelse_split_more but even broke down functions with only single level ifs. At first this appeared to be slower, but it's actually better in the end.
- ifelse_tricks: this is ifelse but with some of the tricks (explained above) implemented. Yeah, that didn't work.
- switch: a huge switch where each character case is ordered by frequency. That's it.

Conclusion...

I'm basing this on the min parse times. My rationale is that the parser versions are of equal architecture and suffer the same problems in the test runner, whatever they be testing. So if under those conditions they were able to parse a certain source in given time over the course of various runs, and this min time is better than the other parser versions, then it is theoretically faster.

Keep in mind that these numbers signify miliseconds of parse time on 16 megabytes of source code. Those are HUGE numbers and the differences are MINIMAL. That said, faster is faster, no question about it. I often see certain approaches being consistently faster, even if it's just 5 milliseconds.

It should also be noted that this is heavily influenced by the benchmark script. This is also a very dangerous point to myself. If I rely to heavily on this file my parser will have a bias and perform worse on certain types of scripts. Some examples include minimized vs non-minimized. Tabs vs spaces. "Built" or processed scripts tending to use the same type of quotes and structures. jQuery and underscore/lodash containing many more identifiers starting with $ and _, and so forth. The benchmark file contains an emscripten project. Great for adding a few mb of script, but skewing the whitespace, quotes, and identifier frequency balance.

Interestingly enough, and I had almost given up on it, but it seems a combination of a switch, range checks, and splitting up over multiple functions is actually one of the fastest ways to go in chrome after all. This is "switch_group_split2" in the screenshots. The other is the if-else structure at the top, heavily broken down in multiple functions to reduce the number of logical branches in one function.

Quite surprising is that initially firefox also has the "switch_group_split2" as winner, and additionally the "ifelse_tricks". Though in hindsight, firefox invests heavily in asm.js and these are very asmjs-istic. In the long run it seems, as you can see, that the split versions all do equally well. So whether you use switch or if-else, for firefox splitting up over multiple files is the way to go.

In chrome, the "ifelse_tricks" approach is significantly slower. Likewise, "ifelse_split_super" is a bit slower too, though not as much. In general, even ignoring "hot or not", firefox does a bit better on if-else versions but much worse on the switch cases. I guess asm.js promotes if-else over switch? :) Perf is a weird beast.

After a tweet I went and reconfirmed that using an object literal or array with the keycode as key was not faster either. Comparable to above, the mins: chrome: 533 for an object literal, 525 for an array literal, 425 for the current version. In firefox: 492 for objlit, 503 for arrlit, 437 for current. Note that the array was not sparse, as browsers have notorious perf problems with that. So it's not terribly slow or anything, but there are certainly faster ways. My guess is that JIT just bails in trying to optimize a 120 item function jump table.

While firefox/chrome were ran on a desktop linux machine, I've also done some small trials in ie (10/11) and opera (20) on windows on a laptop (so benchmark times cannot be compared to above!). Mainly because I was curious :) So after installing 75 updates for an hour and a half (quite literally), these are the results for each:

Opera:

IE10:

IE11:

Opera, basically running a chrome shell, scores similar to what we've seen so far. We can see that the current version is stll fastest. We can also see that group split 2 and ifelse more are also leading.

In IE we can see that times are drastically slower (when just comparing to opera, of course) in both 10 and 11. It's kind of remarkable to me because I thought IE was doing pretty well in the perf department. And there's also very little improvement between 10 and 11.

In IE10 (ran shortly) an interesting point is that the current and dev versions of the code run twice as long compared to the dev builds. My explanation there is that IE does not work well with prototype based modules but optimizes well for closured systems. Seems they need to work on their JIT. Apart from that observation, the version that stands out is the optimized if-else.

However, IE11 seems clearly has one winner: switch_group. Kind of surprising that it stands out so much from the rest. Second and third is a single switch or the switch group split up in multiple functions. I'm a bit surprised by this result, I would not have guessed that IE invested so much in the switch. Unfortunately, it's the only one that does quite bad (relative to the others) in switch_group_split2. Likewise, ifelse_split_super and more are not even close to the switch versions.

So. I'm going to continue development with the switch_group_split2 approach. I think there's a bit of room for improvement and further investigation in this approach, so that's good. And of course, there are many other points in the parser that need attention. For example, one thing I want to start experimenting with next is token hinting. Unfortunately it's not so easy to implement this approach because right now I immediately fetch the next token when consuming the current token. So I'll need to change this to a lazy approach which might incur some overhead, but I'm pretty sure that it'll save way more in guess work. So who knows.

Hope you enjoyed another one of my random rambling posts :) I'd post running demo's but really, who gives a damn :)

References:

ZeParser2
Gonzales
HeatFiler