Parser perf research

2012-07-16

Parser perf research

Writing a parser in js is a tricky business. Tweaking it for speed seems to be a black art by itself. There are a lot of engines, each with their own tricks. It's very difficult to create fast performing code that's still platform agnostic.

TL;DR: parsing's tricky business. Check out this jsperf case to see an attempt at getting a grip on which approach might be best. The blog post describes each individual test you find there.

Parser


As you probably know if you read this blog, I've written a parser for js code in js a while ago. There are a few heavy duty pain points I identified tentatively. But since optimizers aren't very deterministic, it's very difficult to tell which approaches are actually better. Especially when you want to look at multiple browsers. Especially when you can't even "clear" the environment noise these engines are very susceptible to. But you still have to get numbers.

One of the heavy duty points in my code is determining which token the to parse next. A token is the smallest "atom" of source code. For js this means stuff like strings, numbers and comments. The parser treats the sequence of characters that make up these tokens as one unit. So the tokenizer basically makes the source code easier to work with for the parser. This does mean that the tokenizer will go through every part of the source code so it must be very performing.

Next token


So the case in point is identifying which token to parse next. There are a few ways to go about this. My first attempt was flimsily checking the first character or two and making decisions based on that. I then did some tests with regular expressions and it turned out to make things a lot faster. It cut my parse time at the time back in half! Just for reference, this is the regex that it uses:

Code:
// 1 ws 2 lt 3 scmt 4 mcmt 5/6 str 7 nr 8 rx 9 dom 10 punc
Tokenizer.regexBig = /^([ \t\u000B\u000C\u00A0\uFFFF])?([\u000A\u000D\u2028\u2029])?(\/\/)?(\/\*)?(')?(")?(\.?[0-9])?(?:(\/)[^=])?(?:(<)[^<=])?(>>>=|===|!==|>>>|<<=|>>=|<=|>=|==|!=|\+\+|--|<<|>>|\&\&|\|\||\+=|-=|\*=|%=|\&=|\|=|\^=|\/=|\{|\}|\(|\)|\[|\]|\.|;|,|<|>|\+|-|\*|%|\||\&|\||\^|!|~|\?|:|=|\/)?/;
Tokenizer.regexBigAlt = /([ \t\u000B\u000C\u00A0\uFFFF])?([\u000A\u000D\u2028\u2029])?(\/\/)?(\/\*)?(')?(")?(\.?[0-9])?(?:(\/)[^=])?(>>>=|===|!==|>>>|<<=|>>=|<=|>=|==|!=|\+\+|--|<<|>>|\&\&|\|\||\+=|-=|\*=|%=|\&=|\|=|\^=|\/=|\{|\}|\(|\)|\[|\]|\.|;|,|<|>|\+|-|\*|%|\||\&|\||\^|!|~|\?|:|=|\/)?/g;

As described in the comment, they apply to the next four characters (to identify the right punctuators immediately) and return a match with several positions (or null in case of an identifier or an error). I had two approaches; one would apply the regex on a substring, the other would use the lastIndex property of global regular expressions. There didn't seem to be much of a difference between the two, but the different code paths can still be found in the code.

Again


So a while ago Ariya Hidayat released his js parser, called Esprima. It was especially painful to see that it was about three to five times faster. Or whatever number that should be, it's significant in any test. And I remember being very impressed by the perf I got out of ZeParser when I finished it. So needless to say I was very impressed to see that Esprima got even faster. And disappointed, of course :p

Comparison


I do have to note that comparing parsers is a tricky business. My parser might perform several times worse than Esprima, but I also know it's doing a lot of stuff to help Zeon do it's thing. That means lot's of unnecessary overhead. Every parser has its own purpose with its own overhead. I don't mean to give an excuse or anything, don't get me wrong. Esprima is still much faster than ZeParser, even when I turn on the minimal mode of ZeParser that just produces the token stream. All I'm saying is that comparing parsers is close to apples and oranges. Or jQuery and dojo.

For the record, the new version of my parser beats Esprima consistently. Though the difference is mostly noticeable on bigger test suits. I think we're hitting boundaries though. It took me quite some effort to gain an edge on Esprima :p

Tests


Right, let's get back to the point in case. I wanted to see what the fastest way was of determining the next token. Besides trying to do this with a single regular expression (which I've tried, but am not convinced is the best way to go), this process can be divided in two parts. Getting the first character of the next token and determining which type of token it is the start of.

Getting a character
Code:
str[index]
str.charAt(index)
str.charCodeAt(index)
str.substring(index,index+1)
str.substr(index, 1)
str.slice(index, index+1)
new RegExp('.{'+index+'}(.)').exec(str)[1]
str.match(new RegExp('.{'+index+'}(.)'))[1]
str.split('')[index]
str.split('').forEach(function(c){ ... })
Array.prototype.forEach.call(str, function(c){ ... })
str.replace(/./g, function(c){ ... })

While this is a long list, there are only three serious contestants; getting a character as a string through array-like indexing or charAt, or as a number with charCodeAt. Contrary to classic js performance laws, the function call overhead is actually being optimized away these days. In fact, as the test shows the charAt test actually performs better than the index test. We really have to put ourselves over these classic laws. They simply don't apply anymore.

Determining the next token
if-else for each character you target
switch with a case for each character you target
object[chr]=type
array[ord]=type
chr in object
regex (as displayed above)
array.indexOf(chr / ord)
string.indexOf(chr)
applying a regex for each token (like a regex of whitespaces) to determine type

Some of these sound a little ridiculous, but you'd be surprised the performance certain engines can get with certain approaches. Never assume, always test.

Details


So the tests cover a combination of the three ways of getting a character with all the ways to determine the next token. Let's discuss them in detail. I will try to add the right preparation code to the corresponding test, to make it a bit more clear.

if-else, number


Code:

var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.charCodeAt(i);
if (c === 0x0009 || c === 0x000B || c === 0x000C || c === 0x0020 || c === 0x00A0 || c === 0xFFFF) ++whites;
else if (c === 0x000A || c === 0x000D || c === 0x2028 || c === 0x2029) ++newlines;
else if (c === 0x0027) ++squotes;
else if (c === 0x0022) ++dquotes;
else if ((c >= 0x0030 || c <= 0x0039) || c === 0x002e) ++numbers;
else if (c === 0x002f) ++slashes;
else throw "omg this should never happen!";
}

This test is the simplest of them all. Believe it or not though, this seems to be the best way to go. Results may vary, especially with bigger if-else structures the results may differ. But the test at hand clearly shows this one as the winner across the board. Getting the next character as a number and then comparing the numbers. The hex notation makes this especially pleasant to read with the "weird" characters. It's also a very fast way of checking serial ranges of characters.

It doesn't surprise me that this is actually faster than indexed character access, since that requires a string comparison. The computer only deals in numbers, so comparing numbers takes away the translation step. However, you never know what might happen internally, so let's continue with the next test.

if-else, string, method


Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.charAt(i); // formal way of accessing individual characters by index
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
}

This is much like the previous test, except it does string comparisons. As you can see this drastically cuts in performance.

if-else, string, index


Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input[i ]; // de facto standard of accessing individual characters by index, as an array index
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
}

Same as previous test except it accesses the character by an index as if it were an array. Though not specific (yet), this works for all engines. Performance varies though. However, number comparison still beats it.

if-else (implicit), string, method


Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.charAt(i); // formal way of accessing individual characters by index
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u002f') ++slashes;
else if (c >= '\u0030' || c <= '\u0039' || c === '\u002e') ++numbers;
else throw "omg this should never happen!";
}

This uses relative string comparison to do the range checks. Turns out this has terrible performance. It didn't improve when explicitly trying to use parseInt with this either. So this is a big nono.

if-else (implicit), string, index


Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input[i ]; // de facto standard of accessing individual characters by index, as an array index
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u002f') ++slashes;
else if (c >= '\u0030' || c <= '\u0039' || c === '\u002e') ++numbers;
else throw "omg this should never happen!";
}

Same as above, except indexed access.

if-else, string, substring


Code:
// (for sake of being complete)
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.substring(i, i+1); // unlikely to be faster, but lets try anyways
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
}

This is more of a theoretical thing. You could get the next character by substringing one character. It's not very fast though.

if-else, string, substr


// (for sake of being complete)
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.substr(i, 1); // de facto standard, would expect same perf as substring
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
}

Alternative substringing...

if-else, string, slice


Code:
// (for sake of being complete)
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.slice(i, i+1); // oh come on, this is just substring... or is it.
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
}

And another alternative to substringing.

if-else, string, dynamic regex, exec


Code:
// (for sake of being complete)
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = new RegExp('.{'+i+'}(.)').exec(input)[1]; // this cant win, but it's a nice trick
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
}

This is a pretty exotic way of getting the next character. It builds a dynamic regular expression which skips index characters and then captures the next one. Simplistic concept but of course the construction and execution of regular expressions kill it. Note that you could memoize the regular expressions, but that's hard to do clean without overhead and I don't think it's worth the effort.

if-else, string, dynamic regex, match


Code:
// (for sake of being complete)
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.match(new RegExp('.{'+i+'}(.)'))[1]; // pretty much same thing as before
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
}

This is the match variation of the dynamic regex method above.

if-else, string, split, while


Code:
// (for sake of being complete)
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
var splinput = input.split('');
while (i--) {
var c = splinput[i ]; // shouldnt be that much different from indexed string access, & it adds to memory footprint
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
}

Loop over an array rather than a string. Not per se bad performance but always has the overhead of actually splitting and of course the extra memory footprint to hold the extra array. This is compensated a bit by the fact that array access is pretty fast. Indexed string access seems to be slightly slower.

if-else, string, split, forEach


Code:
// (for sake of being complete)
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
// shouldnt be that much different from indexed string access, & it adds to memory footprint
// the forEach might be optimized, so that could be interesting
input.split('').forEach(function(c){
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
});

This one, like the next one, is more for the sake of being complete. I did not expect any performance from them and did not get it either. This simply splits the string to an array and uses forEach to get each character individually. Even if it would perf, I wouldn't be able to use it because I'm not actually looping through the input with a normal while.

if-else, string, apply forEach


Code:
// (for sake of being complete)
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
// this might reduce memory footprint, but still the function calls probably kill it
// not all browsers will dig this test. that's ok.
Array.prototype.forEach.call(input, function(c){
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
});

Like above, but tries to reduce the extra memory footprint. Not quite sure whether it helped anything memory wise, but it certainly didn't help perf :p

if-else, string, replace


Code:
// (for sake of being complete)
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
var splinput = input.split('');
var dot = /./;
// replace will call callback for every match, so effectively for every character
// (though not very helpfull while parsing, i didnt want to leave it out here)
input.replace(dot,function(c){
if (c === '\u0009' || c === '\u000B' || c === '\u000C' || c === '\u0020' || c === '\u00A0' || c === '\uFFFF') ++whites;
else if (c === '\u000A' || c === '\u000D' || c === '\u2028' || c === '\u2029') ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (c === '\u0030' || c === '\u0031' || c === '\u0032' || c === '\u0033' || c === '\u0034' || c === '\u0035' || c === '\u0036' || c === '\u0037' || c === '\u0038' || c === '\u0039' || c === '\u002e') ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
});

This gimmick was interesting. replace is often grasped to for being so powerful and you never know what engines might do. And even though at first it really did seem to get 5x the performance of the if-else test, it turned out to be because I forgot to make the /./ global :p Which replace of course expects. With the global flag in place, this test loses all sense of performance.

switch, number/method/index


That was the last of the exotic character getting tests. The remainder will just focus on the three ways and I'll discuss them grouped like that.

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
switch (input.charCodeAt(i)) { // only way to chr2ord
case 0x0009: case 0x000B: case 0x000C: case 0x0020: case 0x00A0: case 0xFFFF:
++whites;
break;
case 0x000A: case 0x000D: case 0x2028: case 0x2029:
++newlines;
break;
case 0x0027:
++squotes;
break;
case 0x0022:
++dquotes;
break;
case 0x0030: case 0x0031: case 0x0032: case 0x0033: case 0x0034: case 0x0035: case 0x0036: case 0x0037: case 0x0038: case 0x0039: case 0x002e:
++numbers;
break;
case 0x002f:
++slashes;
break;
default:
throw "omg this should never happen!";
}
}

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
switch (input.charAt(i)) { // formal way of accessing individual characters by index
case '\u0009': case '\u000B': case '\u000C': case '\u0020': case '\u00A0': case '\uFFFF':
++whites;
break;
case '\u000A': case '\u000D': case '\u2028': case '\u2029':
++newlines;
break;
case '\u0027':
++squotes;
break;
case '\u0022':
++dquotes;
break;
case '\u0030': case '\u0031': case '\u0032': case '\u0033': case '\u0034': case '\u0035': case '\u0036': case '\u0037': case '\u0038': case '\u0039': case '\u002e':
++numbers;
break;
case '\u002f':
++slashes;
break;
default:
throw "omg this should never happen!";
}
}

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
switch (input[i ]) { // de facto standard of accessing individual characters by index, as an array index
case '\u0009': case '\u000B': case '\u000C': case '\u0020': case '\u00A0': case '\uFFFF':
++whites;
break;
case '\u000A': case '\u000D': case '\u2028': case '\u2029':
++newlines;
break;
case '\u0027':
++squotes;
break;
case '\u0022':
++dquotes;
break;
case '\u0030': case '\u0031': case '\u0032': case '\u0033': case '\u0034': case '\u0035': case '\u0036': case '\u0037': case '\u0038': case '\u0039': case '\u002e':
++numbers;
break;
case '\u002f':
++slashes;
break;
default:
throw "omg this should never happen!";
}
}

These should be equivalent to the if-else tests. In fact, they should be faster because engines get clues about the structure. It's very surprising to find out that most engines fail to optimize this case. In fact, the switch seems to perform much worse than the if-else in most environments. This kind of surprised me. That's what we call low hanging fruit...

From these three, the number version obviously performed best. In the end a switch translates into a series of if-elses which do a strict comparison to the switch argument value. (I would have expected that value to be optimized and the optimizer to create a map for the cases...)

regex, method/index


These tests use this regex:

Code:
// one capturing group per type... we'll even use magic numbers for checking to keep the overhead minimal
var regex = /([\u0009\u000B\u000C\u0020\u00A0\uFFFF])|([\u000A\u000D\u2028\u2029])|(')|(")|([\d.])|(\/)/


Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var match = regex.exec(input.charAt(i));
if (match) {
// using magic numbers for indices, to prevent overhead as much as possible
if (match[1]) ++whites;
else if (match[2]) ++newlines;
else if (match[3]) ++squotes;
else if (match[4]) ++dquotes;
else if (match[5]) ++numbers;
else if (match[6]) ++slashes;
else throw "omg this should never happen!";
} else {
throw "omg this should never happen!";
}
}

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var match = regex.exec(input[i ]);
if (match) {
// using magic numbers for indices, to prevent overhead as much as possible
if (match[1]) ++whites;
else if (match[2]) ++newlines;
else if (match[3]) ++squotes;
else if (match[4]) ++dquotes;
else if (match[5]) ++numbers;
else if (match[6]) ++slashes;
else throw "omg this should never happen!";
} else {
throw "omg this should never happen!";
}
}

If the regex matches, one of the capturing groups will contain at least one character. This is what we check, and this is how the code knows what token to parse. This is actually how the first version of ZeParser does it, albeit with a slightly more complex regex.


if-else, string, grouped regex


Code:
// regex groups
var regexWhitespace = /[\u0009\u000B\u000C\u0020\u00A0\uFFFF]/;
var regexNewlines = /[\u000A\u000D\u2028\u2029]/;
var regexNumbers = /[\d.]/;

// (for sake of being complete)
var input = ' "15';
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.slice(i, i+1); // oh come on, this is just substring... or is it.
if (regexWhitespace.test(c)) ++whites;
else if (regexNewlines) ++newlines;
else if (c === '\u0027') ++squotes;
else if (c === '\u0022') ++dquotes;
else if (regexNumbers.test(c)) ++numbers;
else if (c === '\u002f') ++slashes;
else throw "omg this should never happen!";
}

Uses a regex to check whether a character belongs to a certain group of chars. The overhead of the regex engine kills performance for one character though.

hash, object, number


Code:
// object as lookup table
var lookupObj = {
0x0009: 1, 0x000B: 1, 0x000C: 1, 0x0020: 1, 0x00A0: 1, 0xFFFF: 1,
0x000A: 2, 0x000D: 2, 0x2028: 2, 0x2029: 2,
0x0027: 3,
0x0022: 4,
0x0030: 5, 0x0031: 5, 0x0032: 5, 0x0033: 5, 0x0034: 5, 0x0035: 5, 0x0036: 5, 0x0037: 5, 0x0038: 5, 0x0039: 5,
0x002e: 6,
0x002f: 7
};

var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
// using magic numbers for indices, to prevent overhead as much as possible
switch (lookupObj[input.charCodeAt(i)]) {
case 1: ++whites; break;
case 2: ++newlines; break;
case 3: ++squotes; break;
case 4: ++dquotes; break;
case 5: ++numbers; break;
case 6: ++slashes; break;
default:throw "omg this should never happen!";
}
}

The idea is that browsers heavily optimize object access. So we try to abuse this by determining the token type by accessing the (single) object. Note that for property access with numbers, the numbers are converted to strings and then looked up (although this might internally be optimized differently). So I did not expect this to be fast. Some engines perform decently on this though.

hash, array, number


Code:
// array as lookup table
var lookupArr = [];
lookupArr[0x0009] = lookupArr[0x000B] = lookupArr[0x000C] = lookupArr[0x0020] = lookupArr[0x00A0] = lookupArr[0xFFFF] = 1;
lookupArr[0x000A] = lookupArr[0x000D] = lookupArr[0x2028] = lookupArr[0x2029] = 2;
lookupArr[0x0027] = 3;
lookupArr[0x0022] = 4;
lookupArr[0x0030] = lookupArr[0x0031] = lookupArr[0x0032] = lookupArr[0x0033] = lookupArr[0x0034] = lookupArr[0x0035] = lookupArr[0x0036] = lookupArr[0x0037] = lookupArr[0x0038] = lookupArr[0x0039] = 5;
lookupArr[0x002e] = 6;
lookupArr[0x002f] = 7;

var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
// using magic numbers for indices, to prevent overhead as much as possible
switch (lookupArr[input.charCodeAt(i)]) {
case 1: ++whites; break;
case 2: ++newlines; break;
case 3: ++squotes; break;
case 4: ++dquotes; break;
case 5: ++numbers; break;
case 6: ++slashes; break;
default:throw "omg this should never happen!";
}
}

Same as above, except we access an actual array. This changes the rules of performance a bit because arrays are actually ("supposed to be") heavily optimized for numbered property access. Unfortunately it does mean that we create a so called "sparse" array with a length of 0xFFFF. Various engines don't handle big sparse arrays very well. Safari on the iPad seem to be the exception to this rule. They almost prefer it.

hash, string, method/index


Code:
var lookupHash = {
'\u0009': 1, '\u000B': 1, '\u000C': 1, '\u0020': 1, '\u00A0': 1, '\uFFFF': 1,
'\u000A': 2, '\u000D': 2, '\u2028': 2, '\u2029': 2,
'\u0027': 3,
'\u0022': 4,
'\u0030': 5, '\u0031': 5, '\u0032': 5, '\u0033': 5, '\u0034': 5, '\u0035': 5, '\u0036': 5, '\u0037': 5, '\u0038': 5, '\u0039': 5,
'\u002e': 6,
'\u002f': 7
};

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
// using magic numbers for indices, to prevent overhead as much as possible
switch (lookupHash[input.charAt(i)]) {
case 1: ++whites; break;
case 2: ++newlines; break;
case 3: ++squotes; break;
case 4: ++dquotes; break;
case 5: ++numbers; break;
case 6: ++slashes; break;
default:throw "omg this should never happen!";
}
}

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
// using magic numbers for indices, to prevent overhead as much as possible
switch (lookupHash[input[i ]]) {
case 1: ++whites; break;
case 2: ++newlines; break;
case 3: ++squotes; break;
case 4: ++dquotes; break;
case 5: ++numbers; break;
case 6: ++slashes; break;
default:throw "omg this should never happen!";
}
}

This really tests the object as a hash. It performs pretty decently on some environments.


in, method/index


Code:
// lookup by object
var whitesObj = {'\u0009': 1, '\u000B': 1, '\u000C': 1, '\u0020': 1, '\u00A0': 1, '\uFFFF': 1};
var newlinesObj = {'\u000A': 2, '\u000D': 2, '\u2028': 2, '\u2029': 2};
var squoteObj = {'\u0027': 3};
var dquoteObj = {'\u0022': 4};
var numberObj = {'\u0030': 5, '\u0031': 5, '\u0032': 5, '\u0033': 5, '\u0034': 5, '\u0035': 5, '\u0036': 5, '\u0037': 5, '\u0038': 5, '\u0039': 5, '\u002e': 6};
var slashesObj = {'\u002f': 7};

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.charAt(i);
// using magic numbers for indices, to prevent overhead as much as possible
if (c in whitesObj) ++whites;
else if (c in newlinesObj) ++newlines;
else if (c in squoteObj) ++squotes;
else if (c in dquoteObj) ++dquotes;
else if (c in numberObj) ++numbers;
else if (c in slashesObj) ++slashes;
else throw "omg this should never happen!";
}

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input[i ];
// using magic numbers for indices, to prevent overhead as much as possible
if (c in whitesObj) ++whites;
else if (c in newlinesObj) ++newlines;
else if (c in squoteObj) ++squotes;
else if (c in dquoteObj) ++dquotes;
else if (c in numberObj) ++numbers;
else if (c in slashesObj) ++slashes;
else throw "omg this should never happen!";
}

The in operator is always deemed slow so I did not expect this to be fast. I was not disappointed.

indexOf, number/string


Code:
// lookup by indexof
var arrWhitespace = [0x0009, 0x000B, 0x000C, 0x0020, 0x00A0, 0xFFFF];
var arrNewlines = [0x000A, 0x000D, 0x2028, 0x2029];
var arrSquote = [0x0027];
var arrDquote = [0x0022];
var arrNumber = [0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x002e];
var arrSlash = [0x002f];

var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.charCodeAt(i);
// using magic numbers for indices, to prevent overhead as much as possible
if (arrWhitespace.indexOf(c)) ++whites;
else if (arrNewlines.indexOf(c)) ++newlines;
else if (arrSquote.indexOf(c)) ++squotes;
else if (arrDquote.indexOf(c)) ++dquotes;
else if (arrNumber.indexOf(c)) ++numbers;
else if (arrSlash.indexOf(c)) ++slashes;
else throw "omg this should never happen!";
}

Code:
var objWhitespace = ['\u0009', '\u000B', '\u000C', '\u0020', '\u00A0', '\uFFFF'];
var objNewlines = ['\u000A', '\u000D', '\u2028', '\u2029'];
var objSquote = ['\u0027'];
var objDquote = ['\u0022'];
var objNumbers = ['\u0030', '\u0031', '\u0032', '\u0033', '\u0034', '\u0035', '\u0036', '\u0037', '\u0038', '\u0039', '\u002e'];
var objSlash = ['\u002f'];

var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.charAt(i);
// using magic numbers for indices, to prevent overhead as much as possible
if (objWhitespace.indexOf(c)) ++whites;
else if (objNewlines.indexOf(c)) ++newlines;
else if (objSquote.indexOf(c)) ++squotes;
else if (objDquote.indexOf(c)) ++dquotes;
else if (objNumbers.indexOf(c)) ++numbers;
else if (objSlash.indexOf(c)) ++slashes;
else throw "omg this should never happen!";
}

Code:
var objWhitespace = ['\u0009', '\u000B', '\u000C', '\u0020', '\u00A0', '\uFFFF'];
var objNewlines = ['\u000A', '\u000D', '\u2028', '\u2029'];
var objSquote = ['\u0027'];
var objDquote = ['\u0022'];
var objNumbers = ['\u0030', '\u0031', '\u0032', '\u0033', '\u0034', '\u0035', '\u0036', '\u0037', '\u0038', '\u0039', '\u002e'];
var objSlash = ['\u002f'];

var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input[i ];
// using magic numbers for indices, to prevent overhead as much as possible
if (objWhitespace.indexOf(c)) ++whites;
else if (objNewlines.indexOf(c)) ++newlines;
else if (objSquote.indexOf(c)) ++squotes;
else if (objDquote.indexOf(c)) ++dquotes;
else if (objNumbers.indexOf(c)) ++numbers;
else if (objSlash.indexOf(c)) ++slashes;
else throw "omg this should never happen!";
}

Much to my surprise, using indexOf on arrays actually performed okay.

strindexOf, method/index


Code:
// lookup by indexof
var strWhitespace = '\u0009\u000B\u000C\u0020\u00A0\uFFFF';
var strNewlines = '\u000A\u000D\u2028\u2029';
var strSquote = '\u0027';
var strDquote = '\u0022';
var strNumbers = '\u0030\u0031\u0032\u0033\u0034\u0035\u0036\u0037\u0038\u0039\u002e';
var strSlash = '\u002f';

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input.charAt(i);
// using magic numbers for indices, to prevent overhead as much as possible
if (strWhitespace.indexOf(c)) ++whites;
else if (strNewlines.indexOf(c)) ++newlines;
else if (strSquote === c) ++squotes;
else if (strDquote === c) ++dquotes;
else if (strNumbers.indexOf(c)) ++numbers;
else if (strSlash === c) ++slashes;
else throw "omg this should never happen!";
}

Code:
var whites = 0, newlines = 0, squotes = 0, dquotes = 0, numbers = 0, slashes = 0;
var i = 1000;
while (i--) {
var c = input[i ];
// using magic numbers for indices, to prevent overhead as much as possible
if (strWhitespace.indexOf(c)) ++whites;
else if (strNewlines.indexOf(c)) ++newlines;
else if (strSquote === c) ++squotes;
else if (strDquote === c) ++dquotes;
else if (strNumbers.indexOf(c)) ++numbers;
else if (strSlash === c) ++slashes;
else throw "omg this should never happen!";
}

This last test simply tries to indexOf on an array. That performs even (slightly) better than the array version. In fact, Ariya revealed that he uses this in Esprima core. In my findings I don't agree with this though, I think if-else is going to be faster (for smaller sets). But this was an interesting thing to learn anyways.


Conclusion



Having run all these tests I can only conclude that if-else is the way to go for most single character parsing cases. I guess it's not so bad, it allows for fairly slim and clear code. I wouldn't have minded using a switch though. But I guess this will do.

Now if you don't mind, I have to go and rewrite the current iteration of the new ZeParser. There are some switches to eliminate ;)

See the jsperf case here.

Hope you liked this!