Tokenizing JavaScript

2011-05-21

Tokenizing is the process of breaking up source code in the smallest groups of characters that belong together. So when the tokenizer is done, you're no longer talking about a, b, 1 or +. But you'll talk about Identifier, String, Number or Punctuation tokens. That way the actual parser doesn't need to bother itself with all the character details, or whitespace "noise".

This blog post will explain how I am creating tokens when parsing JavaScript or ECMAScript (I'll refer to that as js) using a hand written parser.

By the way, when I tried writing a parser for js for the first time I had no idea that a tokenizer would make my life much easier. And let me tell you, if you don't know about it, that shit is hard. So I had to learn that the hard way. I hope this post will prevent you from doing the same ;)

Tokens in ECMAScript are actually quite easy to discover. Once you're inside a certain token, you only have to scan for the end of a token. So once you're inside, there is no ambiguity on what to expect next. This is important for ECMAScript because there exist an contextual ambiguation in the language where you need to know whether a division or a regular expression is to be parsed next. But the tokenizer doesn't care about this. You simply set a flag and don't worry about how it's set. The biggest challenge is the allowed characters for an Identifier.

We start by creating a main loop. At each iteration of the loop, the pointer of the input will be at the start of the next token. Since speed is important here, we will focus on doing this as fast as possible. So yes, that means we'll make a single loop and make use of regular expressions and do a few "premature optimizations".

So let's start with a shim.

Code:
var input = "5 + 5";
var tokens = [];
var pos = 0;
var hadNewline = false;
var parseRegex = false;
while (pos < input.length) {
var posStart = pos;
var chr = input[pos++];

// ... code

}

Now, there are eight different types of tokens we need to parse. WhiteSpace, LineTerminator, Comment, String, Number, RegularExpression, Identifier and Punctuator. That's actually not that bad! So let's get to it...

We start with the whitespace. JS doesn't really care about whitespace, except for separating individual tokens for the tokenizer. For instance, take the keyword instanceof and a number. In the bogus example 5instanceof Number, no whitespace is required between the number and the operator because js doesn't allow numbers to prefix a variable name. So it will start parsing a number and stop once it encounters a character that can't be part of a number. The opposite is true for Number instanceof 5. Since numbers may be part of an identifier, instanceof5 would be a valid variable name and you need to add whitespace to disambiguate the tokens for the tokenizer :) Note that x instanceof/* */5 would be fine. It's all about the tokenizer being able to distinguish between tokens, and it can in this case.

Sorry for side stepping there ;) So we want whitespace. There are three types of whitespace tokens; WhiteSpace, LineTerminators and Comments. Yes, comments are considered whitespace (see example above) and are almost completely ignored by the js parser. I say almost because there is one important thing to take into account. Line terminators influence automatic semi colon insertion (ASI). So even though we completely discard whitespace for the parser, we will need to take note of whether a line terminator was parsed. That includes line terminators in multi line comments! Oh yes, it's that evil.

WhiteSpace tokens in js are single characters in the set of space, tab, vertical tag, feed forward, non-breaking spae, byte order mark and any characer in the set of unicode spaces. We will ignore the last part for now and can construct a simple regular expression to scan with: /[\u0020\t\u000B\u000C\u00A0\uFFFF]/. So this results in this snippet:

Code:
if (chr.test(/[\u0020\t\u000B\u000C\u00A0\uFFFF]/)) {
tokens.push({start:posStart, stop:pos, type:'WhiteSpace'});
}

We do the same for line terminators. Which is any of line feed, carriage return, line seperator and paragraph separator. This gives us this regex: /[\u000A\u000D\u2028\u2029]/. And the loop gets extended pretty much the same as whitespace, except for setting a flag.

Code:
else if (chr.test(/[\u000A\u000D\u2028\u2029]/)) {
tokens.push({start:posStart, stop:pos, type:'LineTerminator'});
hadNewline = true;
}

Comments either start with a // and run up to the first newline or they start with /* and ends with */ and actually allows newlines. In fact, if it has a newline we need to remember that too. So the snippet for this becomes:

Code:
else if (chr == '/' && input[chr+1] == '/') {
pos = inp.indexOf('\n',pos);
// you should also check for the other newlines and do a Math.min on their results... newlines are most common though
tokens.push({start:posStart, stop:pos, type:'Comment'});
}
else if (chr == '/' && input[chr+1] == '*') {
var newpos = input.indexOf('*/', pos) + 2;
// do this for all valid newlines but remember you only need one
hadNewline = hadNewline || input.indexOf('\n', pos) < newpos;

tokens.push({start:posStart, stop:newpos, type:'Comment'});
}

Cool, so we have whitespace now. And we've taken care of detecting newlines so ASI can use them later. Note that the specificatin specifically notes that you should remove the whitespace tokens. But we'll want it because you don't want to lose information and there might actually be extra information in the comments (jsdoc).

Let's move to the next token; Strings. Strings are relatively easy. Strings start with either " or ' and end with an unescaped " or ', where strings start and end with the same type of quote and ignore the other type. So you need to make sure to ignore any backslash double, but that's it.

Code:
else if (chr == '"') {
// keep moving forward till the right quote is found. if current char is a backslash, move one extra forward
// note that you should also check for unescaped newlines, since they're not allowed ;)
while (input[pos++] != '"') if (chr == '\\') ++pos;
tokens.push({start:posStart, stop:++pos, type:'String', newlineBefore: hadNewline});
}
else if (chr == "'") {
// keep moving forward till the right quote is found. if current char is a backslash, move one extra forward
// note that you should also check for unescaped newlines, since they're not allowed ;)
while (input[pos++] != "'") if (chr == '\\') ++pos;
tokens.push({start:posStart, stop:++pos, type:'String', newlineBefore: hadNewline});
}

Next we check for numbers. Numbers start with 0 ~ 9. If the next char is a zero, it should only be a hex escape. If not, we're talking about a single zero or an octal escape. And octal escapes are illegal in ES5 (so deal with that as you see fit). Numbers are actually slightly more complex than just numbers with dots. We distinct between hex and decimal. Hex are easy, just 0~9 A~F. Decimals can have an optional dot and an optional exponent...

A decimal number may start with a dot (fraction) or may end with a dot (which is silly) and they may have an exponent, signified by the sigil e or E. The number after the exponent sigil may also have a + or - prefix and least one number. Sigh.

Code:
else if ((chr >= '0' && chr <= '9') || (chr == '.' && input[pos+1] >= '0' && input[pos+1] <= '9')) {
if (chr == '0' && (inp[pos] == 'x' || inp[pos] == 'X')) {
// hex can not contain dots or exponents
while (++pos < inp.length && this.regexHex.test(inp[pos]));
} else {
if (chr != '.') { // first do integer part
while (pos < inp.length && inp[pos] >= '0' && inp[pos] <= '9') ++pos;
if (inp[pos] == '.') ++pos; // skip dot. its not important
}
// decimal part
while (pos < inp.length && inp[pos] >= '0' && inp[pos] <= '9') ++pos;
// exponent part
if (inp[pos] == 'e' || inp[pos] == 'E') {
// optional prefix, may also be (a useless) plus sign
if (inp[++pos] == '+' || inp[pos] == '-') ++pos;
// note that this part is required. so check and create an error if that's not the case
while (pos < inp.length && inp[pos] >= '0' && inp[pos] <= '9') ++pos;
}
}
tokens.push({start:startPos,stop:pos,type:'Number', newlineBefore: hadNewline});
}

And the last literal token type is the regular expression. Note that for js, you need to tell the tokenizer whether you want a regex or a div. The regex itself is really simple to parse as a token. We don't have to worry about the // case (which is not a legal regular expression by spec). And we also don't need to worry about the multi line comments. Newlines are not allowed in regular expression literals, so you'll need to guard for that yourself. You could also check to make sure that character classes [ are properly closed with ] and of course the backslash just ignores the next characters.

The "fun" part about regular expression literals is that they allow for an optional set of flags to follow the literal. At parse time, these flags are not tested for support. The specification simply states that you may put any identifier there. But since js only supports a limited set of identifiers, I only check for ascii identifier (and even that is kind of overkill).

Code:
else if (parseRegex && chr == '/') {
var foundEnd = false;
while (++pos < inp.length) {
chr = inp[pos];
// parse RegularExpressionChar
if (chr == '\\') {
if (inp[pos+1] != '\n') ++pos; // is ok anywhere in the regex (skip next char)
else throw 'error'; // newline not allowed. not even escaped
} else if (chr == '[') {
while (++pos < inp.length && inp[pos] != '\n' && inp[pos] != ']'); // only newline is not allowed in class range
} else if (chr == '\n') {
throw "error"; // newlines not allowed
} else if (chr == '/') {
foundEnd = true;
break;
}
// otherwis it's a source char and we continue parsing
}
if (!foundEnd) throw 'error';
++pos;

// this is the identifier scanner (regex flags) which should actually allow any valid identifier (this just checks for ascii names)
while (pos < inp.length && /[a-zA-Z0-9\$_]/.test(inp[pos])) ++pos;

// fun fact, if a regex starts on a new line, the newline before it cannot cause ASI
tokens.push({start:startPos,stop:pos,type:'Regex', newlineBefore: false};
}

Alright, we're done with value literals. There are just two token types left. We need to check for punctuators and identifiers. We start with punctuators. There are just a handfull of valid punctuators. And it actually differs literally 50% in parse speed to switch from a loop to a regular expression on a substring of the input. This is why it's the second last thing to check for because we'd like to prevent the substring if at all possible. Maybe some day we can just set a start and limit to applying regular expressions on a target string...

Code:
else {
var result = /^(>>>=|===|!==|>>>|<<=|>>=|<=|>=|==|!=|\+\+|--|<<|>>|\&\&|\|\||\+=|-=|\*=|%=|\&=|\|=|\^=|\/=|\{|\}|\(|\)|\[|\]|\.|;|,|<|>|\+|-|\*|%|\||\&|\||\^|!|~|\?|:|=|\/)/.exec(input.substring(pos,pos+4));
if (result) {
// save the actual punctuator. you'll need it anyways and this prevents more substringing.
tokens.push({start:pos, stop:pos+=result[1].length, type:'Punctuator', value:result[1], newlineBefore: hadNewline};
} else {
// ... identifier

We're almost there! The only thing left to check for is an identifier. Identifiers are actually a little tricky because the specification notes a couple of unicode character classes as valid for identifiers. And you cannot check for character classes in js through easy escapes. This means that you cannot do a proper specification following js parser in js. There are characters you cannot parse in js (unicode characters above 16bit). I don't really see that as a problem though :) But the ones below 16bit are a pain because you have to be explicit. These ranges are not consequtive most of the time and you end up with very very long unicode ranges. On top of that you don't really want to execute those regexes all the time if you don't have to.

So the solution I came up with is to check for simple ascii ranges first, then see if i can easily determine that the next char (that's not ascii) is a char I know not to be in the other ranges. Punctuations and whitespace will certainly end the identifier token.

I used YQL to create a set of regular expressions for the character classes. I described that in this blog post a while ago. That's basically what we'll use here as well, if we ever get there.

Note that unicode escapes are valid parts of an identifier. JS uses the "canonical" result as a variable, so var \u0061; would be the same variable as var a;. Remember that if you ever want to obfuscate your code ;)

Code:
var found = false;
// we need to do this character by character... we can potentially optimize this but wont for now.
while (pos < input.length) {
var c = input[pos];
if (/[a-zA-Z0-9\$_]/.test(c)) ++pos;
else if (/\\u[0-9A-Fa-f]{4}/.test(input.substring(pos,pos+6))) pos += 6; // this is like a \uxxxx
// c is not ascii and not a unicode escape. either check for unicode character classes or stop
// in my parser i made it optional to use unicode.
else if (this.Unicode) {
// these chars may not be part of identifier. they are whitespace and punctuators.
// this hopes to prevent running the unicode class range regexes...
if (/[\>\=\!\|\<\+\-\&\*\%\^\/\{\}\(\)\[\]\.\;\,\~\?\:\ \t\n\\\'\"]/.test(c)) break;
// for most scripts, the code wont reach here. which is good, because this is going to be relatively slow :)
var Unicode = this.Unicode; // cache
if (!(
// these may all occur in an identifier... (purely a specification compliance thing :)
// see http://qfox.nl/notes/90 for details
Unicode.Lu.test(c) || Unicode.Ll.test(c) || Unicode.Lt.test(c) || Unicode.Lm.test(c) ||
Unicode.Lo.test(c) || Unicode.Nl.test(c) || Unicode.Mn.test(c) || Unicode.Mc.test(c) ||
Unicode.Nd.test(c) || Unicode.Pc.test(c) || Unicode.sp.test(c)
)) break; // end of match.
} else break; // end of match.
// if a character is found to be part of an identifier, say so now.
found = true;
}

if (found) tokens.push({start:startPos,stop:pos,type:'Identifier', newlineBefore: hadNewline};
else throw 'error';
}

And that's pretty much it. After the loop the tokens array will contain all tokens from the input stream unless an error is thrown. There is a lot of error checking I left out. I'll leave that as an excercise to the reader ;)

So that's how you would scan for tokens in js with a hand written tokenizer. This basically works for most languages, although you have to take care with the specific quirks of every individual language (like unicode escapes being part of an identifier in js). Now do note that you've not yet parsed js, we only have a token stream. This helps but is far from where you want to be in language terms :)

The parser blog post should follow soon. That's where we'll use the tokenizer as a source for input while actually parsing the input source code.

Hope you found it interesting!