Pack token types in 32bit

2019-09-29

I've been writing a JS parser for over two years now. In this particular post I want to highlight a bit packing trick I'm applying to a core data structure of the parser as a whole: the token type.

In general, the way a parser works is that it doesn't consume source code bytes directly, instead it works together with a tokenizer, also called a lexer. This lexer is the one that actually consumes source source code and chunks it into so called "tokens". The token will have some meta data such that the parser doesn't need to worry itself about whitespace or validity of individual tokens.

In JS, or at least in my parser, I have a couple of high level token types. Such as whitespace (spaces, tabs, newlines, but also comments), numbers (decimals, octals, hexidecimals, binary, legacy octals, bigint), strings (single or double quoted), templates (without quasi, start-, middle-, or end part of a template), regular expression (with or without u-flag, or any other flag but those are less relevant), punctuators (those are pretty much all the syntax characters that are not letters or numbers which can still have whitespace before/after), and identifiers (any name, including keywords).

These high level, and lower level, token types are vital for making a parser run efficiently. At the same time, the lexer is single most repetitively called entity in the whole parsing experience. Making it efficient is probably always the first point of attention. Making the data structure efficient is part of that.

Until last week I had defined the token types as bits in a bit field. The token type would be a number and I would check individual bits being set to determine the type. That works fine, btw.

In JS bitwise operations are limited to 32bit, for reasons. So using bitwise flags means you can, at most, have 32 bits per number value (despite them technically being 64bit). You can't really cheat here either since any operation will return a 32bit trunced number, so using the higher part of 64bit would just get lost. Absolute no-go. 32 flags per number and if you need more, you need to change the structure or use a second number.

I opted to change the structure.

Before this was my definition of token types:

Code:
let $flag = 0; let $flag = 0;
const $WHITE = 1 << ++$flag;
const $SPACE = (1 << ++$flag) | $WHITE;
const $TAB = (1 << ++$flag) | $WHITE;
const $NL = (1 << ++$flag) | $WHITE;
const $CRLF = (1 << ++$flag) | $NL;
const $COMMENT = (1 << ++$flag) | $WHITE;
const $COMMENT_SINGLE = (1 << ++$flag) | $COMMENT;
const $COMMENT_MULTI = (1 << ++$flag) | $COMMENT;
const $COMMENT_HTML = (1 << ++$flag) | $COMMENT;
const $NUMBER = 1 << ++$flag;
const $NUMBER_HEX = (1 << ++$flag) | $NUMBER;
const $NUMBER_DEC = (1 << ++$flag) | $NUMBER;
const $NUMBER_BIN = (1 << ++$flag) | $NUMBER;
const $NUMBER_OCT = (1 << ++$flag) | $NUMBER;
const $NUMBER_OLD = (1 << ++$flag) | $NUMBER;
const $STRING = 1 << ++$flag;
const $STRING_SINGLE = (1 << ++$flag) | $STRING;
const $STRING_DOUBLE = (1 << ++$flag) | $STRING;
const $IDENT = 1 << ++$flag;
const $PUNCTUATOR = 1 << ++$flag;
const $REGEX = 1 << ++$flag;
const $REGEXU = (1 << ++$flag) | $REGEX; // with /u flag
const $TICK = 1 << ++$flag;
const $TICK_HEAD = (1 << ++$flag) | $TICK;
const $TICK_BODY = (1 << ++$flag) | $TICK;
const $TICK_TAIL = (1 << ++$flag) | $TICK;
const $TICK_PURE = (1 << ++$flag) | $TICK;
const $TICK_BAD_ESCAPE = 1 << ++$flag; // these are only valid in tagged templates from es9 onward...
const $ASI = 1 << ++$flag;
const $EOF = 1 << ++$flag;
const $ERROR = 1 << ++$flag;
ASSERT($flag < 32, 'cannot use more than 32 flags but have ' + $flag);

As you can see, it reserves one bit per type and all sub types would also get one bit, OR-ed together with their super type (so $STRING_SINGLE would also have the $STRING bit set).

This makes more sense in binary notation:

Code:
$STRING = 0b0000100000000
$STRING_SINGLE = 0b0000110000000
$STRING_DOUBLE = 0b0000101000000

This aproach works fine, except when I wanted to add support for BigInt I got an assertion error. I ran out of precious bits!

Try as I might I did not really want to drop any of the above token types. Arguably some of them were redundant and certainly not used by my parser (crlf, single/double string distinction, html comment, etc). But from an ideologic (overengineering) standpoint these might be useful or relevant in the future. So I definitely didn't want to drop them.

I could also have put those redundant distinctions in a second bit field (optionally, even, because perf).

Instead, I opted for a different encoding that is a little more future proof and has the potential to improve perf (to be fair I don't think it did). A combination of a bit field in tandem of reserving a few bits of continuous space for types that can never coexist.

Code:
let $_leaf = 0;
let $_group = 4;

const $G_WHITE = (1 << ++$_group);
const $G_NEWLINE = (1 << ++$_group);
const $G_COMMENT = (1 << ++$_group);
const $G_IDENT = (1 << ++$_group);
const $G_NUMBER = (1 << ++$_group);
const $G_PUNCTUATOR = (1 << ++$_group);
const $G_STRING = (1 << ++$_group);
const $G_REGEX = (1 << ++$_group);
const $G_TICK = (1 << ++$_group);
const $G_TICK_BAD_ESCAPE = (1 << ++$_group);
ASSERT($_group < 32, 'cannot use more than 32 flags but have ' + $_group);

const $L_SPACE = ++$_leaf;
const $L_TAB = ++$_leaf;
const $L_NL_SINGLE = ++$_leaf;
const $L_NL_CRLF = ++$_leaf;
const $L_COMMENT_SINGLE = ++$_leaf;
const $L_COMMENT_MULTI = ++$_leaf;
const $L_COMMENT_HTML = ++$_leaf;
const $L_IDENT = ++$_leaf;
const $L_NUMBER_HEX = ++$_leaf;
const $L_NUMBER_DEC = ++$_leaf;
const $L_NUMBER_BIN = ++$_leaf;
const $L_NUMBER_OCT = ++$_leaf;
const $L_NUMBER_OLD = ++$_leaf;
const $L_PUNCTUATOR = ++$_leaf;
const $L_REGEXN = ++$_leaf;
const $L_REGEXU = ++$_leaf;
const $L_STRING_SINGLE = ++$_leaf;
const $L_STRING_DOUBLE = ++$_leaf;
const $L_TICK_HEAD = ++$_leaf;
const $L_TICK_BODY = ++$_leaf;
const $L_TICK_TAIL = ++$_leaf;
const $L_TICK_PURE = ++$_leaf;
const $L_EOF = ++$_leaf;
const $L_ASI = ++$_leaf;
const $L_ERROR = ++$_leaf;
ASSERT($_leaf < 32, 'cannot use more than 32 leafs but have ' + $_leaf);

const $SPACE = $L_SPACE | $G_WHITE;
const $TAB = $L_TAB | $G_WHITE;
const $NL_SOLO = $L_NL_SINGLE | $G_WHITE | $G_NEWLINE; // Any specced line terminator that is not the combination of crlf
const $NL_CRLF = $L_NL_CRLF | $G_WHITE | $G_NEWLINE;
const $COMMENT_SINGLE = $L_COMMENT_SINGLE | $G_COMMENT;
const $COMMENT_MULTI = $L_COMMENT_MULTI | $G_COMMENT;
const $COMMENT_HTML = $L_COMMENT_HTML | $G_COMMENT;
const $IDENT = $L_IDENT | $G_IDENT;
const $NUMBER_HEX = $L_NUMBER_HEX | $G_NUMBER;
const $NUMBER_DEC = $L_NUMBER_DEC | $G_NUMBER;
const $NUMBER_BIN = $L_NUMBER_BIN | $G_NUMBER;
const $NUMBER_OCT = $L_NUMBER_OCT | $G_NUMBER;
const $NUMBER_OLD = $L_NUMBER_OLD | $G_NUMBER;
const $PUNCTUATOR = $L_PUNCTUATOR | $G_PUNCTUATOR;
const $REGEXN = $L_REGEXN | $G_REGEX; // No u-flag
const $REGEXU = $L_REGEXU | $G_REGEX; // With u-flag ("strict mode" for regular expressions)
const $STRING_SINGLE = $L_STRING_SINGLE | $G_STRING;
const $STRING_DOUBLE = $L_STRING_DOUBLE | $G_STRING;
const $TICK_HEAD = $L_TICK_HEAD | $G_TICK;
const $TICK_BODY = $L_TICK_BODY | $G_TICK;
const $TICK_TAIL = $L_TICK_TAIL | $G_TICK;
const $TICK_PURE = $L_TICK_PURE | $G_TICK;
const $TICK_BAD_HEAD = $L_TICK_HEAD | $G_TICK | $G_TICK_BAD_ESCAPE;
const $TICK_BAD_BODY = $L_TICK_BODY | $G_TICK | $G_TICK_BAD_ESCAPE;
const $TICK_BAD_TAIL = $L_TICK_TAIL | $G_TICK | $G_TICK_BAD_ESCAPE;
const $TICK_BAD_PURE = $L_TICK_PURE | $G_TICK | $G_TICK_BAD_ESCAPE;
const $EOF = $L_EOF;
const $ASI = $L_ASI;
const $ERROR = $L_ERROR;

This way I can define 27 super types, each taking up one bit of space, and by reserving 5 bits of continuous space I can have 32 sub types. The last block shows how you can compose the actual token types, which you can use for identity comparisons, rather than bit wise checks. This set is fixed and any variation can be explicitly coded. At the same type you can still quickly check whether a token belongs to a super type ("is this a string? Is this a bigint? Is this a template?") by using bitwise checks against the $G_ constants. The biggest danger to worry about is not accidentally creating a composed type with more than one $L_ constant.

Visually, in binary notation, that looks like this:

Code:
$MASK = 0b00000000000000000000000000011111 // Reserved continuous space mask
$G_WHITE = 0b00000000000000000000000000100000
$G_NEWLINE = 0b00000000000000000000000001000000

$L_SPACE = 0b00000000000000000000000000000001
$L_TAB = 0b00000000000000000000000000000010
$L_NL_SINGLE = 0b00000000000000000000000000000011
$L_NL_CRLF = 0b00000000000000000000000000000100

const $SPACE = 0b00000000000000000000000000000001 | 0b00000000000000000000000000100000 = 0b00000000000000000000000000100001
const $TAB = 0b00000000000000000000000000000010 | 0b00000000000000000000000000100000 = 0b00000000000000000000000000100010
const $NL_SOLO = 0b00000000000000000000000000000011 | 0b00000000000000000000000000100000 | 0b00000000000000000000000001000000 = 0b00000000000000000000000001100011
const $NL_CRLF = 0b00000000000000000000000000000100 | 0b00000000000000000000000000100000 | 0b00000000000000000000000001000000 = 0b00000000000000000000000001100100
^---------- is whitespace
^----------- is line terminator
^^^^^----- specific sub-type, not bitwise flag

As you can see, by splitting the types up I now only have used 10 bits for the super types (the $G_ constants). This means I have about 17 bits of space for other purposes. For example, I may encode certain tokens with constant values. Currently the parser has to do extra checks to validate whether a token is =, versus another kind of token that starts with =, like == === =>. Instead of those string comparisons and extra logic branches, it could get this information straight from the lexer, since it would already know about this, anyways!

Going down that route may make things more complicated, though, since the identity checks could no longer be used without some bitwise masking. But we can work around that, or alternatively, maybe it's faster, or we put that meta information in another number after all.

Luckily, not using the higher 16 bits may lead to a small perf improvement as well, since the type of the field would be identified as 16bit vs 32bit. But that might just be wishful thinking ;)