JS unicode regex revisited

2013-07-31

I've posted before about creating huge unicode regexes that contained all the relevant unicode characters that might be part of a valid JavaScript identifier. This time I'm adding unicode support to ZeParser2 and want to use this script again, but try to minify it a bit. Mathias et.al. have pointed out to me a while ago that this is easily done and can drastically reduce the 57k script.

The file described at this post contained one regex per category. On top of that, it's a conjunction of each individual character with a pipe (|). Let's do some brute force transformation to reduce this file to a few k.

First we'll take all the characters that are currently in that file and put them in an array. This array will only contain strings, one string per character. Using the file we can easily do this by simple string replacements. We'll end up with something like ['0041','0042','0043','0044','0045','0046','0047','0048','0049', ... etc. This gives me an array of 7754 elements.

Next we use this simple (though ugly ;)) script to create a new regex:

Code:
// try to get it lexo ordered (that's why it's still a string)
list = list.sort();
// build up the regex
var s = '/[';
for (var i=0; i<list.length; ++i) {
// add next character to the regex
var n = parseInt(list[ i ],16);
var r = n.toString(16);
while (r.length < 4) r = '0'+r;
s += '\\u'+r;
var has = 0;
// check if succeeding character immediately follows n
while (parseInt(list[i+1],16) == n+1) {
n = parseInt(list[++i],16);
++has;
}
// if we found a range of characters, make it so to reduce file size
if (has > 1) {
s += '-';
r = n.toString(16);
while (r.length < 4) r = '0'+r;
s += '\\u'+r;
} else if (has) {
--i;
}
}
s += ']/';

Note that this is a one time operation and I'm actually just doing it in the chrome developer console.

I ended up with a 4.5k string. See this gist.