JS Regex Unicode Categories

2010-02-21

For a project I needed to parse characters from specific Unicode categories. while this page suggests using the \p syntax for that, this is not supported by the Javascript Regular Expression Engine (not according to the spec, anyways). So the next best alternative is to create our own regexes for them...

Note that there's now an updated version, see this post

Whilst it seems simple to create them, some work is involved. This post will tell you how you can do it quickly...

Firstly, we need a website displaying all the characters per category. Luckily fileformat.info has very extensive lists concerning this need. However, we need that list as a regex. And no, it won't just do that for you.

So while you can start doing this yourself, I should perhaps warn you that there are a few thousand characters in the Unicode standard. Shall we just use YQL instead? Finally an interesting project to use it for :)

We use YQL to filter out the codes and get them back as a list. The query goes something like this:

Code: (YQL)
select content
from html
where
url="..."
and xpath='//a'
and href like '/info/unicode/char/%'
and href like '%/index.htm'

This fetches the document, filters out all the <a> tags. It then checks whether the target url checks for a specific url, but YQL only supports LIKE with a leading or ending % sign. So we have to check it twice. Finally, the query only returns the text from that tag, because that's what we're interested in.

The response shows a list of unicode character escapes.

Now, to format this list into a regex, go to your favorite text editor, paste the list and replace all the ","U+ instances by |\u. Cleanup the result and make sure the start and end of the result are what make up a regex literal in Javascript.

With the example you should end up with this snippet:

Code: (JS Regex)
/\u005F|\u203F|\u2040|\u2054|\uFE33|\uFE34|\uFE4D|\uFE4E|\uFE4F|\uFF3F/

And that's the regex :)

Note that there's now an updated version, see this post

You can find some examples here (56k script..). That script contains the regex for unicode categories "Uppercase Letter" (Lu), "Lowercase Letter" (Ll), "Titlecase letter" (Lt), "Modifier letter" (Lm), "Other letter" (Lo), "Letter number" (Nl), "Non-spacing mark" (Mn), "Combining spacing mark (Mc), "Decimal number" (Nd) and "Connector punctuation" (Pc). Bonus question: Do you see what parser I'm writing? ;)

Now, these regexes could be improved by combining ranges together. That's not needed for my project so I didn't.

Please note that Javascript regex doesn't support unicode literals beyond 0xffff, so these categories are still missing a few... ;(

Hope it helps you!