CFG parser in JS

2010-09-17

Writing parsers is certainly a special olympics kind of thing. A while back I decided I wanted to write a JavaScript or ECMAScript parser. So now I have. At the basis of that parser is a CFG parser. And that's what _this_ post is all about.

So what is a CFG? CFG stands for Context Free Grammar. It is a set of rules that is able to formally de-/construct a language without having to know anything about the semantics (context) of the language you're parsing.

CFG's are often (if not always?) written in a notation called Backus–Naur Form, or BNF. BNF is itself just a set of rules to express rules. Recursion much? Yes!

A CFG is a set of "productions". You can see a production as a single unit, variable, or token. Every production has zero or more productions. But because that's a little confusing I call them rules. Every rule consists of productions I call element or part.

That's basically it... Formally, productions are written like this.

Code: (CFG)
name --> a|b|c

Where name is either a, b or c. In this blog (and the parser ;)) I'm going to write them like this:

Code: (CFG)
: silly
"a"
silly "b"

Let's quickly look at that example. There is a production named silly. Silly has two rules. Either it stands for the string "a" or it stands for a production silly and the string "b".

This production would match any string that consists of a single a. It would also match any string that begins with anything that would match the production silly followed by a b. This is a so called recursive rule.

Recursive rules are important in CFG's. They allow for complex constructions while keeping the defined rule set small. The example above would match a string with a single a or any string that starts with an a followed by any number of b's.

Now I've written a parser to parse CFG's in BNF and generate a parse tree from them. I needed this because I've written an Ecmascript parser that uses a CFG as basis. Furthermore it outputs a nice html where you use links to jump to production definitions. Yay!

The parser is very straightforward. It's only supporting what I needed to (anything required by the Ecmascript specification ;)). It will parse the cfg from left to right. Hardly anything fancy goes on.

The parser supports the following features:

- Named productions that have separate rules with a set of elements. Productions start with a colon followed by their name. All whitespace is ignored.
- Named productions that have a special meaning
-- Any: Succeeds if any (one) of the elements contained within the production matches from the current position
-- Not: Succeeds if the body of the not does not match from the current position
-- Exception: Succeeds if the last match does not match the body of the production
-- Optional: Succeeds regardless of a match
- Comments, prefixed by a hash (#), ignores the remainder of the line
- Named sections, prefixed by dollar sign ($), the first word after $ is the name, the rest is treated like a comment
- String literals, double quoted strings
- Unicode literals, single quoted strings containing four hexadecimal characters
- Escaping, anything prefixed by backslash is escaped

Some examples of the above:

Code: (CFG)
$ A1 Lexical Grammar

: SourceCharacter # any unicode code unit

: MultiLineNotForwardSlashOrAsteriskChar
SourceCharacter e{a{"/" "*"}}

: SingleLineComment
"//" o{SingleLineCommentChars}

: SingleLineCommentChars
SingleLineCommentChar o{SingleLineCommentChars}

: SingleLineCommentChar
SourceCharacter e{LineTerminator}

After the parser parses a CFG it can output a nice html version of it, where you can click on productions to jump to their definitions. Can be handy. The parse tree from the parser can also be used as input for a lexer or any other kind of parser.

The parser will put all parsed productions in a couple of containers. Most important are the orderedParts, prods, used, sections and cfgByName properties. Where cfgByName contains all productions with the name of the production as key. And sections is a double array with the productions ordered by section.

I've created a live demo. I've also created a static output file from that demo.

You can find the script here (27k), or the minified version here (10k).

Hope it helps ya :)