Tag literal syntax for js

2012-03-26

I've been playing with this for over a year now. This seems to me like an extension to the language that would be welcomed by a lot of people. The basic idea is that you get an alternative syntax for data structures. Of course the most important context would be in a browser. The current way of creating and setting up a new DOM element is very verbose. It involves various tediously long methods and an appendChild that doesn't even accept multiple arguments. Ugh! Compare that to raw html, which is pretty concise and to the point.

Bored already? Skip right to the interactive demo!

Feasibility



When you look at the syntax of JS up close, and I have, you'll notice that it's actually very much possible to incorporate tags in the current JS syntax. It's actually very easy, the leg work has already been done. Take a look at the regular expression literal. It starts and ends with a forward slash (/). Of course the forward slash is also used for division. (In fact, it's also used for single line comments and multi-line comments. Talk about an overloaded character...)

A tag



So what are tags? The syntax of a tag in this context is very, very simple. A tag starts with an opening angle bracket (<). It is followed by a tag name. This name is at least in this demo a simple [a-zA-Z0-9-] regex, although this should probably be extended to be in line with a JS identifier (but in conjunction with the dash..?).

The tag name is followed by an optional series of attribute and optional value pairs. The attribute name has similar syntax restriction as the tag name, but should almost definitely also include the dash (-), since this is very common in html. The attribute name-value pair is separated by the equal sign (=). If the value is absent then the equal sign should be dropped as well (implies a true value).

Then there are unary and binary tags. The basic difference is that unary tags have no contents (but can still have attributes). Unary tags have no closing tag. This is signified by a forward slash (/) at the end of a tag. This forward slash helps identifying tag depth, both for humans and machines. A very common unary tag in html is the image tag. The /> syntax stems from x(ht)ml and is optional in html. It's not optional in my demo. Binary tags have a body and are closed with another tag, that starts with a forward slash. The closing tag is optional in html, but not in my demo. The parser needs to know when to continue with regular js.

The opening tag is closed with the closing angle bracket (>). So for unary tags, this bracket is directly preceded by optional whitespace and a forward slash.

The closing tag, only for binary tags, start with opening angle bracket, followed by a forward slash, followed by the tag name to close, finalized by the closing angle bracket. So, </div>.

The body of a (binary) tag can be anything, including the nesting of other tags. A body consists of an ordered list of strings and tags.

In conclusion, a unary tag looks like <image />, a binary tag looks like <div>hello world</div> and attributes look like <div class="foo" editable/>.

Whitespace



Inside a tag, whitespace is allowed between any part of the opening and closing tag. In my demo, whitespace is anything that the regex /[^\S]/ would accept. Note that this includes newlines (which \s would not capture...). Basically, assume these are tabs, spaces and newlines.

Whitespace in the body of a tag is significant (will be put as is in the strings). In html, whitespace is generally reduced to a single space.

Demo syntax



It's kind of already explained above, but here's a more to the point syntax description for my demo. You can basically use a tag literal whenever you could otherwise use a number. My intention is that this literal is internally converted to a regular object structure, much like "JSON" would be (but of course it's kind of silly to use JSON as an example). The internal data representation can then be revived to a DOM structure in the browser, or perhaps string output on the server. Or maybe it just serves as another very simple data model, much like JSON.

Let's talk examples.

Code:
var e = <div>Hello world</div>;
log(e); // {name:"div",attributes:{},children:["Hello world"]}

alert(<foo class="5" hidden/>);
// {name:"foo", attributes: {"class":"5", "hidden":true}, unary:true}

<A>oh<B>hello< / B>there<C/>! </A>
// {name:"A", attributes:{}, children:["oh",{name:"B", attributes:{}, children:["hello"]},"there",{name:"A", attributes:{}, unary:true},"! "]}

My example does this. The parser puts a token in the token object stream that contains a property root (see parser.tokenizer.wtree in console). This root contains the object described above. In the demo, I take this demo and convert it to the raw DOM API code required to achieve this structure in the browser.

Standardisation and E4X



Yeah, so. First up; I've never really used E4X, nor did I dive into the material or specification. I'm under the impression that it consists of much much more than the simple data model of tags I'm talking about right now. I know at least that it adds a lot of manipulation sugar into the language. But I fear those are too complicated to be attractive.

Another difference is that I'm not sure whether it allows for a reviver of the resulting data model. At least the console of Firefox throws up when I try something simple like var x = <div></div>;. Disappointing :/

Encoding



Haven't really covered that to be honest. It's currently difficult, if not impossible, to show a literal angle bracket in the output (in the demo). Maybe one could go for the html entities to encode these chars. Or since we're in JS, maybe the generic backslash escapes should work. Don't know, haven't given it much thought either. Maybe go with whatever E4X has adopted. It looks like they use character references (the &amp; encoding). Works for me.

Demo



Again, you can find the interactive demo on taglit.qfox.nl.

Hope you like it :)