Overzealous parsing

2013-08-07

Something hit me today; why do I write my parsers to be so uber super duper strict? We encountered an issue today at work where some random script wasn't parsing properly. At first that seemed to because of random unicode garble, but it turned out that was a different bug caused by the script being rejected by Esprima (the JS parser we use). The other bug was due to falling back to gzipped content, rather than unzipped content. Uups.

Anyways, like I said the script was rejected by Esprima. Upon closer inspection it was caused by a simple expression: f() = x. Now before you go "but that's not right, right?", no, that's actually kind of right. While you probably never used it, it has been valid for a long time. In fact, it's only been formally deprecated in ES5.1 (note, not ES5) there's only an acknowledged bug in ES5.1. (edited: not formally depricated, yet). So that means it'll be around for at least ten years, probably much more.

So I won't bore you with details on what/how/when. But you can only assign to expressions that return a so called "Reference", which is an internal JS type that refers to a memory location (psst, that means variable). While formally this is only returned by a variable and a member expression that ends with property (foo.bar or foo[bar]), there are two other cases. One is an infinite number of matching parenthesis (but inside each such pair only one expression is allowed). This is to make it easier for people to use typeof and delete (and I suppose void) operators, since they operate on References. The other thing is a call expression which is IE legacy. Some ActiveX stuff had methods that returned a Reference. Blargh. But there you go, it was possible, and for a long time valid in JS. Browsers still grok it, of course. As well they should.

Herein lies the problem. See we need to support generic websites. Websites that might contain this. In fact, websites that do. In our case, there's a jQuery plugin (blargh) called Placeholder (can't find homepage, we found it here) that assigned to a method call under some condition. But since that condition was never executed in non-IE browsers, it was no problem. So if it works on a browser, we need to support it. So our parsers need to support it. Esprima does not. It recently dropped support for call expression assignments. And while that's arguably reasonable, it's very inconvenient. And that made me think about overzealous parsing.

I do it too. In fact, I'm working on it right now. Trying to improve my own parser (ZeParser2) to reject certain for-in and assignments that I can detect at compile time to be invalid. So called early errors. Question is, is that really desirable? I guess there are multiple answers to this question. On the one hand you want to be "pixel perfect". In the case of an IDE or validation, you need to be proper. Feels like parser writers tend to write in this gist; follow the spec perfectly. In the case of an IDE, that makes sense. You want to be notified of issues.

However, another answer could be that you want to be pragmatic and/or generic and/or interpreting. Though generic can still be "pixel perfect", so never mind that. As for pragmatic and/or interpreting, that doesn't have to be perfect. You just have to know what they input means. Take a browser for example. The big reason there are so many quirks is because they are so very very lenient on your input. There are so many HTML cases an HTML parser needs to take into account, it's crazy. I've never written a serious one so I don't even know most of them (I'm sure).

Now take two reasons why I've needed a parser the past few years. In one I needed to parse swf files. But these were only swf files that actually already worked in the flash environment. That means my parser only had to be as perfect as common flash implementation. In case of flash, that's only one (yeah yeah, not really, but we can safely work as if there only is one). No need to be pixel perfect here. Pretty much undesirable, actually. What if you fail something that should actually succeeds in production? Yes, you were correct. But your service is still broken. Job well done? No, just thoroughly.

We're facing a similar issue right now. And while the fix was rather easy (simply make Esprima allow call expressions to be the LHS of an assignment), it made me realize that I should at least put an option in my parser to be lenient in cases that might still pass in some browsers.

For example, I'm currently working on tying down various edge cases regarding the lhs of for-in in ZeParser2. But already I can see that many of the cases that throw an early error in Firefox, are actually runtime errors in Chrome. In fact, the for-statement could be executed and it would still not throw over it...

Code:
for ((a=b) in {});
for ((a=b) in {a:1});

Both cases will throw a syntax error in Firefox and compile fine in Chrome. When invoked (assuming the proper vars), Chrome will only complain about the second case. That's (probably) because it's not evaluating the lhs unless the rhs actually has any properties to assign.

Neato, sure. But it causes a discrepancy. I'm guessing Chrome is wrong here, spec wise. But it sure is being pragmatic because really, who cares. Your IDE should care and warn you. Your browser, not so much. Maybe in "strict mode". But let's not open that can of worms... :)

So anyways, the point is, a parser should not be super tightly bound to the spec. Especially if the spec has a variety of implementations and certain parts of it don't have consensus or are a bit ambiguous. It should be the greatest common denominator between those implementations and probably offer options to accept or reject the specific differences. One of those options could be super tight spec mode.

A lesson learned.