Defying some logic

2014-07-30

The elimination of && and || is a deceitfully difficult one. Yet I needed to do this as part of a bigger transformation. In the end, this elimination of expression logic turned out far more difficult than the other transformation. Or at least more work. I nailed it though. This post explains how, what, and in part why.

Let's start with the why. To create the streaming parser I needed to slice out certain function calls from pseudo-arbitrary expressions. It was necessary because I needed to return conditionally and you can't return mid-way an expression in JS. You could force an exception but it's really not the same :) I say pseudo-arbitrary because I only needed to take my source code into account, my coding style, and my usage of the language. That meant some liberty in skipping some awkward stuff like with, for-in and eval. Though I'm certain you can do it for them as well, I see no reason why not. The reason you need to remove logic is because the alternative is being way more semantic about partial expression rewrites.

For example:

Code:
foo(f() && g());

If I wanted to slice out the g(), I'd also need to slice out the f() because I need to make sure g() happens conditionally on the return value of f().

Ok so the goal is to take fairly arbitrary JS source and eliminate all usages of && and ||. Let's take a look at where they might occur. Obviously they can only occur inside an expression. But unfortunately those can occur in many places:

Code:
foo || bar;
f(a || b);
if (a || b) c;
while (a || b) c;
for (a || b; a || b; a || b) c;
for (c in a || b) d;
switch (a || b) {
case a || b:
a || b;
default:
a || b;
}
do {} while(a || b);
return a || b;
var c = a || b;
for (var c = a || b;;);
with (a || b) c;
eval('a || b');
a || b ? a || b : a || b;

As said, since I'm targeting my own source code I can make certain assumptions. The source code of zeparser has no with, no eval, doesn't do important stuff in case expressions, nor in ternary operators, has no for-in, and can ignore variable declarations inside regular fors. Additionally, I don't have to worry about function inception cases; I only process the body of functions with a certain fingerprint and these functions do not contain other functions. This means we don't have to deal with a couple of parsing problems which make a generic solution much more work, though still not impossible.

In my approach I'm using my parser, which only gives me a stream of tokens. I'll concede immediately that an actual parse tree may have helped me here in some cases. But I'll also note that it would not have made the job a breezer.

It's funny. When I first looked at this problem it seemed so trivial to me. Just slice out stuff between logic ops, turn them into individual statements. Rinse and repeat. Let's take a look at why it's not that trivial.

In my approach I will rewrite all occurrences of && and || to respective if blocks. This is where an AST may have made my life a bit easier because now I have to infer the expression ranges between the ops manually AND deal with operator precedence as well. Though precedence is actually quite simple in this case. Some contrived single op examples:

Code:
a && b;
// =>
var A = a;
if (A) A = b;
A;

a || b;
// =>
var A = a;
if (!a) A = b;
A;

These are the base cases. They do exactly the same as the ops do. Now take a look at the cases with mixed ops, one each:

Code:
a && b || c;
// =>
A = a;
if (A) A = b;
if (!A) A = c;
A;

a || b && c;
// =>
A = a;
if (!A) {
A = b;
if (!A) A = c;
}
A;

(a || b) && c;
// =>
A = a;
if (!A) A = b;
if (A) A = c;
A;

a || (b && c);
// =>
A = a;
if (!A) {
A = b;
if (!A) A = c;
}
A;

a && b || c;
// =>
A = a;
if (A) A = b;
if (!A) A = c;
A;

(a && b) || c;
// =>
A = a;
if (A) A = b;
if (!A) A = c;
A;

a && (b || c);
// =>
A = a;
if (A) {
A = b;
if (!A) A = c;
}
A;

Here's where we see operator precedence at work. The rules are actually quite simple. We only care about two ops and for those, one is stronger than the other. I consider the && weaker than the ||. I can process all && until a ||. After that I must be careful to only continue evaluation if the preceding didn't return a truthy value.

Parenthesis obviously do affect this a bit, but as long as we treat everything inside parens as one expression, we only have to worry about the end result and let recursion do its thing. In the above, there are two cases where this changed the translation; wrapping the ||. Let's look at how we should translate the group with many logic ops:

Code:
a && b && c || d && e && f;
// semantically, there's no difference if we group the &&'s or not

a || b || c && d || e || f;
// in this case, the rules do change:
A = a;
if (!A) A = b;
if (!A) {
A = c;
if (A) A = d;
}
if (!A) A = e;
if (!A) A = f;
A;

(a || b || c) && (d || e || f);
// =>
A = a;
if (!A) A = b;
if (!A) A = c;
if (A) {
if (A) A = d;
if (!A) A = e;
if (!A) A = f;
}
A;

Note how the second group was put in a block. This way it's only evaluated if the first group is truthy, as it should. Note that the first expression would get the same group treatment, but I've left it out since a group has no effect unless it's a sub-statement.

The same example for && is not so interesting since in theory, the output remains the same. One might say that in our case, we consider consecutive &&s implicitly grouped. Of course we still have to walk through groups the same way regardless.

It's this "block grouping" that makes it interesting but also a tad difficult. If we encounter a group we have to start a new block. We must also close this block once we encounter the end of the group. In conjunction with the precedence of || we get an interesting play of curly brackets that's sure to make you lose track of what I'm explaining :) So, let's just look at more examples. Keep in mind that you're a dumb computer and can't be smart about this. Even worse; you're traversing the code left to right, mostly, and do this translation in one pass.

Code:
((a && b) || c) && d;
// => (manual translation)
{ // outer group
{ // inner group
A = a;
if (A) A = b;
} // inner group
if (!A) A = c;
}
if (A) A = d;
A;

My approach goes through each token in serial though. If an expression contains multiple logic ops, it won't know about it. Yet it has to properly translate each op such that the end result gives us a correct translation like above. Including the groups and operator precedence. This is where it becomes tricky, and I think an AST would only get you so far.

Whenever we encounter a group, we add one layer of curly brackets. This ensures that the group is indeed executed as a group if it is the right-hand-side ("RHS") of some logic op. Which op isn't even important. If it isn't the RHS of either op, it'll add a bogus block layer. We could eliminate that, but there's no need since it should have no effect whatsoever.

Whenever we encounter a &&, we treat the LHS as one sub-expression in the same "group", regardless of it actually being a group. For us, grouping some expressions with && is the same as not grouping them at all. The || is the only divider here. Basically, we translate && to } if (A) {. This also means we need to start any expression with a {, or this translation is bound to fail.

When we encounter a || we need to close the current series of &&s as if it were a group. We need to check whether the condition has NOT yet been fulfilled and only proceed if this is the case. We translate || into }} if (!A) {. The first bracket closes a bracket that may have been left open by a &&, the second closes a bracket that may have been left open by ||. This means each translated expression has to start and end with at least two opening brackets.

There's an interesting situation now where we start with two opening curly brackets. They maybe closed and reopened by either logic op. However, what happens if we start with a group and encounter ||? We'll just close the group unintentionally. Oops. To prevent this we'll start with an extra layer of brackets and add an extra layer of brackets to each logic op. This way we know that if we close two blocks for ||, we won't accidentally close an open group. It's kind of funny to think of groups this way when transforming JS. Many other languages would have made this a lot harder when a block has more impact like scoping.

Now that we know how to transform most expressions (to cover the scope of my target), let's look at some statements. The problem with expressions in statement headers is similar to that of just expressions; you can't very well return from a statement header without throwing. This is why some statements have to be rewritten. The scope of my project limits it to if, while, do, and return. I don't use fors at all in target functions of this project, switch headers and case expressions don't contain code I care about. Pretty much the same for other statements actually.

Code:
if (a) b;
// =>
{ A = a; if (A) b; }
// looks kind of silly but keep in mind that `a`
// may be any complex logic expression

if (a) b;
else c;
// =>
{ A = a; if (A) b; else c; }
// this translation works with any statement `b`
// and `c`, applied recursively

while (a) b;
// =>
while (true) {
A = a;
if (!a) break;
b;
}
// note the extra layer of brackets for the sub-statement

do a; while (b);
// =>
do {
a;
A = b;
if (!A) break;
} while(true);
// or =>
while (true) {
a;
A = b;
if (!A) break;
}
// since the loop condition is always true, there's no
// longer a difference between `do` or `while`

return a;
// =>
{ A = a; return A; }
// again, note that `a` may translate into multiple statements

var a = b;
// =>
var a;
A = b;
a = A;
// Looks silly, but consider multiple vars and multiple complex
// initializers. Doing it like this makes that all go away :)

for
// none of my fors have logic ops in their header so we ignore them

The translations are fairly straightforward if you've done anything in this space. They simply allow elimination of the logic ops while maintaining execution semantics. The rewrites for other statements or structures is similar.

You may only run into some complexity for case expressions. These could be circumvented by a bound function, though. It would have access to the outer scope without worrying about new variables since it's still just a case expression.

Since my code doesn't contain functions targeted code I can do simple forward scans to see if I need to translate an expression at all. While I indiscriminately translate loops (this makes my next transformation easier since I can ignore the infinite loops), I can skip it for the rest if I see no logic ops in the range. This would not have been as trivial if I had to worry about local functions. But now I can do a plain text search.

The A var that I mention in my examples actually stands for an unused fresh variable. For most cases, this can just be A since it'll only be used for one (original) statement. However, in some cases one statement may have multiple sub-expressions to transform. Each expression will need its own variable, for example with a function call. In that case, at least in my code, I simply walk through the alphabet. In my code it's safe to use single upper case characters as fresh variables, though it's trivial to prefix them with some random string if need be.

Expressions actually also have some pitfalls. One is function calls. For them I need to make sure I stop processing the current expression at the comma, which at this point is not part of an "expressions" but an argument separator instead. This is different from a comma in, say, a statement header. So while parsing is similar, this is a very important difference. Each argument in a function call will need its own free variable. This is actually one place where this becomes an issue if not taken care of. Each argument of the function call is evaluated in order unconditionally, though. So we simply put them in front of the statement containing the function call, wrapped in a new block and the call itself ends up being something like foo(A,B,C,D).

The other expression pitfall is assignments. When processing an expression left to right, we should only replace the actual value of an assignment. This makes my next processing step a lot easier as well. This means that a = b or even a = b = c should only have their right-hand-most expression replaced. The operator precedence comes to our aid here since assignments are nearly the strongest operator out there. On top of that, left to right once an operator has been seen that's not an assignment no more assignment op may validly be encountered. This helps because we can now maintain an index that states where the current expression starts. If we encounter an assignment, update the start to after this token. When we slice out the expression we start slicing at this token since the rest can not be important to us. Note that logic ops are not expressions :)

That's the most important part of the transformation. What's left are some cases of defensive transforming. For example, in translations involving a to if (!a) we actually do if (!(a)) in case a is a comma expression or assignment or something like that. I do cleanup some of these cases if they are trivial in a post-processing step. There's also some more excessive bracketing for similar reasons.

Using the above techniques I can properly transform the zeparser build to a build without && and || (at least in places I where I don't want them). You can find the entire script on github. If you want to run it against some arbitrary code, you can just take that file, require it, and pass some JS to the function. It will return the translated code, or throw an error if the code is invalid or breaks certain assumptions.

Note that it's not very generic, as outlined above. It's tailored to the source code of ZeParser2 and is likely to fail on generic projects simply for using the for statement, if nothing else. I have no need for a generic approach and it therefor feels like a waste of time; I already know it's doable :)

Ohwell. Hope you liked this geekout episode. Till next time, when I talk about changing a regular ZeParser2 build into a streaming build that supports multiple layer yielding :D Hint: it needs this delogic-fying ;)