Normalizing while loops

2021-04-05

I'm trying to normalize code, making it as simple and uniform as possible while maintaining original semantics as much as possible. In this post I wanted to share some thoughts around normalizing a while loop. I know you were waiting for this. Strap in!

Context


Skip this if you're aware of what Preval is doing.

My current pet project is Preval. It's basically my own spin on what Prepack was supposed to become; a pseudo JS compiler with static analysis with partial evaluation.

When I started the project I wasn't sure where to even begin. But quickly I realized that I'd have to normalize the code first before even considering how to evaluate it.

Normalization in this context means simplifying the input code, trying to make it as basic and uniform as possible, compiling the input down to as few building blocks as feasible, and trying to consolidate the many different ways of doing the same thing into a single way, even if that sometimes means having more code or losing high level constructs in the process.

For example, we convert a for loop to a while loop trivially, but we also convert a switch to a set of if-else range checks which end up so mangled that it's hard to figure out that it was a switch originally.

Another example is enforce the code to always have at most observable side effect per line, meaning all expressions must be heavily simplified as it's very common for JS to have multiple actions per line.

Due to the dynamic nature of JS this can be a tedious task for static analysis, though I'm happy to report that I'm getting a lot further than I initially thought.

While loops


For this case I only have to concern myself with regular while loops. The do-while and for-loops are already converted to regular while loops with breaks.

There are other transforms that attempt to eliminate labels and labelled breaks by abstracting the code into functions and generating return statements rather than a labelled break so that's not too much of a concern to me.

Here's a simple example of the do-while transform:

Code:
do { f(); } while (g());

// ->

loop: while (true) {
f();
if (!g()) break loop;
}

and regular for loops:

Code:
for (let i=0; i<10; ++i) { f(i); }

// ->

loop: while (true) {
if (i < 10) {
f(i);
i = i + 1;
} else {
break loop;
}
}

I currently have many many of these kinds of rules, some have a small impact, some have a big impact.

Branching


Preval currently has a rule where it will attempt to enforce one branch per function. Truth be told, the reason for doing this is that I vaguely remember that this state has an interesting property for static analysis but I've not given it much thought beyond that. It feels like a good state to aim for :)

This single branch rule works fine for functions without loops but for loops it becomes much harder because it's the only case where the code "jumps back", re-executing the same code in the same context instance. The only other construct in JS that does this are switch defaults, and we eliminate those (and there it's more about the jump, anyways).

Either way, since we can't safely put nested branching into separate functions we need to find a way to "unnest" them... This works fine for nested if-else statements but not for loops. And that's kind of what we're trying to solve in this post:

"What's the best and safest way to normalize loops into a function?"

The idea here is that if a loop body is in a function, the function itself can be treated as any other and should not need to worry about it being a loop, which simplifies a bunch of things considerably.

Abrupt completion


An "abrupt completion" in JS is anything that stops the current control flow and returns control to its caller (or whatever). The simple examples here are return, continue, break, and throw. Other examples are yield and await, but those are special and we're not focusing on them just yet.

For a loop this is relevant because if we want to abstract the body of a loop in a fresh function, then abrupt completions need to be propagated in some form and we need to distinguish them from one-another since they have different semantic results.

Due to the current state of Preval, we will only consider to normalize while loops that are in the root of a function. So not global, not nested in anything else. (Due to other normalization in place, the only thing loops can still be nested inside of are other loops, which we peel off one by one.)

I'll cover these four statements; return, break, continue, throw.

throw


I can ignore the throw completion since that will fall through any abstraction I create for it (provided I don't wrap that in my own try/catch, of course). So we can ignore these in loop bodies.

return


If a loop contained an "early return" inside of its body then that would stop working when you put the body in a new function;

Code:
while (x) { return y; }

// ->

function tmp() { return y; }
while (x) tmp();

And, unlike break, this abrubt completion needs to stop the loop, make sure not to invoke code that follows the loop, AND return the value that was supposed to be returned. This last requirement is obvious but makes it less trivial since the body will need to return both the kind of completion as well as, potentially, the actual value to return immediately.

As far as I can see there are two ways to accomplish this and neither are pretty;

- 1: Box the return value into an array/object which signals both the completion type as well as the return value, if any
- 2: Use a closure and put the completion type and return value in there, picking them up after the loop

Both ways are very similar in their approach and I'm actually not sure which of the two would perform better. On the one hand you'd expect the boxing to create extra objects which adds to gc pressure. On the other hand, this pattern is very common in modern JS and modern engines are quite efficient with very short lived objects like that.

Still, I went with a closure. Not in the least because I already normalize patterns away but also because it feels like this leads to code that will be easier to optimize. Property access will always be a second class citizen in JS static analysis, where closures are first class. Or first-and-a-half.

break


A break needs to set the completion state and change to a return. The loop should detect that the completion state was a break and stop the loop, but not return the stored value like a return would.

continue


A continue is the simplest of the three. It changes to a return because all it needs to do is stop the current execution of the loop body, which is effectively the abstracted function.

Transform


Let's take a look at the full transformation of a simple counting loop:

Code:
// Note that the for is changed into a while loop
function f() {
for (let i=0; i<10; ++i) $(i);
return 100;
}
const r = f();
$(r);

// ->

// Actual current result when running the above through Preval:
let f = function () {
debugger;
let i = 0;
let tmpLoopRetCode = true;
let tmpLoopRetValue = undefined;
let tmpLoopBody = function () {
debugger;
const tmpIfTest = i < 10;
if (tmpIfTest) {
$(i);
i = i + 1;
} else {
tmpLoopRetCode = false;
return undefined;
}
};
let tmpLoopTail = function ($$0, $$1) {
let tmpLoopRetCode$1 = $$0;
let tmpLoopRetValue$1 = $$1;
debugger;
const tmpIfTest$1 = tmpLoopRetCode$1 === undefined;
if (tmpIfTest$1) {
return tmpLoopRetValue$1;
} else {
return 100;
}
};
while (tmpLoopRetCode) {
tmpLoopBody();
}
const tmpReturnArg = tmpLoopTail(tmpLoopRetCode, tmpLoopRetValue);
return tmpReturnArg;
};
const r = f();
$(r);

The transform takes a function with a while loop and;
- Inject a completion state variable (tmpLoopRetCode)
- Inject a return value variable (tmpLoopRetValue)
- Create a fresh function containing the body of the loop (tmpLoopBody)
- Create a fresh function containing the code that follows the loop (tmpLoopTail)
- Find all return, continue, and break statements in the loop-body-function and convert them
   - Every return updates the completion state and return value state and change to a return undefined
   - Every break updates the completion state and change to a return undefined. It does not use the return value
   - Every continue just changes into a return undefined because all that needs to do is exit the function to "continue" the loop
- Replace all the code after the loop (now inside a fresh function) with a return of calling that fresh function

The fresh function for the code after the loop (tmpLoopTail) will first check the completion state. If it was for a return, then immediately return the value stored in the return value state. Otherwise, we ignore these states and move on to execute the code that originally folllowed the loop. That should suffice to maintain the original semantics.

In the above code, the $$0 stuff are normalized parameters and the debugger statement signifies the end of my function header, since JS has no such thing. Feel free to ignore them, they make analysis easier.

Signals


Ideally the abstracted body would signal an abrubt completion with a simple boolean. However, it needs to tell the loop whether it completed early or not and, if so, whether it was a break or a return.

However, considering our normalization efforts, it would be highly helpful if we could keep the loop condition "simple", that is, a simple expression without noticable side effects. This is effectively an identifier, a literal, or a negative value on something we can statically determine to be a primitive value (negative numbers are highly annoying in this context, but let's discuss that in another blog post).

Effectively this means that I'd rather not have the while condition be anything other than true or an identifier. Not while (x === true) nor while (state === CONTINUE), if you get my drift. This includes not negating the value. So it's very tempting to use 0, 1, 2 for state, considering 0 to mean "continue", and doing a while (!state), but then the while condition is no longer "simple" as it's now a unary expression that needs to be abstracted.

So to keep that condition simple, we have to make sure the state can inform us of the three continue, break, and return states with a truthy value for "continue" and falsy values for the others. For now I went with true, false, and undefined. While I _am_ worried about that leading to slightly sub-optimal code, we can always test and improve on this situation later.

Normalized


This means I can now define a normalized loop as:

"A loop with a simple node condition and without branching in its body"

And in most cases it will look like this:

Code:
while (retCode) loopBody();
return tail();

The "no branching" part is to support inlining functions as, right now, I'm not sure it matters to me whether "flat" code is inside a loop versus inside a function. Although maybe I'll enforce that through closures, anyways. We'll see.

Consistent state


One important part is that the normalization property holds in multiple passes. Otherwise you'd run the risk of an endless loop.

If the condition was not simple then a rule would replace it with true and put the old condition as an if-else inside the while body, leading to the same transformation over and over again.

We could restrict the normalized state to require the while to have exactly one statement expression that is a call (etc), but for now there's no real need and this way the function call may still end up being eliminated if other rules reduce its complexity. Work in progress.

Conclusion


This one was annoying to me. It was necessary to normalize the loop because it prevented nested branching from being normalized safely. But normalizing the loop means adding some more boiler plate and just hoping that other rules eliminate most of that away again. Some do, some won't. It's hard to predict whether it's a net positive, although I'm hoping so.

In the end I hope that any abstraction that's introduced by Preval and left after crunching it down can be compiled again into higher level logic. But I'm well aware this is far from trivial in almost any case. Moar puzzles!