I'm trying to normalize code, making it as simple and uniform as possible while maintaining original semantics as much as possible. In this post I wanted to share some thoughts around normalizing a while loop. I know you were waiting for this. Strap in!
Context
Skip this if you're aware of what Preval is doing.
My current pet project is
Preval. It's basically my own spin on what
Prepack was supposed to become; a pseudo JS compiler with static analysis with partial evaluation.
When I started the project I wasn't sure where to even begin. But quickly I realized that I'd have to normalize the code first before even considering how to evaluate it.
Normalization in this context means simplifying the input code, trying to make it as basic and uniform as possible, compiling the input down to as few building blocks as feasible, and trying to consolidate the many different ways of doing the same thing into a single way, even if that sometimes means having more code or losing high level constructs in the process.
For example, we convert a
for
loop to a
while
loop trivially, but we also convert a
switch
to a set of
if-else
range checks which end up so mangled that it's hard to figure out that it was a
switch
originally.
Another example is enforce the code to always have at most observable side effect per line, meaning all expressions must be heavily simplified as it's very common for JS to have multiple actions per line.
Due to the dynamic nature of JS this can be a tedious task for static analysis, though I'm happy to report that I'm getting a lot further than I initially thought.
While loops
For this case I only have to concern myself with regular
while
loops. The
do-while
and
for-loops
are already converted to regular
while
loops with breaks.
There are other transforms that attempt to eliminate labels and labelled breaks by abstracting the code into functions and generating
return
statements rather than a labelled
break
so that's not too much of a concern to me.
Here's a simple example of the
do-while
transform:
do { f(); } while (g());
// ->
loop: while (true) {
f();
if (!g()) break loop;
}
and regular
for
loops:
for (let i=0; i<10; ++i) { f(i); }
// ->
loop: while (true) {
if (i < 10) {
f(i);
i = i + 1;
} else {
break loop;
}
}
I currently have many many of these kinds of rules, some have a small impact, some have a big impact.
Branching
Preval currently has a rule where it will attempt to enforce one branch per function. Truth be told, the reason for doing this is that I vaguely remember that this state has an interesting property for static analysis but I've not given it much thought beyond that. It feels like a good state to aim for :)
This single branch rule works fine for functions without loops but for loops it becomes much harder because it's the only case where the code "jumps back", re-executing the same code in the same context instance. The only other construct in JS that does this are switch defaults, and we eliminate those (
and there it's more about the jump, anyways).
Either way, since we can't safely put nested branching into separate functions we need to find a way to "unnest" them... This works fine for nested
if-else
statements but not for loops. And that's kind of what we're trying to solve in this post:
"What's the best and safest way to normalize loops into a function?"The idea here is that if a loop body is in a function, the function itself can be treated as any other and should not need to worry about it being a loop, which simplifies a bunch of things considerably.
Abrupt completion
An "abrupt completion" in JS is anything that stops the current control flow and returns control to its caller (or whatever). The simple examples here are
return
,
continue
,
break
, and
throw
. Other examples are
yield
and
await
, but those are special and we're not focusing on them just yet.
For a loop this is relevant because if we want to abstract the body of a loop in a fresh function, then abrupt completions need to be propagated in some form and we need to distinguish them from one-another since they have different semantic results.
Due to the current state of Preval, we will only consider to normalize
while
loops that are in the root of a function. So not global, not nested in anything else.
(Due to other normalization in place, the only thing loops can still be nested inside of are other loops, which we peel off one by one.)I'll cover these four statements;
return, break, continue, throw
.
throw
I can ignore the
throw
completion since that will fall through any abstraction I create for it (
provided I don't wrap that in my own try/catch
, of course). So we can ignore these in loop bodies.
return
If a loop contained an "early return" inside of its body then that would stop working when you put the body in a new function;
while (x) { return y; }
// ->
function tmp() { return y; }
while (x) tmp();
And, unlike
break
, this abrubt completion needs to stop the loop, make sure not to invoke code that follows the loop, AND return the value that was supposed to be returned. This last requirement is obvious but makes it less trivial since the body will need to return both the kind of completion as well as, potentially, the actual value to return immediately.
As far as I can see there are two ways to accomplish this and neither are pretty;
- 1: Box the return value into an array/object which signals both the completion type as well as the return value, if any
- 2: Use a closure and put the completion type and return value in there, picking them up after the loop
Both ways are very similar in their approach and I'm actually not sure which of the two would perform better. On the one hand you'd expect the boxing to create extra objects which adds to gc pressure. On the other hand, this pattern is very common in modern JS and modern engines are quite efficient with very short lived objects like that.
Still, I went with a closure. Not in the least because I already normalize patterns away but also because it feels like this leads to code that will be easier to optimize. Property access will always be a second class citizen in JS static analysis, where closures are first class. Or first-and-a-half.
break
A
break
needs to set the completion state and change to a return. The loop should detect that the completion state was a break and stop the loop, but not return the stored value like a
return
would.
continue
A
continue
is the simplest of the three. It changes to a
return
because all it needs to do is stop the current execution of the loop body, which is effectively the abstracted function.
Transform
Let's take a look at the full transformation of a simple counting loop:
// Note that the for is changed into a while loop
function f() {
for (let i=0; i<10; ++i) $(i);
return 100;
}
const r = f();
$(r);
// ->
// Actual current result when running the above through Preval:
let f = function () {
debugger;
let i = 0;
let tmpLoopRetCode = true;
let tmpLoopRetValue = undefined;
let tmpLoopBody = function () {
debugger;
const tmpIfTest = i < 10;
if (tmpIfTest) {
$(i);
i = i + 1;
} else {
tmpLoopRetCode = false;
return undefined;
}
};
let tmpLoopTail = function ($$0, $$1) {
let tmpLoopRetCode$1 = $$0;
let tmpLoopRetValue$1 = $$1;
debugger;
const tmpIfTest$1 = tmpLoopRetCode$1 === undefined;
if (tmpIfTest$1) {
return tmpLoopRetValue$1;
} else {
return 100;
}
};
while (tmpLoopRetCode) {
tmpLoopBody();
}
const tmpReturnArg = tmpLoopTail(tmpLoopRetCode, tmpLoopRetValue);
return tmpReturnArg;
};
const r = f();
$(r);
The transform takes a function with a
while
loop and;
- Inject a completion state variable (
tmpLoopRetCode
)
- Inject a return value variable (
tmpLoopRetValue
)
- Create a fresh function containing the body of the loop (
tmpLoopBody
)
- Create a fresh function containing the code that follows the loop (
tmpLoopTail
)
- Find all
return
,
continue
, and
break
statements in the loop-body-function and convert them
- Every
return
updates the completion state and return value state and change to a
return undefined
- Every
break
updates the completion state and change to a
return undefined
. It does not use the return value
- Every
continue
just changes into a
return undefined
because all that needs to do is exit the function to "continue" the loop
- Replace all the code after the loop (now inside a fresh function) with a return of calling that fresh function
The fresh function for the code after the loop (
tmpLoopTail
) will first check the completion state. If it was for a
return
, then immediately return the value stored in the return value state. Otherwise, we ignore these states and move on to execute the code that originally folllowed the loop. That should suffice to maintain the original semantics.
In the above code, the $$0
stuff are normalized parameters and the debugger
statement signifies the end of my function header, since JS has no such thing. Feel free to ignore them, they make analysis easier.Signals
Ideally the abstracted body would signal an abrubt completion with a simple boolean. However, it needs to tell the loop whether it completed early or not and, if so, whether it was a
break
or a
return
.
However, considering our normalization efforts, it would be highly helpful if we could keep the loop condition "simple", that is, a simple expression without noticable side effects. This is effectively an identifier, a literal, or a negative value on something we can statically determine to be a primitive value (
negative numbers are highly annoying in this context, but let's discuss that in another blog post).
Effectively this means that I'd rather not have the
while
condition be anything other than
true
or an identifier. Not
while (x === true)
nor
while (state === CONTINUE)
, if you get my drift. This includes not negating the value. So it's very tempting to use
0, 1, 2
for state, considering
0
to mean "continue", and doing a
while (!state)
, but then the while condition is no longer "simple" as it's now a unary expression that needs to be abstracted.
So to keep that condition simple, we have to make sure the state can inform us of the three
continue
,
break
, and
return
states with a truthy value for "continue" and falsy values for the others. For now I went with
true
,
false
, and
undefined
. While I _am_ worried about that leading to slightly sub-optimal code, we can always test and improve on this situation later.
Normalized
This means I can now define a normalized loop as:
"A loop with a simple node condition and without branching in its body"And in most cases it will look like this:
while (retCode) loopBody();
return tail();
The "no branching" part is to support inlining functions as, right now, I'm not sure it matters to me whether "flat" code is inside a loop versus inside a function. Although maybe I'll enforce that through closures, anyways. We'll see.
Consistent state
One important part is that the normalization property holds in multiple passes. Otherwise you'd run the risk of an endless loop.
If the condition was not simple then a rule would replace it with
true
and put the old condition as an
if-else
inside the
while
body, leading to the same transformation over and over again.
We could restrict the normalized state to require the
while
to have exactly one statement expression that is a call (etc), but for now there's no real need and this way the function call may still end up being eliminated if other rules reduce its complexity. Work in progress.
Conclusion
This one was annoying to me. It was necessary to normalize the loop because it prevented nested branching from being normalized safely. But normalizing the loop means adding some more boiler plate and just hoping that other rules eliminate most of that away again. Some do, some won't. It's hard to predict whether it's a net positive, although I'm hoping so.
In the end I hope that any abstraction that's introduced by Preval and left after crunching it down can be compiled again into higher level logic. But I'm well aware this is far from trivial in almost any case. Moar puzzles!