Let's start with looking at how we normalize all the statements. In this post I'll cover all the valid statements in normalized code except for the expressions. We'll see what their normalized form is considered to be and how to get there in a few cases.
Series
This post is part of a mini series describing the normalization steps taken in Preval.
This post covers how the statements get normalized. So let's jump right in!
Block Statement
In JS, a BlockStatement, or a "block", is the name for the {}. These can group an arbitrary number of other statements and declarations, with the caveat of causing function statements (but that's not a risk for us after scrubbing the input).
A block is normalized when it's not nested in another block or global scope and when it contains no hoisting declarations.
Nested blocks should be flattened
Nested blocks are not observable and can't affect anything inside Preval (since all binding names are globally unique). So they are useless and may block detecting certain things for no reason. That's why we flatten them.
Blocks in global code are not allowed
But this is basically the same reason as nested blocks.
Break Statement
The break statement in JS can have two forms: labeled and unlabeled.
A break statement is normalized when it either does not have a label or when the label is not redundant. Additionally, the break should not appear in a tail position where its occurrence does not actually change the semantics of the code.
The unlabeled break is the most common and will stop the nearest loop or switch and continue execution immediately after that statement.
The labeled break is not very common in real code but completely legit. This form will break to the target label, which is parser enforced guaranteed to exist, and continue code execution after the label. This form can not break through function boundaries, but that too is enforced by the parser.
If the label of a break is redundant, it is dropped. In this context, a label is redundant when that label is the direct parent of a loop that would be broken to anyways if the break would have not had a label.
foo: while (true) if (x) break foo;
// Is really the same as
foo: while (true) if (x) break;
(And you can then drop the label because it's no longer used).
Breaks are also eliminated when the semantics of code would be the same without them, like when they are on the tail of a branch.
foo: {
    f();
    break foo;
}
In general you don't see this kind of code very often but it might appear as part an artifact of the IR.
Arguably all breaks and continues should be labeled. Maybe that would make code flow easier to analyze as well and unlabeled breaks are in that sense a subset of labeled breaks. Unfortunately this means introducing more labeled statements with their inconsistency of having syntactic restrictions against loops.
Continue Statement
The continue is similar to a break except it is much more restricted. You can't use it outside of a loop and a labeled continue can only target labels that are direct parents of a loop. These rules are syntactic so the parser enforces them, we don't have to worry about it.
A continue statement is normalized when it either does not have a label or when the label is not redundant. Additionally, the continue should not appear in a tail position where its occurrence does not actually change the semantics of the code.
We eliminate redundant labels and redundant occurrences of continue, just like break.
Export
All exports are already consolidated to the "Named Declaration Export" form in the preamble normalization.
export {foo as bar}
export {foo as bar} from "ding"
The only thing left to do here is that the source string (if present) is converted to a template string. This is not in line with EStree and would be invalid in JS, but we correct for this when serializing the AST so it's just for the IR.
I haven't done enough for import / export because in the end all imports and exports should be eliminated in favor of one giant blob of code that is tree shaken. But that currently doesn't happen.
For-in / For-of Statement
I have not spent a lot of time looking into these two loops. But there are a few things to consider when normalizing them.
Currently, these loops are in normalized form when their header has an identifier to the left and a Simple Expression to the right and a block for body.
Variable declaration
When the header has a variable declaration, it is moved out to the parent block scope. The declaration in the header is then replaced with a regular identifier.
for (const x in y) {}
// ->
let x = undefined;
for (x in y) {}
So this is something I think I should normalize to special case the variable declaration and assignment inside the body. The problem here might be that the special case can end up in an arbitrary place when statements get shuffled, leading to more edge cases rather than fewer. One mitigation for that could be to introduce block headers (similar to func headers). But that seems a bit excessive for this case.
Rhs
The right hand side in the header for each kind of loop must be a Simple Expression. Somewhat surprising, the in does not spy. But the of definitely does as it calls the iterator. But even for for-in it doesn't make sense to allow anything other than a Simple Expression.
Body
The body must be a Block. If it's not then wrap whatever the statement was in a block.
Property as lhs
As I write this, I think I'm leaving the property case alone right now; for (obj.foo in x);. But from the top of my tired head I don't see how we can simplify this much. This would be fixed with the special header case but without it we'd have to introduce a fresh binding and convert it something like this:
for (foo.bar in obj) {}
// ->
let tmp;
for (tmp in obj) { foo.bar = tmp; }
This way the setters are not triggered as part of the header.
Function Expression
The creation of a function is not really observable by itself. Its contents should be normalized though.
A function expression is the only kind of function in normalized form. It is in normalized form when its parameters are identifiers that are named incrementally starting with $$0, have no patterns or defaults, and are converted to Param nodes. Rest parameters are okay since there's no reasonable alternative. The body of a function should start with a "header", which assigns the Param nodes to regular local bindings (whose names reflect the original param names). If the function refers to this or arguments then these are aliased in the header and only these aliases are referenced in the actual body. The header ends with a Debugger statement which must always be present, even if there is no header. All branches of a function body must return explicitly, even if that's undefined.
If a function expression appears as a statement then we drop it. You won't see this in real code but it could be an intermediate step in the IR, like after dropping an assignment which would leave its rhs behind as a statement.
The difference in code between a function declaration and a statement that is a function expression is that the expression is wrapped in parenthesis:
function f(){}
(function f(){});
But as I said, it's very unusual to see this in real code since it's absolutely unobservable. I think.
Function declarations and arrows are converted to function expressions in the preamble phase.
If Statement
This one has a bunch of rules.
Normalized form
Let's start with the normalization form: An if is considered to be normalized when the test expression is a Simple Expression and both the "consequent" and the "alternate" (the "then" and the "else" branch) are a block statement. This state is trivial to achieve on its own.
Swapping branches
Some of the other transform rules in place are around resolving static tests, when the test value is a guaranteed truthy or falsy value, and removing unnecessary boolean inverts. We can safely eliminate boolean coercion (Boolean(foo)) or inverting (!foo) on the test expression since neither spies.
Additionally the if statement semantically does not make a distinction between either branch, aside from the else being optional (which we enforce anyways). So the next two examples are semantically and observably equivalent, making the ! an unnecessary operation.
if (!x) a(); else b();
// ->
if (x) b(); else a();
So when we see this we prefer the second form. In the later phase we'll also detect this when the code is already in normalized state, since the above trick only works safely on input code.
const y = !x;
if (y) a(); else b();
const y = !x;
if (x) b(); else a();
But that requires deeper semantic knowledge to make generic so it doesn't happen in this phase. That trick does kind of make this check redundant.
Static tests
There are also some quick wins, like ~"" or +null, which are guaranteed booleans. But to be fair, these were written in the early stage of development and by now there are more generic superseding rules that will ultimately have the same results. The static expressions get resolved in this phase, anyways, and there is a rule in the next phase that will resolve an if when it detects the test to be a guaranteed truthy or falsy binding. Or when inlining constants, take your pick.
Identical returns
When each branch of the statement only contains one return statement and it returns the same value, the statement can be reduced to an expression statement (for the test) followed by that return statement.
if (x) return 500;
else return 500;
// ->
x;
return 500;
This is a relatively specific trick but could be expanded to be more generic.
Empty body
An if statement where both branches are empty can be replaced by just its test expression.
if (x) {} else {}
// ->
x;
The above looks silly but the test expression should not just be discarded. We aim to retain TDZ and implicit reference errors.
Nested ifs with same test
There is a rule that detects nested if statements that test on the same Simple Expression.
if (x) { if (x) f(); g(); }
// ->
if (x) { f(); g(); }
This transform is semantically the same (with one fewer operation) since the act of testing can't be observed .
Back to back on same test
Similarly, when two back to back if statements test on the same identifier and one branch of the first statement is empty then we safely fold certain cases together.
if (x) {}
else { f(); }
if (x) { g(); }
else { h(); }
// ->
if (x) {
  g();
} else {
  f();
  if (x) {
    g();
  } else {
    h();
  }
}
The idea is that if x is truthy then the second statement invariably invokes the then branch too. We only do this when the then branch of the second statement contains one statement. Otherwise we risk code bloat by deduplication. But in normalized code, we only have to worry about replacing one atomic operation with another.
There is a similar rule for the else branch.
Branches to functions
I wrote a rule that converted each branch of a trailing if into separate functions. But I've disabled it.
function f(){
  if (x) {
    if (y) {
      g();
    } else {
      h();
    }
  }
}
// ->
function f(){
  const A() {
    return g();
  };
  const B() {
    return h();
  }
  if (x) {
    return A();
  } else {
    return B();
  }
}
The idea here was to create smaller functions which I was hoping to reveal other patterns but the trick didn't really help much, risked infinite transformation loops, and was too risky in general. So I disabled it.
Tail hoisting on abrupt completions
When either branch of an if has an "abrupt completion" (a return, throw, break, or continue) then the code that immediately follows the if statement can be moved to the other branch.
if (x) {
  throw "error";
} else {
  f();
}
g();
h();
// ->
if (x) {
  throw "error";
} else {
  f();
  g();
  h();
}
Note that this may not be safe for labeled breaks for input code, but in normalized code the labeled statement must have a block or loop for body, so this rule should be safe to apply for them as well.
Import statements
The various ways to import things are also normalized in the preamble phase.
The statement is in normalized form when it is a named import that uses the alias form and a source string that is a template. (I haven't considered imports without source yet, most likely I would drop all exports and merge toplevel executed code together.)
There are three things we do for imports in this phase:
- Change the source strings to be templates
- Make sure every import statement imports exactly one thing
- Make sure all nodes do an explicit alias, even if it's to the same name
import a, {b} from "c"
// ->
import a, {b} from `c`
// ->
import a from `c`
import {b} from `c`
// ->
import {default as a} from `c`
import {b as b} from `c`
And now we can look at imports and only worry about on specific form, assume they import one symbol.
Labeled statement
A labeled statement is in normalized form when the name is globally unique and when the body is a block where possible.
Sadly, labeled statements are the only exception to the "body must be a block" rule. That's because they have a special rule in the JS syntax: a continue may only target labels explicitly when they are the direct parent of a loop. So this would be an _invalid_ transform:
repeat: while (true) {
  f();
  if (x) continue repeat;
}
// -> (bad!)
repeat: {
  while (true) {
    f();
    if (x) continue repeat;
  }
}
Because now the loop is technically not the direct parent of the loop, so the continue repeat is now invalid.
For this reason we can only enforce the body of a label to be a block when the statement is not a loop. Of course we can still do it if the label is not used by continues.
repeat: while (true) {
  f();
  if (x) break repeat;
}
// ->
repeat: {
  while (true) {
    f();
    if (x) break repeat;
  }
}
Ignoring the redundancy of it, of course.
If the label is not used at all it will be trivially dropped.
Return statement
The normalized form of a return is that it has an argument that is a Simple Expression.
Mandatory argument
Syntactically, the return statement is allowed to omit the argument. That means return statements would need to have two different checks every time we process them. However, JS code cannot detect the difference between a return that returns undefined explicitly or a return that has no argument. So, assuming undefined is immutable (which we do), we can safely transform this statement to return undefined if it didn't have an argument before.
If-return hoisting
When an if is followed by a return, the return is pulled into each branch of the if.
function f(a) {
  if (a) f();
  return a;
}
// ->
function f(a) {
  if (a) {
    f();
    return a;
  } else {
    return a;
  }
}
This should not risk code bloat since the return statement is more of a meta operation and as such and I consider them to be "blanks". The transform happens on the way "back" of the traversal, meaning the argument should have been normalized and so already contain Simple Expressions.
Binding trampoline
When a return statement returns a binding that was created prior to that statement, it's effectively a trampoline. The simple case in this phase can merge them if the init of the declaration was a Simple Expression.
const x = y;
return x;
// ->
return y;
However, the init could also not be a Simple Expression in which case the trick would not apply. Additionally this trick would not be able to detect such a binding if there was another statement in between. So this is something that the next phase should be able to determine.
Throw statement
The throw is syntactically similar to the return. Syntactically the argument is required so that's one fewer corner case to check. The argument must be a Simple Expression.
Other differences are really more relevant for the next phase, where function completion is tracked and throws propagate.
Try statement
The catching mechanism is one that I haven't spent a lot of time on just yet. In this phase there is little we can do.
The normalized form for a try statement can only enforce the finally to be present, since doing so for the catch block changes semantics. And since there's no real advantage here, there's currently no such rule in place at all.
If the try block is empty then the catch block cannot be observed. So if the try is empty and there is no finalizer (or it's empty) then the whole try can be dropped.
try {} catch (e) { importantStuff(); } finally {}
// ->
;
If in such case there is a finally then the catch block can be safely omitted. After which the finally block replaces the whole statement.
try {} catch (e) { stuff(); } finally { a(); b(); }
// ->
try {} finally { a(); b(); }
// ->
a();
b();
So there are also rules that move statements out of a try if they are determined to not be possible to throw. For example, a var declaration can not throw if the init is a literal.
try { const x = 1; } finally { f(x); }
// ->
const x = 1;
try { } finally { f(x); }
// ->
const x = 1;
f(x);
// ->
f(1);
Of course that example is contrived, but it's not unreasonable to get to after applying a bunch of other rules.
Variable Declaration
The normalized form for a var decl is that it declares one binding, either as let or const, and has an init, even if that's undefined.
This way we don't have to do awkward checks in the AST since a VariableDeclaration has one or more VariableDeclarators which each have an optional init (and could even be destructuring, though that's taken care of elsewhere). After normalization, we can safely access node.declarations[0].init and safely assume that this always exists.
Patterns
All patterns are eliminated and this kind is no exception.
let [x, y] = f();
// ->
let tmp = f();
let tmp1 = [...tmp]
let x = tmp1[0];
let y = tmp1[1];
Objects are slightly easier:
let {x, y} = f();
// ->
let tmp = f();
let x = obj.x;
let y = obj.y;
This is actually a one-time transformation since none of our other rules will ever introduce a pattern. But still, it needs to be done as patterns add a layer of complexity that I'm super uncomfortable with keeping.
Init
The init must be an Assignable Expression. This means it can't be an actual assignment, so those get split into a separate statement. Since the identifier read can't trigger spies, that should be fine.
const x = a = b;
// ->
a = b;
const x = b;
If the assignment assigns to a property, you have to cache the rhs because it might be changed by a setter:
let b = 5;
const a = {};
const spy = {set prop(x) { b = 10; }};
const x = spy.prop = b;
// x=5
// ->
let b = 5;
const spy = {set prop(x) { b = 10; }};
const tmp = b;
spy.prop = b;
const x = tmp; // b is now 10, not 5 (!)
Whereas if the rhs is a property lookup you have to cache that first:
let y;
let foo = 0;
const spy = {get prop() { return ++foo; }};
const x = y = spy.prop; // x = y = 1
// ->
let y;
let foo = 0;
const spy = {get prop() { return ++foo; }};
const tmp = spy.prop; // Make sure not to "call" this getter twice
y = tmp;
const x = tmp;
So if you combine them, you cry in a corner:
let foo = 0;
const spy = {
  get prop() { console.log('get'); return ++foo; },
  set prop(x) { console.log('set'); foo *= 10; },
};
const x = spy.prop = spy.prop; // the getter triggers before the setter (!)
// ->
let foo = 0;
const spy = {get prop() { return ++foo; }, set prop(x) { foo *= 10; }};
const tmp = spy.prop; // getter first
spy.prop = tmp; // setter next
const x = tmp; // not spying
Observe that this one is the same as previous, where the rhs is cached and the lhs is not.
While Statement
In normalized form, while is the only regular loop available. Like the if, its test must be a Simple Expression and its body must be a block.
The rules applied to the test are similar to the if test, where it resolves static expressions and guaranteed truthy or falsy values.
Loops do add a level of complexity to the model but they can't be addressed in this phase.
Wrap up
And that's it for all the statements except for the expression statement. The expression statement is really just going to cover expressions so I'm going to cover them in the next blog post. It may be two blog posts but we'll see.
I hope you enjoyed this write up and didn't find it extremely boring or obvious :)