Simpler JS

2021-08-18

When I started Preval I wanted to create this pseudo compiler that would take arbitrary JS and spit out the same program, but faster. It would eliminate useless code artifacts that only served the developer and added overhead to the runtime. This process starts with normalization. Let's talk a bit about that.

Normalization


Normalization doesn't really tell you much without context so let me explain what I mean by it in the context of Preval:

Normalized (JS) code is code that does one thing per statement. This thing could be observable, like calling a function or triggering a "getter". But in the end, no more than one thing. This is harder than it seems because there are so many things that have (potentially) observable side effects, mostly due to coercion and getters/setters.

Like most languages, JS expressions tends to be quite packed. It's easy to forget just how many different things a single statement, or even expression, might trigger.

Code:
obj.foo(x + 1);

This little example can directly trigger not one but three (!) different function calls. And that's assuming reading global bindings won't trigger side effects (which they could in some cases, but I do ignore that). Consider a full program like this:

Code:
let obj = {get foo(){ spy(1); return spy(2); }};
const x = { valueOf(){ spy(3); } };
obj.foo(x + 1);

The evaluation order of the last statement breaks down like this:

Code:
const a = obj; // Cache it because any spy may change it
const b = obj.foo; // Trigger spy(1), y is spy(2)
const c = x + 1; // Triggers spy(3) due to coercion
call(b, a, c); // "Call function y with context x and arg z", which triggers spy(2)

The caching of obj may seem excessive, but consider that it is valid for either of the two spies to update the binding before we call the actual function with the object and we want to make sure we retain the original value.

Reasoning


The reason for wanting normalization is because I was hoping that down the line it would simplify transformations. There are so many edge cases in JS and so many ways to do the same thing that having to guard against all different ways of doing the same thing felt like something that would be very difficult to keep under control.

On the other hand, if every line of code was very simple, straightforward, and only did one thing, then the transformations would be relatively simple and you wouldn't need to worry about how many different things may "observe" (spy) your changes. And that's exactly what you have to prevent. If a transform can't be observed then the code effectively has the same semantics.

The downside is that you lose higher level semantics. For example, a switch delivers a high degree of semantics but is riddled with awkward rules:

Code:
switch (A) {
case B: {
C
break
}
case D:
case E:
let x;
case F:
x = 1;
default:
G
break;
case H:
x;
}

So to name a few, because this is not an exhaustive list: A will evaluate your expression. This step is not observable by itself (cannot trigger getters or coercion). However the expression can be arbitrary.

Step B will do a strict comparison. Strict comparisons are not observable. I can do A === B once or ten times and you would be none the wiser. But B can still be an arbitrary expression in itself.

Then D, E, F fall through. Note that x is created in one case and initialized in the fall-through. But if A === F you would end up with a TDZ error because x did not get initialized. Whether or not it fell through is even irrelevant here.

After that, look at the default. Midway the switch. It's like a one-time loop around that is entered if all previous _and_ future cases failed to match.

And at the end if A === H it still ends up trying to access x, although that will always lead to a TDZ error.

That's too much. Too many expression parts to take into account. Odd rules about bindings, strict or weak comparisons, implicit fall through paths. Ugh. And you may think, "well, that's just an artificial case". Yeah, of course. But the fall through case is quite common. You're inevitably going to run into the binding footgun. And the mid-way default? Well, I'm sure you'll find a colleague who did it just because they could.

Goal


The goal for this normalization was to come up with an IR that would make analyzing and transforming source code easier. To this end I wanted to get to as few atomics as possible by removing a lot of high level or duplicate constructs. And while the high level end goal was to reduce executed code, some of these transforms required injecting more code to simulate what the construct did as well. The idea here is that we could always transform it back. Obviously that's more of a pipedream.

Preval consists of two normalization steps. One that only runs once and takes out some of the gnarly things that it would never introduce itself. The second is normalization that it will keep running after each iteration of all tricks.

I resisted having this separate normalization step for a long time. In the end I added it anyways because I needed to re-parse the source code at least once. The overhead for parsing isn't so bad. Tenko is pretty fast :D

Pre-normalization


This first phase involves parsing the code in its original form (using Tenko), analyzing the AST, transforming out a few pieces, and returning a new AST.

The step removes the following concepts:

- Hoisting
- Function declarations -> function expressions
- Arrows -> function expressions
- Function parameters get changed into a special custom node, Param and are assigned in the function header.
- Switch -> labeled if-else in various degrees of complexity
- Consolidate various import and export statements to use one kind for imports and one for exports
- Class declarations -> class expressions
- Do-while -> while
- For-loop -> while
- Strings -> templates (to the extreme; the IR is technically invalid because templates appear in some cases where only strings are allowed)
- Convert all numbers to decimals, where possible
- Drop statements that are just an ident. Don't remember why that happens here, actually. Maybe early code.
- Replace all occurrences of this and arguments with an alias
- Make sure all labels are unique. We only need to do this once since we will only introduce fresh labels.
- Make sure all binding names are unique. We only need to do this once as well since we will only introduce fresh names.
- Tagged templates -> regular calls

Should probably discuss a few of these in more detail;

Functions


So all function types get consolidated to a function expression. This way I don't have to check for three kinds of functions, each with their own semantics and rules, but only one. The function expression is the one that can do everything that a function declaration can do and does not have any restrictions like function declarations (block scoped) or arrows (restricted expression) do. So it's the perfect middle form.

All functions in normalized state have a so called "function header". The header is separated from the body with a debugger statement. I picked that because it's a valid statement in JS and one that's not likely one that I need to keep around. I could replace it with a custom node or some symbolic string. That doesn't matter much. Actually, now that I think about it, a string with some dashes may have been a nice form as well. Oh well.

The function header contains assignments of params and aliases of this and arguments, if they are used at all. By consolidating them into the header I don't have to worry about special casing them anywhere else in the program. They are regular bindings when actually read.

The parameters become a special AST node called Param. I did this because they represent a special form of assignment that I wanted to eliminate. They implicitly get assigned their value when the function gets called. By creating a special abstraction that explicitly assigns the actual value in the header, the code is more uniform and easier to reason about.

Any parameter that is not a regular identifier is moved out of the function parameter list and aliased. The exception is rest, since that has no real (viable) alternative. But other things like patterns and defaults are all handled in the function body after this phase.

The function header of a function that uses all these features looks something like this:

Code:
function f(a, b) {
g(this, arguments);
}

// Becomes:

let f = function ($$0, $$1) {
const tmpPrevalAliasThis = this;
const tmpPrevalAliasArgumentsAny = arguments;
let a = $$0;
let b = $$1;
debugger;
g(tmpPrevalAliasThis, tmpPrevalAliasArgumentsAny);
};

The parameters would be fixed to this $$123 naming scheme. And they are not identifiers in the AST so they are never visited when walking the AST. In that regard, the Param is just a value node like ThisExpressin is.

Identifiers


Most of the JS code is about identifiers. In particular, binding names and label names. We can ignore other kinds, like property names. Or well, maybe just because I haven't gone that far in.

To make transformation and analyzing the easiest I made sure to make all binding and label names unique. Not having to worry about naming collisions or whether one foo actually referred to another foo just takes away a huge pain when doing these transforms and checks. In my goal I wasn't too concerned about maintaining the original names anyways so this was an easy decision.

Ultimately this transform is trivial for labels. The parser enforces that all labels can be reached and are valid and such. It's a different story for identifiers.

Identifiers may be implicit globals. Some are built in, like Array or parseInt or undefined. Others are implicit, like window or require. And some are keywords, like null or true, although those don't end up as an Identifier node, anyways.

When Preval decides on new names to pick for identifiers it has to follow three rules. One is that the name should not have appeared in the current code so far. Two is that the name may not collide with a known implicit global or built-in. Three is that implicit globals are never renamed. After this phase we go through all idents and make sure none collide with all the implicit globals we ended up finding.

Since I use Tenko I could make sure that Preval received all the scoping information that Tenko already knows about anyways. Scoping is deceptively complex to get correct and I'm happy not to have to worry about it after the normalization.

So after pre-normalization, I can assume that all binding names are unique and never worry about var or lexical scoping. Never worry about hoisting. Never worry about which binding an ident refers to. In fact, I can keep a global registry about identifier names and that will suffice for tracking each binding. And that's exactly what I did.

A very contrived example (destructuring patterns; your own little hell):

Code:
{ let a = 1; }
({x: {y: {z: a}}} = 1);
{ let a = 1; }

// Becomes:

{
let a$3 = 1;
}
({
x: {
y: { z: a },
},
} = 1);
{
let a$1 = 1;
}

So as you can see, all bindings that have the same name get a unique suffix.

Loops


There are five loop constructs in JS: while, do..while, regular for, for..in, and for..of.

The regular for and while loops are effectively the same. I transform the for away because the while contains one expression and that fits nicely with the normalization rules.

The do..while loop is changed to a regular while loop as well and basically has two forms. When the test is a simple node, like an identifier, the transform gets wrapped in an if testing on that condition. If the test is complex then we transform to a more complex loop that starts with true and updates to the actual testing value after that. Other rules will still try to turn that into a better form later, but it may not succeed.

The reason why not all loops end up in a simple form is that the expression may contain any arbitrary expression. Including assignments, destructuring, functions, and a range of complex expressions. When I started I was very reluctant to duplicate expressions in the AST. Near the end I'm a little more relaxed about that since I know most duplication can be eliminated and either way should not introduce more runtime execution. But still.

Code:
do {
$(1);
} while ($(2));

// Becomes

{
let tmpDoWhileFlag = true;
while (tmpDoWhileFlag) {
{
$(1);
}
tmpDoWhileFlag = $(2);
}
}

And

Code:
for ($(1); $(2); $(3)) {
$(4);
}

// Becomes

{
$(1);
while ($(2)) {
{
$(4);
}
$(3);
}
}

The other two for loops can not be eliminated. They are considered atomic. The for-in does something that I can't safely transform into something else. And even if I did, it wouldn't necessarily be better. So I kept it. I didn't look too deep into for-of because it uses iterators. I think you can transform it to a regular loop etc, but it feels like that would also introduce more overhead than its worth. And since I had to keep the other form, it was relatively low effort to keep them both.

One choice I kind of regret and haven't fixed yet is to forcefully move bindings out of for-headers. I often considered making it a special case to have a binding in for-headers, like function parameters have, and reassign them in the body. I still think that's what I should do but it's too much effort. I haven't invested a lof of time in these for-loops yet, anyways. For now, the for-headers are considered a special type of assignment to the lhs. And more often than not, bindings that use them are ignored when looking for transformations to apply.

Imports and exports


There are a couple of syntactical ways to import and export code with JS. But ultimately you only need one for each. Just like everything else, having one means fewer edge cases to consider. And while ESM seems trivial in this context on the surface, there's still plenty of footguns to take care of.

Take a look at the export variation:

Code:
export let a = 1;
export const b = 2;
export var c = 3;
export function f() {}
export class X {}

let g = 1;
export {g};

let h = 2;
export {h as i};

export default function(){};

When I run this through Preval, the pre-normalization step ends up like this:

Code:
let c = undefined;
let f = function () {
debugger;
};
let a = 1;
export { a };
const b = 2;
export { b };
c = 3;
let X = class {};
export { X };
let g = 1;
export { g };
let h = 2;
export { h as i };
const tmpAnonDefaultExport = function () {
debugger;
};
export { tmpAnonDefaultExport as default };
export { c };
export { f };

It converted everything to named exports, even the default. I think I wrote a blog post about that one, actually.

The hoisting one was particularly nasty initially. Kinda caught me off guard. And note that exported bindings are live, except for functions and classes. There's also special naming behavior for default exports. It's wild.

Since we support ESM, I also made an early decision that I would assume module goal, and therefore strict mode. This also allowed me to cut down a few edge cases like the with statement, bless you.

Strings to templates


One change I considered a while before actually doing was to convert all strings to templates in the IR.

There are three different strings types in JS: single quoted, double quoted, and template strings. And in a way, tagged templates are actually a special kind on top of that (_they allow invalid escapes which none of the other forms allow_).

By consolidating all strings to templates I figured I needed to maintain fewer transformation rules. And for Preval and the IR it didn't matter that templates were not valid in object expression keys ({"foo": x} is valid but {`foo`: x} is not).

I modified the printer to take care of these edge cases and always print a regular string for object keys, class member keys, and import path names. So it's really an internal representation.

Ready for the main normalization


After the initial normalization Preval is now ready for the main normalization phase. That's a lot more fun.

But honestly this post is already pretty big so I'm gonna stop here and (maybe ...) continue with the normalization in a next post.