Normy JS p1

2021-08-21

When normalizing JS there are a few high level phases to consider. The first one would be a purely syntactical normalization. Things like eliminating one statement in favor of another or making sure branching statements always have a block for a body. These are rules that you can apply without needing to have deep semantic knowledge about the code. It's all about making a statement or expression follow the rules and moving on.

My last post described how I created a preamble normalization phase to tackle some of the one-time normalization rules and eliminate certain cases that I'd rather not have to worry about for the rest of the runtime of Preval. Things like switch, hoisting, and scope tracking (by way of making all binding names unique and not needing to scope track at all).

In the next few posts I intend to outline the other static rules that I apply in the main normalization phase. This phase is followed by a phase of semantic rules. Those assume normalized code and apply certain rules based on semantic and/or typing knowledge of the code, like "a function that is never called with args and doesn't contain arguments does not need to have args". These two phases are repeated until the code is stable.

I'm not quite sure how to model these blog posts because there are a lot of rules. I've written the code such that any such transformation looks like this:

Code:
rule('Empty nested blocks should be eliminated');
example('{ f(); { } g(); }', '{ f(); g(); }');
before(node, parent);

const newNode = AST.emptyStatement();
body.splice(i, 1, newNode);

after(newNode, parent);

And when I count occurrences of rule( in that file it yields 324. Soooooo. They're not all going to be that interesting individually but there's still a lot of ground to cover. And this is just for the "simple static" rules. And then there's about 78 files with dedicated "deeper" rules, some of which contain a bunch of more rules, like the one about type tracking contains 44 occurrences.

I'm just going to group them. Most of the rules are about expressions so maybe we can at least cover the statements quickly. I'm just gonna start with some rules and see how it goes, cut the post when I feel like the post is long enough. Good tactic to bump my ad clicks.

Background


I should add this warning that I have no formal training in building a compiler. I just like solving these puzzles.

It's fair to say I do have some formal training in logic and linguistics (which helps with formal grammars), but my degree is Cognitive Artificial Intelligence, which is really broad but leans more into philosophy. So most of the things I'm going to be outlining here either comes from random bits of that past and from exploring this space myself over the past few years and then the past six months in particular.

I apologize in advance for botching proper terminology when I do. Feel free to let me know on twitter :)

High level


I think it's good to start this series with a high level overview of my normalized model.

Goals


The top goal for Preval was to reduce arbitrary input JS code in such a way that it executed fewer instructions. The second goal is to reduce code size and it is definitely secondary to the execution count.

The goal of the normalized state is to "dumb down" the code. Every statement and expression should have at most one observable side effect. I call this "spying" since "spy" is just shorter than "observable" or "observable side effects". Another target is to eliminate trivial conditionals, like if (ifNode.alternate && ifNode.alternate.type === 'BlockStatement') etc.

By the end we're left with code that is much easier to reason about than arbitrary unbounded JS.

Spies


So what kinds of things spy? Ah. Almost everything with some notable exceptions.

A "spy" can be thought of as any operation that causes a mutation that may be observed by code that is out of the current context. The obvious examples are function calls, explicitly but also implicitly through getters, setters, and coercion. Other examples are property mutations, since those may be observed by a later function call.

Any transformation has this limit where it cannot change the execution order in an observable way. The end program must be identical with only consciously accepted exceptions in place.

Code:
// Consider that Preval does not know the result of this identity function
const a = identity({toString(){ spy('a'); }});
const b = identity({toString(){ spy('b'); }});

isNaN(String(a), String(b));

// ->
// If return value of "isNaN" is returned it can be omitted
// So we create the args as statements as well. But oops, in reverse order.

String(b);
String(a);
// oopsie

Once I dug deeper into this I realized that assignments actually don't need to spy. The act of assigning to an identifier, or creating a new variable, can not trigger other code. That feels obvious. However, the test of if and while, binary expressions with === and !==, inverting with !, Boolean coercion with Boolean(), function expressions, literal values, and regular expressions are all atomic and an expression like that in itself actually can not trigger other code.

Note that arbitrary class expressions, arrays, and objects can have nested expressions that may spy. This is why we recursively normalize those, after which they won't spy and in normalized state, we indeed assume that they do not don't spy when in normalized form.

I did choose to ignore Proxy in my model, since that kind of introspection makes this work almost impossible as almost nothing works as you expect it to. Getters and setters can be worked around, albeit clunky some times.

Additionally, I'm assuming that reading an explicit binding cannot spy, even though it's possible in certain cases by applying .defineProperty on window to create a getter/setter.

Another thing we're assuming is that builtins are not changed. Expandos can be okay, I was working on that near the end. But [1,2,3].toString() should be allowed to assume to call the builtin toString method. In many cases I put this under an option anyways since it's not directly relevant for normalization purposes.

Binary coercion


One thing I intended to but didn't push for yet is to make sure binary expressions don't accidentally trigger two observable side effects by way of coercing both sides. The tricky part here is that the coercion of the plus operator can trigger different methods depending on the value on the other side. And we can't predict whether the .valueOf or .toString of an object might be called since this phase of normalization can't tell whether an operand is actually a string or not unless it's an explicit literal. Spoilers: more than not it is not a literal.

The problem here that the plus coercion has a particular checklist for coercion; When one side is a string, the other side is coerced to a primitive with the "string" hint. This means that it will first try to call .toString() and if that doesn't return a primitive, then it will call .valueOf. (It will be an error if neither return a primitive). The resulting primitive is then converted to a string if necessary, but that part is irrelevant here.

If neither side is a string, then both sides are inevitably converted to a primitive with the "number" hint. This means that the order in which the methods are called is swapped. So you get six cases:

1: a and b are primitives
2: a is not a primitive, b is a string
3: a is not a primitive, b is a primitive that is not a string
4: a is a string, b is not a primitive
5: a is a primitive that is not a string, b is not a primitive
6: a is not a primitive, b is not a primitive

The first one is not observable. Worst case, all the other cases do spy so are unsafe. Note that even the order in which values are coerced is observable in the last case.

In the end I have to concede to allow at most two observable side effects, which worst case (implicitly yet directly) calls four external functions (both .toString and .valueOf for both operands of a binary expression). But even in this case, I consider to have an atomic with two observable side effects since I care more about the atomic, less about how much code might be invoked per atomic. I can't break atomic any further, after all.

Normalized form


With the above in mind the normalized form looks something like this:

This breaks down in four sub categories; statements, assignments, assignables, and non-assignments. These are the building blocks.

Simple Expression


A "simple expression" is either a primitive literal value or an identifier.

This concept is important as almost all other rules refer to it and it makes code real simple to reason about. An atomic expression is either a primitive value or an arbitrary binding. Not a function, array, etc.

Non-assignment Expression


A "normalized non-assignment expression" is rather strict. One of:
- A Simple Expression
- A unary expression with Simple Expression argument
- A non-assignment binary expression with a Simple Expressions to each side
- A call expression of a Simple Expression and an arbitrary set of arguments that are Simple Expressions
- A call expression of a normalized member expression and an arbitrary set of arguments that are Simple Expressions
- A new expression of a Simple Expression and an arbitrary set of arguments that are Simple Expressions

There is no ternary / conditional expression (a ? b : c) in normalized state. Those will be if-else statements.

Likewise, there is no sequence expression since those will be individual statements. The update expressions (++a) are "dumbed down".

Observe that new does not have a member expression form since it is not possible for this to affect the context so there's no reason to keep it. Unlike regular calls where an object method call does affect the context. Potentially.

I didn't implement support for async/await, iterators (yield), and super. That's why they're not mentioned in the above. But I'm sure they have to be considered atomics as well.

Assignable expression


An "normalized assignable expression" is:
- a Simple Expression
- a Non-assignment Expression
- function expression
- normalized class expression (the super class and all computed keys must be Simple Expressions)
- normalized object expression (all computed keys and values must be Simple Expressions)
- normalized array expression (all elements must be Simple Expressions)
- normalized member expression (single property lookup, can not be chained, and computed props must be a Simple Expression)
- this
- super
- arguments
- regular expression literals

This separate definition basically means functions, classes, arrays, object, and regex literals can only appear in plain statements (where they are eliminated for not being observable), assignments, and var inits.

The regular expression is special because in itself it is not observable, but it is an object so we cannot consider it "simple".

Arrow expressions are eliminated in favor of function expressions. To this end, it's the function expressions that get extra love since we isolate any use of this and arguments to aliases in our custom function header. This is why those two expressions are only allowed in assignments and inits.

Member expressions (properties) can not chain, which might blow up the code a little bit, but that's because each lookup may trigger a getter or setter. By forcing each member expression to not be chained, we can simplify reasoning about the code.

Assignment Expression


A "normalized assignment expression" is one of:
- Simple Expression to a property
- Simple Expression to an identifier
- Non-assignment Expression to an identifier
- Assignment Expression to an identifier

Normalized statement


Expression statements can be any Assignment, Assignable, or Non-assignable Expression.
Variable declarations can have any Assignable or Non-assignable Expression for init.
Any statement can only have Simple Expressions when expressions are mandatory.

Other expressions


Any kind of expression not mentioned above gets eliminated and broken down into a set of atomics mentioned above.

Chain expressions (a?.b and all its variations) become regular const tmp = a; if (tmp) tmp.b kind of code.

Logical expressions (like a && b) become regular let tmp = a; if (tmp) tmp = b; kinds of code. These are difficult to eliminate further and will stay around for a long time.

Etc.

IR AST


Ultimately, another name for this normalized state is an "IR", or "intermediate representation". That's a fancy term of saying it's a specific internal form of the code to work with.

Currently the IR for Preval prints as a sub-set of regular JS, since many kinds of statements and expressions are transformed to other kinds.

An IR usually comes with a custom AST.

For now, the AST inside Preval is generally similar to Estree with two exceptions. Function paramters are actually Param nodes and that all string literals are replaced with template literals.

Potential changes


I have considered a few more modifications, though.

The `arguments` reference should be an actual node, to more easily distinct it from an implicit global. That's an unusual edge case but still one to check for.

The for-in and for-of loops should have a special header where the var declaration is similar to Param. That way, the transform would look like from for (a in expr) {} into for (const tmp in expr) { a = tmp;. This would have the advantage of eliminating another exception to binding assignments and a different (better) way of dealing with the odd-ball variable declaration by hiding it. Right now the var decl is pulled out before of the statement.

The Non-Assignment Expression, Assignment Expression, and Variable Declaration should be merged into one node. Arguably just the assign and var but I think both are fine. The Assignment node would look something like { type: 'Assign', kind: 'var' | 'assign' | 'stmt', lhs: null | Simple Expression | Normalized Member Expression, rhs: Assignable Expression }.

Additionally, there's a line of thought to normalize expression statements to be dud assignments to an implicit global _. This would allow me to merge the assignment of a call expression with the statement that is a call expression. Same for all others Non-assignment Expressions.

I would also formalize the function header into a separate block which is limited to assignments of Param, this, and arguments, or to empty statements.

The literals would be simplified into a Primitive node. Maybe the string would still be a separate node, not sure yet. The regular expression would become its own Regex node.

There would be a Coerce node, which is a special call that takes two args; a Simple Expression that is coerced and a string that is the kind of coercion (string, number, plus, etc).

The $dotCall would become an actual node that represents the signal that this was originally a method call. This is important because it means that it should be safe to reconstruct it to one if that turns out to be. That is unlike an actual case of foo.call(a, b) where the a context actually has no methods.

There are probably a few more internal symbols possible that would get a similar treatment.

One idea I haven't crystallized yet is to mark "observability" into the AST somehow. There are many rules in the next phase that want to know whether a statement might spy. Having this worked into the AST feels like a very helpful thing.

Assignment analysis is another important concept to solidify into the AST. Which reads and writes can reach what other writes. This is particularly difficult but doable for things like loops and try-catch. But it's relatively expensive to discover all these lines.

Next up


In the next post I'll go deeper into each statement. And then onwards into the expressions.