Normy JS p3

2021-08-27

The main thing left to discuss in the first phase are expressions. These are also the most versatile since there are many different ways of achieving the same goal. This pots will describe the expressions in some detail and how to get them to a normalized form.

Series


This post is part of a mini series describing the normalization steps taken in Preval.

In this post I discuss normalizing expressions in the first phase.

Three base cases


In this model we consider four types of expressions:

- Simple Expressions (ident or primitive node)
- Trivial Expression (Simple Expression, unary, binary, new, or call expression)
- Assignable Expression (Trivial Expression, pretty much any atomic value node)
- Assignment Expression (simple property with Simple Expression, or ident with Assignable Expression)

In a way, all expressions can be modeled as an assignment. In this way we have three cases to consider:

- A regular Assignment Expression
- A variable declaration; the init must be an Assignable Expression
- An expression statement; the expression must be an Assignable Expression

This last one could be considered as an assignment to void, or whatever. Then in all three cases we assign a value.

I didn't realize this was the case when I started out so there's quite a bit of duplication of logic in Preval right now. If I were to start over I'd form the IR such that I can do away with this duplication.

Normalizing to expression or statement


When I started out I wasn't sure what the normalized code would end up looking like. I wasn't sure whether you could transform all expressions to statements, or all statements to expressions, without keeping at least some duplication. But it turns out that, yes, you can normalize everything to atomics that don't duplicate each other.

Eliminated expressions


This means that we can drop a few types of expressions. One the one hand we lose out on some high level concepts and on the other we have fewer edge cases to worry about. And the idea is that, in theory, we could reconstruct these high level constructs again when printing the final output. Not that I ever bothered to try ;)

Ok, so let's begin with expressions that we eliminate.

Sequence expressions


A sequence is a list of comma separate expressions. Sometimes also called the comma operator.

A sequence is a single expression and its value is the last one in the sequence. The others are evaluated and then ignored. This means that they are statements. And that's what we'll transform them into.

It's important to note that this is only possible once you can guarantee that all expressions can be normalized to statements in the first place. But we do.

Code:
f(), g(), h();

// ->

f();
g();
h();

Creating new statements is not a challenge in itself, but while traversing there is a limitation of creating new nodes, so it can be tricky. And as I mentioned before, in some statements it is harder to normalize an expression than others.

Let's take a look at the simple case of an if test:

Code:
if (a(), b()) {}

// ->

a();
const tmp = b();
if (tmp) {}

This approach works for any statement that we don't eliminate entirely. It's challenging for switch cases, for example. But we eliminate those generically. The for-in/of rhs can also be somewhat tricky. But those can all be fixed.

Logical expressions


This is the term for the && and || operators. We eliminate them in favor of regular if-else statements. It's actually a bit of a problem later on, as we might see in a later post covering the second phase.

Code:
if (a && b || c) f();

// ->

const tmp = a && b || c;
if (tmp) f();

// ->

let tmp2 = a && b;
let tmp = tmp2 || c;
if (tmp) f();

// ->

let tmp2 = a;
if (tmp2) tmp2 = b;
let tmp = tmp2 || c;
if (tmp) f();

// ->

let tmp2 = a;
if (tmp2) tmp2 = b;
let tmp = tmp2;
if (!tmp) tmp = c;
if (tmp) f();

Observe that the number of operations is equal. The statements are identical to the logical expressions. But we have two fewer edge cases to worry about.

Conditional expressions


Also known as "ternary expression", the a ? b : c operator.

Just like the logical expressions we'll replace all of these with if/else logic.

Code:
f(a ? b : c);

// ->

const tmp = a ? b : c;
f(tmp);

// ->

let tmp = undefined;
if (a) tmp = b;
else tmp = c;
f(tmp);

We could initialize the fresh variable to b or c immediately but then we would be evaluating that identifier before the condition. Additionally, we'll have to have an else branch regardless so it's not like it prevents a condition.

That said, it does help with typing to not have to assign a different type so it might actually be better to compile in a dud "identifier statement" to trigger the TDZ/unknown global errors (if they were to trigger) just so we can initialize the fresh variable to one or the other.

It's an interesting choice, actually, since some patterns are easier to detect if the assignment is in either branch while other patterns are actually easier to detect when one branch is empty.

Update expressions


This would be the ++x code. Both prefix and postfix cases are replaced with regular assignments.

Code:
f(++a);
f(--b);
f(c++);
f(d--);

// ->

a = a + 1;
f(a);
b = b - 1;
f(b);
const tmp = c;
c = c + 1; // May trigger a spy through coercion, but the spy can't change the outcome of this assignment
f(tmp);
const tmp2 = d;
d = d - 1; // May trigger a spy through coercion, but the spy can't change the outcome of this assignment
f(tmp2);

It's a little trickier when the argument is a property, since it would trigger a getter and setter if the property was one or both of them.

Code:
f(++obj.foo);
f(obj.bar++);

// ->

obj.foo = obj.foo + 1;
f(a);
const tmp = obj.foo;
obj.foo = tmp + 1;
f(tmp);

For concrete (contrived) examples:

Code:
const spy = {
get foo(){ console.log('get'); return 10; },
set foo(x){ console.log('set', x); },
};
++spy.foo; // "get". "set 11". -> "10"

const spy2 = {
get foo(){ console.log('get'); return 10; },
set foo(x){ console.log('set', x); },
};
spy2.foo++; // "get". "set 9". -> "10"

The assignments do mutate the property but the only thing that influences the assigned and return value is the getter.

Arrow expressions


We transform arrows to functions. Since arrows are actually the ones that can't have this or arguments of their own, and since we normalize all functions to consolidate actual access to these symbols to their "header" (taking arrow contexts into account), we should not need to worry about their occurrence. As such at this point we should be able to safely convert any arrow to a regular function expression.

Chain expressions


This is the term for the "conditional chaining operator", or ?.. It is used to shortcut an expression and prevent an error if it evaluates to null or undefined.

To be honest, while the operator is easy to use for the straightforward cases, once you get into the nitty gritty I do have to look up the actual semantics for complex chains.

Either way, we get rid of these. We compile them down to regular if-else logic and never look back.

There are three cases of optional chaining:

- Regular property (obj?.foo)
- Computed property (obj?.[foo])
- Call (a?.(foo))

Either way, we have to walk the chain. Here are some examples:

Code:
a?.b;
// ->
const tmp = a;
if (tmp) a.b;

a?.b.c;
// ->
const tmp = a;
if (tmp) a.b.c;

a?.();
// ->
const tmp = a;
if (tmp) a();

a?.b();
// ->
const tmp = a;
if (tmp) tmp.b();

a?.b?.();
// ->
const tmp = a;
if (tmp) {
const tmp2 = tmp.b;
if (tmp2) tmp3.call(tmp);
}

a?.b?.c?.();
->
const tmp = a;
if (tmp) {
const tmp2 = tmp.b;
if (tmp2) {
const tmp3 = tmp2.c;
if (tmp3) tmp3.call(tmp2);
}
}

a?.b.c.d?.();
// ->
const tmp = a;
if (tmp) {
const tmp2 = tmp.b.c;
const tmp3 = tmp2.d;
if (tmp3) tmp3.call(tmp2);
}

a?.b.c.d();
// ->
const tmp = a;
if (tmp) {
tmp.b.c.d();
}

a.b.c.d();
// ->
const tmp = a.b.c;
const tmp2 = tmp.d;
if (tmp2) tmp2.call(tmp);

As you can see, retaining the context for the optional method call is the most annoying.

The computed cases are almost identical to the regular property. You just have to also normalize the property.

Await and yield


The special keywords await and yield are not covered in my model. I didn't want to bother with them when I started out and so I've chosen to ignore them for the time being. When they are used Preva will just throw an error.

Reflecting on it now, I think they can both be considered an external call during which anything can happen. Just like with a regular (unknown) function call. This means properties that may mutate, may mutate. Any local binding accessible in a function should also consider to perhaps be mutating. Etc.

So perhaps it's not going to be that difficult to implement them if that time ever comes.

Unknowns


There are a few expressions that I haven't really even considered at all. Mostly because I haven't gotten there yet or because they're not commonly used in the first place.

One example is super, because I haven't done much yet with objects, let alone constructors. Another is MetaProperty (that would be import.meta or new.target). Just haven't bothered yet so not sure what edge cases they have.

Patterns


There are three kinds of patterns:

- parameters
- binding
- assignment

We want to eliminate them all because they introduce too many edge cases to consider, versus their normalized atomic form.

On a slightly higher level, we have two kinds of patterns that can nest in itself or each other and which are transformed recursively:

- Array pattern
- Object pattern

In the case of arrays we have to spread the pattern because that may trigger custom iterators. Only if it's a known value, like an array or a set, can we drop the spread if the result is not accessed.

Some of the patterns are already dropped in the preamble phase. Some can't be because they depend on other transforms that have a chicken-egg dependency and must therefore happen in this phase.

Tagged templates


This one is tricky but we do eliminate them. The edge case here is that tagged templates are allowed to have invalid escapes where regular templates are not allowed to have that. This means that there literally is no proper way to decompose tagged templates if that rule is important.

I chose to ignore it and decompile a tagged template into a regular call, anyways. This transform happens in the preamble phase.

Expressions to keep


That was not an exhaustive list but those are the relevant ones to discuss.

What follows is the list of expressions that we keep and how we normalize them.

Identifier


An identifier that is a statement can be eliminated. Sometimes.

In particular it can be eliminated when it is an assumed builtin (undefined, Infinity, NaN).

The problem is "TDZ" or "Temporal Dead Zone". In short, this is when a lexical binding is read before it is defined.

Code:
function f(x) { if (x) y; }
f(false); // ok
f(true); // TDZ crash
let x;
f(true); // would be fine now

Most of the time it should be fine to remove explicit bindings. Unfortunately we can only prove in a small subset of cases that TDZ is impossible. When the binding is used in multiple scopes, it can be impossible to automatically determine whether or not a TDZ will happen, or guarantee that it won't. For this reason it's unsafe to eliminate bindings most of the time, even if they're explicit.

Param


A param that is a statement is safe to delete since we introduce them. They are remnants of assigning the parameter to a local binding, when that binding was eliminated. There is no risk of TDZ so these are always fine to drop.

Literal


Any literal that is a statement is safe to drop. So that includes regular expressions.

I also normalize the form of litearls when possible by eliminating escapes from strings and using decimals for numbers (if possible).

Function expression


These were already covered in the previous post. So let me quickly recap:

- parameters must be Simple Expressions
- parameter idents are replaced with Param nodes that assign to local bindings in the header
- all funcs have a header
- if the func uses this, it is aliased in the header
- if the func uses arguments, it is aliased in the header
- the header must always end with a debugger statement, even if it's empty
- implicit returns are not allowed, every completion must be explicit (return or throw)

Call expression


There are basically two forms: method call and ident call.

All parts of the call expression must be Simple Expressions. The exception is that the callee of the call may be a simple member expression (property that gets called). This may be dynamic and in that case the property must be a Simple Expression, too.

All args must be simple so they are all aliased in case they're not simple yet. In that case and only in that case all previous args and the callee must also be aliased. The idea is that any of the complex expressions might alter the previous simple expressions. So to prevent this from messing with the original semantics, we need to cache their values. That action is not observable so we should be fine. Some of these caches will be redundant and we can eliminate them later.

Code:
foo(a, a = b);

// ->

const tmp = a;
a = b;
foo(tmp, a);

Additionally, when any of the remaining args is complex, also cache up to them immediately. If you don't, you'd be caching your caches. Also not a big deal but also not necessary.


Ident call


This is the regular foo(a, b, c) call expression case.

Some built-ins that are static operations are resolved here. Like resolving isNaN("Foo").

One point of annoyance is that if builtins are called with too many parameters then we must still preserve these args. Even if the builtin ignores it.

Code:
String(a, b());

// ->

a;
b();
String(a);

And here you see the annoying part; we have to copy both args as statements to retain TDZ semantics, since a is evaluated before b.

When the call can be resolved the call is replaced with its actual value.

Code:
f(Boolean([]));

// ->

f(true);

If the call contains a spread then we must keep it. The arg is still aliased and normalized. But the spread will stay until we can safely resolve it. Which is almost never.

If the call contains spreads in argument positions that would be ignored by the builtin function then we compile them as array spreads as statements.

Code:
isNaN(a, ...b);

// ->

a;
[...b];
isNaN(a);

This way iterators are still properly invoked, even if their results are ignored. Once we can determine that the iterator is builtin, or doesn't spy, we can still eliminate that. But it's a different rule now.

Some builtins explicitly coerce their arg, like parseInt and Boolean. The result of these functions is irrelevant since they're guaranteed to be a primitive. However, they may trigger coercion spies. So in that case the call is changed to an explicit coercion which is then subject to special rules.

Function constructor


There is some tentative support for calling Function. Mainly for jsfuck cases and other obfuscation. The use of Function is more than likely to break since all variable names are mangled. But it works for the obfuscation cases so why not support it under a flag.

Code:
const f = Function("alert('hi');");
f();

// ->

const f = function() { alert('hi'); };
f();

The code will try to compile an actual function expression if it can resolve all the args to a primitive.

RegExp


Much like Function, the RegExp constructor will attempt to compile a regex literal if the one or two args are primitives.

Code:
const x = RegExp('foo', 'g');

// ->

const x = /foo/g;

I think the literal form is slightly easier to work with, although in the end it shouldn't matter for semantics. And since regexes are another form of objects, you can't safely merge identical regexes.

Method calls


The calls to a property are called method calls. The added complexity is that we can't safely eliminate them because in this form the call passes on the context. Additionally, the property lookup may trigger a getter before actually calling the function.

That said, to preserve values we must cache the property look up in case any of the args are complex and need to be cached. In that case I compile a call to a custom function $dotCall. This special symbol simulates a foo.call(context, a, b, c).

Code:
obj.foo(a + 1);

// ->

const obj = obj;
const tmp = obj.foo;
const tmp2 = a + 1;
$dotCall(tmp, obj, tmp2);

Later rules can sometimes clean some of this stuff up. And otherwise it was apparently necessary.

Known builtins are also resolved, like with the identifier.

If we can resolve a function but not the args we can compile in a special symbol to represent it.

Code:
const x = [];
x.push(a, b);

// ->

const x = [];
$dotCall($ArrayPush, x, a, b);

This way we don't have unnecessary usages of the object in the code, which may simplify certain cases.

Beyond that, the method call treatment is very similar to the ident call treatment.

Static expressions like Math.pow() are resolved if possible.

New expressions


The difference between a call and a new is that the context is never determined by the object of a member expression. I'm carefully wording it since the context can be set through more than one way. It's just guaranteed to never be the object of a member expression.

For this blog post that means that we can cache the property look up and don't have to bother with the $dotCall approach. For new we only consider the ident-as-callee case.

Member expressions


This is a fancy term for "property".

A "simple" member expression has a Simple Expression for its "object" and its "property". The latter is only relevant when the property is computed.

Optional chaining elements are eliminated, same as we've seen earlier. And certain builtin values are immediately resolved if it is safe to do so. That means Number.NaN can be replaced with a regular NaN while Math.PI must remain as is since inlining it would risk precision loss or rounding errors.

Static expressions, like the .length of a string, are resolved as well. Similarly, calls to things like Array.from() are dropped when they are statements. And otherwise replaced with the result if we can determine it.

Computed keys that are idents


When a property lookup uses a computed key that can be converted to a legitimate identifier then we convert this to an actual identifier.

Code:
foo["bar"]

// ->

foo.bar

This is because we will always have some trouble reasoning about computed properties and a few less concerns with non-computed properties.

Properties on null or undefined


It's possible that the reductions and normalizations result in a property lookup on a value that is null or undefined. In that case a throw is compiled afterwards and the rest will be dropped by DCE (in another rule).

Code:
null.foo;

// ->

null.foo;
throw "Unreachable";

These cases are usually bugs. But are they bugs in the input or in Preval? ;p

Assignments


The assignment is an atomic with two high level forms:

- A simple member expression left and a Simple Expression to the right, or
- An identifier to the left and an Assignable Expression to the right

We can't avoid the property form since it's the only way to mutate a property. Now that I write this, perhaps a property mutation should be a special AST node in the IR. As a reminder, the property is a special case because it may trigger a getter.

We'll focus on the ident case for the remainder of assignments.

Compound expressions


All compound expressions are converted to regular assignments of a binary expression.

Code:
foo += x;
// ->
foo = foo + x;

foo *= x;
// ->
foo = foo * x;

For all such compound assignments.

Nested assignments


Since the rhs of an assignment must be an Assignable Expression and an assignment is not part of that set, we have to eliminate it.

The assignment can be a little tricky to eliminate due to the property case. Let's assume we've already made sure that all properties are simple. That leaves us with 2*3=6 cases to consider when splicing out the assignment. Ideally we wouldn't want to create more variables than we have to but at the same time, we can unfold them later anyways. The important part is that we retain proper getter/setter order for the worst case that we have three properties and they all have a getter and setter.

Code:
a = b = c;
// These are not observable so any order will do
// ->
b = c;
a = b;

a.x = b.x = c.x;
// In order:
// c.get
// b.set "c.get"
// a.set "c.get" (so this does not do a b.get!)
// ->
const tmp = c.x;
b.x = tmp;
a.x = tmp;

So to be safe, we should take the latter approach and assume that if it can be made simpler that other rules will patch that back up.

If the lhs is a pattern then we eliminate them.

Code:
a = ({x} = y())

// ->

const tmp = y();
x = tmp.x;
a = tmp;

So regardless of what's in the middle, the lhs gets the right-most value of a nested assignment and none of the other parts are read.

Self assignments


If something ended up doing a = a then we should be able to eliminate that.

Of course there is still a risk of TDZ since a may not yet have been declared at the time of executing that assignment. So for that reason we must leave the identifier behind.

Binary Expressions


The binary expression is when you have an operator with an expression on either side, like a + b or x == y.

This is probably the trickiest of all expressions for us since it may trigger two spies in the same line, instead of one. That would be the case with coercion of two spying objects: {toString(){ spy }} + {toString(){ spy }}.

A few binary operators do not spy: strict comparison operators (=== and !==), in, and instanceof do not trigger spies although in and instanceof may crash hard with invalid operands.

In general, if a binary expression has two primitives as operands then we resolve that immediately. Provided there's no risk of precision loss.

When either side is not a primitive, we make sure they are Simple Expressions. This may still lead to spies on either side so we apply the coercion rules explicitly, assigning coerced values to temporary values and then doing the binary expression on them. This way we can preserve the property of max one spy per line while also preserving the semantics.

For as far as coercion is concerned, almost all the ops have a straightforward pattern of coercion. Most ops convert to numbers. Note that comparison ops do this too (the >= ones).

Plus


The trickiest part is + since that may coerce to string or number first, depending on the other side. This case may not be known in this phase of Preval so we would have to bail on that and defer to the next phase where more semantic knowledge is available to a rule.

I defer to the spec: Convert both operands to primitives calling ToPrimitive with no hint. This essentially means it's "string" for dates and "number" for anything else although you can change this with the Symbol.toPrimitive builtin. Whether we want to support this or not, best we can do is compile it down to a coercion that does this specifically:

Code:
f(a + b);

// ->

const tmp = $coerce(a, 'default');
const tmp2 = $coerce(b, 'default');
f(tmp + tmp2);

After the initial step the value is guaranteed to be a primitive (since otherwise a type error is thrown) so we don't have to worry about the plus after that point.

This does put an extra burden on normalization because how do you know that a plus is normalized? The simple answer is that you don't. But you can compile in yet another special symbol to signal that the plus was normalized and that, at some point, the value was guaranteed to be a primitive.

Code:
f(a + b);

// ->

const tmp = $coerce(a, 'default');
const tmp2 = $coerce(b, 'default');
const tmp3 = $plus(tmp, tmp2); // Really just `tmp + tmp2`
f(tmp3);

Considering the fresh bindings are constants and assumed to be a primitives, we should be able to assert that this is an invariant so the special symbol should be okay.

Weak comparison


The other gnarly ones are == and !=, which coerce according to the other side. I actually wrote a tool for that a long long time ago.

As you can see it has the same kinds of problems as plus, meaning that worst case we can't just figure out how to coerce either side from the AST.

Unary Expressions


There are 7 operators that we care about: delete, void, +, -, !, ~, and typeof.

Static expressions are resolved immediately.

After any transformation, the arg must be a simple node. So in this case, that has to be an ident, since any other simple node is a primitive we can resolve.

delete


Delete is one of those spies that don't trigger a callback necessarily (no getter or setter gets invoked) but the action is still observable later since it mutates an object by reference.

While the spec forbids the argument to be just an identifier, it doesn't forbid it to be many other things that are not a property. For example: `delete parseInt();` returns true. We normalize this case, even though I'm sure we'll never see it in the real world.

void


The void operator is essentially the value undefined preceded by evaluating the actual expression and ignoring its value.

We eliminate void, replace it with undefined and put the argument as a statement before it.

+ and -


Some may not realize this but negative numbers are actually positive numbers with the minus operator to make them negative. This causes some grief in the AST because a negative number is still a UnaryExpression ratner than a Literal (maybe we should normalize that in the IR...).

On the other hand, the + has no real meaning if the arg is already a number and should just be dropped.

Both operators coerce the argument into a number in the same way.

We can look ahead a little bit and, for example, also resolve some nested operators, like +typeof x will always return NaN. Oh look another case we'll never see in the real world.

If the argument of + is another + or - then this one can be dropped. Similarly the argument of a - is a + then the + can be dropped. A double negative -(-(x)) is the same as +x.

!


I think I covered this somewhere else but the logical inversion does not trigger a spy. So whether you do !x or not is, by itself, not something the user code can detect.

This phase doesn't do much when visiting this node and operator. Although it will replace !typeof x with false, since the typeof will never return an empty string.

~


The tilde will coerce the arg to a 32bit int. It certainly triggers coercion spies.

In the next phase some more stuff can be done with this operator but in this phase it's just resolving static expressions.

typeof


Getting the type of a value is not observable, much like ! is not.

If the argument can be resolved because it's primitive or literal or another kind of node for which we can safely predict the outcome, then we can inline it immediately. Real world code shouldn't really get this but it might happen in the IR.

We can immediately resolve certain nested args as well, like typeof typeof x is a string and typeof (a * b) is a number.

Array expressions


The array itself is not going to trigger spies. But the elements should be Simple Expressions.

The exception is a spread. We have to keep that and make the argument a Simple Expression instead.

Once the array is in its simple form, it cannot spy.

Object expressions


Similar to arrays. We first have to make sure all the nested expressions are Simple Expressions. And like with arrays, we have to keep the spreads and make their argument simple instead. Once that's done, the object does not trigger spies either.

One thing to note is that objects can have computed properties. Those should be turned into Simple Expressions as well.

ThisExpression


Already covered but to repeat; any function should get a function header and if the function references this, which we can safely determine statically, then an alias will be created in the header and any occurrence will be replaced by that alias instead.

End of phase 1


That covers all the expressions. Or the relevant ones, anyways.

I think the next phase will be way more interesting since those are more concrete tricks which require deeper analysis. Or maybe it's just cool in my head. Who knows! :)