Normalizing patterns

2020-12-30

One of the more treacherous syntax in modern JS are patterns. On the surface they look benign, easy to work with, and definitely have their usages. But dig deeper and some ugly truths come out. Especially when you mix-and-match arrays with objects, spread, and defaults.

I'm working on Preval and the thing I'm currently trying to get done is normalizing the input. I already fixed a bunch of cases but I realized I needed to tackle patterns sooner or later. And since real world code is riddled with subtle usages of patterns, I wanted to tackle them sooner than later. This way I wouldn't have to worry about them anymore.

Complexity


The problems with patterns is that they have a lot of faces to cover;

- object and array patterns
- defaults
- rest elements / properties
- parameter / binding / assignment
- member expressions in assignment patterns

Arrays and objects can nest arbitrarily and nesting has some subtle parsing differences.

And of course you need to have tests for just about every cross product of these features which means a massive suit of test cases just to cover your bases properly. Bleh. I racked up about 1000 test cases for parameter patterns alone.

Normalization


When I say "normalize", in this context, I mean to simplify a certain kind of code to be more generic and homogeneous. The result is an IR ("intermediate representation") code to be processed further. Some exmples;

Code:
a = b.c.d;
// ->
const tmp = b.c;
a = tmp.d;

function f(a = b) {}
// ->
function f(a) {
if (a === undefined) a = b;
}

etc.

By normalizing the input code we remove some rough edges and reduce the edge case surface. In the particular case that is the topic of this post, we replace all patterns with simple property lookups that semantically do the same. This means pattern syntax becomes a set of variable declarations and member expressions (and spreads, in case of arrays). These language features will need to be supported regardless so by eliminating the requirement to process patterns, we simplify the processing code.

The downside, of course, is a potential loss in source code representation. At the same time, and in particular for patterns, it ought to be possible to reconstruct a pattern (if we'd desire that to be output). Unfortunately it may not be feasible to do so within reasonable time.

Transforms


The list of complexities above has a few things that we'll have to transform to their normalized version. Most of them are actually not that difficult.

As you, as a human, would interpret the pattern, so does the machine. So you can literally codify that in simple "old JS" terms and remove the patterns. What's left is some underwater "specification" artifacts to take into account, like how array patterns need to be spread into an array first.

I've finished the parameter normalizations so let's start with those. Here's the simplest array pattern example:

Code:
function f( [ x ] ) {
return x;
}

Appologies for the redundant explanation here, but; the function above takes a value and returns property x of it.

The normalized code ooks something like this:

Code:
function f(tmp) {
const arr = [...tmp];
var x = arr[0];
return x;
}

The above is not the shortest form, obviously, but rather an example of the IR that I'll end up with. My IR is normalized though valid JS code.

The spread is required because the argument may be an iterator; f(...new Set(1, 2, 3)) and by explicitly spreading the value at array pattern steps, we can safely do an indexed lookup. Obviously this may not be performant for some cases. That's not a big concern to me at the moment.

The object pattern is very similar;
Code:
function f({ x }) { return x; }
// ->
function f(tmp) {
const x = tmp.x;
return x;
}

What do we get when they nest?

Code:
function f([{ x }]) { return x; }
// ->
function f(tmp) {
const arr = [...tmp];
const obj = arr[0];
const x = obj.x;
return x;
}

function f({x: [ y }]) { return y; }
// ->
function f(tmp) {
const tmp2 = obj.x;
const arr = [...tmp2];
const y = arr[0];
return y;
}

function f([[ x ]]) { return x; }
// ->
function f(tmp) {
const arr = [...tmp];
const tmp2 = arr[0];
const arr2 = [...tmp2];
const x = tmp2[0];
return x;
}

function f({x: {y: { z }}}) { return z; }
// ->
function f(tmp) {
const tmp2 = tmp.x;
const z = tmp2.y;
return z;
}

Look at that, simple nesting, already four examples. And we haven't even shown what happens when the pattern is the second parameter, or when an array pattern receives an elided array, or when objects have getters and setters etc.

What about defaults. The spec calls these "initializers" for patterns (ref, ref2) but honestly, they don't initialize anything if the value is unused. While that might be technically true ("they are initializers and only used if the value is not initialized already"), I think "defaults" is more apt.

Anyways. Defaults.

Code:
function ([x = 1]) { return x; }
// ->
function f(tmp) {
const arr = [...tmp];
const tmp2 = arr[0];
const x = tmp2 === undefined ? tmp2 : 1;
return x;
}

Fairly simple to transform, I think. It's the same idea for regular parameters. As long as you do it at the top of the function and retain the order of defaults, the semantics remain the same.

There is TDZ edge case that you may break but who cares. (Spoilers: I do, but you can work around that too)

There are some binding rules to take int account (a variable is not a constant and not a lexical binding, per se) but that's edge casing we can resolve later.

Next feature is "rest" (...x). This one has two parts; a simple and a difficult one.

The simple one is arrays. That's a slice from the element positin where the spread element appears.

The hard one is object. That doesn't really have a proper "old JS" version, although you can definitely cheat a little. It's not performant but you can, statically, pass on the input object and the properties that are excluded and then shallow clone the remaining properties. Or you can let the environment do it anyways, underwater or explicitly. I chose to abstract it into a function that I'll consider special later.

By transforming {...x} as something like objSpread(x, y) I won't need to add special rules or for the traversal later.

Ok, so this looks like:

Code:
function f([a, b, ...x]) { return x; }
// ->
function f(tmp) {
const arr = [...tmp];
const a = arr[0];
const b = arr[1];
const x = arr.slice(2)
return x;
}

function f({a, b, ...x}) { return x; }
// ->
function f(tmp) {
const a = tmp.a;
const b = tmp.b;
const x = objSpread(tmp, ['a', 'b']);
return x;
}

In the end, the transformation for parameter patterns is not that involved. Yes it's annoying because of the different and recursive pieces that can form a pattern. But on paper, the transformation of a pattern is straightforward.

Binding


After parameter patterns I started doing variable decalaration patterns.

Code:
const {x} = y;
const [a] = b;

This kind of code is very common in React apps recently with the introduction of "hooks" (don't get me started). But even outside of that paradigm, it's more and more common to deconstruct a config/options object this way. Or even to prevent repeating the identifier (const foo = bar.foo vs const {foo} = bar).

In theory the patterns are almost identical.

- Parameters toplevel can have a default where constant toplevels cannot (function f({x} = y){} f(obj) vs const {y} = obj).
- Parameter defaults have some restrictions involving await and yield and some other edge cases. Bindings don't have this (or well, maybe they do but it's more intuitive?).
- Variable declarations with var can actually introduce duplicate bindings which params would throw an early error on. (Though that's a parser problem, not really relevant to Preval)

Probably missing one or two other edge cases. But for most of it, they are the same.

One difference is the point in time during transformation. The parameters are processed before the body is. This means we can safely add statements/declarations to the function body and don't have to worry much about disturbing a walk in progress. That's different for variable declarations. This makes injecting new code a much harder effort.

Let's take a look at the simple example;

Code:
const [ x ] = obj;

The pattern is straightforward;

Code:
let tmp = 1;
const x = tmp.x;

But how do you inject this? This would turn a single declaration into two declarations. That means the number of elements for the function .body grows, while walking it. This means the index may screw up (and it will in my case), so that's a no go.

For many other cases I introduce a fresh variable, which gets hoisted, and then use expression logic to not have to replace the variable declaration node;
Code:
const [ x ] = obj;
// ->
var tmp;
const x = (tmp = [...obj], tmp[0]);

This works because variables are hoisted and I add them to a special "fresh vars to create" property which will generate a bunch of var foo statements at the top of the function, after parsing it (when it's safe to mutate its .body).

However, it's a bit ugly and has an inference drawback; a variable that has an arbitrary value and undefined, even if technically I know the undefined is only due to hoisting, it's a little involved to confirm this. I have to fix that problem regardless, but I'd rather not introduce more of those cases unnecessarily.

We could replace the variable declaration with a block:

Code:
const [ x ] = obj;
// ->
{
const tmp = [...obj];
const x = tmp[0];
}

Replacing a single node in the body with another single node while traversing it is fine since it has no chance to mess up the visitor index.

In theory this block approach could work because we don't care for the (temporary) lexical scoping of the new variable and the nested scope will be flattened before it becomes relevant that the existing variable is now block scoped.

The problem is that it's a huge footgun if any future step of the normalization suddenly does rely on the temporary lexical scoping situation. So I'd rather not go this way, even if it's by far the simplest approach.

A possible fix for both of the above is to wrap it in an immediately invoked arrow.

Code:
const x = (()=>{
let tmp = 1;
return tmp.x;
})();

This one will have some pretty bad inference consequences down the road so I'm not a fan. Even if I consider that this kind of trivial arrow should be inlined anyways, mooting that concern. I'd rather go for a simpler approach if I can.

Then I realized that variable declarations can declare any number of variables. We can leverage that principle to simulate introducing multiple statements without actually replacing the declaration node itself. Most of the time, those would be temporary variable assignments anyways.

This approach looks something like this:

Code:
const [ x ] = obj;
// ->
const
tmp = [...obj],
x = tmp[0];

This way I don't have to worry about lexical scoping, body visitors, or hoisted variables. (Creating fresh unique global variable names is trivial at this point.)

It seems to be viable generically so it solves a big problem for me.

Assignment patterns


Once I do the same for assignment patterns I'll probably do the hoisted variable dance anyways, since that's an expression and I don't see a better way of rewriting that. So maybe we should rewrite everything to those expression. Shrug.

Luckily assignment patterns are less common. At least it feels to me this way.

In the end it shouldn't matter. Whether code is normalized to atomic statements or expressions, both should be consumable with the same efficiency. It's just that there are some situations that mandate an expression where some mandate a statement approach.

Or maybe I'll change that too and everything becomes an atomic statement or expression. Because why not.