pvdz.ee - Preval prologue

I'm going to start a new project. It's probably best described as "something like prepack". The name I picked is kind of a homage to it; "Preval". A combination of "Prepack" and "eval" (surprise). The project is meant to be some kind of code reduction tool for JS where anything goes, including the use of "eval".

tldr; this is an announcement post of a new project called Preval. Meant to be an "anything goes" compiler-esque tool for JS.

Different approach

In the past I would finish a project, or at least get it to a release state, and then publish it before doing a writeup. Kind of hoping for the surprise effect, I guess. However, for Mope I had written a bunch of thoughts on paper without publishing them. But they're kind of incoherent, at least from a reader perspect, so I'm not likely to publish them now. That's kind of unfortunate because Mope had a lot of rather interesting challenges and puzzles.

For the Preval project I'm flipping the approach. In part because I don't know where it will lead to, if anything. In part because I don't want to make the same mistake as for Mope, insofar that there's a lot of interesting problems to write about and I don't want to deprivate you of long boring blog posts ;) And in part because this way I can tweet/communicate about it more freely without having this constant "veil" over the project. So here we go.

Prepack

When I first heard of Prepack it was my understanding that it was a JS compiler. In a way it is. But I was hugely surprised that, despite all the things it could do, it wouldn't even do a simple thing like inline primitive constants.

Later, at facebook, I actually talked to a few people involved with the project who set my understanding of the project straight. Sadly the project was not based in EU and so the bar was too high for me to part of it. In hindsight, I guess that was for the best. The project appears to be fairly dead, or at least on ice, purely based on the state of the repo with the last (public?) commit dating 2+ years ago. I still love the idea of Prepack and would love to see something like it being fleshed out.

For the record, I have had zero involvement in the Prepack project. Neither before nor during my time at facebook. I've had some very interesting discussions with most people of that team but haven't actually read or written any code for that project. And as I'll tell you next, I don't even know how it actually works.

Preval

So am I making another Prepack? Honestly, I don't know. I had high hopes for Mope and my decision to have to abandon it was not without a fair amount of disappointment. As such I'm not holding my breath for Preval to reach any amount of usefulness. But maybe that's a good thing. If Preval ends up doing the same or taking the same approach as Prepack I'll consider that a compliment.

What I'd like to see Preval do is to take arbitrary code and to reduce the abstraction overhead by being able to assume certain inputs and generating outputs based on those presets.

For example, say you have a contrived function like;

Code:

function f(a, b) {
  return a + b;
}

Then various kinds of use cases could be folded up;

Code:

const r = f(1, x); // --> const r = 1 + x
const r = f(1, 2); // -> const r = 3
const r = [a, b, c, d].forEach(f); // -> const r = [a, b + 1, c + 2, d + 3]
const r = arr.forEach(f); // --> ... hm not really actionable unless we have concrete information about `arr`

In theory, anyways. Some are trivial (well, ought to be), like inlining constants of primitives.

Complexity

Even the simplest cases can be quite complex though.

Code:

foo(x);
// could be
foo(1);
// if `x` was defined as
const x = 1;

At first glance this sort of replacement looks trivial but to apply this generically in code we need to do a lot of checks and scans.

- We need to make sure this binding is not updated anywhere (constants can't be changed, but let and var can).
- We need to make sure the value is in fact some kind of immutable primitive. Like strings, numbers, null, undefined, or boolean.
- We need to do the replacement in such a way that you can continue working on the same AST. Otherwise you have to do one replacement and start from scratch after a serialization and parsing step. Possible but not scalable.

All this needs to be done by a machine, to whom it is not "obvious". Suffice to say, there's some work involved to get it all working.

Approach

Right now I'm not yet sure about the actual approach. I have the baseline working, where the code takes some JS code, parses it into an AST, sets up scope tracking and some other static things, and then ... well that's where the magic is supposed to happen.

I'm not settled on the actual pagic part of the approach. Even constant inlining is somewhat troublesome. For this approach you have to track bindings, make sure they are constants (whether by const or because they are only set once), make sure they are primitives, and then do the replacement.

And just because something can't be simplified in one pass, it might be able to be simplified in a second pass. Like this one:

Code:

const a = 1;
const b = a;
const c = b;

In the first sweep the binding a is clearly a primitive constant so we can inline it for b. But then we hit c and either we have to update certain references with the fact that b is a primitive (now), or we take a lazy approach and do a whole new pass whenever we changed anything at all. But that would require three passes for the above case. And that's just a trivial case, would that scale? (No.)

For Mope I accidentally created a stack based pseudo language that would execute the code on a meta level. But Mope also does a lot of type tracking. I kind of want to avoid the special language approach here. I might have to, anyways.

Transforms

Another approach might be to transform the whole code and wrap read/writes around all access for anything. This would lead to some sort of encapsulation of all the things, through which you could easily figure out what a certain value would end up given certain inputs to the program.

The transform approach sounds feasible to me. I dread the work that's involved getting that to a working point but once there, you could invoke the actual code and for any point in the code determine the result. That might give highly specialized builds for particular program inputs. Or maybe it's a wash, I find that kind of hard to predict right now.

The setup for this is annoying though. The transform itself is not a big deal, but I never bothered to do transforms properly for Tenko. So either I'm going to hack something together, formalize it (a whole own project), or decide to use a different parser for it. But I want to stick with Tenko so if I do go the transform route, I'll probably end up with also creating some kind of transform library where you can create new nodes and inject/replace them in the original AST.

Right now I'm leaning towards a combination of both (although the transform appraoch kind of supersedes the AST analysis approach). Doing transforms involves running the code which puts you into dangerous territory, even if the transforms would allow you to control every step of the way. The code may never finish or require super expensive computations before yielding any sort of result.

JS

I'm going to write this thing in JS. Why JS over something like wasm or Rust or anything else? Because it's by far the language I'm most proficient in and an easy language to prototype things in. If it turns out that there's something viable here, I can always choose to apply lessons learned to write something concrete from scratch.

The biggest challenge facing JS for performance is GC control. I'm not too concerned about other problems like async since I'm not planning to use async. Multi-threading is possible with some boiler plate work. Object inspection is doable, even if it doesn't perform very well.

In the end it's GC that's gonna drag it down. Even if you allot 64gb to node, it will still want to do full gc sweeps. And since this sort of program is going to long term retain a lot of complex objects, by design, the JS GC will not appreciate that and punish us.

Testing env

In the past four years I've worked on two bigger personal projects that involve code manipulation for inputs and outputs and I've spent too much time on tests. These projects, Tenko and Mope, tend to be easy to test per case. No unit tests, just e2e tests. You have input string and you check the output AST or output string. Yet I spent a lot of time in both projects writing input and output tests. For Tenko I did end up writing proper snapshot tests. For Mope, not so much, and have regretted it multiple times when I had to manually update hundreds of tests after changing my mind or whatever.

Tenko and Mope both started out with a bunch of tests in the same file. An array or object with lots of test cases or a describe() and it() approach. Either way, it contained inline snapshots, or expected outputs. But these just don't scale. For Tenko I did (automatically) rewrite every test to individual files (20k or so?). For Mope I didn't get that far. At this point Tenko has 33k test cases and I think 95% code coverage (pretty good for a parser). Mope has about 1400 input code test cases, which are often more complex in nature than Tenko's test cases. But they are still tucked in grouped files, making them awkward to update. Especially in terms of updating the lint warnings.

This time I decided to spend some time ahead of writing the first concrete line of code with setting up snapshot tests. It's a bit copy/paste from Tenko and Mope but in the end it means I now have a good begin for snapshot testing of the project. That framework will be expanded as time goes on, just like Tenko's testing environment evolved over time.

I picked markdown as the output again. It allows for a nice legible test case with input, output, and descriptions written in the same document. It also renders nicely on github. And we can use Prettier to make the output consistent, regardless of what happens to spaces when printing. This means the test case contains input and output in the same place (I don't like it when snapshots are in a different file) and we can copy/paste test cases to adjust them slightly.

The code is wrapped in code fences of quintuple backticks, which makes "parsing" easy. I use the headers as semantic relevant blocks and then parse the inputs and outputs from it. The code fence can contain a label indicating the file name (for ESM purposes) and my experience with Tenko tells me that this works quite well across the board. Plus, Preval only has to care about correct JS programs, making it even easier.

I rely on git for detecting changes of the snapshot in test files whenever anything in the source changes. This way you can easily see what happens when you change something and you can easily undo this through git. While it does give you a very hard dependency on the quality and coverage of your tests, I like it this way for these kinds of projects since the input is well bound and I think I'm pretty good at creating all the test cases.

One thing I still have to create is a better way to diff. With Tenko the runner will print a proper comparison between old and new test case when it detects changes. When using plain diff you often don't see enough context because the input and output are so far away from each other. This makes reviewing changes much harder. Instead I prefer a side-by-side diff (colordiff -y) while cutting away certain boiler plate parts but leaving others, regardless of changes.

Another thing is generated tests. Tenko has support for this but I'm not quite fond of I made it work there. There needs to be a good balance between automatically generated combinatatory test cases and the ability to describe what's going on. While at the same time still being snapshots and easy to work with.

Current state

In it's current state Preval only allows you to do one or two things. More a proof of concept than anything else.

I copy/pasted Mope's approach to the parsing and walking step. It can detect an addition of two literals and then uses eval to replace the addition with a single literal. It can also detect simple constants with primitive and inline them. Once. See tests ;)

I intend for it to work on module code, which implies strict mode. This makes a few things easier due to mandatory strict mode and I don't really want to bother supporting sloppy mode at this point (although I might if the project gets far enough to be worth it, just don't count on it).

No REPL yet, although I will probably copy it from Mope, too. I did like the REPL I created for it and I think a REPL is almost mandatory to demonstrate a project like this.

I did push it to the npm package. I was a bit surprised to see that the npm package name for preval was still available. No complaints :p

Goals

My main goal with this project is to satisfy my curiosity of how I would tackle something like this and what it would look like. I enjoy these sorts of tools because they're just super interesting albeit difficult puzzles. They automate a rather complex problem and I think there's still value in doing so for JS. Unfortunately the world is catching up to me, to some degree. But I don't care.

At the same time, as long as websites ship megabytes of JS I'm pretty sure there's space for a tool that might shrink that considerably.

I hope Preval ends up as a tool that can inspect an arbitrary piece of JS and reduce it. It'd be awesome if it had some sense of what's going on and could make "smart" suggestions to reduce or optimize the code.

Most of all, I'm doing this for the heck of it. I hope it ends up as a sort of project that can actually crunch down the size of a program significantly but I don't have time to work on it full time and progress may be slow as I still have to flesh a lot of things out.

In other words; I'm just going to be doing this for fun in my limited spare time and I'm curious to see how far I can get it :D