Generators redux: UnYield

2014-03-10

While the new and upcoming ES6 specification will bring the yield keyword to the JavaScript landscape, this new feature doesn't actually enable something that can't already be done. After speculating about how I would do make the translation, well, I did. And it took a slightly different approach, of course.

I've dubbed it UnYield. Here are some links:

The demo: http://unyield.qfox.nl/
The integration tests: http://unyield.qfox.nl
Github: http://github.com/qfox/unyield

So let me explain how it actually does its job...

First the boilerplate. The function* g(){} becomes a regular function like this:

Code:

function g() {
var _γ_that = this;
var _γ_ielded = true;
var _γ_nextId = 0;
var _γ_ieldValue_N; // one var for each occurrence of yield

// any local variables are hoisted here too

var _γ_funcs = [
function _γ_f0(){ ... }
];

// one function for each occurrence yield
_γ_funcs[1] = function _γ_f1(_γ_ieldValue){ ... };

return {
next: function(v){
if (arguments.length > 1) throw "next() only accepts zero or one arguments...";
if (!_γ_ielded) throw 'unable to next, iterator finished';
return {value:_γ_funcs[_γ_nextId].call(_γ_that, v), _γ_ielded: _γ_ielded,_γ_nextId: _γ_nextId,done:!_γ_ielded};
},
};
}

That's a lot of boilerplate. Let's go over it.

First of all, note that _γ_ is just a configurable prefix. It's an attempt to prevent variable collisions, obviously.

Code:
var _γ_that = this;

The context of the iterator is and stays whatever it was when you called the generator function. So if you call x.g();, whenever you call .next() the context will be x. We store the context in _γ_that to preserve it.

Code:
var _γ_ielded = true;
var _γ_nextId = 0;

We need to track whether the function yielded, and if so, which occurrence of yield caused it. We need to know because we need to return a value for done. We need to know which yield becase we need to continue from it.

Code:
var _γ_ieldValue_1, _γ_ieldValue_1, _γ_ieldValue_1, _γ_ieldValue_...;

These contain the return value for a yield, obviously. Initally I used one variable because that seemed to suffice. But then there's the case of for example (yield 1) + (yield 2), where you'll need to do two returns and then have to remember both return values. In hindsight I didn't support multiple yields in the same expression, so perhaps I could drop this part for UnYield now.

Code:
var _γ_funcs = [
function _γ_f0(){ ... }
];

I use _γ_funcs like a jump table. When the code code encounters a yield it actually returns instead. It remembers that it returned from that specific yield and uses that id to continue from the right function when you call .next(). Hence the jump table.

The root function is f0. It's dumbly the original function, except any statement that contained yield is now completely replaced by a return. Other than that, no funny stuff. It's possible that the original code never hits yield and the function will need to return regularly if that's the case. However, in most cases, any occurrence of yield will result in dead code :) Which is fine. All in all, f0 will be the closest resemblence to the original function.

Code:
_γ_funcs[1] = function _γ_f1(_γ_ieldValue){ ... };

These are the continuation functions, one for each yield. In a nutshell they just contain the code starting at the statement that contained yield. This is prefixed by some idle boilerplate to prevent certain syntax errors. In case of loops, the code is followed by each loop, recursively so to the top. I'll get back to that in a sec :)

Code:
return {
next: function(v){
if (arguments.length > 1) throw "next() only accepts zero or one arguments...";
if (!_γ_ielded) throw 'unable to next, iterator finished';
return {value:_γ_funcs[_γ_nextId].call(_γ_that, v), _γ_ielded: _γ_ielded,_γ_nextId: _γ_nextId,done:!_γ_ielded};
},
};

This is the iterator API. Well, the important part of it. There's also a .throw() method I didn't do. But it's kind of trivial to add if you need it.

Note that my yield doesn't require arguments, Kyle told me the spec does require it. But that just means my approach is overaccepting. And that is fine :) But the tests will mostly not have arguments. Again, that should be fine too as they're in those cases not testing the arguments. Ok whatever, let's move on.

The first thing that UnYield will do is normalize all loops to regular while loops. This is a trivial operation for all the loops in JavaScript except the for-in. Though theoretically impossible to transform into a while, it can be done tentively to work mostly as intended. Yeah okay that sounds silly. Let's just look at how it's transformed:

Code:
for (var key in foo) code;

// becomes =>

var key;
var tmp, keys = [], expr = foo;
for (tmp in expr) keys.push(tmp);
while (keys.length) {
if (keys[0] in expr) { // dont iterate deleted props
key = keys.shift(); // postpone assignment to prevent clobbering the previous value
code;
}
}

// actual =>

var key ;
var _γ_key, _γ_keys_6 = [], _γ_expro_6 = ( foo);
for (_γ_key in _γ_expro_6) _γ_keys_6.push(_γ_key);
while (_γ_keys_6.length)
if (_γ_keys_6[0] in _γ_expro_6 && !void( key = _γ_keys_6.shift()))
code;

As you can see it gathers all properties first, in the same order as for-in returns them. Then it becomes a regular loop which we can safely yield. Caveats in this approach are: If a key is deleted and re-added it is still visited but shouldn't be, if the object contains ES5 getters, this transformation is "observable" since the getter will be called before actually looping, it's essentially a memory leak as long as the function does not complete. The memory leak is a potential problem, the rest are just edge cases imho.

The variant without the var keyword is similar, except the variable isn't explicitly declared.

As you can see, in the actual result I kind of hack the assignment of the key to the original variable. This prevents the requirement of having to wrap the sub-statement in curly braces. Each variable gets its own index to prevent nested for-ins (or closures) to clash. The wrapping of the rhs (foo above) is perhaps premature, but it is possible to have a comma there and that would trigger a syntax error.

Let's build the continuation functions, these are actually kind of interesting. They do what the name implies, it continues the original function by executing only the code that would come after the yield. There's no real jump/goto in JavaScript so I simply construct a bunch of functions. In hindsight I probably could have made due with a huge switch, which would have prevented the code duplication. I noticed Regenerator does it that way and it reminds me of how we did goto's in the AS3 bytecode VM for Uxebu (there the switch actually was my idea, so yeah :)).

Since I cut the original function up in several functions I might also cut away the head of a statement. This means I'll have to mend those cuts.

For most statements this is not a problem. Things like return, throw, etc. which don't have sub-statements. They are either replaced in full by the return stuff or the yield expression turns into a variable holding the return value.

The if is fairly easy, though a simple { won't mend it completely because the else will need the proper if. So if the yield is in an if, it will be preceeded by if (true). This way else will never throw a syntax error.

The switch statement is only more complex by having more boilerplate. To make sure cases, default, and breaks in the original switch don't start throwing syntax errors or break to the wrong statement we do two things: We create a new variable to which we assign a fresh object. We only care about the unique reference of this object. Second, we'll prefix the code with switch (switchVar) { case switchVar:. This way the remaining code will not cause syntax errors because it's still in a (dud) switch.

When yielding inside a try we simply prefix the continuation function with try {. However, inside a catch we only prefix a {. This is not generically correct because it will break finally, but I had already decided not to bother with finally anyways. It's not too difficult but adds a bit of boilerplate to detect a yield vs a return or throw to prevent triggering the finally. Not impossible, just didn't do it. I prefix catch with only a { because it prevents me from having to throw, or something even more elaborate. Note that my approach also breaks accessing the catch variable in the continuation function, because it disappears.

If you wanted to make sure a yield inside the catch of a try/catch/finally doesn't throw a syntax error, you could prefix the whole thing with try { throw throwValue_1; } catch (catchVar_1) { var e = catchVar_1;, where throwValue_1 was assigned the catch variable. This would also fix accessing the catch variable ;) This isn't exactly the same thing due to catch-scoping works in JavaScript, but close enough.

A block, those are just curly braces, are in fact prefixed with a {. Probably the easiest to fix up :)

That leaves with and while. Remember that I've already normalized for and do-while to regular whiles at this point.

Obviously with needs to be patched up using the original object. The original occurrence of with is transformed to assign it's parameters to this fresh variable. with (x+y+z) ... becomes with (withVar = (x+y+z)) .... Then in the continuation functions we just prefix a with (withVar) and are done.

The most annoying in this approach is a loop. If you start mid-way a loop you'll have to continue running all the statements in that loop. Then you have to include the entire loop again, in case you ... loop. This can cause a lot of dumb repetition. Ironically, we'll prefix all loop continuations with a for. I make use of the three parts of the for so that I won't have to wrap the body in an extra layer of curlies.

Code:
for (var whileOnce = true, broke = false; whileOnce; whileOnce = false)

Yeah, as you can see, it makes sure the loop can be entered but doesn't ever repeat. At the end of the body, I'll slice out the entire loop from f0 and copy it here. If the loop is nested, the same will be repeated at the end of the outer loop.

Code:
while (a) while (b) yield x;

// =>

for (var whileOnce = true, broke = false; whileOnce; whileOnce = false)
yieldValue;
while (b)
return x;
while (a)
while (b)
return x;

I noticed later that Regenerator uses a switch to work around this problem. It prevents this code duplication by wrapping the original code in an infinite loop and switch to the right points in the code at the end of a loop or when continuing after a yield. I like it. I didn't go for it because I thought it'd be easier without a switch. Note sure if that's true :)

Hmmm I think that's about it for statements.

Initially, UnYield finds all occurrences of yield in the input. It will slice out the keyword and it's argument and remember two values; a string that represents the return statement for this yield when it actually causes a yield, and a string representing the given yield value when continueing from yielding. When yielding, the entire statement is replaced with the return statement, otherwise only the yield expression is replaced with the yield variable.

The replacement return statement does three things:

Code:
return (yielded=true, nextId = 1, <value>);

It will set yielded to true. This identifies this returning as a yield, and not the end of the function. In short, this determines the done value.

The next id is set to the id of the yield that's causing it. When .next() is called this continuation function is invoked.

Lastly, the yield value is returned so that we can return it as the value property.

There's two special cases left to discuss: break and continue. These are context bound keywords. break can only appear within a switch, loop, or with a label within that label. continue can actually only appear in a loop, regardless of label. In the continuation functions I need to know whether the loop was shorted by either keyword. And I'll need to know whether to invoke the loop again of course. Doing break was kind of easy now since I'm prefixing everything with a dud loop. However, continue posed a small problem now because I did not want to actually continue the current loop, but the loop that would need to follow. So instead I'll break anyways and simply not set the broke var to true. That way the current loop is broken, but the original loop which follows will still be entered.

At the beginning of each continuation function the yielded variable is set to false. So if there was no yielding, this var stays false and I'll know that it actually ended.

Mmmm yeah that's about it.

It was fun to create this demo. Learned a great deal about generators and yield, that can never be a bad thing. Maybe I'll translate some of the other ES6 features into ES5 semantics as well, as a good exercise.

The demo: http://unyield.qfox.nl/
The integration tests: http://unyield.qfox.nl
Github: http://github.com/qfox/unyield