Finding memory leaks with JS

2013-12-11

When you code in JS you generally don't have to worry about memory so much. Certainly not as much as the lower level languages have to. However, that doesn't dismiss you from catching memory leaks. Yes, there's a garbage collection system that's smarter than most of you or me will ever be able to concoct, but it's not smart enough to prevent you from doing stupid things. Or rather, it can't distinct between the smart or stupid things people do.

In this blog post I'm going to put together an approach that might allow a script to do track retained scopes and variables. There is a limit to this approach, though. It's not as generic as I would like it to be. On top of that, there's an inherit performance issue that might make this approach thoroughly useless in serious and/or bigger applications. But luckily, speed is not really an important requirement for making this approach useful. Anyways, let's get to it.

The simple initial step is actual scope tracking. You can assign each scope a unique number. Each scope occurs at a certain point in source code. That can give you a scope id. Furthermore, you can have variable from which you draw unique numbers (ok, that's no brain teaser). This combination allows you to track how many times a certain scope is opened and where it's coming from.

First let's define some simple library methods.

Code:
// register a new scope instance, return unique instance id
function startScope(scopeId, parentScopes){}
// reached the end of the scope code (EOF or end of function)
function stopScope(uniqueId){}
// process the value and check it for scopes, up the reference count when found
function link(value){ ... return value; }
// process the value and check it for scopes, decrease the reference count when found
function unlink(value){ ... return value; }


Now with that, the first code transformation would be like this:

Code:
function foo(){
}
// ->
function foo(){
var __scopeId = startScope(1, []);
stopScope(__scopeId);
}


When you call foo we record the creation of a new scope. We track which scopes it has access to because as long as this scope exists, so will those scopes. We store the unique scope instance id in a local variable so that we can access it again later. We'll need to. Multiple scopes on the same level are equally trivial.

Code:
function foo(){
var __scopeId = startScope(1, []);
stopScope(__scopeId);
}
function bar(){
var __scopeId = startScope(2, []);
stopScope(__scopeId);
}


Let's step it up a notch. How about a scope inside a scope:

Code:
function foo(){
function bar(){
}
}
// ->
function foo(){
var __scopeId = startScope(2, []);
function bar(){
var __scopeId = startScope(2, [x]);
stopScope(__scopeId);
}
stopScope(__scopeId);
}


The interesting part about this example is the use of the local scope variable. On the one hand using the same variable for tracking this id is great. You don't have to generate an infinite amount of unique variables. Shadowing is fine in JS (and since we want one var per scope level, the kinds of scopings in JS are actually irrelevant) so declaring them in every scope will work perfectly. However, the shadowing also prohibits you from accessing the instance id from the upper scope making it quite difficult to track which instance id has access to which upper instance ids. Note that while you can determine statically to which scopes a scope might have access to, you can't write in the instance ids as they are runtime generated (and our code needs to know to which <scope, instance> pairs an instance has access to).

To overcome this problem we could wrap every function in a special call:

Code:
function foo(){}
// ->
var __upperScopes = [];
var foo = (function(__upperScopes){
return function(){
var __scopeId = startScope(1, __upperScopes);
stopScope(__scopeId);
}
)(__upperScopes.slice(0).concat([__scopeId]));


Solved! Ah, but not quite. We wrapped the inner function in another function that creates a local variable that contains a list of all scope instances the inner scope has access to. However, we've introduced four problems to our approach:

1: we now have an extra variable on each scope which contains a list of scope instance ids that the upper scopes has access to.
2: every function will have an extra scope layer. Well okay maybe not a problem, but still something to take into account.
3: we'll have to rewrite function declarations because the above transformation is not equivalent. Not at all, actually. The fix is not very difficult, but requires moving the function declarations to the top and transforming them into var assignments (which was done above). And even then, the semantics are different because the function won't have a name. You can fix that too and yet it won't be the same. Anyways, function declaration rewriting is a different matter. It can be done, but requires a bit of preprocessing.
4: by introducing the requirement to wrap functions, we've shot ourselves in the proverbial foot because we now have to take eval and more importantly Function into account. And that will be a much harder problem to solve...

This transformation also adds some burden on the complexity of the transformation, although I'm afraid we're far from done in that department. It's a good thing that we don't actually have to care about how the result looks or that no living person will be able to properly read it :)

Let's take a look at the double scope version of that:

Code:
function foo(){
function bar(){}
}
// ->
var __upperScopes = [];
var foo = (function(__upperScopes){
return function(){
var __scopeId = startScope(1, __upperScopes);
var bar = (function(__upperScopes){
return function(){
var __scopeId = startScope(2, __upperScopes);
stopScope(__scopeId);
}
)(__upperScopes.slice(0).concat([__scopeId]));
stopScope(__scopeId);
}
)(__upperScopes.slice(0).concat([__scopeId]));


There's actually an unsolved problem in the above example. We might link each individual scope when it's invoked, but if we would invoke the above we would first link 1 and then unlink it immediately. We don't actually call the inner function, so that part never even executes. In this specific instance it's actually not a problem because the inner function does not persist and so the outer scope will be released as soon as it ends. But what if we return the inner function? What if we create a closure...

Code:
function foo(){
function bar(){}
return bar;
}
// ->
var __upperScopes = [];
var foo = (function(__upperScopes){
return function(){
var __scopeId = startScope(1, __upperScopes);
var bar = (function(__upperScopes){
return function(){
var __scopeId = startScope(2, __upperScopes);
stopScope(__scopeId);
}
)(__upperScopes.slice(0).concat([__scopeId]));
return bar;
stopScope(__scopeId);
}
)(__upperScopes.slice(0).concat([__scopeId]));


Oh crap. Do you see that? Dead code! How the hell are we going to properly unlink a scope, when anything might still happen in the return expression? Our initial approach might have been a little short-sighted. The return and throw statements are a bit of a problem for us because they can unexpectedly prevent us from unlinking a scope. And if you think we can fix this statically because we can recognize those keywords; you're right! However, that doesn't save us from another function that throws.

So saving from unexpected throws can't be done by code transformation. There's also no such thing as a deconstruction callback, or we would have used it instead of this approach in the first place ;)

There is a way to get around our problem though. It's actually pretty nifty and completely unobtrusive! Finally a use case for finally! Observe:

Code:
function foo(){
return 42;
}
// ->
var bar = (function(__upperScopes){
return function(){
var __scopeId = startScope(2, __upperScopes);
try {
return 42;
} finally {
stopScope(__scopeId);
}
}
)(__upperScopes.slice(0).concat([__scopeId]));


Now regardless of what happens the stopScope function will always be called. This can not be circumvented by user code and is super generic. No draw backs so no new problems (apart from more complex transformations, of course). Good!

Let's take a closer look at our closure again:

Code:
function foo(){
function bar(){}
return bar;
}
// ->
var foo = (function(__upperScopes){
return function(){
var __scopeId = startScope(1, __upperScopes);
try {
var bar = (function(__upperScopes){
return function(){
var __scopeId = startScope(2, __upperScopes);
try {
} finally {
stopScope(__scopeId);
}
}
)(__upperScopes.slice(0).concat([__scopeId]));
return bar;
} finally {
stopScope(__scopeId);
}
}
)(__upperScopes.slice(0).concat([__scopeId]));


Now when we execute the code we do have a problem. Assuming we save the result of calling foo, of course. And even that decision is a problem we'll have to circle back to later. For now let's focus on the problems before that point.

We call foo, which will link a scope, create a function, return that function, and unlink the scope. For our system, foo's scope now has zero links to it. So we would deem it to be freed. However, it obviously isn't and we need to track this somehow. There are a few approaches we can take, all with their own pains.

One way involves exhaustive expression rewriting. The premise is that a fresh function expression can be juggled in an expression for the duration of that expression. After that it either has to be assigned to a variable or property, be returned from a function, or be freed. It can be freed because the value is unused (like x in (x, y) or true?1:x) or because it's transformed into something else, for example doing something silly like x = function(){} + 5; (this works, and frees the scope, but is obviously silly).

We would need to analyze each expression, wrap every value and operator hands and total expression to make sure that a function is or is not freed. On top of that, we'll need to examine every (own) property of any object that is freed this way as it might harbor a function. And don't forget about delete, because that might also free a scope.

We have to inspect every assignment (whether they be to a variable, object property, or object literal) and return value and increase the number of links when we see a scope that way. Freed scopes are unlinked, but returned scopes add a link (they will be removed when the return value is lost, which means we have to wrap every function call too to check this fact).

Even when we do all of this, we still can't face the problem of what native functions do this with scopes. Most functions won't give a crap, so you can unlink a scope when you pass it on. For example, doing (again, silly) something like [1,2,3].join(function(){}) will make you add a link for the argument expression, but since we don't control the native join function, we can't unlink it, causing our system to think there's still a reference to this scope when really there isn't.

The only way around this I can think of is a white list of functions that cause a scope to be retained when passed on (like an event registrator). We can keep track of this by (source code) name and wrap them when we encounter them. For the wrapping we could do a simple typeof and property check to see whether we have a function value, and if so, whether it's our wrapper or not (with a .__wrapper property for example). We could also pre-empt most of the environment by wrapping native prototypes this way. The problem becomes a bit wider when you want to take browser api's (like DOM) into account. But it's pretty much the same for nodejs, but at least you could hook into require with a hack and transform all included libraries that way. That "only" leaves nodejs natives, but pretty much the same pain as web api's otherwise.

This is where I kind of hit a wall. I can do the wrapping and the maintenance of which scope is retained by which other scope. But I can't really get around the native black boxes since I simply don't control them. I could wrap every function call in another call that checks whether the function being called is wrapped, and if not, wrap it on the spot. I can even make that transparent and everything. But that still leaves me with the problem of what to do with them. Do they retain a function or release it?

Some cases are easy. For example:

Code:
"foo".split(function(){});
el.addEventListener('click', function(){});


But it gets crazy quickly, and no whitelist to help you there...

Code:
var e = document.createElement('div');
document.body.appendChild(e);
e.onclick = function(){}; // <-- scope!
document.body.innerHTML = ''; // ... how many scopes did I just free? none!
e = null; // oh nevermind, scope freed


That's right. There are multiple ways how a scope might be freed when you factor in the DOM. I'm not sure I can fix this problem without engine introspection.

But. That doesn't mean the approach is completely useless. For example, my parser runs completely without the DOM. Sure, it uses native methods (heck, what code doesn't), but none of them are dangerous in the above sense. So for those kinds of applications, or those that use the DOM in such a way that the above doesn't happen, we could use this approach!

That said, how would this expression processing work in detail? Well:

Code:
foo;
// ->
foo;


A variable that doesn't do anything will still retain it's original value. Doh.

Code:
function foo(){ return function(){}; }
foo();
f = foo();
// ->
function foo(){ return link(function(){}); }
free(foo());
f = (free(f), foo());


(I've left out other transformations for brevity.) When you return a function you add a reference count to it. If the function call is part of an assignment you don't have to do anything to it. There's already a link added to it. If the function call is not assigned, though, you can free the scope since that's what happens.

A basic rule we can apply is that when you assign a function to a variable or property, you add a link to it. Because this var/prop will retain the function as long as the var-scope or owner-object persists.

When you return a value, you don't know at the point of returning whether or not the returned value will be stored (see above code example). So we make it a rule that we link any function that we return. We'll decide in any function call whether or not the return value will be stored or not and either unlink or do nothing. That's what we can see illustrated above.

How about expressions and the difference between using vars with functions and actual function (keyword)?

Code:
var foo = function(){ return function(){}; };
// ->
var foo = function(){ return link(function(){}); };

x + function(){};
x + foo;
x + foo();
// ->
x + unlink(function(){})
x + foo;
x + unlink(foo());

!function(){};
!foo;
!foo();
// ->
!unlink(function(){});
!foo;
!unlink(foo());

x && function(){};
x && foo;
x && foo();
// ->
(function(a,b){
if (a) b;
unlink(b);
return a;
})(x, function(){});
x && foo;
(function(a,b){
if (a) b;
unlink(b);
return a;
})(x, foo());
// ->
unlink((function(a,b){
if (a) b;
unlink(b);
return a;
})(x, function(){}));
x && foo;
unlink((function(a,b){
if (a) b;
unlink(b);
return a;
})(x, foo()));


y = x && function(){};
y = x && foo;
y = x && foo();
// ->
y = link((function(a,b){
if (a) b;
unlink(b);
return a;
})(x, function(){}))
y = link(x && foo);
y = link((function(a,b){
if (a) b;
unlink(b);
return a;
})(x, foo()))


As you can see, there's a notable difference between a freshly created function inside an expression or a variable that contains a function. Returning a function from a function call is kind of the same, though. Luckily we can check all of these when transforming the code, but it does make stuff a bit more complicated. Basically, when an expression contains a function expression or call, it poisons the entire expression until some point where you can explicitly unlink or assign the value.

Likewise, you have to call link on all other expressions that are assigned because if it contains a function, that function will get an extra reference count. The variable that is assigned to is also wrapped in an unlink, because if that contained a function it's now referenced once fewer times. Obviously.

We can approach this pretty generically for any expression as far as I can see. Not super easy, but I think I've showed you that it can be done.

The next problem is unlinking objects. Whenever we call unlink with an object we have to traverse the entire object manually. We only walk over own properties and we have to guard against circular references since, well, that could put us in an endless loop. We have to do this because some function might be stored in an object and we need to know that this link has been freed when the object was freed. Since we only track assignments, we also only have to check own properties. Of course, we do have to traverse the own properties of own properties that are objects, too.

What about inherited properties? They will entail a bit more work. Not only is the prototype created as a side effect rather than something we can hook in explicitly, it also secretly linked to instances of a constructor when you use new. That meanst that you can delete a constructor, but the prototype will be retained until you free all the children (or remove/replace the links to their parent prototype). For example:

Code:
function A(){}
A.prototype.foo = function(){};
var a = new A();
A = null;
a = null;


Question: when was the second function freed? Let's step through this... First we create a function (link A once). Then we create a method foo (link once). We create a new instance of A (technically, we link A.prototype once). We unset A, this unlinks its prototype which in turn unlinks foo, both once. But our system should have noticed that the prototype was referenced by a, so it won't mark that as freed just yet. Then a is cleared, freeing the object in a, which in turn frees the link to the prototype, which frees up foo (releasing it completely).

Okay that makes new important. But we haven't done any tracking code for that. At all. But any function could become a constructor and be used with new. We might be able to cheat using our existing wrapper and an instanceof check, like:

Code:
function F(){}
new F();
// ->
(function __local(__upperScopes){
var w = function __wrapper(){
if (this instanceof __local) {
linkObject(__local.prototype);
// and now we have to check whether or not the return is overridden... immediately unlinkObject the proto if so
}
(function __origFunc(){
var __scopeId = startScope(1, __upperScopes);
try {

} finally {
stopScope(__scopeId);
}
}).apply(this, arguments); // yes yes, arguments is not an array
};
linkObject(w.prototype);
return w;
})(__upperScopes.slice(0).concat([__scopeId]));


I haven't even mentioned linkObject or linking variables in general, at this point. But yes, that's the main gist of it. You register local variables in a scope as they are defined and you register an object as referenced as long as you know it is such. Of course, native functions are even a bigger pain for general objects.

I'm going to stop now. While I'm ahead. Sort of.

I hope I showed you an interesting approach to doing reference checking and that there are quite a few caveats into doing so. You can build a fairly good system but it'll be easy to poke holes into them using simple built in mechanisms like eval, new, and the native functions that are black boxes.

Maybe some day I'll get myself into implementing a small proof of concept for this mechanism (beyond what I already have before I hit the wall). Or maybe this article persuaded somebody to do so (please let me know! :)). I hope it enjoyed and inspired you.