Generating source maps

2013-01-28

I've spent a couple of days trying to implement source maps for Hutimility and learned a great deal in the process. It actually took me a few attempts to get certain things right and feel the current spec (v3) could use some clarification, as well as a visual verification tool. The latter helped me greatly in discovering some lingering bugs, as well as getting it right. While browsers are starting to support source maps, the whole thing is still a big black box. Much like other "bleeding edge" technologies (like CORS), the feedback from browsers is utter crap. I hope this article helps you on your way while implementing a source map for your own tools.

tl;dr: This is a tutorial for creating source maps (v3, the current version at the time of writing), with some side notes on the future and a visualization tool.

Source maps


When translating input source code to JS you usually lose the ability to properly debug the code you've actually written. Let's say we're translating C++ to JS, then a printf("foo"); might become a console.log("foo");. While that's fairly straightforward, now imagine a 200k source code, usually translating into something even bigger and almost definitely more obscure, and certainly not annotated (/commented). How to deal with that?

Enter source maps. They are a fairly rudimentary tool that should allow you to load a translated source file but debug it as if it were the original. As the map part implies, it's actually a map. For certain parts of the translated source code, it tells you where in the original source code that particular part of the code came from. Unfortunately source maps do not (yet?) have the length property, so it'll still be a small guess unless you encode every character. But it's already much better than the nothing we used to have.

Syntax


The map basically consists of a series of relative positions. Each such location is called a "segment". Each segment is a set of numeric fields. Each segment is separated by a comma, each line is separated by a semi-colon (even if the line is empty).

A segment (in v3) might have one, four, or five fields. Each field is a single number, always a delta relative to the previous field (in this, line breaks and even file switches are ignored, everything is delta). Implicitly, the first field is all zeroes. The spec makes this apparent by saying that the first occurrence of such field is absolute, but I find this a bit odd. Anyways, assume the first implied segment to be all zeroes and any segment you encounter to be relative to the previous segment and you won't have to worry about when something might not be relative ;)

Base64 VLQ


The fields are VLQ encoded, with each field Base64 encoded. Again, this is skipped over a bit in the spec and I feel more clarification is in order here. I spent quite some time getting this right, so I hope I can save you some trouble here.

Base64


When you see a segment, you might notice that it's Base64 encoded. That means that each 8bit character might actually only represent 64 different numbers (0 to 63). It represents this with the following ordered range: A-Za-z0-9+/. The decoding and encoding of Base64 in JS can be fairly straightforward:

Code:
var encoding = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
function encode(num){ return encoding[num]; };
function decode(chr){ return encoding.indexOf(chr); };

So that should be fairly easy to grasp. The number 5 is encoded as E in Base64. The character + is decoded as 62 from Base64. Note that I am merely talking about encoding arbitrary numbers in Base64 here, not the actual segment fields! Read on.

VLQ


Now it get's a bit trickier. VLQ stands for "Variable Length Quantity", Wikipedia has all kinds of interesting facts about this. Let me try to explain how it works for us.

In VLQ, each number is encoded in a series of small numbers, with meta information added about whether another such number should be added and what the sign of the total number should be. Concretely it means this: Each number is cut into parts of a set number of bits. It adds one bit to each such part to indicate whether it is the last part. And only the first part gets an extra bit which tells you the sign of the entire number.

Bits and bytes


I feel I should not have to explain this if you're contemplating implementing source maps, because you'll probably already have covered much more related advanced topics and should really have this down already. So a quick recap; A bit is the smallest piece of information in our technology. It represents either a one or a zero. True or false. In our computers we use sets of eight bits to compose characters, called a byte. So a byte is eight bits. Each bit adds a number to the power of the `nth bit, if set.

When displaying numbers in binary (so, in bits), we usually put the "least significant bit" (LSB) on the right side. Don't get me started on this whole topic... it's called endianness and whole books have been written about it. Unfortunately, we'll see quite a bit of it back in VLQ, and I'll simply skip over it by simply stating how it is.

So, for example the number 5 is shown as 101 in binary. The right most bit (the LSB) is valued one, the second bit from the right is valued two, and the third bit from the right is valued four. As you can see, that's 1, 2, 4. Each value is it's previous step times two. Then we can see that the right most bit is "set" (so, a 1), the second is "unset", and the third is set. So we only add the right most and left most values, 1+4=5. This principle can be applied to a number of any number of bits, of course. Some examples:

Code:
15 => 00001111
200 => 11001000
83 => 01010011

I'll prefix binary numbers with a small b, to clarify. There's an easy way to display any number in binary in js:

Code:
var n = 200;
console.log(n.toString(2));

So the toString method of numbers accepts the "base" argument, for binary that is 2.

Ok, so far the basics of binary. Still with me? I hope you skipped this because it won't get much easier :p

VLQ, continued


This is where it gets a bit tricky. It certainly confused the hell out of me, but the spec and other tutorials so far seem to skimp over this as if it's very normal. Le sigh.

In the end, each character of a segment is Base64 encoded, so can only represent the values 0 to 63. This means each character is actually a number of six bits (binary b111111 is decimal 63). Each of these numbers also loses some "space" (bits) to the meta information. Conclusively this means that the "most significant bit" (MSB), or the left-most bit when printing it, is used to indicate whether another character is used to encode the current number. And only the first character (so the "least significant byte") also uses its LSB (bit..) to encode the sign of the original number. The bytes of a VLQ are then ordered least significant to most significant, which helps tremenduously when encoding... Ugh, ok, example time!

Let's say we want to encode a field with the value 886973. We will have to transform this into a string of characters, each character encoding five bits of that number, except for the first character which can only use four bits. So the number in binary is b11011000100010111101, in other words this number will be cut up as follows: 1 10110 00100 01011 1101. Adding the meta information of the VLQ (thick), each byte will look as follows: 0000001 110110 100100 101011 111010. We ignore the most significant two bits (left most when printing) of an actual byte because they cannot be represented in Base64 encoding. This was a deliberate choice (ok, that's my assumption ;)) and the "wasted space" will (hopefully) be regained with gzip compression during transport.

So we have the bytes 00000001 00110110 00100100 00101011 00111010, or 1 54 36 43 58. Encoded in Base64 this is B 2 k r 6. And as mentioned before, the bytes are ordered least significant first, so the VLQ will be: 6rk2B.

Recap, steps to encode a number:

Code:
886973 // decimal
b11011000100010111101 // binary
1 10110 00100 01011 1101 // cut up representation, as it will be encoded per byte
1 110110 100100 101011 111010 // vlq meta data added
1 54 36 43 58 // decimal
B 2 k r 6 // base 64 encoded
6rk2B // reverse bytes, final result

Obviously, this is done slightly differently in code, using bitwise operators. But that's what you need to do.

decode VLQ


To decode a field, you simply take bytes left-to-right. For each byte, take the first five bits. Multiply that value with two to-the-power-of as many bits as you've already collected, then add that result to a total. Keep doing this as long as the current byte has its sixth bit set. When that's no longer the case, you've reached the most significant byte of this field. As the last step, check the least significant bit (the right-most bit when displayed). When it's 1, the number is negative. Otherwise the number is positive. The total is then "bit-shifted" to the right by one, to remove this sign meta data bit.

Code:
6rB // field
// character 6 is 58 in base64
// b00111010 is number 58 in binary
// AND 5 bits of that number, this is the
// actual value, add to total, multiply with 1 (2^0)
total += (b00111010 & b11111) * 2^0
if (b00111010 & b100000) {
// next char is r, 43 in base64
// b00101011 is 43 in binary
// multiply with 2^5, because we've parsed 5 bits
total += (b00101011 & b11111) * 2^5
if (b00101011 & b100000) {
// character B is 1 in base64
// b00000001 is 1 in binary
// multiply with 2^10, because we've already parsed 5 bits
total += (b00000001 & b11111) * 2^10
if (b00000001 & b10000) {
// nope, no more parsing
}
}
}
// total is now 00001 01011 11010
sign = b000010101111010 & 1
real = b000010101111010 >> 1
// b000001010111101
if (sign) real = -real;
// voila, result is 701, or b000001010111101

Okay. That should cover the VLQ. Onward!

Segment fields


As mentioned, the mapping is a set of segments. Each segment consists of a couple of fields, some which may be omitted.

Target column


The first field is the column of this segment in the "target" (translated, minified, compiled, whatever) source. The line number of the current segment is always implied and can be aquired by counting the number of semi-colons encountered. As the spec mentions, this field is always relative to the previous field. This even applies when moving to a new line. In my opinion this fact is not clarified in the spec, but if the current segment signifies the semi-colon at the end of a 5000 character line while the next segment signifies the first column of the next line, the value of the first field of that next segment will be -5000. This fact also holds when crossing (source) files, by the way. So always compute the "delta" (difference) between the target absolute column and the absolute column of the previous segment.

Source file index


The JSON that contains this map also has a list of source files. These should be the files that were used to create the target source file. Or at least the files referenced in the source map. The second field of each segment indicates a source file, to which that specific character maps. This too is a relative field. So most of the time, this field will be zero.

Source column/row


The third and fourth field of a segment signify the row and column in the original file of the second field from which the column (and implied line) of the target source originated. In other words, the segment tells you that the column from field one, came from the column/row of some file. These fields are relative, even when switching file. So if you were just at the last segment of a very long file, and jump to the start of a new file (like when concatenating source files), the row and column will have a big negative delta, equal to the absolute column and line of the previous segment, negative. Note that the third field is the row, and the fourth the column (which kind of means y comes first). This is counter-intuitive to the usual x/y ordering, although it's in line with the usual row/col ordering. So, whatever, that's just the way it is.

Name index


Source maps also give you the ability to name your segments. Personally, I think you'd be better of with segment lengths, but whatever. This field is a zero-based index onto the names array of the source map json. This field is relative to its previous segment as well.

Omitting fields


In version 3 of the spec, you're allowed to omit certain fields of a segment. Specifically, you can choose to either specify one, four, or five fields. So the name field is completely optional. The source file, col, and row fields are also optional (in which case a name index is also omitted).

I'm personally not sure what use a single field segment would be, but whatever. I'm also trying to persuade the spec writers to allow the omitting of the file index field, since it'll most often be zero. And when it's not, worst case it'll be what it is right now. Best case, you'll omit many A's :)

Map structure


The mapping key of a source map consists of segments. Each segment is separated by at least one comma. The end of each line in the target source causes exactly one semi-colon to be added to the mapping. This is the implied target line number which is not encoded in a segment directly. It's allowed to have lines without any segment, so multiple consecutive semi-colons, in case some range spans multiple lines. I'm not sure about multiple comma's, but I would say they should be okay (though silly cruft). A comment in the spec agrees that the segment AAAAA (which is five zero deltas) can be considered a "noop".

JSON


The source map json contains a few keys. Aside from the obvious mappings key, there is also the version key. This is a numeric value, 3 in this tutorial. I expect it to increase before the spec stabilizes. The sourceRoot key tells the interpreter what to consider the root for the other files that are mentioned, if relative it will be relative to the map file, not the index that loaded it! The file key points to the target file (so the compiled/generated js file). The sources contains an array with the filenames that are referenced in this source map. The names array contains strings which are referenced from the source map as well. Unused keys can be omitted.

Extensions


The spec mentions possible custom extensions for the key map. Such extensions are supposed to start with x_ .... It also says the recommended way would be x_<domain>_..., for example x_com_google_gwt_linecount. Incidentally, x_com_google_gwt_linecount is used by google to indicate the number of lines represented by this source map (probably an optimization of some sort, though the spec does not explain the extension at all).

Post processing map


The spec mentions a post processing step. I did not touch this so can tell you very little about it. Looks complicated.

Referencing a map


There are currently two ways of referencing a source map. One is a js comment: //@ sourceMappingURL=<url>, although some languages might not support the double forward slash as a comment. For instance, in CSS it could look like /*@ sourceMappingURL= */. Oh yes, CSS, you read that correctly. The file agnostic way of pointing to a source map could be by a line in the HTTP header: SourceMap: <url>. If both are supplied... well, the ice was already pretty thin don't you think?

Dynamic magic


An underspecified part of source maps is dynamic generation and usage of source maps. Currently, that seems to be impossible. Hopefully that'll be addressed when v4 arrives.

Conclusion


Source maps are a great tool that are vital to any tool that translates some source code to the final JS to execute. Whether it be JS extensions like TypeScript, flavours of JS like coffeescript, or completely different languages like python or C++, debugging them properly will eventually require source maps. Or a lot of more time.

That said, browser support is currently meh at best. During my investigation only Chrome had some support for source maps (though you have to explicitly enable that through the developer options). Firefox is still working on support, I don't know the status of other browsers. Chrome's support seems to be to show the original sources in the inspector, and being able to set break points from the original source (though the actual break point tabs only show up in the translated source, but that's surely a bug). I thought it would also use the source map when giving the origin of the error, but that doesn't appear to be the case yet. Stepping through the code also appears to use the translated code, rather than the source map. So while that's kind of disappointing, I'm positive that will change in the not-so-far future. This was probably due to a bug in my source map generator, as it does show errors in the proper source position now. Enfin, I rest my black box case. Anyways, with the source map properly mapping now, it's indeed a much better tool :)

Dynamic usage of source maps seems to be impossible right now. Even if you'd be able to hack it in, how are you going to prevent running the original source when including that. It's all a mess, I hope it'll be addressed in the next spec version. I think there's a good solution ;).

Verification tool


Since working with the browser is still very much a black box, I give you a small tool I've been using to verify the source map generator for my Hutimility project. This tool can load a source map, ajax the files in, parse the source map, and display single character tabs for each segment on which you can click. Clicking these tabs will show you the source code that belongs to that segment (they are color coded) and put a small red marker at the actual position referred to by the segment. It will of course scroll to that marker. Hopefully this gives you some better insights in whether your source map generator works.

I've added three examples. Two small examples from Hutimility directly and one that uses jQuery and a simple demo (compiled and generated with Closure). Note that the jQuery one can take a while to load, this is mainly due to the DOM having to load a few thousand spans. Sorry 'bout that :)

You can find the tool here (github here).

You can copy the entire html to your project folder and then load the source map. That should be able to load the files properly and allow you to test the maps.

I hope you like it and that it's useful for ya :)