One of the reasons I wrote a JS parser was so I could actually start building my own tools. I'm not so great with using other people's libraries and this wasn't any different. I knew what I wanted and I knew what I needed. The main two usages in tooling for parsers are analysis and rewriting. And it's the rewriting I'd like to talk about now.
tldr: This post describes how I go about rewriting source code and why you don't need a whole parse tree to do it.
A parser usually comes with a parse tree. This is a schematic structured and usually hierarchic view of the source code. It reduces something complex like an arbitrary expression to a single node. The first version of
ZeParser has a complete parse tree. But I found navigating the parse tree is quite cumbersome. The problem is that it's sometimes handy to have, for example, an expression as a single node, but other times you want the entire statement, or just a sub expression, or whatever. I've found that in most cases, I can do with a simple token stream. That's why I wrote
the second version of ZeParser without a predefined parse tree.
A token is the smallest indivisible unit that, when put together, makes up the source code. The smallest unit is not a character, but a group of characters. For instance, there are identifiers (
foo
,
function
,
while
), there are numbers (
55
,
3.5e+10
,
.4
), and strings (
"foo"
,
'bar'
). The parser usually uses a lexer or tokenizer to do the leg work so it only has to worry about tokens, not the actual characters that make up the token. Most of the time the parser cares more about the type of token than the characters that it contains. In my parser, a token is a regular object with some properties. It should have a type, a range, and a value (the substring of the source code that it represents). Some tokens might also have some meta data (see below).
A token stream, in (my..?) js terms, is simply a regular array with all the tokens (properly ordered) of the source code. I make the distinction between two types of token streams; black and white. The white token stream contains
all the tokens. So it also includes the whitespace, line terminator, and comment tokens (these are not considered part of the token stream per ES spec). The black token stream contains only the non-whitespace tokens. Of course "black token stream" refers to the absence of the whitespace tokens. Note that ASI's are converted to a special ASI token in my stream, which is part of the black stream.
Each token will also have a property that indicates its position in the white token stream. Black tokens will also get their own index into the black stream. White tokens get a reference to the index of the black token that precedes them. This will make navigation possible without having to continuously do an
indexOf
on the token streams.
Together, black and white streams make a very powerful tool for analysis and rewriting. In most cases you'll be using the black stream though. The black stream allows you to jump from the actual source code without having to worry about how many spaces there might be between the
if
and the (. And because the stream is just an array, we can easily check for target structures by searching for a specific kind of token and then looking left and right when we find a token, to confirm the pattern we look for.
For full and arbitrary navigation through the source code, we do need a little extra meta information in the tokens. Namely, we need to be able to jump over arbitrary expressions, functions, brackets, and parenthesis. I've found that all you need to do here is to match the start parenthesis or brackets with their end and vice versa. This way, whenever you encounter an opening square bracket (
[
) you can simply jump to the matching closing square bracket (
]
) by checking the property. This allows you to property jump over dynamic property access (
foo[bar+5]
) when you don't care about the dynamic property expression itself. A function would get a reference to the parenthesis that make up its parameter definition list as well as the curly brackets that make up its body.
For most rewriting tasks it's also very valuable to have a property that signifies that a token is the first or last token of a statement, or expression. The parser probably already knows this anyways so the overhead for this piece of information is trivial and neglectable.
Let's look at an example. For a while now I've had the idea to write a simple rewriting tool that would add validation checks for the jsdoc headers of functions and methods. For example:
/**
* Prints hello world and then something else
* @param {string} something else
*/
function foo(something){
console.log("hello "+something);
}
This is a very simple snippet of code. For the task at hand we basically want to rewrite it to this:
/**
* Prints hello world and then something else
* @param {string} something else
*/
function foo(something){
assertType(something, 'string');
console.log("hello "+something);
}
So in this code, the type of
something
comes from the jsdoc. To make this possible we'll need to check both the white and the black streams.
First we need to search the white stream for multi-line comment tokens that pass as jsdoc headers. I'll leave this as an exercise for the reader. Having done that for the example above, we'll end up with the jsdoc header that declares one parameter, and expects it to be a string. The comment token gives easy access to the index (in the white stream) of the black token that precedes it. Find it (if any). Then use the black stream and look for the next token after this black token, or the first black token if there was no black token preceding the comment. See? Already branching..
We should now have the first black token that succeeded a multi-line comment. If we don't, the source ended and we bail. Otherwise we have to do some pattern matching. There are three types of structures to check for, with regards to matching some jsdoc token to a function.
- function declaration
- function expression assigned to a variable/property
- function expression declared in an object literal (method)
I've never seen the fourth structure, a function expression that's immediately passed on in a function call, so I won't bother with that one (but you would go about it in the same way).
Concrete, these are the five structures we have to check for (there three for the assignment):
/** jsdoc **/
function foo(a,b){ ... }
/** jsdoc **/
var foo = function(a,b){};
/** jsdoc **/
foo = function(a,b){};
/** jsdoc **/
foo.abc = function(a,b){};
{
/** jsdoc **/
foo: function(a,b){},
}
The named function expression is considered equal to the unnamed function expression in these cases.
The five simple checks then become as follows. Note that we only use the black token stream for these checks:
- If the next token has the value
"function"
, assume a function declaration.
- Otherwise, skip the next token, check if the token after that is a colon (
":"
) and the token after the colon is a
"function"
. If so, assume a method.
- Otherwise we have to check the most complex case; assignment.
-- If the next token has the value "var", skip another token (this is the identifier), verify that it's followed by a token
"="
which is followed by a token
"function"
. If it holds, assume a var declaration that's initialized by a function expression. Note that this ignores cases where somebody would do
var a,b=function...;
, but I'm okay with that. It's not common to use this, for good reasons.
-- If the next token is an identifier, keep skipping tokens while you either encounter a dot (in that case also skip the token that follows it) and while you encounter an opening bracket (in which case you skip all tokens and jump to the matching closing bracket). Once you can't skip any more tokens you check if the next token is the
"="
token and then make sure it's followed by a
"function"
token. If that holds, assume the function expression assignment case.
In all other cases, bail (or apply better rules to find the right function that the jsdoc belongs to ;)) and move on to the next jsdoc.
Assuming we have a jsdoc and its matching
function
keyword token, we move on to the validation. First of all, collect the parameter names and make sure they coincide with the (ordered) list of
@param
directives in the jsdoc. If they don't match, try to do your best to match as much as possible anyways. Again, I'll leave that as a task for the reader.
Each element in our list now contains a jsdoc, a function keyword token, and a list of pairs of variable names and their expected types. We use the function keyword token to find the opening curly bracket of its body. The parser will have added this information to the function keyword token, so we can now easily jump to the start of the function body. We can assume it has a body, even if it's empty. And we can assume that it starts right after the curly bracket. We can there also assume we're at the start of a new statement, so this makes our rewriting life much easier.
For rewriting, we simply use the existing tokens and rewrite their value. For each element in the list, taking the opening parenthesis of the function body, we append the value to contain
assertType(<paramName>, '<paramType>');
for each pair in the list of matched jsdoc parameter names. We can put these in sequence (maybe add a newline for legibility if you want) and be happy knowing the rewrite (on its own) is 100% safe because we were at the start of a statement.
Note that a bit more effort has to be taken if the jsdoc defines multiple types, custom classes, or an array of a certain types. Again, I'll leave this for the reader to figure out. Once you figure out how to rewrite it, the way to go about it is exactly the same as described above.
To reconstruct the source code you loop over the white stream, concatenating the value of each token. This will take the rewritten values into account and thus construct the rewritten source code, without loss of whitespace or comments. You prepend the result with a function declaration that defines the
assertType
function, which should then do whatever you want it to do (report, throw, nothing).
So there you go. I realize it was a bit of a dry read. But it's 2am and I'm too tired to make fancy pics for ya ;)
Hope it helps you anyways. Good night!