Determinisitc benchmarking


The past few weeks I've been intensively working to get the performance of my local ZeParser2 branch back up and better compared to the published one. The branch contains numerous improvements, fixes some bugs, and supports optional stricter modes of detecting certain run time errors at parse time (for-in and illegal assignments, to name a few). However these costs came at a price and caused a significant decrease in perf.

This post does not offer a solution, it only speculates in theory :(

My method of checking perf and comparing various attempts has been to simply parse a huge script (about 16mb) and checking the millisecond difference in Chrome. I use a special build step that generates a special Gonzales file which allows me to run various builds, each with a small diff, and compare their results to each other.

While this works fine for ultra often ran functions or statements, obviously the differences are negligible for functions that are called only ten times. Since the statement that is called the most is called over 9M times with my benchmark, any change to it will have a visual affect to the parse times. I tend to be interested in statements that run 50k or more in my benching. Anything else is negligible in most parsing.

Gonzales will show you the min, max, and average of running parser+source combo's. To be honest I only look at the min times. The idea is that the playing field is level and all parsers (-> builds) have the same "problems", if any. So running them under the same circumstances for a bunch of times should give at least comparable numbers even when the numbers themselves don't mean anything.

This is true to some extend, but it seriously helps to have an "index" parser. For me that's been the current published version since I've been benchmarking against it for a few weeks now. I can easily see when the browser hit a "sweet spot" or not. For the 16mb benchmark file, the least amount of time the published version took was 382ms, though generally it'll be around 420ms. These numbers are for my computer so just take them as an index.

The minimal numbers I get for my index parser can still vary between 382ms and 480ms. That's a huge error factor. And those are the mins over a bunch of runs in the same "page refresh" in chrome so not just two individual runs. So there can definitely be a huge difference between refreshing the page. I could guess that this is related to which core the page uses, how busy the browser is, or various other inconsistent idiosyncrasies of JIT. But there are other people with way more knowledge in this area than I so I won't.

However, it does expose the clear need for a consistent deterministic way of benchmarking scripts. A method that would not be constrained by time, since that always depends on system load. A method that will return the same value when run twice, preferably even when run on two different systems. And I'm fine if this is hidden behind a flag. This isn't a tool for every day use.

Would be great if there was a major engine that would do this. Doesn't really matter which engine. My work has proven that engine bias isn't a major problem for this field, though obviously it can be in specific cases. But usually if you optimize your code to perform fast chances are high that it'll also perform better in other engines.

Just one more thing to add to the wishful thinking list...

It's not all impossible though. One way of getting deterministic stats is by simply counting the number of times something executes. While this doesn't help you when accessing time intensive functions or properties. It's been super helpful for ZeParser2 which mostly does number compares and charCodeAt() calls. You can bet I'll be more interested in the function that got called 9M times versus the function that got called 4k times.

As you probably know I also built a tool for this a while ago. I've been dogfooding HeatFiler for ZeParser2 for a while now. It's allowed me to go micro on perf by getting frequency stats and tweaking the order of comparing numbers. It's a lot of work and a huge puzzle but I find it a lot of fun to do and very rewarding. I got the parse time down from 500ms to 330ms with many different micro tweaks.

It becomes a problem when your script has to call a function that takes as much time once as calling some other function 9M times (for example canvas drawImage vs comparing a number). This is something Heatfiler can't detect. Not even theoretical because any rewriting will immediately throw off JIT.

Though perhaps it could be interesting to force a DEOPT (there are easy ways) and compare the speed. If anything you will get more consistent results because JIT would no longer be a factor. I wonder how relevant such results will be to the real world though. But if nothing else they could give a sense of scale. Maybe I'll play around with that some day.

More perf tools please!