No more lies, just benchmarks

October 20th, 2011

I was asked by the people from Zeus WPI to give a talk on benchmarking, since students need to both test and measure the code they write for several courses and apparently there still is no course which teaches people how to conduct experiments. Being the yes-saying fool that I am, I agreed and figured out what to talk about. It turned out there are a lot of things I would like students to become more familiar with. In the end I ended up dismissing a number of more technical things, preferring to talk about as many subjects as I could.

I have made the slides (in Dutch) available in PDF format. Ignore the jokes if you do not grok them.

Thanks Zeus, for the Grimbergen bottles. I will make them count.

hCole-server: a Snap application

October 6th, 2011

During the 9th GhentFPG meeting, I gave a lightning talk about the hCole-server Snap application I had hacked together during the Haskell hackathon last summer in Cambridge. hCole-server acts as a front-end to a bunch of scripts that fire off a computation on the Ghent University supercomputer backend for determining the effect of a sequence of optimisation passes in the LLVM compiler when applied to the SPEC CPU 200[06] benchmarks.

Basically, I used Snap to deal with requests, routing them to a handler that uses two Snap extensions: the HDBC extension and a filesystem cache extension I wrote. When a sequence is submitted to hCole-server in the form of a string consisting of ints separated by dashes, there are four possible outcomes:

  1. This a new sequence that has never been requested before.
  2. The sequence has been requested in the past, but the experiment is still running.
  3. The results for the sequence are available in the filesystem cache.
  4. Some error occurred.

In the first case, a new job is prepared and started, the database is updated to reflect this (inserting a busy entry for this sequence) and the server waits for the next request. In the second case, we return a JSON string indicating we’re still waiting for results to roll in. In the third case, we extract the measurements from the tarball and wrap it in a JSON data structure and return it as the server response. In the last case, we oops. This is also illustrated in the following figure.

Because the experiment scripts automagically fetch completed jobs from the supercomputer and store them as tarballs in the file system cache, we need a way to update the database when a new tarball rolls in. For this, we spawn a second thread that periodically checks the file system for new files that correspond to a busy entry in the database. When a corresponding tarball is found, the thread updates the databse, changing the state for the sequence from busy to done.

You can find the slides of the talk (mostly code) here.

Note that all code was written for Snap 0.5.x and may fail for newer versions of the framework.

Fixoring the Belgian railways in some small aspect.

September 5th, 2011

The Belgian railways are quite well known for the frequent delays incurred by trains and travelers. The number of trains that do not make it on time to their destination (and all stations in between) has been rising steadily over the past few years. The NMBS — the part of the railway system that is responsible for the trains — present figures on these delays a few times per year. They improve their statistics somewhat by not counting certain delays, most notably those under five minutes. At the end points. So the train may still incur delays along the route, if it arrives on time at the final destination, all is well according to the stats. I beg to differ.

The real issue is that the statistics are based on trains, not people. No matter the number of people who will incur a real delay due to their train not being on time in some station — either an endpoint of a transit station — the NMBS will only count a single instance. To them, a train that has a delay at, say 2 p.m., is weighted equally with a delayed rush hour train. Obviously, the latter situation impacts a lot more travelers: people not making it to a connection train, tram or bus, or simply getting home later. The problem with connecting trains is that this may increase the real delay up to one hour or more. And it happens at minimum once every week.

I agree that the NMBS cannot fix all problems, and in some cases, they have no say in the matter, since the (old) NMBS has been split into three parts: the (new) NMBS, the infrastructure controller (Infrabel) and the holding linking the two. When a train breaks down, or an accident occurs, it is Infrabel that decides where and when the other trains will be (re)routed. Yet they have no passenger information (neither does the NMBS for that matter).

The real goal when solving a problem could take some lessons from computer science. Let us consider the trains to be processes that somehow have some dependencies between them. When one process breaks down or stops, because it has to wait for some data to become available, the other processes may be impacted since they may need access to some shared resource that is now being occupied by the stalled process (the borked train in our analogy). The operating system can then decide to force the offending process to release the resource (for a while) such that other processes can continue executing. Now, the analogy is not completely right, but you catch my drift, I’m sure. How would the OS decide which process can then take a hold of this shared resource? One way is to assign priorities to each process and have the process with the highest priority continue first. In the real world of trains there is a small (actually, a large) catch. Trains cannot overtake each other, except at certain well defined places (e.g., a station).

But let us first start with the priorities. How could we assign those to a train. Actually, there seems to be, IMHO, a quite simple solution for this. Every passenger needs to have a valid ticket before getting on the train. How about we change our infrastructure (that would yield jobs!) and force people to validate their ticket before getting on the platform. Moreover, you could deny access if their train is not coming within the next 15 minutes or so, or if there’s a previous train they cannot take (wrong ticket) still waiting. Since each ticket is used between two endpoint — and only very occasionally between the starting point and a station before the real end point — this yield information on how many people are traveling at each moment between any two train stations. Furthermore, since people tend to take the (theoretically) fastest route, it is quite predictable which station will be used for connecting trains (if people need to descend from the platform and revalidate their ticket, even this poses no issue). Thus, the NMBS (and by extension, Infrabel) have data on the number of passengers on each train and on how many people will miss their connecting train if they incur a delay. Hence, when a problem occurs (not an uncommon event on the Belgian railways), Infrabel can decide which trains to let pass first and which ones to (temporarily) put on a side track, such that the least number of people will be impacted and their overall/maximum delay will be as small as possible.

A boon of this system is that the NMBS will have objective numbers to decide how to schedule trains. Right now, they have some vague idea (commuters are known beforehand, but not the time at which they commute) and other travelers are mostly known on the moment they purchase a ticket. But once people are (anonymously) scanned, they gain knowledge. Patterns may arise (summer holiday travelers going to the coast, etc.) and trains can be scheduled much more effectively.

The above might not impact the number of trains incurring a delay, but it should certainly impact the number of affected travelers. While the former do not complain, the latter do. Quite loudly at times, and righteously so.

Ranking Commercial Machines through Data Transposition

August 31st, 2011

The following paper has been accepted for publication at IISWC 2011 (IEEE International Symposium on workload Characterization).

Ranking Commercial Machines through Data Transposition, Beau Piccart, Andy Georges, Hendrik Blockeel, Lieven Eeckhout.

The abstract of the paper reads as follows.

The performance numbers reported by benchmarking consortia and corporations provide little or no insight into the performance of applications of interest that are not part of the benchmark suite. This paper describes data transposition, a novel methodology for addressing this ubiquitous benchmarking problem. Data transposition predicts the performance for an application of interest on a target machine based on its performance similarities with the industry-standard benchmarks on a limited number of predictive machines. The key idea of data transposition is to exploit machine similarity rather than workload similarity as done in prior work, i.e., data transposition identifies a predictive machine that is most similar to the target machine of interest for predicting performance for the application of interest.

We demonstrate the accuracy and effectiveness of data transposition using the SPEC CPU2006 benchmarks and a set of 117 commercial machines. We report that the machine ranking obtained through data transposition correlates well with the machine ranking obtained using measured performance numbers (average correlation coefficient of 0.93). Not only does data transposition improve average correlation, we also demonstrate that data transposition is more robust towards outlier benchmarks, i.e., the worst-case correlation coefficient improves from 0.59 by prior art to 0.71. More concretely, using data transposition to predict the top-1 machine for an application of interest leads to the best performing machine for most workloads (average deficiency of 1.2\% and max deficiency of 24.8\% for one benchmark), whereas prior work leads to deficiencies over 100\% for some workloads.

You can get a preprint of the paper (subject to change for the final submitted version).

Parental leave

August 19th, 2011

During the summer of 2011, I took three months of parental leave, to which all Belgian employees are entitled. June was spent doing a lot of chores for the renovation of my home, such as putting Tasso to the walls, painting, plastering, putting up fake ceilings, etc. We also had a new kitchen installed, courtesy of Studio RL. Though unplanned, I also dedicated some days for work, finishing two papers. July and August were dedicated to the kids. The weather was not as cooperative as I had hoped, but we managed to spent quite some time on the Ostend beach.

September 1st is once more a dreaded date. Not that I do not like to go back to work, but because I really enjoyed heaps of quality time with my family. Still, traditions must be honoured, so I’ll start work with two days of leave. Thus, I still have two weeks of vacation. Except that I already seem to be doing some work in the evening :-)

CamHac: Haskell Hackathon in Cambridge, UK

August 19th, 2011

From August 12-14, 2011, Simon Marlow organised a Haskell Hackathon at Homerton College in Cambridge. Needless to say, several GhentFPG members were interested to attend. In the end, a party of four (Jeroen Janssen, Jasper Van Der Jeugt, Bart Coppens and myself) made the crossing to the UK. It was a smooth ride, even though the train in front of us stopped in the middle of the Channel Tunnel and we had to wait for about 50 minutes before our train was allowed to proceed.

We stayed at the Cambridge YHA, in a room we shared with Simon Meijer and some unnamed Belgian guy from Moeskroen, whose Dutch resembled West-Flemish quite closely. The latter was there to study during the day and party during the night. And he was not happy with the fact that we got up around 8:00 AM :-D I only discovered a shower with warm water on Sunday, having taken cold shower — I mean really cold — the previous two days. But the breakfast more than made up for that.

When signing up to CamHac, I had the plan to work on HaBench, or rather, see how Fibon could be made into what we envisioned during BelHac as a Haskell benchmark suite. However, research requirements dictated otherwise. Given that we are collaborating with researchers from the KULeuven on a project using COLE and active learning to build models for the effect of optimisation sequences, and that the COLE framework requires access to the supercomputer backend at Ghent University, I decided to write a Haskell web application using Snap that would allow the submission of optimisation sequences and get the results for the objective functions (speedup, compilation time, code size, …) back for the used benchmark suite. Having no previous experience with Snap, I found the framework to be easy to use. The core of the application was finished by Sunday afternoon; on the way home I added a watchdog thread to update the database with finished experiments.

I had a ton of fun those three days, learned a lot, and — fingers crossed — started to grok monads a bit better. The event was full, as 72 people registered. I am not sure everybody turned up, but the room was crowded at all times. And a lot of work was done, see the post-hackathon report.

Predictive Learning in Two-way Datasets

July 17th, 2011

We have a paper that is accepted in Inductive Logic Programming (ILP) 2011:

Predictive Learning in Two-way Datasets, by Beau Piccart, Hendrik Blockheel, Andy Georges and Lieven Eechout.

The abstract of the paper reads as follows:

We introduce a new learning setting, called two-way predictive learning, as a special case of relational learning. We demonstrate that this learning setting has some properties that make an alternative learning approach, which we refer to as transposed learning, possible. We show experimentally that transposed learning can yield better results in multi-target learning settings.

We have used this approach to improve on the results we obtained in out PACT 2006 paper, titled “Performance Prediction based on Inherent Program Similarity”. The full results of that research will be part of a future paper.

Invited Talks at Prague University

December 14th, 2010

At Evaluate 2010, I met Petr Tuma, an overall cool frood and a professor at Charles University in Prague, Czech Republic. He kindly invited me to give a talk at his institution on performance analysis. More to the point, I talked about the work we had been doing in the past — mostly my PhD related research — on the issues we found with prevalent experimental setups (see the interaction paper (OOPSLA 2003)), prevalent measurement and data analysis approaches (see the Stats paper (OOPSLA 2007) and the replay paper (OOPSLA 2008)). I also briefly talked about the position statement Koen De Bosschere and I had submitted to Evaluate 2010, focusing mostly on benchmark issues. The slides of the talk are available.

Next to that, I also gave a (short) talk on the metrics we devised for quantifying (throughput) performance in a consolidated system. That talk was pretty much the same I gave at HPCVirt earlier this year, so no new slides for that, excepting a few changes. In the meantime, SPEC has published SPECvirt, and they at least seem to be being this a bit better than VMmark, but we still have to wait for more published results to become available before unleashing our wrath on it, should they do things in the wrong way :-)

During this trip, I met a bunch of cool, hospitable people who are busy doing some pretty interesting research. I must confess that I had even worse issues than normal remembering names. Czech clearly is not my forte :-) After the exchange of ideas and thoughts, the trip definitely seemed to have been worthwhile.

The flight to Prague was pretty uneventful, but the return flight was delayed somewhat thanks to the small snow storm that decided to pick its timing in the worst possible manner to pass by and say hello. Our airplane first taxied to one runway, was de-iced there and then had to taxi all the way back to the start of the other runway, as the one we had previously been assigned to for takeoff had been snowed under too much, I think. BTW, Czech Airlines. Free food and a free drink plus coffee or tea. Eat that, Brussels Airlines. After the landing, we were not allowed to taxi up to the gate immediately for undisclosed reasons, but it made me have to run to catch a train. I skipped the purchase of the obligatory Diabolo ticket and filled in the Rail Pass with unauthorised writing gear (a ballpoint pen is obligatory, but I had only a regular pen with me). Luckily — unexpectedly too — the train arrived on time at Brussel-Noord and I could still purchase the Diabolo tax ticket. Thanks to the new schedule — which will prove to be bad in the coming days, I’m sure — I made it to the IC train to Ostend. The NMBS even managed to deliver me on time at my final destination. A rare feat, I agree.

The first Belgian Haskell Hackathon

November 7th, 2010

During the weekend of November 5-7, 2010, The Ghent Functional Programming Group and the ZeusWPI organised a Haskell hackathon, the first of these events to take place on Belgian soil.

The organisers

Thanks to Ghent University, we had obtained a nice venue: a refurbished coal power plant, of which we could use the upper floor, where the huge silos for storing the coals were located.

The Therminal

The event entailed three days. On Friday we had the registration, the hey-cool-to-see-you-are-here-too and general get-to-know each other, followed by a very nice talk by Miran Lipovača, the author of the ‘Learn you a Haskell for great good’ book (soon to be available from your favorite bookshop(s)). From 5 p.m. onwards, we had the Functional Programming in Industry symposium, where Duncan Coutts, Romain and Donald Steward entertained the masses with great stories about using FP in their daily work. Since we be engineers, and the tradition of our department states that talks in the Jozef Plateau-room are followed by a reception (at least, when people defend their PhD. this is the case, but let’s ignore that little trivia for the moment), we also fed the masses with a bunch of excellent (220) sandwiches. Sadly, some were left over, even though the local folks ate more than (twice) their share once it became clear that there would be leftovers. No worries, we’re well trained, given the number of PhD’s defended every year.

The weekend itself was pretty much devoted to the actual hacking. Kenneth (boegel) and I found some support for our HaBench effort in the person of Yuriy Kashnikov. After a round of project introductions, the work began in earnest only to be briefly interrupted by the announcement that food was available, kindly provided by Het Stokbroodje (we paid, of course, but still, the food was yummy and they delivered on both Saturday and Sunday). The evening saw a march with a few hackers to Julien, Ghent’s moat famous french fries place (I’m told).

On Sunday, hacking continued and in the afternoon, a few 5-minute lightning talks were held by people who considered having made significant progress. I briefly talked about HaBench (slides (PDF)). The general reception was pretty good, I believe. Clearly, a new benchmark suite is highly desirable. Finally, we wrapped up and cleaned shop before the housekeeper came round to chase us out again.

The Geeks

Position statement

October 20th, 2010

Just to open it up and release it into the wild, here’s the statement I wrote with Koen De Bosschere on several (lower level) aspects of evaluation in computer science, we would like to see addressed.

Recent work has shown a continued identification of pitfalls in conducting experimental computer science. In this position statement, we address three issues we consider important to enhance the state-of-the-art in this field: (i) the experimental design and setup, (ii) performance measurements and the subsequent analysis of results, and (iii) benchmark selection.

First, we consider the experimental setup. When designing an experiment, there are many dimensions (input set (size), heap size, garbage collector, compilers, optimisations, different VMs, etc.) to consider. In the not-so-distant past, researchers often used a single design point (e.g., a fixed heap size, one garbage collector, etc.). Recently, however, several researchers made a case in favour of considering multiple design points, and to let, for example, the heap size vary (in fixed steps) from a benchmark dependent minimum to a factor thereof. Essentially, a good experimental design should acknowledge these dimensions and their importance and make correct decisions with respect to the design points chosen for any given experiment. Some dimensions are continuous, others are discrete.

Hence, evaluating all points is not simply infeasible, sometimes it is outright impossible.

Moreover, the output value (e.g., performance as quantified by execution time) is not necessarily a continuous function of continuous input (e.g., heap size). At other times, we are only concerned with a subspace, i.e., fewer dimensions. This brings us to our first concern. Some researchers we have discussed these matters with, have stated that trying to cover the complete design space is pointless, and that they prefer to pick a single point for evaluating their new shiny idea. We beg to differ, but it is clear that the community needs to agree on which design points to used to obtain representative measurements.

Other scientific communities have since long solved the problem of experiment design. For example, in medical and social sciences, taking a representative (random) sample of the population follows a well known and widely practiced methodology. We do not believe that simply adopting these techniques is the way to go, but we can definitely learn from their methodology. We need to carefully examine how the experiment’s design space is shaped and which statistical techniques have to be used for choosing the correct (in some sense) points in it. For this, techniques borrowed from the machine learning community might be employed, where researchers choose data points that have most impact on a model they wish to build. Clearly, we need to pick more points from regions in the design space where the output function (e.g., performance) changes rapidly or where it is discontinuous.

The second issue involves measurement and analysis of data. Computer systems tend to exhibit mildly chaotic behaviour, e.g. programs may behave non-deterministically. We have argued in the past to add more statistical rigour to measurements and data analysis. It seems that the community is slowly embracing this idea, but there are a few obstacles holding researchers and practitioners back. In our opinion, there are at least two issues that must be addressed. The first and foremost problem is the lack of frameworks that automate the hard work — nobody likes boilerplate. Second, doing elaborate experiments takes time and resources. Computing resources are often scarce. Even when an experiment is embarrassingly parallel, researchers do not always have the resources required to exploit (most of) this parallelism. Automation is the key to solving these problems, yet any approach must be fully aware of the pitfalls we uncovered and deal with them adequately. A good example to follow is the recently developed Criterion framework, used by the Haskell community. (http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-l…). It improves on our own work in several respects, e.g., identifying the impact of outliers on the variance, determining the required number of iterations and the number of experiments to be conducted. It is usable for both micro and macro-benchmarks. The community should adopt these practices, and take them to the next level: non-normal distributed measurements, check autocorrelation, etc.

Third, concerning benchmark suites, there are two issues we will briefly discuss. Current benchmarks suites are sets (in the mathematical sense) of programs and their inputs. We propose to add an (objective) ordering to these sets, for example how much each benchmark adds to the coverage of the space of computer programs, spanned by some (again, objective) metrics. Ideally, these metrics should be machine and language independent, though this seems hard to achieve. Reducing the demands, we can start with a set of micro-architecture independent metrics, and see how well programs from different language map in the space. When for some reason a subset is used, the experiment should always include the top ranked benchmark, without skipping benchmarks. In such a scenario, it seems necessary to define the minimum number of benchmarks that should be evaluated, e.g., k >= 5. Also, to avoid focusing only on those benchmarks, we require the code and setup to be made public such that reviewers and potential users can evaluate further.

The second important issue — closely linked to the benchmark ordering — involves deciding when an experiment shows positive evidence for the evaluated technique or innovation. Currently, reviewers check the mean (often the wrong one) and see if the approach works for (at least some of) the benchmarks. Now, if the benchmarks are both characterised and ordered, one can see for which regions the innovation works, thus showing that it is in fact useful for a (limited) class of programs — compare this to a drug that works for 70% of a population.

It is time to place experimental computer science on solid grounds with respect to both the design of an experiment as well as its measurement and evaluation. This requires the following steps. First, gauge the width and depth of the problem by uncovering all pitfalls. Then, provide researchers with tools and sound methodologies to conduct experiments.