Code Generation and Optimisation

April 28th, 2010

I have had the pleasure to attend the conference on Code Generation and Optimisation this week in Toronto, where Kenneth Hoste presented our paper. The conference was pretty amazing. I must admit that compilers and such are not my forte (at least, the nitty gritty details are not), but there were a number of talks that really stood out IMO.

To begin with, Ben Zorn’s keynote “Performance is dead, long live performance”. I’ll never be best friends with Microsoft, but the talk was very interesting and overall, Ben’s cool. In the talk Ben showed that people have been losing focus on performance, shifting to correctness over the past 5-10 years. This was in stark contrast to the 1990’s where papers really focused on performance. This seems like an important trend, but, as the title reveals, performance is still important, even if not the most prominent item. The reason is that performance sells. People use tools and gadgets that have good performance (the comparison between the iPhone and a Windows Mobile phone was made, guess who won?). The important point was here that there is a tradeoff. People are willing to sacrifice performance for security and vice versa.

In the second keynote talk, CJ Newburn showed that there are two sides to heterogeneity: we have the system, and the tool infrastrcuture. The former, we want to make heterogeneous, but we do not want to have it visible to the user, i.e., the infrastructure should paint a homogeneous picture. The examples given include the various extensions to the ISA that have made their way into the CPU core, starting from floating point operations to SSE. The most important takeaway was that the masses want it simple, since the broad base of programmers focus oh what to do, not how to do it. So there should be a homogeneous interface, but also hooks to each of the underlying heterogeneous components such that skilled programmers can get down and dirty.

The first talk was also the winner of the best paper award: PinPlay, on a replay system on top of PIN. I do not know at this point if I agree with this setup, but the work seems pretty impressive. It was presented by Christiano Pereira. The tool aims to capture a number of events, including thread scheduling and system call results. Supposedly it is OS agnostic, and HW agnostic, so application can be recorded on one platform and replayed on another platform. This does need some extra support, for example, to address different thread scheduling mechanisms across OSses. There were three major use cases: (1) checkpointing and sampling for simulation, (2) sharing workloads between people, by providing the logs and let the other party replay the application, and (3) debugging purposes. The authors also claimed it is possible to re-log, i.e., to record an execution during replay. I wondered if it would be possible to use the tool in such a way that bugs due to parallelism could be detected, since the tool itself has quite a large time-overhead. This can cause Heisenbugs to be invisible during the recording phase. However, if it would be possible to reduce overhead by recording only the thread interactions, such that the bug can manifest itself, and record the rest of the execution during the replay, I think.

The second talk I was really impressed with was on An Efficient Software Transactional Memory Using Commit-Time Invalidation, presented by Justin Gottschlich. He’s a very good presenter, and went through the story with heaps of enthusiasm. The bottom line is that one should have committing transactions invalidate, rather than validate. This means that instead of checking all reads and occasionally checking the writes, the transaction should check it’s writes and see if others are reading these elements. Then, there is no need to check the reads. Apparently, this is not always a win over validation, but when it does win (in highly contending workloads), it wins big time.

After the last session, the winner — or should I say winners, as there was a tie — of the best presentation award were chosen by vote of the remaining attendees. The two winners were Justin Gottschlich and Jason Mars. The former really is the better salesman, but the latter was holding up pretty well.

Outside of the technical program, I have learned at least one cool concept. When going to a restaurant, ask for a surprise. I think (if there’s nothing you really loathe on the menu) a great idea. Moreover I was introduced to it by Tipp Moseley, a very cool frood. Certainly one who knows where his towel is :-)

On the day of the return to the country still known as Belgium, we visited the CN Tower. I can be quite short here: totally mindblowingly awesome. It took me a few seconds before I dared to step onto the glass floor, but once on there, looking down was an amazing experience. After that, there was a slight rush to the airport. Both Kenneth and I did not manage to sleep much, if at all, and it was killing to stay awake at Schiphol while waiting for our connecting flight to Brussels.

All in all, it was very interesting, and I’m really looking forward to next year’s CGO. Chamonix, here we come!

On the use of the mean to determine the correct average value for speedup

April 27th, 2010

I am attending the CGO 2010 conference in Toronto at this moment. I have seen at least 10 papers pass by that are reporting speedup values over a number of benchmarks. Nothing wrong there, except for the tiny — nay, extremely — annoying fact that the authors are using the geometric mean to report average speedup across the benchmark suite.

Let me show with a simple example that this is fubar before giving the actual reason why it is wrong — even though the values are not necessarily off by much, but that is besides the point.

Consider three applications A, B and C, with respective original execution times 3, 2 and 4 (in a unit of time of your choice). After a fancy optimisation or whatever, the execution times become 1,1 and 2, respectively. So, the individual speedup values are (original / new execution time) and this yields 3, 2 and 2, respectively. Now, if we were to execute the programs A, B and C one after the other, the total execution time for the original programs is equal to 3 + 2 + 4, i.e., 9 and the total execution time for the fancy optimised versions is 4. Hence, the speedup for all the applications is 9/4, which is thus the average speedup.

So which of the three means would yield this value? Let’s take a look, shall we? The arithmetic mean of the individual speedup values yields 7/3. The geometric mean yields 12^(1/3), which equals (at four significant digits) 2.289. The harmonic mean, finally, yields 3 / (1/3 + 1/2 + 1/2) or 9/4. Well well. What an amazing coincidence! So, clearly, the geometric mean is not yielding the correct value of 2.25.

The reason for this is simple. First of all, the aggregate speedup is not the product of the individual speedups, so the geometric mean is not applicable. The arithmetic mean also is not applicable, since the reference value is found in the denominator.

So, dear CGO authors, the geometric mean is not the right mean to use. Think on it.

Performance Metrics for Consolidated Servers

April 13th, 2010

The following paper got accepted for the HPCVirt 2010 workshop, taking place in Paris, France (duh).

Performance Metrics for Consolidated Servers, Andy Georges, and Lieven Eeckhout.

The abstract of the paper reads as follows:

In spite of the widespread adoption of virtualization and consol- idation, there exists no consensus with respect to how to bench- mark consolidated servers that run multiple guest VMs on the same physical hardware. For example, VMware proposes VMmark which basically computes the geometric mean of normalized throughput values across the VMs; Intel uses vConsolidate which reports a weighted arithmetic average of normalized throughput values.

These benchmarking methodologies focus on total system through- put (i.e., across all VMs in the system), and do not take into account per-VM performance. We argue that a benchmarking methodology for consolidated servers should quantify both total system through- put and per-VM performance in order to provide a meaningful and precise performance characterization. We therefore present two performance metrics, Total Normalized Throughput (TNT) to characterize total system performance, and Average Normalized Reduced Throughput (ANRT) to characterize per-VM performance.

We compare TNT and ANRT against VMmark using published performance numbers, and report several cases for which the VM- mark score is misleading. This is, VMmark says one platform yields better performance than another, however, TNT and ANRT show that both platforms represent different trade-offs in total system throughput versus per-VM performance. Or, even worse, in a cou- ple cases we observe that VMmark yields opposite conclusions than TNT and ANRT, i.e., VMmark says one system performs better than another one which is contradicted by the TNT/ANRT performance characterization.

You can find a preprint to the full paper. The presentation slides are up here.

HPCVirt 2010 workshop

April 13th, 2010

The paper that got rejected at VEE — due to not good reasons, if you ask me — made it in for HPCVirt, which is the EuroSys workshop for High Performance Computing in virtualised environments. So I am currently residing in Paris for both the workshop and the main EuroSys conference. The former comprises only 4 talks, two of which have been fairly disappointing so far. I am not sure how the other attendants feel, but it is my impression that our performance metrics paper generated a lot more interest and discussion than the other papers :-/

Except for the last paper. ‘Experiences booting 10M virtual machines’ was pretty much kick-ass. The authors tried to simulate a botnet, reportedly due to the dangerous nature of buying the (GPL, ahem) botnet source code and then publishing about it. So they designed their own lightweight VM, which basically uses 20M, and which has a manager that copies the files needed to execute an app (shared libraries and all that) to a ramdisk upo execution, and throws it away afterwards.

From what went on between, I do not recall that much (I did take notes :-) . The ending panel discussion, was pretty interesting until it started to die. Points raised were, amongst others, why do we need anything besides container based virtualisation, and what about QoS.

Moving hMollom to a monadic base

April 12th, 2010

In a move to finally grok monads and monad stack a bit better, I decided to move the hMollom code to a monad based implementation. The exported functions now execute in the MollomMonad, which at this point is defined as follows:


type MollomState = ReaderT MollomConfiguration (StateT MollomServerList IO)
type MollomMonad = ErrorT MollomError (StateT (Maybe SessionID) MollomState)

The main concept of the code remains the same, each exported function calls the service function. The latter returns a MollomState a, to capture the fact that during the Mollom request, we potentially need to update the list of servers we can query with API calls. This result is then lifted into the MollomMonad stack by using the following function.


returnStateT a = StateT $ \s -> liftM (flip (,) s) a

This is illustrated in the following control flow figure (for a full sized image, follow the link to flickr):

Control flow in hMollom

The error handling is done by usage of the ErrorT monad transformer, which add the possibility to return a Left MollomError as function value, indicating a fault during the handling of the request. At this point, I have not added a decent runMollomMonad function, as I am still pondering which functions may be most useful for the users of this library. However, you can easily employ the following function, where defaultMollomConfiguration :: MollomConfiguration and we start without any servers that are available.


(\m -> (runStateT $ (runReaderT $ (runStateT . runErrorT $ m) (Nothing) ) defaultMollomConfig) UninitialisedServerList)

As usual, the code is available from both the GitHub repository, as well as from Hackage. All feedback is welcome. Thanks to #haskell and #ghentfpg for bearing with my questions.

Virtual Execution Environments 2010

March 18th, 2010

The past few days, I’ve attended VEE 2010. I registered prior to receiving the definite decision on the paper we submitted — it was rejected (with weak reviews IMHO, but still rejected is rejected and nothing you can do about it). So I did not really have an idea on what the program would be about. Turns out, it’s petty OK, even though most of it is besides my general area of interest/work.

Nonetheless, the opening keynote talk was pretty interesting, if only for the take Peter Chen has on education and teaching systems. The bottom line is: challenge students with something difficult and give them the tools and help to get there. His first year students are allowed to pick anything to implement as a system, provided it is fun and not completely impossible. However, it seems to me that the requirements in teaching assistants is pretty high. Peter said that undergrad students are assisting, and that is one path we cannot take in Belgium. Students must study, not assist. I do like the bottom up approach, but I wonder how feasible it would be, given that our students have trouble grokking pushl 4(%eax). But still, given a small Altera board, a simple instruction set ABI, and simple rules, they managed to do nifty stuff. On example was a Guitar Hero clone, but using a real instrument, where the system recognised the note that was played. Another example included a game, driven by voice (tone, pitch, length) to fly a chopper through a 2D cave. The conclusion: an enlightening talk, showing that maybe we underestimate our students too often.

The first session was on debugging, and the most interesting talk was on a nifty VMM based replay system. The major improvement over what I recall from the work Frank Cornelis did at ELIS, was that these guys reduce the amount of data to retain, while still following all of the execution. Furthermore, they show that a replay system is but a state machine, and such a state machine can be built at user level, taking the (transformed) VMM recorded data and shoving it to the replayed process. Basically they allow replaying on any machine, and they are completely OS/app agnostic. Hence sending bug reports should be transparent enough to allow users to have the recording on at all times.

The keynote of the second day revolved around Singularity. Yes, the stuff from Microsoft. Now, everybody who knows me, knows that I’m no fan of that particular company, but I do dig their research department. This was followed by a talk on XIR trying to decouple the JIT compiler from the runtime system, or at least to define the interface in a better way. The second paper presented this morning was VMKit, which was pretty cool. Using standard software components (LLVM, MMtk, and libpthread), the authors define an interface between these components and to the higher level language. They showed you can easily implement a Java and a .NET VM on top to VMKit, with reasonable performance. J3 is around 2x slower than Jikes RVM, whereas N3 is comparable to Mono. I think this can be a very handy toolkit to move our COLE research to.

Google also had an invited talk on their new gadget to run native code inline in the browser. There are some restrictions, such as not all assembly instructions are allowed, which makes sense, but it also limits what applications you can run natively. One of the examples was Quake, which made me think the tool was pretty cool. Yet, we already have a native Quake Arena type of plugin. Second, at this point it is unclear what other application types you would want to run, but I can see that with ChromeOS being released onto the (unsuspecting?) world, having something that allows classical apps to be transported to the browser OS, is a nifty thing indeed.

In the afternoon, there were some more talk on VMMs, the first of which presented NEON , which uses the VMM and QEMU to see where tainted data is going. It was pretty cool that they can follow copies of the data, even as they are copied using copy ’n paste like actions by the user, to prohibit sensitive material to be send outside of the trusted zone. One remark was that typically, people see that everything gets a taint after a while, but Qing Zhang claimed this was not the case in their experiments, which, admittedly has not been running for months on end. I thought it was a pretty cool thing to do, and it shows that perhaps the notion I have of measuring QoS VMM side is not such a bad idea. Or at least not completely unfeasible.

The last talk on Thursday which kicked major ass was on AASH, which is an Asymmetric Aware Scheduler for Hypervisors. The premise was that the hypervisor (Xen, in particular) is not aware of the possibility that the machine potentially has a heterogeneous array of CPUs or cores. For example, a new system might feature a Corei7 and several Atoms. The former being used for CPU bound workloads, the latter being used for memory bound workloads. The authors extended the Xen credit scheduler with a second type of credits, i.e., fast credits, to be split among so-called fast VCPUs, whereas the original credits (the slow credits) are then shared among the slow VCPUs. When a fast VCPU runs out of fast credits for the current accounting time, they are moved to a slow CPU. I think it’s a pretty brilliant and simple idea. The results are indicative of the fact that this works well in practice, at least for SPEC CPU 2000 benchmarks (crafty being CPU bound, mcf and equake classified as memory bound). The only thing I found to be subpar was the use of execution time for measuring how well these benchmarks were doing under the dual credit scheduler. In a multiprogram system, per-program execution time is not the right metric. Instead, one should use STP and ANTT (Eyerman and Eeckhout, MICRO 2008) for execution time based workloads or TNT and ANRT (Georges and Eeckhout, HPCVirt 2010) for throughput based workloads.

In the Friday Java session, the allocation site tracking was the most impressive paper, in my opinion. Again, a fairly simple idea, but it does seem to help against memory leaks. Up to now, Mike Bond et al. have proposed a number of approached to deal with semantic memory leaks, but this paper seems to provide the means of locating them, without too much overhead. The key idea is that one replaces some field, or part thereof, in the object header to identify the allocation site where the object was allocated. Upon GC, as each live object is walked, statistics can be assembled and the allocation sites with the most surviving objects may indicate leaks, provided the heap usage keeps growing after each global GC. The authors define two options, using the hash code header field, or it this is not present, the class pointer in the object header. They adequately deal with all potential problems, such as the hash code field colliding (they introduce a random value and shrink the allocation site ID size for hot sites).

The last few talks on binary translation we, put bluntly, hard to follow. So that about wrapped it up for VEE. I had a number of interesting chats with people from VMware, Oracle (formerly BEA, and it seems they still prefer that), and the ubiquitous Steve Blackburn. If all goes well, I’ll be back next year.

MacJournal or Ecto?

March 10th, 2010

This year’s MacHeistcontains MacJournal, a journaling app that apparently also lets you post to (at least) a Wordpress blog. Up to now, I had been trying Ecto, which at first sights seems to be a bit more powerful, but perhaps too powerful for my simple needs. Still, MacHeist is priced at $19.95, whereas Ecto costs $19.95 in itself. I’m not sure if I will ever use the other Heist apps, so it’s a tough call at this point. On the other hand, MacJournal can be used for other things besides blogging. Such as keeping track of stuff you did at work, for which I now use dead-tree type recording. Aargh! The choices!

After trying to post it does seem that adding keywords is not possible, so that’s definitely a minus for MacJournal.

Update: Still need to figure out how to enter raw HTML :-/

hMollom

February 26th, 2010

You may or may not be aware of the excellent anti-spam service provided by Mollom, founded by two friends and colleagues, Dries Buytaert and Benjamin Schrauwen. Mollom provides an API to program against, allowing support libraries to be written in a variety of languages and for numerous platforms. Top-notch examples are the Drupal plugin (obviously :-) and the WordPress plugin. Next to that there are a number of libraries for Java, PHP5, Ruby, Python, etc. Sadly, a Haskell library was lacking.

Given that I enjoy programming in Haskell and need to sharpen my Haskell-fu, I set out to write a library for interfacing with the Mollom API. The library is far from finished, but it is functional at this point, so if you have Mollom keys, you can call the services and fight the spam war on your Haskell driven website.

I maintain two repositories where you can get the library:

At this point the library has been given version 0.1. So it is still lacking quite a lot of features. Stuff that is on the immediate TODO list:

  • Add fault handling
  • Add server list refreshing
  • Encapsulate the state in a monad

Any feedback is appreciated.

PyMollom moved to GitHub

February 8th, 2010

Just a quick update. I have moved the repository for PyMollom over to GitHub. If you are unfamiliar with git, you can check out the library using git clone git://github.com/itkovian/PyMollom.git. At this point, the library is still the same, but I do plan on integrating the new language detection ASAP.

New door opens, old door closes

February 1st, 2010

For the past three years, I guess, Veerle has been trying to get me to agree to close up the living room door and make a new door in the dining room. Last fall, I finally caved. I saw the light, she would put it :-) We discussed our plans with the guy who will be taking charge in redoing our attic, and signed the agreement last November. Half December he gave us a call, and we agreed upon having the job done by the end of January. On the 29th, the demolition man came by and tore a new hole in our wall. Using a water to limit the amount of dust the wall was split apart at the right spot. Three hours later, everything was cleared and the big cleanup could begin. Depsite the water, dust had spread throughout the living room, partially spoiling my plans for getting some work done. With spurious amounts of water and soap, the living room was cleaned, and along with it, all the toys, books, and decoration.

I tore down the old door, with Nathan’s help — he’s scared of loud noise, so I figured that if I could get him to help me out, he would at least stop crying. And he did. with combined forces, we yanked the door from the wall, and replaced it with a brand new wooden framework, against which I put 18mm of multiplex, covered with 9.5 mm of plaster plate (gyproc). A bucket of Knauf Goldband later and the doorway was effectively sealed shut.

I plastered the new door hole, using metal corners to work against — my plastering skillz are pretty good, but only for small areas, i.e., no wider than my tool is. So, without further ado, behold the new door and the sealed old door.

Changing doors
Changing doors

Needless to say, we had to move some furniture around :-)

Changing doors

Mind, a new glass sliding door will be installed shortly. when the money for it arrives on my bank account.