Posts Tagged ‘cgo’

Code Generation and Optimisation

Wednesday, April 28th, 2010

I have had the pleasure to attend the conference on Code Generation and Optimisation this week in Toronto, where Kenneth Hoste presented our paper. The conference was pretty amazing. I must admit that compilers and such are not my forte (at least, the nitty gritty details are not), but there were a number of talks that really stood out IMO.

To begin with, Ben Zorn’s keynote “Performance is dead, long live performance”. I’ll never be best friends with Microsoft, but the talk was very interesting and overall, Ben’s cool. In the talk Ben showed that people have been losing focus on performance, shifting to correctness over the past 5-10 years. This was in stark contrast to the 1990’s where papers really focused on performance. This seems like an important trend, but, as the title reveals, performance is still important, even if not the most prominent item. The reason is that performance sells. People use tools and gadgets that have good performance (the comparison between the iPhone and a Windows Mobile phone was made, guess who won?). The important point was here that there is a tradeoff. People are willing to sacrifice performance for security and vice versa.

In the second keynote talk, CJ Newburn showed that there are two sides to heterogeneity: we have the system, and the tool infrastrcuture. The former, we want to make heterogeneous, but we do not want to have it visible to the user, i.e., the infrastructure should paint a homogeneous picture. The examples given include the various extensions to the ISA that have made their way into the CPU core, starting from floating point operations to SSE. The most important takeaway was that the masses want it simple, since the broad base of programmers focus oh what to do, not how to do it. So there should be a homogeneous interface, but also hooks to each of the underlying heterogeneous components such that skilled programmers can get down and dirty.

The first talk was also the winner of the best paper award: PinPlay, on a replay system on top of PIN. I do not know at this point if I agree with this setup, but the work seems pretty impressive. It was presented by Christiano Pereira. The tool aims to capture a number of events, including thread scheduling and system call results. Supposedly it is OS agnostic, and HW agnostic, so application can be recorded on one platform and replayed on another platform. This does need some extra support, for example, to address different thread scheduling mechanisms across OSses. There were three major use cases: (1) checkpointing and sampling for simulation, (2) sharing workloads between people, by providing the logs and let the other party replay the application, and (3) debugging purposes. The authors also claimed it is possible to re-log, i.e., to record an execution during replay. I wondered if it would be possible to use the tool in such a way that bugs due to parallelism could be detected, since the tool itself has quite a large time-overhead. This can cause Heisenbugs to be invisible during the recording phase. However, if it would be possible to reduce overhead by recording only the thread interactions, such that the bug can manifest itself, and record the rest of the execution during the replay, I think.

The second talk I was really impressed with was on An Efficient Software Transactional Memory Using Commit-Time Invalidation, presented by Justin Gottschlich. He’s a very good presenter, and went through the story with heaps of enthusiasm. The bottom line is that one should have committing transactions invalidate, rather than validate. This means that instead of checking all reads and occasionally checking the writes, the transaction should check it’s writes and see if others are reading these elements. Then, there is no need to check the reads. Apparently, this is not always a win over validation, but when it does win (in highly contending workloads), it wins big time.

After the last session, the winner — or should I say winners, as there was a tie — of the best presentation award were chosen by vote of the remaining attendees. The two winners were Justin Gottschlich and Jason Mars. The former really is the better salesman, but the latter was holding up pretty well.

Outside of the technical program, I have learned at least one cool concept. When going to a restaurant, ask for a surprise. I think (if there’s nothing you really loathe on the menu) a great idea. Moreover I was introduced to it by Tipp Moseley, a very cool frood. Certainly one who knows where his towel is :-)

On the day of the return to the country still known as Belgium, we visited the CN Tower. I can be quite short here: totally mindblowingly awesome. It took me a few seconds before I dared to step onto the glass floor, but once on there, looking down was an amazing experience. After that, there was a slight rush to the airport. Both Kenneth and I did not manage to sleep much, if at all, and it was killing to stay awake at Schiphol while waiting for our connecting flight to Brussels.

All in all, it was very interesting, and I’m really looking forward to next year’s CGO. Chamonix, here we come!

On the use of the mean to determine the correct average value for speedup

Tuesday, April 27th, 2010

I am attending the CGO 2010 conference in Toronto at this moment. I have seen at least 10 papers pass by that are reporting speedup values over a number of benchmarks. Nothing wrong there, except for the tiny — nay, extremely — annoying fact that the authors are using the geometric mean to report average speedup across the benchmark suite.

Let me show with a simple example that this is fubar before giving the actual reason why it is wrong — even though the values are not necessarily off by much, but that is besides the point.

Consider three applications A, B and C, with respective original execution times 3, 2 and 4 (in a unit of time of your choice). After a fancy optimisation or whatever, the execution times become 1,1 and 2, respectively. So, the individual speedup values are (original / new execution time) and this yields 3, 2 and 2, respectively. Now, if we were to execute the programs A, B and C one after the other, the total execution time for the original programs is equal to 3 + 2 + 4, i.e., 9 and the total execution time for the fancy optimised versions is 4. Hence, the speedup for all the applications is 9/4, which is thus the average speedup.

So which of the three means would yield this value? Let’s take a look, shall we? The arithmetic mean of the individual speedup values yields 7/3. The geometric mean yields 12^(1/3), which equals (at four significant digits) 2.289. The harmonic mean, finally, yields 3 / (1/3 + 1/2 + 1/2) or 9/4. Well well. What an amazing coincidence! So, clearly, the geometric mean is not yielding the correct value of 2.25.

The reason for this is simple. First of all, the aggregate speedup is not the product of the individual speedups, so the geometric mean is not applicable. The arithmetic mean also is not applicable, since the reference value is found in the denominator.

So, dear CGO authors, the geometric mean is not the right mean to use. Think on it.

Automated Just-in-time Compiler Tuning

Monday, January 25th, 2010

I’m pretty excited about our paper to get accepted at CGO 2010, which takes place in Toronto, Canada.

Automated Just-in-time Compiler tuning, Kenneth Hoste, Andy Georges, and Lieven Eeckhout.

The abstract of the paper reads as follows.

Managed runtime systems, such as a Java virtual machine (JVM), are complex pieces of software with many interacting components. The Just-In-Time (JIT) compiler is at the core of the virtual machine, however, tuning the compiler for optimum performance is a challenging task. There are (i) many compiler optimizations and options, (ii) there may be multiple optimization levels (e.g., -O0, -O1, -O2), each with a specific optimization plan consisting of a collection of optimizations, (iii) the Adaptive Optimization System (AOS) that decides which method to optimize to which optimization level requires fine-tuning, and (iv) the effectiveness of the optimizations depends on the application as well as on the hardware platform. Current practice is to manually tune the JIT compiler which is both tedious and very time-consuming, and in addition may lead to suboptimal performance.

This paper proposes automated tuning of the JIT compiler through multi-objective evolutionary search. The proposed framework (i) identifies optimization plans that are Pareto-optimal in terms of compilation time and code quality, (ii) assigns these plans to optimization levels, and (iii) fine-tunes the AOS accordingly. The key benefit of our framework is that it automates the entire exploration process, which enables tuning the JIT compiler for a given hardware platform and/or application at very low cost.

By automatically tuning Jikes RVM using our framework for average performance across the DaCapo and SPECjvm98 benchmark suites, we achieve similar performance to the hand-tuned default Jikes RVM. When optimizing the JIT compiler for individual benchmarks, we achieve statistically significant speedups for most benchmarks, up to 40% for start-up and up to 19% for steady-state performance. We also show that tuning the JIT compiler for a new hardware platform can yield significantly better performance compared to using a JIT compiler that was tuned for another platform.

You can get a preprint of the paper. We also plan to make our tool available, so it can be used to automagically tune other VMs *cough* J9 *cough*.