Archive for the ‘research’ Category

Position statement

Wednesday, October 20th, 2010

Just to open it up and release it into the wild, here’s the statement I wrote with Koen De Bosschere on several (lower level) aspects of evaluation in computer science, we would like to see addressed.

Recent work has shown a continued identification of pitfalls in conducting experimental computer science. In this position statement, we address three issues we consider important to enhance the state-of-the-art in this field: (i) the experimental design and setup, (ii) performance measurements and the subsequent analysis of results, and (iii) benchmark selection.

First, we consider the experimental setup. When designing an experiment, there are many dimensions (input set (size), heap size, garbage collector, compilers, optimisations, different VMs, etc.) to consider. In the not-so-distant past, researchers often used a single design point (e.g., a fixed heap size, one garbage collector, etc.). Recently, however, several researchers made a case in favour of considering multiple design points, and to let, for example, the heap size vary (in fixed steps) from a benchmark dependent minimum to a factor thereof. Essentially, a good experimental design should acknowledge these dimensions and their importance and make correct decisions with respect to the design points chosen for any given experiment. Some dimensions are continuous, others are discrete.

Hence, evaluating all points is not simply infeasible, sometimes it is outright impossible.

Moreover, the output value (e.g., performance as quantified by execution time) is not necessarily a continuous function of continuous input (e.g., heap size). At other times, we are only concerned with a subspace, i.e., fewer dimensions. This brings us to our first concern. Some researchers we have discussed these matters with, have stated that trying to cover the complete design space is pointless, and that they prefer to pick a single point for evaluating their new shiny idea. We beg to differ, but it is clear that the community needs to agree on which design points to used to obtain representative measurements.

Other scientific communities have since long solved the problem of experiment design. For example, in medical and social sciences, taking a representative (random) sample of the population follows a well known and widely practiced methodology. We do not believe that simply adopting these techniques is the way to go, but we can definitely learn from their methodology. We need to carefully examine how the experiment’s design space is shaped and which statistical techniques have to be used for choosing the correct (in some sense) points in it. For this, techniques borrowed from the machine learning community might be employed, where researchers choose data points that have most impact on a model they wish to build. Clearly, we need to pick more points from regions in the design space where the output function (e.g., performance) changes rapidly or where it is discontinuous.

The second issue involves measurement and analysis of data. Computer systems tend to exhibit mildly chaotic behaviour, e.g. programs may behave non-deterministically. We have argued in the past to add more statistical rigour to measurements and data analysis. It seems that the community is slowly embracing this idea, but there are a few obstacles holding researchers and practitioners back. In our opinion, there are at least two issues that must be addressed. The first and foremost problem is the lack of frameworks that automate the hard work — nobody likes boilerplate. Second, doing elaborate experiments takes time and resources. Computing resources are often scarce. Even when an experiment is embarrassingly parallel, researchers do not always have the resources required to exploit (most of) this parallelism. Automation is the key to solving these problems, yet any approach must be fully aware of the pitfalls we uncovered and deal with them adequately. A good example to follow is the recently developed Criterion framework, used by the Haskell community. (http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-l…). It improves on our own work in several respects, e.g., identifying the impact of outliers on the variance, determining the required number of iterations and the number of experiments to be conducted. It is usable for both micro and macro-benchmarks. The community should adopt these practices, and take them to the next level: non-normal distributed measurements, check autocorrelation, etc.

Third, concerning benchmark suites, there are two issues we will briefly discuss. Current benchmarks suites are sets (in the mathematical sense) of programs and their inputs. We propose to add an (objective) ordering to these sets, for example how much each benchmark adds to the coverage of the space of computer programs, spanned by some (again, objective) metrics. Ideally, these metrics should be machine and language independent, though this seems hard to achieve. Reducing the demands, we can start with a set of micro-architecture independent metrics, and see how well programs from different language map in the space. When for some reason a subset is used, the experiment should always include the top ranked benchmark, without skipping benchmarks. In such a scenario, it seems necessary to define the minimum number of benchmarks that should be evaluated, e.g., k >= 5. Also, to avoid focusing only on those benchmarks, we require the code and setup to be made public such that reviewers and potential users can evaluate further.

The second important issue — closely linked to the benchmark ordering — involves deciding when an experiment shows positive evidence for the evaluated technique or innovation. Currently, reviewers check the mean (often the wrong one) and see if the approach works for (at least some of) the benchmarks. Now, if the benchmarks are both characterised and ordered, one can see for which regions the innovation works, thus showing that it is in fact useful for a (limited) class of programs — compare this to a drug that works for 70% of a population.

It is time to place experimental computer science on solid grounds with respect to both the design of an experiment as well as its measurement and evaluation. This requires the following steps. First, gauge the width and depth of the problem by uncovering all pitfalls. Then, provide researchers with tools and sound methodologies to conduct experiments.

Evaluate 2010

Tuesday, October 19th, 2010

During SPLASHcon, Matthias Hausswirth, Peter Sweeney, Steve Blackburn and Amer Diwan organised the Evaluate 2010 workshop to discuss prevalent issues with evaluation in the papers the computer science community outputs. Even though the attendees mostly are embedded in the performance and language communities, the workshop aimed to be fairly open to all aspects of computer science.

The workshop opened with a very strong keynote talk by Cliff Click. Being his usual self, he did not fear controversy, he showed us why he often did not believe the numbers and evaluations that are published in papers over the last decade. Part of the reason is that the ideas have been tried in e.g., Hotspot, and were shown to have either less or more impact. Another part is that the evaluation is just irrelevant, since the idea has been implemented, but not published. The criticism I have here is that the community can hardly be expected to avoid ideas of which they do not know they have been implemented or tried. But that does not detract from the essence of his remarks. Additionally, often evaluation numbers are below the interesting threshold. For example, for consumers of academic work there is often little incentive to implement the described technique in their own tools or application if the potential gain is not over 20%. Worse, for some cases the gain has got to be at least a factor 2. Furthermore, unless an idea has been implemented in multiple systems, there is no proof or persuasive indication that it can be ported over to other systems than the one used in the paper. So, what does this mean for all our rigorous evaluations, showing that we can squeeze out that extra 5-10% from the system under scrutiny?

The keynote talk was followed by a fairly long discussion, where the following points were raised (there were others, I am summarising). First of all was the question how to gain confidence in the results. Except for doing each experiment in a statistically rigorous manner, one should also account for several other things. First of all, there is the impact of factors that are more or less outside of the control on the researcher when his technique will be deployed in the wild. Examples there are the environment, OS, hardware, etc. Key to getting solid results is randomisation in the first case, and using multiple platforms in the latter cases. However, that only accounts for the results obtained from the full technique. To ascertain that there is nothing going on under the hood that interacts with what the researcher is trying to implement, it is important that the full technique is broken down into smaller steps (if possible), that each can stand on their own and for which you expect either a bonus (e.g., speedup) or a cost (e.g., slowdown). The latter might be the case when you have implemented the extra profiling code, but did not have the JIT take advantage of this extra information. There are several issues that need to be dealt with here. Obviously the breakdown, but also the potential for measuring what has been changed and its impact. More often than not, aggregate performance numbers will not tell the whole story, nor will they provide the necessary insight. Such insight is desirable if one wants to argue why the approach is portable to other systems. Typically, as researchers, we choose a single vehicle to build a proof-of-concept. But if we want our ideas to be used, we likely should provide more evidence as to why it is any good. Having a breakdown can help in that respect, as it tells others what steps were required and if they meet the prerequisites for those steps, or if some (or all) of these steps are already been taken care of on their target platform.

As I mentioned above, there are aspects of any experimental setup that are out of our control. We can try to account for them, but we often cannot change them. However, we like to think of computer science as an engineering science, and thus a constructive science. But that is pretty much in contradiction with the fact that some things lie outside the boundary of our control. Hence, claims were made that we are more of an observational science. The consequence is that we should behave as such a discipline, and accept that some things are out of our control, yet we should take them into account. Fil Pizlo mentioned astrophysics on several occasions, but did not really detail what he meant by that.

Now, if we want to grow up as a scientific discipline, we need to accept that there are shortcoming in our current way of working that need to be fixed. The keynote clearly motivated the need for several of these: (i) the possibility of (relatively) easily replicate existing research results — for example, to allow reviewers to confirm conclusions — as independent experiments are a cornerstone of modern science, (ii) reproducibility of techniques and results — same approach, different setup style, e.g., another JVM, another OS, VMM, etc. — since this confirms that the idea is also generalisable, and (iii) the possibility of refuting earlier work through negative results. None of these are going to happen as much as they should without some incentive system being in place. The reality is, of course, that any PhD student, professor, or anybody else in academia, except for a select few, have to keep increasing their publication output, and the above studies simply do not have a venue where they can be published. Blogs etc. do not count here, for obvious reasons. Now, the idea is not to have small 1-2 page papers confirming or refuting, but to have full-blown papers, that have the same status as any other papers published at the same venue. At this point the ACM digital library offers subscribers the option of reviewing existing papers, but that is not completely the same. Moreover, it does not allow a researcher to add something of significance to his resume. So, such papers should be able to hold their own and be judged on their own merit.

Full disclosure is another point that was often brought up. Even when rigorous statistics are used, the summarising numbers do not tell everything about the experiment. A first step towards the expected disclosure is making the full dataset available. This at least allows other people to redo the analysis even if they do not redo the actual experiments. A step further is providing the code. This might not be feasible for industrial contributors, where code might have to be protected or when researchers have signed NDAs with some industrial partner. However, this should only impact a minority of the cases, I think. Thus, disclosure of the code. On top of that, it is preferable that the scripts, fully detailed setup, etc. are made available. There is one big caveat. Sometimes setups are too large to be easily replicated, which may impact the ability to replicate the work by an independent researcher.

I think there were several remarks that carried quite some weight. The first, obviously, was made during the keynote, namely that sometimes the improvements are not large enough to warrant implementation in a production environment. Complementary to that remark it was suggested that perhaps our efforts are simply too feeble to warrant even replication or reproduction. Third, simply reproducing for the sake of numbers is not the desired attitude. Numbers do not mean anything in themselves, but through evaluation we should gain understanding, and this understanding should be deepened by reproduction and replication of past work. Essentially, reproductions should be a qualitative issue, not a quantitative. Put differently, negative results also count, but not all of them! There was some controversy over when a negative result should be published and when it can be dismissed as uninteresting. I guess that is up to the reviewers.

In the first afternoon session, we discussed in depth several issues that we deemed important and that should be fixed in the (preferably near) future. The subjects were mingling at several occasions, but I think that’s perfectly normal. After all, we were doing this for the first time, so we needed to still wrap our heads around all the possible angles to tackle the problem.

What is the purpose of evaluation? Is it simply to measure a system and report those results? Essentially, as I touched upon above, the ultimate goal is to gain insight. Why is something better, why do we see an improvement over existing approaches? And are these improvements solely due to our technique, or are there other effects at work?

An interesting idea was raised by Fil Pizlo, namely — he elaborated on that over dinner — that science should start from some falsifiable hypothesis, since one cannot prove something right with experimentation, one can only prove it wrong. Hence, everything we know from science is ultimately false. But let us not take it that far. Still, the hypothesis idea held. Mike Hind suggested that the evaluation should provide some first indication that an idea works, once more bringing up the notion of replication and reproduction. Additionally, since ideas all have drawbacks, the motion was put forward that papers should be honest up front: state what are the good and the bad points. I agree, but then there needs to be some way to avoid PC members solely using the negative points pointed out in the paper to shoot it down. Finally, we concluded that any paper should provide insights into why the idea is applicable in general. Now, some ideas are very niche-centered, but still, they might apply in some form to things outside the niche.

After some slides presented by Fil and Eric on perturbation and chaos, respectively, we turned our heads back to reproduction. All of the above applied. One interesting notion that was raised was the idea of cross-validation. I am not convinced that this is applicable in all cases, but whenever it is possible it certainly is a good idea, though this requires to have either multiple benchmark suites, or use a leave-one-out approach. Obviously, different domains have different requirements and possibilities, so we cannot simply judge all evaluations by the same set of rules, but they should at least obey to the gist of the best practices we will put forward.

One of the key issues is to create sufficient incentives to get researchers to do exactly what is commonly done in other sciences, and we talked about that at length during the second afternoon session. Part of the debate was wether we feel we should publish in journals versus conferences. While I feel that’s a worthy debate, I recon it is pretty orthogonal to what we are trying to get reviewers to be acceptive of.

The next steps are pretty straightforward. We have agreed upon a number of things, even though I had liked to see us discuss some lower level things as well, such as benchmarks, metrics and statistics. I felt there was not really much room for debate along these lines at this point, so I guess most of my own position statement still is open to debate.

  1. Contact top-level conference PC chairs and provide them with an open letter where we state our position.
  2. Make people aware of the issues involving evaluation.
  3. Find like-minded researchers that are willing to support this cause.
  4. Write a more in-depth article to give even more visibility.
  5. Have a follow-up workshop in the near future.

I really hope this will help somewhat improve the sorry state of evaluation in our field, as it is high time we wake up.

Code Generation and Optimisation

Wednesday, April 28th, 2010

I have had the pleasure to attend the conference on Code Generation and Optimisation this week in Toronto, where Kenneth Hoste presented our paper. The conference was pretty amazing. I must admit that compilers and such are not my forte (at least, the nitty gritty details are not), but there were a number of talks that really stood out IMO.

To begin with, Ben Zorn’s keynote “Performance is dead, long live performance”. I’ll never be best friends with Microsoft, but the talk was very interesting and overall, Ben’s cool. In the talk Ben showed that people have been losing focus on performance, shifting to correctness over the past 5-10 years. This was in stark contrast to the 1990’s where papers really focused on performance. This seems like an important trend, but, as the title reveals, performance is still important, even if not the most prominent item. The reason is that performance sells. People use tools and gadgets that have good performance (the comparison between the iPhone and a Windows Mobile phone was made, guess who won?). The important point was here that there is a tradeoff. People are willing to sacrifice performance for security and vice versa.

In the second keynote talk, CJ Newburn showed that there are two sides to heterogeneity: we have the system, and the tool infrastrcuture. The former, we want to make heterogeneous, but we do not want to have it visible to the user, i.e., the infrastructure should paint a homogeneous picture. The examples given include the various extensions to the ISA that have made their way into the CPU core, starting from floating point operations to SSE. The most important takeaway was that the masses want it simple, since the broad base of programmers focus oh what to do, not how to do it. So there should be a homogeneous interface, but also hooks to each of the underlying heterogeneous components such that skilled programmers can get down and dirty.

The first talk was also the winner of the best paper award: PinPlay, on a replay system on top of PIN. I do not know at this point if I agree with this setup, but the work seems pretty impressive. It was presented by Christiano Pereira. The tool aims to capture a number of events, including thread scheduling and system call results. Supposedly it is OS agnostic, and HW agnostic, so application can be recorded on one platform and replayed on another platform. This does need some extra support, for example, to address different thread scheduling mechanisms across OSses. There were three major use cases: (1) checkpointing and sampling for simulation, (2) sharing workloads between people, by providing the logs and let the other party replay the application, and (3) debugging purposes. The authors also claimed it is possible to re-log, i.e., to record an execution during replay. I wondered if it would be possible to use the tool in such a way that bugs due to parallelism could be detected, since the tool itself has quite a large time-overhead. This can cause Heisenbugs to be invisible during the recording phase. However, if it would be possible to reduce overhead by recording only the thread interactions, such that the bug can manifest itself, and record the rest of the execution during the replay, I think.

The second talk I was really impressed with was on An Efficient Software Transactional Memory Using Commit-Time Invalidation, presented by Justin Gottschlich. He’s a very good presenter, and went through the story with heaps of enthusiasm. The bottom line is that one should have committing transactions invalidate, rather than validate. This means that instead of checking all reads and occasionally checking the writes, the transaction should check it’s writes and see if others are reading these elements. Then, there is no need to check the reads. Apparently, this is not always a win over validation, but when it does win (in highly contending workloads), it wins big time.

After the last session, the winner — or should I say winners, as there was a tie — of the best presentation award were chosen by vote of the remaining attendees. The two winners were Justin Gottschlich and Jason Mars. The former really is the better salesman, but the latter was holding up pretty well.

Outside of the technical program, I have learned at least one cool concept. When going to a restaurant, ask for a surprise. I think (if there’s nothing you really loathe on the menu) a great idea. Moreover I was introduced to it by Tipp Moseley, a very cool frood. Certainly one who knows where his towel is :-)

On the day of the return to the country still known as Belgium, we visited the CN Tower. I can be quite short here: totally mindblowingly awesome. It took me a few seconds before I dared to step onto the glass floor, but once on there, looking down was an amazing experience. After that, there was a slight rush to the airport. Both Kenneth and I did not manage to sleep much, if at all, and it was killing to stay awake at Schiphol while waiting for our connecting flight to Brussels.

All in all, it was very interesting, and I’m really looking forward to next year’s CGO. Chamonix, here we come!

On the use of the mean to determine the correct average value for speedup

Tuesday, April 27th, 2010

I am attending the CGO 2010 conference in Toronto at this moment. I have seen at least 10 papers pass by that are reporting speedup values over a number of benchmarks. Nothing wrong there, except for the tiny — nay, extremely — annoying fact that the authors are using the geometric mean to report average speedup across the benchmark suite.

Let me show with a simple example that this is fubar before giving the actual reason why it is wrong — even though the values are not necessarily off by much, but that is besides the point.

Consider three applications A, B and C, with respective original execution times 3, 2 and 4 (in a unit of time of your choice). After a fancy optimisation or whatever, the execution times become 1,1 and 2, respectively. So, the individual speedup values are (original / new execution time) and this yields 3, 2 and 2, respectively. Now, if we were to execute the programs A, B and C one after the other, the total execution time for the original programs is equal to 3 + 2 + 4, i.e., 9 and the total execution time for the fancy optimised versions is 4. Hence, the speedup for all the applications is 9/4, which is thus the average speedup.

So which of the three means would yield this value? Let’s take a look, shall we? The arithmetic mean of the individual speedup values yields 7/3. The geometric mean yields 12^(1/3), which equals (at four significant digits) 2.289. The harmonic mean, finally, yields 3 / (1/3 + 1/2 + 1/2) or 9/4. Well well. What an amazing coincidence! So, clearly, the geometric mean is not yielding the correct value of 2.25.

The reason for this is simple. First of all, the aggregate speedup is not the product of the individual speedups, so the geometric mean is not applicable. The arithmetic mean also is not applicable, since the reference value is found in the denominator.

So, dear CGO authors, the geometric mean is not the right mean to use. Think on it.

Performance Metrics for Consolidated Servers

Tuesday, April 13th, 2010

The following paper got accepted for the HPCVirt 2010 workshop, taking place in Paris, France (duh).

Performance Metrics for Consolidated Servers, Andy Georges, and Lieven Eeckhout.

The abstract of the paper reads as follows:

In spite of the widespread adoption of virtualization and consol- idation, there exists no consensus with respect to how to bench- mark consolidated servers that run multiple guest VMs on the same physical hardware. For example, VMware proposes VMmark which basically computes the geometric mean of normalized throughput values across the VMs; Intel uses vConsolidate which reports a weighted arithmetic average of normalized throughput values.

These benchmarking methodologies focus on total system through- put (i.e., across all VMs in the system), and do not take into account per-VM performance. We argue that a benchmarking methodology for consolidated servers should quantify both total system through- put and per-VM performance in order to provide a meaningful and precise performance characterization. We therefore present two performance metrics, Total Normalized Throughput (TNT) to characterize total system performance, and Average Normalized Reduced Throughput (ANRT) to characterize per-VM performance.

We compare TNT and ANRT against VMmark using published performance numbers, and report several cases for which the VM- mark score is misleading. This is, VMmark says one platform yields better performance than another, however, TNT and ANRT show that both platforms represent different trade-offs in total system throughput versus per-VM performance. Or, even worse, in a cou- ple cases we observe that VMmark yields opposite conclusions than TNT and ANRT, i.e., VMmark says one system performs better than another one which is contradicted by the TNT/ANRT performance characterization.

You can find a preprint to the full paper. The presentation slides are up here.

HPCVirt 2010 workshop

Tuesday, April 13th, 2010

The paper that got rejected at VEE — due to not good reasons, if you ask me — made it in for HPCVirt, which is the EuroSys workshop for High Performance Computing in virtualised environments. So I am currently residing in Paris for both the workshop and the main EuroSys conference. The former comprises only 4 talks, two of which have been fairly disappointing so far. I am not sure how the other attendants feel, but it is my impression that our performance metrics paper generated a lot more interest and discussion than the other papers :-/

Except for the last paper. ‘Experiences booting 10M virtual machines’ was pretty much kick-ass. The authors tried to simulate a botnet, reportedly due to the dangerous nature of buying the (GPL, ahem) botnet source code and then publishing about it. So they designed their own lightweight VM, which basically uses 20M, and which has a manager that copies the files needed to execute an app (shared libraries and all that) to a ramdisk upo execution, and throws it away afterwards.

From what went on between, I do not recall that much (I did take notes :-) . The ending panel discussion, was pretty interesting until it started to die. Points raised were, amongst others, why do we need anything besides container based virtualisation, and what about QoS.

Automated Just-in-time Compiler Tuning

Monday, January 25th, 2010

I’m pretty excited about our paper to get accepted at CGO 2010, which takes place in Toronto, Canada.

Automated Just-in-time Compiler tuning, Kenneth Hoste, Andy Georges, and Lieven Eeckhout.

The abstract of the paper reads as follows.

Managed runtime systems, such as a Java virtual machine (JVM), are complex pieces of software with many interacting components. The Just-In-Time (JIT) compiler is at the core of the virtual machine, however, tuning the compiler for optimum performance is a challenging task. There are (i) many compiler optimizations and options, (ii) there may be multiple optimization levels (e.g., -O0, -O1, -O2), each with a specific optimization plan consisting of a collection of optimizations, (iii) the Adaptive Optimization System (AOS) that decides which method to optimize to which optimization level requires fine-tuning, and (iv) the effectiveness of the optimizations depends on the application as well as on the hardware platform. Current practice is to manually tune the JIT compiler which is both tedious and very time-consuming, and in addition may lead to suboptimal performance.

This paper proposes automated tuning of the JIT compiler through multi-objective evolutionary search. The proposed framework (i) identifies optimization plans that are Pareto-optimal in terms of compilation time and code quality, (ii) assigns these plans to optimization levels, and (iii) fine-tunes the AOS accordingly. The key benefit of our framework is that it automates the entire exploration process, which enables tuning the JIT compiler for a given hardware platform and/or application at very low cost.

By automatically tuning Jikes RVM using our framework for average performance across the DaCapo and SPECjvm98 benchmark suites, we achieve similar performance to the hand-tuned default Jikes RVM. When optimizing the JIT compiler for individual benchmarks, we achieve statistically significant speedups for most benchmarks, up to 40% for start-up and up to 19% for steady-state performance. We also show that tuning the JIT compiler for a new hardware platform can yield significantly better performance compared to using a JIT compiler that was tuned for another platform.

You can get a preprint of the paper. We also plan to make our tool available, so it can be used to automagically tune other VMs *cough* J9 *cough*.

PTLSim/KVM-QEMU on Ubuntu Karmic Koala

Monday, September 28th, 2009

Ubuntu Karmic Koala comes with GCC 4.4.1, and the set of include files clashes a bit with what the KVM/QEMU port of PTLSim expects for its string library. Mostly, the issues involve missing ‘const’ modifiers to pointers. Here’s a diff though.

diff –git a/Makefile b/Makefile
index a0c25f4..87d70d6 100644
— a/Makefile
+++ b/Makefile
@@ -56,7 +56,7 @@ BUILDHOST = $(shell hostname -f)
BUILDUSER = $(USER)

CC = g++
-CFLAGS = -O99 -g3 -fpic -march=core2 -falign-functions=16 -funit-at-a-time -minline-all-stringops -fno-trapping-math -fno-exceptions -fno-rtti -mpreferred-stack-boundary=4 -fno-strict-aliasing -Wreturn-type -mno-red-zone
+CFLAGS = -O99 -g3 -fpic -march=core2 -falign-functions=16 -funit-at-a-time -minline-all-stringops -fno-trapping-math -fno-exceptions -fno-rtti -mpreferred-stack-boundary=4 -fno-strict-aliasing -Wreturn-type -mno-red-zone -fno-stack-protector
KVMVER = 86
QEMUDIR = ../qemu-kvm-ptlsim
INCFLAGS = -Iinclude -I. -I$(QEMUDIR)/kvm/include -I$(QEMUDIR)/kvm/include/x86 -DPTLSIM_HYPERVISOR -DPTLSIM_KVM -D__INSIDE_PTLSIM__ \
diff –git a/lib/klibc.cpp b/lib/klibc.cpp
index ae0f75b..3b5c5c0 100644
— a/lib/klibc.cpp
+++ b/lib/klibc.cpp
@@ -842,7 +842,7 @@ int strncmp(const char *cs, const char *ct, size_t count)
* @s: The string to be searched
* @c: The character to search for
*/
-char *strchr(const char *s, int c)
+const char *strchr(const char *s, int c)
{
for (; *s != (char)c; ++s)
if (*s == ‘\0′)
@@ -858,7 +858,7 @@ char *strchr(const char *s, int c)
* @s: The string to be searched
* @c: The character to search for
*/
-char *strrchr(const char *s, int c)
+const char *strrchr(const char *s, int c)
{
const char *p = s + strlen(s);
do {
@@ -877,7 +877,7 @@ char *strrchr(const char *s, int c)
* @count: The number of characters to be searched
* @c: The character to search for
*/
-char *strnchr(const char *s, size_t count, int c)
+const char *strnchr(const char *s, size_t count, int c)
{
for (; count– && *s != ‘\0′; ++s)
if (*s == (char)c)
@@ -976,7 +976,7 @@ size_t strcspn(const char *s, const char *reject)
* @cs: The string to be searched
* @ct: The characters to search for
*/
-char *strpbrk(const char *cs, const char *ct)
+const char *strpbrk(const char *cs, const char *ct)
{
const char *sc1, *sc2;

@@ -1146,7 +1146,7 @@ void *memscan(void *addr, int c, size_t size)
* @s1: The string to be searched
* @s2: The string to search for
*/
-char *strstr(const char *s1, const char *s2)
+const char *strstr(const char *s1, const char *s2)
{
int l1, l2;

@@ -1175,7 +1175,7 @@ char *strstr(const char *s1, const char *s2)
* returns the address of the first occurrence of @c, or %NULL
* if @c is not found
*/
-void *memchr(const void *s, int c, size_t n)
+const void *memchr(const void *s, int c, size_t n)
{
const unsigned char *p = (unsigned char*)s;
while (n– != 0) {

As you can tell, the CFLAGS definition in the Makefile was also extended with “-fno-stack-protector”, as the build with otherwise fail with a few undefined reference to `__stack_chk_fail’ errors, see also this lkml entry for more information.

The Xen hypervisor with Ubuntu Karmic Koala

Saturday, September 26th, 2009

Setting up Xen should be pretty trivial these days, however if you want to get on the edge of development, there might be some trickery involved, which is not documented very well. Overall, the PvOps wiki post on xensource does a fine job, yet several things seemed to be missing. A lot of the flotsam on blogs and mailings lists that can be googled is too old to be remotely useful.

Here is a more or less detailed report on how to set up Xen op Ubuntu Karmic Koala (9.10) on a corei7 system (amd64). Obviously the first thing to do is upgrade the Ubuntu installation to the latest version, or deploy a fresh Karmic installation.

The Linux kernel source code to use resides in a git repository, so git should be installed, using apt-get install git. You should not use the plain kernel sources, as they do not contain the Xen code for dom0, at least. Instead, use the following steps to obtain the source code

$ sudo su -
$ cd /usr/src
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git linux-2.6-xen
$ cd linux-2.6-xen
$ git checkout origin/xen/master -b xen/master
$ git pull

After you checked out the kernel, you might want to copy an existing configuration file so you can re-use the result of your hard work you did on previous installations. You should also check the file at include/linux/swiotlb.h and if necessary apply the following patch (it may already be in the git tip, so check first before applying :-)

diff –git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index cb1a663..f4ebffb 100644
— a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -2,6 +2,7 @@
#define __LINUX_SWIOTLB_H

#include <linux/types.h>
+#include <linux/dma-mapping.h>

struct device;
struct dma_attrs;

Configure the kernel, e.g., make menuconfig. Do make sure that you switch ACPI on, because the Xen dom0 config option has a dependency on ACPI and will not show up in the configuration menu’s otherwise. In the end you should have the following configuration options set:

CONFIG_XEN=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=32
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DOM0=y
CONFIG_XEN_PRIVILEGED_GUEST=y
CONFIG_XEN_DOM0_PCI=y
CONFIG_XEN_BLKDEV_FRONTEND=y
CONFIG_XEN_NETDEV_FRONTEND=y
CONFIG_XEN_KBDDEV_FRONTEND=m
CONFIG_XEN_FBDEV_FRONTEND=m
CONFIG_XEN_BALLOON=y
CONFIG_XEN_SCRUB_PAGES=y
CONFIG_XEN_DEV_EVTCHN=y
CONFIG_XEN_BACKEND=y
CONFIG_XEN_BLKDEV_BACKEND=y
CONFIG_XEN_NETDEV_BACKEND=y
CONFIG_XENFS=y
CONFIG_XEN_COMPAT_XENFS=y
CONFIG_XEN_SYS_HYPERVISOR=y
CONFIG_XEN_XENBUS_FRONTEND=y
CONFIG_XEN_S3=y
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y

Build the kernel as usual using make bzImage && make modules && make modules_install && make install. If you have the options above set right, you should be able to use the same kernel for both dom0 and domU VMs, dom0 being the privileged guest referred to in the above configuration options. If you are using grub, the entry for booting in the Xen VMM and creating the dom0 is the following, if the kernel name extension is set to xen-3.4.1:

title Xen 3.4.2-rc2 / Ubuntu karmic (development branch), kernel 2.6.31
uuid ac5415fa-73d8-47dc-bf6e-eb6004d0484c
kernel /boot/xen-3.4.2-rc1-pre.gz dom0_mem=1048576
module /boot/vmlinuz-2.6.31-xen-3.4.1 root=UUID=ac5415fa-73d8-47dc-bf6e-eb6004d0484c ro console=tty0 earlyprintk=vga
module /boot/initrd.img-2.6.31-xen-3.4.1
quiet

It is advisable to also make an initramfs image, as follows

$ mkinitramfs -o /boot/initrd.img-2.6.31-xen-3.4.1 2.6.31-xen-3.4.1

The next step is then to get the Xen sources. I have used the Xen 3.4-testing mercurial repository. If you do not have mercurial installed on your machine, you might want to apt-get install mercurial. The Xen repository resides at http://xenbits.xensource.com/xen-3.4-testing.hg:

$ hg clone http://xenbits.xensource.com/xen-3.4-testing.hg

Before you proceed, make sure that your Python installation is up to par, if necessary apt-get install python. Since several OS releases, neither Debian, nor Ubuntu have been using the site-packages directory to drop Python packages. They instead use dist-packages, residing in the /usr/lib/python2.X directory. Karmic Koala comes with Python 2.6, so you should symlink /usr/lib/python2.6/dist-packages to /usr/lib.python2.6/site-packages. If you do not, installing xen will create the site-packages directory, which will obviously be missing from your sys.path when running Python. It might also be wise, if you have an older Xen version installed to purge the python directories. I noticed that a reinstall with the latest Xen sources failed to overwrite a bunch of files, rather symlinking to file under /usr/shared. I have no idea why this was the case, but other than that, those files contained code that pointed to obsolete file locations, resulting in the xend daemon not working correctly.

Now, you should be all set up. Build the Xen hypervisor, the Xen tools and if necessary the Xen stub domains:

$ make install-xen
$ make install-tools
$ make install-stubdom

If everything went well, you can now boot into your dom0 domain. After booting the hypervisor, the Linux kernel is started from the location you stated in /boot/grub/menu.lst on the line specifying the module for the kernel.

The final task is setting up a domU. I have chosen to host my guest domain(s) on LVM, so the first task is setting up the LVM itself. This is pretty straighforward. It involves creating at least one physical volume, a volume group, and a logical volume. The first thing to do is assign the correct parition type. In this example I will be using 500GB of a 1TB disk, using the first primary partition, i.e., /dev/sdd1. With fdisk, create a new partition and assign it the type 8e (Linux LVM). The following commands set up a 16GB logical volume. First we create a physical volume, then we set up a volume group named vg0 with physical extends of 8M, and finally, we set up the LV called ubuntu-vm1-disk.

$ pvcreate /dev/sdd1
$ vgcreate -s 8M vg0 /dev/sdd1
$ lvcreate -L 16G -n ubuntu-vm1-disk

The volume needs a file system, which is created using the normal mke2fs tools.

$ mke2fs -t ext3 -m 2 -v /dev/vg0/ubuntu-vm1-disk

For the domU, we also use a Ubunut Karnmic Koala, set up by debootstrap. Note that older Ubuntu distributions should also work, though some of the changes that need to be made might differ. Depending on your location, you might want to use a different mirror.

$ mkdir /mnt/ubuntu
$ mount /dev/vg0/ubuntu-vm1-disk /mnt/ubuntu
$ debootstrap –arch=amd64 karmic /mnt/ubuntu http://ubuntu.mirror.cambrium.nl/ubuntu/

Several files need to be altered or made. etc/fstab should read

proc /proc proc defaults 0 0
/dev/xvda1 / ext3 defaults,errors=remount-ro 0 1

In etc/hosts you should put something like

127.0.0.1 localhost localhost.localdomain
127.0.0.1 problemchild

Finally, you need to fix the Ubuntu upstart process to allow connection to hvc0, otherwise you will end up with domU stating it cannot find a file descriptor for the console. The way to fix is is by copying /etc/init/tty1.conf to /etc/init/hvc0.conf and change the line to read

start on stopped rc RUNLEVEL=[2345]
stop on runlevel [!2345]

respawn
exec /sbin/getty -8 38400 hvc0

The newly created ubuntu installation can now be unmounted, and the domU guest can be set up. This is an example of the domU config for the ProblemChild domU domain, saved to /etc/xen/problem_child.

kernel = “/boot/vmlinuz-2.6.31-xen-3.4.1″
ramdisk = “/boot/initrd.img-2.6.31-xen-3.4.1″
memory = 1024
maxmem = 1024
name = “ProblemChild”
disk = [ 'phy:vg0/ubuntu-vm1-disk,xvda1,w' ]
#dhcp=”dhcp”
ip = “192.168.0.101″
netmask = “255.255.255.0″
gateway = “192.168.0.1″
hostname = “problemchild”
root = “/dev/xvda1 ro”

You should now be able to start the domain by executing

$ xm create problem_child -c

The result should be something like this at the end of the domU boot process:

* Setting preliminary keymap… [ OK ]
* Setting up console font and keymap… [ OK ]
* Starting OpenBSD Secure Shell server sshd [ OK ]

Ubuntu karmic (development branch) assail hvc0

problemchild login:

Java Performance through Rigorous Replay Compilation

Tuesday, August 19th, 2008

At this year’s OOPSLA I am going to present a paper that was accepted in the Research Papers track.

Java Performance Evaluation through Rigorous Replay Compilation, Andy Georges, and Lieven Eeckhout, Dries Buytaert

The abstract of this paper reads as follows.

A managed runtime environment, such as the Java virtual machine, is non-trivial to benchmark. Java performance is affected in various complex ways by the application and its input, as well as by the virtual machine (JIT optimizer, garbage collector, thread scheduler, etc.). In addition, non-determinism due to timer-based sampling for JIT optimization, thread scheduling, and various system effects further complicate the Java performance benchmarking process.

Replay compilation is a recently introduced Java performance analysis methodology that aims at controlling non-determinism to improve experimental repeatability. The key idea of replay compilation is to control the compilation load during experimentation by inducing a pre-recorded compilation plan at replay time. Replay compilation also enables teasing apart performance effects of the application versus the virtual machine.

This paper argues that in contrast to current practice which uses a single compilation plan at replay time, multiple compilation plans add statistical rigor to the replay compilation methodology. By doing so, replay compilation better accounts for the variability observed in compilation load across compilation plans. In addition, we propose matched-pair comparison for statistical data analysis. Matched-pair comparison considers the performance measurements per compilation plan before and after an innovation of interest as a pair, which enables limiting the number of compilation plans needed for accurate performance analysis compared to statistical analysis assuming unpaired measurements.

The bulk of this paper made up Chapter 5 in my PhD dissertation, which was published on April 30, 2008. Here and there slight improvements were made before we submitted the final version. You can get a preprint version of the paper. The presentation I gave is available in both pdf format or as a zipped Keynote archive.