Posts Tagged ‘measurement’

Position statement

Wednesday, October 20th, 2010

Just to open it up and release it into the wild, here’s the statement I wrote with Koen De Bosschere on several (lower level) aspects of evaluation in computer science, we would like to see addressed.

Recent work has shown a continued identification of pitfalls in conducting experimental computer science. In this position statement, we address three issues we consider important to enhance the state-of-the-art in this field: (i) the experimental design and setup, (ii) performance measurements and the subsequent analysis of results, and (iii) benchmark selection.

First, we consider the experimental setup. When designing an experiment, there are many dimensions (input set (size), heap size, garbage collector, compilers, optimisations, different VMs, etc.) to consider. In the not-so-distant past, researchers often used a single design point (e.g., a fixed heap size, one garbage collector, etc.). Recently, however, several researchers made a case in favour of considering multiple design points, and to let, for example, the heap size vary (in fixed steps) from a benchmark dependent minimum to a factor thereof. Essentially, a good experimental design should acknowledge these dimensions and their importance and make correct decisions with respect to the design points chosen for any given experiment. Some dimensions are continuous, others are discrete.

Hence, evaluating all points is not simply infeasible, sometimes it is outright impossible.

Moreover, the output value (e.g., performance as quantified by execution time) is not necessarily a continuous function of continuous input (e.g., heap size). At other times, we are only concerned with a subspace, i.e., fewer dimensions. This brings us to our first concern. Some researchers we have discussed these matters with, have stated that trying to cover the complete design space is pointless, and that they prefer to pick a single point for evaluating their new shiny idea. We beg to differ, but it is clear that the community needs to agree on which design points to used to obtain representative measurements.

Other scientific communities have since long solved the problem of experiment design. For example, in medical and social sciences, taking a representative (random) sample of the population follows a well known and widely practiced methodology. We do not believe that simply adopting these techniques is the way to go, but we can definitely learn from their methodology. We need to carefully examine how the experiment’s design space is shaped and which statistical techniques have to be used for choosing the correct (in some sense) points in it. For this, techniques borrowed from the machine learning community might be employed, where researchers choose data points that have most impact on a model they wish to build. Clearly, we need to pick more points from regions in the design space where the output function (e.g., performance) changes rapidly or where it is discontinuous.

The second issue involves measurement and analysis of data. Computer systems tend to exhibit mildly chaotic behaviour, e.g. programs may behave non-deterministically. We have argued in the past to add more statistical rigour to measurements and data analysis. It seems that the community is slowly embracing this idea, but there are a few obstacles holding researchers and practitioners back. In our opinion, there are at least two issues that must be addressed. The first and foremost problem is the lack of frameworks that automate the hard work — nobody likes boilerplate. Second, doing elaborate experiments takes time and resources. Computing resources are often scarce. Even when an experiment is embarrassingly parallel, researchers do not always have the resources required to exploit (most of) this parallelism. Automation is the key to solving these problems, yet any approach must be fully aware of the pitfalls we uncovered and deal with them adequately. A good example to follow is the recently developed Criterion framework, used by the Haskell community. (http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-l…). It improves on our own work in several respects, e.g., identifying the impact of outliers on the variance, determining the required number of iterations and the number of experiments to be conducted. It is usable for both micro and macro-benchmarks. The community should adopt these practices, and take them to the next level: non-normal distributed measurements, check autocorrelation, etc.

Third, concerning benchmark suites, there are two issues we will briefly discuss. Current benchmarks suites are sets (in the mathematical sense) of programs and their inputs. We propose to add an (objective) ordering to these sets, for example how much each benchmark adds to the coverage of the space of computer programs, spanned by some (again, objective) metrics. Ideally, these metrics should be machine and language independent, though this seems hard to achieve. Reducing the demands, we can start with a set of micro-architecture independent metrics, and see how well programs from different language map in the space. When for some reason a subset is used, the experiment should always include the top ranked benchmark, without skipping benchmarks. In such a scenario, it seems necessary to define the minimum number of benchmarks that should be evaluated, e.g., k >= 5. Also, to avoid focusing only on those benchmarks, we require the code and setup to be made public such that reviewers and potential users can evaluate further.

The second important issue — closely linked to the benchmark ordering — involves deciding when an experiment shows positive evidence for the evaluated technique or innovation. Currently, reviewers check the mean (often the wrong one) and see if the approach works for (at least some of) the benchmarks. Now, if the benchmarks are both characterised and ordered, one can see for which regions the innovation works, thus showing that it is in fact useful for a (limited) class of programs — compare this to a drug that works for 70% of a population.

It is time to place experimental computer science on solid grounds with respect to both the design of an experiment as well as its measurement and evaluation. This requires the following steps. First, gauge the width and depth of the problem by uncovering all pitfalls. Then, provide researchers with tools and sound methodologies to conduct experiments.

Evaluate 2010

Tuesday, October 19th, 2010

During SPLASHcon, Matthias Hausswirth, Peter Sweeney, Steve Blackburn and Amer Diwan organised the Evaluate 2010 workshop to discuss prevalent issues with evaluation in the papers the computer science community outputs. Even though the attendees mostly are embedded in the performance and language communities, the workshop aimed to be fairly open to all aspects of computer science.

The workshop opened with a very strong keynote talk by Cliff Click. Being his usual self, he did not fear controversy, he showed us why he often did not believe the numbers and evaluations that are published in papers over the last decade. Part of the reason is that the ideas have been tried in e.g., Hotspot, and were shown to have either less or more impact. Another part is that the evaluation is just irrelevant, since the idea has been implemented, but not published. The criticism I have here is that the community can hardly be expected to avoid ideas of which they do not know they have been implemented or tried. But that does not detract from the essence of his remarks. Additionally, often evaluation numbers are below the interesting threshold. For example, for consumers of academic work there is often little incentive to implement the described technique in their own tools or application if the potential gain is not over 20%. Worse, for some cases the gain has got to be at least a factor 2. Furthermore, unless an idea has been implemented in multiple systems, there is no proof or persuasive indication that it can be ported over to other systems than the one used in the paper. So, what does this mean for all our rigorous evaluations, showing that we can squeeze out that extra 5-10% from the system under scrutiny?

The keynote talk was followed by a fairly long discussion, where the following points were raised (there were others, I am summarising). First of all was the question how to gain confidence in the results. Except for doing each experiment in a statistically rigorous manner, one should also account for several other things. First of all, there is the impact of factors that are more or less outside of the control on the researcher when his technique will be deployed in the wild. Examples there are the environment, OS, hardware, etc. Key to getting solid results is randomisation in the first case, and using multiple platforms in the latter cases. However, that only accounts for the results obtained from the full technique. To ascertain that there is nothing going on under the hood that interacts with what the researcher is trying to implement, it is important that the full technique is broken down into smaller steps (if possible), that each can stand on their own and for which you expect either a bonus (e.g., speedup) or a cost (e.g., slowdown). The latter might be the case when you have implemented the extra profiling code, but did not have the JIT take advantage of this extra information. There are several issues that need to be dealt with here. Obviously the breakdown, but also the potential for measuring what has been changed and its impact. More often than not, aggregate performance numbers will not tell the whole story, nor will they provide the necessary insight. Such insight is desirable if one wants to argue why the approach is portable to other systems. Typically, as researchers, we choose a single vehicle to build a proof-of-concept. But if we want our ideas to be used, we likely should provide more evidence as to why it is any good. Having a breakdown can help in that respect, as it tells others what steps were required and if they meet the prerequisites for those steps, or if some (or all) of these steps are already been taken care of on their target platform.

As I mentioned above, there are aspects of any experimental setup that are out of our control. We can try to account for them, but we often cannot change them. However, we like to think of computer science as an engineering science, and thus a constructive science. But that is pretty much in contradiction with the fact that some things lie outside the boundary of our control. Hence, claims were made that we are more of an observational science. The consequence is that we should behave as such a discipline, and accept that some things are out of our control, yet we should take them into account. Fil Pizlo mentioned astrophysics on several occasions, but did not really detail what he meant by that.

Now, if we want to grow up as a scientific discipline, we need to accept that there are shortcoming in our current way of working that need to be fixed. The keynote clearly motivated the need for several of these: (i) the possibility of (relatively) easily replicate existing research results — for example, to allow reviewers to confirm conclusions — as independent experiments are a cornerstone of modern science, (ii) reproducibility of techniques and results — same approach, different setup style, e.g., another JVM, another OS, VMM, etc. — since this confirms that the idea is also generalisable, and (iii) the possibility of refuting earlier work through negative results. None of these are going to happen as much as they should without some incentive system being in place. The reality is, of course, that any PhD student, professor, or anybody else in academia, except for a select few, have to keep increasing their publication output, and the above studies simply do not have a venue where they can be published. Blogs etc. do not count here, for obvious reasons. Now, the idea is not to have small 1-2 page papers confirming or refuting, but to have full-blown papers, that have the same status as any other papers published at the same venue. At this point the ACM digital library offers subscribers the option of reviewing existing papers, but that is not completely the same. Moreover, it does not allow a researcher to add something of significance to his resume. So, such papers should be able to hold their own and be judged on their own merit.

Full disclosure is another point that was often brought up. Even when rigorous statistics are used, the summarising numbers do not tell everything about the experiment. A first step towards the expected disclosure is making the full dataset available. This at least allows other people to redo the analysis even if they do not redo the actual experiments. A step further is providing the code. This might not be feasible for industrial contributors, where code might have to be protected or when researchers have signed NDAs with some industrial partner. However, this should only impact a minority of the cases, I think. Thus, disclosure of the code. On top of that, it is preferable that the scripts, fully detailed setup, etc. are made available. There is one big caveat. Sometimes setups are too large to be easily replicated, which may impact the ability to replicate the work by an independent researcher.

I think there were several remarks that carried quite some weight. The first, obviously, was made during the keynote, namely that sometimes the improvements are not large enough to warrant implementation in a production environment. Complementary to that remark it was suggested that perhaps our efforts are simply too feeble to warrant even replication or reproduction. Third, simply reproducing for the sake of numbers is not the desired attitude. Numbers do not mean anything in themselves, but through evaluation we should gain understanding, and this understanding should be deepened by reproduction and replication of past work. Essentially, reproductions should be a qualitative issue, not a quantitative. Put differently, negative results also count, but not all of them! There was some controversy over when a negative result should be published and when it can be dismissed as uninteresting. I guess that is up to the reviewers.

In the first afternoon session, we discussed in depth several issues that we deemed important and that should be fixed in the (preferably near) future. The subjects were mingling at several occasions, but I think that’s perfectly normal. After all, we were doing this for the first time, so we needed to still wrap our heads around all the possible angles to tackle the problem.

What is the purpose of evaluation? Is it simply to measure a system and report those results? Essentially, as I touched upon above, the ultimate goal is to gain insight. Why is something better, why do we see an improvement over existing approaches? And are these improvements solely due to our technique, or are there other effects at work?

An interesting idea was raised by Fil Pizlo, namely — he elaborated on that over dinner — that science should start from some falsifiable hypothesis, since one cannot prove something right with experimentation, one can only prove it wrong. Hence, everything we know from science is ultimately false. But let us not take it that far. Still, the hypothesis idea held. Mike Hind suggested that the evaluation should provide some first indication that an idea works, once more bringing up the notion of replication and reproduction. Additionally, since ideas all have drawbacks, the motion was put forward that papers should be honest up front: state what are the good and the bad points. I agree, but then there needs to be some way to avoid PC members solely using the negative points pointed out in the paper to shoot it down. Finally, we concluded that any paper should provide insights into why the idea is applicable in general. Now, some ideas are very niche-centered, but still, they might apply in some form to things outside the niche.

After some slides presented by Fil and Eric on perturbation and chaos, respectively, we turned our heads back to reproduction. All of the above applied. One interesting notion that was raised was the idea of cross-validation. I am not convinced that this is applicable in all cases, but whenever it is possible it certainly is a good idea, though this requires to have either multiple benchmark suites, or use a leave-one-out approach. Obviously, different domains have different requirements and possibilities, so we cannot simply judge all evaluations by the same set of rules, but they should at least obey to the gist of the best practices we will put forward.

One of the key issues is to create sufficient incentives to get researchers to do exactly what is commonly done in other sciences, and we talked about that at length during the second afternoon session. Part of the debate was wether we feel we should publish in journals versus conferences. While I feel that’s a worthy debate, I recon it is pretty orthogonal to what we are trying to get reviewers to be acceptive of.

The next steps are pretty straightforward. We have agreed upon a number of things, even though I had liked to see us discuss some lower level things as well, such as benchmarks, metrics and statistics. I felt there was not really much room for debate along these lines at this point, so I guess most of my own position statement still is open to debate.

  1. Contact top-level conference PC chairs and provide them with an open letter where we state our position.
  2. Make people aware of the issues involving evaluation.
  3. Find like-minded researchers that are willing to support this cause.
  4. Write a more in-depth article to give even more visibility.
  5. Have a follow-up workshop in the near future.

I really hope this will help somewhat improve the sorry state of evaluation in our field, as it is high time we wake up.

Mini symposium on managed runtime systems

Friday, March 21st, 2008

On April 11th, 2008, after my internal PhD defense has taken place, two of my jury members will give a talk. If you are interested, can join us in the Jozef Plateauzaal in the Faculty of Engineering building, Plateaustraat 22, in Ghent. The first talk will be given by Matthias Hauswirth (University of Lugano, Switzerland) and is titled Observer Effect and Measurement Bias in Performance Measurement. The second talk is by Kathryn McKinley (University of Texas at Austin, USA), is titled Dynamic Bug Detection for Managed Languages. The event starts at 14:30, and each talk will take about one hour of with 15 minutes are reserved for questions and such.

Observer Effect and Measurement Bias in Performance Measurement

To evaluate an innovation in computer systems, performance analysts measure execution time or other metrics using one or more standard workloads (e.g., the SPEC benchmarks). The performance analyst may carefully minimize the amount of measurement instrumentation, control the context in which measurement takes place, and repeat each meas-urement multiple times. Finally, the performance analyst may use the appropriate statistical techniques to characterize the data. Unfortunately, even with such a responsible approach, the collected data may or may not be actually useful. In this talk we show how easy it is to produce poor (and thus misleading) data in computer systems research. We explore two common sources of poor-quality data. First, we get poor-quality data if our data collection perturbs the behavior of the system that we are measuring; this is often known as the “observer effect”. We show that even a seemingly insignificant measurement probe can dramatically alter system behavior; thus, perturbation is much more common than most performance analysts probably realize. Second, we get poor-quality data if we measure the system in a particular set of contexts and that set does not capture the range of reasonable contexts that a user of the system might encounter; this is known as “measurement bias”. We show that different contexts favor different configurations of the system. We conclude our talk by outlining techniques for producing high-quality data.

Dynamic Bug Detection for Managed Languages

Although managed languages preclude and help prevent some software errors, deployed programs still have errors and crash. In this talk, we discuss approaches for detecting bugs and making deployed software more reliable. Our work focuses on efficient on-line techniques. We overview approaches for detecting the source of null pointer exceptions and efficiently computing calling context. We present an approach for detecting data structures that are growing. We show how to piggyback on the garbage collector to summarize efficiently (in time and space) the object volumes and relationships based on their user defined class. Experimental results show this approach is effective at finding memory leaks, i.e., data structure errors of omission. We include a brief discussion of in progress work on tolerating leaks. These results indicate promise for inexpensive approaches that help developers find bugs and protect users from their consequences.

You can find a PDF of the announcement here. For your convenience, I have put together a map.

Here are the short biographies of the speakers.

Matthias Hauswirth

He is an assistant professor at the University of Lugano in Switzerland. He is interested in approaches for measuring, understan-ding, and improving the performance of modern, complex systems. He is particularly intrigued by the intricate interactions between the different system layers, from the hardware, over the operating system, virtual machine, application frameworks, all the way to the applications.

Kathryn McKinley

She is a professor at the University of Texas. Her research interests include compiler optimization, architecture, memory management, and software engineering. She is currently serving as the Editor-and-Chief of TOPLAS and has been the program chair of ASPLOS, PACT and PLDI. She has graduated eight PhD students and is currently supervising eight graduate students.