Archive for the ‘research’ Category

Automated Just-in-time Compiler Tuning

Monday, January 25th, 2010

I’m pretty excited about our paper to get accepted at CGO 2010, which takes place in Toronto, Canada.

Automated Just-in-time Compiler tuning, Kenneth Hoste, Andy Georges, and Lieven Eeckhout.

The abstract of the paper reads as follows.

Managed runtime systems, such as a Java virtual machine (JVM), are complex pieces of software with many interacting components. The Just-In-Time (JIT) compiler is at the core of the virtual machine, however, tuning the compiler for optimum performance is a challenging task. There are (i) many compiler optimizations and options, (ii) there may be multiple optimization levels (e.g., -O0, -O1, -O2), each with a specific optimization plan consisting of a collection of optimizations, (iii) the Adaptive Optimization System (AOS) that decides which method to optimize to which optimization level requires fine-tuning, and (iv) the effectiveness of the optimizations depends on the application as well as on the hardware platform. Current practice is to manually tune the JIT compiler which is both tedious and very time-consuming, and in addition may lead to suboptimal performance.

This paper proposes automated tuning of the JIT compiler through multi-objective evolutionary search. The proposed framework (i) identifies optimization plans that are Pareto-optimal in terms of compilation time and code quality, (ii) assigns these plans to optimization levels, and (iii) fine-tunes the AOS accordingly. The key benefit of our framework is that it automates the entire exploration process, which enables tuning the JIT compiler for a given hardware platform and/or application at very low cost.

By automatically tuning Jikes RVM using our framework for average performance across the DaCapo and SPECjvm98 benchmark suites, we achieve similar performance to the hand-tuned default Jikes RVM. When optimizing the JIT compiler for individual benchmarks, we achieve statistically significant speedups for most benchmarks, up to 40% for start-up and up to 19% for steady-state performance. We also show that tuning the JIT compiler for a new hardware platform can yield significantly better performance compared to using a JIT compiler that was tuned for another platform.

You can get a preprint of the paper. We also plan to make our tool available, so it can be used to automagically tune other VMs *cough* J9 *cough*.

PTLSim/KVM-QEMU on Ubuntu Karmic Koala

Monday, September 28th, 2009

Ubuntu Karmic Koala comes with GCC 4.4.1, and the set of include files clashes a bit with what the KVM/QEMU port of PTLSim expects for its string library. Mostly, the issues involve missing ‘const’ modifiers to pointers. Here’s a diff though.

diff –git a/Makefile b/Makefile
index a0c25f4..87d70d6 100644
— a/Makefile
+++ b/Makefile
@@ -56,7 +56,7 @@ BUILDHOST = $(shell hostname -f)
BUILDUSER = $(USER)

CC = g++
-CFLAGS = -O99 -g3 -fpic -march=core2 -falign-functions=16 -funit-at-a-time -minline-all-stringops -fno-trapping-math -fno-exceptions -fno-rtti -mpreferred-stack-boundary=4 -fno-strict-aliasing -Wreturn-type -mno-red-zone
+CFLAGS = -O99 -g3 -fpic -march=core2 -falign-functions=16 -funit-at-a-time -minline-all-stringops -fno-trapping-math -fno-exceptions -fno-rtti -mpreferred-stack-boundary=4 -fno-strict-aliasing -Wreturn-type -mno-red-zone -fno-stack-protector
KVMVER = 86
QEMUDIR = ../qemu-kvm-ptlsim
INCFLAGS = -Iinclude -I. -I$(QEMUDIR)/kvm/include -I$(QEMUDIR)/kvm/include/x86 -DPTLSIM_HYPERVISOR -DPTLSIM_KVM -D__INSIDE_PTLSIM__ \
diff –git a/lib/klibc.cpp b/lib/klibc.cpp
index ae0f75b..3b5c5c0 100644
— a/lib/klibc.cpp
+++ b/lib/klibc.cpp
@@ -842,7 +842,7 @@ int strncmp(const char *cs, const char *ct, size_t count)
* @s: The string to be searched
* @c: The character to search for
*/
-char *strchr(const char *s, int c)
+const char *strchr(const char *s, int c)
{
for (; *s != (char)c; ++s)
if (*s == ‘\0′)
@@ -858,7 +858,7 @@ char *strchr(const char *s, int c)
* @s: The string to be searched
* @c: The character to search for
*/
-char *strrchr(const char *s, int c)
+const char *strrchr(const char *s, int c)
{
const char *p = s + strlen(s);
do {
@@ -877,7 +877,7 @@ char *strrchr(const char *s, int c)
* @count: The number of characters to be searched
* @c: The character to search for
*/
-char *strnchr(const char *s, size_t count, int c)
+const char *strnchr(const char *s, size_t count, int c)
{
for (; count– && *s != ‘\0′; ++s)
if (*s == (char)c)
@@ -976,7 +976,7 @@ size_t strcspn(const char *s, const char *reject)
* @cs: The string to be searched
* @ct: The characters to search for
*/
-char *strpbrk(const char *cs, const char *ct)
+const char *strpbrk(const char *cs, const char *ct)
{
const char *sc1, *sc2;

@@ -1146,7 +1146,7 @@ void *memscan(void *addr, int c, size_t size)
* @s1: The string to be searched
* @s2: The string to search for
*/
-char *strstr(const char *s1, const char *s2)
+const char *strstr(const char *s1, const char *s2)
{
int l1, l2;

@@ -1175,7 +1175,7 @@ char *strstr(const char *s1, const char *s2)
* returns the address of the first occurrence of @c, or %NULL
* if @c is not found
*/
-void *memchr(const void *s, int c, size_t n)
+const void *memchr(const void *s, int c, size_t n)
{
const unsigned char *p = (unsigned char*)s;
while (n– != 0) {

As you can tell, the CFLAGS definition in the Makefile was also extended with “-fno-stack-protector”, as the build with otherwise fail with a few undefined reference to `__stack_chk_fail’ errors, see also this lkml entry for more information.

The Xen hypervisor with Ubuntu Karmic Koala

Saturday, September 26th, 2009

Setting up Xen should be pretty trivial these days, however if you want to get on the edge of development, there might be some trickery involved, which is not documented very well. Overall, the PvOps wiki post on xensource does a fine job, yet several things seemed to be missing. A lot of the flotsam on blogs and mailings lists that can be googled is too old to be remotely useful.

Here is a more or less detailed report on how to set up Xen op Ubuntu Karmic Koala (9.10) on a corei7 system (amd64). Obviously the first thing to do is upgrade the Ubuntu installation to the latest version, or deploy a fresh Karmic installation.

The Linux kernel source code to use resides in a git repository, so git should be installed, using apt-get install git. You should not use the plain kernel sources, as they do not contain the Xen code for dom0, at least. Instead, use the following steps to obtain the source code

$ sudo su -
$ cd /usr/src
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git linux-2.6-xen
$ cd linux-2.6-xen
$ git checkout origin/xen/master -b xen/master
$ git pull

After you checked out the kernel, you might want to copy an existing configuration file so you can re-use the result of your hard work you did on previous installations. You should also check the file at include/linux/swiotlb.h and if necessary apply the following patch (it may already be in the git tip, so check first before applying :-)

diff –git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index cb1a663..f4ebffb 100644
— a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -2,6 +2,7 @@
#define __LINUX_SWIOTLB_H

#include <linux/types.h>
+#include <linux/dma-mapping.h>

struct device;
struct dma_attrs;

Configure the kernel, e.g., make menuconfig. Do make sure that you switch ACPI on, because the Xen dom0 config option has a dependency on ACPI and will not show up in the configuration menu’s otherwise. In the end you should have the following configuration options set:

CONFIG_XEN=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=32
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DOM0=y
CONFIG_XEN_PRIVILEGED_GUEST=y
CONFIG_XEN_DOM0_PCI=y
CONFIG_XEN_BLKDEV_FRONTEND=y
CONFIG_XEN_NETDEV_FRONTEND=y
CONFIG_XEN_KBDDEV_FRONTEND=m
CONFIG_XEN_FBDEV_FRONTEND=m
CONFIG_XEN_BALLOON=y
CONFIG_XEN_SCRUB_PAGES=y
CONFIG_XEN_DEV_EVTCHN=y
CONFIG_XEN_BACKEND=y
CONFIG_XEN_BLKDEV_BACKEND=y
CONFIG_XEN_NETDEV_BACKEND=y
CONFIG_XENFS=y
CONFIG_XEN_COMPAT_XENFS=y
CONFIG_XEN_SYS_HYPERVISOR=y
CONFIG_XEN_XENBUS_FRONTEND=y
CONFIG_XEN_S3=y
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y

Build the kernel as usual using make bzImage && make modules && make modules_install && make install. If you have the options above set right, you should be able to use the same kernel for both dom0 and domU VMs, dom0 being the privileged guest referred to in the above configuration options. If you are using grub, the entry for booting in the Xen VMM and creating the dom0 is the following, if the kernel name extension is set to xen-3.4.1:

title Xen 3.4.2-rc2 / Ubuntu karmic (development branch), kernel 2.6.31
uuid ac5415fa-73d8-47dc-bf6e-eb6004d0484c
kernel /boot/xen-3.4.2-rc1-pre.gz dom0_mem=1048576
module /boot/vmlinuz-2.6.31-xen-3.4.1 root=UUID=ac5415fa-73d8-47dc-bf6e-eb6004d0484c ro console=tty0 earlyprintk=vga
module /boot/initrd.img-2.6.31-xen-3.4.1
quiet

It is advisable to also make an initramfs image, as follows

$ mkinitramfs -o /boot/initrd.img-2.6.31-xen-3.4.1 2.6.31-xen-3.4.1

The next step is then to get the Xen sources. I have used the Xen 3.4-testing mercurial repository. If you do not have mercurial installed on your machine, you might want to apt-get install mercurial. The Xen repository resides at http://xenbits.xensource.com/xen-3.4-testing.hg:

$ hg clone http://xenbits.xensource.com/xen-3.4-testing.hg

Before you proceed, make sure that your Python installation is up to par, if necessary apt-get install python. Since several OS releases, neither Debian, nor Ubuntu have been using the site-packages directory to drop Python packages. They instead use dist-packages, residing in the /usr/lib/python2.X directory. Karmic Koala comes with Python 2.6, so you should symlink /usr/lib/python2.6/dist-packages to /usr/lib.python2.6/site-packages. If you do not, installing xen will create the site-packages directory, which will obviously be missing from your sys.path when running Python. It might also be wise, if you have an older Xen version installed to purge the python directories. I noticed that a reinstall with the latest Xen sources failed to overwrite a bunch of files, rather symlinking to file under /usr/shared. I have no idea why this was the case, but other than that, those files contained code that pointed to obsolete file locations, resulting in the xend daemon not working correctly.

Now, you should be all set up. Build the Xen hypervisor, the Xen tools and if necessary the Xen stub domains:

$ make install-xen
$ make install-tools
$ make install-stubdom

If everything went well, you can now boot into your dom0 domain. After booting the hypervisor, the Linux kernel is started from the location you stated in /boot/grub/menu.lst on the line specifying the module for the kernel.

The final task is setting up a domU. I have chosen to host my guest domain(s) on LVM, so the first task is setting up the LVM itself. This is pretty straighforward. It involves creating at least one physical volume, a volume group, and a logical volume. The first thing to do is assign the correct parition type. In this example I will be using 500GB of a 1TB disk, using the first primary partition, i.e., /dev/sdd1. With fdisk, create a new partition and assign it the type 8e (Linux LVM). The following commands set up a 16GB logical volume. First we create a physical volume, then we set up a volume group named vg0 with physical extends of 8M, and finally, we set up the LV called ubuntu-vm1-disk.

$ pvcreate /dev/sdd1
$ vgcreate -s 8M vg0 /dev/sdd1
$ lvcreate -L 16G -n ubuntu-vm1-disk

The volume needs a file system, which is created using the normal mke2fs tools.

$ mke2fs -t ext3 -m 2 -v /dev/vg0/ubuntu-vm1-disk

For the domU, we also use a Ubunut Karnmic Koala, set up by debootstrap. Note that older Ubuntu distributions should also work, though some of the changes that need to be made might differ. Depending on your location, you might want to use a different mirror.

$ mkdir /mnt/ubuntu
$ mount /dev/vg0/ubuntu-vm1-disk /mnt/ubuntu
$ debootstrap –arch=amd64 karmic /mnt/ubuntu http://ubuntu.mirror.cambrium.nl/ubuntu/

Several files need to be altered or made. etc/fstab should read

proc /proc proc defaults 0 0
/dev/xvda1 / ext3 defaults,errors=remount-ro 0 1

In etc/hosts you should put something like

127.0.0.1 localhost localhost.localdomain
127.0.0.1 problemchild

Finally, you need to fix the Ubuntu upstart process to allow connection to hvc0, otherwise you will end up with domU stating it cannot find a file descriptor for the console. The way to fix is is by copying /etc/init/tty1.conf to /etc/init/hvc0.conf and change the line to read

start on stopped rc RUNLEVEL=[2345]
stop on runlevel [!2345]

respawn
exec /sbin/getty -8 38400 hvc0

The newly created ubuntu installation can now be unmounted, and the domU guest can be set up. This is an example of the domU config for the ProblemChild domU domain, saved to /etc/xen/problem_child.

kernel = “/boot/vmlinuz-2.6.31-xen-3.4.1″
ramdisk = “/boot/initrd.img-2.6.31-xen-3.4.1″
memory = 1024
maxmem = 1024
name = “ProblemChild”
disk = [ 'phy:vg0/ubuntu-vm1-disk,xvda1,w' ]
#dhcp=”dhcp”
ip = “192.168.0.101″
netmask = “255.255.255.0″
gateway = “192.168.0.1″
hostname = “problemchild”
root = “/dev/xvda1 ro”

You should now be able to start the domain by executing

$ xm create problem_child -c

The result should be something like this at the end of the domU boot process:

* Setting preliminary keymap… [ OK ]
* Setting up console font and keymap… [ OK ]
* Starting OpenBSD Secure Shell server sshd [ OK ]

Ubuntu karmic (development branch) assail hvc0

problemchild login:

Java Performance through Rigorous Replay Compilation

Tuesday, August 19th, 2008

At this year’s OOPSLA I am going to present a paper that was accepted in the Research Papers track.

Java Performance Evaluation through Rigorous Replay Compilation, Andy Georges, and Lieven Eeckhout, Dries Buytaert

The abstract of this paper reads as follows.

A managed runtime environment, such as the Java virtual machine, is non-trivial to benchmark. Java performance is affected in various complex ways by the application and its input, as well as by the virtual machine (JIT optimizer, garbage collector, thread scheduler, etc.). In addition, non-determinism due to timer-based sampling for JIT optimization, thread scheduling, and various system effects further complicate the Java performance benchmarking process.

Replay compilation is a recently introduced Java performance analysis methodology that aims at controlling non-determinism to improve experimental repeatability. The key idea of replay compilation is to control the compilation load during experimentation by inducing a pre-recorded compilation plan at replay time. Replay compilation also enables teasing apart performance effects of the application versus the virtual machine.

This paper argues that in contrast to current practice which uses a single compilation plan at replay time, multiple compilation plans add statistical rigor to the replay compilation methodology. By doing so, replay compilation better accounts for the variability observed in compilation load across compilation plans. In addition, we propose matched-pair comparison for statistical data analysis. Matched-pair comparison considers the performance measurements per compilation plan before and after an innovation of interest as a pair, which enables limiting the number of compilation plans needed for accurate performance analysis compared to statistical analysis assuming unpaired measurements.

The bulk of this paper made up Chapter 5 in my PhD dissertation, which was published on April 30, 2008. Here and there slight improvements were made before we submitted the final version. You can get a preprint version of the paper. The presentation I gave is available in both pdf format or as a zipped Keynote archive.

Statistically Rigorous Java Performance Evaluation: presentation

Tuesday, October 23rd, 2007

If you are interested in the presentation I gave at OOPSLA, you can get a Keynote exported pdf.

Adding Rigorous Statistics to the Java Benchmarker's Toolbox

Tuesday, October 16th, 2007

You can find the pdf of the poster I will be presenting together with Dries at OOPSLA this year. If all went well, there should be a two-page poster abstract printed in the OOPSLA Companion (preprint). So feel free to drop by on Monday evening and have a chat.

The abstract to this abstract reads as follows.

Java performance is far from trivial to benchmark because it is affected by various factors such as the Java application, its input, the virtual machine, the garbage collector, the heap size, etc. In addition, non-determinism due to Just-in-Time compilation/optimization, thread scheduling, etc., causes the execution time of a Java program to differ from run to run.

This poster advocates statistically rigorous data analysis when reporting Java performance. We advise to model non-determinism by computing confidence intervals. In addition, we show that prevalent data analysis approaches may lead to misleading or even incorrect conclusions. Although we focus on Java performance, the techniques can be readily applied to any managed runtime system.

Statistically Rigorous Java Performance Evaluation

Monday, June 18th, 2007

The following paper has been accepted for OOPSLA 2007.

Statistically Rigorous Java Performance Evaluation, Andy Georges, Dries Buytaert, and Lieven Eeckhout.

The abstract reads as follows.

Java performance is far from being trivial to benchmark because it is affected by various factors such as the Java application, its input, the virtual machine, the garbage collector, the heap size, etc. In addition, non-determinism at run-time causes the execution time of a Java program to differ from run to run. There are a number of sources of non-determinism such as Just-In-Time (JIT) compilation and optimization in the virtual machine (VM) driven by timer-based method sampling, thread scheduling, garbage collection, and various system effects.

There exist a wide variety of Java performance evaluation methodologies used by researchers and benchmarkers. These methodologies differ from each other in a number of ways. Some report average performance over a number of runs of the same experiment; others report the best or second best performance observed; yet others report the worst. Some iterate the benchmark multiple times within a single VM invocation; others consider multiple VM invocations and iterate a single benchmark execution;
yet others consider multiple VM invocations and iterate the benchmark multiple times.

This paper shows that prevalent methodologies can be misleading, and can even lead to incorrect conclusions. The reason is that the data analysis is not statistically rigorous. In this paper, we present a survey of existing Java performance evaluation methodologies and discuss the importance of statistically rigorous data analysis for dealing with non-determinism. We advocate approaches to quantify startup as well as steady-state performance, and, in addition, we provide the JavaStats software to automatically obtain performance numbers in a rigorous manner. Although this paper focuses on Java performance evaluation, many of the issues addressed in this paper also apply to other programming languages and systems that build on a managed runtime system.

This paper took quite some work, especially in the experimentation-wise. While the initial reviews were very positive, they required us to perform several extra experiments. But in the end, it was worth the effort. You can get a preprint version.

So, 2 out of X at OOPSLA for us! Yay!

Using HPM-Sampling to Drive Dynamic Compilation

Saturday, May 12th, 2007

The following paper has been acepted for publication at OOPSLA 2007.

Using HPM-Sampling to Drive Dynamic Compilation, Dries Buytaert, Andy Georges, Michael Hind, Matthew Arnold, Lieven Eeckhout, and Koen De Bosschere.

The paper abstract reads as follows.

All high-performance production JVMs employ an adaptive strategy for program execution. Methods are first executed unoptimized and then an online profiling mechanism is used to find a subset of methods that should be optimized during the same execution. This paper empirically evaluates the design space of
several profilers for initiating dynamic compilation and shows that existing online profiling schemes suffer from several limitations. They provide an insufficient number of samples, are untimely, and have limited accuracy at determining the frequently executed methods. We describe and comprehensively evaluate HPM-sampling, a simple but effective profiling scheme for finding optimization candidates using hardware performance monitors (HPMs) that addresses the aforementioned limitations. We show that HPM-sampling is more accurate; has low overhead; and improves performance by 5.7\% on average and up to 18.3\% when compared to the default system in Jikes RVM, without changing the compiler.

MontrĂ©al, here we come. October 21st – October 25th it is!

This paper has quite a long history behind it. Dries and I conceived the idea while attending the ACACES summerschool in July 2006. After a long talk with Mike, we decided to launch some preliminary measurements with the system Dries had already built into Jikes RVM using the HPM interface I had adapted from Steve Blackburn’s perfctr patch for Jikes RVM. We intially targetted PLDI 2007, when some matters were brought to our attention, that questioned our original idea on the current state of the art. Submission was postponed, extra experiments were conducted and we targetted VEE instead, where our paper was rejected. Based on the reviews we received there, it seems like it was a border case, but a rejection nonetheless. So, we figured, why not submit to OOPSLA. Worst case scenario: we get additional reviews to improve our paper. I turns out that the Best Case Scenario was visited upon us instead. You can get a preprint version.

Method-Level Phase Behavior in Java Workloads.

Tuesday, July 13th, 2004

The following paper has been accepted for publication at OOPSLA 2004

Method-Level Phase Behavior in Java Workloads, Andy Georges, Dries Buytaert, Lieven Eeckhout, and Koen De Bosschere

The paper abstract reads as follows.

Java workloads are becoming more and more prominent on various computing devices. Understanding the behavior of a Java workload which includes the interaction between the application and the virtual machine (VM), is thus of primary importance during performance analysis and optimization. Moreover, as contemporary software projects are increasing in complexity, automatic performance analysis techniques are indispensable. This paper proposes an off-line method-level phase analysis approach for Java workloads that consists of three steps. In the first step, the execution time is computed for each method invocation. Using an off-line tool, we subsequently analyze the dynamic call graph (that is annotated with the method invocations` execution times) to identify method-level phases. Finally, we measure performance characteristics for each of the selected phases. This is done using hardware performance monitors. As such, our approach allows for linking microprocessor-level information at the individual methods in the Java application`s source code. This is extremely interesting information during performance analysis and optimization as programmers can use this information to optimize their code. We evaluate our approach in the Jikes RVM on an IA-32 platform using the SPECjvm98 and SPECjbb2000 benchmarks. This is done according to a number of important criteria: the overhead during profiling, the variability within and between the phases, its applicability in Java workload characterization (measuring performance characteristics of the various VM components) and application bottleneck identification.

How Java Programs Interact with Virtual Machines at the Microarchitectural Level

Saturday, July 12th, 2003

The following paper has been accepted for publication at OOPSLA 2003

How Java Programs Interact with Virtual Machines at the Microarchitectural Level, Lieven Eeckhout, Andy Georges, and Koen De Bosschere.

The paper abstract reads as follows.

Java workloads are becoming increasingly prominent on various platforms ranging from embedded systems, over general-purpose computers to high-end servers. Understanding the implications of all the aspects involved when running Java workloads, is thus extremely important during the design of a system that will run such workloads. In other words, understanding the interaction between the Java application, its input and the virtual machine it runs on, is key to a succesful design. The goal of this paper is to study this complex interaction at the microarchitectural level, e.g., by analyzing the branch behavior, the cache behavior, etc. This is done by measuring a large number of performance characteristics using performance counters on an AMD K7 Duron microprocessor. These performance characteristics are measured for seven virtual machine configurations, and a collection of Java benchmarks with corresponding inputs coming from the SPECjvm98 benchmark suite, the SPECjbb2000 benchmark suite, the Java Grande Forum benchmark suite and an open-source raytracer, called Raja with 19 scene descriptions. This large amount of data is further analyzed using statistical data analysis techniques, namely principal components analysis and cluster analysis. These techniques provide useful insights in an understandable way.From our experiments, we conclude that (i) the behavior observed at the microarchitectural level is primarily determined by the virtual machine for small input sets, e.g., the SPECjvm98 s1 input set; (ii) the behavior can be quite different for various input sets, e.g., short-running versus long-running benchmarks; (iii) for long-running benchmarks with few hot spots, the behavior can be primarily determined by the Java program and not the virtual machine, i.e., all the virtual machines optimize the hot spots to similarly behaving native code; (iv) in general, the behavior of a Java application running on one virtual machine can be significantly different from running on another virtual machine. These conclusions warn researchers working on Java workloads to be careful when using a limited number of Java benchmarks or virtual machines since this might lead to biased conclusions.