<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Geek in progress &#187; kernel</title>
	<atom:link href="http://www.itkovian.net/base/tag/kernel/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.itkovian.net/base</link>
	<description>I am not yet done.</description>
	<lastBuildDate>Thu, 20 Oct 2011 20:56:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Performance, scalability, and real-time response from the Linux kernel</title>
		<link>http://www.itkovian.net/base/performance-scalability-and-real-time-response-from-the-linux-kernel/</link>
		<comments>http://www.itkovian.net/base/performance-scalability-and-real-time-response-from-the-linux-kernel/#comments</comments>
		<pubDate>Mon, 20 Jul 2009 16:19:37 +0000</pubDate>
		<dc:creator>Itkovian</dc:creator>
				<category><![CDATA[kernel]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[paul mckenney]]></category>
		<category><![CDATA[real time]]></category>

		<guid isPermaLink="false">http://www.itkovian.net/base/?p=239</guid>
		<description><![CDATA[The most interesting course, as well as the one I enjoyed most, was on the
performance, scalability and real-time response of the Linux kernel.

<div class="figure">
<a href="http://www.flickr.com/photos/itkovian/3717236440/" title="Paul McKenney by Itkovian, on Flickr"><img src="http://farm4.static.flickr.com/3425/3717236440_950cf38df4.jpg" width="319" height="500" alt="Paul McKenney" /></a>
</div>]]></description>
			<content:encoded><![CDATA[<p>The most interesting course, as well as the one I enjoyed most, was on the<br />
performance, scalability and real-time response of the Linux kernel.</p>
<div class="figure">
<a href="http://www.flickr.com/photos/itkovian/3717236440/" title="Paul McKenney by Itkovian, on Flickr"><img src="http://farm4.static.flickr.com/3425/3717236440_950cf38df4.jpg" width="319" height="500" alt="Paul McKenney" /></a>
</div>
<p><!--break--></p>
<p>He opened the course by dropping a question into the unsuspecting audience:<br />
what decades old tech can keep a multi-core busy and yet be easy to program<br />
against. I thought Paul had the idea of a time-sharing machine in his mind, but<br />
the solution was far easier than that: SQL. Given that the frequency increase<br />
of the CPUs is stabilising at naught, we need to find a good way to easily<br />
program against multi-core architectures. Something that rivals the ease of<br />
SQL, where under the hood a lot of stuff is going on, but to the user, it<br />
remains fairly simple. Unlike most of the other people talking about<br />
parallelism, Paul stressed multiple times that if one does not need it, it&#8217;s<br />
best to run single threaded. I wholeheartedly agree! On the other hand, if we<br />
parallelise, we should be considering high-level approaches prior to trying to<br />
get the nitty-gritty details right. So, first: get your algorithm in shape. I<br />
think that&#8217;s a very good point, given the fact that research papers publishing<br />
tweaks, rather than new algorithms seldom succeed in increasing the performance<br />
with a factor or even a large percentage. Conversely, any performance lost at<br />
the base OS level, cannot be made up by the higher levels, no matter the<br />
algorithm. Context-switches, locks, etc. take a (more-or-less) fixed amount of<br />
time, and that time will be spent anyhow.</p>
<p>The major issue with RT-processes seems that they need to interact with non-RT<br />
processes, I/O (disk, network, etc.). As such, the RT approach has to be<br />
applied across the entire execution stack, if we want to gdet it right.<br />
However, we still need to keep a fair responsiveness for non-RT processes.<br />
Essentially, Paul argues for making tradeoffs, rather that going for the<br />
best-for-a-single-goal apporoach and ignore the rest.</p>
<p>The question raised was why we need to enhance performance. The answer is that<br />
people time is much more costly than machine time these days. So it does no<br />
lomnger pay off to get an engineer trying to enhance the solution. It should be<br />
done automnagically as much as possible. Moreover, general solutions help to<br />
spread the cost over multiple users.</p>
<p>One of the major problem when parallelising programs is that people either do<br />
not grok the issues fully, or try to tackle the problems in the wrong order.<br />
Paul argued that we first need to understand how we can split up the problem<br />
into parts where there is little interaction between the data (as to avoid<br />
excess locking). Only then can we partition the work that is done on that data.<br />
The final step then is to determine which parts can have actual access to the<br />
data, i.e., assign the locks. The matra that was repeated here was that<br />
low-level details really do matter, and that it is important to get them right.<br />
Building on this, the argument was raised what we rely on people who implement<br />
things to have detailed knowledge of the underlying hardware. Unfortunately,<br />
this is not always the case.</p>
<p>The takeaway lesson from the first lecture was this: parallel programming is<br />
bloody hard, because it was designed that way.</p>
<p>Lesson 2 discussed Linux kernel programming environments dealing with: response<br />
times, preemption inside the kernel, non-maskable interrupts, etc. Point made:<br />
if an algorithm runs at a low level, you need interruptible locks. The kernel<br />
comes with a broad aaray of synchronisation primitives, so it is important to<br />
use the right primitives for the right job. For example, use locks that allow<br />
looping in the reader if there are potentially (multiple) writers. Once more,<br />
Paul stressed that synchronisation primitivies are not the first thing to<br />
decide on. We should associate locks and other primitives with each data<br />
partition (that was agreed upon earlier in the design stage). Clearly, it is<br />
not good to have too many data partitions, as that means more locks, and a<br />
higher risk of lock contention. The example used throughout this lesson was<br />
that of a linked list. Should we lock the header? Lock each node? Keep the<br />
locks in the data structure or in some hash array of locks? Key point: provide<br />
protection for each way in which the data can be accessed! A per-cpu locking<br />
mechanism can be used; if done right it scales pretty well.</p>
<p>In lesson 3, Paul tackled the performance and scalability of Linux<br />
applications. Most frameworks (200+) that we once in use have now either faded,<br />
merged, or discontinued. Advice is given not to use or rely on signal handlers.<br />
POSIX primitives were discussed, as were per-thread variables, spinlocks, etc.<br />
Important point was that the use of per-cpu state to lock onto, does not<br />
translate well from kernel to user space. Some remarks were made about the RT<br />
aspects of user-space applications. Should this be enforced? The issue here is<br />
that opening RT behaviour to user-space clears the way for abuse. During the class, he used the (adapted) illustration of the blind philosophers and the elephant:</p>
<div class="figure">
<a href="http://www.flickr.com/photos/itkovian/3724554811/" title="The five blind penguins and the elephant by Itkovian, on Flickr"><img src="http://farm4.static.flickr.com/3495/3724554811_8da2811907.jpg" width="500" height="333" alt="The five blind penguins and the elephant" /></a>
</div>
<p>Lesson 4 was fully dedicated to real time systems, discussing some of the<br />
implementations in the Linux kernel for dealing with this. Main topics of the<br />
day were timers, high-resolution and others, interrupt handlers that can be<br />
threaded, etc. It was stressed that real time has a broad range of meanings,<br />
going from a few nanoseconds up to 10ms, the latter amounting basically to the<br />
context switch time. Apparently, as a first step, some parts of the kernel can<br />
be preempted, some cannot. The consequence is a reduction of schedular latency,<br />
but nowhere near enough for a RT system at the hiogh end of the scale. Timer<br />
wheels were added to improve locality and queueing, but still certain cascading<br />
operations on this data structure can take a long time. Long enough to warrant<br />
implementing high-resolution timers using RB trees, along with preemptible<br />
spinlocks. Still: greater power means greater responsibility, so care must be<br />
taken. Priority inversion was discussed, and adequaltely illustrated using the<br />
dancing processes.</p>
<div class="figure">
<a href="http://www.flickr.com/photos/itkovian/3727450809/" title="Real time processes by Itkovian, on Flickr"><img src="http://farm4.static.flickr.com/3485/3727450809_9fef6d4eb1.jpg" width="500" height="333" alt="Real time processes" /></a>
</div>
<p>RCU once more came to the rescue, and it was shown how this can be used in the RT scenario, with priority inversion.</p>
<p>In the final lesson, Paul discussed RT Linux applications.</p>
<div class="figure">
<a href="http://www.flickr.com/photos/itkovian/3728456171/" title="Real time vs. the Hammer by Itkovian, on Flickr"><img src="http://farm4.static.flickr.com/3219/3728456171_c498e263a4.jpg" width="500" height="333" alt="Real time vs. the Hammer" /></a>
</div>
<p>I guess the above illustration really says it all. The class discussed the<br />
meaning of a hard RT system, and most of use were proven wrong. In some cases<br />
knowing that failure is imminent is more important that guaranteed making the<br />
deadline. (This eems to have some resemblance to writing research papers.) A<br />
combination of an accurate system that is allowed to fail and can indicate it<br />
with a less accurate systemn that guarentees deadline meeting seems to be the<br />
way to go. In any case, maths cannot describe RT systems in practive, and QoS<br />
is more important that hard/soft RT distinction.</p>
<p>RT applications can be divided into three classes: search for life<br />
(medical/industrial control systems), search for death (military) and search<br />
for money (financial). In todays interconnected machine web, the slowest<br />
machine determines the RT nature of the complete system. Multiple serialised<br />
machines have a large impact on this fact. Funny fact: in the Linux kernel,<br />
real time used to mean real life time, rather than deadline meeting. So beware<br />
of the code you rely on!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.itkovian.net/base/performance-scalability-and-real-time-response-from-the-linux-kernel/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Using performance counters with multi-threaded applications</title>
		<link>http://www.itkovian.net/base/using-performance-counters-with-multi-threaded-applications/</link>
		<comments>http://www.itkovian.net/base/using-performance-counters-with-multi-threaded-applications/#comments</comments>
		<pubDate>Fri, 23 May 2008 11:28:17 +0000</pubDate>
		<dc:creator>Itkovian</dc:creator>
				<category><![CDATA[kernel]]></category>
		<category><![CDATA[multi-threaded applications]]></category>
		<category><![CDATA[patch]]></category>
		<category><![CDATA[performance counters]]></category>

		<guid isPermaLink="false">http://www.itkovian.net/base/?p=214</guid>
		<description><![CDATA[Since a few years, there is quite good support for using performance counters on Linux machines. Examples are <a href="http://oprofile.sourceforge.net/">OProfile</a> (which has been included in the kernel since 2.6, I think), <a href="http://user.it.uu.se/~mikpe/linux/perfctr/">Perfctr</a>, and <a href="http://perfmon2.sourceforge.net/">Perfmon</a> (not to be confused with <a href="http://perfmon.sourceforge.net/">the other Perfmon</a>, which is a SNMP based performance monitoring tool). I think Perfmon is destined to make it to the kernel source tree as well, or so I've heard. Yet, I have been using Perfctr since I started my research, so this post is only about that tool.

There has been talk on the Perfctr mailing list (which gets hopelessly spammed these days) for including support for multi-threaded processes, but thus far I've seen nothing that does what I want. So, without further ado, here's how to patch your kernel to support multi-threaded applications.]]></description>
			<content:encoded><![CDATA[<p>Since a few years, there is quite good support for using performance counters on Linux machines. Examples are <a href="http://oprofile.sourceforge.net/">OProfile</a> (which has been included in the kernel since 2.6, I think), <a href="http://user.it.uu.se/~mikpe/linux/perfctr/">Perfctr</a>, and <a href="http://perfmon2.sourceforge.net/">Perfmon</a> (not to be confused with <a href="http://perfmon.sourceforge.net/">the other Perfmon</a>, which is a SNMP based performance monitoring tool). I think Perfmon is destined to make it to the kernel source tree as well, or so I&#8217;ve heard. Yet, I have been using Perfctr since I started my research, so this post is only about that tool.</p>
<p>There has been talk on the Perfctr mailing list (which gets hopelessly spammed these days) for including support for multi-threaded processes, but thus far I&#8217;ve seen nothing that does what I want. So, without further ado, here&#8217;s how to patch your kernel to support multi-threaded applications. <!--break--></p>
<p>I assume you know how to install the basic Perfctr driver, and compile your kernel to add support for it. If possible, compile it as a module, as this is easiest if you need to change things (unless you also change stuff in the kernel header files, in which case you probably want to recompile the complete kernel). Let&#8217;s further assume your kernel source lives in /usr/src/linux, further referred to as toplevel. I&#8217;ll also assume we&#8217;re building the Perfctr module here.</p>
<p>The first thing that needs to be done is make sure that child processes set up their kernel data structures such that performance counter data can be stored at context switch (you want to use virtual counters, i.e., per-process counters). Therefore you need to add a function that does exactly this. We&#8217;ll call it __vperfctr_set_child_perfctr(struct task_struct*, struct task_struct*). You also want to be able to set up and empty vperfctr structure, by simply allocating space for it. So, add their existance to your toplevel/include/linux/perfctr.h file (which should be there if you patched the kernel and copied the relevant files when installing Perfctr), which should then read:</p>
<div class="code">
<ol>
<li>#ifdef CONFIG_PERFCTR_MODULE
<li>extern struct vperfctr_stub {
<li class="indent">struct module *owner;
<li class="indent">void (*exit)(struct vperfctr*);
<li class="indent">void (*suspend)(struct vperfctr*);
<li class="indent">void (*resume)(struct vperfctr*);
<li class="indent">void (*sample)(struct vperfctr*);
<li class="newline indent">struct vperfctr* (*get_empty)(void);
<li class="newline indent">void (*set_child_perfctr) (struct task_struct*, struct task_struct*);
<li>#ifdef CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK
<li class="indent">void (*set_cpus_allowed)(struct task_struct*, struct vperfctr*, cpumask_t);
<li>#endif
<li>} vperfctr_stub;
<li>
<li>extern void _vperfctr_exit(struct vperfctr*);
<li>#define _vperfctr_suspend(x)  vperfctr_stub.suspend((x))
<li>#define _vperfctr_resume(x) vperfctr_stub.resume((x))
<li>#define _vperfctr_sample(x) vperfctr_stub.sample((x))
<li class="newline">#define _vperfctr_get_empty()   vperfctr_stub.get_empty()
<li class="newline">#define _vperfctr_set_child_perfctr(x,y)  vperfctr_stub.set_child_perfctr((x),(y))
<li>#define _vperfctr_set_cpus_allowed(x,y,z) (*vperfctr_stub.set_cpus_allowed)((x),(y),(z))
<li>#else /* !CONFIG_PERFCTR_MODULE */
<li>#define _vperfctr_exit(x) __vperfctr_exit((x))
<li>#define _vperfctr_suspend(x)  __vperfctr_suspend((x))
<li>#define _vperfctr_resume(x) __vperfctr_resume((x))
<li>#define _vperfctr_sample(x) __vperfctr_sample((x))
<li class="newline">#define _vperfctr_get_empty()   __vperfctr_get_empty()
<li class="newline">#define _vperfctr_set_child_perfctr(x,y) __vperfctr_set_child_perfctr((x),(y))
<li>#define _vperfctr_set_cpus_allowed(x,y,z) __vperfctr_set_cpus_allowed((x),(y),(z))
<li>#endif  /* CONFIG_PERFCTR_MODULE */
</ol>
</div>
<p>In the same file you should add some code to the perfctr_copy_task(struct task_struct *, struct pt_regs *) function. Otherwise it only contains a comment stating that nothing should be done until inheritance is implemented and sets the vperfctr structure to NULL. The code for that function should become the following:</p>
<div class="code">
<ol>
<li>static inline void perfctr_copy_task(struct task_struct *child, struct pt_regs *regs) {
<li>
<li class="indent newline">if(current-&gt;thread.perfctr != NULL) {
<li class="indent2 newline">child-&gt;thread.perfctr = _vperfctr_get_empty();
<li class="indent2 newline">if(!child-&gt;thread.perfctr) {
<li class="indent3 newline">printk(&#8220;PERFCTR::error activating child perfctr\n&#8221;);
<li class="indent2 newline">}
<li class="indent2 newline">else {
<li class="indent3 newline">_vperfctr_set_child_perfctr(current, child);
<li class="indent2 newline">}
<li class="indent newline">}
<li class="indent newline">else {
<li class="indent2">child-&gt;thread.perfctr = NULL;
<li class="indent newline">}
<li class="indent">
<li>}
</ol>
</div>
<p>The above will get a new structure set up (through __vperfctr_get_empty) and copy the existing settings (i.e., for the control registers etc.) to that structure (through __vperfctr_set_child_perfctr). Before we can use these functions, we need to define them, which is done in toplevel/drivers/perfctr/virtual.c.</p>
<div class="code">
<ol>
<li class="indent newline">struct vperfctr* __vperfctr_get_empty(void) {
<li class="indent2 newline">  return get_empty_vperfctr();
<li class="indent newline">}
<li class="indent newline">
<li class="indent newline">void __vperfctr_set_child_perfctr(struct task_struct* parent, struct task_struct* child) {
<li class="indent newline">
<li class="indent2 newline">  int err;
<li class="indent2 newline">  struct vperfctr* parent_perfctr = parent-&gt;thread.perfctr;
<li class="indent2 newline">  struct vperfctr* child_perfctr = child-&gt;thread.perfctr;
<li class="indent newline">
<li class="indent2 newline">  if(!child_perfctr) { /* check should have been done before! */
<li class="indent3 newline">    return;
<li class="indent2 newline">  }
<li class="indent newline">
<li class="indent2 newline">  child_perfctr-&gt;owner = child;
<li class="indent2 newline">  memcpy(&#038;(child_perfctr-&gt;cpu_state.control), &#038;(parent_perfctr-&gt;cpu_state.control), sizeof(parent_perfctr-&gt;cpu_state.control));
<li class="indent2 newline">  child_perfctr-&gt;si_signo = parent_perfctr-&gt;si_signo;
<li class="indent newline">
<li class="indent newline">#ifdef CONFIG_SMP
<li class="indent2 newline">  child_perfctr-&gt;sampling_timer = parent_perfctr-&gt;sampling_timer;
<li class="indent newline">#endif
<li class="indent newline">
<li class="indent2 newline">  err = perfctr_cpu_update_control(&#038;child_perfctr-&gt;cpu_state, 0);
<li class="indent2 newline">  if(err &lt; 0) {
<li class="indent3 newline">    printk(&#8220;perfctr::error::__vperfctr_set_child cstatus &lt; 0 &#8220;);
<li class="indent2 newline">  }
<li class="indent newline">
<li class="indent newline">}
</ol>
</div>
<p>I think the Perfctr patch uses p-&gt;thread as the first argument, so change this accordingly.</p>
<p>Now, to get the counter values assembled in one central spot, I&#8217;m using a second module that accumulates these values. Upon exit, each thread will pass on their values to that module through a hook. The code has protection for usage on multicore processors, through a spin_lock.</p>
<p>Here&#8217;s the code you should add to toplevel/drivers/perfctr/virtual.c to be able to access the accumulating module through the hook.</p>
<div class="code"
<ol>
<li class="newline">void (*__read_counter_update)(int nrctrs, int* events, long long* counters_values) = NULL;
<li>
<li class="newline">int __vperfctr_set_read_counter_hook(void f_address(int, int*, long long*)) {
<li class="indent newline">  __read_counter_update = f_address;
<li class="indent newline">  return 0;
<li class="newline">}
<li class="newline">EXPORT_SYMBOL(__vperfctr_set_read_counter_hook);
<li class="newline">
<li class="newline">int __vperfctr_unset_read_counter_hook(void f_address(int, int*, long long*)) {
<li class="indent newline">  __read_counter_update = NULL;
<li class="indent newline">  return 0;
<li class="newline">}
<li class="newline">EXPORT_SYMBOL(__vperfctr_unset_read_counter_hook);
<li class="indent">
<li class="indent">
<li>static void vperfctr_unlink(struct task_struct *owner, struct vperfctr *perfctr) {
<li class="indent">
<li class="indent">  /* this synchronises with vperfctr_ioctl() */
<li class="indent">  spin_lock(&amp;perfctr-&gt;owner_lock);
<li class="indent">  perfctr-&gt;owner = NULL;
<li class="indent">  spin_unlock(&amp;perfctr-&gt;owner_lock);
<li class="indent">
<li class="indent">  /* perfctr suspend+detach must be atomic wrt process suspend */
<li class="indent">  /* this also synchronises with perfctr_set_cpus_allowed() */
<li class="indent">  vperfctr_task_lock(owner);
<li class="indent">  if( IS_RUNNING(perfctr) &amp;&amp;owner == current )
<li class="indent">    vperfctr_suspend(perfctr);
<li class="indent">  owner-&gt;thread.perfctr = NULL;
<li class="indent">  vperfctr_task_unlock(owner);
<li class="indent">
<li class="indent newline">  if(__read_counter_update) {
<li class="indent2 newline">    int nractrs = perfctr_cstatus_nractrs(perfctr-&gt;cpu_state.cstatus);
<li class="indent2 newline">    long long counters [nractrs+1];
<li class="indent2 newline">    int events[nractrs+1];
<li class="indent2 newline">    int i = 0;
<li class="indent2 newline">    for(i = 0; i < nractrs; i++) {
<li class="indent3 newline">      events[i] = perfctr-&gt;cpu_state.control.evntsel[i];
<li class="indent3 newline">      counters[i] = perfctr-&gt;cpu_state.pmc[i].sum;
<li class="indent2 newline">    }
<li class="indent2 newline">    events[nractrs] = -1;
<li class="indent2 newline">    counters[nractrs] = perfctr-&gt;cpu_state.tsc_sum;
<li class="indent2 newline">    __read_counter_update(nractrs, events, counters);
<li class="indent newline">  }
<li class="indent">
<li class="indent">  perfctr-&gt;cpu_state.cstatus = 0;
<li class="indent">  vperfctr_clear_iresume_cstatus(perfctr);
<li class="indent">  put_vperfctr(perfctr);
<li>}
</ol>
</div>
<p>The final piece then is the read_counter module where the data is accumulated.</p>
<div class="code">
<ol>
<li>#include &gt;linux/version.h&lt;
<li>#include &gt;linux/vermagic.h&lt;
<li>#include &gt;linux/init.h&lt;
<li>#include &gt;linux/module.h&lt;
<li>#include &gt;linux/kernel.h&lt;
<li>#include &gt;linux/fs.h&lt;
<li>#include &gt;linux/major.h&lt;
<li>#include &gt;linux/errno.h&lt;
<li>#include &gt;asm/uaccess.h&lt;
<li>#include &gt;asm/io.h&lt;
<li>#include &gt;linux/spinlock.h&lt;
<li>
<li>MODULE_INFO(vermagic, VERMAGIC_STRING);
<li>
<li>#define READ_COUNTERS_DEV 120 /* major number */
<li>
<li>#undef unix
<li>struct module __this_module
<li>__attribute__((section(&#8220;.gnu.linkonce.this_module&#8221;))) = {
<li class="indent">     .name = __stringify(read_counters),
<li class="indent">     .init = init_module,
<li>#ifdef CONFIG_MODULE_UNLOAD
<li class="indent">     .exit = cleanup_module,
<li>#endif
<li>
<li>
<li>struct counter_info_s {
<li class="indent">     long long ctrs[9];
<li class="indent">     int nractrs;
<li class="indent">     int events[9];
<li class="indent">     spinlock_t lock;
<li>
<li>
<li>static struct counter_info_s counters;
<li>
<li>
<li>void __read_counters_update(int nractrs, int* events, long long* cntrs) {
<li class="indent">
<li class="indent">     int i = 0;
<li class="indent">     spin_lock(&amp;counters.lock);
<li class="indent">     counters.nractrs = nractrs;
<li class="indent">     for(i = 0; i &lt; nractrs; ++i) {
<li class="indent">             counters.ctrs[i] += cntrs[i];
<li class="indent">     }
<li class="indent">     spin_unlock(&amp;counters.lock);
<li>}
<li>
<li>
<li>static int read_counters_open(struct inode* inode, struct file* file) {
<li class="indent">     printk(&#8220;read_counters OPEN\n&#8221;);
<li class="indent">     return 0;
<li>}
<li>
<li>
<li>static int read_counters_close(struct file* file) {
<li class="indent">     printk(&#8220;read_counters CLOSE\n&#8221;);
<li class="indent">     return 0;
<li>}
<li>
<li>
<li>static ssize_t read_counters_read(struct file* file, char* buf, size_t count, loff_t *ppos) {
<li class="indent">     int i = 0;
<li class="indent">     int res = 0;
<li>
<li class="indent">     res = copy_to_user((void*) buf, (void*) &amp;counters, count);
<li class="indent">     if(!res) {
<li class="indent2">            /* we reset the counters values at this point */
<li class="indent2">            for(i = 0; i &lt;= counters.nractrs; ++i) {
<li class="indent3">                    counters.ctrs[i] = 0;
<li class="indent2">            }
<li class="indent2">            return 0;
<li class="indent">     }
<li class="indent">
<li class="indent">     return -EFAULT;
<li>}
<li>
<li>static ssize_t read_counters_write(struct file* file, const char* buf, size_t count, loff_t* ppos) {
<li class="indent">     return 0;
<li>}
<li>
<li>
<li>static int read_counters_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg) {
<li class="indent">
<li class="indent">     int i;
<li class="indent">     int res;
<li class="indent">
<li class="indent">     /* we have the following legal actions that can be asked
<li class="indent">      * read: reads the counters
<li class="indent">      * reset: sets the captured values to zero
<li class="indent">      *
<li class="indent">      * cmd is encoded to contain both the desired action (first byte)
<li class="indent">      * as well as the length of the buffer (last 3 bytes)
<li class="indent">      */
<li class="indent">
<li class="indent">     int action = cmd &amp; 0xff0000;
<li class="indent">     int count = cmd &amp; 0x0000ff;
<li class="indent">
<li class="indent">     switch(action) {
<li class="indent2">            case 0: /* reset */
<li class="indent3">                    spin_lock(&amp;counters.lock);
<li class="indent3">                    for(i = 0; i &lt;= counters.nractrs; i++) {
<li class="indent4">                            counters.ctrs[i] = 0;
<li class="indent4">                            counters.events[i] = 0;
<li class="indent3">                    }
<li class="indent3">                    spin_unlock(&amp;counters.lock);
<li class="indent3">                    break;
<li class="indent2">            case 1: /* read */
<li class="indent3">                    spin_lock(&amp;counters.lock);
<li class="indent3">                    {
<li class="indent4">                            char tmp [(counters.nractrs+1)*sizeof(long long)+sizeof(int)];
<li class="indent4">                            long long*tmp_counters = (long long*) (tmp+sizeof(int));
<li class="indent4">
<li class="indent4">                            *(int*)tmp = counters.nractrs;
<li class="indent4">                            for(i = 0; i &lt;= counters.nractrs; ++i) {
<li class="indent5">                                    *(tmp_counters+i) = counters.ctrs[i];
<li class="indent4">                            }
<li class="indent4">
<li class="indent4">                            if(count &lt; (counters.nractrs+1)*sizeof(long long)+sizeof(int)) {
<li class="indent5">                                    res =  -EFAULT;
<li class="indent4">                            } else {
<li class="indent5">                                    /*
<li class="indent5">                                    * we expect arg to contain the address of a
<li class="indent5">                                    * structure where the counter values can be dropped<br />
>li class=&#8221;indent5&#8243;>                                    */</p>
<li class="indent5">                                    res = (copy_to_user((void*) arg, tmp, (counters.nractrs+1)*sizeof(long long)+sizeof(int)) ? -EFAULT : 0);
<li class="indent4">                            }
<li class="indent3">                    }
<li class="indent3">                    spin_unlock(&amp;counters.lock);
<li class="indent3">                    return res;
<li class="indent3">                    break;
<li class="indent">     }
<li class="indent">     return -1;
<li>}
<li>
<li>static struct file_operations read_counters_fops = {
<li class="indent">     .read  = read_counters_read,      /* read  */
<li class="indent">     .write = read_counters_write,     /* write */
<li class="indent">     .ioctl = read_counters_ioctl,     /* ioctl */
<li class="indent">     .open  = read_counters_open,      /* open */
<li class="indent">     .flush = read_counters_close,     /* close */
<li>};
<li>
<li>extern int __vperfctr_set_read_counter_hook(void (*f)(int, int*, long long*));
<li>extern int __vperfctr_unset_read_counter_hook(void);
<li>
<li>int init_module(void) {
<li class="indent">
<li class="indent">     int i = 0;
<li class="indent">
<li class="indent">     /* find a symbol called __vperfctr_set_read_counter_hook  */
<li class="indent">     /* if it doesn&#8217;t exist, let insmod handle it! <img src='http://www.itkovian.net/base/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />          */
<li class="indent">
<li class="indent">     __vperfctr_set_read_counter_hook(__read_counters_update);
<li class="indent">
<li class="indent">     for(i =0; i &lt; 9; ++i) {
<li class="indent2">            counters.ctrs[i] = 0LL;
<li class="indent2">            counters.events[i] = 0;
<li class="indent">     }
<li class="indent">
<li class="indent">     if(register_chrdev(READ_COUNTERS_DEV, &#8220;read_counters&#8221;, &amp;read_counters_fops) == -EBUSY) {
<li class="indent2">            printk(&#8220;READ_COUNTERS device: unable to connect to major number %d\n&#8221;, READ_COUNTERS_DEV);
<li class="indent2">            return -EIO;
<li class="indent">     }
<li class="indent">     else {
<li class="indent2">            printk(&#8220;READ_COUNTERS device installed.\n&#8221;);
<li class="indent">     }
<li class="indent">     return 0;
<li>}
<li>
<li>void cleanup_module(void) {
<li class="indent">     __vperfctr_unset_read_counter_hook();
<li class="indent">     if(unregister_chrdev(READ_COUNTERS_DEV, &#8220;read_counters&#8221;)) {
<li class="indent2">            printk(&#8220;READ_COUNTERS device: unable to release device %d\n&#8221;, READ_COUNTERS_DEV);
<li class="indent">     }
<li class="indent">     else {
<li class="indent2">            printk(&#8220;READ_COUNTERS device driver removed\n&#8221;);
<li class="indent">     }
<li>}
<li>
<li>MODULE_LICENSE(&#8220;GPL&#8221;);
</div>
<p>If you wish to read the data, you must create the device /dev/read_counters with major number 120 and minor number 0. Code that does the reading for you might look like this:</p>
<div class="code">
<ol>
<li>#include &lt;sys/types.h&gt;
<li>#include &lt;sys/stat.h&gt;
<li>#include &lt;fcntl.h&gt;
<li>#include &lt;linux/spinlock.h&gt;
<li>
<li>struct counter_info_s {
<li class="indent">  long long ctrs[9];
<li class="indent">  int nractrs;
<li class="indent">  int events[9];
<li class="indent">  spinlock_t lock;
<li>};
<li>
<li>struct counter_info_s counters;
<li>
<li>int main() {
<li class="indent">
<li class="indent">     int file = open(&#8220;/dev/read_counters&#8221;, O_RDONLY);
<li class="indent">
<li class="indent">     if(file < 0) {
<li class="indent2">            perror(&#8220;open&#8221;);
<li class="indent2">            exit(-1);
<li class="indent">     }
<li class="indent">
<li class="indent">     if(read(file, (void*) &amp;counters, sizeof(struct counter_info_s), 0)) {
<li class="indent2">            perror(&#8220;oops. read error.&#8221;);
<li class="indent2">            exit(-1);
<li class="indent">     }
<li class="indent">     else {
<li class="indent2">            int i = 0;
<li class="indent2">            printf(&#8220;read succesfull\n&#8221;);
<li class="indent2">            printf(&#8220;counters.nractrs = %d\n&#8221;, counters.nractrs);
<li class="indent">
<li class="indent2">            for(i = 0; i &lt; counters.nractrs; ++i) {
<li class="indent3">                    printf(&#8220;counters.ctrs[%d] = %lld\n&#8221;, i, counters.ctrs[i]);
<li class="indent2">            }
<li class="indent2">            exit(0);
<li class="indent">     }
<li>}
</ol>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.itkovian.net/base/using-performance-counters-with-multi-threaded-applications/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

