<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Rusty's Bleeding Edge Page   </title>
    <link>http://ozlabs.org/~rusty/index.cgi</link>
    <description>Rusty's Bleeding Edge Page</description>
    <language>en</language>

<item>
    <title>The Joy of linux-next</title>
    <pubDate>Sun, 20 Jul 2008 19:03:00 GMT</pubDate>
    <link>http://ozlabs.org/~rusty/index.cgi/2008/07/20#2008-07-15</link>
    <description>
&lt;p&gt; Sure, &lt;a
href=&quot;http://linux.f-seidel.de/linux-next/pmwiki/&quot;&gt;linux-next&lt;/a&gt; is a
useful way of early-detecting patch conflicts with random developers.
But the second order effect has been more useful to me: forcing me to
get my shit together.  Now I regularly publish my &lt;a
href=&quot;http://ozlabs.org/~rusty/kernel&quot;&gt;patchqueue&lt;/a&gt; in a form which
applies and compiles, and has clear &quot;production&quot; vs &quot;alpha&quot;
demarcation.  &lt;/p&gt;

&lt;p&gt; Obviously, this is good for people trying to follow various
patches (and there are quite a few independent efforts at the moment,
including typesafe patches, virtio, lguest, module, tun/tap,
stop_machine, kmod-removal and down_trylock removal), but it also
makes the arrival of the merge window far less stressful.&lt;/p&gt;

&lt;p&gt; In theory, I could have been this organized before.  But just like
the concept of doing homework long before the deadline, it was never
going to happen.  So thanks Stephen!  &lt;/p&gt;
</description>
</item>
<item>
    <title>UNSW CS: Employment @ IBM OzLabs Talk: 1pm Tuesday September 2nd</title>
    <pubDate>Mon, 14 Jul 2008 21:05:00 GMT</pubDate>
    <link>http://ozlabs.org/~rusty/index.cgi/2008/07/14#2008-07-14</link>
    <description>
&lt;p&gt; UNSW School of Computer Science and Engineering are having
&quot;Employer of the Week&quot; experiment: September 1st is IBM's week.  I'll
be spruking for OzLabs, so if you know anyone at UNSW who worth
talking to, drag them there (I don't know which room, I'm guessing the
signs in CS will be pretty clear).  &lt;/p&gt;

&lt;p&gt; I'm going to try to talk about the stuff people in the office are
hacking on, to give an idea what it's like being in what AFAICT is
Australia's largest bunch of Free and Open Source Software hackers.
&lt;/p&gt;</description>
</item>
<item>
    <title>stop_machine latency: the rewrite</title>
    <pubDate>Mon, 30 Jun 2008 15:08:00 GMT</pubDate>
    <link>http://ozlabs.org/~rusty/index.cgi/2008/06/30#2008-06-30</link>
    <description>
&lt;p&gt;
Following on from my &lt;a href=&quot;http://ozlabs.org/~rusty/index.cgi/tech/2008-06-12.html&quot;&gt;previous graphs of stop_machine latency&lt;/a&gt;, I have new
results with my &lt;a href=&quot;http://ozlabs.org/~rusty/kernel/rr-2008-06-26-1/stop_machine:simplify.patch&quot;&gt;stop_machine simplification&lt;/a&gt; patch.
&lt;/p&gt;

&lt;p&gt; Again, it's the 18-way Power4 box; the simplied stop_machine
creates all the threads and moves them into the correct CPUs before
starting them.  They then step through the state machine themselves,
rather than having a central controller.
&lt;/p&gt;

&lt;img src=&quot;http://chart.apis.google.com/chart?cht=lc&amp;chtt=Latency+in+microseconds+vs+CPU+Number&amp;chs=400x260&amp;chco=ff0000,00ff00&amp;chdl=stop_machine|idle&amp;chxt=y,x&amp;chxr=1,1,17|0,0,160&amp;chd=t:22.920,23.226,22.077,23.194,23.390,23.358,23.585,22.765,22.070,21.641,21.458,22.215,22.013,22.342,22.007,24.362,24.330|64.543,95.536,96.112,96.075,95.100,90.793,92.412,88.381,88.931,84.825,84.737,84.093,85.268,79.781,82.828,82.100,84.012&quot;&gt;

It's actually marginally worse than the previous:

&lt;img src=&quot;http://chart.apis.google.com/chart?cht=lc&amp;chtt=Latency+in+microseconds+vs+CPU+Number&amp;chs=400x260&amp;chco=ff0000,00ff00&amp;chdl=Old stop_machine|New stop_machine&amp;chxt=y,x&amp;chxr=1,1,17|0,0,160&amp;chd=t:39.375,84.273,57.100,56.681,59.131,60.362,62.500,61.337,63.718,64.462,67.125,68.656,71.231,70.731,73.418,74.793,77.318|64.543,95.536,96.112,96.075,95.100,90.793,92.412,88.381,88.931,84.825,84.737,84.093,85.268,79.781,82.828,82.100,84.012&quot;&gt;

&lt;p&gt; Since these are different kernel versions, I looked at the
baseline latency for both kernels: &lt;/p&gt;

&lt;img src=&quot;http://chart.apis.google.com/chart?cht=lc&amp;chtt=Latency+in+microseconds+vs+CPU+Number&amp;chs=400x260&amp;chco=ff0000,00ff00&amp;chdl=Old baseline|New baseline&amp;chxt=y,x&amp;chxr=1,1,17|0,0,160&amp;chd=t:21.575,52.077,21.481,21.275,21.900,21.625,21.756,21.143,21.043,20.912,21.487,21.350,21.518,21.668,21.918,21.981,22.687|22.920,23.226,22.077,23.194,23.390,23.358,23.585,22.765,22.070,21.641,21.458,22.215,22.013,22.342,22.007,24.362,24.330&quot;&gt;

&lt;p&gt; Now I need to go back and compare the exact same kernel version,
to make sure something else isn't interfering... &lt;/p&gt;</description>
</item>
<item>
    <title>Linux Foundation's Device Driver Statement</title>
    <pubDate>Fri, 27 Jun 2008 15:02:00 GMT</pubDate>
    <link>http://ozlabs.org/~rusty/index.cgi/2008/06/27#2008-06-27</link>
    <description>
&lt;p&gt;
Someone noted that I didn't sign the &lt;a href=&quot;http://www.linuxfoundation.org/en/Device_driver_statement&quot;&gt;LF &quot;proprietary modules are bad&quot; statement&lt;/a&gt;.  This
is entirely due to my slackness and not any lack of support.
&lt;/p&gt;

&lt;p&gt;
As kernel module maintainer I feel obliged to maintain the status quo
with proprietary modules, but I have noticed many colleagues becoming
more annoyed about them.
&lt;/p&gt;</description>
</item>
<item>
    <title>stop_machine latency</title>
    <pubDate>Thu, 12 Jun 2008 11:29:00 GMT</pubDate>
    <link>http://ozlabs.org/~rusty/index.cgi/2008/06/12#2008-06-12</link>
    <description>
&lt;p&gt;
Kathy Staples and I wrote a little program to measure the latency on
every CPU on a machine.  It sets CPU affinity and high priority
(SCHED_FIFO, prio 50) for each thread, then spins doing gettimeofday()
for a given duration.  The maximum gap in gettimeofday() is reported
for each CPU.
&lt;/p&gt;

&lt;p&gt;
I tested this on an old 18-way Power4 box sitting around the lab: CPU
0 is used for the parent process, and the latency is measured on the
other CPUS.  This was run 100 times.  Then a variant which did an
insmod system call on CPU 0 was used (this calls stop_machine, which
is what we were trying to measure).
&lt;/p&gt;

&lt;img src=&quot;http://chart.apis.google.com/chart?cht=lc&amp;chtt=Latency+in+microseconds+vs+CPU+Number&amp;chs=400x260&amp;chco=ff0000,00ff00&amp;chdl=stop_machine|idle&amp;chxt=y,x&amp;chxr=1,1,17|0,0,140&amp;chd=t:45.000,96.312,65.257,64.778,67.578,68.985,71.428,70.100,72.821,73.671,76.714,78.464,81.407,80.835,83.907,85.478,88.364|4.657,59.516,24.550,24.314,25.028,24.714,24.864,24.164,24.050,23.900,24.557,24.400,24.592,24.764,25.050,25.121,25.928&quot;&gt;

&lt;p&gt;
The results are interesting and a little surprising.  Normal max
latency is around 35 usec, the stop_machine increasing it to the 100
range.  There's obviously something running periodically on CPU 2: for
both runs I had to remove one horrific 150ms latency result (1000
times average!) but there's still a noticeable spike there.  I suspect
CPU1 is low because CPU0 is mainly idle (same core).
&lt;/p&gt;

&lt;p&gt;
But more concerning is that latency seems to go &lt;em&gt;up&lt;/em&gt; with
higher CPU numbers, whereas I expected it to be worst on lower CPUs.
We launch stop_machine threads in cpu order, so I expected the lower
CPUs to wait the longest.
&lt;/p&gt;

&lt;p&gt;
We're running modprobe on cpu 0, which means the stop_machine control
thread runs there, too.  It loops through creating 17 other threads:
as CPU 0 is busy, it gets scheduled on a different idle CPU.  The
first thing the thread does is try to move itself to its proper CPU.
&lt;/p&gt;

&lt;p&gt;
I suspect what is happening is that we're creating the 17 threads fast
enough that they all end up queued on the migration queue for CPU 0 at
once: this queueing uses &quot;list_add&quot; not &quot;list_add_tail&quot;, so they are
in fact deployed by the migration thread in reverse-CPU order.
&lt;/p&gt;

&lt;p&gt;
My simplified version of stop_machine is more intelligent: it moves
all the threads to their correct CPUs before waking them all up.  This
should solve this problem as well as reducing overall latency.
&lt;/p&gt;</description>
</item>
<item>
    <title>Tuning VirtIO and virtio_net: part I</title>
    <pubDate>Fri, 16 May 2008 11:02:00 GMT</pubDate>
    <link>http://ozlabs.org/~rusty/index.cgi/2008/05/16#2008-05-16</link>
    <description>
&lt;p&gt;
One premise of virtio is that we should be as fast as reasonably
possible.  While there's nothing which &lt;em&gt;should&lt;/em&gt; make us slow,
that's not the same as actually being fast.  So this week, I've been
doing some simple benchmarks on my patch queue, which includes major
changes to accelerate the tap device and allow async packet sends.
&lt;/p&gt;

&lt;p&gt; I've been using lguest rather than kvm because it's far more
hackable, and my test has been a 1GB (1024x1024x1024 byte) TCP send
using netcat.  And host-&gt;guest results were awful: instead of the
current 12 seconds it was taking 70 seconds to receive 1GB.  So I
started breaking that down.  &lt;/p&gt;

&lt;p&gt; The first things that I found was that simply allocating large
receive buffers (of which only 1500 bytes is used) is expensive.  Just
this change alone takes the time from 12 seconds to 29, and there are
two reasons for this so far.
&lt;/p&gt;

&lt;p&gt; The first is because each 1500 byte packet takes two descriptors
(we have a header containing metadata), whereas a fully populated
paged skb takes 2 + 65536/PAGE_SIZE + 2 == 20 descriptors.  That means
we only fit 6 large packets in lguest's 128-descriptor ring, vs 64 for
the small packet case.  Increasing lguest's rings to 1024 drops the
time from 29 to 25: not as much as you'd expect.  Increasing it
further has marginal effect (logically, we should see equivalence at
1280 descriptors, but it has to be a power of 2).  &lt;/p&gt;

&lt;p&gt; The second reason is that alloc_page is quite slow.  A simple
cache of allocated pages drops the time from 25 to 19 seconds.
&lt;/p&gt;

&lt;p&gt; But we're still 50% slower than allocating 1500-byte receive
buffers, and today's task is to figure out why.  It seems unlikely
that the increased overhead of skb_to_sgvec, get_buf and add_buf would
account for it.  Cache effects also seem unlikely: 1024 descriptors
are still only 8k.  It's unfortunate that oprofile doesn't work inside
lguest guests, so this will be old school.  &lt;/p&gt;

&lt;p&gt; If the overhead really is inherent in large descriptors, we have
several options.  The obvious one is to add a separate &quot;large buffer&quot;
queue, or allow mixing buffer sizes and expect the other end to try to
forage for the minimal sized one.  Both require a change to the server
side.  We can add a feature bit for backwards-compat, but that's always
a last resort.  Another option is to try for multi-page allocations
for our skbs: as they're physically contiguous they'll use fewer
descriptors.  &lt;/p&gt;</description>
</item>
<item>
    <title>C inline functions not in headers</title>
    <pubDate>Mon, 07 Apr 2008 15:20:00 GMT</pubDate>
    <link>http://ozlabs.org/~rusty/index.cgi/2008/04/07#2008-04-07</link>
    <description>
&lt;p&gt;
I just appreciated an interesting side-effect of slapping &quot;inline&quot; on
static functions within .c files.  You don't get a warning when they
become unused.
&lt;/p&gt;

&lt;p&gt;
This breaks my normal method for code cleanup (in this case, the tun
driver).  So unless you have evidence otherwise, plase trust the
compiler to inline static functions appropriately and don't label them
inline.  (And remember: &lt;tt&gt;inline&lt;/tt&gt; is the &lt;tt&gt;register&lt;/tt&gt; keyword
for the 21st century.)
&lt;/p&gt;
</description>
</item>
  </channel>
</rss>