<?xml version="1.0"?>
<rss version="0.91"><channel><title>jk</title><link>http://ozlabs.org/~jk/diary</link><description>jeremy's blog</description><language>en</language><item><title>I've moved!</title><pubDate>Tue, 06 Nov 2012 18:24:53 +1100</pubDate><link>http://ozlabs.org/~jk/diary/misc/moved.diary/</link><description><![CDATA[<p>All of the stuff from this site is now at <a href="http://jk.ozlabs.org"
>jk.ozlabs.org</a>.</p>
<p>If you're looking for a new RSS feed, you can find that at
<a href="http://jk.ozlabs.org/feeds/rss/">jk.ozlabs.org/feeds/rss/</a>.</p>
<p>See you on the other side!</p>
]]></description></item><item><title>global namespace</title><pubDate>Tue, 26 May 2009 16:51:52 +1000</pubDate><link>http://ozlabs.org/~jk/diary/tech/linux/global-namespace.diary/</link><description><![CDATA[<pre class="shaded">
<span class="prompt">[jk@pingu ~]$ </span>mkdir ~/sshfs
<span class="prompt">[jk@pingu ~]$ </span><a href="http://afuse.sf.net">afuse</a> -o mount_template="<a href="http://fuse.sourceforge.net/sshfs.html">sshfs</a> %r:/ %m" -o unmount_template="fusermount -u -z %m" ~/sshfs/
<span class="prompt">[jk@pingu ~]$ </span>cat ~/sshfs/ozlabs.org/etc/hostname
ozlabs
<span class="prompt">[jk@pingu ~]$ </span>
</pre>
]]></description></item><item><title>hiprofile - HTML interactive profiler, version 1.0</title><pubDate>Mon, 20 Apr 2009 18:02:16 +1000</pubDate><link>http://ozlabs.org/~jk/diary/tech/linux/hiprofile-v1.0.diary/</link><description><![CDATA[<p>I've recently been doing some performance analysis work on Linux,
which requires profiling various applications to see how they
perform under load. One of the handy tools for profiling this
behaviour is <a href="http://oprofile.sourceforge.net/">oprofile</a>,
but the output can be a little difficult to interpret.</p>
<p>So, I've developed <a href="http://ozlabs.org/~jk/projects/hiprofile/"
>hiprofile</a>, the HTML interactive profiler.</p>
<p>Hiprofile is a small python script to generate HTML reports for
oprofile data. Output is broken down into per-program statistics:</p>
<div class="screenshot">
<img src="http://ozlabs.org/~jk/projects/hiprofile/images/report.png"
 width="429" height="199" alt="Report summary">
</div>
<p>The page for each program lists the top 20 (by default) symbols
within that function:</p>
<div class="screenshot">
<img src="http://ozlabs.org/~jk/projects/hiprofile/images/binary.png"
 width="350" height="173"
 alt="Breakdown of functions in the postgres binary"/>
</div>
<p>Each of the most-hit functions are show with per-instruction
profiling data, and each instruction is coloured according to how
'hot' it is. If the source code for the function is available, the
output is annotated with the corresponding code:</p>
<div class="screenshot">
<img src="http://ozlabs.org/~jk/projects/hiprofile/images/instruction.png"
 width="561" height="304" alt="Source &amp; assembly listing"/>
</div>
<p>More info and downloads are on the <a
href="http://ozlabs.org/~jk/projects/hiprofile/">hiprofile project page</a>.</p>

]]></description></item><item><title>linux.conf.au hackfest: the solution, part four</title><pubDate>Thu, 08 Jan 2009 13:38:55 +1100</pubDate><link>http://ozlabs.org/~jk/diary/tech/cell/hackfest08-solution-4.diary/</link><description><![CDATA[<p>In the <a
href="http://ozlabs.org/~jk/diary/tech/cell/hackfest08-solution-3.diary">previous</a>
article in this series, we finished with a parellel SPE-based fractal renderer.
This article will cover some of the major optimisations we can do to make use
of the Cell Broadband Engine.</p>
<p>Our last benchmark gave us the following results:</p>
<table class="data">
 <caption>Performance of multi-threaded, interleaved SPE fractal
  renderer</caption>
 <tr>
  <th>SPE threads</th>
  <th>Running time (sec)</th>
 </tr>
 <tr><td>1</td><td class="num">40.72</td></tr>
 <tr><td>2</td><td class="num">20.75</td></tr>
 <tr><td>4</td><td class="num">10.78</td></tr>
 <tr><td>6</td><td class="num">7.44</td></tr>
 <tr><td>8</td><td class="num">5.81</td></tr>
</table>
<h3>SPE vector instructions</h3>
<p>The registers on the SPEs are actually 128-bits wide. Rather than using the
entire register for one value, most of the SPE instruction set consists of
<em>vector instructions</em>.</p>
<p>For example, say we wanted to add 7 to the value in register one, and place
the result in register two. To do this, we'd use the <code>ai</code> ("add
immediate") instruction:</p>
<pre class="shaded code">
        <span class="id">ai</span> <span class="id">r2</span>,<span class="id">r1</span>,<span class="cs">7</span>
</pre>
<p>This instruction operates on the four 32-bit 'slots' of the 128-bit register
(4 x 32 = 128) in parallel:</p>
<div>
<img src="http://ozlabs.org/~jk/projects/lca2008-hackfest/images/simd-vector.png"
width="433" height="196" alt="Diagram of vector addition instruction"
style="display: block; margin-left: auto; margin-right: auto;" />
</div>
<p>By using these vector instructions, we can do a lot of the fractal rendering
work in parallel.</p>
<p>Vector instructions aren't specific to the Cell Broadband Engine - Altivec
(on PowerPC processors) and SSE (on Pentium processors) offer similar
capabilities for parallel processing of data. In fact, the PPE on the Cell
Broadband Engine supports Altivec extensions.</p>
<h3>vectorising the fractal renderer</h3>
<p>The <a href="http://ozlabs.org/~jk/projects/lca2008-hackfest/">hackfest task description</a>
contains a little information on how to vectorise code:</p>
<div class="quote">
  <p>The SPEs of the Cell/B.E. have an instruction set almost entirely composed
   of SIMD instructions - so you can do most computations on four operands
   at a time.</p>
  <p>To take advantage of these SIMD capabilities, you can use "vector" types
   in your code.</p>
  <p>For example, consider a simple function to multiply a number by three:</p>
  <pre class="shaded code">
<span class="ty">float</span> multiply_by_three(<span class="ty">float</span> f)
{
        <span class="st">return</span> f * <span class="cs">3</span>;
}</pre>

  <p>By vectorising this function, we can do four operations in the same amount
   of time:</p>

  <pre class="shaded code">
<span class="ty">vector float</span> multiply_by_three(<span class="ty">vector float</span> f);
{
        <span class="st">return</span> f * (<span class="ty">vector float</span>){<span class="cs">3</span>, <span class="cs">3</span>, <span class="cs">3</span>, <span class="cs">3</span>};
}
</pre>
  <p>The <code>vector float</code> type is similar to an array of 4
  <code>float</code>s, except that we can operate on the vector as a whole.</p>
  <p>If we look at the instructions generated for both the scalar and vector
  versions of the multiply_by_three function, they're <em>exactly the same</em>.
  The following code shows the compiled assembly of
  <code>multiply_by_three()</code>, both vector and scalar versions. The
  function argument is in register 3, return value is placed in register 3.</p>
  <pre class="shaded code">
<span class="id">multiply_by_three</span>:
        <span class="id">ilhu</span>   $<span class="cs">2</span>,<span class="cs">16448</span>    ;<span class="cm"> load {3, 3, 3, 3} into register 2</span>
        <span class="id">fm</span>     $<span class="cs">3</span>,$<span class="cs">3</span>,$<span class="cs">2</span>    ;<span class="cm"> register 3 = (register 3) * (register 2)</span>
        <span class="id">bi</span>     $<span class="id">lr</span>         ;<span class="cm"> return to caller</span>
</pre>
  <p>So, if you're using scalar operations, we get one multiply in three
   instructions. Using the vector equivalent, we get four multiplies in three
   instructions.</p>
</div>
<p>So, we should look at the hot paths of the current fractal renderer, and
try to vectorise where possible.</p>
<p>Currently, this loop of our <code>render_fractal</code> function is
where the SPE is doing most of the work, iterating over each pixel to find
the point where it diverges to infinity:</p>
<pre class="shaded code">
        <span class="st">for</span> (x = <span class="cs">0</span>; x &lt; params-&gt;cols; x++) {
                cr = x_min + x * params-&gt;delta;

                zr = <span class="cs">0</span>;
                zi = <span class="cs">0</span>;

                <span class="st">for</span> (i = <span class="cs">0</span>; i &lt; params-&gt;i_max; i++)  {
                        <span class="cm">/*</span><span class="cm"> z = z^2 + c </span><span class="cm">*/</span>
                        tmp = zr*zr - zi*zi + cr;
                        zi =  <span class="cs">2.0</span> * zr * zi + ci;
                        zr = tmp;

                        <span class="cm">/*</span><span class="cm"> if abs(z) &gt; 2.0 </span><span class="cm">*/</span>
                        <span class="st">if</span> (unlikely(zr*zr + zi*zi &gt; <span class="cs">4.0</span>))
                                <span class="st">break</span>;
                }

                colour_map(&amp;params-&gt;imgbuf[r * params-&gt;cols + x],
                                i, params-&gt;i_max);
        }
</pre>
<p>The renderer will execute the inner loop up to <code>rows</code> *
<code>columns</code> * <code>i</code> times, where <code>i</code> is between
1 and <code>i_max</code>. Seeing as this is the most executed path in our
renderer code, it is a good place to start our vectorisation efforts.</p>

<p>By vectorising, we can perform the contents of the loop on four pixels at a
time (rather than just one), enabling us to reduce the number of iterations to
around one quarter.</p>

<p>The new vectorised fractal generator is available in <a
href="http://ozlabs.org/~jk/projects/lca2008-hackfest/solution/fractal.5.tar.gz">fractal.5.tar.gz</a>.
All we have changed here is the SPE-side code (<code>spe-fractal.c</code>).
The most significant change is to the <code>render_fractal()</code>
function, where we have replaced the above loop with:</p>
<pre class="shaded code">
        <span class="st">for</span> (x = <span class="cs">0</span>; x &lt; params-&gt;cols; x += <span class="cs">4</span>) {
                escaped_i = (vector <span class="ty">unsigned</span> <span class="ty">int</span>){<span class="cs">0</span>, <span class="cs">0</span>, <span class="cs">0</span>, <span class="cs">0</span>};
                cr = x_min + spu_splats(params-&gt;delta * x);

                zr = (vector <span class="ty">float</span>){<span class="cs">0</span>, <span class="cs">0</span>, <span class="cs">0</span>, <span class="cs">0</span>};
                zi = (vector <span class="ty">float</span>){<span class="cs">0</span>, <span class="cs">0</span>, <span class="cs">0</span>, <span class="cs">0</span>};

                <span class="st">for</span> (i = <span class="cs">0</span>; i &lt; params-&gt;i_max; i++)  {

                        <span class="cm">/*</span><span class="cm"> z = z^2 + c </span><span class="cm">*/</span>
                        tmp = zr*zr - zi*zi + cr;
                        zi =  (vector <span class="ty">float</span>){<span class="cs">2.0f</span>, <span class="cs">2.0f</span>, <span class="cs">2.0f</span>, <span class="cs">2.0f</span>}
                                * zr * zi + ci;
                        zr = tmp;

                        <span class="cm">/*</span><span class="cm"> escaped = abs(z) &gt; 2 </span><span class="cm">*/</span>
                        escaped = spu_cmpgt(zr * zr + zi * zi, limit);

                        <span class="cm">/*</span><span class="cm"> escaped_i = escaped ? escaped_i : i </span><span class="cm">*/</span>
                        escaped_i = spu_sel(spu_splats(i), escaped_i,
                                        escaped);

                        <span class="cm">/*</span><span class="cm"> if all four words in escaped are non-zero,</span>
<span class="cm">                         * we're done </span><span class="cm">*/</span>
                        <span class="st">if</span> (!spu_extract(spu_orx(~escaped), <span class="cs">0</span>))
                                <span class="st">break</span>;
                }

                colour_map(&amp;params-&gt;imgbuf[r * params-&gt;cols + x],
                                escaped_i, params-&gt;i_max);
        }
</pre>
<p>This new code processes four columns at each iteration, and increments
<code>x</code> by four each time. We set up a vector of pixels (one per column)
at each iteration, and do the divergence tests on a vector at a time.</p>
<p>Most of the actual aritmetic is easy to convert; the standard multiply,
add and subtract operators will work just fine with vector types.</p>
<p>However, the <code>if</code>-test to see if the number has diverged is more
complex - since we're dealing with a vector of four pixels, we can't just break
out of the inner loop when one pixel has diverged. Instead, we need to keep
processing until all vector elements to reach the divergence point, and remember
the value of <code>i</code> where this divergence occurs.</p>
<p>Basically, the above code keeps two extra variables: <code>escaped</code> and
<code>escaped_i</code>. <code>escaped</code> is a vector of four flags - if the
n<sup>th</sup> word of <code>escaped</code> is non-zero, we have detected that
the n<sup>th</sup> pixel in our vector has diverged to infinity. The
<code>escaped_i</code> is a similar vector of the value of <code>i</code> where
we first detected this divergence, which is used as the 'value' for each point
in the fractal.</p>
<p>We do this with a few SPE <em>intrinsics</em>, which allow us to use some
of the lower-level SPE vector functionality:</p>
<table class="data">
 <caption>SPE intrinsics used in vectorised fractal code</caption>
 <tr>
  <td>
   <code>spu_cmpgt(a,b)</code>
  </td>
  <td>
   <strong>compare greater than</strong><br />
   returns one bits in each word slot in <code>a</code> that is greater than the
   corresponding word slot in <code>b</code>.
  </td>
 </tr>
 <tr>
  <td>
   <code>spu_sel(a,b,c)</code>
  </td>
  <td>
   <strong>select bits</strong><br />
   For bit each bit in <code>c</code>: if the bit is zero, return the
   corresponding bit from <code>a</code>, otherwise return the corresponding
   bit from <code>b</code>.
  </td>
 </tr>
 <tr>
  <td>
   <code>spu_orx(a)</code>
  </td>
  <td>
   <strong>or across</strong><br />
   Logical-or each word in <code>a</code>, and return the result in the
   <em>preferred slot</em> (ie, the first word). The remaining three slots
   are set to zero.
  </td>
 </tr>
 <tr>
  <td>
   <code>spu_extract(a,n)</code>
  </td>
  <td>
   <strong>extract vector element</strong><br />
   Return a scalar value from the <code>n</code><sup>th</sup> element of
   vector <code>a</code>
  </td>
 </tr>
 <tr>
  <td>
   <code>spu_splats(a)</code>
  </td>
  <td>
   <strong>splat scalar</strong><br />
   Return a vector consisting of each word equal to <code>a</code>
  </td>
 </tr>
</table>
<p>The first three intrinsics will result in just one instruction in the
compiled code. Since slot 0 is the preferred slot (and therefore where the
compiler would normally store a scalar value in a register), the call to
<code>spu_extract(x, 0)</code> will not result in any instructions at all. The
<code>spu_splats</code> instrinsic may require a variable number of
instructions, depending on the value to splat.</p>
<p>More detail on the SPE instrinsics is available in the <a
href="http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E?Open&S_TACT=105AGX16&S_CMP=LP"
>PPU & SPU C/C++ Language Extension Specification</a> and <a
href="http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/76CA6C7304210F3987257060006F2C44?Open&S_TACT=105AGX16&S_CMP=LP"
>SPU Instruction Set Architecture</a> documents on the IBM developerWorks <a
href="http://www.ibm.com/developerworks/power/cell/documents.html"
>Cell Broadband Engine Resource Center</a>.</p>
<p>We also need to change the <code>colour_map</code> function to take
a vector of <code>i</code> values (rather than just a single value), and
calulate the colours of four pixels at a time. For now, we can just do this
one element at a time, using <code>spu_extract</code>, but this could be
improved later.</p>
<h3>benchmarking</h3>
<p>After vectorising the SPE fractal code, we can benchmark to see how
the vectorisation affects run time. The following table shows run times for
the new renederer, along with the proportion of time against the previous
scalar implementation.</p>
<table class="data">
 <caption>Performance of multi-threaded, interleaved, vectorised SPE fractal
  renderer</caption>
 <tr>
  <th>SPE threads</th>
  <th>Running time (sec)</th>
  <th>Vector / Scalar</th>
 </tr>
 <tr><td>1</td><td class="num">10.70</td><td class="num">0.262</td></tr>
 <tr><td>2</td><td class="num">5.72</td><td class="num">0.372</td></tr>
 <tr><td>4</td><td class="num">3.29</td><td class="num">0.305</td></tr>
 <tr><td>6</td><td class="num">2.46</td><td class="num">0.330</td></tr>
 <tr><td>8</td><td class="num">2.06</td><td class="num">0.354</td></tr>
</table>
<p>Not too bad! We're doing much better than our original PPE-based renderer,
 which took 55.7 seconds to render the same fractal.</p>
<p>Ideally, we'd see a constant vector / scalar figure of 0.25, meaning that
we're doing four times the amount of work in the same run time. However, there
are still a lot of operations that we haven't optimised yet.</p>
<p>More optimisations in the next article!</p>
]]></description></item><item><title>SPE gang scheduling policies</title><pubDate>Tue, 18 Nov 2008 18:56:40 +1100</pubDate><link>http://ozlabs.org/~jk/diary/tech/cell/spufs/gang-scheduling-policies.diary/</link><description><![CDATA[<p>In my <a href="http://ozlabs.org/~jk/diary/tech/cell/external-scheduling.diary">previous
post</a> here, I mentioned that:
<blockquote>
We also need to allow contexts to be loaded outside of spu_run to implement
gang scheduling correctly and efficiently.
</blockquote>
<p>I think that may require a little explanation, so here we go:</p>

<h3>gang scheduling policies</h3>
<p>The idea behind SPE gang scheduling is to allow a set of related SPE
contexts to be scheduled together, to allow interactions between contexts to be
performed in a timely manner. For example, consider two contexts (A and B)
that send mailbox messages to each other. If context A is running while
context B is scheduled out, then A will spend its time-slice waiting for a
message from B. If they are scheduled as a single entity, neither context will
have to spend its timeslice blocked, waiting for the other context to be
run.</p>

<p>So, we have to come up with a <em>policy</em> to define the behaviour of the
gang scheduler.  When does the gang become schedulable? Under which conditions
should the gang be descheduled?</p>

<p>I can see four possible approaches:<p>

<h4>policy 1: the gang is only schedulable when <em>all</em> contexts are
runnable</h4>

<p>In this case, the gang is only ever scheduled when <em>all</em> of the gang's
contexts are runnable (ie, they are being run by the <code>spu_run</code> system
call).</p>
<p>Although the simplest, this approach will never complete&emdash;consider the
following:</p>
<ol>
 <li>Context A becomes runnable</li>
 <li>Context B becomes runnable</li>
 <li>The gang is now schedulable, so both contexts are scheduled</li>
 <li>Because it has slightly less work to do, context A finishes before
  context B</li>
 <li>Because only one of the two contexts is runnable, the gang is no longer
  schedulable. Context B is never re-scheduled, so cannot complete the
 rest of its task</li>
</ol>

<p>So, this policy isn't much use; perhaps we can solve this with a new
approach:</p>

<h4>policy 2: the gang is scheduled when all contexts are runnable, and
descheduled when no contexts are runnable</h4>

<p>This will solve the previous non-termination problem, in that context B will
be able to terminate - the context isn't immediately descheduled when A
finishes.</p>

<p>However, now we have a new, slightly more complex non-termination case:</p>
<ol>
 <li>Context A becomes runnable</li>
 <li>Context B becomes runnable</li>
 <li>The gang is now schedulable, so both contexts are scheduled</li>
 <li>Because it has slightly less work to do, context A finishes before
  context B</li>
 <li>At the same time, context B does a PPE-assisted callback, which requires
  a stop-and-signal (and so leaves spu_run for just a moment)</li>
 <li>Because neither context is currently runnable, the gang is
  descheduled</li>
 <li>Context B finishes its callback, so re-enters spu_run to be
  re-scheduled. However, the policy does not allow context B to be re-scheduled,
  as only one of the two contexts is runnable.</li>
</ol>

<p>Although this may sound like a rare occurrence, it's not a restriction
we can pass on to the programmer. Imagine the following SPE code:</p>

<pre class="shaded code">
<span class="ty">int</span> main(<span class="ty">void</span>)
{
        do_work();
        printf(<span class="cs">&quot;work done!</span><span class="sp">\n</span><span class="cs">&quot;</span>);
        <span class="st">return</span> <span class="cs">0</span>;
}</pre>

<p>Here we're doing a PPE-assisted callback (the call to <code>printf</code> is
implemented as a callback) before finishing. If this callback were to occur
when the other context has already completed, we would hit the non-termination
condition above.</p>

<p>This means that the last-running context of a gang can never do a
PPE-assisted callback. In fact, to be completely safe against this
non-termination, a programmer would have to avoid callbacks after <em>any</em>
context has finished, for risk of callbacks on the rest of the gang being
synchronised.</p>

<p>So, it looks like we need to be a little more permissive when deciding if the
gang is schedulable.</p>

<h4>policy 3: the gang is scheduled when any context is runnable, and
descheduled when no contexts are runnable</h4>

<p>This is another fairly simple approach&emdash;the gang is scheduled whenever
there is <em>any</em> work to do. We no longer have any non-termination
conditions, as 'having work to do' will result in 'doing work'.</p>

<p>The tricky part is that it will require us to change one of the fundamental
assumptions about spufs: currently, we don't schedule any context unless it is
runnable. Because we schedule the entire gang when one if its context becomes
runnable, we have to now schedule a number of non-runnable contexts.</p>

<p>The good news is that I've already done a little experimental work to
overcome this general restriction in spufs.</p>

<p>The last approach is a little more complex, but works around this
restriction:</p>

<h4>policy 4: schedule the runnable contexts of a gang, and reserve SPEs for
the non-runnable contexts</h4>

<p>This is just like policy 3, but instead of actually scheduling the
non-runnable contexts, we reserve a SPE for them.</p>

<p>This way, a non-runnable context does not need to be loaded, but can be
quickly scheduled when it becomes runnable. The downside is that we're only
half-implementing gang scheduling; there still may be interactions to a
non-runnable SPE (eg, accesses to the problem state mapping from a running
context in the same gang) that will cause running contexts to become
blocked.</p>

<p>So, which policy is best for spufs?</p>

<p>Policies 1 and 2 have significant flaws in their approach. It's quite
possible that either will lead to non-termination conditions in fairly simple
user programs. I don't think we can 'work around' this with a restriction on
the programmer.</p>

<P>Policy 4 will require a mechanism for reserving SPEs for a particular
context; I'm not convinced the extra complexity is worth the effort, especially
as this doesn't allow us to implement gang scheduling properly.</p>

<p>Currently, Luke Browning and André Detsch have a work-in progress
patch series for gang scheduling, based on policy 3.</p>
]]></description></item><item><title>external context scheduling in spufs</title><pubDate>Fri, 14 Nov 2008 18:14:55 +1100</pubDate><link>http://ozlabs.org/~jk/diary/tech/cell/spufs/external-scheduling.diary/</link><description><![CDATA[<p>At present, the spufs code has the invariant that a context is only
ever loaded to an SPE when it is being run; ie, a thread is calling the
<code>spu_run</code> syscall on the context.</p>
<p>However, there are situations where we may want to load the context without
it being run. For example, to <a
href="http://www-128.ibm.com/developerworks/forums/thread.jspa?messageID=14153478&tstart=0"
>use the SPU's DMA engine from the PPE</a>, requires the PPE thread to
write to registers in the SPU's problem-state mapping (psmap). Faults on the
psmap area can only be serviced while the context is loaded, so will block until
someone runs the context. Ideally, we could allow such accesses to the psmap
without the spu_run call. We also need to allow contexts to be loaded outside of
spu_run to implement gang scheduling correctly and efficiently.</p>
<p>So, I've been working on some experimental changes to allow "external
scheduling" for SPE contexts. The "external" refers to a thread external to the
SPE's usual method of scheduling (ie, it's owning thread calling
<code>spu_run</code>). In the example above, the external schedule would be
caused by the fault handler for the problem-state mapping.</p>
<p>Although a context may be scheduled to an SPE, we still can't always guarantee
forward progress. For example, in the "use the psmap to access the DMA engine"
scenario, a DMA may cause a major page fault, which needs a controlling thread
to service. In this case, the only way to ensure forward progress is through
calling <code>spu_run</code>. However, I have some ideas on how we can
remove this restriction later.</p>

<h3>the interface</h3>
<p>First up, we need to tell the spufs scheduler that we want a context to be
loaded:</p>
<pre class="shaded code">
<span class="cm">/*</span>
<span class="cm"> * Request an 'external' schedule for this context.</span>
<span class="cm"> *</span>
<span class="cm"> * The context will be either loaded to an SPU, or added to the run queue,</span>
<span class="cm"> * depending on SPU availability.</span>
<span class="cm"> *</span>
<span class="cm"> * Should be called with the context's state mutex locked, and the context</span>
<span class="cm"> * in SPU_STATE_SAVED state.</span>
<span class="cm"> </span><span class="cm">*/</span>
<span class="ty">int</span> spu_request_external_schedule(<span class="ty">struct</span> spu_context *ctx);</pre>
<p>After loading the context with <code>spu_request_external_schedule</code>, we
need a way to tell the scheduler that the context can be de-scheduled:</p>
<pre class="shaded code">
<span class="cm">/*</span>
<span class="cm"> * The context should be unscheduled at the end of its timeslice</span>
<span class="cm"> </span><span class="cm">*/</span>
<span class="ty">void</span> spu_cancel_external_schedule(<span
  class="ty">struct</span> spu_context *ctx);</pre>
These functions are implemented by incrementing or decrementing a count of
"external schedulers" on the context. If multiple threads are requesting an
external schedule, then the first will activate the context. When the last
thread calls the cancel method, the context can be descheduled.</p>

<h3>usage</h3>
<p>We can use these two functions to allow the problem-state mapping fault
handler to proceed outside of spu_run:</p>
<pre class="shaded code">
<span class="ty">--- a/arch/powerpc/platforms/cell/spufs/file.c</span>
<span class="ty">+++ b/arch/powerpc/platforms/cell/spufs/file.c</span>
<span class="st">@@ -413,9 +413,11 @@</span><span class="pp"> static int spufs_ps_fault(struct vm_area_struct *vma,</span>

        if (ctx-&gt;state == SPU_STATE_SAVED) {
                up_read(&amp;current-&gt;mm-&gt;mmap_sem);
<span class="id">+               spu_request_external_schedule(ctx);</span>
                spu_context_nospu_trace(spufs_ps_fault__sleep, ctx);
                ret = spufs_wait(ctx-&gt;run_wq, ctx-&gt;state == SPU_STATE_LOADED);
                spu_context_trace(spufs_ps_fault__wake, ctx, ctx-&gt;spu);
<span class="id">+               spu_cancel_external_schedule(ctx);</span>
                down_read(&amp;current-&gt;mm-&gt;mmap_sem);
        } else {
                area = ctx-&gt;spu-&gt;problem_phys + ps_offs;
</pre>
<p>Note that the spu_cancel_external_schedule function doesn't unload the
context right away; if it did, the refault would fail too, and we'd end up in
an infinite loop of faults. Instead, it keeps the context scheduled for the rest
of its timeslice. This gives the faulting thread time to access the mapping
after the fault handler has been invoked.</p>
<p>We also need to do a bit of trickery with the priorities of contexts during
external schedule operations. If a high-priority thread access the problem-state
mapping of a low-priority context, we want the context to temporarily inherit
the higher priority. To do this, we raise the priority when
spu_request_external_schedule is called, and drop it back after the context has
finished its timeslice on the SPU.</p>
<h3>the code</h3>
<p>I've created a development branch in the spufs repository for these
changes, which is available:</p>
<ul>
 <li>via git:
  <code>git://git.kernel.org/pub/scm/linux/kernel/git/jk/spufs.git</code>, in
  the <code>ext-sched</code> branch;
  or</li>
 <li>on the <a
 href="http://git.kernel.org/?p=linux/kernel/git/jk/spufs.git;a=shortlog;h=ext-sched"
  >browsable gitweb interface</a>.</li>
</ul>
<p>Note that this is an <em>experimental</em> codebase, expect breakages!</p>
]]></description></item><item><title>new patchwork beta</title><pubDate>Tue, 28 Oct 2008 22:23:07 +1100</pubDate><link>http://ozlabs.org/~jk/diary/tech/software/new-patchwork-beta.diary/</link><description><![CDATA[<img class="diaryfloat" src="http://ozlabs.org/~jk/images/diary/patchwork-header.png"
 width="269" height="134" alt="patchwork screengrab"/>
<p>We've had a new version of <a href="http://ozlabs.org/~jk/projects/patchwork/">patchwork</a>
- the web-based patch-tracking system - online for a few weeks now at
<a href="http://patchwork.ozlabs.org/">patchwork.ozlabs.org</a>.</p>
<p>After <a href="http://lwn.net/Articles/298592/">Paul's presentation on
patchwork at the 2008 Kernel Summit</a>, there has been wider interest in
patchwork setups for other projects. Patchwork originally hosted the
Linux on Power and Linux on Cell lists, we've since added
<a href="http://patchwork.ozlabs.org/project/netdev/">netdev</a>,
<a href="http://patchwork.ozlabs.org/project/linux-mtd/">linux-mtd</a> and
<a href="http://patchwork.ozlabs.org/project/linux-ext4/">linux-ext4</a>. I've
also added the main Linux Kernel mailing list
(<a href="http://patchwork.ozlabs.org/project/linux-kernel/">lkml</a>), just
to see how the new patchwork handles the load; all has been okay so far.</p>
<p>If you're interested in installing the new patchwork at your own site, you
can grab the source from the <a href="http://ozlabs.org/~jk/projects/patchwork/">patchwork
project page</a>. Installation can be a little tricky, so feel free to mail me
if you need a hand.</p>
<div style="clear: both;"></div>
]]></description></item><item><title>2.6.26 on a Lenovo x61 thinkpad</title><pubDate>Tue, 02 Sep 2008 20:51:01 +1000</pubDate><link>http://ozlabs.org/~jk/diary/tech/linux/x61-2.6.26-wireless.diary/</link><description><![CDATA[<p>It looks like the iwl driver is slightly broken in the 2.6.26 release -
<a href="http://bugzilla.kernel.org/show_bug.cgi?id=11055">connections
will drop-out after 10 seconds or so</a>.</p>
<p>The workaround for this is to enable the config option
<code>CONFIG_IWL4965_HT</code>.</p>
]]></description></item><item><title>asynchronous spu contexts, initial designs</title><pubDate>Wed, 16 Jul 2008 15:21:51 +1000</pubDate><link>http://ozlabs.org/~jk/diary/tech/cell/spufs/asynchronous-spu-contexts-initial-designs.diary/</link><description><![CDATA[<p>I've recently been working on some changes to the <a
 href="http://git.kernel.org/?p=linux/kernel/git/jk/spufs.git">spufs</a>
code, and thought I'd write-up some of the details.</p>
<p>At present, the <code>spu_run</code> syscall (used to run a SPU context)
blocks until the SPU program has exited (or some other event has happened,
such as a non-serviceable fault). This means that to take advantage of the
SPUs, you really need to start a new thread for each SPU context that you
create, otherwise your application will be sitting around waiting for each SPU
context to complete.</p>

<p>In fact, we have an invariant in the spufs code at the moment that only
contexts that are currently being <code>spu_run</code> will ever be runnable
(and, at the moment, schedulable).</p>

<p>Ben H and I have been chatting about some ideas about asynchronous spu
contexts. This means that the userspace app can start the context, then later
retrieve the status of the SPU context (to see if it has stopped, faulted, or
whatever). We can then use standard POSIX semantics like <code>poll()</code> to
see if a context is still running or has generated any "events", then handle
these events when they become available.</p>

<p>In effect, this is similar to <code>spu_run</code>: currently, the
<code>spu_run</code> syscall runs the SPU, then blocks until an event happens,
which is then returned to userpsace as the return value of
<code>spu_run</code>. The main difference is that we don't block in the kernel
while the SPU is running.</p>

<p>So, I've been coding up an experimental change to spufs. Firstly, we have
to explicitly tell the kernel that we want a context to operate in asynchronous
mode, so I've added a new flag to the <code>spu_create</code> syscall:
<code>SPU_CREATE_ASYNC</code>.</p>

<p>I've opted for a file-based interface to these asynchronous contexts -
SPU events are retrieved by reading from a file. Contexts that are created with
the <code>SPU_CREATE_ASYNC</code> flag have an extra file present (called
something like "<code>event</code>") in their context directory in the
spufs mount. Reading from this file allows applications to retreive events
that the SPU program has raised.</p>

<p>We need to define a format for the data read from this events file, so
here's something to get started with:</p>
<pre class="shaded code">
<span class="ty">struct</span> spu_event {
	<span class="ty">uint32_t</span> event;
	<span class="ty">uint32_t</span> status;
	<span class="ty">uint32_t</span> npc;
};
</pre>
<p> - where the <code>event</code> member specifies which event happened - a
stop-and-signal for example.</p>

<p>The status and npc members return the status of the SPU and the next program
counter register, respectively. While not strictly necessary (this information
is available from other files in spufs), it's very likely that the application
will need these values in order to handle the event.</p>

<p>So, users of this interface may look something like this:</p>

<pre class="shaded code">
<span class="ty">uint32_t</span> npc = <span class="cs">0</span>;
<span class="ty">struct</span> context {
        <span class="ty">int</span> fd;
        <span class="ty">int</span> event_fd;
} context;

<span class="cm">/* create the context */</span>
context.fd = spu_create(<span class="cs">"/spu/ctx"</span>, <span class="cs">NULL</span>, SPU_CREATE_ASYNC);

<span class="cm">/* open the events file */</span>
context.event_fd = openat(context.fd, <span class="cs">"event"</span>, O_RDWR);

<span class="cm">/* start the context running. unlike the spu_run syscall,
 * this function does not block for the duration of the
 * spu program */</span>
run_context(&amp;context, npc);

<span class="st">for</span> (;;) {
        <span class="ty">struct</span> spu_event event;

        <span class="cm">/* get the next event caused by the SPU */</span>
        read(context.event_fd, &amp;event, <span class="st">sizeof</span>(event));

        <span class="st">if</span> (event.event == SPU_EVENT_STOP)
                <span class="st">break</span>;

        <span class="cm">/* handle other event ... */</span>
}
</pre>

<p>Note that the userspace examples here are not what we'd present to
Cell application developers. They're more low-level examples of how the
new asynchronous kernel interface works. In fact, the changes could be
completely transparent to applications which use the <a
href="http://sourceforge.net/projects/libspe/">libSPE</a> interface.</p>

<p>This isn't far from the API provided by the current <code>spu_run</code>
syscall, except that we're not waiting in the kernel while the SPU is
running.</p>

<p>Also, we're going to need to control the SPU somehow - for example, we need
to implement the <code>run_context</code> function in the pseudocode above.
Rather than overloading the <code>spu_run</code> syscall, I've opted to use the
same event file - writes to this file will allow userspace to control the SPU.
I'm still working out the exact format of these writes, but the way I've
implemented it at the moment is that the application can write structures of
this layout to the file:</p>

<pre class="shaded code">
<span class="ty">struct</span> spu_control {
	<span class="ty">uint32_t</span> op;
	<span class="ty">char</span> data[];
};
</pre>

<p>The contents of the <code>data</code> member depends on the operation
requested (specified by the <code>op</code> member). For example, a 'start spu'
operation would have four extra bytes - a <code>uint32_t</code> containing the
NPC to start the SPU execution from. A 'stop spu' operation doesn't require any
extra parameters, so the <code>data</code> member would be 0 bytes long.</p>

<p>This would allow us to implement the <code>run_context</code> function
as follows:</p>
<pre class="shaded code">
<span class="ty">void</span> run_context(<span class="ty">struct</span> context *context, <span class="ty">uint32_t</span> npc)
{
        <span class="ty">uint32_t</span> buf[2];

        buf[0] = SPU_CONTROL_START_SPU;
        buf[1] = npc;

        write(context.event_fd, buf, <span class="st">sizeof</span>(buf));
}
</pre>
<p>There are plenty of other issues to deal with (like signals, and debugging),
but I have a basic prototype working at the moment. More to come!</p>
]]></description></item><item><title>debian on a qs22 cell blade</title><pubDate>Mon, 30 Jun 2008 14:25:28 +1000</pubDate><link>http://ozlabs.org/~jk/diary/tech/cell/debian-on-qs22.diary/</link><description><![CDATA[<p>Seeing as the <a
href="http://ibm.com/systems/bladecenter/hardware/servers/qs22/index.html"
>QS22 blades</a> are out, here's a short guide to getting debian installed.
</p>
<pre class="shaded">
<span class="prompt">[jk@qs22 ~]$ </span>grep -m1 ^cpu /proc/cpuinfo
cpu             : Cell Broadband Engine, altivec supported
<span class="prompt">[jk@qs22 ~]$ </span>lsb_release -d
Description:    Debian GNU/Linux unstable (sid)
</pre>
<h3>kernel</h3>
<p>You'll need a kernel that has support for the IBM Cell blades. If you
configure your kernel with the 'cell_defconfig' target, you should have all
the necessary options:</p>
<pre class="shaded">
<span class="prompt">[jk@pingu linux-2.6.25]$ </span>make cell_defconfig
</pre>
<p>Specifically, you need:</p>
<ul>
 <li><code>CONFIG_PPC_IBM_CELL_BLADE</code>;</li>
 <li><code>CONFIG_SERIAL_OF_PLATFORM</code>;</li>
 <li><code>CONFIG_FUSION_SAS</code>;</li>
 <li><code>CONFIG_ROOT_NFS</code>;</li>
 <li><code>CONFIG_IP_PNP_DHCP</code> and</li>
 <li><code>CONFIG_SPU_FS</code>.</li>
</ul>

<h3>root filesystem</h3>
<p>The QS22s have no internal disk (they're compute nodes, right?), so you'll
have to either:
</p>
<ul>
 <li>use a remote root filesystem, like NFS; or</li>
 <li>add a LSI SAS adaptor to the blade, and use an external SAS disk
  for the root filesystem.</li>
</ul>
<p>The installation process will be different depending on which you choose,
so just skip to the appropriate section here.</p>

<h4>NFS root</h4>

<p>For the first option, there's a number of <a
href="http://www.google.com/search?q=NFS+root+howto">NFS-root howtos</a>
around. First up, we need to build the actual debian filesystem, using
<code>debootstrap</code>. For example:</p>

<pre class="shaded">
<span class="prompt">[jk@pingu ~]$ </span>sudo debootstrap --arch=powerpc --foreign sid /srv/nfs/qs22/
</pre>

<p>This will create an entire debian filesystem in <code>/srv/nfs/qs22</code>.
We need to make a few modifications though:</p>

<ol>
 <li>add the following line to <code>/etc/inittab</code>:
  <pre class="shaded">T0:23:respawn:/sbin/getty -L ttyS0 19200 vt100</pre>
 </li>
 <li>and couple of extra device nodes:
  <pre class="shaded">
<span class="prompt">[jk@pingu ~]$ </span>cd /srv/nfs/qs22/dev
<span class="prompt">[jk@pingu dev]$ </span>sudo mknod console c 5 1
<span class="prompt">[jk@pingu dev]$ </span>sudo mknod ttyS0 c 4 64</pre>
 </li>
</ol>

<p>Once this is done, we need to complete the bootstrap on the QS22. Set up
your NFS server, and export the appropriate directory. Boot the QS22 with the
nfs root kernel options, plus "<code>rw init=/bin/sh</code>" (eg
<code>root=/dev/nfs nfsroot=server_ip:/srv/nfs/qs22 ip=dhcp rw
init=/bin/sh</code>). Then, once the machine has booted:</p>

<pre class="shaded">
<span class="prompt">sh-3.2# </span>PATH=/:/bin:/usr/bin:/sbin:/usr/sbin /debootstrap/debootstrap --second-stage
</pre>

<p>This should finish the bootstrap. After it has completed (it should finish
 with "<code>I: Base system installed successfully</code>"), reboot the
machine with the same kernel command line, minus the <code>rw
init=/bin/sh</code> arguments. Once it boots, you should have the debian login
prompt. Login as root (there will be no password, but don't forget to set one)
and away you go.</p>

<h4>SAS disk</h4>

<p>If you're using SAS, the install is much more straightforward, as you
can just use the standard debian installer. However, you may need to use a
custom kernel which supports the QS22s. This is a matter of building your own
kernel, using the <a
href="http://ftp.us.debian.org/debian/dists/testing/main/installer-powerpc/current/images/powerpc64/netboot/initrd.gz">powerpc64
debian installer image</a> as the initramfs:</p>
<pre class="shaded">
<span class="prompt">[jk@pingu linux-2.6.25]$ </span>wget http://ftp.us.debian.org/debian/dists/testing/main/installer-powerpc/current/images/powerpc64/netboot/initrd.gz
<span class="prompt">[jk@pingu linux-2.6.25]$ </span>gunzip -c < initrd.gz > initrd
<span class="prompt">[jk@pingu linux-2.6.25]$ </span>make cell_defconfig
<span class="prompt">[jk@pingu linux-2.6.25]$ </span>sed -ie 's,^CONFIG_INITRAMFS_SOURCE=".*",CONFIG_INITRAMFS_SOURCE="'$PWD'/initrd",' .config
<span class="prompt">[jk@pingu linux-2.6.25]$ </span>make
</pre>

<p>Then, just boot the kernel in <code>arch/powerpc/boot/zImage.pseries</code>. The debian installer should start, and guide you through the rest of the
installation. Since you're netbooting, you can ignore any messages about not
having a bootstrap partition, or not being able to install a kernel or
yaboot</p>

<h3>software</h3>
<p>Entirely optional, but you'll probably get the most out of your QS22 with a
few extra packages:</p>
<pre class="shaded">
<span class="prompt">[jk@qs22 ~]$ </span>sudo apt-get install openssh-server libspe2-dev spu-gcc build-essential
</pre>
]]></description></item><item><title>linux.conf.au hackfest: the solution, part three</title><pubDate>Mon, 23 Jun 2008 09:20:13 +1000</pubDate><link>http://ozlabs.org/~jk/diary/tech/cell/hackfest08-solution-3.diary/</link><description><![CDATA[<p>In <a href="http://ozlabs.org/~jk/diary/tech/cell/hackfest08-solution-2.diary">part two</a>
of this series, we had just ported a fractal renderer to the SPEs on a Cell
Broadband Engine machine. Now we're going to start the optimisation...</p>

<p>Our baseline performance is 40.7 seconds to generate the sample fractal
(using the <a
href="http://ozlabs.org/~jk/projects/lca2008-hackfest/solution/fractal.data">sample
fractal parameters</a>).</p>

<p>The initial SPE-based fractal renderer used only one SPE. However, we
have more available:</p>
<table class="data">
 <caption>Number of SPEs in currently-available in CBEA machines</caption>
 <tr>
  <th>Machine</th>
  <th>SPEs available</th>
 </tr>
 <tr>
  <td>Sony Playstation 3</td>
  <td class="num">6</td>
 </tr>
 <tr>
  <td>IBM <a href="http://ibm.com/systems/bladecenter/qs21/index.html">QS21</a>
   / <a
   href="http://ibm.com/systems/bladecenter/hardware/servers/qs22/index.html">QS22</a>
   blades.</td>
  <td class="num">16 (8 per CPU)</td>
 </tr>
</table>

<p>So, we should be able to get better performance by distributing the render
work between the SPEs. We can do this by dividing the fractal into a set of
<em>n</em> strips, where <em>n</em> is the number of SPEs available. Then,
each SPE renders its own strip of the fractal.</p>

<p>The following image shows the standard fractal, as would be rendered by 8
SPEs, with shading to show the division of the work amongst the SPEs.</p>

<div>
<img src="http://ozlabs.org/~jk/projects/lca2008-hackfest/images/fractal-striped.png"
width="300" height="251" alt="SPE fractal divisions"
style="display: block; margin-left: auto; margin-right: auto;" />
</div>

<p>In order to split up the work, we first need to tell each SPE which part of
the fractal it is to render. We'll add two variables to the
<code>spe_args</code> structure (which is passed to the SPE during program
setup), to provide the total number of threads (<code>n_threads</code>) and
the index of this SPE thread (<code>thread_idx</code>).</p>

<pre class="shaded code">
<span class="ty">struct</span> spe_args {
        <span class="ty">struct</span> fractal_params fractal;
        <span class="ty">int</span> n_threads;
        <span class="ty">int</span> thread_idx;
};
</pre>

<h3>spe changes</h3>
<p>On the SPE side, we use these parameters to alter the invocations of
<code>render_fractal</code>. We set up another couple of convenience variables:
</p>

<pre class="shaded code">
        rows_per_spe = args.fractal.rows / args.n_threads;
	start_row = rows_per_spe * args.thread_idx;
</pre>

<p>And just alter our <code>for</code>-loop to only render
<code>rows_per_spe</code> rows, rather than the entire fractal:</p>

<pre class="shaded code">
        <span class="st">for</span> (row = <span class="cs">0</span>; row &lt; rows_per_spe; row += rows_per_dma) {

                render_fractal(&amp;args.fractal, start_row + row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                <span class="cs">0</span>, <span class="cs">0</span>, <span class="cs">0</span>);

                <span class="cm">/* Wait for the DMA to complete */</span>
                mfc_write_tag_mask(<span class="cs">1</span> &lt;&lt; <span class="cs">0</span>);
                mfc_read_tag_status_all();
        }
</pre>

<h3>ppe changes</h3>
<p>The changes to the PPE code are fairly simple - instead of just creating one
thread, create <em>n</em> threads.</p>
<p>First though, let's add a '<code>-n</code>' argument to the program to
specify the number of threads to start:</p>

<pre class="shaded code">
        <span class="st">while</span> ((opt = getopt(argc, argv, <span class="cs">&quot;p:o:n:&quot;</span>)) != -<span class="cs">1</span>) {
                <span class="st">switch</span> (opt) {

                <span class="cm">/* other options omitted */</span>

                <span class="st">case</span> <span class="cs">'n'</span>:
                        n_threads = atoi(optarg);
                        <span class="st">break</span>;
</pre>

<p>Rather than just creating one SPE thread, we create <code>n_threads</code>.
Also, we have to set the thread-specific arguments for each thread:</p>

<pre class="shaded code">
        <span class="st">for</span> (i = <span class="cs">0</span>; i &lt; n_threads; i++) {
                <span class="cm">/* copy the fractal data into this thread's args */</span>
                memcpy(&amp;threads[i].args.fractal, fractal, <span class="st">sizeof</span>(*fractal));

                <span class="cm">/* set thread-specific arguments */</span>
                threads[i].args.n_threads = n_threads;
                threads[i].args.thread_idx = i;

                threads[i].ctx = spe_context_create(<span class="cs">0</span>, <span class="cs">NULL</span>);

                spe_program_load(threads[i].ctx, &amp;spe_fractal);

                pthread_create(&amp;threads[i].pthread, <span class="cs">NULL</span>,
                                spethread_fn, &amp;threads[i]);
        }
</pre>

<p>Now, the SPEs should be running, and generating their own slice of the
fractal. We just have to wait for them all to finish:</p>

<pre class="shaded code">
        <span class="cm">/* wait for the threads to finish */</span>
        <span class="st">for</span> (i = <span class="cs">0</span>; i &lt; n_threads; i++)
                pthread_join(threads[i].pthread, <span class="cs">NULL</span>);
</pre>

<p>If you're after the source code for the multi-threaded SPE fractal renderer,
it's available in <a
href="http://ozlabs.org/~jk/projects/lca2008-hackfest/solution/fractal.3.tar.gz">fractal.3.tar.gz</a>.</p>

<p>That's it! Now we have a multi-threaded SPE application to do the fractal
rendering. In theory, an <em>n</em> threaded program will take 1/<em>n</em>
of the time of a single-threaded implementation, let's see how that fares
with reality...</p>

<h3>performance</h3>

<p>Let's compare invocations of our multi-threaded fractal renderer, with
different values for the <code>n_threads</code> parameter.</p>

<table class="data">
 <caption>Performance of multi-threaded SPE fractal renderer</caption>
 <tr>
  <th>SPE threads</th>
  <th>Running time (sec)</th>
 </tr>
 <tr><td>1</td><td class="num">40.72</td></tr>
 <tr><td>2</td><td class="num">30.14</td></tr>
 <tr><td>4</td><td class="num">18.84</td></tr>
 <tr><td>6</td><td class="num">12.72</td></tr>
 <tr><td>8</td><td class="num">10.89</td></tr>
</table>

<p>Not too bad, but we're definitely not seeing linear scalability here; we
could expect the 8 SPE case to take around 5 seconds, rather than 11.</p>

<h3>what's slowing us down?</h3>

<p>A little investigation into the fractal generator will show that some SPE
threads are finishing long before others. This is due to the variability in
calculation time between pixels. In order to see if a point (ie, pixel) in the
fractal does <em>not</em> converge towards infinity (and gets coloured blue),
we need to do the full <code>i_max</code> tests in <code>render_fractal</code>
(<code>i_max</code> is 10,000 in our sample fractal). Other pixels may
converge much more quickly (possibly in under 10 iterations), and so are orders
of mangitude faster to calculate.</p>

<p>We end up with slices that are very quick to calculate, and others
that take longer. Of course, we have to wait for the longest-running SPE
thread to complete, so our overall runtime will be that of the slowest thread.
</p>

<p>So, let's take another aproach to distributing the workload. Rather than
dividing the fractal into contiguous slices, we can <em>interleave</em> the
SPE work. For example, if we were to use 2 SPE threads, then thread 0 would
render all the even chunks, and thread 1 would render all the odd chunks (where
a "chunk" is a set of rows that fit into a single DMA). This should even-out
the work between SPE threads.</p>

<div>
<img src="http://ozlabs.org/~jk/projects/lca2008-hackfest/images/fractal-interleaved.png"
width="300" height="251" alt="Interleaved SPE fractal divisions"
style="display: block; margin-left: auto; margin-right: auto;" />
</div>

<p>This is just a matter of changing the SPE <code>for</code>-loop a little.
Rather than the current code, which divides the work into
<code>n_threads</code> contiguous chunks:</p>

<pre class="shaded code">
        <span class="st">for</span> (row = <span class="cs">0</span>; row &lt; rows_per_spe; row += rows_per_dma) {

                render_fractal(&amp;args.fractal, start_row + row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                <span class="cs">0</span>, <span class="cs">0</span>, <span class="cs">0</span>);

                <span class="cm">/*</span><span class="cm"> Wait for the DMA to complete </span><span class="cm">*/</span>
                mfc_write_tag_mask(<span class="cs">1</span> &lt;&lt; <span class="cs">0</span>);
                mfc_read_tag_status_all();
        }

</pre>

<p>We change this to render every <code>n_thread</code><sup>th</sup>
chunk, starting from <code>thread_idx</code>, which gives us the
the interleaved pattern:</p>

<pre class="shaded code">
        <span class="st">for</span> (row = rows_per_dma * args.thread_idx;
                        row &lt; args.fractal.rows;
                        row += rows_per_dma * args.n_threads) {

                render_fractal(&amp;args.fractal, row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + row * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                <span class="cs">0</span>, <span class="cs">0</span>, <span class="cs">0</span>);

                <span class="cm">/* Wait for the DMA to complete */</span>
                mfc_write_tag_mask(<span class="cs">1</span> &lt;&lt; <span class="cs">0</span>);
                mfc_read_tag_status_all();
        }
</pre>

<p>An updated renderer is available in <a
href="http://ozlabs.org/~jk/projects/lca2008-hackfest/solution/fractal.4.tar.gz">fractal.4.tar.gz</a>.</p>

<p>Making this small change gives some better performance figures:</p>

<table class="data">
 <caption>Performance of multi-threaded, interleaved SPE fractal
  renderer</caption>
 <tr>
  <th>SPE threads</th>
  <th>Running time (sec)</th>
 </tr>
 <tr><td>1</td><td class="num">40.72</td></tr>
 <tr><td>2</td><td class="num">20.75</td></tr>
 <tr><td>4</td><td class="num">10.78</td></tr>
 <tr><td>6</td><td class="num">7.44</td></tr>
 <tr><td>8</td><td class="num">5.81</td></tr>
</table>

<p>We're doing much better now, but we're still nowhere near the theoretical
maximum performance of the SPEs. More optimisations in the next article...</p>

]]></description></item></channel></rss>
