diary
global namespace
hiprofile - HTML interactive profiler, version 1.0
I've recently been doing some performance analysis work on Linux, which requires profiling various applications to see how they perform under load. One of the handy tools for profiling this behaviour is oprofile, but the output can be a little difficult to interpret.
So, I've developed hiprofile, the HTML interactive profiler.
Hiprofile is a small python script to generate HTML reports for oprofile data. Output is broken down into per-program statistics:
The page for each program lists the top 20 (by default) symbols within that function:
Each of the most-hit functions are show with per-instruction profiling data, and each instruction is coloured according to how 'hot' it is. If the source code for the function is available, the output is annotated with the corresponding code:
More info and downloads are on the hiprofile project page.
linux.conf.au hackfest: the solution, part four
In the previous article in this series, we finished with a parellel SPE-based fractal renderer. This article will cover some of the major optimisations we can do to make use of the Cell Broadband Engine.
Our last benchmark gave us the following results:
| SPE threads | Running time (sec) |
|---|---|
| 1 | 40.72 |
| 2 | 20.75 |
| 4 | 10.78 |
| 6 | 7.44 |
| 8 | 5.81 |
SPE vector instructions
The registers on the SPEs are actually 128-bits wide. Rather than using the entire register for one value, most of the SPE instruction set consists of vector instructions.
For example, say we wanted to add 7 to the value in register one, and place
the result in register two. To do this, we'd use the ai ("add
immediate") instruction:
ai r2,r1,7
This instruction operates on the four 32-bit 'slots' of the 128-bit register (4 x 32 = 128) in parallel:
By using these vector instructions, we can do a lot of the fractal rendering work in parallel.
Vector instructions aren't specific to the Cell Broadband Engine - Altivec (on PowerPC processors) and SSE (on Pentium processors) offer similar capabilities for parallel processing of data. In fact, the PPE on the Cell Broadband Engine supports Altivec extensions.
vectorising the fractal renderer
The hackfest task description contains a little information on how to vectorise code:
The SPEs of the Cell/B.E. have an instruction set almost entirely composed of SIMD instructions - so you can do most computations on four operands at a time.
To take advantage of these SIMD capabilities, you can use "vector" types in your code.
For example, consider a simple function to multiply a number by three:
float multiply_by_three(float f) { return f * 3; }
By vectorising this function, we can do four operations in the same amount of time:
vector float multiply_by_three(vector float f); { return f * (vector float){3, 3, 3, 3}; }
The vector float type is similar to an array of 4
floats, except that we can operate on the vector as a whole.
If we look at the instructions generated for both the scalar and vector
versions of the multiply_by_three function, they're exactly the same.
The following code shows the compiled assembly of
multiply_by_three(), both vector and scalar versions. The
function argument is in register 3, return value is placed in register 3.
multiply_by_three: ilhu $2,16448 ; load {3, 3, 3, 3} into register 2 fm $3,$3,$2 ; register 3 = (register 3) * (register 2) bi $lr ; return to caller
So, if you're using scalar operations, we get one multiply in three instructions. Using the vector equivalent, we get four multiplies in three instructions.
So, we should look at the hot paths of the current fractal renderer, and try to vectorise where possible.
Currently, this loop of our render_fractal function is
where the SPE is doing most of the work, iterating over each pixel to find
the point where it diverges to infinity:
for (x = 0; x < params->cols; x++) {
cr = x_min + x * params->delta;
zr = 0;
zi = 0;
for (i = 0; i < params->i_max; i++) {
/* z = z^2 + c */
tmp = zr*zr - zi*zi + cr;
zi = 2.0 * zr * zi + ci;
zr = tmp;
/* if abs(z) > 2.0 */
if (unlikely(zr*zr + zi*zi > 4.0))
break;
}
colour_map(¶ms->imgbuf[r * params->cols + x],
i, params->i_max);
}
The renderer will execute the inner loop up to rows *
columns * i times, where i is between
1 and i_max. Seeing as this is the most executed path in our
renderer code, it is a good place to start our vectorisation efforts.
By vectorising, we can perform the contents of the loop on four pixels at a time (rather than just one), enabling us to reduce the number of iterations to around one quarter.
The new vectorised fractal generator is available in fractal.5.tar.gz.
All we have changed here is the SPE-side code (spe-fractal.c).
The most significant change is to the render_fractal()
function, where we have replaced the above loop with:
for (x = 0; x < params->cols; x += 4) {
escaped_i = (vector unsigned int){0, 0, 0, 0};
cr = x_min + spu_splats(params->delta * x);
zr = (vector float){0, 0, 0, 0};
zi = (vector float){0, 0, 0, 0};
for (i = 0; i < params->i_max; i++) {
/* z = z^2 + c */
tmp = zr*zr - zi*zi + cr;
zi = (vector float){2.0f, 2.0f, 2.0f, 2.0f}
* zr * zi + ci;
zr = tmp;
/* escaped = abs(z) > 2 */
escaped = spu_cmpgt(zr * zr + zi * zi, limit);
/* escaped_i = escaped ? escaped_i : i */
escaped_i = spu_sel(spu_splats(i), escaped_i,
escaped);
/* if all four words in escaped are non-zero,
* we're done */
if (!spu_extract(spu_orx(~escaped), 0))
break;
}
colour_map(¶ms->imgbuf[r * params->cols + x],
escaped_i, params->i_max);
}
This new code processes four columns at each iteration, and increments
x by four each time. We set up a vector of pixels (one per column)
at each iteration, and do the divergence tests on a vector at a time.
Most of the actual aritmetic is easy to convert; the standard multiply, add and subtract operators will work just fine with vector types.
However, the if-test to see if the number has diverged is more
complex - since we're dealing with a vector of four pixels, we can't just break
out of the inner loop when one pixel has diverged. Instead, we need to keep
processing until all vector elements to reach the divergence point, and remember
the value of i where this divergence occurs.
Basically, the above code keeps two extra variables: escaped and
escaped_i. escaped is a vector of four flags - if the
nth word of escaped is non-zero, we have detected that
the nth pixel in our vector has diverged to infinity. The
escaped_i is a similar vector of the value of i where
we first detected this divergence, which is used as the 'value' for each point
in the fractal.
We do this with a few SPE intrinsics, which allow us to use some of the lower-level SPE vector functionality:
spu_cmpgt(a,b)
|
compare greater than returns one bits in each word slot in a that is greater than the
corresponding word slot in b.
|
spu_sel(a,b,c)
|
select bits For bit each bit in c: if the bit is zero, return the
corresponding bit from a, otherwise return the corresponding
bit from b.
|
spu_orx(a)
|
or across Logical-or each word in a, and return the result in the
preferred slot (ie, the first word). The remaining three slots
are set to zero.
|
spu_extract(a,n)
|
extract vector element Return a scalar value from the nth element of
vector a
|
spu_splats(a)
|
splat scalar Return a vector consisting of each word equal to a
|
The first three intrinsics will result in just one instruction in the
compiled code. Since slot 0 is the preferred slot (and therefore where the
compiler would normally store a scalar value in a register), the call to
spu_extract(x, 0) will not result in any instructions at all. The
spu_splats instrinsic may require a variable number of
instructions, depending on the value to splat.
More detail on the SPE instrinsics is available in the PPU & SPU C/C++ Language Extension Specification and SPU Instruction Set Architecture documents on the IBM developerWorks Cell Broadband Engine Resource Center.
We also need to change the colour_map function to take
a vector of i values (rather than just a single value), and
calulate the colours of four pixels at a time. For now, we can just do this
one element at a time, using spu_extract, but this could be
improved later.
benchmarking
After vectorising the SPE fractal code, we can benchmark to see how the vectorisation affects run time. The following table shows run times for the new renederer, along with the proportion of time against the previous scalar implementation.
| SPE threads | Running time (sec) | Vector / Scalar |
|---|---|---|
| 1 | 10.70 | 0.262 |
| 2 | 5.72 | 0.372 |
| 4 | 3.29 | 0.305 |
| 6 | 2.46 | 0.330 |
| 8 | 2.06 | 0.354 |
Not too bad! We're doing much better than our original PPE-based renderer, which took 55.7 seconds to render the same fractal.
Ideally, we'd see a constant vector / scalar figure of 0.25, meaning that we're doing four times the amount of work in the same run time. However, there are still a lot of operations that we haven't optimised yet.
More optimisations in the next article!
SPE gang scheduling policies
In my previous post here, I mentioned that:
We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.
I think that may require a little explanation, so here we go:
gang scheduling policies
The idea behind SPE gang scheduling is to allow a set of related SPE contexts to be scheduled together, to allow interactions between contexts to be performed in a timely manner. For example, consider two contexts (A and B) that send mailbox messages to each other. If context A is running while context B is scheduled out, then A will spend its time-slice waiting for a message from B. If they are scheduled as a single entity, neither context will have to spend its timeslice blocked, waiting for the other context to be run.
So, we have to come up with a policy to define the behaviour of the gang scheduler. When does the gang become schedulable? Under which conditions should the gang be descheduled?
I can see four possible approaches:
policy 1: the gang is only schedulable when all contexts are runnable
In this case, the gang is only ever scheduled when all of the gang's
contexts are runnable (ie, they are being run by the spu_run system
call).
Although the simplest, this approach will never complete&emdash;consider the following:
- Context A becomes runnable
- Context B becomes runnable
- The gang is now schedulable, so both contexts are scheduled
- Because it has slightly less work to do, context A finishes before context B
- Because only one of the two contexts is runnable, the gang is no longer schedulable. Context B is never re-scheduled, so cannot complete the rest of its task
So, this policy isn't much use; perhaps we can solve this with a new approach:
policy 2: the gang is scheduled when all contexts are runnable, and descheduled when no contexts are runnable
This will solve the previous non-termination problem, in that context B will be able to terminate - the context isn't immediately descheduled when A finishes.
However, now we have a new, slightly more complex non-termination case:
- Context A becomes runnable
- Context B becomes runnable
- The gang is now schedulable, so both contexts are scheduled
- Because it has slightly less work to do, context A finishes before context B
- At the same time, context B does a PPE-assisted callback, which requires a stop-and-signal (and so leaves spu_run for just a moment)
- Because neither context is currently runnable, the gang is descheduled
- Context B finishes its callback, so re-enters spu_run to be re-scheduled. However, the policy does not allow context B to be re-scheduled, as only one of the two contexts is runnable.
Although this may sound like a rare occurrence, it's not a restriction we can pass on to the programmer. Imagine the following SPE code:
int main(void) { do_work(); printf("work done!\n"); return 0; }
Here we're doing a PPE-assisted callback (the call to printf is
implemented as a callback) before finishing. If this callback were to occur
when the other context has already completed, we would hit the non-termination
condition above.
This means that the last-running context of a gang can never do a PPE-assisted callback. In fact, to be completely safe against this non-termination, a programmer would have to avoid callbacks after any context has finished, for risk of callbacks on the rest of the gang being synchronised.
So, it looks like we need to be a little more permissive when deciding if the gang is schedulable.
policy 3: the gang is scheduled when any context is runnable, and descheduled when no contexts are runnable
This is another fairly simple approach&emdash;the gang is scheduled whenever there is any work to do. We no longer have any non-termination conditions, as 'having work to do' will result in 'doing work'.
The tricky part is that it will require us to change one of the fundamental assumptions about spufs: currently, we don't schedule any context unless it is runnable. Because we schedule the entire gang when one if its context becomes runnable, we have to now schedule a number of non-runnable contexts.
The good news is that I've already done a little experimental work to overcome this general restriction in spufs.
The last approach is a little more complex, but works around this restriction:
policy 4: schedule the runnable contexts of a gang, and reserve SPEs for the non-runnable contexts
This is just like policy 3, but instead of actually scheduling the non-runnable contexts, we reserve a SPE for them.
This way, a non-runnable context does not need to be loaded, but can be quickly scheduled when it becomes runnable. The downside is that we're only half-implementing gang scheduling; there still may be interactions to a non-runnable SPE (eg, accesses to the problem state mapping from a running context in the same gang) that will cause running contexts to become blocked.
So, which policy is best for spufs?
Policies 1 and 2 have significant flaws in their approach. It's quite possible that either will lead to non-termination conditions in fairly simple user programs. I don't think we can 'work around' this with a restriction on the programmer.
Policy 4 will require a mechanism for reserving SPEs for a particular context; I'm not convinced the extra complexity is worth the effort, especially as this doesn't allow us to implement gang scheduling properly.
Currently, Luke Browning and André Detsch have a work-in progress patch series for gang scheduling, based on policy 3.
external context scheduling in spufs
At present, the spufs code has the invariant that a context is only
ever loaded to an SPE when it is being run; ie, a thread is calling the
spu_run syscall on the context.
However, there are situations where we may want to load the context without it being run. For example, to use the SPU's DMA engine from the PPE, requires the PPE thread to write to registers in the SPU's problem-state mapping (psmap). Faults on the psmap area can only be serviced while the context is loaded, so will block until someone runs the context. Ideally, we could allow such accesses to the psmap without the spu_run call. We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.
So, I've been working on some experimental changes to allow "external
scheduling" for SPE contexts. The "external" refers to a thread external to the
SPE's usual method of scheduling (ie, it's owning thread calling
spu_run). In the example above, the external schedule would be
caused by the fault handler for the problem-state mapping.
Although a context may be scheduled to an SPE, we still can't always guarantee
forward progress. For example, in the "use the psmap to access the DMA engine"
scenario, a DMA may cause a major page fault, which needs a controlling thread
to service. In this case, the only way to ensure forward progress is through
calling spu_run. However, I have some ideas on how we can
remove this restriction later.
the interface
First up, we need to tell the spufs scheduler that we want a context to be loaded:
/* * Request an 'external' schedule for this context. * * The context will be either loaded to an SPU, or added to the run queue, * depending on SPU availability. * * Should be called with the context's state mutex locked, and the context * in SPU_STATE_SAVED state. */ int spu_request_external_schedule(struct spu_context *ctx);
After loading the context with spu_request_external_schedule, we
need a way to tell the scheduler that the context can be de-scheduled:
/* * The context should be unscheduled at the end of its timeslice */ void spu_cancel_external_schedule(struct spu_context *ctx);These functions are implemented by incrementing or decrementing a count of "external schedulers" on the context. If multiple threads are requesting an external schedule, then the first will activate the context. When the last thread calls the cancel method, the context can be descheduled.
usage
We can use these two functions to allow the problem-state mapping fault handler to proceed outside of spu_run:
--- a/arch/powerpc/platforms/cell/spufs/file.c +++ b/arch/powerpc/platforms/cell/spufs/file.c @@ -413,9 +413,11 @@ static int spufs_ps_fault(struct vm_area_struct *vma, if (ctx->state == SPU_STATE_SAVED) { up_read(¤t->mm->mmap_sem); + spu_request_external_schedule(ctx); spu_context_nospu_trace(spufs_ps_fault__sleep, ctx); ret = spufs_wait(ctx->run_wq, ctx->state == SPU_STATE_LOADED); spu_context_trace(spufs_ps_fault__wake, ctx, ctx->spu); + spu_cancel_external_schedule(ctx); down_read(¤t->mm->mmap_sem); } else { area = ctx->spu->problem_phys + ps_offs;
Note that the spu_cancel_external_schedule function doesn't unload the context right away; if it did, the refault would fail too, and we'd end up in an infinite loop of faults. Instead, it keeps the context scheduled for the rest of its timeslice. This gives the faulting thread time to access the mapping after the fault handler has been invoked.
We also need to do a bit of trickery with the priorities of contexts during external schedule operations. If a high-priority thread access the problem-state mapping of a low-priority context, we want the context to temporarily inherit the higher priority. To do this, we raise the priority when spu_request_external_schedule is called, and drop it back after the context has finished its timeslice on the SPU.
the code
I've created a development branch in the spufs repository for these changes, which is available:
- via git:
git://git.kernel.org/pub/scm/linux/kernel/git/jk/spufs.git, in theext-schedbranch; or - on the browsable gitweb interface.
Note that this is an experimental codebase, expect breakages!
new patchwork beta
We've had a new version of patchwork - the web-based patch-tracking system - online for a few weeks now at patchwork.ozlabs.org.
After Paul's presentation on patchwork at the 2008 Kernel Summit, there has been wider interest in patchwork setups for other projects. Patchwork originally hosted the Linux on Power and Linux on Cell lists, we've since added netdev, linux-mtd and linux-ext4. I've also added the main Linux Kernel mailing list (lkml), just to see how the new patchwork handles the load; all has been okay so far.
If you're interested in installing the new patchwork at your own site, you can grab the source from the patchwork project page. Installation can be a little tricky, so feel free to mail me if you need a hand.
2.6.26 on a Lenovo x61 thinkpad
It looks like the iwl driver is slightly broken in the 2.6.26 release - connections will drop-out after 10 seconds or so.
The workaround for this is to enable the config option
CONFIG_IWL4965_HT.
asynchronous spu contexts, initial designs
I've recently been working on some changes to the spufs code, and thought I'd write-up some of the details.
At present, the spu_run syscall (used to run a SPU context)
blocks until the SPU program has exited (or some other event has happened,
such as a non-serviceable fault). This means that to take advantage of the
SPUs, you really need to start a new thread for each SPU context that you
create, otherwise your application will be sitting around waiting for each SPU
context to complete.
In fact, we have an invariant in the spufs code at the moment that only
contexts that are currently being spu_run will ever be runnable
(and, at the moment, schedulable).
Ben H and I have been chatting about some ideas about asynchronous spu
contexts. This means that the userspace app can start the context, then later
retrieve the status of the SPU context (to see if it has stopped, faulted, or
whatever). We can then use standard POSIX semantics like poll() to
see if a context is still running or has generated any "events", then handle
these events when they become available.
In effect, this is similar to spu_run: currently, the
spu_run syscall runs the SPU, then blocks until an event happens,
which is then returned to userpsace as the return value of
spu_run. The main difference is that we don't block in the kernel
while the SPU is running.
So, I've been coding up an experimental change to spufs. Firstly, we have
to explicitly tell the kernel that we want a context to operate in asynchronous
mode, so I've added a new flag to the spu_create syscall:
SPU_CREATE_ASYNC.
I've opted for a file-based interface to these asynchronous contexts -
SPU events are retrieved by reading from a file. Contexts that are created with
the SPU_CREATE_ASYNC flag have an extra file present (called
something like "event") in their context directory in the
spufs mount. Reading from this file allows applications to retreive events
that the SPU program has raised.
We need to define a format for the data read from this events file, so here's something to get started with:
struct spu_event { uint32_t event; uint32_t status; uint32_t npc; };
- where the event member specifies which event happened - a
stop-and-signal for example.
The status and npc members return the status of the SPU and the next program counter register, respectively. While not strictly necessary (this information is available from other files in spufs), it's very likely that the application will need these values in order to handle the event.
So, users of this interface may look something like this:
uint32_t npc = 0; struct context { int fd; int event_fd; } context; /* create the context */ context.fd = spu_create("/spu/ctx", NULL, SPU_CREATE_ASYNC); /* open the events file */ context.event_fd = openat(context.fd, "event", O_RDWR); /* start the context running. unlike the spu_run syscall, * this function does not block for the duration of the * spu program */ run_context(&context, npc); for (;;) { struct spu_event event; /* get the next event caused by the SPU */ read(context.event_fd, &event, sizeof(event)); if (event.event == SPU_EVENT_STOP) break; /* handle other event ... */ }
Note that the userspace examples here are not what we'd present to Cell application developers. They're more low-level examples of how the new asynchronous kernel interface works. In fact, the changes could be completely transparent to applications which use the libSPE interface.
This isn't far from the API provided by the current spu_run
syscall, except that we're not waiting in the kernel while the SPU is
running.
Also, we're going to need to control the SPU somehow - for example, we need
to implement the run_context function in the pseudocode above.
Rather than overloading the spu_run syscall, I've opted to use the
same event file - writes to this file will allow userspace to control the SPU.
I'm still working out the exact format of these writes, but the way I've
implemented it at the moment is that the application can write structures of
this layout to the file:
struct spu_control { uint32_t op; char data[]; };
The contents of the data member depends on the operation
requested (specified by the op member). For example, a 'start spu'
operation would have four extra bytes - a uint32_t containing the
NPC to start the SPU execution from. A 'stop spu' operation doesn't require any
extra parameters, so the data member would be 0 bytes long.
This would allow us to implement the run_context function
as follows:
void run_context(struct context *context, uint32_t npc) { uint32_t buf[2]; buf[0] = SPU_CONTROL_START_SPU; buf[1] = npc; write(context.event_fd, buf, sizeof(buf)); }
There are plenty of other issues to deal with (like signals, and debugging), but I have a basic prototype working at the moment. More to come!
debian on a qs22 cell blade
Seeing as the QS22 blades are out, here's a short guide to getting debian installed.
[jk@qs22 ~]$ grep -m1 ^cpu /proc/cpuinfo cpu : Cell Broadband Engine, altivec supported [jk@qs22 ~]$ lsb_release -d Description: Debian GNU/Linux unstable (sid)
kernel
You'll need a kernel that has support for the IBM Cell blades. If you configure your kernel with the 'cell_defconfig' target, you should have all the necessary options:
[jk@pingu linux-2.6.25]$ make cell_defconfig
Specifically, you need:
CONFIG_PPC_IBM_CELL_BLADE;CONFIG_SERIAL_OF_PLATFORM;CONFIG_FUSION_SAS;CONFIG_ROOT_NFS;CONFIG_IP_PNP_DHCPandCONFIG_SPU_FS.
root filesystem
The QS22s have no internal disk (they're compute nodes, right?), so you'll have to either:
- use a remote root filesystem, like NFS; or
- add a LSI SAS adaptor to the blade, and use an external SAS disk for the root filesystem.
The installation process will be different depending on which you choose, so just skip to the appropriate section here.
NFS root
For the first option, there's a number of NFS-root howtos
around. First up, we need to build the actual debian filesystem, using
debootstrap. For example:
[jk@pingu ~]$ sudo debootstrap --arch=powerpc --foreign sid /srv/nfs/qs22/
This will create an entire debian filesystem in /srv/nfs/qs22.
We need to make a few modifications though:
- add the following line to
/etc/inittab:T0:23:respawn:/sbin/getty -L ttyS0 19200 vt100
- and couple of extra device nodes:
[jk@pingu ~]$ cd /srv/nfs/qs22/dev [jk@pingu dev]$ sudo mknod console c 5 1 [jk@pingu dev]$ sudo mknod ttyS0 c 4 64
Once this is done, we need to complete the bootstrap on the QS22. Set up
your NFS server, and export the appropriate directory. Boot the QS22 with the
nfs root kernel options, plus "rw init=/bin/sh" (eg
root=/dev/nfs nfsroot=server_ip:/srv/nfs/qs22 ip=dhcp rw
init=/bin/sh). Then, once the machine has booted:
sh-3.2# PATH=/:/bin:/usr/bin:/sbin:/usr/sbin /debootstrap/debootstrap --second-stage
This should finish the bootstrap. After it has completed (it should finish
with "I: Base system installed successfully"), reboot the
machine with the same kernel command line, minus the rw
init=/bin/sh arguments. Once it boots, you should have the debian login
prompt. Login as root (there will be no password, but don't forget to set one)
and away you go.
SAS disk
If you're using SAS, the install is much more straightforward, as you can just use the standard debian installer. However, you may need to use a custom kernel which supports the QS22s. This is a matter of building your own kernel, using the powerpc64 debian installer image as the initramfs:
[jk@pingu linux-2.6.25]$ wget http://ftp.us.debian.org/debian/dists/testing/main/installer-powerpc/current/images/powerpc64/netboot/initrd.gz [jk@pingu linux-2.6.25]$ gunzip -c < initrd.gz > initrd [jk@pingu linux-2.6.25]$ make cell_defconfig [jk@pingu linux-2.6.25]$ sed -ie 's,^CONFIG_INITRAMFS_SOURCE=".*",CONFIG_INITRAMFS_SOURCE="'$PWD'/initrd",' .config [jk@pingu linux-2.6.25]$ make
Then, just boot the kernel in arch/powerpc/boot/zImage.pseries. The debian installer should start, and guide you through the rest of the
installation. Since you're netbooting, you can ignore any messages about not
having a bootstrap partition, or not being able to install a kernel or
yaboot
software
Entirely optional, but you'll probably get the most out of your QS22 with a few extra packages:
[jk@qs22 ~]$ sudo apt-get install openssh-server libspe2-dev spu-gcc build-essential
linux.conf.au hackfest: the solution, part three
In part two of this series, we had just ported a fractal renderer to the SPEs on a Cell Broadband Engine machine. Now we're going to start the optimisation...
Our baseline performance is 40.7 seconds to generate the sample fractal (using the sample fractal parameters).
The initial SPE-based fractal renderer used only one SPE. However, we have more available:
| Machine | SPEs available |
|---|---|
| Sony Playstation 3 | 6 |
| IBM QS21 / QS22 blades. | 16 (8 per CPU) |
So, we should be able to get better performance by distributing the render work between the SPEs. We can do this by dividing the fractal into a set of n strips, where n is the number of SPEs available. Then, each SPE renders its own strip of the fractal.
The following image shows the standard fractal, as would be rendered by 8 SPEs, with shading to show the division of the work amongst the SPEs.
In order to split up the work, we first need to tell each SPE which part of
the fractal it is to render. We'll add two variables to the
spe_args structure (which is passed to the SPE during program
setup), to provide the total number of threads (n_threads) and
the index of this SPE thread (thread_idx).
struct spe_args { struct fractal_params fractal; int n_threads; int thread_idx; };
spe changes
On the SPE side, we use these parameters to alter the invocations of
render_fractal. We set up another couple of convenience variables:
rows_per_spe = args.fractal.rows / args.n_threads;
start_row = rows_per_spe * args.thread_idx;
And just alter our for-loop to only render
rows_per_spe rows, rather than the entire fractal:
for (row = 0; row < rows_per_spe; row += rows_per_dma) {
render_fractal(&args.fractal, start_row + row,
rows_per_dma);
mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
bytes_per_row * rows_per_dma,
0, 0, 0);
/* Wait for the DMA to complete */
mfc_write_tag_mask(1 << 0);
mfc_read_tag_status_all();
}
ppe changes
The changes to the PPE code are fairly simple - instead of just creating one thread, create n threads.
First though, let's add a '-n' argument to the program to
specify the number of threads to start:
while ((opt = getopt(argc, argv, "p:o:n:")) != -1) {
switch (opt) {
/* other options omitted */
case 'n':
n_threads = atoi(optarg);
break;
Rather than just creating one SPE thread, we create n_threads.
Also, we have to set the thread-specific arguments for each thread:
for (i = 0; i < n_threads; i++) {
/* copy the fractal data into this thread's args */
memcpy(&threads[i].args.fractal, fractal, sizeof(*fractal));
/* set thread-specific arguments */
threads[i].args.n_threads = n_threads;
threads[i].args.thread_idx = i;
threads[i].ctx = spe_context_create(0, NULL);
spe_program_load(threads[i].ctx, &spe_fractal);
pthread_create(&threads[i].pthread, NULL,
spethread_fn, &threads[i]);
}
Now, the SPEs should be running, and generating their own slice of the fractal. We just have to wait for them all to finish:
/* wait for the threads to finish */
for (i = 0; i < n_threads; i++)
pthread_join(threads[i].pthread, NULL);
If you're after the source code for the multi-threaded SPE fractal renderer, it's available in fractal.3.tar.gz.
That's it! Now we have a multi-threaded SPE application to do the fractal rendering. In theory, an n threaded program will take 1/n of the time of a single-threaded implementation, let's see how that fares with reality...
performance
Let's compare invocations of our multi-threaded fractal renderer, with
different values for the n_threads parameter.
| SPE threads | Running time (sec) |
|---|---|
| 1 | 40.72 |
| 2 | 30.14 |
| 4 | 18.84 |
| 6 | 12.72 |
| 8 | 10.89 |
Not too bad, but we're definitely not seeing linear scalability here; we could expect the 8 SPE case to take around 5 seconds, rather than 11.
what's slowing us down?
A little investigation into the fractal generator will show that some SPE
threads are finishing long before others. This is due to the variability in
calculation time between pixels. In order to see if a point (ie, pixel) in the
fractal does not converge towards infinity (and gets coloured blue),
we need to do the full i_max tests in render_fractal
(i_max is 10,000 in our sample fractal). Other pixels may
converge much more quickly (possibly in under 10 iterations), and so are orders
of mangitude faster to calculate.
We end up with slices that are very quick to calculate, and others that take longer. Of course, we have to wait for the longest-running SPE thread to complete, so our overall runtime will be that of the slowest thread.
So, let's take another aproach to distributing the workload. Rather than dividing the fractal into contiguous slices, we can interleave the SPE work. For example, if we were to use 2 SPE threads, then thread 0 would render all the even chunks, and thread 1 would render all the odd chunks (where a "chunk" is a set of rows that fit into a single DMA). This should even-out the work between SPE threads.
This is just a matter of changing the SPE for-loop a little.
Rather than the current code, which divides the work into
n_threads contiguous chunks:
for (row = 0; row < rows_per_spe; row += rows_per_dma) {
render_fractal(&args.fractal, start_row + row,
rows_per_dma);
mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
bytes_per_row * rows_per_dma,
0, 0, 0);
/* Wait for the DMA to complete */
mfc_write_tag_mask(1 << 0);
mfc_read_tag_status_all();
}
We change this to render every n_threadth
chunk, starting from thread_idx, which gives us the
the interleaved pattern:
for (row = rows_per_dma * args.thread_idx;
row < args.fractal.rows;
row += rows_per_dma * args.n_threads) {
render_fractal(&args.fractal, row,
rows_per_dma);
mfc_put(buf, ppe_buf + row * bytes_per_row,
bytes_per_row * rows_per_dma,
0, 0, 0);
/* Wait for the DMA to complete */
mfc_write_tag_mask(1 << 0);
mfc_read_tag_status_all();
}
An updated renderer is available in fractal.4.tar.gz.
Making this small change gives some better performance figures:
| SPE threads | Running time (sec) |
|---|---|
| 1 | 40.72 |
| 2 | 20.75 |
| 4 | 10.78 |
| 6 | 7.44 |
| 8 | 5.81 |
We're doing much better now, but we're still nowhere near the theoretical maximum performance of the SPEs. More optimisations in the next article...
qs22 released
The next revision of IBM Cell Broadband Engine machines has just been released - the QS22 blade. The QS22 has five-times the double-precision floating point performance of the previous Cell blade (the QS21), but is instruction-set compatible. This is also the first Cell/B.E. machine to use DDR2 memory, and can hold up to 32GB.
In other powerpc news, Terra Soft Solutions have announced the PowerStation - a deskside development machine, based on 2 dual-core PowerPC 970MP CPUs. It comes with Yellow Dog Linux installed, including the Cell/B.E. SDK. Hugh (amongst many others) has been doing a lot of great work getting this machine out the door (he has a number of posts on his blog if you're keen to see the "making of" feature). Nice work Hugh!