rss feed

linux.conf.au hackfest: the solution, part four

08/01/2009, [ tech / cell / ] [ link ]

In the previous article in this series, we finished with a parellel SPE-based fractal renderer. This article will cover some of the major optimisations we can do to make use of the Cell Broadband Engine.

Our last benchmark gave us the following results:

Performance of multi-threaded, interleaved SPE fractal renderer
SPE threads Running time (sec)

SPE vector instructions

The registers on the SPEs are actually 128-bits wide. Rather than using the entire register for one value, most of the SPE instruction set consists of vector instructions.

For example, say we wanted to add 7 to the value in register one, and place the result in register two. To do this, we'd use the ai ("add immediate") instruction:

        ai r2,r1,7

This instruction operates on the four 32-bit 'slots' of the 128-bit register (4 x 32 = 128) in parallel:

Diagram of vector addition instruction

By using these vector instructions, we can do a lot of the fractal rendering work in parallel.

Vector instructions aren't specific to the Cell Broadband Engine - Altivec (on PowerPC processors) and SSE (on Pentium processors) offer similar capabilities for parallel processing of data. In fact, the PPE on the Cell Broadband Engine supports Altivec extensions.

vectorising the fractal renderer

The hackfest task description contains a little information on how to vectorise code:

The SPEs of the Cell/B.E. have an instruction set almost entirely composed of SIMD instructions - so you can do most computations on four operands at a time.

To take advantage of these SIMD capabilities, you can use "vector" types in your code.

For example, consider a simple function to multiply a number by three:

float multiply_by_three(float f)
        return f * 3;

By vectorising this function, we can do four operations in the same amount of time:

vector float multiply_by_three(vector float f);
        return f * (vector float){3, 3, 3, 3};

The vector float type is similar to an array of 4 floats, except that we can operate on the vector as a whole.

If we look at the instructions generated for both the scalar and vector versions of the multiply_by_three function, they're exactly the same. The following code shows the compiled assembly of multiply_by_three(), both vector and scalar versions. The function argument is in register 3, return value is placed in register 3.

        ilhu   $2,16448    ; load {3, 3, 3, 3} into register 2
        fm     $3,$3,$2    ; register 3 = (register 3) * (register 2)
        bi     $lr         ; return to caller

So, if you're using scalar operations, we get one multiply in three instructions. Using the vector equivalent, we get four multiplies in three instructions.

So, we should look at the hot paths of the current fractal renderer, and try to vectorise where possible.

Currently, this loop of our render_fractal function is where the SPE is doing most of the work, iterating over each pixel to find the point where it diverges to infinity:

        for (x = 0; x < params->cols; x++) {
                cr = x_min + x * params->delta;

                zr = 0;
                zi = 0;

                for (i = 0; i < params->i_max; i++)  {
                        /* z = z^2 + c */
                        tmp = zr*zr - zi*zi + cr;
                        zi =  2.0 * zr * zi + ci;
                        zr = tmp;

                        /* if abs(z) > 2.0 */
                        if (unlikely(zr*zr + zi*zi > 4.0))

                colour_map(&params->imgbuf[r * params->cols + x],
                                i, params->i_max);

The renderer will execute the inner loop up to rows * columns * i times, where i is between 1 and i_max. Seeing as this is the most executed path in our renderer code, it is a good place to start our vectorisation efforts.

By vectorising, we can perform the contents of the loop on four pixels at a time (rather than just one), enabling us to reduce the number of iterations to around one quarter.

The new vectorised fractal generator is available in fractal.5.tar.gz. All we have changed here is the SPE-side code (spe-fractal.c). The most significant change is to the render_fractal() function, where we have replaced the above loop with:

        for (x = 0; x < params->cols; x += 4) {
                escaped_i = (vector unsigned int){0, 0, 0, 0};
                cr = x_min + spu_splats(params->delta * x);

                zr = (vector float){0, 0, 0, 0};
                zi = (vector float){0, 0, 0, 0};

                for (i = 0; i < params->i_max; i++)  {

                        /* z = z^2 + c */
                        tmp = zr*zr - zi*zi + cr;
                        zi =  (vector float){2.0f, 2.0f, 2.0f, 2.0f}
                                * zr * zi + ci;
                        zr = tmp;

                        /* escaped = abs(z) > 2 */
                        escaped = spu_cmpgt(zr * zr + zi * zi, limit);

                        /* escaped_i = escaped ? escaped_i : i */
                        escaped_i = spu_sel(spu_splats(i), escaped_i,

                        /* if all four words in escaped are non-zero,
                         * we're done */
                        if (!spu_extract(spu_orx(~escaped), 0))

                colour_map(&params->imgbuf[r * params->cols + x],
                                escaped_i, params->i_max);

This new code processes four columns at each iteration, and increments x by four each time. We set up a vector of pixels (one per column) at each iteration, and do the divergence tests on a vector at a time.

Most of the actual aritmetic is easy to convert; the standard multiply, add and subtract operators will work just fine with vector types.

However, the if-test to see if the number has diverged is more complex - since we're dealing with a vector of four pixels, we can't just break out of the inner loop when one pixel has diverged. Instead, we need to keep processing until all vector elements to reach the divergence point, and remember the value of i where this divergence occurs.

Basically, the above code keeps two extra variables: escaped and escaped_i. escaped is a vector of four flags - if the nth word of escaped is non-zero, we have detected that the nth pixel in our vector has diverged to infinity. The escaped_i is a similar vector of the value of i where we first detected this divergence, which is used as the 'value' for each point in the fractal.

We do this with a few SPE intrinsics, which allow us to use some of the lower-level SPE vector functionality:

SPE intrinsics used in vectorised fractal code
spu_cmpgt(a,b) compare greater than
returns one bits in each word slot in a that is greater than the corresponding word slot in b.
spu_sel(a,b,c) select bits
For bit each bit in c: if the bit is zero, return the corresponding bit from a, otherwise return the corresponding bit from b.
spu_orx(a) or across
Logical-or each word in a, and return the result in the preferred slot (ie, the first word). The remaining three slots are set to zero.
spu_extract(a,n) extract vector element
Return a scalar value from the nth element of vector a
spu_splats(a) splat scalar
Return a vector consisting of each word equal to a

The first three intrinsics will result in just one instruction in the compiled code. Since slot 0 is the preferred slot (and therefore where the compiler would normally store a scalar value in a register), the call to spu_extract(x, 0) will not result in any instructions at all. The spu_splats instrinsic may require a variable number of instructions, depending on the value to splat.

More detail on the SPE instrinsics is available in the PPU & SPU C/C++ Language Extension Specification and SPU Instruction Set Architecture documents on the IBM developerWorks Cell Broadband Engine Resource Center.

We also need to change the colour_map function to take a vector of i values (rather than just a single value), and calulate the colours of four pixels at a time. For now, we can just do this one element at a time, using spu_extract, but this could be improved later.


After vectorising the SPE fractal code, we can benchmark to see how the vectorisation affects run time. The following table shows run times for the new renederer, along with the proportion of time against the previous scalar implementation.

Performance of multi-threaded, interleaved, vectorised SPE fractal renderer
SPE threads Running time (sec) Vector / Scalar

Not too bad! We're doing much better than our original PPE-based renderer, which took 55.7 seconds to render the same fractal.

Ideally, we'd see a constant vector / scalar figure of 0.25, meaning that we're doing four times the amount of work in the same run time. However, there are still a lot of operations that we haven't optimised yet.

More optimisations in the next article!

SPE gang scheduling policies

18/11/2008, [ tech / cell / spufs / ] [ link ]

In my previous post here, I mentioned that:

We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.

I think that may require a little explanation, so here we go:

gang scheduling policies

The idea behind SPE gang scheduling is to allow a set of related SPE contexts to be scheduled together, to allow interactions between contexts to be performed in a timely manner. For example, consider two contexts (A and B) that send mailbox messages to each other. If context A is running while context B is scheduled out, then A will spend its time-slice waiting for a message from B. If they are scheduled as a single entity, neither context will have to spend its timeslice blocked, waiting for the other context to be run.

So, we have to come up with a policy to define the behaviour of the gang scheduler. When does the gang become schedulable? Under which conditions should the gang be descheduled?

I can see four possible approaches:

policy 1: the gang is only schedulable when all contexts are runnable

In this case, the gang is only ever scheduled when all of the gang's contexts are runnable (ie, they are being run by the spu_run system call).

Although the simplest, this approach will never complete&emdash;consider the following:

  1. Context A becomes runnable
  2. Context B becomes runnable
  3. The gang is now schedulable, so both contexts are scheduled
  4. Because it has slightly less work to do, context A finishes before context B
  5. Because only one of the two contexts is runnable, the gang is no longer schedulable. Context B is never re-scheduled, so cannot complete the rest of its task

So, this policy isn't much use; perhaps we can solve this with a new approach:

policy 2: the gang is scheduled when all contexts are runnable, and descheduled when no contexts are runnable

This will solve the previous non-termination problem, in that context B will be able to terminate - the context isn't immediately descheduled when A finishes.

However, now we have a new, slightly more complex non-termination case:

  1. Context A becomes runnable
  2. Context B becomes runnable
  3. The gang is now schedulable, so both contexts are scheduled
  4. Because it has slightly less work to do, context A finishes before context B
  5. At the same time, context B does a PPE-assisted callback, which requires a stop-and-signal (and so leaves spu_run for just a moment)
  6. Because neither context is currently runnable, the gang is descheduled
  7. Context B finishes its callback, so re-enters spu_run to be re-scheduled. However, the policy does not allow context B to be re-scheduled, as only one of the two contexts is runnable.

Although this may sound like a rare occurrence, it's not a restriction we can pass on to the programmer. Imagine the following SPE code:

int main(void)
        printf("work done!\n");
        return 0;

Here we're doing a PPE-assisted callback (the call to printf is implemented as a callback) before finishing. If this callback were to occur when the other context has already completed, we would hit the non-termination condition above.

This means that the last-running context of a gang can never do a PPE-assisted callback. In fact, to be completely safe against this non-termination, a programmer would have to avoid callbacks after any context has finished, for risk of callbacks on the rest of the gang being synchronised.

So, it looks like we need to be a little more permissive when deciding if the gang is schedulable.

policy 3: the gang is scheduled when any context is runnable, and descheduled when no contexts are runnable

This is another fairly simple approach&emdash;the gang is scheduled whenever there is any work to do. We no longer have any non-termination conditions, as 'having work to do' will result in 'doing work'.

The tricky part is that it will require us to change one of the fundamental assumptions about spufs: currently, we don't schedule any context unless it is runnable. Because we schedule the entire gang when one if its context becomes runnable, we have to now schedule a number of non-runnable contexts.

The good news is that I've already done a little experimental work to overcome this general restriction in spufs.

The last approach is a little more complex, but works around this restriction:

policy 4: schedule the runnable contexts of a gang, and reserve SPEs for the non-runnable contexts

This is just like policy 3, but instead of actually scheduling the non-runnable contexts, we reserve a SPE for them.

This way, a non-runnable context does not need to be loaded, but can be quickly scheduled when it becomes runnable. The downside is that we're only half-implementing gang scheduling; there still may be interactions to a non-runnable SPE (eg, accesses to the problem state mapping from a running context in the same gang) that will cause running contexts to become blocked.

So, which policy is best for spufs?

Policies 1 and 2 have significant flaws in their approach. It's quite possible that either will lead to non-termination conditions in fairly simple user programs. I don't think we can 'work around' this with a restriction on the programmer.

Policy 4 will require a mechanism for reserving SPEs for a particular context; I'm not convinced the extra complexity is worth the effort, especially as this doesn't allow us to implement gang scheduling properly.

Currently, Luke Browning and André Detsch have a work-in progress patch series for gang scheduling, based on policy 3.

external context scheduling in spufs

14/11/2008, [ tech / cell / spufs / ] [ link ]

At present, the spufs code has the invariant that a context is only ever loaded to an SPE when it is being run; ie, a thread is calling the spu_run syscall on the context.

However, there are situations where we may want to load the context without it being run. For example, to use the SPU's DMA engine from the PPE, requires the PPE thread to write to registers in the SPU's problem-state mapping (psmap). Faults on the psmap area can only be serviced while the context is loaded, so will block until someone runs the context. Ideally, we could allow such accesses to the psmap without the spu_run call. We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.

So, I've been working on some experimental changes to allow "external scheduling" for SPE contexts. The "external" refers to a thread external to the SPE's usual method of scheduling (ie, it's owning thread calling spu_run). In the example above, the external schedule would be caused by the fault handler for the problem-state mapping.

Although a context may be scheduled to an SPE, we still can't always guarantee forward progress. For example, in the "use the psmap to access the DMA engine" scenario, a DMA may cause a major page fault, which needs a controlling thread to service. In this case, the only way to ensure forward progress is through calling spu_run. However, I have some ideas on how we can remove this restriction later.

the interface

First up, we need to tell the spufs scheduler that we want a context to be loaded:

 * Request an 'external' schedule for this context.
 * The context will be either loaded to an SPU, or added to the run queue,
 * depending on SPU availability.
 * Should be called with the context's state mutex locked, and the context
 * in SPU_STATE_SAVED state.
int spu_request_external_schedule(struct spu_context *ctx);

After loading the context with spu_request_external_schedule, we need a way to tell the scheduler that the context can be de-scheduled:

 * The context should be unscheduled at the end of its timeslice
void spu_cancel_external_schedule(struct spu_context *ctx);
These functions are implemented by incrementing or decrementing a count of "external schedulers" on the context. If multiple threads are requesting an external schedule, then the first will activate the context. When the last thread calls the cancel method, the context can be descheduled.


We can use these two functions to allow the problem-state mapping fault handler to proceed outside of spu_run:

--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -413,9 +413,11 @@ static int spufs_ps_fault(struct vm_area_struct *vma,

        if (ctx->state == SPU_STATE_SAVED) {
+               spu_request_external_schedule(ctx);
                spu_context_nospu_trace(spufs_ps_fault__sleep, ctx);
                ret = spufs_wait(ctx->run_wq, ctx->state == SPU_STATE_LOADED);
                spu_context_trace(spufs_ps_fault__wake, ctx, ctx->spu);
+               spu_cancel_external_schedule(ctx);
        } else {
                area = ctx->spu->problem_phys + ps_offs;

Note that the spu_cancel_external_schedule function doesn't unload the context right away; if it did, the refault would fail too, and we'd end up in an infinite loop of faults. Instead, it keeps the context scheduled for the rest of its timeslice. This gives the faulting thread time to access the mapping after the fault handler has been invoked.

We also need to do a bit of trickery with the priorities of contexts during external schedule operations. If a high-priority thread access the problem-state mapping of a low-priority context, we want the context to temporarily inherit the higher priority. To do this, we raise the priority when spu_request_external_schedule is called, and drop it back after the context has finished its timeslice on the SPU.

the code

I've created a development branch in the spufs repository for these changes, which is available:

  • via git: git://git.kernel.org/pub/scm/linux/kernel/git/jk/spufs.git, in the ext-sched branch; or
  • on the browsable gitweb interface.

Note that this is an experimental codebase, expect breakages!

asynchronous spu contexts, initial designs

16/07/2008, [ tech / cell / spufs / ] [ link ]

I've recently been working on some changes to the spufs code, and thought I'd write-up some of the details.

At present, the spu_run syscall (used to run a SPU context) blocks until the SPU program has exited (or some other event has happened, such as a non-serviceable fault). This means that to take advantage of the SPUs, you really need to start a new thread for each SPU context that you create, otherwise your application will be sitting around waiting for each SPU context to complete.

In fact, we have an invariant in the spufs code at the moment that only contexts that are currently being spu_run will ever be runnable (and, at the moment, schedulable).

Ben H and I have been chatting about some ideas about asynchronous spu contexts. This means that the userspace app can start the context, then later retrieve the status of the SPU context (to see if it has stopped, faulted, or whatever). We can then use standard POSIX semantics like poll() to see if a context is still running or has generated any "events", then handle these events when they become available.

In effect, this is similar to spu_run: currently, the spu_run syscall runs the SPU, then blocks until an event happens, which is then returned to userpsace as the return value of spu_run. The main difference is that we don't block in the kernel while the SPU is running.

So, I've been coding up an experimental change to spufs. Firstly, we have to explicitly tell the kernel that we want a context to operate in asynchronous mode, so I've added a new flag to the spu_create syscall: SPU_CREATE_ASYNC.

I've opted for a file-based interface to these asynchronous contexts - SPU events are retrieved by reading from a file. Contexts that are created with the SPU_CREATE_ASYNC flag have an extra file present (called something like "event") in their context directory in the spufs mount. Reading from this file allows applications to retreive events that the SPU program has raised.

We need to define a format for the data read from this events file, so here's something to get started with:

struct spu_event {
	uint32_t event;
	uint32_t status;
	uint32_t npc;

- where the event member specifies which event happened - a stop-and-signal for example.

The status and npc members return the status of the SPU and the next program counter register, respectively. While not strictly necessary (this information is available from other files in spufs), it's very likely that the application will need these values in order to handle the event.

So, users of this interface may look something like this:

uint32_t npc = 0;
struct context {
        int fd;
        int event_fd;
} context;

/* create the context */
context.fd = spu_create("/spu/ctx", NULL, SPU_CREATE_ASYNC);

/* open the events file */
context.event_fd = openat(context.fd, "event", O_RDWR);

/* start the context running. unlike the spu_run syscall,
 * this function does not block for the duration of the
 * spu program */
run_context(&context, npc);

for (;;) {
        struct spu_event event;

        /* get the next event caused by the SPU */
        read(context.event_fd, &event, sizeof(event));

        if (event.event == SPU_EVENT_STOP)

        /* handle other event ... */

Note that the userspace examples here are not what we'd present to Cell application developers. They're more low-level examples of how the new asynchronous kernel interface works. In fact, the changes could be completely transparent to applications which use the libSPE interface.

This isn't far from the API provided by the current spu_run syscall, except that we're not waiting in the kernel while the SPU is running.

Also, we're going to need to control the SPU somehow - for example, we need to implement the run_context function in the pseudocode above. Rather than overloading the spu_run syscall, I've opted to use the same event file - writes to this file will allow userspace to control the SPU. I'm still working out the exact format of these writes, but the way I've implemented it at the moment is that the application can write structures of this layout to the file:

struct spu_control {
	uint32_t op;
	char data[];

The contents of the data member depends on the operation requested (specified by the op member). For example, a 'start spu' operation would have four extra bytes - a uint32_t containing the NPC to start the SPU execution from. A 'stop spu' operation doesn't require any extra parameters, so the data member would be 0 bytes long.

This would allow us to implement the run_context function as follows:

void run_context(struct context *context, uint32_t npc)
        uint32_t buf[2];

        buf[0] = SPU_CONTROL_START_SPU;
        buf[1] = npc;

        write(context.event_fd, buf, sizeof(buf));

There are plenty of other issues to deal with (like signals, and debugging), but I have a basic prototype working at the moment. More to come!

debian on a qs22 cell blade

30/06/2008, [ tech / cell / ] [ link ]

Seeing as the QS22 blades are out, here's a short guide to getting debian installed.

[jk@qs22 ~]$ grep -m1 ^cpu /proc/cpuinfo
cpu             : Cell Broadband Engine, altivec supported
[jk@qs22 ~]$ lsb_release -d
Description:    Debian GNU/Linux unstable (sid)


You'll need a kernel that has support for the IBM Cell blades. If you configure your kernel with the 'cell_defconfig' target, you should have all the necessary options:

[jk@pingu linux-2.6.25]$ make cell_defconfig

Specifically, you need:


root filesystem

The QS22s have no internal disk (they're compute nodes, right?), so you'll have to either:

  • use a remote root filesystem, like NFS; or
  • add a LSI SAS adaptor to the blade, and use an external SAS disk for the root filesystem.

The installation process will be different depending on which you choose, so just skip to the appropriate section here.

NFS root

For the first option, there's a number of NFS-root howtos around. First up, we need to build the actual debian filesystem, using debootstrap. For example:

[jk@pingu ~]$ sudo debootstrap --arch=powerpc --foreign sid /srv/nfs/qs22/

This will create an entire debian filesystem in /srv/nfs/qs22. We need to make a few modifications though:

  1. add the following line to /etc/inittab:
    T0:23:respawn:/sbin/getty -L ttyS0 19200 vt100
  2. and couple of extra device nodes:
    [jk@pingu ~]$ cd /srv/nfs/qs22/dev
    [jk@pingu dev]$ sudo mknod console c 5 1
    [jk@pingu dev]$ sudo mknod ttyS0 c 4 64

Once this is done, we need to complete the bootstrap on the QS22. Set up your NFS server, and export the appropriate directory. Boot the QS22 with the nfs root kernel options, plus "rw init=/bin/sh" (eg root=/dev/nfs nfsroot=server_ip:/srv/nfs/qs22 ip=dhcp rw init=/bin/sh). Then, once the machine has booted:

sh-3.2# PATH=/:/bin:/usr/bin:/sbin:/usr/sbin /debootstrap/debootstrap --second-stage

This should finish the bootstrap. After it has completed (it should finish with "I: Base system installed successfully"), reboot the machine with the same kernel command line, minus the rw init=/bin/sh arguments. Once it boots, you should have the debian login prompt. Login as root (there will be no password, but don't forget to set one) and away you go.

SAS disk

If you're using SAS, the install is much more straightforward, as you can just use the standard debian installer. However, you may need to use a custom kernel which supports the QS22s. This is a matter of building your own kernel, using the powerpc64 debian installer image as the initramfs:

[jk@pingu linux-2.6.25]$ wget http://ftp.us.debian.org/debian/dists/testing/main/installer-powerpc/current/images/powerpc64/netboot/initrd.gz
[jk@pingu linux-2.6.25]$ gunzip -c < initrd.gz > initrd
[jk@pingu linux-2.6.25]$ make cell_defconfig
[jk@pingu linux-2.6.25]$ sed -ie 's,^CONFIG_INITRAMFS_SOURCE=".*",CONFIG_INITRAMFS_SOURCE="'$PWD'/initrd",' .config
[jk@pingu linux-2.6.25]$ make

Then, just boot the kernel in arch/powerpc/boot/zImage.pseries. The debian installer should start, and guide you through the rest of the installation. Since you're netbooting, you can ignore any messages about not having a bootstrap partition, or not being able to install a kernel or yaboot


Entirely optional, but you'll probably get the most out of your QS22 with a few extra packages:

[jk@qs22 ~]$ sudo apt-get install openssh-server libspe2-dev spu-gcc build-essential

linux.conf.au hackfest: the solution, part three

23/06/2008, [ tech / cell / ] [ link ]

In part two of this series, we had just ported a fractal renderer to the SPEs on a Cell Broadband Engine machine. Now we're going to start the optimisation...

Our baseline performance is 40.7 seconds to generate the sample fractal (using the sample fractal parameters).

The initial SPE-based fractal renderer used only one SPE. However, we have more available:

Number of SPEs in currently-available in CBEA machines
Machine SPEs available
Sony Playstation 3 6
IBM QS21 / QS22 blades. 16 (8 per CPU)

So, we should be able to get better performance by distributing the render work between the SPEs. We can do this by dividing the fractal into a set of n strips, where n is the number of SPEs available. Then, each SPE renders its own strip of the fractal.

The following image shows the standard fractal, as would be rendered by 8 SPEs, with shading to show the division of the work amongst the SPEs.

SPE fractal divisions

In order to split up the work, we first need to tell each SPE which part of the fractal it is to render. We'll add two variables to the spe_args structure (which is passed to the SPE during program setup), to provide the total number of threads (n_threads) and the index of this SPE thread (thread_idx).

struct spe_args {
        struct fractal_params fractal;
        int n_threads;
        int thread_idx;

spe changes

On the SPE side, we use these parameters to alter the invocations of render_fractal. We set up another couple of convenience variables:

        rows_per_spe = args.fractal.rows / args.n_threads;
	start_row = rows_per_spe * args.thread_idx;

And just alter our for-loop to only render rows_per_spe rows, rather than the entire fractal:

        for (row = 0; row < rows_per_spe; row += rows_per_dma) {

                render_fractal(&args.fractal, start_row + row,

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);

ppe changes

The changes to the PPE code are fairly simple - instead of just creating one thread, create n threads.

First though, let's add a '-n' argument to the program to specify the number of threads to start:

        while ((opt = getopt(argc, argv, "p:o:n:")) != -1) {
                switch (opt) {

                /* other options omitted */

                case 'n':
                        n_threads = atoi(optarg);

Rather than just creating one SPE thread, we create n_threads. Also, we have to set the thread-specific arguments for each thread:

        for (i = 0; i < n_threads; i++) {
                /* copy the fractal data into this thread's args */
                memcpy(&threads[i].args.fractal, fractal, sizeof(*fractal));

                /* set thread-specific arguments */
                threads[i].args.n_threads = n_threads;
                threads[i].args.thread_idx = i;

                threads[i].ctx = spe_context_create(0, NULL);

                spe_program_load(threads[i].ctx, &spe_fractal);

                pthread_create(&threads[i].pthread, NULL,
                                spethread_fn, &threads[i]);

Now, the SPEs should be running, and generating their own slice of the fractal. We just have to wait for them all to finish:

        /* wait for the threads to finish */
        for (i = 0; i < n_threads; i++)
                pthread_join(threads[i].pthread, NULL);

If you're after the source code for the multi-threaded SPE fractal renderer, it's available in fractal.3.tar.gz.

That's it! Now we have a multi-threaded SPE application to do the fractal rendering. In theory, an n threaded program will take 1/n of the time of a single-threaded implementation, let's see how that fares with reality...


Let's compare invocations of our multi-threaded fractal renderer, with different values for the n_threads parameter.

Performance of multi-threaded SPE fractal renderer
SPE threads Running time (sec)

Not too bad, but we're definitely not seeing linear scalability here; we could expect the 8 SPE case to take around 5 seconds, rather than 11.

what's slowing us down?

A little investigation into the fractal generator will show that some SPE threads are finishing long before others. This is due to the variability in calculation time between pixels. In order to see if a point (ie, pixel) in the fractal does not converge towards infinity (and gets coloured blue), we need to do the full i_max tests in render_fractal (i_max is 10,000 in our sample fractal). Other pixels may converge much more quickly (possibly in under 10 iterations), and so are orders of mangitude faster to calculate.

We end up with slices that are very quick to calculate, and others that take longer. Of course, we have to wait for the longest-running SPE thread to complete, so our overall runtime will be that of the slowest thread.

So, let's take another aproach to distributing the workload. Rather than dividing the fractal into contiguous slices, we can interleave the SPE work. For example, if we were to use 2 SPE threads, then thread 0 would render all the even chunks, and thread 1 would render all the odd chunks (where a "chunk" is a set of rows that fit into a single DMA). This should even-out the work between SPE threads.

Interleaved SPE fractal divisions

This is just a matter of changing the SPE for-loop a little. Rather than the current code, which divides the work into n_threads contiguous chunks:

        for (row = 0; row < rows_per_spe; row += rows_per_dma) {

                render_fractal(&args.fractal, start_row + row,

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);

We change this to render every n_threadth chunk, starting from thread_idx, which gives us the the interleaved pattern:

        for (row = rows_per_dma * args.thread_idx;
                        row < args.fractal.rows;
                        row += rows_per_dma * args.n_threads) {

                render_fractal(&args.fractal, row,

                mfc_put(buf, ppe_buf + row * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);

An updated renderer is available in fractal.4.tar.gz.

Making this small change gives some better performance figures:

Performance of multi-threaded, interleaved SPE fractal renderer
SPE threads Running time (sec)

We're doing much better now, but we're still nowhere near the theoretical maximum performance of the SPEs. More optimisations in the next article...

qs22 released

18/06/2008, [ tech / cell / ] [ link ]

The next revision of IBM Cell Broadband Engine machines has just been released - the QS22 blade. The QS22 has five-times the double-precision floating point performance of the previous Cell blade (the QS21), but is instruction-set compatible. This is also the first Cell/B.E. machine to use DDR2 memory, and can hold up to 32GB.

In other powerpc news, Terra Soft Solutions have announced the PowerStation - a deskside development machine, based on 2 dual-core PowerPC 970MP CPUs. It comes with Yellow Dog Linux installed, including the Cell/B.E. SDK. Hugh (amongst many others) has been doing a lot of great work getting this machine out the door (he has a number of posts on his blog if you're keen to see the "making of" feature). Nice work Hugh!

linux.conf.au hackfest: the solution, part two

05/05/2008, [ tech / cell / ] [ link ]

In the last article we finished with a SPE-based fractal renderer, but with a limited maximum fractal size of 64 × 64 pixels:

first 64x64 fractal

We'd like to generate full-size fractals, but the DMAs (which we use to transfer the fractal image out of the SPE) have a maximum size of 64kB. The solution is to perform multiple DMAs each containing a subset of the image's rows.

Each invocation of render_fractal() should render a DMA-able chunk of fractal data, then we perform the DMA. We do this until the SPE has processed the entire image:

threading structure for our next SPE-based fractal generator

We just need to modify the spe-fractal code (spe-fractal.c) a little. At present, we just render the whole fractal in one pass and DMA the data in the main() function:


        mfc_put(args.fractal.imgbuf, ppe_buf,
                args.fractal.rows * args.fractal.cols * sizeof(struct pixel),
                0, 0, 0);

        /* Wait for the DMA to complete */
        mfc_write_tag_mask(1 << 0);

First, we need to modify our render_fractal() fuction to take a starting row, and a number of rows to render. This is the new prototype of render_fractal():

static void render_fractal(struct fractal_params *params,
                int start_row, int n_rows)

In the SPE program's main(), we just need to set up some convenience variables:

        bytes_per_row = sizeof(*buf) * args.fractal.cols;
        rows_per_dma = sizeof(buf) / bytes_per_row;

And do the rendering and DMAs in a loop:

        for (row = 0; row < args.fractal.rows; row += rows_per_dma) {

                render_fractal(&args.fractal, row, rows_per_dma);

                mfc_put(buf, ppe_buf + row * bytes_per_dma,
                                rows_per_dma * bytes_per_row,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);

This loop will render as many image rows as will fit into a single DMA, then DMA the rendered data back to main memory.

Now, we're able to render fractals larger than 64 × 64 pixels:

512 x 384 fractal

The source for the updated fractal renderer is available in fractal.2.tar.gz.


Now that we can generate full-size fractals, we can compare the running times with the PPE-based fractal renderer. The following table shows running times with a standard fractal (using these fractal parameters).

Running times of fractal renderer
ImplementationTime (sec)
1 SPE40.7

So, we get a 27% speedup by moving the fractal generation code to run on a SPE. We're still a way behind the optimal performance though, and benchmarking on other systems gives better times (for example, generating the same fractal on an Intel Core 2 Duo @ 2.4GHz takes 13.8 seconds).

We can improve the Cell performance by a large amount - stay tuned for the next article to see how.

linux.conf.au hackfest: the solution, part one

07/04/2008, [ tech / cell / ] [ link ]

During linux.conf.au 2008, a bunch of us ozlabbers ran the hackfest - a programming competition for conference attendees. This year's task was to optimise a fractal generation program to run on the Cell Broadband Engine - the hackfest task description is still available if you want to take a squiz.

The next few articles here will take you through a solution to the hackfest task. This is only one approach, and there may be many others. If you have any comments or questions, feel free to mail me.

(If you're viewing this through a feed reader or planet, you may want to check out the the original article, where you get much nicer code formatting.)


The task is a matter of optimising an existing program. We should take a leaf out of Knuth's book here:

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

We'll start out with something simple, and work our way up from there.

starting out

As a starting point, it'd be a good idea to check out the simple-fractal example, to find out what sort of problem we're tackling here.

While we're at it, we can do a bit of profiling on the sample fractal generator to find out where the hot paths of the program are.

A quick way to do this is to run the simple-fractal program under callgrind:

[jk@pokey simple-fractal]$ callgrind --simulate-cache=yes --dump-instr=yes \
             ./simple-fractal fractal.data

Looking at the callgraph output (using kcachegrind), we can get a list of the functions taking the largest amount of CPU time:

profiling data, showing 99.2% of cost is render_fractal function

The 'Self' column gives the estimated percentage of cycles spent in each function. We can see that 99.2% of the CPU time is spent in render_fractal(), 0.7% in various png-encoding functions, and 0.04% calculating the colour map for the fractal.

Now that we know what we need to optimise, we can work on offloading this to the SPEs on the Cell Processor. Because the majority of the running time is due to render_fractal(), we should offload that work to the SPEs.

cell version

We can get a fractal generator working on the Cell pretty quickly, by using the simple-fractal sample code for the fractal side of things, along with the data-transfer example for a framework for getting code running on the SPEs.

To me, the most logical approach is to move the render_fractal() to the SPEs, then DMA the resulting fractal data to the PPE, which does the PNG encoding. We should start with a simple single-SPE renderer:

threading structure for simple SPE-based fractal generator

This will require a few changes:

  • On the SPE side:
    1. We need some way of getting the fractal parameters (ie, a copy of struct fractal_params) to the SPE, so we should embed these into struct spe_args:
      struct spe_args {
              struct fractal_params fractal;
      } __attribute__((aligned(SPE_ALIGN)));
    2. For the moment, we'll deal with fractals that can fit into a single SPE DMA (ie, 16kB, which we've defined to CHUNK_SIZE). So, we'll need a local buffer on the SPE to work with:
      struct pixel buf[CHUNK_SIZE / sizeof(struct pixel)]
    3. As in the data-transfer example, the SPE program starts by DMA-ing a copy of struct spe_args into local store, from the address (in main memory) provided in the argv argument to main(). We'll need to do this here too:
       * The argv argument will be populated with the address that the PPE provided,
       * from the 4th argument to spe_context_run()
      int main(uint64_t speid, uint64_t argv, uint64_t envp)
              struct spe_args args __attribute__((aligned(SPE_ALIGN)));
              uint64_t ppe_buf;
              /* DMA the spe_args struct into the SPE. The mfc_get function
               * takes the following arguments, in order:
               * - The local buffer pointer to DMA into
               * - The remote address to DMA from
               * - A tag (0 to 15) to assign to this DMA transaction. The tag is
               *   later used to wait for this particular DMA to complete.
               * - The transfer class ID (don't worry about this one)
               * - The replacement class ID (don't worry about this one either)
              mfc_get(&args, argv, sizeof(args), 0, 0, 0);
              /* Wait for the DMA to complete - we write the tag mask with
               * (1 << tag), where tag is 0 in this case */
              mfc_write_tag_mask(1 << 0);
    4. Since the fractal_params->imgbuf pointer is a reference to main memory, we need to do a bit of shuffling to put a valid local store address in there, for render_fractal to use. We still need to keep this pointer though, as we'll need it when we DMA our fractal (now in local store) back to main memory. So, replace the imgbuf pointer with our local buf array, and keep the PPE pointer in ppe_buf for later use:
              /* initialise our local buffer */
              ppe_buf = (uint64_t)(unsigned long)args.fractal.imgbuf;
              args.fractal.imgbuf = buf;
    5. We can now call render_fractal() on the SPE:
    6. And finally, DMA-put our rendered fractal to main memory, at the original ppe_buf pointer, and return:
              mfc_put(args.fractal.imgbuf, ppe_buf,
                      args.fractal.rows * args.fractal.cols * sizeof(struct pixel),
                      0, 0, 0);
              /* Wait for the DMA to complete */
              mfc_write_tag_mask(1 << 0);
              return 0;
  • On the PPE side:
    1. Since we're creating a SPE thread, we need a function to do the spe_context_run(), to pass to pthread_create:
      void *spethread_fn(void *data)
              struct spe_thread *spethread = data;
              uint32_t entry = SPE_DEFAULT_ENTRY;
              /* run the context, passing the address of our args structure to
               * the 'argv' argument to main() */
              spe_context_run(spethread->ctx, &entry, 0,
                              &spethread->args, NULL, NULL);
              return NULL;
    2. We need to be careful with the alignment of the fractal buffer, as the SPE needs to DMA the fractal here. So, instead of using malloc, use memalign
              /* allocate our image buffer */
              fractal->imgbuf = memalign(SPE_ALIGN, sizeof(*fractal->imgbuf)
                              * fractal->rows * fractal->cols);
    3. Copy the parsed fractal data into the spe_args structure:
              memcpy(&thread.args.fractal, fractal, sizeof(*fractal));
    4. Rather than calling render_fractal on the PPE, we just create the SPE context, upload the SPE program and start it in a new thread:
              thread.ctx = spe_context_create(0, NULL);
              spe_program_load(thread.ctx, &spe_fractal);
              pthread_create(&thread.pthread, NULL, spethread_fn, &thread);
    5. Now the SPE should be happily generating the fractal, and will DMA it back to our allocated buffer when it's complete. We just wait for the SPE thread to finish, and write the fractal out to a PNG file:
              pthread_join(thread.pthread, NULL);
              /* compress and write to outfile */
              if (write_png(outfile, fractal->rows, fractal->cols, fractal->imgbuf))
                      return EXIT_FAILURE;
              return EXIT_SUCCESS;

If immediate gratification is more your style, here's one I prepared earlier.

After these changes (plus some general plumbing), you should have a working SPE-based fractal renderer!

first 64x64 fractal

However, we still have a few limitations:

  • We can only generate fractals up to 16kB in total size - that's a maximum of 64 × 64 pixels;
  • We've only started one SPE thread; and
  • The generation is not significantly quicker on the SPE than on the PPE.

So, nothing too exciting yet. However, in the next part of this series, we'll be working on optimising our new program to use some of the neat features of the Cell architecture, and get around each of these limitations.

Stay tuned!

petitboot v0.2

11/01/2008, [ tech / cell / ] [ link ]
petitboot screenshot

The next version of petitboot - the graphical bootloader for the PlayStation 3 - is now out.

Some notable changes in the v0.2 build:

  • PS3 controller support
  • Improved bootloader config file parsing, should now recognise most setups "out of the box"
  • OtherOS images are now based on OpenWRT, so we have a more complete linux environment
  • UUID= and LABEL= device specifications are now supported
  • Better montior detection with the 2.6.24 kernel

See the petitboot project page for more details and downloads. I've also built an OtherOS image with remote access support, so it's now possible to ssh to your bootloader.

linux on cell page

01/11/2007, [ tech / cell / ] [ link ]
cell diagram section

Up until now, I had a bunch of Linux on Cell information scattered about my website - I've now organised this into a central Linux on Cell page. Some current items:

If you're interested in Linux on Cell development, take a look.

There's also a heap of more general Cell development resources on the IBM developerWorks site.