diary
SPE gang scheduling policies
In my previous post here, I mentioned that:
We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.
I think that may require a little explanation, so here we go:
gang scheduling policies
The idea behind SPE gang scheduling is to allow a set of related SPE contexts to be scheduled together, to allow interactions between contexts to be performed in a timely manner. For example, consider two contexts (A and B) that send mailbox messages to each other. If context A is running while context B is scheduled out, then A will spend its time-slice waiting for a message from B. If they are scheduled as a single entity, neither context will have to spend its timeslice blocked, waiting for the other context to be run.
So, we have to come up with a policy to define the behaviour of the gang scheduler. When does the gang become schedulable? Under which conditions should the gang be descheduled?
I can see four possible approaches:
policy 1: the gang is only schedulable when all contexts are runnable
In this case, the gang is only ever scheduled when all of the gang's
contexts are runnable (ie, they are being run by the spu_run system
call).
Although the simplest, this approach will never complete&emdash;consider the following:
- Context A becomes runnable
- Context B becomes runnable
- The gang is now schedulable, so both contexts are scheduled
- Because it has slightly less work to do, context A finishes before context B
- Because only one of the two contexts is runnable, the gang is no longer schedulable. Context B is never re-scheduled, so cannot complete the rest of its task
So, this policy isn't much use; perhaps we can solve this with a new approach:
policy 2: the gang is scheduled when all contexts are runnable, and descheduled when no contexts are runnable
This will solve the previous non-termination problem, in that context B will be able to terminate - the context isn't immediately descheduled when A finishes.
However, now we have a new, slightly more complex non-termination case:
- Context A becomes runnable
- Context B becomes runnable
- The gang is now schedulable, so both contexts are scheduled
- Because it has slightly less work to do, context A finishes before context B
- At the same time, context B does a PPE-assisted callback, which requires a stop-and-signal (and so leaves spu_run for just a moment)
- Because neither context is currently runnable, the gang is descheduled
- Context B finishes its callback, so re-enters spu_run to be re-scheduled. However, the policy does not allow context B to be re-scheduled, as only one of the two contexts is runnable.
Although this may sound like a rare occurrence, it's not a restriction we can pass on to the programmer. Imagine the following SPE code:
int main(void) { do_work(); printf("work done!\n"); return 0; }
Here we're doing a PPE-assisted callback (the call to printf is
implemented as a callback) before finishing. If this callback were to occur
when the other context has already completed, we would hit the non-termination
condition above.
This means that the last-running context of a gang can never do a PPE-assisted callback. In fact, to be completely safe against this non-termination, a programmer would have to avoid callbacks after any context has finished, for risk of callbacks on the rest of the gang being synchronised.
So, it looks like we need to be a little more permissive when deciding if the gang is schedulable.
policy 3: the gang is scheduled when any context is runnable, and descheduled when no contexts are runnable
This is another fairly simple approach&emdash;the gang is scheduled whenever there is any work to do. We no longer have any non-termination conditions, as 'having work to do' will result in 'doing work'.
The tricky part is that it will require us to change one of the fundamental assumptions about spufs: currently, we don't schedule any context unless it is runnable. Because we schedule the entire gang when one if its context becomes runnable, we have to now schedule a number of non-runnable contexts.
The good news is that I've already done a little experimental work to overcome this general restriction in spufs.
The last approach is a little more complex, but works around this restriction:
policy 4: schedule the runnable contexts of a gang, and reserve SPEs for the non-runnable contexts
This is just like policy 3, but instead of actually scheduling the non-runnable contexts, we reserve a SPE for them.
This way, a non-runnable context does not need to be loaded, but can be quickly scheduled when it becomes runnable. The downside is that we're only half-implementing gang scheduling; there still may be interactions to a non-runnable SPE (eg, accesses to the problem state mapping from a running context in the same gang) that will cause running contexts to become blocked.
So, which policy is best for spufs?
Policies 1 and 2 have significant flaws in their approach. It's quite possible that either will lead to non-termination conditions in fairly simple user programs. I don't think we can 'work around' this with a restriction on the programmer.
Policy 4 will require a mechanism for reserving SPEs for a particular context; I'm not convinced the extra complexity is worth the effort, especially as this doesn't allow us to implement gang scheduling properly.
Currently, Luke Browning and André Detsch have a work-in progress patch series for gang scheduling, based on policy 3.
external context scheduling in spufs
At present, the spufs code has the invariant that a context is only
ever loaded to an SPE when it is being run; ie, a thread is calling the
spu_run syscall on the context.
However, there are situations where we may want to load the context without it being run. For example, to use the SPU's DMA engine from the PPE, requires the PPE thread to write to registers in the SPU's problem-state mapping (psmap). Faults on the psmap area can only be serviced while the context is loaded, so will block until someone runs the context. Ideally, we could allow such accesses to the psmap without the spu_run call. We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.
So, I've been working on some experimental changes to allow "external
scheduling" for SPE contexts. The "external" refers to a thread external to the
SPE's usual method of scheduling (ie, it's owning thread calling
spu_run). In the example above, the external schedule would be
caused by the fault handler for the problem-state mapping.
Although a context may be scheduled to an SPE, we still can't always guarantee
forward progress. For example, in the "use the psmap to access the DMA engine"
scenario, a DMA may cause a major page fault, which needs a controlling thread
to service. In this case, the only way to ensure forward progress is through
calling spu_run. However, I have some ideas on how we can
remove this restriction later.
the interface
First up, we need to tell the spufs scheduler that we want a context to be loaded:
/* * Request an 'external' schedule for this context. * * The context will be either loaded to an SPU, or added to the run queue, * depending on SPU availability. * * Should be called with the context's state mutex locked, and the context * in SPU_STATE_SAVED state. */ int spu_request_external_schedule(struct spu_context *ctx);
After loading the context with spu_request_external_schedule, we
need a way to tell the scheduler that the context can be de-scheduled:
/* * The context should be unscheduled at the end of its timeslice */ void spu_cancel_external_schedule(struct spu_context *ctx);These functions are implemented by incrementing or decrementing a count of "external schedulers" on the context. If multiple threads are requesting an external schedule, then the first will activate the context. When the last thread calls the cancel method, the context can be descheduled.
usage
We can use these two functions to allow the problem-state mapping fault handler to proceed outside of spu_run:
--- a/arch/powerpc/platforms/cell/spufs/file.c +++ b/arch/powerpc/platforms/cell/spufs/file.c @@ -413,9 +413,11 @@ static int spufs_ps_fault(struct vm_area_struct *vma, if (ctx->state == SPU_STATE_SAVED) { up_read(¤t->mm->mmap_sem); + spu_request_external_schedule(ctx); spu_context_nospu_trace(spufs_ps_fault__sleep, ctx); ret = spufs_wait(ctx->run_wq, ctx->state == SPU_STATE_LOADED); spu_context_trace(spufs_ps_fault__wake, ctx, ctx->spu); + spu_cancel_external_schedule(ctx); down_read(¤t->mm->mmap_sem); } else { area = ctx->spu->problem_phys + ps_offs;
Note that the spu_cancel_external_schedule function doesn't unload the context right away; if it did, the refault would fail too, and we'd end up in an infinite loop of faults. Instead, it keeps the context scheduled for the rest of its timeslice. This gives the faulting thread time to access the mapping after the fault handler has been invoked.
We also need to do a bit of trickery with the priorities of contexts during external schedule operations. If a high-priority thread access the problem-state mapping of a low-priority context, we want the context to temporarily inherit the higher priority. To do this, we raise the priority when spu_request_external_schedule is called, and drop it back after the context has finished its timeslice on the SPU.
the code
I've created a development branch in the spufs repository for these changes, which is available:
- via git:
git://git.kernel.org/pub/scm/linux/kernel/git/jk/spufs.git, in theext-schedbranch; or - on the browsable gitweb interface.
Note that this is an experimental codebase, expect breakages!
asynchronous spu contexts, initial designs
I've recently been working on some changes to the spufs code, and thought I'd write-up some of the details.
At present, the spu_run syscall (used to run a SPU context)
blocks until the SPU program has exited (or some other event has happened,
such as a non-serviceable fault). This means that to take advantage of the
SPUs, you really need to start a new thread for each SPU context that you
create, otherwise your application will be sitting around waiting for each SPU
context to complete.
In fact, we have an invariant in the spufs code at the moment that only
contexts that are currently being spu_run will ever be runnable
(and, at the moment, schedulable).
Ben H and I have been chatting about some ideas about asynchronous spu
contexts. This means that the userspace app can start the context, then later
retrieve the status of the SPU context (to see if it has stopped, faulted, or
whatever). We can then use standard POSIX semantics like poll() to
see if a context is still running or has generated any "events", then handle
these events when they become available.
In effect, this is similar to spu_run: currently, the
spu_run syscall runs the SPU, then blocks until an event happens,
which is then returned to userpsace as the return value of
spu_run. The main difference is that we don't block in the kernel
while the SPU is running.
So, I've been coding up an experimental change to spufs. Firstly, we have
to explicitly tell the kernel that we want a context to operate in asynchronous
mode, so I've added a new flag to the spu_create syscall:
SPU_CREATE_ASYNC.
I've opted for a file-based interface to these asynchronous contexts -
SPU events are retrieved by reading from a file. Contexts that are created with
the SPU_CREATE_ASYNC flag have an extra file present (called
something like "event") in their context directory in the
spufs mount. Reading from this file allows applications to retreive events
that the SPU program has raised.
We need to define a format for the data read from this events file, so here's something to get started with:
struct spu_event { uint32_t event; uint32_t status; uint32_t npc; };
- where the event member specifies which event happened - a
stop-and-signal for example.
The status and npc members return the status of the SPU and the next program counter register, respectively. While not strictly necessary (this information is available from other files in spufs), it's very likely that the application will need these values in order to handle the event.
So, users of this interface may look something like this:
uint32_t npc = 0; struct context { int fd; int event_fd; } context; /* create the context */ context.fd = spu_create("/spu/ctx", NULL, SPU_CREATE_ASYNC); /* open the events file */ context.event_fd = openat(context.fd, "event", O_RDWR); /* start the context running. unlike the spu_run syscall, * this function does not block for the duration of the * spu program */ run_context(&context, npc); for (;;) { struct spu_event event; /* get the next event caused by the SPU */ read(context.event_fd, &event, sizeof(event)); if (event.event == SPU_EVENT_STOP) break; /* handle other event ... */ }
Note that the userspace examples here are not what we'd present to Cell application developers. They're more low-level examples of how the new asynchronous kernel interface works. In fact, the changes could be completely transparent to applications which use the libSPE interface.
This isn't far from the API provided by the current spu_run
syscall, except that we're not waiting in the kernel while the SPU is
running.
Also, we're going to need to control the SPU somehow - for example, we need
to implement the run_context function in the pseudocode above.
Rather than overloading the spu_run syscall, I've opted to use the
same event file - writes to this file will allow userspace to control the SPU.
I'm still working out the exact format of these writes, but the way I've
implemented it at the moment is that the application can write structures of
this layout to the file:
struct spu_control { uint32_t op; char data[]; };
The contents of the data member depends on the operation
requested (specified by the op member). For example, a 'start spu'
operation would have four extra bytes - a uint32_t containing the
NPC to start the SPU execution from. A 'stop spu' operation doesn't require any
extra parameters, so the data member would be 0 bytes long.
This would allow us to implement the run_context function
as follows:
void run_context(struct context *context, uint32_t npc) { uint32_t buf[2]; buf[0] = SPU_CONTROL_START_SPU; buf[1] = npc; write(context.event_fd, buf, sizeof(buf)); }
There are plenty of other issues to deal with (like signals, and debugging), but I have a basic prototype working at the moment. More to come!