Thu, 22 Feb 2007

lguest: how fast is fast enough?

So, lguest on my Core Duo 2 using 512M is about 25-30% slower than the same setup native and uni-processor. A straight context-switch syscall benchmark puts us 60 - 90% slower (see below). Given that we're doing two pagetable switches instead of one (into the hypervisor and back), this is pretty good.

The idea of lguest is that it's simple, and to close the rest of that gap probably means introducing complexity. Where's the tradeoff? There are two examples which are troubling me. The first is that copying the per-guest information into and back out of the per-cpu area drops context switch speed by about 15% (7800ns to 9000ns), and in fact slows down all hypercalls. The optimal approach is to only copy back out when someone else wants to run on that CPU, but that means we need locking. This one, I'll probably do, since I'll need locking for shrinking the shadow pagetables under memory pressure, too.

The other example a more difficult call: servicing some hypercalls directly in the switcher stub, staying in the guest address space. This is more difficult since we started using read-only pages to protect the hypervisor, but still might be possible. It changes our very simple assembler switch code into something else, though.

In particular consider that there are three hypervisor-sensitive parts to normal context switch: changing the kernel stack (particularly changing what stack system call traps will arrive on), altering the thread-local segment entries in the GDT, and actually switching the page tables. The last one is easiest: we already cache four toplevel page tables, we just have to move this cache into the read-only part of the guest address space where our assembler code can reach it: if it's in that cache, the asm code can put the value in cr3.

The change of stack can be managed in at least two ways. Normally it involves updating an entry in the TSS, but that's read-only in the guest address space. We could use a fixed stack for system calls and copy to the "real" stack in the guest. This copy must be done with (virtual) interrupts off, but we already have a dead-zone where we tell the hypervisor not to give us interrupts within a range of instructions. The change of stack is then simply updating a local variable in the guest. This solution also means the real stack doesn't move: the hypervisor needs to ensure the stack is always mapped so the guest doesn't doublefault and we kill it.

The other solution is to have multiple TSSs ready to go, just like we cache pagetable tops. Each one is 108 bytes, though, and while threads share page tables, they don't share kernel stacks, so this will potentially cache less than the "One True Stack" solution.

The TLS entries is harder. The GDT is 32x8 = 256 bytes, so we could cache a handful of them (maybe 14 if we don't cache TSSes: we have one read-only page). Otherwise, perhaps the host could set the %gs register to 0, and store the old value somewhere along with the TLS entries. The hypervisor would then see a General Protection Fault when the userspace program tries to use the TLS entries, and could put them in at that point. Of course, this means threaded program still get one trap per context switch, and just about every program on modern systems is threaded (thanks glibc!). No win.

So, say we add 250 lines (ie. 5%) of fairly hairy code to support this. Say our context switch speed is within 10% of native until we go outside the cache: probably most people see less than 5% performance improvement, and I'm not sure that's enough to justify the complexity for lguest. I think that from now on, I want macro benchmark performance gains to be about 2x the percentage code increase. That means codeside can only increase by about 12% due to optimizations, say 610 lines 8)


[/tech] permanent link