Tue, 02 Jan 2007

lhype: speeding up system calls

Last episode, lhype was 35 times slower at system calls than native. The main reason for this is that every trap (including the syscall) gets redirected into the hypervisor.S stubs, which exit into the host, which then decides it's for the guest, copies the trap frame onto the guest stack and jumps into the guest's handler.

After handling the interrupt, the guest calls back into the hypervisor to do the "iret": re-enable interrupts and return.

Now, we want to point the interrupt handlers straight into the guest. This means that the guest stack and the guest handler code must be mapped otherwise we "double fault" and it's hard (maybe impossible?) to recover. So we always enter the guest with the stack and interrupt-handler pages already mapped, then we can point the handlers for just about everything straight into the guest. We still need to intercept 14 (page fault) because we unmap pages behind the guest's back and we need to fix them up again, and 13 (general protection fault) because we have to emulate the inb/outb instructions. But the rest can go straight into the guest...

Returning from interrupts is a little trickier. The native "iret" instruction restores the interrupt state (ie. re-enables them) and returns to the caller atomically. We would need to use two instructions, one to re-enable virtual interrupts, and then the "iret". This is no longer atomic: we could be interrupted between the two. So we explicitly tell the hypervisor the address of this instruction: that it is not to interrupt us on that "iret" instruction, even if interrupts are enabled, and the race is closed.

But what about the TLS segments? There are two problems here: first, making sure userspace has access to the 4G TLS segments, and secondly making sure that kernelspace doesn't. But segments are strange beasts on x86: the segment table is only consulted when a segment register is loaded (in this case, %gs), so we can load it once then replace the segment table entries, which ensures any reloads don't get the full segment.

We use this when restoring %gs in the hypervisor on the way back into the guest: pop %gs off the stack, then truncate the 4G TLS segment down to one page. If the guest reloads %gs and tries to use it, it will fault. We then enter the hypervisor and can decide whether we should reload %gs for it (ie. it's in usermode) or not. To avoid looping on a real faulting instruction, we remember the last instruction we fixed %gs on: if the guest hasn't made a system call or other trap and it faults again in the same place, we pass the fault through to the guest. In theory, the code could be reloading %gs in a loop, but in practice that doesn't happen.

Aside: a bonus, this works under QEMU, which doesn't enforce segment limits. It never traps, so we never fix up the segment limit, but then, we don't need to. Of course, all hypervisors using segments like this are insecure under qemu.

How does this help us prevent the kernel for accessing that 4GB segment? Well, now all we need to do is ensure the kernel reloads %gs upon entry, which will ensure it gets the harmless segment from the table. To do this, we divert all interrupts via a special page of stubs, which look like this:

	# Reload the gs register
	push	%gs  
	pop	%gs
	# Make sure the hypervisor knows we've done a gs reload
	movl	$0, lhype_page+4
	# Now it's safe to call into the kenrel.

This page is mapped read-only in the guest's address space, so it can't change the contents, and voila! The total cost of virtualization for the syscall is a few instructions (although the gs load is not particularly cheap) and a fault on first %gs access after userspace return. As a bonus, we only need to ensure that one page which contains all the stubs is mapped, not every interrupt handler.

Now it's implemented and debugged, benchmarks to follow...

[/tech] permanent link