Tue, 20 Feb 2007

lguest: Life Without Segments?

So, I've spent the last few days trying to wean lguest off segments, and now the result works (with a performance regression I'm hunting), so it's time to braindump the whole thing.

Currently lguest uses segments: nasty x86 things which allow you to have an offset and limit on virtual addresses which can be accessed. You tell the CPU about your segment table (aka Gate Descriptor Table) with the "lgdt" instruction: after that, anyone can try to load a number into a segment register and start using that segment. eg. "movl 0x68, %ds" would load GDT entry 13 into the DS segment register: each entry is 8 bytes, so shift away the bottom three bits. (This example is real: in Linux, entry 13 is the KERNEL_DS segment used for the kernel data and stack).

There are six segment registers: %cs is used for code, %ds for data and %ss for stack operations. %es is used for some string operations. The other two, %fs and %gs, are used explicitly in instruction prefixes to indicate that they are to be used, instead of %ds. This is used for special effects, such as per-cpu data or per-thread data. For example, "movl %gs:1000, %eax" will read in from a different virtual memory address depending on the GDT entry last loaded into the %gs segment register.

Each GDT entry contains a 2-bit privilege field, so you can disallow less privileged (ie. higher) CPU states from loading GDT entries. This means you can set a limit on all the entries in the GDT available to, say, priv level 1, and thus guarantee that any process at that priv level 1 would be unable to access high virtual addresses. The GDT entry is only read when the segment register is loaded though: you could load a priv-0-only GDT entry into %ds at priv level 0 and then return to priv level 1 and it'd continue using that segment.

Anyway, "trimmed" segments is what Xen and lguest (and AFAIK VMWare) use to protect the hypervisor from guests (which run at priv level 1). Lguest uses two unlimited GDT entries (10 and 11) which are only available at priv level 0: traps and interrupts are set up to switch the %cs segment to this and jump into the hypervisor which only those segments can reach.

This approach has two problems: glibc needs full untrimmed segments, and some x86_64 chips don't enforce segment limits, so it doesn't work there. Lguest has a fairly complicated trick for the former, involving trapping and reloading untrimmed segments for userspace, and bouncing syscalls through a segment-neutralizing trampoline. As for x86_64, lguest is 32-bit only 8)

The new idea (from Andi Kleen and Zach Amsden) gets rid of segments altogether, and uses the pagetables to protect the hypervisor. This means the hypervisor text and real GDT are visible to the guest, but read-only so they can't be changed. Pages for other guests which might be running on other CPUs aren't mapped at all in this guest. The result looks like this for a guest running on CPU1:

+---------------+ 0xFFFFFFFF
|               |
|               |
...(Unmapped) ...
|               |
|               |
|CPU 1: rw page | <- stack for interrupts
|CPU 1: ro page | <- host state for restore, guest IDT, GDT & TSS.
|               |
...(Unmapped) ...
|               |
|Hypervisor Text| <- (Text is readonly of course!)
+===============+ 0xFFC00000 (4G - 4M)
|               |
|    mappings   |

The stack is fully writable by the guest, but we only use it when we trap, in which case the guest isn't running (lguest guests are uniprocessor, but even if they were SMP, only this CPU's trap page is mapped on this CPU).

The mapping in the host is the same, except the host can see all the pages for every cpu, and all are writable. Linux i386 doesn't have per-cpu kernel mappings, but mapping each CPU's pair of pages in adjacent addresses works just as well.

Switching into the guest looks like:

  1. Disable interrupts
  2. Link this CPU's hypervisor pagetable page into this guest's pagetable.
  3. Copy guest registers into "read-write" page for this CPU.
  4. Copy guest GDT, Interrupt Descriptor Table into "read-only" page for this CPU.
  5. Save segment registers and framepointer (we've told gcc we're going to clobber all registers that it will let us).
  6. Save host stack pointer and switch to guest stack.
  7. Switch to guest's GDT and Interrupt Descriptor Table.
  8. Load guest's TSS.
  9. Switch to guest page tables (GDT, IDT, TSS etc. now read-only)
  10. Pop all the guest registers off the stack.
  11. iret to jump back into the guest.

The copying in (and, on return, copying out) of registers is a pain, but we need a stack mapped in the same place in the guest and host: a non-maskable interrupt (NMI) could come in at any time and so we must always have a valid stack. Moving the stack and switching pagetables atomically is almost impossible.

The copying in of (most of the) the GDT and IDT (and a couple of guest-specific fields in the TSS) is also a pain, but it must also be mapped at the same place in guest and host. Loading the guest TSS (which tells the CPU where the stack for traps is) actually involves a write to the GDT by the CPU. So we cannot load the TSS after we've switched to the guest pagetables, where the GDT is read-only. Loading the TSS before the switch implies that it's in the same virtual address in host and guest.

This implementation works, but virtbench reveals that guest context switch time has doubled. Strangely, it seems that the copying in and out is not the culprit; I'm profiling now...

[/tech] permanent link