Mon, 01 Jan 2007

Lhype's TLS Segment Trick

x86 hypervisors under Linux have a problem: glibc wants segments which cover the entire 4GB range of virtual addresses, but allowing that would let the guest access hypervisor memory (usually sitting in the top 64 MB or so of memory). This is because glibc uses segments to implement __thread (aka thread-local storage), and uses huge offsets to wrap around to below the thread pointer.

Linux doesn't have a problem with allowing these huge segments, because the "U" bit in the page tables protects it: if this bit isn't set userspace can't access the memory. However, this works to protect ring0 from ring3, but doesn't work to protect ring0 from ring1 (the hypervisor case). For this reason, Xen uses modified glibc (or traps on every __thread access and prints out a warning that you're going damn slowly).

lhype is supposed to be convenient, so a modified glibc (at least, until everyone has them in their distributions) or a huge performance hit were not good options. Hence I used a different trick: since all transitions from userspace to kernel (ie. interrupts and iret) go via the hypervisor, we replace the TLS segments with trimmed segments if returning to the guest kernel, and the full segments if returning to the guest userspace, where the lack of U bit on the pagetables protects the hypervisor anyway.

This works well, but the two bounces through the hypervisor for every system call is the reason we're 35 times slower than native system calls. And if we don't go through the hypervisor, how do we ensure that the kernel never gets access to those huge hypervisor-mapping segments?

A: Another, slightly trickier trick....

[/tech] permanent link