Free Software programmer
This blog existed before my current employment, and obviously reflects my own opinions and not theirs.
This work is licensed under a Creative Commons Attribution 2.1 Australia License.
Sun, 28 Jan 2007
So I've been looking at lguest performance, and it's an interesting area. There were some fairly obvious things to do with page table updates (we used to throw away the whole page table on every context switch, for example), and they proved a big win. Implementing binary patching, something I wanted to do for lguest to be a good demonstration of paravirt_ops, bought around 5%. But one of my ideas hasn't worked out.
The idea of amortizing hypercall cost by having some batching mechanism is not novel; it's explicity supported by the "set_lazy_mode" operation in paravirt_ops. In lguest I decided on a simple ringbuffer of calls: when you make a hypercall, the ringbuffer gets executed first. Yet we were already down to 2 hypercalls per context switch, so reducing it to 1 doesn't make a great difference.
My grander plan was to use these "async" calls for network I/O, to get up to 64 packets in one hypercall. But for real (TCP) network flows between two guests, this doesn't help. It helps a little on a simple udpblast scenario, but it hurts horribly on a pingpong benchmark. I previously changed lguest to use "sync" wakeups for inter-guest interrupts, and to yield() when the receiver is out of buffers: both help on bandwidth benchmarks.
I've kept the network async call patch around though, because I suspect the terrible latencies are due to a bug rather than a flaw in the idea: AFAICT the sender should go idle fairly soon and call LHCALL_HALT which sill flush the async calls. I'll revisit it later; telling the networking core about the capabilities of lguest_net is the more obvious path to speed!
Avi pointed out that KVM (as lguest) blocks on disk I/O. Changing this is easy in theory, but I'd prefer to use a separate process rather than AIO or threads. And of course, there is also an infinite number of page table optimizations to be done...
Meanwhile, compiling the kernel under lguest (512M) takes almost exactly twice as long as compiling under the host (3G). I'd hope to halve that gap, but after that I expect we'll face diminishing codesize/performance returns, and lguest is supposed to stay simple.
[/tech] permanent link