Mon, 21 May 2007

Ambition, Hubris and Virtual I/O

So, now we have at least 4 x86 virtualization solutions for Linux (Xen, KVM, VMWare and lguest), not to mention UML, Power and S/390, the obvious point has been raised by many: why not have a single mechanism for (virtual device) I/O?

Well, first it turns out that there are many different things which people mean when they talk about I/O. There's guest userspace to guest userspace, guest devices served by the host and guest devices served by another guest. There's device discovery, configuration, serving and guest suspend and resume.

And, of course, everyone has a Plan, and many people have an Implementation as well. This is good because there's experience in different approaches, but bad because noone wants to change. The answer is always to standardize what you can, and let the rest converge naturally. In this case, I think aiming for common guest Linux driver code is an achievable short-term aim (ie. a platform-dependent "virtio" layer and common drivers above it).

Device discovery I'm leaving alone (Xen bus vs PCI vs Open Firmware vs Some-All-New-Virtbus): I'm not sure there's even a great deal of point in unifying it, but more importantly it's a separate problem.

There are four reasonable implementations which I have in mind. (I assume some method of sending inter-guest interrupts):

A shared page
This is the simplest: copy in, copy out.
A shared page containing descriptors
The other end is privileged: it can read/write the memory referred to by the descriptors (eg. guest - host comms)
Shared pages containing descriptors + hypervisor helper
The other end can use a hypercall to say "copy the memory referred to by that descriptor" to/from itself. This means the descriptor page has to be read-only to the other side so the hypervisor can trust it.
Full Xen-style grant table
Mapping of arbitrary pages by the other side can be allowed (and revoked), and pages can be "given" to a willing recipient. This is controlled by a separate table, rather than being implied by the descriptors.

The danger is to come up with an abstraction so far removed from what's actually happening that performance sucks, there's more glue code than actual driver code and there are seemingly arbitrary correctness requirements. But being efficient for both network and block devices is also quite a trick.

So far, my model consists of an array of input and output buffers on either side. You register inbufs and outbufs into this array, send from your inbufs to their outbufs and receive from their outbufs to your inbufs. Finally you unregister inbufs and outbufs so the other side can no longer write/read them.

This seems to map reasonably well to existing practice and existing paravirt drivers. It provides the right places for Xen to grant/ungrant, and it works whether you're pulling or pushing data: send might actually transfer data, or it might just wake the other side. Similarly, receive might do nothing, or might actually do the transfer.

The actual management of where in the array to put your in and outbufs, and where to send to/receive from, and how to coordinate that with the other side is currently left to the driver. For the network driver, it's a ring buffer. For the block driver, it'll be more randomly ordered. This might be pushed into the infrastructure as more commonality emerges.

I'll know more about how well it's worked once I've got a couple of drivers and a couple of backend implementations....

[/tech] permanent link