Fri, 16 May 2008

Tuning VirtIO and virtio_net: part I

One premise of virtio is that we should be as fast as reasonably possible. While there's nothing which should make us slow, that's not the same as actually being fast. So this week, I've been doing some simple benchmarks on my patch queue, which includes major changes to accelerate the tap device and allow async packet sends.

I've been using lguest rather than kvm because it's far more hackable, and my test has been a 1GB (1024x1024x1024 byte) TCP send using netcat. And host->guest results were awful: instead of the current 12 seconds it was taking 70 seconds to receive 1GB. So I started breaking that down.

The first things that I found was that simply allocating large receive buffers (of which only 1500 bytes is used) is expensive. Just this change alone takes the time from 12 seconds to 29, and there are two reasons for this so far.

The first is because each 1500 byte packet takes two descriptors (we have a header containing metadata), whereas a fully populated paged skb takes 2 + 65536/PAGE_SIZE + 2 == 20 descriptors. That means we only fit 6 large packets in lguest's 128-descriptor ring, vs 64 for the small packet case. Increasing lguest's rings to 1024 drops the time from 29 to 25: not as much as you'd expect. Increasing it further has marginal effect (logically, we should see equivalence at 1280 descriptors, but it has to be a power of 2).

The second reason is that alloc_page is quite slow. A simple cache of allocated pages drops the time from 25 to 19 seconds.

But we're still 50% slower than allocating 1500-byte receive buffers, and today's task is to figure out why. It seems unlikely that the increased overhead of skb_to_sgvec, get_buf and add_buf would account for it. Cache effects also seem unlikely: 1024 descriptors are still only 8k. It's unfortunate that oprofile doesn't work inside lguest guests, so this will be old school.

If the overhead really is inherent in large descriptors, we have several options. The obvious one is to add a separate "large buffer" queue, or allow mixing buffer sizes and expect the other end to try to forage for the minimal sized one. Both require a change to the server side. We can add a feature bit for backwards-compat, but that's always a last resort. Another option is to try for multi-page allocations for our skbs: as they're physically contiguous they'll use fewer descriptors.

[/tech] permanent link