|
Free Software programmer
Subscribe This blog existed before my current employment, and obviously reflects my own opinions and not theirs. ![]() This work is licensed under a Creative Commons Attribution 2.1 Australia License.
Categories of this blog:
Older issues:
My wife: |
Sun, 20 Jul 2008The Joy of linux-nextSure, linux-next is a useful way of early-detecting patch conflicts with random developers. But the second order effect has been more useful to me: forcing me to get my shit together. Now I regularly publish my patchqueue in a form which applies and compiles, and has clear "production" vs "alpha" demarcation. Obviously, this is good for people trying to follow various patches (and there are quite a few independent efforts at the moment, including typesafe patches, virtio, lguest, module, tun/tap, stop_machine, kmod-removal and down_trylock removal), but it also makes the arrival of the merge window far less stressful. In theory, I could have been this organized before. But just like the concept of doing homework long before the deadline, it was never going to happen. So thanks Stephen! [/tech] permanent link Mon, 14 Jul 2008UNSW CS: Employment @ IBM OzLabs Talk: 1pm Tuesday September 2ndUNSW School of Computer Science and Engineering are having "Employer of the Week" experiment: September 1st is IBM's week. I'll be spruking for OzLabs, so if you know anyone at UNSW who worth talking to, drag them there (I don't know which room, I'm guessing the signs in CS will be pretty clear). I'm going to try to talk about the stuff people in the office are hacking on, to give an idea what it's like being in what AFAICT is Australia's largest bunch of Free and Open Source Software hackers. [/tech] permanent link Mon, 30 Jun 2008stop_machine latency: the rewriteFollowing on from my previous graphs of stop_machine latency, I have new results with my stop_machine simplification patch. Again, it's the 18-way Power4 box; the simplied stop_machine creates all the threads and moves them into the correct CPUs before starting them. They then step through the state machine themselves, rather than having a central controller. Since these are different kernel versions, I looked at the baseline latency for both kernels: Now I need to go back and compare the exact same kernel version, to make sure something else isn't interfering... [/tech] permanent link Fri, 27 Jun 2008Linux Foundation's Device Driver StatementSomeone noted that I didn't sign the LF "proprietary modules are bad" statement. This is entirely due to my slackness and not any lack of support. As kernel module maintainer I feel obliged to maintain the status quo with proprietary modules, but I have noticed many colleagues becoming more annoyed about them. [/tech] permanent link Mon, 16 Jun 2008Selling the Farm .
With Alli's high-risk pregnancy, we're selling the farm and moving back to an apartment in Adelaide (where both our families are). She's moved across already, I'm staying to take care of the farm until it sells. Unfortunately farms do not sell quickly, but it gives me time to have everyone visit (again). And now we're selling, there's less chance you'll be asked to do random tree planting or similar chores. When Richard Guy Briggs visited last year he took some great photos, and now seems a good chance to link to them. [/self] permanent link Thu, 12 Jun 2008stop_machine latencyKathy Staples and I wrote a little program to measure the latency on every CPU on a machine. It sets CPU affinity and high priority (SCHED_FIFO, prio 50) for each thread, then spins doing gettimeofday() for a given duration. The maximum gap in gettimeofday() is reported for each CPU. I tested this on an old 18-way Power4 box sitting around the lab: CPU 0 is used for the parent process, and the latency is measured on the other CPUS. This was run 100 times. Then a variant which did an insmod system call on CPU 0 was used (this calls stop_machine, which is what we were trying to measure). The results are interesting and a little surprising. Normal max latency is around 35 usec, the stop_machine increasing it to the 100 range. There's obviously something running periodically on CPU 2: for both runs I had to remove one horrific 150ms latency result (1000 times average!) but there's still a noticeable spike there. I suspect CPU1 is low because CPU0 is mainly idle (same core). But more concerning is that latency seems to go up with higher CPU numbers, whereas I expected it to be worst on lower CPUs. We launch stop_machine threads in cpu order, so I expected the lower CPUs to wait the longest. We're running modprobe on cpu 0, which means the stop_machine control thread runs there, too. It loops through creating 17 other threads: as CPU 0 is busy, it gets scheduled on a different idle CPU. The first thing the thread does is try to move itself to its proper CPU. I suspect what is happening is that we're creating the 17 threads fast enough that they all end up queued on the migration queue for CPU 0 at once: this queueing uses "list_add" not "list_add_tail", so they are in fact deployed by the migration thread in reverse-CPU order. My simplified version of stop_machine is more intelligent: it moves all the threads to their correct CPUs before waking them all up. This should solve this problem as well as reducing overall latency. [/tech] permanent link Fri, 16 May 2008Tuning VirtIO and virtio_net: part IOne premise of virtio is that we should be as fast as reasonably possible. While there's nothing which should make us slow, that's not the same as actually being fast. So this week, I've been doing some simple benchmarks on my patch queue, which includes major changes to accelerate the tap device and allow async packet sends. I've been using lguest rather than kvm because it's far more hackable, and my test has been a 1GB (1024x1024x1024 byte) TCP send using netcat. And host->guest results were awful: instead of the current 12 seconds it was taking 70 seconds to receive 1GB. So I started breaking that down. The first things that I found was that simply allocating large receive buffers (of which only 1500 bytes is used) is expensive. Just this change alone takes the time from 12 seconds to 29, and there are two reasons for this so far. The first is because each 1500 byte packet takes two descriptors (we have a header containing metadata), whereas a fully populated paged skb takes 2 + 65536/PAGE_SIZE + 2 == 20 descriptors. That means we only fit 6 large packets in lguest's 128-descriptor ring, vs 64 for the small packet case. Increasing lguest's rings to 1024 drops the time from 29 to 25: not as much as you'd expect. Increasing it further has marginal effect (logically, we should see equivalence at 1280 descriptors, but it has to be a power of 2). The second reason is that alloc_page is quite slow. A simple cache of allocated pages drops the time from 25 to 19 seconds. But we're still 50% slower than allocating 1500-byte receive buffers, and today's task is to figure out why. It seems unlikely that the increased overhead of skb_to_sgvec, get_buf and add_buf would account for it. Cache effects also seem unlikely: 1024 descriptors are still only 8k. It's unfortunate that oprofile doesn't work inside lguest guests, so this will be old school. If the overhead really is inherent in large descriptors, we have several options. The obvious one is to add a separate "large buffer" queue, or allow mixing buffer sizes and expect the other end to try to forage for the minimal sized one. Both require a change to the server side. We can add a feature bit for backwards-compat, but that's always a last resort. Another option is to try for multi-page allocations for our skbs: as they're physically contiguous they'll use fewer descriptors. [/tech] permanent link |