Free Software programmer
Subscribe This blog existed before my current employment, and obviously reflects my own opinions and not theirs. This work is licensed under a Creative Commons Attribution 2.1 Australia License.
Categories of this blog:
Older issues: |
Sun, 09 Nov 2008Ccan gets some much-needed loveOK, so CCAN (think CPAN for C) finally got some cycles: the web page no longer completely sucks, and there's a rudimentry upload facility. I thought it worth mentioning it here; IMHO it's something which would really advance best practices in C, but obviously needs a fair amount of more polish and a LOT more code before that becomes a reality. (The handful of modules so far are mine, most inspired by Linux kernel practice, such as the reimplemented list.h). [/tech] permanent link Fri, 29 Aug 2008Linux Next GraphingSome neat stats just graphing the size of the bz2 patch for Linux next for the last 108 days (12 May through 28 August). Since Stephen doesn't produce patches on weekends, you can see the gaps (dashed lines are Mondays, Australian time) The -rc1 dip is really clear (these patches are produced against the last labelled Linus kernel, so hence it's a one day drop), and you can see the -rc2, -rc3 and -rc4 dips diminishing like they're supposed to. Those sharp-eyed will note that during the merge window, kernel hackers work weekends :) [/tech] permanent link Tue, 22 Jul 2008WTF? Wikipedia deletion gone mad...OK, so Dave Miller's pending deletion I can understand; if you didn't know how key he was, the article itself lacks references and is lacks detail (compare it with Andrew Tridgell's page. (At least he noticed; when I was deleted last time I didn't know). But then I find out that the article on OLS was deleted back in February. Huh? This is the major Linux conference in the world. Some would argue that it's a bit faded at the edged these days, but none of the crop of contenders can genuinely claim that crown. I know conferences don't generally get pages as sexy as humans do, but still... [/tech] permanent link Sun, 20 Jul 2008The Joy of linux-nextSure, linux-next is a useful way of early-detecting patch conflicts with random developers. But the second order effect has been more useful to me: forcing me to get my shit together. Now I regularly publish my patchqueue in a form which applies and compiles, and has clear "production" vs "alpha" demarcation. Obviously, this is good for people trying to follow various patches (and there are quite a few independent efforts at the moment, including typesafe patches, virtio, lguest, module, tun/tap, stop_machine, kmod-removal and down_trylock removal), but it also makes the arrival of the merge window far less stressful. In theory, I could have been this organized before. But just like the concept of doing homework long before the deadline, it was never going to happen. So thanks Stephen! [/tech] permanent link Mon, 14 Jul 2008UNSW CS: Employment @ IBM OzLabs Talk: 1pm Tuesday September 2ndUNSW School of Computer Science and Engineering are having "Employer of the Week" experiment: September 1st is IBM's week. I'll be spruking for OzLabs, so if you know anyone at UNSW who worth talking to, drag them there (I don't know which room, I'm guessing the signs in CS will be pretty clear). I'm going to try to talk about the stuff people in the office are hacking on, to give an idea what it's like being in what AFAICT is Australia's largest bunch of Free and Open Source Software hackers. [/tech] permanent link Mon, 30 Jun 2008stop_machine latency: the rewriteFollowing on from my previous graphs of stop_machine latency, I have new results with my stop_machine simplification patch. Again, it's the 18-way Power4 box; the simplied stop_machine creates all the threads and moves them into the correct CPUs before starting them. They then step through the state machine themselves, rather than having a central controller. It's actually marginally worse than the previous:Since these are different kernel versions, I looked at the baseline latency for both kernels: Now I need to go back and compare the exact same kernel version, to make sure something else isn't interfering... [/tech] permanent link Fri, 27 Jun 2008Linux Foundation's Device Driver StatementSomeone noted that I didn't sign the LF "proprietary modules are bad" statement. This is entirely due to my slackness and not any lack of support. As kernel module maintainer I feel obliged to maintain the status quo with proprietary modules, but I have noticed many colleagues becoming more annoyed about them. [/tech] permanent link Thu, 12 Jun 2008stop_machine latencyKathy Staples and I wrote a little program to measure the latency on every CPU on a machine. It sets CPU affinity and high priority (SCHED_FIFO, prio 50) for each thread, then spins doing gettimeofday() for a given duration. The maximum gap in gettimeofday() is reported for each CPU. I tested this on an old 18-way Power4 box sitting around the lab: CPU 0 is used for the parent process, and the latency is measured on the other CPUS. This was run 100 times. Then a variant which did an insmod system call on CPU 0 was used (this calls stop_machine, which is what we were trying to measure). The results are interesting and a little surprising. Normal max latency is around 35 usec, the stop_machine increasing it to the 100 range. There's obviously something running periodically on CPU 2: for both runs I had to remove one horrific 150ms latency result (1000 times average!) but there's still a noticeable spike there. I suspect CPU1 is low because CPU0 is mainly idle (same core). But more concerning is that latency seems to go up with higher CPU numbers, whereas I expected it to be worst on lower CPUs. We launch stop_machine threads in cpu order, so I expected the lower CPUs to wait the longest. We're running modprobe on cpu 0, which means the stop_machine control thread runs there, too. It loops through creating 17 other threads: as CPU 0 is busy, it gets scheduled on a different idle CPU. The first thing the thread does is try to move itself to its proper CPU. I suspect what is happening is that we're creating the 17 threads fast enough that they all end up queued on the migration queue for CPU 0 at once: this queueing uses "list_add" not "list_add_tail", so they are in fact deployed by the migration thread in reverse-CPU order. My simplified version of stop_machine is more intelligent: it moves all the threads to their correct CPUs before waking them all up. This should solve this problem as well as reducing overall latency. [/tech] permanent link Fri, 16 May 2008Tuning VirtIO and virtio_net: part IOne premise of virtio is that we should be as fast as reasonably possible. While there's nothing which should make us slow, that's not the same as actually being fast. So this week, I've been doing some simple benchmarks on my patch queue, which includes major changes to accelerate the tap device and allow async packet sends. I've been using lguest rather than kvm because it's far more hackable, and my test has been a 1GB (1024x1024x1024 byte) TCP send using netcat. And host->guest results were awful: instead of the current 12 seconds it was taking 70 seconds to receive 1GB. So I started breaking that down. The first things that I found was that simply allocating large receive buffers (of which only 1500 bytes is used) is expensive. Just this change alone takes the time from 12 seconds to 29, and there are two reasons for this so far. The first is because each 1500 byte packet takes two descriptors (we have a header containing metadata), whereas a fully populated paged skb takes 2 + 65536/PAGE_SIZE + 2 == 20 descriptors. That means we only fit 6 large packets in lguest's 128-descriptor ring, vs 64 for the small packet case. Increasing lguest's rings to 1024 drops the time from 29 to 25: not as much as you'd expect. Increasing it further has marginal effect (logically, we should see equivalence at 1280 descriptors, but it has to be a power of 2). The second reason is that alloc_page is quite slow. A simple cache of allocated pages drops the time from 25 to 19 seconds. But we're still 50% slower than allocating 1500-byte receive buffers, and today's task is to figure out why. It seems unlikely that the increased overhead of skb_to_sgvec, get_buf and add_buf would account for it. Cache effects also seem unlikely: 1024 descriptors are still only 8k. It's unfortunate that oprofile doesn't work inside lguest guests, so this will be old school. If the overhead really is inherent in large descriptors, we have several options. The obvious one is to add a separate "large buffer" queue, or allow mixing buffer sizes and expect the other end to try to forage for the minimal sized one. Both require a change to the server side. We can add a feature bit for backwards-compat, but that's always a last resort. Another option is to try for multi-page allocations for our skbs: as they're physically contiguous they'll use fewer descriptors. [/tech] permanent link Mon, 07 Apr 2008C inline functions not in headersI just appreciated an interesting side-effect of slapping "inline" on static functions within .c files. You don't get a warning when they become unused. This breaks my normal method for code cleanup (in this case, the tun driver). So unless you have evidence otherwise, plase trust the compiler to inline static functions appropriately and don't label them inline. (And remember: inline is the register keyword for the 21st century.) [/tech] permanent link Sat, 05 Apr 2008Hard To Misuse Commentry
Since my blogfu doesn't extend to comments, I recommend the thoughtful
comments found on my recent 'Hard to Misuse' posts at LWN: firstly
'How Do I Make This Hard to Misuse?'
commentry and then 'What If I Don't Actually Like My Users?' commentry.
[/tech] permanent link Tue, 01 Apr 2008What If I Don't Actually Like My Users?Here begins our descent into hell; if an interface manages to achieve negative scores on the Hard To Misuse List, your users may detect the dull red glow of malignancy rather than incompetence.
That's everything I know about interface design. Now, go and make your own mistakes so you can have wise things to say about it! [/tech] permanent link Sun, 30 Mar 2008How Do I Make This Hard to Misuse?It's useful to arm ourselves with a pithy phrase should we ever have to face an "it'll be easier to use!" argument. But once we've pointed to it, it's still not clear how to improve the difficulty of interface misuse. So I've created a "best" to "worst" list: my hope is that by putting "hard to misuse" on one axis in our mental graphs, we can at least make informed decisions about tradeoffs like "hard to misuse" vs "optimal". The Hard To Misuse Positive Score List
[/tech] permanent link Tue, 18 Mar 2008APIs: "Easy to Use" vs "Hard to Misuse"It's an elementary goal of API design to make something easy to use: easy for yourself, easy for yourself next year, easy for others. Let's take that as a given. Many goals will conflict with "easy to use", but the subtlest is the requirement that an API be hard to misuse. Ease of use attracts users, but difficulty of misuse keeps them alive. To make this concept crisp, I have two real life examples. The first is the safety catch on a gun. Hard to misuse beats easy to use. The second example is the Linux kernel's kmalloc dynamic memory allocation function. It takes two arguments: a size and a flag. The most commonly used flag arguments are GFP_KERNEL and GFP_ATOMIC: I'll ignore the others for this example. This flag indicates what the allocator should do when no memory is immediately available: should it wait (sleep) while memory is freed or swapped out (GFP_KERNEL), or should it return NULL immediately (GFP_ATOMIC). And this flag is entirely redundant: kmalloc() itself can figure out whether it is able to sleep or not. Implementing malloc() would be a no-brainer, and kernel coders generally like ease of use. So why don't we? [Correction:Jon Corbet points out that it's not entirely redundant in some configurations; we'd need to do a few lines extra work.] Because atomic allocations should be avoided: they're drawing from a limited pool and more likely to fail or make other atomic allocations fail. By placing the burden of specifying this onto the author, we make atomic allocations easier to spot and thus harder to abuse. And if we want to make our APIs harder to misuse we need to measure how an API scores, and that'll be the topic of the next post. [/tech] permanent link Wed, 12 Mar 2008Bricklayer, not cathedral builder.I'm always a little uncomfortable with "fuzzy" programming topics; much better to judge between two specific pieces of code. The big issues are important but it's hard to say something new on that topic which will help people code better. Most useful stuff has been said already. Nonetheless, for my OLS keynote years ago I did have a point which I felt was underappreciated, and managed to rope it down to actual guidelines so the idea was of practical use. I'm going to revisit that topic in my next few blog posts, because unfortunately my OLS keynote was not recorded anywhere for me to simply point to, and there has been some maturing of these ideas since then. [/tech] permanent link Wed, 06 Feb 2008lca2008 Projector Pong with Wiimote and Linux: Pong Hero!Once the teething problems were out, and with much assistance from various people, we had fun at linux.conf.au's Open Day playing a pong variant using IR pens and a Wiimote. I've finally put all the information up on a typically-ugly web page, including a link to the source code. [/tech] permanent link Wed, 30 Jan 2008lca2008: 70 OLPCs Randomly Seeded Among AttendeesFor years it has been an LCA dream to put an OLPC in every attendee's registration bag, to give the project a development boost and inspire our attendees. We didn't quite get there, but we did get 100. Jim Gettys and I announced at the keynote that we had a handful available, and we'd chosen names a random. We gave out 10 there, and leaked out another 60 to random people over the morning. I fought hard for randomness, because we don't know who will make best use of them and I trust our attendees to pass them on if they can't do something wonderful. Some comments overheard since then have battered my faith, but I still hope that most people will make sure these XOs make a difference. BTW, the following people were loved by the random number generator but still haven't been found (send them to Registration Desk):
[/tech] permanent link lguest lca2008 Tutorial Preparation Fastpath
You need to have lguest working for the lguest tutorial. We had a preparation
BoF, and here's what we ended up with (thanks everyone!)
There's also a Qemu image with instructions but you need to build outside and install updates into the image. [/tech] permanent link Sat, 26 Jan 2008linux.conf.au 2008 lguest tutorial: Preparation!For the lguest tutorial, you will need lguest working. This is a hacking tutorial. This means a 2.6.23 kernel (lguest is different in 2.6.24, so 2.6.23 please!) with lguest support. Sorry, 32-bit x86 only. I'm serious: I'll be turning people away who don't have lguest booting already. Fortunately, we have a BOF from 12:30-2:30 on the Wednesday (that's lunchtime and the next session) to help people get setup. [/tech] permanent link Tue, 15 Jan 2008sg_ring: Sorry, -ETIMEDOUTBeyond a quickly-reached line, arguing with the maintainer is not a path to getting your patches accepted. Let me just say that I'm in the DaveM school of "then we'll simply rewrite all the drivers" rather than the James Bottomley "abstractions make us futureproof" school. [/tech] permanent link Wed, 09 Jan 2008Partial checksumming of virtio net packetsToday I started hacking on adding extensions to the tun/tap driver; I was going to try adding async I/O but that seems to be a major reenginering and not likely to get in while syslets are waiting in the wings (so meanwhile just use a thread). Partial checksumming and GSO support are my aims: virtio_net supports both at the moment but both kvm and lguest don't turn on those feature bits becasue tap doesn't support them. This afternoon partial checksumming. Implemented, added some printks to make sure it was happening, and then started doing sendfile benchmarks (160MB guest to host). And the differences were marginal. David Miller pointed this out long ago: if you're copying the data with the CPU (as tap does), the checksumming calculation is in the noise. So tomorrow is GSO support, and using get_user_pages() to avoid copying the skb (except some amount of header). Then it should be a real win... The beautiful thing: I've made the GSO-describing header for the tap device suspiciously identical to the header for the virtio_net device, so the lguest launcher just passes the whole thing through. [/tech] permanent link Tue, 08 Jan 2008Yak Shaving, eventfd and libaioAnthony Liguori pointed out that one performance bottleneck for kvm (and lguest, if we cared) is the fact that the tap device doesn't support AIO. Of course I said, AIO is evil because it's incompatible with poll(), to which he replied "eventfd". This was a introduced in 2.6.21 and AFAICT is best documented in the commit message. Two patches later Davide slipped in AIO support so AIO requests can hit the eventfd. So now I want to use the thing, and I track down libaio: shipped by Ubuntu, SuSE and RedHat, and referred to by the io_submit(2) man page. Unfortunately, it's out-of-date: looks no eventfd support. In fact, at I can't find any version beyond 0.3.92 (Ubuntu claims 0.3.106) from 2002: looks pretty unloved. Ok, let's update the header, and then I decide to run the test suite to make sure I've not broken anything. The test suite doesn't compile; maybe it did with older gccs and glibcs, but not any more. Hack it for the moment and run the tests. Wade through the errors. Find two kernel bugs, create patches and send them off (corner cases, yes, but this is a bad sign). Find a couple of errors in the testsuite. Fix up the Makefile with a "make check" to do all the stuff the README says to do manually. Three or four hours later, send off patch. Ben LaHaise hasn't responded directly, don't know if he's still interested in maintaining libaio (he indicated he's going to handover the kernel side). So for posterity (and others searching for preadv/pwritev or eventfd support for libio): here's my patch. [Update: Jeff Moyer is keeping a repo with updates: Now, what was it that I supposed to be doing? [/tech] permanent link Mon, 07 Jan 2008My first git whine for 2008I don't like to whinge about software; that's what bug reporting is for. But it might be instructive to see how I spent the last 20 minutes. Went to clone my copy of the kvm repo onto my Ubuntu test machine (debussy). Decided to clone my linux-2.6 tree first: might as well have it there. After installing git, then realizing my mistake, removing it and installing git-core, I was ready. First I rsync'ed the linux-2.6 tree from my laptop, but then: rusty@debussy:~$ git clone --reference=linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git error: object directory /home/rusty/devel/cvs/kvm/kvm.git/kvm/.git/objects does not exist; check .git/objects/info/alternates. error: refs/reference-tmp/refs/remotes/origin/HEAD points nowhere! ... Clearly, I'd made my laptop linux-2.6 tree with references to my laptop kvm tree (saving bandwidth and disk space). OK, my bad. I should use 'git clone' to do the transfer rather than rsync. First attempt was dumb: 'git clone linux-2.6 debussy:' took a while, and only when I looked on debussy did I realize I'd just cloned into a 'debussy:' dir on my laptop. OK, proper url: rusty@vivaldi:~/devel/kernel$ git clone linux-2.6 ssh://debussy/ Initialized empty Git repository in /home/rusty/devel/kernel/ssh:/debussy/.git/ remote: Generating pack... Err, OK, clone doesn't understand destination URLs. Remove the 'ssh:' dir it just created, ssh into debussy and try to clone from there: rusty@debussy:~$ git clone ssh://192.168.5.3/devel/kernel/linux-2.6 rusty@192.168.5.3's password: fatal: '/devel/kernel/linux-2.6': unable to chdir or not a git archive fatal: unexpected EOF fetch-pack from 'ssh://192.168.5.3/devel/kernel/linux-2.6' failed. Err, that's not the dir I asked for. OK, use full pathname: rusty@debussy:~$ git clone ssh://192.168.5.3/home/rusty/devel/kernel/linux-2.6 rusty@192.168.5.3's password: Connection closed by 192.168.5.3 fatal: unexpected EOF fetch-pack from 'ssh://192.168.5.3/home/rusty/devel/kernel/linux-2.6' failed. Um, what happened there? No idea. So, I go back to my laptop to create a "clean" dir with no references, so I can just use rsync. rusty@vivaldi:~/devel/kernel$ rm -rf tmp; git clone linux-2.6 tmp ... rusty@vivaldi:~/devel/kernel$ rsync -avz tmp debussy:linux-2.6 ... rusty@vivaldi:~/devel/kernel$ rm -rf tmp Back to debussy: git clone --reference=linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git tar: refs: Cannot stat: No such file or directory tar: Error exit delayed from previous errors error: object directory /home/rusty/linux-2.6/objects does not exist; check .git/objects/info/alternates. remote: Generating pack... remote: Counting objects: 6651 ^C Poke around: I forgot the / in rsync, so it's created a linux-2.6/tmp dir. Git spat some cryptic complaints (not "that's not a git repo"), then seemed ready to pull everything (precisely what I try to avoid on my 3G-per-month satellite connection). OK, move that dir up one... git clone --reference=linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git error: refs/reference-tmp/refs/remotes/origin/HEAD points nowhere! ... No idea what that error is, so I'm ignoring it. Git seems to. And after all that, what directory does git create? Not "kvm.git", but "kvm", which appeared nowhere on that commandline. Confusing, because I had an old kvm.git dir there, too... You can see I'm no git poweruser, and inevitably git will get easier as I memorize the various arcana. But for Rusty today, git is the slowest of the modern version control systems. And that's not counting the time it takes to blog out my frustrations after using it... :) [/tech] permanent link Fri, 04 Jan 2008#ifdef and -WundefOne of the problems with the C preprocessor is that it deals with undefined symbols by treating them as 0, which can hide bugs. A subtler problem is the widespread use of #ifdef: if you make a typo or use an obsolete name, you don't get any warning. Fortunately, gcc has -Wundef, which warns about any undefined preprocessor symbols. But to use it to its full effect, you need to change the common C idiom of ifdefs. Instead of this: /* Define HAVE_FOO if you have foo support. */ #ifdef HAVE_FOO ... #endif You need to start doing this: /* Define HAVE_FOO to 1 if you have foo support, otherwise 0. */ #if HAVE_FOO ... #endif The fact that the Linux kernel uses #ifdefs instead of #if and -Wundef is one of those warts which would be nice to fix if we were starting over, but not worth the churn for such an established project. New projects however... [/tech] permanent link Wed, 02 Jan 2008Chained scatterlists vs. sg_ringEver since Jens Axboe's scatterlist chaining patches intruded on my consciousness, they made me uncomfortable. The overloading of lower bits to allow chaining isn't what bothered me, it was how nasty they are to manage: chaining requires an extra padding element, and so you can't do much manipulation with a chained sg handed to you by someone else. This bit the virtio code when I tried to use them. This, I decided, was one of those places where neat tricks should give way to explicitness: having an exposed two-level structure is easier to understand, debug and manipulate. It also means that new code (struct sg_ring *) is obviously different from unconverted code (struct scatterlist *). However, when you actually try to do this, you're faced with modifying all the SCSI drivers. Not in a significant way, but changing loops to use different iterators. And after a number of days over the break spend touching those drivers, I understand why Jens chose the approach which placed so little burden on them (even if annoying for everyone else). It's because these drivers are horrible. Really bad. Clear bugs, non-obvious assumptions and years of neglect. It's certain that converting them in one hit is not feasible, and perhaps any conversion indicates temerity. So at the least, a long-term conversion path is necessary. [/tech] permanent link |