Fri, 29 Dec 2006

Virtualization minconf the biggest miniconf at

According to Pia via ComputerWorld: Waugh was also surprised to note that LCA 2007 will also be the first LCA event in which the Debian miniconf has not been the most popular amongst registered participants; instead, participants have registered most interest in the topic of Virtualisation.

I'm always amazed by the interest in virtualization. I mean, it's a neat trick, but the real news is that it's becoming commodity, and commodity means boring. Management is the key issue at this stage, but I wouldn't have thought that would have been mainstream to attendees. But maybe it's just that the Debian miniconf is losing attention in this Ubuntu era...

[/self] permanent link

Wed, 27 Dec 2006

SVG Presentation Pain

As something of an experiment, four of us are presenting at LCA on how to write a hypervisor. It's a nuts and bolts talk, and while there aren't too many tricky details, there are a lot of them, mainly obscure things about the x86 architecture. This means one thing: the talk needs lots of good, consistent diagrams.

So I started drawing the diagrams in Inkscape, which finally replaced the venerable xfig as my figure creator of choice after seeing its use by Andy Fitzsimon. Only then to discover (re-discover, actually) that OpenOffice doesn't import SVGs. Exporting hundreds as PNGs and trying to match sizes didn't excite me at all.

So I decided to do the whole presentation in Inkscape; each slide would be a layer. It's a little awkward to view that way, but with appropriate placement of guides, and a heuristic that layers with names which are a substring of others should be combined, it makes for painless diagram creation! So now my presentation is in one SVG file with 49 layers, with names like "Virt" (section heading) and "Virt 1" (first slide in section, leave section heading layer showing), and I hacked up a little proggie to convert to PNGs (I wanted to do direct display but that was more complex. Anyway, shouldn't it be possible to write a modern web thingy to do it all in a browser? Anyone?). Here's the tarball weighing in at a glorious 16k! SVG rocks.

[/tech] permanent link

Fri, 22 Dec 2006

virtbench and lhype

So, with some help from Tony Breeds, virtbench now runs reasonably well.

The purpose of virtbench is to provide low-level benchmarks for virtualization developers to optimize Linux on their systems. It's designed to be hypervisor-agnostic and very simple to run (Tony is writing the scripts for Xen now). It runs a statically-linked standalone "virtclient" binary in each virtual machine, then the server binary coordinates testing. At the moment it consists of 10 tests, but it will grow significantly as I test more things and get feedback. It also has a local mode, where each virtclient is launched as a process in the local machine (inter-guest tests are then TCP between processes rather than between machines).

So I compared lhype against local performance on my 1533MHz AMD Athlon. Since the benchmarks are mostly chosen to measure things where we expect virtualization to be slow, we expect to be measurably slower than native. Indeed, the naivety of lhype shows up almost immediately: I refused to optimize before this benchmarking.

Time for one context switch via pipe 108 times slower
Time for one Copy-on-Write fault 3 times slower
Time to exec client once 26 times slower
Time for one fork/exit/wait 81 times slower
Time to walk random 64 MB 1 times slower
Time to walk linear 64 MB 1 times slower
Time to read from disk (16 kB) 2 times slower
Time for one disk read 0 times slower
Time for one syscall 35 times slower
Time for inter-guest pingpong 8 times slower

The "disk read" cases are wrong, because the disk read isn't synchronous in lhype. The memory walks are about the same (this would only expect to be different with overcommit), but the worst results are the context switch, the fork/exit/wait, and the syscall. The context switch slowness I expected, because lhype throws away all the pagetables when the toplevel changes (and then immediatly faults them all back in). A better solution would be to keep track of multiple toplevels, or at least pre-fault back in the stack pointer and program counter. The system call overhead is much higher than I expected: it goes into the hypervisor and is reflected back out, but 35 times slower? Fortunately, this is the easiest to fix, too: we should direct the system call trap directly into the guest. That means some trickier handling for the system call return (which could be done without a hypercall).

[/tech] permanent link

Wed, 20 Dec 2006

On GPLv3 and Modification Restrictions

There has been much discussion about the interaction of DRM (aka Digital Restrictions Management) and the GPLv3. I have explored this issue in depth both with those involved in the process and lawyers familiar with the issues.

This post attempts to shed some light on my concerns, as an active Free Software developer who releases code under the GPL.

The Intent of the GPL

I am not a lawyer; I started using the GPL based on other's explanations and the preamble. This explains the intent of the license to allow others to use, distribute and modify the code, and the restrictions it places on people to enforce this intent.

One restriction my use of GPLv2 places on distributors is "you may not impose further restrictions": distributors can't add their own restrictions on my code to subvert the intent of the license. This is important to me: I handed them a whole slew of rights, and if they hand this code on to you, they must also hand on those rights.

Now, I don't care how someone tries to add restrictions against my express intent: licensing, patents, hardware, software, TPM law or voodoo. Do not do it. When all else fails I expect the license to give me the legal ability to enforce my intent.

Those legal mechanisms in the license are beyond my expertise, but I do believe the lawyers when they tell me the GPL needs updating. Fifteen years of experience and fifteen years of legal changes can have perverse effects even on the best-drafted license.

The Case for Technological Prevention Measures

Several people have made the case for preventing modification of GPL software. But such arguments can similarly be used for any other licensing terms, and have found little sympathy. There might be genuine security or legal problems with releasing modified source code, for example. In which case, they simply cannot distribute. There are often cases where a party seeks a relaxation of licensing terms in order to combine with non-GPL code outside their control. I've been approached to relicense several projects, and sometimes saying no has meant that some cool software didn't exist, or became harder to create.

On prevention of modification, let me be clear: I do not wish someone to gain a monopoly on supply and support of my code. This is not a subtle issue caused by license wording, but a real tension between my intent and theirs: I intend to allow downstream modification, they want to prevent it.

Nor is this a GPLv3 issue, but the GPLv3 (draft 2) clarifies the legalese by explicitly requiring "encryption or authorization keys" as part of the Source Code in these cases, bringing this issue to the fore. And it seems to me that addressing DRM is a requirement if we are to fully implement the promise of the preamble in modern times.

Now, there is sometimes legitimate reason to loosen any term of a license, and the requirement to provide keys to execute modified versions is no exception. The question then becomes one of the size of the problem, and the cost of solving it[1]. Remember that the GPL does not have to meet every need, so deciding not to solve the problem is not a catastrophe.

Firstly, I note that by far the most uses of authentication involve (non-GPL) data, rather than programs: these are not problematic.

Secondly, projects such as signed Linux kernel modules are not problematic either. Equivalent but different signature can be generated by rebuilding the source code[2].

The classic example of program verification is that of GPL-derived voting machine software which uses a TPM to recognise modified software and prevent them from contributing votes. As the GPL sharing obligations only kick in with distribution[3], intra-company and government-controlled elections have no GPL issues. The only problem comes in distributing "unmodifiable" software, such as in home voting.[4] There are genuine reasons to want to modify this software (such as to run on Linux, or on PowerPC, or in Australia), and there is a genuine reason to want to prevent modifications (vote integrity). Note that while modification prevention is poor security[5], it is better than nothing[6].

A second example is a set-top box, which contains GPL-derived software and no longer displays the movies (for which you have paid), should it be modified in any way. Again, you cannot fix bugs, make it work in Australia, lower its power consumption or add a pause button, if you want the software to be useful, as was my intent as original author. I do not want my software to be distributed in such a restricted way, and I expect I can prevent it under GPLv2. You should not use my software if you need to do this.

Presumably, we would like to allow the former, without the latter. It is arguable whether this corner case justifies weakening the license, but let's explore it.

The Further Restrictions

There are at least two cases where further modification restrictions are already tacitly allowed by the GPLv2, as illuminated by the FSF.

The first is GPL executables on "Live CDs", the second is in ROM. In the Live CD case, the lack of modification is so mild a restriction it can probably be ignored. In the ROM case, the FSF has said that this is fine. I can only assume they feel that there has been no deliberate attempt to remove rights, merely a reasonable physical limitation.

We could extend this logic to cover other "reasonable limitations". Yet we do not want to create an exception which allows an end-run around the license requirements. An exception which allowed the license terms to be ignored if "required for functionality" would become a perverse incentive to design systems with such a requirement in order to subvert the license.

So If We Had an Exception, What Could It Look Like?

Given these constraints, any such exception would need to be minimal in scope and introduce some burden to avoid it becoming an attractive license loophole. So, if such cases are considered important, I suggest the following four step test to relax the "all the same functionality in the same range of circumstances" requirement for modified works:

  1. The restrictions are the minimal required for correct functioning of the system,
  2. There is no reasonable alternative implementation which could have been used to achieve similar results,
  3. The correct functioning ensured by the restrictions are in the interest of the users as a whole, and
  4. There must be a method by which modifications can reasonably be demonstrated not to subvert the correct functioning of the system, and modifications so demonstrated must be allowed.

The first two clauses ensures the restrictions are only open as a reasonable last resort. The third is a safeguard against systems designed solely to subvert the license. It is a weakened version of the intent that the copy recipient be given various freedoms.[7] The fourth term also places the burden on the distributer using this exception: the onus is on them to allow fixes.

I'm not convinced that opening this door is worthwhile, given the complications it would add to the license and limited nature of the problem. But if it is to be opened, I would recommend such safeguards.

Rusty Russell (,
Free/Open Source Software Hacker.

[1] There is a large case of compromise in the GPLv2: the "system component" exception which allows distribution of GPLv2 binaries which rely on GPL-incompatible libraries, as long as they are "normally distributed with the major components of the operating system". This is a case where the cost of the loophole was judged smaller than the size of the problem if GPL software could not use non-GPL system libraries (almost all of them in 1991).

[2] I went through this a few years back with a lawyer when reviewing a (RedHat) patch to the Linux module code, and seems to be addressed explicitly in GPLv3's "use of the work normally implies the user already has the key". The key is randomly generated at build time by the build scripts in this case.

[3] Or "conveying" in GPLv3-speak.

[4] A terrible idea for government elections, but they could improve representation where alternatives are even less reliable, such as widely-distributed organizations.

[5] All software has bugs, and authentication technologies do not prevent them from being exploited.

[6] Unlike proprietary code which can do obfuscated self-checks, GPL code cannot even do this due to source availability, so verification may be an answer.

[7] In this case, the individual user might want to subvert the election, so we cannot predicate the exception on their interest.

[/IP] permanent link

Thu, 14 Dec 2006

tcmalloc and the C++ issue

From Michael Still's blog (via Planet Linux Australia) I found a reference to "tcmalloc", a faster-than-glibc-malloc threaded malloc implementation out of Google. Reading the paper, it sounded good.

But the "Caveats" caught my eye, particularly: In particular, at startup TCMalloc allocates approximately 6 MB of memory. It would be easy to roll a specialized version that trades a little bit of speed for more space efficiency.

Why should it use that much memory? Since the arrays involved should be zero until used, surely it wouldn't take a 6MB hit until they were fully utilized?

So, I downloaded it and took a look. Erk. It's in C++! It astounds me that someone would write a low-level library like this in C++. Here's a simple "1-byte malloc then spin" program with glibc:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7078 rusty     21   0  1524  344  272 R 95.3  0.1   0:02.73 sleeptest          

And linked against ctmalloc:

 5946 rusty     24   0  4688 1456  908 R 80.2  0.3   0:02.77 sleeptest          

Now 1.1MB isn't quite the 6MB quoted, but it's not good. Here's what size says:

$ size /usr/local/lib/ 
   text    data     bss     dec     hex filename
 125231    1840  267624  394695   605c7 /usr/local/lib/

125k of text seems excessive to me, but the 267k of BSS caught my eye. "static TCMalloc_Central_FreeListPadded central_cache[kNumClasses];" is 250k (kNumClasses is 170). In addition, initializing the library causes 4 128k allocs and a single 1Mb alloc. I'm guessing at least some of these are being helpfully initialized by C++ constructors, using up memory even if it's initialized to zero.

But because it was C++, my tolerance limit was reached and I stopped poking. My initial casual thought of integrating talloc has faded. But then, Andrew Tridgell once said "you never need a reason to rewrite something"!

[/tech] permanent link

Wed, 13 Dec 2006

lhype progress and virtbench

So I've continued hacking on lhype, interrupted by my move out to the farm. James Morris (who insists on calling it "The Rustyvisor") wrote a quickstart: I'm offline now, but "jamesm rustyvisor" will probably get you there. Now that the paravirt_ops patch is in Linus' tree, it's basically a single (fairly big) patch.

I want to improve lhype, but I decided to do it in a methodical fashion: benchmarking. To that end, I've started on a set of hypervisor micro-benchmarks called virtbench designed to guide my optimizations. To make this work with lhype, however, I need external networking (lhype only has inter-domain networking at the moment), so that's what I'm coding now.

[/tech] permanent link

The Joy of Binary Compatibility: glibc and vdso

One of the early problems Xen hit, and we had to deal with cleanly with paravirt_ops, is that the hypervisor wants to be mapped at the top of virtual memory, but the vsyscall dynamic shared object (vdso) is usually mapped there, right up at 0xfffff000. The vdso is a page exposed to userspace which glibc will jump into to make syscalls. This allows the kernel to decide whether to use traditional "int $80" or the more recent "sysenter" instruction for syscalls, plus provides a fixed place to return in the "sysenter" case.

There were several patches floating around to make this location dynamic, which is fine except for one small problem: moving it breaks some early vdso-enabled glibc versions, which exit with an assertion. This problem surfaced with Andrew Morton's Fedora Core 1 box, but since that's so old, the consensus (with which I disagreed) was to add a CONFIG_VDSO_COMPAT which mapped the vdso at the old address, and disallowed CONFIG_PARAVIRT.

Unfortunately, Andi Kleen tested on SuSE 9.0, which has the same issue. He solved it by turning vdso off by default when CONFIG_PARAVIRT was enabled (it can be turned on by vdso=1 on the boot commandline). This is pretty unsatisfactory given that the bug would effect only a tiny and diminishing number of people, and disabling vdso slows everyone down by default.

This came to a head on the mailing list, so I decided to look at more hacky solutions. My first thought was that if init crashes without doing anything, we should disable vdso and exec it again. Unfortunately, assert() calls abort(), which calls kill() to send itself a SIGABRT, but the kernel has special code to make sure that init never receives a signal it can't handle and refuses to deliver it. The result is that init hangs instead of crashing.

So, first I looked at the signal handling, and produced this patch for the generic signal.c:

-               if (current == child_reaper(current))
+               if (current == child_reaper(current)) {
+                       /* Gross hack: Old glibc asserts, not
+                          liking moved vdso (SuSE 9, FC1) */
+                       if (signr == SIGABRT && list_empty(¤t->children)) {
+                               signr = -1;
+                               break;
+                       }
+               }
This function never returned a negative number before, so the caller knows what happend:
+       else if (signr == -1) {
+               void reexec_init(void);
+               static int reexec_done;
+               if (!reexec_done++) {
+                       printk("COMPAT_VDSO: old glibc?  Disabling vdso\n");
+                       vdso_enabled = 0;
+                       reexec_init();
+               }
+       }

I sent it out without testing, since I don't have either of the failing systems, but since I know the code is ugly and noone will be inclined to fix it for me, I wrote a tiny program called "assert" which simply did "assert(0)", and booted with "init=/assert" (under qemu, of course!).

And, equally of course, it didn't work. Firstly, because init has all those kernel threads as children already, so "list_empty(¤t->children)" is never true. Secondly, because the second exec failed: the kill() path was expecting the signal to kill the process, and so it set "current->signal->group_stop_count" to 1, which indicates the entire thread group should be stopped. Next time through the signal-checking path (and it gets called immediately on return to userspace), init hangs. Which indicates the code to "protect" init from bad signals hasn't been tested in a while.

Once I got this working and sent out, I didn't like the patch. The worst part is that it touches generic code for an ugly x86-specific issue. So I decided to tackle the other end: what if I intercepted the kill() call, and if it's init sending a SIGABRT to itself, re-exec instead? I did this by putting "sys_check_init_abort_kill()" in the place of the generic "sys_kill" in the system call table: the longer version simply checks for our special case and tries to re-exec. It's still ugly, but it's clear.

[/tech] permanent link

Farm Life and The Hacking Environment

So, due to my beautiful wife's horsey nature, we have moved out to a 100 acre farm in the parish of Ballalaba, 8km drive from Major's Creek, 25km drive from Braidwood or 100km from Canberra.

I took two weeks off for the move (was going to be three, but we moved later than expected). Plus time to set up the server (more RAM ordered: 256M is not enough), home network, etc. There's still a lot to be done, in particular we're going to knock out a wall between two rooms upstairs to form a large study. My aim is to have a classic study: bookshelves along the walls, large desk down one end and leather chairs down the other. If I can get a leather-bound edition of the C standard, it'd fit perfectly! Geeks understand immediately what I'm after: David Gibson pointed out that I'll need crossed swords above the fireplace for when my archenemy challenges me to a duel...

Since I'll be spending about 4 days a week in here, I want it to be the perfect hacking environment.

But like everything else in this place, it's a work in progress, and progress out here can be pretty slow. Here's a short list of things which are pending:

  • We're going to see a couple of Labrador pups next week. Alli want to name them Snugglepot and Cuddlepie ("Pot" & "Pie"), I want "C" and "Assembler" (or "inc" and "dec"?)
  • Curtains for the master bedroom, guest bedroom and study are on order to replace the duna covers strung up at the moment.
  • The arena construction is beginning next week: would have been this week but for the rumour of rain.
  • Waiting on the plans for the guesthouse to submit to council.
  • Waiting for rain to compact the soil in the first veggie patch.
  • Waiting for sattelite internet install, and meanwhile waiting for packets to trickle through the modem.
  • Waiting for neighbour to help with repairing the fence where his cows got through.
  • Waiting for builder to get time to knock the study wall down.
  • Waiting for planting season to plant a vast array of trees. This delay is actually useful: I can figure out what I like that grows here.

Still, with all the walking and repairing and no TV, I'm getting fitter. To prevent this from getting out of hand, Alli and I bought ourselves a Nintendo Wii as an early Christmas present.

[/self] permanent link

Sat, 11 Nov 2006

Please Send Them to Linux.Conf.Au

There was an insane surprise idea floating around during Dunedin for LCA07, but now it seems like it's not going to happen. It was probably just too crazy, but it would have made people go "wow".

So while I can't promise that people will be talking about LCA07 in a decade, it's exciting to see fervour and community spirit like!

[/tech] permanent link

Tue, 24 Oct 2006

Kernel coding irritations

The Linux kernel is one of the nicer pieces of complex code I've worked on, but it's not without ugliness. Consider this function from mm/memory.c (2.6.19-rc2):

int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long start, int len, int write, int force,
		struct page **pages, struct vm_area_struct **vmas);

This function get references to a number of user pages from address "start". Features are as follows:

  • The "mm" arg is always "tsk->mm" in the caller, in fact, it doesn't make sense for them to be different. (Actually, it's not clear that fs/aio.c does this correctly).
  • The "tsk" arg is always "current", the (global) macro identifying the current task. AFAICT it will work with other tasks, so it's not a fundamental constraint. [This is crap: DaveM points out that ptrace is the obvious case where tsk != current. See mm/memory.c's access_process_vm().]
  • It returns the number of pages it got, or -EFAULT, or -ENOMEM. This makes for some interesting error handling in the caller: if we run out of memory or hit a bad address after getting N pages, we return N. The caller presumably notes that not all the pages were retrieved, and unmaps the first N pages.
  • "len", unlike the name suggests, is actually the number of pages, not the length in bytes. This one bit me.
  • write and force are booleans, but the kernel doesn't use "bool". I've come to like bool for its documentation value.
  • Naturally, there's no documentation on this function.

[/tech] permanent link

Wed, 18 Oct 2006

A Small Trick: Avoiding missing configuration includes

The kernel used to have a problem that people would use the CONFIG_ pre-processor variables but not include the "linux/config.h" header, resulting in strange bugs as some files were built with the option, and some without.

 some expression

This was solved in the kernel by putting "-include linux/config.h" on the gcc command line to always ensure its inclusion. But an alternative is to change to using #if instead:

 some expression

This does the same thing (undefined symbols in CPP are equivalent to 0), but now you can define every switch to 0 or 1 and use -Wundef: GCC will warn if the option is undefined.

[/tech] permanent link paper, and requests for comments...

I'm excited about my LCA paper: "Writing an x86 hypervisor: all the cool kids are doing it!". It's going to be an ensemble piece, which is always a bit risky, but I managed to convince Chris Wright (RedHat), Jeremy Fitzhardinge (XenSource) and Zach Amsden (VMWare) to copresent with me. We are the four people doing the hard yards of getting the paravirtualization infrastructure into Linux (and adapting our various products to it), so it doesn't get more expert than this. We've just got to pack it all into 45 minutes.

I'm also co-presenting a "First-timer's Introduction to LCA" with jarich, hopefully on the Sunday evening. This is important, because I worry that LCA could become "regulars" vs "others" (and the next step after "cliquey" is usually "sucky"). I'm spending some idle cycles thinking about what should be in it, but it's hard for me to see things from a newcomer's POV. If anyone wants to mail me with "I wish I had known..." tips, please do!

I had submitted a tutorial proposal as well: "Hacking in groups". The idea is to get 20 or so really top coders (speakers are good candidates) and set them up with five attendees to code on some small project for three hours. The lead coder will have done it before: the idea is they present the problem, let people code for a while to some milestone, then show their solution, then let them run, with a wrap-up at the end. It won't teach coding, but you will get exposure to an experienced coder's tricks and techniques.

There are two problems, though: it needs 20 talented coders, and it only caters for 100 attendees max. The paper ctte rejected it as a tutorial and suggested running it on Saturday. Now I'm rethinking: it's a lot of work, for me and for the coders, and I'm not sure how well it will work. Do people want this?


[/tech] permanent link

Sat, 14 Oct 2006

Lhype: The Little Hypervisor That Couldn't

Late this week I finally got lhype to boot to the shell prompt. Now I have a bundle of more-fixmes-than code; here is the exchange:
From: 	Rusty Russell
To: 	Jeremy Fitzhardinge
Subject: Re: Lhype now boots!
Date: 	Fri, 13 Oct 2006 11:49:37 +1000

On Thu, 2006-10-12 at 11:13 -0700, Jeremy Fitzhardinge wrote:
> Rusty Russell wrote:
> > Current lack of features include:
> > 1) Console doesn't actually allow input.
> > 2) All pte ops flush entire shadow page tables, every time.
> > 3) Guest pages are locked into host ram.
> > 4) Host clocksource and other useful things need to be implemented.
> > 5) Code not SMP safe: host must be UP.
> > 6) No devices supported.
> >   
> So what can you do?

Do?  DO?  What can you DO?

Watch the cpu usage of kernel thread representing the host domain as it
boots!  Gaze at the paravirtualized wonder of the sash prompt in all its
glory!  Hit ^C and watch the domain die! 

Get away from me with your "feature creep" attitude!
Help! Save Australia from the worst of the DMCA:

[/tech] permanent link

Mon, 09 Oct 2006

Day of LCA Paper Reviews

Went to Sydney to run through the paper and tutorial selections for There was a great deal of blood on the floor by the time we finished: there were only about 30-40 clearly-rejectable papers, and in the end there were at least 4 talks I wanted to see which got cut. Some of the committee saw their own talks rejected, and my tutorial proposal was bumped off to Saturday.

The process wasn't flawless, but the committee were dedicated and generally calm. Their efforts will be more deeply appreciated when people attend the best and deepest talks yet. Leave your laptop behind to avoid missing any talks.

After my previous rant after reviewing some papers by people who didn't seem to understand how high the paper standard for LCA has become, Mary Gardiner posted an excellent guide to what separated the accepted papers from the majority on her blog.

[/self] permanent link

Wed, 04 Oct 2006 submissions, rejected

I know, in the 250 abstracts I'm judging for, I'm throwing out some pearls among the sand. And I'm probably unfairly biassing against people, but dammit, the number of "I am a member of an open source project so let me tell you how to code/run a project/create a community" papers I've seen is astounding.

Firstly, being a member of an OSS project describes the majority of our attendees: it's about the minimum standard for a speaker. If you expect over a hundred of them to listen to you, you've got to be in that top 1%, and you've got to convince me of it! So if your project hasn't done something significant or astonishing, don't waste my time submitting a talk about it. And if you weren't the key person doing that astonishing thing, ditto: unless you're a known awesome speaker, I'm going to rate you low. And if I can't tell if it was you who did the work, I'll rate you low, too. If you didn't tell me, sorry, no time if google didn't take me straight there...

Judging technical abstracts can be hard: there are so many Open Source projects! (Are our attendees really interested in your implementation of Yet Another window manager? Maybe not, but how to write a new window manager, maybe!) But judging non-technical abstracts is almost entirely a judgement on the person presenting. If noone has heard of you, and especially if you sound like a crackpot, sorry.

So, while you're best known for your implementation of a video codec in 23 instructions, you really feel you have a lot to say on Eastern philosophy and its influence on your post-Raymondian Free/Unlocked/Cooperative/Knowledge-based software design. I'm sorry, I can't judge whether your presentation will suck, so given the number of papers, well, perhaps you should just put it up on your blog. I'll read it, I promise. (And why didn't you submit a paper about that damn video codec?)

Flowchart for "should I submit an abstract to LCA":
(1) Have you done something related to F/OSS you're excited about? If so, goto (3).
(2) Is there a great demand for information on some subject on which you are a leading, recognised expert? If yes, please submit an abstract[1]. If no, please do not.
(3) When you describe/show this to collegues, are they interested? If no, please do not submit.
(4) Is what you've done expected to be widely used? If yes, please submit[2].
(5) Is what you've done original and useful for other F/OSS projects? If yes, please submit[3].
(6) Is what you've done so insanely cool that it makes people say things worth quoting? If yes, please submit[4].
(7) If you reach here, don't invent something to speak about. It'll suck, because like me, you're not inherently interesting.

Finally, be aware that preparing a talk is a significant amount of work. You need to prepare and weed the material, cast it into a coherent presentation, and practice several times (on real people if you're not an experienced conference presenter). This takes a good couple of days' time. And never do a tutorial: it takes much longer to prepare and needs far more practice, since you need to time and test the user interaction as well.

Bad speakers happen, but in the past we have had some speakers who didn't prepare. This is unforgivable, and I always oppose accepting proposals from those speakers again: our attendees deserve (and demand) better. The miniconfs or BOFs are the place for ad-hoc presentations.

[1] This covers "expert" talks like licencing talks, etc.

[2] " development" vs "libmeanwhile development"

[3] "A new typechecking tool" vs "Haskall type experiments"

[4] "Build your own satellite" vs "My First Gnome Applet"

[/tech] permanent link

Tue, 26 Sep 2006


I hadn't been following the GPLv3 drafting process, since both Andrew Tridgell and Harald Welte are involved. Imagine my surprise to read that kernel developers are standing up against the GPLv3 threat!.

Turns out the GPLv3 is unnecessary! These crack coders have scrutinized GPLv2 and found no problems with it! NONE!

WORSE, the GPLv3 is going to add ONEROUS USE RESTRICTIONS, and by that, they mean when that USE is DISTRIBUTION. What a SHOCKING change!

Who knew that the The Additional Restrictions clause designed to make GPL compatible with other licenses and cure existing fragmentation is actually a FRAGMENTATION "NIGHTMARE"!


I weep for my friends, Tridge and Harald, who cannot see these problems: they have clearly been corrupted by exposure to the FSFs diabolically open GPLv3 drafting process! I went to the draft and I couldn't see it either: DAMN FSF MIND RAYS!!!! LUCKY those kernel developers stayed right away from it, so they can tell us the TRUTH, and ensure that the RADICAL FSF LANGUAGE in the preamble "if you distribute copies ... you must give the recipients all the rights that you have" is never ACTUALLY ENFORCED by the license!

[/tech] permanent link

Thu, 21 Sep 2006

Accessing stack-relative variables vs. using segment registers

So Jeremy Fitzhardinge recently produces a series of patches to use the "gs" segement register within the Linux kernel for per-cpu data. I generalized these patches to combine them with the existing per-cpu infrastructure, but I wondered about whether we should use this method for accessing the current task and current CPU. In the case of CPU number, the core code expects it attached to the thread info anyway, as it accesses it directly to see where a thread last ran, so we'd end up with it in both places.

To explain how things currently work in Linux: each process has a kernel stack. On x86, this is 8k, aligned on an 8k boundary (there's a 4k stack config option, but the idea is the same). At the top of this stack (ie lowest address in this 8k region) is the thread structure, and in this we keep a pointer to the current task, and the current CPU number. This means that we can access either one quite quickly by rounding down the stack pointer to the nearest 8k boundary and grabbing the entry from the structure. In assembler, this is 3 instructions, eg:

mov    %esp,%eax		# Get stack pointer
and    $0xfffff000,%eax		# Round down (4k in this example)
mov    (%eax),%eax		# Grab value (first element in structure)

This is 9 bytes long. If we use a per-cpu variable and the gs register, it's a single 6-byte instruction:

mov    %gs:0x0,%eax		# Get first value in per-cpu section

This is 6 bytes long, but segment-relative instructions are a little more complicated to decode than the simpler instructions above. So, I wrote two test programs with each of these sequences repeated a million times (to remove icache effects), and timed running that million-sequence 10,000 times, for 10 billion total repeats. We'd expect the gs version to be 33% faster than the stack version, because it's 33% shorter.

Machine 1.7GHz Pentium M (Laptop) 3GHz Pentium 4 3GHz Xeon (5160)
Stack-based value read (nsecs) 10.39 4.06 8.46
gs-based value read (nsec) 6.93 2.70 4.40
Improvement 33% 33% 48%

This kind of single huge instruction stream is great for the Pentium 4, which is why it does so well (OTOH, flushing the pipeline on P4 is painfully slow). Nonetheless, it looks like using gs is a gain. Of course, this effect measured here is likely to be overwhelmed by the benefits of reduced instruction cache footprint.

Stephen Rothwell points out that the former (stack-based) instructions require an arithmetic unit which the gs-based instruction does not, which may allow for better instruction scheduling in real code, too, amplifying the win again.

[/tech] permanent link

Sun, 03 Sep 2006

lhype progress...

In order to flesh out the paravirt interfaces for the Linux kernel, I've started a toy hypervisor called "lhype". This sits entirely within the Linux kernel source tree and will be remarkable mainly for its lack of features. But it's invaluable to have an common example everyone can point at and play with: there are common problems which all hypervisors hit which are best understood by the Linux coders if they can see it in front of them. These range from implementation issues (eg. dealing with "stolen" time), to performance issues (eg. batching of page table uptdates).

And importantly, it'll give me something to present at in 2007.

[/tech] permanent link

Wed, 16 Aug 2006

RedOctane Dance Mat, DDR, pydance and StepMania

When I was in Sydney talking about horrible legislative dangers to Free Software , there was a gaming "sluglets" session, and someone introduced me to PyDance.

Did some research, and ordered the RedOctane dance mat, which arrived while I was away at OLS, and Kelly came over to help Alli set it up. PyDance has some synchronization issues (a killer for this game: it seems to lag slightly, but enough to drop scoring significantly), so we've taken to using the BSD-licensed StepMania. (As an aside, the complete sets of songs and steps from the original arcade "Dance Dance Revolution" series are out there; this is probably a copyright violation, but, like MAME ROMs, a case where enforcement would just punish fans and exacerbate the failure-to-supply/orphan-works problems of modern copyright).

Anyway, the point of this post is to feed the following facts to Google: it turns out that on both Alli's and my laptops (and on Kellys without the power plugged in), the USB dance mat connects and disconnects every few seconds, making it useless. On our desktop machine (a server which doesn't have 3D support, making StepMania unusable) it's fine. With some experimentation, a USB hub (unpowered!) between the laptop and the dance mat, and it works flawlessly. I don't know if this is some device flakiness and/or a Linux/Ubuntu USB bug, but I was cursing Greg K-H for a while there...

[/tech] permanent link

Tue, 15 Aug 2006

pf vs iptables

There's been some coverage on LWN of my netdev musings on netfilter and the Grand Unified Flow Cache. There was a comment on the article along the lines that this would be a good thing if Linux moved closer to OpenBSD's pf, which made me bump "learn more about pf" up my TODO list where it's been lingering for years.

Now, from the glance I'd had at pf before, I already had a fondness for it. It has some minor warts, but it does many things really nicely. Googling for pre-canned comparisons revealed a some BSD types who really hate iptables, no iptables users who really hate pf, but generally few people well-versed in both.

I've always thought of iptables as the assembler language of packet filters (although CISC not RISC). Hence I always write my filters as shell scripts, which provides the variables ('macros' in pf-speak), expansion etc. pf takes a very different route, using a higher-level domain-specific language, and this is reflected in the features offered. "scrub" for example does various cleaning/dropping of packets, all in one command. "keep state" on a rule implies that all reply and related packets should bypass the filtering and be accepted, which sums up the pf approach quite well. They also implement a syn proxy, timestamp randomization, and other features missing in iptables. Their handling of FTP is via a proxy, and this is a rare occasion where their implementation is more awkward than the iptables "modprobe ip_nat_ftp" approach.

By feature count, iptables wins, but that's not really fair: most of these features are extensions which hardly anyone uses, and iptables misses some nice "all-in-one" pf features. Looks like they do NAT by port binding (ie. the way everyone else does), meaning you can't cram as many clients behind a NAT box, but again, most people won't care.

There's one place where pf would make Linux users green with envy, it's rate limiting and queueing. pf isn't exactly a barrel of simplicity here, but it's integrated with the rest of the syntax: I could probably remember how to do it with minimal prompting, rather than having to digest the Advanced Routing HOWTO as I always have to for Linux. (BTW, I'm still waiting for a good gui tool for this: it'd be awesome, because I think of ratelimiting in visual terms already.)

Finally, some advice for people who are translating from one into the other: it's not a trivial conversion. I (roughly) translated the example at the end of the pf FAQ, and then rewrote it how you would really do this in iptables. The NAT stuff is fairly directly mapped (although be aware that iptables filtering is always on "real" addresses, ie. after DNAT/rdr, before SNAT/nat). For filtering, iptables makes you consider the problem in three distinct parts: INPUT (destined for this box), FORWARD (passing through this box) and OUTPUT (generated by this box). Consider the following pf rules:

pass in on $ext_if inet proto tcp from any to ($ext_if) \
   port $tcp_services flags S/SA keep state
pass in on $ext_if inet proto tcp from any to $comp3 port 80 \
    flags S/SA synproxy state

In iptables, these two rules are fundamentally different, and indeed, would unlikely be listed consecutively. The first is an INPUT rule, the second a FORWARD rule:

iptables -A INPUT -i $ext_if -p tcp --dport $tcp_services --syn -j ACCEPT
iptables -A FORWARD -i $ext_if -p tcp -d $comp3 --dport 80 --syn -j ACCEPT

This was one change I made from ipchains, which I feel is a significant improvement over the "by destination IP address" approach, because it matches a fundamental distinction in filtering.

Finally, here is the rough iptables equivalent of the pf example. (Note: no scrubbing, but we do fragment reassembly because we use "-m state" and NAT, either of which requires connection tracking).

[/tech] permanent link

Fri, 11 Aug 2006

GCC versions on x86, and the cost of clobbers

For the paravirt_ops patches, I've been doing some mildly tricky GCC things to binary patch over the indirect calls at runtime.

On x86, to replace, say, 'cli' (which disables interrupts) with an call through a function pointer (at offset PARAVIRT_irq_disable )in the paravirt_ops struct, you need to do:

	call *paravirt_ops+PARAVIRT_irq_disable
However, calls on x86 can overwrite the eax, ecx and edx register, so to be safe for any function, you need to do:
	push %eax
	push %ecx
	push %edx
	call *paravirt_ops+PARAVIRT_irq_disable
	pop %edx
	pop %ecx
	pop %eax
Each of these pushes and pops is a 1 byte instruction on x86.

There is a way, however, to tell gcc that some assembler is going to clobber registers, so if it's clever, it can avoid having to push and pop. I wondered how much more efficient it would be to do this: at worst gcc will always have to push and pop, at best, it would never have to (unlikely as this is on register-starved x86).

In my simple test, I use these kind of calls for four common kernel (inline) functions, raw_local_irq_disable() and raw_local_irq_enable() which have to save three registers, and raw_local_irq_restore() and __raw_local_save_flags() which use %eax and so only have to save two registers. Counting up the calls in my configuration gives 132, 66, 97 and 113, giving 1014 saved registers. When saved with push/pops, we'd expect to see 2028 bytes of bloat (I added -fno-align-functions to the top level Makefile so function alignment wouldn't play a part, but jump and loop alignment still play a part).

Test cases GCC 3.3 GCC 3.4 GCC 4.0 GCC 4.1
Code size (no push/pop, no clobber) 2225653 2209176 2198744 2183453
Push/pop code size addition (bytes) 1553 1631 3129 1584

To discover how effective various gccs are at avoiding register spills (and indirectly get an indication of how good gcc's x86 code generation is), I produced three kernels: one which did no saves or restores of registers at all (baseline, ideal case), one which did all the pushes and pops manually (worst case), and one which used clobbers. The better gcc's code generation is, the closer we'd expect the clobber case to be to the ideal case. I used "size vmlinux" to measure the code size. We can use the actual code increase from push/pop (that theoretical 2028 bytes) to take into account other noise effects: this normalized result probably gives a better indication of the differential effect of clobbers vs push/pops.

Test cases GCC 3.3 GCC 3.4 GCC 4.0 GCC 4.1
Clobber code extra (bytes) 625 512 955 577
Cost (bytes) per clobber 0.61 0.50 0.94 0.56
Normalized cost (bytes) per clobber 0.4 0.31 0.31 0.36

[/tech] permanent link

Tue, 08 Aug 2006

Selling our house

Alli and I bought a hundred acres out near Braidwood (Alli and her horses), so we're selling our current place, just as we've made it comfortable. Oh well, lots of interest, so I hope we find someone who'll really enjoy it.
[/self] permanent link

Fri, 28 Jul 2006

paravirt_ops, Xen and VMWare

Xen is a hypervisor: it sits under a (slightly) modified operating system and allows it share the real hardware with other operating system. This is called "paravirtualization", because the OS knows it's not running directly on the hardware, and helps the hypervisor along a little. Naturally, Xen has patches for Linux.

Chris Wright has been doing excellent work (first for OSDL, then Red Hat) laborously preparing long the series of patches to merge the Xen code into the mainline kernel. Then, along came VMWare, with a proposal for an ABI which all Operating Systems and hypervisors could use, called VMI. They use this currently, and they have a version which supports Xen as well. In the wings are other contenders, such as Microsoft and the L4 work.

So, knowing that just about every place in the kernel where we support multiple implementations of the an API at runtime we use an "ops" struct, it makes sense to use it here, and in fact, ppc64 already does this to support the same kernel on native and under a hypervisor. No kernel programmer is likely to be surprised by this approach, aka. "paravirt_ops".

So, with this plan, I agreed to help with the merge. There are some performance issues shown up by lmbench with doing an indirect call instead of (on native) a single instruction. This is solved by extending the infrastructure we already have for binary patching in the kernel, and it turns out that the interrupt operations dominate other paravirtual-sensitive instructions by a couple of orders of magnitude: patching them is sufficient.

The kernel summit helped convince both the Xen and VMWare people that this approach was most likely to be merged, so we have a mercurial patch queue and everyone who wants it has commit access. It's been pretty good for allowing everyone to hack away (after we sorted out some tools issues...).

[/tech] permanent link

Sat, 22 Jul 2006

Xen Paravirtualization Patch fun

Sun, 16 Jul 2006

OzDMCA: Kim Weatherall's Superb Explanation of the Issues

"We don't have to have a draconian law that prevents competition and is anti-consumer. If we end up with such a law, it will, in my view, be a legislative failure on the government's part." Find the whole thing in her blog post .
[/IP] permanent link

Thu, 13 Jul 2006

Fanboying Michael Geist!

The Unlocking IP conference was, as these things are, two days of talking to lawyers. Even when it's interesting, it takes me away from what a should be doing: making new and better software.

But I got to meet Michael Geist, who has had such a pervasive and positive effect on copyright issues in Canada. More people like him would be good.

[/IP] permanent link

Fri, 30 Jun 2006

Help! I Own My Music, DVDs

The music and movie industry keep trying to take away our rights to enjoy the music and movies we bought. That's stealing, and doing it through legislative means makes it worse, not better.

Janet Hawkin's Beware Digital Rights Traps image, Spread the word...

[/IP] permanent link

Wed, 28 Jun 2006

Circumvention visits: SLUG, HUMBUG and LUV

OK, so on Friday 30-06-2006 I'm in Sydney for a talk at the Sydney Linux User Group meeting, then Saturday 1-07-2006 morning I keep heading north to Brisbane for HUMBUG. Tuesday next week 04-07-2006 I'm in Sydney again to talk to the Australian Consumer Association, and then off to Melbourne in the evening for LUV. Then back in Sydney for the Unlocking IP conference on Monday and Tuesday 10-07-2006/11-07-2006.

The worst thing about the battlefield moving away from technical issues to legal ones is the amount of time my colleagues and I now spend in that territory. The other side has studied their Sun Tzu, attacking our undefended points. It'd be nice if large monopolies fought fair.

After that I'm off to OLS (I tried really hard to skip it this year, but there's a paravirt merging BOF which Andrew Morton wants me at, and he's right). Might actually get some hacking done, between handing out petitions.

[/IP] permanent link

Sun, 18 Jun 2006

I'm away for a week, North Queensland...

No computer or phone for a week. But I leave you with that font of information which is
[/tech] permanent link

Wed, 14 Jun 2006

Q & A on Anti-circumvention issues in Australia and the Petition: 8:30 EST Friday night

More details as available... we're going to have an icecast server you can listen in on, and an IRC channel for questions and chatting. I'll create a short presentation to kick off, then we'll field questions. This is an excellent way to get involved to make sure Free/Open Source software and other competition and consumer issues aren't steamrolled...
[/IP] permanent link

Mon, 12 Jun 2006

Anti-circumvention Linux Australia Podcast: Part 1

I gave an interview to the Linux Australia Podcast on the dangers of the coming "DMCA" implementation in Australia. You can find it 20:39 in if you're impatient. Here is the ogg torrent, mp3 torrent, ogg file and mp3 file.

And so, it begins. My phone number is 0417 451212.

[/IP] permanent link

Tue, 06 Jun 2006

Binary Searches and Mergesorts Still Broken?

Via Miguel de Icaza's blog I read a Google researcher saying Extra, Extra - Read All About It: Nearly All Binary Searches and Mergesorts are Broken .

The problem is an overflow in adding the high and low indices to calculate the midpoint, apparently widely implemented as in Programming Pearls . I'm not a Java person, but fortunately he has a C fix, too.

This is classic programmer behavior: find a problem, patch it, declare it fixed. But the root of this problem has been missed, with two consequences: (1) people reading this are mislead as to the cause of this problem, and (2) the problem immediately behind it still lurks, at least in C.[1]

The suggested C fix is: "mid = ((unsigned) (low + high)) >> 1;". The better fix is to realize that valid indices into a C arraythis array are always unsigned by definition, and declare "low" and "high" as "unsigned int" in the first place. This means the empty array case needs to be handled, and a new "magic not-found return value" needs to be chosen. But generally, life's far simpler when you say what you mean.

Now the next problem becomes apparent: we've just shifted the overflow from 31 bits to 32 bits, giving ourselves a bugfix which will last ~10 months. low=1,200,000,000 high=1,200,000,100 works now, but low=2,400,000,000 high=2,400,000,100 still doesn't. There are several fixes for this, mine is here.

[1] For all I know, Java can't handle arrays of length > 31 bit, making this overkill, but C certainly can.

[/tech] permanent link

Mon, 22 May 2006

Ubuntu "Dapper Drake"

I'm running Ubuntu Dapper Drake beta on my laptop. Evolution still sucks: ended up moving my old mailbox out the way and re-importing small parts of it. It was running out of file descriptors, crashing, freezing. Now I just get lots of "can't sync" messages, dups in mailbox, and had to grab & hack a dup-removal plugin.

Two things have impressed me about the latest Ubuntu release. Neither of these are earth-shattering to anyone else, I'm sure. The first was gnome-xchat has a cool "notify" feature which puts a speech balloon in the corner of the screen when someone talks to me and I'm on a different desktop. Secondly was that (for the first time) I clicked on a bittorrent link to download a movie (Elephants Dream, well-rendered but odd), and it Just Worked.

Overall, it's coming along very nicely. Great work!

[/tech] permanent link

Thu, 18 May 2006

Van Jacobson Stole Those Channel Ideas!

After Van Jacobson's talk, some IBM Research people complained to me about all the fuss: this was done years ago by the Exokernel crowd (see The Sydney LCA organizers said to me "this was described at LCA in Perth by Peter Chubb" (see This one was a bit of a miss: VJ didn't want to move drivers out into userspace. Now, we get a post to lkml on Lazy Receive Processing (see

Lockless datastructures aren't new. Moving stuff out to usermode isn't new (some of the LRP ideas are already in Linux). Zero-copy to userspace isn't new (I remember seeing a student project doing this on Linux on my first trip to Bangalore, called "zbufs" IIRC). So why are we suddenly now excited about these ideas? Is David S. Miller just slow?

There are two basic reasons. The weaker, more prosaic one, is that Van had an implementation for Linux, and more importantly, a step-by-step plan for how to introduce this into Linux showing improvements at each point. This isn't the kind of stuff academics write papers about, but it's critical if you're trying to change an OS in widespread use. This lack of revolution was a strong argument. It wasn't "if you start with our wierdo OS, you can make it fast by doing X", nor "if you rewrite all your applications to use our wierdo API, you can make it fast" (although that was the part many people concentrated on: the final stage where VJ exposed the API to userspace).

The greater, more profound reason was clearer if you were at the talk. This wasn't about a performance tweak, not about cool datastructures, and not about zero-copy (which it isn't). This was about fixing a longstanding design kludge in TCP/IP: the acknowledgements done by OS on behalf of the application, in violation of the end-to-end principle. This, in turn, requires window information in every packet "Ack, but not really". Without this insight, you might be doing the right things, but for the wrong reasons: instead of a Grand Design, you have one or more Performance Hacks.

Just to nail this, let me quote the LRP paper (which I recommend reading):

The main difference between UDP and TCP processing in the LRP architecture is that receiver processing cannot be performed only in the context of a receive system call, due to the semantics of TCP. Because TCP is flow controlled, transmission of data is paced by the receiver via acknowledgments. Achieving high network utilization and throughput requires timely processing of incoming acknowledgments. If receiver processing were performed only in the context of receive system calls, then at most one TCP congestion window of data could be transmitted between successive receive system calls, resulting in poor performance for many applications.

Which is exactly the opposite of what VJ is saying.

And the Exokernel paper, also very interesting reading:

The real power of application-level networking is that it allows networking
software to be specialized for and integrated with important applications.

Which might be true (and I think it will prove true for message-passing a-la RDMA) but isn't the point which has got everyone excited today.

[/tech] permanent link

Credit Card Trap

Alli was two days late paying our VISA bill two bills ago. Sure enough, our last bill arrived with a $77 charge: having overrun the interest-free period, we got charged the full amount, of course.

But Ben Elliston pointed out something I hadn't fully grasped: if you keep using the credit card, you'll never get your interest-free period back. First the billing cycle ends, then bill arrives and you pay it, but if a charge comes in between those two dates, you never actually reach 0 balance, and hence your interest free period is never reset. You're on the treadmill.

I assumed I'd have to push it into credit to stop the cycle. I called ANZ (ie. India, which was perfectly fine and fluent) to explain we'd been late with a payment and ask if we were still being charged interest. Conversation went roughtly "Yes, and I'll just revert that now." "So, how much do I have to pay to put it into credit and clear it?" "No, I'm just removing the interest now" "So I won't ever get another interest charge?" "That is correct."

Obviously, the call center works from a flowchart. Equally obviously, they're used to irate customers who've received their second interest bill after having paid off "in full". But if possible, that just makes this whole trap seem even more slimy.

[/self] permanent link

Thu, 04 May 2006

Benkler "The Wealth of Networks"

I wanted to read the book in printed form, but it's still listed as pre-order on Amazon. So, I'm wading through a printout of the PDF, smelling of freshly-slaughtered rainforest. Here's a taste:

Many of these initial statements or inquiries die because the community finds them uninteresting or fruitless. Some reach greater salience, and are distributed through the high-visibility sites throughout the community of interest. Issues that in this form reached political salience became topics of conversation and commentary across the divide. This is certainly consistent with both the BoycottSBG and Diebold stories, where we saw a significant early working out of strategies and observations before the criticism reached genuine political salience.

I'm around page 280 of 520. I hope Larry is wrong that "you are not serious about these issues --- on either side of these debates --- unless you have read this book". That's a high bar: this book an endurance test akin to reading a lecture series. He warms up on the examples, but the rest is dry sandpaper to my personal writing-style sensibilities.

This book provides useful academic underpinning for pro-network Internet policy. But the word "salience" is really starting to grate.

(Just for contrast, here's how I would have preferred to read this, although I'd probably cut deeper into the whole para):

Although many initial statements or inquiries die as uninteresting or fruitless, some resonate through core sites in the community of interest to then become topics of conversation across the political divide. In both BoycottSBG and Diebold, strategies and observations were refined early on before the criticism reached genuine political prominence.

[/IP] permanent link

Sun, 30 Apr 2006

ccontrol, interactivity, load and thrashing

Found a problem with ccontrol, while building git: it forks 120 asciidoc instances, which then drive my machine into the ground.

This lead to experimenting with the ccontrol locking, and reinstating an older version of the code which used to timeout locks, and then retry another lock. This is useful to avoid getting stuck behind very slow compiles, or even a suspended compile. Of course, the lock is there to avoid us running hundreds of compiles at once: we don't want to ignore it if the machine is actually at capacity. Previously I've discovered that getloadavg is completely useless for this kind of self-limiting.

So, I did some tests: by doing gettimeofday() before and after the sleep, we can quite accurately detect the interactive response of the machine: if it's not good, we're stressed and shouldn't assume that whoever is holding the lock is sleeping.

Which got me to thinking: why don't we do this all the time? If we always started with one lock, we could then decide to completely ignore the lock if we timeout and the machine isn't stressed: this would self-adjust to an optimal (high but not thrashing) load. This would remove a configuration option from ccontrol: you currently tell it how many CPUs you have. This method would let us automatically expand to fill any machine!

Well, it works for a while. At some stage a bunch of ccontrols decide to go lockless, and we jump to doing 5 compiles at once. At this point, we become I/O bound, so the CPU is fairly idle: we're not thrashing just yet, and our interactivity is good. Another 10 ccontrols think we're not stressed and go lockless, and suddenly we're thrashing to death...

After several painful experiments, tried only going lockless if the load average is low, which reflects the problem of I/O activity. This works better, but is a delicate balance: I think I'll stick to backing off onto a new lock, not abandoning locking altogether when ccontrol thinks we're not stressed...

[/tech] permanent link

Wed, 26 Apr 2006

Falling Deeper in Love With Mercurial: 0.8.1

Mercurial 0.8.1 has my two main gripes: you can now push and pull partial trees (effectively giving you a way of expunging revisions), and "hg revert" doesn't delete a file you previously "hg add"ed. I filled in the User Survey too.

Some gripes remain, as with any software:

  • Get rid of "forget": "revert" now does the job of removing an add correctly. While forget could be useful, it's a wart.
  • Get rid of "undo": only undoes some things. Not all, and it doesn't tell you what it undoes. I screwed my repo pretty badly with this once, now I pretend it doesn't exist.
  • Get rid of "unbundle" and make "pull" work from a bundle.
  • Streamline "old-http": there's no fundamental reason why this should suck, with partial http GETs (the format is append-only). This is what currently prevents hosting mercurial repos on SourceForge, for example.
  • Then get rid of "old-http" and simply have the "http:" method auto-detect.

After 1.0, I'd love to see the following

  • Proper support for merge across rename.
  • I'd love to see hg become a universal client, so I don't *have* to deal with git or svn. "git+http://" and "svn+http://" methods which hid the differences would be wonderful. And people who suggest tailor obviously haven't used it.

Of course, mercurial being in Python, raises a serious barrier to me hacking on it. It's far more efficient for me to leave it to someone who groks Python. Mind you, I don't have the antipathy for Python that Keith Packard has...

[/tech] permanent link

Mon, 17 Apr 2006

The Fun of Floating Point

Kernel people don't use floating point math much. I was once taken to task by Stephen Tweedie for asserting that FP is only used by people who don't care about the answers. Wesnoth avoids floating point math for a different reason: platform consistency. You really want units attacking power (ie. damage) to be equal on all platforms, and it's always expressed in terms of round numbers. So if your sword does 6 damage, but your opponent received 125% damage against blades and it's nighttime (-25%), the answer had better come out to be 6 on all platforms. Of course, calculations can be more complicated with multiple factors.

Even more importantly than not having an easier time on some strange architecture, if you're having a multiplayer game, you'd better all agree on what happens. So Wesnoth uses integer ops and open-multiplies them by 10000, etc. It's icky, but cleaning it up might introduce gratuitous incompatibilities. It also uses a new rounding mode on me: "round back to base value" in the case of rounding 0.5: so 6 + 0.5 => 6, and 6 - 0.5 => 6. Kinda ugly, but it's the same everywhere.

My attack_prediction code (which evaluates the likely damage distribution for an attack) uses doubles, because it's slightly faster in my benchmarks than 16-bit fixed point (as would be expected for a modern machine like my Pentium M laptop). Although it's possible that the AI could in future use this to make a different decision on some platforms, the current code uses random trials anyway, and a little unpredictability in the AI is a feature, anyway.

[/tech] permanent link

Thu, 13 Apr 2006

Kimberlee Weatherall: Market Failure?

For those who have just gained internet access, Kimberlee Weatherall is one of the leading online-aware IP lawyers in Australia, and the person most widely quoted outside legal circles on Australian IP.

She's a lecturer at Melbourne Uni, she's witty and approachable, and above all she's not smarmy (an achievement for any lawyer). And here's a not-particularly-great picture.

I finally figured out what disturbs me most about Kim: her continued singleness is undermining my faith in the free market. She's not been battered by the ugly stick, failed any major personal hygiene classes, suffering from Tourette's Syndrome, or secretly a man (I asked). Now, I'm no economist, but before we seek government intervention (and irritatingly, there's nowhere on the UN website to apply for UNO Special Days), perhaps we should ensure the market is receiving the signals it needs.

If only there were some way of spreading the word to the single men of Melbourne...

[/IP] permanent link

Tue, 11 Apr 2006

More on Wesnoth

Wesnoth's other two features which stand out to a developer are that it's cross-platform, and leaderless.

Cross-plaform: using SDL makes the Windows-vs-MacOS-vs-Linux issue far less important code-wise, but it turns out that some Windows developers are still using MSVC++ 5.0, which is too old to implement the C++ standard. Well, g++2.95 has issues there, too, but the answer is to upgrade. But the Windows folks point out, that's a $1500 cost for the new version. It's been so long since I had to buy a compiler, I hadn't realized. Ouch!

Leaderless: David White lead Wesnoth up until the release of 1.0, but has since been taking a back seat. The rate of minor changes has picked up significantly, but without a single guiding purpose of a release, or a new vision, there hasn't been a bold new direction. Now, it may well be that polishing and minor enhancements are what the projects need for a while: the new music and images are new scenarios are flowing in. But eventually, someone will come along with a really fresh idea for addition: without someone to lead, will it just die on the vine?

[/tech] permanent link

Sun, 09 Apr 2006

Almost 100 Things to Do in Adelaide

Friends moved to Adelaide: she's an old Adelaide person, but he's from Melbourne. Having grown up in Adelaide, Alli and I thought we'd ask around, to supply a list of things that add up to the Adelaidian experience. We sent it to them as a belated Christmas present, and as I promised various people, here it is:

Adelaide is a quiet town, so many places are known only to locals. We asked our friends what they like to do, and what they think newcomers should do to get a feel for what Adelaide is really like. Assuming you'd do these things in pairs, we put prices (sometimes guesses) for two, except where noted.


Balfour's Custard Tart (Alli's favourite) [one] $4
Eat Samboy BBQ Chips (Alli's other favourite) [one] $2
Try Farmer's Union Iced Coffee (Alli again) [one] $2
Try Coopers Vintage Ale [one bottle] $8
Eat a Balfours Froggie Cake [one] $3
Try Menz Fruichocs [one] $3
Drink Woodies lemonade $5

Walking Distance within City:

Feed the ducks on the Torrens behind Festival Theater
Carols by Candlelight
Adelaide Christmas Pageant
The Adelaide Fringe Festival (March)
Catch a train to Outer Harbour $7
Buy a Pie Floater from the Pie Cart [one] $5
Go on Paddle Boats on the Torrens $10
Catch Popeye to the Zoo (return) $12
Catch the O'Bahn $13.20
Look around the Central Market on Friday night
Eat at the smallest restaurant you can find in Chinatown $20
Have a coffee in the Hilton foyer $7
Catch Tram to Glenelg $13.20
Ride Glass Elevator in Rundle Mall
Have a Hot Roast Turkey, Cranberry Sauce, Lettuce and Brie Baguette from Charlie's Sandwitch downstairs in the Myer Center [one] $8
Visit the Adelaide Art Gallery
Visit the Adelaide Museum
Go to a lunchtime/weekend concert in Elder Hall $0-10
Eat Yiros on Rundle Street $12
Coffee on Rundle Street at night $6
Have a drink at the very chilled out Supermild (Hindley St) $10
Have a drink at the Historian Hotel $8
Wander into abandoned offices along North Terrace (KW to Pultney Streets)

Less than 5kms from City:

Adelaide Zoo $32
Take Botanic Gardens guided walk (10:30am any day)
Bicentennial conservatory in Botannic Gardens $8.60
Moonlight Cinema at Botanic Gardens $22
BBQ in Bonython Park
Play par 3 golf along the Torrens $22.80
Go to Skyshow (Australia Day)
West End Xmas Lights
North Adelaide Pubs
Have dessert/coffee at the Elephant Walk $12
Get a hamburger from the North Adelaide Burger bar $6
Coffee in North Adelaide $8
Swim and have a sauna at North Adelaide Swimming Center $11.40
St Peter's Cathedral
View Adelaide from Montefiore Hill (North Adelaide)
BBQ in Rymill Park
Have some arancini at Carnevale the Italian Festival (11/12 Feb Rymill Park)
Tour Japanese Himeji Gardens in South Park Lands (12-14:00 weekdays)
Look for a Rosella, Cockatoo and maybe Kookaburra in the South Park Lands
Buy a dagwood dog at Adelaide Show [one] $7
Look at the houses on Victoria Avenue
Haighs Factory Tour
Have dessert/coffee at Spats $15
Visit the all night Villi's Cafe (Mile End)
Try the Combination Sar Hor Fun at Penang Hawker's Corner $8
Catch a gig at The Gov (Governor Hindmarsh) $20-30
Go to the Glendi (Greek) festival and eat dolmades $20
Ride a bike along the River Torrens


Go to a Crows/Port match at AAMI Stadium $44
Visit the BrickWorks markets
Play Zone 3 $18
Coffee on Norwood Parade $8
Go to a Norwood/Port SANFL game at Norwood Oval $20

Outer suburbs:

Windy Point at night
Lighthouse at Semaphore
Lobethol Xmas Lights
Tourist Drive through Adelaide Hills
Eat icecream at Glenelg $8
Walk the length of Jetty Road, Glenelg
Go to Maslin's Beach
Get up early and go to the fish markets at Port Adelaide (7:00-12:00)
Walk along the beach & jetty during sunset at Henley Beach
Walk in Belair National Park
Visit Mount Lofty Gardens
Visit Waterfall Gully
The 4hr hike at Parra Wirra National Park $8
Cuddle a koala at Cleland National Park

Out of Adelaide:

Drive through the Barossa Valley (Nuriootpa etc.)
Winetasting in Barossa Valley
Sailing or camping & birdwatching on the Coorong $5/night
Whale Watching at Victor Harbour
Climb The Bluff at Victor Harbour
Find a penguin and a wallaby Granite Island
Visit Hahndorf
Go to Greenhill's Adventure Park $40
Walk part of the Heysen Trail
Visit the main street of Gawler, Australia's biggest country town
Spend a day driving through the Barossa (Nuriopta, Gawler etc)

[/self] permanent link

Fri, 07 Apr 2006

Wesnoth Fun and Games

Some months ago I gained commit access for Wesnoth, a GPL game. A change for me from several perspectives: it's in C++ (and it's been about 10 years since I did anything in C++), it's cross-platform, and it has recently "lost" its project leader.

C++: Never a language I liked, not because it doesn't have features, but because it rates so highly in my "how many traps does it have?" measure of languages. However, one thing has changed since I dealt with C++ last: the Standard Template Library has become ubiquitous. It's fairly well designed, but more important than any particular feature, it provides a known set of primitives: standards a new coder will understand. This lowers the barriers to new hackers.

[/tech] permanent link

Freeloading? Sorry?

I was a little disturbed in a recent meeting with the Attorney General's Dept. where it was suggested that implementing your own compatible software (rather than licensing from an existing product holder) was "freeloading". Those in the software industry reading this will be thinking "what??", but it does illustrate the understanding gap when it comes to IT, and OSS in particular.

This seems to be a subset of a larger conceptual problem. Most of us see a competitive marketplace as the norm, and copyright as a (useful) exception to that norm. Those who spend their days dealing in copyright sometimes view a pervasive monopoly as a natural entitlement: at the extreme all unauthorized acts (especially competition!) is immoral and probably should be illegal. Blockbuster should be asking (ie. paying) copyright holders for permission to hire out DVDs, Sanity should get permission for putting headphones for customers in stores, VCRs should not record, and MP3 players should be locked down to prevent CDs being copied to them. And, of course, it should be illegal to produce your own DVD playing software.

[/IP] permanent link

Fri, 10 Mar 2006 Wireless Network Anti-spoofing

One trick I learned at LCA this year was explained by the guys who set up the wireless network. They blocked all ARP packets, and served all ARP replies themselves, based on the DHCP leases file. This makes it much harder to pretend to be another machine on the network, as you cannot lie about your ARP (you can still set your MAC address to someone else's, and fight with them). You can still be an Access Point and "serve" people yourself, but it's a start.
[/tech] permanent link

Fri, 17 Feb 2006

Talloc: A Rounder Wheel?

Brian Aker made reference to talloc in his blog, which I've been thinking about a little. He notes (as I brushed over in my talk) that writing your own pool-based memory allocator is a well-worn sport: I knew Apache had one, and SAMBA had one (pre-talloc), but I wasn't aware of GTK's or MySQL's.

Now, there are three answers here to "why do people keep writing these things?". One reason is in one of my favourite Tridge quotes "You never need an excuse to rewrite something". (Note: IMHO Tridge is being generous. Tridge doesn't need an excuse. You and I, however, do).

Secondly, are several real improvements over straight pool allocators, especially avoiding the distinction between pools and allocations. Talloc heirarchies naturally reflect the lifetime relationships of the the data. As the code notes, Tridge stole this idea from halloc.

Finally, there is the reason I use and am excited about talloc, and the reason I thought it was the most important thing to talk about at LCA. This is why:

rusty@chopin:~/devel/cvs/talloc$ wc -l talloc.[ch]
 1208 talloc.c
  142 talloc.h
 1350 total
That's 1350 lines of code to drop into your project, not part of some toolkit you need to swallow whole. And that's way sweet.

[/tech] permanent link

Fri, 10 Feb 2006

Remixes "based on the sound of Hitachi hard drives failing"

From Copyfight's Alan Wexelblat:

The blog "Gizmodo" has announced the winner of its competition to create a remix track. The track must be "based on the sound of Hitachi hard drives failing". No, really. Hitachi has a page with .wav files playable so that people can figure out what that noise their hard disk is making might mean.

[/tech] permanent link

Mon, 06 Feb 2006

Wesnoth hacking

So I am now a Wesnoth developer. So far I've been fixing bug and making minor changes: it's fun, and *so* easy to work on after the kernel. You can reproduce bug reports, then run the whole thing under gdb!

One interesting project was to try to get an improved attack result prediction, initially for defensive weapon selection. This proves to be an interesting problem. So interesting I included a simple version of it as question 2 of the LCA hackfest.

The initial code, with comments, can be found here. One contestant caused it to run twice as fast, but being IBMers, they were eliminated. Noone else made significant improvements; it was quite depressing to see what people thought would optimize the code 8(

The algorithm itself can be improved significantly by realizing that we needn't keep the entire matrix, merely the probability distributions of each unit at every point the combat could end. Summing these independent (and exhaustive) outcomes will give the correct results. I would estimate a 10x speedup doing this, although I haven't completed a working variation (it was my initial version, but the matrix solution was easier to debug).

[/tech] permanent link

Mon, 30 Jan 2006

The Moustache

In previous years, I've flatly refused to consider shaving off my moustache at the LCA auction. However, there was a good cause this year, The John Lions Chair.

My main regret is that I didn't hold out for more, as Russell Coker later suggested. I'd heard that the UNSW cartel had $10,000, but I wanted to make sure we got all the money out of them. Still, most other (mainly non-facial-hair) people had no idea that this was a non-trivial thing for me: I'd had the moustache for over half my life.

[/self] permanent link

Sat, 28 Jan 2006

This year's event was very, very good. Turned up and registered Sunday night; the registration area and University looked great. On Monday, I got up at 6:30am (4:30am Canberra time) to make penguin-shaped waffles for the organisers. The Monday brought Damien Conway's excellent presentation on Presentation Aikido. Takeaway point: use a different font for code than for showing terminal interaction in presentations. So I ducked out of that early to catch some of the Digital Arts miniconf, and to make some Damien-inspired tweaks to my presentation.

Tuesday, I spent mainly hacking, but there was a lot of buzz around about various of the miniconfs, though: the Linux in Education miniconf might explode into a sister conference next year, the Debian miniconf was enlightening, and a number of excited people described the Digital Arts miniconf to me in minute detail.

Wednesday, I decided to go to Andrew Fitzsimon's tutorial on Open Source Graphic Design which taught me about the existence of Scribus and fontforge. Wednesday night was the hackfest; I was looking forward to getting lots of ccontrol GUIs.

Thursday, Dave Miller's keynote opened. I got to do something I've always wanted to do, and introduce him. Dave gave a great talk (three, actually), and indeed, revealed that he'll be working on the Linux port to Sun's new chip (Niagra). My talk went well.

Friday, Damien Conway's keynote opened. Lighter than Dave's, but as entertaining as we've come to expect from Damien. The day is a blur, until Van Jacobson's talk on speeding Linux networks. It recieved a deserved standing ovation, only the second at LCA after Eben Moglen. I watched Dave's face, but turns out Van Jacobson had spilled the beans to him two days before. We'll see VJ's ideas in Linux in 2006.

The dinner Friday night deserves its own paragraph. We auctioned off a copy of the Lions Book, signed by various UNIX notables and LCA speakers. The proceeds were to go to the John Lions Chair for Operating Systems at University of New South Wales (UNSW). We got AU$10,000 for the from a group of ex-UNSW students. That was matched by $10,000 from Linux Australia. Then bowls were passed around, and over NZ$3000 were handed in. This $AU23,000 will be given to maddog, who will donate it to the John Lions Chair fund. Maddog is on the board of USENIX, who will match donations made by any member, giving a total raised of over AU$46,000. Well done.

[/tech] permanent link

Fri, 27 Jan 2006

Talloc talk at

Talloc talk went fairly well; several people asked for the graphing code. Jeremy Kerr wrote the basic one, I hacked it to merge same-named nodes. You can grab the patch here, and also the shell script I used which merges the duplicate links and refreshes the browser.

Happy hacking!

[/tech] permanent link