Wed, 26 Dec 2007

Abysmal sqlite performance: sqlite3_busy_timeout() sleeps for 1 sec on lock

A thread on wesnoth-dev woke my from my development slumber, and I marshalled the stats.wesnoth.org code together for checking into the repo (hopefully someone with time will work on it).

One thing I did before committing is run "make check", and indeed, the parallel test failed: this runs 10 "insert into db" CGIs at once. I vaguely remember having a problem with sqlite3 commands returning SQLITE_BUSY at random times, and writing this test to diagnose it.

My fix at the time was to do a 'sqlite3_busy_timeout(500)': wait for 500ms before returning SQLITE_BUSY, rather than returning it immediately. I felt aggrieved to have to explicitly tell the database this: given that it's non-trivial to test such a parallel-access path, it's insane that the default behaviour of sqlite3 is to error out because someone else is using the database.

Anyway, as my 'make check' failure reminded me, that's not enough. This time, I poked around for a minute or so, and sure enough, strace shows this:

fcntl64(4, F_SETLK64, {type=F_WRLCK, whence=SEEK_SET, start=1073741826, len=510}, 0xbfb6bac4)
	= -1 EAGAIN (Resource temporarily unavailable)
nanosleep({1, 0}, NULL)          = 0

A non-blocking attempt to lock the database, then a full second sleep. The Right Way is to do a blocking lock with a SIGALARM (but that is dangerous for libraries to use), but this code results in a 1 second latency: there is no way my 10 parallel programs can access the database before that 1/2 second timeout I set, so 9 of them fail.

The correct answer seems to be to write your own busy handler which sleeps for less. This is a horrible hack.

/* sqlite3_busy_timeout sleeps for a *second*.  What a piece of shit. */
static int busy(void *unused __attribute__((unused)), int count)
{
	usleep(50000);

	/* If we've been stuck for 1000 iterations (at least 50
	 * seconds), give up. */
	return (count < 1000);
}

...
	sqlite3_busy_handler(handle, busy, NULL);

Perhaps the more correct answer is to use something other than sqlite?


[/tech] permanent link

Tue, 18 Dec 2007

linux.conf.au cliquey?

Jeff Waugh notes that "linux.conf.au is also very cliquey", which is an exaggeration (esp. when compared with an invitation-only event), but does contain a grain of truth: although many people come to LCA for the first time every year, there is a core of Old Timers.

So while I don't think anyone is gratuitously excluded from LCA, I do worry about it. Hence we added a less-than-successful optional video submission to the call for papers, to try to reach out to great presenters who weren't well served by the submission process. I'm also running a newcomer's session to ensure everyone feels they have a good handle on the conference before it begins.

I'm certainly not the one to preach about being welcoming and inclusive, but I think it's a laudable goal.


[/self] permanent link

Thu, 13 Dec 2007

The foocamp clique?

OK, I've wrestled with this for a while. I've been invited to a Foocamp, but I've always been nervous about cliquiness in Free Software (such as the kernel summit). I have already compromised on topic-specific events like the Virtualization mini-summit, but I wonder.

We second-tier developers often have an instinctive reaction to horde our knowledge ("it was hard to write, it should be hard to read!"). I've consciously resisted this urge, mainly by asking myself "What would Andrew Tridgell do?" (I also suspect that some day this hoarding attitude will become the litmus test for FOSS posers, so am seeking to preemptively fool it).

So am I paranoid? Are invite-only events a necessary evil? Or should I let my curiousity overcome my principles? Hey, maybe one of those web 2.0 types could help me enable comments on my blog...

Note: just noticed it clashes with LCA. No way!


[/self] permanent link

Mon, 10 Dec 2007

FOSS.IN: I had fun

For me, it's hard to properly enjoy a conference I'm speaking at. So I judge a conference by the discussions I have after my talk(s), and on that scale FOSS.in rated highly. Good technical questions, ideas and interest.

The chief organizer, Atul, made my enjoyment particularly difficult by asking me on Friday afternoon to do Saturday's pre-closing slot: he wanted me to fan the flames of FOSS contribution among the delegates.

I am touched by the faith of my colleagues, but here's the secret: if I do a good presentation it's because I spent a few solid days preparing it, and much longer actually planning it. Spontaneity fills the gaps, but it can't provide content. A restless night rolled into the morning with no great plan emerging.

So I asked for ideas from the sample of fellow speakers staying at my hotel, and they really came through for me. Lots of raw ideas that I tried to sculpt into a coherent whole. My message was simple: contributing is simply not hard! I talked about how I started, and I talked about how my first conference (USENIX/Uselinux in 1997) convinced me to get involved. I got James Morris up to talk about his similar experience at CALU, then Amit Shah, the only Qumranet employee in India.

Then I invited a volunteer on stage (actually one of the organisers; I wasn't quite brave enough to leave it to random chance!), and got her to create and submit a kernel patch using my laptop. Despite my perennially sticky escape key and the general impossibility of typing when hundreds of people are watching, the point was made: it's easy to join in (and actually the kernel is one of the more formal projects to submit to).

Finally, I asked for groups of people to come up to the stage: all the FOSS coders and contributors (well over half the audience), bug reporters, those who've helped other people with free software, then finally the trailing few who've only used free software. We heard from a few people on stage, on what they were doing (my mistake here: I should have tried to cover far more people, say 5 words each). I sat in the chairs and listened to their excitement over their contributions.

All up, I gave my performance a 6.5 out of 10. I'm not sure that I helped anyone overcome internal barriers to making useful contributions. But if someone in the audience goes home and starts creating something awesome, that'll be worth that one night of lost sleep.

I greatly enjoyed my time in Bangalore; for once I had time for sightseeing and shopping. But the conference was still the highlight, and I plan on being there again in 2008.


[/tech] permanent link

Tue, 20 Nov 2007

Quick note: linux.conf.au Newcomers' Session TBA

Going to linux.conf.au for the first time? There are some guides you can read, but it's nice to have some friendly faces ease you into your conference experience. To welcome you into the inner circle and teach you the linux.conf.au secret handshake, some assistants and I will be running a 'Newcomers Session' on the Sunday before the conference (probably around 5:30 somewhere near the registration desk).

It'll be fun and casual, about 30 minutes, then we'll head out to dinner nearby as a group.


[/tech] permanent link

Mon, 12 Nov 2007

Rusty vs. git vs. mercurial

I ran with git for long enough to distinguish my mercurialisms from real usability problems, but it just isn't what I want.

The problem is that I really have gotten used to hacking with a queue of patches a-la quilt, and I get frustrated hacking without it. My lguest development tree was an experiment using mercurial queues, which is a quilt tree with revision control, and I'd hoped to do better with git (I tried using guilt). But it's just not a win.

Git is great for feeding patches to Linus (well, actually they're kind of a pain to prepare, but Linus takes them more reliably). But I'm not prepared to spend the time to write and maintain YA quilt-on-git system.

So I'll be maintaining a patch queue (in mercurial), except this time for all my kernel projects (including misc hacks). From that I'll create a for-linus git tree for him to pull. This time I've got scripts to export the mercurial tree, but also the tarball of patches and the unpacked series and patch files.

It currently all lives in my ozlabs.org home dir. I just need to figure out some triage system to mark some as "for linus", some as "for mm" and the rest as "only the brave".

I wonder if there's a proper way of extracting a tag from a revision (rather than "based on Linus' 6e800af233e0bdf108efb7bd23c11ea6fa34cdeb" it'd be nice to say "based on Linus' v2.6.24" or equiv). Using "grep .git/refs/tags/" seems wrong...


[/tech] permanent link

Tue, 06 Nov 2007

Kernel Coding: Good Enough?

Two incidents recently disturbed me about the state of kernel code and review. The first was Willy's stringbuf patch: I felt compelled to contribute a version which was more polished.

Ignoring (for this post at least!) the voodoo optimization that followed, what concerns me is that stringbuf was good, but not great. Yet I always think of the kernel as a bastion of really good C code and practice, carefully honed by thoughtful coders. Here even the unmeasured optimization attempts show a lack of effort on the part of experienced kernel coders.

Then came Jens Axboe's scatterlist changes. The transition pain is fine, but the resulting interface is subtle and not up to the standard I expect of kernel infrastructure: when I tried to use them in virtio the result was ugly and suboptimal. This is a side-effect of an attempt not to break things: to keep loops over chained scatterlists simple. Yet it turns out that introducing a header for multiple scatterlist arrays is both more explicit and can coexist cleanly with the old code, which would have vastly simplified the transition.

DaveM said to me that the scatterlist code "isn't that bad" and he's absolutely right. But it's not *great*, and that concerns me enough to look hard at replacing it, even though it works.

I briefly wondered if accelerated decentralized development and the weakening of review are causing a gradual lowering of standards. But I think it's like it always is: it's *hard* to write great code, and most of us frequently fall short.


[/tech] permanent link

Sun, 07 Oct 2007

New Laptop, Clean Start

The three-yearly laptop rotation offers a chance for spring cleaning while wrestling with video and sound (upgrade to Gutsy, easy) and wireless (so far still no packets). So I didn't transfer my package selections or dotfiles. I also changed from Gnome to KDE.

I switched to Evolution years ago to handle (IBM-internal) Lotus Notes meeting invitations. Yet it has ongoing problems indexing mail reliably. It occasionally complains it cannot sync, and has for years; noone seems to know how to fix it.

Kmail's "Evolution 2.0 Import" hangs: not a good sign, but importing individual mboxes works. And since Amarok has already proven more reliable than Rhythmbox playing AAP (daapd runs on my test box), so switching whole-hog to KDE seemed reasonable.


[/tech] permanent link

Fri, 31 Aug 2007

So, What is KVM-lite?

Someone sent me a very polite mail asking what KVM-lite is. My bad.

KVM currently requires either AMD-V (SVM) or Intel's VT (VMX) extensions on the chip: these are implemented in separate modules. So it should be possible to implement a "lite" backend which uses lguest-like techniques to boot a (paravirtualized) guest.

Why? To increase the coverage and reach of KVM: many low-end machines are still shipping without sufficient hardware support. Having a single "kvm" which can run across everything (at least, for Linux) makes a great deal of sense.

Critically, it's not that hard. KVM already has emulation for all the parts of the PC we want. Paravirtual drivers are coming, as is a hypercall interface. KVM lite only has 6 hypercalls, and three of those can probably be replaced by emulation. More hypercalls will come (page table updates and batching), but normal KVM will want those too.


[/tech] permanent link

Tue, 21 Aug 2007

kvm-lite: sash prompt arrives

Friday I hit the "VFS: Cannot mount root" and thought "I'm basically there", and spent the weekend finalizing LCA paper submissions.

But I just spent much of last night and all of today getting to the sash prompt. Nasty issues included paravirt patching being terminally broken recently (but worked well enough for native), learning PAE (lguest doesn't do it, so I had been blissfully unaware) and dealing with QEMU internals. Diabolical issues included: and inject_page_fault() can be called from inside the emulation code, which has temporary copies of your registers and will "restore" them over any modifications you make. "push %ebx" is not emulated correctly by KVM, so if it faults we get a strange crash later. Finally, (and this took over a day of debugging) qemu does not seem to emulate cmpxchg8 reliably, and can zero out the %edi register. I was running the whole thing under qemu for debugging, and it took me an awfully long time to prove to myself that the host wasn't somehow corrupting guest registers.

$ wc -l kvm-lite.patch
3451 kvm-lite.patch
$ grep -c FIXME kvm-lite.patch
35

That's over 1% FIXMEs by weight! So guest what's next...


[/tech] permanent link

Thu, 09 Aug 2007

Progress on kvm-lite, lguest

So after two weeks of going through the kvm kernel code, reading Intel docs, and dozens of patches later, this week was supposed to be the start of implementing "kvm-lite" which I'm supposed to be presenting at the KVM Forum at the end of the month (yeah, I love pressure).

Of course, other things (such as a couple of lguest bug reports) stole some time, but just tonight I got it to the stage where it flips into the guest and back (multiple times). Now, since I haven't even hacked a console together for the guest, it doesn't get far, but from here to booting should be less painful than those early steps.

What's interesting is that by mangling the lguest code into this different context I revisit the code with a little more x86 knowledge. Indeed, while copying the segment handling code into kvm-lite, I discovered (and wrote a test for) a nasty bug. The guest can tell us to change a GDT entry it's currently using, and we'll fault when we try to restore the guest segment registers. I handle the simple case of marking a currently used entry not-present, but not the more obscure cases which can cause a fault such as changing the stack segment descriptor to a code segment.

The problem is made worse by the user-modifiable registers of kvm-lite (or anything which wants to offer guest restore, such as future lguest). With lguest, we know that the segments were OK when we last ran the guest: we only have to be careful when executing the two hypercalls which modify the GDT. With kvm-lite we also have to be suspicious of userspace-supplied GDT entries, as they can crash the host.

The solution was rather simple, if in some ways less than elegent. We catch faults in the switcher and return to the host: because we didn't enter the guest, the trap number is not updated and so we can tell the switcher faulted. We kill the guest that caused it.

This also gives us some insulation against other such bugs: rather than causing a triple fault and host reboot (or even a re-install for poor Ron!), it just causes the problematic guest to die.


[/tech] permanent link

Mon, 23 Jul 2007

linux.conf.au presentations last chance!

The 2008 linux.conf.au presentation submissions are closed, but the link is still live for a few days (as our committee hasn't started judging yet).

So take a handful of minutes from your performance of great deeds and submit a summary before it really is too late.


[/tech] permanent link

Sat, 21 Jul 2007

Lguest gets merged, has a nasty bug

So lguest went into Linus' tree 5am Friday morning or so. Since Friday is my day in the office I grabbed the latest git snapshot and compiled it up for one of the machines there, to check it had been merged properly and also test a few minor cleanups I wanted to send.

Boots fine, but after a while sitting idle, it stops responding. I back out my cleanups, and it still happens. Shit: bad start to lguest in mainline! I start debugging: the guest has interrupts disabled and is doing something, but nothing obvious. There was a bug report I had from someone last week which sounds similar which I hadn't tracked down yet. A few hours into debugging my wife arrives, time to go home: nothing obvious comes to me on the 90 minute drive, other than how I should get more details.

On arriving home and eating dinner, I prepare for a late night of debugging. But at home, it doesn't happen. I try with the same config as the work machine, still no lockup. I can't get into that work machine from here, so I head for bed hours earlier than I expected.

But I lay in bed thinking "what's different from my machine to the one at work?". Finally it occurs to me that it's possible that the one at work doesn't have synchronous TSCs: what if the guest were to see time go backwards because the host switched its CPUS? Perhaps it would end up in a huge loop. It would explain why it only happens after a period of idle, too. Damn, I'm a genius for figuring this out!

So I finally get remote access to the machine this evening. But on that machine the guest's not using the TSC, and so my beautiful theory is wrong...


[/tech] permanent link

Tue, 03 Jul 2007

Virtualization Minisummit and OLS

In one sense, not much happened at the virtualization minisummit. We got up to speed on what each other are doing and thinking, and we all met face-to-face, including finally meeting Avi Kivity.

In other sense, a great deal happened. By discussing plans with each other, we pushed along various efforts to harmonize (or at least learned what the sticking points are). I hope that we'll look back and see this as the main achievement of the summit.

This trend continued at OLS itself. More people now understand the role of lguest (and coders are excited about having a sandbox to play in). More people are thinking about common virtual I/O, and aware of the virtio effort. KVM portability was plotted (expecially by Carsten Otte pondering an S/390 port). Fruitful.


[/tech] permanent link

Sun, 17 Jun 2007

Talloc: References Considered Harmful?

So everyone knows I'm a big fan of talloc, the hierarchical allocator. The more you use it, the nicer it gets. So much so that I wouldn't code a new C project without it ("if it uses free(), it needs talloc").

A longstanding question in talloc circles has been the use of talloc_reference() vs. talloc_free(). (See posts like this one to samba-technical).

The normal problem runs like this: code A does "X = talloc(PARENT, ...); somefn(X); talloc_free(X)". This code should still be correct if somefn() uses talloc_reference() to hold a reference to X, and yet in general we don't know whether the "talloc_free()" meant to unlink X from PARENT or whatever somefn() attached it to. One solution is to insist that everyone use "talloc_unlink()" which explicitly says what parent to free the node from. But that's awkward, and can be difficult.

I came up with an algorithm which gets this "correct", in the sense that code using talloc_free() never free X before code using talloc_unlink(), and never after the worst-case talloc_unlink(). But one important feature of talloc's destructors are that they are reliable unlike garbage collection: this algorithm means that X's destructor (which you might be relying on for cleanup) might not get called when you expect, depending on who else references it.

But this leads us back to examine the original case: why is "somefn()" taking a reference? If the object is going to be destroyed, as in the case of our code, presumably it should no longer be used by the somefn() logic anyway (some destructor should deregister it). And if that is the case, somefn() doesn't need a reference...

I spoke briefly with Tridge about it, and he said "well, removing references does seem to remove bugs". So perhaps if you think you need a reference you should look deeper...


[/tech] permanent link

Thu, 14 Jun 2007

Virtio I/O Draft III

So my work with trying to create a generic virtual I/O layer continues, with fascinating asides like the conversation with DaveM. My very first attempt (which never saw public release) was a low-level ring-buffer interface. My second attempt (draft I) was an interface to register input and output buffers and "used" pointers: in the "interrupt" you scanned the used pointers to see what had been used up.

I liked this, because the virtio-using drivers looked much like normal drivers. Unfortunately, it had the fatal flaw that delivery wasn't in-order. After some toying around, I moved to a callback model (draft II): each buffer has an associated callback which gets called and says what length was used. In order to get the locking to be sane, I moved the lock into the virtio subsyststem: driver callbacks are called with the lock held. This is much less Linux-driver-like, but seemed to work.

Then I tried to NAPIfy the net driver, and the difference between virtio and "normal" hardware bit me on the ass. NAPI assumes that the interrupt and the information about what happened are separate: you can disable the interrupt and still poll for incoming packets. You can't do this if you're relying on callback "interrupts" to tell you about used buffers. So I switched to a "get_inbuf()" method: instead of the callback being passed information about the used buffers, it (or any other code) can ask for them one at a time.

But now the draft-II centralized locking hit me: the net devices poll function stops the input callbacks, but the virtio output callback code will still try to grab the lock. So revert the lock centralization as well, and draft II has almost entirely vanished.

The moral here is that Linux driver infrastructure is optimized for real hardware: interrupts, status registers, DMA and such. If you're designing an I/O mechanism and you do something "alternative" your drivers won't fit the infrastructure: they'll be foreign-looking and complicated, and possibly buggy and sub-optimal as well.


[/tech] permanent link

Thu, 07 Jun 2007

Rusty Russell, Techno-diplomat.

ZDnet reports on virtio; the details are light as would be expected, but the description of paravirt_ops as a "techno-diplomatic feat" is certainly how it sometimes felt.

I spent a fair amount of time encouraging a common IRC channel, setting up a patch repository and similar confidence-building measures (like the all-important "who sends the patches?"), but the main breakthrough was at the 2006 Kernel Summit where Christoph Hellwig concentrated minds by bluntly telling the others to stop wasting everyone's time and do paravirt_ops (from my very rough memory). It made me appreciate both Christoph and the Summit a little more.

Of course, having lucked in once, I'm now looking at repeating the trick with virtual I/O. In this case, though, I've got a few virtual I/O implementations under my belt already, so I hope I have more of an idea what I'm getting into...


[/tech] permanent link

Sun, 03 Jun 2007

Linux.conf.au Submissions with Videos!

There's been some stir about the fact that you are allowed to submit a short video with your LCA talk/tutorial submission this year.

Since LCA doesn't do papers, the main point of being a speaker is to, y'know, speak. What attendees seem to want is to get access to quality information, so on the paper committee we try to judge the product of the interesting stuff the submitter knows and their willingness and ability to convey it. Judging the latter from a written submission is really hard, and we've failed horribly in the past.

As the competition for slots rises (75% reject rate last year), there's mounting pressure to play it safe and just accept the same speakers. But I know we're missing out on some great stuff: how do we give the other speakers a way to show us how great their talk will be? We thought this was an idea which might encourage more, different cool stuff.

Here's my example for an lguest talk, to show how low our expectations are. First take, self-filmed, 40 seconds long: 2MB ogg.

So if you're worried about getting accepted, I'd encourage you to spend 30 seconds telling us about your talk. Thanks!


[/self] permanent link

Thu, 31 May 2007

Virtualization Performance and the Very Low Bar

"I've clocked kbuild at within 25% of native."
    -- Avi Kivity on KVM 22 release.

I was surprised by this admission: I certainly hadn't been advertising (the similar) lguest numbers! But Avi probably knew that this was just the beginning (I don't have comparable kvm 26 numbers, but given the improvements in the virtbench results since then, I'd expect at least 5-10% more shaved off).

For virtualization I consider 25% slowdown to be poor, 15-25% to be reasonable, 10% to be good and 5% or less to be excellent. I'd hoped lguest would end up in "reasonable" for most tasks, and entertained fantasies that it might reach "good" without becoming a bloated mess.

But what's interesting here is how low the bar is. To pick my favourite at the moment, we're still very much in the experimental stage with efficient I/O: even for the "established" Xen doesn't have inter-guest I/O at all, and has been through several different schemes for networking. The schemes which ppc uses are both baroque and not obviously efficient either.

This, of course, makes it exciting. So jump in!


[/tech] permanent link

Wed, 30 May 2007

More Virtual I/O: Block devices

I now have two drivers on top of my "generic" Virtual I/O layer: a network driver and a block driver. To get the bugs out I enhanced lguest's launcher to use 8 clone()s to serve the block device asynchronously, which gives a great speedup even in a simple virtbench read.

The question of barriers/flushes and virtual block devices remains problematic: what guarantees should the host give the guest about data consistency? Consistency in the case of guest death is rather easy: the host can just service all outstanding requests. But if you want your data intact in the case of host cash, it means barriers have to be passed through to the physical underlying device. This affects performance.

Virtualization tends to do really well on naive disk benchmarks, because the server acts as an external cache (this is why lguest opens the block device it's serving with O_DIRECT, to make things a little fairer). Since virtualization loses to native on every other benchmark, I guess noone wants to give this up. But it's cheating; unless the underlying device is non-persistent (eg. copy-on-write and discarded when the guest dies), you should be honoring those barriers.


[/tech] permanent link

Mon, 21 May 2007

Ambition, Hubris and Virtual I/O

So, now we have at least 4 x86 virtualization solutions for Linux (Xen, KVM, VMWare and lguest), not to mention UML, Power and S/390, the obvious point has been raised by many: why not have a single mechanism for (virtual device) I/O?

Well, first it turns out that there are many different things which people mean when they talk about I/O. There's guest userspace to guest userspace, guest devices served by the host and guest devices served by another guest. There's device discovery, configuration, serving and guest suspend and resume.

And, of course, everyone has a Plan, and many people have an Implementation as well. This is good because there's experience in different approaches, but bad because noone wants to change. The answer is always to standardize what you can, and let the rest converge naturally. In this case, I think aiming for common guest Linux driver code is an achievable short-term aim (ie. a platform-dependent "virtio" layer and common drivers above it).

Device discovery I'm leaving alone (Xen bus vs PCI vs Open Firmware vs Some-All-New-Virtbus): I'm not sure there's even a great deal of point in unifying it, but more importantly it's a separate problem.

There are four reasonable implementations which I have in mind. (I assume some method of sending inter-guest interrupts):

A shared page
This is the simplest: copy in, copy out.
A shared page containing descriptors
The other end is privileged: it can read/write the memory referred to by the descriptors (eg. guest - host comms)
Shared pages containing descriptors + hypervisor helper
The other end can use a hypercall to say "copy the memory referred to by that descriptor" to/from itself. This means the descriptor page has to be read-only to the other side so the hypervisor can trust it.
Full Xen-style grant table
Mapping of arbitrary pages by the other side can be allowed (and revoked), and pages can be "given" to a willing recipient. This is controlled by a separate table, rather than being implied by the descriptors.

The danger is to come up with an abstraction so far removed from what's actually happening that performance sucks, there's more glue code than actual driver code and there are seemingly arbitrary correctness requirements. But being efficient for both network and block devices is also quite a trick.

So far, my model consists of an array of input and output buffers on either side. You register inbufs and outbufs into this array, send from your inbufs to their outbufs and receive from their outbufs to your inbufs. Finally you unregister inbufs and outbufs so the other side can no longer write/read them.

This seems to map reasonably well to existing practice and existing paravirt drivers. It provides the right places for Xen to grant/ungrant, and it works whether you're pulling or pushing data: send might actually transfer data, or it might just wake the other side. Similarly, receive might do nothing, or might actually do the transfer.

The actual management of where in the array to put your in and outbufs, and where to send to/receive from, and how to coordinate that with the other side is currently left to the driver. For the network driver, it's a ring buffer. For the block driver, it'll be more randomly ordered. This might be pushed into the infrastructure as more commonality emerges.

I'll know more about how well it's worked once I've got a couple of drivers and a couple of backend implementations....


[/tech] permanent link

Mon, 14 May 2007

Powering down machines, wakeonlan

I used to do everything on my laptop, but now I work from home so much, I have a server machine as well. It is my test box for lguest and also the web proxy, but I don't really need it to run all the time. So I wrote a cron job, which runs this every minute:

#! /bin/sh

date >> /tmp/halter.log

# So, has everyone on been idle for over 10 minutes?
PEOPLE=`who -u | awk '{print $2}'`
if [ -n "$PEOPLE" ]; then
        if [ -n "$(cd /dev; find $PEOPLE -mmin -10)" ]; then
                echo People still active: $PEOPLE >> /tmp/halter.log
                exit 0
        fi
fi

# Has squid proxy been unused for over 30 minutes?
if [ -n "`find /var/log/squid/access.log -mmin -30`" ]; then
        echo Squid log `ls -l /var/log/squid/access.log` >> /tmp/halter.log
        exit 0
fi

# Now, is load average <= .10?
if awk '{ if ($1 < 0.1) exit 1; }' < /proc/loadavg; then
        echo Loadavg too high: `cat /proc/loadavg` >> /tmp/halter.log
        exit 0
fi

/sbin/shutdown -P +1 "Automatic shutdown due to idle" >> /tmp/halter.log < /dev/null 2>&1

I ran "sudo ethtool -s eth0 wol g" once (it seems to be sticky), and then had to go into the BIOS and turn it on there. Now I can wake up the machine with "wakeonlan 00:16:76:E3:60:57" (I put that in a script for Alli).

The final ingredient is only half-done. There is a way to get Firefox to do direct if the proxy isn't answering, but it's not as simple as a checkbox. I put the line "function FindProxyForURL(url, host) { return "PROXY 192.168.5.133:3128; DIRECT"; }" in /etc/proxy.pac and put "/etc/proxy.pac" in Preferences->Connection Settings->Automatic Proxy Configuration URL. This causes very low-delay failover when the proxy is off.

Ideally, I'd want the wake-on-lan packet sent out when I try to access the machine, and then go direct while it's booting. I was initially thinking a firefox extension, but actually a little libnetfilter_log program makes more sense: the machine would then wake up on any traffic.

My other issue is that I run experimental kernels which don't have suspend set up, so I'm booting every time. Ubuntu boots quite quickly, but I did have to tune2fs the filesystems from fsck-every-100-mounts to every 10 weeks.

Wake-on-lan seems under-developed and under-utilized, but it seems like a really easy way to be environmentally conscious.


[/tech] permanent link

Work life in the country

In three days it will have been six months since my wife and I moved out to the country.

Since then I've done lots of things that I would normally have called a professional in to do. That includes repairing fences, digging ditches, and chopping wood. The bigger things (which do take a professional) take longer out here, which is why my study is still just a large room with a desk in it. But there are grand plans...

Workwise, the move coincided with my desire to get back to some deeper hacking, and it has been very productive. I've also finally read various technical texts: "The Mythical Man Month" (excellent), "Inside the Machine" (good, if a little light for me), "The Psychology of Computer Programming" (thought-provoking) and "Hacker's Delight" (useful, but after the first half, more skimmed than read).

Downsides: 512kbit satellite is slow and not perfectly reliable, 3GB per month isn't enough and I'm not surrounded by my Ozlabian peers. I've finally started going into the office on a fixed day, so everyone knows when they can catch me, but not being able to casually chat about coding is a definite loss.


[/self] permanent link

Fri, 04 May 2007

NAK! A flame in three letters.

It's become trendy in the last few years for kernel people to start emails with "NAK" (ie. I oppose your patch going in). It's usually followed by reasons (although the first time I saw it was in the classic on-liner "NAK" from Al Viro), but it's still an absolutely horrible thing to say.

Tone-wise, it's the equivalent of "fuck you", with the added bonus of being a power trip for the person who says it (I can stop your code going in! I'm so leet!).

You're delivering bad news to someone; a couple of words can really make it easier for them to digest it. I suggest the following alternatives:

  • Sorry, there are still at least four problems...
  • Unfortunately, this isn't going to work.
  • I think you missed....
  • You are an idiot, your code sucks. Go away.

Personally, I've started putting "Hi XXX," as the first line of my posts when I'm going to argue. It's semantically null, but it helps keep a friendly tone. I don't know if it helps for other people, but it definitely helps when I get a mail starting "Hi Rusty,".


[/tech] permanent link

Tue, 13 Mar 2007

Andi has resisted the PDA->percpu conversion for i386, partially because x86_64 has a PDA. So the obvious answer is to convert the x86_64 pda to the percpu section as well.

This is my first x86_64 experience, and I hit a serious snag when I did "make ARCH=x86_64":

In file included from include/asm/system.h:4,
                 from include/asm/processor.h:18,
                 from include/asm/atomic.h:5,
                 from include/linux/crypto.h:20,
                 from arch/x86_64/kernel/asm-offsets.c:7:
include/linux/kernel.h:115: warning: conflicting types for built-in function 'snprintf'
include/linux/kernel.h:117: warning: conflicting types for built-in function 'vsnprintf'
...
And lots of like errors. Turns out my include/asm symlink was still pointing to include/asm-i386: make distclean fixed that.

Once in my x86_64 kernel, kernel compiles seemed faster, so I thought I'd benchmark it: compiling similarly-configured 32 bit and 64 bit kernels running under 32 bit and 64 bit kernels. This is a 2.13GHz Core Duo 2 with 4G of RAM running 2.6.21-rc3-mm2 (HIGHMEM4G enabled on i386), compiling 2.6.21-rc3-git1 with "make -j4".

Running a 64-bit kernel:

  • Compiling a 64-bit kernel (median of three): 6m17s
  • Compiling a 32-bit kernel (median of three): 6m50s

Running a 32-bit kernel:

  • Compiling a 64-bit kernel (median of three): 6m19s
  • Compiling a 32-bit kernel (median of three): 6m54s

In a nutshell, no performance difference, it's just that compiling an x86-64 is faster: given that there's almost exactly the same number of .o files in each case, and the x86-64 vmlinux is 5% bigger, I'd suspect that gcc is having an easier time compiling x86_64 code.


[/tech] permanent link

Thu, 22 Feb 2007

lguest: how fast is fast enough?

So, lguest on my Core Duo 2 using 512M is about 25-30% slower than the same setup native and uni-processor. A straight context-switch syscall benchmark puts us 60 - 90% slower (see below). Given that we're doing two pagetable switches instead of one (into the hypervisor and back), this is pretty good.

The idea of lguest is that it's simple, and to close the rest of that gap probably means introducing complexity. Where's the tradeoff? There are two examples which are troubling me. The first is that copying the per-guest information into and back out of the per-cpu area drops context switch speed by about 15% (7800ns to 9000ns), and in fact slows down all hypercalls. The optimal approach is to only copy back out when someone else wants to run on that CPU, but that means we need locking. This one, I'll probably do, since I'll need locking for shrinking the shadow pagetables under memory pressure, too.

The other example a more difficult call: servicing some hypercalls directly in the switcher stub, staying in the guest address space. This is more difficult since we started using read-only pages to protect the hypervisor, but still might be possible. It changes our very simple assembler switch code into something else, though.

In particular consider that there are three hypervisor-sensitive parts to normal context switch: changing the kernel stack (particularly changing what stack system call traps will arrive on), altering the thread-local segment entries in the GDT, and actually switching the page tables. The last one is easiest: we already cache four toplevel page tables, we just have to move this cache into the read-only part of the guest address space where our assembler code can reach it: if it's in that cache, the asm code can put the value in cr3.

The change of stack can be managed in at least two ways. Normally it involves updating an entry in the TSS, but that's read-only in the guest address space. We could use a fixed stack for system calls and copy to the "real" stack in the guest. This copy must be done with (virtual) interrupts off, but we already have a dead-zone where we tell the hypervisor not to give us interrupts within a range of instructions. The change of stack is then simply updating a local variable in the guest. This solution also means the real stack doesn't move: the hypervisor needs to ensure the stack is always mapped so the guest doesn't doublefault and we kill it.

The other solution is to have multiple TSSs ready to go, just like we cache pagetable tops. Each one is 108 bytes, though, and while threads share page tables, they don't share kernel stacks, so this will potentially cache less than the "One True Stack" solution.

The TLS entries is harder. The GDT is 32x8 = 256 bytes, so we could cache a handful of them (maybe 14 if we don't cache TSSes: we have one read-only page). Otherwise, perhaps the host could set the %gs register to 0, and store the old value somewhere along with the TLS entries. The hypervisor would then see a General Protection Fault when the userspace program tries to use the TLS entries, and could put them in at that point. Of course, this means threaded program still get one trap per context switch, and just about every program on modern systems is threaded (thanks glibc!). No win.

So, say we add 250 lines (ie. 5%) of fairly hairy code to support this. Say our context switch speed is within 10% of native until we go outside the cache: probably most people see less than 5% performance improvement, and I'm not sure that's enough to justify the complexity for lguest. I think that from now on, I want macro benchmark performance gains to be about 2x the percentage code increase. That means codeside can only increase by about 12% due to optimizations, say 610 lines 8)


[/tech] permanent link

Tue, 20 Feb 2007

lguest: Life Without Segments?

So, I've spent the last few days trying to wean lguest off segments, and now the result works (with a performance regression I'm hunting), so it's time to braindump the whole thing.

Currently lguest uses segments: nasty x86 things which allow you to have an offset and limit on virtual addresses which can be accessed. You tell the CPU about your segment table (aka Gate Descriptor Table) with the "lgdt" instruction: after that, anyone can try to load a number into a segment register and start using that segment. eg. "movl 0x68, %ds" would load GDT entry 13 into the DS segment register: each entry is 8 bytes, so shift away the bottom three bits. (This example is real: in Linux, entry 13 is the KERNEL_DS segment used for the kernel data and stack).

There are six segment registers: %cs is used for code, %ds for data and %ss for stack operations. %es is used for some string operations. The other two, %fs and %gs, are used explicitly in instruction prefixes to indicate that they are to be used, instead of %ds. This is used for special effects, such as per-cpu data or per-thread data. For example, "movl %gs:1000, %eax" will read in from a different virtual memory address depending on the GDT entry last loaded into the %gs segment register.

Each GDT entry contains a 2-bit privilege field, so you can disallow less privileged (ie. higher) CPU states from loading GDT entries. This means you can set a limit on all the entries in the GDT available to, say, priv level 1, and thus guarantee that any process at that priv level 1 would be unable to access high virtual addresses. The GDT entry is only read when the segment register is loaded though: you could load a priv-0-only GDT entry into %ds at priv level 0 and then return to priv level 1 and it'd continue using that segment.

Anyway, "trimmed" segments is what Xen and lguest (and AFAIK VMWare) use to protect the hypervisor from guests (which run at priv level 1). Lguest uses two unlimited GDT entries (10 and 11) which are only available at priv level 0: traps and interrupts are set up to switch the %cs segment to this and jump into the hypervisor which only those segments can reach.

This approach has two problems: glibc needs full untrimmed segments, and some x86_64 chips don't enforce segment limits, so it doesn't work there. Lguest has a fairly complicated trick for the former, involving trapping and reloading untrimmed segments for userspace, and bouncing syscalls through a segment-neutralizing trampoline. As for x86_64, lguest is 32-bit only 8)

The new idea (from Andi Kleen and Zach Amsden) gets rid of segments altogether, and uses the pagetables to protect the hypervisor. This means the hypervisor text and real GDT are visible to the guest, but read-only so they can't be changed. Pages for other guests which might be running on other CPUs aren't mapped at all in this guest. The result looks like this for a guest running on CPU1:

+---------------+ 0xFFFFFFFF
|               |
|               |
...(Unmapped) ...
|               |
|               |
+---------------+
|CPU 1: rw page | <- stack for interrupts
+---------------+
|CPU 1: ro page | <- host state for restore, guest IDT, GDT & TSS.
+---------------+
|               |
...(Unmapped) ...
|               |
+---------------+
|Hypervisor Text| <- (Text is readonly of course!)
+===============+ 0xFFC00000 (4G - 4M)
|               |
|host-controlled|
|    mappings   |
...

The stack is fully writable by the guest, but we only use it when we trap, in which case the guest isn't running (lguest guests are uniprocessor, but even if they were SMP, only this CPU's trap page is mapped on this CPU).

The mapping in the host is the same, except the host can see all the pages for every cpu, and all are writable. Linux i386 doesn't have per-cpu kernel mappings, but mapping each CPU's pair of pages in adjacent addresses works just as well.

Switching into the guest looks like:

  1. Disable interrupts
  2. Link this CPU's hypervisor pagetable page into this guest's pagetable.
  3. Copy guest registers into "read-write" page for this CPU.
  4. Copy guest GDT, Interrupt Descriptor Table into "read-only" page for this CPU.
  5. Save segment registers and framepointer (we've told gcc we're going to clobber all registers that it will let us).
  6. Save host stack pointer and switch to guest stack.
  7. Switch to guest's GDT and Interrupt Descriptor Table.
  8. Load guest's TSS.
  9. Switch to guest page tables (GDT, IDT, TSS etc. now read-only)
  10. Pop all the guest registers off the stack.
  11. iret to jump back into the guest.

The copying in (and, on return, copying out) of registers is a pain, but we need a stack mapped in the same place in the guest and host: a non-maskable interrupt (NMI) could come in at any time and so we must always have a valid stack. Moving the stack and switching pagetables atomically is almost impossible.

The copying in of (most of the) the GDT and IDT (and a couple of guest-specific fields in the TSS) is also a pain, but it must also be mapped at the same place in guest and host. Loading the guest TSS (which tells the CPU where the stack for traps is) actually involves a write to the GDT by the CPU. So we cannot load the TSS after we've switched to the guest pagetables, where the GDT is read-only. Loading the TSS before the switch implies that it's in the same virtual address in host and guest.

This implementation works, but virtbench reveals that guest context switch time has doubled. Strangely, it seems that the copying in and out is not the culprit; I'm profiling now...


[/tech] permanent link

Mon, 19 Feb 2007

Wikipedia, Old Photos and Lguest

So, after my Wikipedia entry was deleted for lack of notability, it was recreated (albeit in a lesser form). I'm not sure if I'm notable or not, but I'm staying out of it.

The reason I blog about it is that the picture is awful:

So I went looking on google for a photo which didn't suck.


Let's start with the bad ones, "The Crucifiction" and "Chipmunk Impression":



I found one which proved that they didn't choose the very worst picture of me for Wikipedia, "The Impending Vomit":


For a trip on the wild side, here is how I look to our Japanese colleagues:


The "Dot Com Boom" and the "What Happened to the Dot Com Boom":



Finally, the two pictures I actually like, "The Bath" and "The Wannabe":


(Thanks to those flikr albums, pro-linux.de, the very odd Hacker Pictures, Linux Weekly News and OLS).


[/self] permanent link

Tue, 13 Feb 2007

lguest progresses, peer review, and a disturbing idea...

lguest is in the -mm tree, although Andi doesn't think it should get into 2.6.21. I disagree, since it's fairly self-contained: kvm went in then was rewritten, but OTOH, that didn't go through Andi IIRC.

The lkml peer review has been great: Jens found a block driver bug (fixing that more than doubled the block speed), and Herbert found a network driver bug. Andi, in particular, walked through the patch and made numerous comments, most of which I acted upon.

The disturing idea came out of the x86-64 lguest port proposal on the virtualization mailing lists. Andi and Zach waved the idea of how to do lguest on x86-64 at me at LCA, but I was distracted. Having thought about it, I think it's doable: use read-only pages for the highmem area and not segment limits (which is not available on many x86-64 systems apparently). More disturbing is the idea that this is how I should have done the 32-bit version of lguest.

What does this mean? Well, I wouldn't have to futz with segments, and the magic restoration of 4G segments for threaded userspace code. This would get rid of the most complex part of lguest. Secondly, since I'd need a whole scratch page, I'd have plenty of stack room to handle NMIs (currently oprofile reboots my machine when an lguest is running because we do *not* handle nested interrupts). Finally, it would make the x86-64 port fairly straight-forward.

The downside is that it's back to hacking on the low-level switch code. This was the slowest part of development, and something I thought I'd put behind me: now it was surely about optimizations and features! Plus, all that learning about segments: wasted!

Intel kindly offer free x86 reference books if you ask, and they arrived today. Perhaps this is a sign?


[/tech] permanent link

Mon, 12 Feb 2007

lguest patch review and performance

So, posting the (cleaned up, divided up) lguest patches to lkml actually helped significantly. There was some useful comments from Andrew Morton and particularly Andi Kleen, but also about the block driver from Jens Axboe. It turns out that "end_request()" doesn't actually end the request, it ends a single bio_vec. Hence I was doing far too much I/O. Ouch!

That fix in my hot little hand, I benchmarked a kernel compile under lguest again. Microbenchmarks are nice for focussing optimizations, but they don't tell the whole story. Compile time: 17:55. Down by about a minute on the last time I tested. But this time I rebooted my host kernel with mem=512m (the same as the guest had), for a fairer comparison. 14:01. That means I'm within 30% of native speed.

Not a bad effort for a simple hypervisor! Now, if only adding more memory didn't slow it down (damn highmem!)...


[/tech] permanent link

Tue, 06 Feb 2007

Books: the final list

Mon, 05 Feb 2007

Books to buy?

I figure it's time for a big Amazon order, being out in the country and all. Mainly I'm after books which are worth keeping: any recommendations welcome!
  • The Elements of Programming Style- Brian W. Kernighan
  • Programming Ruby: The Pragmatic Programmers' Guide, Second Edition- Dave Thomas Paperback
  • The Pragmatic Programmer: From Journeyman to Master- Andrew Hunt Paperback
  • Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture- Jon Stokes Hardcover
  • The Mythical Man-Month: Essays on Software Engineering, 20th Anniversary Edition- Frederick P. Brooks Paperback
  • Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software- Scott Rosenberg Hardcover

[/tech] permanent link

Fri, 02 Feb 2007

Lguest Progress

Lguest has seen much progress since my last post. First, James Morris implemented SMP host support, so now lguest is runnable by most people (it still conflicts with CONFIG_X86_PAE, aka HIGHMEM64G, and probably will forever, at least on the guest).

Avi Kivity suggested some simple optimizations (in the area of KVM), which I implemented for lguest, which helped noticeably. He suggested that lguest and kvm share mmu code, which is a good idea except I'm not sure I need all that; for the moment I'm pushing ahead with my own code and I may put some of those ideas into his. In particular, I want the shadow page tables to be susceptible to memory pressure: lguest uses a pool of 1024 for all guests, and kvm uses 256 per guest. Both suck in different ways: lguest is too small, and kvm is bad if you want non-root to run up machines. I'm working on that now.

I implemented bzImage loading for lguest: these kernels are easier to find than a vmlinux, so this is a usability win.

Last but not least, Linux Weekly News published an overview of lguest.


[/tech] permanent link

LWN's Best Kept Secret

Linux Weekly News has a Kernel article index. Who knew?
[/tech] permanent link

Sun, 28 Jan 2007

lguest performance

So I've been looking at lguest performance, and it's an interesting area. There were some fairly obvious things to do with page table updates (we used to throw away the whole page table on every context switch, for example), and they proved a big win. Implementing binary patching, something I wanted to do for lguest to be a good demonstration of paravirt_ops, bought around 5%. But one of my ideas hasn't worked out.

The idea of amortizing hypercall cost by having some batching mechanism is not novel; it's explicity supported by the "set_lazy_mode" operation in paravirt_ops. In lguest I decided on a simple ringbuffer of calls: when you make a hypercall, the ringbuffer gets executed first. Yet we were already down to 2 hypercalls per context switch, so reducing it to 1 doesn't make a great difference.

My grander plan was to use these "async" calls for network I/O, to get up to 64 packets in one hypercall. But for real (TCP) network flows between two guests, this doesn't help. It helps a little on a simple udpblast scenario, but it hurts horribly on a pingpong benchmark. I previously changed lguest to use "sync" wakeups for inter-guest interrupts, and to yield() when the receiver is out of buffers: both help on bandwidth benchmarks.

I've kept the network async call patch around though, because I suspect the terrible latencies are due to a bug rather than a flaw in the idea: AFAICT the sender should go idle fairly soon and call LHCALL_HALT which sill flush the async calls. I'll revisit it later; telling the networking core about the capabilities of lguest_net is the more obvious path to speed!

Avi pointed out that KVM (as lguest) blocks on disk I/O. Changing this is easy in theory, but I'd prefer to use a separate process rather than AIO or threads. And of course, there is also an infinite number of page table optimizations to be done...

Meanwhile, compiling the kernel under lguest (512M) takes almost exactly twice as long as compiling under the host (3G). I'd hope to halve that gap, but after that I expect we'll face diminishing codesize/performance returns, and lguest is supposed to stay simple.


[/tech] permanent link

Sun, 21 Jan 2007

linux.conf.au 2007

My lca2007. Home-made segway. User as hero. Good morning Awesome. Restrained use of IRC popups in our talk. Open day. Zach and Andi suggesting how to use the R/O pagetable bit to implement lguest/x86_64. lguest patches from James Morris on my birthday. More.

Best. LCA. Ever.


[/tech] permanent link

Wed, 10 Jan 2007

lhype/ll progress and the TODO list

The Great Renaming has not happened yet becase James Morris offered to look at SMP host support, and it would completely screw his merge if I rename everything.

However, my TODO list has been reordered into "before the merge" and "after the merge". Now the console is non-bufferred, I've been spending more time actually working inside an lhype guest. This has refocused me away from optimizations (and there are still plenty there) and back towards functionality. In particular, suspend/resume has moved to "before the merge", since obviously the startup and resume paths should be the same.

I want a trivial "lguest" program to live inside the kernel sources; since the lhype/ll ABI can change kernel to kernel, at least at this stage, distributing the launch tool separately doesn't make much sense. But obviously lhype_add is already 900 lines and will only grow with things like support for new devices, device hotplug and suspend/resume. I need to resolve this in a reasonable way, possibly through some sane use of dynamic libraries as fairly open-ended plugins.

So here's my "pre-merge" TODO at the moment:

  • SMP host support
  • Suspend/resume
  • Unlimited DMA receive buffers
  • Neaten get_dma proc op.
  • Plugins support
And afterwards (ie. less urgent/more blue-sky).
  • Debug trap/int3 support (ie. direct interrupt gates).
  • Don't COW pages when they are read in get_pfn
  • More intelligent cr3 reload
  • pte optimizations
  • Lazy mode support
  • Tickless idle
  • Stolen time.
  • Allow normal users to access lhype
  • Framebuffer support

[/tech] permanent link

Mon, 08 Jan 2007

lhype: speculation on a new name, and networking work

So, Ingo Molnar hates the name "lhype"... mainly the "hype" part, believing that noone will take it seriously. I guess I'm not that serious a person, so this didn't trouble me at all. However, Ingo is brilliant so I'm considering changing the name. "LL" (Linux on Linux) is the current front-runner.

Meanwhile, virtbench is supposed to be guiding my hand at more optimizations, except first it's serving to stress-test lhype. The networking code is the latest victim: I didn't realize that a driver could call netif_stop_queue on itself and return NETDEV_TX_BUSY, and the packet gets nicely requeued and everything. I had open-coded a single-packet-cache because... well, I guess my expectations are low. The Linux networking code is so nice to work with!

Now it works, virtbench shows that it's slow. My current 8 packet receive queue for a mega-high speed device like a virtual NIC is criminal. Supporting huge numbers of registered buffers without consuming wads of host memory requires another rework, but now I can justify it!

The reason this weekend has been so productive is that I finally finished Zelda: Twilight Princess on the Wii. Gamespot reckoned it offered 30-40 hours of gameplay: it took this newbie over 85 hours! And that's despite losing my "googling for puzzle solutions is cheating" inhibitions around the 60 hour mark...


[/tech] permanent link

Thu, 04 Jan 2007

Wesnoth at linux.conf.au Open Day

So I might have to drive up to Sydney to bring along my new Intel Core Duo2 and monitor for it. I'm not quite sure what I should demo: maybe if I can get other machines we can play a multiplayer game or something, I'm just aware that watching someone play Wesnoth is probably pretty boring. Ideas welcome!

There's always the Wesnoth 1.1.2 trailer in a loop!


[/tech] permanent link

Tue, 02 Jan 2007

lhype: speeding up system calls

Last episode, lhype was 35 times slower at system calls than native. The main reason for this is that every trap (including the syscall) gets redirected into the hypervisor.S stubs, which exit into the host, which then decides it's for the guest, copies the trap frame onto the guest stack and jumps into the guest's handler.

After handling the interrupt, the guest calls back into the hypervisor to do the "iret": re-enable interrupts and return.

Now, we want to point the interrupt handlers straight into the guest. This means that the guest stack and the guest handler code must be mapped otherwise we "double fault" and it's hard (maybe impossible?) to recover. So we always enter the guest with the stack and interrupt-handler pages already mapped, then we can point the handlers for just about everything straight into the guest. We still need to intercept 14 (page fault) because we unmap pages behind the guest's back and we need to fix them up again, and 13 (general protection fault) because we have to emulate the inb/outb instructions. But the rest can go straight into the guest...

Returning from interrupts is a little trickier. The native "iret" instruction restores the interrupt state (ie. re-enables them) and returns to the caller atomically. We would need to use two instructions, one to re-enable virtual interrupts, and then the "iret". This is no longer atomic: we could be interrupted between the two. So we explicitly tell the hypervisor the address of this instruction: that it is not to interrupt us on that "iret" instruction, even if interrupts are enabled, and the race is closed.

But what about the TLS segments? There are two problems here: first, making sure userspace has access to the 4G TLS segments, and secondly making sure that kernelspace doesn't. But segments are strange beasts on x86: the segment table is only consulted when a segment register is loaded (in this case, %gs), so we can load it once then replace the segment table entries, which ensures any reloads don't get the full segment.

We use this when restoring %gs in the hypervisor on the way back into the guest: pop %gs off the stack, then truncate the 4G TLS segment down to one page. If the guest reloads %gs and tries to use it, it will fault. We then enter the hypervisor and can decide whether we should reload %gs for it (ie. it's in usermode) or not. To avoid looping on a real faulting instruction, we remember the last instruction we fixed %gs on: if the guest hasn't made a system call or other trap and it faults again in the same place, we pass the fault through to the guest. In theory, the code could be reloading %gs in a loop, but in practice that doesn't happen.

Aside: a bonus, this works under QEMU, which doesn't enforce segment limits. It never traps, so we never fix up the segment limit, but then, we don't need to. Of course, all hypervisors using segments like this are insecure under qemu.

How does this help us prevent the kernel for accessing that 4GB segment? Well, now all we need to do is ensure the kernel reloads %gs upon entry, which will ensure it gets the harmless segment from the table. To do this, we divert all interrupts via a special page of stubs, which look like this:

	# Reload the gs register
	push	%gs  
	pop	%gs
	# Make sure the hypervisor knows we've done a gs reload
	movl	$0, lhype_page+4
	# Now it's safe to call into the kenrel.
	jmp	

This page is mapped read-only in the guest's address space, so it can't change the contents, and voila! The total cost of virtualization for the syscall is a few instructions (although the gs load is not particularly cheap) and a fault on first %gs access after userspace return. As a bonus, we only need to ensure that one page which contains all the stubs is mapped, not every interrupt handler.

Now it's implemented and debugged, benchmarks to follow...


[/tech] permanent link

"Five Things" meme from Pia Waugh.

Does someone (who is not on a modem) wants to trace this meme to its source? I guess it wasn't hard to predict that you're on a winner asking bloggers "here, I insist you write about yourself!".

  1. My first contribution to Free Software was a patch to g++.
  2. When I was 25 I took ballroom and latin dance lessons. I found it difficult but rewarding: it's how I met my wife.
  3. I once started a wrapper for g++ called g++-helper to parse and expand the error messages (complete with code examples), but lost the error database in a hard-drive crash and was too demoralized to return to it. (Anyone interested in restarting?)
  4. When I was in primary school, I did ballet. I doubt I was any good.
  5. I am not brilliant, but that is why I love Free Software. I might just enable others who are.

I'd tag Tridge (aka. He Who Has No Blog), Martin Pool, Tony Breeds, Jeremy Kerr and Alli, but they're too cool and vanity-lacking for such a venture...


[/self] permanent link

Mon, 01 Jan 2007

Lhype's TLS Segment Trick

x86 hypervisors under Linux have a problem: glibc wants segments which cover the entire 4GB range of virtual addresses, but allowing that would let the guest access hypervisor memory (usually sitting in the top 64 MB or so of memory). This is because glibc uses segments to implement __thread (aka thread-local storage), and uses huge offsets to wrap around to below the thread pointer.

Linux doesn't have a problem with allowing these huge segments, because the "U" bit in the page tables protects it: if this bit isn't set userspace can't access the memory. However, this works to protect ring0 from ring3, but doesn't work to protect ring0 from ring1 (the hypervisor case). For this reason, Xen uses modified glibc (or traps on every __thread access and prints out a warning that you're going damn slowly).

lhype is supposed to be convenient, so a modified glibc (at least, until everyone has them in their distributions) or a huge performance hit were not good options. Hence I used a different trick: since all transitions from userspace to kernel (ie. interrupts and iret) go via the hypervisor, we replace the TLS segments with trimmed segments if returning to the guest kernel, and the full segments if returning to the guest userspace, where the lack of U bit on the pagetables protects the hypervisor anyway.

This works well, but the two bounces through the hypervisor for every system call is the reason we're 35 times slower than native system calls. And if we don't go through the hypervisor, how do we ensure that the kernel never gets access to those huge hypervisor-mapping segments?

A: Another, slightly trickier trick....


[/tech] permanent link