Sat, 21 Jul 2007

Lguest gets merged, has a nasty bug

So lguest went into Linus' tree 5am Friday morning or so. Since Friday is my day in the office I grabbed the latest git snapshot and compiled it up for one of the machines there, to check it had been merged properly and also test a few minor cleanups I wanted to send.

Boots fine, but after a while sitting idle, it stops responding. I back out my cleanups, and it still happens. Shit: bad start to lguest in mainline! I start debugging: the guest has interrupts disabled and is doing something, but nothing obvious. There was a bug report I had from someone last week which sounds similar which I hadn't tracked down yet. A few hours into debugging my wife arrives, time to go home: nothing obvious comes to me on the 90 minute drive, other than how I should get more details.

On arriving home and eating dinner, I prepare for a late night of debugging. But at home, it doesn't happen. I try with the same config as the work machine, still no lockup. I can't get into that work machine from here, so I head for bed hours earlier than I expected.

But I lay in bed thinking "what's different from my machine to the one at work?". Finally it occurs to me that it's possible that the one at work doesn't have synchronous TSCs: what if the guest were to see time go backwards because the host switched its CPUS? Perhaps it would end up in a huge loop. It would explain why it only happens after a period of idle, too. Damn, I'm a genius for figuring this out!

So I finally get remote access to the machine this evening. But on that machine the guest's not using the TSC, and so my beautiful theory is wrong...

[/tech] permanent link