Wed, 13 Dec 2006

The Joy of Binary Compatibility: glibc and vdso

One of the early problems Xen hit, and we had to deal with cleanly with paravirt_ops, is that the hypervisor wants to be mapped at the top of virtual memory, but the vsyscall dynamic shared object (vdso) is usually mapped there, right up at 0xfffff000. The vdso is a page exposed to userspace which glibc will jump into to make syscalls. This allows the kernel to decide whether to use traditional "int $80" or the more recent "sysenter" instruction for syscalls, plus provides a fixed place to return in the "sysenter" case.

There were several patches floating around to make this location dynamic, which is fine except for one small problem: moving it breaks some early vdso-enabled glibc versions, which exit with an assertion. This problem surfaced with Andrew Morton's Fedora Core 1 box, but since that's so old, the consensus (with which I disagreed) was to add a CONFIG_VDSO_COMPAT which mapped the vdso at the old address, and disallowed CONFIG_PARAVIRT.

Unfortunately, Andi Kleen tested on SuSE 9.0, which has the same issue. He solved it by turning vdso off by default when CONFIG_PARAVIRT was enabled (it can be turned on by vdso=1 on the boot commandline). This is pretty unsatisfactory given that the bug would effect only a tiny and diminishing number of people, and disabling vdso slows everyone down by default.

This came to a head on the mailing list, so I decided to look at more hacky solutions. My first thought was that if init crashes without doing anything, we should disable vdso and exec it again. Unfortunately, assert() calls abort(), which calls kill() to send itself a SIGABRT, but the kernel has special code to make sure that init never receives a signal it can't handle and refuses to deliver it. The result is that init hangs instead of crashing.

So, first I looked at the signal handling, and produced this patch for the generic signal.c:

-               if (current == child_reaper(current))
+               if (current == child_reaper(current)) {
+#ifdef CONFIG_COMPAT_VDSO
+                       /* Gross hack: Old glibc asserts, not
+                          liking moved vdso (SuSE 9, FC1) */
+                       if (signr == SIGABRT && list_empty(¤t->children)) {
+                               signr = -1;
+                               break;
+                       }
+#endif                         
                        continue;
+               }
 
This function never returned a negative number before, so the caller knows what happend:
+#ifdef CONFIG_COMPAT_VDSO
+       else if (signr == -1) {
+               void reexec_init(void);
+               static int reexec_done;
+               if (!reexec_done++) {
+                       printk("COMPAT_VDSO: old glibc?  Disabling vdso\n");
+                       vdso_enabled = 0;
+                       reexec_init();
+               }
+       }
+#endif

I sent it out without testing, since I don't have either of the failing systems, but since I know the code is ugly and noone will be inclined to fix it for me, I wrote a tiny program called "assert" which simply did "assert(0)", and booted with "init=/assert" (under qemu, of course!).

And, equally of course, it didn't work. Firstly, because init has all those kernel threads as children already, so "list_empty(¤t->children)" is never true. Secondly, because the second exec failed: the kill() path was expecting the signal to kill the process, and so it set "current->signal->group_stop_count" to 1, which indicates the entire thread group should be stopped. Next time through the signal-checking path (and it gets called immediately on return to userspace), init hangs. Which indicates the code to "protect" init from bad signals hasn't been tested in a while.

Once I got this working and sent out, I didn't like the patch. The worst part is that it touches generic code for an ugly x86-specific issue. So I decided to tackle the other end: what if I intercepted the kill() call, and if it's init sending a SIGABRT to itself, re-exec instead? I did this by putting "sys_check_init_abort_kill()" in the place of the generic "sys_kill" in the system call table: the longer version simply checks for our special case and tries to re-exec. It's still ugly, but it's clear.


[/tech] permanent link