Free Software programmer
This blog existed before my current employment, and obviously reflects my own opinions and not theirs.
This work is licensed under a Creative Commons Attribution 2.1 Australia License.
Fri, 11 Aug 2006
For the paravirt_ops patches, I've been doing some mildly tricky GCC things to binary patch over the indirect calls at runtime.
On x86, to replace, say, 'cli' (which disables interrupts) with an call through a function pointer (at offset PARAVIRT_irq_disable )in the paravirt_ops struct, you need to do:
call *paravirt_ops+PARAVIRT_irq_disableHowever, calls on x86 can overwrite the eax, ecx and edx register, so to be safe for any function, you need to do:
push %eax push %ecx push %edx call *paravirt_ops+PARAVIRT_irq_disable pop %edx pop %ecx pop %eaxEach of these pushes and pops is a 1 byte instruction on x86.
There is a way, however, to tell gcc that some assembler is going to clobber registers, so if it's clever, it can avoid having to push and pop. I wondered how much more efficient it would be to do this: at worst gcc will always have to push and pop, at best, it would never have to (unlikely as this is on register-starved x86).
In my simple test, I use these kind of calls for four common kernel (inline) functions, raw_local_irq_disable() and raw_local_irq_enable() which have to save three registers, and raw_local_irq_restore() and __raw_local_save_flags() which use %eax and so only have to save two registers. Counting up the calls in my configuration gives 132, 66, 97 and 113, giving 1014 saved registers. When saved with push/pops, we'd expect to see 2028 bytes of bloat (I added -fno-align-functions to the top level Makefile so function alignment wouldn't play a part, but jump and loop alignment still play a part).
To discover how effective various gccs are at avoiding register spills (and indirectly get an indication of how good gcc's x86 code generation is), I produced three kernels: one which did no saves or restores of registers at all (baseline, ideal case), one which did all the pushes and pops manually (worst case), and one which used clobbers. The better gcc's code generation is, the closer we'd expect the clobber case to be to the ideal case. I used "size vmlinux" to measure the code size. We can use the actual code increase from push/pop (that theoretical 2028 bytes) to take into account other noise effects: this normalized result probably gives a better indication of the differential effect of clobbers vs push/pops.
[/tech] permanent link