Fri, 11 Aug 2006

GCC versions on x86, and the cost of clobbers

For the paravirt_ops patches, I've been doing some mildly tricky GCC things to binary patch over the indirect calls at runtime.

On x86, to replace, say, 'cli' (which disables interrupts) with an call through a function pointer (at offset PARAVIRT_irq_disable )in the paravirt_ops struct, you need to do:

	call *paravirt_ops+PARAVIRT_irq_disable
However, calls on x86 can overwrite the eax, ecx and edx register, so to be safe for any function, you need to do:
	push %eax
	push %ecx
	push %edx
	call *paravirt_ops+PARAVIRT_irq_disable
	pop %edx
	pop %ecx
	pop %eax
Each of these pushes and pops is a 1 byte instruction on x86.

There is a way, however, to tell gcc that some assembler is going to clobber registers, so if it's clever, it can avoid having to push and pop. I wondered how much more efficient it would be to do this: at worst gcc will always have to push and pop, at best, it would never have to (unlikely as this is on register-starved x86).

In my simple test, I use these kind of calls for four common kernel (inline) functions, raw_local_irq_disable() and raw_local_irq_enable() which have to save three registers, and raw_local_irq_restore() and __raw_local_save_flags() which use %eax and so only have to save two registers. Counting up the calls in my configuration gives 132, 66, 97 and 113, giving 1014 saved registers. When saved with push/pops, we'd expect to see 2028 bytes of bloat (I added -fno-align-functions to the top level Makefile so function alignment wouldn't play a part, but jump and loop alignment still play a part).

Test cases GCC 3.3 GCC 3.4 GCC 4.0 GCC 4.1
Code size (no push/pop, no clobber) 2225653 2209176 2198744 2183453
Push/pop code size addition (bytes) 1553 1631 3129 1584

To discover how effective various gccs are at avoiding register spills (and indirectly get an indication of how good gcc's x86 code generation is), I produced three kernels: one which did no saves or restores of registers at all (baseline, ideal case), one which did all the pushes and pops manually (worst case), and one which used clobbers. The better gcc's code generation is, the closer we'd expect the clobber case to be to the ideal case. I used "size vmlinux" to measure the code size. We can use the actual code increase from push/pop (that theoretical 2028 bytes) to take into account other noise effects: this normalized result probably gives a better indication of the differential effect of clobbers vs push/pops.

Test cases GCC 3.3 GCC 3.4 GCC 4.0 GCC 4.1
Clobber code extra (bytes) 625 512 955 577
Cost (bytes) per clobber 0.61 0.50 0.94 0.56
Normalized cost (bytes) per clobber 0.4 0.31 0.31 0.36

[/tech] permanent link