Thu, 21 Sep 2006

Accessing stack-relative variables vs. using segment registers

So Jeremy Fitzhardinge recently produces a series of patches to use the "gs" segement register within the Linux kernel for per-cpu data. I generalized these patches to combine them with the existing per-cpu infrastructure, but I wondered about whether we should use this method for accessing the current task and current CPU. In the case of CPU number, the core code expects it attached to the thread info anyway, as it accesses it directly to see where a thread last ran, so we'd end up with it in both places.

To explain how things currently work in Linux: each process has a kernel stack. On x86, this is 8k, aligned on an 8k boundary (there's a 4k stack config option, but the idea is the same). At the top of this stack (ie lowest address in this 8k region) is the thread structure, and in this we keep a pointer to the current task, and the current CPU number. This means that we can access either one quite quickly by rounding down the stack pointer to the nearest 8k boundary and grabbing the entry from the structure. In assembler, this is 3 instructions, eg:

mov    %esp,%eax		# Get stack pointer
and    $0xfffff000,%eax		# Round down (4k in this example)
mov    (%eax),%eax		# Grab value (first element in structure)

This is 9 bytes long. If we use a per-cpu variable and the gs register, it's a single 6-byte instruction:

mov    %gs:0x0,%eax		# Get first value in per-cpu section

This is 6 bytes long, but segment-relative instructions are a little more complicated to decode than the simpler instructions above. So, I wrote two test programs with each of these sequences repeated a million times (to remove icache effects), and timed running that million-sequence 10,000 times, for 10 billion total repeats. We'd expect the gs version to be 33% faster than the stack version, because it's 33% shorter.

Machine 1.7GHz Pentium M (Laptop) 3GHz Pentium 4 3GHz Xeon (5160)
Stack-based value read (nsecs) 10.39 4.06 8.46
gs-based value read (nsec) 6.93 2.70 4.40
Improvement 33% 33% 48%

This kind of single huge instruction stream is great for the Pentium 4, which is why it does so well (OTOH, flushing the pipeline on P4 is painfully slow). Nonetheless, it looks like using gs is a gain. Of course, this effect measured here is likely to be overwhelmed by the benefits of reduced instruction cache footprint.

Stephen Rothwell points out that the former (stack-based) instructions require an arithmetic unit which the gs-based instruction does not, which may allow for better instruction scheduling in real code, too, amplifying the win again.


[/tech] permanent link