Rusty's Bleeding Edge Page

I've been meaning for a while to write up what's happening with cpumasks in the kernel. Several people have asked, and it's not obvious so it's worth explaining in detail. Thanks to Oleg Nesterov for the latest reminder.

The Problems

The two obvious problems are

Putting cpumasks on the stack limits us to NR_CPUS around 128, and
The whack-a-mole attempts to fix the worst offenders is a losing game.

For better or worse, people want NR_CPUS 4096 in stock kernels today, and that number is only going to go up.

Unfortunately, our merge-surge development model makes whack-a-mole the obvious thing to do, but the results (creeping in largely unnoticed) have been between awkward and horrible. Here's some samples across that spectrum:

cpumask_t instead of struct cpumask. I gather that this is a relic from when cpus were represented by an unsigned long, even though now it's always a struct.
cpu_set/cpu_clear etc. are magic non-C creatures which modify their arguments through macro magic:
```
#define cpu_set(cpu, dst) __cpu_set((cpu), &(dst))
```

cpumask_of_cpu(cpu) looked like this:

#define cpumask_of_cpu(cpu)
(*({
        typeof(_unused_cpumask_arg_) m;
        if (sizeof(m) == sizeof(unsigned long)) {
                m.bits[0] = 1UL<<(cpu);
        } else {
                cpus_clear(m);
                cpu_set((cpu), m);
        }
        &m;
}))

Ignoring that this code has a silly optimization and could be better written, it's illegal since we hand the address of a going-out-of-scope local var. This is the code which got me looking at this mess to start with.

New "_nr" iterators and operators have been introducted to only go to up to nr_cpu_ids bits instead of all the way to NR_CPUS, and used where it seems necessary. (nr_cpu_ids is the actual cap of possible cpu numbers, calculated at boot).

Several macros contain implicit declarations in them, eg:

#define CPUMASK_ALLOC(m)        struct m _m, *m = &_m
...
#define node_to_cpumask_ptr(v, node)                                    \
                cpumask_t _##v = node_to_cpumask(node);                 \
                const cpumask_t *v = &_##v

#define node_to_cpumask_ptr_next(v, node)                               \
                          _##v = node_to_cpumask(node)

But eternal vigilance is required to ensure that someone doesn't add another cpumask to the stack, somewhere. This isn't going to happen.

The Goals

No measurable overhead for small CONFIG_NR_CPUS.
As little time and memory overhead for large CONFIG_NR_CPUS kernels booted on machined with small nr_cpu_ids.
APIs and conventions that reasonable hackers can follow who don't care about giant machines, without screwing up those machines.

The Solution

These days we avoid Big Bang changes where possible. So we need to introduce a parallel cpumask API and convert everything across, then get rid of the old one.

The first step is to introduce replacemenst for the cpus_* functions. The new ones start with cpumask_; making names longer is always a little painful, but it's now consistent. The few operators which started with cpumask_ already were converted in one swoop (they were rarely used). These new functions take (const) struct cpumask pointers, and may only operate on some number of bits (CONFIG_NR_CPUS if it's small, otherwise nr_cpu_ids). This replacement is fairly simple, but you have to watch for cases like this:
```
	for_each_cpu_mask(i, my_cpumask)
		...
	if (i == NR_CPUS)
```
That final test should be "(i >= nr_cpu_ids)" to be safe now:
```
	for_each_cpu(i, &my_cpumask)
		...
	if (i >= nr_cpu_ids)
```
The next step is to deprecate cpumask_t and NR_CPUS (CONFIG_NR_CPUS is now defined even for !SMP). These are minor annoyances but more importantly in a job this large and slow they mark where code needs to be inspected for conversion.
cpumask_var_t is introduced to replace cpumask_t definitions (except in function parameters and returns, that's always (const) struct cpumask *). This is just struct cpumask[1] for most people, and alloc_cpumask_var/free_cpumask_var do nothing. Otherwise, it's a pointer to a cpumask when CONFIG_CPUMASK_OFFSTACK=y.
alloc_cpumask_var currently allocates all CONFIG_NR_CPUS bits, and zeros any bits between nr_cpu_ids and NR_CPUS. This is because there are still some uses of the old cpu operators which could otherwise foul things up. It will be changed to do the minimal allocation as the transfer progresses.
New functions are added to avoid common needs for temporary cpumasks. The two most useful are for_each_cpu_and() (to iterate over the intersection of two cpumasks) and cpu_any_but() (to exclude one cpu from consideration).
Another new function added for similar reasons was work_on_cpu(). There was a growing idiom of temporarily setting the current thread's cpus_allowed to execute on a specific CPU. This not only requires a temporary cpumask, it is potentiall buggy since it races with userspace setting the cpumask on that thread.
to_cpumask() is provided to convert raw bitmaps to struct cpumask *. In some few cases, cpumask_var_t cannot serve because we can't allocate early enough or there's no good place (or I was too scared to try), and I wanted to get rid of all 'struct cpumask' declarations as we'll see in a moment.
Architectures which have no intention of setting CONFIG_CPUMASK_OFFSTACK need do very little. We should convert them eventually, but there's no real benefit other than cleanup and consistency.
I took the opportunity to centralize the cpu_online_map etc definitions, because we're obsoleting them anyway.
cpu_online_mask/cpu_possible_mask etc (the pointer versions of the cpu_online_map/cpu_possible_map they replace) are const. This means that the few cases where we really want to manipulate them, the new set_cpu_online()/set_cpu_possible() or init_cpu_online()/init_cpu_possible() should be used.
We will change 'struct cpumask' to be undefined for CONFIG_CPUMASK_OFFSTACK=y. This can only be done once all cpumask_t (ie. struct cpumask) declarations are removed, including globals and from structures. Most of these are a good idea anyway, but some are gratuitous. But this will instantly catch any attempt to use struct cpumask on the stack, or to copy it (the latter is dangerous since cpumask_var_t will not allocate the entire mask).

Conclusion

At this point, we will have code that doesn't suck, rules which can be enforced by the compiler, and the possibility of setting CONFIG_NR_CPUS to 16384 as the SGI guys really want.

Importantly, we are forced to audit all kinds of code. As always, some few were buggy, but more were unnecessarily ugly. With less review happening these days before code goes in, it's important that we strive to leave code we touch neater than when we found it.

[/tech] permanent link

Rusty's Bleeding Edge Page

Wed, 07 Jan 2009

Fun with cpumasks

The Problems

The Goals

The Solution

Conclusion