[PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

Tue Nov 21 15:26:54 AEDT 2023

----- Original Message -----
> From: "Timothy Pearson" <tpearson at raptorengineeringinc.com>
> To: "Michael Ellerman" <mpe at ellerman.id.au>
> Cc: "Jens Axboe" <axboe at kernel.dk>, "regressions" <regressions at lists.linux.dev>, "npiggin" <npiggin at gmail.com>,
> "christophe leroy" <christophe.leroy at csgroup.eu>, "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> Sent: Monday, November 20, 2023 10:10:32 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save

> ----- Original Message -----
>> From: "Michael Ellerman" <mpe at ellerman.id.au>
>> To: "Timothy Pearson" <tpearson at raptorengineering.com>
>> Cc: "Jens Axboe" <axboe at kernel.dk>, "regressions" <regressions at lists.linux.dev>,
>> "npiggin" <npiggin at gmail.com>,
>> "christophe leroy" <christophe.leroy at csgroup.eu>, "linuxppc-dev"
>> <linuxppc-dev at lists.ozlabs.org>
>> Sent: Monday, November 20, 2023 5:39:52 PM
>> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec
>> register  save
> 
>> Timothy Pearson <tpearson at raptorengineering.com> writes:
>>> ----- Original Message -----
>>>> From: "Michael Ellerman" <mpe at ellerman.id.au>
>> ...
>>>> 
>>>> But we now have a new path, because io-uring can call copy_process() via
>>>> create_io_thread() from the signal handling path. That's OK if the signal is
>>>> handled as we return from a syscall, but it's not OK if the signal is handled
>>>> due to some other interrupt.
>>>> 
>>>> Which is:
>>>> 
>>>> interrupt_return_srr_user()
>>>>  interrupt_exit_user_prepare()
>>>>    interrupt_exit_user_prepare_main()
>>>>      do_notify_resume()
>>>>        get_signal()
>>>>          task_work_run()
>>>>            create_worker_cb()
>>>>              create_io_worker()
>>>>                copy_process()
>>>>                  dup_task_struct()
>>>>                    arch_dup_task_struct()
>>>>                      flush_all_to_thread()
>>>>                        save_all()
>>>>                          if (tsk->thread.regs->msr & MSR_FP)
>>>>                            save_fpu()
>>>>                            # fr0 is clobbered and potentially live in userspace
>>>> 
>>>> 
>>>> So tldr I think the corruption is only an issue since io-uring started doing
>>>> the clone via signal, which I think matches the observed timeline of this bug
>>>> appearing.
>>>
>>> I agree the corruption really only started showing up in earnest on
>>> io_uring clone-via-signal, as this was confirmed several times in the
>>> course of debugging.
>> 
>> Thanks.
>> 
>>> Note as well that I may very well have a wrong call order in the
>>> commit message, since I was relying on a couple of WARN_ON() macros I
>>> inserted to check for a similar (but not identical) condition and
>>> didn't spend much time getting new traces after identifying the root
>>> cause.
>> 
>> Yep no worries. I'll reword it to incorporate the full path from my mail.
>> 
>>> I went back and grabbed some real world system-wide stack traces, since I now
>>> know what to trigger on.  A typical example is:
>>>
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>   interrupt_exit_user_prepare_main()
>>>    schedule()
>>>     __schedule()
>>>      __switch_to()
>>>       giveup_all()
>>>        # tsk->thread.regs->msr MSR_FP is still set here
>>>        __giveup_fpu()
>>>         save_fpu()
>>>         # fr0 is clobbered and potentially live in userspace
>> 
>> fr0 is not live there.
> <snip>
>> ie. it clears the FP etc. bits from the task's MSR. That means the FP
>> state will be reloaded from the thread struct before the task is run again.
> 
> So a little more detail on this, just to put it to rest properly vs. assuming
> hand analysis caught every possible pathway. :)
> 
> The debugging that generates this stack trace also verifies the following in
> __giveup_fpu():
> 
> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to calling
> save_fpu()
> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling
> save_fpu()
> 3.) MSR_FP is set both in the task struct and in the live MSR.
> 
> Only if all three conditions are met will it generate the trace.  This is a
> generalization of the hack I used to find the problem in the first place.
> 
> If the state will subsequently be reloaded from the thread struct, that means
> we're reloading the registers from the thread struct that we just verified was
> corrupted by the earlier save_fpu() call.  There are only two ways I can see
> for that to be true -- one is if the registers were already clobbered when
> giveup_all() was entered, and the other is if save_fpu() went ahead and
> clobbered them right here inside giveup_all().
> 
> To see which scenario we were dealing with, I added a bit more instrumentation
> to dump the current register state if MSR_FP bit was already set in registers
> (i.e. not dumping data from task struct, but using the live FPU registers
> instead), and sure enough the registers are corrupt on entry, so something else
> has already called save_fpu() before we even hit giveup_all() in this call
> chain.
> 
> Unless I'm missing something, doesn't this effectively mean that anything
> interrupting a task can hit this bug?  Or, put another way, I'm seeing several
> processes hit this exact call chain with the corrupt register going back out to
> userspace without io_uring even in the mix, so there seems to be another
> pathway in play.  These traces are from a qemu guest, in case it matters given
> the kvm path is possibly susceptible.
> 
> Just a few things to think about.  The FPU patch itself definitely resolves the
> problems; I used a sledgehammer approach *specifically* so that there is no
> place for a rare call sequence we didn't consider to hit it again down the
> line. :)

For reference, a couple of traces that are verified to hit the conditions above when I leave the debugging unrestricted system-wide:

>From perl:

[  100.735133] NIP [c00000000001b0d8] __giveup_fpu+0xc8/0x280
[  100.735162] LR [c00000000001b084] __giveup_fpu+0x74/0x280
[  100.735190] Call Trace:
[  100.735205] [c000000008ac7710] [c000000008ac7830] 0xc000000008ac7830 (unreliable)
[  100.735251] [c000000008ac7a10] [c00000000001c094] giveup_all+0x84/0x120
[  100.735289] [c000000008ac7a40] [c00000000001cb08] __switch_to+0x128/0x2e0
[  100.735327] [c000000008ac7aa0] [c00000000101e0d0] __schedule+0x1020/0x11c0
[  100.735362] [c000000008ac7b90] [c00000000101e3f8] schedule+0x188/0x1f0
[  100.735397] [c000000008ac7c10] [c00000000063a834] pipe_read+0x3c4/0x5c0
[  100.735437] [c000000008ac7cf0] [c00000000062a9cc] vfs_read+0x18c/0x360
[  100.735505] [c000000008ac7dc0] [c00000000062b9e4] ksys_read+0xf4/0x150
[  100.735540] [c000000008ac7e10] [c00000000002fca4] system_call_exception+0x294/0x2e0
[  100.735581] [c000000008ac7e50] [c00000000000d0dc] system_call_vectored_common+0x15c/0x2ec

>From mariadbd:

[  129.374710] NIP [c00000000001b0d8] __giveup_fpu+0xc8/0x280
[  129.374743] LR [c00000000001b084] __giveup_fpu+0x74/0x280
[  129.374774] Call Trace:
[  129.374791] [c000000018dbf680] [0000000000000001] 0x1 (unreliable)
[  129.374833] [c000000018dbf980] [c00000000001c094] giveup_all+0x84/0x120
[  129.374873] [c000000018dbf9b0] [c00000000001cb08] __switch_to+0x128/0x2e0
[  129.374915] [c000000018dbfa10] [c00000000101e0d0] __schedule+0x1020/0x11c0
[  129.374958] [c000000018dbfb00] [c00000000101e3f8] schedule+0x188/0x1f0
[  129.374996] [c000000018dbfb80] [c0000000002ba240] futex_wait_queue+0x80/0xf0
[  129.375051] [c000000018dbfbc0] [c0000000002baf70] __futex_wait+0xc0/0x180
[  129.375102] [c000000018dbfca0] [c0000000002bb0c4] futex_wait+0x94/0x150
[  129.375161] [c000000018dbfd60] [c0000000002b5d5c] do_futex+0x11c/0x320
[  129.375214] [c000000018dbfd90] [c0000000002b6130] sys_futex+0x1d0/0x240
[  129.375261] [c000000018dbfe10] [c00000000002fca4] system_call_exception+0x294/0x2e0
[  129.375307] [c000000018dbfe50] [c00000000000d0dc] system_call_vectored_common+0x15c/0x2ec

Another from mariadbd (this one takes out the server in this particular run, but that's just because it "lost" the race with the io_uring worker spawn):

[  136.342361] NIP [c00000000001b0d8] __giveup_fpu+0xc8/0x280
[  136.342416] LR [c00000000001b084] __giveup_fpu+0x74/0x280
[  136.342467] Call Trace:
[  136.342491] [c000000018dbf8d0] [000000000000002b] 0x2b (unreliable)
[  136.342557] [c000000018dbfbd0] [c00000000001c094] giveup_all+0x84/0x120
[  136.342620] [c000000018dbfc00] [c00000000001cb08] __switch_to+0x128/0x2e0
[  136.342685] [c000000018dbfc60] [c00000000101e0d0] __schedule+0x1020/0x11c0
[  136.342752] [c000000018dbfd50] [c00000000101e3f8] schedule+0x188/0x1f0
[  136.342818] [c000000018dbfdd0] [c00000000002e92c] interrupt_exit_user_prepare_main+0x7c/0x2e0
[  136.342907] [c000000018dbfe20] [c00000000002ee18] interrupt_exit_user_prepare+0x88/0xa0
[  136.342983] [c000000018dbfe50] [c00000000000d954] interrupt_return_srr_user+0x8/0x12c