[PATCH] powerpc/eeh: Permanently disable the removed device

Ganesh G R ganeshgr at linux.ibm.com
Mon Apr 15 16:49:32 AEST 2024


On 4/9/24 14:37, Michael Ellerman wrote:

> Hi Ganesh,
>
> Ganesh Goudar <ganeshgr at linux.ibm.com> writes:
>> When a device is hot removed on powernv, the hotplug
>> driver clears the device's state. However, on pseries,
>> if a device is removed by phyp after reaching the error
>> threshold, the kernel remains unaware, leading to the
>> device not being torn down. This prevents necessary
>> remediation actions like failover.
>>
>> Permanently disable the device if the presence check
>> fails.
> You can wrap your changelogs a bit wider, 70 or 80 columns is fine.

ok

>> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>> index ab316e155ea9..8d1606406d3f 100644
>> --- a/arch/powerpc/kernel/eeh.c
>> +++ b/arch/powerpc/kernel/eeh.c
>> @@ -508,7 +508,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>>   	 * state, PE is in good state.
>>   	 */
>>   	if ((ret < 0) ||
>> -	    (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
>> +	    (ret == EEH_STATE_NOT_SUPPORT &&
>> +	     dev->error_state == pci_channel_io_perm_failure) ||
>> +	    eeh_state_active(ret)) {
>>   		eeh_stats.false_positives++;
>>   		pe->false_positives++;
>>   		rc = 0;
> How does this hunk relate the changelog?
>
> This is adding an extra condition to the false positive check, so
> there's a risk this causes devices to go into failure when previously
> they didn't, right? So please explain why it's a good change. The
> comment above the if needs updating too.

We need this change to log the event and get the device removed, I will explain this
in commit message.

>> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
>> index 48773d2d9be3..10317badf471 100644
>> --- a/arch/powerpc/kernel/eeh_driver.c
>> +++ b/arch/powerpc/kernel/eeh_driver.c
>> @@ -867,7 +867,13 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>>   	if (!devices) {
>>   		pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
>>   			pe->phb->global_number, pe->addr);
>> -		goto out; /* nothing to recover */
> The other cases that go to recover_failed usually print something at
> warn level, so this probably should too. So either make the above a
> pr_warn(), or change it to a warn with a more helpful message.

ok

>> +		/*
>> +		 * The device is removed, Tear down its state,
>> +		 * On powernv hotplug driver would take care of
>> +		 * it but not on pseries, Permanently disable the
>> +		 * card as it is hot removed.
>> +		 */
> Formatting and punctuation is weird. It can be wider, and capital letter
> is only required after a full stop, not a comma.

ok, i will take care of it.

> Also you say that the powernv hotplug driver "would" take care of it,
> that's past tense, is that what you mean? Does the powernv hotplug
> driver still take care of it after this change? And (how) does that
> driver cope with it happening here also?

Yes, hotplug driver can still remove the device and the removal of
device is covered by pci rescan lock.

>> +		goto recover_failed;
>>   	}
>>   	
> cheers


More information about the Linuxppc-dev mailing list