Bug 216295

Summary: Spurious wakeup from s2idle caused by AER
Product: Drivers Reporter: Kai-Heng Feng (kai.heng.feng)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: bjorn, mario.limonciello, wse
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: mainline, linux-next Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
dmesg with debug message
lspci
dmesg
lspci
dmesg with target power state reordered
proposed patch

Description Kai-Heng Feng 2022-07-26 23:36:04 UTC
Created attachment 301497 [details]
dmesg

[  248.265121] PM: suspend-to-idle
[  248.265280] ACPI: EC: ACPI EC GPE status set
[  248.265303] ACPI: PM: Wakeup after ACPI Notify sync
[  248.265305] PM: resume from suspend-to-idle
[  248.269770] ACPI: EC: interrupt unblocked
Comment 1 Kai-Heng Feng 2022-07-26 23:43:31 UTC
Created attachment 301498 [details]
dmesg with debug message

pm_async = 0, pm_debug_messages = 1, dynamic debug enabled for PCI, print IRQs in pm_system_irq_wakeup(), and pci_aer_clear_status() removed from pci_restore_state():

Spurious IRQ when root port 01.0 is set to D3cold:
[  105.756581] pcieport 0000:00:01.0: PME# enabled
[  106.233587] PM: DEBUG: pm_system_irq_wakeup 122 0
[  106.324125] pcieport 0000:00:01.0: power state changed by ACPI to D3cold
[  106.324135] pcieport 0000:00:01.0: PCI PM: Suspend power state: D3cold

ACPI SCI event shouldn't wake the system up, but since an IRQ 122 is already there, a spurious wakeup occurred:
[  106.327529] PM: DEBUG: pm_system_irq_wakeup 122 9
[  106.329297] PM: suspend-to-idle
[  106.329456] ACPI: EC: ACPI EC GPE status set
[  106.329475] ACPI: PM: Wakeup after ACPI Notify sync
[  106.329476] PM: resume from suspend-to-idle
[  106.330920] ACPI: EC: interrupt unblocked

The error being printed out by AER service's ISR:
[  106.808712] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[  106.808727] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  106.808731] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00000001/00002000
[  106.808737] pcieport 0000:00:01.0:    [ 0] RxErr
Comment 2 Kai-Heng Feng 2022-07-26 23:44:07 UTC
Created attachment 301499 [details]
lspci
Comment 3 Kai-Heng Feng 2023-08-09 05:23:33 UTC
Created attachment 304799 [details]
dmesg

In addition to D3cold, the issue can be observed on D3hot case too.
Comment 4 Kai-Heng Feng 2023-08-09 05:24:26 UTC
Created attachment 304800 [details]
lspci
Comment 5 Kai-Heng Feng 2023-08-09 05:26:16 UTC
$ cat /sys/power/pm_wakeup_irq 
122

[    0.838030] pcieport 0000:00:1c.0: PME: Signaling with IRQ 122
[    0.838127] pcieport 0000:00:1c.0: AER: enabled with IRQ 122
...
[  697.102377] PM: Triggering wakeup from IRQ 122
Comment 6 Mario Limonciello (AMD) 2023-08-09 22:39:59 UTC
Can you double check the PEP constraints for 0000:00:1c.0?  You can turn on dynamic debugging for drivers/acpi/x86/s2idle.c on kernel command line and they'll be printed in your dmesg.

It's unlikely; but if any of them are /not/ aiming for D3hot/D3cold at suspend then my patch series for a similar issue of wrong states at suspend https://lore.kernel.org/linux-pci/20230809185453.40916-1-mario.limonciello@amd.com/T/#t may help.
Comment 7 Mario Limonciello (AMD) 2023-08-09 22:47:32 UTC
And yes I saw that some of that is enabled in your most recent dmesg, but the constraints enumeration happens at startup, so it needs to be on your kernel command line.

https://github.com/torvalds/linux/blob/v6.5-rc5/drivers/acpi/x86/s2idle.c#L270
Comment 8 Kai-Heng Feng 2023-08-11 03:00:33 UTC
(In reply to Mario Limonciello (AMD) from comment #6)
> Can you double check the PEP constraints for 0000:00:1c.0?  You can turn on
> dynamic debugging for drivers/acpi/x86/s2idle.c on kernel command line and
> they'll be printed in your dmesg.
> 
> It's unlikely; but if any of them are /not/ aiming for D3hot/D3cold at
> suspend then my patch series for a similar issue of wrong states at suspend
> https://lore.kernel.org/linux-pci/20230809185453.40916-1-mario.
> limonciello@amd.com/T/#t may help.

This series doesn't help.
Comment 9 Mario Limonciello (AMD) 2023-08-11 03:01:23 UTC
Thanks for confirming.  It was a long shot for your issue.
Comment 10 Kai-Heng Feng 2023-08-11 03:16:40 UTC
(In reply to Mario Limonciello (AMD) from comment #7)
> And yes I saw that some of that is enabled in your most recent dmesg, but
> the constraints enumeration happens at startup, so it needs to be on your
> kernel command line.
> 
> https://github.com/torvalds/linux/blob/v6.5-rc5/drivers/acpi/x86/s2idle.
> c#L270

[    0.760646] ACPI: \_SB_.PEPD: index:2 Name:\_SB.PR02
[    0.760647] ACPI: \_SB_.PEPD: uid:255 min_dstate:D0
Comment 11 Mario Limonciello (AMD) 2023-08-11 03:19:23 UTC
Interesting... Is that device considered 'ACPI power manageable' by the kernel?  

If it is then re-ordering the last patch in the series to prefer constraints as first choice might actually change things as it would prevent it from going into D3 (the constraints don't say it needs to).
Comment 12 Kai-Heng Feng 2023-08-11 07:49:15 UTC
(In reply to Mario Limonciello (AMD) from comment #11)
> Interesting... Is that device considered 'ACPI power manageable' by the
> kernel?  

Yes, the root port has _PS0 and _PS3 methods so it's considered power manageable.

> 
> If it is then re-ordering the last patch in the series to prefer constraints
> as first choice might actually change things as it would prevent it from
> going into D3 (the constraints don't say it needs to).

Reordering can keep the root port at D0. However the same issue can still be observed.
Comment 13 Kai-Heng Feng 2023-08-11 07:49:45 UTC
Created attachment 304815 [details]
dmesg with target power state reordered
Comment 14 Mario Limonciello (AMD) 2023-08-11 12:11:58 UTC
Comment on attachment 304815 [details]
dmesg with target power state reordered

Thanks, so this isn't the solution for your issue then.  I'm curious though; with it re-ordered and your AER patch in place, do you get to deepest state?  It would keep several of your root ports at D0, and if that still works for you I might change the series as well.

> [    1.148467] pcieport 0000:00:1c.0: AER: Corrected error received:
> 0000:01:00.0

Looking at the log, I notice that you have AER happening even at bootup.  Is something wrong with the card reader or card reader driver perhaps?
Comment 15 Mario Limonciello (AMD) 2023-08-11 20:23:38 UTC
> It would keep several of your root ports at D0, and if that still works for
> you I might change the series as well.

But FWIW if it does work for you, it at least needs some more consideration for my systems.  I've found that moving it earlier leads to some devices that should be in D3cold over s2idle being put into D3hot which causes major problems.
Comment 16 Werner Sembach [TUXEDO] 2023-09-05 16:38:49 UTC
*** Bug 217082 has been marked as a duplicate of this bug. ***
Comment 17 Werner Sembach [TUXEDO] 2024-01-03 11:19:50 UTC
Comming from https://bugzilla.kernel.org/show_bug.cgi?id=217082

Let me know if I can help debug this.
Comment 18 Kai-Heng Feng 2024-01-04 04:39:10 UTC
(In reply to Werner Sembach [TUXEDO] from comment #17)
> Comming from https://bugzilla.kernel.org/show_bug.cgi?id=217082
> 
> Let me know if I can help debug this.

Is your case caused by NVIDIA GFX?
Comment 19 Werner Sembach [TUXEDO] 2024-01-04 09:32:57 UTC
It is caused by "PEG1" but i don't know what that device actually is.
Comment 20 Bjorn Helgaas 2024-06-18 21:35:06 UTC
Created attachment 306474 [details]
proposed patch

Proposed patch for this issue, based on v6.10-rc1.  Would love to hear any testing results.
Comment 21 Kai-Heng Feng 2024-06-19 06:06:52 UTC
Confirmed the patch solves my issue.