Bug 218090

Summary: Regression: Kernel 6.4 rc1 and higher causes Steam Deck to fail to wake from suspend (bisected)
Product: Drivers Reporter: gloriouseggroll (gloriouseggroll)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: bjorn.helgaas, bjorn, craig77uk, kai.heng.feng
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 6.4 Subsystem:
Regression: Yes Bisected commit-id: 1ad11eafc63ac16e667853bee4273879226d2d1b
Attachments: lspci log
dmesg log
debug patch
lspci before sleep
lspci after sleep
dmesg after sleep
debug patch 2
lspci.before
lspci.after
proposed patch

Description gloriouseggroll@gmail.com 2023-10-31 05:18:02 UTC
On Kernel 6.4 rc1 and higher if you put the Steam Deck into suspend then press the power button again it will not wake up. 

I don't have a clue as to -why- this commit breaks wake from suspend on steam deck, but it does. Bisected to:

```
1ad11eafc63ac16e667853bee4273879226d2d1b is the first bad commit
commit 1ad11eafc63ac16e667853bee4273879226d2d1b
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Tue Mar 7 14:32:43 2023 -0600

    nvme-pci: drop redundant pci_enable_pcie_error_reporting()
    
    pci_enable_pcie_error_reporting() enables the device to send ERR_*
    Messages.  Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is
    native"), the PCI core does this for all devices during enumeration, so the
    driver doesn't need to do it itself.
    
    Remove the redundant pci_enable_pcie_error_reporting() call from the
    driver.  Also remove the corresponding pci_disable_pcie_error_reporting()
    from the driver .remove() path.
    
    Note that this only controls ERR_* Messages from the device.  An ERR_*
    Message may cause the Root Port to generate an interrupt, depending on the
    AER Root Error Command register managed by the AER service driver.
    
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>

 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)
```
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.4.y&id=1ad11eafc63ac16e667853bee4273879226d2d1b

Reverting that commit by itself on top of 6.5.9 (stable) allows it to wake from suspend properly.
Comment 1 gloriouseggroll@gmail.com 2023-10-31 05:18:48 UTC
Should additionally note s2idle works fine, it's only the standard suspend mode that breaks.
Comment 2 Bjorn Helgaas 2023-11-01 11:37:41 UTC
Ouch, sorry about that.  Can you please attach the output of "sudo lspci -vvv" from 6.5.9, and again from 6.5.9 with 1ad11eafc63a reverted?  Also the complete 6.5.9 dmesg log (before suspend, since you can't get it after), and with the revert (after suspend and resume).
Comment 3 gloriouseggroll@gmail.com 2023-12-06 06:41:24 UTC
Sorry for the late reply, currently on holiday vacation -- however I believe this can be closed -- I am no longer experiencing the issue on 6.6.3
Comment 4 silverspring 2024-03-11 18:20:27 UTC
This still appears to be an issue with kernel 6.4 and above. Steam Deck crashes on sleep and requires a hard boot
Comment 5 Bjorn Helgaas 2024-03-11 18:43:57 UTC
(In reply to silverspring from comment #4)

Hmmm.  That's a problem.

  - Can you reproduce with v6.8?
  - Please attach output of "sudo lspci -vv"
  - Please attach complete dmesg log (before sleep)
  - Please try booting with "pci=noaer" to see whether it makes a difference
Comment 6 silverspring 2024-03-12 19:07:13 UTC
Created attachment 305982 [details]
lspci log
Comment 7 silverspring 2024-03-12 19:08:26 UTC
Created attachment 305983 [details]
dmesg log
Comment 8 silverspring 2024-03-12 19:09:51 UTC
Files attached. Same issue with v6.8 however pci=noaer on all from v6.4 allows sleep to function correctly.
Comment 9 Bjorn Helgaas 2024-03-29 23:24:07 UTC
Created attachment 306059 [details]
debug patch

My theory is that after 1ad11eafc63a, we leave PCIe error reporting enabled, when we previously disabled it.  Errors may occur during suspend, and maybe that confuses the BIOS.

Please try this patch if you can.  It's based on v6.9-rc1.  If you do, please attach the output of "sudo lspci -vv" before and after suspend/resume (if resume works).  If resume works, please also attach the dmesg log afterwards.
Comment 10 silverspring 2024-04-06 16:22:41 UTC
Created attachment 306096 [details]
lspci before sleep
Comment 11 silverspring 2024-04-06 16:23:19 UTC
Created attachment 306097 [details]
lspci after sleep
Comment 12 silverspring 2024-04-06 16:24:25 UTC
Created attachment 306098 [details]
dmesg after sleep
Comment 13 silverspring 2024-04-06 16:30:10 UTC
Logs attached. I can confirm that the patch allows sleep without crash with pci=noaer removed from the boot parameters
Comment 14 Bjorn Helgaas 2024-04-18 21:07:35 UTC
Created attachment 306180 [details]
debug patch 2

Thank you very much for testing the previous patch.  It worked, so we must be on the right track, but if we need to disable AER during suspend, that should be done by the PCI core, not by every individual driver.

This patch, again based on v6.9-rc1, is closer to the approach I think we should take.  This is Kai-Heng's work, which he has been patiently trying to upstream for years.

I'd be very grateful if anybody could test this and attach the output of "sudo lspci -vv" before and after suspend/resume (if resume works).
Comment 15 silverspring 2024-05-01 12:23:51 UTC
Created attachment 306253 [details]
lspci.before
Comment 16 silverspring 2024-05-01 12:24:52 UTC
Created attachment 306254 [details]
lspci.after
Comment 17 silverspring 2024-05-01 12:28:03 UTC
lspci before and after attached.
Sleep works fine with the debug patch 2 applied
Comment 18 Bjorn Helgaas 2024-06-18 21:35:55 UTC
Created attachment 306475 [details]
proposed patch

Proposed patch for this issue, based on v6.10-rc1.  Would love to hear any testing results.