Bug 215742 - The NVME storage quirked as SIMPLE SUSPEND makes system resume failed after suspend (Regression)
Summary: The NVME storage quirked as SIMPLE SUSPEND makes system resume failed after s...
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Linux ACPI Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-25 06:41 UTC by Jian-Hong Pan
Modified: 2024-02-23 11:49 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.15+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Journal log (212.54 KB, text/plain)
2022-03-25 06:41 UTC, Jian-Hong Pan
Details
Dump ACPI (2.42 MB, text/plain)
2022-03-25 07:01 UTC, Jian-Hong Pan
Details
Patch to set default to S3 (1.58 KB, application/mbox)
2022-04-08 04:20 UTC, Mario Limonciello (AMD)
Details
1.pci.txt (18.95 KB, text/plain)
2024-02-08 08:09 UTC, Daniel Drake
Details
2.pci.txt (18.88 KB, text/plain)
2024-02-08 08:09 UTC, Daniel Drake
Details
3.pci.txt (18.88 KB, text/plain)
2024-02-08 08:10 UTC, Daniel Drake
Details
4.pci.txt (18.95 KB, text/plain)
2024-02-08 08:10 UTC, Daniel Drake
Details
5.pci.txt (18.95 KB, text/plain)
2024-02-08 08:10 UTC, Daniel Drake
Details
Full system lspci -vvxxxx (124.87 KB, text/plain)
2024-02-08 08:32 UTC, Daniel Drake
Details

Description Jian-Hong Pan 2022-03-25 06:41:53 UTC
Created attachment 300614 [details]
Journal log

We have an ASUS B1400CEAE laptop equipped with Intel i5-1135G7 and Sandisk Corp WD Blue SN550 NVMe SSD.  The system can resume from suspend correctly with kernel 5.10.  However, it resumes failed from suspend and hangs with kernel 5.15 and current 5.17, which is a regression issue.

I set the persistent journal log.  I read the journal log again and again. I notice there is nothing after system becomes suspended.  The kernel messages after resume might write failed to NVMe.

Then, I notice the message "kernel: nvme 0000:01:00.0: platform quirk: setting simple suspend".  Wonder why it quirks as simple suspend.  Trace the code and disable the quirk, like

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9f4f3884fefe..017a87ae999f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3096,7 +3096,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
                 */
                dev_info(&pdev->dev,
                         "platform quirk: setting simple suspend\n");
-               quirks |= NVME_QUIRK_SIMPLE_SUSPEND;
+               //quirks |= NVME_QUIRK_SIMPLE_SUSPEND;
        }
 
        /*

System resumes successfully!!!  NVMe is the one which produces the bug!
Comment 1 Jian-Hong Pan 2022-03-25 06:45:48 UTC
Here is the information of the NVMe:

01:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009]
	Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0, IOMMU group 14
	Memory at 82200000 (64-bit, non-prefetchable) [size=16K]
	Memory at 82204000 (64-bit, non-prefetchable) [size=256]
	Capabilities: [80] Power Management version 3
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
	Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
	Capabilities: [c0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [1b8] Latency Tolerance Reporting
	Capabilities: [300] Secondary PCI Express
	Capabilities: [900] L1 PM Substates
	Kernel driver in use: nvme
Comment 2 Jian-Hong Pan 2022-03-25 07:01:26 UTC
Created attachment 300615 [details]
Dump ACPI

To understand why it needs the simple suspend quirk, I trace acpi_storage_d3() [1] in the if condition.

I notice it goes though the whole acpi_storage_d3(), which means the ACPI node has StorageD3Enable with value 1.

To confrim that, I dumped the ACPI and found it:

Scope (_SB.PC00)
{
    Device (SAT0)
    {
        Name (_ADR, 0x00170000)  // _ADR: Address
        Name (_DSD, Package (0x02)  // _DSD: Device-Specific Data
        {
            ToUUID ("5025030f-842f-4ab4-a561-99a5189762d0") /* Unknown UUID */, 
            Package (0x01)
            {
                Package (0x02)
                {
                    "StorageD3Enable", 
                    One
                }
            }
        })

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/acpi/device_pm.c?h=v5.17#n1381
Comment 3 Jian-Hong Pan 2022-03-25 07:10:25 UTC
I found three related commits:

* df4f9bc4fb9c ("nvme-pci: add support for ACPI StorageD3Enable property")
* e21e0243e7b0 ("nvme-pci: look for StorageD3Enable on companion ACPI device instead")
* 2744d7a07335 ("ACPI: Check StorageD3Enable _DSD property in ACPI code")

The original ACPI node PXSX and PEGP check is removed in e21e0243e7b0 ("nvme-pci: look for StorageD3Enable on companion ACPI device instead") for AMD platforms.
I did not see the ACPI node PXSX, nor PEGP in the ACPI tables in comment #2.

So, I add the check back for test:

diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c
index cc6c97e7dcae..d5d93d3f01f7 100644
--- a/drivers/acpi/device_pm.c
+++ b/drivers/acpi/device_pm.c
@@ -1381,6 +1381,8 @@ EXPORT_SYMBOL_GPL(acpi_dev_pm_attach);
 bool acpi_storage_d3(struct device *dev)
 {
        struct acpi_device *adev = ACPI_COMPANION(dev);
+       acpi_handle handle;
+       acpi_status status;
        u8 val;
 
        if (force_storage_d3())
@@ -1388,6 +1390,14 @@ bool acpi_storage_d3(struct device *dev)
 
        if (!adev)
                return false;
+
+       status = acpi_get_handle(adev->handle, "PXSX", &handle);
+       if (ACPI_FAILURE(status)) {
+               status = acpi_get_handle(adev->handle, "PEGP", &handle);
+               if (ACPI_FAILURE(status))
+                       return false;
+       }
+
        if (fwnode_property_read_u8(acpi_fwnode_handle(adev), "StorageD3Enable",
                        &val))
                return false;

System resumes successfully! I think Intel platforms need the check provided by df4f9bc4fb9c ("nvme-pci: add support for ACPI StorageD3Enable property").
Comment 4 Jian-Hong Pan 2022-03-25 07:28:12 UTC
With the test patch in comment #3, the NVMe will not be quirked as NVME_QUIRK_SIMPLE_SUSPEND.  And then, system can resume from suspend correctly.

However, does this mean that the NVMe on this laptop does not support simple suspend?
Comment 5 Jian-Hong Pan 2022-03-25 08:15:16 UTC
If I set test modes of hibernation for PM test [1]:

* devices mode: System resumes correctly

# echo devices > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo mem > /sys/power/state

* platform mode: System resumes failed

# echo platform > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo mem > /sys/power/state

[1] https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt
Comment 6 Mario Limonciello (AMD) 2022-04-01 15:05:08 UTC
Is this production hardware?  Is this production firmware?
If so - can you please confirm that you have the latest firmware from the manufacturer first in case they made a mistake in the release you have.

----

Looking at your ACPI table I see that this system likely supports s2idle as it sets:
                      Low Power S0 Idle (V5) : 1

I have a theory here the disk is getting to the deepest state with that simple suspend set but other platform problems are causing the issue.

Can you please do the following:
1) Use your workaround/revert.
2) Confirm /sys/power/mem_sleep is "s2idle"
3) Run a suspend, and then observe the values of /sys/kernel/debug/pmc_core/slp_s0_residency_usec

If those are non-zero you got to the deepest state and my theory is wrong.
If they're 0 it confirms my theory.
Comment 7 Jian-Hong Pan 2022-04-06 08:18:22 UTC
(In reply to mario.limonciello from comment #6)
The firmware has been updated to the latest version 307 from official website.
 
> Looking at your ACPI table I see that this system likely supports s2idle as
> it sets:
>                       Low Power S0 Idle (V5) : 1
> 
> I have a theory here the disk is getting to the deepest state with that
> simple suspend set but other platform problems are causing the issue.
> 
> Can you please do the following:
> 1) Use your workaround/revert.
Sure!

> 2) Confirm /sys/power/mem_sleep is "s2idle"
It is "s2idle".

$ cat /sys/power/mem_sleep 
[s2idle] deep

> 3) Run a suspend, and then observe the values of
> /sys/kernel/debug/pmc_core/slp_s0_residency_usec
> 
> If those are non-zero you got to the deepest state and my theory is wrong.
> If they're 0 it confirms my theory.
After suspend & resume, it is 0

$ sudo cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec
0
Comment 8 Mario Limonciello (AMD) 2022-04-06 16:51:52 UTC
Thanks, so that at least provides evidence to support my hypothesis.  There's a few ways that we could approach this:
1. Use S3 instead of S2idle for this system
2. Fix the other (presumably) platform problems leading to this behavior.

As the system advertises S3 as well, that's a much easier thing to try.  I would suggest that you leave the code in place and just change mem_sleep to "deep" and try to suspend.  If that works properly, we should be able to quirk this system to prefer "deep" even though the ACPI table advertises to use s2idle.

If that doesn't help, you can try to revert the quirk and try with it reverted to see if "deep" works.
Comment 9 Jian-Hong Pan 2022-04-07 07:02:18 UTC
Checked the S3 with and without the workaround.

$ cat /sys/power/mem_sleep 
s2idle [deep]

System can resume from suspend.

I found one of my friend placed an order for the same laptop, but under shipping.  Will check with him again to make sure this is not a single hardware issue.
Comment 10 Mario Limonciello (AMD) 2022-04-08 04:20:11 UTC
Created attachment 300716 [details]
Patch to set default to S3

Attached a patch that should do that programmatically by default.  If you can confirm this works, then after your colleague confirms it's not a hardware problem for you I can submit this up to review.
Comment 11 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-05-09 09:18:24 UTC
(In reply to Mario Limonciello (AMD) from comment #10)
> Created attachment 300716 [details]
> Patch to set default to S3
> 
> Attached a patch that should do that programmatically by default.  If you
> can confirm this works, then after your colleague confirms it's not a
> hardware problem for you I can submit this up to review.

Jian-Hong Pan: did you ever give this a try?
Comment 12 Jian-Hong Pan 2022-05-09 09:30:38 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #11)
> Jian-Hong Pan: did you ever give this a try?
Sorry for the late reply.  My family member got COVID, so we are in quarantine.
Will have the test, when I reach the laptop in the office.
Comment 13 Jian-Hong Pan 2022-05-10 10:20:40 UTC
(In reply to Mario Limonciello (AMD) from comment #10)
> Created attachment 300716 [details]
> Patch to set default to S3
> 
> Attached a patch that should do that programmatically by default.  If you
> can confirm this works, then after your colleague confirms it's not a
> hardware problem for you I can submit this up to review.

Just tested the patch.  My laptop can suspend & resume with the patch which makes it S3, instead of s2idle.  Thanks!
Comment 14 Mario Limonciello (AMD) 2022-05-10 13:14:36 UTC
Submitted up here:
https://lore.kernel.org/linux-acpi/20220510131136.1103-1-mario.limonciello@amd.com/T/#u
Comment 15 Christoph Anton Mitterer 2023-03-15 00:59:58 UTC
Hey folks.

I might suffer from the same issue (just with different hardware).

Would be highly appreciated if the experts could have a look at:
https://bugzilla.kernel.org/show_bug.cgi?id=216998

Thanks,
Chris.
Comment 16 Claudio Sampaio 2023-09-05 16:43:00 UTC
Hi Thorsten and Keith,

Thanks for the details. I'm still unsure if responding by email is better or adding to the ticket, but here it goes: I have tried for days both with complete power off of the machine and cycle-booting all kernels in succession and without exception, 6.1.x LTS and the patched 6.5.1 kernel always recognize and operate the NVME, whilst the other kernels also fail with the same error message. As this is my "production" desktop, though, during the week it's more difficult to me to perform tests with it, but I will try to do it in a more methodic way and also with 6.5.1 vanilla.

As for the reason the Lexar doesn't catch the quirk default, I can't say I catch the complex logic of the driver activation, but I found out how to "fix" for my case because there are three other Lexar models in the pci.c file: NM610, NM620 and NM760 (this one with an additional quirk marked on it on the code, NVME_QUIRK_IGNORE_DEV_SUBNQN) -- so I guess whatever justifies the exception for them also justifies for my model, NM790. Might even be the case that I would need NVME_QUIRK_IGNORE_DEV_SUBNQN (not sure what it does) like in the NM760 case, but it activates correctly without it.
Comment 17 Mario Limonciello (AMD) 2023-09-05 16:46:02 UTC
This issue is long fixed; please open a new one for your problem with all the details.  We need to see a full dmesg, and acpidump.
There is a lot of nuance that has to do with the platform and specific hardware.
Comment 18 Claudio Sampaio 2023-09-05 16:48:57 UTC
Sorry, I thought I was commenting in my original request here: https://bugzilla.kernel.org/show_bug.cgi?id=217863
Comment 19 Jian-Hong Pan 2024-01-18 09:22:35 UTC
I tested with kernel 6.7 and force system go to "s2idle" again.
Confirmed this issue happens if the Intel® Volume Management Device(Intel® VMD)is "disabled" within the ASUS B1400CEAE laptop's BIOS ver. 304.

If the VMD is "enabled" within the ASUS B1400CEAE laptop's BIOS ver. 304, system can do "s2idle" successfully.

Notice ASUS website provides newer BIOS verion 311 and 314.
https://www.asus.com/laptops/for-work/expertbook/expertbook-b1-b1400-11th-gen-intel/helpdesk_bios?model2Name=ExpertBook-B1-B1400CEAE

Those new BIOS versions support system to do s2idle with both enabled & disabled VMD successfully.
Comment 20 Daniel Drake 2024-01-30 18:37:58 UTC
After much testing we found that S3 suspend/resume is unreliable on this platform after all. The system occasionally gets into a state where it cannot wake up. In this state, the power LED is blinking (as it does throughout suspend) but there is no response to wakeup via keyboard or power button.

Also while in this state, the battery LED gets stuck. For example, if the LED is solid white (implying AC adapter connected & battery full), if you unplug the AC adapter the LED remains solid white (you would expect it to turn off).
Strong evidence of a firmware crash.

That's why we're looking to get s2idle going again. I have now found the root cause of the original issue here and suggested a workaround as below. Once solved we can revert the quirk that selects S3 on this device.

(Sorry Mario - forgot to CC you on this)

https://lore.kernel.org/linux-pci/20240130183124.19985-1-drake@endlessos.org/T/#u

The Asus B1400 with production shipped firmware version 304 and VMD
disabled cannot resume from suspend: the NVMe device becomes unresponsive
and inaccessible.

This is because the NVMe device and parent PCI bridge get put into D3cold
during suspend, and this PCI bridge cannot be recovered from D3cold mode:

  echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/unbind
  echo "0000:00:06.0" > /sys/bus/pci/drivers/pcieport/unbind
  setpci -s 00:06.0 CAP_PM+4.b=03 # D3hot
  acpidbg -b "execute \_SB.PC00.PEG0.PXP._OFF"
  acpidbg -b "execute \_SB.PC00.PEG0.PXP._ON"
  setpci -s 00:06.0 CAP_PM+4.b=0 # D0
  echo "0000:00:06.0" > /sys/bus/pci/drivers/pcieport/bind
  echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/bind
  # NVMe probe fails here with -ENODEV

This appears to be an untested D3cold transition by the vendor; Intel
socwatch shows that Windows leaves the NVMe device and parent bridge in D0
during suspend, even though this firmware version has StorageD3Enable=1.

Experimenting with the DSDT, the _OFF method calls DL23() which sets a L23E
bit at offset 0xe2 into the PCI configuration space for this root port.
This is the specific write that the _ON routine is unable to recover from.
This register is not documented in the public chipset datasheet.
Comment 21 Daniel Drake 2024-02-08 08:08:59 UTC
Uploading 5 lspci dumps from this procedure:

  lspci -vvxxxxs 00:06.0 > 1.pci.txt
  echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/unbind
  echo "0000:00:06.0" > /sys/bus/pci/drivers/pcieport/unbind
  lspci -vvxxxxs 00:06.0 > 2.pci.txt
  setpci -s 00:06.0 CAP_PM+4.b=03 # D3hot
  acpidbg -b "execute \_SB.PC00.PEG0.PXP._OFF"
  acpidbg -b "execute \_SB.PC00.PEG0.PXP._ON"
  setpci -s 00:06.0 CAP_PM+4.b=0 # D0
  lspci -vvxxxxs 00:06.0 > 3.pci.txt
  echo "0000:00:06.0" > /sys/bus/pci/drivers/pcieport/bind
  lspci -vvxxxxs 00:06.0 > 4.pci.txt
  echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/bind
  # NVMe probe fails here with -ENODEV
  lspci -vvxxxxs 00:06.0 > 5.pci.txt
Comment 22 Daniel Drake 2024-02-08 08:09:29 UTC
Created attachment 305844 [details]
1.pci.txt
Comment 23 Daniel Drake 2024-02-08 08:09:46 UTC
Created attachment 305845 [details]
2.pci.txt
Comment 24 Daniel Drake 2024-02-08 08:10:02 UTC
Created attachment 305846 [details]
3.pci.txt
Comment 25 Daniel Drake 2024-02-08 08:10:21 UTC
Created attachment 305847 [details]
4.pci.txt
Comment 26 Daniel Drake 2024-02-08 08:10:38 UTC
Created attachment 305848 [details]
5.pci.txt
Comment 27 Daniel Drake 2024-02-08 08:32:54 UTC
Created attachment 305849 [details]
Full system lspci -vvxxxx
Comment 28 Daniel Drake 2024-02-21 15:38:49 UTC
The PXP._OFF function calls DL23(), which is what writes L23E = One, which is the condition I have not yet found a way of recovering from. Full method:

        Method (DL23, 0, Serialized)
        {
            L23E = One
            Sleep (0x10)
            Local0 = Zero
            While (L23E)
            {
                If ((Local0 > 0x04))
                {
                    Break
                }

                Sleep (0x10)
                Local0++
            }

            SCB0 = One
        }

Here is the corresponding function called in the _ON path:

        Method (L23D, 0, Serialized)
        {
            If ((SCB0 != One))
            {
                Return (Zero)
            }

            L23R = One
            Local0 = Zero
            While (L23R)
            {
                If ((Local0 > 0x04))
                {
                    Break
                }

                Sleep (0x10)
                Local0++
            }

            SCB0 = Zero
            Local0 = Zero
            While ((LASX == Zero))
            {
                If ((Local0 > 0x08))
                {
                    Break
                }

                Sleep (0x10)
                Local0++
            }
        }

I was able to find some more meaningful names for these registers, from
https://doxygen.coreboot.org/df/dc4/rtd3_8c_source.html which appears to generate basically the same code.

 #define ACPI_REG_PCI_LINK_ACTIVE "LASX"    /* Link active status */
 #define ACPI_REG_PCI_L23_RDY_ENTRY "L23E"  /* L23_Rdy Entry Request */
 #define ACPI_REG_PCI_L23_RDY_DETECT "L23R" /* L23_Rdy Detect Transition */

Also here:
https://www.mail-archive.com/devel@edk2.groups.io/msg27375.html
+  // DL23 method puts link to L2 or L3 state. Used for RTD3 flows, before 
endpoint is powered down.
+  // This flow is implemented in ASL because rootport registers used for L2/L3 
entry/exit
+  // are proprietary and OS drivers don't know about them.
+  //

So the intention of this code is to go into PCI power state L2/L3, which is not something standardized, so the magic gets done in ACPI.

I experimented with this flow with setpci - nothing jumps out. When L23_Rdy Entry Request gets written, it clears to 0 immediately after. When L23_Rdy Detect Transition is written to 1, it also clears to 0 immediately. Link Active Status (LASX) is also 1 at this point.

I think there's room to look at exactly what is the breakage at that point, after going through _OFF and _ON. The PCI devices are enumerable at this point and configuration space can be read, for both the bridge and child NVMe device. I think nvme probe is failing at the NVME_REG_CSTS check in nvme_pci_enable(). I'll examine more closely another day.
Comment 29 Daniel Drake 2024-02-22 16:48:16 UTC
The issue in that case is that all the device memory is FFFFFFFF at that point in the test. Can check with `busybox devmem 0x82200000`.

But I think my test is questionable: unbind drivers, put parent bridge in D3hot, then D3cold, then power bridge back up, re-enter D0, and reload the drivers. At this point the bus seems enumerable and configuration spaces are accessible but the downstream child device is not working right. That's perhaps not surprising; the parent bridge was power cycled, but there wasn't any reset/reinit done at the child device level.

I tried a more complete test:

setpci -s 01:00.0 CAP_PM+4.b=03 # D3hot
setpci -s 00:06.0 CAP_PM+4.b=03 # D3hot
./acpidbg -b "execute \_SB.PC00.PEG0.PEGP._PS3"
./acpidbg -b "execute \_SB.PC00.PEG0._PS3"
./acpidbg -b "execute \_SB.PC00.PEG0.PXP._OFF"
./acpidbg -b "execute \_SB.PC00.PEG0.PXP._ON"
./acpidbg -b "execute \_SB.PC00.PEG0._PS0"
./acpidbg -b "execute \_SB.PC00.PEG0.PEGP._PS0"
setpci -s 00:06.0 CAP_PM+4.b=0 # D0
setpci -s 01:00.0 CAP_PM+4.b=0 # D0
echo "0000:00:06.0" > /sys/bus/pci/drivers/pcieport/bind
echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/bind

That fails (01:00.0 now disappears, can't do the setpci D0 or nvme driver bind), which may be a confirmation of this problem, or perhaps still something not quite right in my testing.

Tried a more thorough way of testing this.

Remove this code from pci_pm_runtime_suspend():

	/*
	 * If pci_dev->driver is not set (unbound), we leave the device in D0,
	 * but it may go to D3cold when the bridge above it runtime suspends.
	 * Save its config space in case that happens.
	 */
	if (!pci_dev->driver) {
		pci_save_state(pci_dev);
		return 0;
	}

(That's needed because otherwise the driverless child device won't go "properly" into D3cold, it'll get marked as D3cold when the parent bridge suspends but the ACPI bits won't be executed. In this case the NVMe device and parent bridge both have the same ACPI power resource referenced by _PR3, so both references must be released for the problematic codepath to be hit.)

Now unbind nvme driver and enable runtime suspend on that device:

echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/unbind
echo auto > "/sys/bus/pci/devices/0000:01:00.0/power/control"

Now the NVME device and parent bridge go into D3cold properly with PXP power resource turned off.

Power on bridge again:
echo on > "/sys/bus/pci/devices/0000:00:06.0/power/control"

Result:
 pcieport 0000:00:06.0: broken device, retraining non-functional downstream link at 2.5GT/s
 pcieport 0000:00:06.0: retraining failed
 pcieport 0000:00:06.0: broken device, retraining non-functional downstream link at 2.5GT/s
 pcieport 0000:00:06.0: retraining failed
 pci 0000:01:00.0: not ready 1023ms after resume; waiting
(snip)
 pci 0000:01:00.0: not ready 65535ms after resume; giving up
 pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible

The fact that the 06.0 parent bridge seems to fail early at this point might suggest that the bridge is the thing not being resumed properly. But the pci bridge config space is readable, the errors are about the downstream link, and the NVMe device config space is inaccessible. So that might suggest that the NVMe device is the thing that is not being reset properly? Also, the NVMe device has no-op _PS3 and _PS0 and the _PR3 just points at the one power resource from the root port. It feels like nothing is really managing the reset of the NVMe device.

Not sure if this gets us any closer to a way of powering the devices back up again here, or if it even really matters which of the two devices is the culprit, disabling D3cold on either one would suffice.
Comment 30 Daniel Drake 2024-02-23 11:49:51 UTC
Another test.

Again with the kernel modified to allow a driverless child device to go properly into D3cold (must release the power management resource, because it is shared with the parent bridge, otherwise parent bridge will not go into D3cold in the experiment below):

# Prevent bridge from going into D3cold
echo on > "/sys/bus/pci/devices/0000:00:06.0/power/control"

# Unbind NVMe and remove the device
echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/unbind

# Put NVMe device in D3cold and check
echo auto > "/sys/bus/pci/devices/0000:01:00.0/power/control"
cat "/sys/bus/pci/devices/0000:01:00.0/power_state" # D3cold

# Remove NVMe device
echo 1 > "/sys/bus/pci/devices/0000:01:00.0/remove"

# Put bridge in D3cold
echo auto > "/sys/bus/pci/devices/0000:00:06.0/power/control"

# Check D3cold state with associated PXP power resource off
cat "/sys/bus/pci/devices/0000:00:06.0/power_state" # D3cold
cat /sys/bus/acpi/devices/LNXPOWER:00/path # confirm PEG0.PXP
cat /sys/bus/acpi/devices/LNXPOWER:00/status # confirm 0

# Power on bridge again
echo on > "/sys/bus/pci/devices/0000:00:06.0/power/control"

# Force rescan
echo 1 > "/sys/bus/pci/devices/0000:00:06.0/rescan"

# Check for NVMe device reappearance
ls "/sys/bus/pci/devices/0000:00:06.0/0000:01:00.0" # FAIL


This leaves me more confident that the NVMe device is the one causing problems in this scenario. It should be in D3hot at this point (we did turn on its power resource at the same time as we power on the bridge, because it is the same power resource) yet it cannot be found by probing configuration space.
Also, no error messages were produced during the above sequence (no more messages about retraining), and in the final state, the PCI bridge control space is accessible.

Note You need to log in before you can comment on or make changes to this bug.