Bug 215742 - The NVME storage quirked as SIMPLE SUSPEND makes system resume failed after suspend (Regression)
Summary: The NVME storage quirked as SIMPLE SUSPEND makes system resume failed after s...
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Linux ACPI Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-25 06:41 UTC by Jian-Hong Pan
Modified: 2022-05-10 13:15 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.15+
Tree: Mainline
Regression: Yes


Attachments
Journal log (212.54 KB, text/plain)
2022-03-25 06:41 UTC, Jian-Hong Pan
Details
Dump ACPI (2.42 MB, text/plain)
2022-03-25 07:01 UTC, Jian-Hong Pan
Details
Patch to set default to S3 (1.58 KB, application/mbox)
2022-04-08 04:20 UTC, Mario Limonciello (AMD)
Details

Description Jian-Hong Pan 2022-03-25 06:41:53 UTC
Created attachment 300614 [details]
Journal log

We have an ASUS B1400CEAE laptop equipped with Intel i5-1135G7 and Sandisk Corp WD Blue SN550 NVMe SSD.  The system can resume from suspend correctly with kernel 5.10.  However, it resumes failed from suspend and hangs with kernel 5.15 and current 5.17, which is a regression issue.

I set the persistent journal log.  I read the journal log again and again. I notice there is nothing after system becomes suspended.  The kernel messages after resume might write failed to NVMe.

Then, I notice the message "kernel: nvme 0000:01:00.0: platform quirk: setting simple suspend".  Wonder why it quirks as simple suspend.  Trace the code and disable the quirk, like

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9f4f3884fefe..017a87ae999f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3096,7 +3096,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
                 */
                dev_info(&pdev->dev,
                         "platform quirk: setting simple suspend\n");
-               quirks |= NVME_QUIRK_SIMPLE_SUSPEND;
+               //quirks |= NVME_QUIRK_SIMPLE_SUSPEND;
        }
 
        /*

System resumes successfully!!!  NVMe is the one which produces the bug!
Comment 1 Jian-Hong Pan 2022-03-25 06:45:48 UTC
Here is the information of the NVMe:

01:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009]
	Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0, IOMMU group 14
	Memory at 82200000 (64-bit, non-prefetchable) [size=16K]
	Memory at 82204000 (64-bit, non-prefetchable) [size=256]
	Capabilities: [80] Power Management version 3
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
	Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
	Capabilities: [c0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [1b8] Latency Tolerance Reporting
	Capabilities: [300] Secondary PCI Express
	Capabilities: [900] L1 PM Substates
	Kernel driver in use: nvme
Comment 2 Jian-Hong Pan 2022-03-25 07:01:26 UTC
Created attachment 300615 [details]
Dump ACPI

To understand why it needs the simple suspend quirk, I trace acpi_storage_d3() [1] in the if condition.

I notice it goes though the whole acpi_storage_d3(), which means the ACPI node has StorageD3Enable with value 1.

To confrim that, I dumped the ACPI and found it:

Scope (_SB.PC00)
{
    Device (SAT0)
    {
        Name (_ADR, 0x00170000)  // _ADR: Address
        Name (_DSD, Package (0x02)  // _DSD: Device-Specific Data
        {
            ToUUID ("5025030f-842f-4ab4-a561-99a5189762d0") /* Unknown UUID */, 
            Package (0x01)
            {
                Package (0x02)
                {
                    "StorageD3Enable", 
                    One
                }
            }
        })

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/acpi/device_pm.c?h=v5.17#n1381
Comment 3 Jian-Hong Pan 2022-03-25 07:10:25 UTC
I found three related commits:

* df4f9bc4fb9c ("nvme-pci: add support for ACPI StorageD3Enable property")
* e21e0243e7b0 ("nvme-pci: look for StorageD3Enable on companion ACPI device instead")
* 2744d7a07335 ("ACPI: Check StorageD3Enable _DSD property in ACPI code")

The original ACPI node PXSX and PEGP check is removed in e21e0243e7b0 ("nvme-pci: look for StorageD3Enable on companion ACPI device instead") for AMD platforms.
I did not see the ACPI node PXSX, nor PEGP in the ACPI tables in comment #2.

So, I add the check back for test:

diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c
index cc6c97e7dcae..d5d93d3f01f7 100644
--- a/drivers/acpi/device_pm.c
+++ b/drivers/acpi/device_pm.c
@@ -1381,6 +1381,8 @@ EXPORT_SYMBOL_GPL(acpi_dev_pm_attach);
 bool acpi_storage_d3(struct device *dev)
 {
        struct acpi_device *adev = ACPI_COMPANION(dev);
+       acpi_handle handle;
+       acpi_status status;
        u8 val;
 
        if (force_storage_d3())
@@ -1388,6 +1390,14 @@ bool acpi_storage_d3(struct device *dev)
 
        if (!adev)
                return false;
+
+       status = acpi_get_handle(adev->handle, "PXSX", &handle);
+       if (ACPI_FAILURE(status)) {
+               status = acpi_get_handle(adev->handle, "PEGP", &handle);
+               if (ACPI_FAILURE(status))
+                       return false;
+       }
+
        if (fwnode_property_read_u8(acpi_fwnode_handle(adev), "StorageD3Enable",
                        &val))
                return false;

System resumes successfully! I think Intel platforms need the check provided by df4f9bc4fb9c ("nvme-pci: add support for ACPI StorageD3Enable property").
Comment 4 Jian-Hong Pan 2022-03-25 07:28:12 UTC
With the test patch in comment #3, the NVMe will not be quirked as NVME_QUIRK_SIMPLE_SUSPEND.  And then, system can resume from suspend correctly.

However, does this mean that the NVMe on this laptop does not support simple suspend?
Comment 5 Jian-Hong Pan 2022-03-25 08:15:16 UTC
If I set test modes of hibernation for PM test [1]:

* devices mode: System resumes correctly

# echo devices > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo mem > /sys/power/state

* platform mode: System resumes failed

# echo platform > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo mem > /sys/power/state

[1] https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt
Comment 6 Mario Limonciello (AMD) 2022-04-01 15:05:08 UTC
Is this production hardware?  Is this production firmware?
If so - can you please confirm that you have the latest firmware from the manufacturer first in case they made a mistake in the release you have.

----

Looking at your ACPI table I see that this system likely supports s2idle as it sets:
                      Low Power S0 Idle (V5) : 1

I have a theory here the disk is getting to the deepest state with that simple suspend set but other platform problems are causing the issue.

Can you please do the following:
1) Use your workaround/revert.
2) Confirm /sys/power/mem_sleep is "s2idle"
3) Run a suspend, and then observe the values of /sys/kernel/debug/pmc_core/slp_s0_residency_usec

If those are non-zero you got to the deepest state and my theory is wrong.
If they're 0 it confirms my theory.
Comment 7 Jian-Hong Pan 2022-04-06 08:18:22 UTC
(In reply to mario.limonciello from comment #6)
The firmware has been updated to the latest version 307 from official website.
 
> Looking at your ACPI table I see that this system likely supports s2idle as
> it sets:
>                       Low Power S0 Idle (V5) : 1
> 
> I have a theory here the disk is getting to the deepest state with that
> simple suspend set but other platform problems are causing the issue.
> 
> Can you please do the following:
> 1) Use your workaround/revert.
Sure!

> 2) Confirm /sys/power/mem_sleep is "s2idle"
It is "s2idle".

$ cat /sys/power/mem_sleep 
[s2idle] deep

> 3) Run a suspend, and then observe the values of
> /sys/kernel/debug/pmc_core/slp_s0_residency_usec
> 
> If those are non-zero you got to the deepest state and my theory is wrong.
> If they're 0 it confirms my theory.
After suspend & resume, it is 0

$ sudo cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec
0
Comment 8 Mario Limonciello (AMD) 2022-04-06 16:51:52 UTC
Thanks, so that at least provides evidence to support my hypothesis.  There's a few ways that we could approach this:
1. Use S3 instead of S2idle for this system
2. Fix the other (presumably) platform problems leading to this behavior.

As the system advertises S3 as well, that's a much easier thing to try.  I would suggest that you leave the code in place and just change mem_sleep to "deep" and try to suspend.  If that works properly, we should be able to quirk this system to prefer "deep" even though the ACPI table advertises to use s2idle.

If that doesn't help, you can try to revert the quirk and try with it reverted to see if "deep" works.
Comment 9 Jian-Hong Pan 2022-04-07 07:02:18 UTC
Checked the S3 with and without the workaround.

$ cat /sys/power/mem_sleep 
s2idle [deep]

System can resume from suspend.

I found one of my friend placed an order for the same laptop, but under shipping.  Will check with him again to make sure this is not a single hardware issue.
Comment 10 Mario Limonciello (AMD) 2022-04-08 04:20:11 UTC
Created attachment 300716 [details]
Patch to set default to S3

Attached a patch that should do that programmatically by default.  If you can confirm this works, then after your colleague confirms it's not a hardware problem for you I can submit this up to review.
Comment 11 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-05-09 09:18:24 UTC
(In reply to Mario Limonciello (AMD) from comment #10)
> Created attachment 300716 [details]
> Patch to set default to S3
> 
> Attached a patch that should do that programmatically by default.  If you
> can confirm this works, then after your colleague confirms it's not a
> hardware problem for you I can submit this up to review.

Jian-Hong Pan: did you ever give this a try?
Comment 12 Jian-Hong Pan 2022-05-09 09:30:38 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #11)
> Jian-Hong Pan: did you ever give this a try?
Sorry for the late reply.  My family member got COVID, so we are in quarantine.
Will have the test, when I reach the laptop in the office.
Comment 13 Jian-Hong Pan 2022-05-10 10:20:40 UTC
(In reply to Mario Limonciello (AMD) from comment #10)
> Created attachment 300716 [details]
> Patch to set default to S3
> 
> Attached a patch that should do that programmatically by default.  If you
> can confirm this works, then after your colleague confirms it's not a
> hardware problem for you I can submit this up to review.

Just tested the patch.  My laptop can suspend & resume with the patch which makes it S3, instead of s2idle.  Thanks!
Comment 14 Mario Limonciello (AMD) 2022-05-10 13:14:36 UTC
Submitted up here:
https://lore.kernel.org/linux-acpi/20220510131136.1103-1-mario.limonciello@amd.com/T/#u

Note You need to log in before you can comment on or make changes to this bug.