Bug 216462 - Huawei Mate Book D16 NVMe SSD not detected (lost) after resume from suspend
Summary: Huawei Mate Book D16 NVMe SSD not detected (lost) after resume from suspend
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-08 09:13 UTC by Nikolai
Modified: 2023-08-10 08:30 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.19.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmidecode after UEFI firmware update to latest 1.20 version (22.29 KB, text/plain)
2022-09-09 10:38 UTC, Nikolai
Details
ACPI DSDT table for huawei D16 (3.13 MB, text/x-csrc)
2022-09-09 10:39 UTC, Nikolai
Details
lspci -vvnn log for stock SSD (issue reproducible) (27.75 KB, text/plain)
2022-09-09 10:40 UTC, Nikolai
Details
lspci -vvnn log for non-stock Samsung SSD (issue non-reproducible) (27.87 KB, text/plain)
2022-09-09 10:42 UTC, Nikolai
Details
full dmesg log mentioned in topic starting message (90.88 KB, text/plain)
2022-09-09 10:44 UTC, Nikolai
Details

Description Nikolai 2022-09-08 09:13:13 UTC
Huawei Mate Book D16 (Intel i5) laptop resumes from suspend with NVMe undetected, while was successfully working before.

ACPI: EC: interrupt blocked
[11168.523511] ACPI: EC: interrupt unblocked
[11168.527546] nvme 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[11168.596318] i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_69.0.3.bin version 69.0
[11168.596321] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
[11168.596599] nvme 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[11168.596652] nvme nvme0: Removing after probe failure status: -19
[11168.596662] nvme0n1: detected capacity change from 1000215216 to 0
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-09-09 08:28:20 UTC
(In reply to Nikolai from comment #0)
> while was successfully working before.

when was that "before"? 5.18.y? Or an earlier 5.19 version?
Comment 2 Nikolai 2022-09-09 09:39:08 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #1)
> (In reply to Nikolai from comment #0)
> > while was successfully working before.
> 
> when was that "before"? 5.18.y? Or an earlier 5.19 version?

"before" is literally "before suspending".

The issue is also reproducible for 5.15.63 on the same laptop.

5.10.139 is unable to boot on this machine (due to intel iRIS driver I suspect, but I did't try to confirm that).

Actually, I found out that replacing stock laptop SSD from "PCIe-8 SSD 512GB" to "Samsung SSD 970 EVO Plus 250GB" eliminates the issue.

That may mean both SSD firmware issue or ACPI PM compatibility issue of new hardware.

I will add more technical descriptions and logs soon on both configs.
If you need some specific diagnostics please feel free to request.
Comment 3 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-09-09 09:57:03 UTC
(In reply to Nikolai from comment #2)
> "before" is literally "before suspending".

ha, sorry, yeah, obviously.

> is also reproducible for 5.15.63

Thx for clarifying. In that case it's likely not a regression and not something for my todo list.
Comment 4 Nikolai 2022-09-09 10:10:32 UTC
Initially the issue was discovered for laptop UEFI firmware version 1.09, but updating it to v.1.20 (the latest for the moment) didn't solve the problem.


SSD reappearing the issue (similar one is reviewed in [1]):
SN: YMA1512JA2202320B4
Model: PCIe-8 SSD 512GB
FW Rev: YM00D216

SSD without the issue:
SN: S4EUNM0R104410T
Model: Samsung SSD 970 EVO Plus 250GB
FW Rev: 2B2QEXM7

[1] https://fadvices.com/huawei-matebook-16s-review-intel-core-i9-in-sheeps-clothing-tech-reviews/
Comment 5 Nikolai 2022-09-09 10:38:03 UTC
Created attachment 301779 [details]
dmidecode after UEFI firmware update to latest 1.20 version
Comment 6 Nikolai 2022-09-09 10:39:40 UTC
Created attachment 301780 [details]
ACPI DSDT table for huawei D16
Comment 7 Nikolai 2022-09-09 10:40:50 UTC
Created attachment 301781 [details]
lspci -vvnn log for stock SSD (issue reproducible)
Comment 8 Nikolai 2022-09-09 10:42:24 UTC
Created attachment 301782 [details]
lspci -vvnn log for non-stock Samsung SSD (issue non-reproducible)
Comment 9 Nikolai 2022-09-09 10:44:54 UTC
Created attachment 301783 [details]
full dmesg log mentioned in topic starting message
Comment 10 Anatolii 2023-08-10 08:30:15 UTC
Hello.
I have concerned with the same issue on the Huawei MateBook D16 with, as I see in the attachment [1], the same vendor NVMe. Seems that not every NVMe drive supports D3cold mode.
Kernel is 6.4.2.

I have created a dirty workaround patch [2] that just forcefully disables D3cold state for the specified NVMe. It works for me. Could you recompile the kernel and check it?

By the way, (maybe a bit offtop in the context of this issue), there is a `/sys/bus/pci/devices/0000:01:00.0/d3cold_allowed`. It can be set to zero, however, the kernel function (that suspends PCI devices) `pci_set_power_state` (`drivers/pci/pci.c`) does not check it [3]. 
What about to make a check `pci_dev_check_d3cold` inside the `pci_set_power_state` function? That will allow disabling D3cold state from the userspace without the kernel recompilation.

[1] https://bugzilla.kernel.org/attachment.cgi?id=301781

[2] https://gist.github.com/Toliak/86340b839b45f2c6fa4337ba6d8e971b#the-solution-part

[3] https://gist.github.com/Toliak/86340b839b45f2c6fa4337ba6d8e971b#meanwhile-why-d3cold_allowed-is-not-working

Note You need to log in before you can comment on or make changes to this bug.