Bug 217445 - standby-resume cycle increases NVMe error count (maybe bad NVMe commands)
Summary: standby-resume cycle increases NVMe error count (maybe bad NVMe commands)
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: AMD Linux
: P3 normal
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-05-15 10:22 UTC by kolAflash
Modified: 2024-04-29 00:12 UTC (History)
7 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description kolAflash 2023-05-15 10:22:52 UTC
Using smartctl I can see, that the error log entries of the NVMe SSD are being increased every time the system has done a standby-resume cycle.

It's suspected that this could be caused by NVMe commands that the NVMe device doesn't understand.
Is there a way to fix this in the Linux kernel?

https://www.smartmontools.org/ticket/1722#comment:3
Please see that ticket for more deatils.
I'm having this issue on two systems which both use a "Seagate FireCuda 510 SSD ZP2000GM30001" NVMe. But other users seem to have similar problems with other NVMe devices.
https://www.smartmontools.org/ticket/1222
https://www.smartmontools.org/ticket/1663




= systems and dmesg =

Model: HP EliteBook 735 G6 (Notebook)
CPU: Ryzen 3500U
OS: Debian-12
Kernel: Linux-6.1
(all software versions from Debian-12)
dmesg:
[    2.933308] nvme nvme0: pci function 0000:04:00.0
[    2.939022] nvme nvme0: missing or invalid SUBNQN field.
[    2.939073] nvme nvme0: Shutdown timeout set to 10 seconds
[    2.941185] nvme nvme0: 8/0/0 default/read/poll queues
[    2.941736] nvme nvme0: ctrl returned bogus length: 16 for NVME_NIDT_EUI64
[    2.943756]  nvme0n1: p1 p2 p3 p4
[...]
[  735.372074] PM: suspend entry (deep)
[...]
[  737.706880] nvme nvme0: Shutdown timeout set to 10 seconds
[  737.708557] nvme nvme0: 8/0/0 default/read/poll queues
[  737.708901] nvme nvme0: ctrl returned bogus length: 16 for NVME_NIDT_EUI64
[...]
[  739.403068] PM: suspend exit


Model: HP EliteBook 845 G8 (Notebook)
CPU: Ryzen 5650U
OS: openSUSE-15.4
Kernel: Linux-6.1.27 (compiled myself)
(all other software versions from openSUSE-15.4)
dmesg:
[    0.915830][  T449] nvme 0000:03:00.0: platform quirk: setting simple suspend
[    0.915931][  T449] nvme nvme0: pci function 0000:03:00.0
[...]
[    0.919920][   T89] nvme nvme0: missing or invalid SUBNQN field.
[    0.919939][   T89] nvme nvme0: Shutdown timeout set to 10 seconds
[    0.921188][   T89] nvme nvme0: 8/0/0 default/read/poll queues
[    0.922707][   T90]  nvme0n1: p1 p2 p3 p4
Comment 1 kolAflash 2023-05-15 10:35:03 UTC
P.S.
On the EliteBook 735 the NVMe log error count also increases by 1 on reboot.
(didn't test that on the other notebook yet)
Comment 2 kolAflash 2023-05-18 22:52:24 UTC
Output of "nvme error-log":
https://www.smartmontools.org/attachment/ticket/1722/nvme-error-log_Seagate-FireCuda-510-SSD_HP-EliteBook-735-G6-Ryzen3500U-Debian-12.txt

Sample entry:
error_count     : 447
sqid            : 0
cmdid           : 0x8
status_field    : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag       : 0
parm_err_loc    : 0x28
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0

Nearly each of the 63 log entries has a different cmdid. Rest is always identical.
Just these cmdids appear a few more times, but most cmdids seem random.
0x4, 0x8014, 0xc012, 0xd00e, 0xe
Comment 3 kolAflash 2023-05-24 10:28:34 UTC
Similar looking problems:

Samsung 970 EVO Plus Generates NVME Errors
https://bugzilla.kernel.org/show_bug.cgi?id=211573

Need NVME QUIRK BOGUS for Hiksemi SSD - HS-SSD-FUTURE 2048G
https://bugzilla.kernel.org/show_bug.cgi?id=217384
Comment 4 Keith Busch 2023-05-24 21:19:59 UTC
The driver is just attempting an optional command that the device doesn't support. The driver has no way to know if the device supports it without trying, so that's what it's doing. The drive can log the error if it wants to, but this is just unnecessary for this command, IMO, but we can't do anything about that. I'd just ignore the errors.
Comment 5 kolAflash 2023-05-30 13:24:17 UTC
@Keith
Thanks for the answer!

Ist there a way to manually stop the kernel from attempting this command?
Something like a boot kernel module parameter?

I like to keep the NVMe error log clean. So I'll recognize if there is a real problem with the ssd.



P.S.

Any idea how other operating systems handle this?
(I got just Linux installed, so I can not easily test anything else)

And I'm thinking if userland could introduce a mechanism to remember if a command failed once. So userland could configure the kernel upon boot not to try it again.
Any clue who in userland might could take care of this?
Comment 6 Keith Busch 2023-05-30 15:22:34 UTC
Nothing reasonable you can do from user space. Since this is such a frequent report, let me see if we can add something in kernel to skip if we already saw it fail before.
Comment 7 kolAflash 2023-05-30 18:54:16 UTC
By the way:
Which one is the name of the unknown command?
SUBNQN or NIDT_EUI64 ?
And what is the function of that command?

I'll assume it's "SUBNQN" for this comment.


(In reply to Keith Busch from comment #6)
> Nothing reasonable you can do from user space. Since this is such a frequent
> report, let me see if we can add something in kernel to skip if we already
> saw it fail before.

But this information would be lost after reboot. Correct?
So NVMe errors would still increase which each reboot.


I would really wish for a way where userspace can tell the kernel in advance not to attempt this NVMe command for a specific NVMe device!
Something like: nvme_core.skip_subnqn=pci-0000:04:00.0-nvme-1
(comma separated list of NVMe devices)

So after one initial fail this would not happen again. Even after rebooting.
How userspace does this wouldn't be the kernels problem.
A first attempt could be a userspace program watching dmesg for that error. And if the error appears the userspace program could add a module parameter to /etc/modprobe.d/nvme_core_quirks.conf or /etc/default/grub -> GRUB_CMDLINE_LINUX_DEFAULT
So the kernel wont attempt that nvme command again. Even not after reboot.
Comment 8 Keith Busch 2023-05-30 19:22:51 UTC
It's the Identify command for IO Command Set Specific Controller (or at least, I'm pretty sure that's the command that's triggering the error log entry).

And correct, the kernel proposal would not remember after a reboot.

If you're going to change the kernel runtime behavior for specific devices, normally we add "quirk" flags on a per-model basis.
Comment 9 KevinBu 2024-04-29 00:12:35 UTC
I am using Samsung 960 Pro. I am getting 2 errors in a row on every reboot, and probably by running some virtual machines too:  


```
Entry[62]   
.................
error_count	: 4556
sqid		: 0
cmdid		: 0x1b
status_field	: 0x210d(Feature Identifier Not Saveable: The Feature Identifier specified does not support a saveable value)
phase_tag	: 0
parm_err_loc	: 0x28
lba		: 0
nsid		: 0
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................
 Entry[63]   
.................
error_count	: 4555
sqid		: 0
cmdid		: 0x12
status_field	: 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag	: 0
parm_err_loc	: 0xffff
lba		: 0
nsid		: 0
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................

```

Note You need to log in before you can comment on or make changes to this bug.