Using smartctl I can see, that the error log entries of the NVMe SSD are being increased every time the system has done a standby-resume cycle. It's suspected that this could be caused by NVMe commands that the NVMe device doesn't understand. Is there a way to fix this in the Linux kernel? https://www.smartmontools.org/ticket/1722#comment:3 Please see that ticket for more deatils. I'm having this issue on two systems which both use a "Seagate FireCuda 510 SSD ZP2000GM30001" NVMe. But other users seem to have similar problems with other NVMe devices. https://www.smartmontools.org/ticket/1222 https://www.smartmontools.org/ticket/1663 = systems and dmesg = Model: HP EliteBook 735 G6 (Notebook) CPU: Ryzen 3500U OS: Debian-12 Kernel: Linux-6.1 (all software versions from Debian-12) dmesg: [ 2.933308] nvme nvme0: pci function 0000:04:00.0 [ 2.939022] nvme nvme0: missing or invalid SUBNQN field. [ 2.939073] nvme nvme0: Shutdown timeout set to 10 seconds [ 2.941185] nvme nvme0: 8/0/0 default/read/poll queues [ 2.941736] nvme nvme0: ctrl returned bogus length: 16 for NVME_NIDT_EUI64 [ 2.943756] nvme0n1: p1 p2 p3 p4 [...] [ 735.372074] PM: suspend entry (deep) [...] [ 737.706880] nvme nvme0: Shutdown timeout set to 10 seconds [ 737.708557] nvme nvme0: 8/0/0 default/read/poll queues [ 737.708901] nvme nvme0: ctrl returned bogus length: 16 for NVME_NIDT_EUI64 [...] [ 739.403068] PM: suspend exit Model: HP EliteBook 845 G8 (Notebook) CPU: Ryzen 5650U OS: openSUSE-15.4 Kernel: Linux-6.1.27 (compiled myself) (all other software versions from openSUSE-15.4) dmesg: [ 0.915830][ T449] nvme 0000:03:00.0: platform quirk: setting simple suspend [ 0.915931][ T449] nvme nvme0: pci function 0000:03:00.0 [...] [ 0.919920][ T89] nvme nvme0: missing or invalid SUBNQN field. [ 0.919939][ T89] nvme nvme0: Shutdown timeout set to 10 seconds [ 0.921188][ T89] nvme nvme0: 8/0/0 default/read/poll queues [ 0.922707][ T90] nvme0n1: p1 p2 p3 p4
P.S. On the EliteBook 735 the NVMe log error count also increases by 1 on reboot. (didn't test that on the other notebook yet)
Output of "nvme error-log": https://www.smartmontools.org/attachment/ticket/1722/nvme-error-log_Seagate-FireCuda-510-SSD_HP-EliteBook-735-G6-Ryzen3500U-Debian-12.txt Sample entry: error_count : 447 sqid : 0 cmdid : 0x8 status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field) phase_tag : 0 parm_err_loc : 0x28 lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0 Nearly each of the 63 log entries has a different cmdid. Rest is always identical. Just these cmdids appear a few more times, but most cmdids seem random. 0x4, 0x8014, 0xc012, 0xd00e, 0xe
Similar looking problems: Samsung 970 EVO Plus Generates NVME Errors https://bugzilla.kernel.org/show_bug.cgi?id=211573 Need NVME QUIRK BOGUS for Hiksemi SSD - HS-SSD-FUTURE 2048G https://bugzilla.kernel.org/show_bug.cgi?id=217384
The driver is just attempting an optional command that the device doesn't support. The driver has no way to know if the device supports it without trying, so that's what it's doing. The drive can log the error if it wants to, but this is just unnecessary for this command, IMO, but we can't do anything about that. I'd just ignore the errors.
@Keith Thanks for the answer! Ist there a way to manually stop the kernel from attempting this command? Something like a boot kernel module parameter? I like to keep the NVMe error log clean. So I'll recognize if there is a real problem with the ssd. P.S. Any idea how other operating systems handle this? (I got just Linux installed, so I can not easily test anything else) And I'm thinking if userland could introduce a mechanism to remember if a command failed once. So userland could configure the kernel upon boot not to try it again. Any clue who in userland might could take care of this?
Nothing reasonable you can do from user space. Since this is such a frequent report, let me see if we can add something in kernel to skip if we already saw it fail before.
By the way: Which one is the name of the unknown command? SUBNQN or NIDT_EUI64 ? And what is the function of that command? I'll assume it's "SUBNQN" for this comment. (In reply to Keith Busch from comment #6) > Nothing reasonable you can do from user space. Since this is such a frequent > report, let me see if we can add something in kernel to skip if we already > saw it fail before. But this information would be lost after reboot. Correct? So NVMe errors would still increase which each reboot. I would really wish for a way where userspace can tell the kernel in advance not to attempt this NVMe command for a specific NVMe device! Something like: nvme_core.skip_subnqn=pci-0000:04:00.0-nvme-1 (comma separated list of NVMe devices) So after one initial fail this would not happen again. Even after rebooting. How userspace does this wouldn't be the kernels problem. A first attempt could be a userspace program watching dmesg for that error. And if the error appears the userspace program could add a module parameter to /etc/modprobe.d/nvme_core_quirks.conf or /etc/default/grub -> GRUB_CMDLINE_LINUX_DEFAULT So the kernel wont attempt that nvme command again. Even not after reboot.
It's the Identify command for IO Command Set Specific Controller (or at least, I'm pretty sure that's the command that's triggering the error log entry). And correct, the kernel proposal would not remember after a reboot. If you're going to change the kernel runtime behavior for specific devices, normally we add "quirk" flags on a per-model basis.
I am using Samsung 960 Pro. I am getting 2 errors in a row on every reboot, and probably by running some virtual machines too: ``` Entry[62] ................. error_count : 4556 sqid : 0 cmdid : 0x1b status_field : 0x210d(Feature Identifier Not Saveable: The Feature Identifier specified does not support a saveable value) phase_tag : 0 parm_err_loc : 0x28 lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0 ................. Entry[63] ................. error_count : 4555 sqid : 0 cmdid : 0x12 status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field) phase_tag : 0 parm_err_loc : 0xffff lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0 ................. ```