Bug 217445
Summary: | standby-resume cycle increases NVMe error count (maybe bad NVMe commands) | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | kolAflash (kolAflash) |
Component: | NVMe | Assignee: | IO/NVME Virtual Default Assignee (io_nvme) |
Status: | NEW --- | ||
Severity: | normal | CC: | agurenko, kbusch, kernelbugs, kernelorg, pdecat, peter+linux, pmhahn |
Priority: | P3 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: |
Description
kolAflash
2023-05-15 10:22:52 UTC
P.S. On the EliteBook 735 the NVMe log error count also increases by 1 on reboot. (didn't test that on the other notebook yet) Output of "nvme error-log": https://www.smartmontools.org/attachment/ticket/1722/nvme-error-log_Seagate-FireCuda-510-SSD_HP-EliteBook-735-G6-Ryzen3500U-Debian-12.txt Sample entry: error_count : 447 sqid : 0 cmdid : 0x8 status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field) phase_tag : 0 parm_err_loc : 0x28 lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0 Nearly each of the 63 log entries has a different cmdid. Rest is always identical. Just these cmdids appear a few more times, but most cmdids seem random. 0x4, 0x8014, 0xc012, 0xd00e, 0xe Similar looking problems: Samsung 970 EVO Plus Generates NVME Errors https://bugzilla.kernel.org/show_bug.cgi?id=211573 Need NVME QUIRK BOGUS for Hiksemi SSD - HS-SSD-FUTURE 2048G https://bugzilla.kernel.org/show_bug.cgi?id=217384 The driver is just attempting an optional command that the device doesn't support. The driver has no way to know if the device supports it without trying, so that's what it's doing. The drive can log the error if it wants to, but this is just unnecessary for this command, IMO, but we can't do anything about that. I'd just ignore the errors. @Keith Thanks for the answer! Ist there a way to manually stop the kernel from attempting this command? Something like a boot kernel module parameter? I like to keep the NVMe error log clean. So I'll recognize if there is a real problem with the ssd. P.S. Any idea how other operating systems handle this? (I got just Linux installed, so I can not easily test anything else) And I'm thinking if userland could introduce a mechanism to remember if a command failed once. So userland could configure the kernel upon boot not to try it again. Any clue who in userland might could take care of this? Nothing reasonable you can do from user space. Since this is such a frequent report, let me see if we can add something in kernel to skip if we already saw it fail before. By the way: Which one is the name of the unknown command? SUBNQN or NIDT_EUI64 ? And what is the function of that command? I'll assume it's "SUBNQN" for this comment. (In reply to Keith Busch from comment #6) > Nothing reasonable you can do from user space. Since this is such a frequent > report, let me see if we can add something in kernel to skip if we already > saw it fail before. But this information would be lost after reboot. Correct? So NVMe errors would still increase which each reboot. I would really wish for a way where userspace can tell the kernel in advance not to attempt this NVMe command for a specific NVMe device! Something like: nvme_core.skip_subnqn=pci-0000:04:00.0-nvme-1 (comma separated list of NVMe devices) So after one initial fail this would not happen again. Even after rebooting. How userspace does this wouldn't be the kernels problem. A first attempt could be a userspace program watching dmesg for that error. And if the error appears the userspace program could add a module parameter to /etc/modprobe.d/nvme_core_quirks.conf or /etc/default/grub -> GRUB_CMDLINE_LINUX_DEFAULT So the kernel wont attempt that nvme command again. Even not after reboot. It's the Identify command for IO Command Set Specific Controller (or at least, I'm pretty sure that's the command that's triggering the error log entry). And correct, the kernel proposal would not remember after a reboot. If you're going to change the kernel runtime behavior for specific devices, normally we add "quirk" flags on a per-model basis. I am using Samsung 960 Pro. I am getting 2 errors in a row on every reboot, and probably by running some virtual machines too: ``` Entry[62] ................. error_count : 4556 sqid : 0 cmdid : 0x1b status_field : 0x210d(Feature Identifier Not Saveable: The Feature Identifier specified does not support a saveable value) phase_tag : 0 parm_err_loc : 0x28 lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0 ................. Entry[63] ................. error_count : 4555 sqid : 0 cmdid : 0x12 status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field) phase_tag : 0 parm_err_loc : 0xffff lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. cs : 0 trtype_spec_info: 0 ................. ``` |