Bug 214439
Summary: | Kernel Panic during boot - with unsupported namespace (0x6) nvme SSD | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Stig Nielsen (stig) |
Component: | NVMe | Assignee: | IO/NVME Virtual Default Assignee (io_nvme) |
Status: | RESOLVED IMPLEMENTED | ||
Severity: | normal | CC: | kbusch |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.13.13-200 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Screenshot
Screenshot runlevel 3 |
Description
Stig Nielsen
2021-09-16 19:49:34 UTC
CNS 6 is not invalid, though. It's defined by spec, and it's the only defined way a driver can discover length limits for non-r/w commands. There's also no way for the driver to know if the target supports the optional CNS 6 until it tries it. If the target doesn't support it, the correct response is to just return "Invalid Field" response code and the driver will happily carry on. So what's really happening here? An MCE occurrence due to an optional command attempt sounds like your drive is horribly broken. What model are you using? Thanks for the help. I can switch back and forth to previous kernel and all works fine, including the drive so it's not broken. It may be that the host sends the CNS command continuously as the "activity" led keeps flashing, so some activity keeps going. To test that an "Invalid Field" is returned from the drive, the following command was asserted to check for correct response (running kernel 5.4.14-200) Although I'm not sure about the correct syntax here, so any suggestions greatly appreciated.... $ sudo nvme admin-passthru -o 6 -4 6 -n 1 /dev/nvme0 -r -l 1 NVMe status: INVALID_FIELD: A reserved coded value or an unsupported value in a defined field(0x2) $ sudo nvme admin-passthru -o 6 -4 1 -n 1 /dev/nvme0 -r -l 1 NVMe command result:00000000 Thanks for sending those nvme commands, that was the very next thing I was going to ask for. :) Your syntax looks correct to me, and your drive is producing the expected response to the unsupported CNS 6 request, so I am starting to believe your observation is not the CNS 6 identification. In fact, after reviewing the git commit history, we've been using Identify CNS 6 since 5.13.0, so you're working 5.14.14-200 is issuing this same command sequence. There must something unrelated to that that is breaking your boot. Since your screen shot contains very little information, I think it would be most useful to bisect. Is that something you can do? Eh, sorry, you wrote 5.4.14, not 5.14.14... 5.4 definitely doesn't send that command. Right now, I don't see any path in the driver that would issue that identification repeatedly. Are you able to revert the commit that introduced this command? If you're building from git source, it should be commit 5befc7c26e5a98cd49789fb1beb52c62bd472dba. And one last comment... On your command line syntax, the 'nvme admin-passthru' parameter value for '-l' should be 4096 since that's the number of bytes this command is supposed to transfer. I suspect that difference won't matter here, though. Created attachment 298861 [details]
Screenshot runlevel 3
Thanks for your help. Yes -l 4096 doesn't make a difference. I tried to boot into runlevel 3, and were actually able to login and assert "dmesg -w". Last output before the mce: CPUs not responding..etc... is nvme nvme0: I/O 18 QID 0 timeout, disable controller (see picture) I just updated the kernels from repositories, but I'll try to see if I can build from git. Thanks again The new screen shot indicates a command during initialization times out, and the MCE follows shortly after. It's not clear at this point if this timeout is the new identify command, but no matter what command it is, the drive is not producing a response for it. If we assume it is the recently added identification command, something about the drive is different during boot compared to when you manually submitted the command from user space later. I've tried to synthesize this error condition, and it looks to me that the driver is handling it correctly. I've no idea what could be triggering this. The problem has been resolved. The SSD vendor identified the problem and changed the namespace transport mechanism in their firmware. Closing this issue |