Bug 196325
Summary: | Resume issue: "nvme nvme0: I/O 18 QID 0 timeout, disable controller" | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Martin Peres (martin.peres) |
Component: | Other | Assignee: | io_other |
Status: | NEW --- | ||
Severity: | normal | CC: | jan.public, kbusch, marc, systeminterrupt, tomi.p.sarvela |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Linux-next-20170710 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Martin Peres
2017-07-11 13:12:45 UTC
It might be worth checking the Intel 600p M.2 NVMe firmware is up to date. # smartctl -a /dev/nvme0 | grep "Firmware Version" Firmware Version: PSF121C I have one in my workstation which I recently updated, so the above version is likely still current. Mine doesn't get disabled on suspend/resume @ kernel 4.11.9. (However, it's an AMD Ryzen CPU, so different chipset etc. There are other issues on suspend/resume - but it doesn't panic). PS: Assuming this report is for the Intel 600p M.2 NVMe mentioned here: https://bugzilla.kernel.org/show_bug.cgi?id=196223 Yes, it is this controller. The firmware was upgraded a couple of weeks ago: # smartctl -a /dev/nvme0 | grep "Firmware Version" Firmware Version: PSF121C It looks like the APST feature is causing more trouble for certain devices, and it also looks like the driver is failing to handle this condition correctly. I had requested a parameter to disable APST when others were enabling this power savings feature, but that didn't happen. I will submit a patch to provide that option. I also see the nvme initialization is broken if a timeout occurs for both APST and directives enabling... I will submit a patch to fix that as well. As to why the controller is failing the APST feature enabling in the first place, though, I am not sure. I posted a patch to fix the stuck work-queue. It turns out there is already a way to disable APST. Could you give that a try? The kernel parameter to do so should be 'nvme_core.default_ps_max_latency=0'. I picked up next-20170731 and tested with it. Booted 7700K + NVMe with parameter nvme_core.default_ps_max_latency=0 and started looping gem_exec_suspend@basic-S3. The behaviour was different than before: I didn't get hard hang right away. First loops went ok, suspend/resume worked. Then, when (I think) resuming on around tenth loop, I got the failed to set APST feature (-4)" I could Ctrl-C to the command line and look around, but any disk-based command just waited. Within minute I got the following trace: kbl-7700k login: [ 156.914098] nvme nvme0: failed to set APST feature (-4) [ 196.097915] Buffer I/O error on device nvme0n1p2, logical block 55157764 [ 246.774622] INFO: task rs:main Q:Reg:462 blocked for more than 60 seconds. [ 246.785499] Tainted: G U 4.13.0-rc3-next-20170731-CI-next-20170731 #1 [ 246.794531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 246.804594] Kernel panic - not syncing: hung_task: blocked tasks [ 246.811421] CPU: 0 PID: 66 Comm: khungtaskd Tainted: G U 4.13.0-rc3-next-20170731-C1 [ 246.822714] Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F22 03/06/7 [ 246.832732] Call Trace: [ 246.835984] dump_stack+0x68/0x9f [ 246.840075] panic+0xcf/0x212 [ 246.843835] watchdog+0x460/0x610 [ 246.847933] kthread+0x10f/0x150 [ 246.851955] ? reset_hung_task_detector+0x10/0x10 [ 246.857442] ? kthread_create_on_node+0x40/0x40 [ 246.862747] ret_from_fork+0x27/0x40 [ 246.867136] Kernel Offset: disabled [ 246.972943] Rebooting in 1 seconds.. [ 247.977326] ACPI MEMORY or I/O RESET_REG. ERROR: Boot option loading failed I think you just got random behavior rather than changed anything: the parameter as written was misspelled (my fault), and should be "nvme_core.default_ps_max_latency_us=0". In any case, the parameter should not be necessary if the drive functions correctly, so I've raised the issue directly with the device's product group. I'll also ping the nvme mailing list to see about applying the patch to get the driver to gracefully unbind in this scenario, though the storage device would still be unusable in this scenario. This keeps happening on my device and whenever it happens my device stops taking in keyboard and mouse input and does not refresh the screen either. This makes the system unusable for about a minute or so. Funnily enough the mouse pointer moves. Please let me know if I can contribute with diagnostic information. This is what I see in /var/log/syslog: Dec 13 20:56:31 hostnamemasked kernel: [ 83.018757] nvme nvme0: I/O 7 QID 0 timeout, reset controller Dec 13 20:57:33 hostnamemasked kernel: [ 144.507062] nvme nvme0: 4/0/0 default/read/poll queues .... Dec 13 21:04:04 hostnamemasked kernel: [ 535.632765] nvme nvme0: I/O 20 QID 0 timeout, reset controller Dec 13 21:05:05 hostnamemasked kernel: [ 597.113570] nvme nvme0: 4/0/0 default/read/poll queues (In reply to systeminterrupt from comment #7) > This keeps happening on my device and whenever it happens my device stops > taking in keyboard and mouse input and does not refresh the screen either. > This makes the system unusable for about a minute or so. Funnily enough the > mouse pointer moves. Please let me know if I can contribute with diagnostic > information. This is what I see in /var/log/syslog: > > Dec 13 20:56:31 hostnamemasked kernel: [ 83.018757] nvme nvme0: I/O 7 QID > 0 timeout, reset controller > Dec 13 20:57:33 hostnamemasked kernel: [ 144.507062] nvme nvme0: 4/0/0 > default/read/poll queues > .... > Dec 13 21:04:04 hostnamemasked kernel: [ 535.632765] nvme nvme0: I/O 20 QID > 0 timeout, reset controller > Dec 13 21:05:05 hostnamemasked kernel: [ 597.113570] nvme nvme0: 4/0/0 > default/read/poll queues Please create a new bug about this issue. To my eyes it seems that the nvme drive is timeouting IO requests, which might be caused eg. by failing drive (the usual case), slightly bad NVMe connection, or motherboard issues. My advice would be checking the SMART values from the drive (with 'nvme' tool) and if there is anything suspicious, replace the drive. Suspect values are high error counts (should be close to 0) or available spare going towards 0%. This doesn't look related to the bugzilla. Strangely, it's an admin command timing out, and it's not happening during controller init. I'd be curious to know what the command is and where it's coming from. Do you happen to know what's submitting this admin command? If you're not sure and want to find out, you could do something like the following (as root, and assuming your kernel has ftrace enabled): # echo “qid == 0” > /sys/kernel/debug/tracing/events/nvme/filter # echo 1 > /sys/kernel/debug/tracing/events/nvme/enable Then after observing the timeout message, you can get the trace like: # cat /sys/kernel/debug/tracing/trace |