When AER recovery is triggered, .slot_reset tries to do a nvme_reset_ctrl_sync(). In most cases, this works fine. However, when the error is a malformed TLP, nvme_reset_ctrl_sync() never returns, and the entire recovery deadlocks.
Created attachment 276031 [details] dmesg of failure Log with ANSI escape sequences (use wget and color terminal instead of browser) Even though this log is taken with FFS(firmware-first), the same happens with native AER.
Created attachment 276033 [details] stack trace of allegedly hung kernel processes
Initial impression is that you are having a livelock problem. I'm still decoding the order of events.
Looking at the callstack.., here is what I decoded. Feel free to correct me. This is all very fresh. pid 554 performed a slot_reset() via AER path and is waiting for the worker thread to finish performing NVMe reset at pid 685. pid 685 cannot enter a BLK layer lock because the lock is being held by pid 1723 which is waiting for reset by pid 534 to be finished.
What lock is pid 685 trying to take? The admin request_queue shouldn't have any dependencies on any other IO task.
I think the issue is at the block layer. blk_queue_enter() vs. blk_execute_rq(). It feels like blk_queue_enter() happened before blk_execute_rq() and won't let blk_execute_rq() to run to completion.
But the blk_queue_enter in the trace is for a namespace request queue, where the blk_execute_rq is for the admin request_queue. They're different queues that don't share locks.
I'm going to guess that the blk_execute_rq's request was sent to the blk-mq layer and is stuck in "wait_for_complete_io", but that admin request_queue is quiesced; it's not supposed to be quiesced, but that's the only thing i can think of since the timeout handler didn't kick in.
I see. I'm not very familiar with the BLK layer so take it with a pinch of salt. I think Alex asked this question before. Why do you need to run the reset operation in a worker thread? Can't we call it directly from the AER task.
It's for the state-machine and to ensure we don't have two different threads trying to reset the same controller at the same time, otherwise intialization would surely fail.
Alex, Can you put some trace_printk("...") into blk_execute_rq(), blk_end_sync_rq() and blk_queue_enter() functions and share what you capture at the moment of failure? For each blk_execute_rq(), blk_end_sync_rq() needs to get called. Let's see if this is happening.
Apparently a new tracing functionality was added to NVMe driver. It would be good to enable that too.
I found how to enable nvme tracing, but I'm spammed with events. I'm not sure how to enable filtering to only get events for a specific PCIe drive.
Created attachment 276067 [details] trace per device patch I attached an untested patch that should allow you to trace per device. You specify the device that you want to trace through kernel module parameter trace_device at /sys/module/nvme/parameters/trace_device. Code will check the device name against this to decide when to log. I hope I didn't do any pointer mistakes.
where are we on this debug?
We're in Houston, TX, and Round Rock, TX.
Nice charm, I hope the weather is good there.
Created attachment 276155 [details] [v0002] Trace with trace_device nvme2
Created attachment 276157 [details] [v0002] stack of allegedly hung processes
The 1049 is the only process of interest in the attachment.
Anything else you'd like me to stacktrace before I reboot?
Not quite sure yet. kworker/u162:1-1049 [027] .... 280.943367: nvme_setup_admin_cmd: cmdid=12, flags=0x0, meta=0x0, cmd=(nvme_admin_identify cns=1, ctrlid=0) So PID 1049 says it dispatched the command, so that *should* mean the IO timer is ticking, so should attempt to recover this command after 60 seconds (default timeout) if no response is provided. I can't imagine why the timer isn't kicking, but it's definitely weird.
Oh shit, I know why nothing is happening: the pci channel is offline, which disables the nvme's damn timeout handler. Bloody hell, we definitely can't allow that ... I blame the PowerPC people for this one. Stay tuned, I'll get a fix soon.
I'm guessing the reason the timeout occurred is because you're still injecting errors, and you happened to catch an error while the recovery was happening. So we're very much relying on the timeout handler in this case, but it's disabled, so we need to allow it to proceed with the disabling. Unfortunately a fixing this deadlock will mean the driver's controller initialization will fail, so the driver will unbind from the device. I suppose that's better than a deadlock.
Created attachment 276159 [details] fix for deadlock This should fix the deadlock, but will fail to initialize the controller under the conditions this test is creating.
the driver will get an error_resume callback when the link is restored. You may want to move this to there.
Yep, that's the patch in comment 25 does.
Created attachment 276161 [details] fix which also compiles
thx
(In reply to Alex G. from comment #28) > Created attachment 276161 [details] > fix which also compiles You're okay with me posting this for mainline consideration?
Oh carp. My other comment from yesterday failed to get submitted. I'm fine with the patch, as is. Though I have a more generic concern with taking so damn long to recover. Something as simple as doing an 'ls' or 'tree' in a directory on a mdraid set where one drive is about to fail can hang things. Normally you'd expect a drive to fall out of the mdraid set and things to continue, but when we hit this path, the entire array freezes up, and all the userspace processes are stuck. The array stays stuck until the timeout finally kicks the drive out. I'm not sure we want a RAID set to get stuck for this long because of one acting drive. You also had some patches to hold the PCI topology lock during AER recovery. When recovery goes smoothly, everything's honey-dory, but when we wait so long, we could prevent other errors from being serviced -- in completely different subnodes of the tree. Sure, a malformed TLP is a very syntetic error, but I imagine we could get same issue naturally. Even with a simple timeout, I suspect we'd eventually figure out the drive isn't responding and yank it out, absent of any AER reporting. With AER, we can know ahead of time that things look dire. So the only thing AER does in this case, is hold more locks and create more headaches. We're not really using the info from AER to improve upon the timeout-only case. The patch is great. Recovering eventually is miles ahead of never. I'm just not confident in how everything interacts.
We have module parameters you can tune if you want things to fail faster. The defaults are insanely high.
From experience, I've seen that users who go as far as tweaking kernel parameters, tend to have custom kernels and build their own servers anyway. The majority of users are not going to go tweaking knobs. My concern is that, from a platform vendor perspective, it might make more sense to "handle things in firmware" if OSes "can't seem to get it right". Example: a number of modern Dell server platforms hide most AER errors from the OS, and just hope the OS will use the hotplug interrupts to know what's going on. And the sad part is, this strategy is working. Windows doesn't BSOD, linux doesn't cry, and I don't know what ESXI and others are doing, but they're pretty happy too. There's a lot of extra silicon going in those servers just to know when it's okay to hide errors from crashy OSes. My hope is that, if we can get it right in linux, the argument that "it pays to not be in Redmond, WA, and follow good programming practice" may actually work for once. We've come a good ways in the past few months. Although the better feature to handle everyone's needs will likely be DPC, it's good to make the AER paths saner and better.
It's a bit risky to change the defaults at this point. They've been this way for so long, even though they're insanely high, we may find someone was relying on it to be that way.
I'd like to find a sane way to change the entire behavior, not just mess around with some default. Non-thoughtful example: on AER errors, why can't we just trigger the timeout to kill the transaction regardless? Speaking of "been this way for so long": http://www.wisdompills.com/2014/05/28/the-famous-social-experiment-5-monkeys-a-ladder/
Note that 4.18 kernel has Oza's patch included. The AER fatal error handling is changing. Code will do remove/probe as opposed to recovery. I suggest you give it a try.
(In reply to Alex G. from comment #35) > Non-thoughtful example: on AER errors, why can't > we just trigger the timeout to kill the transaction regardless? That's exactly what we do. The problem with AER is that they're handled serially, so if you get another AER while we're trying to recover from the first, that manifests as a timeout because the driver has no way of being notified the second AER even occurred.