Bug 199757

Summary: nvme: nvme_reset_ctrl_sync() hangs when AER error is generated by a Malformed TLP
Product: Drivers Reporter: Alex G. (mr.nuke.me)
Component: OtherAssignee: drivers_other
Status: NEW ---    
Severity: normal CC: austin.bolen, kbusch, okaya
Priority: P1    
Hardware: All   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=199657
Kernel Version: 4.17-rc5+ Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg of failure
stack trace of allegedly hung kernel processes
trace per device patch
[v0002] Trace with trace_device nvme2
[v0002] stack of allegedly hung processes
fix for deadlock
fix which also compiles

Description Alex G. 2018-05-17 18:13:47 UTC
When AER recovery is triggered, .slot_reset tries to do a nvme_reset_ctrl_sync(). In most cases, this works fine. However, when the error is a malformed TLP, nvme_reset_ctrl_sync() never returns, and the entire recovery deadlocks.
Comment 1 Alex G. 2018-05-17 18:15:20 UTC
Created attachment 276031 [details]
dmesg of failure

Log with ANSI escape sequences (use wget and color terminal instead of browser)

Even though this log is taken with FFS(firmware-first), the same happens with native AER.
Comment 2 Alex G. 2018-05-17 19:54:03 UTC
Created attachment 276033 [details]
stack trace of allegedly hung kernel processes
Comment 3 Sinan Kaya 2018-05-17 21:57:40 UTC
Initial impression is that you are having a livelock problem. I'm still decoding the order of events.
Comment 4 Sinan Kaya 2018-05-17 22:08:59 UTC
Looking at the callstack.., here is what I decoded. Feel free to correct me. This is all very fresh.

pid 554 performed a slot_reset() via AER path and is waiting for the worker thread to finish performing NVMe reset at pid 685.

pid 685 cannot enter a BLK layer lock because the lock is being held by pid 1723 which is waiting for reset by pid 534 to be finished.
Comment 5 Keith Busch 2018-05-17 22:12:24 UTC
What lock is pid 685 trying to take? The admin request_queue shouldn't have any dependencies on any other IO task.
Comment 6 Sinan Kaya 2018-05-17 22:16:55 UTC
I think the issue is at the block layer. blk_queue_enter() vs. blk_execute_rq().

It feels like blk_queue_enter() happened before blk_execute_rq() and won't let blk_execute_rq() to run to completion.
Comment 7 Keith Busch 2018-05-17 22:23:50 UTC
But the blk_queue_enter in the trace is for a namespace request queue, where the blk_execute_rq is for the admin request_queue. They're different queues that don't share locks.
Comment 8 Keith Busch 2018-05-17 22:28:45 UTC
I'm going to guess that the blk_execute_rq's request was sent to the blk-mq layer and is stuck in "wait_for_complete_io", but that admin request_queue is quiesced; it's not supposed to be quiesced, but that's the only thing i can think of since the timeout handler didn't kick in.
Comment 9 Sinan Kaya 2018-05-17 22:30:42 UTC
I see. I'm not very familiar with the BLK layer so take it with a pinch of salt.

I think Alex asked this question before. Why do you need to run the reset operation in a worker thread? Can't we call it directly from the AER task.
Comment 10 Keith Busch 2018-05-17 22:31:54 UTC
It's for the state-machine and to ensure we don't have two different threads trying to reset the same controller at the same time, otherwise intialization would surely fail.
Comment 11 Sinan Kaya 2018-05-18 00:35:30 UTC
Alex,

Can you put some trace_printk("...") into blk_execute_rq(), blk_end_sync_rq() and blk_queue_enter() functions and share what you capture at the moment of failure?

For each blk_execute_rq(), blk_end_sync_rq() needs to get called. Let's see if this is happening.
Comment 12 Sinan Kaya 2018-05-18 02:08:54 UTC
Apparently a new tracing functionality was added to NVMe driver. It would be good to enable that too.
Comment 13 Alex G. 2018-05-18 19:11:34 UTC
I found how to enable nvme tracing, but I'm spammed with events. I'm not sure how to enable filtering to only get events for a specific PCIe drive.
Comment 14 Sinan Kaya 2018-05-19 14:12:04 UTC
Created attachment 276067 [details]
trace per device patch

I attached an untested patch that should allow you to trace per device. You specify the device that you want to trace through kernel module parameter trace_device at /sys/module/nvme/parameters/trace_device. Code will check the device name against this to decide when to log. 

I hope I didn't do any pointer mistakes.
Comment 15 Sinan Kaya 2018-05-23 02:01:24 UTC
where are we on this debug?
Comment 16 Alex G. 2018-05-23 13:30:47 UTC
We're in Houston, TX, and Round Rock, TX.
Comment 17 Sinan Kaya 2018-05-23 13:31:49 UTC
Nice charm, I hope the weather is good there.
Comment 18 Alex G. 2018-05-23 19:18:00 UTC
Created attachment 276155 [details]
[v0002] Trace with trace_device nvme2
Comment 19 Alex G. 2018-05-23 19:18:42 UTC
Created attachment 276157 [details]
[v0002] stack of allegedly hung processes
Comment 20 Keith Busch 2018-05-23 19:20:21 UTC
The 1049 is the only process of interest in the attachment.
Comment 21 Alex G. 2018-05-23 19:21:13 UTC
Anything else you'd like me to stacktrace before I reboot?
Comment 22 Keith Busch 2018-05-23 19:24:01 UTC
Not quite sure yet.

  kworker/u162:1-1049  [027] ....   280.943367: nvme_setup_admin_cmd:  cmdid=12, flags=0x0, meta=0x0, cmd=(nvme_admin_identify cns=1, ctrlid=0)

So PID 1049 says it dispatched the command, so that *should* mean the IO timer is ticking, so should attempt to recover this command after 60 seconds (default timeout) if no response is provided.

I can't imagine why the timer isn't kicking, but it's definitely weird.
Comment 23 Keith Busch 2018-05-23 20:06:08 UTC
Oh shit, I know why nothing is happening: the pci channel is offline, which disables the nvme's damn timeout handler. Bloody hell, we definitely can't allow that ... I blame the PowerPC people for this one.

Stay tuned, I'll get a fix soon.
Comment 24 Keith Busch 2018-05-23 20:11:30 UTC
I'm guessing the reason the timeout occurred is because you're still injecting errors, and you happened to catch an error while the recovery was happening. So we're very much relying on the timeout handler in this case, but it's disabled, so we need to allow it to proceed with the disabling.

Unfortunately a fixing this deadlock will mean the driver's controller initialization will fail, so the driver will unbind from the device. I suppose that's better than a deadlock.
Comment 25 Keith Busch 2018-05-23 20:28:18 UTC
Created attachment 276159 [details]
fix for deadlock

This should fix the deadlock, but will fail to initialize the controller under the conditions this test is creating.
Comment 26 Sinan Kaya 2018-05-23 20:49:43 UTC
the driver will get an error_resume callback when the link is restored. You may want to move this to there.
Comment 27 Keith Busch 2018-05-23 20:50:49 UTC
Yep, that's the patch in comment 25 does.
Comment 28 Alex G. 2018-05-23 23:21:13 UTC
Created attachment 276161 [details]
fix which also compiles
Comment 29 Keith Busch 2018-05-24 02:20:23 UTC
thx
Comment 30 Keith Busch 2018-05-24 14:31:24 UTC
(In reply to Alex G. from comment #28)
> Created attachment 276161 [details]
> fix which also compiles

You're okay with me posting this for mainline consideration?
Comment 31 Alex G. 2018-05-24 15:05:28 UTC
Oh carp. My other comment from yesterday failed to get submitted.

I'm fine with the patch, as is. Though I have a more generic concern with taking so damn long to recover.

Something as simple as doing an 'ls' or 'tree' in a directory on a mdraid set where one drive is about to fail can hang things. Normally you'd expect a drive to fall out of the mdraid set and things to continue, but when we hit this path, the entire array freezes up, and all the userspace processes are stuck. The array stays stuck until the timeout finally kicks the drive out.

I'm not sure we want a RAID set to get stuck for this long because of one acting drive. You also had some patches to hold the PCI topology lock during AER recovery. When recovery goes smoothly, everything's honey-dory, but when we wait so long,  we could prevent other errors from being serviced -- in completely different subnodes of the tree.

Sure, a malformed TLP is a very syntetic error, but I imagine we could get same issue naturally. Even with a simple timeout, I suspect we'd eventually figure out the drive isn't responding and yank it out, absent of any AER reporting. With AER, we can know ahead of time that things look dire. So the only thing AER does in this case, is hold more locks and create more headaches. We're not really using the info from AER to improve upon the timeout-only case.

The patch is great. Recovering eventually is miles ahead of never. I'm just not confident in how everything interacts.
Comment 32 Keith Busch 2018-05-24 15:36:39 UTC
We have module parameters you can tune if you want things to fail faster. The defaults are insanely high.
Comment 33 Alex G. 2018-05-24 15:54:06 UTC
From experience, I've seen that users who go as far as tweaking kernel parameters, tend to have custom kernels and build their own servers anyway.

The majority of users are not going to go tweaking knobs. My concern is that, from a platform vendor perspective, it might make more sense to "handle things in firmware" if OSes "can't seem to get it right".

Example: a number of modern Dell server platforms hide most AER errors from the OS, and just hope the OS will use the hotplug interrupts to know what's going on. And the sad part is, this strategy is working. Windows doesn't BSOD, linux doesn't cry, and I don't know what ESXI and others are doing, but they're pretty happy too. There's a lot of extra silicon going in those servers just to know when it's okay to hide errors from crashy OSes.

My hope is that, if we can get it right in linux, the argument that "it pays to not be in Redmond, WA, and follow good programming practice" may actually work for once. We've come a good ways in the past few months. Although the better feature to handle everyone's needs will likely be DPC, it's good to make the AER paths saner and better.
Comment 34 Keith Busch 2018-05-24 17:09:36 UTC
It's a bit risky to change the defaults at this point. They've been this way for so long, even though they're insanely high, we may find someone was relying on it to be that way.
Comment 35 Alex G. 2018-05-24 17:12:51 UTC
I'd like to find a sane way to change the entire behavior, not just mess around with some default. Non-thoughtful example: on AER errors, why can't we just trigger the timeout to kill the transaction regardless?

Speaking of "been this way for so long":
http://www.wisdompills.com/2014/05/28/the-famous-social-experiment-5-monkeys-a-ladder/
Comment 36 Sinan Kaya 2018-05-24 17:17:20 UTC
Note that 4.18 kernel has Oza's patch included. The AER fatal error handling is changing. 

Code will do remove/probe as opposed to recovery. I suggest you give it a try.
Comment 37 Keith Busch 2018-05-24 17:22:15 UTC
(In reply to Alex G. from comment #35)
> Non-thoughtful example: on AER errors, why can't
> we just trigger the timeout to kill the transaction regardless?

That's exactly what we do. The problem with AER is that they're handled serially, so if you get another AER while we're trying to recover from the first, that manifests as a timeout because the driver has no way of being notified the second AER even occurred.