Bug 199657

Summary: AER recovery deadlocks with Malformed TLP and NVMe drives
Product: Drivers Reporter: Alex G. (mr.nuke.me)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: bjorn, kbusch, okaya
Priority: P1    
Hardware: All   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=199757
Kernel Version: 4.17-rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci -tvnn
pci_aer_print_hacks.patch
dmesg of failure
dmesg - near miss - no deadlock
lspci-vv
log-20180509-0940

Description Alex G. 2018-05-08 16:04:26 UTC
Created attachment 275841 [details]
lspci -tvnn

I'm seeing what appears to be a deadlock in the AER recovery path. It seems that the device_lock() call in report_slot_reset() never returns.

## How to reproduce:

Injection of the error happens by changing the maximum payload size to 128 bytes from 256. This is on the switch upstream port.

When there's IO to the drive, switch sees a malformed TLP. Switch
reports error, AER handles it. More IO goes, another error is triggered, and this time the root port reports it. AER recovery hits the NVMe drive behind the affected
upstream port and deadlocks.
Comment 1 Alex G. 2018-05-08 16:09:10 UTC
Created attachment 275843 [details]
pci_aer_print_hacks.patch
Comment 2 Alex G. 2018-05-08 16:10:36 UTC
Created attachment 275845 [details]
dmesg of failure
Comment 3 Alex G. 2018-05-08 16:12:06 UTC
Created attachment 275847 [details]
dmesg - near miss - no deadlock
Comment 4 Alex G. 2018-05-08 17:18:25 UTC
Created attachment 275851 [details]
lspci-vv

lspci -vv
Comment 5 Alex G. 2018-05-08 17:23:58 UTC
By popular request, how to change max payload size, example:

# lspci -s 0000:b0:04.0 -vv |grep Capabilities |grep Express
        Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00

Here, capability is at 0x68; control reg is 8 after, so [0x70].

# setpci -s 0000:b0:04.0 70.B
3e
# setpci -s 0000:b0:04.0 70.B=1e # set to 128
# setpci -s 0000:b0:04.0 70.B=3e # change back to 256
Comment 6 Sinan Kaya 2018-05-08 22:23:40 UTC
interesting, I always thought that shpchp was a thing of the past. My patch only works if slot_reset is implemented by the hotplug implementation here:

https://elixir.bootlin.com/linux/v4.17-rc4/source/drivers/pci/pci.c#L4217

and it is only implemented by the pciehp driver. Your system seems to be using shpchp driver instead.

pcieport 0000:b0:04.0
nvme 0000:b1:00.0
0000:ae:00.0
0000:af:00.0
0000:b0:04.0



85:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04) (prog-if 00 [Normal decode])
	Physical Slot: 5
	Kernel driver in use: pcieport
	Kernel modules: shpchp

ae:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

af:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

b0:04.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

b0:05.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

b0:06.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

b0:07.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

d7:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

d8:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

d9:04.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

d9:05.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

d9:06.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

d9:07.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	Kernel modules: shpchp

+-[0000:ae]-+-00.0-[af-b4]----00.0-[b0-b4]--+-04.0-[b1]----00.0  Intel Corporation Express Flash NVMe P4600 [8086:0a55]
 |           |                               +-05.0-[b2]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller 172Xa/172Xb [144d:a822]
 |           |                               +-06.0-[b3]--
 |           |                               \-07.0-[b4]--
Comment 7 Sinan Kaya 2018-05-08 22:58:04 UTC
Can you also provide a dmesg after reboot? One of your logs show pciehp driver.

I'm confused now.
Comment 8 Bjorn Helgaas 2018-05-08 22:58:54 UTC
I don't think Alex's system is actually using the shpchp driver.  I think the lspci "Kernel modules:" line is showing us modules that can *potentially* claim the device, based on the MODULE_DEVICE_TABLE.  For shpchp, that includes all PCI bridge devices.
Comment 9 Alex G. 2018-05-08 22:59:31 UTC
What the heck is shpchp, and why the heck lspci thinks anything of it?
The driver is pcieport, I have no idea why the module name has any relevance.

[root@g-prime linux]# ls /sys/bus/pci/drivers/shpchp/
bind  module  new_id  remove_id  uevent  unbind
[root@g-prime linux]# ls /sys/bus/pci_express/drivers/pciehp/
0000:00:1c.0:pcie004  0000:3c:06.0:pcie204  0000:b0:05.0:pcie204  0000:d9:04.0:pcie204  0000:d9:07.0:pcie204  unbind
0000:3c:04.0:pcie204  0000:3c:07.0:pcie204  0000:b0:06.0:pcie204  0000:d9:05.0:pcie204  bind
0000:3c:05.0:pcie204  0000:b0:04.0:pcie204  0000:b0:07.0:pcie204  0000:d9:06.0:pcie204  uevent
Comment 10 Sinan Kaya 2018-05-09 08:37:31 UTC
ok, so it is a hickup in lspci tool. 

 pcieport 0000:b0:07.0: HACK: report_error_detected: locked and ready 
 pcieport 0000:b0:07.0: HACK: report_error_detected: Unlocked and DONE 
 nvme nvme2: Removing after probe failure status: -19 
 pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up

This log seems suspicious to me. 

The "nvme nvme2: Removing after probe failure" message comes from nvme_remove_dead_ctrl() which gets called from the remove() callback of the driver.

Nobody should call remove() in this case since you observed a fatal error and you don't have Oza's patch yet. Upstream fatal error handling relies on the secondary bus reset and error callbacks for recovery at this moment.

I'm still betting on the hotplug link down handler calling the remove() function. While this guy is in progress, the link comes back up and then you have funny things going on. 

Can you place a  dump_stack() call into nvme_remove_dead_ctrl()?

Let's see where this remove() is coming from.
Comment 11 Alex G. 2018-05-09 14:43:14 UTC
Dam it! It's a worker_thread, but no clue who queued it up:

[Wed May  9 09:41:04 2018] nvme nvme2: Removing after probe failure status: -19
[Wed May  9 09:41:04 2018] CPU: 31 PID: 974 Comm: kworker/u162:1 Not tainted 4.17.0-rc4+ #28
[Wed May  9 09:41:04 2018] Hardware name: Dell Inc. PowerEdge R740xd/07X9K0, BIOS 1.4.4 [OS AER Native 2] 03/09/2018
[Wed May  9 09:41:04 2018] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[Wed May  9 09:41:04 2018] Call Trace:
[Wed May  9 09:41:04 2018]  dump_stack+0x5c/0x7b
[Wed May  9 09:41:04 2018]  nvme_reset_work+0x98/0x16c0 [nvme]
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x34/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x40/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x34/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x40/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x34/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x40/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x34/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x40/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x34/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x40/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x34/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x40/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x34/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to+0xa2/0x4c0
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x40/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x34/0x70
[Wed May  9 09:41:04 2018]  ? __switch_to_asm+0x40/0x70
[Wed May  9 09:41:04 2018]  process_one_work+0x175/0x360
[Wed May  9 09:41:04 2018]  worker_thread+0x2e/0x380
[Wed May  9 09:41:04 2018]  ? process_one_work+0x360/0x360
[Wed May  9 09:41:04 2018]  kthread+0x113/0x130
[Wed May  9 09:41:04 2018]  ? kthread_create_worker_on_cpu+0x70/0x70
[Wed May  9 09:41:04 2018]  ret_from_fork+0x35/0x40
[Wed May  9 09:41:04 2018] nvme1n1: detected capacity change from 3200631791616 to 0
Comment 12 Alex G. 2018-05-09 14:48:37 UTC
Created attachment 275877 [details]
log-20180509-0940

Log explaining where the last dmesg snippet came.
Comment 13 Sinan Kaya 2018-05-09 21:10:32 UTC
can you move dump_stack() to the top of nvme_remove() function instead? nvme_reset_work() seems to get scheduled there.
Comment 14 Alex G. 2018-05-09 21:13:14 UTC
I am working in it, and it appears I can't reproduce the race with a dump_stack in there. Here's a quick peek at the dump:

[  618.980217]  nvme_remove+0x1a/0x130 [nvme]
[  618.980227]  pci_device_remove+0x3b/0xb0
[  618.980239]  device_release_driver_internal+0x15a/0x220
[  619.113818]  nvme_remove_dead_ctrl_work+0x29/0x40 [nvme]
[  619.113823]  process_one_work+0x175/0x360
[  619.113825]  worker_thread+0x2e/0x380
[  619.113827]  ? process_one_work+0x360/0x360
[  619.113830]  kthread+0x113/0x130
[  619.113831]  ? kthread_create_worker_on_cpu+0x70/0x70
[  619.113834]  ret_from_fork+0x35/0x40
Comment 15 Keith Busch 2018-05-09 21:32:59 UTC
It looks like the nvme driver's asynchronous reset might be racing with the rest of the AER recover, where a second AER occuring during that async reset breaks other things.

In nvme_slot_reset, if we change the "nvme_reset_ctrl" with "nvme_reset_ctrl_sync", that should prevent subsequent AER handling resetting links until the NVMe driver is ready for it, otherwise such disruptions during a reset can result in the driver failing to bring up the controller, causing it to unbind itself.
Comment 16 Sinan Kaya 2018-05-09 21:38:34 UTC
I'm glad Keith chimed in. We posted at the same time.

I traced it back to here:

https://elixir.bootlin.com/linux/v4.17-rc4/source/drivers/pci/pcie/aer/aerdrv_core.c#L513

This is a broadcast to all drivers saying that the PCIe link has been reset via secondary bus reset. Now, it is time to restore functionality.

This gets converted to nvme_reset_ctrl() at nvme_slot_reset() function.

nvme_reset_ctrl() queues the reset_work task here via reset_work workqueue. Then, it becomes nvme_reset_work() which eventually goes into the call stack you showed here. 

IMO, What is wrong is for the NVMe driver to remove the device object. It should have done a plain recovery without touching any of the device stuff. 

Even the print is wrong. It thinks that something went wrong during probe which is not the case. 

Can you put some prints into  nvme_reset_work() function to see which goto statement took us to here. 

https://elixir.bootlin.com/linux/v4.17-rc4/source/drivers/nvme/host/pci.c#L2393
Comment 17 Sinan Kaya 2018-05-09 21:41:53 UTC
I agree. nvme_reset_ctrl_sync() looks like a better solution. The expectation is that when driver returns from slot_reset(), it should be functionally ready.
Comment 18 Alex G. 2018-05-09 22:49:49 UTC
Sigh, the goto disease. Forty years later and people still didn't figure the out 'return' statement.

Anyhow, it craps at nvme_pci_configure_admin_queue()
Comment 19 Alex G. 2018-05-10 00:15:33 UTC
Keith, I applied your suggested fix, and it appears to have eliminated the deadlock (sample size == 1 test).
Comment 20 Keith Busch 2018-05-10 15:08:54 UTC
Thanks, Alex. Will send this to the list for review.
Comment 21 Keith Busch 2018-05-10 15:31:26 UTC
BTW, there is no better way in the C language to unwind error cases in a function than the 'goto'. :)
Comment 22 Alex G. 2018-05-10 15:33:23 UTC
Challenge accepted, Keith. Will send to the list for review.
Comment 23 Sinan Kaya 2018-05-15 15:35:30 UTC
now that this issue is resolved, can we close this?
Comment 24 Alex G. 2018-05-15 18:00:35 UTC
Indeed.