Created attachment 275841 [details] lspci -tvnn I'm seeing what appears to be a deadlock in the AER recovery path. It seems that the device_lock() call in report_slot_reset() never returns. ## How to reproduce: Injection of the error happens by changing the maximum payload size to 128 bytes from 256. This is on the switch upstream port. When there's IO to the drive, switch sees a malformed TLP. Switch reports error, AER handles it. More IO goes, another error is triggered, and this time the root port reports it. AER recovery hits the NVMe drive behind the affected upstream port and deadlocks.
Created attachment 275843 [details] pci_aer_print_hacks.patch
Created attachment 275845 [details] dmesg of failure
Created attachment 275847 [details] dmesg - near miss - no deadlock
Created attachment 275851 [details] lspci-vv lspci -vv
By popular request, how to change max payload size, example: # lspci -s 0000:b0:04.0 -vv |grep Capabilities |grep Express Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00 Here, capability is at 0x68; control reg is 8 after, so [0x70]. # setpci -s 0000:b0:04.0 70.B 3e # setpci -s 0000:b0:04.0 70.B=1e # set to 128 # setpci -s 0000:b0:04.0 70.B=3e # change back to 256
interesting, I always thought that shpchp was a thing of the past. My patch only works if slot_reset is implemented by the hotplug implementation here: https://elixir.bootlin.com/linux/v4.17-rc4/source/drivers/pci/pci.c#L4217 and it is only implemented by the pciehp driver. Your system seems to be using shpchp driver instead. pcieport 0000:b0:04.0 nvme 0000:b1:00.0 0000:ae:00.0 0000:af:00.0 0000:b0:04.0 85:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04) (prog-if 00 [Normal decode]) Physical Slot: 5 Kernel driver in use: pcieport Kernel modules: shpchp ae:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp af:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp b0:04.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp b0:05.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp b0:06.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp b0:07.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp d7:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp d8:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp d9:04.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp d9:05.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp d9:06.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp d9:07.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev aa) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Kernel modules: shpchp +-[0000:ae]-+-00.0-[af-b4]----00.0-[b0-b4]--+-04.0-[b1]----00.0 Intel Corporation Express Flash NVMe P4600 [8086:0a55] | | +-05.0-[b2]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller 172Xa/172Xb [144d:a822] | | +-06.0-[b3]-- | | \-07.0-[b4]--
Can you also provide a dmesg after reboot? One of your logs show pciehp driver. I'm confused now.
I don't think Alex's system is actually using the shpchp driver. I think the lspci "Kernel modules:" line is showing us modules that can *potentially* claim the device, based on the MODULE_DEVICE_TABLE. For shpchp, that includes all PCI bridge devices.
What the heck is shpchp, and why the heck lspci thinks anything of it? The driver is pcieport, I have no idea why the module name has any relevance. [root@g-prime linux]# ls /sys/bus/pci/drivers/shpchp/ bind module new_id remove_id uevent unbind [root@g-prime linux]# ls /sys/bus/pci_express/drivers/pciehp/ 0000:00:1c.0:pcie004 0000:3c:06.0:pcie204 0000:b0:05.0:pcie204 0000:d9:04.0:pcie204 0000:d9:07.0:pcie204 unbind 0000:3c:04.0:pcie204 0000:3c:07.0:pcie204 0000:b0:06.0:pcie204 0000:d9:05.0:pcie204 bind 0000:3c:05.0:pcie204 0000:b0:04.0:pcie204 0000:b0:07.0:pcie204 0000:d9:06.0:pcie204 uevent
ok, so it is a hickup in lspci tool. pcieport 0000:b0:07.0: HACK: report_error_detected: locked and ready pcieport 0000:b0:07.0: HACK: report_error_detected: Unlocked and DONE nvme nvme2: Removing after probe failure status: -19 pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up This log seems suspicious to me. The "nvme nvme2: Removing after probe failure" message comes from nvme_remove_dead_ctrl() which gets called from the remove() callback of the driver. Nobody should call remove() in this case since you observed a fatal error and you don't have Oza's patch yet. Upstream fatal error handling relies on the secondary bus reset and error callbacks for recovery at this moment. I'm still betting on the hotplug link down handler calling the remove() function. While this guy is in progress, the link comes back up and then you have funny things going on. Can you place a dump_stack() call into nvme_remove_dead_ctrl()? Let's see where this remove() is coming from.
Dam it! It's a worker_thread, but no clue who queued it up: [Wed May 9 09:41:04 2018] nvme nvme2: Removing after probe failure status: -19 [Wed May 9 09:41:04 2018] CPU: 31 PID: 974 Comm: kworker/u162:1 Not tainted 4.17.0-rc4+ #28 [Wed May 9 09:41:04 2018] Hardware name: Dell Inc. PowerEdge R740xd/07X9K0, BIOS 1.4.4 [OS AER Native 2] 03/09/2018 [Wed May 9 09:41:04 2018] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [Wed May 9 09:41:04 2018] Call Trace: [Wed May 9 09:41:04 2018] dump_stack+0x5c/0x7b [Wed May 9 09:41:04 2018] nvme_reset_work+0x98/0x16c0 [nvme] [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x34/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x40/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x34/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x40/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x34/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x40/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x34/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x40/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x34/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x40/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x34/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x40/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x34/0x70 [Wed May 9 09:41:04 2018] ? __switch_to+0xa2/0x4c0 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x40/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x34/0x70 [Wed May 9 09:41:04 2018] ? __switch_to_asm+0x40/0x70 [Wed May 9 09:41:04 2018] process_one_work+0x175/0x360 [Wed May 9 09:41:04 2018] worker_thread+0x2e/0x380 [Wed May 9 09:41:04 2018] ? process_one_work+0x360/0x360 [Wed May 9 09:41:04 2018] kthread+0x113/0x130 [Wed May 9 09:41:04 2018] ? kthread_create_worker_on_cpu+0x70/0x70 [Wed May 9 09:41:04 2018] ret_from_fork+0x35/0x40 [Wed May 9 09:41:04 2018] nvme1n1: detected capacity change from 3200631791616 to 0
Created attachment 275877 [details] log-20180509-0940 Log explaining where the last dmesg snippet came.
can you move dump_stack() to the top of nvme_remove() function instead? nvme_reset_work() seems to get scheduled there.
I am working in it, and it appears I can't reproduce the race with a dump_stack in there. Here's a quick peek at the dump: [ 618.980217] nvme_remove+0x1a/0x130 [nvme] [ 618.980227] pci_device_remove+0x3b/0xb0 [ 618.980239] device_release_driver_internal+0x15a/0x220 [ 619.113818] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme] [ 619.113823] process_one_work+0x175/0x360 [ 619.113825] worker_thread+0x2e/0x380 [ 619.113827] ? process_one_work+0x360/0x360 [ 619.113830] kthread+0x113/0x130 [ 619.113831] ? kthread_create_worker_on_cpu+0x70/0x70 [ 619.113834] ret_from_fork+0x35/0x40
It looks like the nvme driver's asynchronous reset might be racing with the rest of the AER recover, where a second AER occuring during that async reset breaks other things. In nvme_slot_reset, if we change the "nvme_reset_ctrl" with "nvme_reset_ctrl_sync", that should prevent subsequent AER handling resetting links until the NVMe driver is ready for it, otherwise such disruptions during a reset can result in the driver failing to bring up the controller, causing it to unbind itself.
I'm glad Keith chimed in. We posted at the same time. I traced it back to here: https://elixir.bootlin.com/linux/v4.17-rc4/source/drivers/pci/pcie/aer/aerdrv_core.c#L513 This is a broadcast to all drivers saying that the PCIe link has been reset via secondary bus reset. Now, it is time to restore functionality. This gets converted to nvme_reset_ctrl() at nvme_slot_reset() function. nvme_reset_ctrl() queues the reset_work task here via reset_work workqueue. Then, it becomes nvme_reset_work() which eventually goes into the call stack you showed here. IMO, What is wrong is for the NVMe driver to remove the device object. It should have done a plain recovery without touching any of the device stuff. Even the print is wrong. It thinks that something went wrong during probe which is not the case. Can you put some prints into nvme_reset_work() function to see which goto statement took us to here. https://elixir.bootlin.com/linux/v4.17-rc4/source/drivers/nvme/host/pci.c#L2393
I agree. nvme_reset_ctrl_sync() looks like a better solution. The expectation is that when driver returns from slot_reset(), it should be functionally ready.
Sigh, the goto disease. Forty years later and people still didn't figure the out 'return' statement. Anyhow, it craps at nvme_pci_configure_admin_queue()
Keith, I applied your suggested fix, and it appears to have eliminated the deadlock (sample size == 1 test).
Thanks, Alex. Will send this to the list for review.
BTW, there is no better way in the C language to unwind error cases in a function than the 'goto'. :)
Challenge accepted, Keith. Will send to the list for review.
now that this issue is resolved, can we close this?
Indeed.