Bug 208855

Summary: Lockdep warning in PCI-e hotplug code
Product: Drivers Reporter: Hans de Goede (jwrdegoede)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED CODE_FIX    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.8.0 Subsystem:
Regression: No Bisected commit-id:

Description Hans de Goede 2020-08-09 18:00:35 UTC
When I connect my X1C8, running a 5.8.0 kernel with lockdep enabled, to a Lenovo 2nd gen thunderbolt dock I get the following warning from lockdep:

[  139.754540] pcieport 0000:06:01.0: PCI bridge to [bus 08-2c]
[  139.754545] pcieport 0000:06:01.0:   bridge window [io  0x4000-0x4fff]
[  139.754552] pcieport 0000:06:01.0:   bridge window [mem 0xdc100000-0xe80fffff]
[  139.754558] pcieport 0000:06:01.0:   bridge window [mem 0xa0000000-0xbfffffff 64bit pref]
[  139.754856] pcieport 0000:08:00.0: enabling device (0000 -> 0003)
[  139.755955] pcieport 0000:09:02.0: enabling device (0000 -> 0003)
[  139.757097] pcieport 0000:09:04.0: enabling device (0000 -> 0002)
[  139.757622] pcieport 0000:09:04.0: pciehp: Slot #4 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+

[  139.759255] ============================================
[  139.759257] WARNING: possible recursive locking detected
[  139.759259] 5.8.0+ #16 Tainted: G            E    
[  139.759260] --------------------------------------------
[  139.759261] irq/125-pciehp/143 is trying to acquire lock:
[  139.759263] ffff95ee9f3d1f38 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_check_presence+0x23/0x80
[  139.759269] 
               but task is already holding lock:
[  139.759270] ffff95eee497e738 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xdf/0x120
[  139.759274] 
               other info that might help us debug this:
[  139.759275]  Possible unsafe locking scenario:

[  139.759276]        CPU0
[  139.759277]        ----
[  139.759278]   lock(&ctrl->reset_lock);
[  139.759279]   lock(&ctrl->reset_lock);
[  139.759281] 
                *** DEADLOCK ***

[  139.759282]  May be due to missing lock nesting notation

[  139.759283] 4 locks held by irq/125-pciehp/143:
[  139.759284]  #0: ffff95eee497e738 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xdf/0x120
[  139.759288]  #1: ffffffffa2a25e70 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pciehp_configure_device+0x22/0x110
[  139.759291]  #2: ffff95eec9a9a240 (&dev->mutex){....}-{3:3}, at: __device_attach+0x25/0x170
[  139.759296]  #3: ffff95ee9f0b19b0 (&dev->mutex){....}-{3:3}, at: __device_attach+0x25/0x170
[  139.759299] 
               stack backtrace:
[  139.759301] CPU: 5 PID: 143 Comm: irq/125-pciehp Tainted: G            E     5.8.0+ #16
[  139.759303] Hardware name: LENOVO 20U90SIT19/20U90SIT19, BIOS N2WET16W (1.06 ) 05/10/2020
[  139.759304] Call Trace:
[  139.759310]  dump_stack+0x92/0xc8
[  139.759314]  __lock_acquire.cold+0x121/0x296
[  139.759319]  lock_acquire+0xa4/0x3d0
[  139.759321]  ? pciehp_check_presence+0x23/0x80
[  139.759327]  down_read+0x45/0x130
[  139.759329]  ? pciehp_check_presence+0x23/0x80
[  139.759331]  pciehp_check_presence+0x23/0x80
[  139.759333]  pciehp_probe+0x156/0x1a0
[  139.759337]  pcie_port_probe_service+0x31/0x50
[  139.759339]  really_probe+0x2d4/0x410
[  139.759342]  driver_probe_device+0xe1/0x150
[  139.759344]  ? driver_allows_async_probing+0x50/0x50
[  139.759346]  bus_for_each_drv+0x6d/0xa0
[  139.759349]  __device_attach+0xe4/0x170
[  139.759352]  bus_probe_device+0x9f/0xb0
[  139.759354]  device_add+0x389/0x810
[  139.759356]  ? __init_waitqueue_head+0x45/0x60
[  139.759359]  pcie_port_device_register+0x296/0x520
[  139.759363]  ? disable_irq_nosync+0x10/0x10
[  139.759365]  pcie_portdrv_probe+0x2d/0xb0
[  139.759368]  local_pci_probe+0x42/0x80
[  139.759371]  pci_device_probe+0xd9/0x190
[  139.759374]  really_probe+0x167/0x410
[  139.759377]  ? disable_irq_nosync+0x10/0x10
[  139.759379]  driver_probe_device+0xe1/0x150
[  139.759381]  ? driver_allows_async_probing+0x50/0x50
[  139.759383]  bus_for_each_drv+0x6d/0xa0
[  139.759386]  __device_attach+0xe4/0x170
[  139.759389]  pci_bus_add_device+0x4b/0x70
[  139.759391]  pci_bus_add_devices+0x2c/0x70
[  139.759394]  pci_bus_add_devices+0x57/0x70
[  139.759396]  pciehp_configure_device+0x92/0x110
[  139.759399]  pciehp_handle_presence_or_link_change+0x17b/0x2a0
[  139.759402]  pciehp_ist+0x116/0x120
[  139.759404]  irq_thread_fn+0x20/0x60
[  139.759407]  ? irq_thread+0x8c/0x1b0
[  139.759409]  irq_thread+0xf0/0x1b0
[  139.759412]  ? irq_finalize_oneshot.part.0+0xd0/0xd0
[  139.759414]  ? irq_thread_check_affinity+0xb0/0xb0
[  139.759417]  kthread+0x138/0x160
[  139.759419]  ? kthread_create_worker_on_cpu+0x40/0x40
[  139.759423]  ret_from_fork+0x1f/0x30
Comment 1 Hans de Goede 2020-08-09 19:01:10 UTC
On the linux-pci list Lukas Wunner wrote the following about this:

False positive, the reset_lock is per-controller and multiple
instances of the lock are held concurrently because pciehp
controllers are nested with Thunderbolt.

This was already reported by Theodore T'so:
https://lore.kernel.org/linux-pci/20190402021933.GA2966@mit.edu/

So the issue is on my radar and I have some ideas how to fix it.
Let me get back to you with a solution later.  In the meantime,
thank you for the report.
Comment 2 Hans de Goede 2022-01-12 12:35:59 UTC
This is fixed by this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?id=42a46c70045915bcbdced3e694dc5825d124fb5c

Which should get merged into 5.17-rc1 soon-ish, closing.
Comment 3 Hans de Goede 2022-01-12 13:27:27 UTC
Note the branch the commit from comment 2 references was just rebased so the commit hash will likely stop working eventually.

So for future reference the commit fixing this has the following title/subject:

"PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors"