Created attachment 290843 [details] Kernel .config My HP Spectre x360 (Icelake) laptop doesn't successfully hotplug with an HP Thunderbolt 3 dock. I'm using 5.8.0-rc7-next-20200729. The dock's firmwares have been updated from a system running Windows. The HP laptop is using the latest BIOS as of last week, and nvm_version is "80.0": % cat /sys/devices/pci0000:00/0000:00:0d.2/domain0/0-0/nvm_version 80.0 Cold booting the system with the dock attached provides working ethernet, USB hub, etc. Unplugging and replugging the dock does not work, leaving it only providing power. Attached are dmesg, lspci -vvnnt output, and /proc/iomem captured (1) at coldboot with the dock attached, (2) after unplugging the dock, (3) after hotplugging the dock, and (4) after hotplugging the dock when it had not been previously attached; and my kernel .config. For search engines, the most apparent failure in dmesg is: xhci_hcd 0000:2e:00.0: enabling device (0000 -> 0002) xhci_hcd 0000:2e:00.0: xHCI Host Controller xhci_hcd 0000:2e:00.0: new USB bus registered, assigned bus number 5 xhci_hcd 0000:2e:00.0: Host halt failed, -19 xhci_hcd 0000:2e:00.0: can't setup: -19 xhci_hcd 0000:2e:00.0: USB bus 5 deregistered xhci_hcd 0000:2e:00.0: init 0000:2e:00.0 fail, -19 tg3 0000:2f:00.0: enabling device (0000 -> 0002) tg3 0000:2f:00.0: phy probe failed, err -19 tg3 0000:2f:00.0: Problem fetching invariants of chip, aborting
Created attachment 290845 [details] coldplugged-lspci
Created attachment 290847 [details] unplugged-lspci
Created attachment 290849 [details] hotplugged-lspci
Created attachment 290851 [details] coldplugged-iomem
Created attachment 290853 [details] unplugged-iomem
Created attachment 290855 [details] hotplugged-iomem
Created attachment 290857 [details] coldplugged-dmesg
Created attachment 290859 [details] unplugged-dmesg
Created attachment 290861 [details] hotplugged-dmesg
Created attachment 290863 [details] hotplugged-not-previously-attached-dmesg
Created attachment 290865 [details] hotplugged-not-previously-attached-lspci
With some help from Ben Widawsky, we noticed that PCI device 2d:04.0 (Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015] [8086:1578]) doesn't get IO space allocated correctly on hotplug: mattst88@hp-x360 ~ % head working 2d:04.0 PCI bridge [0604]: Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015] [8086:1578] (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 144 Bus: primary=2d, secondary=32, subordinate=32, sec-latency=0 I/O behind bridge: 00005000-00005fff [size=4K] Memory behind bridge: [disabled] Prefetchable memory behind bridge: [disabled] Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- mattst88@hp-x360 ~ % head broken 2d:04.0 PCI bridge: Intel Corporation DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C 2015] (prog-if 00 [Normal decode]) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 169 Bus: primary=2d, secondary=32, subordinate=56, sec-latency=0 I/O behind bridge: [disabled] Memory behind bridge: 68400000-741fffff [size=190M] Prefetchable memory behind bridge: 0000006000400000-000000601bffffff [size=444M] Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- Additionally, when I do 'echo 1 > /sys/bus/pci/rescan' I see this in dmesg: [Aug12 15:37] pcieport 0000:2d:04.0: bridge window [io 0x1000-0x0fff] to [bus 32-56] add_size 1000 [ +0.000006] pcieport 0000:2d:04.0: BAR 7: no space for [io size 0x1000] [ +0.000001] pcieport 0000:2d:04.0: BAR 7: failed to assign [io size 0x1000] [ +0.000001] pcieport 0000:2d:04.0: BAR 7: no space for [io size 0x1000] [ +0.000000] pcieport 0000:2d:04.0: BAR 7: failed to assign [io size 0x1000] [ +0.079302] pci_bus 0000:2e: Allocating resources [ +0.000017] pci_bus 0000:2f: Allocating resources [ +0.000016] pci_bus 0000:30: Allocating resources [ +0.000009] pci_bus 0000:31: Allocating resources [ +0.000085] pci_bus 0000:2e: Allocating resources [ +0.000015] pci_bus 0000:2f: Allocating resources [ +0.000015] pci_bus 0000:30: Allocating resources [ +0.000010] pci_bus 0000:31: Allocating resources which seems to corroborate that point. And 2d:04 appears to be a critical device in the tree, according to lspci -t: +-07.1-[2c-56]----00.0-[2d-56]--+-00.0-[2e]----00.0 ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller [1b21:1142] | +-01.0-[2f]----00.0 Broadcom Inc. and subsidiaries NetXtreme BCM57762 Gigabit Ethernet PCIe [14e4:1682] | +-02.0-[30]-- | +-03.0-[31]-- | \-04.0-[32-56]--
IO space is not necessary with PCIe devices so that should be fine and expected. However, MMIO resources below the dock PCIe switch upstream port look weird and I can't see in the logs any failures. One thing I can suggest to try is to enable IOMMU since ICL kind of expects it to be enabled so in theory if the BIOS leaves the IOMMU configured or so it could manifest like this. Can you set CONFIG_INTEL_IOMMU=y in your .config and try again?
Created attachment 290885 [details] Kernel .config diff enabling CONFIG_INTEL_IOMMU=y I enabled CONFIG_INTEL_IOMMU=y and tried again, but with the same results. :(
Can you attach full dmesg and output of 'sudo lspci -vv' here with the IOMMU enabled? Please do the same steps that you only connect the dock after you have booted up and then take the dmesg and lspci.
While there can you also enable CONFIG_PCI_DEBUG=y before you take the dmesg so we can hopefully see some additional messages.
Created attachment 290899 [details] dmesg hotplug with CONFIG_PCI_DEBUG=y
Created attachment 290901 [details] sudo lspci -vv after hotplug
Created attachment 290903 [details] dmesg coldplug with CONFIG_PCI_DEBUG=y
Created attachment 290905 [details] sudo lspci -vv after coldplug
Thanks for the logs. For some reason the two downstream PCIe ports (2d:00.0 and 2d:01.0) that lead to the xHCI and the NIC get their bridge windows reset to 0 and this prevents drivers from accessing their MMIO registers. I also see that you are not running the mainline kernel so can you take v5.8 vanilla kernel and try that and add "pcie_port_pm=off" to the kernel command line to disable runtime PM of those ports.
(In reply to Mika Westerberg from comment #21) > Thanks for the logs. For some reason the two downstream PCIe ports (2d:00.0 > and 2d:01.0) that lead to the xHCI and the NIC get their bridge windows > reset to 0 and this prevents drivers from accessing their MMIO registers. I > also see that you are not running the mainline kernel so can you take v5.8 > vanilla kernel and try that and add "pcie_port_pm=off" to the kernel command > line to disable runtime PM of those ports. Tried with v5.8.1. Was previously using 5.8.0-rc7-next-20200729 because I expected to be asked to test linux-next. Anyway, pcie_port_pm=off didn't help. Neither did pcie_ports=native. I also tried adding +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1578, quirk_no_bus_reset); to drivers/pci/quirks.c on a whim because of something I saw in a google search, but of course that didn't help either. Is the fact that it works if I attach the dock and then boot the system not indicative of something? Is the BIOS/EFI setup tasked with programming some stuff that the thunderbolt driver might be failing to do so? I just noticed something odd. Coldplugged with the dock working, I can suspend and resume and it will continue working. But if I unplug and replug the dock while the system is suspended, it fails to work after resume. Doesn't that indicate that the thunderbolt firmware is doing something wrong?
When you boot the system with device connected it is the BIOS that configures the PCIe devices. When you hot-plug the device to the running system it is the kernel PCI stack that does the configuration (no Thunderbolt driver is even involved here, it is just plain PCIe). The Linux PCI stack should be able to do this but for some reason on your particular system it does not work as expected - it succeeds to configure everything just fine but immediately after the two downstream PCIe ports lose what is configured to their bridge window registers so I kind of suspected that the runtime PM kicks in here but apparently that is not the case.
Created attachment 292029 [details] Do not skip resource assignment Probably does not help but just in case, can you try the attached patch and see if it makes any difference? There is one device without PCI class in that system and it should not affect resource allocation of devices behind TBT but better to check.
Nope, no suck luck :(
OK, can you then add "initcall_debug" to the command line and try again (with CONFIG_PCI_DEBUG=y as well). Then attach full dmesg.
Created attachment 292189 [details] hotplugged-dmesg failure with CONFIG_PCI_DEBUG=y, initcall_debug, and patch from comment #24
Created attachment 292191 [details] hotplugged-dmesg success with CONFIG_PCI_DEBUG=y, initcall_debug, and patch from comment #24
Created attachment 292193 [details] sudo lspci -vv failure with CONFIG_PCI_DEBUG=y, initcall_debug, and patch from comment #24
Created attachment 292195 [details] sudo lspci -vv success with CONFIG_PCI_DEBUG=y, initcall_debug, and patch from comment #24
Unexpectedly, hotplug worked a few times. I've attached dmesg and sudo lspci -vv output from two hotplug attempts with v5.8.3, CONFIG_PCI_DEBUG=y, initcall_debug, and the patch from comment #24 applied -- one that succeeded and one that failed. Again we see the same pattern in lspci -vv output: --- lspci-patched-failure 2020-08-27 12:54:22.300504263 -0700 +++ lspci-patched-success 2020-08-27 12:47:04.525430133 -0700 @@ -654,9 +654,9 @@ Latency: 0 Interrupt: pin A routed to IRQ 167 Bus: primary=2d, secondary=2e, subordinate=2e, sec-latency=0 - I/O behind bridge: 00000000-00000fff [size=4K] - Memory behind bridge: 00000000-000fffff [size=1M] - Prefetchable memory behind bridge: 0000000000000000-00000000000fffff [size=1M] + I/O behind bridge: 00005000-00005fff [size=4K] + Memory behind bridge: 68000000-680fffff [size=1M] + Prefetchable memory behind bridge: 0000006000000000-00000060000fffff [size=1M] Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Thanks for the logs. I think I know what is going on. From the failure log: Registers get saved: [ 80.292442] pci 0000:2d:00.0: saving config space at offset 0x1c (reading 0x101) [ 80.292446] pci 0000:2d:00.0: saving config space at offset 0x20 (reading 0x0) [ 80.292450] pci 0000:2d:00.0: saving config space at offset 0x24 (reading 0x10001) [ 80.292454] pci 0000:2d:00.0: saving config space at offset 0x28 (reading 0x0) [ 80.292458] pci 0000:2d:00.0: saving config space at offset 0x2c (reading 0x0) Resources are assigned: [ 80.293725] pci 0000:2d:00.0: BAR 8: assigned [mem 0x68000000-0x680fffff] [ 80.293727] pci 0000:2d:00.0: BAR 9: assigned [mem 0x6000000000-0x60000fffff 64bit pref] [ 80.293752] pci 0000:2d:00.0: BAR 7: assigned [io 0x5000-0x5fff] ... [ 80.293803] pci 0000:2d:00.0: PCI bridge to [bus 2e] [ 80.293807] pci 0000:2d:00.0: bridge window [io 0x5000-0x5fff] [ 80.293816] pci 0000:2d:00.0: bridge window [mem 0x68000000-0x680fffff] [ 80.293823] pci 0000:2d:00.0: bridge window [mem 0x6000000000-0x60000fffff 64bit pref] Note that there is no save happening here. Then shortly after there is register restore: [ 80.294748] pcieport 0000:2d:00.0: runtime IRQ mapping not provided by arch [ 80.294830] pcieport 0000:2d:00.0: restoring config space at offset 0x2c (was 0x60, writing 0x0) [ 80.294835] pcieport 0000:2d:00.0: restoring config space at offset 0x28 (was 0x60, writing 0x0) [ 80.294839] pcieport 0000:2d:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x10001) [ 80.294844] pcieport 0000:2d:00.0: restoring config space at offset 0x20 (was 0x68006800, writing 0x0) ^^^^^^^^^^ ^^^ [ 80.294848] pcieport 0000:2d:00.0: restoring config space at offset 0x1c (was 0x5151, writing 0x101) This ends up clearing the bridge window registers of 2d:00.0 downstream port. I guess this does not happen always because it is dependent on timing.
Created attachment 292259 [details] Save PCI bridge state right after setup Can you try the attached hack patch and see if it makes the issue go away? At least then we know that the theory is correct.
Created attachment 292269 [details] hotplugged-dmesg failure with CONFIG_PCI_DEBUG=y, initcall_debug, and patch from comment #33
Created attachment 292271 [details] sudo lspci -vv failure with CONFIG_PCI_DEBUG=y, initcall_debug, and patch from comment #33
Dang, doesn't work, but it the lspci output looks like we're getting the right memory addresses (diffing against the attached lspci -vv success output). --- lspci 2020-08-31 12:16:11.919502424 -0700 +++ lspci-patched-success 2020-08-27 12:47:04.525430133 -0700 @@ -1025,13 +1025,14 @@ 2e:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller [1b21:1142] (prog-if 30 [XHCI]) Subsystem: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller [1b21:1142] - Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- + Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- + Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 - Region 0: Memory at 68000000 (64-bit, non-prefetchable) [virtual] [size=32K] + Region 0: Memory at 68000000 (64-bit, non-prefetchable) [size=32K] Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 - Capabilities: [68] MSI-X: Enable- Count=8 Masked- + Capabilities: [68] MSI-X: Enable+ Count=8 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002080 Capabilities: [78] Power Management version 3 @@ -1071,14 +1072,16 @@ Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- + Kernel driver in use: xhci_hcd 2f:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM57762 Gigabit Ethernet PCIe [14e4:1682] (rev 01) Subsystem: Broadcom Inc. and subsidiaries NetXtreme BCM57762 Gigabit Ethernet PCIe [14e4:1682] - Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- + Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- + Latency: 0 Interrupt: pin A routed to IRQ 17 - Region 0: Memory at 6000100000 (64-bit, prefetchable) [virtual] [size=64K] - Region 2: Memory at 6000110000 (64-bit, prefetchable) [virtual] [size=64K] + Region 0: Memory at 6000100000 (64-bit, prefetchable) [size=64K] + Region 2: Memory at 6000110000 (64-bit, prefetchable) [size=64K] Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- And the diff against the lspci -vv hotplug failure attachment: --- lspci 2020-08-31 12:16:11.919502424 -0700 +++ lspci-patched-failure 2020-08-27 12:54:22.300504263 -0700 @@ -654,9 +654,9 @@ Latency: 0 Interrupt: pin A routed to IRQ 167 Bus: primary=2d, secondary=2e, subordinate=2e, sec-latency=0 - I/O behind bridge: 00005000-00005fff [size=4K] - Memory behind bridge: 68000000-680fffff [size=1M] - Prefetchable memory behind bridge: 0000006000000000-00000060000fffff [size=1M] + I/O behind bridge: 00000000-00000fff [size=4K] + Memory behind bridge: 00000000-000fffff [size=1M] + Prefetchable memory behind bridge: 0000000000000000-00000000000fffff [size=1M] Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- So the memory addresses look right, but we're missing "[virtual]". Hopefully that indicates only a small remaining problem :)
(In reply to Matt Turner from comment #36) > So the memory addresses look right, but we're missing "[virtual]". Hopefully > that indicates only a small remaining problem :) Sorry, I got this backwards. In the lspci output after hotplug success we *don't* have "[virtual]" and with the patch in #33 applied (and hotplug failure) we *do* have "[virtual]".
Created attachment 292273 [details] Disable runtime PM from xHCI and PCI ports Can you try this patch instead? It should disable all runtime PM for the affected drivers. Please remove any previous patch and apply this directly on top of the mainline or stable.
No luck, didn't work. The lspci output again has > I/O behind bridge: 00000000-00000fff [size=4K] > Memory behind bridge: 00000000-000fffff [size=1M] > Prefetchable memory behind bridge: 0000000000000000-00000000000fffff > [size=1M] and dmesg didn't look appreciably different. Should I bother posting them?
Yes please.
Created attachment 292713 [details] hotplugged-dmesg failure with CONFIG_PCI_DEBUG=y, initcall_debug, and patch from comment #38
Created attachment 292715 [details] sudo lspci -vv failure with CONFIG_PCI_DEBUG=y, initcall_debug, and patch from comment #38 Sorry for the delay. Please find attached the dmesg and lspci output you requested.
Do those logs show anything interesting? I just updated to v5.9 and it looks like the same behavior to me. :( This laptop is my development machine at Intel, and I'm leaving Intel in a few weeks. I'd love to see this fixed before I return the laptop. Perhaps if that can't be accomplished we can ship you the laptop when I leave.
Sorry for the delay from my side. I was on vacation last week. From the logs I can see that the ports runtime suspend and resume so with the patch and "pcie_port_pm=off" in the kernel command line should in theory work the problem around. It would help if you can ship the device to me to our Finland office and if the dock is also Intel I suggest to ship that too so I can replicate the issue.
> It would help if you can ship the device to me to our Finland office and if > the dock is also Intel I suggest to ship that too so I can replicate the > issue. Unfortunately my manager rejected that as an option, and I'm no longer in possession of the laptop.