Reproduce steps: Scenario 1: 1) Plug the dock 2) Unplug the dock, and in dmesg I can find "[ 220.834798] xhci_hcd 0000:3a:00.0: ERROR mismatched command completion event" 3) "[ 227.916579] pcieport 0000:00:1c.0: PME: Spurious native interrupt!" floods the dmesg 4) Replug the dock, but "xhci_hcd 0000:3a:00.0" doesn't work anymore. Scenario 2: 3) No more "PME: Spurious native interrupt!", "[ 202.724317] xhci_hcd 0000:3a:00.0: Command completion event does not match command" instead. 4) Replug the dock, but "xhci_hcd 0000:3a:00.0" doesn't work anymore.
Created attachment 283245 [details] lspci -vv
Created attachment 283247 [details] dmesg for scenario 1
Created attachment 283249 [details] dmesg for scenario 2
Cc both Mika and Mathias, because I am not sure it's a bug in PCI, Thunderbolt or xHC.
On Thu, Jun 13, 2019 at 08:46:03AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=203885 > > Bug ID: 203885 > Summary: Titan Ridge xHCI may stop to working after re-plugging > the dock > Product: Drivers > Version: 2.5 > Kernel Version: 5.2-rc4 All USB bugs should be sent to the linux-usb@vger.kernel.org mailing list, and not entered into bugzilla. Please bring this issue up there, if it is still a problem in the latest kernel release.
Kai-Heng, The "PME: Spurious native interrupt" doesn't have anything to do with the dock, I'm afraid. You can see that these messages start at timestamp 19.781905: > [ 19.781905] pcieport 0000:00:1c.0: PME: Spurious native interrupt! > [ 19.781909] pcieport 0000:00:1c.0: PME: Spurious native interrupt! There isn't much information we can get as for why that PME event is being generated or who's generating it for the bridge. We could try this patch to see if we get a little more information about the situation: > diff --git a/drivers/pci/pcie/pme.c b/drivers/pci/pcie/pme.c > index f38e6c19dd50..33285ef29362 100644 > --- a/drivers/pci/pcie/pme.c > +++ b/drivers/pci/pcie/pme.c > @@ -203,7 +203,7 @@ static void pcie_pme_handle_request(struct pci_dev *port, > u16 req_id) > > out: > if (!found) > - pci_info(port, "Spurious native interrupt!\n"); > + pci_info(port, "Spurious native interrupt! (Bus# %d DevFn > %d)\n", busnr, devfn); > } > > /** It could be that xHCI is generating that, but we won't know until we get the output. Anyway, moving on to XHCI. Please run the test case again but this time enable full xHCI logs (dynamic printk and tracepoints). Also, let's move this discussion to linux-usb where it should be :-) Thanks
Created attachment 283837 [details] dmesg, plugging annotated
Created attachment 283903 [details] dmesg, plugging/unplugging annotated
Created attachment 283905 [details] full kernel log collected by `journalctl -b -k`
Update latest debug progress into bug for record. Kai-Heng Feng Jul 25, 2019, 9:25 PM (6 days ago) to Mathias, Felipe, Oliver, Mika, Linux, Linux, me at 22:45, Mathias Nyman <mathias.nyman@linux.intel.com> wrote: > On 22.7.2019 12.44, Kai-Heng Feng wrote: >>>>>>> Hi Mika and Mathias, >>>>>>> >>>>>>> I’ve filed a bug [1] which renders docking station unusable. >>>>>>> >>>>>>> I am not sure it's a bug in PCI, Thunderbolt or xHCI so raise the >>>>>>> issue >>>>>>> to >>>>>>> you both. >>>>>>> >>>>>>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=203885 >>>>>>> >>>>>>> Kai-Heng >>>> >>>> I upgraded the system firmware, TBT firmware and docking station >>>> firmware >>>> to latest, and used latest mainline kernel. >>>> Now the issue can be reproduced at the very first time I plugged the >>>> docking station. >> Request log attached to Bugzilla. > > After docking station unplug we see Transfer errors from > devices connected to Titan Ridge xHC, driver tries to recover, fails, > usb devices are disconnected. > > After this xhci driver runtime suspends xHC controller as runtime pm is > allowed > by default for Titan Ridge xHC controllers. > > Interesting parts from log: > >>>> Unplug Docking Station <<< > > [ 328.102279] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on > endpoint > [ 328.118279] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on > endpoint > [ 328.134291] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on > endpoint > [ 328.150355] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on > endpoint > [ 328.166342] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on > endpoint > [ 332.178710] usb usb4-port2: Cannot enable. Maybe the USB cable is bad? > [ 332.178765] usb 4-2: USB disconnect, device number 35 > [ 332.178769] usb 4-2.3: USB disconnect, device number 36 > [ 332.179973] usb 4-2.4: USB disconnect, device number 37 > [ 332.414618] xhci_hcd 0000:39:00.0: set port remote wake mask, actual > port 0 status = 0xe0002a0 > [ 332.414639] xhci_hcd 0000:39:00.0: set port remote wake mask, actual > port 1 status = 0xe0002b0 > [ 332.414693] xhci_hcd 0000:39:00.0: xhci_hub_status_data: stopping port > polling. > [ 332.414703] xhci_hcd 0000:39:00.0: xhci_suspend: stopping port polling. > [ 332.414719] xhci_hcd 0000:39:00.0: // Setting command ring address to > 0x487da9001 > >>>> Plug Docking Station <<< > > [ 346.455568] pci_raw_set_power_state: 25 callbacks suppressed > [ 346.455574] xhci_hcd 0000:39:00.0: Refused to change power state, > currently in D3 > [ 346.539451] xhci_hcd 0000:39:00.0: enabling device (0000 -> 0002) > [ 346.539482] xhci_hcd 0000:39:00.0: // Setting command ring address to > 0x487da903f > [ 346.539487] xhci_hcd 0000:39:00.0: WARN: xHC restore state timeout > [ 346.539489] xhci_hcd 0000:39:00.0: PCI post-resume error -110! > [ 346.539490] xhci_hcd 0000:39:00.0: HC died; cleaning up > >>>> We don't have 0000:39:00 anymore <<< > > When docking station is plugged back we try to resume Titan Ridge xHC, > PCI log shows that changing power state to D0 failed, xHC is still in D3. > Resume process continues anyway, and xhci driver tries to restore state, > but fails. > Usb core will assume HC died if the pci resume callback failed > > Does disabling runtime PM for Titan Ridge xHC help? Yes, disabling runtime PM can workaround this issue. Kai-Heng > > -Mathias
Hi, Any advise what to do next?
Please try with the latest mainline (v5.3-rc4) and see if the issue still reproduces. There is a fix in the mainline touching PCI PM that might solve this one as well.
I can still reproduce the issue on v5.3-rc4: $ uname -r 5.3.0-050300rc4-generic [ 442.084333] pci_raw_set_power_state: 25 callbacks suppressed [ 442.084338] xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
What does 'sudo lspci -vv' show after you have plugged the dock back?
Created attachment 284419 [details] `lspci -vvnn` after the issue appears
Created attachment 284421 [details] `lspci -t` after the issue appears
Hmm, from the lspci output the xHCI device on the host router (39:00.0) looks fine. Does it not work now? Or do you mean the other xHCI that is part of the dock?
The original issue still happens. But before v5.3-rc4, subsequent hotplugs don't work anymore. On v5.3-rc4, subsequent hotplugs can still be detected. So the issue is alleviated but not completely solved.
OK, the original issue is that the xHCI (39:00) may not work always after plugging the dock back, correct? When that happens does the lspci output differ from what you attached previously?
Ok I spoke too soon. I can still reproduce the original issue on v5.3-rc4.
Created attachment 284427 [details] dmesg-5.3-rc4
Created attachment 284429 [details] lspci-5.3-rc4
OK, I can see that the xHCI (39:00.0) disappears at one point. Is this dock connected through Thunderbolt or USB? To me it looks like the latter. Also is the xHCI there always if you don't connect anything? I mean if you boot the system up without anything connected wait some time and check lspci is the xHCI present?
(In reply to Mika Westerberg from comment #23) > OK, I can see that the xHCI (39:00.0) disappears at one point. Is this dock > connected through Thunderbolt or USB? To me it looks like the latter. Through USB. It has dual Type-C connector: https://www.dell.com/en-us/work/shop/dell-performance-dock-wd19dc-210w-power-delivery/apd/210-ARIL/pc-accessories > Also is the xHCI there always if you don't connect anything? I mean if you > boot the system up without anything connected wait some time and check lspci > is the xHCI present? Yes it's always present on PCI bus without anything connected.
Created attachment 284447 [details] Add PCIe link delays It looks like the PCIe link to the xHCI (39:00.0) is not established in time when the kernel tries to runtime resume the device after re-plug. Can you try the attached patch and see if it helps? Please also attach dmesg.
Created attachment 284461 [details] dmesg after the patch applied Yes the patch works for me. I re-plugged the dock 50 times an I no longer see the issue.
An unrelated question: Do we still need commit 76380a607ba0 ("mfd: intel-lpss: Remove D3cold delay") after your patch? Seems like d3cold_delay only applies to PCI links, and not for other devices?
(In reply to Kai-Heng Feng from comment #26) > Created attachment 284461 [details] > dmesg after the patch applied > > Yes the patch works for me. > I re-plugged the dock 50 times an I no longer see the issue. Thanks for testing. However, looking at the dmesg output the delay is not applied to the root port and downstream port leading to the xHCI so I'm surprised it works. It does add delay to the other root port that is connected to the AMD GPU but I don't see how it can help here. When you unplug the dock does the root port enter D3cold? Typically these systems it does unless there is something keeping it awake.
(In reply to Mika Westerberg from comment #28) > Thanks for testing. However, looking at the dmesg output the delay is not > applied to the root port and downstream port leading to the xHCI so I'm > surprised it works. It does add delay to the other root port that is > connected to the AMD GPU but I don't see how it can help here. You are right. I need to wait a while to make root port runtime suspended. > When you unplug the dock does the root port enter D3cold? Typically these > systems it does unless there is something keeping it awake. The issue still happens with your patch, once root port is in D3cold.
Created attachment 284499 [details] dmesg with patch applied, root port was in D3cold
If you change the delay from 100ms to say 150ms does it help? I can see from the dmesg that the xHCI registers look fine when they are being restored but there is failure when the device is put back to D0 which makes me think it is still not ready. Also which firmware version you have? You can read it from /sys/bus/thunderbolt/devices/0-0/nvm_version.
(In reply to Mika Westerberg from comment #31) > If you change the delay from 100ms to say 150ms does it help? I can see from > the dmesg that the xHCI registers look fine when they are being restored but > there is failure when the device is put back to D0 which makes me think it > is still not ready. Changing the delay to 150ms the fail rate became really low but not completely eliminated, so I increased the delay to 200ms, which seems to completely solve the issue. > > Also which firmware version you have? You can read it from > /sys/bus/thunderbolt/devices/0-0/nvm_version. 41.1
Created attachment 284529 [details] Add debugging to pci_raw_set_power_state() Yeah, PCIe spec says that software should wait minimum of 100ms after link training completes but of course that is just minimum and not all devices seem to cope with that. Can you try the attached patch on top of my previous one (where it uses 100ms) attach dmesg when the issue happens again? It should log the PMCSR register couple of times.
Created attachment 284531 [details] testpatch, make sure xHC controller is ready in resume before register write Patch to make xhci resume more reliable, Does patch this help? It doesn't fix the the PCI side, but makes xhci resume more reliable in general. Other platforms have suggested the CNR (Controller Not Ready) check already earlier. This also adds a 0xffffffff check at resume in case PCI controller is not yet accessible, giving it an extra 100ms
Created attachment 284547 [details] dmesg with patch from #25 and diff from #33 applied
Created attachment 284549 [details] dmesg with patch from #25 and patch from #34 applied Issue still happens with Mathias' patch.
Created attachment 284551 [details] Updated PCIe link delay patch Thanks for the log. So indeed the device is not yet accessible after the 100ms delay: [ 60.254340] xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3, PMCSR=0xffff [ 60.257876] pcieport 0000:00:1c.0: PME: Spurious native interrupt! [ 60.257890] pcieport 0000:00:1c.0: PME: Spurious native interrupt! [ 60.274060] xhci_hcd 0000:39:00.0: second read, PMCSR=0x8100 I've attached an updated version of the link delay patch that probes for the downstream device once and if it is not accessible waits for another 100ms. This way we don't delay all downstream devices unnecessarily. Can you try the patch on top of v5.3-rc5 and see if the xHCI stays alive after several iterations?
(In reply to Mika Westerberg from comment #37) > Created attachment 284551 [details] > Updated PCIe link delay patch > > Thanks for the log. So indeed the device is not yet accessible after the > 100ms delay: > > [ 60.254340] xhci_hcd 0000:39:00.0: Refused to change power state, > currently in D3, PMCSR=0xffff > [ 60.257876] pcieport 0000:00:1c.0: PME: Spurious native interrupt! > [ 60.257890] pcieport 0000:00:1c.0: PME: Spurious native interrupt! > [ 60.274060] xhci_hcd 0000:39:00.0: second read, PMCSR=0x8100 > > I've attached an updated version of the link delay patch that probes for the > downstream device once and if it is not accessible waits for another 100ms. > This way we don't delay all downstream devices unnecessarily. > > Can you try the patch on top of v5.3-rc5 and see if the xHCI stays alive > after several iterations? Yes, this patch solves the issue. xHCI works every time. Please collect my tested-by: Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Thanks for testing. I just sent the patch upstream: https://lore.kernel.org/linux-pci/20190821124519.71594-1-mika.westerberg@linux.intel.com/ Let's see how it goes.
Hi Mika, Any good news on and patch upstream? If there is anything we could help please let us know. Thank you.
Hi, There is a new version of the patch here: https://lore.kernel.org/linux-pci/20191004123947.11087-1-mika.westerberg@linux.intel.com/ So far I haven't got any comments from the PCI maintainer though.