Bug 203885

Summary: Titan Ridge xHCI may stop to working after re-plugging the dock
Product: Drivers Reporter: Kai-Heng Feng (kai.heng.feng)
Component: PCIAssignee: Greg Kroah-Hartman (greg)
Status: RESOLVED CODE_FIX    
Severity: normal CC: balbif, kent.lin, mathias.nyman, mika.westerberg, w
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.2-rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci -vv
dmesg for scenario 1
dmesg for scenario 2
dmesg, plugging annotated
dmesg, plugging/unplugging annotated
full kernel log collected by `journalctl -b -k`
`lspci -vvnn` after the issue appears
`lspci -t` after the issue appears
dmesg-5.3-rc4
lspci-5.3-rc4
Add PCIe link delays
dmesg after the patch applied
dmesg with patch applied, root port was in D3cold
Add debugging to pci_raw_set_power_state()
testpatch, make sure xHC controller is ready in resume before register write
dmesg with patch from #25 and diff from #33 applied
dmesg with patch from #25 and patch from #34 applied
Updated PCIe link delay patch

Description Kai-Heng Feng 2019-06-13 08:46:03 UTC
Reproduce steps:

Scenario 1:
1) Plug the dock
2) Unplug the dock, and in dmesg I can find "[  220.834798] xhci_hcd 0000:3a:00.0: ERROR mismatched command completion event"
3) "[  227.916579] pcieport 0000:00:1c.0: PME: Spurious native interrupt!" floods the dmesg
4) Replug the dock, but "xhci_hcd 0000:3a:00.0" doesn't work anymore.

Scenario 2:
3) No more "PME: Spurious native interrupt!", "[  202.724317] xhci_hcd 0000:3a:00.0: Command completion event does not match command" instead.
4) Replug the dock, but "xhci_hcd 0000:3a:00.0" doesn't work anymore.
Comment 1 Kai-Heng Feng 2019-06-13 08:48:47 UTC
Created attachment 283245 [details]
lspci -vv
Comment 2 Kai-Heng Feng 2019-06-13 08:49:26 UTC
Created attachment 283247 [details]
dmesg for scenario 1
Comment 3 Kai-Heng Feng 2019-06-13 08:50:15 UTC
Created attachment 283249 [details]
dmesg for scenario 2
Comment 4 Kai-Heng Feng 2019-06-13 08:51:55 UTC
Cc both Mika and Mathias, because I am not sure it's a bug in PCI, Thunderbolt or xHC.
Comment 5 Greg Kroah-Hartman 2019-06-13 08:55:15 UTC
On Thu, Jun 13, 2019 at 08:46:03AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=203885
> 
>             Bug ID: 203885
>            Summary: Titan Ridge xHCI may stop to working after re-plugging
>                     the dock
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 5.2-rc4

All USB bugs should be sent to the linux-usb@vger.kernel.org mailing
list, and not entered into bugzilla.  Please bring this issue up there,
if it is still a problem in the latest kernel release.
Comment 6 Felipe Balbi 2019-07-19 07:05:03 UTC
Kai-Heng,


The "PME: Spurious native interrupt" doesn't have anything to do with the dock, I'm afraid. You can see that these messages start at timestamp 19.781905:


> [   19.781905] pcieport 0000:00:1c.0: PME: Spurious native interrupt!
> [   19.781909] pcieport 0000:00:1c.0: PME: Spurious native interrupt!


There isn't much information we can get as for why that PME event is being generated or who's generating it for the bridge. We could try this patch to see if we get a little more information about the situation:


> diff --git a/drivers/pci/pcie/pme.c b/drivers/pci/pcie/pme.c
> index f38e6c19dd50..33285ef29362 100644
> --- a/drivers/pci/pcie/pme.c
> +++ b/drivers/pci/pcie/pme.c
> @@ -203,7 +203,7 @@ static void pcie_pme_handle_request(struct pci_dev *port,
> u16 req_id)
>  
>   out:
>       if (!found)
> -             pci_info(port, "Spurious native interrupt!\n");
> +             pci_info(port, "Spurious native interrupt! (Bus# %d DevFn
> %d)\n", busnr, devfn);
>  }
>  
>  /**


It could be that xHCI is generating that, but we won't know until we get the output. Anyway, moving on to XHCI. Please run the test case again but this time enable full xHCI logs (dynamic printk and tracepoints). Also, let's move this discussion to linux-usb where it should be :-)

Thanks
Comment 7 Kai-Heng Feng 2019-07-19 10:25:02 UTC
Created attachment 283837 [details]
dmesg, plugging annotated
Comment 8 Kai-Heng Feng 2019-07-22 09:33:16 UTC
Created attachment 283903 [details]
dmesg, plugging/unplugging annotated
Comment 9 Kai-Heng Feng 2019-07-22 09:34:07 UTC
Created attachment 283905 [details]
full kernel log collected by `journalctl -b -k`
Comment 10 Kent Lin 2019-07-31 08:38:46 UTC
Update latest debug progress into bug for record.


Kai-Heng Feng
Jul 25, 2019, 9:25 PM (6 days ago)
to Mathias, Felipe, Oliver, Mika, Linux, Linux, me


at 22:45, Mathias Nyman <mathias.nyman@linux.intel.com> wrote:

> On 22.7.2019 12.44, Kai-Heng Feng wrote:
>>>>>>> Hi Mika and Mathias,
>>>>>>>
>>>>>>> I’ve filed a bug [1] which renders docking station unusable.
>>>>>>>
>>>>>>> I am not sure it's a bug in PCI, Thunderbolt or xHCI so raise the  
>>>>>>> issue
>>>>>>> to
>>>>>>> you both.
>>>>>>>
>>>>>>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=203885
>>>>>>>
>>>>>>> Kai-Heng
>>>>
>>>> I upgraded the system firmware, TBT firmware and docking station  
>>>> firmware
>>>> to latest, and used latest mainline kernel.
>>>> Now the issue can be reproduced at the very first time I plugged the
>>>> docking station.
>> Request log attached to Bugzilla.
>
> After docking station unplug we see Transfer errors from
> devices connected to Titan Ridge xHC, driver tries to recover, fails,
> usb devices are disconnected.
>
> After this xhci driver runtime suspends xHC controller as runtime pm is  
> allowed
> by default for Titan Ridge xHC controllers.
>
> Interesting parts from log:
>
>>>> Unplug Docking Station <<<
>
> [  328.102279] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on  
> endpoint
> [  328.118279] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on  
> endpoint
> [  328.134291] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on  
> endpoint
> [  328.150355] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on  
> endpoint
> [  328.166342] xhci_hcd 0000:39:00.0: Transfer error for slot 36 ep 6 on  
> endpoint
> [  332.178710] usb usb4-port2: Cannot enable. Maybe the USB cable is bad?
> [  332.178765] usb 4-2: USB disconnect, device number 35
> [  332.178769] usb 4-2.3: USB disconnect, device number 36
> [  332.179973] usb 4-2.4: USB disconnect, device number 37
> [  332.414618] xhci_hcd 0000:39:00.0: set port remote wake mask, actual  
> port 0 status  = 0xe0002a0
> [  332.414639] xhci_hcd 0000:39:00.0: set port remote wake mask, actual  
> port 1 status  = 0xe0002b0
> [  332.414693] xhci_hcd 0000:39:00.0: xhci_hub_status_data: stopping port  
> polling.
> [  332.414703] xhci_hcd 0000:39:00.0: xhci_suspend: stopping port polling.
> [  332.414719] xhci_hcd 0000:39:00.0: // Setting command ring address to  
> 0x487da9001
>
>>>> Plug Docking Station <<<
>
> [  346.455568] pci_raw_set_power_state: 25 callbacks suppressed
> [  346.455574] xhci_hcd 0000:39:00.0: Refused to change power state,  
> currently in D3
> [  346.539451] xhci_hcd 0000:39:00.0: enabling device (0000 -> 0002)
> [  346.539482] xhci_hcd 0000:39:00.0: // Setting command ring address to  
> 0x487da903f
> [  346.539487] xhci_hcd 0000:39:00.0: WARN: xHC restore state timeout
> [  346.539489] xhci_hcd 0000:39:00.0: PCI post-resume error -110!
> [  346.539490] xhci_hcd 0000:39:00.0: HC died; cleaning up
>
>>>> We don't have 0000:39:00 anymore <<<
>
> When docking station is plugged back we try to resume Titan Ridge xHC,
> PCI log shows that changing power state to D0 failed, xHC is still in D3.
> Resume process continues anyway, and xhci driver tries to restore state,  
> but fails.
> Usb core will assume HC died if the pci resume callback failed
>
> Does disabling runtime PM for Titan Ridge xHC help?

Yes, disabling runtime PM can workaround this issue.

Kai-Heng

>
> -Mathias
Comment 11 Kent Lin 2019-08-08 07:01:29 UTC
Hi,

Any advise what to do next?
Comment 12 Mika Westerberg 2019-08-13 14:30:59 UTC
Please try with the latest mainline (v5.3-rc4) and see if the issue still reproduces. There is a fix in the mainline touching PCI PM that might solve this one as well.
Comment 13 Kai-Heng Feng 2019-08-14 05:11:26 UTC
I can still reproduce the issue on v5.3-rc4:
$ uname -r
5.3.0-050300rc4-generic

[  442.084333] pci_raw_set_power_state: 25 callbacks suppressed
[  442.084338] xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
Comment 14 Mika Westerberg 2019-08-14 09:19:43 UTC
What does 'sudo lspci -vv' show after you have plugged the dock back?
Comment 15 Kai-Heng Feng 2019-08-15 06:05:12 UTC
Created attachment 284419 [details]
`lspci -vvnn` after the issue appears
Comment 16 Kai-Heng Feng 2019-08-15 06:05:35 UTC
Created attachment 284421 [details]
`lspci -t` after the issue appears
Comment 17 Mika Westerberg 2019-08-15 08:21:31 UTC
Hmm, from the lspci output the xHCI device on the host router (39:00.0) looks fine. Does it not work now? Or do you mean the other xHCI that is part of the dock?
Comment 18 Kai-Heng Feng 2019-08-15 10:06:41 UTC
The original issue still happens. But before v5.3-rc4, subsequent hotplugs don't work anymore. On v5.3-rc4, subsequent hotplugs can still be detected.

So the issue is alleviated but not completely solved.
Comment 19 Mika Westerberg 2019-08-15 11:08:55 UTC
OK, the original issue is that the xHCI (39:00) may not work always after plugging the dock back, correct? When that happens does the lspci output differ from what you attached previously?
Comment 20 Kai-Heng Feng 2019-08-15 12:20:16 UTC
Ok I spoke too soon. I can still reproduce the original issue on v5.3-rc4.
Comment 21 Kai-Heng Feng 2019-08-15 12:20:45 UTC
Created attachment 284427 [details]
dmesg-5.3-rc4
Comment 22 Kai-Heng Feng 2019-08-15 12:21:07 UTC
Created attachment 284429 [details]
lspci-5.3-rc4
Comment 23 Mika Westerberg 2019-08-15 13:04:12 UTC
OK, I can see that the xHCI (39:00.0) disappears at one point. Is this dock connected through Thunderbolt or USB? To me it looks like the latter.

Also is the xHCI there always if you don't connect anything? I mean if you boot the system up without anything connected wait some time and check lspci is the xHCI present?
Comment 24 Kai-Heng Feng 2019-08-15 14:39:10 UTC
(In reply to Mika Westerberg from comment #23)
> OK, I can see that the xHCI (39:00.0) disappears at one point. Is this dock
> connected through Thunderbolt or USB? To me it looks like the latter.

Through USB. It has dual Type-C connector:
https://www.dell.com/en-us/work/shop/dell-performance-dock-wd19dc-210w-power-delivery/apd/210-ARIL/pc-accessories

> Also is the xHCI there always if you don't connect anything? I mean if you
> boot the system up without anything connected wait some time and check lspci
> is the xHCI present?

Yes it's always present on PCI bus without anything connected.
Comment 25 Mika Westerberg 2019-08-16 06:52:37 UTC
Created attachment 284447 [details]
Add PCIe link delays

It looks like the PCIe link to the xHCI (39:00.0) is not established in time when the kernel tries to runtime resume the device after re-plug. Can you try the attached patch and see if it helps? Please also attach dmesg.
Comment 26 Kai-Heng Feng 2019-08-16 11:04:04 UTC
Created attachment 284461 [details]
dmesg after the patch applied

Yes the patch works for me.
I re-plugged the dock 50 times an I no longer see the issue.
Comment 27 Kai-Heng Feng 2019-08-16 11:06:45 UTC
An unrelated question: Do we still need commit 76380a607ba0 ("mfd: intel-lpss: Remove D3cold delay") after your patch? Seems like d3cold_delay only applies to PCI links, and not for other devices?
Comment 28 Mika Westerberg 2019-08-19 08:07:50 UTC
(In reply to Kai-Heng Feng from comment #26)
> Created attachment 284461 [details]
> dmesg after the patch applied
> 
> Yes the patch works for me.
> I re-plugged the dock 50 times an I no longer see the issue.

Thanks for testing. However, looking at the dmesg output the delay is not applied to the root port and downstream port leading to the xHCI so I'm surprised it works. It does add delay to the other root port that is connected to the AMD GPU but I don't see how it can help here.

When you unplug the dock does the root port enter D3cold? Typically these systems it does unless there is something keeping it awake.
Comment 29 Kai-Heng Feng 2019-08-19 11:02:40 UTC
(In reply to Mika Westerberg from comment #28)
> Thanks for testing. However, looking at the dmesg output the delay is not
> applied to the root port and downstream port leading to the xHCI so I'm
> surprised it works. It does add delay to the other root port that is
> connected to the AMD GPU but I don't see how it can help here.

You are right. I need to wait a while to make root port runtime suspended.

> When you unplug the dock does the root port enter D3cold? Typically these
> systems it does unless there is something keeping it awake.

The issue still happens with your patch, once root port is in D3cold.
Comment 30 Kai-Heng Feng 2019-08-19 11:03:23 UTC
Created attachment 284499 [details]
dmesg with patch applied, root port was in D3cold
Comment 31 Mika Westerberg 2019-08-19 11:36:54 UTC
If you change the delay from 100ms to say 150ms does it help? I can see from the dmesg that the xHCI registers look fine when they are being restored but there is failure when the device is put back to D0 which makes me think it is still not ready.

Also which firmware version you have? You can read it from /sys/bus/thunderbolt/devices/0-0/nvm_version.
Comment 32 Kai-Heng Feng 2019-08-19 14:27:22 UTC
(In reply to Mika Westerberg from comment #31)
> If you change the delay from 100ms to say 150ms does it help? I can see from
> the dmesg that the xHCI registers look fine when they are being restored but
> there is failure when the device is put back to D0 which makes me think it
> is still not ready.

Changing the delay to 150ms the fail rate became really low but not completely eliminated, so I increased the delay to 200ms, which seems to completely solve the issue.

> 
> Also which firmware version you have? You can read it from
> /sys/bus/thunderbolt/devices/0-0/nvm_version.
41.1
Comment 33 Mika Westerberg 2019-08-20 10:30:49 UTC
Created attachment 284529 [details]
Add debugging to pci_raw_set_power_state()

Yeah, PCIe spec says that software should wait minimum of 100ms after link training completes but of course that is just minimum and not all devices seem to cope with that.

Can you try the attached patch on top of my previous one (where it uses 100ms) attach dmesg when the issue happens again? It should log the PMCSR register couple of times.
Comment 34 Mathias Nyman 2019-08-20 10:49:59 UTC
Created attachment 284531 [details]
testpatch, make sure xHC controller is ready in resume before register write

Patch to make xhci resume more reliable,

Does patch this help?
It doesn't fix the the PCI side, but makes xhci resume
more reliable in general. Other platforms have suggested the CNR 
(Controller Not Ready) check already earlier.

This also adds a 0xffffffff check at resume in case PCI controller 
is not yet accessible, giving it an extra 100ms
Comment 35 Kai-Heng Feng 2019-08-21 07:50:44 UTC
Created attachment 284547 [details]
dmesg with patch from #25 and diff from #33 applied
Comment 36 Kai-Heng Feng 2019-08-21 08:00:10 UTC
Created attachment 284549 [details]
dmesg with patch from #25 and patch from #34 applied

Issue still happens with Mathias' patch.
Comment 37 Mika Westerberg 2019-08-21 09:00:05 UTC
Created attachment 284551 [details]
Updated PCIe link delay  patch

Thanks for the log. So indeed the device is not yet accessible after the 100ms delay:

[   60.254340] xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3, PMCSR=0xffff
[   60.257876] pcieport 0000:00:1c.0: PME: Spurious native interrupt!
[   60.257890] pcieport 0000:00:1c.0: PME: Spurious native interrupt!
[   60.274060] xhci_hcd 0000:39:00.0: second read, PMCSR=0x8100

I've attached an updated version of the link delay patch that probes for the downstream device once and if it is not accessible waits for another 100ms. This way we don't delay all downstream devices unnecessarily.

Can you try the patch on top of v5.3-rc5 and see if the xHCI stays alive after several iterations?
Comment 38 Kai-Heng Feng 2019-08-21 10:13:39 UTC
(In reply to Mika Westerberg from comment #37)
> Created attachment 284551 [details]
> Updated PCIe link delay  patch
> 
> Thanks for the log. So indeed the device is not yet accessible after the
> 100ms delay:
> 
> [   60.254340] xhci_hcd 0000:39:00.0: Refused to change power state,
> currently in D3, PMCSR=0xffff
> [   60.257876] pcieport 0000:00:1c.0: PME: Spurious native interrupt!
> [   60.257890] pcieport 0000:00:1c.0: PME: Spurious native interrupt!
> [   60.274060] xhci_hcd 0000:39:00.0: second read, PMCSR=0x8100
> 
> I've attached an updated version of the link delay patch that probes for the
> downstream device once and if it is not accessible waits for another 100ms.
> This way we don't delay all downstream devices unnecessarily.
> 
> Can you try the patch on top of v5.3-rc5 and see if the xHCI stays alive
> after several iterations?

Yes, this patch solves the issue. xHCI works every time.

Please collect my tested-by:
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Comment 39 Mika Westerberg 2019-08-21 12:50:14 UTC
Thanks for testing. I just sent the patch upstream:

https://lore.kernel.org/linux-pci/20190821124519.71594-1-mika.westerberg@linux.intel.com/

Let's see how it goes.
Comment 40 Kent Lin 2019-10-14 06:16:42 UTC
Hi Mika,

Any good news on and patch upstream?
If there is anything we could help please let us know. Thank you.
Comment 41 Mika Westerberg 2019-10-21 10:24:40 UTC
Hi,

There is a new version of the patch here:

  https://lore.kernel.org/linux-pci/20191004123947.11087-1-mika.westerberg@linux.intel.com/

So far I haven't got any comments from the PCI maintainer though.