Bug 218987

Summary: USB devices are not detected after Save/Restore error on Intel xHC
Product: Drivers Reporter: Remi Pommarel (repk)
Component: USBAssignee: Default virtual assignee for Drivers/USB (drivers_usb)
Status: RESOLVED CODE_FIX    
Severity: normal CC: mathias.nyman
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: Annotated dmesg output showing unplug/replug xhci debug logs
usbmon traces when unplug/replug Genesys Logic Hub based devices causing SRE
Patch attempt to get devices enumerated when a SRE occurs
patch v2 to fix connect detect issue after xhc resume SRE case

Description Remi Pommarel 2024-06-25 13:06:44 UTC
Created attachment 306492 [details]
Annotated dmesg output showing unplug/replug xhci debug logs

Hello,

With some (Genesys Logic based Hub at least) USB devices, Intel xHC raises Save/Restore error (SRE) if it gets suspended when those devices are unplugged. It seems that in this case (Save/Restore error) the host controller does sometimes miss Port Status Changed Event.

In xhci_resume() host controller port is actually resumed only if Port Status Changed Event are generated (xhci_pending_portevent() returns true). Because such event can be missing when an SRE happened, port can remain suspended when a USB device has been plugged to it preventing the device enumeration.

This can be seen in the attached annotated dmesg excerpt with drivers/usb/host/xhci* and drivers/usb/core/hub.c debug print enabled:

[13903.331265] pcieport 0000:00:1d.0: Intel SPT PCH root port ACS workaround enabled
[13904.829108] xhci_hcd 0000:3c:00.0: // Setting command ring address to 0xffc55001
[13904.829300] xhci_hcd 0000:3c:00.0: xHC error in resume, USBSTS 0x411, Reinit <==== SRE happened on resume
[13904.829308] usb usb3: root hub lost power or was reset
[13904.829315] usb usb4: root hub lost power or was reset

...

[13904.832695] xhci_hcd 0000:3c:00.0: xhci_resume: starting usb3 port polling.
[13904.832732] xhci_hcd 0000:3c:00.0: config port 4-1 wake bits, portsc: 0x2a0, write: 0x202a0
[13904.832747] xhci_hcd 0000:3c:00.0: config port 4-2 wake bits, portsc: 0x2a0, write: 0x202a0
[13904.832761] xhci_hcd 0000:3c:00.0: config port 3-1 wake bits, portsc: 0x2a0, write: 0x202a0
[13904.832773] xhci_hcd 0000:3c:00.0: config port 3-2 wake bits, portsc: 0x2a0, write: 0x202a0
[13904.832782] xhci_hcd 0000:3c:00.0: xhci_suspend: stopping usb3 port polling. <==== Controller goes back to suspend

There are several ways to workaround that : 
    - Disabling autosuspend (echo on > /sys/bus/usb/devices/.../power) prevent the issue to happen
    - Triggering a resume manually after the device has been replugged (lsusb -v, cat /dev/bus/usb/<bus>/001, ...)

To summarize I can see 2 different issues here:
    1) Genesys Logic (and maybe other) devices cause SRE error if controller is suspend due to it being unplugged
    2) Controller does not reflect Port Status correctly when SRE happened

Unfortunately I have no idea why 1) happen. Does anyone know what can cause a SRE error to happen, besides issuing a CMD_RUN between a save and restore ? I have attached a usbmon trace that registers the faulty device being unplugged if it can ba ofe any help.

Also after a quick glance at xHCI specification I didn't managed to understand why 2) happens either. Is it expected ? Is it a known behavior of some controller ?

It can be noted that once a problematic device has been unplugged it does not matter which device get plugged after (could be any I tried) it won't be recognized, so 2) appears to me to not be specific to a single device but does seems to be a host controller issue.

In the end in order to workaround this issue I have created the attached patch.  This forces host controller to actually try to resume if a SRE happen, even if no Port Status Change event has been received. This has the downsize to maybe try resuming the controller while no devices has been actually connected (false positive) but the controller will suspend utimately still. But maybe it is something we can live with ?

If need be, this can also be made tunable through a XHCI quirks.

Thanks
Comment 1 Remi Pommarel 2024-06-25 13:07:26 UTC
Created attachment 306493 [details]
usbmon traces when unplug/replug Genesys Logic Hub based devices causing SRE
Comment 2 Remi Pommarel 2024-06-25 13:08:22 UTC
Created attachment 306494 [details]
Patch attempt to get devices enumerated when a SRE occurs
Comment 3 Mathias Nyman 2024-06-26 09:10:16 UTC
Thanks for debugging this.

Issue 1, reason for SRE is unknown to me as well.
Issue 2 is probably due to PORTSC not immediately showing connected devices after host is restarted. This is especially true for USB3 devices where device only shows up in PORTSC after successful link training.   

A similar issue was solved in:
commit 33e321586e37 ("xhci: Add grace period after xHC start to prevent premature runtime suspend")

In your SRE case we never event start polling the roothub.

How about a solution that is similar to your patch but doesn't touch
pending_portevent. Instead we always resume roothubs in case xHC was reset during resume.

Does the oneliner below work for you?

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 37eb37b0affa..12b9c2c7f39f 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -1172,7 +1172,7 @@ int xhci_resume(struct xhci_hcd *xhci, pm_message_t msg)
                        pending_portevent = xhci_pending_portevent(xhci);
                }
 
-               if (pending_portevent) {
+               if (pending_portevent || reinit_xhc) {
                        if (xhci->shared_hcd)
                                usb_hcd_resume_root_hub(xhci->shared_hcd);
                        usb_hcd_resume_root_hub(hcd);
Comment 4 Remi Pommarel 2024-06-26 10:12:51 UTC
Thanks for looking into this issue.

(In reply to Mathias Nyman from comment #3)
...
> Issue 2 is probably due to PORTSC not immediately showing connected devices
> after host is restarted. This is especially true for USB3 devices where
> device only shows up in PORTSC after successful link training.   

Weirdly enough this issue seems to happen only on the USB 2.0 root hub part of the port not the USB 3.0 one. I first tried to add delay to see if my port events showed but no luck on that. Anyway that does not matter much.

> How about a solution that is similar to your patch but doesn't touch
> pending_portevent. Instead we always resume roothubs in case xHC was reset
> during resume.
> 
> Does the oneliner below work for you?

Yes it works. While I did initially try that, I switched to pending_portevent patch version to avoid unneeded xhci_pending_portevent() and msleep(120) calls. But both are very fine with me.
Comment 5 Mathias Nyman 2024-06-27 13:04:03 UTC
> > 
> > Does the oneliner below work for you?
> 
> Yes it works. While I did initially try that, I switched to
> pending_portevent patch version to avoid unneeded xhci_pending_portevent()
> and msleep(120) calls. But both are very fine with me.

Good point, I'll attach a second version that skips the 120ms sleep.
After that patch it will also be easier to refactor reinit_xhc and 
!reinit_xhc into separate functions
Comment 6 Mathias Nyman 2024-06-27 13:10:31 UTC
Created attachment 306504 [details]
patch v2 to fix connect detect issue after xhc resume SRE case

Does this work?
Comment 7 Remi Pommarel 2024-06-27 14:03:01 UTC
Yes this works thanks.
Just a typo in the commit message s/Restore Errer/Restore Error/.
Comment 8 Mathias Nyman 2024-06-27 14:56:20 UTC
(In reply to Remi Pommarel from comment #7)
> Yes this works thanks.
> Just a typo in the commit message s/Restore Errer/Restore Error/.

Thanks, patch with fixed commit message (typo, tags etc) sent.
https://lore.kernel.org/linux-usb/20240627145523.1453155-2-mathias.nyman@linux.intel.com/
Comment 9 Remi Pommarel 2024-06-27 15:36:35 UTC
Thanks this is resolved for me then.
Still I will take a look at why SRE happen just out of curiosity.(In reply to Mathias Nyman from comment #8)
...
> Thanks, patch with fixed commit message (typo, tags etc) sent.
> https://lore.kernel.org/linux-usb/20240627145523.1453155-2-mathias.
> nyman@linux.intel.com/

Thanks this is resolved for me then.
Still I will take a look at why SRE happen (seem to be related to the presence of HDMI) just out of curiosity.
Comment 10 Mathias Nyman 2024-06-28 08:51:05 UTC
About SRE, 

Logs show transfer errors due to disconnect. This normal.

Driver will however try to recover the transfer by resending the same transfer 3 times (soft-reset internal endpoint) before giving up and and doing a hard reset of the endpoint.

Soft reset is useful if transfer error was caused by electrical interference, but this case the soft rest if futile, device is gone. 

Not sure if it makes a difference but could be worth testing if skipping the soft reset impacts SRE.

Can be done by setting quirk:
xhci->quirks |= XHCI_NO_SOFT_RETRY
Comment 11 Remi Pommarel 2024-06-28 17:00:32 UTC
(In reply to Mathias Nyman from comment #10)
> About SRE, 
> 

...

> Not sure if it makes a difference but could be worth testing if skipping the
> soft reset impacts SRE.
> 
> Can be done by setting quirk:
> xhci->quirks |= XHCI_NO_SOFT_RETRY

Thanks.

Unfortunately this does not help.

What is funny is that it is related to if the HDMI is plugged or not. So I think it is related to DP alt mode of type C. I don't know if xHC needs to even know about alternate mode, it does not look like so to me.