Bug 214021 - The USB devices fail to be detected during boot
Summary: The USB devices fail to be detected during boot
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Default virtual assignee for Drivers/USB
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-08-10 15:26 UTC by Chris Chiu
Modified: 2022-05-13 08:29 UTC (History)
7 users (show)

See Also:
Kernel Version: 5.12 and later version
Subsystem:
Regression: No
Bisected commit-id:


Attachments
output of lsusb -v (42.00 KB, text/plain)
2021-08-10 15:26 UTC, Chris Chiu
Details
output of lsusb -t (1.37 KB, text/plain)
2021-08-10 15:27 UTC, Chris Chiu
Details
Delay registration of secondary root hub (1.05 KB, patch)
2021-08-20 14:31 UTC, Alan Stern
Details | Diff
output of journalctl -b | grep -E "(xhci|\: usb)" (1.90 KB, text/plain)
2022-04-07 21:25 UTC, stygarfield
Details
Patch 0001 usb core hcd: Add support for deferring roothub (4.52 KB, patch)
2022-04-08 11:56 UTC, Mathias Nyman
Details | Diff
PAtch 0002 xhci: set hcd flag to defer primary roothub registration (1.66 KB, patch)
2022-04-08 11:57 UTC, Mathias Nyman
Details | Diff
patch20220425.tar.gz (4.78 KB, application/x-gzip)
2022-04-25 11:00 UTC, Hongyu Xie
Details

Description Chris Chiu 2021-08-10 15:26:42 UTC
Created attachment 298255 [details]
output of lsusb -v

We found some USB devices (keyboard, storage) will fail to be detected on kernel 5.13 during boot, but it's working OK after re-plug the device. After looking into the dmesg during boot, there's an error message as down below
[   39.350435] hub 1-0:1.0: USB hub found
[   39.398835] hub 1-0:1.0: 12 ports detected
[   39.622744] usb usb1-port3: couldn't allocate usb_device

Then we try to reproduce it on older kernel (5.4 and 5.8), there's no such problem. After bisecting, we found the commit https://github.com/torvalds/linux/commit/90d28fb53d4a51299ff324dede015d5cb11b88a2 makes the difference. It seems that hub_power_on_good_delay is not long enough for this hub. The bPwrOn2PwrGood in the Hub Descriptor (please refer to the attached lsusb-v.txt) is 20ms, which seems pretty normal when I check the same thing on other hubs. Can we totally trust this? Please suggest what we should do for this kind of hub. Thanks
Comment 1 Chris Chiu 2021-08-10 15:27:43 UTC
Created attachment 298257 [details]
output of lsusb -t
Comment 2 Chris Chiu 2021-08-10 15:39:07 UTC
Forgot to mention that this problem only happens on the high-speed hub in the attached file. 
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Comment 3 Alan Stern 2021-08-11 15:38:58 UTC
You can try changing the bPwrOn2PwrGood value in the kernel source to see if that will make any difference.  The line to change is in drivers/usb/host/xhci-hub.c in the function xhci_common_hub_descriptor().  It's at the start of the function:

        desc->bPwrOn2PwrGood = 10;      /* xhci section 5.4.9 says 20ms max */

The value gets multiplied by two, so if you change it to 50 then the delay will become 100 ms.

However I suspect that the length of this delay is not the cause of the problem you see.  The actual error occurs in the usb_alloc_dev() function in drivers/usb/core/usb.c.  You can try adding some printk statements in there to see which of the "return NULL" statements is the one you're getting.
Comment 4 Chris Chiu 2021-08-12 02:31:26 UTC
The patch author Chunfeng Yun also suggest to change the bPwrOn2PwrGood. It still fails even I set the value to 20 and 40. Then I set to a 100 to see if it makes difference. The value 100 works.

I'll add some printk to get more information. Thanks
Comment 5 Chris Chiu 2021-08-16 07:09:05 UTC
I add printks to trace where the error happens. The dmesg show as down below

[   44.488980] xhci_hcd 0000:00:14.0: FIXME: allocate a command ring segment
[   44.488981] usb_alloc_dev: hcd driver alloc dev exist but return 0
[   44.488982] usb usb1-port4: couldn't allocate usb_device

The first line show for the "xhci_dbg(xhci, "FIXME: allocate a command ring segment\n");" in xhci_alloc_dev()

The second line shows the printk I add in usb_alloc_dev()

       /* Root hubs aren't true devices, so don't allocate HCD resources */
        if (usb_hcd->driver->alloc_dev && parent &&
                !usb_hcd->driver->alloc_dev(usb_hcd, dev)) {
+               printk("%s: hcd driver alloc dev exist but return 0", __func__);
                usb_put_hcd(bus_to_hcd(bus));
                kfree(dev);
                return NULL;
        }

The usb_hcd->driver->alloc_dev is equal to xhci_alloc_dev and fails the xhci_queue_slot_control(). Does that mean anything regarding this?
Comment 6 Chunfeng Yun 2021-08-16 07:29:29 UTC
If the controller doesn't work, there will be log "xHCI dying or halted, can't queue_command"? printed in queue_command()
Comment 7 Alan Stern 2021-08-16 14:16:36 UTC
This means the problem is somewhere below xhci_alloc_dev.  In other words, the problem is in the xhci-hcd driver, and it has no connection to the commit you found.  Except perhaps that increasing the time delay manages to avoid the bug somehow.

You can try enabling dynamic debugging for xhci-hcd.  On the boot command line, add:

   dyndbg="module xhci_hcd =p"

and see what shows up in the dmesg log.
Comment 8 Chris Chiu 2021-08-18 05:48:56 UTC
After lots of experiments, I agree the commit is not the root cause. Since the commit applied on 5.11 has no problem, but it happens when applying this commit on 5.4 base.
The confusing part is, I can't reproduce it by enabling the dynamic debug on xhci-hcd even with the kernel 5.4 base. I'm still trying to find out what exactly commit cause the difference.
Comment 9 Mathias Nyman 2021-08-20 11:33:45 UTC
Kishon reported this same issue. xhci_alloc_dev()->xhci_queue_slot_control()->queue_command() fails because
xhci->xhc_state still had a  XHCI_STATE_HALTED bit set.

This is possible because the xhci driver will first add the high-speed hcd/roothub, then the super-speed hcd/roothub.
xHC controller is started and xhci->xhc_state cleared only when the second, super-speed hcd is added.

So if we are unlucky the first hs hcd will create the hs roothub device, 
hub driver binds to it, hub inits, and hub thread start enumerating devies before xHC is running.

xHC port registers can be read when xHC is powered, before it's is running, so hub thread is able to see the connected devices early.
Comment 10 Alan Stern 2021-08-20 14:31:33 UTC
Created attachment 298391 [details]
Delay registration of secondary root hub

Would something like the attached patch fix the problem?  It delays registration of the secondary root hub until after the primary hcd is started, just before the primary root hub is registered.

Chris, can you test this patch?
Comment 11 Mathias Nyman 2021-08-23 12:49:40 UTC
(In reply to Alan Stern from comment #10)
> 
> Would something like the attached patch fix the problem?  It delays
> registration of the secondary root hub until after the primary hcd is
> started, just before the primary root hub is registered.

If I understood the problem correctly we would need to delay the registration of the primary roothub until secondary hcd usb_hcd_add() reaches hcd->driver->start()

xhci pci driver will first create and add primary hcd, then create and add secondary hcd.

xhci_pci_probe()
  /* create AND add primary hcd in one go */
  usb_hcd_pci_probe(dev, id, &hc_driver)
  primary_hcd = dev_get_drvdata(&dev->dev)
  /* create secondary hcd */
  secondary_hcd = usb_create_shared_hcd(...,primary_hcd)
  /* add secondary hcd */
  usb_add_hcd(secondary_hcd)


If we force xhci driver to always first create both hcds, and then add them, we could maybe do somethinh like this:


usb_hcd_add()
    ...
    if ((is_primary_hcd(hcd) && !hcd->shared_hcd) || !is_primary_hcd)
        register_roothub(primary_hcd);
    if(!is_primary_hcd(hcd))
        register_roothub(hcd);

If this makes any sense I'll turn this into real code.
Comment 12 Alan Stern 2021-08-23 14:06:29 UTC
I may have gotten the primary and secondary mixed up.  Anyway, the general idea is simple: Don't register the first root hub along with the first hcd; instead register both root hubs along with the second hcd.

The only problem is how to tell usb_hcd_add that the first hcd is going to be shared.  You'll probably need to modify xhci-hcd slightly to make this possible.  Maybe add a new flag.
Comment 13 Chris Chiu 2021-08-24 07:37:06 UTC
Mathias, thanks for the analysis since I've ran out of ideas for debugging this. In my case, even a single line printk in either xhci-hub.c or hub.c before hub_power_on() can make difference (undetected -> detected). But I can always reproduce with the clean code. Let me know if there's any patch I can help test. Thanks
Comment 14 Chris Chiu 2021-08-25 12:08:24 UTC
Thanks to the upstream patch from Kishon. I've tested and it works on my kernel 5.4 base which used to have problem.
Comment 15 devsk 2021-11-02 17:32:39 UTC
> Thanks to the upstream patch from Kishon.

Which upstream patch are you talking about? I am running into this issue and I was wondering if you could point to the actual patch. Thanks!
Comment 16 Tilman Schmidt 2022-03-25 08:30:18 UTC
Any progress on this? This is getting fixed/circumvented and rebroken repeatedly in downstream releases like Ubuntu. A definitive fix would be highly appreciated.
Comment 17 stygarfield 2022-04-07 21:25:00 UTC
Created attachment 300715 [details]
output of journalctl -b | grep -E "(xhci|\: usb)"
Comment 18 stygarfield 2022-04-07 21:27:25 UTC
(In reply to stygarfield from comment #17)
> Created attachment 300715 [details]
> output of journalctl -b | grep -E "(xhci|\: usb)"

I'm having the same issue I believe, and am hoping that there is a fix out there. This started happening fairly recently.
Comment 19 Mathias Nyman 2022-04-08 11:55:00 UTC
The roothub deferring patches fixing this triggered a race, and were reverted.
That race should now be fixed in 5.16-rc3 and in stable kernels with patches:

6ae6dc22d2d1 ("usb: hub: Fix usb enumeration issue due to address0 race")
6cca13de26ee ("usb: hub: Fix locking issues with address0_mutex")

I'm planning on reapplying the roothub deferring patches but would
need some help testing them, making sure they work and not trigger any
new issues.

Patches attached (based on 5.18-rc1).

0001-usb-core-hcd-Add-support-for-deferring-roothub-regis.patch
0002-xhci-Set-HCD-flag-to-defer-primary-roothub-registrat.patch

Thanks
Comment 20 Mathias Nyman 2022-04-08 11:56:43 UTC
Created attachment 300723 [details]
Patch 0001 usb core hcd: Add support for deferring roothub
Comment 21 Mathias Nyman 2022-04-08 11:57:44 UTC
Created attachment 300724 [details]
PAtch 0002 xhci: set hcd flag to defer primary roothub registration
Comment 22 Hongyu Xie 2022-04-24 07:50:07 UTC
Hi, everyone.
What's the status on this?
Comment 23 Hongyu Xie 2022-04-24 07:52:45 UTC
(In reply to Mathias Nyman from comment #21)
> Created attachment 300724 [details]
> PAtch 0002 xhci: set hcd flag to defer primary roothub registration

Reapplied on linux-next? Haven't seen this on linux-next.
Comment 24 Mathias Nyman 2022-04-25 10:08:32 UTC
(In reply to Hongyu Xie from comment #23)

> Reapplied on linux-next? Haven't seen this on linux-next.

As I said earlier, I'd like to get these patches tested and verified by someone who can trigger this issue before applying them.

Can you help me test them?
Comment 25 Hongyu Xie 2022-04-25 11:00:24 UTC
Created attachment 300800 [details]
patch20220425.tar.gz

On 2022/4/25 18:08, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=214021
>
> --- Comment #24 from Mathias Nyman (mathias.nyman@linux.intel.com) ---
> (In reply to Hongyu Xie from comment #23)
>
>> Reapplied on linux-next? Haven't seen this on linux-next.
> As I said earlier, I'd like to get these patches tested and verified by
> someone
> who can trigger this issue before applying them.
>
> Can you help me test them?

About to do so. But my kernel is little old, 5.4.x. So I had to changed 
it a little bit. You can see them in the

attachment.

>
Comment 26 Hongyu Xie 2022-05-13 07:29:32 UTC

(In reply to Mathias Nyman from comment #24)
> (In reply to Hongyu Xie from comment #23)
> 
> > Reapplied on linux-next? Haven't seen this on linux-next.
> 
> As I said earlier, I'd like to get these patches tested and verified by
> someone who can trigger this issue before applying them.
> 
> Can you help me test them?

It works.

Note You need to log in before you can comment on or make changes to this bug.