Bug 213081 - usb-storage / uas Genesys Logic Card Reader no longer working on 5.12
Summary: usb-storage / uas Genesys Logic Card Reader no longer working on 5.12
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Default virtual assignee for Drivers/USB
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-15 12:43 UTC by Peter Ganzhorn
Modified: 2021-06-05 07:29 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.12.4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg log after plugging in card reader and inserting card (4.38 KB, text/plain)
2021-05-15 12:43 UTC, Peter Ganzhorn
Details
usbmon trace 5.12 bus 6 (38.64 KB, text/plain)
2021-05-15 17:44 UTC, Peter Ganzhorn
Details
usbmon trace 5.11 bus 6 (254.45 KB, text/plain)
2021-05-15 17:44 UTC, Peter Ganzhorn
Details
usbmon trace 5.12 bus 5 (9.27 KB, text/plain)
2021-05-15 17:45 UTC, Peter Ganzhorn
Details
usbmon trace 5.11 bus 5 (8.69 KB, text/plain)
2021-05-15 17:51 UTC, Peter Ganzhorn
Details
git bisect log (4.11 KB, text/plain)
2021-05-17 19:15 UTC, Peter Ganzhorn
Details
Patch fixing accidental clearing of URB status in xhci (1.57 KB, application/mbox)
2021-05-19 12:13 UTC, Mathias Nyman
Details
usb bus 5 trace 5.12+patch (8.92 KB, text/plain)
2021-05-19 17:31 UTC, Peter Ganzhorn
Details
usb bus 6 trace 5.12+patch (16.76 KB, text/plain)
2021-05-19 17:38 UTC, Peter Ganzhorn
Details
dmesg 5.12.5+patch (106.91 KB, text/plain)
2021-05-19 21:10 UTC, Peter Ganzhorn
Details
dmesg 5.12.5+patch (2) (107.28 KB, text/plain)
2021-05-19 21:24 UTC, Peter Ganzhorn
Details
detailed trace 5.12.5+patch (803.04 KB, text/plain)
2021-05-19 21:26 UTC, Peter Ganzhorn
Details
Patch forcing a set TR deq command after halt due to stall (1.82 KB, patch)
2021-05-20 21:04 UTC, Mathias Nyman
Details | Diff
dmesg 5.12.5 + both patches (122.25 KB, text/plain)
2021-05-21 16:54 UTC, Peter Ganzhorn
Details
detailed trace 5.12.5 + both patches (3.29 MB, text/plain)
2021-05-21 17:01 UTC, Peter Ganzhorn
Details

Description Peter Ganzhorn 2021-05-15 12:43:16 UTC
Created attachment 296763 [details]
dmesg log after plugging in card reader and inserting card

Hello,

my USB SD card reader is no longer working with Linux 5.12.
Up to Linux 5.11.21 the device is working flawlessly.

T:  Bus=06 Lev=03 Prnt=03 Port=00 Cnt=01 Dev#=  4 Spd=5000 MxCh= 0
D:  Ver= 3.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 9 #Cfgs=  1
P:  Vendor=05e3 ProdID=0749 Rev=15.31
S:  Manufacturer=Generic
S:  Product=USB3.0 Card Reader
S:  SerialNumber=000000001531
C:  #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=896mA
I:  If#=0x0 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage

It is integrated in my display and probably made by Genesys Logic (display type is Benq PD3200U).

My system is a x86_64 (Intel Haswell).

After inserting a SD card into the reader usb-devices output hangs when accessing the device.
Also reading data from a inserted SD card is not possible (e.g. cfdisk /dev/sdb will hang forever).

I've attached my dmesg output. It begins after plugging in the USB cable into the display and the hub and card reader being detected. After this I inserted a SD card into the reader and tried to run cfdisk.

I've tried multiple cards and the result is the same with all of them. All the cards are working with the same reader and Linux 5.11.21.
Another USB card reader (Realtek VID 0x0BDA PID 0x0301) I own works without any issues with 5.12.

Some change after Linux 5.11 must have broken the support for this card reader.

Please tell me if I can provide further information to investigate and fix the issue.

Peter
Comment 1 Alan Stern 2021-05-15 15:36:50 UTC
Can you collect usbmon traces for bus 6 showing the failure under a 5.12 kernel and success under a 5.11.21 kernel?
Comment 2 Peter Ganzhorn 2021-05-15 17:44:36 UTC
Created attachment 296781 [details]
usbmon trace 5.12 bus 6
Comment 3 Peter Ganzhorn 2021-05-15 17:44:57 UTC
Created attachment 296783 [details]
usbmon trace 5.11 bus 6
Comment 4 Peter Ganzhorn 2021-05-15 17:45:17 UTC
Created attachment 296785 [details]
usbmon trace 5.12 bus 5
Comment 5 Peter Ganzhorn 2021-05-15 17:51:15 UTC
Created attachment 296787 [details]
usbmon trace 5.11 bus 5

Here you are.
I also recorded a trace of bus 5 since it seems a Genesys hub shows up there as well as soon as I connect my display (but I have only one physical USB cable).

What I did with both kernels was:
1. Boot the system with the SD card removed from the reader and the USB cable of my display unplugged
2. Start the trace via
cat /sys/kernel/debug/usb/usbmon/6u > usbmon-5.11-bus6.log &
cat /sys/kernel/debug/usb/usbmon/5u > usbmon-5.11-bus5.log
3. Plugged in my displays USB cable
4. After waiting a moment, inserted the SD card into the card reader slot
5. Executed cfdisk /dev/sdb
6. Waited a few seconds and stopped the cat processes

Please let me know if I can assist you with additional information.
Thanks a lot for responding so quickly to my bug report and taking care of this, I appreciate it!
Comment 6 Alan Stern 2021-05-16 01:20:27 UTC
Here's where the difference shows up.  In the 5.11 log:

ffff8881259c3540 48224904 S Bo:6:004:2 -115 31 = 55534243 03000000 60000000 80000603 00000060 00000000 00000000 000000
ffff8881259c3540 48224939 C Bo:6:004:2 0 31 >
ffff888102854900 48224942 S Bi:6:004:1 -115 96 <
ffff888102854900 48225040 C Bi:6:004:1 -121 18 = 70000200 0000000a 00000000 3a000000 0000
ffff8881259c3540 48225042 S Bi:6:004:1 -115 13 <
ffff8881259c3540 48225112 C Bi:6:004:1 -32 0
ffff8881259c3540 48225115 S Co:6:004:0 s 02 01 0000 0081 0000 0
ffff8881259c3540 48225522 C Co:6:004:0 0 0
ffff8881259c3540 48225525 S Bi:6:004:1 -115 13 <
ffff8881259c3540 48225598 C Bi:6:004:1 0 13 = 55534253 03000000 4e000000 00

In the 5.12 log:

ffff88812366e6c0 273105524 S Bo:6:010:2 -115 31 = 55534243 03000000 60000000 80000603 00000060 00000000 00000000 000000
ffff88812366e6c0 273105551 C Bo:6:010:2 0 31 >
ffff88815e5c5c00 273105558 S Bi:6:010:1 -115 96 <
ffff88815e5c5c00 273105676 C Bi:6:010:1 -121 18 = 70000200 0000000a 00000000 3a000000 0000
ffff88812366e6c0 273105682 S Bi:6:010:1 -115 13 <
ffff88812366e6c0 273105860 C Bi:6:010:1 0 0
ffff88812366e6c0 273105864 S Bi:6:010:1 -115 13 <
ffff88812366e6c0 303293804 C Bi:6:010:1 -104 0

In both of these, the computer asks for a 96-bytes of sense data and the device responds with only 18 bytes.  Then when the computer asks for a status code, the device is supposed to respond with a STALL (code -32 in the trace).  This does happen (in the sixth line) under 5.11, but under 5.12 the STALL isn't reported.  Things get worse from there.

This appears to be a bug in the xhci-hcd driver.  I can't tell you how to debug that driver; your best approach is probably to use bisection to find the commit between 5.11 and 5.12 responsible for introducing the bug.
Comment 7 Peter Ganzhorn 2021-05-16 09:51:52 UTC
Alan, thanks for looking into this and spotting the difference in the traces.
I know what bisecting is in theory, but haven't actually ever done it myself.
Do you know how could I limit the reported commits to xhci-related stuff when running git bisect or point me to according documentation/instructions?
Peter
Comment 8 Alan Stern 2021-05-16 15:01:18 UTC
The output from "git help bisect" includes this text:

  Cutting down bisection by giving more parameters to bisect start
       You can further cut down the number of trials, if you know what part of
       the tree is involved in the problem you are tracking down, by
       specifying path parameters when issuing the bisect start command:

           $ git bisect start -- arch/i386 include/asm-i386

You should read the entire output; it includes a pretty good (though rather terse) description of the whole bisection process.

(Beware: Although it's likely that the bug was introduced by a commit affecting the xhci-hcd driver; it isn't certain.  If you can't find the guilty party by a restricted search, you may have to try again with a full bisection.)

If you want more information about git bisection, there are plenty of web pages available that explain it in more detail.  Ask Google.
Comment 9 Peter Ganzhorn 2021-05-17 19:15:22 UTC
Created attachment 296815 [details]
git bisect log

Well bisecting went quicker than I expected, but I don't think I can come up with a solution for this myself.
git bisect found commit 674f8438c12125d6b4fe51d44b9316bb02b286b5 xhci: split handling halted endpoints into two steps as the culprit, but the last two builds before finding this final commit already refused to boot correctly.
I guess this one is part of a bigger patch series which the bisect process cut in half at some point.
The description of the commit mentions:

    Note: This removes the RESET_EP_QUIRK handling which was added in
    commit ac9d8fe7c6a8 ("USB: xhci: Add quirk for Fresco Logic xHCI hardware.")

    This quirk was added in 2009 for prototype xHCI hardware meant for
    evaluation purposes only, and should not reach consumers.

Which made me raise an eyebrow. After reading this I am not sure if I may have some misbehaving/broken-by-design hardware here.

I'll take the liberty of adding Mathias Nyman who authored the commit I found to be involved in this issue to the CC list.
Hopefully he can shed some light on this.
Comment 10 Alan Stern 2021-05-18 01:47:23 UTC
The way to find out for sure whether this commit caused the problem is to test the kernel built from this commit and to test the kernel built with the commit reverted.  If the first one fails and the second one works, you'll know.

However, it does seem likely that this commit is related to your problem, since they both involve halted endpoints (note that STALL is the code sent by a device when the computer tries to communicate with a halted endpoint).
Comment 11 Peter Ganzhorn 2021-05-18 16:59:54 UTC
It's obvious one should test with and without the commit in question, but as I already said it seems the particular commit is part of a bigger series and I am unable to revert it via git (git revert 674f8438c12125d6b4fe51d44b9316bb02b286b5) or manually:

/usr/src/linux-5.12# cat ../674f8438c12125d6b4fe51d44b9316bb02b286b5.patch | patch -p1 -R --dry-run
checking file drivers/usb/host/xhci-ring.c
Hunk #1 succeeded at 819 (offset 22 lines).
Hunk #2 FAILED at 834.
Hunk #3 FAILED at 855.
Hunk #4 FAILED at 894.
Hunk #5 FAILED at 934.
Hunk #6 succeeded at 1276 (offset 101 lines).
Hunk #7 succeeded at 1353 (offset 101 lines).
Hunk #8 succeeded at 1383 (offset 101 lines).
Hunk #9 FAILED at 2043.
5 out of 9 hunks FAILED
checking file drivers/usb/host/xhci.c
Hunk #1 succeeded at 1452 (offset 12 lines).
Hunk #2 succeeded at 1822 (offset 12 lines).
Hunk #3 succeeded at 3131 (offset 12 lines).
checking file drivers/usb/host/xhci.h

Could you please give me some advice how I can revert this commit without breaking anything else?
Comment 12 Alan Stern 2021-05-18 17:57:48 UTC
In the failed reversion attempt above, what commit was checked out when you made the attempt?

The right way to do this is to check out the commit and then do the reversion.  In other words, the commit you want to revert should be the top one at the time you do the reversion.  This is more or less equivalent to checking out the commit's parent.

So for example, you should do:

   git checkout 674f8438c121
   git revert 674f8438c121

That way there won't be any conflicts, and the fact that the commit is in the middle of a larger series won't matter.
Comment 13 Peter Ganzhorn 2021-05-18 19:01:42 UTC
I tried to revert the commit on 5.12.
Checking out 674f8438c121 and then reverting worked as you suggested, thanks for the hint.

Results:
1) Kernel built at / with commit 674f8438c121 stops at the following output:

...
usb-storage 6-2.3.1:1.0: USB Mass Storage device detected
scsi host6: usb-storage 6-2.3.1:1.0
scsi 6:0:0:0: Direct-Access    Generic STORAGE DEVICE   1531 PQ: 0 ANSI: 6
sd 6:0:0:0: Attached scsi generic sg2 type 0

Which is clearly related to my card reader. After that, the boot process DOES NOT CONTINUE.

2) Kernel without commit 674f8438c121: Boots just fine and my card reader works.

Seems like the commit git bisect found makes things a lot worse indeed.
I guess there were other changes involved since 5.12 at least continued to boot with the device attached.

What else could I do to help you investigating this?
Comment 14 Alan Stern 2021-05-18 19:48:32 UTC
Okay, good.  This is now something for Mathias to look into.  He'll have to take over the analysis from this point.
Comment 15 Peter Ganzhorn 2021-05-18 21:19:58 UTC
Thanks again for the support so far Alan, I appreciate it!
I'm looking forward to hear from Mathias.
Peter
Comment 16 Mathias Nyman 2021-05-19 12:13:04 UTC
Created attachment 296865 [details]
Patch fixing accidental clearing of URB status in xhci

Thanks for narrowing it down this far.

As Alan pointed out xhci is clearing the -EPIPE status of the a URB in some STALL cases.

Attached a patch that should resove this. Does it help?
Comment 17 Peter Ganzhorn 2021-05-19 17:31:16 UTC
Created attachment 296873 [details]
usb bus 5 trace 5.12+patch
Comment 18 Peter Ganzhorn 2021-05-19 17:38:55 UTC
Created attachment 296875 [details]
usb bus 6 trace 5.12+patch

Hi Mathias, thanks for your patch.

I applied the patch to v5.12.4 and it applied just fine, but the card reader is still not working.
So sadly I have to report that the issue is not fixed with it, but I spotted a few differences with it.
I recorded new traces and it seems the trace log for bus 6 is a lot smaller than it was without the patch, but the symptoms are very much the same.
After plugging in the display / card reader, it seems to be detected fine according to dmesg until

[   66.338408] usb 6-2.3.1: new SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[   66.366869] usb 6-2.3.1: New USB device found, idVendor=05e3, idProduct=0749, bcdDevice=15.31
[   66.366879] usb 6-2.3.1: New USB device strings: Mfr=3, Product=4, SerialNumber=5
[   66.366883] usb 6-2.3.1: Product: USB3.0 Card Reader
[   66.366886] usb 6-2.3.1: Manufacturer: Generic
[   66.366889] usb 6-2.3.1: SerialNumber: 000000001531
[   66.373185] usb-storage 6-2.3.1:1.0: USB Mass Storage device detected
[   66.373392] scsi host6: usb-storage 6-2.3.1:1.0
[   67.390948] scsi 6:0:0:0: Direct-Access     Generic  STORAGE DEVICE   1531 PQ: 0 ANSI: 6
[   67.391389] sd 6:0:0:0: Attached scsi generic sg2 type 0

Inserting the card once did not change anything in dmesg, so I ejected and re-inserted it. The second time I also got

[   98.068827] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[   98.096880] sd 6:0:0:0: [sdb] Attached SCSI removable disk

But still no partitions (which are present) detected and cfdisk fails just like before.
What seems different now is I am seeing a lot less errors in dmesg, but repeated

[  128.789141] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  159.509221] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  190.239273] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  220.959354] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  251.679495] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  282.389627] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  313.109866] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  343.830026] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  374.550193] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  405.270368] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
[  436.000490] usb 6-2.3.1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd

If you need more information or have anything else I can test, just let me know.
Comment 19 Alan Stern 2021-05-19 17:59:31 UTC
The usbmon trace shows that the original problem of not reporting the STALL has been fixed.  But there is a new problem.  Here's where it shows up in the trace for bus 6:

ffff88810248e0c0 67313744 S Bi:6:004:1 -115 13 <
ffff88810248e0c0 97843800 C Bi:6:004:1 -104 13 = 55534243 04000000 00000000 00

This was the second packet received from ep1in after the halt was cleared.  The values in the received packet are correct (i.e., what I expect them to be), but the timestamp and the -104 error code show that the URB timed out after 30 second and was unlinked.  This should not have happened; the URB should have completed normally in less than one millisecond.

After this error, the device was reset and the same sequence of packets was exchanged, with the same result as before (apparently -- the trace ends during the second 30-second timeout window).

This indicates that the patch does not fully fix all the issues.  Evidently xhci-hcd has trouble seeing when an URB for a formerly halted endpoint has completed.
Comment 20 Mathias Nyman 2021-05-19 20:12:22 UTC
There is another recent fix for a 5.12 regression related to giving back URBs:
9b6a126ae58d xhci: Fix giving back cancelled URBs even if halted endpoint can't reset
It was added to v5.12.5 stable, so still missing from your 5.12.4 setup.

Could you add that patch as well, and if it still doesn't help then take more detailed xhci traces and logs:

mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
< connect the card reader and insert card >
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace

note that trace file grows fast so copy the trace content somewhere soon
after you see the first "reset SuperSpeed Gen 1 USB device" after failure.

Thanks
Comment 21 Peter Ganzhorn 2021-05-19 21:10:01 UTC
Created attachment 296879 [details]
dmesg 5.12.5+patch
Comment 22 Peter Ganzhorn 2021-05-19 21:24:42 UTC
Created attachment 296881 [details]
dmesg 5.12.5+patch (2)

I actually forgot to insert the card into the reader.
Rebooted and re-collected the data (start trace, plug in reader, insert card and wait for usb reset in dmesg).
Comment 23 Peter Ganzhorn 2021-05-19 21:26:41 UTC
Created attachment 296885 [details]
detailed trace 5.12.5+patch

Hi Mathias, I've also added the requested detailed trace.
I hope you find something helpful in there.

Thank you for your support, Mathias and Alan!
Peter
Comment 24 Peter Ganzhorn 2021-05-19 21:28:34 UTC
I should also have mentioned: Updating to v5.12.5 + your patch did not help.
dmesg and trace were recorded on the updated kernel version.
Comment 25 Mathias Nyman 2021-05-20 20:04:20 UTC
Thanks, the trace gives a pretty good picture of what is going on.

Before 5.12 the xhci driver handled halted endpoints (TRB completed with STALL) by immediately queuing both a reset endpoint command to clear the host side of the halted endpoint, and a "set TR dequeue" command to move past the Stalled TRB.

In 5.12 the Reset endpoint command is queued first, and "set TR Dequeue" command is queued when the reset endpoint command completes, if needed.
Traces show that hardware already moved past the stalled TRB when reset endpoint completed, so driver won't issue a "Set TR Dequeu" command at all.
I think this command is anyway needed as it will also flush the TRB cache of xHC controller.

From traces:

Stall on TRB at fffcb040:
56.024868: xhci_handle_event: EVENT: TRB 00000000fffcb040 status 'Stall Error' len 13 slot 5 ep 3 type 'Transfer Event' flags e:C

Queue reset endpoint command:
56.024873: xhci_queue_trb: CMD: Reset Endpoint Command: ctx 0000000000000000 slot 5 ep 3 flags t:C

When the reset endpoint command completes we see hardware dequeue is at fffcb050:
56.024970: xhci_handle_command: CMD: Reset Endpoint Command: ctx 0000000000000000 slot 5 ep 3 flags t:C
56.024971: xhci_handle_cmd_reset_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 4 maxp 1024 deq 00000000fffcb051 avg trb len 0

No set TR Deq to move to fffcb050 is queued.

I'll start working on a patch that forces a "Set TR dequeue" command after a stall
Comment 26 Mathias Nyman 2021-05-20 21:04:10 UTC
Created attachment 296903 [details]
Patch forcing a set TR deq command after halt due to stall

One more patch on top of the previous ones.
Does this help? 
I don't have a simple way for trying this out myself.
Comment 27 Peter Ganzhorn 2021-05-21 16:54:36 UTC
Created attachment 296915 [details]
dmesg 5.12.5 + both patches
Comment 28 Peter Ganzhorn 2021-05-21 17:01:43 UTC
Created attachment 296917 [details]
detailed trace 5.12.5 + both patches

Mathias, with both of your patches applied to 5.12.5 everything seems to be just perfect again!
My card reader is detected and seems to work like it did on 5.11 and before.

If you'd like to verify that everything is indeed fine, I've created another detailed trace and dmesg. As usual, I recorded plugging in the device, inserting a card and this time also running cfdisk.

Thank you Mathias and Alan for the great support and instructions, it was a pleasure assisting you tracking down and fixing this bug! :-)
Comment 29 Mathias Nyman 2021-05-21 19:25:52 UTC
Thank you Peter.

I'll take a quick look at the traces, and clean up the commit messages before submitting the patches.

Can I add "Reported-by" and "Tested-by" Peter Ganzhorn tags to the patches?
Comment 30 Peter Ganzhorn 2021-05-21 20:39:25 UTC
You are welcome, Mathias.
Of course you can add my name in the tags.

Just out of curiosity:
Do you have an explanation why this bug seems to only affect some devices?
As I mentioned before, another card reader I own worked just perfectly and I got the impression from your description in comment #25 that this issue may theoretically affect just about any device which apparently is not the case.
Why would the hardware move past the stalled TRB with one device and not with another?
My knowledge about the details of USB / xHCI is pretty vague, but I'd still like to grasp what you just fixed.
Comment 31 Alan Stern 2021-05-21 21:56:14 UTC
It's entirely possible that the other card reader doesn't issue any STALLs.
Comment 32 Peter Ganzhorn 2021-06-05 07:29:30 UTC
Everything has been working well with the patches for me so far and I just saw Linux 5.12.9 has the patches included.
I'll just mark this as resolved.

Thanks again for the great support!

Note You need to log in before you can comment on or make changes to this bug.