Bug 218890 - reset SuperSpeed USB device number 2 using xhci_hcd
Summary: reset SuperSpeed USB device number 2 using xhci_hcd
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Default virtual assignee for Drivers/USB
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-25 10:02 UTC by Pierre Tomon
Modified: 2024-06-21 19:53 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
usbmon (216.09 KB, text/plain)
2024-05-25 10:02 UTC, Pierre Tomon
Details
usbmon_6.6.15.txt (245.38 KB, text/plain)
2024-05-28 09:02 UTC, Pierre Tomon
Details
dmesg_2024-05-29_13:28:29 (27.17 KB, text/plain)
2024-05-29 11:38 UTC, Pierre Tomon
Details
trace_2024-05-29_13:25:37 (1.03 MB, text/plain)
2024-05-29 11:39 UTC, Pierre Tomon
Details
patch to fix cancelled bulk transfer length (1.58 KB, application/mbox)
2024-05-29 15:42 UTC, Mathias Nyman
Details
usbmon_6.9.3_with_patch.txt (79.25 KB, text/plain)
2024-05-30 17:08 UTC, Pierre Tomon
Details

Description Pierre Tomon 2024-05-25 10:02:59 UTC
Created attachment 306346 [details]
usbmon

Hello,

Since kernel 6.1.82 (amd64), I got this message when turning on the external drive.
The device is accessible after about 1 minute, then I can mount the fs and access data.

It happens with those kernel versions:
6.1.82
6.7.7
6.7.9
6.8.9
Git mainline

Latest kernel versions tested and not affected:
6.1.76
6.6.15

SATA/USB controller: Bus 006 Device 003: ID 174c:5106 ASMedia Technology Inc. ASM1051 SATA 3Gb/s bridge
SATA controler: ASMedia Technology Inc. ASM1061/ASM1062 Serial ATA Controller (rev 01)

Added this module parameter: `quirks=174c:5106:t` for the usb-storage
driver, same.

Part of `dmesg`:
[   89.821220] usb 6-1: new SuperSpeed USB device number 2 using xhci_hcd
[   98.669813] usb 6-1: New USB device found, idVendor=174c, idProduct=5106, bcdDevice= 0.01
[   98.669829] usb 6-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1
[   98.669836] usb 6-1: Product: AS2105
[   98.669842] usb 6-1: Manufacturer: ASMedia
[   98.669847] usb 6-1: SerialNumber:             XXXXXXXX
[   98.702925] usb-storage 6-1:1.0: USB Mass Storage device detected
[   98.703186] scsi host10: usb-storage 6-1:1.0
[   98.703323] usbcore: registered new interface driver usb-storage
[   98.712000] usbcore: registered new interface driver uas
[   99.729263] scsi 10:0:0:0: Direct-Access     ST3000DM 001-1CH166       CC29 PQ: 0 ANSI: 0
[   99.729842] sd 10:0:0:0: Attached scsi generic sg3 type 0
[   99.730034] sd 10:0:0:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[  130.381128] usb 6-1: reset SuperSpeed USB device number 2 using xhci_hcd
[  130.401814] sd 10:0:0:0: [sdc] 5860533168 512-byte logical blocks: (3.00 TB/2.73 TiB)
[  130.402173] sd 10:0:0:0: [sdc] Write Protect is off
[  130.402181] sd 10:0:0:0: [sdc] Mode Sense: 23 00 00 00
[  130.402486] sd 10:0:0:0: [sdc] No Caching mode page found
[  130.402493] sd 10:0:0:0: [sdc] Assuming drive cache: write through
[  130.403339] sd 10:0:0:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[  161.100747] usb 6-1: reset SuperSpeed USB device number 2 using xhci_hcd
[  161.395030]  sdc: sdc1 sdc2
[  161.395511] sd 10:0:0:0: [sdc] Attached SCSI disk

Output of `cat /sys/kernel/debug/usb/usbmon/6u`, see attachment.
Comment 1 Alan Stern 2024-05-25 13:54:01 UTC
There are at least two very suspicious-looking things in the usbmon trace:

ffff9244d6cf5000 1301812804 S Ci:6:004:0 s 80 06 0301 0409 00ff 255 <
ffff9244d6cf5000 1306956154 C Ci:6:004:0 -2 255 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
ffff9244d6cf5000 1306956185 S Ci:6:004:0 s 80 06 0301 0409 0002 2 <
ffff9244d6cf5000 1310626032 C Ci:6:004:0 0 2 = 2a03

The first two lines show the computer asking the drive to send one of its string descriptors and the transfer timing out after 5 seconds.  The trace says that the device sent 255 bytes of zeros, which seems unlikely, but if it is true it should have caused the transfer to end right away.  The second two lines show the computer asking for just the first two bytes of the string descriptor and the response arriving correctly, but not until an additional 3.7 seconds have passed.

ffff9244d6cf5300 1311630806 S Bo:6:004:2 -115 31 = 55534243 05000000 00020000 80000a28 00000000 00000001 00000000 000000
ffff9244d6cf5300 1311630823 C Bo:6:004:2 0 31 >
ffff9244ea669300 1311630835 S Bi:6:004:1 -115 512 <
ffff9244ea669300 1342568184 C Bi:6:004:1 -104 512 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

This shows the computer trying to read the first sector of the drive.  The transfer times out after 30 seconds (again after 512 bytes of zeros are apparently received), and this error is what causes the reset.

It's possible that the problem is in the device, but more likely the problem is in the kernel.  You've made a good start by finding that the 6.6.1 kernel works properly and the 6.7.7 kernel doesn't.  You should try doing a git bisection to find out which commit in the interval between those kernels caused the problem to appear.
Comment 2 Pierre Tomon 2024-05-28 00:19:23 UTC
I did a `git bisect` on the stable repo and it give me this as result:
d97e1c3602240bd35c48ef9aa978e0c47a511d03 is the first bad commit
Author: Martin K. Petersen <martin.petersen@oracle.com>
Date:   Tue Feb 13 09:33:06 2024 -0500
scsi: sd: usb_storage: uas: Access media prior to querying device properties
...

I also did a `git revert` on that commit and built the kernel, The bug does not appear.
Comment 3 Alan Stern 2024-05-28 01:08:41 UTC
That makes sense; the commit changes the sequence of commands during the early part of initialization.  However, to figure out what's going wrong it would be a big help to have a usbmon trace for an earlier working kernel, such as 6.6.15, for comparison.  Can you do that test and attach the trace to this bug report?

You don't have to collect a huge amount of information.  The first few seconds would probably be enough -- up to the point where the "Attached SCSI disk" message appears in the kernel log.
Comment 4 Pierre Tomon 2024-05-28 09:02:59 UTC
Created attachment 306362 [details]
usbmon_6.6.15.txt

Attachment added.
Comment 5 Mathias Nyman 2024-05-29 10:40:06 UTC
Would it possible to get xhci traces and logs? 

Alan pointed out that there's something odd with urb->actual_length 
values set by xhci driver.

Steps to take traces and logs:
mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
< Reproduce issue >
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace

trace file grows fast so please copy it as soon as possbile after issue is reproduced.
Comment 6 Pierre Tomon 2024-05-29 11:38:41 UTC
Created attachment 306365 [details]
dmesg_2024-05-29_13:28:29
Comment 7 Pierre Tomon 2024-05-29 11:39:08 UTC
Created attachment 306366 [details]
trace_2024-05-29_13:25:37
Comment 8 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-05-29 11:46:13 UTC
FWIW, to make sure this is known: another regression was bisected to the culprit and a patch for it exists:
https://lore.kernel.org/all/eec6ebbf-061b-4a7b-96dc-ea748aa4d035@googlemail.com/
Does not sound related, but I might be wrong here, so I thought I quickly mention it.
Comment 9 Mathias Nyman 2024-05-29 15:39:04 UTC
Traces show xhci driver sets the length of the cancelled 512 byte bulk transfer 
incorrectly.

I'll attach a patch for that.

The 255 byte control transfer case is not that clear.
It stops on the data stage with a 'stopped' completion code, with 'len' field indicating residual untransferred bytes is 0.

 827.939738: xhci_handle_event: EVENT: TRB 00000001511921e0 status 'Stopped' len 0 slot 2 ep 1 type 'Transfer Event' flags e:c

 827.939747: xhci_handle_transfer: CTRL: Buffer 00000001023cff00 length 255 TD size 0 intr 0 type 'Data Stage' flags i:i:c:s:I:e:C

It's also unclear why these transfers don't go through
Comment 10 Mathias Nyman 2024-05-29 15:42:51 UTC
Created attachment 306368 [details]
patch to fix cancelled bulk transfer length

grateful if this patch could be tested.
usbmon or xhci traces should show if it works.
Comment 11 Pierre Tomon 2024-05-30 11:44:06 UTC
The patch doesn't work, same message.
Comment 12 Mathias Nyman 2024-05-30 15:20:32 UTC
Did usbmon still show that the timed out bulk-in transfer received 512 bytes of zeroes like earlier?

ffff9244ea669300 1311630835 S Bi:6:004:1 -115 512 <
ffff9244ea669300 1342568184 C Bi:6:004:1 -104 512 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Comment 13 Alan Stern 2024-05-30 15:28:02 UTC
I should have pointed out that Mathias's change might not affect the timeout and reset that you're getting.  The usbmon trace showed three different problems, which might or might not be related, and the new patch is intended to fix only one of them.  Until it was tested, we had no way of knowing whether it would also fix any of the others.
Comment 14 Pierre Tomon 2024-05-30 17:08:36 UTC
Created attachment 306381 [details]
usbmon_6.9.3_with_patch.txt

Attachment: usbmon with the patch applied.
Comment 15 Alan Stern 2024-05-30 18:55:53 UTC
Here's the relevant part of the usbmon trace:

ffff97af45927780 378815216 S Bo:5:004:2 -115 31 = 55534243 08000000 00020000 80000a28 00000000 00000001 0000000 000000
ffff97af45927780 378815233 C Bo:5:004:2 0 31 >
ffff97af4255c900 378815246 S Bi:5:004:1 -115 512 <
ffff97af4255c900 410624206 C Bi:5:004:1 -104 0

There's still a timeout and cancellation, but at least now the number of bytes transferred is reported as 0 and not 512.  So the patch did fix this bug, but it didn't prevent the main problem from occurring.

It looks like the main difference between the working and non-working kernels is that the non-working one issues a READ(10) command (added by Martin Petersen's commit) whereas the working one uses only READ(16) commands.  I'll ask Martin about updating his commit to change this.
Comment 16 Alan Stern 2024-06-20 13:56:01 UTC
Pierre, unless you're still having any problems, you can go ahead and close out this bug report.
Comment 17 Pierre Tomon 2024-06-21 19:53:11 UTC
Patches are now in stable releases, 6.9.6 here, all is ok.

Commit: f0260589b439 - xhci: Set correct transferred length for cancelled bulk transfers
Releases: v6.1.95, v6.6.35 and v6.9.6

Commit: 7926d51f73e0 - scsi: sd: Use READ(16) when reading block zero on large capacity disks
Releases: v6.1.95, v6.6.35 and v6.9.6

Thanks to Alan Stern, Mathias Nyman, Martin K. Petersen and others for the help.

Note You need to log in before you can comment on or make changes to this bug.