Created attachment 306346 [details] usbmon Hello, Since kernel 6.1.82 (amd64), I got this message when turning on the external drive. The device is accessible after about 1 minute, then I can mount the fs and access data. It happens with those kernel versions: 6.1.82 6.7.7 6.7.9 6.8.9 Git mainline Latest kernel versions tested and not affected: 6.1.76 6.6.15 SATA/USB controller: Bus 006 Device 003: ID 174c:5106 ASMedia Technology Inc. ASM1051 SATA 3Gb/s bridge SATA controler: ASMedia Technology Inc. ASM1061/ASM1062 Serial ATA Controller (rev 01) Added this module parameter: `quirks=174c:5106:t` for the usb-storage driver, same. Part of `dmesg`: [ 89.821220] usb 6-1: new SuperSpeed USB device number 2 using xhci_hcd [ 98.669813] usb 6-1: New USB device found, idVendor=174c, idProduct=5106, bcdDevice= 0.01 [ 98.669829] usb 6-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1 [ 98.669836] usb 6-1: Product: AS2105 [ 98.669842] usb 6-1: Manufacturer: ASMedia [ 98.669847] usb 6-1: SerialNumber: XXXXXXXX [ 98.702925] usb-storage 6-1:1.0: USB Mass Storage device detected [ 98.703186] scsi host10: usb-storage 6-1:1.0 [ 98.703323] usbcore: registered new interface driver usb-storage [ 98.712000] usbcore: registered new interface driver uas [ 99.729263] scsi 10:0:0:0: Direct-Access ST3000DM 001-1CH166 CC29 PQ: 0 ANSI: 0 [ 99.729842] sd 10:0:0:0: Attached scsi generic sg3 type 0 [ 99.730034] sd 10:0:0:0: [sdc] Very big device. Trying to use READ CAPACITY(16). [ 130.381128] usb 6-1: reset SuperSpeed USB device number 2 using xhci_hcd [ 130.401814] sd 10:0:0:0: [sdc] 5860533168 512-byte logical blocks: (3.00 TB/2.73 TiB) [ 130.402173] sd 10:0:0:0: [sdc] Write Protect is off [ 130.402181] sd 10:0:0:0: [sdc] Mode Sense: 23 00 00 00 [ 130.402486] sd 10:0:0:0: [sdc] No Caching mode page found [ 130.402493] sd 10:0:0:0: [sdc] Assuming drive cache: write through [ 130.403339] sd 10:0:0:0: [sdc] Very big device. Trying to use READ CAPACITY(16). [ 161.100747] usb 6-1: reset SuperSpeed USB device number 2 using xhci_hcd [ 161.395030] sdc: sdc1 sdc2 [ 161.395511] sd 10:0:0:0: [sdc] Attached SCSI disk Output of `cat /sys/kernel/debug/usb/usbmon/6u`, see attachment.
There are at least two very suspicious-looking things in the usbmon trace: ffff9244d6cf5000 1301812804 S Ci:6:004:0 s 80 06 0301 0409 00ff 255 < ffff9244d6cf5000 1306956154 C Ci:6:004:0 -2 255 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ffff9244d6cf5000 1306956185 S Ci:6:004:0 s 80 06 0301 0409 0002 2 < ffff9244d6cf5000 1310626032 C Ci:6:004:0 0 2 = 2a03 The first two lines show the computer asking the drive to send one of its string descriptors and the transfer timing out after 5 seconds. The trace says that the device sent 255 bytes of zeros, which seems unlikely, but if it is true it should have caused the transfer to end right away. The second two lines show the computer asking for just the first two bytes of the string descriptor and the response arriving correctly, but not until an additional 3.7 seconds have passed. ffff9244d6cf5300 1311630806 S Bo:6:004:2 -115 31 = 55534243 05000000 00020000 80000a28 00000000 00000001 00000000 000000 ffff9244d6cf5300 1311630823 C Bo:6:004:2 0 31 > ffff9244ea669300 1311630835 S Bi:6:004:1 -115 512 < ffff9244ea669300 1342568184 C Bi:6:004:1 -104 512 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 This shows the computer trying to read the first sector of the drive. The transfer times out after 30 seconds (again after 512 bytes of zeros are apparently received), and this error is what causes the reset. It's possible that the problem is in the device, but more likely the problem is in the kernel. You've made a good start by finding that the 6.6.1 kernel works properly and the 6.7.7 kernel doesn't. You should try doing a git bisection to find out which commit in the interval between those kernels caused the problem to appear.
I did a `git bisect` on the stable repo and it give me this as result: d97e1c3602240bd35c48ef9aa978e0c47a511d03 is the first bad commit Author: Martin K. Petersen <martin.petersen@oracle.com> Date: Tue Feb 13 09:33:06 2024 -0500 scsi: sd: usb_storage: uas: Access media prior to querying device properties ... I also did a `git revert` on that commit and built the kernel, The bug does not appear.
That makes sense; the commit changes the sequence of commands during the early part of initialization. However, to figure out what's going wrong it would be a big help to have a usbmon trace for an earlier working kernel, such as 6.6.15, for comparison. Can you do that test and attach the trace to this bug report? You don't have to collect a huge amount of information. The first few seconds would probably be enough -- up to the point where the "Attached SCSI disk" message appears in the kernel log.
Created attachment 306362 [details] usbmon_6.6.15.txt Attachment added.
Would it possible to get xhci traces and logs? Alan pointed out that there's something odd with urb->actual_length values set by xhci driver. Steps to take traces and logs: mount -t debugfs none /sys/kernel/debug echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable echo 1 > /sys/kernel/debug/tracing/tracing_on < Reproduce issue > Send output of dmesg Send content of /sys/kernel/debug/tracing/trace trace file grows fast so please copy it as soon as possbile after issue is reproduced.
Created attachment 306365 [details] dmesg_2024-05-29_13:28:29
Created attachment 306366 [details] trace_2024-05-29_13:25:37
FWIW, to make sure this is known: another regression was bisected to the culprit and a patch for it exists: https://lore.kernel.org/all/eec6ebbf-061b-4a7b-96dc-ea748aa4d035@googlemail.com/ Does not sound related, but I might be wrong here, so I thought I quickly mention it.
Traces show xhci driver sets the length of the cancelled 512 byte bulk transfer incorrectly. I'll attach a patch for that. The 255 byte control transfer case is not that clear. It stops on the data stage with a 'stopped' completion code, with 'len' field indicating residual untransferred bytes is 0. 827.939738: xhci_handle_event: EVENT: TRB 00000001511921e0 status 'Stopped' len 0 slot 2 ep 1 type 'Transfer Event' flags e:c 827.939747: xhci_handle_transfer: CTRL: Buffer 00000001023cff00 length 255 TD size 0 intr 0 type 'Data Stage' flags i:i:c:s:I:e:C It's also unclear why these transfers don't go through
Created attachment 306368 [details] patch to fix cancelled bulk transfer length grateful if this patch could be tested. usbmon or xhci traces should show if it works.
The patch doesn't work, same message.
Did usbmon still show that the timed out bulk-in transfer received 512 bytes of zeroes like earlier? ffff9244ea669300 1311630835 S Bi:6:004:1 -115 512 < ffff9244ea669300 1342568184 C Bi:6:004:1 -104 512 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
I should have pointed out that Mathias's change might not affect the timeout and reset that you're getting. The usbmon trace showed three different problems, which might or might not be related, and the new patch is intended to fix only one of them. Until it was tested, we had no way of knowing whether it would also fix any of the others.
Created attachment 306381 [details] usbmon_6.9.3_with_patch.txt Attachment: usbmon with the patch applied.
Here's the relevant part of the usbmon trace: ffff97af45927780 378815216 S Bo:5:004:2 -115 31 = 55534243 08000000 00020000 80000a28 00000000 00000001 0000000 000000 ffff97af45927780 378815233 C Bo:5:004:2 0 31 > ffff97af4255c900 378815246 S Bi:5:004:1 -115 512 < ffff97af4255c900 410624206 C Bi:5:004:1 -104 0 There's still a timeout and cancellation, but at least now the number of bytes transferred is reported as 0 and not 512. So the patch did fix this bug, but it didn't prevent the main problem from occurring. It looks like the main difference between the working and non-working kernels is that the non-working one issues a READ(10) command (added by Martin Petersen's commit) whereas the working one uses only READ(16) commands. I'll ask Martin about updating his commit to change this.
Pierre, unless you're still having any problems, you can go ahead and close out this bug report.
Patches are now in stable releases, 6.9.6 here, all is ok. Commit: f0260589b439 - xhci: Set correct transferred length for cancelled bulk transfers Releases: v6.1.95, v6.6.35 and v6.9.6 Commit: 7926d51f73e0 - scsi: sd: Use READ(16) when reading block zero on large capacity disks Releases: v6.1.95, v6.6.35 and v6.9.6 Thanks to Alan Stern, Mathias Nyman, Martin K. Petersen and others for the help.