Bug 219652
Summary: | READ CAPACITY(16) not used on large USB-attached drive in recent kernels | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Kris Karas (bugs-a21) |
Component: | SCSI | Assignee: | linux-scsi (linux-scsi) |
Status: | NEW --- | ||
Severity: | normal | CC: | michael.christie, mkp, stern |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | Yes | Bisected commit-id: | 0f11328f2f46618c8c4734041fdb2aacfa99b802 |
Attachments: |
diff without stamps of the usbmon output, good (-) to bad (+)
usbmon trace from kernel 6.12 showing failed capacity negotiation usbmon trace from kernel 6.1 showing correct capacity negotiation |
Created attachment 307436 [details]
usbmon trace from kernel 6.12 showing failed capacity negotiation
Created attachment 307437 [details]
usbmon trace from kernel 6.1 showing correct capacity negotiation
I had a feeling that Greg KH would amble by and say, "Can you bisect that?" :-) So, I figured I'd try to get to that first. And the winner is... commit 0f11328f2f46618c8c4734041fdb2aacfa99b802 Author: Mike Christie <michael.christie@oracle.com> Date: Mon Jan 22 18:22:16 2024 -0600 scsi: sd: Have midlayer retry read_capacity_10() errors This has read_capacity_10() have the SCSI midlayer retry errors instead of driving them itself. I have tested 6.12.7 with that backed-out, and confirm that read capacity works normally again. Your best bet is to instrument the new read_capacity_10() routine and its caller (sd_read_capacity()) with sd_printk() lines to find out exactly what is going on. For instance, does sd_try_rc16_first() return true? And why does the "Very big device. Trying to use READ CAPACITY(16)" line not get executed? Also, please provide the output of: # sg_readcap --hex /dev/sdb # sg_readcap --hex --16 /dev/sdb On 1/1/25 6:57 PM, bugzilla-daemon@kernel.org wrote: > * When booting into kernel 6.12.7, the external drive capacity > is mis-reported. Booting /back/ into 6.6.68, the capacity > continues to be mis-reported. One must now yank the USB > cable (or power-cycle) to get the correct size detection again. That part is really strange. I wonder if the new code is not resetting the device and the old code was somehow. > > Here is a diff of the dmesg output when plugging in the drive. > From (-) is kernel 6.1.122 > To (+) is kernel 6.12.7: > scsi host12: usb-storage 1-13:1.0 > scsi 12:0:0:0: Direct-Access TOSHIBA DT01ACA300 3.00 PQ: 0 ANSI: > 4 > sd 12:0:0:0: Attached scsi generic sg1 type 0 > -sd 12:0:0:0: [sdb] Very big device. Trying to use READ CAPACITY(16). > -sd 12:0:0:0: [sdb] 5860533167 512-byte logical blocks: (3.00 TB/2.73 TiB) > +sd 12:0:0:0: [sdb] 4294967295 512-byte logical blocks: (2.20 TB/2.00 TiB) > sd 12:0:0:0: [sdb] Write Protect is off > sd 12:0:0:0: [sdb] Mode Sense: 23 00 00 00 > sd 12:0:0:0: [sdb] No Caching mode page found > sd 12:0:0:0: [sdb] Assuming drive cache: write through > -sd 12:0:0:0: [sdb] Very big device. Trying to use READ CAPACITY(16). > - sdb: sdb1 > sd 12:0:0:0: [sdb] Attached SCSI disk > > I am attaching a diff of the usbmon output for the earlier kernel (good) and > current kernel (bad), Just to start can you tell me what the device is returning? We hit that "Very big device" code path if the device returns a size of 0xffffffff or larger so with the old code we at some point returned a large size. With the new code the device returns only 0xfffffffe (4294967294) (read_capacity_10 does a lba that we got from the device + 1 so you see 4294967295 in the log message) so we never hit that very bit device code path and try READ CAPACITY(16). I might have changed some behavior during the retries that causes us to not get the right size now. However, I couldn't fully understand the output of the usbmon traces. Can you tell me when we do the READ CAPACITY (10) what the device is returning? Is it returning a sense error like UNIT_ATTENTION a couple times and when it returns success what's in the data buffer? The two kernels perform identical USB initializations, and the two SCSI initializations also start out identically: INQUIRY, TEST UNIT READY, and READ CAPACITY(10). Under both kernels, the response to the READ CAPACITY(10) is 0xffffffff blocks with blocksize 0x00000200, no errors. At that point the bad kernel repeats the READ CAPACITY(10) command, issuing it a total of 4 times before stopping (the fourth time, the returned value says 0xfffffffe blocks instead of 0xffffffff -- I'm not sure if the device does this or if usb-storage changes the value in flight, but it doesn't really matter because the error has already occurred). By contrast, at that point the good kernel issues READ CAPACITY(16), receiving a response of 0x15d50a3ae blocks of size 0x00000200, the correct values. For Martin in comment #5: # : Working kernel sg_readcap 00 ff ff ff ff 00 00 02 00 # : Ditto --16 00 00 00 00 01 5d 50 a3 ae 00 00 02 00 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # : Failing kernel sg_readcap 00 ff ff ff fe 00 00 02 00 # : Ditto --16 00 00 00 00 01 5d 50 a3 ae 00 00 02 00 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 As Michael surmised, the only difference I can see from the above is in the regular READ_CAPACITY, returning ffffffff normally and fffffffe under the failing patch. For Alan in comment #4 and Michael in comment #6: I spent a couple hours staring at the code (all of it new to me), and the only thing I could see pre/post patch is that the retry loop was moved one level down. Commands passed, buffers, etc., look the same. Timing/Race? Command buffer is a bit bigger (16 vs 10). sshdr.ascq is checked (== 0x00) pre-patch but not post-patch. Retry count 13 vs 10. Can you guys write me a couple sd_printk()'s you would like to see in there? I am not familiar with the structure extractors enough to craft a printk with useful info in it. Just saw Alan's comment #7 (posting race). FWIW, there are two drives showing this error. The first is a Toshiba DT01ACA300, and the second, a Seagate ST3000DM001. So it's rather unlikely to be a drive firmware bug. Given that this patch has been out since 6.9.1 (if I read git correctly), it's a bit odd that my searches for relevant/duplicate bug reports came up empty. Hardware bug with this particular USB/SATA bridge? I checked the manufacturer's website, and the model 1610 sports an embedded 8051 uP with downloadable firmware; however, there are no firmware resources on their website. I found one "smoking gun" in this report from site hddguru: https://forum.hddguru.com/viewtopic.php?t=33369 It indicates that other initio USB-to-SATA adapters report sector_count-1 in some cases. If this is specific to Initio adapters, perhaps the easiest patch here is to assume that if read_capacity_10 is reported as 0xfffffffe, then it's buggy hardware, treat it as 0xffffffff and proceed thusly. Re comment #8: I would like to know the answers to the questions asked in comment #4 about the bad kernel: In sd_read_capacity(), does sd_try_rc16_first() return true? And why doesn't the "Very big device. Trying to use READ CAPACITY(16)" line, along with the subsequent call to read_capacity_16(), get executed in the bad kernel? Or does read_capacity_16() get executed but return immediately with an error instead of communicating with the device? The whole business about 0xffffffff vs. 0xfffffffe is a red herring. If the bad kernel were working properly, the issue wouldn't even arise. Notice that you never see the erroneous value under the good kernel. For example, at the appropriate points in sd_read_capacity(), add: sd_printk(KERN_NOTICE, sdkp, "sd_try_rc16_first(): %d\n", sd_try_rc16_first(sdp)); sd_printk(KERN_NOTICE, sdkp, "sector_size: %d\n", sector_size); sd_printk(KERN_NOTICE, sdkp, "sizeof(sdkp->capacity): %d\n", (int) sizeof(sdkp->capacity)); sd_printk(KERN_NOTICE, sdkp, "sdkp->capacity: %llx\n", (unsigned long long) sdkp->capacity); On 1/2/25 9:30 PM, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219652 > > --- Comment #10 from Alan Stern (stern@rowland.harvard.edu) --- > Re comment #8: I would like to know the answers to the questions asked in > comment #4 about the bad kernel: > > In sd_read_capacity(), does sd_try_rc16_first() return true? > > And why doesn't the "Very big device. Trying to use READ CAPACITY(16)" > line,> along with the subsequent call to read_capacity_16(), get executed > in the bad > kernel? I see it. With the new code we think the command is failing. We then retry the command 3 times like you described you saw in the trace. Then because on the 3rd retry we get 0xfffffffe instead of 0xffffffff, we don't hit the check below like I mentioned before: sector_size = read_capacity_10(sdkp, sdp, buffer); if (sector_size == -EOVERFLOW) goto got_data; if (sector_size < 0) return; if ((sizeof(sdkp->capacity) > 4) && (sdkp->capacity > 0xffffffffULL)) { int old_sector_size = sector_size; sd_printk(KERN_NOTICE, sdkp, "Very big device. " "Trying to use READ CAPACITY(16).\n"); sector_size = read_capacity_16(sdkp, sdp, lim, buffer); With the old kernel, we saw the first try succeed. We then saw the 0xffffffff and then tried read_capacity_16 above. And the reason for the difference was that with the new code, I forgot to add a check for if there was even an error. We ended up always retrying 3 times and that lead us to get the 0xfffffffe value on that last retry. so we need: diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index adee6f60c966..2dcf225c7017 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -210,6 +210,9 @@ static int scsi_check_passthrough(struct scsi_cmnd *scmd, struct scsi_sense_hdr sshdr; enum sam_status status; + if (!scmd->result) + return 0; + if (!failures) return 0; Sorry for late reply, was out of town. Just checked Michael's proposed two-liner in comment #12, and am most happy to confirm it fixes this bug! FWIW, I also investigated the related issue mentioned in comment #9, from the HDDGuru forum, and can confirm that (with this same brand, Initio) of USB-to-SATA connector, when you perform a disk capacity check via USB, it returns exactly one sector less than when the enclosure is connected via eSATA. So their chipset/firmware clearly has a bug. No problem. Thanks for doing the investigation and lots of debugging. I'll send the patch upstream later today. |
Created attachment 307435 [details] diff without stamps of the usbmon output, good (-) to bad (+) Upgrading from old mainline LTS kernel 6.6 to current LTS kernel 6.12, a large 3.00TB SATA drive connected via USB through an "Initio Corporation INIC-1610P SATA bridge" (id 13fd:1e40) is falsely reported as a 2.00TiB capacity drive. According to the dmesg output, the newer kernel fails to identify the need to call read_capacity_16(), resulting in a 32-bit size calculation. I had reported this initially to the linux-usb mailing list. Alan Stern (CC'ed) wrote back, suggesting I redirect to linux-scsi, and include some usbmon traces. To upload the traces, I'm opening this on bugzilla. A few notes: * The USB-to-SATA bridge (via lsusb) is: ID 13fd:1e40 Initio Corporation INIC-1610P SATA bridge * When booting into kernel 6.12.7, the external drive capacity is mis-reported. Booting /back/ into 6.6.68, the capacity continues to be mis-reported. One must now yank the USB cable (or power-cycle) to get the correct size detection again. Here is a diff of the dmesg output when plugging in the drive. From (-) is kernel 6.1.122 To (+) is kernel 6.12.7: scsi host12: usb-storage 1-13:1.0 scsi 12:0:0:0: Direct-Access TOSHIBA DT01ACA300 3.00 PQ: 0 ANSI: 4 sd 12:0:0:0: Attached scsi generic sg1 type 0 -sd 12:0:0:0: [sdb] Very big device. Trying to use READ CAPACITY(16). -sd 12:0:0:0: [sdb] 5860533167 512-byte logical blocks: (3.00 TB/2.73 TiB) +sd 12:0:0:0: [sdb] 4294967295 512-byte logical blocks: (2.20 TB/2.00 TiB) sd 12:0:0:0: [sdb] Write Protect is off sd 12:0:0:0: [sdb] Mode Sense: 23 00 00 00 sd 12:0:0:0: [sdb] No Caching mode page found sd 12:0:0:0: [sdb] Assuming drive cache: write through -sd 12:0:0:0: [sdb] Very big device. Trying to use READ CAPACITY(16). - sdb: sdb1 sd 12:0:0:0: [sdb] Attached SCSI disk I am attaching a diff of the usbmon output for the earlier kernel (good) and current kernel (bad), in case the sequence of commands sent to/from the bridge shows whatever it is that puts the bridge into a low-capacity state. As I think I can only attach one file to this posting, I'll make separate attachments for the good and bad usbmon listings.