Bug 16181
Summary: | libata EH going crazy on hotplug PW protected HDD | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Robert de Rooy (robert.de.rooy) |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alan, kay, mlord, tj |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33.5-112.fc13.i686 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
syslog
read_part_sector-sane-retries.patch syslog after patch Example session showing security lock/unlock read_part_sector-sane-retries.patch gzipped syslog kernel config file syslog without hal and udev running syslog without hal and udev running |
Description
Robert de Rooy
2010-06-11 12:09:46 UTC
Here is what I see on a T60 with ICH7 (Ultrabay is still PATA) ACPI: \_SB_.PCI0.IDE0.PRIM.MSTR - docking Jun 11 14:12:23 t60 kernel: ata5: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen Jun 11 14:12:23 t60 kernel: ata5: ACPI event Jun 11 14:12:24 t60 kernel: ata5: soft resetting link Jun 11 14:12:24 t60 kernel: ata5.00: ATA-6: HTS726060M9AT00, MH4OA6DA, max UDMA/100 Jun 11 14:12:24 t60 kernel: ata5.00: 117210240 sectors, multi 0: LBA Jun 11 14:12:24 t60 kernel: ata5.00: limited to UDMA/33 due to 40-wire cable Jun 11 14:12:24 t60 kernel: ata5.00: configured for UDMA/33 Jun 11 14:12:24 t60 kernel: ata5: EH complete Jun 11 14:12:24 t60 kernel: scsi 4:0:0:0: Direct-Access ATA HTS726060M9AT00 MH4O PQ: 0 ANSI: 5 Jun 11 14:12:24 t60 kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0 Jun 11 14:12:24 t60 kernel: sd 4:0:0:0: [sdb] 117210240 512-byte logical blocks: (60.0 GB/55.8 GiB) Jun 11 14:12:24 t60 kernel: sd 4:0:0:0: [sdb] Write Protect is off Jun 11 14:12:24 t60 kernel: sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jun 11 14:12:24 t60 kernel: sdb: And the same continuous stream of errors Jun 11 14:12:24 t60 kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Jun 11 14:12:24 t60 kernel: ata5.00: BMDMA stat 0x25 Jun 11 14:12:24 t60 kernel: ata5.00: failed command: READ DMA Jun 11 14:12:24 t60 kernel: ata5.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in Jun 11 14:12:24 t60 kernel: res 51/04:08:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error) Jun 11 14:12:24 t60 kernel: ata5.00: status: { DRDY ERR } Jun 11 14:12:24 t60 kernel: ata5.00: error: { ABRT } Jun 11 14:12:24 t60 kernel: ata5.00: configured for UDMA/33 Jun 11 14:12:24 t60 kernel: ata5: EH complete Can you please attach the whole output of dmesg instead of snippets? How continuous is continuous? libata itself doesn't really care whether the drive is locked or not. If it's locked, IOs will fail and whoever is trying to access the drive should give up after a few tries. It looks like the failing command is read from the partition code which should give up after several retries, doesn't it? Thanks. Created attachment 26752 [details]
syslog
I mean it keeps trying it indefinably until you remove the disk, which I did in
the attached log file after a bit.
Hmmm... yeah, that is annoyingly persistent. It's just going through all the different partition types and libata can't tell whether the command failure is persistent or transient so it can't preemptively fail commands on issue. Probably the most expensive part of all that is printing out all those information to console. If you make the kernel to avoid logging to console (e.g. klogconsole -l1), it doesn't take too long to finish, right? klogconsole does not exist on Fedora 13. But your right, after a bit it does stop. It just took a few tries... # dmesg |grep EH|wc -l 280 Is there no way for libata to detect the HDD is locked before it starts scanning partition types? The thing is that ATA error reporting is not very accurate. libata can try to interpret the identify data but again the less critical part of it doesn't tend to be too reliable, so the general policy should be try-if-not-completely-sure. We can suppress or use lower logging level for failures during partition probe but I'm not quite sure whether that would be a good idea either. How long does it take to actually give up? Thanks. On the old ThinkPad T41 (ICH4) it took 20 seconds. On the newer T60 (ICH7 PATA) it took 36 seconds instead. The poor performance on the T60 is probably due to the integrated graphics. When I measured it again without following the log, it also dropped to 20 seconds. Hmmm.... I'm not sure how to work around this one. We definitely want to be retrying there but we are trying too hard. Those reads are coming from partition code and as there's no shared state between different partition probe code different partition types end up issuing reads to the same sectors regardless they failed previously or not. I think it would be best if partition probe code tries only different sectors. I'll see if that can be implemented without disturbing the code too much but I have to say it's a pretty niche problem. Thanks. Well, the next issue would be, how to actually unlock the HDD from Linux, as hdparm just fails on both ThinkPads with an IO error for any of the security commands. Regardless if the HDD is actually locked or not. # hdparm --user-master u --security-unlock test /dev/sdb security_password="test" /dev/sdb: Issuing SECURITY_UNLOCK command, password="test", user=user SECURITY_UNLOCK: Input/output error Well, that's something for Mark. cc'd. The BIOS has probably done a "SECURITY FREEZE" command. To undo that, you'll have to power-cycle the driver (just pull it out, and plug it back in again quickly). Then the other commands might work. Try 'hdparm -I /dev/sdb', and look for 'frozen' or 'not frozen' near the bottom of the output. While that is the case for my boot drive (sda), it is not the case for the just hotplugged drive sdb. $ sudo hdparm -I /dev/sdb /dev/sdb: ATA device, with non-removable media Model Number: HTS726060M9AT00 Serial Number: MRH453M4H11ARB Firmware Revision: MH4OA6DA Standards: Used: ATA/ATAPI-6 T13 1410D revision 3a Supported: 6 5 4 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 117210240 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 57231 MBytes device size with M = 1000*1000: 60011 MBytes (60 GB) cache/buffer size = 7877 KBytes (type=DualPortCache) Capabilities: LBA, IORDY(can be disabled) Standby timer values: spec'd by Vendor, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 0 Advanced power management level: 128 Recommended acoustic management value: 128, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=240ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set * Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * Advanced Power Management feature set Power-Up In Standby feature set * SET_FEATURES required to spinup after power up Address Offset Reserved Area Boot SET_MAX security extension Automatic Acoustic Management feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * SMART error logging * SMART self-test Security: Master password revision code = 65534 supported enabled locked not frozen not expired: security count not supported: enhanced erase Security level maximum 38min for SECURITY ERASE UNIT. HW reset results: CBLID- above Vih Device num = 0 determined by the jumper Checksum: correct With regards to the original discussion about partition probing: the IDENTIFY data (shown above) clearly shows the drive as "locked". So that could be used to prevent the pointless thrashing discussed earlier. As for unlocking the drive, does this work? hdparm --user-master m --security-unlock test /dev/sdb If not, then what makes you think "test" is the correct password ? Remember that security commands are designed to be oblique. They fail without indications as to why they failed, so as not to give clues to attackers. Thus, attempting to "unlock" a non-locked drive will fail. -ml The HDD is set in the BIOS with that password, as a "User" password (not the "Master+User" password). And if I boot with the drive present, the BIOS will prompt me for it and unlock it. Also, *any* hdparm --security-* commands fail with the same IO error regardless if the drive is locked or not. And in none of the cases is hdparm reporting it as being frozen. [root@t41 robert]# hdparm --user-master m --security-unlock test /dev/sdb security_password="test" /dev/sdb: Issuing SECURITY_UNLOCK command, password="test", user=master SECURITY_UNLOCK: Input/output error [root@t41 robert]# hdparm --user-master u --security-unlock test /dev/sdb security_password="test" /dev/sdb: Issuing SECURITY_UNLOCK command, password="test", user=user SECURITY_UNLOCK: Input/output error [root@t41 robert]# hdparm -I /dev/sdb|grep frozen not frozen (In reply to comment #14) > With regards to the original discussion about partition probing: the > IDENTIFY > data (shown above) clearly shows the drive as "locked". So that could be > used > to prevent the pointless thrashing discussed earlier. I don't really wanna do that. There could (or rather would) be devices which have garbages there or fail to update the values after unlocking or whatever. As a general rule, I don't think it would be wise to preemptively fail commands without actually trying regardless of what the IDENTIFY data says. Thanks. There's zero guarantee that the password you type into a BIOS is the exact thing the BIOS passes to the drive. It could be hashing it for all we know. Show us an example of a security command working or not-working on an unlocked drive. And use --verbose with hdparm in both cases, so we can see exactly what the drive is doing. Your right that the BIOS sets any drives that are present during POST to 'frozen', regardless if they have a PW set or not. I removed the password in the BIOS, and hotplugged the drive again after Linux was booted. # hdparm -I /dev/sdb /dev/sdb: ATA device, with non-removable media Model Number: HTS726060M9AT00 Serial Number: MRH453M4H11ARB Firmware Revision: MH4OA6DA Standards: Used: ATA/ATAPI-6 T13 1410D revision 3a Supported: 6 5 4 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 117210240 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 57231 MBytes device size with M = 1000*1000: 60011 MBytes (60 GB) cache/buffer size = 7877 KBytes (type=DualPortCache) Capabilities: LBA, IORDY(can be disabled) Standby timer values: spec'd by Vendor, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 0 Advanced power management level: 128 Recommended acoustic management value: 128, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=240ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * Advanced Power Management feature set Power-Up In Standby feature set * SET_FEATURES required to spinup after power up Address Offset Reserved Area Boot SET_MAX security extension Automatic Acoustic Management feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * SMART error logging * SMART self-test Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count not supported: enhanced erase 38min for SECURITY ERASE UNIT. HW reset results: CBLID- above Vih Device num = 0 determined by the jumper Checksum: correct After this it seems like the hdparm --security-* commands succeed, although setting the master pw does not actually seem to have any result (command succeeds, but IDENTIFY data still reports it as not enabled), the user pw does. # hdparm --verbose --user-master u --security-set-pass test /dev/sdb security_password="test" /dev/sdb: Issuing SECURITY_SET_PASS command, password="test", user=user, mode=high oflags.lob_all=0x82, flags={ feat command } oflags.hob_all=0x82, flags={ feat } using LBA48 taskfile outgoing cdb: 85 0b 26 00 00 00 00 00 00 00 00 00 00 40 f1 00 data: 00 00 74 65 73 74 00 00 00 00 00 00 00 00 00 00 SG_IO: ATA_16 status=0x2, host_status=0x0, driver_status=0x8 SG_IO: sb[]: 72 00 00 00 00 00 00 0e 09 0c 01 00 00 00 00 00 00 00 00 00 e0 50 00 00 00 00 00 00 00 00 00 00 SG_IO: desc[]: 09 0c 01 00 00 00 00 00 00 00 00 00 ATA_16 stat=50 err=00 nsect=00 lbal=00 lbam=00 lbah=00 dev=e0 # hdparm -I /dev/sdb|grep enabled enabled After this I removed, and plugged the drive again, and indeed it is locked, but the unlock is failing # hdparm --verbose --user-master u --security-unlock test /dev/sdb security_password="test" /dev/sdb: Issuing SECURITY_UNLOCK command, password="test", user=user oflags.lob_all=0x82, flags={ feat command } oflags.hob_all=0x82, flags={ feat } using LBA48 taskfile outgoing cdb: 85 0b 26 00 00 00 00 00 00 00 00 00 00 40 f2 00 data: 00 00 74 65 73 74 00 00 00 00 00 00 00 00 00 00 SG_IO: ATA_16 status=0x2, host_status=0x0, driver_status=0x8 SG_IO: sb[]: 72 0b 00 00 00 00 00 0e 09 0c 01 04 00 00 00 00 00 00 00 00 e0 51 00 00 00 00 00 00 00 00 00 00 SG_IO: desc[]: 09 0c 01 04 00 00 00 00 00 00 00 00 ATA_16 stat=51 err=04 nsect=00 lbal=00 lbam=00 lbah=00 dev=e0 I/O error, ata_op=0xf2 ata_status=0x51 ata_error=0x04 SECURITY_UNLOCK: Input/output error # hdparm --verbose --user-master u --security-disable test /dev/sdb security_password="test" /dev/sdb: Issuing SECURITY_DISABLE command, password="test", user=user oflags.lob_all=0x82, flags={ feat command } oflags.hob_all=0x82, flags={ feat } using LBA48 taskfile outgoing cdb: 85 0b 26 00 00 00 00 00 00 00 00 00 00 40 f2 00 data: 00 00 74 65 73 74 00 00 00 00 00 00 00 00 00 00 SG_IO: ATA_16 status=0x2, host_status=0x0, driver_status=0x8 SG_IO: sb[]: 72 0b 00 00 00 00 00 0e 09 0c 01 04 00 00 00 00 00 00 00 00 e0 51 00 00 00 00 00 00 00 00 00 00 SG_IO: desc[]: 09 0c 01 04 00 00 00 00 00 00 00 00 ATA_16 stat=51 err=04 nsect=00 lbal=00 lbam=00 lbah=00 dev=e0 I/O error, ata_op=0xf2 ata_status=0x51 ata_error=0x04 SECURITY_UNLOCK: Input/output error # hdparm --verbose --user-master u --security-erase test /dev/sdb security_password="test" /dev/sdb: Issuing SECURITY_ERASE command, password="test", user=user outgoing cdb: 85 06 20 00 00 00 00 00 00 00 00 00 00 40 f3 00 SG_IO: ATA_16 status=0x2, host_status=0x0, driver_status=0x8 SG_IO: sb[]: 72 00 00 00 00 00 00 0e 09 0c 00 00 00 00 00 00 00 00 00 00 e0 50 00 00 00 00 00 00 00 00 00 00 SG_IO: desc[]: 09 0c 00 00 00 00 00 00 00 00 00 00 ATA_16 stat=50 err=00 nsect=00 lbal=00 lbam=00 lbah=00 dev=e0 oflags.lob_all=0x82, flags={ feat command } oflags.hob_all=0x82, flags={ feat } using LBA48 taskfile outgoing cdb: 85 0b 26 00 00 00 00 00 00 00 00 00 00 40 f4 00 data: 00 00 74 65 73 74 00 00 00 00 00 00 00 00 00 00 SG_IO: ATA_16 status=0x2, host_status=0x0, driver_status=0x8 SG_IO: sb[]: 72 0b 00 00 00 00 00 0e 09 0c 01 04 00 00 00 00 00 00 00 00 e0 51 00 00 00 00 00 00 00 00 00 00 SG_IO: desc[]: 09 0c 01 04 00 00 00 00 00 00 00 00 ATA_16 stat=51 err=04 nsect=00 lbal=00 lbam=00 lbah=00 dev=e0 I/O error, ata_op=0xf4 ata_status=0x51 ata_error=0x04 SECURITY_ERASE: Input/output error And the BIOS is not recognizing the password either, so it seems your right on the hashing part, and I have transformed the drive into a doorstop. Created attachment 26843 [details]
read_part_sector-sane-retries.patch
Can you please try this patch and see whether it helps?
Thanks.
Try unlocking it with the master password that you set (even though you said it didn't appear to lock after setting the master password). Created attachment 26851 [details]
syslog after patch
I applied the patch on top of 2.6.35-rc3, but it did not seem to reduce the messages, in fact it may have done the opposite
$ cat messages.txt |grep EH|wc -l
1740
I even had to compress the log file to get it under the bugzilla limit of 1MB.
(In reply to comment #20) > Try unlocking it with the master password that you set (even though you said > it > didn't appear to lock after setting the master password). After hotplugging the drive with the 2.6.35-rc3 kernel # hdparm -I /dev/sdb|grep -A9 Security: Security: Master password revision code = 65297 supported enabled locked not frozen not expired: security count not supported: enhanced erase Security level high 38min for SECURITY ERASE UNIT. # hdparm --verbose --user-master m --security-unlock test /dev/sdb security_password="test" /dev/sdb: Issuing SECURITY_UNLOCK command, password="test", user=master oflags.lob_all=0x82, flags={ feat command } oflags.hob_all=0x82, flags={ feat } using LBA48 taskfile outgoing cdb: 85 0b 26 00 00 00 00 00 00 00 00 00 00 40 f2 00 data: 01 00 74 65 73 74 00 00 00 00 00 00 00 00 00 00 SG_IO: ATA_16 status=0x2, host_status=0x0, driver_status=0x8 SG_IO: sb[]: 72 0b 00 00 00 00 00 0e 09 0c 01 04 00 00 00 00 00 00 00 00 e0 51 00 00 00 00 00 00 00 00 00 00 SG_IO: desc[]: 09 0c 01 04 00 00 00 00 00 00 00 00 ATA_16 stat=51 err=04 nsect=00 lbal=00 lbam=00 lbah=00 dev=e0 I/O error, ata_op=0xf2 ata_status=0x51 ata_error=0x04 SECURITY_UNLOCK: Input/output error # hdparm --verbose --user-master m --security-disable test /dev/sdb security_password="test" /dev/sdb: Issuing SECURITY_DISABLE command, password="test", user=master oflags.lob_all=0x82, flags={ feat command } oflags.hob_all=0x82, flags={ feat } using LBA48 taskfile outgoing cdb: 85 0b 26 00 00 00 00 00 00 00 00 00 00 40 f2 00 data: 01 00 74 65 73 74 00 00 00 00 00 00 00 00 00 00 SG_IO: ATA_16 status=0x2, host_status=0x0, driver_status=0x8 SG_IO: sb[]: 72 0b 00 00 00 00 00 0e 09 0c 01 04 00 00 00 00 00 00 00 00 e0 51 00 00 00 00 00 00 00 00 00 00 SG_IO: desc[]: 09 0c 01 04 00 00 00 00 00 00 00 00 ATA_16 stat=51 err=04 nsect=00 lbal=00 lbam=00 lbah=00 dev=e0 I/O error, ata_op=0xf2 ata_status=0x51 ata_error=0x04 SECURITY_UNLOCK: Input/output error The data going out all looks correct there. Here (attached in a sec) is a log of a session I just did here, locking and unlocking a drive on 2.6.32.5, using my latest (pre-release) of hdparm (no security changes). Created attachment 26854 [details]
Example session showing security lock/unlock
Created attachment 26859 [details]
read_part_sector-sane-retries.patch
Can you please try this one then? Maybe the IOs are not coming from partition check code at all? Can you please attach .config?
Created attachment 26864 [details]
gzipped syslog
Tejun,
Here is the syslog after the second patch, the log is still getting flooded.
Created attachment 26865 [details]
kernel config file
(In reply to comment #23) > The data going out all looks correct there. > > Here (attached in a sec) is a log of a session I just did here, locking and > unlocking a drive on 2.6.32.5, using my latest (pre-release) of hdparm (no > security changes). Strange, as you can see from my logs, I was able to lock the drive, but any unlock commands just fail with IO errors. I tried 2 different ThinkPads (T41 and T60), and it made no difference. I did use the same HDD in both cases, but am rather hesitant to try another one. Especially since the BIOS based lock/unlock had always worked fine. Any chance there may be a difference due to the fact that I used a PATA drive and you used a SATA drive? I guess the hdparm version shipped with F13 is new enough hdparm-9.27-1.fc13.x86_64 So, most of those reads aren't from kernel at all. Can you please test w/ udev and hal stopped? Created attachment 26880 [details]
syslog without hal and udev running
stopping hal does not make any difference, but stopping udev drops the number of EH messages to 55.
Can you please try with the patch applied and udev stopped? And I suppose it's about time to summon Kay (hello!). Kay, this is a rather rare corner case but sometimes disk device exists but fails all IOs. It looks like udev is a bit too persistent in retrying and keeps issuing IOs to the same sectors. Do you think this is something which can be improved from udev? Thanks. On Mon, Jun 21, 2010 at 16:42, <bugzilla-daemon@bugzilla.kernel.org> wrote: > Kay, this is a rather rare corner case but sometimes disk device exists but > fails all IOs. It looks like udev is a bit too persistent in retrying and > keeps issuing IOs to the same sectors. Do you think this is something which > can be improved from udev? Hmm, not sure if udev can ever decide that a device is not readable and switch some state here. Maybe the first thing would be to add the needed stuff to ata_id: http://git.kernel.org/?p=linux/hotplug/udev.git;a=blob;f=extras/ata_id/ata_id.c;hb=HEAD to identify locked/protected disks? And we can possibly avoid some further probing of the disk based on that information? At least the blkid call (which will do quite a few small reads) we could exclude, but there might be several things plugging into udev events which are not really under our control. Hmmm... maybe, I'm not sure. In general, for ATA, I think things should be trial based as long as possible. There are a lot of devices with extremely poorly done firmwares, especially in SSD and CF devices, and we run the risk of breaking working devices if we deny access without actually trying based on the IDENTIFY data. I wrote this earlier but I have this this nagging feeling that there gotta be devices which would be okay if security lock isn't used at all but fail to update locked status if it's used and unlocked. Maybe ata_id can check IDENTIFY data and then issue a read to check whether it's actually locked. That should be mostly safe, I think. One way or the other, I think it's generally low priority. It's far more important to keep working things working and ATA security lock feature isn't a very often used feature anyway (its usefulness as a security measure is dubious to begin with). Thanks. Created attachment 26901 [details]
syslog without hal and udev running
with the kernel patch applied, without udev & hal running. The EH is 'only' being called 13 times now.
|