Bug 57211
Summary: | WD SATA 1.0a HDDs problem with SATA LPM | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | DE (risc4all) |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | NEW --- | ||
Severity: | blocking | CC: | szg00000, tj |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.10.16 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
horkage-nolpm.patch
dmesg horkage-nolpm-v2.patch horkage-nolpm-v3.patch libata-disable-LPM-for-some-WD-SATA-I-devices.patch |
Description
DE
2013-04-27 20:34:13 UTC
Last of the many? Hell no! On May 1 we spoke on the phone with our WD contact and he reported that no firmware will be released for EOL products. He insisted that those disks do not support any PM and have no bug! The sound produced with regular disk activities when the PHY is on slumber is alarmingly similar (for the untrained ears) to the sound of a WD HDD suffering from a head crash. Since we were convinced that all the JD family (except for WD800JD-##LSA1, WD800JD-##MSA1 which is SATA-II) are affected, we searched and bought a second hand JD, older than ours. This time though it is a DELL part (probably WD biggest customer back then as half of the JDs in Greece where -75 on many retail shops).Despite the fact that it has the older Marvell PATA2SATA bridge, not only it supports HIPM but also supports DIPM! After certifying it as error free, we found exactly the same behaviour. Here is some more info: ATA device, with non-removable media Model Number: WDC WD1200JD-75GBB0 Serial Number: WD-WMAET165**** Firmware Revision: 02.05D02 Standards: Supported: 6 5 4 Likely used: 6 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 234375000 LBA48 user addressable sectors: 234375000 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 114440 MBytes device size with M = 1000*1000: 120000 MBytes (120 GB) cache/buffer size = 8192 KBytes (type=DualPortCache) Capabilities: LBA, IORDY(can be disabled) Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Recommended acoustic management value: 128, current value: 128 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE Power-Up In Standby feature set SET_MAX security extension Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * Gen1 signaling speed (1.5Gb/s) * Host-initiated interface power management * Device-initiated interface power management HW reset results: CBLID- above Vih Device num = 0 determined by the jumper Checksum: correct Now the fact that a SATA device reports CBLID may be a bug on its own since SATA has no jumpers... That is different story we will check later. We believe the following HDDs need testing BUT THEY MUST NOT BE USED FOR INSTALLATION DISKS : WD800JD-##HKA0 WD800JD-##HKA1 WD800JD-##JNA0 WD800JD-##JNC0 WD1200JD-##FYB0 WD1200JD-##GBB0 WD1200JD-##HBB0 WD1200JD-##HBC0 WD1600JD-##FYB0 WD1600JD-##GBB0 WD1600JD-##HBB0 WD1600JD-##HBC0 WD2000JD-##FYB0 WD2000JD-##GBB0 WD2000JD-##HBB0 WD2000JD-##HBC0 WD2500JD-##FYB0 WD2500JD-##GBB0 WD2500JD-##HBB0 WD3000JD-* WD3200JD-* ##=00,75,... Also other JD maybe affected as well. We must inform you that we used kernel 3.8.4 with the https://raw.github.com/fenrus75/powertop/master/patches/linux-3.3.0-ahci-alpm-accounting.patch patch applied and 2.6.32 . For a symbolic cost of 15 dollars or 10 euros plus transportation fees we can send the OEM disk to you for further testing! If anyone has a Dell contact, request docs, f/w. Since these specific disk families are still not blacklisted the problem remains on Debian 7.2.0 running vanilla 3.10.16. As a result the disks are not usable on AHCI controllers such as our SB750. Oops, sorry about missing this one. Jesus, this is horrifying. I'll make a patch for it and propagate it through -stable once the merge window is closed. Thanks a lot for the report. Created attachment 118321 [details]
horkage-nolpm.patch
Can you please try this patch?
Thanks.
Hello, it was about time! After giving up waiting some time ago, we visited our bug again and now there is a reply! To sum up WD has underestimated the impact of this bug despite our repeated contacts. We agree that is a serious bug. Suggested patch will do more harm than good. That is because your blacklist is too generic and will cause a big mess! We will give it a try but mind that the disks will lose partial with this patch though partial works fine as we thoroughly tested partial on both disks. Your patch(with our modification) can be accepted as a temporary solution until many people test their disks. After that only slumber must be disabled. Before waiting for the next crappy disk discovery partial only disabling should be implemented as well. In your patch replace + /* + * Drives which spin down on LPM. + * https://bugzilla.kernel.org/show_bug.cgi?id=57211 + */ + { "WD800JD-*", NULL, ATA_HORKAGE_NOLPM }, + { "WD1200JD-*", NULL, ATA_HORKAGE_NOLPM }, + { "WD1600JD-*", NULL, ATA_HORKAGE_NOLPM }, + { "WD2000JD-*", NULL, ATA_HORKAGE_NOLPM }, + { "WD2500JD-*", NULL, ATA_HORKAGE_NOLPM }, + { "WD3000JD-*", NULL, ATA_HORKAGE_NOLPM }, + { "WD3200JD-*", NULL, ATA_HORKAGE_NOLPM }, + with + /* + * Drives which spin up and down like crazy with LPM slumber. + * https://bugzilla.kernel.org/show_bug.cgi?id=57211 + * + * Caution, many BD and JD models are SATA II and should not be + * blacklisted, as their power consumption will be higher. + * Examples: + * WD800BD-** + * WD800JD-**LSA1 + * WD800JD-**MSA1 + * + * WD BD, KD, SD models of various capacities are probably affected + * and should be tested. + */ + { "WD400JD-**JBA0", NULL, ATA_HORKAGE_NOLPM }, + { "WD400JD-**JDA0", NULL, ATA_HORKAGE_NOLPM }, + { "WD400JD-**JNA0", NULL, ATA_HORKAGE_NOLPM }, + { "WD400JD-**JNC0", NULL, ATA_HORKAGE_NOLPM }, + { "WD400JD-**JRA0", NULL, ATA_HORKAGE_NOLPM }, + { "WD400JD-**JRC0", NULL, ATA_HORKAGE_NOLPM }, + + { "WD800JD-**HKA0", NULL, ATA_HORKAGE_NOLPM }, + { "WD800JD-**HKA1", NULL, ATA_HORKAGE_NOLPM }, + { "WD800JD-**JNA0", NULL, ATA_HORKAGE_NOLPM }, + { "WD800JD-**JNC0", NULL, ATA_HORKAGE_NOLPM }, + + { "WD1200JD-**FYB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD1200JD-**GBB0", NULL, ATA_HORKAGE_NOLPM }, // tested + { "WD1200JD-**HBB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD1200JD-**HBC0", NULL, ATA_HORKAGE_NOLPM }, + + { "WD1600JD-**FYB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD1600JD-**GBB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD1600JD-**HBB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD1600JD-**HBC0", NULL, ATA_HORKAGE_NOLPM }, + + { "WD2000JD-**FYB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD2000JD-**GBB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD2000JD-**HBB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD2000JD-**HBC0", NULL, ATA_HORKAGE_NOLPM }, + + { "WD2500JD-**FYB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD2500JD-**GBB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD2500JD-**HBB0", NULL, ATA_HORKAGE_NOLPM }, + { "WD2500JD-**HBC0", NULL, ATA_HORKAGE_NOLPM }, // tested + We will try hard to find a time slice in our very busy schedule during year end to test the above modified patch tomorrow(only to verify that it is going to the right direction) but no promise here. Once we test we shall give you our info to write a tested and reported by so that anyone can communicate with us. Hello, Do we actually know whether some of the drives with the WDXXXXJD prefix have working LPM? Most of those drives are already EOL'd and WD is highly unlikely to fix the issue for these. I'd rather err on the side of larger coverage given the severity of the issue. Thanks. Hello again. Happy new year, our custom kernel is being compiled! Maybe our position was not clear enough. We believe that the current patch is for completely broken devices which almost certainly exist but have gone unnoticed till this day. Here we need a patch that disables slumber while keeping partial operation enabled. The patch you submitted gives large coverage but we do not recommend it except if you wish an immediate fix at linux distributions ASAP prior any further digging and testing. The comment in our modification for your patch clearly says that WD800JD-**MSA1 for example works fine with slumber and partial. Your patch would not allow that. In our previous comment where we wrote: + * WD800BD-** we should have the following: + * WD400BD-**MRA1 A last sneaky but effective solution would be a smart blacklist. To be specific if BD or JD or KD are detected and are not SATAII models then slumber should be blacklisted. In any other way use our list not yours and we will later add more. Not to forget, do you want to send you a broken disk(actually two as we have another open bug too)? Hmm... so there are devices which work properly. The reason why this has gone unnoticed as long as it has is probably that it just hasn't been very popular to use LPM on desktop / server setups. I still don't feel very excited about developing extensive list of specific model numbers as such list tends to be cumbersome maintain and it's easy to leave out some. It's not like we have access to an authoritative list. So, if BD, JD and KD with 1.5Gbps max speed is a good pattern, I'd be much happier with that. Bad news, our custom kernel did not fix the issue:( The attached SATA port was set on min power! Is that possible? This family breakage must be dealt at a lower hardware level. You must modify the SATA SCR as we described above. We do not like lists too but most of all we do not like mulfunctioning hardware. Setting lpm_policy to MAX_POWER should have achieved that. I probably made a mistake somewhere. Can you please post the output of dmesg? Thanks. An additional family not known to us until today was shipped in Macs. Those were WD800JD-40GBB2 and WD1600JD-40GBB2 from 2004. We checked that all source files received the required modifications. Requested dmesg attachment is below: Created attachment 120951 [details]
dmesg
Happy new year. Hmm... the horkage didn't kick in at all. I wonder why that is. I'll prep a new patch w/ more debug info. Created attachment 121881 [details]
horkage-nolpm-v2.patch
Can you please test the attached patch and post the dmesg output?
Thanks.
Drive still spins down parking its heads then spins up and so on. The WD drive spins up and down like crazy(30+ times for parted -l) until it has processed pending commands. Not spin down as you write. min_power is still available for this port! Have you actually blocked min_power(or min_power and medium_power) for this port from sysfs? Here is what dmesg prints now for libata: [ 1.347306] libata version 3.00 loaded. [ 1.632046] ahci 0000:00:11.0: version 3.0 [ 1.632422] ahci 0000:00:11.0: AHCI 0001.0100 32 slots 6 ports 3 Gbps 0x3f impl SATA mode [ 1.632427] ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part ccc sxs [ 1.636059] scsi0 : ahci [ 1.637722] scsi1 : ahci [ 1.639370] scsi2 : ahci [ 1.640119] scsi3 : ahci [ 1.642240] scsi4 : ahci [ 1.642518] scsi5 : ahci [ 1.642591] ata1: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7ffd00 irq 22 [ 1.642595] ata2: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7ffd80 irq 22 [ 1.642599] ata3: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7ffe00 irq 22 [ 1.642603] ata4: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7ffe80 irq 22 [ 1.642607] ata5: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7fff00 irq 22 [ 1.642611] ata6: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7fff80 irq 22 [ 1.643177] pata_atiixp 0000:00:14.1: setting latency timer to 64 [ 1.646285] scsi6 : pata_atiixp [ 1.648098] scsi7 : pata_atiixp [ 1.648597] ata7: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xff00 irq 14 [ 1.648600] ata8: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xff08 irq 15 [ 1.972038] ata3: SATA link down (SStatus 0 SControl 300) [ 1.972081] ata4: SATA link down (SStatus 0 SControl 300) [ 1.972115] ata5: SATA link down (SStatus 0 SControl 300) [ 1.972147] ata6: SATA link down (SStatus 0 SControl 300) [ 2.132047] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 2.132078] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 2.132514] ata1.00: XXX WD_BROKEN_LPM is 0, SATA_CAPABILITY=0x706 [ 2.132640] ata1.00: ATA-7: WDC WD400BD-75MRA1, 10.01E01, max UDMA/133 [ 2.132643] ata1.00: 78125000 sectors, multi 0: LBA48 NCQ (depth 31/32), AA [ 2.133110] ata1.00: XXX WD_BROKEN_LPM is 0, SATA_CAPABILITY=0x706 [ 2.133247] ata1.00: configured for UDMA/133 [ 2.133397] scsi 0:0:0:0: Direct-Access ATA WDC WD400BD-75MR 10.0 PQ: 0 ANSI: 5 [ 2.134911] ata2.00: XXX WD_BROKEN_LPM is 0, SATA_CAPABILITY=0x202 [ 2.136208] ata2.00: ATA-6: WDC WD2500JD-00HBC0, 08.02D08, max UDMA/133 [ 2.136215] ata2.00: 488397168 sectors, multi 0: LBA48 [ 2.137203] sd 0:0:0:0: [sda] 78125000 512-byte logical blocks: (40.0 GB/37.2 GiB) [ 2.137269] sd 0:0:0:0: [sda] Write Protect is off [ 2.137273] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 [ 2.137302] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 2.140314] ata2.00: XXX WD_BROKEN_LPM is 0, SATA_CAPABILITY=0x202 [ 2.141576] ata2.00: configured for UDMA/133 [ 2.141715] scsi 1:0:0:0: Direct-Access ATA WDC WD2500JD-00H 08.0 PQ: 0 ANSI: 5 [ 2.141975] sd 1:0:0:0: [sdb] 488397168 512-byte logical blocks: (250 GB/232 GiB) [ 2.142039] sd 1:0:0:0: [sdb] Write Protect is off [ 2.142043] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 [ 2.142090] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 2.147077] sd 0:0:0:0: Attached scsi generic sg0 type 0 [ 2.147154] sd 1:0:0:0: Attached scsi generic sg1 type 0 [ 2.163782] sdb: sdb1 [ 2.164128] sd 1:0:0:0: [sdb] Attached SCSI disk [ 2.179151] sda: sda1 sda2 sda3 [ 2.179577] sd 0:0:0:0: [sda] Attached SCSI disk Apparently the issue will not be fixed unless SATA SCR access takes place if a SATA I JD is detected. There is no need to reinvent the wheel! We have to use what the SATA specification allows us to do. That is go to the SCR[2] register(SControl register) and write 0010b at the IPM field to disable slumber. It is unknown if the SATA controller will respect that bit but that can be tested on all our hosts. Created attachment 121921 [details]
horkage-nolpm-v3.patch
We were missing WDC prefix from model names so the horkage wasn't activated. With all due respect, please stop second guessing the implementation method. You can't write to a SCR register once and expect it to stick. They're configured each time libata tries to config lpm state. IOW, the SCRs aren't the master copy. They're slave to libata internal state, so you have to change the libata internal state to actually inhibit lpm transitions.
Thanks.
If you're interested in learning how it actually works, please take a look at libata-eh.c::ata_eh_set_lpm(), libahci.c::ahci_set_lpm() and libata-core.c::sata_link_scr_lpm(). Drive now works. Yesterday we tested proper functionality including ACPI S3 and S4 transition. We identified the issue correctly as the bug title proves. Before the v3 patch it enters a series of sleep and wake up events that cause crazy spin up and down of the drive in slumber. So the patch comments must change(as we started writting at comment 5). Final thoughts are written below to avoid confusion. patch line 2221 must become: /* some WD SATA-I drives have issues with LPM, turn on NOLPM for them */ patch lines 4219 to 4222 must become: * Some WD SATA-I drives spin up and down like crazy with LPM slumber. * We don't have full list of all affected devices so enable the horkage * if the device matches one of the known prefixes and is SATA-I. As a * side effect LPM partial is lost too. * If you agree with our modified comments, it is acceptable(but not perfect) by us. At least the drive will continue to work as usual prior LPM introduction. Here is what you need. Reported-and-tested-by: Nikos Barkas levelwol@gmail.com Reported-and-tested-by: Ioannis Barkas risc4all@yahoo.com We will definetely read the src sometime but you leave no doubt that partial will be lost forever...Damn, not what we were waiting! Expect more suckers for this list as some ODDs have completely broken LPM. Good luck, go back in time and fix older kernels. Created attachment 122281 [details]
libata-disable-LPM-for-some-WD-SATA-I-devices.patch
Applied the attached patch to libata/for-3.14 w/ stable cc'd.
Thank you very much.
Beautiful! Who will close the bug? Debian Wheezy 7.5.0 has this patch in its 3.2.57-3 kernel! Unfortunately the latest Squeeze (6.0.9) has an unpatched 2.6.32. As the bug was discovered on Squeeze with 2.6.32, this patch must be backported into 2.6.32 for the sake of completeness, whether it gets into Squeeze or not. The latest 2.6.32.62 does not have this patch, is there a good reason it was be omitted? Our "infamous" disk now has a Debian 7.5.0 32bit install on an ICH5 system and revealed a problem in the logic of the patch: [ 3.052105] ata3.00: LPM support broken, forcing max_power [ 3.053417] ata3.00: ATA-6: WDC WD2500JD-00HBC0, 08.02D08, max UDMA/133 [ 3.053427] ata3.00: 488397168 sectors, multi 16: LBA48 [ 3.068478] ata3.00: LPM support broken, forcing max_power This host is handled by ata_piix which does not support LPM so this message should not be present and the workaround activation should be omitted on non AHCI hosts for now. Reading the host PCI class code could do the trick as ahci is the only driver currently using LPM. |