Bug 57211

Summary: WD SATA 1.0a HDDs problem with SATA LPM
Product: IO/Storage Reporter: DE (risc4all)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: NEW ---    
Severity: blocking CC: szg00000, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.10.16 Subsystem:
Regression: No Bisected commit-id:
Attachments: horkage-nolpm.patch
dmesg
horkage-nolpm-v2.patch
horkage-nolpm-v3.patch
libata-disable-LPM-for-some-WD-SATA-I-devices.patch

Description DE 2013-04-27 20:34:13 UTC
Hello to the libata team again.

Our first SATA Gen1 WD HDD which survived its warranty is misbehaving on AHCI controllers (tested on SB750, SB900, JMB365) after it enters into slumber mode.
To be more accurate, Nick's WD2500JD-00HBC0 is almost unusable when the phy transitions to slumber as the disk enters into standby and returns there almost instantaneously even if there is disk activity. No such problem occurs in partial mode, as it stays active/idle.
For a simple command like hdparm -I or smartctl -a the motor will start-stop from 2 up to 29 times depending on the command before the motor spins down at the end!!! At first we though that the disk died as we forgot that we had activated min_power on the test system.
Since it is our only new Gen1 HDD it was used for only 9 months intensively before it was "decommissioned" (or should we say before Nick filled it up and got a bigger one) for compatibility only testing.
Obviously if the disk is used as an installation disk it will probably be dead within a month from the excessive start-stop cycles...In partial there is a noticeable performance reduction. Also this is not one of the first Gen1 disks although early enough to wear a PATA2SATA 88i8030-TBC1 bridge. WD SATA Drives with the older 88i8030-TBC bridge (with bridge limits applied) are probably affected too. This one was manufactured in 2005, so every pre 2006 WD SATA 1.0a is a suspect. Probable "offenders" are: WD800JD-75HKA1 (had one before it decided to commit digital suicide) and WD2500JD-00HBB0 (never had this one but it is the previous MDL with older f/w [or CC as WD calls it]).Also our WD306GD-00FLC0 stays active into slumber and has no problems even though it also wears a marvell PATA2SATA bridge.

In document T13 / e04149-r0 "SATA Device Initiated Power Management (DIPM)" a guy from WD reports that "SATA Interface Power Management is independent of drive power management" BUT further down says that it "Can be tied to drive PM: Standby, Sleep, Idle".
We strongly believe that on our JD the slumber on the PHY is tied to the disk standby. We believe a new ata horkage is needed, ATA_HORKAGE_NOSLUMBER , which will block the blacklisted disks from entering into slumber mode by sending 0010b to the IPM field in the SATA SCR. The disk is ATA6 compliant and word 222 is empty since that word was introduced in ATA7. As a result the disk is not detected as SATA 1.0a compliant and linux only knows it is a 1,5Gbps SATA device.

It must be noted that even though we asked WD for firmware and docs (TRM) for this one (and for our many many WD we have), we had to give the S/N, our name and our home telephone number, to get only the TRM. Unfortunately it did not make us wiser. The TRM does not even contain the word slumber! Our contact at WD believes that this disk does not support HIPM or DIPM but the IDENTIFY together with the fact that min_power is activated with this one tell another story. Also our contact stated that they do not support Linux. That is despite the fact that many WD devices with onboard SATA HDD are Linux powered as their source code proves[1]!...We informed them that this bug is OS independent and requested to get in contact with the firmware team. WD is officially informed about this bug and we are hoping for a firmware, even though WD says that the drive is EOL. Besides that issue, the disk is in great condition. Actually brand new if you exclude the POH counter. Those WD disk series(JD) have been produced in vast quantities, how didn't anyone fell on this bug?

This is the IDENTIFY:

/dev/sdb:

ATA device, with non-removable media
	Model Number:       WDC WD2500JD-00HBC0                     
	Serial Number:      WD-WCAL7429****
	Firmware Revision:  08.02D08
Standards:
	Supported: 6 5 4 
	Likely used: 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  488397168
	Logical/Physical Sector size:           512 bytes
	device size with M = 1024*1024:      238475 MBytes
	device size with M = 1000*1000:      250059 MBytes (250 GB)
	cache/buffer size  = 8192 KBytes (type=DualPortCache)
Capabilities:
	LBA, IORDY(can be disabled)
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	Recommended acoustic management value: 128, current value: 254
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	DOWNLOAD_MICROCODE
	    	Power-Up In Standby feature set
	   *	SET_FEATURES required to spinup after power up
	    	SET_MAX security extension
	   *	Automatic Acoustic Management feature set
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Host-initiated interface power management
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Long Sector Access (AC1)
	   *	SCT LBA Segment Access (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
	not	supported: enhanced erase
Checksum: correct
 
[1] http://download.wdc.com/gpl/WD_Elements_Play_Gen2_GPL_image_release.v1.01.02.zip?v=9186
Comment 1 DE 2013-05-03 19:30:10 UTC
Last of the many? Hell no!

On May 1 we spoke on the phone with our WD contact and he reported that no firmware will be released for EOL products. He insisted that those disks do not support any PM and have no bug!
The sound produced with regular disk activities when the PHY is on slumber is alarmingly similar (for the untrained ears) to the sound of a WD HDD suffering from a head crash.
Since we were convinced that all the JD family (except for WD800JD-##LSA1, WD800JD-##MSA1 which is SATA-II) are affected, we searched and bought a second hand JD, older than ours. 
This time though it is a DELL part (probably WD biggest customer back then as half of the JDs in Greece where -75 on many retail shops).Despite the fact that it has the older Marvell PATA2SATA bridge, not only it supports HIPM but also supports DIPM! After certifying it as error free, we found exactly the same behaviour.

Here is some more info:

ATA device, with non-removable media
	Model Number:       WDC WD1200JD-75GBB0               
	Serial Number:      WD-WMAET165****
	Firmware Revision:  02.05D02
Standards:
	Supported: 6 5 4 
	Likely used: 6
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  234375000
	LBA48  user addressable sectors:  234375000
	Logical/Physical Sector size:           512 bytes
	device size with M = 1024*1024:      114440 MBytes
	device size with M = 1000*1000:      120000 MBytes (120 GB)
	cache/buffer size  = 8192 KBytes (type=DualPortCache)
Capabilities:
	LBA, IORDY(can be disabled)
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	Recommended acoustic management value: 128, current value: 128
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	DOWNLOAD_MICROCODE
	    	Power-Up In Standby feature set
	    	SET_MAX security extension
	    	Automatic Acoustic Management feature set
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Host-initiated interface power management
	   *	Device-initiated interface power management
HW reset results:
	CBLID- above Vih
	Device num = 0 determined by the jumper
Checksum: correct

Now the fact that a SATA device reports CBLID may be a bug on its own since SATA has no jumpers... That is different story we will check later. 


We believe the following HDDs need testing BUT THEY MUST NOT BE USED FOR INSTALLATION DISKS :

WD800JD-##HKA0
WD800JD-##HKA1
WD800JD-##JNA0
WD800JD-##JNC0


WD1200JD-##FYB0
WD1200JD-##GBB0
WD1200JD-##HBB0
WD1200JD-##HBC0

WD1600JD-##FYB0
WD1600JD-##GBB0
WD1600JD-##HBB0
WD1600JD-##HBC0

WD2000JD-##FYB0
WD2000JD-##GBB0
WD2000JD-##HBB0
WD2000JD-##HBC0

WD2500JD-##FYB0
WD2500JD-##GBB0
WD2500JD-##HBB0

WD3000JD-*
WD3200JD-*

##=00,75,...
Also other JD maybe affected as well.
 
 
We must inform you that we used kernel 3.8.4 with the https://raw.github.com/fenrus75/powertop/master/patches/linux-3.3.0-ahci-alpm-accounting.patch patch applied and 2.6.32 .

For a symbolic cost of 15 dollars or 10 euros plus transportation fees we can send the OEM disk to you for further testing! If anyone has a Dell contact, request docs, f/w.
Comment 2 DE 2013-11-05 21:33:05 UTC
Since these specific disk families are still not blacklisted the problem remains on Debian 7.2.0 running vanilla 3.10.16. As a result the disks are not usable on AHCI controllers such as our SB750.
Comment 3 Tejun Heo 2013-11-20 17:22:55 UTC
Oops, sorry about missing this one. Jesus, this is horrifying. I'll make a patch for it and propagate it through -stable once the merge window is closed.

Thanks a lot for the report.
Comment 4 Tejun Heo 2013-12-13 23:29:20 UTC
Created attachment 118321 [details]
horkage-nolpm.patch

Can you please try this patch?

Thanks.
Comment 5 DE 2013-12-29 22:39:48 UTC
Hello, it was about time! After giving up waiting some time ago, we visited our bug again and now there is a reply! To sum up WD has underestimated the impact of this bug despite our repeated contacts. We agree that is a serious bug. 

Suggested patch will do more harm than good. That is because your blacklist is too generic and will cause a big mess! We will give it a try but mind that the disks will lose partial with this patch though partial works fine as we thoroughly tested partial on both disks. Your patch(with our modification) can be accepted as a temporary solution until many people test their disks. After that only slumber must be disabled. Before waiting for the next crappy disk discovery partial only disabling should be implemented as well.

In your patch replace 

+	/*
+	 * Drives which spin down on LPM.
+	 * https://bugzilla.kernel.org/show_bug.cgi?id=57211
+	 */
+	{ "WD800JD-*",			NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD1200JD-*",			NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD1600JD-*",			NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD2000JD-*",			NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD2500JD-*",			NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD3000JD-*",			NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD3200JD-*",			NULL,	ATA_HORKAGE_NOLPM },
+

with

+	/*
+	 * Drives which spin up and down like crazy with LPM slumber.
+	 * https://bugzilla.kernel.org/show_bug.cgi?id=57211
+	 *
+	 * Caution, many BD and JD models are SATA II and should not be
+	 * blacklisted, as their power consumption will be higher.
+	 * Examples:
+	 * WD800BD-**
+	 * WD800JD-**LSA1
+	 * WD800JD-**MSA1
+	 *
+	 * WD BD, KD, SD models of various capacities are probably affected
+	 * and should be tested.
+	 */
+	{ "WD400JD-**JBA0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD400JD-**JDA0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD400JD-**JNA0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD400JD-**JNC0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD400JD-**JRA0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD400JD-**JRC0",		NULL,	ATA_HORKAGE_NOLPM },
+
+	{ "WD800JD-**HKA0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD800JD-**HKA1",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD800JD-**JNA0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD800JD-**JNC0",		NULL,	ATA_HORKAGE_NOLPM },
+
+	{ "WD1200JD-**FYB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD1200JD-**GBB0",		NULL,	ATA_HORKAGE_NOLPM }, // tested
+	{ "WD1200JD-**HBB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD1200JD-**HBC0",		NULL,	ATA_HORKAGE_NOLPM },
+
+	{ "WD1600JD-**FYB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD1600JD-**GBB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD1600JD-**HBB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD1600JD-**HBC0",		NULL,	ATA_HORKAGE_NOLPM },
+
+	{ "WD2000JD-**FYB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD2000JD-**GBB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD2000JD-**HBB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD2000JD-**HBC0",		NULL,	ATA_HORKAGE_NOLPM },
+
+	{ "WD2500JD-**FYB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD2500JD-**GBB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD2500JD-**HBB0",		NULL,	ATA_HORKAGE_NOLPM },
+	{ "WD2500JD-**HBC0",		NULL,	ATA_HORKAGE_NOLPM }, // tested
+

We will try hard to find a time slice in our very busy schedule during year end to test the above modified patch tomorrow(only to verify that it is going to the right direction) but no promise here. Once we test we shall give you our info to write a tested and reported by so that anyone can communicate with us.
Comment 6 Tejun Heo 2013-12-31 12:06:21 UTC
Hello,

Do we actually know whether some of the drives with the WDXXXXJD prefix have working LPM? Most of those drives are already EOL'd and WD is highly unlikely to fix the issue for these. I'd rather err on the side of larger coverage given the severity of the issue.

Thanks.
Comment 7 DE 2014-01-04 15:51:15 UTC
Hello again. Happy new year, our custom kernel is being compiled! Maybe our position was not clear enough. We believe that the current patch is for completely broken devices which almost certainly exist but have gone unnoticed till this day. Here we need a patch that disables slumber while keeping partial operation enabled. The patch you submitted gives large coverage but we do not recommend it except if you wish an immediate fix at linux distributions ASAP prior any further digging and testing. The comment in our modification for your patch clearly says that WD800JD-**MSA1 for example works fine with slumber and partial. Your patch would not allow that.

In our previous comment where we wrote:
+	 * WD800BD-**
we should have the following:
+	 * WD400BD-**MRA1

A last sneaky but effective solution would be a smart blacklist. To be specific if BD or JD or KD are detected and are not SATAII models then slumber should be blacklisted. In any other way use our list not yours and we will later add more. 

Not to forget, do you want to send you a broken disk(actually two as we have another open bug too)?
Comment 8 Tejun Heo 2014-01-04 16:03:03 UTC
Hmm... so there are devices which work properly. The reason why this has gone unnoticed as long as it has is probably that it just hasn't been very popular to use LPM on desktop / server setups. I still don't feel very excited about developing extensive list of specific model numbers as such list tends to be cumbersome maintain and it's easy to leave out some. It's not like we have access to an authoritative list. So, if BD, JD and KD with 1.5Gbps max speed is a good pattern, I'd be much happier with that.
Comment 9 DE 2014-01-04 23:15:41 UTC
Bad news, our custom kernel did not fix the issue:( The attached SATA port was set on min power! Is that possible? This family breakage must be dealt at a lower hardware level. You must modify the SATA SCR as we described above. 

We do not like lists too but most of all we do not like mulfunctioning hardware.
Comment 10 Tejun Heo 2014-01-04 23:20:07 UTC
Setting lpm_policy to MAX_POWER should have achieved that. I probably made a mistake somewhere. Can you please post the output of dmesg? Thanks.
Comment 11 DE 2014-01-05 00:09:23 UTC
An additional family not known to us until today was shipped in Macs. Those were WD800JD-40GBB2 and WD1600JD-40GBB2 from 2004. 

We checked that all source files received the required modifications.

Requested dmesg attachment is below:
Comment 12 DE 2014-01-05 00:10:18 UTC
Created attachment 120951 [details]
dmesg
Comment 13 Tejun Heo 2014-01-12 13:31:15 UTC
Happy new year. Hmm... the horkage didn't kick in at all. I wonder why that is. I'll prep a new patch w/ more debug info.
Comment 14 Tejun Heo 2014-01-13 15:47:32 UTC
Created attachment 121881 [details]
horkage-nolpm-v2.patch

Can you please test the attached patch and post the dmesg output?

Thanks.
Comment 15 DE 2014-01-13 21:53:13 UTC
Drive still spins down parking its heads then spins up and so on. The WD drive spins up and down like crazy(30+ times for parted -l) until it has processed pending commands. Not spin down as you write. min_power is still available for this port! Have you actually blocked min_power(or min_power and medium_power) for this port from sysfs?

Here is what dmesg prints now for libata:
[    1.347306] libata version 3.00 loaded.
[    1.632046] ahci 0000:00:11.0: version 3.0
[    1.632422] ahci 0000:00:11.0: AHCI 0001.0100 32 slots 6 ports 3 Gbps 0x3f impl SATA mode
[    1.632427] ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part ccc sxs 
[    1.636059] scsi0 : ahci
[    1.637722] scsi1 : ahci
[    1.639370] scsi2 : ahci
[    1.640119] scsi3 : ahci
[    1.642240] scsi4 : ahci
[    1.642518] scsi5 : ahci
[    1.642591] ata1: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7ffd00 irq 22
[    1.642595] ata2: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7ffd80 irq 22
[    1.642599] ata3: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7ffe00 irq 22
[    1.642603] ata4: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7ffe80 irq 22
[    1.642607] ata5: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7fff00 irq 22
[    1.642611] ata6: SATA max UDMA/133 abar m1024@0xfe7ffc00 port 0xfe7fff80 irq 22
[    1.643177] pata_atiixp 0000:00:14.1: setting latency timer to 64
[    1.646285] scsi6 : pata_atiixp
[    1.648098] scsi7 : pata_atiixp
[    1.648597] ata7: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xff00 irq 14
[    1.648600] ata8: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xff08 irq 15
[    1.972038] ata3: SATA link down (SStatus 0 SControl 300)
[    1.972081] ata4: SATA link down (SStatus 0 SControl 300)
[    1.972115] ata5: SATA link down (SStatus 0 SControl 300)
[    1.972147] ata6: SATA link down (SStatus 0 SControl 300)
[    2.132047] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    2.132078] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    2.132514] ata1.00: XXX WD_BROKEN_LPM is 0, SATA_CAPABILITY=0x706
[    2.132640] ata1.00: ATA-7: WDC WD400BD-75MRA1, 10.01E01, max UDMA/133
[    2.132643] ata1.00: 78125000 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[    2.133110] ata1.00: XXX WD_BROKEN_LPM is 0, SATA_CAPABILITY=0x706
[    2.133247] ata1.00: configured for UDMA/133
[    2.133397] scsi 0:0:0:0: Direct-Access     ATA      WDC WD400BD-75MR 10.0 PQ: 0 ANSI: 5
[    2.134911] ata2.00: XXX WD_BROKEN_LPM is 0, SATA_CAPABILITY=0x202
[    2.136208] ata2.00: ATA-6: WDC WD2500JD-00HBC0, 08.02D08, max UDMA/133
[    2.136215] ata2.00: 488397168 sectors, multi 0: LBA48 
[    2.137203] sd 0:0:0:0: [sda] 78125000 512-byte logical blocks: (40.0 GB/37.2 GiB)
[    2.137269] sd 0:0:0:0: [sda] Write Protect is off
[    2.137273] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    2.137302] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.140314] ata2.00: XXX WD_BROKEN_LPM is 0, SATA_CAPABILITY=0x202
[    2.141576] ata2.00: configured for UDMA/133
[    2.141715] scsi 1:0:0:0: Direct-Access     ATA      WDC WD2500JD-00H 08.0 PQ: 0 ANSI: 5
[    2.141975] sd 1:0:0:0: [sdb] 488397168 512-byte logical blocks: (250 GB/232 GiB)
[    2.142039] sd 1:0:0:0: [sdb] Write Protect is off
[    2.142043] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    2.142090] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.147077] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    2.147154] sd 1:0:0:0: Attached scsi generic sg1 type 0
[    2.163782]  sdb: sdb1
[    2.164128] sd 1:0:0:0: [sdb] Attached SCSI disk
[    2.179151]  sda: sda1 sda2 sda3
[    2.179577] sd 0:0:0:0: [sda] Attached SCSI disk

Apparently the issue will not be fixed unless SATA SCR access takes place if a SATA I JD is detected. There is no need to reinvent the wheel! We have to use what the SATA specification allows us to do. That is go to the SCR[2] register(SControl register) and write 0010b at the IPM field to disable slumber. It is unknown if the SATA controller will respect that bit but that can be tested on all our hosts.
Comment 16 Tejun Heo 2014-01-13 22:09:12 UTC
Created attachment 121921 [details]
horkage-nolpm-v3.patch

We were missing WDC prefix from model names so the horkage wasn't activated. With all due respect, please stop second guessing the implementation method. You can't write to a SCR register once and expect it to stick. They're configured each time libata tries to config lpm state. IOW, the SCRs aren't the master copy. They're slave to libata internal state, so you have to change the libata internal state to actually inhibit lpm transitions.

Thanks.
Comment 17 Tejun Heo 2014-01-13 22:13:18 UTC
If you're interested in learning how it actually works, please take a look at libata-eh.c::ata_eh_set_lpm(), libahci.c::ahci_set_lpm() and libata-core.c::sata_link_scr_lpm().
Comment 18 DE 2014-01-16 10:30:32 UTC
Drive now works. Yesterday we tested proper functionality including ACPI S3 and S4 transition. We identified the issue correctly as the bug title proves.

Before the v3 patch it enters a series of sleep and wake up events that cause crazy spin up and down of the drive in slumber. So the patch comments must change(as we started writting at comment 5). Final thoughts are written below to avoid confusion.

patch line 2221 must become:
	/* some WD SATA-I drives have issues with LPM, turn on NOLPM for them */

patch lines 4219 to 4222 must become:
	 * Some WD SATA-I drives spin up and down like crazy with LPM slumber.
	 * We don't have full list of all affected devices so enable the horkage
	 * if the device matches one of the known prefixes and is SATA-I. As a
	 * side effect LPM partial is lost too.
	 *

If you agree with our modified comments, it is acceptable(but not perfect) by us. At least the drive will continue to work as usual prior LPM introduction. Here is what you need.
Reported-and-tested-by: Nikos Barkas levelwol@gmail.com
Reported-and-tested-by: Ioannis Barkas risc4all@yahoo.com

We will definetely read the src sometime but you leave no doubt that partial will be lost forever...Damn, not what we were waiting! Expect more suckers for this list as some ODDs have completely broken LPM.

Good luck, go back in time and fix older kernels.
Comment 19 Tejun Heo 2014-01-16 14:52:19 UTC
Created attachment 122281 [details]
libata-disable-LPM-for-some-WD-SATA-I-devices.patch

Applied the attached patch to libata/for-3.14 w/ stable cc'd.

Thank you very much.
Comment 20 DE 2014-01-17 21:27:39 UTC
Beautiful! Who will close the bug?
Comment 21 DE 2014-05-21 21:26:25 UTC
Debian Wheezy 7.5.0 has this patch in its 3.2.57-3 kernel! Unfortunately the latest Squeeze (6.0.9) has an unpatched 2.6.32. As the bug was discovered on Squeeze with 2.6.32, this patch must be backported into 2.6.32 for the sake of completeness, whether it gets into Squeeze or not. The latest 2.6.32.62 does not have this patch, is there a good reason it was be omitted?

Our "infamous" disk now has a Debian 7.5.0 32bit install on an ICH5 system and revealed a problem in the logic of the patch:

[    3.052105] ata3.00: LPM support broken, forcing max_power
[    3.053417] ata3.00: ATA-6: WDC WD2500JD-00HBC0, 08.02D08, max UDMA/133
[    3.053427] ata3.00: 488397168 sectors, multi 16: LBA48 
[    3.068478] ata3.00: LPM support broken, forcing max_power

This host is handled by ata_piix which does not support LPM so this message should not be present and the workaround activation should be omitted on non AHCI hosts for now. Reading the host PCI class code could do the trick as ahci is the only driver currently using LPM.