Bug 72191 - Thinkpad t 440s - Setting ALPM from min_power to max_power gives ATA errors, sets the disk r/o
Summary: Thinkpad t 440s - Setting ALPM from min_power to max_power gives ATA errors, ...
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: IA-64 Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-16 09:23 UTC by Joost
Modified: 2014-10-20 14:13 UTC (History)
8 users (show)

See Also:
Kernel Version: 3.14.rc6
Tree: Mainline
Regression: No


Attachments
Setting /sys/class/scsi_host/host[0-2]/link_power_management_policy to max_power on another terminal (2.32 MB, image/jpeg)
2014-03-16 09:23 UTC, Joost
Details
dmesg (57.62 KB, text/plain)
2014-03-16 09:24 UTC, Joost
Details
lspci (1.30 KB, text/plain)
2014-03-16 09:25 UTC, Joost
Details

Description Joost 2014-03-16 09:23:00 UTC
Created attachment 129561 [details]
Setting /sys/class/scsi_host/host[0-2]/link_power_management_policy to max_power on another terminal

Hello everybody,

this is on a T440s 20AR-S0BH00. I am investigating repeating system lock ups when hibernating, and found that when the ALPM setting is changed from min_power to max_power, there is a lot of ATA errors and the disk is set read only, resulting in the need to hard reboot as 'reboot' or 'poweroff' do not work anymore. Setting ALPM to medium_power does not trigger this behaviour.
I believe that this is the culprit for constant failures of hibernation (at least one).
Attached are a screenshot and some other info, please tell me what else you need.
Thanks!

Joost
Comment 1 Joost 2014-03-16 09:24:55 UTC
Created attachment 129571 [details]
dmesg
Comment 2 Joost 2014-03-16 09:25:19 UTC
Created attachment 129581 [details]
lspci
Comment 3 Joost 2014-03-16 09:26:46 UTC
Sorry, I meant "this is at least one culprit for ...", the hibernation crashes are constant.
Comment 4 wes33 2014-03-19 12:02:58 UTC
I can confirm this, my T440s has a Samsung SSD,
MZ7TD512HAGM-0001L  (512gb) with firmware
DXT05L0Q

Please note that this problem (I suspect it is
the identical problem) has been noticed in Windows
too, see 

http://forums.lenovo.com/t5/T400-T500-and-newer-T-series/T440s-is-killing-Samsung-840-pro-SSD-s/td-p/1366903

It may well be a firmware problem with Samsung drives
(the MZ7TD512HAGM-0001L is an OEM version of the 840
series I believe).
Comment 5 Joost 2014-03-19 21:44:38 UTC
Thanks for the news!

I believe I got a 256GB OEM PM841 inside, at least no firmware upgrades for 840/840 pro/840 evo found an updatable disk.

Is there any way to disable ALPM switching when hibernating?

I just miss hibernation so much ...
Comment 6 Alan 2014-03-20 12:28:02 UTC
You could try adding the device to the table of busted devices

drivers/ata/libata-core.c: static const struct ata_blacklist_entry ata_device_blacklist [] = {


and you'll see that lists drives with problems.

If you add a pattern for your drive with the ATA_HORKAGE_NOLPM flag and build a new kernel that ought to do the trick properly and you can then let us know if it works.

Alan
Comment 7 Joost 2014-03-20 21:04:47 UTC
Would this disable ALPM completely (not acceptable for me because of the increase in power usage) or only during hibernation?

Joost
Comment 8 Alan 2014-03-20 21:16:13 UTC
It will disable LPM completely for that device - which is a good starting point for testing if LPM is the problem here and if it fixes hibernation. The detail can be refined once we know that is the case.
Comment 9 Joost 2014-03-21 16:40:10 UTC
Right now I'm getting working hibernation with doing

cat /etc/pm/sleep.d/00_powertop_autotune 

#!/bin/sh 

case "$1" in
    thaw|resume)
		echo "SATA_ALPM_ENABLE=true" > /etc/pm/config.d/sata_alpm
		/usr/sbin/pm-powersave true
		/usr/sbin/powertop --auto-tune
    		;;
    hibernate)
		echo "SATA_ALPM_ENABLE=false" > /etc/pm/config.d/sata_alpm
		/usr/sbin/pm-powersave
		/usr/sbin/pm-powersave false
		;;
esac

Don't know if "pm-powersave" or "pm-powersafe false" does the magic, but with this I have ALPM enabled and working hibernation. I don't know how to check if ALPM is enabled at all (cat'ing /sys/.../link-power-management-policy always returns the last configuration), so I have both commands running right now.
Comment 10 Joost 2014-03-21 16:52:37 UTC
... and re-enabled on thaw/resume of course.

Ok, powertop tells me that it's probably 

pm-powersave false

which turns of ALPM without crashing the drive/controller.

I'm going to check if that by itself is enough to let us hibernate successfully.
Comment 11 Joost 2014-03-21 17:00:34 UTC
So, right now I believe


#!/bin/sh 

case "$1" in
    thaw|resume)
		/usr/sbin/pm-powersave true
		/usr/sbin/powertop --auto-tune
    		;;
    hibernate)
		/usr/sbin/pm-powersave false
		;;
esac


is sufficient for successfull hibernation.
Comment 12 Stefano Zacchiroli 2014-05-07 14:23:27 UTC
FWIW, I'm seeing this problem on another T440s, with the following disk:

ATA device, with non-removable media
	Model Number:       SAMSUNG MZ7TD512HAGM-000L1              
	Serial Number:      S151NYAF303599      
	Firmware Revision:  DXT05L0Q
	Media Serial Num:   00000000000000000000
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0

I'm not using hibernation on a regular basis, but I notice the problem sporadically when plugging the AC power connector and (very empirical evidence) using the disk immediately after it (e.g. by starting mutt that will check the timestamps of a whole bunch of maildirs).

I didn't realize myself that the disk was being set read-only, but I confirm that the only "way out" at that point is a hard reboot.

I've the impression that setting CONTROL_HD_POWERMGMT=0 in /etc/laptop-mode/laptop-mode.conf (Debian/testing user here, with kernel 3.14.2) has significantly reduced the incidence of the issue when plugging the AC connect, but it has definitely not made it go away completely. (And, even with that setting in laptop-mode.conf, according to powertop SATA power management is still in use, not sure why.)

It is not clear to me if this is something which is fixable at the kernel level, or if it needs to be fixed at the firmware level anyhow. Can someone comment on that? I've tried to update the disk firmware but:

1) I'm not sure which one from www.samsung.com/samsungssd/ is the right firmware image for the above disk (anyone?)

2) as a guess I've tried DXT09B0Q (it is the most similar-looking firmware version to the one hdparm detects on my disk), but the bootable image didn't seem to be really bootable. I guess the Win EXE would work, but I've thrown away Win alltogether...

Tips welcome.

And thanks a bunch for your awesome kernel work!
Comment 13 wes33 2014-05-09 12:29:44 UTC
"I've the impression that setting CONTROL_HD_POWERMGMT=0 in /etc/laptop-mode/laptop-mode.conf (Debian/testing user here, with kernel 3.14.2) has significantly reduced the incidence of the issue when plugging the AC connect, but it has definitely not made it go away completely. (And, even with that setting in laptop-mode.conf, according to powertop SATA power management is still in use, not sure why.)"

Just a note on the above: laptop-mode does not handle alpm properly.
I'm using TLP to set "medium_power" / "max_performance" instead of
"min_power". I haven't had the problem recur but I would prefer to
get maximum battery saving if possible.

This problem reported solved on windows machines with consumer
versions of samsung 840 models via firmware update. Unfortunately,
samsung magician software does not work on OEM drives. It would
be interesting to see if DXT09B0Q firmware fixes problem.
Comment 14 wes33 2014-05-09 12:52:03 UTC
I screwed up my courage and tried the DXT09B0Q upgrade,
but it only reports "no supported SSDs found" and
exits ...
Comment 15 Bharath Ramesh 2014-06-10 07:04:41 UTC
I recently updated my BIOS to GJET77WW (2.27). I have re-enabled link power management and knock on wood have not seen this issue. Previously, I would have this issue constantly and ended up with one of my SSD completely being killed.
Comment 16 Joost 2014-07-07 18:48:42 UTC
I also updated the BIOS to GJET77WW (2.27 ), but I am experiencing further lock ups.

BTW wes33, at least in my machine there is some special kind of OEM SSD for which no firmware updates seem to be available...
Though I would be glad to hear the opposite!
Comment 17 Stefano Zacchiroli 2014-07-07 19:01:31 UTC
@Joost: yep, same here, I've updated the BIOS but the locks up still happen, unfortunately. And I'm, too, still unable to find firmware updates for the OEM SSDs. As a work around I'm now regularly putting the laptop on stand-by, closing the lid, every time I want to plug the AC charger in, to avoid the risk of a lock up.
Comment 18 wes33 2014-07-07 23:35:47 UTC
@Joost - you are right- the oem ssd does not take standard
firmward upgrades

Has anyone else used the "medium_power" setting successfully?

I've had zero lockups since starting to use it, and my power
useage is not bad (base about 5 watts).
Comment 19 bulletgani 2014-08-01 01:01:11 UTC
@wes33 I have had no lockups with "medium_power" setting ... and my power usage about .8W below "max_power" settings.
Comment 20 yueqili 2014-08-17 01:31:42 UTC
Hi,
 I will have a T440p. I will replace the harddrive with 840 SSD.
 Any body has experience with BIOS 2.25 and SSD firmware DXT09B0Q? I assume SATA Active Link Power Management is on in BIOS. 
 Thank you.
Comment 21 yueqili 2014-08-26 02:24:59 UTC
I have tested 840 ssd with firmware DXT09B0Q.
BIOS 2.25.
I am using fedora 20 linux 64bit version with default setting.
I have using it for more than one week. 
Everything works OK.
Comment 22 yueqili 2014-08-26 02:25:17 UTC
I have tested 840 ssd with firmware DXT09B0Q.
T440p BIOS 2.25.
I am using fedora 20 linux 64bit version with default setting.
I have using it for more than one week. 
Everything works OK.
Comment 23 Scott R. Santos 2014-08-28 02:43:01 UTC
Here to report that the "medium_power" workaround for the T440S appears to be SSD model sensitive/specific. Currently booting off a:

=== START OF INFORMATION SECTION ===
Model Family:     Marvell based SanDisk SSDs
Device Model:     SanDisk SD6SB2M512G1022I
Serial Number:    XXXXXXXXXXXXXX
LU WWN Device Id: XXXXXXXXXXXXXX
Firmware Version: X210400
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      Unknown (0x0011)
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Aug 27 21:30:50 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

OS is up-to-date Arch with TLP for power management. Both "min_power" and "medium_power" pollute the journal with errors related to the SATA interface, particularly when waking from suspend and/or under heavy write activity while on battery. Only errors though, no lock-ups and/or data loss as others have reported. 

The saving grace in this situation is the 6+3 battery configuration; still getting ~10 hrs of mobile time under "max_performance", other TLP features enabled and 35% backlight.
Comment 24 yueqili 2014-08-28 08:58:38 UTC
Have you updated your BIOS and SSD firmware? 
I think it could be hardware's bug.
Comment 25 Scott R. Santos 2014-08-28 12:07:21 UTC
Yes, BIOS and SSD firmware are most up-to-date versions as of this posting date. Hopefully future updates to these and/or the Linux kernel overcome this current limitation.
Comment 26 wes33 2014-10-20 14:07:48 UTC
I noticed that lenovo offers a firmware update for several
drives including the MZ7TD512HAGM-0001L to firmware version
DXT06L0Q

It does NOT help with this bug; I still get the errors
and am kicked to read-only file system on (some)transitions
from min_power to performance setting
Comment 27 Dmitry Nezhevenko 2014-10-20 14:13:10 UTC
It seems that new Samsung 850 Pro SSD is ok with min_power (t440p machine)

Device Model:     Samsung SSD 850 PRO 256GB
LU WWN Device Id: 5 002538 8a066a5a2
Firmware Version: EXM01B6Q

Don't know about 840 on this particular machine.

Note You need to log in before you can comment on or make changes to this bug.