Bug 43200

Summary: ATA errors when link_power_management_policy is min_power
Product: IO/Storage Reporter: Vegar (storvann)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: RESOLVED OBSOLETE    
Severity: high CC: alan, j, risc4all, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.4.0-rc5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Errors from dmesg (kernel 3.2)
dmesg errors with 3.4
lspci -vvnn output
/proc/scsi/scsi
/proc/modules
uname -r (3.2 kernel)
Full dmesg from boot to error message
Output of requested smartctl commands

Description Vegar 2012-05-03 20:43:51 UTC
Created attachment 73168 [details]
Errors from dmesg (kernel 3.2)

When /sys/class/scsi_host/host*/link_power_management_policy is set to min_power, ata errors (see below) show up in dmesg whenever there is some significant disk I/O going on. The systems continues to run, but I/O halts for several seconds whenever the error occurs, making it dead-slow.

This is on a thinkpad T61 with the following S-ATA controller:
00:1f.2 SATA controller: Intel Corporation 82801HM/HEM (ICH8M/ICH8M-E) SATA Controller [AHCI mode] (rev 03)

The issue did not occur with the 3.0 kernel from ubuntu 11.10, but showed up when running 3.2 after upgrading to ubuntu 12.04.

I have tested 3.4-rc5 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-rc5-precise/, and the error is present here too.


First line of the error message from dmesg, see attachment for more:
ata1.00: exception Emask 0x10 SAct 0xfe SErr 0x48c0002 action 0xe


Ubuntu bug report:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/993507
Comment 1 Vegar 2012-05-03 20:44:47 UTC
Created attachment 73169 [details]
dmesg errors with 3.4
Comment 2 Vegar 2012-05-03 20:46:05 UTC
Created attachment 73170 [details]
lspci -vvnn output
Comment 3 Vegar 2012-05-03 20:46:42 UTC
Created attachment 73171 [details]
/proc/scsi/scsi
Comment 4 Vegar 2012-05-03 20:48:52 UTC
Created attachment 73172 [details]
/proc/modules
Comment 5 Vegar 2012-05-03 20:49:49 UTC
Created attachment 73173 [details]
uname -r (3.2 kernel)
Comment 6 Jürg Billeter 2012-11-04 13:31:53 UTC
I'm seeing this issue on a ThinkPad X1 Carbon with Linux 3.6.5 x86_64.

00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
Comment 7 DE 2014-01-18 11:09:40 UTC
Hello, a similar issue has been fixed this week! If the drive is broken we will fix your issue soon. Please attach the output of sudo hdparm -C /dev/sd* and sudo hdparm -i --Istdout /dev/sda* where * is your disk drive letter. With that data we will create a patch to workaround your issue.
Comment 8 Vegar 2014-01-18 11:17:37 UTC
$ sudo hdparm -C
/dev/sda:
 drive state is:  active/idle

$ sudo hdparm -i --Istdout
/dev/sda1:

 Model=HITACHI HTS541612J9SA00, FwRev=SBDIC7JP, SerialNo=SB2D51EVG4M3UE
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=DualPortCache, BuffSize=7516kB, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 
 AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
 Drive conforms to: ATA/ATAPI-7 T13 1532D revision 1:  ATA/ATAPI-2,3,4,5,6,7

 * signifies the current active mode

045a 3fff c837 0010 0000 0000 003f 0000
0000 0000 2020 2020 2020 5342 3244 3531
4556 4734 4d33 5545 0003 3ab8 0004 5342
4449 4337 4a50 4849 5441 4348 4920 4854
5335 3431 3631 324a 3953 4130 3020 2020
2020 2020 2020 2020 2020 2020 2020 8010
0000 0f00 4000 0200 0200 0007 3fff 0010
003f fc10 00fb 0100 4bb0 0df9 0000 0007
0003 0078 0078 0078 0078 0000 0000 0000
0000 0000 0000 001f 0702 0000 005e 0044
00fc 001a 746b 7f09 6163 7469 3c09 6163
203f 0024 0000 40fe fffe 0000 80fe 0000
0000 0000 0000 0000 4bb0 0df9 0000 0000
0000 0000 0000 8848 5000 cca5 4dc2 1945
0000 0000 0000 0000 0000 0000 0000 4004
4004 0000 0000 0000 0000 0000 0000 0000
0001 000b 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 4005 4000 8000 0000
4449 0000 0000 5858 5858 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 8000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 72a5
Comment 9 Tejun Heo 2014-01-18 12:20:23 UTC
Can you please also post full dmesg output including the errors?

Thanks.
Comment 10 DE 2014-01-18 15:10:37 UTC
Your 5K160 drive is from the 2006 era so it is in the danger zone for blacklist.

Since Tejun focuses on errors run sudo smartctl -x /dev/sda after boot. Then sudo smartctl -a /dev/sda and again sudo smartctl -x /dev/sda. Those will give us an idea of your drive's health.
Comment 11 Vegar 2014-02-02 14:27:54 UTC
Created attachment 124201 [details]
Full dmesg from boot to error message
Comment 12 Vegar 2014-02-02 14:32:56 UTC
Created attachment 124211 [details]
Output of requested smartctl commands
Comment 13 Vegar 2014-02-02 14:36:55 UTC
See the last two attachments for the requested output. Here's what I did:

1. reboot
2. run requested smartctl commands
3. echo min_power |sudo tee /sys/class/scsi_host/host*/link_power_management_policy
4. start thunderbird to generate disk activity and provoke the requested error message
5. dump dmesg to file, including error messages

Current kernel version (uname -r): 3.8.0-25-generic
Comment 14 Tejun Heo 2014-02-03 15:38:04 UTC
Hmmm... In the initial report, you said that the problem didn't occur with 3.0 but started appearing with 3.2. I've gone through the changes in that time period but can't spot anything which may affect lpm related issues. Would it be possible for you to verify with 3.0 kernel that the errors definitely don't occur there? And if so, would it be possible for you to bisect the kernels between 3.0 and 3.2? If the errors are reliably reproducible, while somewhat laborious, it shouldn't be too difficult.

Thanks.
Comment 15 Vegar 2014-02-04 22:55:21 UTC
I've bisected kernel versions from http://kernel.ubuntu.com/~kernel-ppa/mainline. I assume these are unmodified upstream builds (their wiki says "All of the upstream kernels are published at http://kernel.ubuntu.com/~kernel-ppa/mainline/").

It seems the bug was introduced with the 3.1 kernel. I was not able to reproduce the bug with 3.0.101, but it appeared immediately with 3.1.0.

I will proceed with commit bisecting as soon as I find some time for it.
Comment 16 Alan 2015-02-19 17:35:42 UTC
This bug relates to a very old kernel. Closing as obsolete.