Bug 26702

Summary: SATA AHCI Errors: READ FPDMA QUEUED
Product: IO/Storage Reporter: Gerry Reno (greno)
Component: Serial ATAAssignee: Tejun Heo (tj)
Status: RESOLVED INVALID    
Severity: high CC: andornaut, greno, mikhail.v.gavrilov, tj, utomatoe
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output

Description Gerry Reno 2011-01-14 16:02:09 UTC
Created attachment 43552 [details]
dmesg output

On one of my Ubuntu machines I tried upgrading the kernel to 2.6.37-12 from the kernel-ppa.  I cannot get this kernel to run successfully on my hardware.  And I found an old Ubuntu bug https://bugs.launchpad.net/ubuntu/+bug/550559 where people are describing the same problem (no resolution).  Plus there was another kernel bug https://bugzilla.kernel.org/show_bug.cgi?id=15573 about a marvell controller that showed similar problem, only I do not have a marvell controller.

Here is my lspci output:
# lspci
00:00.0 Host bridge: Advanced Micro Devices [AMD] RS780 Host Bridge
00:01.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge (int gfx)
00:0a.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge (PCIE port 5)
00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]
00:12.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
00:12.1 USB Controller: ATI Technologies Inc SB700 USB OHCI1 Controller
00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller
00:13.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
00:13.1 USB Controller: ATI Technologies Inc SB700 USB OHCI1 Controller
00:13.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 3a)
00:14.1 IDE interface: ATI Technologies Inc SB700/SB800 IDE Controller
00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia (Intel HDA)
00:14.3 ISA bridge: ATI Technologies Inc SB700/SB800 LPC host controller
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge
00:14.5 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI2 Controller
00:18.0 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] Link Control
01:05.0 VGA compatible controller: ATI Technologies Inc Radeon HD 3300 Graphics
01:05.1 Audio device: ATI Technologies Inc RS780 Azalia controller
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
03:0e.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link)

Upon booting this kernel (as well as 2.6.32 kernels) I see a lot of the following types of errors on the console and dmesg:

$ dmesg
tatus: { DRDY ERR }
[ 5305.808784] ata7.00: error: { UNC }
[ 5305.824400] ata7.00: configured for UDMA/133
[ 5305.828712] ata7: EH complete
[ 5310.222859] ata7.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
[ 5310.229321] ata7.00: irq_stat 0x40000008
[ 5310.233308] ata7.00: failed command: READ FPDMA QUEUED
[ 5310.238460] ata7.00: cmd 60/00:00:c6:04:7c/01:00:13:00:00/40 tag 0 ncq 131072 in
[ 5310.238461]          res 51/40:00:c6:04:7c/00:01:13:00:00/40 Emask 0x409 (media error) <F>
[ 5310.254240] ata7.00: status: { DRDY ERR }
[ 5310.258273] ata7.00: error: { UNC }
[ 5310.273862] ata7.00: configured for UDMA/133
[ 5310.278174] ata7: EH complete
[ 5314.522287] ata7.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
[ 5314.528751] ata7.00: irq_stat 0x40000008
[ 5314.532748] ata7.00: failed command: READ FPDMA QUEUED
[ 5314.537910] ata7.00: cmd 60/00:00:c6:05:7c/01:00:13:00:00/40 tag 0 ncq 131072 in
[ 5314.537911]          res 51/40:00:c6:05:7c/00:01:13:00:00/40 Emask 0x409 (media error) <F>
[ 5314.553645] ata7.00: status: { DRDY ERR }
[ 5314.557676] ata7.00: error: { UNC }
[ 5314.573340] ata7.00: configured for UDMA/133
[ 5314.577648] ata7: EH complete
[ 5318.930134] ata7.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
[ 5318.936591] ata7.00: irq_stat 0x40000008
[ 5318.940589] ata7.00: failed command: READ FPDMA QUEUED
[ 5318.945748] ata7.00: cmd 60/00:00:c6:04:7c/01:00:13:00:00/40 tag 0 ncq 131072 in
[ 5318.945749]          res 51/40:00:c6:04:7c/00:01:13:00:00/40 Emask 0x409 (media error) <F>
[ 5318.961511] ata7.00: status: { DRDY ERR }
[ 5318.965551] ata7.00: error: { UNC }
[ 5318.981149] ata7.00: configured for UDMA/133
[ 5318.985461] ata7: EH complete
[ 5323.229573] ata7.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
[ 5323.236084] ata7.00: irq_stat 0x40000008
[ 5323.240053] ata7.00: failed command: READ FPDMA QUEUED
[ 5323.245207] ata7.00: cmd 60/00:00:c6:05:7c/01:00:13:00:00/40 tag 0 ncq 131072 in
[ 5323.245208]          res 51/40:00:c6:05:7c/00:01:13:00:00/40 Emask 0x409 (media error) <F>
[ 5323.260970] ata7.00: status: { DRDY ERR }
[ 5323.265010] ata7.00: error: { UNC }
[ 5323.280627] ata7.00: configured for UDMA/133
[ 5323.284937] ata7: EH complete
[ 5327.587426] ata7.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
[ 5327.593929] ata7.00: irq_stat 0x40000008
[ 5327.597889] ata7.00: failed command: READ FPDMA QUEUED
[ 5327.603080] ata7.00: cmd 60/00:00:c6:04:7c/01:00:13:00:00/40 tag 0 ncq 131072 in
[ 5327.603081]          res 51/40:00:c6:04:7c/00:01:13:00:00/40 Emask 0x409 (media error) <F>
[ 5327.618848] ata7.00: status: { DRDY ERR }
[ 5327.622913] ata7.00: error: { UNC }
[ 5327.638437] ata7.00: configured for UDMA/133
[ 5327.642775] ata7: EH complete
[ 5331.886866] ata7.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
[ 5331.893354] ata7.00: irq_stat 0x40000008
[ 5331.897305] ata7.00: failed command: READ FPDMA QUEUED
[ 5331.902511] ata7.00: cmd 60/00:00:c6:05:7c/01:00:13:00:00/40 tag 0 ncq 131072 in
[ 5331.902512]          res 51/40:00:c6:05:7c/00:01:13:00:00/40 Emask 0x409 (media error) <F>
[ 5331.918300] ata7.00: status: { DRDY ERR }
[ 5331.922356] ata7.00: error: { UNC }
[ 5331.937917] ata7.00: configured for UDMA/133
[ 5331.942255] sd 6:0:0:0: [sde] Unhandled sense code
[ 5331.947070] sd 6:0:0:0: [sde]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 5331.954397] sd 6:0:0:0: [sde]  Sense Key : Medium Error [current] [descriptor]
[ 5331.961847] Descriptor sense data with sense descriptors (in hex):
[ 5331.968090]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
[ 5331.975246]         13 7c 05 c6 
[ 5331.978713] sd 6:0:0:0: [sde]  Add. Sense: Unrecovered read error - auto reallocate failed
[ 5331.987127] sd 6:0:0:0: [sde] CDB: Read(10): 28 00 13 7c 05 c6 00 01 00 00
[ 5331.994618] end_request: I/O error, dev sde, sector 326895046
[ 5332.000414] md/raid1:md6: sde3: rescheduling sector 294668656
[ 5332.006188] md/raid1:md6: sde3: rescheduling sector 294668904
[ 5332.012017] ata7: EH complete
[ 5336.344707] ata7.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
[ 5336.351201] ata7.00: irq_stat 0x40000008
[ 5336.355143] ata7.00: failed command: READ FPDMA QUEUED

I've attached the full listing.


I've tested the hard drives with manufacturers surface test tools and the drives show no errors.  Smartctl shows no errors after testing.

I've tried adding the following kernel boot switch:  libata.force=noncq
but it did not help.
Comment 1 Tejun Heo 2011-01-14 17:07:20 UTC
That's your drive reporting that it can't read a sector.  Most likely it doesn't have anything to do with the kernel upgrade.  Can you please attach the output of "smartctl -a /dev/sde"?

Thanks.
Comment 2 Gerry Reno 2011-01-14 17:41:08 UTC
I ran that command yesterday successfully but when I run it just now again it shows errors as you said.  You can see in the output where I ran my test yesterday which occurred 3 hours before any failures started showing up.  
=================
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10064   
=================

And now at 10067 hours errors are logged.  It looks then like this drive was starting to fail but not reporting errors until today.  Very odd that it occurred just as I had upgraded the kernel.

I'll try to find a replacement drive and then see what happens.


==================================================================
# smartctl -a /dev/sde
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     Hitachi HDP725050GLA360
Serial Number:    GEA554RJ340LDG
Firmware Version: GM4OA5CA
User Capacity:    500,107,862,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Jan 14 12:29:38 2011 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (7890) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 131) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   065   065   016    Pre-fail  Always       -       3152942
  2 Throughput_Performance  0x0005   128   128   054    Pre-fail  Offline      -       161
  3 Spin_Up_Time            0x0007   175   175   024    Pre-fail  Always       -       201 (Average 237)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       62
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   127   127   020    Pre-fail  Offline      -       31
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       10092
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       62
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       109
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       109
194 Temperature_Celsius     0x0002   153   153   000    Old_age   Always       -       39 (Lifetime Min/Max 24/50)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   001   001   000    Old_age   Always       -       1899
198 Offline_Uncorrectable   0x0008   001   001   000    Old_age   Offline      -       13979
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 2052 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2052 occurred at disk power-on lifetime: 10067 hours (419 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 be bc eb e1  Error: UNC 8 sectors at LBA = 0x01ebbcbe = 32226494

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 be bc eb e1 08      07:15:21.400  READ DMA
  27 00 00 00 00 00 e0 08      07:15:21.400  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      07:15:21.400  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      07:15:21.400  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 08      07:15:21.400  READ NATIVE MAX ADDRESS EXT

Error 2051 occurred at disk power-on lifetime: 10067 hours (419 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 be bc eb e1  Error: UNC 8 sectors at LBA = 0x01ebbcbe = 32226494

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 be bc eb e1 08      07:15:17.100  READ DMA
  27 00 00 00 00 00 e0 08      07:15:17.100  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      07:15:17.100  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      07:15:17.100  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 08      07:15:17.100  READ NATIVE MAX ADDRESS EXT

Error 2050 occurred at disk power-on lifetime: 10067 hours (419 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 be bc eb e1  Error: UNC 8 sectors at LBA = 0x01ebbcbe = 32226494

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 be bc eb e1 08      07:15:12.800  READ DMA
  27 00 00 00 00 00 e0 08      07:15:12.800  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      07:15:12.800  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      07:15:12.800  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 08      07:15:12.800  READ NATIVE MAX ADDRESS EXT

Error 2049 occurred at disk power-on lifetime: 10067 hours (419 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 be bc eb e1  Error: UNC 8 sectors at LBA = 0x01ebbcbe = 32226494

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 be bc eb e1 08      07:15:08.500  READ DMA
  27 00 00 00 00 00 e0 08      07:15:08.500  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      07:15:08.500  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      07:15:08.500  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 08      07:15:08.500  READ NATIVE MAX ADDRESS EXT

Error 2048 occurred at disk power-on lifetime: 10067 hours (419 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 be bc eb e1  Error: UNC 8 sectors at LBA = 0x01ebbcbe = 32226494

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 be bc eb e1 08      07:15:04.200  READ DMA
  27 00 00 00 00 00 e0 08      07:15:04.200  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08      07:15:04.200  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      07:15:04.200  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 08      07:15:04.200  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10064         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Comment 3 andornaut 2011-02-09 05:33:15 UTC
I'm experiencing a similar issue. My system hangs periodically for about a minute while the HD resets.

Environment:
Ubuntu running in a Virtual Box VM running on WIndows 7 64bit.
Asus G53JW Laptop
Intel X25 SSD

Log excerpts:

Jan 17 14:47:58 vm rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="578" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.
Jan 17 14:48:32 vm kernel: [ 1513.120252] ata3: hard resetting link
Jan 17 14:48:33 vm kernel: [ 1513.470345] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan 17 14:48:33 vm kernel: [ 1513.471232] ata3.00: configured for UDMA/133
Jan 17 14:48:33 vm kernel: [ 1513.471241] ata3.00: device reported invalid CHS sector 0
Jan 17 14:48:33 vm kernel: [ 1513.471255] ata3: EH complete

[ 1513.120200] ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[ 1513.120210] ata3.00: failed command: READ FPDMA QUEUED
[ 1513.120222] ata3.00: cmd 60/08:00:30:66:4c/00:00:00:00:00/40 tag 0 ncq 4096 in
[ 1513.120224]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[ 1513.120230] ata3.00: status: { DRDY }
[ 1513.120252] ata3: hard resetting link
[ 1513.470345] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1513.471232] ata3.00: configured for UDMA/133
[ 1513.471241] ata3.00: device reported invalid CHS sector 0
[ 1513.471255] ata3: EH complete
Comment 4 Tejun Heo 2011-02-09 08:43:38 UTC
andronaut, you're seeing command errors in a virtualized environment.  Please report it to vbox devs.
Comment 5 Edwin 2011-02-17 03:19:12 UTC
On a similar board I get the same errors and some different:

Tried both Ubuntu 10.04 (2.6.35-24) and mainline kernels 2.6.35 and 2.6.37

[145415.041024] ata6.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[145415.043044] ata6.00: failed command: READ FPDMA QUEUED
[145415.045069] ata6.00: cmd 60/08:00:88:cf:22/00:00:7f:00:00/40 tag 0 ncq 4096 in
[145415.045072]          res 40/00:00:00:f4:2c/00:00:00:00:00/40 Emask 0x4 (timeout)
[145415.049235] ata6.00: status: { DRDY }
[145415.053161] ata6: hard resetting link
[145416.680969] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[145416.693556] ata6.00: configured for UDMA/133
[145416.693566] ata6.00: device reported invalid CHS sector 0
[145416.693582] ata6: EH complete

And turning off NCQ as well as tried pcie_aspm=off, results in DMA errors:

[28438.930222] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[28438.932625] ata4.00: failed command: READ DMA EXT
[28438.934695] ata4.00: cmd 25/00:88:b0:ee:f8/00:00:73:00:00/e0 tag 0 dma 69632 in
[28438.934698]          res 40/00:00:af:fb:64/00:00:72:00:00/e0 Emask 0x4 (timeout)
[28438.938729] ata4.00: status: { DRDY }
[28438.942185] ata4: hard resetting link
[28448.950122] ata4: softreset failed (1st FIS failed)
[28448.952571] ata4: hard resetting link
[28458.960096] ata4: softreset failed (1st FIS failed)
[28458.962086] ata4: hard resetting link
[28491.680112] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[28491.685013] ata4.00: configured for UDMA/133
[28491.685035] ata4: EH complete

I've seen this for READ/WRITE FPDMA QUEUED and READ/WRITE DMA EXT. I've gone as far as turned off the cache and then i get READ/WRITE DMA errors.

Seems to happen only to a single drive at a time, had it happen once to two drives. They were configured as RAID 1 or 10, using ext4 and LVM2. 

Bulk of my errors was FLUSH CACHE EXT:

[151921.080189] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[151921.082160] ata6.00: failed command: FLUSH CACHE EXT
[151921.084104] ata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[151921.084107]          res 40/00:00:00:f4:2c/00:00:00:00:00/40 Emask 0x4 (timeout)
[151921.087978] ata6.00: status: { DRDY }
[151921.091783] ata6: hard resetting link
[151925.910120] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[151925.924365] ata6.00: configured for UDMA/133
[151925.924373] ata6.00: retrying FLUSH 0xea Emask 0x4
[151925.924592] ata6.00: device reported invalid CHS sector 0
[151925.924608] ata6: EH complete

Ended up replacing all my drives and PSU. Still cropped up on one of the replacement drives. Happened on 6 out of 10 Seagate Barracuda LP 2 TB (ST32000542AS) drives. Had 1 WD20EARS drive in with the original 5 that failed and it doesn't have any issues.

Also tried disable APM (hdparm -B), without any good results.

Eventually the drives fail SMART, with their reallocated sector count dipping below the pre-fail threshold.

I'm not entirely convinced it's a bug, seemed to be an oddity with SB700/800 boards.


root@codis:/var/log# lspci -nn |grep -i ahci
00:11.0 SATA controller [0106]: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode] [1002:4391] (rev 40)
03:00.0 SATA controller [0106]: JMicron Technology Corp. JMB361 AHCI/IDE [197b:2361] (rev 02)
03:00.1 IDE interface [0101]: JMicron Technology Corp. JMB361 AHCI/IDE [197b:2361] (rev 02)

I was only using AHCI in my BIOS, so the drives are all off of 00:11.0

Possible similar bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=611350
https://bugzilla.redhat.com/show_bug.cgi?id=549981
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/285892
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/397096
http://www.spinics.net/lists/linux-ide/msg39123.html
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/718598
https://bugs.launchpad.net/ubuntu/+bug/550559
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/718598

Unfortunately, I can't seem to reliably reproduce it anymore. Used to be firing up dbench/bonnie would do the trick, or dd if=/dev/sda of=/dev/null, but not anymore...
Comment 6 Tejun Heo 2011-02-17 11:23:50 UTC
Edwin, timeouts (especially on FLUSH_CACHE commands) usually indicate hardware problems.  PSU tends to be the most frequent offender.  How many drives do you have in the system?  Also, if you want to pursue it, please open a separate bug report.  This bug entry already has three completely different issues on it.

Thanks.
Comment 7 Tejun Heo 2011-02-17 11:24:30 UTC
Resolving as INVALID.
Comment 8 Mikhail 2013-10-13 09:34:49 UTC
Why this bug is closed?
I have similar issue on Fedora 20 :(
Comment 9 Tejun Heo 2013-10-13 17:12:22 UTC
Mikhail, if you are experiencing issues, please open a separate bug report. While the symptoms may seem similar, the root issues are usually completely separate depending on the hardware involved and specifics of the errors. No one can do much, if anything, without detailed information on each case.

Thanks.