Bug 43182 (RepeatingSATAError) - On average every few days my PC becomes unusuable due to ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Summary: On average every few days my PC becomes unusuable due to ata2.00: exception E...
Status: NEW
Alias: RepeatingSATAError
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-30 18:57 UTC by Holger Brandsmeier
Modified: 2016-03-19 16:56 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.6.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg5.out (kernel 3.0.0-16) (213.47 KB, application/octet-stream)
2012-04-30 18:57 UTC, Holger Brandsmeier
Details
dmesg.out: 3.2.12 (143.16 KB, application/octet-stream)
2012-04-30 18:58 UTC, Holger Brandsmeier
Details
lspci output (1.72 KB, application/octet-stream)
2012-04-30 18:59 UTC, Holger Brandsmeier
Details
hdparm_sdb.out (626 bytes, application/octet-stream)
2012-04-30 19:00 UTC, Holger Brandsmeier
Details
hdparm_sda.out (599 bytes, application/octet-stream)
2012-04-30 19:00 UTC, Holger Brandsmeier
Details
smart.out (5.30 KB, application/octet-stream)
2012-04-30 19:15 UTC, Holger Brandsmeier
Details

Description Holger Brandsmeier 2012-04-30 18:57:34 UTC
Created attachment 73127 [details]
dmesg5.out (kernel 3.0.0-16)

The error appeared in several Kernel version and for several distributions, namely

Gentoo Linux for 3.2.12 kernel (I am mainly interested in this)
Ubuntu/Kubuntu 11.10 with kernel 3.0.0-15, 3.0.0-16

The problem appears since I got the new laptop.

First error message from 3.0.0-16 (see more in attachment dmesg5.out):
[65304.786519] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

First error message from 3.2.12 (see attachment dmesg.out):
[84012.938152] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

I have this error repeatetly for 3-5 month about twice every week. Usually after long times of working / hibernations / etc. I never experienced this problem in Windows (I am almost exclusively using Linux, but the error should have appeared in Windows during those 3-5 month).

I have two HDDs:
- /dev/sda: 160GB SATA-II SSD Intel 320 Series (SSDSA2CW160G3)
- /dev/sdb: 1000GB SATA-II 5400U/Min Samsung Spinpoint M8 (HN-M101MBB)
The problem always affects the Samsung HDD and never the SSD (more details see below). All SMART values seem to be ok (more below).

Somehow the above error message, and the errors that are caused afterwards (see below) sound like a problem in the kernel, and not a hardware failure, am I correct here?

Can you explain me what the error means and what is going wrong? I don't know if you can reproduce the error (I attached the output of lspci, but if you tell me what to do, I can execute some more tests. With Gentoo I build my own kernel and I can do any changes to it if you want. I also attached my kernel config (see below).

Thanks,
Holger
--------------------------------
The above error is always followed by:
[84012.938160] ata2.00: failed command: FLUSH CACHE EXT
[84012.938170] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[84012.938172]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[84012.938177] ata2.00: status: { DRDY }
[84012.938185] ata2: hard resetting link
[84018.284271] ata2: link is slow to respond, please be patient (ready=0)
[84022.970300] ata2: COMRESET failed (errno=-16)
[84022.970303] ata2: hard resetting link
[...]
[84073.027654] ata2.00: disabled
[84073.027657] ata2.00: device reported invalid CHS sector 0
[84073.029560] ata2: EH complete

And then the following error loops:
[84073.029620] sd 1:0:0:0: [sdb] Unhandled error code
[84073.029622] sd 1:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[84073.029627] sd 1:0:0:0: [sdb] CDB: Write(10): 2a 00 2e c4 7f e5 00 00 08 00
[84073.029638] end_request: I/O error, dev sdb, sector 784629733
[84073.029641] end_request: I/O error, dev sdb, sector 784629733
[84073.029651] sd 1:0:0:0: [sdb] Unhandled error code
[84073.029656] Aborting journal on device sdb9-8.
[84073.029660] sd 1:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[84073.029664] sd 1:0:0:0: [sdb] CDB: Write(10): 2a 00 57 42 19 b8 00 00 38 00
[84073.029673] end_request: I/O error, dev sdb, sector 1463949752
Comment 1 Holger Brandsmeier 2012-04-30 18:58:45 UTC
Created attachment 73128 [details]
dmesg.out: 3.2.12
Comment 2 Holger Brandsmeier 2012-04-30 18:59:47 UTC
Created attachment 73129 [details]
lspci output
Comment 3 Holger Brandsmeier 2012-04-30 19:00:20 UTC
Created attachment 73130 [details]
hdparm_sdb.out
Comment 4 Holger Brandsmeier 2012-04-30 19:00:52 UTC
Created attachment 73131 [details]
hdparm_sda.out
Comment 5 Holger Brandsmeier 2012-04-30 19:03:29 UTC
This is the configuration of the Laptop, it is a configurable Laptop from Schenker (www.mysn.de)

XMG P711 PRO Gaming Notebook 43,9cm (17.3")
. 43,9cm (17.3") Full-HD (1920*1080) Non Glare
. NVIDIA GeForce GTX 570M 3072MB GDDR5
. Intel Core i7-2760QM - 2,40 - 3,50GHz 6MB
. 16GB (4x4096) SO-DIMM DDR3 RAM 1333MHz (nur mit Quadcore-CPU)
. 1000GB SATA-II 5400U/Min Samsung Spinpoint M8 (HN-M101MBB)
. 160GB SATA-II SSD Intel 320 Series (SSDSA2CW160G3)
. Blu-ray Combo (Blu-ray Lesen / DVD Multinorm)
Comment 6 Holger Brandsmeier 2012-04-30 19:15:11 UTC
Created attachment 73132 [details]
smart.out

Output from SMART. Note the very large value for Program_Fail_Cnt_Total. But on the internet many people reported large values for this HDD.
Comment 7 Alan 2012-05-12 00:39:41 UTC
Your hard drive went offline.

Could be a drive problem (the important data after it happens is the last failed commands in the smart log)

What Linux saw is

- Asked drive to flush the cache
- Drive stayed busy
- Drive stayed busy
- Linux got bored
- Reset the connection
- Drive didn't respond
- Reset again
- Linux got back garbage

at which point it couldn't do much else as the drive refused to come back.

Marginal power is another cause of such things
Comment 8 Holger Brandsmeier 2012-05-12 20:27:35 UTC
With "the last failed commands in the smart log" do you mean this:

[84073.027657] ata2.00: device reported invalid CHS sector 0

Do you think this can be due to the way that the linux driver is programmed, or is this certainly a hardware issue? I still have warranty on the laptop so looks like I have to sort something out with the manufacturer. A linux error that happens on average once a week is not very easy for them to reproduce ...
Comment 9 Marcus Brinkmann 2012-12-06 20:19:55 UTC
Hi,

I am seeing pretty much the same thing on my Intel SSD 520 Series (180 GB) that comes with the ThinkPad T430s.  Here some facts:

* Kernel version was 3.6.7-5.fc18.x86_64 on Fedora 18 Beta.  (will test 3.6.9 next)

* Hard drive is:

Device Model:     INTEL SSDSC2BW180A3L
Firmware Version: LE1i
User Capacity:    180,045,766,656 bytes [180 GB]

* I believe the laptop was plugged into the wallsocket.

* The error is reproducible under certain heavy loads running a test suite that uses a database server (firebird).  If running the test suite immediately after rebooting, at some point the system becomes unresponsive and then the test suite (or rather the python interpreter running it) crashes with a segmentation fault.  The system survives this.  If the system has some uptime, the result is the more severe hard drive disconnect described in this bug report, resulting in a completely unusable system.

* It does not occur at boot time.

* It seems that other ThinkPad T430s users are also affected: http://forums.lenovo.com/t5/T400-T500-and-newer-T-series/T430s-Intel-SSD-520-180GB-issue/td-p/888083

* ahci:

[    0.701463] ahci 0000:00:1f.2: version 3.0
[    0.701536] ahci 0000:00:1f.2: irq 42 for MSI/MSI-X
[    0.701560] ahci: SSS flag set, parallel bus scan disabled
[    0.701603] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x13 impl SATA mode
[    0.701605] ahci 0000:00:1f.2: flags: 64bit ncq ilck stag pm led clo pio slum part ems sxs apst 
[    0.701609] ahci 0000:00:1f.2: setting latency timer to 64

* $ lspci
00:00.0 Host bridge: Intel Corporation 3rd Gen Core processor DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 2 (rev c4)
00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 3 (rev c4)
00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation QM77 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6205 [Taylor Peak] (rev 34)
04:00.0 System peripheral: Ricoh Co Ltd MMC/SD Host Controller (rev 07)

Kernel log at crash:

[20667.023716] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[20667.023722] ata1.00: failed command: FLUSH CACHE EXT
[20667.023728] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[20667.023730] ata1.00: status: { DRDY }
[20667.023735] ata1: hard resetting link
[20672.365987] ata1: link is slow to respond, please be patient (ready=0)
[20677.045934] ata1: COMRESET failed (errno=-16)
[20677.045941] ata1: hard resetting link
[20682.388174] ata1: link is slow to respond, please be patient (ready=0)
[20687.068242] ata1: COMRESET failed (errno=-16)
[20687.068248] ata1: hard resetting link
[20692.411563] ata1: link is slow to respond, please be patient (ready=0)
[20722.017588] ata1: COMRESET failed (errno=-16)
[20722.017592] ata1: limiting SATA link speed to 3.0 Gbps
[20722.017594] ata1: hard resetting link
[20727.054626] ata1: COMRESET failed (errno=-16)
[20727.054631] ata1: reset failed, giving up
[20727.054633] ata1.00: disabled
[20727.054644] ata1.00: device reported invalid CHS sector 0
[20727.056536] ata1: EH complete
[20727.056576] sd 0:0:0:0: [sda] Unhandled error code
[20727.056577] sd 0:0:0:0: [sda]
[20727.056578] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[20727.056579] sd 0:0:0:0: [sda] CDB:
[20727.056580] Write(10): 2a 00 09 0e 1b f8 00 00 18 00
[20727.056584] end_request: I/O error, dev sda, sector 151919608
[20727.056590] Buffer I/O error on device dm-3, logical block 3706239
[20727.056595] EXT4-fs warning (device dm-3): ext4_end_bio:319: I/O error writing to inode 917984 (offset 8192 size 4096 starting block 3706239)
[20727.056598] Buffer I/O error on device dm-3, logical block 3706240
[20727.056600] EXT4-fs warning (device dm-3): ext4_end_bio:319: I/O error writing to inode 918961 (offset 4096 size 4096 starting block 3706240)
[20727.056603] Buffer I/O error on device dm-3, logical block 3706241
[20727.056604] EXT4-fs warning (device dm-3): ext4_end_bio:319: I/O error writing to inode 918965 (offset 4096 size 4096 starting block 3706241)

more of "sd 0:0:0:0: [sda] Unhandled error code"

Please let me know if I can help debugging this.
Comment 10 Marcus Brinkmann 2012-12-06 20:26:47 UTC
Sorry, I forgot to add: smart diagnostics is clean.
Comment 11 Marcus Brinkmann 2012-12-06 21:55:00 UTC
I confirmed that my problem persists with:

Linux localhost.localdomain 3.6.9-4.fc18.x86_64 #1 SMP Tue Dec 4 14:12:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
 
Gnu C                  4.7.2
Gnu make               3.82
binutils               2.23.51.0.1
util-linux             2.22.1
mount                  debug
module-init-tools      10
e2fsprogs              1.42.5
xfsprogs               3.1.8
pcmciautils            018
quota-tools            4.00-pre1.
PPP                    2.4.5
Linux C Library        2.16
Dynamic linker (ldd)   2.16
Procps                 3.3.3-20120807git
Kbd                    1.15.3wip
oprofile               0.9.8
Sh-utils               8.17
wireless-tools         29
Modules Loaded         fuse ebtable_nat ebtables nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack rfcomm bnep be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support arc4 iwldvm mac80211 coretemp microcode uvcvideo videobuf2_vmalloc joydev videobuf2_memops videobuf2_core videodev snd_hda_intel media snd_hda_codec i2c_i801 snd_hwdep snd_seq snd_seq_device btusb bluetooth snd_pcm iwlwifi lpc_ich cfg80211 mfd_core cdc_ncm usbnet mii cdc_wdm cdc_acm snd_page_alloc snd_timer thinkpad_acpi snd soundcore vhost_net mei e1000e tun macvtap macvlan kvm_intel kvm rfkill tpm_tis uinput tpm tpm_bios xts gf128mul dm_crypt crc32c_intel ghash_clmulni_intel i915 sdhci_pci sdhci mmc_core i2c_algo_bit drm_kms_helper drm i2c_core wmi video

Note You need to log in before you can comment on or make changes to this bug.