Bug 9465 - Exception and possible crash with SATAII 150 TX4
Summary: Exception and possible crash with SATAII 150 TX4
Status: CLOSED UNREPRODUCIBLE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-28 02:41 UTC by Arno Wagner
Modified: 2012-05-17 15:14 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.22.14
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Promise 2nd gen ASIC PRD/SG bug fix for 2.6.23 (3.44 KB, patch)
2007-11-28 05:59 UTC, Mikael Pettersson
Details | Diff

Description Arno Wagner 2007-11-28 02:41:30 UTC
Most recent kernel where this bug did not occur: 2.6.18.8 (no time to test these in between and a test could take a while for each)

Distribution: etch

Hardware Environment: See Bug 9264. Additional SATAII 150 TX4 with two disks, 1) SAMSUNG HM160HI 2) SAMSUNG HM160JI

Software Environment: 

Problem Description: I get the following error message in /var/log/kernel repeatedly, seemingly regular but also at some irregular times:

Nov 28 04:21:30 gate kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
Nov 28 04:21:31 gate kernel: ata1.00: (port_status 0x20080000)
Nov 28 04:21:31 gate kernel: ata1.00: cmd c8/00:40:86:e1:b7/00:00:00:00:00/e4 tag 0 cdb 0x0 data 32768 in
Nov 28 04:21:31 gate kernel:          res 50/00:00:c5:e1:b7/00:00:00:00:00/e4 Emask 0x2 (HSM violation)
Nov 28 04:21:31 gate kernel: ata1: soft resetting port
Nov 28 04:21:31 gate kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Nov 28 04:21:31 gate kernel: ata1.00: configured for UDMA/133
Nov 28 04:21:31 gate kernel: ata1: EH complete
Nov 28 04:21:31 gate kernel: sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB)
Nov 28 04:21:31 gate kernel: sd 1:0:0:0: [sdb] Write Protect is off
Nov 28 04:21:31 gate kernel: sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA


I get a similar error for the other disk. One possible candidate for causing this is smartctl, which I run regularly on both disks. If so, about 1/2 or so of these errors may result from this, but only about 1/4 or so of the calls to smartctl seem to cause errors. Some, that I did manually, did not cause errors at all, so some other machine state may influence the problem. There are also a instances of this error which are not correlated to a call to smartctl, which are about half of them. In addition, the machine crashed hard at the time of one of the smartctl calls, which may be coincidence or not. But the last log-entry anywere (and I get a lot from netfilter drops, more than one per minute) was 5:01 and the only thing running there is a cron-job that queries both disks via smartctl for changes in the reallocated sector count.

I have now disabled all SMART queries to see wether this error still shows up.

I had no issues at all with this hardware and kernel 2.6.18.8 and no such errors in the logs. Since this is my primary router and firewall, I am likely to downgrade back to 2.6.18 again, which was rock solid. 


Steps to reproduce: I am not sure. Running smartctl regularly voa cron on an SATA disk may be enough to get this sometimes. Or not.
Comment 1 Arno Wagner 2007-11-28 02:47:50 UTC
Before I forget, smartd is not running on the system, so the error instances not correlated to a cron-job starttime are not from that source.
Comment 2 Andrew Morton 2007-11-28 02:51:00 UTC
pretty old kernel.  Can you find if it happens on 2.6.23 or
2.6.24-rc3?

Thanks.
Comment 3 Arno Wagner 2007-11-28 03:05:33 UTC
I am compiling 2.6.23.9 now and will report back.
Comment 4 Arno Wagner 2007-11-28 04:04:23 UTC
Ok, same thing with 2.6.23.9 after about 20 minutes uptime:

Nov 28 12:58:14 gate kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
Nov 28 12:58:14 gate kernel: ata2.00: port_status 0x20080000
Nov 28 12:58:14 gate kernel: ata2.00: cmd c8/00:10:c6:41:26/00:00:00:00:00/e6 tag 0 cdb 0x0 data 8192 in
Nov 28 12:58:14 gate kernel:          res 50/00:00:d5:41:26/00:00:00:00:00/e6 Emask 0x2 (HSM violation)
Nov 28 12:58:14 gate kernel: ata2: soft resetting port
Nov 28 12:58:14 gate kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Nov 28 12:58:14 gate kernel: ata2.00: configured for UDMA/100
Nov 28 12:58:14 gate kernel: ata2: EH complete
Nov 28 12:58:14 gate kernel: sd 2:0:0:0: [sdc] 312581808 512-byte hardware sectors (160042 MB)
Nov 28 12:58:14 gate kernel: sd 2:0:0:0: [sdc] Write Protect is off
Nov 28 12:58:14 gate kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA


This one was most likely triggered by a query with smartctl, run in a cron-job ath 12:58. (I have re-enabled these). I will disable the regular calls to smartctl again, make sure smartd is not running, and see whether these still show up.
Comment 5 Tejun Heo 2007-11-28 04:33:00 UTC
Mikael, any ideas?
Comment 6 Mikael Pettersson 2007-11-28 04:37:22 UTC
May be related to 2nd gen ASIC PRD/SG bug, which got fixed in 2.6.24-rc2. I can
provide a backported patch for 2.6.23 or 2.6.22 if you're willing to test it.
Comment 7 Arno Wagner 2007-11-28 05:34:05 UTC
I am a bit unwilling to put a -rc kernel on my server, since it also doubles as fileserver. If you provide a backported patch for 2.6.23.9, I will test it.
Comment 8 Mikael Pettersson 2007-11-28 05:59:54 UTC
Created attachment 13778 [details]
Promise 2nd gen ASIC PRD/SG bug fix for 2.6.23

Added Promise 2nd gen ASIC PRD/SG bug fix patch, backported from 2.6.24-rc2 to 2.6.23. (It applies cleanly to 2.6.23.9.)
Comment 9 Arno Wagner 2007-11-29 00:40:45 UTC
Ok, the problem is still present, but I have had only one instance in 7:40 uptime and that one was not caused by a call to smartctl (which I have on again, calls every 5 minutes). It is possible that the patch did fix something relevant to this and either there are several triggers for the problem and the former main one is now gone or the problem was made far less likely to manifest itself. However that is speculation. For reference, the one instance I have in the logs:

Nov 29 02:31:49 gate kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
Nov 29 02:31:49 gate kernel: ata1.00: port_status 0x20080000
Nov 29 02:31:49 gate kernel: ata1.00: cmd c8/00:80:7e:c7:60/00:00:00:00:00/e8 tag 0 cdb 0x0 data 65536 in
Nov 29 02:31:49 gate kernel:          res 50/00:00:fd:c7:60/00:00:00:00:00/e8 Emask 0x2 (HSM violation)
Nov 29 02:31:49 gate kernel: ata1: soft resetting port
Nov 29 02:31:49 gate kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Nov 29 02:31:49 gate kernel: ata1.00: configured for UDMA/133
Nov 29 02:31:49 gate kernel: ata1: EH complete
Nov 29 02:31:49 gate kernel: sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB)
Nov 29 02:31:49 gate kernel: sd 1:0:0:0: [sdb] Write Protect is off
Nov 29 02:31:49 gate kernel: sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Comment 10 Arno Wagner 2007-12-03 19:28:51 UTC
I have had one more, that makes two in 5 days uptime. I will continue to run with the patch. The issue is still a concern. If it is the same problem, but far less frequent, it could be weeks before I experience a crash.
Comment 11 Arno Wagner 2007-12-17 17:55:09 UTC
By not at 19 days uptime, no crashes.

I just read on Heise online that some NCQ stuff has been removed from 2.6.23.11. Is this related to this bug?
Comment 12 Mikael Pettersson 2007-12-18 04:35:21 UTC
(In reply to comment #11)
> By not at 19 days uptime, no crashes.
> 
> I just read on Heise online that some NCQ stuff has been removed from
> 2.6.23.11. Is this related to this bug?
> 

Unlikely. sata_promise doesn't (yet) support NCQ.
Comment 13 NPetr 2008-01-25 18:04:09 UTC
Hello Arno, could you test physically remove all Promise from your PC and test it only with SiI controllers if my hypothesis is right?: http://bugzilla.kernel.org/show_bug.cgi?id=9474
Comment 14 NPetr 2008-01-25 18:10:10 UTC
Arno, what chipset of motherboard do you have? Please specify your exact model of MB. And one another test. Could you test Promise against Western Digital disks (physically remove all Samsung from PC) if problem still exist (potentional incompatibility between Promise and Samsung)? I can not test it. I do not have WDC disks.
Comment 15 NPetr 2008-01-25 18:11:04 UTC
Arno, what chipset of motherboard do you have? Please specify your exact model of MB. And one another test. Could you test Promise against Western Digital disks (physically remove all Samsung from PC) if problem still exist (potentional incompatibility between Promise and Samsung)? I can not test it. I do not have WDC disks.
Comment 16 Arno Wagner 2008-01-26 00:57:05 UTC
Hi NPetr. Sorry, I do not have any SiI controllers. I also do not have any WD disks, since I do not trust them. 

Mainboard is a quite old Epox board (from the broken capacitor aera), model EP-8KTA+ with a 800MHz Athlon. Lspci output below. Site note: No more crashes since end of November with the manually patched 2.6.23.9 kernel.

00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133] (rev 02)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 22)
00:07.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 10)
00:07.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 10)
00:07.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 10)
00:07.4 Bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 30)
00:09.0 Ethernet controller: D-Link System Inc DL2000-based Gigabit Ethernet (rev 0c)
00:0a.0 USB Controller: NEC Corporation USB (rev 43)
00:0a.1 USB Controller: NEC Corporation USB (rev 43)
00:0a.2 USB Controller: NEC Corporation USB 2.0 (rev 04)
00:0b.0 Mass storage controller: Promise Technology, Inc. PDC20518/PDC40518 (SATAII 150 TX4) (rev 02)
00:0c.0 SCSI storage controller: Adaptec AHA-7850 (rev 03)
00:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
01:00.0 VGA compatible controller: nVidia Corporation NV11 [GeForce2 MX/MX 400] (rev a1)
Comment 17 Mikael Pettersson 2008-01-26 13:31:31 UTC
(In reply to comment #16)
> Hi NPetr. Sorry, I do not have any SiI controllers. I also do not have any WD
> disks, since I do not trust them. 
> 
> Mainboard is a quite old Epox board (from the broken capacitor aera), model
> EP-8KTA+ with a 800MHz Athlon. Lspci output below. Site note: No more crashes
> since end of November with the manually patched 2.6.23.9 kernel.

Thank you for that information. It's promising (no pun intended).

Note You need to log in before you can comment on or make changes to this bug.