9474 – exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen; COMRESET failed (errno=-16)

Bug 9474 - exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen; COMRESET failed (errno=-16)

Summary: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen; COMRESET failed (err...

Status:	CLOSED OBSOLETE

Alias:	None

Product:	IO/Storage
Classification:	Unclassified
Component:	Serial ATA (show other bugs)
Hardware:	All Linux

Importance:	P1 normal
Assignee:	Jeff Garzik

URL:
Keywords:

Depends on:
Blocks:

Reported:	2007-11-29 03:25 UTC by NPetr
Modified:	2012-05-17 15:15 UTC (History)
CC List:	12 users (show)

See Also:
Kernel Version:	2.6.24-rc3-mm2
Subsystem:
Regression:	Yes
Bisected commit-id:

Attachments
/var/log/dmesg (25.12 KB, application/octet-stream) 2007-11-29 03:27 UTC, NPetr	Details
/var/log/dmesg.0 (25.95 KB, application/octet-stream) 2007-11-29 03:27 UTC, NPetr	Details
dmesg (29.23 KB, text/plain) 2007-11-29 03:28 UTC, NPetr	Details
/var/log/dmesg.1.gz (7.20 KB, application/x-gzip) 2007-11-29 03:28 UTC, NPetr	Details
/var/log/dmesg.2.gz (7.06 KB, application/x-gzip) 2007-11-29 03:28 UTC, NPetr	Details
/var/log/dmesg.1.g3 (7.26 KB, application/x-gzip) 2007-11-29 03:28 UTC, NPetr	Details
/var/log/dmesg.4.gz (7.23 KB, application/x-gzip) 2007-11-29 03:29 UTC, NPetr	Details
fdisk -l (2.77 KB, text/plain) 2007-11-29 03:29 UTC, NPetr	Details
lspci -vvv (15.62 KB, text/plain) 2007-11-29 03:29 UTC, NPetr	Details
mdadm --detail --scan /dev/md0 (952 bytes, text/plain) 2007-11-29 03:30 UTC, NPetr	Details
reshape patch form Neil Brown for mdadm 2.6.23 (2.02 KB, application/octet-stream) 2007-11-29 03:30 UTC, NPetr	Details
smartctl - t long "all_drives", smartctl - l selftest "all_drives", smartctl - a "all_drives" (15.33 KB, application/zip) 2007-11-29 03:32 UTC, NPetr	Details
/var/log/syslog (4.95 KB, application/octet-stream) 2007-11-29 03:32 UTC, NPetr	Details
/var/log/syslog.0 (907.88 KB, application/octet-stream) 2007-11-29 03:33 UTC, NPetr	Details
/var/log/syslog.1.gz (6.62 KB, application/x-gzip) 2007-11-29 03:33 UTC, NPetr	Details
uname -a (89 bytes, text/plain) 2007-11-29 03:33 UTC, NPetr	Details
/var/log messages, dmesgs, syslogs, kerns, debugs (682.25 KB, application/x-bzip) 2007-12-03 07:43 UTC, NPetr	Details
Parallel ATA machine, OK with 2.6.22, not OK with 2.6.23. LONG! (152.38 KB, text/plain) 2007-12-30 23:22 UTC, jl-icase	Details
Serial ATA machine, OK with 2.6.22, not OK with 2.6.23. (10.07 KB, text/plain) 2007-12-30 23:22 UTC, jl-icase	Details
Another PC log - dmesg + kern.log (82.88 KB, application/x-zip-compressed) 2008-01-24 13:06 UTC, NPetr	Details
Add an attachment (proposed patch, testcase, etc.)

Description NPetr 2007-11-29 03:25:50 UTC

Most recent kernel where this bug did not occur:
Kernels newer then 2.6.20 are broken, maybe older then 2.6.20 also.

Distribution:
Debian Lenny current + Sid mdadm (with reshape patch) and smartmontools

Hardware Environment:
RAID bus controller: Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 02)
Mass storage controller: Promise Technology, Inc. PDC40775 (SATA 300 TX2plus) (rev 02)
RAID bus controller: Promise Technology, Inc. 20771 (FastTrak TX2300) (rev 02)
5x Samsung HD501LJ

Software Environment:
Debian Lenny current + Sid mdadm (with reshape patch) and Sid smartmontools

Problem Description:
In original 2.6.22-2-486 kernel from Debian Lenny exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen occures very often

In original 2.6.23.9-slh-smp-1 kernel form Sidux Eros exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen occures very often

In official 2.6.24-rc3-mm2 kernel form www.kernel.org
COMRESET failed (errno=-16) very often

Steps to reproduce:
Only start copy from one disk to another or from one disk to raid field or reverse.

I include report files in attachement.

These toppic is very hot (a few for all):
http://lkml.org/lkml/2007/5/10/63
http://lkml.org/lkml/2007/10/10/32
http://article.gmane.org/gmane.linux.ide/23896
https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/84603
https://bugs.launchpad.net/ubuntu/+bug/104581
http://www.opensubscriber.com/message/linux-ide@vger.kernel.org/7961713.html
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09306.html
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09636.html
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09898.html

And also I restarted computer during reshape (http://scotgate.org/?p=107) raid-5 "3->4" disks. Reshape died and I have never done to continue reshape :-/.

Comment 1 NPetr 2007-11-29 03:27:07 UTC

Created attachment 13783 [details]
/var/log/dmesg

Comment 2 NPetr 2007-11-29 03:27:27 UTC

Created attachment 13784 [details]
/var/log/dmesg.0

Comment 3 NPetr 2007-11-29 03:28:00 UTC

Created attachment 13785 [details]
dmesg

Comment 4 NPetr 2007-11-29 03:28:23 UTC

Created attachment 13786 [details]
/var/log/dmesg.1.gz

Comment 5 NPetr 2007-11-29 03:28:37 UTC

Created attachment 13787 [details]
/var/log/dmesg.2.gz

Comment 6 NPetr 2007-11-29 03:28:55 UTC

Created attachment 13788 [details]
/var/log/dmesg.1.g3

Comment 7 NPetr 2007-11-29 03:29:13 UTC

Created attachment 13789 [details]
/var/log/dmesg.4.gz

Comment 8 NPetr 2007-11-29 03:29:30 UTC

Created attachment 13790 [details]
fdisk -l

Comment 9 NPetr 2007-11-29 03:29:46 UTC

Created attachment 13791 [details]
lspci -vvv

Comment 10 NPetr 2007-11-29 03:30:12 UTC

Created attachment 13792 [details]
mdadm --detail --scan /dev/md0

Comment 11 NPetr 2007-11-29 03:30:44 UTC

Created attachment 13793 [details]
reshape patch form Neil Brown for mdadm 2.6.23

Comment 12 NPetr 2007-11-29 03:32:06 UTC

Created attachment 13794 [details]
smartctl - t long "all_drives", smartctl - l selftest "all_drives", smartctl - a "all_drives"

Comment 13 NPetr 2007-11-29 03:32:36 UTC

Created attachment 13795 [details]
/var/log/syslog

Comment 14 NPetr 2007-11-29 03:33:11 UTC

Created attachment 13796 [details]
/var/log/syslog.0

Comment 15 NPetr 2007-11-29 03:33:29 UTC

Created attachment 13797 [details]
/var/log/syslog.1.gz

Comment 16 NPetr 2007-11-29 03:33:46 UTC

Created attachment 13798 [details]
uname -a

Comment 17 NPetr 2007-11-29 04:07:44 UTC

I have made shot and long test for all disks with Seagate utility: http://www.seagate.com/support/seatools/SeaToolsDOS110.iso

No errors has been found.

Comment 18 Tejun Heo 2007-11-29 16:26:12 UTC

Mikael, both ata4 and 6 are sata_promise SATA ports.  In syslog.1, ata6 spewed HSM violation while ata4 suffered a timeout and went down horribly.

Two events are close together although they're on different controllers.  NPetr, can you please all non-essential drives from your machine and connect just one to the sata_promise controller and see whether the problem goes away?

Nov 28 05:10:51 storage kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
Nov 28 05:10:51 storage kernel: ata6.00: (port_status 0x20080000)
Nov 28 05:10:51 storage kernel: ata6.00: cmd 25/00:80:07:5c:32/00:00:2c:00:00/e0 tag 0 cdb 0x0 data 65536 in
Nov 28 05:10:51 storage kernel:          res 50/00:00:06:5c:32/00:00:2c:00:00/e0 Emask 0x2 (HSM violation)
Nov 28 05:10:51 storage kernel: ata6: soft resetting port
Nov 28 05:10:52 storage kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 28 05:10:52 storage kernel: ata6.00: configured for UDMA/133
Nov 28 05:10:52 storage kernel: ata6: EH complete
Nov 28 05:11:37 storage kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Nov 28 05:12:47 storage kernel: ata4.00: cmd 25/00:78:bf:0d:d4/00:00:21:00:00/e0 tag 0 cdb 0x0 data 61440 in
Nov 28 05:12:47 storage kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 28 05:12:47 storage kernel: ata4: port is slow to respond, please be patient (Status 0xff)
Nov 28 05:12:47 storage kernel: ata4: device not ready (errno=-16), forcing hardreset
Nov 28 05:12:47 storage kernel: ata4: hard resetting port
Nov 28 05:12:47 storage kernel: ata4: port is slow to respond, please be patient (Status 0xff)
Nov 28 05:12:47 storage kernel: ata4: COMRESET failed (errno=-16)
Nov 28 05:12:47 storage kernel: ata4: hard resetting port
Nov 28 05:12:47 storage kernel: ata4: port is slow to respond, please be patient (Status 0xff)
Nov 28 05:12:47 storage kernel: ata4: COMRESET failed (errno=-16)
Nov 28 05:12:47 storage kernel: ata4: hard resetting port
Nov 28 05:12:47 storage kernel: ata4: port is slow to respond, please be patient (Status 0xff)
Nov 28 05:12:47 storage kernel: ata4: COMRESET failed (errno=-16)
Nov 28 05:12:47 storage kernel: ata4: limiting SATA link speed to 1.5 Gbps
Nov 28 05:12:47 storage kernel: ata4: hard resetting port
Nov 28 05:12:47 storage kernel: ata4: COMRESET failed (errno=-16)
Nov 28 05:12:47 storage kernel: ata4: reset failed, giving up

Comment 19 Tejun Heo 2007-11-29 16:35:13 UTC

Hmm... This looks very similar to

http://bugzilla.kernel.org/show_bug.cgi?id=9465

And, NPeter, the links you posted in the initial report all point to different symptoms on various controllers.

Comment 20 NPetr 2007-11-30 01:08:27 UTC

Hello Tejun,
do you think that this problem is controller dependant? My problem is on all controllers that I have. During #9465 I tried 2.6.18-5-k7 kernel form Debian Etch 4.0. The proble is same, but different kmessage:

ata4: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
ata4: status=0x40 { DriveReady }

Comment 21 NPetr 2007-11-30 01:11:05 UTC

Fix message:
ata3: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
ata3: command timeout

Comment 22 Mikael Pettersson 2007-12-01 12:12:16 UTC

There are seven dmesg dumps, all of the same 2.6.24-rc3-mm2 kernel,
but some things change. None of the dmesg dumps indicate problems
with sata_promise, but two of them indicate problems with sata_sil:
a "port is slow to respond" and a 10B8B SError. The second of these
occured immediately after a manual hotplug which added a disk on a
sata_promise port.

syslog.0 contains kernel logs from at least fifteen boots with
varying kernel versions:
- a 2.6.23 boot showed a common sata_promise error: a HSM violation
  with port_status 0x20080000; it was followed by timeouts and
  COMRESET failures (not a common pattern)
- a 2.6.24-rc3-mm2 boot showed a timeout/port is slow to respond
  error on sata_sil
- another 2.6.24-rc3-mm2 boot showed a timeout failure on a
  sata_promise port followed by COMRESET failures, and occasional
  SError transmission errors; a port on the other Promise controller
  showed a hot-unplug/plug sequence, which may or may not have been
  a hardware hickup
- the last 2.6.24-rc3-mm2 boot showed a manual hotplug on a Promise
  port (not an error), followed by a 10B8B SError on a sata_sil port

All errors shown in these logs except one are commonly caused by an
overloaded PSU. The one exception is the HSM violation in the 2.6.23
boot, which was probably caused by the ASIC PRD bug fixed in 2.6.24-rc2.
Especially telling is the fact that hot-adding a disk on a sata_promise
port causes a hickup on a disk on a sata_sil port.

Comment 23 NPetr 2007-12-02 19:55:21 UTC

Hello Mikael
- I removed sata_sil controller from the PC. Two Promise controllers stayed in PC. - I changed PSU 300W by another 350W that I found.

Tejun nad Mikael
- Problems are still here (2.6.17, 2.6.18, 2.6.22, 2.6.23, 2.6.24-rc3 that i tried ; kernels older than 2.6.17 is not sufficient for me - it does not support reshape of raid array).

I will attach big tarball with /var/log messages, dmesgs, syslogs when I will arrive home. Another files are important for you?

Steps to reproduce
==================

Configuration:
Promise 1 (2x 500 GB SATA Samsung)
/dev/sdd
/dev/sde

Promise 2 (2x 500 GB SATA Samsung)
/dev/sdf
/dev/sdg

System on MB
/dev/md1 (/dev/hda + /dev/hdb ... 2x 3.2 GB PATA Seagate)

Swap on MB
/dev/hdc (1.7 GB PATA Seagate)

Procedure:
1) dd if=/dev/sdg of=/dev/sdf
2) wait cca five minutes
3) mount /dev/sdd1 /mnt
4) tree /mnt
5) exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen and COMRESET failed
*) if exception does not appear run tree /mnt more than once

Note:
I guess that maximum DMA througput traffic on one SATA controller (or one port in hard case) exhaust linux kernel - when traffic starts concurrently on another controller (or on another port in hard case) exception is here.

Do you think that boot kernel with ide=nodma option may help? I could try it but slow PIO traffic takes horrible long time.

Comment 24 NPetr 2007-12-02 19:57:19 UTC

Fix:
dd if=/dev/sdg of=/dev/sdf bs=16384

Comment 25 NPetr 2007-12-03 07:43:10 UTC

Created attachment 13838 [details]
/var/log messages, dmesgs, syslogs, kerns, debugs

OK, here is actual report.

Comment 26 NPetr 2007-12-03 14:04:25 UTC

Does NetBSD or FreeBSD using same drivers as Linux? I could try behaviuor on Unix system.

Thank you for further support, NPetr

Comment 27 Alan 2007-12-03 14:06:59 UTC

The BSD stuff is unrelated (beyond coming from similar docs and the fact authors cross check details from other open source code), so should be fairly close to a test of unrelated drivers

Comment 28 NPetr 2007-12-03 14:45:31 UTC

Hello Alex, happy to see you. I found that OpenSolaris has 3114 and 3124 drivers too. Maybe CDDL license is more close to GPL than BSD. I can not believe that linux has so buggy driver. I am trying to resurrect my data now. Is it possible to repair exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen and COMRESET failed (errno=-16) bugs in linux driver acroding in OpenSolaris or BSD? I prefer linux, it has RAID-5 reshape ability from kernel 2.6.17. No other OS has it (RaidFrame, VxVM, ZFS [maybe in future - http://www.opensolaris.org/jive/thread.jspa?messageID=144085]). P.S. Alex, why your *-ac kernels disappear?

Comment 29 NPetr 2007-12-03 14:46:52 UTC

I made a terrible mistake in your name. Very, very sorry Alan.

Comment 30 Alan 2007-12-03 15:12:30 UTC

Linux driver should be rock solid, and your report is very very strange indeed as well as apparently unique.

Comment 31 NPetr 2007-12-03 15:38:40 UTC

I do not understand, I googled a lot about "exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen" and "COMRESET failed (errno=-16)" on Promise or SiI controllers. A lot of stuff I found, therefore I add you to CC of this bug. How can I make brighter report and steps to reproduce or what could I do? I will try kernel parameter ide=nodma, hope that it limits SATA also to PIO modes.

Summary what I did:
- replace PSU
- remove SiI 3112 controller
- replace MB (Via KT3 Ultra -> nForce2 Ultra400)
- replace DDRs
- disable SiI 3112 on MB in BIOS
- buy another HD501LJ disks

Nothing has helped :(.

Comment 32 Robert Hancock 2007-12-03 16:34:24 UTC

ata5: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xa frozen
ata5: hotplug_status 0x10
ata5: SError: { PHYRdyChg DevExch }
ata5: port is slow to respond, please be patient (Status 0xff)

This sounds very unlike a driver problem, the controller reports a change in the state of the SATA PHY which suggests it thinks the drive has disconnected itself. Perhaps some kind of incompatibility between the controller and these drives? Have you tried a different model/brand of drive?

Comment 33 NPetr 2007-12-03 18:34:19 UTC

Hello Robert,
I have only Samsung hard disks. But I tried them on SiI 3112, TX2plus and TX2300. Do you think that HD501LJ are buggy for Linux SATA interface?

Comment 34 jl-icase 2007-12-30 21:33:57 UTC

I have an old machine (single Pentium 4, Intel motherboard, Promise PATA cards) which worked reliably with kernel 2.6.22.14-72.fc6 (when disks were still called /dev/hd*) but started to encounter this bug frequently (e.g. every 5 seconds) after upgrading to 2.6.23.9-85.fc8 -- the IDE interfaces are:

00:1f.1 IDE interface: Intel Corporation 82801BA IDE U100 Controller (rev 04)
02:0b.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
02:0c.0 Mass storage controller: Promise Technology, Inc. PDC20267 (FastTrak100/Ultra100) (rev 02)

Note that the 82801BA chip doesn't get much work, and all of the bugs are seen on the two Promise cards.  Again, this bug is new, it wasn't present in 2.6.22 kernels (before hd/sd drivers merged).

Comment 35 Robert Hancock 2007-12-30 21:45:51 UTC

(In reply to comment #34)
> I have an old machine (single Pentium 4, Intel motherboard, Promise PATA
> cards)
> which worked reliably with kernel 2.6.22.14-72.fc6 (when disks were still
> called /dev/hd*) but started to encounter this bug frequently (e.g. every 5
> seconds) after upgrading to 2.6.23.9-85.fc8 -- the IDE interfaces are:
> 
> 00:1f.1 IDE interface: Intel Corporation 82801BA IDE U100 Controller (rev 04)
> 02:0b.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
> 02:0c.0 Mass storage controller: Promise Technology, Inc. PDC20267
> (FastTrak100/Ultra100) (rev 02)
> 
> Note that the 82801BA chip doesn't get much work, and all of the bugs are
> seen
> on the two Promise cards.  Again, this bug is new, it wasn't present in
> 2.6.22
> kernels (before hd/sd drivers merged).

You should post the dmesg output you are getting from these errors. It's quite possible that you have an entirely different problem.

Comment 36 jl-icase 2007-12-30 22:58:01 UTC

(In reply to comment #35)
> 
> You should post the dmesg output you are getting from these errors. It's
> quite
> possible that you have an entirely different problem.

I think that this problem with new ATA drivers is generic -- the behavior matches what was posted before, but perhaps you are right in asking for more detail.

This kind of problem even happens on another machine (a year old) with SATA drives only: The signature "exception Emask 0x2 SAct [...] SErr 0x0 action 0x2 frozen" is seen in logs, and while SAct codes vary, the error is followed by the same dance leading to ATA soft reset.  BTW, these errors go to /var/log/messages, and don't appear in dmesg output.

This newer machine never reported any SATA problems before being upgraded to a 2.6.23-series kernel.  I'd say that both PATA and SATA aspects of the merged hd/sd driver have problems.  The SATA chip in the newer machine is:

00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port SATA AHCI Controller (rev 02)

So, what I know is:

(1) Two different machine create this bug's signature in /var/log/messages
(2) Both used to work fine without bugs with 2.6.22 type kernel, and broke after upgrading to 2.6.23 type kernel
(3) The bug affects both PATA and SATA interfaces from Promise and Intel attached to 6 Maxtor drives (PATA) or 6 WD drives (SATA)
(4) Kernel 2.6.23 introduced a merged hd/sd driver for PATA/SATA drives

...so the simplest hypothesis is that HW is OK but that the hd/sd driver in 2.6.23+ kernels broke something.  Of course, "something" could actually be several broken things, and perhaps examining a range of possible failures can help, so I'll attach a fairly long list of system logs when I get a chance.

Comment 37 Robert Hancock 2007-12-30 23:06:35 UTC

The "exception Emask 0x2 SAct [...] SErr 0x0 action 0x2 frozen" part doesn't mean much, all that indicates is that a command timed out and that error handling kicked in. We really need the entire error details..

Comment 38 jl-icase 2007-12-30 23:22:03 UTC

Created attachment 14233 [details]
Parallel ATA machine, OK with 2.6.22, not OK with 2.6.23.  LONG!

Comment 39 jl-icase 2007-12-30 23:22:37 UTC

Created attachment 14234 [details]
Serial ATA machine, OK with 2.6.22, not OK with 2.6.23.

Comment 40 jl-icase 2007-12-30 23:47:56 UTC

(In reply to comment #37)
> The "exception Emask 0x2 SAct [...] SErr 0x0 action 0x2 frozen" part doesn't
> mean much, all that indicates is that a command timed out and that error
> handling kicked in. We really need the entire error details..
> 

Done -- see the attachments I just posted.

BTW, I've got another old machine w/Intel motherboard which has *NOT* seen this bug since being upgraded to 2.6.23 -- but this machine doesn't do RAID while the other two (which exhibit problems) do.  Since RAID imposes bursts of work on disk interfaces, perhaps this problem is timing related.

I don't think that SMP plays a role -- the PATA machine is uniprocessor, runs RAID-5 and has problems, the problem-free machine has a single CPU w/hyperthreading but no RAID, and the problematic SATA machine has a dual core CPU and runs RAID (various levels).  The problem happens under heavy load on RAID machines.  The single processor PATA machine is more loaded and sees more problems.

Comment 41 Robert Hancock 2007-12-30 23:48:09 UTC

There are two different issues here:

The serial ATA problems are "spurious" NCQ completions. These detections were actually invalid and so the detection code has been removed in 2.6.24 (and I believe the latest 2.6.23 stable kernels), so that problem should now be fixed.

The PATA problem would be appear to be the Promise PATA driver doing something that causes the controller's BMDMA engine to choke for some reason. I would suggest opening a new bug report since it seems unrelated to the original reporter's problem. Also, I don't think there is any relationship to the 2.6.22 to 2.6.23 change since Fedora only enabled libata for PATA disks in FC7. Quite likely the libata Promise driver had this problem previously as well.

Comment 42 Mikael Pettersson 2007-12-31 03:11:35 UTC

(In reply to comment #38)
> Created an attachment (id=14233) [details]
> Parallel ATA machine, OK with 2.6.22, not OK with 2.6.23.  LONG!
> 

Note that your PATA machine has both a PDC20267 and a PDC20269.
The older 20267 is driven by pata_pdc202xx_old while the 20269
is driven by pata_pdc2027x. There are known problems with the
20267/pata_pdc202xx_old combination, see other bugzilla entries,
but the 20269/pata_pdc2027x combination should be solid.
Unfortunately, your kernel log is incomplete so we can't tell if
the errors occur on the 20267 or the 20269.

Comment 43 Tejun Heo 2008-01-01 20:16:19 UTC

NPetr, multiple errors occurring across multiple controllers / drives strongly indicate hardware problem.  Also, you're seeing quite a few PHY events (phy going offline, transmission errors) which also point at hardware problems.

As Mikael said before, the most common cause of this kind of problem is overloaded / faulty PSU.  Please perform some hardware debugging.

* Take some drives out and see if anything changes.

* If you have access to an extra PSU, use it to power SATA drives (I'm not sure using separate PSU is safe for PATA but it definitely is safe for SATA).  You can power up PSU w/o motherboard by...

  http://modtown.co.uk/mt/article2.php?id=psumod

  This will distribute 12v load across two PSUs.  Please see whether errors continue to happen and whether there's any pattern according to which PSU drives are connected to.

Comment 44 Tobias Ussing 2008-01-03 15:25:45 UTC

I have what i believe is the same problem. (except for the comreset, but that may come once the machine run on for long enough, i will see).

I have two promise PATA cards in my computer:
05:06.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
05:08.0 Mass storage controller: Promise Technology, Inc. PDC20268 (Ultra100 TX2) (rev 02)

Both with all seagate disks...

I get this quite a lot:

[ 5994.518914] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[ 5994.524893] ata8.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
[ 5994.524896]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 5999.561120] ata8: soft resetting port
[ 5999.809328] ata8.00: configured for UDMA/100
[ 5999.821326] ata8.01: configured for UDMA/100
[ 5999.821354] ata8: EH complete
[ 5999.839737] sd 7:0:0:0: [sdk] 586072368 512-byte hardware sectors (300069 MB)
[ 5999.856294] sd 7:0:0:0: [sdk] Write Protect is off
[ 5999.856903] sd 7:0:0:0: [sdk] Mode Sense: 00 3a 00 00
[ 5999.888333] sd 7:0:0:0: [sdk] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 5999.888702] sd 7:0:1:0: [sdl] 586072368 512-byte hardware sectors (300069 MB)
[ 5999.889004] sd 7:0:1:0: [sdl] Write Protect is off
[ 5999.889007] sd 7:0:1:0: [sdl] Mode Sense: 00 3a 00 00
[ 5999.889095] sd 7:0:1:0: [sdl] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 5999.889113] sd 7:0:0:0: [sdk] 586072368 512-byte hardware sectors (300069 MB)
[ 5999.889123] sd 7:0:0:0: [sdk] Write Protect is off
[ 5999.889126] sd 7:0:0:0: [sdk] Mode Sense: 00 3a 00 00
[ 5999.889142] sd 7:0:0:0: [sdk] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 5999.889158] sd 7:0:1:0: [sdl] 586072368 512-byte hardware sectors (300069 MB)
[ 5999.889168] sd 7:0:1:0: [sdl] Write Protect is off
[ 5999.889170] sd 7:0:1:0: [sdl] Mode Sense: 00 3a 00 00
[ 5999.889186] sd 7:0:1:0: [sdl] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

For all the harddisks on the PATA disks, on all the PATA disks(that is 8 drives total).

I do NOT have the old "IDE" driver in my kernel at all, nor as module. Only running with libata as of my upgrade to:
Linux ts.local.machines 2.6.23-hardened-r4 #6 PREEMPT Thu Jan 3 01:32:33 CET 2008 i686 AMD Sempron(tm) Processor 2800+ AuthenticAMD GNU/Linux

I have another version of the kernel compiled, with the only difference being that the Promise cards run on the old driver instead of libata..

Both with the promise cards with the libata or the old driver sata_nv is also being used...

If i run the one where i use the old driver everything works, no error messages, no problems, and it isn't unstable.. SMART doesn't complain for any of the harddisks (short tests are run daily, long tests weekly) on the old drivers, so it isn't a drive/cable problem..

Will a full dmesg, lspci, anything else be of any use from me? or do you have enough...

Adding to CC.

Sincerely
Tobias

Comment 45 Tobias Ussing 2008-01-03 15:27:26 UTC

Oh, also running raid on all drives, some are raid5, some raid1. same problem on all harddisks..

Comment 46 Tobias Ussing 2008-01-03 15:37:29 UTC

Just noticed that it did all the drives on by one, and none of the drives were actually being used(even iostat shows no data transfer from neither drive nor raid).

The only thing that would be touching the drives in that time is hddtemp or smart.

But when i ran those manually, neither of them triggered the error. 

The important part is, i get the error, with little to no data transfer taking place.

Comment 47 Tejun Heo 2008-01-03 15:41:20 UTC

Tobias, does stopping smartd fix your problem?

Comment 48 Tobias Ussing 2008-01-03 16:14:44 UTC

Smartd isn't running.

i have a watch -n 900 sh temp-test running, where temp-test contains hddtemp /dev/sd? (or rather, a for loop to do that). So, it should trigger every 15 min if it was hddtemp.. but it doesn't trigger when i run it manually..

the SMART is a cron script run nightly. so, it isn't that.. no daemons are touching the disks, sans mdadm.

doing a:

grep "soft resetting" /var/log/everything/log-2008-01-03-23\:01\:59 | grep ata8

results in:

Jan  3 22:58:33 [kernel] [ 1201.277748] ata5: soft resetting port
Jan  3 22:58:41 [kernel] [ 1209.108766] ata5: soft resetting port
Jan  3 23:14:24 [kernel] [ 2151.702897] ata5: soft resetting port
Jan  3 23:14:32 [kernel] [ 2159.058493] ata5: soft resetting port
Jan  3 23:30:16 [kernel] [ 3102.451425] ata5: soft resetting port
Jan  3 23:30:24 [kernel] [ 3110.606082] ata5: soft resetting port
Jan  3 23:46:08 [kernel] [ 4052.923316] ata5: soft resetting port
Jan  3 23:46:15 [kernel] [ 4060.694420] ata5: soft resetting port
Jan  4 00:01:59 [kernel] [ 5003.394691] ata5: soft resetting port
Jan  4 00:02:07 [kernel] [ 5011.129828] ata5: soft resetting port
Jan  4 00:17:51 [kernel] [ 5953.709660] ata5: soft resetting port
Jan  4 00:17:59 [kernel] [ 5961.932235] ata5: soft resetting port
Jan  4 00:32:54 [kernel] [ 6855.731626] ata5: soft resetting port
Jan  4 00:33:02 [kernel] [ 6863.698527] ata5: soft resetting port

One weird thing, if i use ata8 instead of ata5 through ata7 i don't get two in a row, i only get:

Jan  3 22:59:18 [kernel] [ 1246.450438] ata8: soft resetting port
Jan  3 23:15:09 [kernel] [ 2196.707464] ata8: soft resetting port
Jan  3 23:31:02 [kernel] [ 3147.703638] ata8: soft resetting port
Jan  3 23:46:53 [kernel] [ 4098.143576] ata8: soft resetting port
Jan  4 00:02:44 [kernel] [ 5048.550923] ata8: soft resetting port
Jan  4 00:18:37 [kernel] [ 5999.561120] ata8: soft resetting port
Jan  4 00:33:40 [kernel] [ 6901.994591] ata8: soft resetting port

even though it has the same number of harddisks on the cable as the other ones.

[   98.269159] ata5: PATA max UDMA/133 cmd 0xf88697c0 ctl 0xf8869fda bmdma 0xf8869000 irq 16
[   98.292103] ata6: PATA max UDMA/133 cmd 0xf88695c0 ctl 0xf8869dda bmdma 0xf8869008 irq 16
[   98.559891] ata5.00: ATA-7: ST3250823A, 3.03, max UDMA/100
[   98.583778] ata5.01: ATA-6: ST3200822A, 3.01, max UDMA/100
[   98.619719] ata5.00: configured for UDMA/100
[   98.647748] ata5.01: configured for UDMA/100
[   98.883502] ata6.00: ATA-7: ST3250823A, 3.02, max UDMA/100
[   98.906139] ata6.01: ATA-6: ST3200822A, 3.01, max UDMA/100
[   98.943331] ata6.00: configured for UDMA/100
[   98.967357] ata6.01: configured for UDMA/100
[   99.767194] ata7: PATA max UDMA/100 cmd 0xf88717c0 ctl 0xf8871fda bmdma 0xf8871000 irq 20
[   99.789555] ata8: PATA max UDMA/100 cmd 0xf88715c0 ctl 0xf8871dda bmdma 0xf8871008 irq 20
[  100.054144] ata7.00: ATA-7: ST3300831A, 3.06, max UDMA/100
[  100.077306] ata7.01: ATA-6: ST3200822A, 3.01, max UDMA/100
[  100.106086] ata7.00: configured for UDMA/100
[  100.130109] ata7.01: configured for UDMA/100
[  100.365760] ata8.00: ATA-7: ST3300831A, 3.06, max UDMA/100
[  100.387634] ata8.01: ATA-7: ST3300831A, 3.06, max UDMA/100

I don't understand that part.. but i see that it does do the thing at about 15 min intervals usually.. which does fit with my hddtemp script from above... So this may be a spindown spinup issue.

I'll try setting it to run once every minute and see what that does. 

Starting that now, i'll report back in an hour, unless i'm sleeping, in that case i'll report back when i wakeup.

Sincerely
Tobias

Comment 49 Robert Hancock 2008-01-03 16:31:50 UTC

Is the cmd always b0 as it is in your last detailed output you posted? That's a SMART command. Nothing in the kernel issues those itself, so it would have to be some daemon or something that is doing it.

Comment 50 Tobias Ussing 2008-01-03 16:51:04 UTC

always b0, but that would be explained by hddtemp which gets the temperature via SMART commands.

Oki, it should have triggered by now, it hasn't..

Jan  4 01:06:50 [kernel] [ 8889.284639] ata5: soft resetting port
Jan  4 01:06:58 [kernel] [ 8897.375369] ata5: soft resetting port
Jan  4 01:07:04 [kernel] [ 8903.216683] ata6: soft resetting port
Jan  4 01:07:11 [kernel] [ 8910.843946] ata6: soft resetting port
Jan  4 01:07:19 [kernel] [ 8918.778857] ata7: soft resetting port
Jan  4 01:07:27 [kernel] [ 8926.210346] ata7: soft resetting port
Jan  4 01:07:35 [kernel] [ 8934.796580] ata8: soft resetting port

I set hddtemp to run once per minute instead of once ever 15 min at 01:10

But, i noticed this..

Jan  4 01:06:50 [kernel] [ 8889.284639] ata5: soft resetting port
Jan  4 01:06:58 [kernel] [ 8897.375369] ata5: soft resetting port
Jan  4 01:07:04 [kernel] [ 8903.216683] ata6: soft resetting port
Jan  4 01:07:11 [kernel] [ 8910.843946] ata6: soft resetting port
Jan  4 01:07:19 [kernel] [ 8918.778857] ata7: soft resetting port
Jan  4 01:07:27 [kernel] [ 8926.210346] ata7: soft resetting port
Jan  4 01:07:35 [kernel] [ 8934.736576] ata8: port is slow to respond, please
be patient (Status 0xd0)
Jan  4 01:07:35 [kernel] [ 8934.796580] ata8: soft resetting port

Which, i guess, would explain why ata8 only runs once instead of twice

Just to make sure i'm not missing anything stupid, here is everything fiting
"ata" from within the last 40minutes..

Jan  4 01:06:47 [kernel] [ 8886.987293] ata5.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x2 frozen
Jan  4 01:06:47 [kernel] [ 8886.991849] ata5.00: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
Jan  4 01:06:50 [kernel] [ 8889.284639] ata5: soft resetting port
Jan  4 01:06:50 [kernel] [ 8889.812386] ata5.00: configured for UDMA/100
Jan  4 01:06:51 [kernel] [ 8890.337799] ata5.01: configured for UDMA/100
Jan  4 01:06:51 [kernel] [ 8890.339850] ata5: EH complete
Jan  4 01:06:54 [kernel] [ 8893.895373] ata5.01: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x2 frozen
Jan  4 01:06:54 [kernel] [ 8893.901204] ata5.01: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/10 tag 0 cdb 0x0 data 0 
Jan  4 01:06:58 [kernel] [ 8897.375369] ata5: soft resetting port
Jan  4 01:06:58 [kernel] [ 8897.615490] ata5.00: configured for UDMA/100
Jan  4 01:06:58 [kernel] [ 8897.635505] ata5.01: configured for UDMA/100
Jan  4 01:06:58 [kernel] [ 8897.635532] ata5: EH complete
Jan  4 01:07:01 [kernel] [ 8900.835424] ata6.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x2 frozen
Jan  4 01:07:01 [kernel] [ 8900.840928] ata6.00: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
Jan  4 01:07:04 [kernel] [ 8903.216683] ata6: soft resetting port
Jan  4 01:07:04 [kernel] [ 8903.736463] ata6.00: configured for UDMA/100
Jan  4 01:07:05 [kernel] [ 8904.262569] ata6.01: configured for UDMA/100
Jan  4 01:07:05 [kernel] [ 8904.263878] ata6: EH complete
Jan  4 01:07:08 [kernel] [ 8907.699561] ata6.01: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x2 frozen
Jan  4 01:07:08 [kernel] [ 8907.705009] ata6.01: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/10 tag 0 cdb 0x0 data 0 
Jan  4 01:07:11 [kernel] [ 8910.843946] ata6: soft resetting port
Jan  4 01:07:11 [kernel] [ 8911.084121] ata6.00: configured for UDMA/100
Jan  4 01:07:11 [kernel] [ 8911.104073] ata6.01: configured for UDMA/100
Jan  4 01:07:11 [kernel] [ 8911.104108] ata6: EH complete
Jan  4 01:07:15 [kernel] [ 8914.339957] ata7.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x2 frozen
Jan  4 01:07:15 [kernel] [ 8914.345418] ata7.00: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
Jan  4 01:07:19 [kernel] [ 8918.778857] ata7: soft resetting port
Jan  4 01:07:20 [kernel] [ 8919.306731] ata7.00: configured for UDMA/100
Jan  4 01:07:20 [kernel] [ 8919.832337] ata7.01: configured for UDMA/100
Jan  4 01:07:20 [kernel] [ 8919.833975] ata7: EH complete
Jan  4 01:07:24 [kernel] [ 8923.289705] ata7.01: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x2 frozen
Jan  4 01:07:24 [kernel] [ 8923.295121] ata7.01: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/10 tag 0 cdb 0x0 data 0 
Jan  4 01:07:27 [kernel] [ 8926.210346] ata7: soft resetting port
Jan  4 01:07:27 [kernel] [ 8926.458522] ata7.00: configured for UDMA/100
Jan  4 01:07:27 [kernel] [ 8926.478588] ata7.01: configured for UDMA/100
Jan  4 01:07:27 [kernel] [ 8926.478616] ata7: EH complete
Jan  4 01:07:30 [kernel] [ 8929.682383] ata8.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x2 frozen
Jan  4 01:07:30 [kernel] [ 8929.687838] ata8.00: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
Jan  4 01:07:35 [kernel] [ 8934.736576] ata8: port is slow to respond, please
be patient (Status 0xd0)
Jan  4 01:07:35 [kernel] [ 8934.796580] ata8: soft resetting port
Jan  4 01:07:35 [kernel] [ 8935.036695] ata8.00: configured for UDMA/100
Jan  4 01:07:35 [kernel] [ 8935.056804] ata8.01: configured for UDMA/100
Jan  4 01:07:35 [kernel] [ 8935.056834] ata8: EH complete

though i don't think i've missed anything this time around...

So, still don't really get the ata8 thing, but then again, i'm not sure if it
is really the same issue...

I know on ata8 that one of the disks, sdl, should never spin down(it is raid
mirror of OS), while sdk should spin down (is just a data raid5 disk, which may
not be used all the time). 

Though i fail to see how one of the disks not spinning down will result in a
"slow to respond", if it is even relevant...



In any case, since it should have triggered at least 3 times(because of the 15min thing from my prior comment) i will stop the script running hddtemp, and let it stay the night, then i'll see what happens with a normal ls tomorrow, and then with hddtemp again tomorrow...

Sincerely
Tobias

Comment 51 Tobias Ussing 2008-01-04 03:57:27 UTC

Oki, a normal spin up does NOT cause this on my system, i will try to reinitiate the hddtemps(the SMART commands) with decreasing intervals and see if it fits with my spindown idea(if it does, it should hit at the lowest with about 5min intervals, which is the limit for my drives to spin down).


But, normal disk transfer after spindown does not cause this for me.


Sincerely
Tobias

Comment 52 Tejun Heo 2008-01-05 04:15:50 UTC

Tobias, it has nothing to do with spindown.  IIRC, it's some userland daemon issuing invalid command which can be fixed by updating the program.  You won't see it if you put the machine into runlevel 1.  Just find out which daemon or cron job is doing it and disable it.

Comment 53 Tobias Ussing 2008-01-05 08:49:58 UTC

Then how come it doesn't make that error when i use the old driver?

Comment 54 Tejun Heo 2008-01-06 00:11:58 UTC

The problem you're reporting got traced quite some time ago and I don't remember details.  ISTR that the difference was that the IDE driver doesn't complain or take any corrective action against HSM violation.  It's one of smartd, smartctl or hddtemp.  I'll ask Bruce Allen whether he remembers better than me.

Comment 55 Robert Hancock 2008-01-06 10:11:13 UTC

I believe it was hddtemp that had the bug..

Comment 56 Tobias Ussing 2008-01-06 10:13:30 UTC

Oki, in that case, please disregard me :D

Sincerely
Tobias Ussing

Sorry for the noise.

Comment 57 jl-icase 2008-01-07 20:41:35 UTC

(In reply to comment #41)

> [...] Also, I don't think there is any relationship to the 2.6.22
> to 2.6.23 change since Fedora only enabled libata for PATA disks in FC7.
> Quite
> likely the libata Promise driver had this problem previously as well.

The machine used FC6 and was very reliable under heavy load for months at a time.   It never used F7.  The problems are related to F8 upgrade without any hardware changes.

Thanks for the insights.

Comment 58 jl-icase 2008-01-13 12:23:11 UTC

FYI: Reverting to hd style drivers fixes the sd style driver's "ata ... frozen" messages & soft resets -- although there are some "SeekComplete" errors reported by the hd* device, the raid5 driver sails through them, apparently without the effects of freezing or resets:

Jan 13 12:32:33 fw kernel: hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Jan 13 12:32:33 fw kernel: hdi: dma_intr: error=0x40 { UncorrectableError }, LBAsect=226841460, high=13, low=8737652, sector=226841455
Jan 13 12:32:33 fw kernel: end_request: I/O error, dev hdi, sector 226841455
[...]
Jan 13 12:32:34 fw kernel: raid5:md0: read error corrected (8 sectors at 226841456 on hdi1)

Comment 59 Alan 2008-01-13 13:27:47 UTC

Uncorrectable error is a failed disk.

The timeouts for old and new ATA layer are different as are the messages printed. Looks like your drive takes a very long time to fail the I/O, far longer than libata was prepared to wait around with the machine stuck.

You probably want to run a verify on the drive and it looks like libata may need to use slightly longer I/O timeouts to cope with such failure events.

Comment 60 Tejun Heo 2008-01-13 16:42:08 UTC

Alan, timeout for reads and writes is 30secs.  I don't really think we should go higher than that.

Comment 61 NPetr 2008-01-24 13:06:08 UTC

Hello, sorry for the long delay. I bring logs from another testing PC.

ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x2400000 action 0x2 frozen
ata1.00: cmd 35/00:00:20:17:06/00:04:01:00:00/e0 tag 0 cdb 0x0 data 524288 out
res 51/04:01:1f:18:06/00:03:01:00:00/e0 Emask 0x10 (ATA bus error)
ata1: hard resetting port
ata2: exception Emask 0x10 SAct 0x0 SErr 0x80000 action 0x2 frozen
ata2: hard resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2.00: configured for UDMA/100
ata2: EH complete

Comment 62 NPetr 2008-01-24 13:06:56 UTC

Created attachment 14567 [details]
Another PC log - dmesg + kern.log

Comment 63 NPetr 2008-01-24 13:15:11 UTC

I did not test another PC so much. When the first Exception emask occurs, I stopped testing. That is mean for me, that my HW is OK.

Steps to reproduce:
1) dd if=/dev/sdd of=/dev/sda bs=16384
2) wait cca five minutes
3) mount -t ext3 /dev/sdc1 /media/
4) tree /media
5) mount -t ext3 /dev/sdb1 /mnt/
6) tree /mnt
7) exception Emask 0x10 SAct 0x0 SErr ...
*) if exception does not appear run tree /mnt more than once

Comment 64 NPetr 2008-01-24 13:40:14 UTC

But I have some progress. The hell are Promise controllers in my case, not cheaper SiI. I am inspired by this article:

http://osdir.com/ml/netbsd.ports.sparc64/2006-08/msg00013.html

Of course, I am on X86 CPU, not Sparc, but I have tested NetBSD and I have seen quite simillar problems. Would you mind saying this sentece silently? Promise HW is very unstable or undocumented and all drivers dies on undocumented exceptions as in NetBSD.

Now my system is stable without Promise controllers. I only must have disabled APCI functions in Bios (EPoX 8RDA3+), because NetBSD won't boot without it and Linux losts interrupts on IDE/Sata controllers and works in PIO modes, not in UDMA modes.

Comment 65 Tejun Heo 2008-01-24 15:29:16 UTC

NPetr, timeout on from the sata_via is ATA bus error and SError register is indicating unrecognized FIS and handshake error.  SATA links are much more susceptible to these error conditions than PATA ones are and transmission errors occur on perfectly healthy systems once in a blue moon.  So, unless it happens often, it doesn't really matter.

For the promise comment, I guess Mikael has something to say.  :-)

Comment 66 NPetr 2008-01-25 16:38:56 UTC

Tejun, you are right, but in my case I have these loops:

1. exception emask
2. system 20s freezes
3. system 10s works
4. GOTO 1.

That is a pity, that other old logs from another PC disappeared. We can not see all exeptions emasks :(.

I trashed Promise cards. Now I have 2x Axago four ports controllers (SiI 3124 PCI) and one EPoX 8RDA3+ two ports internal controller (SiI 3112). No more exception emask till now.

P.S. I heared that one man (can not remeber the URL) resolve these problems with buy Intel chipset motherboard and new Western raid edition SATA disks. But I do not want to believe that, Promise works correctly only on Intel MBs and WDC raid disks.

Comment 67 Tejun Heo 2008-01-25 16:53:04 UTC

I think the first thing to do is isolating the problem - ie. testing only w/ sata_promise controller, reproduce the problem and try to debug that.  Sigh... Debugging problems on promise and nv seems really difficult.  Mikael, do you have a contact with promise?

Comment 68 Mikael Pettersson 2008-01-26 13:03:15 UTC

(In reply to comment #62)
> Created an attachment (id=14567) [details]
> Another PC log - dmesg + kern.log

What was this log supposed to show? All I saw was sata_sil in kernel 2.6.22 driving two Samsungs and getting an ata exception on each port. And in the same kernel, sata_promise driving another Samsung but not getting any errors.

All this shows is that something else than sata_promise in this system is having
problems, be it the disks, cables, PSU, mainboard, chipset, or even libata core.

Comment 69 Mikael Pettersson 2008-01-26 13:11:37 UTC

(In reply to comment #64)
> But I have some progress. The hell are Promise controllers in my case, not
> cheaper SiI. I am inspired by this article:
> 
> http://osdir.com/ml/netbsd.ports.sparc64/2006-08/msg00013.html
> 
> Of course, I am on X86 CPU, not Sparc, but I have tested NetBSD and I have
> seen
> quite simillar problems. Would you mind saying this sentece silently? Promise
> HW is very unstable or undocumented and all drivers dies on undocumented
> exceptions as in NetBSD.

Problems on non-x86 can be expected because the x86 BIOS will do some initialisations that at least the Linux driver did not use to. I fixed that in the Linux driver when I initially tried to get a 2nd gen Promise sata card working in an UltraSPARC (Ultra5) machine. Nothing prevents other operating systems from doing the same.

Promise's public sata HW documentation is however very very poor; we get register definitions but /nothing/ about programming model or quirks/bugs.

Comment 70 Mikael Pettersson 2008-01-26 13:17:52 UTC

(In reply to comment #66)
> P.S. I heared that one man (can not remeber the URL) resolve these problems
> with buy Intel chipset motherboard and new Western raid edition SATA disks.
> But
> I do not want to believe that, Promise works correctly only on Intel MBs and
> WDC raid disks.

Not true. I've used Promise SATA in production on a leafnode News server for years, most of the time using standard Seagate disks, and for the last year on a VIA K8T800 chipset mainboard.

What matters is the quality of the system components.

Comment 71 Mikael Pettersson 2008-01-26 13:18:56 UTC

(In reply to comment #67)
> I think the first thing to do is isolating the problem - ie. testing only w/
> sata_promise controller, reproduce the problem and try to debug that. 
> Sigh...
> Debugging problems on promise and nv seems really difficult.  Mikael, do you
> have a contact with promise?

No, only Jeff seems to have any contact with them.

Comment 72 Leslie P. Polzer 2008-03-19 02:00:41 UTC

Experienced something similar with sata_via here:

[   22.665005] SCSI subsystem initialized
[   22.667119] Driver 'sd' needs updating - please use bus_type methods
[   22.680498] libata version 3.00 loaded.
[   22.682447] sata_via 0000:00:0b.0: version 2.3
[   22.682471] PCI: Enabling device 0000:00:0b.0 (0000 -> 0001)
[   22.682488] ACPI: PCI Interrupt 0000:00:0b.0[A] -> GSI 19 (level, low) -> IRQ 16
[   22.682609] sata_via 0000:00:0b.0: routed to hard irq line 0
[   22.682797] scsi0 : sata_via
[   22.682885] scsi1 : sata_via
[   22.682951] scsi2 : sata_via
[   22.683009] ata1: SATA max UDMA/133 port i16@0xa00 bmdma 0xcc00 irq 16
[   22.683013] ata2: SATA max UDMA/133 port i16@0x1000 bmdma 0xcc08 irq 16
[   22.683016] ata3: PATA max UDMA/133 port i16@0x100 bmdma 0xcc10 irq 16
[   22.991163] ata1: SATA link down (SStatus 0 SControl 310)
[   28.350660] ata2: port is slow to respond, please be patient (Status 0xff)
[   33.054225] ata2: device not ready (errno=-16), forcing hardreset
[   39.241653] ata2: port is slow to respond, please be patient (Status 0xff)
[   43.105295] ata2: COMRESET failed (errno=-16)
[   49.292722] ata2: port is slow to respond, please be patient (Status 0xff)
[   53.156364] ata2: COMRESET failed (errno=-16)
[   59.343792] ata2: port is slow to respond, please be patient (Status 0xff)
[   88.181125] ata2: COMRESET failed (errno=-16)
[   93.192661] ata2: COMRESET failed (errno=-16)
[   93.192714] ata2: reset failed, giving up

Connected HDD won't work at all. This is a

00:0b.0 RAID bus controller [0104]: VIA Technologies, Inc. VT6421 IDE RAID Controller [1106:3249] (rev 50)
        Subsystem: VIA Technologies, Inc. VT6421 IDE RAID Controller [1106:3249]
        Flags: bus master, medium devsel, latency 32, IRQ 16
        I/O ports at 0a00 [size=16]
        I/O ports at 1000 [size=16]
        I/O ports at 0100 [size=16]
        I/O ports at 1010 [size=16]
        I/O ports at cc00 [size=32]
        I/O ports at 8c00 [size=1K]
        [virtual] Expansion ROM at 30000000 [disabled] [size=64K]
        Capabilities: <access denied>

Comment 73 Tejun Heo 2008-03-19 02:19:24 UTC

Can you please try irqpoll?

Comment 74 Tejun Heo 2008-03-19 02:19:58 UTC

Oops, no.  Please don't try that.  Can you please file a separate bug report and cc me?

Comment 75 Matteo Croce 2008-05-17 19:06:16 UTC

Same here with a VIA chipset, and also with a Nvidia one with ACPI disabled:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd 25/00:08:85:75:25/00:00:25:00:00/e0 tag 0 dma 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd 25/00:20:6d:ae:1d/00:00:24:00:00/e0 tag 0 dma 16384 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Comment 76 Tejun Heo 2008-05-17 22:26:43 UTC

Matteo filed bug report #10738 for comment #75.  Let's track it there.

Note You need to log in before you can comment on or make changes to this bug.