Most recent kernel where this bug did not occur: Kernels newer then 2.6.20 are broken, maybe older then 2.6.20 also. Distribution: Debian Lenny current + Sid mdadm (with reshape patch) and smartmontools Hardware Environment: RAID bus controller: Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 02) Mass storage controller: Promise Technology, Inc. PDC40775 (SATA 300 TX2plus) (rev 02) RAID bus controller: Promise Technology, Inc. 20771 (FastTrak TX2300) (rev 02) 5x Samsung HD501LJ Software Environment: Debian Lenny current + Sid mdadm (with reshape patch) and Sid smartmontools Problem Description: In original 2.6.22-2-486 kernel from Debian Lenny exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen occures very often In original 2.6.23.9-slh-smp-1 kernel form Sidux Eros exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen occures very often In official 2.6.24-rc3-mm2 kernel form www.kernel.org COMRESET failed (errno=-16) very often Steps to reproduce: Only start copy from one disk to another or from one disk to raid field or reverse. I include report files in attachement. These toppic is very hot (a few for all): http://lkml.org/lkml/2007/5/10/63 http://lkml.org/lkml/2007/10/10/32 http://article.gmane.org/gmane.linux.ide/23896 https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/84603 https://bugs.launchpad.net/ubuntu/+bug/104581 http://www.opensubscriber.com/message/linux-ide@vger.kernel.org/7961713.html http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09306.html http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09636.html http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09898.html And also I restarted computer during reshape (http://scotgate.org/?p=107) raid-5 "3->4" disks. Reshape died and I have never done to continue reshape :-/.
Created attachment 13783 [details] /var/log/dmesg
Created attachment 13784 [details] /var/log/dmesg.0
Created attachment 13785 [details] dmesg
Created attachment 13786 [details] /var/log/dmesg.1.gz
Created attachment 13787 [details] /var/log/dmesg.2.gz
Created attachment 13788 [details] /var/log/dmesg.1.g3
Created attachment 13789 [details] /var/log/dmesg.4.gz
Created attachment 13790 [details] fdisk -l
Created attachment 13791 [details] lspci -vvv
Created attachment 13792 [details] mdadm --detail --scan /dev/md0
Created attachment 13793 [details] reshape patch form Neil Brown for mdadm 2.6.23
Created attachment 13794 [details] smartctl - t long "all_drives", smartctl - l selftest "all_drives", smartctl - a "all_drives"
Created attachment 13795 [details] /var/log/syslog
Created attachment 13796 [details] /var/log/syslog.0
Created attachment 13797 [details] /var/log/syslog.1.gz
Created attachment 13798 [details] uname -a
I have made shot and long test for all disks with Seagate utility: http://www.seagate.com/support/seatools/SeaToolsDOS110.iso No errors has been found.
Mikael, both ata4 and 6 are sata_promise SATA ports. In syslog.1, ata6 spewed HSM violation while ata4 suffered a timeout and went down horribly. Two events are close together although they're on different controllers. NPetr, can you please all non-essential drives from your machine and connect just one to the sata_promise controller and see whether the problem goes away? Nov 28 05:10:51 storage kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 Nov 28 05:10:51 storage kernel: ata6.00: (port_status 0x20080000) Nov 28 05:10:51 storage kernel: ata6.00: cmd 25/00:80:07:5c:32/00:00:2c:00:00/e0 tag 0 cdb 0x0 data 65536 in Nov 28 05:10:51 storage kernel: res 50/00:00:06:5c:32/00:00:2c:00:00/e0 Emask 0x2 (HSM violation) Nov 28 05:10:51 storage kernel: ata6: soft resetting port Nov 28 05:10:52 storage kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Nov 28 05:10:52 storage kernel: ata6.00: configured for UDMA/133 Nov 28 05:10:52 storage kernel: ata6: EH complete Nov 28 05:11:37 storage kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Nov 28 05:12:47 storage kernel: ata4.00: cmd 25/00:78:bf:0d:d4/00:00:21:00:00/e0 tag 0 cdb 0x0 data 61440 in Nov 28 05:12:47 storage kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Nov 28 05:12:47 storage kernel: ata4: port is slow to respond, please be patient (Status 0xff) Nov 28 05:12:47 storage kernel: ata4: device not ready (errno=-16), forcing hardreset Nov 28 05:12:47 storage kernel: ata4: hard resetting port Nov 28 05:12:47 storage kernel: ata4: port is slow to respond, please be patient (Status 0xff) Nov 28 05:12:47 storage kernel: ata4: COMRESET failed (errno=-16) Nov 28 05:12:47 storage kernel: ata4: hard resetting port Nov 28 05:12:47 storage kernel: ata4: port is slow to respond, please be patient (Status 0xff) Nov 28 05:12:47 storage kernel: ata4: COMRESET failed (errno=-16) Nov 28 05:12:47 storage kernel: ata4: hard resetting port Nov 28 05:12:47 storage kernel: ata4: port is slow to respond, please be patient (Status 0xff) Nov 28 05:12:47 storage kernel: ata4: COMRESET failed (errno=-16) Nov 28 05:12:47 storage kernel: ata4: limiting SATA link speed to 1.5 Gbps Nov 28 05:12:47 storage kernel: ata4: hard resetting port Nov 28 05:12:47 storage kernel: ata4: COMRESET failed (errno=-16) Nov 28 05:12:47 storage kernel: ata4: reset failed, giving up
Hmm... This looks very similar to http://bugzilla.kernel.org/show_bug.cgi?id=9465 And, NPeter, the links you posted in the initial report all point to different symptoms on various controllers.
Hello Tejun, do you think that this problem is controller dependant? My problem is on all controllers that I have. During #9465 I tried 2.6.18-5-k7 kernel form Debian Etch 4.0. The proble is same, but different kmessage: ata4: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00 ata4: status=0x40 { DriveReady }
Fix message: ata3: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00 ata3: command timeout
There are seven dmesg dumps, all of the same 2.6.24-rc3-mm2 kernel, but some things change. None of the dmesg dumps indicate problems with sata_promise, but two of them indicate problems with sata_sil: a "port is slow to respond" and a 10B8B SError. The second of these occured immediately after a manual hotplug which added a disk on a sata_promise port. syslog.0 contains kernel logs from at least fifteen boots with varying kernel versions: - a 2.6.23 boot showed a common sata_promise error: a HSM violation with port_status 0x20080000; it was followed by timeouts and COMRESET failures (not a common pattern) - a 2.6.24-rc3-mm2 boot showed a timeout/port is slow to respond error on sata_sil - another 2.6.24-rc3-mm2 boot showed a timeout failure on a sata_promise port followed by COMRESET failures, and occasional SError transmission errors; a port on the other Promise controller showed a hot-unplug/plug sequence, which may or may not have been a hardware hickup - the last 2.6.24-rc3-mm2 boot showed a manual hotplug on a Promise port (not an error), followed by a 10B8B SError on a sata_sil port All errors shown in these logs except one are commonly caused by an overloaded PSU. The one exception is the HSM violation in the 2.6.23 boot, which was probably caused by the ASIC PRD bug fixed in 2.6.24-rc2. Especially telling is the fact that hot-adding a disk on a sata_promise port causes a hickup on a disk on a sata_sil port.
Hello Mikael - I removed sata_sil controller from the PC. Two Promise controllers stayed in PC. - I changed PSU 300W by another 350W that I found. Tejun nad Mikael - Problems are still here (2.6.17, 2.6.18, 2.6.22, 2.6.23, 2.6.24-rc3 that i tried ; kernels older than 2.6.17 is not sufficient for me - it does not support reshape of raid array). I will attach big tarball with /var/log messages, dmesgs, syslogs when I will arrive home. Another files are important for you? Steps to reproduce ================== Configuration: Promise 1 (2x 500 GB SATA Samsung) /dev/sdd /dev/sde Promise 2 (2x 500 GB SATA Samsung) /dev/sdf /dev/sdg System on MB /dev/md1 (/dev/hda + /dev/hdb ... 2x 3.2 GB PATA Seagate) Swap on MB /dev/hdc (1.7 GB PATA Seagate) Procedure: 1) dd if=/dev/sdg of=/dev/sdf 2) wait cca five minutes 3) mount /dev/sdd1 /mnt 4) tree /mnt 5) exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen and COMRESET failed *) if exception does not appear run tree /mnt more than once Note: I guess that maximum DMA througput traffic on one SATA controller (or one port in hard case) exhaust linux kernel - when traffic starts concurrently on another controller (or on another port in hard case) exception is here. Do you think that boot kernel with ide=nodma option may help? I could try it but slow PIO traffic takes horrible long time.
Fix: dd if=/dev/sdg of=/dev/sdf bs=16384
Created attachment 13838 [details] /var/log messages, dmesgs, syslogs, kerns, debugs OK, here is actual report.
Does NetBSD or FreeBSD using same drivers as Linux? I could try behaviuor on Unix system. Thank you for further support, NPetr
The BSD stuff is unrelated (beyond coming from similar docs and the fact authors cross check details from other open source code), so should be fairly close to a test of unrelated drivers
Hello Alex, happy to see you. I found that OpenSolaris has 3114 and 3124 drivers too. Maybe CDDL license is more close to GPL than BSD. I can not believe that linux has so buggy driver. I am trying to resurrect my data now. Is it possible to repair exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen and COMRESET failed (errno=-16) bugs in linux driver acroding in OpenSolaris or BSD? I prefer linux, it has RAID-5 reshape ability from kernel 2.6.17. No other OS has it (RaidFrame, VxVM, ZFS [maybe in future - http://www.opensolaris.org/jive/thread.jspa?messageID=144085]). P.S. Alex, why your *-ac kernels disappear?
I made a terrible mistake in your name. Very, very sorry Alan.
Linux driver should be rock solid, and your report is very very strange indeed as well as apparently unique.
I do not understand, I googled a lot about "exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen" and "COMRESET failed (errno=-16)" on Promise or SiI controllers. A lot of stuff I found, therefore I add you to CC of this bug. How can I make brighter report and steps to reproduce or what could I do? I will try kernel parameter ide=nodma, hope that it limits SATA also to PIO modes. Summary what I did: - replace PSU - remove SiI 3112 controller - replace MB (Via KT3 Ultra -> nForce2 Ultra400) - replace DDRs - disable SiI 3112 on MB in BIOS - buy another HD501LJ disks Nothing has helped :(.
ata5: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xa frozen ata5: hotplug_status 0x10 ata5: SError: { PHYRdyChg DevExch } ata5: port is slow to respond, please be patient (Status 0xff) This sounds very unlike a driver problem, the controller reports a change in the state of the SATA PHY which suggests it thinks the drive has disconnected itself. Perhaps some kind of incompatibility between the controller and these drives? Have you tried a different model/brand of drive?
Hello Robert, I have only Samsung hard disks. But I tried them on SiI 3112, TX2plus and TX2300. Do you think that HD501LJ are buggy for Linux SATA interface?
I have an old machine (single Pentium 4, Intel motherboard, Promise PATA cards) which worked reliably with kernel 2.6.22.14-72.fc6 (when disks were still called /dev/hd*) but started to encounter this bug frequently (e.g. every 5 seconds) after upgrading to 2.6.23.9-85.fc8 -- the IDE interfaces are: 00:1f.1 IDE interface: Intel Corporation 82801BA IDE U100 Controller (rev 04) 02:0b.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02) 02:0c.0 Mass storage controller: Promise Technology, Inc. PDC20267 (FastTrak100/Ultra100) (rev 02) Note that the 82801BA chip doesn't get much work, and all of the bugs are seen on the two Promise cards. Again, this bug is new, it wasn't present in 2.6.22 kernels (before hd/sd drivers merged).
(In reply to comment #34) > I have an old machine (single Pentium 4, Intel motherboard, Promise PATA > cards) > which worked reliably with kernel 2.6.22.14-72.fc6 (when disks were still > called /dev/hd*) but started to encounter this bug frequently (e.g. every 5 > seconds) after upgrading to 2.6.23.9-85.fc8 -- the IDE interfaces are: > > 00:1f.1 IDE interface: Intel Corporation 82801BA IDE U100 Controller (rev 04) > 02:0b.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02) > 02:0c.0 Mass storage controller: Promise Technology, Inc. PDC20267 > (FastTrak100/Ultra100) (rev 02) > > Note that the 82801BA chip doesn't get much work, and all of the bugs are > seen > on the two Promise cards. Again, this bug is new, it wasn't present in > 2.6.22 > kernels (before hd/sd drivers merged). You should post the dmesg output you are getting from these errors. It's quite possible that you have an entirely different problem.
(In reply to comment #35) > > You should post the dmesg output you are getting from these errors. It's > quite > possible that you have an entirely different problem. I think that this problem with new ATA drivers is generic -- the behavior matches what was posted before, but perhaps you are right in asking for more detail. This kind of problem even happens on another machine (a year old) with SATA drives only: The signature "exception Emask 0x2 SAct [...] SErr 0x0 action 0x2 frozen" is seen in logs, and while SAct codes vary, the error is followed by the same dance leading to ATA soft reset. BTW, these errors go to /var/log/messages, and don't appear in dmesg output. This newer machine never reported any SATA problems before being upgraded to a 2.6.23-series kernel. I'd say that both PATA and SATA aspects of the merged hd/sd driver have problems. The SATA chip in the newer machine is: 00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port SATA AHCI Controller (rev 02) So, what I know is: (1) Two different machine create this bug's signature in /var/log/messages (2) Both used to work fine without bugs with 2.6.22 type kernel, and broke after upgrading to 2.6.23 type kernel (3) The bug affects both PATA and SATA interfaces from Promise and Intel attached to 6 Maxtor drives (PATA) or 6 WD drives (SATA) (4) Kernel 2.6.23 introduced a merged hd/sd driver for PATA/SATA drives ...so the simplest hypothesis is that HW is OK but that the hd/sd driver in 2.6.23+ kernels broke something. Of course, "something" could actually be several broken things, and perhaps examining a range of possible failures can help, so I'll attach a fairly long list of system logs when I get a chance.
The "exception Emask 0x2 SAct [...] SErr 0x0 action 0x2 frozen" part doesn't mean much, all that indicates is that a command timed out and that error handling kicked in. We really need the entire error details..
Created attachment 14233 [details] Parallel ATA machine, OK with 2.6.22, not OK with 2.6.23. LONG!
Created attachment 14234 [details] Serial ATA machine, OK with 2.6.22, not OK with 2.6.23.
(In reply to comment #37) > The "exception Emask 0x2 SAct [...] SErr 0x0 action 0x2 frozen" part doesn't > mean much, all that indicates is that a command timed out and that error > handling kicked in. We really need the entire error details.. > Done -- see the attachments I just posted. BTW, I've got another old machine w/Intel motherboard which has *NOT* seen this bug since being upgraded to 2.6.23 -- but this machine doesn't do RAID while the other two (which exhibit problems) do. Since RAID imposes bursts of work on disk interfaces, perhaps this problem is timing related. I don't think that SMP plays a role -- the PATA machine is uniprocessor, runs RAID-5 and has problems, the problem-free machine has a single CPU w/hyperthreading but no RAID, and the problematic SATA machine has a dual core CPU and runs RAID (various levels). The problem happens under heavy load on RAID machines. The single processor PATA machine is more loaded and sees more problems.
There are two different issues here: The serial ATA problems are "spurious" NCQ completions. These detections were actually invalid and so the detection code has been removed in 2.6.24 (and I believe the latest 2.6.23 stable kernels), so that problem should now be fixed. The PATA problem would be appear to be the Promise PATA driver doing something that causes the controller's BMDMA engine to choke for some reason. I would suggest opening a new bug report since it seems unrelated to the original reporter's problem. Also, I don't think there is any relationship to the 2.6.22 to 2.6.23 change since Fedora only enabled libata for PATA disks in FC7. Quite likely the libata Promise driver had this problem previously as well.
(In reply to comment #38) > Created an attachment (id=14233) [details] > Parallel ATA machine, OK with 2.6.22, not OK with 2.6.23. LONG! > Note that your PATA machine has both a PDC20267 and a PDC20269. The older 20267 is driven by pata_pdc202xx_old while the 20269 is driven by pata_pdc2027x. There are known problems with the 20267/pata_pdc202xx_old combination, see other bugzilla entries, but the 20269/pata_pdc2027x combination should be solid. Unfortunately, your kernel log is incomplete so we can't tell if the errors occur on the 20267 or the 20269.
NPetr, multiple errors occurring across multiple controllers / drives strongly indicate hardware problem. Also, you're seeing quite a few PHY events (phy going offline, transmission errors) which also point at hardware problems. As Mikael said before, the most common cause of this kind of problem is overloaded / faulty PSU. Please perform some hardware debugging. * Take some drives out and see if anything changes. * If you have access to an extra PSU, use it to power SATA drives (I'm not sure using separate PSU is safe for PATA but it definitely is safe for SATA). You can power up PSU w/o motherboard by... http://modtown.co.uk/mt/article2.php?id=psumod This will distribute 12v load across two PSUs. Please see whether errors continue to happen and whether there's any pattern according to which PSU drives are connected to.
I have what i believe is the same problem. (except for the comreset, but that may come once the machine run on for long enough, i will see). I have two promise PATA cards in my computer: 05:06.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02) 05:08.0 Mass storage controller: Promise Technology, Inc. PDC20268 (Ultra100 TX2) (rev 02) Both with all seagate disks... I get this quite a lot: [ 5994.518914] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [ 5994.524893] ata8.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 [ 5994.524896] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 5999.561120] ata8: soft resetting port [ 5999.809328] ata8.00: configured for UDMA/100 [ 5999.821326] ata8.01: configured for UDMA/100 [ 5999.821354] ata8: EH complete [ 5999.839737] sd 7:0:0:0: [sdk] 586072368 512-byte hardware sectors (300069 MB) [ 5999.856294] sd 7:0:0:0: [sdk] Write Protect is off [ 5999.856903] sd 7:0:0:0: [sdk] Mode Sense: 00 3a 00 00 [ 5999.888333] sd 7:0:0:0: [sdk] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 5999.888702] sd 7:0:1:0: [sdl] 586072368 512-byte hardware sectors (300069 MB) [ 5999.889004] sd 7:0:1:0: [sdl] Write Protect is off [ 5999.889007] sd 7:0:1:0: [sdl] Mode Sense: 00 3a 00 00 [ 5999.889095] sd 7:0:1:0: [sdl] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 5999.889113] sd 7:0:0:0: [sdk] 586072368 512-byte hardware sectors (300069 MB) [ 5999.889123] sd 7:0:0:0: [sdk] Write Protect is off [ 5999.889126] sd 7:0:0:0: [sdk] Mode Sense: 00 3a 00 00 [ 5999.889142] sd 7:0:0:0: [sdk] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 5999.889158] sd 7:0:1:0: [sdl] 586072368 512-byte hardware sectors (300069 MB) [ 5999.889168] sd 7:0:1:0: [sdl] Write Protect is off [ 5999.889170] sd 7:0:1:0: [sdl] Mode Sense: 00 3a 00 00 [ 5999.889186] sd 7:0:1:0: [sdl] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA For all the harddisks on the PATA disks, on all the PATA disks(that is 8 drives total). I do NOT have the old "IDE" driver in my kernel at all, nor as module. Only running with libata as of my upgrade to: Linux ts.local.machines 2.6.23-hardened-r4 #6 PREEMPT Thu Jan 3 01:32:33 CET 2008 i686 AMD Sempron(tm) Processor 2800+ AuthenticAMD GNU/Linux I have another version of the kernel compiled, with the only difference being that the Promise cards run on the old driver instead of libata.. Both with the promise cards with the libata or the old driver sata_nv is also being used... If i run the one where i use the old driver everything works, no error messages, no problems, and it isn't unstable.. SMART doesn't complain for any of the harddisks (short tests are run daily, long tests weekly) on the old drivers, so it isn't a drive/cable problem.. Will a full dmesg, lspci, anything else be of any use from me? or do you have enough... Adding to CC. Sincerely Tobias
Oh, also running raid on all drives, some are raid5, some raid1. same problem on all harddisks..
Just noticed that it did all the drives on by one, and none of the drives were actually being used(even iostat shows no data transfer from neither drive nor raid). The only thing that would be touching the drives in that time is hddtemp or smart. But when i ran those manually, neither of them triggered the error. The important part is, i get the error, with little to no data transfer taking place.
Tobias, does stopping smartd fix your problem?
Smartd isn't running. i have a watch -n 900 sh temp-test running, where temp-test contains hddtemp /dev/sd? (or rather, a for loop to do that). So, it should trigger every 15 min if it was hddtemp.. but it doesn't trigger when i run it manually.. the SMART is a cron script run nightly. so, it isn't that.. no daemons are touching the disks, sans mdadm. doing a: grep "soft resetting" /var/log/everything/log-2008-01-03-23\:01\:59 | grep ata8 results in: Jan 3 22:58:33 [kernel] [ 1201.277748] ata5: soft resetting port Jan 3 22:58:41 [kernel] [ 1209.108766] ata5: soft resetting port Jan 3 23:14:24 [kernel] [ 2151.702897] ata5: soft resetting port Jan 3 23:14:32 [kernel] [ 2159.058493] ata5: soft resetting port Jan 3 23:30:16 [kernel] [ 3102.451425] ata5: soft resetting port Jan 3 23:30:24 [kernel] [ 3110.606082] ata5: soft resetting port Jan 3 23:46:08 [kernel] [ 4052.923316] ata5: soft resetting port Jan 3 23:46:15 [kernel] [ 4060.694420] ata5: soft resetting port Jan 4 00:01:59 [kernel] [ 5003.394691] ata5: soft resetting port Jan 4 00:02:07 [kernel] [ 5011.129828] ata5: soft resetting port Jan 4 00:17:51 [kernel] [ 5953.709660] ata5: soft resetting port Jan 4 00:17:59 [kernel] [ 5961.932235] ata5: soft resetting port Jan 4 00:32:54 [kernel] [ 6855.731626] ata5: soft resetting port Jan 4 00:33:02 [kernel] [ 6863.698527] ata5: soft resetting port One weird thing, if i use ata8 instead of ata5 through ata7 i don't get two in a row, i only get: Jan 3 22:59:18 [kernel] [ 1246.450438] ata8: soft resetting port Jan 3 23:15:09 [kernel] [ 2196.707464] ata8: soft resetting port Jan 3 23:31:02 [kernel] [ 3147.703638] ata8: soft resetting port Jan 3 23:46:53 [kernel] [ 4098.143576] ata8: soft resetting port Jan 4 00:02:44 [kernel] [ 5048.550923] ata8: soft resetting port Jan 4 00:18:37 [kernel] [ 5999.561120] ata8: soft resetting port Jan 4 00:33:40 [kernel] [ 6901.994591] ata8: soft resetting port even though it has the same number of harddisks on the cable as the other ones. [ 98.269159] ata5: PATA max UDMA/133 cmd 0xf88697c0 ctl 0xf8869fda bmdma 0xf8869000 irq 16 [ 98.292103] ata6: PATA max UDMA/133 cmd 0xf88695c0 ctl 0xf8869dda bmdma 0xf8869008 irq 16 [ 98.559891] ata5.00: ATA-7: ST3250823A, 3.03, max UDMA/100 [ 98.583778] ata5.01: ATA-6: ST3200822A, 3.01, max UDMA/100 [ 98.619719] ata5.00: configured for UDMA/100 [ 98.647748] ata5.01: configured for UDMA/100 [ 98.883502] ata6.00: ATA-7: ST3250823A, 3.02, max UDMA/100 [ 98.906139] ata6.01: ATA-6: ST3200822A, 3.01, max UDMA/100 [ 98.943331] ata6.00: configured for UDMA/100 [ 98.967357] ata6.01: configured for UDMA/100 [ 99.767194] ata7: PATA max UDMA/100 cmd 0xf88717c0 ctl 0xf8871fda bmdma 0xf8871000 irq 20 [ 99.789555] ata8: PATA max UDMA/100 cmd 0xf88715c0 ctl 0xf8871dda bmdma 0xf8871008 irq 20 [ 100.054144] ata7.00: ATA-7: ST3300831A, 3.06, max UDMA/100 [ 100.077306] ata7.01: ATA-6: ST3200822A, 3.01, max UDMA/100 [ 100.106086] ata7.00: configured for UDMA/100 [ 100.130109] ata7.01: configured for UDMA/100 [ 100.365760] ata8.00: ATA-7: ST3300831A, 3.06, max UDMA/100 [ 100.387634] ata8.01: ATA-7: ST3300831A, 3.06, max UDMA/100 I don't understand that part.. but i see that it does do the thing at about 15 min intervals usually.. which does fit with my hddtemp script from above... So this may be a spindown spinup issue. I'll try setting it to run once every minute and see what that does. Starting that now, i'll report back in an hour, unless i'm sleeping, in that case i'll report back when i wakeup. Sincerely Tobias
Is the cmd always b0 as it is in your last detailed output you posted? That's a SMART command. Nothing in the kernel issues those itself, so it would have to be some daemon or something that is doing it.
always b0, but that would be explained by hddtemp which gets the temperature via SMART commands. Oki, it should have triggered by now, it hasn't.. Jan 4 01:06:50 [kernel] [ 8889.284639] ata5: soft resetting port Jan 4 01:06:58 [kernel] [ 8897.375369] ata5: soft resetting port Jan 4 01:07:04 [kernel] [ 8903.216683] ata6: soft resetting port Jan 4 01:07:11 [kernel] [ 8910.843946] ata6: soft resetting port Jan 4 01:07:19 [kernel] [ 8918.778857] ata7: soft resetting port Jan 4 01:07:27 [kernel] [ 8926.210346] ata7: soft resetting port Jan 4 01:07:35 [kernel] [ 8934.796580] ata8: soft resetting port I set hddtemp to run once per minute instead of once ever 15 min at 01:10 But, i noticed this.. Jan 4 01:06:50 [kernel] [ 8889.284639] ata5: soft resetting port Jan 4 01:06:58 [kernel] [ 8897.375369] ata5: soft resetting port Jan 4 01:07:04 [kernel] [ 8903.216683] ata6: soft resetting port Jan 4 01:07:11 [kernel] [ 8910.843946] ata6: soft resetting port Jan 4 01:07:19 [kernel] [ 8918.778857] ata7: soft resetting port Jan 4 01:07:27 [kernel] [ 8926.210346] ata7: soft resetting port Jan 4 01:07:35 [kernel] [ 8934.736576] ata8: port is slow to respond, please be patient (Status 0xd0) Jan 4 01:07:35 [kernel] [ 8934.796580] ata8: soft resetting port Which, i guess, would explain why ata8 only runs once instead of twice Just to make sure i'm not missing anything stupid, here is everything fiting "ata" from within the last 40minutes.. Jan 4 01:06:47 [kernel] [ 8886.987293] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 4 01:06:47 [kernel] [ 8886.991849] ata5.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 Jan 4 01:06:50 [kernel] [ 8889.284639] ata5: soft resetting port Jan 4 01:06:50 [kernel] [ 8889.812386] ata5.00: configured for UDMA/100 Jan 4 01:06:51 [kernel] [ 8890.337799] ata5.01: configured for UDMA/100 Jan 4 01:06:51 [kernel] [ 8890.339850] ata5: EH complete Jan 4 01:06:54 [kernel] [ 8893.895373] ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 4 01:06:54 [kernel] [ 8893.901204] ata5.01: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/10 tag 0 cdb 0x0 data 0 Jan 4 01:06:58 [kernel] [ 8897.375369] ata5: soft resetting port Jan 4 01:06:58 [kernel] [ 8897.615490] ata5.00: configured for UDMA/100 Jan 4 01:06:58 [kernel] [ 8897.635505] ata5.01: configured for UDMA/100 Jan 4 01:06:58 [kernel] [ 8897.635532] ata5: EH complete Jan 4 01:07:01 [kernel] [ 8900.835424] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 4 01:07:01 [kernel] [ 8900.840928] ata6.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 Jan 4 01:07:04 [kernel] [ 8903.216683] ata6: soft resetting port Jan 4 01:07:04 [kernel] [ 8903.736463] ata6.00: configured for UDMA/100 Jan 4 01:07:05 [kernel] [ 8904.262569] ata6.01: configured for UDMA/100 Jan 4 01:07:05 [kernel] [ 8904.263878] ata6: EH complete Jan 4 01:07:08 [kernel] [ 8907.699561] ata6.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 4 01:07:08 [kernel] [ 8907.705009] ata6.01: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/10 tag 0 cdb 0x0 data 0 Jan 4 01:07:11 [kernel] [ 8910.843946] ata6: soft resetting port Jan 4 01:07:11 [kernel] [ 8911.084121] ata6.00: configured for UDMA/100 Jan 4 01:07:11 [kernel] [ 8911.104073] ata6.01: configured for UDMA/100 Jan 4 01:07:11 [kernel] [ 8911.104108] ata6: EH complete Jan 4 01:07:15 [kernel] [ 8914.339957] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 4 01:07:15 [kernel] [ 8914.345418] ata7.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 Jan 4 01:07:19 [kernel] [ 8918.778857] ata7: soft resetting port Jan 4 01:07:20 [kernel] [ 8919.306731] ata7.00: configured for UDMA/100 Jan 4 01:07:20 [kernel] [ 8919.832337] ata7.01: configured for UDMA/100 Jan 4 01:07:20 [kernel] [ 8919.833975] ata7: EH complete Jan 4 01:07:24 [kernel] [ 8923.289705] ata7.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 4 01:07:24 [kernel] [ 8923.295121] ata7.01: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/10 tag 0 cdb 0x0 data 0 Jan 4 01:07:27 [kernel] [ 8926.210346] ata7: soft resetting port Jan 4 01:07:27 [kernel] [ 8926.458522] ata7.00: configured for UDMA/100 Jan 4 01:07:27 [kernel] [ 8926.478588] ata7.01: configured for UDMA/100 Jan 4 01:07:27 [kernel] [ 8926.478616] ata7: EH complete Jan 4 01:07:30 [kernel] [ 8929.682383] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 4 01:07:30 [kernel] [ 8929.687838] ata8.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 Jan 4 01:07:35 [kernel] [ 8934.736576] ata8: port is slow to respond, please be patient (Status 0xd0) Jan 4 01:07:35 [kernel] [ 8934.796580] ata8: soft resetting port Jan 4 01:07:35 [kernel] [ 8935.036695] ata8.00: configured for UDMA/100 Jan 4 01:07:35 [kernel] [ 8935.056804] ata8.01: configured for UDMA/100 Jan 4 01:07:35 [kernel] [ 8935.056834] ata8: EH complete though i don't think i've missed anything this time around... So, still don't really get the ata8 thing, but then again, i'm not sure if it is really the same issue... I know on ata8 that one of the disks, sdl, should never spin down(it is raid mirror of OS), while sdk should spin down (is just a data raid5 disk, which may not be used all the time). Though i fail to see how one of the disks not spinning down will result in a "slow to respond", if it is even relevant... In any case, since it should have triggered at least 3 times(because of the 15min thing from my prior comment) i will stop the script running hddtemp, and let it stay the night, then i'll see what happens with a normal ls tomorrow, and then with hddtemp again tomorrow... Sincerely Tobias
Oki, a normal spin up does NOT cause this on my system, i will try to reinitiate the hddtemps(the SMART commands) with decreasing intervals and see if it fits with my spindown idea(if it does, it should hit at the lowest with about 5min intervals, which is the limit for my drives to spin down). But, normal disk transfer after spindown does not cause this for me. Sincerely Tobias
Tobias, it has nothing to do with spindown. IIRC, it's some userland daemon issuing invalid command which can be fixed by updating the program. You won't see it if you put the machine into runlevel 1. Just find out which daemon or cron job is doing it and disable it.
Then how come it doesn't make that error when i use the old driver?
The problem you're reporting got traced quite some time ago and I don't remember details. ISTR that the difference was that the IDE driver doesn't complain or take any corrective action against HSM violation. It's one of smartd, smartctl or hddtemp. I'll ask Bruce Allen whether he remembers better than me.
I believe it was hddtemp that had the bug..
Oki, in that case, please disregard me :D Sincerely Tobias Ussing Sorry for the noise.
(In reply to comment #41) > [...] Also, I don't think there is any relationship to the 2.6.22 > to 2.6.23 change since Fedora only enabled libata for PATA disks in FC7. > Quite > likely the libata Promise driver had this problem previously as well. The machine used FC6 and was very reliable under heavy load for months at a time. It never used F7. The problems are related to F8 upgrade without any hardware changes. Thanks for the insights.
FYI: Reverting to hd style drivers fixes the sd style driver's "ata ... frozen" messages & soft resets -- although there are some "SeekComplete" errors reported by the hd* device, the raid5 driver sails through them, apparently without the effects of freezing or resets: Jan 13 12:32:33 fw kernel: hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error } Jan 13 12:32:33 fw kernel: hdi: dma_intr: error=0x40 { UncorrectableError }, LBAsect=226841460, high=13, low=8737652, sector=226841455 Jan 13 12:32:33 fw kernel: end_request: I/O error, dev hdi, sector 226841455 [...] Jan 13 12:32:34 fw kernel: raid5:md0: read error corrected (8 sectors at 226841456 on hdi1)
Uncorrectable error is a failed disk. The timeouts for old and new ATA layer are different as are the messages printed. Looks like your drive takes a very long time to fail the I/O, far longer than libata was prepared to wait around with the machine stuck. You probably want to run a verify on the drive and it looks like libata may need to use slightly longer I/O timeouts to cope with such failure events.
Alan, timeout for reads and writes is 30secs. I don't really think we should go higher than that.
Hello, sorry for the long delay. I bring logs from another testing PC. ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x2400000 action 0x2 frozen ata1.00: cmd 35/00:00:20:17:06/00:04:01:00:00/e0 tag 0 cdb 0x0 data 524288 out res 51/04:01:1f:18:06/00:03:01:00:00/e0 Emask 0x10 (ATA bus error) ata1: hard resetting port ata2: exception Emask 0x10 SAct 0x0 SErr 0x80000 action 0x2 frozen ata2: hard resetting port ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2.00: configured for UDMA/100 ata2: EH complete
Created attachment 14567 [details] Another PC log - dmesg + kern.log
I did not test another PC so much. When the first Exception emask occurs, I stopped testing. That is mean for me, that my HW is OK. Steps to reproduce: 1) dd if=/dev/sdd of=/dev/sda bs=16384 2) wait cca five minutes 3) mount -t ext3 /dev/sdc1 /media/ 4) tree /media 5) mount -t ext3 /dev/sdb1 /mnt/ 6) tree /mnt 7) exception Emask 0x10 SAct 0x0 SErr ... *) if exception does not appear run tree /mnt more than once
But I have some progress. The hell are Promise controllers in my case, not cheaper SiI. I am inspired by this article: http://osdir.com/ml/netbsd.ports.sparc64/2006-08/msg00013.html Of course, I am on X86 CPU, not Sparc, but I have tested NetBSD and I have seen quite simillar problems. Would you mind saying this sentece silently? Promise HW is very unstable or undocumented and all drivers dies on undocumented exceptions as in NetBSD. Now my system is stable without Promise controllers. I only must have disabled APCI functions in Bios (EPoX 8RDA3+), because NetBSD won't boot without it and Linux losts interrupts on IDE/Sata controllers and works in PIO modes, not in UDMA modes.
NPetr, timeout on from the sata_via is ATA bus error and SError register is indicating unrecognized FIS and handshake error. SATA links are much more susceptible to these error conditions than PATA ones are and transmission errors occur on perfectly healthy systems once in a blue moon. So, unless it happens often, it doesn't really matter. For the promise comment, I guess Mikael has something to say. :-)
Tejun, you are right, but in my case I have these loops: 1. exception emask 2. system 20s freezes 3. system 10s works 4. GOTO 1. That is a pity, that other old logs from another PC disappeared. We can not see all exeptions emasks :(. I trashed Promise cards. Now I have 2x Axago four ports controllers (SiI 3124 PCI) and one EPoX 8RDA3+ two ports internal controller (SiI 3112). No more exception emask till now. P.S. I heared that one man (can not remeber the URL) resolve these problems with buy Intel chipset motherboard and new Western raid edition SATA disks. But I do not want to believe that, Promise works correctly only on Intel MBs and WDC raid disks.
I think the first thing to do is isolating the problem - ie. testing only w/ sata_promise controller, reproduce the problem and try to debug that. Sigh... Debugging problems on promise and nv seems really difficult. Mikael, do you have a contact with promise?
(In reply to comment #62) > Created an attachment (id=14567) [details] > Another PC log - dmesg + kern.log What was this log supposed to show? All I saw was sata_sil in kernel 2.6.22 driving two Samsungs and getting an ata exception on each port. And in the same kernel, sata_promise driving another Samsung but not getting any errors. All this shows is that something else than sata_promise in this system is having problems, be it the disks, cables, PSU, mainboard, chipset, or even libata core.
(In reply to comment #64) > But I have some progress. The hell are Promise controllers in my case, not > cheaper SiI. I am inspired by this article: > > http://osdir.com/ml/netbsd.ports.sparc64/2006-08/msg00013.html > > Of course, I am on X86 CPU, not Sparc, but I have tested NetBSD and I have > seen > quite simillar problems. Would you mind saying this sentece silently? Promise > HW is very unstable or undocumented and all drivers dies on undocumented > exceptions as in NetBSD. Problems on non-x86 can be expected because the x86 BIOS will do some initialisations that at least the Linux driver did not use to. I fixed that in the Linux driver when I initially tried to get a 2nd gen Promise sata card working in an UltraSPARC (Ultra5) machine. Nothing prevents other operating systems from doing the same. Promise's public sata HW documentation is however very very poor; we get register definitions but /nothing/ about programming model or quirks/bugs.
(In reply to comment #66) > P.S. I heared that one man (can not remeber the URL) resolve these problems > with buy Intel chipset motherboard and new Western raid edition SATA disks. > But > I do not want to believe that, Promise works correctly only on Intel MBs and > WDC raid disks. Not true. I've used Promise SATA in production on a leafnode News server for years, most of the time using standard Seagate disks, and for the last year on a VIA K8T800 chipset mainboard. What matters is the quality of the system components.
(In reply to comment #67) > I think the first thing to do is isolating the problem - ie. testing only w/ > sata_promise controller, reproduce the problem and try to debug that. > Sigh... > Debugging problems on promise and nv seems really difficult. Mikael, do you > have a contact with promise? No, only Jeff seems to have any contact with them.
Experienced something similar with sata_via here: [ 22.665005] SCSI subsystem initialized [ 22.667119] Driver 'sd' needs updating - please use bus_type methods [ 22.680498] libata version 3.00 loaded. [ 22.682447] sata_via 0000:00:0b.0: version 2.3 [ 22.682471] PCI: Enabling device 0000:00:0b.0 (0000 -> 0001) [ 22.682488] ACPI: PCI Interrupt 0000:00:0b.0[A] -> GSI 19 (level, low) -> IRQ 16 [ 22.682609] sata_via 0000:00:0b.0: routed to hard irq line 0 [ 22.682797] scsi0 : sata_via [ 22.682885] scsi1 : sata_via [ 22.682951] scsi2 : sata_via [ 22.683009] ata1: SATA max UDMA/133 port i16@0xa00 bmdma 0xcc00 irq 16 [ 22.683013] ata2: SATA max UDMA/133 port i16@0x1000 bmdma 0xcc08 irq 16 [ 22.683016] ata3: PATA max UDMA/133 port i16@0x100 bmdma 0xcc10 irq 16 [ 22.991163] ata1: SATA link down (SStatus 0 SControl 310) [ 28.350660] ata2: port is slow to respond, please be patient (Status 0xff) [ 33.054225] ata2: device not ready (errno=-16), forcing hardreset [ 39.241653] ata2: port is slow to respond, please be patient (Status 0xff) [ 43.105295] ata2: COMRESET failed (errno=-16) [ 49.292722] ata2: port is slow to respond, please be patient (Status 0xff) [ 53.156364] ata2: COMRESET failed (errno=-16) [ 59.343792] ata2: port is slow to respond, please be patient (Status 0xff) [ 88.181125] ata2: COMRESET failed (errno=-16) [ 93.192661] ata2: COMRESET failed (errno=-16) [ 93.192714] ata2: reset failed, giving up Connected HDD won't work at all. This is a 00:0b.0 RAID bus controller [0104]: VIA Technologies, Inc. VT6421 IDE RAID Controller [1106:3249] (rev 50) Subsystem: VIA Technologies, Inc. VT6421 IDE RAID Controller [1106:3249] Flags: bus master, medium devsel, latency 32, IRQ 16 I/O ports at 0a00 [size=16] I/O ports at 1000 [size=16] I/O ports at 0100 [size=16] I/O ports at 1010 [size=16] I/O ports at cc00 [size=32] I/O ports at 8c00 [size=1K] [virtual] Expansion ROM at 30000000 [disabled] [size=64K] Capabilities: <access denied>
Can you please try irqpoll?
Oops, no. Please don't try that. Can you please file a separate bug report and cc me?
Same here with a VIA chipset, and also with a Nvidia one with ACPI disabled: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd 25/00:08:85:75:25/00:00:25:00:00/e0 tag 0 dma 4096 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1: soft resetting link ata1.00: configured for UDMA/133 ata1: EH complete sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd 25/00:20:6d:ae:1d/00:00:24:00:00/e0 tag 0 dma 16384 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1: soft resetting link ata1.00: configured for UDMA/133 ata1: EH complete sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Matteo filed bug report #10738 for comment #75. Let's track it there.