Most recent kernel where this bug did not occur: None that I know of; tried several kernels from 2.6.8 up to 2.6.14-rc1-mm1, all exposing the same behavior. Distribution: Debian GNU/Linux 31r0a/x86_64 (stable) Hardware Environment: dmesg: http://noth.ceid.upatras.gr/Computers/dmesgs/dmesg.atlas lspci -vv: http://noth.ceid.upatras.gr/Misc/lspci-vv.atlas hdparm -I: http://noth.ceid.upatras.gr/Misc/hdparm-I.atlas Software Environment: Debian GNU/Linux 31roa/x86_64 plus Debian security updates, as of Sep 20 2005. Kernel version 2.6.14-rc1-mm1 (but also occurs with many kernels all the way back to the default 2.6.8 shipped with the distribution). Problem Description: This machine has 6 identical Western Digital 80Gb serial ata disks (model number WD800JD-00JNC0) and 3 SATA controllers: an on-board 2-port Promise PDC20378, an on-board 2-port VIA VT6420, and a PCI-card 2-port SiI-3112. These 6 disks participate in a raid-5 array (5 active, 1 hot-spare). However, the array is virtually unusable, since I start getting a tone of this in the syslog the moment the assembly/reconstruction starts: Sep 20 07:56:29 Atlas kernel: ata4: status=0x51 { DriveReady SeekComplete Error } Sep 20 07:56:29 Atlas kernel: ata4: error=0x04 { DriveStatusError } Sep 20 07:56:42 Atlas kernel: ata5: status=0x51 { DriveReady SeekComplete Error } Sep 20 07:56:42 Atlas kernel: ata5: error=0x04 { DriveStatusError } After a while it deteriorates to: ata4: translated ATA stat/err 0xc8/00 to SCSI SK/ASC/ASCQ 0xb/47/00 ata4: status=0xc8 { Busy } ata5: translated ATA stat/err 0xc8/00 to SCSI SK/ASC/ASCQ 0xb/47/00 ata5: status=0xc8 { Busy } I've also tried using the two disks separately (i.e. not as parts of a md array). Mounting one of them and running bonnie++ on it produced: ata4: translated ATA stat/err 0xca/00 to SCSI SK/ASC/ASCQ 0xb/47/00 ata4: status=0xca { Busy } which is also reproducible with the ata5 disk. The devices that woe are the disks attached to the Sil3112 card. I've tried adding the disks to the sata_sil.c's blacklists (MOD15WRITE, UDMA5MAX, both) without any improvement. The disks are both in excellent shape, as verified by WD's Data Lifeguard Diagnostics tool. This machine is to go into production mode in 10 days or so. If I don't get this thing fixed in the next few days, I'll swap the SiI3112 for a new Promise card. As of this, I will only be able to test patches and stuff for another 3-4 days. Sorry... Steps to reproduce: This could be a controller-drive combination error, although I've seen others complaining about sata_sil-related problems lately. I can't really say this problem is pertinent to any of its components, so the only way I can propose ot reproduce this is by hooking a WD800JD-00JNC0 on a Sunix SiI3112 controller and stressing it a bit.
Some more stressing resulted in: Sep 21 02:19:29 Atlas kernel: sd 5:0:0:0: SCSI error: return code = 0x8000002 Sep 21 02:19:29 Atlas kernel: sdd: Current: sense key: Aborted Command Sep 21 02:19:29 Atlas kernel: Additional sense: Scsi parity error Sep 21 02:19:29 Atlas kernel: Info fld=0x5314f Sep 21 02:19:29 Atlas kernel: end_request: I/O error, dev sdd, sector 340303 Sep 21 02:19:29 Atlas kernel: R5: read error not correctable. Sep 21 02:19:29 Atlas kernel: ATA: abnormal status 0xD0 on port 0xFFFFC200000060C7 and a lock-up.
Only to get worse, when I tried to halt after the previous error: Unable to handle kernel paging request at ffff82bc81000030 RIP: <ffffffff8015ccb6>{free_block+102} PGD 0 Oops: 0000 [1] CPU 0 Modules linked in: parport_pc lp parport capability commoncap ipt_ECN ipt_TOS ipt_limit ipt_REJECT ipt_ULOG ipt_state ipt_pkttype ipt_recent ipt_iprange ipt_multiport ipt_conntrack iptable_mangle ip_nat_irc ip_nat_tftp ip_nat_ftp iptable_nat ip_conntrack_irc ip_conntrack_tftp ip_conntrack_ftp ip_conntrack iptable_filter ip_tables md5 ipv6 sg uhci_hcd ohci_hcd ehci_hcd usbcore sata_via sata_sil sata_promise libata sk98lin dm_mod psmouse Pid: 4, comm: events/0 Not tainted 2.6.14-Atlas #1 RIP: 0010:[<ffffffff8015ccb6>] <ffffffff8015ccb6>{free_block+102} RSP: 0018:ffff810002f59df8 EFLAGS: 00010006 RAX: ffff82bc81000000 RBX: ffff81007f417940 RCX: 0000000000000001 RDX: 0000003790000000 RSI: ffff81007f666a10 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffffffff8015d520 R09: 0000000000000000 R10: 000000000000000e R11: 0000000000000028 R12: 0000000000000001 R13: ffff81007f666a10 R14: ffff81007f7009e0 R15: ffffffff8043e6f0 FS: 00002aaaaae00640(0000) GS:ffffffff80472800(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: ffff82bc81000030 CR3: 0000000078b8f000 CR4: 00000000000006e0 Process events/0 (pid: 4, threadinfo ffff810002f58000, task ffff810002f40060) Stack: ffff810002f59e78 ffff81007f666a10 ffff81007f666a00 0000000000000001 ffff81007f417940 ffffffff8015d4cf ffff81007f66bd40 0000000000000003 ffff81007f4179a8 ffffffff8015d598 Call Trace:<ffffffff8015d4cf>{drain_array_locked+111} <ffffffff8015d598>{cache_reap+120} <ffffffff80140d40>{worker_thread+416} <ffffffff8012dc60>{default_wake_function+0} <ffffffff8012dc60>{default_wake_function+0} <ffffffff80140ba0>{worker_thread+0} <ffffffff80140ba0>{worker_thread+0} <ffffffff80144e38>{kthread+136} <ffffffff8010f2d2>{child_rip+8} <ffffffff80140ba0>{worker_thread+0} <ffffffff80144db0>{kthread+0} <ffffffff8010f2ca>{child_rip+0} Code: 48 8b 70 30 48 8b 56 08 48 0f b7 46 30 48 39 32 48 8b 4c c3 RIP <ffffffff8015ccb6>{free_block+102} RSP <ffff810002f59df8> CR2: ffff82bc81000030 (Sorry for posting these in separate chunks... I'm frantically trying to sort things out and just don't seem to be getting anywhere near solving this issue...)
Just another data-point: 2.6.13-rc7-libata1 managed to get to almost 90% of reconstructing the 320G raid5 array with just a couple of 'status=0xc8' errors, only to lock-up hard with the same error as in http://bugzilla.kernel.org/show_bug.cgi?id=5279#c1 after nearly 40' of disk activity :( In the meanwhile, I've also tried all possible combinations of 'acpi=off', 'pci=noacpi', 'noapic', 'pci=routeirq', 'pci=usepirqmask', and 'idebus=66', as well as hooking up other (although identical to the current) disks on the SiI3112, moving it to different PCI slots, changing cables, reserving irqs from the bios to force sata_sil into choosing different ones, and dancing Fandago on the keyboard. I intended to check with a 2.4 kernel but that seems impossible with Debian 3.1r0a/x86_64 (I've filed a bug with the Debian BTS but haven't gotten the bug # yet). Is there anything I can do to narrow this down?
I no longer have neither the machine nor the controller, so I can't reproduce this issue or provide any further insight. Since this bug report has laid dormant for all this time, I think it's safe and sane to close it.