Bug 5279 - SiI3112 problems under load with 2.6.14-rc1-mm1
Summary: SiI3112 problems under load with 2.6.14-rc1-mm1
Status: REJECTED UNREPRODUCIBLE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: i386 Linux
: P2 blocking
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-09-20 15:27 UTC by Nikos Ntarmos
Modified: 2005-11-23 21:54 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.14-rc1-mm1
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Nikos Ntarmos 2005-09-20 15:27:39 UTC
Most recent kernel where this bug did not occur:
None that I know of; tried several kernels from 2.6.8 up to 2.6.14-rc1-mm1, all
exposing the same behavior.
Distribution:
Debian GNU/Linux 31r0a/x86_64 (stable)
Hardware Environment:

dmesg: http://noth.ceid.upatras.gr/Computers/dmesgs/dmesg.atlas
lspci -vv: http://noth.ceid.upatras.gr/Misc/lspci-vv.atlas
hdparm -I: http://noth.ceid.upatras.gr/Misc/hdparm-I.atlas

Software Environment:

Debian GNU/Linux 31roa/x86_64 plus Debian security updates, as of Sep 20 2005.
Kernel version 2.6.14-rc1-mm1 (but also occurs with many kernels all the way
back to the default 2.6.8 shipped with the distribution).

Problem Description:

This machine has 6 identical Western Digital 80Gb serial ata disks (model number
WD800JD-00JNC0) and 3 SATA controllers: an on-board 2-port Promise PDC20378, an
on-board 2-port VIA VT6420, and a PCI-card 2-port SiI-3112. These 6 disks
participate in a raid-5 array (5 active, 1 hot-spare). However, the array is
virtually unusable, since I start getting a tone of this in the syslog the
moment the assembly/reconstruction starts:

Sep 20 07:56:29 Atlas kernel: ata4: status=0x51 { DriveReady SeekComplete Error }
Sep 20 07:56:29 Atlas kernel: ata4: error=0x04 { DriveStatusError }
Sep 20 07:56:42 Atlas kernel: ata5: status=0x51 { DriveReady SeekComplete Error }
Sep 20 07:56:42 Atlas kernel: ata5: error=0x04 { DriveStatusError }

After a while it deteriorates to:

ata4: translated ATA stat/err 0xc8/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata4: status=0xc8 { Busy }
ata5: translated ATA stat/err 0xc8/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata5: status=0xc8 { Busy }

I've also tried using the two disks separately (i.e. not as parts of a md
array). Mounting one of them and running bonnie++ on it produced:

ata4: translated ATA stat/err 0xca/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata4: status=0xca { Busy }

which is also reproducible with the ata5 disk.

The devices that woe are the disks attached to the Sil3112 card. I've tried
adding the disks to the sata_sil.c's blacklists (MOD15WRITE, UDMA5MAX, both)
without any improvement. The disks are both in excellent shape, as verified by
WD's Data Lifeguard Diagnostics tool.

This machine is to go into production mode in 10 days or so. If I don't get this
thing fixed in the next few days, I'll swap the SiI3112 for a new Promise card.
As of this, I will only be able to test patches and stuff for another 3-4 days.
Sorry...

Steps to reproduce:

This could be a controller-drive combination error, although I've seen others
complaining about sata_sil-related problems lately. I can't really say this
problem is pertinent to any of its components, so the only way I can propose ot
reproduce this is by hooking a WD800JD-00JNC0 on a Sunix SiI3112 controller and
stressing it a bit.
Comment 1 Nikos Ntarmos 2005-09-20 16:22:26 UTC
Some more stressing resulted in:

Sep 21 02:19:29 Atlas kernel: sd 5:0:0:0: SCSI error: return code = 0x8000002
Sep 21 02:19:29 Atlas kernel: sdd: Current: sense key: Aborted Command
Sep 21 02:19:29 Atlas kernel:     Additional sense: Scsi parity error
Sep 21 02:19:29 Atlas kernel: Info fld=0x5314f
Sep 21 02:19:29 Atlas kernel: end_request: I/O error, dev sdd, sector 340303
Sep 21 02:19:29 Atlas kernel: R5: read error not correctable.
Sep 21 02:19:29 Atlas kernel: ATA: abnormal status 0xD0 on port 0xFFFFC200000060C7

and a lock-up.
Comment 2 Nikos Ntarmos 2005-09-20 16:25:39 UTC
Only to get worse, when I tried to halt after the previous error:

Unable to handle kernel paging request at ffff82bc81000030 RIP: 
<ffffffff8015ccb6>{free_block+102}                              
PGD 0                             
Oops: 0000 [1] 
CPU 0          
Modules linked in: parport_pc lp parport capability commoncap ipt_ECN ipt_TOS
ipt_limit ipt_REJECT ipt_ULOG ipt_state ipt_pkttype ipt_recent ipt_iprange
ipt_multiport ipt_conntrack iptable_mangle ip_nat_irc ip_nat_tftp ip_nat_ftp
iptable_nat ip_conntrack_irc ip_conntrack_tftp ip_conntrack_ftp ip_conntrack
iptable_filter ip_tables md5 ipv6 sg uhci_hcd ohci_hcd ehci_hcd usbcore sata_via
sata_sil sata_promise libata sk98lin dm_mod psmouse
Pid: 4, comm: events/0 Not tainted 2.6.14-Atlas #1
RIP: 0010:[<ffffffff8015ccb6>] <ffffffff8015ccb6>{free_block+102}
RSP: 0018:ffff810002f59df8  EFLAGS: 00010006                     
RAX: ffff82bc81000000 RBX: ffff81007f417940 RCX: 0000000000000001
RDX: 0000003790000000 RSI: ffff81007f666a10 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffff8015d520 R09: 0000000000000000
R10: 000000000000000e R11: 0000000000000028 R12: 0000000000000001
R13: ffff81007f666a10 R14: ffff81007f7009e0 R15: ffffffff8043e6f0
FS:  00002aaaaae00640(0000) GS:ffffffff80472800(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b                           
CR2: ffff82bc81000030 CR3: 0000000078b8f000 CR4: 00000000000006e0
Process events/0 (pid: 4, threadinfo ffff810002f58000, task ffff810002f40060)
Stack: ffff810002f59e78 ffff81007f666a10 ffff81007f666a00 0000000000000001   
       ffff81007f417940 ffffffff8015d4cf ffff81007f66bd40 0000000000000003 
       ffff81007f4179a8 ffffffff8015d598                                   
Call Trace:<ffffffff8015d4cf>{drain_array_locked+111}
<ffffffff8015d598>{cache_reap+120}
       <ffffffff80140d40>{worker_thread+416}
<ffffffff8012dc60>{default_wake_function+0}
       <ffffffff8012dc60>{default_wake_function+0}
<ffffffff80140ba0>{worker_thread+0}
       <ffffffff80140ba0>{worker_thread+0} <ffffffff80144e38>{kthread+136}
       <ffffffff8010f2d2>{child_rip+8} <ffffffff80140ba0>{worker_thread+0}
       <ffffffff80144db0>{kthread+0} <ffffffff8010f2ca>{child_rip+0}      
                                                                    
       
Code: 48 8b 70 30 48 8b 56 08 48 0f b7 46 30 48 39 32 48 8b 4c c3 
RIP <ffffffff8015ccb6>{free_block+102} RSP <ffff810002f59df8>     
CR2: ffff82bc81000030                                        

(Sorry for posting these in separate chunks... I'm frantically trying to sort
things out and just don't seem to be getting anywhere near solving this issue...)
Comment 3 Nikos Ntarmos 2005-09-21 01:48:54 UTC
Just another data-point: 2.6.13-rc7-libata1 managed to get to almost 90% of
reconstructing the 320G raid5 array with just a couple of 'status=0xc8' errors,
only to lock-up hard with the same error as in
http://bugzilla.kernel.org/show_bug.cgi?id=5279#c1 after nearly 40' of disk
activity :(

In the meanwhile, I've also tried all possible combinations of 'acpi=off',
'pci=noacpi', 'noapic', 'pci=routeirq', 'pci=usepirqmask', and 'idebus=66', as
well as hooking up other (although identical to the current) disks on the
SiI3112, moving it to different PCI slots, changing cables, reserving irqs from
the bios to force sata_sil into choosing different ones, and dancing Fandago on
the keyboard. I intended to check with a 2.4 kernel but that seems impossible
with Debian 3.1r0a/x86_64 (I've filed a bug with the Debian BTS but haven't
gotten the bug # yet).

Is there anything I can do to narrow this down?
Comment 4 Nikos Ntarmos 2005-11-23 21:54:34 UTC
I no longer have neither the machine nor the controller, so I can't reproduce
this issue or provide any further insight. Since this bug report has laid
dormant for all this time, I think it's safe and sane to close it.

Note You need to log in before you can comment on or make changes to this bug.