Bug 1888

Summary: SiI 3112 & Asus MBoard & WD Raptor cause complete hang with DMA and heavy load
Product: IO/Storage Reporter: Joe Rutledge (joe.rutledge)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: CLOSED CODE_FIX    
Severity: high    
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.1 Subsystem:
Regression: --- Bisected commit-id:

Description Joe Rutledge 2004-01-16 04:13:12 UTC
Distribution:
Gentoo 1.4

Hardware Environment:
Asus A7N8X Deluxe (nforce2) BIOS V1007, AMD Athlon XP (Barton) 2800+, 1GB DDR
RAM (2 x 512 as Dual Channel), WD Raptor 36G SATA HD, Asus GeForce FX 5600
(256MB), Lite-On 52x CDRW, DVD-ROM.

Software Environment:
Fresh Gentoo build optimised for Athlon-XP (GCC 3.2.3 -march athlon-xp). 2.6.1
kernel patched for forcedeth and nvidia graphics card. APIC & ACPI support
removed from kernel and turned off in BIOS. Both the IDE and libata drivers have
been built into the kernel at separate times. It makes no difference what other
applications are running.

Problem Description:
Initially using a Seagate drive I was experiencing random lockups, no kernel
panic just a complete hang. Having read about issues with DMA and some Seagates
I replaced the drive with a Western Digital Raptor. However I still see the same
lockups. I've tried a variety of options to hdparm (based around -X70 -d1) none
of them making any difference to stability. I then swapped to the libata driver
expecting this to be more solid. It does appear to last a little longer than the
IDE driver but the same problems manifest themselves. I then found that there
were potentially some problems with the Asus board and shared interrupt lines to
the SiI3112 so I upgraded the BIOS to the most recent version (1007). This has
made no difference whatsoever. I also read that APIC and ACPI support could
exascerbate this problem no I removed them from the kernel and disabled them in
the BIOS. This has given better stability but still not to the point of a usable
system. This is a desktop system and it will become locked if any heavy disk
access is done. At the moment I'm running in PIO mode as this is the only stable
way of handling the disk. I'm not doing any RAID and have no need to.

Steps to reproduce:
Merely a boot of my 2.6.1/2.6.0/2.4.24 system to runlevel 1 and then running
bonnie++ -u nobody will guarantee a hang before all the write checks have been
completed.
Comment 1 Gunnlaugur Thor Briem 2004-01-29 12:28:53 UTC
I am getting the exact same symptoms with almost the same setup: MSI K7N2
Delta-ILSR instead of Asus A7N8X, and two WD 74GB raptor disks, with a software
raid0 (md) device across identical partitions on the two drives. This
motherboard is also nforce2 and SATA controller is Promise 20376. Both forcedeth
and sata_promise are compiled in, not loaded as modules. Video card is MSI
Geforce FX 5900, but I'm not even running X or a fb console, just plain VGA
text. No sound card driver, no frills. Kernel is gentoo-dev-sources-2.6.1, which
does have some patches on top of the vanilla 2.6.1, but nothing that seems a
likely culprit.

bonnie++ output:
leggiero tmp # bonnie++ -u nobody
Using uid:65534, gid:65534.
Writing with putc()...done
Writing intelligently...done
Rewriting...[at this point, hang occurs after time on the order of a minute)

No oops message displayed, nothing suspicious found in logs. Machine no longer
responds to ping. Motherboard state LED indicator array stays all green (i.e.
OK), for whatever that may be worth. And cursor on console still blinks.

It's not a pathological case involving this benchmark -- happens during lengthy
stretches of compilations as well.

Haven't tried running a kernel with a serial console linked to another machine
(as suggested by LKML FAQ) because I doubt that an oops would appear there since
it doesn't appear in the screen console. (Am I wrong?)

Occasionally, after doing a soft reboot (with the case reboot switch linked to
motherboard, not with Ctrl-Alt-Del since that doesn't work of course), the
motherboard will fail to start and screen will go blank. The motherboard state
LED indicator array displays the state "Initializing Hard Drive Controller",
which seems to imply that the Promise controller is left in a bogus state after
this occurs, and the soft reboot doesn't always get it out of that bogus state.

Is there any sane procedure for extracting more info to debug this? I'm very
willing to jump through hoops if someone knowledgeable prescribes the hoops!
(e.g. places where I should insert code for dumping debug info to the console,
in order to narrow down where the hang happens).
Comment 2 Gunnlaugur Thor Briem 2004-01-29 12:40:35 UTC
As a sanity check of the hardware, I installed Windows XP on the same setup
(just another partition on one of the disks) and jostled it around quite a bit,
but could not produce a hang. So this does not **seem** to be a hardware-only
fault (although this test is clearly not conclusive ...). That install, of
course, used the Promise-supplied driver for the SATA controller, instead of the
linux sata_promise driver.
Comment 3 Jeff Garzik 2004-01-30 17:50:17 UTC
This is (unfortunately) a known bug, and is the reason why the driver is marked
with CONFIG_BROKEN.

Should have a fix soon.
Comment 4 Gunnlaugur Thor Briem 2004-02-03 01:48:22 UTC
With ATA_DEBUG and ATA_VERBOSE_DEBUG #defined in libata.h, this is what I get in
/var/log/messages just before things go awry:

Jan 31 03:29:44 leggiero ata_fill_sg: PRD[22] = (0xAEDC000, 0x1000)
Jan 31 03:29:44 leggiero ata_fill_sg: PRD[23] = (0x36EFA000, 0x1000)
Jan 31 03:29:44 leggiero ata_fill_sg: PRD[24] = (0x36EF0000, 0x1000)
Jan 31 03:29:44 leggiero pdc_dma_start: ENTER, ap c1bfd204
Jan 31 03:29:44 leggiero ata_scsi_rw_queue: EXIT
Jan 31 03:29:44 leggiero pdc_interrupt: ENTER
Jan 31 03:29:44 leggiero pdc_interrupt: port 0
Jan 31 03:29:44 leggiero ata_sg_clean: unmapping 25 sg elements
Jan 31 03:29:44 leggiero pdc_interrupt: port 1
Jan 31 03:29:44 leggiero pdc_interrupt: EXIT
Jan 31 03:29:44 leggiero ata_scsi_queuecmd: CDB (1:0,0,0) 2a 00 02 61 9c 8a 00 00 68
Jan 31 03:29:44 leggiero ata_scsi_rw_queue: ENTER
Jan 31 03:29:44 leggiero ata_scsi_rw_xlat: writing
Jan 31 03:29:44 leggiero ata_scsi_rw_xlat: ten-byte command
Jan 31 03:29:44 leggiero ata_dev_select: ENTER, ata1: device 0, wait 1
Jan 31 03:29:44 leggiero ata_sg_setup: ENTER, ata1, use_sg 13
Jan 31 03:29:44 leggiero ata_sg_setup: 13 sg elements mapped
Jan 31 03:29:44 leggiero pdc_fill_sg: ENTER
Jan 31 03:29:44 leggiero ata_fill_sg: PRD[0] = (0x139DF000, 0x1000)
Jan 31 03:29:44 leggiero ata_fill_sg: PRD[1] = (0x139D5000, 0x1000)
Jan 31 03:29:44 leggiero ata_fill_sg: PRD[2] = (0x13C01000, 0x1000)
Jan 31 03:29:44 leggiero ata_fill_sg: PRD[3] = (0x13D45000, 0x1000)

After this, I get a couple of thousand zero bytes, and about four megabytes of
garbage (including my kernel config and some data that was probably in the
in-memory cache from other files), followed by the syslog startup message from
the next boot. Presumably a DMA transfer went haywire and ended up in
/var/log/messages, just before everything hung. No telling what other
interesting stuff happened on the disk as well :)

Note that, in the last operation, the ata_fill_sg loop should run 13 times, but
we don't hear from it after the fourth time. Of course, this is an unreliable
hint -- the failure might well happen somewhere completely different, and the
last (crucial) bit of log output just never made it onto the disk. I'm not sure
whether these messages are flushed to disk synchronously -- presumably you are.
I'm using syslog-ng.

Hope this helps.
Comment 5 Gunnlaugur Thor Briem 2004-02-03 18:33:27 UTC
Original poster had a Silicon Image 3112 controller, I have a Promise 20376 --
when you said it's a known bug, do you mean it's a bug in libata itself, not in
the individual drivers? Or did you just mean in the 3112 driver, and not notice
that I had a different controller? Probably, since the 3112 is marked
CONFIG_BROKEN, but the Promise driver is not.

So should my part of this be a separate bug? I'll guess (in the dark) that they
are too closely related to warrant that.

Are you at all able to narrow this down, to provide a clue? I want to help, but
am not experienced at this low level so I'm making very slow headway getting
familiar with things.
Comment 6 Jeff Garzik 2004-02-04 00:32:37 UTC
Yes, please open a separate bug for same or similar behavior on a different
controller than Silicon Image.
Comment 7 Gunnlaugur Thor Briem 2004-02-04 03:17:08 UTC
OK, opened Bug 2011 -- sorry for the confusion.
Comment 8 Jeff Garzik 2004-03-25 19:12:58 UTC
Should be fixed in latest 2.6.5-rc kernel.

Note that there may also be platform bugs.  Try booting with "noapic",
"acpi=off", "pci=noacpi", and/or "nomce".