Bug 1888
Summary: | SiI 3112 & Asus MBoard & WD Raptor cause complete hang with DMA and heavy load | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Joe Rutledge (joe.rutledge) |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | ||
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.1 | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Joe Rutledge
2004-01-16 04:13:12 UTC
I am getting the exact same symptoms with almost the same setup: MSI K7N2 Delta-ILSR instead of Asus A7N8X, and two WD 74GB raptor disks, with a software raid0 (md) device across identical partitions on the two drives. This motherboard is also nforce2 and SATA controller is Promise 20376. Both forcedeth and sata_promise are compiled in, not loaded as modules. Video card is MSI Geforce FX 5900, but I'm not even running X or a fb console, just plain VGA text. No sound card driver, no frills. Kernel is gentoo-dev-sources-2.6.1, which does have some patches on top of the vanilla 2.6.1, but nothing that seems a likely culprit. bonnie++ output: leggiero tmp # bonnie++ -u nobody Using uid:65534, gid:65534. Writing with putc()...done Writing intelligently...done Rewriting...[at this point, hang occurs after time on the order of a minute) No oops message displayed, nothing suspicious found in logs. Machine no longer responds to ping. Motherboard state LED indicator array stays all green (i.e. OK), for whatever that may be worth. And cursor on console still blinks. It's not a pathological case involving this benchmark -- happens during lengthy stretches of compilations as well. Haven't tried running a kernel with a serial console linked to another machine (as suggested by LKML FAQ) because I doubt that an oops would appear there since it doesn't appear in the screen console. (Am I wrong?) Occasionally, after doing a soft reboot (with the case reboot switch linked to motherboard, not with Ctrl-Alt-Del since that doesn't work of course), the motherboard will fail to start and screen will go blank. The motherboard state LED indicator array displays the state "Initializing Hard Drive Controller", which seems to imply that the Promise controller is left in a bogus state after this occurs, and the soft reboot doesn't always get it out of that bogus state. Is there any sane procedure for extracting more info to debug this? I'm very willing to jump through hoops if someone knowledgeable prescribes the hoops! (e.g. places where I should insert code for dumping debug info to the console, in order to narrow down where the hang happens). As a sanity check of the hardware, I installed Windows XP on the same setup (just another partition on one of the disks) and jostled it around quite a bit, but could not produce a hang. So this does not **seem** to be a hardware-only fault (although this test is clearly not conclusive ...). That install, of course, used the Promise-supplied driver for the SATA controller, instead of the linux sata_promise driver. This is (unfortunately) a known bug, and is the reason why the driver is marked with CONFIG_BROKEN. Should have a fix soon. With ATA_DEBUG and ATA_VERBOSE_DEBUG #defined in libata.h, this is what I get in /var/log/messages just before things go awry: Jan 31 03:29:44 leggiero ata_fill_sg: PRD[22] = (0xAEDC000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[23] = (0x36EFA000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[24] = (0x36EF0000, 0x1000) Jan 31 03:29:44 leggiero pdc_dma_start: ENTER, ap c1bfd204 Jan 31 03:29:44 leggiero ata_scsi_rw_queue: EXIT Jan 31 03:29:44 leggiero pdc_interrupt: ENTER Jan 31 03:29:44 leggiero pdc_interrupt: port 0 Jan 31 03:29:44 leggiero ata_sg_clean: unmapping 25 sg elements Jan 31 03:29:44 leggiero pdc_interrupt: port 1 Jan 31 03:29:44 leggiero pdc_interrupt: EXIT Jan 31 03:29:44 leggiero ata_scsi_queuecmd: CDB (1:0,0,0) 2a 00 02 61 9c 8a 00 00 68 Jan 31 03:29:44 leggiero ata_scsi_rw_queue: ENTER Jan 31 03:29:44 leggiero ata_scsi_rw_xlat: writing Jan 31 03:29:44 leggiero ata_scsi_rw_xlat: ten-byte command Jan 31 03:29:44 leggiero ata_dev_select: ENTER, ata1: device 0, wait 1 Jan 31 03:29:44 leggiero ata_sg_setup: ENTER, ata1, use_sg 13 Jan 31 03:29:44 leggiero ata_sg_setup: 13 sg elements mapped Jan 31 03:29:44 leggiero pdc_fill_sg: ENTER Jan 31 03:29:44 leggiero ata_fill_sg: PRD[0] = (0x139DF000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[1] = (0x139D5000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[2] = (0x13C01000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[3] = (0x13D45000, 0x1000) After this, I get a couple of thousand zero bytes, and about four megabytes of garbage (including my kernel config and some data that was probably in the in-memory cache from other files), followed by the syslog startup message from the next boot. Presumably a DMA transfer went haywire and ended up in /var/log/messages, just before everything hung. No telling what other interesting stuff happened on the disk as well :) Note that, in the last operation, the ata_fill_sg loop should run 13 times, but we don't hear from it after the fourth time. Of course, this is an unreliable hint -- the failure might well happen somewhere completely different, and the last (crucial) bit of log output just never made it onto the disk. I'm not sure whether these messages are flushed to disk synchronously -- presumably you are. I'm using syslog-ng. Hope this helps. Original poster had a Silicon Image 3112 controller, I have a Promise 20376 -- when you said it's a known bug, do you mean it's a bug in libata itself, not in the individual drivers? Or did you just mean in the 3112 driver, and not notice that I had a different controller? Probably, since the 3112 is marked CONFIG_BROKEN, but the Promise driver is not. So should my part of this be a separate bug? I'll guess (in the dark) that they are too closely related to warrant that. Are you at all able to narrow this down, to provide a clue? I want to help, but am not experienced at this low level so I'm making very slow headway getting familiar with things. Yes, please open a separate bug for same or similar behavior on a different controller than Silicon Image. OK, opened Bug 2011 -- sorry for the confusion. Should be fixed in latest 2.6.5-rc kernel. Note that there may also be platform bugs. Try booting with "noapic", "acpi=off", "pci=noacpi", and/or "nomce". |