Distribution: Gentoo 1.4 Hardware Environment: Asus A7N8X Deluxe (nforce2) BIOS V1007, AMD Athlon XP (Barton) 2800+, 1GB DDR RAM (2 x 512 as Dual Channel), WD Raptor 36G SATA HD, Asus GeForce FX 5600 (256MB), Lite-On 52x CDRW, DVD-ROM. Software Environment: Fresh Gentoo build optimised for Athlon-XP (GCC 3.2.3 -march athlon-xp). 2.6.1 kernel patched for forcedeth and nvidia graphics card. APIC & ACPI support removed from kernel and turned off in BIOS. Both the IDE and libata drivers have been built into the kernel at separate times. It makes no difference what other applications are running. Problem Description: Initially using a Seagate drive I was experiencing random lockups, no kernel panic just a complete hang. Having read about issues with DMA and some Seagates I replaced the drive with a Western Digital Raptor. However I still see the same lockups. I've tried a variety of options to hdparm (based around -X70 -d1) none of them making any difference to stability. I then swapped to the libata driver expecting this to be more solid. It does appear to last a little longer than the IDE driver but the same problems manifest themselves. I then found that there were potentially some problems with the Asus board and shared interrupt lines to the SiI3112 so I upgraded the BIOS to the most recent version (1007). This has made no difference whatsoever. I also read that APIC and ACPI support could exascerbate this problem no I removed them from the kernel and disabled them in the BIOS. This has given better stability but still not to the point of a usable system. This is a desktop system and it will become locked if any heavy disk access is done. At the moment I'm running in PIO mode as this is the only stable way of handling the disk. I'm not doing any RAID and have no need to. Steps to reproduce: Merely a boot of my 2.6.1/2.6.0/2.4.24 system to runlevel 1 and then running bonnie++ -u nobody will guarantee a hang before all the write checks have been completed.
I am getting the exact same symptoms with almost the same setup: MSI K7N2 Delta-ILSR instead of Asus A7N8X, and two WD 74GB raptor disks, with a software raid0 (md) device across identical partitions on the two drives. This motherboard is also nforce2 and SATA controller is Promise 20376. Both forcedeth and sata_promise are compiled in, not loaded as modules. Video card is MSI Geforce FX 5900, but I'm not even running X or a fb console, just plain VGA text. No sound card driver, no frills. Kernel is gentoo-dev-sources-2.6.1, which does have some patches on top of the vanilla 2.6.1, but nothing that seems a likely culprit. bonnie++ output: leggiero tmp # bonnie++ -u nobody Using uid:65534, gid:65534. Writing with putc()...done Writing intelligently...done Rewriting...[at this point, hang occurs after time on the order of a minute) No oops message displayed, nothing suspicious found in logs. Machine no longer responds to ping. Motherboard state LED indicator array stays all green (i.e. OK), for whatever that may be worth. And cursor on console still blinks. It's not a pathological case involving this benchmark -- happens during lengthy stretches of compilations as well. Haven't tried running a kernel with a serial console linked to another machine (as suggested by LKML FAQ) because I doubt that an oops would appear there since it doesn't appear in the screen console. (Am I wrong?) Occasionally, after doing a soft reboot (with the case reboot switch linked to motherboard, not with Ctrl-Alt-Del since that doesn't work of course), the motherboard will fail to start and screen will go blank. The motherboard state LED indicator array displays the state "Initializing Hard Drive Controller", which seems to imply that the Promise controller is left in a bogus state after this occurs, and the soft reboot doesn't always get it out of that bogus state. Is there any sane procedure for extracting more info to debug this? I'm very willing to jump through hoops if someone knowledgeable prescribes the hoops! (e.g. places where I should insert code for dumping debug info to the console, in order to narrow down where the hang happens).
As a sanity check of the hardware, I installed Windows XP on the same setup (just another partition on one of the disks) and jostled it around quite a bit, but could not produce a hang. So this does not **seem** to be a hardware-only fault (although this test is clearly not conclusive ...). That install, of course, used the Promise-supplied driver for the SATA controller, instead of the linux sata_promise driver.
This is (unfortunately) a known bug, and is the reason why the driver is marked with CONFIG_BROKEN. Should have a fix soon.
With ATA_DEBUG and ATA_VERBOSE_DEBUG #defined in libata.h, this is what I get in /var/log/messages just before things go awry: Jan 31 03:29:44 leggiero ata_fill_sg: PRD[22] = (0xAEDC000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[23] = (0x36EFA000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[24] = (0x36EF0000, 0x1000) Jan 31 03:29:44 leggiero pdc_dma_start: ENTER, ap c1bfd204 Jan 31 03:29:44 leggiero ata_scsi_rw_queue: EXIT Jan 31 03:29:44 leggiero pdc_interrupt: ENTER Jan 31 03:29:44 leggiero pdc_interrupt: port 0 Jan 31 03:29:44 leggiero ata_sg_clean: unmapping 25 sg elements Jan 31 03:29:44 leggiero pdc_interrupt: port 1 Jan 31 03:29:44 leggiero pdc_interrupt: EXIT Jan 31 03:29:44 leggiero ata_scsi_queuecmd: CDB (1:0,0,0) 2a 00 02 61 9c 8a 00 00 68 Jan 31 03:29:44 leggiero ata_scsi_rw_queue: ENTER Jan 31 03:29:44 leggiero ata_scsi_rw_xlat: writing Jan 31 03:29:44 leggiero ata_scsi_rw_xlat: ten-byte command Jan 31 03:29:44 leggiero ata_dev_select: ENTER, ata1: device 0, wait 1 Jan 31 03:29:44 leggiero ata_sg_setup: ENTER, ata1, use_sg 13 Jan 31 03:29:44 leggiero ata_sg_setup: 13 sg elements mapped Jan 31 03:29:44 leggiero pdc_fill_sg: ENTER Jan 31 03:29:44 leggiero ata_fill_sg: PRD[0] = (0x139DF000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[1] = (0x139D5000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[2] = (0x13C01000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[3] = (0x13D45000, 0x1000) After this, I get a couple of thousand zero bytes, and about four megabytes of garbage (including my kernel config and some data that was probably in the in-memory cache from other files), followed by the syslog startup message from the next boot. Presumably a DMA transfer went haywire and ended up in /var/log/messages, just before everything hung. No telling what other interesting stuff happened on the disk as well :) Note that, in the last operation, the ata_fill_sg loop should run 13 times, but we don't hear from it after the fourth time. Of course, this is an unreliable hint -- the failure might well happen somewhere completely different, and the last (crucial) bit of log output just never made it onto the disk. I'm not sure whether these messages are flushed to disk synchronously -- presumably you are. I'm using syslog-ng. Hope this helps.
Original poster had a Silicon Image 3112 controller, I have a Promise 20376 -- when you said it's a known bug, do you mean it's a bug in libata itself, not in the individual drivers? Or did you just mean in the 3112 driver, and not notice that I had a different controller? Probably, since the 3112 is marked CONFIG_BROKEN, but the Promise driver is not. So should my part of this be a separate bug? I'll guess (in the dark) that they are too closely related to warrant that. Are you at all able to narrow this down, to provide a clue? I want to help, but am not experienced at this low level so I'm making very slow headway getting familiar with things.
Yes, please open a separate bug for same or similar behavior on a different controller than Silicon Image.
OK, opened Bug 2011 -- sorry for the confusion.
Should be fixed in latest 2.6.5-rc kernel. Note that there may also be platform bugs. Try booting with "noapic", "acpi=off", "pci=noacpi", and/or "nomce".