Distribution: Gentoo 2004.0 Hardware Environment: MSI K7N2 Delta-ILSR (nforce2, Promise 20376 SATA rev 2), AMD Athlon XP 3200+, 1GB DDR RAM (2 x 512), 2x WD Raptor 74G SATA HD, MSI GeForce FX 5900 (128MB), some CDRW/DVD Software Environment: Fresh Gentoo build optimised for Athlon-XP (GCC 3.2.3 -march athlon-xp). 2.6.2rc3 vanilla kernel (also tried 2.6.1 gentoo patched for forcedeth, same results). Problem Description: At even short periods of heavy disk activity, the kernel suddenly hangs. Machine does not respond to keyboard events or ping, or anything other than a hard reset. I get it every time in bonnie++ test, and often in copying large directory trees, compiling the kernel, etc. ... but it often gets through the latter without problems, too. Steps to reproduce: Run bonnie++ -u nobody; that seems to be the most reliable way to reproduce it. Extra info: With ATA_DEBUG and ATA_VERBOSE_DEBUG #defined in libata.h, this is what I get in /var/log/messages just before things go awry: Jan 31 03:29:44 leggiero ata_fill_sg: PRD[22] = (0xAEDC000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[23] = (0x36EFA000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[24] = (0x36EF0000, 0x1000) Jan 31 03:29:44 leggiero pdc_dma_start: ENTER, ap c1bfd204 Jan 31 03:29:44 leggiero ata_scsi_rw_queue: EXIT Jan 31 03:29:44 leggiero pdc_interrupt: ENTER Jan 31 03:29:44 leggiero pdc_interrupt: port 0 Jan 31 03:29:44 leggiero ata_sg_clean: unmapping 25 sg elements Jan 31 03:29:44 leggiero pdc_interrupt: port 1 Jan 31 03:29:44 leggiero pdc_interrupt: EXIT Jan 31 03:29:44 leggiero ata_scsi_queuecmd: CDB (1:0,0,0) 2a 00 02 61 9c 8a 00 00 68 Jan 31 03:29:44 leggiero ata_scsi_rw_queue: ENTER Jan 31 03:29:44 leggiero ata_scsi_rw_xlat: writing Jan 31 03:29:44 leggiero ata_scsi_rw_xlat: ten-byte command Jan 31 03:29:44 leggiero ata_dev_select: ENTER, ata1: device 0, wait 1 Jan 31 03:29:44 leggiero ata_sg_setup: ENTER, ata1, use_sg 13 Jan 31 03:29:44 leggiero ata_sg_setup: 13 sg elements mapped Jan 31 03:29:44 leggiero pdc_fill_sg: ENTER Jan 31 03:29:44 leggiero ata_fill_sg: PRD[0] = (0x139DF000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[1] = (0x139D5000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[2] = (0x13C01000, 0x1000) Jan 31 03:29:44 leggiero ata_fill_sg: PRD[3] = (0x13D45000, 0x1000) After this, I get a couple of thousand zero bytes, and about four megabytes of garbage (including my kernel config and some data that was probably in the in-memory cache from other files), followed by the syslog startup message from the next boot. Presumably a DMA transfer went haywire and ended up in /var/log/messages, just before everything hung. No telling what other interesting stuff happened on the disk as well :) Note that, in the last operation, the ata_fill_sg loop should run 13 times, but we don't hear from it after the fourth time. Of course, this is an unreliable hint -- the failure might well happen somewhere completely different, and the last (crucial) bit of log output just never made it onto the disk. I'm not sure whether these messages are flushed to disk synchronously -- presumably you are. I'm using syslog-ng. Filed this originally on Bug 1888 by mistake -- that bug report involves a different chipset and driver, so this warrants a separate bug.
sata_promise.c contains several FIXMEs; are any of these possibly relevant? In the struct ata_port_info for board_2037x: .udma_mask = 0x7f, /* udma0-6 ; FIXME */ (This is presumably unused, right? It would be used in pdc_sata_set_udmamode, but that's a no-op because SATA uses its own DMA mode, not the Ultra DMA modes. Same goes for .pio_mask above. If that's correctly understood, then why is this marked FIXME rather than DUMMY?) In pdc_host_init: /* reduce TBG clock to 133 Mhz. FIXME: why? */ tmp = readl(mmio + PDC_TBG_MODE); tmp &= ~0x30000; /* clear bit 17, 16*/ tmp |= 0x10000; /* set bit 17:16 = 0:1 */ writel(tmp, mmio + PDC_TBG_MODE); /* adjust slew rate control register. FIXME: why? */ tmp = readl(mmio + PDC_SLEW_CTL); tmp &= 0xFFFFF03F; /* clear bit 11 ~ 6 */ tmp |= 0x00000900; /* set bit 11-9 = 100b , bit 8-6 = 100 */ writel(tmp, mmio + PDC_SLEW_CTL); (Do the questions "why?" indicate that this code is there only because that's what the closed-source drivers do?) In pdc_sata_init_one: /* FIXME: check ata_device_add return value */ ata_device_add(probe_ent); kfree(probe_ent); (I'm almost certain that this is not relevant here -- right? If the return value were 0, then my disks wouldn't have come online for the kernel)
Looks like I jumped on the wrong culprit here. I just tried applying Ross Dickson's patches for APIC quirks on nForce2 chipsets: http://lkml.org/lkml/2003/12/21/7 and setting the apic_tack=2 boot parameter, and with that kernel I can't reproduce this -- haven't tested for very long, but my machine got through the bonnie++ test with this kernel, and was never even close before. So it appears that the Promise controller and driver are as innocent as the driven snow. I suggest that you close this, unless you know better (you did mention that this was a known bug, but perhaps you meant only the sil driver). Sorry for the mistargeted report!