Bug 2494
Summary: | PDC20265 crashes when DMA is enabled | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | peter mutsaers (pmutsaers) |
Component: | IDE | Assignee: | Bartlomiej Zolnierkiewicz (bzolnier) |
Status: | REJECTED INSUFFICIENT_DATA | ||
Severity: | normal | CC: | drescher0110-lists, hhielscher, sumbach |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.5 and 2.6.5-mm4, not in 2.4.x | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | pdc20267 |
Description
peter mutsaers
2004-04-11 13:23:36 UTC
Can confirm the problem also exists in 2.6.6. Have seen this problem occur with vanilla 2.6.6 kernels as well as with redhat FC 2 test kernels. Any heavy activity on any drive attached to a PDC20265 controller will cause this bug to manifest (such as RAID activity, filesystem rsyncs, etc.). The system generally becomes unusable and will not reboot without a power cycle. Here are some kernel logs of the timeout occuring: hde: dma_timer_expiry: dma status == 0x60 hde: DMA timeout retry PDC202XX: Primary channel reset PDC202XX: Secondary channel reset hde: timeout waiting for DMA hde: dma_timer_expiry: dma status == 0x61 hde: DMA timeout error Controller works perfectly under 2.4 kernel. With either a RH 9 box upgraded to 2.6.6, or with FC 2 test 3 kernels and installations this bug always occurs with load. Confirmed with PDC20265 running gentoo-dev-sources-2.6.5-r1 and mm-sources-2.6.6-mm5. Hm. I'm getting similar lockups even without DMA enabled but it usually takes longer before they occur. Its also happening for me now using only one harddrive. Seems to be an even more critical bug than first surmised. confirmation on kernel 2.6.5 on a debian sarge system (SMP - dual celeron - PDC driver build in kernel, not as mudule). two controllers (Ultra TX2) with four HDs connected as master (one HD per channel) are ok. After connecting a third controller with one HD, this fifth HD shows DMA errors (not the four previously connected HDs). This is no IRQ, HD or controller problem (cross changing tests: all components are ok). Additional observation: Promise BIOS doesn't show the fifth HD on booting but kernel reports all five HDs correctly. When setting DMA=1 for the first four HDs and setting DMA=0 for the fifth HD, system is working well (but with all limitations a pio driven HD implies). Confirmed on 2.6.7-rc2, 2.6.7-rc2-bk1 and 2.6.7-rc2-bk2: kernel: hde: dma_timer_expiry: dma status == 0x20 kernel: hde: DMA timeout retry kernel: PDC202XX: Primary channel reset. kernel: PDC202XX: Secondary channel reset. kernel: hde: timeout waiting for DMA kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } kernel: hde: multwrite_intr: error=0x04 { DriveStatusError } kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } kernel: hde: multwrite_intr: error=0x04 { DriveStatusError } kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } kernel: hde: multwrite_intr: error=0x04 { DriveStatusError } kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } kernel: hde: multwrite_intr: error=0x04 { DriveStatusError } kernel: PDC202XX: Primary channel reset. kernel: PDC202XX: Secondary channel reset. kernel: ide2: reset: master: error (0x00?) kernel: hde: dma_timer_expiry: dma status == 0x21 kernel: hde: DMA timeout error kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } kernel: hde: dma_timer_expiry: dma status == 0x21 kernel: hde: DMA timeout error kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } kernel: kernel: hde: dma_timer_expiry: dma status == 0x21 kernel: hde: DMA timeout error kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } I can confirm that this bug affects me running a 2.6.3 kernel and does not affect me running a 2.4.x kernel. I, too, have a promise IDE controller (non- RAID mode) on my motherboard. I am running Mandrake 10. It is unstable with a 2.6 kernel, exhibiting the described problems, but is stable with a 2.4 kernel. I have a soyo dragon plus motherboard with an athlon 1700+ xp CPU. I recommend this bug be changed to high priority as it affects a lot of users. The promise IDE chipset is very popular on many motherboards. I don't have PDC20265 so I'm unable to reproduce it. If you are willing to help please narrow the problem to a specific kernel version (i.e. 2.5.xx works and 2.5.yy doesn't). Thanks. Also A7V Board, running ArchLinux0.6 kernel 2.6.3/2.6.6/2.4.26 (all standard distribution kernels). Special: - occurs with 2.4.26 kernel as well - No heavy load on disks required, kernel hangs on boot time (when kernel starts INIT) - No problem when using `ide=nodma' output `lspci': 00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133] (rev 02) 00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP] 00:04.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 22) 00:04.1 IDE interface: VIA Technologies, Inc. VT82C586/B/686A/B PIPC Bus Master IDE (rev 10) 00:04.2 USB Controller: VIA Technologies, Inc. USB (rev 10) 00:04.3 USB Controller: VIA Technologies, Inc. USB (rev 10) 00:04.4 Bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 30) 00:0a.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 07) 00:0a.1 Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 07) 00:0b.0 Ethernet controller: Macronix, Inc. [MXIC] MX987x5 (rev 20) 00:11.0 Unknown mass storage controller: Promise Technology, Inc. 20265 (rev 02) 01:00.0 VGA compatible controller: ATI Technologies Inc: Unknown device 5961 (rev 01) 01:00.1 Display controller: ATI Technologies Inc: Unknown device 5941 (rev 01) I'd be happy to test any patches, I have several PDC20265 boxes. As to the specific kernels it happens with: 2.6.5 2.6.6 2.6.7-rc1 2.6.7-rc2 2.6.7-rc2-mm2 Also confirmed on 2.6.7-rc2-bk8: Jun 7 10:11:43 localhost kernel: hde: dma_intr: status=0x50 { DriveReady SeekComplete } Jun 7 10:11:43 localhost kernel: Jun 7 10:11:43 localhost kernel: hde: dma_timer_expiry: dma status == 0x20 Jun 7 10:11:43 localhost kernel: hde: DMA timeout retry Jun 7 10:11:43 localhost kernel: PDC202XX: Primary channel reset. Jun 7 10:11:43 localhost kernel: PDC202XX: Secondary channel reset. Jun 7 10:11:43 localhost kernel: hde: timeout waiting for DMA Jun 7 10:11:43 localhost kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Jun 7 10:11:43 localhost kernel: hde: multwrite_intr: error=0x04 { DriveStatusError } Jun 7 10:11:43 localhost kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Jun 7 10:11:43 localhost kernel: hde: multwrite_intr: error=0x04 { DriveStatusError } Jun 7 10:11:43 localhost kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Jun 7 10:11:43 localhost kernel: hde: multwrite_intr: error=0x04 { DriveStatusError } Jun 7 10:11:43 localhost kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Jun 7 10:11:43 localhost kernel: hde: multwrite_intr: error=0x04 { DriveStatusError } Jun 7 10:11:43 localhost kernel: PDC202XX: Primary channel reset. Jun 7 10:11:43 localhost kernel: PDC202XX: Secondary channel reset. Jun 7 10:11:43 localhost kernel: ide2: reset: master: error (0x00?) Jun 7 10:11:43 localhost kernel: hde: dma_timer_expiry: dma status == 0x21 Jun 7 10:11:43 localhost kernel: hde: DMA timeout error Jun 7 10:11:43 localhost kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } Jun 7 10:11:43 localhost kernel: Jun 7 10:11:43 localhost kernel: hde: dma_timer_expiry: dma status == 0x21 Jun 7 10:11:43 localhost kernel: hde: DMA timeout error Jun 7 10:11:43 localhost kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } Jun 7 10:11:43 localhost kernel: Jun 7 10:11:43 localhost kernel: hde: dma_timer_expiry: dma status == 0x21 Jun 7 10:11:43 localhost kernel: hde: DMA timeout error Jun 7 10:11:43 localhost kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } Jun 7 10:11:43 localhost kernel: Jun 7 10:11:43 localhost kernel: hde: DMA disabled Jun 7 10:11:43 localhost kernel: hdg: DMA disabled The box seems to recover though and is still useful after the DMA mode is disabled. Will do further testing to see if this remains the case as the day goes by. Oh, if its helpful, heres on more line to the previous logfile dump that preceeds the other errors: Jun 7 10:11:43 localhost kernel: hde: dma_intr: bad DMA status (dma_stat=36) This error is new. Haven't seen this one before with other 2.6.x kernels and the PDC20265 DMA problem. Again, this error occurs before any of the other DMA errors listed. Need my coffee, keep forgetting details. Also, this time the DMA timeout occured during the initial boot of the system, which is also a rare occurance in the past. I'll see if I can reproduce this with reboots to see if it occurs whenever the box reboots. The hardware config for this system is one IDE controller, PDC20265, two 80 GB ATA100 harddrives (identical), only one of which is mounted. The other, for the moment, is not mounted or used by the system in any way. Also, just ran an fdisk -l, which caused the same set of errors as in the past, but with the 2.6.7-rc2-bk8 kernel the systems seems to recover almost instantly. Here are the errors with fdisk -l: Jun 7 10:37:28 localhost kernel: hdg: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } Jun 7 10:37:28 localhost kernel: hdg: read_intr: error=0x04 { DriveStatusError } Jun 7 10:37:28 localhost kernel: hdg: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } Jun 7 10:37:28 localhost kernel: hdg: read_intr: error=0x04 { DriveStatusError } Jun 7 10:37:28 localhost kernel: hdg: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } Jun 7 10:37:28 localhost kernel: hdg: read_intr: error=0x04 { DriveStatusError } Jun 7 10:37:28 localhost kernel: hdg: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } Jun 7 10:37:28 localhost kernel: hdg: read_intr: error=0x04 { DriveStatusError } Jun 7 10:37:30 localhost kernel: PDC202XX: Secondary channel reset. Jun 7 10:37:33 localhost kernel: PDC202XX: Primary channel reset. Jun 7 10:37:28 localhost kernel: ide3: reset: master: error (0x00?) Only polling the second drive seems to cause the error. Also, the error does not occur on subsequent retries of the fdisk -l command. Running badblocks -n on /dev/hdg1 gets this response from the kernel: Jun 7 10:40:26 localhost kernel: APIC error on CPU1: 00(60) Jun 7 10:40:26 localhost kernel: APIC error on CPU1: 60(60) Jun 7 10:40:26 localhost kernel: APIC error on CPU1: 60(60) Jun 7 10:40:26 localhost kernel: APIC error on CPU0: 00(60) Jun 7 10:40:26 localhost kernel: hdg: status error: status=0x58 { DriveReady SeekComplete DataRequest } Jun 7 10:40:26 localhost kernel: Jun 7 10:40:26 localhost kernel: hdg: drive not ready for command Jun 7 10:42:25 localhost kernel: hdg: status error: status=0x58 { DriveReady SeekComplete DataRequest } Jun 7 10:42:25 localhost kernel: Jun 7 10:42:25 localhost kernel: hdg: drive not ready for command Jun 7 10:42:30 localhost kernel: APIC error on CPU1: 60(60) Jun 7 10:42:30 localhost last message repeated 3 times Jun 7 10:42:30 localhost kernel: hdg: status error: status=0x58 { DriveReady SeekComplete DataRequest } Jun 7 10:42:30 localhost kernel: Jun 7 10:42:30 localhost kernel: hdg: drive not ready for command read-only mode for badblocks does not produce this set of errors, and completes successfully. It would be _really_ useful if people could post a test case that always (or almost always) generates the problem. I'm having a hard time generating a test case myself (though admittedly this is because I was testing drives on a different controller for most of the weekend). In response to comment #10: I doubt very much that the bug occurs generally with 2.4.x. Initially, before I filed the bug, I tried with various 2.4.x and with various 2.6.x kernels. 2.4.x consistently did not exhibit the bug, whereas 2.6.x did. Currently, I have a server being havily pounded with an uptime of 47 days. I have DMA enabled, and am running 2.4.26. I cannot believe that you do not have another problem causing the bug for you with 2.4.26. In response to comment #15: please read by comment (the first, when I submitted the bug). It gives a sure recipe to reproduce the bug for sure within seconds. I think more reports to confirm with 2.6.x aren't very useful. Comment #9 hit the nail on the head: the bug must have been introduced somewhere in the 2.5.x series. Someone should try various 2.5.x to check in what exact version the bug was first introduced. Alas I have only 1 production server so I cannot do this myself. I tried every 2.6 kernel, and all have the bug. Also tried 2.5.50 and 2.5.75, which seems to be perfect, at least on my machine. Does this mean the bug was introduced on 2.5.75 -> 2.6.0 ?? Please people, try both 2.5.75 and 2.6.0 to confirm this. I believe that this patch fixes the problem: --- linux-2.6.7/drivers/ide/ide-probe.c 2004-06-21 15:25:51.000000000 +0200 +++ linux/drivers/ide/ide-probe.c 2004-06-21 15:29:19.901710936 +0200 @@ -897,7 +897,7 @@ blk_queue_segment_boundary(q, 0xffff); if (!hwif->rqsize) - hwif->rqsize = hwif->no_lba48 ? 256 : 65536; + hwif->rqsize = 256; if (hwif->rqsize < max_sectors) max_sectors = hwif->rqsize; blk_queue_max_sectors(q, max_sectors); I've been running with this minor change all day with a 2.6.7 kernel and its working without any DMA errors. UDMA5 set, drives clobbered, no problems so far. The only error I am seeing, but it does not appear to be effecting anything and could be totally unrelated: kernel: APIC error on CPU1: 60(60) kernel: APIC error on CPU0: 60(60) Again, this error isn't causing any perceptible problems, so I would consider this patch to be effective at solving this specific problem. Note that as of 2.6.8.1, this fix has not yet been applied. Probably reasonable, there have been no comments here one way or another on whether this fix works. Anyone else tried it out? Does anyone understand ide-probe.c well enough to comment on whether this is 'obviously correct', correct only for Promise IDE cards, or not actually correct at all? Proper fix for PDC20265 was merged in 2.6.8 and was reported to work. I can confirm that the newest kernel (2.6.8.1) works for me. I'm still facing the same problem with 2.6.8.1: This is from dmesg: hdj: dma_timer_expiry: dma status == 0x40 hdj: DMA timeout retry PDC202XX: Primary channel reset. PDC202XX: Secondary channel reset. hdj: timeout waiting for DMA hdj: dma_timer_expiry: dma status == 0x41 hdj: DMA timeout error hdj: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } hdj: status timeout: status=0xd0 { Busy } PDC202XX: Primary channel reset. PDC202XX: Secondary channel reset. hdj: drive not ready for command ide4: reset: master: error (0x00?) hdj: dma_timer_expiry: dma status == 0x41 hdj: DMA timeout error hdj: dma timeout error: status=0x59 { DriveReady SeekComplete DataRequest Error } hdj: dma timeout error: error=0x00 { } hdj: dma_timer_expiry: dma status == 0x41 hdj: DMA timeout error hdj: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } Created attachment 3848 [details]
pdc20267
This happens on PDC20267 too. On 2.6.8.1 I must use attached patch to make
things work correctly.
I have similar problems as well, with all three machines in my setup. Machine #1: Swan contains a PcChips M811LU motherboard, which is KT-266A-based, and is equipped with one hard disk and one DVD burner: hda: QUANTUM Bigfoot TX4.0AT, ATA DISK drive hdc: ATAPI 40X DVD-ROM CD-R/RW drive, 2048kB Cache, UDMA(33) Swan currently runs a home-built 2.6.7 kernel, and has been tested with everything from the default Slackware 9.0 kernel (2.4.20 - "bare.i") to home-built 2.6.10. Swan does not currently appear to suffer from DMA issues. Machine #2: Stork contains an MSI MS6712 motherboard, which is KT-400-based, and is equipped with three hard disks and one DVD reader: hda: Maxtor 5T060H6, ATA DISK drive hdb: Maxtor 92040U6, ATA DISK drive hdc: HITACHI DVD-ROM GD-2500, ATAPI CD/DVD-ROM drive hdd: Maxtor 91360U4, ATA DISK drive Stork currently runs a home-built 2.6.7 kernel, and has been tested with the default Slackware 10.0 kernel (2.4.26) as well as home-built kernels 2.6.7 and 2.6.10. Stork currently has DMA problems as previously described by others. Drives hdb and hdd are most often used, as these are root and home, respectively, and both show the DMA errors previously described in this bug. I have not noticed a problem with drive hda. Machine #3: Rainbird contains an Intel 440BX motherboard (not sure if that is the chipset or the board model, it was a gift), and is equipped with one hard disk and one DVD reader: hda: Seagate Technology 1275MB - ST31276A, ATA DISK drive hdc: Memorex DVD-632, ATAPI CD/DVD-ROM drive Rainbird currently runs a home-built 2.6.7 kernel, and has been tested with everything from the default Slackware 8.1 kernel (2.4.18 - "bare.i") to kernel 2.6.10 compiled locally. Rainbird currently has DMA problems, with the exception that the "dma status" is "0x20" when an error occurs. General: In all cases it seems to be enough to just wait for the affected computer to get its head out of its hind end, where it then will continue where it left off, otherwise unaffected. All machines run home-built 2.6.7 kernels, each adjusted according to each machine's hardware and purpose. Swan and Rainbird are simple client machines, using their local disks mostly for swap space and boot information. Both NFS-root from Stork. Swan also uses part of its hard disk for Windows 2000. I forgot to mention that on Rainbird described above, the DMA error often occurs while reiserfsck is running at startup, as well as other times while the system is being used, and sometimes the machine has to be rebooted to get it to wake back up. I'm running a 2.6.10-gentoo-r6 kernel with an A7V. If I disable DMA, things seem to work. If I enable DMA, then I get strange behavior: If I use hdparm to enable DMA without having the "Use PCI DMA by default when available" then the drive mounts, but accessing it give an I/O Error. If I boot with "Use PCI DMA" enabled, then everything *appears* to work fine - and for smaller files it is. However, for large (>400Mb, it seems) files, the data is corrupt. I could md5sum a file, copy it to the drive on the PDC20265 controller, and md5sum it again and get a different result. Interesting, if I then reboot with DMA disabled, the md5sum is correct. So it appears that the reading of large files is broken. In my system logs: PDC20265: IDE controller at PCI slot 0000:00:11.0 ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 12 PCI: setting IRQ 12 as level-triggered ACPI: PCI interrupt 0000:00:11.0[A] -> GSI 12 (level, low) -> IRQ 12 PDC20265: chipset revision 2 PDC20265: 100% native mode on irq 12 PDC20265: (U)DMA Burst Bit DISABLED Primary PCI Mode Secondary PCI Mode. ide2: BM-DMA at 0x8000-0x8007, BIOS settings: hde:pio, hdf:DMA ide3: BM-DMA at 0x8008-0x800f, BIOS settings: hdg:DMA, hdh:DMA Probing IDE interface ide2... ide2: Wait for ready failed before probe ! hdf: Maxtor 91826U4, ATA DISK drive ide2 at 0x9400-0x9407,0x9002 on irq 12 hdf: max request size: 128KiB hdf: 35673120 sectors (18264 MB) w/2048KiB Cache, CHS=35390/16/63 hdf: cache flushes not supported hdf: hdf1 Probing IDE interface ide3... hdg: ST380011A, ATA DISK drive ide3 at 0x8800-0x8807,0x8402 on irq 12 hdg: max request size: 128KiB hdg: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=16383/255/63 hdg: cache flushes supported hdg: hdg1 BIOS EDD facility v0.16 2004-Jun-25, 1 devices found ... and the errors with DMA enabled: Mar 16 03:10:21 [kernel] hde: dma_timer_expiry: dma status == 0x21 Mar 16 03:10:35 [kernel] hde: DMA timeout error Mar 16 03:10:55 [kernel] hde: dma_timer_expiry: dma status == 0x21 Mar 16 03:11:09 [kernel] hde: DMA timeout error Mar 16 03:11:09 [kernel] end_request: I/O error, dev hde, sector 63 Mar 16 03:11:09 [kernel] Remounting filesystem read-only Mar 16 03:11:09 [kernel] end_request: I/O error, dev hde, sector 148635727 Mar 16 03:11:09 [kernel] Remounting filesystem read-only Mar 16 03:11:09 [kernel] end_request: I/O error, dev hde, sector 4231 - Last output repeated twice - Mar 16 03:11:09 [kernel] Remounting filesystem read-only Mar 16 11:03:14 [kernel] hdg: dma_timer_expiry: dma status == 0x61 Mar 16 11:03:28 [kernel] hdg: DMA timeout error Mar 16 11:12:45 [kernel] hdg: dma_timer_expiry: dma status == 0x61 Mar 16 11:12:59 [kernel] hdg: DMA timeout error Mar 16 11:12:59 [kernel] end_request: I/O error, dev hdg, sector 4223 - Last output repeated 3 times - Mar 16 11:14:58 [kernel] Remounting filesystem read-only Mar 16 11:15:01 [kernel] end_request: I/O error, dev hdg, sector 4223 The problem is still not solved using the following stuff: - pdc20262 - drives over 40gb (and _only_ over 40gb) - 2.6.x (afaik only some really old 2.4 work...) I get the same dma-errors as posted above... Cheers, Csaba I just had the exact same problem on 2.6.11-gentoo-r4 with a dual processor Athlon MP 2400 with two promise 20268 TX2 100 cards. The only way to get things to work reliably with my software raid5 system that was using these two controllers was to disable dma and set the multisector count to 16. This was with 120 GB WD drives with 8MB buffers. I have since replaced the promise cards with a highpoint RocketRaid 454 card and have not had any problems yet (a few days). I am lucky to have the same problem on Promise 20269 + 3 disk RAID5. Here are the logs. kernel: [4294884.186000] hdf: dma_timer_expiry: dma status == 0x61 kernel: [4294894.186000] hdf: DMA timeout error kernel: [4294894.186000] hdf: dma timeout error: status=0x50 { DriveReady SeekComplete } Once this happens a minute later my system hangs, unless I reenable dma with hdparm. Problem occur with kernel 2.6.12-9-386 - ubuntu 5.10 out of the box. This may in fact be a differrent bug, but I have the same symtoms as Alexander Sandler ie dma timeouts, followed by ide error, followed by crash. I do not know why, but if I have a serial console I get no information, if VGA I get a traceback . but of course ony the last part is visible. I was running 2.6.12-1.1381 (FC3) panic + 0x42 printk + 0x17 K7_machine_check + 0x1b4 K7_machine_check + 0x0 error_code + 0x4f ide_insw +0x8 ata_input_data + 0x13 task_file_input_data + 0x13 ide_pio_sector + 0xc8 ide_pio_multi + 0x29 task_in_intr + 0x8d ide_intr 0x275 timer_inter + 0x7a task_in_intr + 0x0 ... (I may have made a mistake transcribing) Note that the interface is now PIO, but there was no console log indicated DMA had been disabled. a bit of rambling, but I'm not sure what might be useful:) I have tried Centos (2.6.9) FC3 (2.6.10, 2.6.12), FC4 (2.6.11, 2.6.16) all have the same symptons. The motherboard is an ASUS k8v-x with a sempron 3000 (64bit) running in 32 bit mode. There are 2 ite8212 cards and a promise 100 (20267) and I'm running 2 8-way raids. I have tried different combinations of ide cards (promise 66 - 20262, CMD 648). In several combinations, the system hangs starting the kernel (after grub outputs "kernel ...", "initrd ..." ; but before "linux decompressing ...", but not consistantly. I can power off, wait and then power on and it will work (sometimes). I have tried resetting the BIOS. Also If I just replace the promise card with the CMD card the BIOS does not seem to recognize the on-board primary interface. I seem to have a configuration that works, by adding a third ite card (broken - only the secodary interface works) and the CMD card (only using 1 interface!) go figure. I have tried different cables, card slots, exchanged ite and the promise card, but the problem moves with the promise card. (either 20262 or 20267). I have tried the promise in slots that share interrupts and one that doesn't. I exercise the system by "md5sum /dev/md3". It seems that the problem will occur in minutes to an hour or so. 2 interfaces shared interrupts (VIA8237) <1 min 1 interface shared interrupts (VIA8237) 10's mins 2 interfaces no shared interrupts 1 hr I have Centos (2.6.9) running on an Intel board, Celeron 800MHzwith a CMD card and 2 promise 66 (20262) cards and it seems to run fine (side note: some long time ago when no-one cared about promise cards I discovered that the 262's whould occasionally read 64k of zeros using LBA48 commands on a small (80GB) drive. I patched the driver to disable LBA48 and they ran fine - 2.4 kernels. I will try and see if limiting the request size as for 267 and 265 works) The 16 drives used to be connected to a 1.2G Celeron and they were working fine. Finally I have another (different) Intel board with a 1.2 Celeron and it used to work with ite and cmd cards, but got canobalized for the Sempron. It now has a promise 66, 100 and a CMD card (well 2 single channel cards). I fails to boot with 2.6.10 and 2.6.12-1.1381 (as above it just hangs after grub before "uncompressing ...", but will boot with 2.6.9-1.667 (all FC3 releases). It seems to work, but I'm still exercising it. My final server is a PIII 1GHz running a 2.4 RH9 and it is my main server so I'm haven't screwed with it!. All systems were working fine (for several years now), until I decided to upgrade to 2.6 and get a faster processor for video games (the Sempron). I have 6 Promise cards, 2 1/2 ITE8212 cards and 3 1/2 (2 flavours) CMD646 cards. Locally I can only get Promise cards so it would be nice to get them working again. I can do most any experiment, but I would appreciate several suggestions to try at the same time as it is a pain to take down and reconfigure my servers (from above) Finally I have another (different) Intel board with a 1.2 Celeron and it used to work with ite and cmd cards, but got canobalized for the Sempron. It now has a promise 66, 100 and a CMD card (well 2 single channel cards). I fails to boot with 2.6.10 and 2.6.12-1.1381 (as above it just hangs after grub before "uncompressing ...", but will boot with 2.6.9-1.667 (all FC3 releases). Today this computer booted on 2.6.12-1.1381 twice successfully, with NO changes at all, just power down and up - oh well must be gremlins Well my problem is probably not the promise card (despite the fact that the problem followed it). 5 days of running without it and my machine crashed with a Machine check error 0000000000000004 Bank 4: b200000000070f0f with a similar traceback ... error_code + 0x4F ide_inb + 0x3 ide_dma_timer_retry + 0xf9 ide_dma_intr + 0x0 ide_timer_expiry + 0x24 ide_timer_expiry + 0x0 run_timer_soft_irq + 0x12E I too am seeing this problem under Debian's 2.6.16-1 kernel build. I'm running an old machine, Pentium II 450MHz w/ Intel i440BX chipset. The controller is a Promise FastTrak66 (PDC 20262). Hard drives are Western Digital 60, 80, and 250GB. I tried applying the IDE RAID card compatibility patch (firmware update) from Western Digital, but only the 80GB drive needed it and the results are still the same. I have two FT66 cards and would happily donate one if you think it would help solve this problem. I am also happy to try kernel patches if you need testers. Have you seen Bug 1556? I think it's a dupe of this one. Do the drives seeing the problem all support LBA48 ? Closing due to inactivity |