Bug 2494

Summary: PDC20265 crashes when DMA is enabled
Product: IO/Storage Reporter: peter mutsaers (pmutsaers)
Component: IDEAssignee: Bartlomiej Zolnierkiewicz (bzolnier)
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: drescher0110-lists, hhielscher, sumbach
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.5 and 2.6.5-mm4, not in 2.4.x Subsystem:
Regression: --- Bisected commit-id:
Attachments: pdc20267

Description peter mutsaers 2004-04-11 13:23:36 UTC
Distribution: slackware
Hardware Environment: A7V, with normal (VIA) IDE controller and on-board 
PDC20265
Software Environment: the "old" promise IDE driver
Problem Description:
When DMA is enabled and two disks are connected (one to each channel) and used 
heavily a crash follows soon. At first random programs start crashing an 
starting new programs results in complaints that required shared libraries 
cannot be found/loaded, this even when the executables and shared libraries 
reside on other disks not connected to the PDC20265.
As soon as a 2.4.x kernel is used, DMA is disabled or the disks are moved to the 
normal VIA controller no problem happens.

Steps to reproduce:
Start badblocks -n on both disks, it happens within a few seconds.
Start badblocks -n or large copies on just one disk the same happens but much 
later and less predictable.
Comment 1 Frank Elis 2004-05-12 18:23:55 UTC
Can confirm the problem also exists in 2.6.6.  Have seen this problem occur with
vanilla 2.6.6 kernels as well as with redhat FC 2 test kernels.

Any heavy activity on any drive attached to a PDC20265 controller will cause
this bug to manifest (such as RAID activity, filesystem rsyncs, etc.).  The
system generally becomes unusable and will not reboot without a power cycle.
Comment 2 Frank Elis 2004-05-12 18:27:55 UTC
Here are some kernel logs of the timeout occuring:

hde: dma_timer_expiry: dma status == 0x60
hde: DMA timeout retry
PDC202XX: Primary channel reset
PDC202XX: Secondary channel reset
hde: timeout waiting for DMA
hde: dma_timer_expiry: dma status == 0x61
hde: DMA timeout error

Controller works perfectly under 2.4 kernel.  With either a RH 9 box upgraded to
2.6.6, or with FC 2 test 3 kernels and installations this bug always occurs with
load.  
Comment 3 Carl Englund 2004-05-23 13:00:23 UTC
Confirmed with PDC20265 running gentoo-dev-sources-2.6.5-r1 and
mm-sources-2.6.6-mm5.
Comment 4 Carl Englund 2004-05-24 03:37:26 UTC
Hm. I'm getting similar lockups even without DMA enabled but it usually takes
longer before they occur.
Comment 5 Frank Elis 2004-05-30 18:35:45 UTC
Its also happening for me now using only one harddrive.  Seems to be an even
more critical bug than first surmised.
Comment 6 Juergen Striegel 2004-05-31 05:52:54 UTC
confirmation on kernel 2.6.5 on a debian sarge system (SMP - dual celeron - PDC
driver build in kernel, not as mudule).
two controllers (Ultra TX2) with four HDs connected as master (one HD per
channel) are ok. After connecting a third controller with one HD, this fifth HD
shows DMA errors (not the four previously connected HDs). This is no IRQ, HD or
controller problem (cross changing tests: all components are ok).
Additional observation: Promise BIOS doesn't show the fifth HD on booting but
kernel reports all five HDs correctly. When setting DMA=1 for the first four HDs
and setting DMA=0 for the fifth HD, system is working well (but with all
limitations a pio driven HD implies).
Comment 7 Null 2004-06-01 12:03:34 UTC
Confirmed on 2.6.7-rc2, 2.6.7-rc2-bk1 and 2.6.7-rc2-bk2:

kernel: hde: dma_timer_expiry: dma status == 0x20
kernel: hde: DMA timeout retry
kernel: PDC202XX: Primary channel reset.
kernel: PDC202XX: Secondary channel reset.
kernel: hde: timeout waiting for DMA
kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error }
kernel: hde: multwrite_intr: error=0x04 { DriveStatusError }
kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error }
kernel: hde: multwrite_intr: error=0x04 { DriveStatusError }
kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error }
kernel: hde: multwrite_intr: error=0x04 { DriveStatusError }
kernel: hde: multwrite_intr: status=0x51 { DriveReady SeekComplete Error }
kernel: hde: multwrite_intr: error=0x04 { DriveStatusError }
kernel: PDC202XX: Primary channel reset.
kernel: PDC202XX: Secondary channel reset.
kernel: ide2: reset: master: error (0x00?)
kernel: hde: dma_timer_expiry: dma status == 0x21
kernel: hde: DMA timeout error
kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
kernel: hde: dma_timer_expiry: dma status == 0x21
kernel: hde: DMA timeout error
kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
kernel:
kernel: hde: dma_timer_expiry: dma status == 0x21
kernel: hde: DMA timeout error
kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
Comment 8 Chris Thompson 2004-06-05 10:26:15 UTC
I can confirm that this bug affects me running a 2.6.3 kernel and does not 
affect me running a 2.4.x kernel.  I, too, have a promise IDE controller (non-
RAID mode) on my motherboard.  I am running Mandrake 10.  It is unstable with 
a 2.6 kernel, exhibiting the described problems, but is stable with a 2.4 
kernel.

I have a soyo dragon plus motherboard with an athlon 1700+ xp CPU.

I recommend this bug be changed to high priority as it affects a lot of 
users.  The promise IDE chipset is very popular on many motherboards.
Comment 9 Bartlomiej Zolnierkiewicz 2004-06-05 10:30:51 UTC
I don't have PDC20265 so I'm unable to reproduce it.

If you are willing to help please narrow the problem to a specific
kernel version (i.e. 2.5.xx works and 2.5.yy doesn't).  Thanks.
Comment 10 Marco Rossini 2004-06-06 13:26:02 UTC
Also A7V Board, running ArchLinux0.6 kernel 2.6.3/2.6.6/2.4.26 (all standard distribution kernels). 
Special: 
- occurs with 2.4.26 kernel as well 
- No heavy load on disks required, kernel hangs on boot time (when kernel starts INIT) 
- No problem when using `ide=nodma' 
 
output `lspci': 
00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133] (rev 02) 
00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP] 
00:04.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 22) 
00:04.1 IDE interface: VIA Technologies, Inc. VT82C586/B/686A/B PIPC Bus Master IDE (rev 10) 
00:04.2 USB Controller: VIA Technologies, Inc. USB (rev 10) 
00:04.3 USB Controller: VIA Technologies, Inc. USB (rev 10) 
00:04.4 Bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 30) 
00:0a.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 07) 
00:0a.1 Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 07) 
00:0b.0 Ethernet controller: Macronix, Inc. [MXIC] MX987x5 (rev 20) 
00:11.0 Unknown mass storage controller: Promise Technology, Inc. 20265 (rev 02) 
01:00.0 VGA compatible controller: ATI Technologies Inc: Unknown device 5961 (rev 01) 
01:00.1 Display controller: ATI Technologies Inc: Unknown device 5941 (rev 01) 
Comment 11 Null 2004-06-07 05:49:21 UTC
I'd be happy to test any patches, I have several PDC20265 boxes.  As to the
specific kernels it happens with:

2.6.5
2.6.6
2.6.7-rc1
2.6.7-rc2
2.6.7-rc2-mm2
Comment 12 Null 2004-06-07 07:29:31 UTC
Also confirmed on 2.6.7-rc2-bk8:

Jun  7 10:11:43 localhost kernel: hde: dma_intr: status=0x50 { DriveReady
SeekComplete }
Jun  7 10:11:43 localhost kernel:
Jun  7 10:11:43 localhost kernel: hde: dma_timer_expiry: dma status == 0x20
Jun  7 10:11:43 localhost kernel: hde: DMA timeout retry
Jun  7 10:11:43 localhost kernel: PDC202XX: Primary channel reset.
Jun  7 10:11:43 localhost kernel: PDC202XX: Secondary channel reset.
Jun  7 10:11:43 localhost kernel: hde: timeout waiting for DMA
Jun  7 10:11:43 localhost kernel: hde: multwrite_intr: status=0x51 { DriveReady
SeekComplete Error }
Jun  7 10:11:43 localhost kernel: hde: multwrite_intr: error=0x04 {
DriveStatusError }
Jun  7 10:11:43 localhost kernel: hde: multwrite_intr: status=0x51 { DriveReady
SeekComplete Error }
Jun  7 10:11:43 localhost kernel: hde: multwrite_intr: error=0x04 {
DriveStatusError }
Jun  7 10:11:43 localhost kernel: hde: multwrite_intr: status=0x51 { DriveReady
SeekComplete Error }
Jun  7 10:11:43 localhost kernel: hde: multwrite_intr: error=0x04 {
DriveStatusError }
Jun  7 10:11:43 localhost kernel: hde: multwrite_intr: status=0x51 { DriveReady
SeekComplete Error }
Jun  7 10:11:43 localhost kernel: hde: multwrite_intr: error=0x04 {
DriveStatusError }
Jun  7 10:11:43 localhost kernel: PDC202XX: Primary channel reset.
Jun  7 10:11:43 localhost kernel: PDC202XX: Secondary channel reset.
Jun  7 10:11:43 localhost kernel: ide2: reset: master: error (0x00?)
Jun  7 10:11:43 localhost kernel: hde: dma_timer_expiry: dma status == 0x21
Jun  7 10:11:43 localhost kernel: hde: DMA timeout error
Jun  7 10:11:43 localhost kernel: hde: dma timeout error: status=0x58 {
DriveReady SeekComplete DataRequest }
Jun  7 10:11:43 localhost kernel:
Jun  7 10:11:43 localhost kernel: hde: dma_timer_expiry: dma status == 0x21
Jun  7 10:11:43 localhost kernel: hde: DMA timeout error
Jun  7 10:11:43 localhost kernel: hde: dma timeout error: status=0x58 {
DriveReady SeekComplete DataRequest }
Jun  7 10:11:43 localhost kernel:
Jun  7 10:11:43 localhost kernel: hde: dma_timer_expiry: dma status == 0x21
Jun  7 10:11:43 localhost kernel: hde: DMA timeout error
Jun  7 10:11:43 localhost kernel: hde: dma timeout error: status=0x58 {
DriveReady SeekComplete DataRequest }
Jun  7 10:11:43 localhost kernel:
Jun  7 10:11:43 localhost kernel: hde: DMA disabled
Jun  7 10:11:43 localhost kernel: hdg: DMA disabled

The box seems to recover though and is still useful after the DMA mode is
disabled.  Will do further testing to see if this remains the case as the day
goes by.
Comment 13 Null 2004-06-07 07:34:30 UTC
Oh, if its helpful, heres on more line to the previous logfile dump that
preceeds the other errors:

Jun  7 10:11:43 localhost kernel: hde: dma_intr: bad DMA status (dma_stat=36)

This error is new.  Haven't seen this one before with other 2.6.x kernels and
the PDC20265 DMA problem.  Again, this error occurs before any of the other DMA
errors listed.
Comment 14 Null 2004-06-07 07:45:31 UTC
Need my coffee, keep forgetting details.

Also, this time the DMA timeout occured during the initial boot of the system,
which is also a rare occurance in the past.  I'll see if I can reproduce this
with reboots to see if it occurs whenever the box reboots.

The hardware config for this system is one IDE controller, PDC20265, two 80 GB
ATA100 harddrives (identical), only one of which is mounted.  The other, for the
moment, is not mounted or used by the system in any way.

Also, just ran an fdisk -l, which caused the same set of errors as in the past,
but with the 2.6.7-rc2-bk8 kernel the systems seems to recover almost instantly.
 Here are the errors with fdisk -l:

Jun  7 10:37:28 localhost kernel: hdg: read_intr: status=0x59 { DriveReady
SeekComplete DataRequest Error }
Jun  7 10:37:28 localhost kernel: hdg: read_intr: error=0x04 { DriveStatusError }
Jun  7 10:37:28 localhost kernel: hdg: read_intr: status=0x59 { DriveReady
SeekComplete DataRequest Error }
Jun  7 10:37:28 localhost kernel: hdg: read_intr: error=0x04 { DriveStatusError }
Jun  7 10:37:28 localhost kernel: hdg: read_intr: status=0x59 { DriveReady
SeekComplete DataRequest Error }
Jun  7 10:37:28 localhost kernel: hdg: read_intr: error=0x04 { DriveStatusError }
Jun  7 10:37:28 localhost kernel: hdg: read_intr: status=0x59 { DriveReady
SeekComplete DataRequest Error }
Jun  7 10:37:28 localhost kernel: hdg: read_intr: error=0x04 { DriveStatusError }
Jun  7 10:37:30 localhost kernel: PDC202XX: Secondary channel reset.
Jun  7 10:37:33 localhost kernel: PDC202XX: Primary channel reset.
Jun  7 10:37:28 localhost kernel: ide3: reset: master: error (0x00?)


Only polling the second drive seems to cause the error.  Also, the error does
not occur on subsequent retries of the fdisk -l command.

Running badblocks -n on /dev/hdg1 gets this response from the kernel:

Jun  7 10:40:26 localhost kernel: APIC error on CPU1: 00(60)
Jun  7 10:40:26 localhost kernel: APIC error on CPU1: 60(60)
Jun  7 10:40:26 localhost kernel: APIC error on CPU1: 60(60)
Jun  7 10:40:26 localhost kernel: APIC error on CPU0: 00(60)
Jun  7 10:40:26 localhost kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Jun  7 10:40:26 localhost kernel:
Jun  7 10:40:26 localhost kernel: hdg: drive not ready for command
Jun  7 10:42:25 localhost kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Jun  7 10:42:25 localhost kernel:
Jun  7 10:42:25 localhost kernel: hdg: drive not ready for command
Jun  7 10:42:30 localhost kernel: APIC error on CPU1: 60(60)
Jun  7 10:42:30 localhost last message repeated 3 times
Jun  7 10:42:30 localhost kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Jun  7 10:42:30 localhost kernel:
Jun  7 10:42:30 localhost kernel: hdg: drive not ready for command

read-only mode for badblocks does not produce this set of errors, and completes
successfully.


Comment 15 Chris Thompson 2004-06-07 08:02:12 UTC
It would be _really_ useful if people could post a test case that always (or 
almost always) generates the problem.  I'm having a hard time generating a 
test case myself (though admittedly this is because I was testing drives on a 
different controller for most of the weekend). 
Comment 16 peter mutsaers 2004-06-07 11:45:58 UTC
In response to comment #10: I doubt very much that the bug occurs generally with 
2.4.x. Initially, before I filed the bug, I tried with various 2.4.x and with 
various 2.6.x kernels. 2.4.x consistently did not exhibit the bug, whereas 2.6.x 
did.

Currently, I have a server being havily pounded with an uptime of 47 days. I 
have DMA enabled, and am running 2.4.26. I cannot believe that you do not have 
another problem causing the bug for you with 2.4.26.

In response to comment #15: please read by comment (the first, when I submitted 
the bug). It gives a sure recipe to reproduce the bug for sure within seconds.

I think more reports to confirm with 2.6.x aren't very useful. Comment #9 hit 
the nail on the head: the bug must have been introduced somewhere in the 2.5.x 
series. Someone should try various 2.5.x to check in what exact version the bug 
was first introduced. Alas I have only 1 production server so I cannot do this 
myself.
Comment 17 Adolfo Gonz 2004-06-19 16:11:01 UTC
I tried every 2.6 kernel, and all have the bug. Also tried 2.5.50 and 2.5.75,
which seems to be perfect, at least on my machine. Does this mean the bug was
introduced on 2.5.75 -> 2.6.0 ??

Please people, try both 2.5.75 and 2.6.0 to confirm this.
Comment 18 Null 2004-06-23 11:59:30 UTC
I believe that this patch fixes the problem:

--- linux-2.6.7/drivers/ide/ide-probe.c 2004-06-21 15:25:51.000000000 +0200
+++ linux/drivers/ide/ide-probe.c 2004-06-21 15:29:19.901710936 +0200
@@ -897,7 +897,7 @@
blk_queue_segment_boundary(q, 0xffff);

if (!hwif->rqsize)
- hwif->rqsize = hwif->no_lba48 ? 256 : 65536;
+ hwif->rqsize = 256;
if (hwif->rqsize < max_sectors)
max_sectors = hwif->rqsize;
blk_queue_max_sectors(q, max_sectors);

I've been running with this minor change all day with a 2.6.7 kernel and its
working without any DMA errors.  UDMA5 set, drives clobbered, no problems so far.

The only error I am seeing, but it does not appear to be effecting anything and
could be totally unrelated:

kernel: APIC error on CPU1: 60(60)
kernel: APIC error on CPU0: 60(60)

Again, this error isn't causing any perceptible problems, so I would consider
this patch to be effective at solving this specific problem.
Comment 19 Chris Thompson 2004-08-27 12:48:11 UTC
Note that as of 2.6.8.1, this fix has not yet been applied.  Probably 
reasonable, there have been no comments here one way or another on whether 
this fix works.  Anyone else tried it out?  Does anyone understand ide-probe.c 
well enough to comment on whether this is 'obviously correct', correct only 
for Promise IDE cards, or not actually correct at all? 
Comment 20 Bartlomiej Zolnierkiewicz 2004-08-31 03:48:09 UTC
Proper fix for PDC20265 was merged in 2.6.8 and was reported to work.
Comment 21 Chris Thompson 2004-09-13 20:54:15 UTC
I can confirm that the newest kernel (2.6.8.1) works for me. 
Comment 22 Rami AlZaid 2004-09-25 04:59:38 UTC
I'm still facing the same problem with 2.6.8.1:

This is from dmesg:
hdj: dma_timer_expiry: dma status == 0x40
hdj: DMA timeout retry
PDC202XX: Primary channel reset.
PDC202XX: Secondary channel reset.
hdj: timeout waiting for DMA
hdj: dma_timer_expiry: dma status == 0x41
hdj: DMA timeout error
hdj: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
hdj: status timeout: status=0xd0 { Busy }
PDC202XX: Primary channel reset.
PDC202XX: Secondary channel reset.
hdj: drive not ready for command
ide4: reset: master: error (0x00?)
hdj: dma_timer_expiry: dma status == 0x41
hdj: DMA timeout error
hdj: dma timeout error: status=0x59 { DriveReady SeekComplete DataRequest 
Error }
hdj: dma timeout error: error=0x00 { }
hdj: dma_timer_expiry: dma status == 0x41
hdj: DMA timeout error
hdj: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
Comment 23 Krzysztof Chmielewski 2004-10-17 09:46:29 UTC
Created attachment 3848 [details]
pdc20267

This happens on PDC20267 too. On 2.6.8.1 I must use attached patch to make
things work correctly.
Comment 24 Vanessa Dannenberg 2005-01-14 10:35:29 UTC
I have similar problems as well, with all three machines in my setup.

Machine #1:
Swan contains a PcChips M811LU motherboard, which is KT-266A-based, and is equipped 
with one hard disk and one DVD burner:

hda: QUANTUM Bigfoot TX4.0AT, ATA DISK drive
hdc: ATAPI 40X DVD-ROM CD-R/RW drive, 2048kB Cache, UDMA(33)

Swan currently runs a home-built 2.6.7 kernel, and has been tested with everything 
from the default Slackware 9.0 kernel (2.4.20 - "bare.i") to home-built 2.6.10.  
Swan does not currently appear to suffer from DMA issues.

Machine #2:
Stork contains an MSI MS6712 motherboard, which is KT-400-based, and is equipped with 
three hard disks and one DVD reader:

hda: Maxtor 5T060H6, ATA DISK drive
hdb: Maxtor 92040U6, ATA DISK drive
hdc: HITACHI DVD-ROM GD-2500, ATAPI CD/DVD-ROM drive
hdd: Maxtor 91360U4, ATA DISK drive

Stork currently runs a home-built 2.6.7 kernel, and has been tested with the default 
Slackware 10.0 kernel (2.4.26) as well as home-built kernels 2.6.7 and 2.6.10.  Stork 
currently has DMA problems as previously described by others.  Drives hdb and hdd are 
most often used, as these are root and home, respectively, and both show the DMA 
errors previously described in this bug.  I have not noticed a problem with drive 
hda.

Machine #3:
Rainbird contains an Intel 440BX motherboard (not sure if that is the chipset or the 
board model, it was a gift), and is equipped with one hard disk and one DVD reader:

hda: Seagate Technology 1275MB - ST31276A, ATA DISK drive
hdc: Memorex DVD-632, ATAPI CD/DVD-ROM drive

Rainbird currently runs a home-built 2.6.7 kernel, and has been tested with 
everything from the default Slackware 8.1 kernel (2.4.18 - "bare.i") to kernel 2.6.10 
compiled locally.  Rainbird currently has DMA problems, with the exception that the 
"dma status" is "0x20" when an error occurs.

General:
In all cases it seems to be enough to just wait for the affected computer to get its 
head out of its hind end, where it then will continue where it left off, otherwise 
unaffected.  All machines run home-built 2.6.7 kernels, each adjusted according to 
each machine's hardware and purpose.

Swan and Rainbird are simple client machines, using their local disks mostly for swap 
space and boot information.  Both NFS-root from Stork.  Swan also uses part of its 
hard disk for Windows 2000.
Comment 25 Vanessa Dannenberg 2005-01-14 10:39:24 UTC
I forgot to mention that on Rainbird described above, the DMA error often occurs 
while reiserfsck is running at startup, as well as other times while the system is 
being used, and sometimes the machine has to be rebooted to get it to wake back up.
Comment 26 Stuart Shelton 2005-03-16 03:32:07 UTC
I'm running a 2.6.10-gentoo-r6 kernel with an A7V.

If I disable DMA, things seem to work.
If I enable DMA, then I get strange behavior:

If I use hdparm to enable DMA without having the "Use PCI DMA by default when
available" then the drive mounts, but accessing it give an I/O Error.

If I boot with "Use PCI DMA" enabled, then everything *appears* to work fine -
and for smaller files it is.  However, for large (>400Mb, it seems) files, the
data is corrupt.

I could md5sum a file, copy it to the drive on the PDC20265 controller, and
md5sum it again and get a different result.  Interesting, if I then reboot with
DMA disabled, the md5sum is correct.  So it appears that the reading of large
files is broken.

In my system logs:

PDC20265: IDE controller at PCI slot 0000:00:11.0
ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 12
PCI: setting IRQ 12 as level-triggered
ACPI: PCI interrupt 0000:00:11.0[A] -> GSI 12 (level, low) -> IRQ 12
PDC20265: chipset revision 2
PDC20265: 100% native mode on irq 12
PDC20265: (U)DMA Burst Bit DISABLED Primary PCI Mode Secondary PCI Mode.
    ide2: BM-DMA at 0x8000-0x8007, BIOS settings: hde:pio, hdf:DMA
    ide3: BM-DMA at 0x8008-0x800f, BIOS settings: hdg:DMA, hdh:DMA
Probing IDE interface ide2...
ide2: Wait for ready failed before probe !
hdf: Maxtor 91826U4, ATA DISK drive
ide2 at 0x9400-0x9407,0x9002 on irq 12
hdf: max request size: 128KiB
hdf: 35673120 sectors (18264 MB) w/2048KiB Cache, CHS=35390/16/63
hdf: cache flushes not supported
 hdf: hdf1
Probing IDE interface ide3...
hdg: ST380011A, ATA DISK drive
ide3 at 0x8800-0x8807,0x8402 on irq 12
hdg: max request size: 128KiB
hdg: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=16383/255/63
hdg: cache flushes supported
 hdg: hdg1
BIOS EDD facility v0.16 2004-Jun-25, 1 devices found

... and the errors with DMA enabled:

Mar 16 03:10:21 [kernel] hde: dma_timer_expiry: dma status == 0x21
Mar 16 03:10:35 [kernel] hde: DMA timeout error
Mar 16 03:10:55 [kernel] hde: dma_timer_expiry: dma status == 0x21
Mar 16 03:11:09 [kernel] hde: DMA timeout error
Mar 16 03:11:09 [kernel] end_request: I/O error, dev hde, sector 63
Mar 16 03:11:09 [kernel] Remounting filesystem read-only
Mar 16 03:11:09 [kernel] end_request: I/O error, dev hde, sector 148635727
Mar 16 03:11:09 [kernel] Remounting filesystem read-only
Mar 16 03:11:09 [kernel] end_request: I/O error, dev hde, sector 4231
                - Last output repeated twice -
Mar 16 03:11:09 [kernel] Remounting filesystem read-only

Mar 16 11:03:14 [kernel] hdg: dma_timer_expiry: dma status == 0x61
Mar 16 11:03:28 [kernel] hdg: DMA timeout error
Mar 16 11:12:45 [kernel] hdg: dma_timer_expiry: dma status == 0x61
Mar 16 11:12:59 [kernel] hdg: DMA timeout error
Mar 16 11:12:59 [kernel] end_request: I/O error, dev hdg, sector 4223
                - Last output repeated 3 times -
Mar 16 11:14:58 [kernel] Remounting filesystem read-only
Mar 16 11:15:01 [kernel] end_request: I/O error, dev hdg, sector 4223
Comment 27 Macskasi Csaba 2005-05-01 05:53:45 UTC
The problem is still not solved using the following stuff:
- pdc20262
- drives over 40gb (and _only_ over 40gb)
- 2.6.x (afaik only some really old 2.4 work...)
I get the same dma-errors as posted above...

Cheers,
Csaba
Comment 28 John M. Drescher 2005-10-17 06:57:57 UTC
I just had the exact same problem on 2.6.11-gentoo-r4 with a dual processor
Athlon MP 2400 with two promise 20268 TX2 100 cards. The only way to get things
to work reliably with my software raid5 system that was using these two
controllers was to  disable dma and set the multisector count to 16. This was
with 120 GB WD drives with 8MB buffers. I have since replaced the promise cards
with a highpoint RocketRaid 454 card and have not had any problems yet (a few
days). 
Comment 29 Alexander Sandler 2006-04-18 13:24:36 UTC
I am lucky to have the same problem on Promise 20269 + 3 disk RAID5.
Here are the logs.

kernel: [4294884.186000] hdf: dma_timer_expiry: dma status == 0x61
kernel: [4294894.186000] hdf: DMA timeout error
kernel: [4294894.186000] hdf: dma timeout error: status=0x50 { DriveReady 
SeekComplete }

Once this happens a minute later my system hangs, unless I reenable dma with 
hdparm.
Problem occur with kernel 2.6.12-9-386 - ubuntu 5.10 out of the box.
Comment 30 mike 2006-06-15 15:54:15 UTC
This may in fact be a differrent bug, but I have the same symtoms as Alexander
Sandler ie dma timeouts, followed by ide error, followed by crash. I do not know
why, but if I have a serial console I get no information, if VGA I get a
traceback . but of course ony the last part is visible. I was running
2.6.12-1.1381 (FC3)

panic + 0x42
printk + 0x17
K7_machine_check + 0x1b4
K7_machine_check + 0x0
error_code + 0x4f
ide_insw +0x8
ata_input_data + 0x13
task_file_input_data + 0x13
ide_pio_sector + 0xc8
ide_pio_multi + 0x29
task_in_intr + 0x8d
ide_intr 0x275
timer_inter + 0x7a
task_in_intr + 0x0 

...

(I may have made a mistake transcribing)
Note that the interface is now PIO, but there was no console log indicated DMA
had been disabled.

a bit of rambling, but I'm not sure what might be useful:)
I have tried Centos (2.6.9) FC3 (2.6.10, 2.6.12), FC4 (2.6.11, 2.6.16) all have
the same symptons. The motherboard is an ASUS k8v-x with a sempron 3000 (64bit)
running in 32 bit mode. There are 2 ite8212 cards and a promise 100 (20267) and
I'm running 2 8-way raids. I have tried different combinations of ide cards
(promise 66 - 20262, CMD 648). In several combinations, the system hangs
starting the kernel (after grub outputs "kernel ...", "initrd ..." ; but before
"linux decompressing ...", but not consistantly. I can power off, wait and then
power on and it will work (sometimes). I have tried resetting the BIOS. Also If
I just replace the promise card with the CMD card the BIOS does not seem to
recognize the on-board primary interface.
I seem to have a configuration that works, by adding a third ite card (broken -
only the secodary interface works) and the CMD card (only using 1 interface!) go
figure. I have tried different cables, card slots, exchanged ite and the promise
card, but the problem moves with the promise card. (either 20262 or 20267). I
have tried the promise in slots that share interrupts and one that doesn't. I
exercise the system by "md5sum /dev/md3". It seems that the problem will occur
in minutes to an hour or so.
2 interfaces shared interrupts (VIA8237) <1 min
1 interface shared interrupts (VIA8237) 10's mins
2 interfaces no shared interrupts 1 hr

I have Centos (2.6.9) running on an Intel board, Celeron 800MHzwith a CMD card
and 2 promise 66 (20262) cards and it seems to run fine (side note: some long
time ago when no-one cared about promise cards I discovered that the 262's
whould occasionally read 64k of zeros using LBA48 commands on a small (80GB)
drive. I patched the driver to disable LBA48 and they ran fine - 2.4 kernels. I
will try and see if limiting the request size as for 267 and 265 works)
The 16 drives used to be connected to a 1.2G Celeron and they were working fine.

Finally I have another (different) Intel board with a 1.2 Celeron and it used to
work with ite and cmd cards, but got canobalized for the Sempron. It now has a
promise 66, 100 and a CMD card (well 2 single channel cards). I fails to boot
with 2.6.10 and 2.6.12-1.1381 (as above it just hangs after grub before
"uncompressing ...", but will boot with 2.6.9-1.667 (all FC3 releases). It seems
to work, but I'm still exercising it. My final server is a PIII 1GHz running a
2.4 RH9 and it is my main server so I'm haven't screwed with it!.
All systems were working fine (for several years now), until I decided to
upgrade to 2.6 and get a faster processor for video games (the Sempron).

I have 6 Promise cards, 2 1/2 ITE8212 cards and 3 1/2 (2 flavours) CMD646 cards.
Locally I can only get Promise cards so it would be nice to get them working again.

I can do most any experiment, but I would appreciate several suggestions to try
at the same time as it is a pain to take down and reconfigure my servers
Comment 31 mike 2006-06-16 14:16:36 UTC
(from above)
Finally I have another (different) Intel board with a 1.2 Celeron and it used to
work with ite and cmd cards, but got canobalized for the Sempron. It now has a
promise 66, 100 and a CMD card (well 2 single channel cards). I fails to boot
with 2.6.10 and 2.6.12-1.1381 (as above it just hangs after grub before
"uncompressing ...", but will boot with 2.6.9-1.667 (all FC3 releases).
Today this computer booted on 2.6.12-1.1381 twice successfully, with NO changes
at all, just power down and up - oh well must be gremlins
Comment 32 mike 2006-06-20 13:19:56 UTC
Well my problem is probably not the promise card (despite the fact that the
problem followed it). 5 days of running without it and my machine crashed with a
Machine check error 0000000000000004
Bank 4: b200000000070f0f
 with a similar traceback

...
error_code + 0x4F
ide_inb + 0x3
ide_dma_timer_retry + 0xf9
ide_dma_intr + 0x0
ide_timer_expiry + 0x24
ide_timer_expiry + 0x0
run_timer_soft_irq + 0x12E
Comment 33 Sam Umbach 2006-07-11 22:24:20 UTC
I too am seeing this problem under Debian's 2.6.16-1 kernel build.  I'm running
an old machine, Pentium II 450MHz w/ Intel i440BX chipset.  The controller is a
Promise FastTrak66 (PDC 20262).  Hard drives are Western Digital 60, 80, and
250GB.  I tried applying the IDE RAID card compatibility patch (firmware update)
from Western Digital, but only the 80GB drive needed it and the results are
still the same.

I have two FT66 cards and would happily donate one if you think it would help
solve this problem.  I am also happy to try kernel patches if you need testers.

Have you seen Bug 1556?  I think it's a dupe of this one.
Comment 34 Alan 2007-06-05 06:21:47 UTC
Do the drives seeing the problem all support LBA48 ?
Comment 35 Alan 2007-06-18 07:42:21 UTC
Closing due to inactivity