Most recent kernel where this bug did not occur: 2.2.13 Distribution: Any Hardware Environment: 2x I686 (P III Coppermine) @ 650Mhz 100MHZ FSB, Twelve IDE HDD on Three pdc20262, SB16PCI (AKA Ensoniq 5880), Realtek RTL-8196, MGA G200, [3C509B, EMU10K, com90C66 Arcnet, and Lava LP card on ISA], 419856K PC133 ram. Software Environment: Debian Sarge, with kernel.org kernel Problem Description: IO-APIC IRQ of DEATH on very high IRQ loads, which seem to scribble on the PTE area Steps to reproduce: from bash for i in a b c d e f g h i j k l ; { dd if=/dev/hd$i of=/dev/null & } This simulates a normal load when I do a parallel backup. photo references: http://dr.ea.ms/~oldfart/panics/P1010004.JPG (stock kernel) http://dr.ea.ms/~oldfart/panics/P1010009.JPG (stock kernel with 2 bogus PTE entries) Another report in the wild suggest this can happen on other combinations of hardware when disk and networking is excercized heavily.
vmlinux binary, .config file, lspci, etc http://dr.ea.ms/~oldfart/panics/diagdata.tgz source tree including the pte area fudge (CAUTION LARGE FILE!!! 71.0M!!!) http://dr.ea.ms/~oldfart/panics/linux-2.6.13.4-panic.tgz Hope these resources can assist in finding the problem.
Created attachment 6575 [details] do_page_fault debug patch Could you please reproduce the bug with the attached patch?
patch gave an undefined reference to read_cr3 ....
compiled after I copied the routine from the kernel exec source.
Created attachment 6577 [details] patch to the patch
Created attachment 6578 [details] Lights, Camera, Action... Death
Created attachment 6589 [details] Additional panic information capture Added an mdelay(1) in the pdc*old.c dma routine so I could capture the remainder of the printk's before the machine would totally go dark.
Created attachment 6594 [details] More debug info After attaching serial console, and doing a clean and remake, etc, the fault address moved... however it is pretty much consistant. This is as interesting as annoying, and I am beginning to suspect the problem is some sort of race condition during a high load where the DMA gets told the wrong address.....
Created attachment 7171 [details] 2.2.17 working IDE + other stuff patch Here is the entire tree patch I use... it includes all sorts of misc bugfixes as well that are not related to the IDE issue, but are nice to have and perhaps could be a factor, thus I am tossing up the entire set. Included also is the config settings that I use, so that one may be able to replicate the kernel code with gcc version 2.7.2.3 and libc5
Is this bug still present in kernel 2.6.18?
I still need to check if it does, I shall check it in a few days. My guess is that it probabbly still is.
Finally did the test... I added more options for tracing and whatnot... got an SMP lock wedged on cpu#0 then, after some time, the debug dump. I apologize for the poor photos, but they are the same error. http://dr.ea.ms/~oldfart/panics/002.bmp http://dr.ea.ms/~oldfart/panics/005.bmp
oh yeah, and it was tested on 2.6.19.2, for reference ;-) sorry, forgot.
It looks like it still may be trapping on the APIC access. Can you get a picture of the main oops dump?
Andrew, can you please take a serial capture with the latest kernel. Zwane, is that what would be sufficient? and as I understand you needed opos with unmodified kernel.
Reply-To: akpm@linux-foundation.org test to bugme-daemon@kernel-bugs.osdl.org, please ignore.
Reply-To: akpm@linux-foundation.org test to bugme-daemon@bugzilla.kernel.org, please ignore
This is looking like a dead bug. Let's shut it down if nobody can reproduce it in 2.6.22.
I guess I will have to test again... If it still fails, then I will assume it is a motherboard issue.
Andrew, any news? Did you get chance to try latest kernel?