Bug 5565

Summary: Guess of i386 APIC PTE area scribble
Product: Platform Specific/Hardware Reporter: Andrew J. Kroll (a)
Component: i386Assignee: Zwane Mwaikambo (zwane)
Status: REJECTED UNREPRODUCIBLE    
Severity: high CC: akpm, bunk, bzolnier, mingo, protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.13.4 Subsystem:
Regression: --- Bisected commit-id:
Attachments: do_page_fault debug patch
patch to the patch
Lights, Camera, Action... Death
Additional panic information capture
More debug info
2.2.17 working IDE + other stuff patch

Description Andrew J. Kroll 2005-11-07 14:27:58 UTC
Most recent kernel where this bug did not occur: 2.2.13
Distribution: Any
Hardware Environment: 2x I686 (P III Coppermine) @ 650Mhz 100MHZ FSB, Twelve IDE
HDD on Three pdc20262, SB16PCI (AKA Ensoniq 5880), Realtek RTL-8196, MGA G200,
[3C509B, EMU10K, com90C66 Arcnet, and Lava LP card on ISA], 419856K PC133 ram.
Software Environment: Debian Sarge, with kernel.org kernel
Problem Description: IO-APIC IRQ of DEATH on very high IRQ loads, which seem to
scribble on the PTE area

Steps to reproduce: from bash
for i in a b c d e f g h i j k l ; { dd if=/dev/hd$i of=/dev/null & }

This simulates a normal load when I do a parallel backup.

photo references:
http://dr.ea.ms/~oldfart/panics/P1010004.JPG (stock kernel)
http://dr.ea.ms/~oldfart/panics/P1010009.JPG (stock kernel with 2 bogus PTE entries)

Another report in the wild suggest this can happen on other combinations of
hardware when disk and networking is excercized heavily.
Comment 1 Andrew J. Kroll 2005-11-12 19:08:54 UTC
vmlinux binary, .config file, lspci, etc
http://dr.ea.ms/~oldfart/panics/diagdata.tgz

source tree including the pte area fudge (CAUTION LARGE FILE!!! 71.0M!!!)
http://dr.ea.ms/~oldfart/panics/linux-2.6.13.4-panic.tgz

Hope these resources can assist in finding the problem.
Comment 2 Zwane Mwaikambo 2005-11-14 00:15:05 UTC
Created attachment 6575 [details]
do_page_fault debug patch

Could you please reproduce the bug with the attached patch?
Comment 3 Andrew J. Kroll 2005-11-14 03:57:25 UTC
patch gave an undefined reference to read_cr3 ....
Comment 4 Andrew J. Kroll 2005-11-14 04:23:51 UTC
compiled after I copied the routine from the kernel exec source.
Comment 5 Andrew J. Kroll 2005-11-14 04:31:57 UTC
Created attachment 6577 [details]
patch to the patch
Comment 6 Andrew J. Kroll 2005-11-14 05:19:57 UTC
Created attachment 6578 [details]
Lights, Camera, Action... Death
Comment 7 Andrew J. Kroll 2005-11-14 20:45:07 UTC
Created attachment 6589 [details]
Additional panic information capture

Added an mdelay(1) in the pdc*old.c dma routine so I could capture the
remainder of the printk's before the machine would totally go dark.
Comment 8 Andrew J. Kroll 2005-11-15 22:08:10 UTC
Created attachment 6594 [details]
More debug info

After attaching serial console, and doing a clean and remake, etc, the fault
address moved... however it is pretty much consistant. This is as interesting
as annoying, and I am beginning to suspect the problem is some sort of race
condition during a high load where the DMA gets told the wrong address.....
Comment 9 Andrew J. Kroll 2006-01-29 05:09:58 UTC
Created attachment 7171 [details]
2.2.17 working IDE + other stuff patch

Here is the entire tree patch I use... it includes all sorts of misc bugfixes
as well that are not related to the IDE issue, but are nice to have and perhaps
could be a factor, thus I am tossing up the entire set. Included also is the
config settings that I use, so that one may be able to replicate the kernel
code with 
gcc version 2.7.2.3 and libc5
Comment 10 Adrian Bunk 2006-11-13 07:53:11 UTC
Is this bug still present in kernel 2.6.18?
Comment 11 Andrew J. Kroll 2006-11-27 23:49:57 UTC
I still need to check if it does, I shall check it in a few days. My guess is
that it probabbly still is.
Comment 12 Andrew J. Kroll 2007-01-31 18:12:00 UTC
Finally did the test... I added more options for tracing and whatnot... got an 
SMP lock wedged on cpu#0 then, after some time, the debug dump. I apologize 
for the poor photos, but they are the same error.

http://dr.ea.ms/~oldfart/panics/002.bmp
http://dr.ea.ms/~oldfart/panics/005.bmp
Comment 13 Andrew J. Kroll 2007-01-31 18:18:15 UTC
oh yeah, and it was tested on 2.6.19.2, for reference ;-) sorry, forgot.
Comment 14 Zwane Mwaikambo 2007-01-31 23:14:25 UTC
It looks like it still may be trapping on the APIC access. Can you get a picture
of the main oops dump? 
Comment 15 Natalie Protasevich 2007-07-03 18:11:08 UTC
Andrew, can you please take a serial capture with the latest kernel.
Zwane, is that what would be sufficient? and as I understand you needed opos with unmodified kernel.
Comment 16 Anonymous Emailer 2007-07-16 14:26:59 UTC
Reply-To: akpm@linux-foundation.org

test to bugme-daemon@kernel-bugs.osdl.org, please ignore.
Comment 17 Anonymous Emailer 2007-07-16 14:27:44 UTC
Reply-To: akpm@linux-foundation.org

test to bugme-daemon@bugzilla.kernel.org, please ignore
Comment 18 Andrew Morton 2007-08-02 15:44:59 UTC
This is looking like a dead bug.  Let's shut it down if
nobody can reproduce it in 2.6.22.
Comment 19 Andrew J. Kroll 2007-08-06 16:07:29 UTC
I guess I will have to test again... If it still fails, then I will assume it is a motherboard issue.
Comment 20 Natalie Protasevich 2008-03-05 22:20:01 UTC
Andrew, any news? Did you get chance to try latest kernel?