Bug 6142 - Skge related Oops on P3 SMP box with IRQ migration enabled
Summary: Skge related Oops on P3 SMP box with IRQ migration enabled
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: i386 Linux
: P2 blocking
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-28 15:41 UTC by Krzysztof Oledzki
Modified: 2006-09-22 10:03 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.15.4
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Full .config file (154 bytes, text/html)
2006-08-05 05:28 UTC, Krzysztof Oledzki
Details
possible IRQ race fix (1.02 KB, patch)
2006-08-29 16:01 UTC, Stephen Hemminger
Details | Diff

Description Krzysztof Oledzki 2006-02-28 15:41:36 UTC
Distribution: Slackware

Hardware Environment:
# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 10
cpu MHz         : 998.478
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 mmx fxsr sse
bogomips        : 1999.41

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 10
cpu MHz         : 998.478
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 mmx fxsr sse
bogomips        : 1996.47

# lspci -v
00:00.0 Host bridge: VIA Technologies, Inc. VT82C693A/694x [Apollo PRO133x] (rev c4)
        Subsystem: ABIT Computer Corp.: Unknown device a204
        Flags: bus master, medium devsel, latency 8
        Memory at d0000000 (32-bit, prefetchable) [size=16M]
        Capabilities: [a0] AGP version 2.0
        Capabilities: [c0] Power Management version 2

00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x
AGP] (prog-if 00 [Normal decode])
        Flags: bus master, 66Mhz, medium devsel, latency 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        Capabilities: [80] Power Management version 2

00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 40)
        Subsystem: ABIT Computer Corp.: Unknown device 0000
        Flags: bus master, stepping, medium devsel, latency 0
        Capabilities: [c0] Power Management version 2

00:07.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) (prog-if 8a
[Master SecP PriP])
        Subsystem: VIA Technologies, Inc.
VT82C586/B/VT82C686/A/B/VT8233/A/C/VT8235 PIPC Bus Master IDE
        Flags: bus master, medium devsel, latency 32
        I/O ports at c000 [size=16]
        Capabilities: [c0] Power Management version 2

00:07.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 16) (prog-if 00 [UHCI])
        Subsystem: VIA Technologies, Inc. (Wrong ID) USB Controller
        Flags: bus master, medium devsel, latency 32, IRQ 10
        I/O ports at c400 [size=32]
        Capabilities: [80] Power Management version 2

00:07.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 16) (prog-if 00 [UHCI])
        Subsystem: VIA Technologies, Inc. (Wrong ID) USB Controller
        Flags: bus master, medium devsel, latency 32, IRQ 10
        I/O ports at c800 [size=32]
        Capabilities: [80] Power Management version 2

00:07.4 Bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 40)
        Flags: medium devsel, IRQ 9
        Capabilities: [68] Power Management version 2

00:09.0 VGA compatible controller: ATI Technologies Inc 215CT [Mach64 CT] (rev
41) (prog-if 00 [VGA])
        Flags: stepping, medium devsel, IRQ 7
        Memory at d1000000 (32-bit, non-prefetchable) [size=16M]
        Expansion ROM at 88060000 [disabled] [size=64K]

00:0b.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T [Marvell]
(rev 10)
        Subsystem: 3Com Corporation 3C941 Gigabit LOM Ethernet Adapter
        Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 177
        Memory at d3000000 (32-bit, non-prefetchable) [size=16K]
        I/O ports at cc00 [size=256]
        Expansion ROM at 88000000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data

00:0c.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30)
        Subsystem: 3Com Corporation 3C905B Fast Etherlink XL 10/100
        Flags: bus master, medium devsel, latency 32, IRQ 169
        I/O ports at d000 [size=128]
        Memory at d3004000 (32-bit, non-prefetchable) [size=128]
        Expansion ROM at 88020000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 1

00:0e.0 Mass storage controller: Triones Technologies, Inc.
HPT366/368/370/370A/372/372N (rev 04)
        Subsystem: Triones Technologies, Inc. HPT370A
        Flags: bus master, 66Mhz, medium devsel, latency 120, IRQ 185
        I/O ports at d400 [size=8]
        I/O ports at d800 [size=4]
        I/O ports at dc00 [size=8]
        I/O ports at e000 [size=4]
        I/O ports at e400 [size=256]
        Expansion ROM at 88040000 [disabled] [size=128K]
        Capabilities: [60] Power Management version 2

# cat /proc/interrupts
           CPU0       CPU1
  0:   86255208  115575456    IO-APIC-edge  timer
  1:          7          1    IO-APIC-edge  i8042
  4:        136        146    IO-APIC-edge  serial
  8:          1          0    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 10:          0          0   IO-APIC-level  uhci_hcd:usb1, uhci_hcd:usb2
 12:         71         23    IO-APIC-edge  i8042
 14:    1936004    2200642    IO-APIC-edge  ide0
 15:    1957196    2227564    IO-APIC-edge  ide1
169:          0          0   IO-APIC-level  eth0
177:   24238205      13926   IO-APIC-level  skge
185:    3713789    4198732   IO-APIC-level  ide2, ide3
NMI:          0          0
LOC:  201861862  201862935
ERR:          0
MIS:          0


Software Environment:
# cat /proc/version
Linux version 2.6.15.4-debug (root@space) (gcc version 3.4.5) #1 SMP PREEMPT Sat
Feb 25 12:41:19 CET 2006

# cat /proc/modules
bonding 55396 0 - Live 0xf89c3000

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v2.6.5 (November 4, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: down
Link Failure Count: 0
Permanent HW addr: 00:04:76:90:8b:ba

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:0a:5e:53:ac:1a

dmesg:
Linux version 2.6.15.4-debug (root@space) (gcc version 3.4.5) #1 SMP PREEMPT Sat
Feb 25 12:41:19 CET 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
 BIOS-e820: 000000007fff0000 - 000000007fff3000 (ACPI NVS)
 BIOS-e820: 000000007fff3000 - 0000000080000000 (ACPI data)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
2047MB LOWMEM available.
found SMP MP-table at 000f5700
On node 0 totalpages: 524272
  DMA zone: 4096 pages, LIFO batch:0
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 520176 pages, LIFO batch:31
  HighMem zone: 0 pages, LIFO batch:0
DMI 2.3 present.
ACPI: RSDP (v000 VIA694                                ) @ 0x000f7050
ACPI: RSDT (v001 VIA694 AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x7fff3000
ACPI: FADT (v001 VIA694 AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x7fff3040
ACPI: MADT (v001 VIA694          0x00000000  0x00000000) @ 0x7fff5640
ACPI: DSDT (v001 VIA694 AWRDACPI 0x00001000 MSFT 0x0100000c) @ 0x00000000
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 6:8 APIC version 17
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 6:8 APIC version 17
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 dfl dfl)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 88000000 (gap: 80000000:7ec00000)
Built 1 zonelists
Kernel command line: auto BOOT_IMAGE=Linux-2.6.15.4d ro root=900
rootflags=data=journal hdb=noprobe console=ttyS0,115200
ide_setup: hdb=noprobe
mapped APIC to ffffd000 (fee00000)
mapped IOAPIC to ffffc000 (fec00000)
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 65536 bytes)
Detected 998.478 MHz processor.
Using tsc for high-res timesource
Console: colour VGA+ 80x30
Dentry cache hash table entries: 524288 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 262144 (order: 8, 1048576 bytes)
Memory: 2070484k/2097088k available (2812k kernel code, 26068k reserved, 993k
data, 200k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 1999.41 BogoMIPS (lpj=999707)
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0387fbff 00000000 00000000 00000000 00000000
00000000 00000000
CPU: After vendor identify, caps: 0387fbff 00000000 00000000 00000000 00000000
00000000 00000000
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU serial number disabled.
CPU: After all inits, caps: 0383fbf7 00000000 00000000 00000040 00000000
00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
mtrr: v2.0 (20020519)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
CPU0: Intel Pentium III (Coppermine) stepping 0a
Booting processor 1/1 eip 3000
Initializing CPU#1
Calibrating delay using timer specific routine.. 1996.47 BogoMIPS (lpj=998238)
CPU: After generic identify, caps: 0387fbff 00000000 00000000 00000000 00000000
00000000 00000000
CPU: After vendor identify, caps: 0387fbff 00000000 00000000 00000000 00000000
00000000 00000000
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU serial number disabled.
CPU: After all inits, caps: 0383fbf7 00000000 00000000 00000040 00000000
00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel Pentium III (Coppermine) stepping 0a
Total of 2 processors activated (3995.89 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
checking TSC synchronization across 2 CPUs: passed.
Brought up 2 CPUs
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfb370, last bus=1
PCI: Using configuration type 1
mtrr: your CPUs had inconsistent variable MTRR settings
mtrr: probably your BIOS does not setup all CPUs.
mtrr: corrected configuration.
ACPI: Subsystem revision 20050902
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
ACPI: Assume root bridge [\_SB_.PCI0] bus is 0
PCI quirk: region 6000-607f claimed by vt82c686 HW-mon
PCI quirk: region 5000-500f claimed by vt82c686 SMB
Boot video device is 0000:00:09.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 1 3 4 5 6 *7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 1 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 1 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 1 3 4 5 6 7 *10 11 12 14 15)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 11 devices
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
PCI: Bridge: 0000:00:01.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Setting latency timer of device 0000:00:01.0 to 64
IA-32 Microcode Update Driver: v1.14 <tigran@veritas.com>
audit: initializing netlink socket (disabled)
audit(1140970843.006:1): initialized
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
NTFS driver 2.1.25 [Flags: R/O].
Initializing Cryptographic API
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
PCI: Enabling Via external APIC routing
ACPI: Power Button (FF) [PWRF]
ACPI: Power Button (CM) [PWRB]
Real Time Clock Driver v1.12
PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:08: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
nbd: registered device at major 43
ACPI: PCI Interrupt 0000:00:0c.0[A] -> GSI 19 (level, low) -> IRQ 169
3c59x version LK1.1.19
eth0: 3Com PCI 3c905B Cyclone 100baseTx at 0xf8802000.
 00:04:76:90:8b:ba, IRQ 169
  product code 4d4c rev 00.12 date 04-11-01
  Internal config register is 1800000, transceivers 0xa.
  8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
  MII transceiver found at address 24, status 7849.
  Enabling bus-master transmits and whole-frame receives.
eth0: scatter/gather enabled. h/w checksums enabled
ACPI: PCI Interrupt 0000:00:0b.0[A] -> GSI 17 (level, low) -> IRQ 177
skge 1.3 addr 0xd3000000 irq 177 chip Yukon rev 1
skge eth1: addr 00:0a:5e:53:ac:1a
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller at PCI slot 0000:00:07.1
PCI: Via IRQ fixup for 0000:00:07.1, from 255 to 0
VP_IDE: chipset revision 6
VP_IDE: not 100% native mode: will probe irqs later
VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci0000:00:07.1
    ide0: BM-DMA at 0xc000-0xc007, BIOS settings: hda:DMA, hdb:DMA
    ide1: BM-DMA at 0xc008-0xc00f, BIOS settings: hdc:DMA, hdd:pio
Probing IDE interface ide0...
hda: WDC WD800JB-00JJC0, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
hdc: WDC WD800BB-00JHC0, ATA DISK drive
ide1 at 0x170-0x177,0x376 on irq 15
HPT370A: IDE controller at PCI slot 0000:00:0e.0
ACPI: PCI Interrupt 0000:00:0e.0[A] -> GSI 18 (level, low) -> IRQ 185
HPT370A: chipset revision 4
HPT370A: 100% native mode on irq 185
HPT37X: using 33MHz PCI clock
    ide2: BM-DMA at 0xe400-0xe407, BIOS settings: hde:DMA, hdf:pio
HPT37X: using 33MHz PCI clock
    ide3: BM-DMA at 0xe408-0xe40f, BIOS settings: hdg:DMA, hdh:pio
Probing IDE interface ide2...
hde: WDC WD800JB-00FSA0, ATA DISK drive
ide2 at 0xd400-0xd407,0xd802 on irq 185
Probing IDE interface ide3...
hdg: WDC WD800JB-00JJC0, ATA DISK drive
ide3 at 0xdc00-0xdc07,0xe002 on irq 185
Probing IDE interface ide4...
Probing IDE interface ide5...
hda: max request size: 128KiB
hda: 156301488 sectors (80026 MB) w/8192KiB Cache, CHS=65535/16/63, UDMA(100)
hda: cache flushes supported
 hda: hda1 hda2 hda3 < hda5 hda6 hda7 hda8 hda9 hda10 hda11 hda12 >
hdc: max request size: 128KiB
hdc: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100)
hdc: cache flushes supported
 hdc: hdc1 hdc2 hdc3 < hdc5 hdc6 hdc7 hdc8 hdc9 hdc10 hdc11 hdc12 >
hde: max request size: 1024KiB
hde: 156301488 sectors (80026 MB) w/8192KiB Cache, CHS=16383/255/63, UDMA(100)
hde: cache flushes supported
 hde: hde1 hde2 hde3 < hde5 hde6 hde7 hde8 hde9 hde10 hde11 hde12 >
hdg: max request size: 128KiB
hdg: 156301488 sectors (80026 MB) w/8192KiB Cache, CHS=65535/16/63, UDMA(100)
hdg: cache flushes supported
 hdg: hdg1 hdg2 hdg3 < hdg5 hdg6 hdg7 hdg8 hdg9 hdg10 hdg11 hdg12 >
libata version 1.20 loaded.
USB Universal Host Controller Interface driver v2.3
ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10
ACPI: PCI Interrupt 0000:00:07.2[D] -> Link [LNKD] -> GSI 10 (level, low) -> IRQ 10
uhci_hcd 0000:00:07.2: UHCI Host Controller
uhci_hcd 0000:00:07.2: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:07.2: irq 10, io base 0x0000c400
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:07.3[D] -> Link [LNKD] -> GSI 10 (level, low) -> IRQ 10
uhci_hcd 0000:00:07.3: UHCI Host Controller
uhci_hcd 0000:00:07.3: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:07.3: irq 10, io base 0x0000c800
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
mice: PS/2 mouse device common for all mice
input: PC Speaker as /class/input/input0
md: raid1 personality registered as nr 3
md: raid10 personality registered as nr 9
md: raid5 personality registered as nr 4
raid5: automatically using best checksumming function: pIII_sse
input: AT Translated Set 2 keyboard as /class/input/input1
   pIII_sse  :  1964.000 MB/sec
raid5: using function: pIII_sse (1964.000 MB/sec)
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
Netfilter messages via NETLINK v0.30.
NET: Registered protocol family 2
IP route cache hash table entries: 131072 (order: 7, 524288 bytes)
TCP established hash table entries: 262144 (order: 9, 3145728 bytes)
TCP bind hash table entries: 65536 (order: 7, 786432 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP reno registered
ip_conntrack version 2.4 (8192 buckets, 65536 max) - 228 bytes per conntrack
ctnetlink v0.90: registering with nfnetlink.
ip_tables: (C) 2000-2002 Netfilter core team
input: ImExPS/2 Generic Explorer Mouse as /class/input/input2
ipt_time loading
ipt_random match loaded
ipt_recent v0.3.1: Stephen Frost <sfrost@snowman.net>. 
http://snowman.net/projects/ipt_recent/
arp_tables: (C) 2002 David S. Miller
TCP bic registered
TCP westwood registered
TCP highspeed registered
TCP hybla registered
TCP htcp registered
TCP vegas registered
TCP scalable registered
NET: Registered protocol family 1
NET: Registered protocol family 10
ip6_tables: (C) 2000-2002 Netfilter core team
registering ipv6 mark target
NET: Registered protocol family 17
802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
All bugs added by David S. Miller <davem@redhat.com>
Using IPI No-Shortcut mode
ACPI wakeup devices:
PCI0 USB0 USB1 MODM UAR1 UAR2
ACPI: (supports S0 S1 S4 S5)
BIOS EDD facility v0.16 2004-Jun-25, 4 devices found
md: Autodetecting RAID arrays.
(...)
md: ... autorun DONE.

Problem Description:

Kernel generates Oops about two or three times per week in random areas. I
enabled suggested kernel debuging options and catched two more accurate oopses:

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_PAGEALLOC=y

Unable to handle kernel paging request at virtual address 252d7a5a
 printing eip:
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: bonding
CPU:    0
EIP:    0060:[<7831dea6>]    Not tainted VLI
EFLAGS: 00010202   (2.6.15.4-debug)
EIP is at skb_copy_bits+0x11b/0x1f5
eax: 00005a5a   ebx: 00000002   ecx: 00000002   edx: a75cded8
esi: 252d7a5a   edi: a75cded8   ebp: a75cded8   esp: a5219c7c
ds: 007b   es: 007b   ss: 0068
Process httpd (pid: 32547, threadinfo=a5218000 task=b1242ae0)
Stack: 00000000 00005a90 00000036 00000000 c99d1f64 a93d1f60 000000aa 00000004
       7831d947 a93d1f60 00000036 a75cddf8 00000002 a93d1f60 a93d1f60 f76e8c00
       7831d9d9 a93d1f60 000000aa 00000004 00000020 f76e8ebc a93d1f60 f76e8c00
Call Trace:
 [<7831d947>] skb_copy_expand+0xa7/0xc5
 [<7831d9d9>] skb_pad+0x74/0xcb
 [<782abdad>] skge_xmit_frame+0x45/0x28f
 [<7832b342>] qdisc_restart+0xdf/0x1b8
 [<78321b71>] net_tx_action+0x9c/0xef
 [<78123279>] __do_softirq+0x55/0xbd
 [<78123311>] do_softirq+0x30/0x35
 [<78123374>] local_bh_enable+0x5e/0x7e
 [<78321902>] dev_queue_xmit+0x1d8/0x1df
 [<78338726>] ip_output+0x1e0/0x236
 [<78338b67>] ip_queue_xmit+0x3eb/0x461
 [<781440d3>] poison_obj+0x21/0x41
 [<7814551c>] cache_free_debugcheck+0x1cd/0x1d7
 [<78145da4>] kmem_cache_free+0x29/0x5e
 [<7819da3b>] journal_stop+0x1a0/0x1ac
 [<78195968>] __ext3_journal_stop+0x19/0x37
 [<783470c3>] tcp_transmit_skb+0x596/0x65f
 [<783be436>] _spin_unlock+0xd/0x21
 [<78347dfd>] tcp_write_xmit+0x1be/0x2d3
 [<78347f35>] __tcp_push_pending_frames+0x23/0x80
 [<7833ff90>] tcp_setsockopt+0x151/0x316
 [<7831c71f>] sock_common_setsockopt+0x1e/0x22
 [<7831a7da>] sys_setsockopt+0x58/0x70
 [<7831ad5d>] sys_socketcall+0x164/0x1a4
 [<78157f83>] sys_sendfile+0x5d/0x84
 [<78102ddb>] sysenter_past_esp+0x54/0x75
Code: 39 52 78 0f b7 44 ca 1c 89 d9 c1 e9 02 c1 fe 05 c1 e6 0c 8d b4 06 00 00 00
78 03 74 24 28 2b 74 24 08 f3 a5 89 d9 83 e1 03 74 02 <f3> a4 29 5c 24 30 0f 84
bd 00 00 00 01 5c 24 28 01 dd ff 44 24
 <0>Kernel panic - not syncing: Fatal exception in interrupt
 <0>Rebooting in 30 seconds..


Unable to handle kernel paging request at virtual address 252d7a5a
 printing eip:
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: bonding
CPU:    1
EIP:    0060:[<7831de9d>]    Not tainted VLI
EFLAGS: 00010216   (2.6.15.4-debug)
EIP is at skb_copy_bits+0x112/0x1f5
eax: 00005a5a   ebx: 00000004   ecx: 00000001   edx: e3526ed8
esi: 252d7a5a   edi: e3526ed8   ebp: e3526ed8   esp: b2a29b78
ds: 007b   es: 007b   ss: 0068
Process httpd (pid: 3922, threadinfo=b2a28000 task=b7b73ae0)
Stack: 00000000 00005a90 00000036 00000000 97582f64 92055f60 000000aa 00000002
       7831d947 92055f60 00000036 e3526df8 00000004 92055f60 92055f60 f76e7c00
       7831d9d9 92055f60 000000aa 00000002 00000020 f76e7ebc 92055f60 f76e7c00
Call Trace:
 [<7831d947>] skb_copy_expand+0xa7/0xc5
 [<7831d9d9>] skb_pad+0x74/0xcb
 [<782abdad>] skge_xmit_frame+0x45/0x28f
 [<7832b342>] qdisc_restart+0xdf/0x1b8
 [<78321b71>] net_tx_action+0x9c/0xef
 [<78123279>] __do_softirq+0x55/0xbd
 [<78123311>] do_softirq+0x30/0x35
 [<78123374>] local_bh_enable+0x5e/0x7e
 [<78321902>] dev_queue_xmit+0x1d8/0x1df
 [<78338726>] ip_output+0x1e0/0x236
 [<78338b67>] ip_queue_xmit+0x3eb/0x461
 [<783be49f>] _spin_unlock_irqrestore+0xf/0x23
 [<78115cde>] change_page_attr+0x46/0x4d
 [<7831d157>] kfree_skbmem+0xb/0x70
 [<78115dd4>] kernel_map_pages+0x1c/0x48
 [<7814550e>] cache_free_debugcheck+0x1bf/0x1d7
 [<7831d157>] kfree_skbmem+0xb/0x70
 [<78145e59>] kfree+0x45/0x7a
 [<7831d157>] kfree_skbmem+0xb/0x70
 [<78321b30>] net_tx_action+0x5b/0xef
 [<783470c3>] tcp_transmit_skb+0x596/0x65f
 [<783be49f>] _spin_unlock_irqrestore+0xf/0x23
 [<78115cde>] change_page_attr+0x46/0x4d
 [<781440d3>] poison_obj+0x21/0x41
 [<78347dfd>] tcp_write_xmit+0x1be/0x2d3
 [<78347f35>] __tcp_push_pending_frames+0x23/0x80
 [<7833df71>] do_tcp_sendpages+0x54a/0x574
 [<7833dfe7>] tcp_sendpage+0x4c/0x5f
 [<7831998a>] sock_sendpage+0x3a/0x3e
 [<7813e210>] file_send_actor+0x32/0x49
 [<7813dbc3>] do_generic_mapping_read+0x170/0x3ed
 [<7813e26e>] generic_file_sendfile+0x47/0x58
 [<7813e1de>] file_send_actor+0x0/0x49
 [<78157e8c>] do_sendfile+0x1a3/0x23d
 [<7813e1de>] file_send_actor+0x0/0x49
 [<78157f70>] sys_sendfile+0x4a/0x84
 [<78102ddb>] sysenter_past_esp+0x54/0x75
Code: 24 30 8b 74 02 18 2b 35 90 39 52 78 0f b7 44 ca 1c 89 d9 c1 e9 02 c1 fe 05
c1 e6 0c 8d b4 06 00 00 00 78 03 74 24 28 2b 74 24 08 <f3> a5 89 d9 83 e1 03 74
02 f3 a4 29 5c 24 30 0f 84 bd 00 00 00
 <0>Kernel panic - not syncing: Fatal exception in interrupt
 <0>Rebooting in 30 seconds..

I also tried the skge-fix-napi-irq-race but it didn't help.

After some tests I finally discovered that disabling rand-robin irq balancing
(echo 1 > /proc/irq/177/smp_affinity) helps - there have been no oopses for
nearly three days.

Everyting is fine when only other NIC (3c90B) is plugged to network. This
Marvell based 3c940 NIC is know to work without problems in other (UP) server.
Comment 1 Robert Theron Brockman II 2006-03-15 15:18:47 UTC
I'm having a similar problem with 2.6.15.6 on a Athlon64 X2 3800+ running 64 bit
gentoo.  The motherboard is an ASUS A8N-SLI nForce4 based board with two
integrated NICs, one Marvell 88E8001 and one nVidia.  The nVidia NIC works fine,
but using the Marvell NIC with the skge driver eventually causes the system to
lock up hard.  It takes a while, but usually ~10 minutes of heavy NFS traffic
(>20 MB/s) will break the system.  It's not a hardware issue, since the Marvell
NIC works fine (albeit slower and less efficiently) with the in-kernel sk98lin
driver.  The problem only manifests when using a SMP kernel.

Setting smp_affinity to 1 on the skge interrupt (82 on my system) seems to make
the problem go away.  Smells like the race condition problems haven't quite been
fixed yet.  One minor complication:  I'm using the loop-aes 3.1c patch and have
disk encryption on all of my drives.  Perhaps this is the source of the problem. 

Here are a list of things that don't seem to have any effect on the problem:

Over/Underclocking the system
2.6.16-rc5
Linux Vserver patches
Preempt vs. Non-Preempt
Monkeying around with the interrupt coalescing settings with ethtool


Side note:  The newer sk98lin driver from SysKonnect causes the system to crash
 spectacularly whenever any NFS traffic occurs unless a big chunk of SSH traffic
(>10MB) occurs first.  If the SSH transfer occurs first, the system will be rock
solid -- hours of high-speed data transfer -- from then on out.
Comment 2 Krzysztof Oledzki 2006-03-21 12:48:30 UTC

On Wed, 15 Mar 2006, bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=6142
>
>
>
>
>
> ------- Additional Comments From robert@firehead.org  2006-03-15 15:18 -------
> I'm having a similar problem with 2.6.15.6 on a Athlon64 X2 3800+ running 64 bit
> gentoo.  The motherboard is an ASUS A8N-SLI nForce4 based board with two
> integrated NICs, one Marvell 88E8001 and one nVidia.  The nVidia NIC works fine,
> but using the Marvell NIC with the skge driver eventually causes the system to
> lock up hard.  It takes a while, but usually ~10 minutes of heavy NFS traffic
> (>20 MB/s) will break the system.  It's not a hardware issue, since the Marvell
> NIC works fine (albeit slower and less efficiently) with the in-kernel sk98lin
> driver.  The problem only manifests when using a SMP kernel.
>
> Setting smp_affinity to 1 on the skge interrupt (82 on my system) seems to make
> the problem go away.  Smells like the race condition problems haven't quite been
> fixed yet.  One minor complication:  I'm using the loop-aes 3.1c patch and have
> disk encryption on all of my drives.  Perhaps this is the source of the problem.
>
> Here are a list of things that don't seem to have any effect on the problem:
>
> Over/Underclocking the system
> 2.6.16-rc5
> Linux Vserver patches
> Preempt vs. Non-Preempt
> Monkeying around with the interrupt coalescing settings with ethtool
>
>
> Side note:  The newer sk98lin driver from SysKonnect causes the system to crash
> spectacularly whenever any NFS traffic occurs unless a big chunk of SSH traffic
> (>10MB) occurs first.  If the SSH transfer occurs first, the system will be rock
> solid -- hours of high-speed data transfer -- from then on out.
>

You may also try to disable rx and/or tx csum. With disabled rx&tx 
hardware csuming my system is stable even with smp_affinity set to 3. 
Now I only need to test what is the real problem: rx or tx...

Best regards,

 				Krzysztof Ol
Comment 3 Stephen Hemminger 2006-03-21 13:47:25 UTC
Please retest with new 1.4 version (post 2.6.16).
You can find diff from 2.6.16 version at:
http://developer.osdl.org/shemminger/prototypes/skge-1.4.diff
Comment 4 Robert Theron Brockman II 2006-03-25 06:38:41 UTC
Applied skge 1.4 patch to 2.6.16-vserver (presence or absence of vserver had no
effect on crashes previously).  This time the system locked up within a few
minutes of heavy NFS traffic, so it seems the bug is still there.

SMP affinity setting decreased the frequenct of crashing, but did not eliminate
the problem entirely.

Turning off tx and rx checksumming with ethtool -K seems to have made the bug go
away for now.  This caused a performance hit of about 20% which I was able to
get rid of by messing with the interrupt coalescing settings on all the machines.
Comment 5 Stephen Hemminger 2006-04-18 15:57:22 UTC
Please send full .config of a non-working system.

I can't reproduce this with an old P3 SMP box, and 2.6.16.6
so something different is going on. It may have something to do
with bonding or vlan's.  I saw the bonding config, are you using VLAN's as well?
Comment 6 Adrian Bunk 2006-08-05 04:17:43 UTC
Please reopen this bug if:
- it is still present in kernel 2.6.17 and
- you can provide the requested information.
Comment 7 Krzysztof Oledzki 2006-08-05 05:28:00 UTC
Created attachment 8711 [details]
Full .config file
Comment 8 Krzysztof Oledzki 2006-08-05 05:43:20 UTC
The bug still exists in 2.6.17.

Anyway, it take some time before system crashes - sometimes even day or two and
this server is quite busy (pop3/imap/smtp/amavis/apache/mysql/etc).

For now I'm happy with the "/usr/sbin/ethtool -K eth1 tx off" workaround.
Comment 9 Krzysztof Oledzki 2006-08-05 05:46:46 UTC
Ah, I don't use vlans on this server - only bonding (active/backup) with eth0+eth1.
Comment 10 Stephen Hemminger 2006-08-29 16:01:04 UTC
Created attachment 8900 [details]
possible IRQ race fix

This changes order of lock and irq register read that could theoritically
cause problems.
Comment 11 Stephen Hemminger 2006-09-22 10:03:52 UTC
The problems should now be fixed in 2.6.17.13 and 2.6.18

Note You need to log in before you can comment on or make changes to this bug.