Bug 6942 - e1000 segfault
Summary: e1000 segfault
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Jesse Brandeburg
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-08-02 01:55 UTC by Alexey Maximov
Modified: 2006-12-02 01:50 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.17-gentoo-r3
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Alexey Maximov 2006-08-02 01:55:40 UTC
Aug  2 10:52:39 master CPU:    2
Aug  2 10:52:39 master EIP:    0060:[<c034ff25>]    Not tainted VLI
Aug  2 10:52:39 master EFLAGS: 00010296   (2.6.17-gentoo-r3 #2)
Aug  2 10:52:39 master EIP is at skb_over_panic+0x37/0x45
Aug  2 10:52:39 master eax: 00000073   ebx: f7905000   ecx: 00000000   edx: 
00000292
Aug  2 10:52:39 master esi: f7905000   edi: 00000040   ebp: 00000000   esp: 
c04c4f28
Aug  2 10:52:39 master ds: 007b   es: 007b   ss: 0068
Aug  2 10:52:39 master Process apache2 (pid: 1369, threadinfo=c04c4000 
task=d02d7a90)
Aug  2 10:52:39 master Stack: c03fc6ee c02e32b8 00000620 000005ea db267c00 
db267c6a db26828a db267d00
Aug  2 10:52:39 master f7905000 000005ea 000000cc c02e32c3 c04c4fb8 f7919640 
f7905400 00000023
Aug  2 10:52:39 master f7905000 f7548cb0 f7548cc0 f8c0fcb0 000005ea 000000cc 
c4809480 23000002
Aug  2 10:52:39 master Call Trace:
Aug  2 10:52:39 master <c02e32b8> e1000_clean_rx_irq+0x31e/0x52e  <c02e32c3> 
e1000_clean_rx_irq+0x329/0x52e
Aug  2 10:52:39 master <c02e1667> e1000_clean+0xc9/0x175  <c0355473> 
net_rx_action+0x99/0x148
Aug  2 10:52:39 master <c011be1f> __do_softirq+0x58/0xc2  <c0104bc5> 
do_softirq+0x46/0x51
Aug  2 10:52:39 master =======================



kernel crash at high load

TSO has disabled completely!


random bug, after long uptime and under high load.


master root # lspci
00:00.0 Host bridge: Intel Corporation E7501 Memory Controller Hub (rev 01)
00:00.1 Class ff00: Intel Corporation E7500/E7501 Host RASUM Controller (rev 
01)
00:02.0 PCI bridge: Intel Corporation E7500/E7501 Hub Interface B PCI-to-PCI 
Bridge (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801CA/CAM USB (Hub #1) (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801CA/CAM USB (Hub #2) (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801CA/CAM USB (Hub #3) (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 42)
00:1f.0 ISA bridge: Intel Corporation 82801CA LPC Interface Controller (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801CA Ultra ATA Storage Controller 
(rev 02)
00:1f.3 SMBus: Intel Corporation 82801CA/CAM SMBus Controller (rev 02)
01:1c.0 PIC: Intel Corporation 82870P2 P64H2 I/OxAPIC (rev 04)
01:1d.0 PCI bridge: Intel Corporation 82870P2 P64H2 Hub PCI Bridge (rev 04)
01:1e.0 PIC: Intel Corporation 82870P2 P64H2 I/OxAPIC (rev 04)
01:1f.0 PCI bridge: Intel Corporation 82870P2 P64H2 Hub PCI Bridge (rev 04)
02:03.0 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet 
Controller (Copper) (rev 01)
02:03.1 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet 
Controller (Copper) (rev 01)
03:02.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10)
03:02.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10)
04:01.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
Comment 1 Alexey Maximov 2006-08-02 02:04:31 UTC
Aug  2 10:52:39 master skb_over_panic: text:c02e32b8 len:1568 put:1514 
head:db267c00 data:db267c6a tail:
db26828a end:db267d00 dev:eth1
Aug  2 10:52:39 master ------------[ cut here ]------------
Aug  2 10:52:39 master kernel BUG at net/core/skbuff.c:94!
Aug  2 10:52:39 master invalid opcode: 0000 [#1]
Aug  2 10:52:39 master SMP
Aug  2 10:52:39 master
Aug  2 10:52:39 master Modules linked in:
Aug  2 10:52:39 master ipt_REJECT
Aug  2 10:52:39 master ipt_LOG
Aug  2 10:52:39 master xt_tcpudp
Aug  2 10:52:39 master xt_state
Aug  2 10:52:39 master xt_pkttype
Aug  2 10:52:39 master iptable_raw
Aug  2 10:52:39 master xt_CLASSIFY
Aug  2 10:52:39 master xt_CONNMARK
Aug  2 10:52:39 master xt_connmark
Aug  2 10:52:39 master ipt_owner
Aug  2 10:52:39 master ipt_recent
Aug  2 10:52:39 master ipt_iprange
Aug  2 10:52:39 master xt_conntrack
Aug  2 10:52:39 master iptable_mangle
Aug  2 10:52:39 master iptable_nat
Aug  2 10:52:39 master ip_nat
Aug  2 10:52:39 master ip_conntrack_ftp
Aug  2 10:52:39 master ip_conntrack
Aug  2 10:52:39 master nfnetlink
Aug  2 10:52:39 master iptable_filter
Aug  2 10:52:39 master ip_tables
Aug  2 10:52:39 master x_tables
Aug  2 10:52:39 master netconsole
Comment 2 Jesse Brandeburg 2006-08-02 09:11:29 UTC
please add your dmesg output pre-crash

is there a possibility you could have jumbo frames on your network?

I'm specifically looking for what driver version you're running. There are a
couple of known bugs in certain versions of the code.  In this case it looks
like we tried to do a put on an skb of a very long frame, which is really odd.
Comment 3 Alexey Maximov 2006-08-02 09:26:24 UTC
jumbo frames ... don't know 

on external interface 100Mb on internal 1G

I had troubles before with vanilla kernels and with ULOG module

it was random segfaults (ULOG) and hangs with skb and TSO on

I used patch from opensuse kernel to disable TSO completely (it has improved my 
uptime to 2 week from 2 days)

======================
From: Olaf Kirch <okir@suse.de>
Subject: [e1000] Disable TSO for now
References: 157600

It seems there is a memory corruption problem related the use of TSO
with the e1000 driver. As a matter of caution, I am turning off
TSO by default on the e1000 for the time being.

Signed-off-by: okir@suse.de

 drivers/net/e1000/e1000_main.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: build/drivers/net/e1000/e1000_main.c
===================================================================
--- build.orig/drivers/net/e1000/e1000_main.c
+++ build/drivers/net/e1000/e1000_main.c
@@ -735,7 +735,7 @@ e1000_probe(struct pci_dev *pdev,
        }
        
 
-#ifdef NETIF_F_TSO
+#ifdef NETIF_F_TSO_default_to_off_for_now
        if ((adapter->hw.mac_type >= e1000_82544) &&
           (adapter->hw.mac_type != e1000_82547))
                netdev->features |= NETIF_F_TSO;


======================


my dmesg (new)

Linux version 2.6.17-gentoo-r3 (root@master) (gcc version 4.1.1 (Gentoo 4.1.1)) 
#2 SMP Sun Jul 16 10:14:08 MSD 2006
BIOS-provided physical RAM map:
Comment 4 Adrian Bunk 2006-09-06 09:40:51 UTC
What is the status of this issue in 2.6.18-rc6?
Comment 5 Alexey Maximov 2006-09-07 08:33:39 UTC
Unable to check on production server until .18 release. sorry :-| too high load 
and 5 min downtime = half of my salary. :(
Comment 6 Adrian Bunk 2006-12-02 01:50:18 UTC
Please reopen this bug if it's still present in kernel 2.6.19.

Note You need to log in before you can comment on or make changes to this bug.