Bug 1902

Summary: Kernel panic in drivers/net/fealnx.c interrupt handler.
Product: Drivers Reporter: Andreas Henriksson (andreas)
Component: NetworkAssignee: Jeff Garzik (jgarzik)
Status: CLOSED CODE_FIX    
Severity: high CC: romieu
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.1 Subsystem:
Regression: --- Bisected commit-id:

Description Andreas Henriksson 2004-01-18 03:10:29 UTC
Distribution: Debian Sarge (Testing)
Hardware Environment: Pentium 166MHz, 40mb Ram, Realtek 8139, Surecom (fealnx)
Software Environment: Gateway with NAT and some filtering plus a couple of HTB 
rules to stop bittorrent from eating up all my upstream bandwitdth (which is 
only 1/10 of my downstream).

Problem Description:
My gateway machine running NAT for my workstation has had several kernel panics. 
About one per week. (Although I booted it up just before starting to write this 
and got one before I finished writing it, so it's totally "random".)

I have, with the help from people at #kernelnewbies (specially thanks goes to 
coderock), debugged it and found that the problem is at line 1520 in 
drivers/net/fealnx.c
http://lxr.linux.no/source/drivers/net/fealnx.c?v=2.6.0#L1520
np->cur_tx->skbuff somehow is NULL there.... and get dereferenced in the inlined 
function dev_kfree_skb_irq().
(for the whole story see: http://www.fjortis.info/pub/panic/my-debugging.txt )

I've previously posted information about this to linux-net and linux-kernel.
http://marc.theaimsgroup.com/?l=linux-net&m=107360949210508&w=2
http://www.ussg.iu.edu/hypermail/linux/kernel/0401.1/0281.html


Here's one kernel panic:

EIP: 0060:[<c3834473>] Not tainted
EFLAGS: 00010206
EIP is at intr_handler+0x173/0x390 [fealnx]
eax: 00207aee ebx: 00000000 ecx: c294c040 edx: 00000000
esi: c125ea00 edi: 00000018 ebp: 00000002 esp: c033bf44
ds: 007b es: 007b ss: 0068
Process swapper (pid: 0, threadinfo:c033a000, task: c02ce4c0)

Stack: 00006144 00006134 00006100 00006138 00000001 00000014 c2911620 04000001
       00000000 c033bfb0 c010a091 0000000a c125e800 c033bfb0 0000000a 00000140
       c0339b40 c2911620 c010a320 0000000a c033bfb0 c2911620 c033a000 000a0600

Call Trace:
 [<c010a091>] handle_IRQ_event+0x31/0x60
 [<c010a320>] do_IRQ+0x70/0xe0
 [<c0105000>] _stext+0x0/0x20
 [<c0108c68>] common_interrupt+0x18/0x20
 [<c0106b30>] default_idle+0x0/0x30
 [<c0105000>] _stext+0x0/0x20
 [<c0106b54>] default_idle+0x24/0x30
 [<c0106bc5>] cpu_idle+0x25/0x40
 [<c033c66d>] start_kernel+0x15d/0x190

Code: ff 8a 98 00 00 00 0f 94 c0 84 c0 0f 85 84 01 00 00 8b 86 a8
 <0>Kernel panic: Fatal exception in interrupt
In interupt handler - not syncing


I'm keeping information about this at http://fjortis.info/pub/panic/ so have a 
look there for more panic's and other information.


Steps to reproduce:
Use the fealnx driver and send some traffic over it.... (bittorrent seems to be 
the perfect killer. Atleast it happens here alot when I have a couple of 
torrents running.)
Comment 1 Francois Romieu 2004-02-27 14:53:56 UTC
Afaiks, np->{free_tx_count/really_tx_count} update in start_xmit is not atomic 
wrt the irq handler and the irq handler updates these variables as well. 
So we could have: 
- start_xmit reads really_tx_count (first step to update really_tx_count); 
- first tx irq takes place in and updates really_tx_count 
- start_xmit ends updating really_tx_count, ignoring the value set in the irq 
handler 
- second tx irq takes place and reads an erroneously high really_tx_count 
value 
- <censored> 
 
Can you try to disable interrupts around the two following instructions in 
start_xmit: 
        --np->free_tx_count; 
        ++np->really_tx_count; 
 
PS: please Cc: me on followup 
 
Comment 2 Andreas Henriksson 2004-03-07 07:17:22 UTC
I've been testing out a fealnx card in my workstation for about one week now and 
haven't had any kernel panic when using spin_lock_irq(&np->lock) and 
spin_unlock_irq(&np->lock) in the start_tx function as described by Francois 
Romieu (who also suggested changing it to spin_{,un}lock_irqsave(&np->lock, 
flags).. ).

I'm changing it to RESOLVED and continue testing (since my p166 could live about 
3 weeks before getting a kernel panic).... Francois Romieu also suggested a 
couple of ways to stress-test the driver which I'll try "really soon"<tm> ... I 
might also fire up my p166 again and test it before submitting a patch (unless 
some skilled person steps forward and deems this at really trivial and sends a 
patch to Linus or akpm).

// Andreas Henriksson (andreas@fjortis.info)
Comment 3 Denis Vlasenko 2004-03-26 12:03:32 UTC
Nice explanation, but on x86 uniprocessor it can't happen.
Andreas' machine is UP AFAIK.
Comment 4 Francois Romieu 2004-03-26 12:27:18 UTC
Bzzzt. It can happen on uniprocessor as well if interruptions are not disabled.