Bug 1902 - Kernel panic in drivers/net/fealnx.c interrupt handler.
Summary: Kernel panic in drivers/net/fealnx.c interrupt handler.
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-01-18 03:10 UTC by Andreas Henriksson
Modified: 2004-05-16 05:36 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.1
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Andreas Henriksson 2004-01-18 03:10:29 UTC
Distribution: Debian Sarge (Testing)
Hardware Environment: Pentium 166MHz, 40mb Ram, Realtek 8139, Surecom (fealnx)
Software Environment: Gateway with NAT and some filtering plus a couple of HTB 
rules to stop bittorrent from eating up all my upstream bandwitdth (which is 
only 1/10 of my downstream).

Problem Description:
My gateway machine running NAT for my workstation has had several kernel panics. 
About one per week. (Although I booted it up just before starting to write this 
and got one before I finished writing it, so it's totally "random".)

I have, with the help from people at #kernelnewbies (specially thanks goes to 
coderock), debugged it and found that the problem is at line 1520 in 
drivers/net/fealnx.c
http://lxr.linux.no/source/drivers/net/fealnx.c?v=2.6.0#L1520
np->cur_tx->skbuff somehow is NULL there.... and get dereferenced in the inlined 
function dev_kfree_skb_irq().
(for the whole story see: http://www.fjortis.info/pub/panic/my-debugging.txt )

I've previously posted information about this to linux-net and linux-kernel.
http://marc.theaimsgroup.com/?l=linux-net&m=107360949210508&w=2
http://www.ussg.iu.edu/hypermail/linux/kernel/0401.1/0281.html


Here's one kernel panic:

EIP: 0060:[<c3834473>] Not tainted
EFLAGS: 00010206
EIP is at intr_handler+0x173/0x390 [fealnx]
eax: 00207aee ebx: 00000000 ecx: c294c040 edx: 00000000
esi: c125ea00 edi: 00000018 ebp: 00000002 esp: c033bf44
ds: 007b es: 007b ss: 0068
Process swapper (pid: 0, threadinfo:c033a000, task: c02ce4c0)

Stack: 00006144 00006134 00006100 00006138 00000001 00000014 c2911620 04000001
       00000000 c033bfb0 c010a091 0000000a c125e800 c033bfb0 0000000a 00000140
       c0339b40 c2911620 c010a320 0000000a c033bfb0 c2911620 c033a000 000a0600

Call Trace:
 [<c010a091>] handle_IRQ_event+0x31/0x60
 [<c010a320>] do_IRQ+0x70/0xe0
 [<c0105000>] _stext+0x0/0x20
 [<c0108c68>] common_interrupt+0x18/0x20
 [<c0106b30>] default_idle+0x0/0x30
 [<c0105000>] _stext+0x0/0x20
 [<c0106b54>] default_idle+0x24/0x30
 [<c0106bc5>] cpu_idle+0x25/0x40
 [<c033c66d>] start_kernel+0x15d/0x190

Code: ff 8a 98 00 00 00 0f 94 c0 84 c0 0f 85 84 01 00 00 8b 86 a8
 <0>Kernel panic: Fatal exception in interrupt
In interupt handler - not syncing


I'm keeping information about this at http://fjortis.info/pub/panic/ so have a 
look there for more panic's and other information.


Steps to reproduce:
Use the fealnx driver and send some traffic over it.... (bittorrent seems to be 
the perfect killer. Atleast it happens here alot when I have a couple of 
torrents running.)
Comment 1 Francois Romieu 2004-02-27 14:53:56 UTC
Afaiks, np->{free_tx_count/really_tx_count} update in start_xmit is not atomic 
wrt the irq handler and the irq handler updates these variables as well. 
So we could have: 
- start_xmit reads really_tx_count (first step to update really_tx_count); 
- first tx irq takes place in and updates really_tx_count 
- start_xmit ends updating really_tx_count, ignoring the value set in the irq 
handler 
- second tx irq takes place and reads an erroneously high really_tx_count 
value 
- <censored> 
 
Can you try to disable interrupts around the two following instructions in 
start_xmit: 
        --np->free_tx_count; 
        ++np->really_tx_count; 
 
PS: please Cc: me on followup 
 
Comment 2 Andreas Henriksson 2004-03-07 07:17:22 UTC
I've been testing out a fealnx card in my workstation for about one week now and 
haven't had any kernel panic when using spin_lock_irq(&np->lock) and 
spin_unlock_irq(&np->lock) in the start_tx function as described by Francois 
Romieu (who also suggested changing it to spin_{,un}lock_irqsave(&np->lock, 
flags).. ).

I'm changing it to RESOLVED and continue testing (since my p166 could live about 
3 weeks before getting a kernel panic).... Francois Romieu also suggested a 
couple of ways to stress-test the driver which I'll try "really soon"<tm> ... I 
might also fire up my p166 again and test it before submitting a patch (unless 
some skilled person steps forward and deems this at really trivial and sends a 
patch to Linus or akpm).

// Andreas Henriksson (andreas@fjortis.info)
Comment 3 Denis Vlasenko 2004-03-26 12:03:32 UTC
Nice explanation, but on x86 uniprocessor it can't happen.
Andreas' machine is UP AFAIK.
Comment 4 Francois Romieu 2004-03-26 12:27:18 UTC
Bzzzt. It can happen on uniprocessor as well if interruptions are not disabled.

Note You need to log in before you can comment on or make changes to this bug.