Bug 15998

Summary: ath5k: soft lockup while using kismet
Product: Drivers Reporter: Bruno Randolf (br1)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: RESOLVED CODE_FIX    
Severity: low CC: linville, me, mickflemm
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel oops while runniing kismet
another oops #2
another oops #3
output of objdump -S net/core/skbuff.o

Description Bruno Randolf 2010-05-19 00:59:57 UTC
Created attachment 26429 [details]
kernel oops while runniing kismet

running kismet over night (many channel switches!) gives me the following kernel messages:

BUG: soft lockup - CPU#0 stuck for 61s! [kismet_client:1039]
Modules linked in: nf_nat_tftp nf_conntrack_tftp nf_nat_irc nf_conntrack_irc nf_nat_fc
Modules linked in: nf_nat_tftp nf_conntrack_tftp nf_nat_irc nf_conntrack_irc nf_nat_fc

Pid: 1039, comm: kismet_client Not tainted 2.6.34-wl #21 /
EIP: 0060:[<c109d520>] EFLAGS: 00000286 CPU: 0
EIP is at kfree+0x80/0xe0
EAX: c7046120 EBX: c70440c0 ECX: 00000000 EDX: 00000004
ESI: 00000286 EDI: c11a4390 EBP: c7de3de8 ESP: c7de3dd8
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process kismet_client (pid: 1039, ti=c7de2000 task=c7e819d0 task.ti=c7de2000)
Stack:
 c11a4390 c700eb60 c700eb60 00000000 c7de3df8 c11a4390 c700eb60 c11aed40
<0> c7de3e04 c11a4027 c700eb60 c7de3e18 c11a411b c700eb60 c700eb60 c7ea0840
<0> c7de3e24 c11aed40 c7fac1e0 c7de3e98 c89b21b3 00000000 ffffffc4 c11a4dd9
Call Trace:
 [<c11a4390>] ? skb_release_data+0x70/0xa0
 [<c11a4390>] ? skb_release_data+0x70/0xa0
 [<c11aed40>] ? netif_rx+0x80/0xf0
 [<c11a4027>] ? __kfree_skb+0x17/0x90
 [<c11a411b>] ? kfree_skb+0x3b/0x80
 [<c11aed40>] ? netif_rx+0x80/0xf0
 [<c89b21b3>] ? ieee80211_rx+0x6d3/0x900 [mac80211]
 [<c11a4dd9>] ? __alloc_skb+0x99/0x130
 [<c8a4850d>] ? ath5k_rx_skb_alloc+0x8d/0x150 [ath5k]
 [<c8a3eead>] ? ath5k_hw_get_tsf64+0x6d/0xa0 [ath5k]
 [<c8a49c58>] ? ath5k_tasklet_rx+0x2b8/0x770 [ath5k]
 [<c1031485>] ? cascade+0x65/0x80
 [<c102bd57>] ? tasklet_action+0x67/0x80
 [<c102caf9>] ? __do_softirq+0x79/0x190
 [<c1056a1c>] ? handle_IRQ_event+0x4c/0x140
 [<c102cc45>] ? do_softirq+0x35/0x40
 [<c102cd45>] ? irq_exit+0x35/0x40
 [<c121f428>] ? do_IRQ+0x48/0x94
 [<c121da28>] ? schedule+0x198/0x350
 [<c10a1562>] ? sys_read+0x42/0x70
 [<c1002f29>] ? common_interrupt+0x29/0x30
 [<c1210000>] ? show_multicast_router+0x0/0x30

... (see attachment)...
Comment 1 John W. Linville 2010-05-19 18:21:36 UTC
Strange...something corrupting an skb?  skb_release_data hasn't changed in ages...
Comment 2 Bob Copeland 2010-05-19 18:51:47 UTC
Could be, maybe it's looping in put_page if nr_frags got hosed.
Comment 3 Bob Copeland 2010-05-19 18:53:07 UTC
Bruno can you post the objdump -S of skbuff.o?  Then we can look at the counter values that match the code.
Comment 4 Bruno Randolf 2010-05-20 00:45:53 UTC
sure, i'll post an attachment.

do you really think that the real problem is there? isn't the "CPU stuck for 61s" more to worry about? could be a locking problem in ath5k. i have seen other cases which go like this. the actual location of the oops is always slightly different:

BUG: soft lockup - CPU#0 stuck for 61s! [kismet_server:1517]
Modules linked in: ath5k ath mac80211 cfg80211 nf_nat_tftp nf_conntrack_tftp nf_nat_i]
Modules linked in: ath5k ath mac80211 cfg80211 nf_nat_tftp nf_conntrack_tftp nf_nat_i]

Pid: 1517, comm: kismet_server Not tainted 2.6.34-rc7-wl #20 /
EIP: 0060:[<c109d8fa>] EFLAGS: 00000202 CPU: 0
EIP is at __kmalloc_track_caller+0x5a/0x160
EAX: c71ee120 EBX: c13087b0 ECX: c13087b0 EDX: c71ea060
ESI: c71ee120 EDI: 00000020 EBP: c7f39bb8 ESP: c7f39b94
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process kismet_server (pid: 1517, ti=c7f38000 task=c7e72000 task.ti=c7f38000)
Stack:
 c92bc030 c1316090 c92bc030 00000020 00000b44 00000202 c728b0e0 00000a40
<0> 00000020 c7f39bd4 c11a4dae c7c05870 00000000 c7cf8000 c7cf8000 000009e0
<0> c7f39bf0 c92bc030 ffffffff 00000000 c7cf8000 c71e09c0 00000000 c7f39c18
Call Trace:
 [<c92bc030>] ? ath_rxbuf_alloc+0x30/0xa0 [ath]
 [<c92bc030>] ? ath_rxbuf_alloc+0x30/0xa0 [ath]
 [<c11a4dae>] ? __alloc_skb+0x4e/0x130
 [<c92bc030>] ? ath_rxbuf_alloc+0x30/0xa0 [ath]
 [<c95b74a8>] ? ath5k_rx_skb_alloc+0x28/0x150 [ath5k]
 [<c95b762a>] ? ath5k_rxbuf_setup+0x5a/0xb0 [ath5k]
 [<c95ab1c0>] ? ath5k_hw_setup_rx_desc+0x0/0x50 [ath5k]
 [<c95b8a8c>] ? ath5k_tasklet_rx+0xcc/0x770 [ath5k]
 [<c95ab840>] ? ath5k_hw_get_isr+0xa0/0x380 [ath5k]
 [<c1126b32>] ? __const_udelay+0x22/0x30
 [<c95b08b8>] ? ath5k_hw_set_spur_mitigation_filter+0x448/0x690 [ath5k]
 [<c95b024d>] ? ath5k_hw_set_antenna_mode+0x1bd/0x3b0 [ath5k]
 [<c102bd67>] ? tasklet_action+0x67/0x80
 [<c102cb09>] ? __do_softirq+0x79/0x190
 [<c1056a1c>] ? handle_IRQ_event+0x4c/0x140
 [<c102cc55>] ? do_softirq+0x35/0x40
 [<c102cd55>] ? irq_exit+0x35/0x40
 [<c121f2d8>] ? do_IRQ+0x48/0x94
 (...)

i'll attach the oopses of two other cases as well.
Comment 5 Bruno Randolf 2010-05-20 00:46:27 UTC
Created attachment 26448 [details]
another oops #2
Comment 6 Bruno Randolf 2010-05-20 00:46:45 UTC
Created attachment 26449 [details]
another oops #3
Comment 7 Bruno Randolf 2010-05-20 00:47:31 UTC
Created attachment 26450 [details]
output of objdump -S net/core/skbuff.o
Comment 8 Bob Copeland 2010-05-20 03:35:47 UTC
(In reply to comment #4)
> sure, i'll post an attachment.
> 
> do you really think that the real problem is there? isn't the "CPU stuck for
> 61s" more to worry about? could be a locking problem in ath5k. i have seen
> other cases which go like this. the actual location of the oops is always
> slightly different:

Well, the soft lockup watchdog doesn't necessarily indicate a locking bug; it can be a spin-locked section that never exits but still does stuff.  Or could be a livelock where interrupts are coming fast and furiously so as soon as the tasklet is exited it is re-entered and the watchdog thread never gets to run.

If all of the traces were in skb_release_data it would be a good indication that nr_frags was horked.  As the errors seem to be dispersed around, probably it's an indication that we aren't exiting rx tasklet or getting irq storm?

Your disassembly looks way different from mine, so it's not too easy to see the loop counters.  I assume kfree+0x80 is put_page on your box, so the first case is in put_page but I agree it's probably not relevant.
Comment 9 Bruno Randolf 2010-06-14 01:34:48 UTC
my patch disabling tasklets while reset helps:

https://patchwork.kernel.org/patch/105527/

i ran kismet the whole weekend without an oops.