Bug 70551

Summary: kernel crash on ath9k under heavy load
Product: Drivers Reporter: Max Sydorenko (maxim.stargazer)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED CODE_FIX    
Severity: high CC: ath9k-devel, linville, maxim.stargazer, stf_xl
Priority: P1    
Hardware: i386   
OS: Linux   
Kernel Version: 3.12.9 Subsystem:
Regression: No Bisected commit-id:
Attachments: tarball with crashtool output for crashes and kernel config
ath9k_protect_tid_list.patch
New crash with proposed patch applied
mac80211_check_null_skb.patch
mac80211_check_null_skb_v2.patch

Description Max Sydorenko 2014-02-13 23:58:42 UTC
Created attachment 126001 [details]
tarball with crashtool output for crashes and kernel config

I am experiencing quite frequent kernel crashes on my thinkpad T410 used as a wireless router/home server.
Crashes observed only when Atheros 9380 based miniPCIe card is used (with ath9k driver). They mostly associated with heavy wireless bandwidth use, especially when torrent client with lots of connections runs over wifi.
Distro in use is ArchLinux i686.
Crashes has been observed on the 3.12.x kernels, as well as on the 3.10.29 LTS kernel.
It seems that I can observe this crashes only on the kernels buit by myself, not on the stock ARCH kernels.
I can provide crash tool outputs (bt log ps concatenated into single file) for the several crash instances.
I will also attach config file, which was used to build affected kernels. GCC 4.8.2 has been used.
Please, let me know what useful extra information I can provide.
Comment 1 Max Sydorenko 2014-02-14 03:18:34 UTC
A correction here: now I also have got a crash on 3.10.29 LTS kernel with stock Archlinux config, it wasn't exactly one from repository though (I needed to rebuild it to include debug info).
Comment 2 Stanislaw Gruszka 2014-02-14 16:33:21 UTC
What shows "gdb l *(ath_tx_aggr_sleep+0x62)" command called from crash tool?
Comment 3 Max Sydorenko 2014-02-14 17:23:42 UTC
crash> gdb l *(ath_tx_aggr_sleep+0x62)
No symbol "ath_tx_aggr_sleep" in current context.
gdb: gdb request failed: l *(ath_tx_aggr_sleep+0x62)

Sorry, may be I am doing something wrong?
Comment 4 Stanislaw Gruszka 2014-02-14 17:34:54 UTC
Try to load module first i.e. "mod -s ath9k". If that will not work, you should point whole path to the module i.e. "mod -s ath9k /usr/src/linux/drivers/net/wireless/ath/ath9k/ath9k.ko"
Comment 5 Max Sydorenko 2014-02-14 18:48:06 UTC
crash> gdb l *(ath_tx_aggr_sleep+0x62)
0xf854f072 is in ath_tx_aggr_sleep (include/linux/list.h:88).
83      in include/linux/list.h
Comment 6 Stanislaw Gruszka 2014-02-14 19:23:09 UTC
Crash happen on one of two list_del's from :

                ath_txq_lock(sc, txq);

                buffered = ath_tid_has_buffered(tid);

                tid->sched = false;
                list_del(&tid->list);

                if (ac->sched) {
                        ac->sched = false;
                        list_del(&ac->list);
                }

but we don't know which one. Could you go down on address like:

l *(ath_tx_aggr_sleep+0x61)
l *(ath_tx_aggr_sleep+0x60)
l *(ath_tx_aggr_sleep+0x5f)
...

until command will show up line from ath9k/xmit.c , and provide that info here?
Comment 7 Max Sydorenko 2014-02-14 19:31:49 UTC
0xf854f06e is in ath_tx_aggr_sleep (drivers/net/wireless/ath/ath9k/xmit.c:1470).
1465    in drivers/net/wireless/ath/ath9k/xmit.c
Comment 8 Stanislaw Gruszka 2014-02-14 19:49:36 UTC
Crash is on list_del(&tid->list), looks like we are trying to delete that entry twice.
Comment 9 Stanislaw Gruszka 2014-02-14 19:52:35 UTC
Created attachment 126151 [details]
ath9k_protect_tid_list.patch

Proposed fix. 

Max, please check if it make problem gone.
Comment 10 Max Sydorenko 2014-02-15 15:34:42 UTC
Thank you for patch.
Everything looks promising, no crash so far.
Comment 11 Max Sydorenko 2014-02-15 22:39:11 UTC
Created attachment 126291 [details]
New crash with proposed patch applied

I've got a crash again on the patched kernel. 
In the log:
[ 6335.697678] BUG: unable to handle kernel NULL pointer dereference at 000000ac
[ 6335.697858] IP: [<fefcdc70>] ieee80211_report_used_skb+0x10/0x1e0 [mac80211]
[ 6335.697968] *pde = 00000000 
[ 6335.698004] Oops: 0000 [#1] PREEMPT SMP

crash> gdb l *(ieee80211_report_used_skb+0x10)
0xfefcdc70 is in ieee80211_report_used_skb (net/mac80211/status.c:389).
384     in net/mac80211/status.c

I have attached again output from the crashtool (log, bt, etc.)
Comment 12 Stanislaw Gruszka 2014-02-16 07:38:14 UTC
Heh, everyone run linux wireless AP mode on some routers with single cpu, you are doing this on mulit-core SMP machine and hit bugs that nobody else does :-)

This new crash happen because we call ieee80211_free_txskb() with NULL skb. I can see only one place when that can happen, it's on ieee80211_tx_h_unicast_ps_buf() function. You can verify that by checking where the *(invoke_tx_handlers+0x1322) address is. Let me know if it is not in inside ieee80211_tx_h_unicast_ps_buf(), otherwise please continue testing with patch, which I will shortly attach.
Comment 13 Stanislaw Gruszka 2014-02-16 07:40:42 UTC
Created attachment 126311 [details]
mac80211_check_null_skb.patch

Patch for new crash.
Comment 14 Max Sydorenko 2014-02-16 12:31:44 UTC
Thank you for patch, Stanislaw.
I've got a build error now:

net/mac80211/tx.c: In function ‘ieee80211_tx_h_unicast_ps_buf’:
net/mac80211/tx.c:487:4: error: implicit declaration of function ‘spin_lock_irqrestore’ [-Werror=implicit-function-declaration]
    spin_lock_irqrestore(&sta->ps_tx_buf[ac].lock, flags);
    ^
net/mac80211/tx.c:493:37: error: ‘old’ undeclared (first use in this function)
    ieee80211_free_txskb(&local->hw, old);
                                     ^
net/mac80211/tx.c:493:37: note: each undeclared identifier is reported only once for each function it appears in
cc1: some warnings being treated as errors

Should I enable implicit function declarations to compile it now?
Comment 15 Stanislaw Gruszka 2014-02-16 17:11:59 UTC
Created attachment 126341 [details]
mac80211_check_null_skb_v2.patch

Sorry, I should compile test the patch. This one should be fine.
Comment 16 Max Sydorenko 2014-02-18 20:17:12 UTC
Promising performance, 2 days of uptime without crash on the patched kernel.
Thank you once again for your fast response, Stanislaw.
Comment 17 Stanislaw Gruszka 2014-02-19 12:07:44 UTC
I'll post patches, I'm quite confident they fixed the problems you reported. If machine will crash again, it will be probably on other place. Thanks.
Comment 18 Stanislaw Gruszka 2014-02-26 09:51:41 UTC
FYI: for second bug other fix was applied:
>
> https://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211.git/commit/?id=1d147bfa64293b2723c4fec50922168658e613ba

Anyway, this bug can be closed.
Comment 19 Max Sydorenko 2014-02-26 13:27:38 UTC
No crashes so far, BTW.
In which mainline kernel release those fixes can be expected?
Comment 20 Stanislaw Gruszka 2014-02-26 14:28:14 UTC
3.14 plus backports to stable/longterm kernels that are not yet end of life i.e. 3.13.y, 3.10.y ...
Comment 21 Stanislaw Gruszka 2014-12-12 14:49:56 UTC
This is fixed.