Bug 70551
Summary: | kernel crash on ath9k under heavy load | ||
---|---|---|---|
Product: | Drivers | Reporter: | Max Sydorenko (maxim.stargazer) |
Component: | network-wireless | Assignee: | drivers_network-wireless (drivers_network-wireless) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | ath9k-devel, linville, maxim.stargazer, stf_xl |
Priority: | P1 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 3.12.9 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
tarball with crashtool output for crashes and kernel config
ath9k_protect_tid_list.patch New crash with proposed patch applied mac80211_check_null_skb.patch mac80211_check_null_skb_v2.patch |
A correction here: now I also have got a crash on 3.10.29 LTS kernel with stock Archlinux config, it wasn't exactly one from repository though (I needed to rebuild it to include debug info). What shows "gdb l *(ath_tx_aggr_sleep+0x62)" command called from crash tool? crash> gdb l *(ath_tx_aggr_sleep+0x62) No symbol "ath_tx_aggr_sleep" in current context. gdb: gdb request failed: l *(ath_tx_aggr_sleep+0x62) Sorry, may be I am doing something wrong? Try to load module first i.e. "mod -s ath9k". If that will not work, you should point whole path to the module i.e. "mod -s ath9k /usr/src/linux/drivers/net/wireless/ath/ath9k/ath9k.ko" crash> gdb l *(ath_tx_aggr_sleep+0x62) 0xf854f072 is in ath_tx_aggr_sleep (include/linux/list.h:88). 83 in include/linux/list.h Crash happen on one of two list_del's from : ath_txq_lock(sc, txq); buffered = ath_tid_has_buffered(tid); tid->sched = false; list_del(&tid->list); if (ac->sched) { ac->sched = false; list_del(&ac->list); } but we don't know which one. Could you go down on address like: l *(ath_tx_aggr_sleep+0x61) l *(ath_tx_aggr_sleep+0x60) l *(ath_tx_aggr_sleep+0x5f) ... until command will show up line from ath9k/xmit.c , and provide that info here? 0xf854f06e is in ath_tx_aggr_sleep (drivers/net/wireless/ath/ath9k/xmit.c:1470). 1465 in drivers/net/wireless/ath/ath9k/xmit.c Crash is on list_del(&tid->list), looks like we are trying to delete that entry twice. Created attachment 126151 [details]
ath9k_protect_tid_list.patch
Proposed fix.
Max, please check if it make problem gone.
Thank you for patch. Everything looks promising, no crash so far. Created attachment 126291 [details]
New crash with proposed patch applied
I've got a crash again on the patched kernel.
In the log:
[ 6335.697678] BUG: unable to handle kernel NULL pointer dereference at 000000ac
[ 6335.697858] IP: [<fefcdc70>] ieee80211_report_used_skb+0x10/0x1e0 [mac80211]
[ 6335.697968] *pde = 00000000
[ 6335.698004] Oops: 0000 [#1] PREEMPT SMP
crash> gdb l *(ieee80211_report_used_skb+0x10)
0xfefcdc70 is in ieee80211_report_used_skb (net/mac80211/status.c:389).
384 in net/mac80211/status.c
I have attached again output from the crashtool (log, bt, etc.)
Heh, everyone run linux wireless AP mode on some routers with single cpu, you are doing this on mulit-core SMP machine and hit bugs that nobody else does :-) This new crash happen because we call ieee80211_free_txskb() with NULL skb. I can see only one place when that can happen, it's on ieee80211_tx_h_unicast_ps_buf() function. You can verify that by checking where the *(invoke_tx_handlers+0x1322) address is. Let me know if it is not in inside ieee80211_tx_h_unicast_ps_buf(), otherwise please continue testing with patch, which I will shortly attach. Created attachment 126311 [details]
mac80211_check_null_skb.patch
Patch for new crash.
Thank you for patch, Stanislaw. I've got a build error now: net/mac80211/tx.c: In function ‘ieee80211_tx_h_unicast_ps_buf’: net/mac80211/tx.c:487:4: error: implicit declaration of function ‘spin_lock_irqrestore’ [-Werror=implicit-function-declaration] spin_lock_irqrestore(&sta->ps_tx_buf[ac].lock, flags); ^ net/mac80211/tx.c:493:37: error: ‘old’ undeclared (first use in this function) ieee80211_free_txskb(&local->hw, old); ^ net/mac80211/tx.c:493:37: note: each undeclared identifier is reported only once for each function it appears in cc1: some warnings being treated as errors Should I enable implicit function declarations to compile it now? Created attachment 126341 [details]
mac80211_check_null_skb_v2.patch
Sorry, I should compile test the patch. This one should be fine.
Promising performance, 2 days of uptime without crash on the patched kernel. Thank you once again for your fast response, Stanislaw. I'll post patches, I'm quite confident they fixed the problems you reported. If machine will crash again, it will be probably on other place. Thanks. FYI: for second bug other fix was applied:
>
> https://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211.git/commit/?id=1d147bfa64293b2723c4fec50922168658e613ba
Anyway, this bug can be closed.
No crashes so far, BTW. In which mainline kernel release those fixes can be expected? 3.14 plus backports to stable/longterm kernels that are not yet end of life i.e. 3.13.y, 3.10.y ... This is fixed. |
Created attachment 126001 [details] tarball with crashtool output for crashes and kernel config I am experiencing quite frequent kernel crashes on my thinkpad T410 used as a wireless router/home server. Crashes observed only when Atheros 9380 based miniPCIe card is used (with ath9k driver). They mostly associated with heavy wireless bandwidth use, especially when torrent client with lots of connections runs over wifi. Distro in use is ArchLinux i686. Crashes has been observed on the 3.12.x kernels, as well as on the 3.10.29 LTS kernel. It seems that I can observe this crashes only on the kernels buit by myself, not on the stock ARCH kernels. I can provide crash tool outputs (bt log ps concatenated into single file) for the several crash instances. I will also attach config file, which was used to build affected kernels. GCC 4.8.2 has been used. Please, let me know what useful extra information I can provide.