Bug 188201

Summary: Kernel crashes with ath10k radio in AP mode and Nexus 5X as client
Product: Drivers Reporter: Timo Sigurdsson (public_timo.s)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: NEW ---    
Severity: normal CC: felix+kernel, mar.kolya, raghujindia, stijn+bugs
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: Syslog with kernel crash happening when Nexus 5X is connected to ath10k wireless access point

Description Timo Sigurdsson 2016-11-20 01:29:51 UTC
Created attachment 245181 [details]
Syslog with kernel crash happening when Nexus 5X is connected to ath10k wireless access point

I have a TP-Link Archer C7 v2 running LEDE (the OpenWrt fork) on top of  Linux 4.4. It all works well except for the fact that when I connect a Nexus 5X device to the 5GHz radio, the kernel or ath10k driver will crash after a while. 5Ghz wifi will be dead after that until I reboot the system.

This bug report is based on an email I sent earlier to the ath10k and lede-dev mailing lists [0]. I was asked to file a bug in the kernel.org bugtracker as well.

The issue I'm describing has been reported before [2] and it also has been declared as solved with newer firmwares [3] (but reopened by other users). However, even with the latest ath10k firmware from the linux-firmware repository as well as the most recent pre-release firmware 10.2.4.70.58 from Kalle Valo's Github repository [4] the issue is far from resolved. I have tried many different firmware revisions over the time (more recently 10.2.4.70.56 and 10.2.4.70.54), and I can could only find that the issue sometimes takes longer to trigger with some firmwares (which might just be random), but it would always occur at some point with API level 5 firmwares. With API level 2 firmwares (which I testesd when I was still using OpenWrt 15.05) I never saw these crashes, but the Nexus 5X had other connectivity issues with these older firmwares that made this combination no fun to use either. But this indicates that the firmware itself makes the difference here.

I actually have two Nexus 5X on my network (my wife's and my own). I can trigger the crash with either one of them. And if both Nexus 5X are connected to the 5Ghz radio, then the issue triggers much faster (can be as low as 15 minutes). My workaround is to let the Nexus 5X devices only connect to the 2.4GHz radio. This way, the device runs for weeks without any issue or crash, but of course I would prefer the actual bug being fixed rather than to circumvent it.

After I first reported this issue on the ath10k and lede-dev mailing lists about 3 weeks ago, it was suggested to me to try the unofficial ath10k driver and firmware by Candela Tech. I did that with the current BETA 18 ath10k-ct firmware and driver for the last 3 weeks and the issue did not occur with this firmware, so it seems this is a regression in the current and official API level 5 ath10k firmware blobs.

I'm appending a syslog from my access point with a crash happening on LEDE r1952 (Linux 4.4.26, compat-wireless-2016-10-08) while one Nexus 5X was connected to the 5GHz radio starting from the time the system booted. I randomized the MAC addresses. but left the first two characters unique so different clients can be distinguished.

If there is more info I could collect to help identify the cause of this issue, please let me know (and possibly how to do that as well).

Regards,

Timo

[1] https://www.mail-archive.com/lede-dev@lists.infradead.org/msg03726.html
[2] http://lists.infradead.org/pipermail/ath10k/2015-November/006413.html
[3] https://dev.openwrt.org/ticket/20854
[4] https://github.com/kvalo/ath10k-firmware/tree/master/QCA988X/hw2.0/10.2.4.70
Comment 1 Timo Sigurdsson 2016-12-03 11:37:48 UTC
I provoked the crash again on a newer build with debugging enabled (LEDE r2321, Linux 4.4.32, ath10k firmware 10.2.4.70.54). Maybe this additional information can help to identify the cause of the issue.

Here's the relevant dmesg output (with mac addresses pseudo-anonymized):

[283232.075393] ath10k_pci 0000:01:00.0: failed to delete peer 78:f8:82:00:00:00 for vdev 0: -145
[283232.084140] ath10k_pci 0000:01:00.0: found sta peer 78:f8:82:00:00:00 (ptr 86df2800 id 62) entry on vdev 0 after it was supposedly removed
[283232.096976] ------------[ cut here ]------------
[283232.101904] WARNING: CPU: 0 PID: 1841 at compat-wireless-2016-10-08/net/mac80211/sta_info.c:964 sta_set_sinfo+0x92c/0x9e0 [mac80211]()
[283232.114279] Modules linked in: pppoe ppp_async iptable_nat ath9k pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE ax88179_178a ath9k_common xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY usbnet slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt ath9k_hw em_nbyte sch_tbf sch_pie sch_gred sch_htb sch_teql cls_basic act_ipt sch_red sch_prio em_meta em_text sch_codel sch_sfq act_police sch_dsmark em_cmp sch_fq act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ath10k_pci ath10k_core ath mac80211 cfg80211 compat ledtrig_usbport ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables ifb ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common mii
[283232.221696] CPU: 0 PID: 1841 Comm: hostapd Not tainted 4.4.32 #0
[283232.227884] Stack : 803c74e4 00000000 00000001 80420000 87d65b80 8040ed83 803a8bac 00000731
[283232.227884] 	  8048379c 8740e800 00000000 7782fe94 00000000 800a7238 803ae218 80410000
[283232.227884] 	  00000003 8740e800 803ac624 86e1db2c 00000000 800a51b4 01afb952 00000000
[283232.227884] 	  00000001 801f3200 00000000 00000000 00000000 00000000 00000000 00000000
[283232.227884] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[283232.227884] 	  ...
[283232.264341] Call Trace:
[283232.266934] [<80071b80>] show_stack+0x50/0x84
[283232.271445] [<80081900>] warn_slowpath_common+0xa0/0xd0
[283232.276847] [<800819b8>] warn_slowpath_null+0x18/0x24
[283232.282164] [<875870a4>] sta_set_sinfo+0x92c/0x9e0 [mac80211]
[283232.288224] [<87587188>] __sta_info_destroy+0x30/0x48 [mac80211]
[283232.294519] [<87587238>] sta_info_destroy_addr_bss+0x38/0x60 [mac80211]
[283232.301475] [<874cc144>] cfg80211_check_station_change+0xed8/0x1390 [cfg80211]
[283232.308942] 
[283232.310543] ---[ end trace 1e173b5175dabd85 ]---
[283547.702654] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer a0:02:dc:00:00:00: -145
[283547.711846] wlan0: failed to remove key (0, a0:02:dc:00:00:00) from hardware (-145)
[283547.720952] ath10k_pci 0000:01:00.0: cipher 0 is not supported
[283547.727006] ath10k_pci 0000:01:00.0: failed to remove peer wep key 0: -122
[283547.734106] ath10k_pci 0000:01:00.0: failed to clear all peer wep keys for vdev 0: -122
[283547.742313] ath10k_pci 0000:01:00.0: failed to disassociate station: a0:02:dc:00:00:00 vdev 0: -122
[283547.751590] ------------[ cut here ]------------
[283547.756521] WARNING: CPU: 0 PID: 1841 at compat-wireless-2016-10-08/net/mac80211/sta_info.c:956 sta_set_sinfo+0x8d8/0x9e0 [mac80211]()
[283547.768872] Modules linked in: pppoe ppp_async iptable_nat ath9k pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE ax88179_178a ath9k_common xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY usbnet slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt ath9k_hw em_nbyte sch_tbf sch_pie sch_gred sch_htb sch_teql cls_basic act_ipt sch_red sch_prio em_meta em_text sch_codel sch_sfq act_police sch_dsmark em_cmp sch_fq act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ath10k_pci ath10k_core ath mac80211 cfg80211 compat ledtrig_usbport ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables ifb ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common mii
[283547.876293] CPU: 0 PID: 1841 Comm: hostapd Tainted: G        W       4.4.32 #0
[283547.883714] Stack : 803c74e4 00000000 00000001 80420000 87d65b80 8040ed83 803a8bac 00000731
[283547.883714] 	  8048379c 8740e800 00000000 7782fe94 00000000 800a7238 803ae218 80410000
[283547.883714] 	  00000003 8740e800 803ac624 86e1db2c 00000000 800a51b4 00000000 00000000
[283547.883714] 	  8040de90 801f3200 00000000 00000000 00000000 00000000 00000000 00000000
[283547.883714] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[283547.883714] 	  ...
[283547.920171] Call Trace:
[283547.922762] [<80071b80>] show_stack+0x50/0x84
[283547.927274] [<80081900>] warn_slowpath_common+0xa0/0xd0
[283547.932676] [<800819b8>] warn_slowpath_null+0x18/0x24
[283547.937995] [<87587050>] sta_set_sinfo+0x8d8/0x9e0 [mac80211]
[283547.944054] [<87587188>] __sta_info_destroy+0x30/0x48 [mac80211]
[283547.950349] [<87587238>] sta_info_destroy_addr_bss+0x38/0x60 [mac80211]
[283547.957314] [<874cc144>] cfg80211_check_station_change+0xed8/0x1390 [cfg80211]
[283547.964779] 
[283547.966381] ---[ end trace 1e173b5175dabd86 ]---
[283550.962647] ath10k_pci 0000:01:00.0: failed to delete peer a0:02:dc:00:00:00 for vdev 0: -11
[283550.971316] ath10k_pci 0000:01:00.0: found sta peer a0:02:dc:00:00:00 (ptr 86df6400 id 196) entry on vdev 0 after it was supposedly removed
[283553.982619] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer 64:bc:0c:00:00:00: -11
[283553.991720] wlan0: failed to remove key (0, 64:bc:0c:00:00:00) from hardware (-11)
[283554.017647] ath10k_pci 0000:01:00.0: cipher 0 is not supported
[283554.023706] ath10k_pci 0000:01:00.0: failed to remove peer wep key 0: -122
[283554.030778] ath10k_pci 0000:01:00.0: failed to clear all peer wep keys for vdev 0: -122
[283554.039005] ath10k_pci 0000:01:00.0: failed to disassociate station: 64:bc:0c:00:00:00 vdev 0: -122
[283557.042589] ath10k_pci 0000:01:00.0: failed to delete peer 64:bc:0c:00:00:00 for vdev 0: -11
[283557.051248] ath10k_pci 0000:01:00.0: found sta peer 64:bc:0c:00:00:00 (ptr 86e31800 id 43) entry on vdev 0 after it was supposedly removed
[283560.062553] ath10k_pci 0000:01:00.0: failed to set beacon mode for vdev 0: -11
[283563.062522] ath10k_pci 0000:01:00.0: failed to set dtim period for vdev 0: -11
[283566.077407] ath10k_pci 0000:01:00.0: failed to recalculate rts/cts prot for vdev 0: -11
[283569.082474] ath10k_pci 0000:01:00.0: failed to set protection mode 0 on vdev 0: -11
[283572.082447] ath10k_pci 0000:01:00.0: failed to set preamble for vdev 0: -11
[283826.570252] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer 30:b5:c2:00:00:00: -11
[283826.579348] wlan0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-11)
[283829.580227] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer 30:b5:c2:00:00:00: -11
[283829.589379] wlan0: failed to set key (1, ff:ff:ff:ff:ff:ff) to hardware (-11)
Comment 2 Stijn Tintel 2017-09-25 15:03:14 UTC
Have been seeing this crash on qca988x hw2.0 running in AP mode, on a D-Link DAP-2695-A1 and Unifi AP AC Pro. First time I reported it, the AP was running compat-wireless-2016-10-08 with LEDE patches and firmware 10.2.4.70.54 api 5. IIRC, the problem happened also with earlier versions. Today I saw it again on one of these APs, now running LEDE master from Hauke's branch, which contains mac80211+ath10k_pci from backports v4.14-rc1-0-g2bd6bf03f4c1 and firmware 10.2.4-1.0-00029 api 5.

Also see downstream bug report: https://bugs.lede-project.org/index.php?do=details&task_id=333
Comment 3 Stijn Tintel 2017-09-29 22:28:18 UTC
I thought it might be related to running 2 APs on a single device, but I just saw it on UAP-AC-Pro here with only 1 AP config:

[137116.099707] ath10k_pci 0000:00:00.0: failed to delete peer c0:ee:fb:4a:18:b7 for vdev 0: -145
[137116.108655] ath10k_pci 0000:00:00.0: found sta peer c0:ee:fb:4a:18:b7 (ptr 86905a00 id 95) entry on vdev 0 after it was supposedly removed
[137116.122150] ------------[ cut here ]------------
[137116.127151] WARNING: CPU: 0 PID: 1197 at /home/build/lede/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/backports-4.14-rc1/net/mac80211/sta_info.c:975 sta_set_sinfo+0x930/0x9e4 [mac80211]()
[137116.145303] Modules linked in: ath9k ath9k_common crc_ccitt ath9k_hw act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_tbf sch_htb sch_hfsc sch_ingress ath10k_pci ath10k_core ath mac80211 cfg80211 compat sha256_generic jitterentropy_rng drbg hmac cmac gpio_button_hotplug aead cryptomgr crypto_null crypto_hash
[137116.176650] CPU: 0 PID: 1197 Comm: hostapd Not tainted 4.4.88 #0
[137116.182950] Stack : 803cb224 00000000 00000001 80420000 87cb5bac 80412ca3 803ac79c 000004ad
[137116.182950]           8048378c 87e52000 00000000 77699f54 00000000 800a7720 803b1f28 80410000
[137116.182950]           00000003 87e52000 803b032c 8690fb2c 00000000 800a569c 00d0c3ca 00000000
[137116.182950]           00000001 801f8700 00000000 00000000 00000000 00000000 00000000 00000000
[137116.182950]           00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[137116.182950]           ...
[137116.220051] Call Trace:
[137116.222676] [<80071dc0>] show_stack+0x54/0x88
[137116.227272] [<80081bc0>] warn_slowpath_common+0xa0/0xd0
[137116.232775] [<80081c78>] warn_slowpath_null+0x18/0x24
[137116.238155] [<87507050>] sta_set_sinfo+0x930/0x9e4 [mac80211]
[137116.244260] [<87507134>] __sta_info_destroy+0x30/0x48 [mac80211]
[137116.250642] [<875071e4>] sta_info_destroy_addr_bss+0x38/0x60 [mac80211]
[137116.257650] [<8748de1c>] cfg80211_check_station_change+0x1fa8/0x3628 [cfg80211]
[137116.265321]
[137116.266948] ---[ end trace 351c1b7fcd54f528 ]---
[137119.269720] ath10k_pci 0000:00:00.0: failed to wait for created wmi peer c0:ee:fb:4a:18:b7 on vdev 0: -145
[137119.279830] ath10k_pci 0000:00:00.0: failed to add peer c0:ee:fb:4a:18:b7 for vdev 0 when adding a new sta: -145
[137437.629723] ath10k_pci 0000:00:00.0: failed to install key for vdev 0 peer a4:77:33:79:20:64: -11
[137437.639213] wlan0: failed to remove key (0, a4:77:33:79:20:64) from hardware (-11)
[137437.653056] ath10k_pci 0000:00:00.0: cipher 0 is not supported
[137437.659326] ath10k_pci 0000:00:00.0: failed to remove peer wep key 0: -122
[137437.666592] ath10k_pci 0000:00:00.0: failed to clear all peer wep keys for vdev 0: -122
[137437.674963] ath10k_pci 0000:00:00.0: failed to disassociate station: a4:77:33:79:20:64 vdev 0: -122
[137437.684402] ------------[ cut here ]------------
[137437.689395] WARNING: CPU: 0 PID: 1197 at /home/build/lede/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/backports-4.14-rc1/net/mac80211/sta_info.c:967 sta_set_sinfo+0x8dc/0x9e4 [mac80211]()
[137437.707527] Modules linked in: ath9k ath9k_common crc_ccitt ath9k_hw act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_tbf sch_htb sch_hfsc sch_ingress ath10k_pci ath10k_core ath mac80211 cfg80211 compat sha256_generic jitterentropy_rng drbg hmac cmac gpio_button_hotplug aead cryptomgr crypto_null crypto_hash
[137437.738862] CPU: 0 PID: 1197 Comm: hostapd Tainted: G        W       4.4.88 #0
[137437.746415] Stack : 803cb224 00000000 00000001 80420000 87cb5bac 80412ca3 803ac79c 000004ad
[137437.746415]           8048378c 87e52000 00000000 77699f54 00000000 800a7720 803b1f28 80410000
[137437.746415]           00000003 87e52000 803b032c 8690fb2c 00000000 800a569c 00000000 00000000
[137437.746415]           80411dd0 801f8700 00000000 00000000 00000000 00000000 00000000 00000000
[137437.746415]           00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[137437.746415]           ...
[137437.783518] Call Trace:
[137437.786145] [<80071dc0>] show_stack+0x54/0x88
[137437.790752] [<80081bc0>] warn_slowpath_common+0xa0/0xd0
[137437.796238] [<80081c78>] warn_slowpath_null+0x18/0x24
[137437.801617] [<87506ffc>] sta_set_sinfo+0x8dc/0x9e4 [mac80211]
[137437.807695] [<87507134>] __sta_info_destroy+0x30/0x48 [mac80211]
[137437.814070] [<875071e4>] sta_info_destroy_addr_bss+0x38/0x60 [mac80211]
[137437.821112] [<8748de1c>] cfg80211_check_station_change+0x1fa8/0x3628 [cfg80211]
[137437.828765]
[137437.830417] ---[ end trace 351c1b7fcd54f529 ]---
[137440.829706] ath10k_pci 0000:00:00.0: failed to delete peer a4:77:33:79:20:64 for vdev 0: -11
[137440.838539] ath10k_pci 0000:00:00.0: found sta peer a4:77:33:79:20:64 (ptr 86905200 id 136) entry on vdev 0 after it was supposedly removed
[137443.859697] ath10k_pci 0000:00:00.0: failed to set beacon mode for vdev 0: -11
[137446.859704] ath10k_pci 0000:00:00.0: failed to set dtim period for vdev 0: -11
[137449.869690] ath10k_pci 0000:00:00.0: failed to recalculate rts/cts prot for vdev 0: -11
[137452.869690] ath10k_pci 0000:00:00.0: failed to set cts protection for vdev 0: -11
[137455.869694] ath10k_pci 0000:00:00.0: failed to set preamble for vdev 0: -11
[172827.368634] ath10k_pci 0000:00:00.0: failed to install key for vdev 0 peer f0:9f:c2:22:30:9b: -11
[172827.377962] wlan0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-11)
[172830.378645] ath10k_pci 0000:00:00.0: failed to install key for vdev 0 peer f0:9f:c2:22:30:9b: -11
[172830.387924] wlan0: failed to set key (1, ff:ff:ff:ff:ff:ff) to hardware (-11)
[334603.013770] ath10k_pci 0000:00:00.0: failed to send pdev bss chan info request
Comment 4 WhoAmEye 2017-12-25 09:18:02 UTC
Is this issue still seen in latest kernel?
Comment 5 Stijn Tintel 2018-04-20 18:27:56 UTC
Just saw this again on my Unifi AP AC Pro running OpenWrt r6397-7cc9914aae. This contains mac80211 and ath10k from a wireless-testing snapshot taking on 2017-11-01. See https://bugs.openwrt.org/index.php?do=details&task_id=333

Trying ath10k-ct with the -ct-htt firmware now.
Comment 6 Stijn Tintel 2018-08-14 08:21:24 UTC
This *still* happens on OpenWrt built from blogic's staging tree, which contains mac80211 and ath10k from 4.18-rc7.
Comment 7 Stijn Tintel 2018-11-15 16:40:39 UTC
Anyone who still experiences issues with ath10k should have a look at ath10k-ct: http://www.candelatech.com/ath10k.php. Since the performance issues that occurred with 802.11w enabled have been fixed, I have been happily using the ath10k-ct driver with the htt firmware. I've had uptimes up to 41 days on several APs without a single issue. With my current combination of wireless clients, ath10k would reliably crash within 48h.

For those who cannot easily use the driver, it is possible to use the non-htt CT firmware with upstream ath10k, which should also greatly improve stability.