Bug 216513
Summary: | ath11k: IPQ8074: interface down times out with 'failed to flush transmit queue' | ||
---|---|---|---|
Product: | Drivers | Reporter: | Robert Marko (robimarko) |
Component: | network-wireless | Assignee: | drivers_network-wireless (drivers_network-wireless) |
Status: | NEW --- | ||
Severity: | high | CC: | ansuelsmth, brozikcz, joachim.banzhaf, josacar, kvalo, me, nikitashvets, quic_mkenna, tperalta82, wiagn233 |
Priority: | P3 | ||
Hardware: | ARM | ||
OS: | Linux | ||
Kernel Version: | 5.15.68 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Robert Marko
2022-09-22 08:53:36 UTC
I forgot to state that this is on IPQ8074, and FW WLAN.HK.2.5.0.1-01208-QCAHKSWPL_SILICONZ-1 as anything newer is crashing the remoteproc for us This is a common issue, also on older kernels. My version currently is 5.15.58 and I see this all the time on my devices. [113346.372617] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 329 [113346.586626] ath10k_pci 0000:01:00.0: could not get mac80211 beacon, vdev_id: 0 addr: 28:d1:27:4e:56:75 [113346.634517] ath10k_pci 0000:01:00.0: mac flush null vif, drop 0 queues 0xffff [113346.754534] br-lan: port 6(wlan0) entered disabled state [113351.652600] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 21 [113351.703231] br-lan: port 7(wlan1) entered disabled state [113351.743035] device wlan0 left promiscuous mode [113351.743087] br-lan: port 6(wlan0) entered disabled state [113351.843405] device wlan1 left promiscuous mode [113351.843459] br-lan: port 7(wlan1) entered disabled state Running 5.15.69, exactly same issue root@Vision:~# uname -a Linux Vision 5.15.69 #0 SMP Wed Sep 28 09:58:38 2022 aarch64 GNU/Linux root@Vision:~# [42095.697467] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 7 [42096.283197] br-lan: port 5(wlan1) entered blocking state [42096.283226] br-lan: port 5(wlan1) entered disabled state [42096.283513] device wlan1 entered promiscuous mode [42096.283627] br-lan: port 5(wlan1) entered blocking state [42096.283638] br-lan: port 5(wlan1) entered forwarding state [42096.807893] br-lan: port 5(wlan1) entered disabled state [42097.265277] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready [42097.265462] br-lan: port 5(wlan1) entered blocking state [42097.265475] br-lan: port 5(wlan1) entered forwarding state [43732.668008] device wlan1 left promiscuous mode [43732.668162] br-lan: port 5(wlan1) entered disabled state [43737.779516] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 7 [43738.367215] br-lan: port 5(wlan1) entered blocking state [43738.367241] br-lan: port 5(wlan1) entered disabled state [43738.367488] device wlan1 entered promiscuous mode [43738.367580] br-lan: port 5(wlan1) entered blocking state [43738.367590] br-lan: port 5(wlan1) entered forwarding state [43738.879785] br-lan: port 5(wlan1) entered disabled state [43739.336850] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready [43739.337032] br-lan: port 5(wlan1) entered blocking state [43739.337048] br-lan: port 5(wlan1) entered forwarding state [44288.666161] device wlan1 left promiscuous mode [44288.666320] br-lan: port 5(wlan1) entered disabled state [44293.780294] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 7 Can easily replicate this on a network with some traffic. Also there seems to be some kind of memory leak, I need to reboot the router every day, sometimes twice to prevent oom Is this still happening? Yes, this is still present Kalle if you want some additional info (since I tried for days to bisect this) To me it seems some packet are getting lost on the track... I got a good way to repro this but doesn't work 100% of the times. From what I notice the real problem is not in the failed to flush... but something more dramatic and problematic. In theory this latent bug on the long run may cause OOM and memory leak and at the very end fw crashing, but this would take month. To me it really seems a fw problem so we can't really do anything in the driver side. The bug is that packet set in the tx data ring in some very specific case are not transmitted to the tx completion ring. This cause packet in the idr list to never be freed. When the tx flush function is called, num_tx_pending never goes to 0 (as the packets are lost way before and the entire logic of 'free the packet that still needs to be handled' fail) and the function tiemouts. As you can see the number are not so big, 7... 15...worst result we had was 300 but also that is not a big number (considering the amount of packet are handled) (but still this is problematic for many reason, one of all... In the worst condition the idr is filled and nothing is transmitted) I did multiple test... 1. Reduce ring to 1 (my theory was a locking problem and some idr were freed with a different index) 2. Disable the peer migration patch (the bug was already there before) 3. Use the debug reg to force the use of specific ring (theory fw has some problem internally in using the wrong ring out of some packet corruption?) Nothing manage to solve this. To repro is quite simple: 1. Create an AP 2. Connect a client to it 3. Disconnect the client 4. repeat untile the packet gets stuck I really think this can be tracked and solved only on your side since debug mask and checking dumping the ring are needed to understand wth happen when the packet gets stuck and doesn't go to the completion ring. Hi @Robert, Are you seeing this issue with latest TOT with master branch with latest firmware revision 2.7.0.1? If the issue is seen with latest TOT, can you enable the debug_mask with 0x201A (/sys/module/ath11k/parameters/debug_mask) during the issue time and share the logs. Hi, with a Dynalink DL-WRX36 router running kernel 5.15.162 on OpenWrt 23.05.4 using firmware version fw_version 0x290604a5 fw_build_timestamp 2023-10-12 02:06 fw_build_id WLAN.HK.2.9.0.1-01977-QCAHKSWPL_SILICONZ-1 Still getting the errors: [Fri Aug 16 18:02:27 2024] 0x00000db00000-0x00000e400000 : "0:wififw" [Fri Aug 16 18:02:27 2024] 0x000007100000-0x000007a00000 : "0:wififw_1" [Fri Aug 16 18:02:34 2024] ath11k c000000.wifi: ipq8074 hw2.0 [Fri Aug 16 18:02:34 2024] ath11k c000000.wifi: FW memory mode: 0 [Fri Aug 16 18:02:35 2024] ath11k c000000.wifi: qmi ignore invalid mem req type 3 [Fri Aug 16 18:02:35 2024] ath11k c000000.wifi: chip_id 0x0 chip_family 0x0 board_id 0xff soc_id 0xffffffff [Fri Aug 16 18:02:35 2024] ath11k c000000.wifi: fw_version 0x290604a5 fw_build_timestamp 2023-10-12 02:06 fw_build_id WLAN.HK.2.9.0.1-01977-QCAHKSWPL_SILICONZ-1 [Fri Aug 16 18:02:35 2024] ath11k c000000.wifi: htt event 48 not handled [Sat Aug 17 12:34:03 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1 [Mon Aug 19 06:10:27 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1 [Mon Aug 19 07:35:50 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 5 [Mon Aug 19 07:45:52 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1 [Mon Aug 19 09:54:17 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1 [Mon Aug 19 13:58:51 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 3 [Mon Aug 19 21:53:16 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1 [Mon Aug 19 23:37:32 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 3 [Tue Aug 20 00:23:17 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 4 [Tue Aug 20 16:57:37 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1 [Fri Aug 23 15:09:37 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1 [Sat Aug 24 11:08:01 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1 [Sat Aug 24 13:53:32 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 2 [Sat Aug 24 15:29:27 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1 [Sat Aug 24 22:54:22 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 3 And more important losing the mDNS, DHCP, DNS packets for some hours. I tried setting the mask: root@openwrt:~# cat /sys/module/ath11k/parameters/debug_mask 8218 But nothing else is displayed. Regards, JL |