Bug 216513

Summary: ath11k: IPQ8074: interface down times out with 'failed to flush transmit queue'
Product: Drivers Reporter: Robert Marko (robimarko)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: NEW ---    
Severity: high CC: ansuelsmth, brozikcz, joachim.banzhaf, josacar, kvalo, me, nikitashvets, quic_mkenna, tperalta82, wiagn233
Priority: P3    
Hardware: ARM   
OS: Linux   
Kernel Version: 5.15.68 Subsystem:
Regression: No Bisected commit-id:

Description Robert Marko 2022-09-22 08:53:36 UTC
Hi,

I am one of the guys trying to bring modern Qualcomm IPQ807x SoC-s and ath11k support to OpenWrt, however, we are still having issues with ath11k.

Namely, there is an issue that after radio being up for a while trying to stop it for example doing an upgrade will take a long of time until it times out with:


root@OpenWrt:/# service wpad stop
[52465.206256] device wlan0 left promiscuous mode
[52465.206376] br-lan: port 6(wlan0) entered disabled state
root@OpenWrt:/# [52465.379202] device wlan1 left promiscuous mode
[52465.379319] br-lan: port 7(wlan1) entered disabled state
[52470.506522] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1

This is an issue as it breaks every upgrade mechanism due to sometimes 20 or so second timeout it takes.
This is just one example of the issue, sometimes it times out twice, etc.
It seems to be happening only after the radio has been up for some time and devices have roamed from and to it.

Note that despite kernel version being 5.15.68, we have backported all of the ath11k changes up to Revert "ath11k: add support for hardware rfkill for
 QCA6390" from your ath-next.

Regards,
Robert
Comment 1 Robert Marko 2022-09-22 10:29:36 UTC
I forgot to state that this is on IPQ8074, and FW WLAN.HK.2.5.0.1-01208-QCAHKSWPL_SILICONZ-1 as anything newer is crashing the remoteproc for us
Comment 2 Joachim Banzhaf 2022-09-22 10:55:42 UTC
This is a common issue, also on older kernels. My version currently is 5.15.58 and I see this all the time on my devices.

[113346.372617] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 329
[113346.586626] ath10k_pci 0000:01:00.0: could not get mac80211 beacon, vdev_id: 0 addr: 28:d1:27:4e:56:75
[113346.634517] ath10k_pci 0000:01:00.0: mac flush null vif, drop 0 queues 0xffff
[113346.754534] br-lan: port 6(wlan0) entered disabled state
[113351.652600] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 21
[113351.703231] br-lan: port 7(wlan1) entered disabled state
[113351.743035] device wlan0 left promiscuous mode
[113351.743087] br-lan: port 6(wlan0) entered disabled state
[113351.843405] device wlan1 left promiscuous mode
[113351.843459] br-lan: port 7(wlan1) entered disabled state
Comment 3 Tiago Peralta 2022-09-30 18:12:50 UTC
Running 5.15.69, exactly same issue

root@Vision:~# uname -a
Linux Vision 5.15.69 #0 SMP Wed Sep 28 09:58:38 2022 aarch64 GNU/Linux
root@Vision:~#

[42095.697467] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 7
[42096.283197] br-lan: port 5(wlan1) entered blocking state
[42096.283226] br-lan: port 5(wlan1) entered disabled state
[42096.283513] device wlan1 entered promiscuous mode
[42096.283627] br-lan: port 5(wlan1) entered blocking state
[42096.283638] br-lan: port 5(wlan1) entered forwarding state
[42096.807893] br-lan: port 5(wlan1) entered disabled state
[42097.265277] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[42097.265462] br-lan: port 5(wlan1) entered blocking state
[42097.265475] br-lan: port 5(wlan1) entered forwarding state
[43732.668008] device wlan1 left promiscuous mode
[43732.668162] br-lan: port 5(wlan1) entered disabled state
[43737.779516] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 7
[43738.367215] br-lan: port 5(wlan1) entered blocking state
[43738.367241] br-lan: port 5(wlan1) entered disabled state
[43738.367488] device wlan1 entered promiscuous mode
[43738.367580] br-lan: port 5(wlan1) entered blocking state
[43738.367590] br-lan: port 5(wlan1) entered forwarding state
[43738.879785] br-lan: port 5(wlan1) entered disabled state
[43739.336850] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[43739.337032] br-lan: port 5(wlan1) entered blocking state
[43739.337048] br-lan: port 5(wlan1) entered forwarding state
[44288.666161] device wlan1 left promiscuous mode
[44288.666320] br-lan: port 5(wlan1) entered disabled state
[44293.780294] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 7

Can easily replicate this on a network with some traffic.

Also there seems to be some kind of memory leak, I need to reboot the router every day, sometimes twice to prevent oom
Comment 4 Kalle Valo 2023-03-08 08:51:43 UTC
Is this still happening?
Comment 5 Robert Marko 2023-03-08 09:55:33 UTC
Yes, this is still present
Comment 6 Christian Marangi (Ansuel) 2023-03-21 13:37:55 UTC
Kalle if you want some additional info (since I tried for days to bisect this)

To me it seems some packet are getting lost on the track... I got a good way to repro this but doesn't work 100% of the times.

From what I notice the real problem is not in the failed to flush... but something more dramatic and problematic.

In theory this latent bug on the long run may cause OOM and memory leak and at the very end fw crashing, but this would take month.

To me it really seems a fw problem so we can't really do anything in the driver side.

The bug is that packet set in the tx data ring in some very specific case are not transmitted to the tx completion ring.

This cause packet in the idr list to never be freed.

When the tx flush function is called, num_tx_pending never goes to 0 (as the packets are lost way before and the entire logic of 'free the packet that still needs to be handled' fail) and the function tiemouts.

As you can see the number are not so big, 7... 15...worst result we had was 300 but also that is not a big number (considering the amount of packet are handled)
(but still this is problematic for many reason, one of all... In the worst condition the idr is filled and nothing is transmitted)


I did multiple test...
1. Reduce ring to 1 (my theory was a locking problem and some idr were freed with a different index)
2. Disable the peer migration patch (the bug was already there before)
3. Use the debug reg to force the use of specific ring (theory fw has some problem internally in using the wrong ring out of some packet corruption?)

Nothing manage to solve this.


To repro is quite simple:
1. Create an AP
2. Connect a client to it
3. Disconnect the client
4. repeat untile the packet gets stuck


I really think this can be tracked and solved only on your side since debug mask and checking dumping the ring are needed to understand wth happen when the packet gets stuck and doesn't go to the completion ring.
Comment 7 Maharaja Kennadyrajan 2023-05-10 06:53:42 UTC
Hi @Robert,

Are you seeing this issue with latest TOT with master branch with latest firmware revision 2.7.0.1?

If the issue is seen with latest TOT, can you enable the debug_mask with 0x201A (/sys/module/ath11k/parameters/debug_mask) during the issue time and share the logs.
Comment 8 Selu 2024-08-25 07:26:46 UTC
Hi,

with a Dynalink DL-WRX36 router running kernel 5.15.162 on OpenWrt 23.05.4 using firmware version 

fw_version 0x290604a5 fw_build_timestamp 2023-10-12 02:06 fw_build_id WLAN.HK.2.9.0.1-01977-QCAHKSWPL_SILICONZ-1

Still getting the errors:

[Fri Aug 16 18:02:27 2024] 0x00000db00000-0x00000e400000 : "0:wififw"
[Fri Aug 16 18:02:27 2024] 0x000007100000-0x000007a00000 : "0:wififw_1"
[Fri Aug 16 18:02:34 2024] ath11k c000000.wifi: ipq8074 hw2.0
[Fri Aug 16 18:02:34 2024] ath11k c000000.wifi: FW memory mode: 0
[Fri Aug 16 18:02:35 2024] ath11k c000000.wifi: qmi ignore invalid mem req type 3
[Fri Aug 16 18:02:35 2024] ath11k c000000.wifi: chip_id 0x0 chip_family 0x0 board_id 0xff soc_id 0xffffffff
[Fri Aug 16 18:02:35 2024] ath11k c000000.wifi: fw_version 0x290604a5 fw_build_timestamp 2023-10-12 02:06 fw_build_id WLAN.HK.2.9.0.1-01977-QCAHKSWPL_SILICONZ-1
[Fri Aug 16 18:02:35 2024] ath11k c000000.wifi: htt event 48 not handled
[Sat Aug 17 12:34:03 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1
[Mon Aug 19 06:10:27 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1
[Mon Aug 19 07:35:50 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 5
[Mon Aug 19 07:45:52 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1
[Mon Aug 19 09:54:17 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1
[Mon Aug 19 13:58:51 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 3
[Mon Aug 19 21:53:16 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1
[Mon Aug 19 23:37:32 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 3
[Tue Aug 20 00:23:17 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 4
[Tue Aug 20 16:57:37 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1
[Fri Aug 23 15:09:37 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1
[Sat Aug 24 11:08:01 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1
[Sat Aug 24 13:53:32 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 2
[Sat Aug 24 15:29:27 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1
[Sat Aug 24 22:54:22 2024] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 3


And more important losing the mDNS, DHCP, DNS packets for some hours.

I tried setting the mask:

root@openwrt:~# cat /sys/module/ath11k/parameters/debug_mask
8218

But nothing else is displayed.

Regards,

JL