Bug 216513 - ath11k: IPQ8074: interface down times out with 'failed to flush transmit queue'
Summary: ath11k: IPQ8074: interface down times out with 'failed to flush transmit queue'
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: ARM Linux
: P3 high
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-22 08:53 UTC by Robert Marko
Modified: 2023-05-10 06:53 UTC (History)
9 users (show)

See Also:
Kernel Version: 5.15.68
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Robert Marko 2022-09-22 08:53:36 UTC
Hi,

I am one of the guys trying to bring modern Qualcomm IPQ807x SoC-s and ath11k support to OpenWrt, however, we are still having issues with ath11k.

Namely, there is an issue that after radio being up for a while trying to stop it for example doing an upgrade will take a long of time until it times out with:


root@OpenWrt:/# service wpad stop
[52465.206256] device wlan0 left promiscuous mode
[52465.206376] br-lan: port 6(wlan0) entered disabled state
root@OpenWrt:/# [52465.379202] device wlan1 left promiscuous mode
[52465.379319] br-lan: port 7(wlan1) entered disabled state
[52470.506522] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 1

This is an issue as it breaks every upgrade mechanism due to sometimes 20 or so second timeout it takes.
This is just one example of the issue, sometimes it times out twice, etc.
It seems to be happening only after the radio has been up for some time and devices have roamed from and to it.

Note that despite kernel version being 5.15.68, we have backported all of the ath11k changes up to Revert "ath11k: add support for hardware rfkill for
 QCA6390" from your ath-next.

Regards,
Robert
Comment 1 Robert Marko 2022-09-22 10:29:36 UTC
I forgot to state that this is on IPQ8074, and FW WLAN.HK.2.5.0.1-01208-QCAHKSWPL_SILICONZ-1 as anything newer is crashing the remoteproc for us
Comment 2 Joachim Banzhaf 2022-09-22 10:55:42 UTC
This is a common issue, also on older kernels. My version currently is 5.15.58 and I see this all the time on my devices.

[113346.372617] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 329
[113346.586626] ath10k_pci 0000:01:00.0: could not get mac80211 beacon, vdev_id: 0 addr: 28:d1:27:4e:56:75
[113346.634517] ath10k_pci 0000:01:00.0: mac flush null vif, drop 0 queues 0xffff
[113346.754534] br-lan: port 6(wlan0) entered disabled state
[113351.652600] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 21
[113351.703231] br-lan: port 7(wlan1) entered disabled state
[113351.743035] device wlan0 left promiscuous mode
[113351.743087] br-lan: port 6(wlan0) entered disabled state
[113351.843405] device wlan1 left promiscuous mode
[113351.843459] br-lan: port 7(wlan1) entered disabled state
Comment 3 Tiago Peralta 2022-09-30 18:12:50 UTC
Running 5.15.69, exactly same issue

root@Vision:~# uname -a
Linux Vision 5.15.69 #0 SMP Wed Sep 28 09:58:38 2022 aarch64 GNU/Linux
root@Vision:~#

[42095.697467] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 7
[42096.283197] br-lan: port 5(wlan1) entered blocking state
[42096.283226] br-lan: port 5(wlan1) entered disabled state
[42096.283513] device wlan1 entered promiscuous mode
[42096.283627] br-lan: port 5(wlan1) entered blocking state
[42096.283638] br-lan: port 5(wlan1) entered forwarding state
[42096.807893] br-lan: port 5(wlan1) entered disabled state
[42097.265277] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[42097.265462] br-lan: port 5(wlan1) entered blocking state
[42097.265475] br-lan: port 5(wlan1) entered forwarding state
[43732.668008] device wlan1 left promiscuous mode
[43732.668162] br-lan: port 5(wlan1) entered disabled state
[43737.779516] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 7
[43738.367215] br-lan: port 5(wlan1) entered blocking state
[43738.367241] br-lan: port 5(wlan1) entered disabled state
[43738.367488] device wlan1 entered promiscuous mode
[43738.367580] br-lan: port 5(wlan1) entered blocking state
[43738.367590] br-lan: port 5(wlan1) entered forwarding state
[43738.879785] br-lan: port 5(wlan1) entered disabled state
[43739.336850] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[43739.337032] br-lan: port 5(wlan1) entered blocking state
[43739.337048] br-lan: port 5(wlan1) entered forwarding state
[44288.666161] device wlan1 left promiscuous mode
[44288.666320] br-lan: port 5(wlan1) entered disabled state
[44293.780294] ath11k c000000.wifi: failed to flush transmit queue, data pkts pending 7

Can easily replicate this on a network with some traffic.

Also there seems to be some kind of memory leak, I need to reboot the router every day, sometimes twice to prevent oom
Comment 4 Kalle Valo 2023-03-08 08:51:43 UTC
Is this still happening?
Comment 5 Robert Marko 2023-03-08 09:55:33 UTC
Yes, this is still present
Comment 6 Christian Marangi (Ansuel) 2023-03-21 13:37:55 UTC
Kalle if you want some additional info (since I tried for days to bisect this)

To me it seems some packet are getting lost on the track... I got a good way to repro this but doesn't work 100% of the times.

From what I notice the real problem is not in the failed to flush... but something more dramatic and problematic.

In theory this latent bug on the long run may cause OOM and memory leak and at the very end fw crashing, but this would take month.

To me it really seems a fw problem so we can't really do anything in the driver side.

The bug is that packet set in the tx data ring in some very specific case are not transmitted to the tx completion ring.

This cause packet in the idr list to never be freed.

When the tx flush function is called, num_tx_pending never goes to 0 (as the packets are lost way before and the entire logic of 'free the packet that still needs to be handled' fail) and the function tiemouts.

As you can see the number are not so big, 7... 15...worst result we had was 300 but also that is not a big number (considering the amount of packet are handled)
(but still this is problematic for many reason, one of all... In the worst condition the idr is filled and nothing is transmitted)


I did multiple test...
1. Reduce ring to 1 (my theory was a locking problem and some idr were freed with a different index)
2. Disable the peer migration patch (the bug was already there before)
3. Use the debug reg to force the use of specific ring (theory fw has some problem internally in using the wrong ring out of some packet corruption?)

Nothing manage to solve this.


To repro is quite simple:
1. Create an AP
2. Connect a client to it
3. Disconnect the client
4. repeat untile the packet gets stuck


I really think this can be tracked and solved only on your side since debug mask and checking dumping the ring are needed to understand wth happen when the packet gets stuck and doesn't go to the completion ring.
Comment 7 Maharaja Kennadyrajan 2023-05-10 06:53:42 UTC
Hi @Robert,

Are you seeing this issue with latest TOT with master branch with latest firmware revision 2.7.0.1?

If the issue is seen with latest TOT, can you enable the debug_mask with 0x201A (/sys/module/ath11k/parameters/debug_mask) during the issue time and share the logs.

Note You need to log in before you can comment on or make changes to this bug.