Bug 218623

Summary: ath11k: WCN6855: possible ring buffer corruption
Product: Drivers Reporter: Johan Hovold (johan)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: NEW ---    
Severity: normal CC: dawnxkey, jens.glathe, pbrobinson, vadikas
Priority: P3    
Hardware: ARM   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:

Description Johan Hovold 2024-03-21 10:11:15 UTC
Over the past year I've received occasional reports from users of the Lenovo ThinkPad X13s that the wifi sometimes stops working. When this happens the kernel log is filled with errors like:

[ 1164.962227] ath11k_warn: 222 callbacks suppressed
[ 1164.962238] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
[ 1164.962309] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
[ 1164.962994] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1476, expected 1484
[ 1164.963405] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1488
[ 1164.963701] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1480, expected 1484
[ 1164.963852] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1468, expected 1480
[ 1164.964491] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
[ 1164.964733] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
[ 1165.198329] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1488
[ 1165.198470] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1476
[ 1166.266513] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 2699 at byte 348 (1132 bytes left, 64788 expected)
[ 1166.542803] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 4270 at byte 348 (1128 bytes left, 63772 expected)
[ 1166.768238] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 0 at byte 376 (1112 bytes left, 11730 expected)
[ 1166.900152] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 3 at byte 790 (694 bytes left, 16256 expected)
[ 1168.499073] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 1 at byte 62 (1426 bytes left, 3089 expected)
[ 1168.818086] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 63063 at byte 1466 (10 bytes left, 50467 expected)
[ 1169.032885] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 0 at byte 364 (1120 bytes left, 12483 expected)
[ 1169.308546] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 3092 at byte 348 (1128 bytes left, 64780 expected)
[ 1169.563928] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 1 at byte 348 (1124 bytes left, 44062 expected)

which after a quick look at the driver seems to suggest that we may be hitting some kind of ring buffer corruption.

Rebinding the driver supposedly sometimes make things work again, but not always.

The issue has been confirmed with the 6.8 kernel and the latest firmware WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.37.

I've triggered this issue twice myself with 6.6 and .23 firmware, but the reports date back to at least 6.2 and likely when using even older firmware.

An unconfirmed hypothesis is that we may be hitting this more often when enabling the GIC ITS so that the interrupt processing is spread out over all cores (unlike when using the DWC controller's internal MSI implementation). This change is now merged for 6.10.
Comment 1 Vadim Likholetov 2024-04-15 08:08:11 UTC
I'm expecting the same problems -- 
[19653.033501] wlan0: disconnect from AP e8:ed:d6:44:df:81 for new auth to e8:ed:d6:44:dd:11
[19653.238739] wlan0: authenticate with e8:ed:d6:44:dd:11 (local address=00:03:7f:12:7d:a0)
[19653.238750] wlan0: send auth to e8:ed:d6:44:dd:11 (try 1/3)
[19653.243206] wlan0: authenticated
[19653.248442] wlan0: associate with e8:ed:d6:44:dd:11 (try 1/3)
[19653.261187] wlan0: RX ReassocResp from e8:ed:d6:44:dd:11 (capab=0x411 status=0 aid=8)
[19653.273616] wlan0: associated
[19701.008210] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
[19701.008511] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1488
[19701.008571] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1476, expected 1492
[19701.008901] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1480, expected 1492
[19701.009173] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1480, expected 1484
[19701.009180] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1464, expected 1480
[19701.009611] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1492
[19701.009654] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1476
[19701.509526] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1472, expected 1476
[19701.509635] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1492
[19706.365617] ath11k_warn: 90 callbacks suppressed
[19706.365629] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
[19706.365645] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1480, expected 1488
[19706.365862] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1452, expected 1480
[19706.367046] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1464, expected 1488
[19706.367666] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1468, expected 1476
[19706.367967] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1476, expected 1480
[19706.368397] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1464, expected 1492
[19706.368618] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1456, expected 1484
[19706.847596] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1472, expected 1476
[19706.848069] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1480, expected 1492
[19711.450744] ath11k_warn: 106 callbacks suppressed

also with latest .37 firmware, kernel 6.7.5 on Lenovo x13s.
from my point of view this happens when device is roaming from one AP to another. 
If I go from one room to another with laptop sleeping (lid closed) -- everything is OK. If laptop is running -- this may happen.
Comment 2 Jens Glathe 2024-10-13 18:25:44 UTC
Even with WiFi disabled I get these odd RX cb errors eventually. Also, spinlocking the whole cb function doesn't resolve the issue, it is just a call from the ath firmware with incorrect data

[183265.996996] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1468, expected 1476
[183265.997222] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1456, expected 1492
[183265.997470] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
[183276.622153] ath11k_warn: 1 callbacks suppressed
[183276.622198] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1476
[183276.622239] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1476, expected 1484
[183276.622633] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1476, expected 1484

Had this on 2 volterra boxes. One with spinlocked cb, the other without
19:16
if you route the whole traffic via WLAN, it occurs way sooner.

Running the .41 firmware on all of them.

Interesting side observation: I have a HP Omnibook X14 with x1e80100, and it also has a WCN6855 in it, different board string, but same chip revisions as the volterra boxes. It is also running the .41 firmware, and on this laptop, this issue has never happened yet, for ~ 2 months of use. So, maybe the firmware behaves different, maybe the interrupt / DMA handling is different? It is the same kernel and same binary on all of these boxes.
Comment 3 MaryWKlein 2024-10-15 03:56:16 UTC
Bug report "ath11k: WCN6855: possible ring buffer corruption" on Bugzilla mentions an issue related to Wi-Fi connectivity on Lenovo ThinkPad X13s computers. These bugs appear to be related to incorrect handling http://www.ralinktech.com/ralink/Home/Backing/Linux.html https://poppyplaytimechapter3.io of data from the Wi-Fi hardware, resulting in an "insufficient length" condition.