Bug 218364

Summary: [BISECTED] mac80211 + ath11k crash system suspend - Dell XPS-13 9310
Product: Networking Reporter: Len Brown (lenb)
Component: WirelessAssignee: Benjamin Berg (benjamin)
Status: RESOLVED CODE_FIX    
Severity: blocking CC: benjamin, christian, gilbert.fernandes, gregory.greenman, j, johannes, kvalo, miriam.rachel.korenblit, sergoopsmoskal, warren
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: 6.7, 6.8-rc1 Subsystem:
Regression: Yes Bisected commit-id: 0a3d898ee9a8303d5b3982b97ef0703919c3ea76
Attachments: dmesg 6.6
sleepgraph HTML output after revert
successful 6.7.0-0001 suspend cycle with Bluetooth disabled
Possible ath11k bugfix for the mac80211 debugfs behaviour change
dmesg 6.7 + patch
sleepgraph HTML output for 6.7 + patch successful suspend resume cycle

Description Len Brown 2024-01-11 04:24:56 UTC
The following commit causes suspend-resume to crash on my Dell XPS 13 9310
when attempting a system suspend/resume cycle.  100% reproducible.

commit 0a3d898ee9a8303d5b3982b97ef0703919c3ea76
Author: Benjamin Berg <benjamin.berg@intel.com>
Date:   Wed Dec 20 04:38:01 2023 +0200

    wifi: mac80211: add/remove driver debugfs entries as appropriate

    When an interface is removed, we should also be deleting the driver
    debugfs entries (as it might still exist in DOWN state in mac80211). At
    the same time, when adding an interface, we can check the
    IEEE80211_SDATA_IN_DRIVER flag to know whether the interface was
    previously known to the driver and is simply being reconfigured.

    Fixes: a1f5dcb1c0c1 ("wifi: mac80211: add a driver callback to add vif debugfs")
    Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
    Reviewed-by: Gregory Greenman <gregory.greenman@intel.com>
    Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
    Link: https://msgid.link/20231220043149.a9f64c359424.I7076526b5297ae8f832228079c999f7b8e147a4c@changeid
    Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Comment 1 Len Brown 2024-01-11 04:41:55 UTC
Created attachment 305698 [details]
dmesg 6.6

here is the dmesg when booted with Linux-6.6

The last RC to work properly ws Linux-6.7-rc8.
Linux 6.7.0 never comes back from resume -- black screen, hard power cycle is necessary to recover the machine from an attempted suspend/resume.
Comment 2 Len Brown 2024-01-11 05:09:44 UTC
Created attachment 305700 [details]
sleepgraph HTML output after revert

I have verified that reverting the offending commit from 6.7
allows the machine to survive a suspend/resume cycle.

However, it was not a successful suspend/resume cycle (attached),
apparently do to the following:

[   36.169338] Bluetooth: hci0: command 0xfc00 tx timeout
[   37.097602] Bluetooth: hci0: SSR or FW download time out
[   37.097614] hci_uart_qca serial0-0: PM: dpm_run_callback(): acpi_subsys_suspend+0x0/0x70 returns -110
[   37.097633] hci_uart_qca serial0-0: PM: failed to suspend: error -110
Comment 3 Len Brown 2024-01-11 05:17:20 UTC
Created attachment 305701 [details]
successful 6.7.0-0001 suspend cycle with Bluetooth disabled

Verified that 6.7 with the offending commit reverted
can successfully suspend/resume if Bluetooth is disabled in BIOS.
Comment 4 Miri Korenblit 2024-01-11 07:13:58 UTC
Could you please attach the dmesg/syslog of the failure? (before revert)
Comment 5 Miri Korenblit 2024-01-11 10:43:29 UTC
Better attach the netconsole please
Comment 6 Miri Korenblit 2024-01-11 10:43:43 UTC
Better attach the netconsole please
Comment 7 Benjamin Berg 2024-01-11 12:09:41 UTC
Ah, so the patch changes things around in mac80211 to delete (sometimes to re-create) the debugfs entries. This means that vif->debugfs_dir is removed recursively by mac80211 already, but ath11k calls ath11k_debugfs_remove_interface internally, trying to remove the files again (causing a use-after-free).

The fix should be simple, i.e.
 * Remove ath11k_debugfs_remove_interface entirely (also remove debugfs_twt struct member)
 * Remove calls to ath11k_debugfs_add_interface
 * Hook up ath11k_debugfs_add_interface to the vif_add_debugfs op
Comment 8 Benjamin Berg 2024-01-11 12:10:55 UTC
Note that another thing that is changed is that in remove IEEE80211_SDATA_IN_DRIVER is already cleared when calling the driver remove handler. This could affect interface iteration, but I guess that should not matter here.
Comment 9 Benjamin Berg 2024-01-11 12:22:46 UTC
Created attachment 305703 [details]
Possible ath11k bugfix for the mac80211 debugfs behaviour change

@Len, can you try this patch (I have not compile tested it). I think that likely fixes the use-after-free issue of the debugfs dentry.
Comment 10 Sergey 2024-01-11 16:13:10 UTC
I have the Honor Magicbook 16 (5600h) with ath11k wifi/bl card. And this commit (0a3d898ee9a8303d5b3982b97ef0703919c3ea76) cause the crash on every boot for me.

I tried the patch by Benjamin Berg and it helped. Startup/poweroff, suspend/resume is working normally.
Comment 11 Christian Heusel 2024-01-11 18:30:47 UTC
The patch has also been reviewed as working on the Arch Linux bugtracker and will be included before moving the 6.7 kernel to the stable repositories:
https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/issues/18#note_155581
Comment 12 Len Brown 2024-01-11 19:54:56 UTC
(In reply to Benjamin Berg from comment #9)
> Created attachment 305703 [details]
> Possible ath11k bugfix for the mac80211 debugfs behaviour change
> 
> @Len, can you try this patch (I have not compile tested it). I think that
> likely fixes the use-after-free issue of the debugfs dentry.

Yes, this patch on top of 6.7 allows the 9310 to survive suspend/resume.
Comment 13 Len Brown 2024-01-11 19:56:35 UTC
Created attachment 305705 [details]
dmesg 6.7 + patch

Here is the dmesg from 6.7 with the patchfix above booting,
and then completing three suspend/resume cycles.
Comment 14 Len Brown 2024-01-11 19:59:58 UTC
Created attachment 305706 [details]
sleepgraph HTML output for 6.7 + patch successful suspend resume cycle
Comment 15 Len Brown 2024-01-12 21:57:22 UTC
(In reply to Len Brown from comment #2)
 
> I have verified that reverting the offending commit from 6.7
> allows the machine to survive a suspend/resume cycle.
> 
> However, it was not a successful suspend/resume cycle (attached),
> apparently do to the following:
> 
> [   36.169338] Bluetooth: hci0: command 0xfc00 tx timeout
> [   37.097602] Bluetooth: hci0: SSR or FW download time out
> [   37.097614] hci_uart_qca serial0-0: PM: dpm_run_callback():
> acpi_subsys_suspend+0x0/0x70 returns -110
> [   37.097633] hci_uart_qca serial0-0: PM: failed to suspend: error -110

For the record, this appears to be an independent issue.

I ran into this issue again after patch in comment #9 was applied.
Bluetooth failed to connect to any devices,
and Bluetooth prevented suspend from working.

I found after a system power cycle that Bluetooth connected,
suspend/resume functioned, and Bluetooth continued to work after resume.
Comment 17 Kalle Valo 2024-01-18 10:02:25 UTC
Fix applied:

https://git.kernel.org/wireless/wireless/c/556857aa1d08

Hopefully will be in v6.8-rc2. Marking as resolved.
Comment 18 Len Brown 2024-01-22 04:49:56 UTC
still fails in v6.8-rc1
Comment 19 Kalle Valo 2024-01-26 06:08:12 UTC
(In reply to Kalle Valo from comment #17)
> Fix applied:
> 
> https://git.kernel.org/wireless/wireless/c/556857aa1d08
> 
> Hopefully will be in v6.8-rc2.
The fix is now in Linus' tree so it will be in v6.8-rc2.
Comment 20 Jürg Billeter 2024-02-01 09:10:47 UTC
I'm seeing this crash on suspend on Linux 6.7.3, which includes the mentioned fix, if I'm not mistaken. This is with the latest firmware: 3.6510.37. Is suspend working for anyone with stock 6.7.3?

Lenovo ThinkPad T14 Gen 3 AMD (6850U)
Comment 21 Jürg Billeter 2024-02-03 11:39:12 UTC
Please ignore my previous comment. The crash on suspend I'm seeing on Linux 6.7.3 is most likely unrelated to ath11k (current suspect is amdgpu).
Comment 22 Gilbert Fernandes 2024-02-06 14:16:39 UTC
(In reply to Jürg Billeter from comment #20)
> I'm seeing this crash on suspend on Linux 6.7.3, which includes the
> mentioned fix, if I'm not mistaken. This is with the latest firmware:
> 3.6510.37. Is suspend working for anyone with stock 6.7.3?
> 
> Lenovo ThinkPad T14 Gen 3 AMD (6850U)

Lenovo Thinkpad P16s AMD Gen2

I just received a few minutes ago the kernel 6.7.3-200.fc39.x86_64 on my Fedora 39
After a reboot I tested the standby (my laptop uses the modern standby) and I left the machine for 5 minutes sleeping.

I do not see the issue currently on my model. Wifi card info :

gf@aesir:~$ lspci -vv -s 01:00.0
01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 Wireless Network Adapter (rev 01)
	Subsystem: Lenovo Device 9309
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin ? routed to IRQ 91
	IOMMU group: 12
	Region 0: Memory at 78600000 (64-bit, non-prefetchable) [size=2M]
	Capabilities: <access denied>
	Kernel driver in use: ath11k_pci
	Kernel modules: ath11k_pci

That laptop was pre-installed with Fc39 by Lenovo. Never had any issues from the Wifi (a lot of issues of ping issues have been reported here -> https://forums.lenovo.com/t5/Other-Linux-Discussions/QCNFA765-Linux-ath11k-wifi-crippled-high-latency-packet-loss-frequent-disassociations-T14s-AMD/).

My wifi card works fine from the laptop first turn on.
I do not have currently sleep issues from kernel 6.7.3

Available for any tests, logs or kernel to try if you need any help.
Comment 23 Warren Togami 2024-02-07 13:52:53 UTC
Summary of QCNFA765 ath11k problems kernel-6.7.4 
================================================
The current 6.7.x suspend crashes are intertwined with the long standing packet loss and latency problems we've been seeing with QCNFA765 Linux ath11k.

Kernel 6.4.12-6.6.14 all had the same problem where you need the `iw dev wlp1s0 set power_save off` workaround to prevent crippling packet losses and slow speeds. With the power_save workaround applied this wifi adapter was mostly tolerable.

Kernel-6.7.3 broke suspend.

Kernel-6.7.4 included a partial fix.

6.7.4

* crashes on suspend with bluetooth enabled.

* no crash on suspend with bluetooth disabled

* no crash on suspend with bluetooth enabled plus the power_save=off workaround

* There is however a new problem after resume. Speeds become considerably slower. What was previously 15MB/sec instead crawls at 11KB/sec fluctuating to 50KB/sec.

This slow speed problem is unaffected by flipping the power_save parameter on and off. You can get the speed back by rebooting, or unloading the ath11k_pci kernel module and reloading it.
Comment 24 Johannes Berg 2024-02-07 13:56:42 UTC
All those slow, bluetooth, etc. issues should probably go to a different bug though ... This one's already marked fixed, and it _is_ fixed as reported, so everything here will just be ignored, I'd guess.