Bug 218364
Summary: | [BISECTED] mac80211 + ath11k crash system suspend - Dell XPS-13 9310 | ||
---|---|---|---|
Product: | Networking | Reporter: | Len Brown (lenb) |
Component: | Wireless | Assignee: | Benjamin Berg (benjamin) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | benjamin, christian, gilbert.fernandes, gregory.greenman, j, johannes, kvalo, miriam.rachel.korenblit, sergoopsmoskal, warren |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 6.7, 6.8-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | 0a3d898ee9a8303d5b3982b97ef0703919c3ea76 |
Attachments: |
dmesg 6.6
sleepgraph HTML output after revert successful 6.7.0-0001 suspend cycle with Bluetooth disabled Possible ath11k bugfix for the mac80211 debugfs behaviour change dmesg 6.7 + patch sleepgraph HTML output for 6.7 + patch successful suspend resume cycle |
Description
Len Brown
2024-01-11 04:24:56 UTC
Created attachment 305698 [details]
dmesg 6.6
here is the dmesg when booted with Linux-6.6
The last RC to work properly ws Linux-6.7-rc8.
Linux 6.7.0 never comes back from resume -- black screen, hard power cycle is necessary to recover the machine from an attempted suspend/resume.
Created attachment 305700 [details]
sleepgraph HTML output after revert
I have verified that reverting the offending commit from 6.7
allows the machine to survive a suspend/resume cycle.
However, it was not a successful suspend/resume cycle (attached),
apparently do to the following:
[ 36.169338] Bluetooth: hci0: command 0xfc00 tx timeout
[ 37.097602] Bluetooth: hci0: SSR or FW download time out
[ 37.097614] hci_uart_qca serial0-0: PM: dpm_run_callback(): acpi_subsys_suspend+0x0/0x70 returns -110
[ 37.097633] hci_uart_qca serial0-0: PM: failed to suspend: error -110
Created attachment 305701 [details]
successful 6.7.0-0001 suspend cycle with Bluetooth disabled
Verified that 6.7 with the offending commit reverted
can successfully suspend/resume if Bluetooth is disabled in BIOS.
Could you please attach the dmesg/syslog of the failure? (before revert) Better attach the netconsole please Better attach the netconsole please Ah, so the patch changes things around in mac80211 to delete (sometimes to re-create) the debugfs entries. This means that vif->debugfs_dir is removed recursively by mac80211 already, but ath11k calls ath11k_debugfs_remove_interface internally, trying to remove the files again (causing a use-after-free). The fix should be simple, i.e. * Remove ath11k_debugfs_remove_interface entirely (also remove debugfs_twt struct member) * Remove calls to ath11k_debugfs_add_interface * Hook up ath11k_debugfs_add_interface to the vif_add_debugfs op Note that another thing that is changed is that in remove IEEE80211_SDATA_IN_DRIVER is already cleared when calling the driver remove handler. This could affect interface iteration, but I guess that should not matter here. Created attachment 305703 [details]
Possible ath11k bugfix for the mac80211 debugfs behaviour change
@Len, can you try this patch (I have not compile tested it). I think that likely fixes the use-after-free issue of the debugfs dentry.
I have the Honor Magicbook 16 (5600h) with ath11k wifi/bl card. And this commit (0a3d898ee9a8303d5b3982b97ef0703919c3ea76) cause the crash on every boot for me. I tried the patch by Benjamin Berg and it helped. Startup/poweroff, suspend/resume is working normally. The patch has also been reviewed as working on the Arch Linux bugtracker and will be included before moving the 6.7 kernel to the stable repositories: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/issues/18#note_155581 (In reply to Benjamin Berg from comment #9) > Created attachment 305703 [details] > Possible ath11k bugfix for the mac80211 debugfs behaviour change > > @Len, can you try this patch (I have not compile tested it). I think that > likely fixes the use-after-free issue of the debugfs dentry. Yes, this patch on top of 6.7 allows the 9310 to survive suspend/resume. Created attachment 305705 [details]
dmesg 6.7 + patch
Here is the dmesg from 6.7 with the patchfix above booting,
and then completing three suspend/resume cycles.
Created attachment 305706 [details]
sleepgraph HTML output for 6.7 + patch successful suspend resume cycle
(In reply to Len Brown from comment #2) > I have verified that reverting the offending commit from 6.7 > allows the machine to survive a suspend/resume cycle. > > However, it was not a successful suspend/resume cycle (attached), > apparently do to the following: > > [ 36.169338] Bluetooth: hci0: command 0xfc00 tx timeout > [ 37.097602] Bluetooth: hci0: SSR or FW download time out > [ 37.097614] hci_uart_qca serial0-0: PM: dpm_run_callback(): > acpi_subsys_suspend+0x0/0x70 returns -110 > [ 37.097633] hci_uart_qca serial0-0: PM: failed to suspend: error -110 For the record, this appears to be an independent issue. I ran into this issue again after patch in comment #9 was applied. Bluetooth failed to connect to any devices, and Bluetooth prevented suspend from working. I found after a system power cycle that Bluetooth connected, suspend/resume functioned, and Bluetooth continued to work after resume. Proposed ath11k fix here: https://patchwork.kernel.org/project/linux-wireless/patch/20240111170629.1257217-1-benjamin@sipsolutions.net/ Fix applied: https://git.kernel.org/wireless/wireless/c/556857aa1d08 Hopefully will be in v6.8-rc2. Marking as resolved. still fails in v6.8-rc1 (In reply to Kalle Valo from comment #17) > Fix applied: > > https://git.kernel.org/wireless/wireless/c/556857aa1d08 > > Hopefully will be in v6.8-rc2. The fix is now in Linus' tree so it will be in v6.8-rc2. I'm seeing this crash on suspend on Linux 6.7.3, which includes the mentioned fix, if I'm not mistaken. This is with the latest firmware: 3.6510.37. Is suspend working for anyone with stock 6.7.3? Lenovo ThinkPad T14 Gen 3 AMD (6850U) Please ignore my previous comment. The crash on suspend I'm seeing on Linux 6.7.3 is most likely unrelated to ath11k (current suspect is amdgpu). (In reply to Jürg Billeter from comment #20) > I'm seeing this crash on suspend on Linux 6.7.3, which includes the > mentioned fix, if I'm not mistaken. This is with the latest firmware: > 3.6510.37. Is suspend working for anyone with stock 6.7.3? > > Lenovo ThinkPad T14 Gen 3 AMD (6850U) Lenovo Thinkpad P16s AMD Gen2 I just received a few minutes ago the kernel 6.7.3-200.fc39.x86_64 on my Fedora 39 After a reboot I tested the standby (my laptop uses the modern standby) and I left the machine for 5 minutes sleeping. I do not see the issue currently on my model. Wifi card info : gf@aesir:~$ lspci -vv -s 01:00.0 01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 Wireless Network Adapter (rev 01) Subsystem: Lenovo Device 9309 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin ? routed to IRQ 91 IOMMU group: 12 Region 0: Memory at 78600000 (64-bit, non-prefetchable) [size=2M] Capabilities: <access denied> Kernel driver in use: ath11k_pci Kernel modules: ath11k_pci That laptop was pre-installed with Fc39 by Lenovo. Never had any issues from the Wifi (a lot of issues of ping issues have been reported here -> https://forums.lenovo.com/t5/Other-Linux-Discussions/QCNFA765-Linux-ath11k-wifi-crippled-high-latency-packet-loss-frequent-disassociations-T14s-AMD/). My wifi card works fine from the laptop first turn on. I do not have currently sleep issues from kernel 6.7.3 Available for any tests, logs or kernel to try if you need any help. Summary of QCNFA765 ath11k problems kernel-6.7.4 ================================================ The current 6.7.x suspend crashes are intertwined with the long standing packet loss and latency problems we've been seeing with QCNFA765 Linux ath11k. Kernel 6.4.12-6.6.14 all had the same problem where you need the `iw dev wlp1s0 set power_save off` workaround to prevent crippling packet losses and slow speeds. With the power_save workaround applied this wifi adapter was mostly tolerable. Kernel-6.7.3 broke suspend. Kernel-6.7.4 included a partial fix. 6.7.4 * crashes on suspend with bluetooth enabled. * no crash on suspend with bluetooth disabled * no crash on suspend with bluetooth enabled plus the power_save=off workaround * There is however a new problem after resume. Speeds become considerably slower. What was previously 15MB/sec instead crawls at 11KB/sec fluctuating to 50KB/sec. This slow speed problem is unaffected by flipping the power_save parameter on and off. You can get the speed back by rebooting, or unloading the ath11k_pci kernel module and reloading it. All those slow, bluetooth, etc. issues should probably go to a different bug though ... This one's already marked fixed, and it _is_ fixed as reported, so everything here will just be ignored, I'd guess. |