Dell XPS 13, 9310, 32GB version equiped with ath11k wifi card. Latest BIOS (3.1.0) Tested with kernel 5.14 different flavours, 5.15-rc, ath11k GIT tree. Same everywhere. Hibernation - both from desktop environmant (Mate) and console (echo disk > sys/power/state) works perfectly. But on resume it loads image and freezes, never give back control. If I do rmmod ath11k_pci before hibernation - I am able to hibernate and then resume succesfully, but the ath11k wifi can not be activated again, module loads but the card is not comig back (see bug https://bugzilla.kernel.org/show_bug.cgi?id=214541) I've also tried to build mhi, qrtr_mhi as the modules and unload mhi and qrtr_mhi before hibernation, and load all of them back - still no success.
IIRC the issue is that ath11k expects the firmware to be running during suspend. And this was because shutting down the firmware for suspend caused problems in the MHI subsystem during resume. To fix this I suspect we need changes both in ath11k and in the MHI subsystem, so not easy.
The same bug was reported recently for 6.1/6.2 kernels on openSUSE Tumbleweed: https://bugzilla.opensuse.org/show_bug.cgi?id=1207948 The current situation is that the hibernation resume leads to a kernel panic, because the mhi bus timeout is set to 90 seconds, while the PM core watchdog timeout is 60 seconds that triggers the panic. So, at least, we should reduce MHI_TIMEOUT_DEFAULT_MS to a more reasonable value for avoiding the unnecessary kernel panic, IMO.
*** Bug 216962 has been marked as a duplicate of this bug. ***
Created attachment 303921 [details] Hibernation Logs
Ooops, Message lost for what ever reason - stupid me. So here we go again: I can confirm this bug to be present on a HP Pro x360 435 G9 w/ firmware WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.23 and vanilla kernel v6.1.15, too. Logs attached above. Hibernation works to the extent that login and session restore are functional (with significant delays) but the ath11k_pci module got stuck somewhere between the middle of nowhere and refuses even to respond to rmmod requests. Thus a (hard) reset is required, rendering the hibernation function useless. This is (among other reasons) why I moved away from hibernation - session management is not that important for me. @Kalle: thx for forwarding. Please let me know if can further assist in resolving that bug. Testing would be fine. Yours, Carsten
The root cause is, WLAN power is cut off during hibernation, so when resume MHI gets stuck for nearly 90 secs. Since PM core watchdog timeout in openSUSE Tumbleweed is configured as 60 seconds, kernel crashes. Hence the WAR is to reduce MHI time out value to be smaller than PM watchdog, and this can be achieved from userspace because MHI module has exported it to kernel debugfs already. See https://bugzilla.opensuse.org/show_bug.cgi?id=1207948
Do we really need to use such a timeout value like 90 seconds for detecting the bus stall? It's unnecessarily long.
(In reply to Takashi Iwai from comment #2) > The same bug was reported recently for 6.1/6.2 kernels on openSUSE > Tumbleweed: > https://bugzilla.opensuse.org/show_bug.cgi?id=1207948 > > The current situation is that the hibernation resume leads to a kernel > panic, because the mhi bus timeout is set to 90 seconds, while the PM core > watchdog timeout is 60 seconds that triggers the panic. > > So, at least, we should reduce MHI_TIMEOUT_DEFAULT_MS to a more reasonable > value for avoiding the unnecessary kernel panic, IMO. Yes, 90 seconds sounds like unnecessarily long. Something like 20 seconds would sound more approriate here. I'll look at this in detail and submit a patch.
BTW I was not able to reproduce the crash on my x86 NUC test setup, after hibernation ath11k is broken as expected but the kernel is not crashing. Maybe I don't have some watchdog enabled or something? (I have a slimmed down custom kernel) This is not important, just curious.
IIRC, the RPM watchdog default timeout (in kconfig) is 120 seconds in the upstream code, so maybe that's the reason. It's reduced to 60 seconds on openSUSE kernel by some reason.
Thanks, but I guess you mean DPM watchdog? I didn't have it even enabled so did this now: CONFIG_DPM_WATCHDOG=y CONFIG_DPM_WATCHDOG_TIMEOUT=60 Still I don't see the crash but it doesn't matter, I'll anyway submit the patch changing the timeout next week. At the same time I'm also talking with the MHI folks about how to fix the hibernation properly.
Here's the patch changing the timeout in ath11k: https://patchwork.kernel.org/project/linux-wireless/patch/20230329162038.8637-1-kvalo@kernel.org/ I'll try to get it to v6.3.
(In reply to Kalle Valo from comment #12) > Here's the patch changing the timeout in ath11k: > > https://patchwork.kernel.org/project/linux-wireless/patch/20230329162038. > 8637-1-kvalo@kernel.org/ > > I'll try to get it to v6.3. It's applied now and should make it to v6.3: https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless.git/commit/?id=cf5fa3ca0552f1b7ba8490de40700bbfb6979b17
For the actual hibernation support in ath11k here's a proof of concept implementation: https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/log/?h=ath11k-hibernation-support
(In reply to Kalle Valo from comment #14) > For the actual hibernation support in ath11k here's a proof of concept > implementation: > > https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/log/?h=ath11k- > hibernation-support Thanks! I'll give it a try later. But the patches seem missing to modify mhi_power_down() calls in drivers/accel/qaic/mhi_controller.c?
The initial test with those patches look good. The hibernation worked on Thinkpad T14s Gen 3, while there have been a few errors from ath11k spewed at the resume time. It'd be great if those fixes can go for 6.6, at least.
(In reply to Takashi Iwai from comment #16) > The initial test with those patches look good. The hibernation worked on > Thinkpad T14s Gen 3, while there have been a few errors from ath11k spewed > at the resume time. > > It'd be great if those fixes can go for 6.6, at least. Thanks for the test. Could you share kernel logs of ath11k so that we can check those resume errors?
> But the patches seem missing to modify mhi_power_down() calls in > drivers/accel/qaic/mhi_controller.c? Yeah, the 0-day bot also reported that. I'll fix that later.
> It'd be great if those fixes can go for 6.6, at least. I would want the same but this depends on the MHI subsystem.
(In reply to Baochen Qiang from comment #17) > (In reply to Takashi Iwai from comment #16) > > The initial test with those patches look good. The hibernation worked on > > Thinkpad T14s Gen 3, while there have been a few errors from ath11k spewed > > at the resume time. > > > > It'd be great if those fixes can go for 6.6, at least. > > Thanks for the test. Could you share kernel logs of ath11k so that we can > check those resume errors? At S4 resume, starting fine with: [ 42.214532] ath11k_pci 0000:01:00.0: chip_id 0x12 chip_family 0xb board_id 0xff soc_id 0x400c1211 [ 42.214542] ath11k_pci 0000:01:00.0: fw_version 0x110b196e fw_build_timestamp 2022-12-22 12:54 fw_build_id WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.23 then triggering this: [ 42.285809] ath11k_pci 0000:01:00.0: Last interrupt received for each CE: [ 42.285817] ath11k_pci 0000:01:00.0: CE_id 0 pipe_num 0 35884ms before [ 42.285823] ath11k_pci 0000:01:00.0: CE_id 1 pipe_num 1 4524ms before .... I attach the full dmesg output below. Note that it's 6.4.6 kernel with backports of the mhi patches.
Created attachment 304706 [details] dmeseg output after s4 indicating ath11k errors
(In reply to Takashi Iwai from comment #20) > (In reply to Baochen Qiang from comment #17) > > (In reply to Takashi Iwai from comment #16) > > > The initial test with those patches look good. The hibernation worked on > > > Thinkpad T14s Gen 3, while there have been a few errors from ath11k > spewed > > > at the resume time. > > > > > > It'd be great if those fixes can go for 6.6, at least. > > > > Thanks for the test. Could you share kernel logs of ath11k so that we can > > check those resume errors? > > At S4 resume, starting fine with: > [ 42.214532] ath11k_pci 0000:01:00.0: chip_id 0x12 chip_family 0xb > board_id 0xff soc_id 0x400c1211 > [ 42.214542] ath11k_pci 0000:01:00.0: fw_version 0x110b196e > fw_build_timestamp 2022-12-22 12:54 fw_build_id > WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.23 > > then triggering this: > [ 42.285809] ath11k_pci 0000:01:00.0: Last interrupt received for each CE: > [ 42.285817] ath11k_pci 0000:01:00.0: CE_id 0 pipe_num 0 35884ms before > [ 42.285823] ath11k_pci 0000:01:00.0: CE_id 1 pipe_num 1 4524ms before > .... Checked the attached dmesg log, ath11k is working well. Will refine later to avoid above misleading logs. Thanks. > > I attach the full dmesg output below. > > Note that it's 6.4.6 kernel with backports of the mhi patches.
I now updated the branch: ath11k-hibernation-support-202307280814 * rebase to 6.5.0-rc3-wt-ath+ * fix checkpatch warnings * add new function mhi_power_down_no_destroy() and keep mhi_power_down() as is, this should fix the compilation error
Thanks, I'll try it later. Through a quick glance, the comment for mhi_power_down() still has a stale destroy_device, and it should be dropped.
(In reply to Takashi Iwai from comment #24) > Thanks, I'll try it later. I think you shouldn't use time for testing this version, there's not that much changes. I only tried to update the code to top of our tree and fix the compilation errors. I'll let you know once it makes sense to test again. > Through a quick glance, the comment for mhi_power_down() still has a stale > destroy_device, and it should be dropped. Thanks, I'll take a look.
Created attachment 305199 [details] ath11k dmesg on resume (ThinkPad T14 Gen 3 AMD) I tested the ath11k-hibernation-support-202309121433 patches on top of kernel 6.5.6 and it appears to work here. Previously I had the same ath11k driver lockup mentioned previously after resuming from hibernation. There's a bunch of debug info printed on resume; I don't think it's too important, but I included it as an attachment. System is ThinkPad T14 Gen 3 AMD - CPU: AMD Ryzen 6850U - WLAN: wcn6855 hw2.1 (17cb:1103, subsys 17aa:9309)
I have tested the patch on Thinkpad T14 Gen 4 AMD. Everything is working fine with kernel 6.6. When can we expect these changes to be included in the mainline kernel ?
I updated the branch and I'm planning to submit this version for public review soon. There should not be any warnings anymore. Testing results very welcome, both positive and negative! https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/log/?h=ath11k-hibernation-support ath11k-hibernation-support-202311091548 * rebase to ath-202310310746 (6.6.0-wt-ath+) * remove PCI state save/restore (from Baochen's patch "wifi: ath11k: no need to save/restore PCI state") * add patch 'wifi: ath11k: fix warning on DMA ring capabilities event' * add patch 'wifi: ath11k: do not dump SRNG statistics during resume' * replace patch 'wifi: ath11k: handle thermal device registration together with MAC' with patch 'wifi: ath11k: thermal: don't try to register multiple times' * improve commit logs * write MHI API documentation * run more tests * handle FIXME in ath11k_mhi_stop() * patch 2: __mhi_prepare_for_transfer_autoqueue(): simplify if statements
I did quick tests and the new patches seem working fine. Feel free to take my tested-by tags Tested-by: Takashi Iwai <tiwai@suse.de> Thanks!
> I did quick tests and the new patches seem working fine. > Feel free to take my tested-by tags Perfect timing, I was just about to submit the patches for public review :) Thank you!
New update, only cosmetic changes this time based on Jeff's and Baochen's review: ath11k-hibernation-support-202311101016 * white space fix * spelling fixes * 'wifi: ath11k: support hibernation': don't mention about saving PCI states, we don't do that anymore * add Takashi's Tested-by I now posted the patches for public review: https://lore.kernel.org/ath11k/20231110102202.3168243-1-kvalo@kernel.org/
New update: ath11k-hibernation-support-202311271607 * rebase to ath-202311221826 (6.7.0-rc2-wt-ath+) * 'bus: mhi: host: add mhi_power_down_no_destroy()': fix null state string for DEV_ST_TRANSITION_DISABLE_DESTROY_DEVICE * 'bus: mhi: host: add new interfaces to handle MHI channels directly': fix typos in comments * 'bus: mhi: host: add new interfaces to handle MHI channels directly': honour initial autoqueue configuration * 'bus: mhi: host: add new interfaces to handle MHI channels directly': don't prepare/unprepare MHI devices that don't match with a MHI client driver * 'wifi: ath11k: remove MHI LOOPBACK channels': remove LOOPBACK channels for QCN9074 as well Submitted as v2: https://lore.kernel.org/mhi/20231127162022.518834-1-kvalo@kernel.org/
A short status update, the MHI maintainer doesn't like the patchset: https://lore.kernel.org/ath11k/20231130054250.GC3043@thinkpad/
Finally a solution was found and ath11k hibernation support has been commited: https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=ath-next&id=813e0ae613d6ee1b3e11f1c41f8b9e9df8ef0493 https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=ath-next&id=e0cd1185900e638d41d9cccb4c259051e05f69e9 https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=ath-next&id=166a490f59ac10340ee5330e51c15188ce2a7f8f If all goes well these commits should be in v6.10 released around July/August. If someone wants to test this these commits are in master branch tag ath-202404091148 (or later) from our ath.git tree: https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/
It seems to be working. I have disabled all workarounds and wifi still works after sleep & hibernate with `6.10.2-arch1-1`. Thanks!
It appears this has been reverted? https://github.com/torvalds/linux/commit/2f833e8948d6c88a3a257d4e426c9897b4907d5a
Indeed, let's reopen until another better fix landed to the upstream.