Bug 214649

Summary: ath11k: kernel crashes after hibernation
Product: Drivers Reporter: Mark Herbert (mark.herbert42)
Component: network-wirelessAssignee: Kalle Valo (kvalo)
Status: REOPENED ---    
Severity: normal CC: bqiang, bugs+kernel, gerbilsoft, kernel, kishorv06, kvalo, marcoen, reg.krn, rycwo, serviceskernelorg, tiwai, xmb8dsv4
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: All Subsystem:
Regression: No Bisected commit-id:
Attachments: Hibernation Logs
dmeseg output after s4 indicating ath11k errors
ath11k dmesg on resume (ThinkPad T14 Gen 3 AMD)

Description Mark Herbert 2021-10-08 13:23:03 UTC
Dell XPS 13, 9310, 32GB version equiped with ath11k wifi card. Latest BIOS (3.1.0)

Tested with kernel 5.14 different flavours, 5.15-rc, ath11k GIT tree. Same everywhere.

Hibernation - both from desktop environmant (Mate) and console (echo disk > sys/power/state) works perfectly. But on resume it loads image and freezes, never give back control.  

If I do rmmod ath11k_pci before hibernation - I am able to hibernate and then resume succesfully, but the ath11k wifi can not be activated again, module loads but the card is not comig back (see bug https://bugzilla.kernel.org/show_bug.cgi?id=214541) 

I've also tried to build mhi, qrtr_mhi as the modules and unload mhi and qrtr_mhi before hibernation, and load all of them back - still no success.
Comment 1 Kalle Valo 2023-02-13 12:33:25 UTC
IIRC the issue is that ath11k expects the firmware to be running during
suspend. And this was because shutting down the firmware for suspend
caused problems in the MHI subsystem during resume. To fix this I
suspect we need changes both in ath11k and in the MHI subsystem, so not
easy.
Comment 2 Takashi Iwai 2023-02-15 16:20:24 UTC
The same bug was reported recently for 6.1/6.2 kernels on openSUSE Tumbleweed:
  https://bugzilla.opensuse.org/show_bug.cgi?id=1207948

The current situation is that the hibernation resume leads to a kernel panic, because the mhi bus timeout is set to 90 seconds, while the PM core watchdog timeout is 60 seconds that triggers the panic.

So, at least, we should reduce MHI_TIMEOUT_DEFAULT_MS to a more reasonable value for avoiding the unnecessary kernel panic, IMO.
Comment 3 Kalle Valo 2023-03-08 09:28:29 UTC
*** Bug 216962 has been marked as a duplicate of this bug. ***
Comment 4 Carsten Hatger 2023-03-10 18:10:30 UTC
Created attachment 303921 [details]
Hibernation Logs
Comment 5 Carsten Hatger 2023-03-10 18:26:53 UTC
Ooops,

Message lost for what ever reason - stupid me.

So here we go again:

I can confirm this bug to be present on a HP Pro x360 435 G9 w/ firmware WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.23 and vanilla kernel v6.1.15, too. Logs attached above.

Hibernation works to the extent that login and session restore are functional  (with significant delays) but the ath11k_pci module got stuck somewhere between the middle of nowhere and refuses even to respond to rmmod requests. Thus a (hard) reset is required, rendering the hibernation function useless. 

This is (among other reasons) why I moved away from hibernation - session management is not that important for me.

@Kalle: thx for forwarding.

Please let me know if can further assist in resolving that bug. Testing would be fine.
 
Yours,
Carsten
Comment 6 Baochen Qiang 2023-03-13 08:10:05 UTC
The root cause is, WLAN power is cut off during hibernation, so when resume MHI gets stuck for nearly 90 secs. Since PM core watchdog timeout in openSUSE Tumbleweed is configured as 60 seconds, kernel crashes.

Hence the WAR is to reduce MHI time out value to be smaller than PM watchdog, and this can be achieved from userspace because MHI module has exported it to kernel debugfs already. See https://bugzilla.opensuse.org/show_bug.cgi?id=1207948
Comment 7 Takashi Iwai 2023-03-13 08:34:08 UTC
Do we really need to use such a timeout value like 90 seconds for detecting the bus stall?  It's unnecessarily long.
Comment 8 Kalle Valo 2023-03-22 13:09:47 UTC
(In reply to Takashi Iwai from comment #2)
> The same bug was reported recently for 6.1/6.2 kernels on openSUSE
> Tumbleweed:
>   https://bugzilla.opensuse.org/show_bug.cgi?id=1207948
> 
> The current situation is that the hibernation resume leads to a kernel
> panic, because the mhi bus timeout is set to 90 seconds, while the PM core
> watchdog timeout is 60 seconds that triggers the panic.
> 
> So, at least, we should reduce MHI_TIMEOUT_DEFAULT_MS to a more reasonable
> value for avoiding the unnecessary kernel panic, IMO.

Yes, 90 seconds sounds like unnecessarily long. Something like 20 seconds would sound more approriate here. I'll look at this in detail and submit a patch.
Comment 9 Kalle Valo 2023-03-24 10:38:05 UTC
BTW I was not able to reproduce the crash on my x86 NUC test setup, after hibernation ath11k is broken as expected but the kernel is not crashing. Maybe I don't have some watchdog enabled or something? (I have a slimmed down custom kernel) This is not important, just curious.
Comment 10 Takashi Iwai 2023-03-24 12:03:19 UTC
IIRC, the RPM watchdog default timeout (in kconfig) is 120 seconds in the upstream code, so maybe that's the reason.  It's reduced to 60 seconds on openSUSE kernel by some reason.
Comment 11 Kalle Valo 2023-03-24 14:17:12 UTC
Thanks, but I guess you mean DPM watchdog? I didn't have it even enabled so did this now:

CONFIG_DPM_WATCHDOG=y
CONFIG_DPM_WATCHDOG_TIMEOUT=60

Still I don't see the crash but it doesn't matter, I'll anyway submit the patch changing the timeout next week. At the same time I'm also talking with the MHI folks about how to fix the hibernation properly.
Comment 12 Kalle Valo 2023-03-29 18:44:12 UTC
Here's the patch changing the timeout in ath11k:

https://patchwork.kernel.org/project/linux-wireless/patch/20230329162038.8637-1-kvalo@kernel.org/

I'll try to get it to v6.3.
Comment 13 Kalle Valo 2023-04-03 13:53:42 UTC
(In reply to Kalle Valo from comment #12)
> Here's the patch changing the timeout in ath11k:
> 
> https://patchwork.kernel.org/project/linux-wireless/patch/20230329162038.
> 8637-1-kvalo@kernel.org/
> 
> I'll try to get it to v6.3.

It's applied now and should make it to v6.3:

https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless.git/commit/?id=cf5fa3ca0552f1b7ba8490de40700bbfb6979b17
Comment 14 Kalle Valo 2023-07-26 08:16:06 UTC
For the actual hibernation support in ath11k here's a proof of concept implementation:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/log/?h=ath11k-hibernation-support
Comment 15 Takashi Iwai 2023-07-26 13:13:48 UTC
(In reply to Kalle Valo from comment #14)
> For the actual hibernation support in ath11k here's a proof of concept
> implementation:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/log/?h=ath11k-
> hibernation-support

Thanks! I'll give it a try later.

But the patches seem missing to modify mhi_power_down() calls in drivers/accel/qaic/mhi_controller.c?
Comment 16 Takashi Iwai 2023-07-26 15:43:01 UTC
The initial test with those patches look good.  The hibernation worked on Thinkpad T14s Gen 3, while there have been a few errors from ath11k spewed at the resume time.

It'd be great if those fixes can go for 6.6, at least.
Comment 17 Baochen Qiang 2023-07-27 01:36:42 UTC
(In reply to Takashi Iwai from comment #16)
> The initial test with those patches look good.  The hibernation worked on
> Thinkpad T14s Gen 3, while there have been a few errors from ath11k spewed
> at the resume time.
> 
> It'd be great if those fixes can go for 6.6, at least.

Thanks for the test. Could you share kernel logs of ath11k so that we can check those resume errors?
Comment 18 Kalle Valo 2023-07-27 06:18:57 UTC
> But the patches seem missing to modify mhi_power_down() calls in
> drivers/accel/qaic/mhi_controller.c?
Yeah, the 0-day bot also reported that. I'll fix that later.
Comment 19 Kalle Valo 2023-07-27 06:21:38 UTC
> It'd be great if those fixes can go for 6.6, at least.

I would want the same but this depends on the MHI subsystem.
Comment 20 Takashi Iwai 2023-07-27 08:22:05 UTC
(In reply to Baochen Qiang from comment #17)
> (In reply to Takashi Iwai from comment #16)
> > The initial test with those patches look good.  The hibernation worked on
> > Thinkpad T14s Gen 3, while there have been a few errors from ath11k spewed
> > at the resume time.
> > 
> > It'd be great if those fixes can go for 6.6, at least.
> 
> Thanks for the test. Could you share kernel logs of ath11k so that we can
> check those resume errors?

At S4 resume, starting fine with:
[   42.214532] ath11k_pci 0000:01:00.0: chip_id 0x12 chip_family 0xb board_id 0xff soc_id 0x400c1211
[   42.214542] ath11k_pci 0000:01:00.0: fw_version 0x110b196e fw_build_timestamp 2022-12-22 12:54 fw_build_id WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.23

then triggering this:
[   42.285809] ath11k_pci 0000:01:00.0: Last interrupt received for each CE:
[   42.285817] ath11k_pci 0000:01:00.0: CE_id 0 pipe_num 0 35884ms before
[   42.285823] ath11k_pci 0000:01:00.0: CE_id 1 pipe_num 1 4524ms before
....

I attach the full dmesg output below.

Note that it's 6.4.6 kernel with backports of the mhi patches.
Comment 21 Takashi Iwai 2023-07-27 08:22:41 UTC
Created attachment 304706 [details]
dmeseg output after s4 indicating ath11k errors
Comment 22 Baochen Qiang 2023-07-27 08:30:11 UTC
(In reply to Takashi Iwai from comment #20)
> (In reply to Baochen Qiang from comment #17)
> > (In reply to Takashi Iwai from comment #16)
> > > The initial test with those patches look good.  The hibernation worked on
> > > Thinkpad T14s Gen 3, while there have been a few errors from ath11k
> spewed
> > > at the resume time.
> > > 
> > > It'd be great if those fixes can go for 6.6, at least.
> > 
> > Thanks for the test. Could you share kernel logs of ath11k so that we can
> > check those resume errors?
> 
> At S4 resume, starting fine with:
> [   42.214532] ath11k_pci 0000:01:00.0: chip_id 0x12 chip_family 0xb
> board_id 0xff soc_id 0x400c1211
> [   42.214542] ath11k_pci 0000:01:00.0: fw_version 0x110b196e
> fw_build_timestamp 2022-12-22 12:54 fw_build_id
> WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.23
> 
> then triggering this:
> [   42.285809] ath11k_pci 0000:01:00.0: Last interrupt received for each CE:
> [   42.285817] ath11k_pci 0000:01:00.0: CE_id 0 pipe_num 0 35884ms before
> [   42.285823] ath11k_pci 0000:01:00.0: CE_id 1 pipe_num 1 4524ms before
> ....

Checked the attached dmesg log, ath11k is working well. Will refine later to avoid above misleading logs. Thanks.



> 
> I attach the full dmesg output below.
> 
> Note that it's 6.4.6 kernel with backports of the mhi patches.
Comment 23 Kalle Valo 2023-07-28 08:15:58 UTC
I now updated the branch:

ath11k-hibernation-support-202307280814

* rebase to 6.5.0-rc3-wt-ath+

* fix checkpatch warnings

* add new function mhi_power_down_no_destroy() and keep
  mhi_power_down() as is, this should fix the compilation error
Comment 24 Takashi Iwai 2023-07-28 11:18:22 UTC
Thanks, I'll try it later.

Through a quick glance, the comment for mhi_power_down() still has a stale destroy_device, and it should be dropped.
Comment 25 Kalle Valo 2023-07-28 13:55:15 UTC
(In reply to Takashi Iwai from comment #24)
> Thanks, I'll try it later.

I think you shouldn't use time for testing this version, there's not that much changes. I only tried to update the code to top of our tree and fix the compilation errors. I'll let you know once it makes sense to test again.

> Through a quick glance, the comment for mhi_power_down() still has a stale
> destroy_device, and it should be dropped.

Thanks, I'll take a look.
Comment 26 David Korth 2023-10-07 00:15:31 UTC
Created attachment 305199 [details]
ath11k dmesg on resume (ThinkPad T14 Gen 3 AMD)

I tested the ath11k-hibernation-support-202309121433 patches on top of kernel 6.5.6 and it appears to work here. Previously I had the same ath11k driver lockup mentioned previously after resuming from hibernation.

There's a bunch of debug info printed on resume; I don't think it's too important, but I included it as an attachment.

System is ThinkPad T14 Gen 3 AMD
- CPU: AMD Ryzen 6850U
- WLAN: wcn6855 hw2.1 (17cb:1103, subsys 17aa:9309)
Comment 27 Kishor 2023-11-04 10:30:24 UTC
I have tested the patch on Thinkpad T14 Gen 4 AMD. Everything is working fine with kernel 6.6. When can we expect these changes to be included in the mainline kernel ?
Comment 28 Kalle Valo 2023-11-09 15:52:43 UTC
I updated the branch and I'm planning to submit this version for public review soon. There should not be any warnings anymore. Testing results very welcome, both positive and negative!

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/log/?h=ath11k-hibernation-support

ath11k-hibernation-support-202311091548

* rebase to ath-202310310746 (6.6.0-wt-ath+)

* remove PCI state save/restore (from Baochen's patch "wifi: ath11k:
  no need to save/restore PCI state")

* add patch 'wifi: ath11k: fix warning on DMA ring capabilities event'

* add patch 'wifi: ath11k: do not dump SRNG statistics during resume'

* replace patch 'wifi: ath11k: handle thermal device registration
  together with MAC' with patch 'wifi: ath11k: thermal: don't try to
  register multiple times'

* improve commit logs

* write MHI API documentation

* run more tests

* handle FIXME in ath11k_mhi_stop()

* patch 2: __mhi_prepare_for_transfer_autoqueue(): simplify if statements
Comment 29 Takashi Iwai 2023-11-10 09:59:56 UTC
I did quick tests and the new patches seem working fine.
Feel free to take my tested-by tags
  Tested-by: Takashi Iwai <tiwai@suse.de>

Thanks!
Comment 30 Kalle Valo 2023-11-10 10:14:27 UTC
> I did quick tests and the new patches seem working fine.
> Feel free to take my tested-by tags
Perfect timing, I was just about to submit the patches for public review :)

Thank you!
Comment 31 Kalle Valo 2023-11-10 10:26:09 UTC
New update, only cosmetic changes this time based on Jeff's and Baochen's review:

ath11k-hibernation-support-202311101016

* white space fix

* spelling fixes

* 'wifi: ath11k: support hibernation': don't mention about saving PCI
  states, we don't do that anymore

* add Takashi's Tested-by

I now posted the patches for public review:

https://lore.kernel.org/ath11k/20231110102202.3168243-1-kvalo@kernel.org/
Comment 32 Kalle Valo 2023-11-27 16:58:16 UTC
New update:

ath11k-hibernation-support-202311271607

* rebase to ath-202311221826 (6.7.0-rc2-wt-ath+)

* 'bus: mhi: host: add mhi_power_down_no_destroy()': fix null state string for DEV_ST_TRANSITION_DISABLE_DESTROY_DEVICE
  
* 'bus: mhi: host: add new interfaces to handle MHI channels
  directly': fix typos in comments

* 'bus: mhi: host: add new interfaces to handle MHI channels directly': honour initial autoqueue configuration

* 'bus: mhi: host: add new interfaces to handle MHI channels
   directly': don't prepare/unprepare MHI devices that don't match
   with a MHI client driver

* 'wifi: ath11k: remove MHI LOOPBACK channels': remove LOOPBACK channels for QCN9074 as well

Submitted as v2:

https://lore.kernel.org/mhi/20231127162022.518834-1-kvalo@kernel.org/
Comment 33 Kalle Valo 2023-12-11 18:00:44 UTC
A short status update, the MHI maintainer doesn't like the patchset:

https://lore.kernel.org/ath11k/20231130054250.GC3043@thinkpad/
Comment 34 Kalle Valo 2024-04-09 11:57:23 UTC
Finally a solution was found and ath11k hibernation support has been commited:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=ath-next&id=813e0ae613d6ee1b3e11f1c41f8b9e9df8ef0493
https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=ath-next&id=e0cd1185900e638d41d9cccb4c259051e05f69e9
https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=ath-next&id=166a490f59ac10340ee5330e51c15188ce2a7f8f

If all goes well these commits should be in v6.10 released around July/August.

If someone wants to test this these commits are in master branch tag ath-202404091148 (or later) from our ath.git tree:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/
Comment 35 Matej D 2024-08-13 15:21:04 UTC
It seems to be working. I have disabled all workarounds and wifi still works after sleep & hibernate with `6.10.2-arch1-1`.

Thanks!
Comment 36 Matej D 2024-09-19 10:00:18 UTC
It appears this has been reverted? https://github.com/torvalds/linux/commit/2f833e8948d6c88a3a257d4e426c9897b4907d5a
Comment 37 Takashi Iwai 2024-09-19 13:57:02 UTC
Indeed, let's reopen until another better fix landed to the upstream.