Bug 214179 - ath11k: wcn6855: suspend S3 broken
Summary: ath11k: wcn6855: suspend S3 broken
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P2 normal
Assignee: Kalle Valo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-08-26 03:25 UTC by Pengyu Ma
Modified: 2021-12-27 04:44 UTC (History)
10 users (show)

See Also:
Kernel Version: 5.14-rc4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
mainline kernel 5.14-rc4 dmesg (167.57 KB, text/plain)
2021-08-26 03:25 UTC, Pengyu Ma
Details
lspci (95.86 KB, text/plain)
2021-08-26 03:29 UTC, Pengyu Ma
Details
attachment-2895-0.html (2.61 KB, text/html)
2021-12-10 07:48 UTC, Peter Zhang
Details

Description Pengyu Ma 2021-08-26 03:25:45 UTC
Created attachment 298477 [details]
mainline kernel 5.14-rc4 dmesg

When resume, ath11k reports error, mhi is in MHI_STATE_RESET than M3.

[   46.003087] ath11k_pci 0000:03:00.0: failed to set mhi state: RESUME(6)
[   46.003796] mhi-pci-generic 0000:05:00.0: failed to resume device: -22
[   46.003814] mhi-pci-generic 0000:05:00.0: device recovery started

Even disable runtime suspend, it will show the same error.

fw is from https://github.com/kvalo/ath11k-firmware.git
Comment 1 Pengyu Ma 2021-08-26 03:29:34 UTC
Created attachment 298479 [details]
lspci
Comment 2 Kalle Valo 2021-11-20 17:22:27 UTC
This is with wcn6855:

[    6.243697] ath11k_pci 0000:03:00.0: wcn6855 hw2.0
[    7.327423] ath11k_pci 0000:03:00.0: chip_id 0x2 chip_family 0xb board_id 0xff soc_id 0x400c0200
[    7.327432] ath11k_pci 0000:03:00.0: fw_version 0x110d897f fw_build_timestamp 2021-05-27 19:36 fw_build_id QC_IMAGE_VERSION_STRING=WLAN.HSP.1.1-02431-QCAHSPSWPL_V1_V2_SILICONZ_LITE-1

Changing the title to reflect that.
Comment 3 Kalle Valo 2021-11-20 17:24:03 UTC
I also noticed a warning in dmesg:

[    7.315630] WARNING: CPU: 0 PID: 561 at mm/page_alloc.c:5366 __alloc_pages+0x2a9/0x320

Comes from:

[    7.315838]  ath11k_qmi_msg_mem_request_cb+0xe3/0x190 [ath11k]
Comment 4 mani 2021-11-20 17:35:19 UTC
See relevant mailing list discussion here: https://lore.kernel.org/all/20211118174145.GA31300@thinkpad/T/
Comment 5 Kalle Valo 2021-11-22 14:59:20 UTC
(In reply to mani from comment #4)
> See relevant mailing list discussion here:
> https://lore.kernel.org/all/20211118174145.GA31300@thinkpad/T/

There are two separate suspend bugs in ath11k at the moment. Even if I revert commit 020d3b26c07a ("bus: mhi: Early MHI resume failure in non M3 state") WCN6855 suspend still fails.

I filed bug #215103 for handling the mhi issue.
Comment 6 Baochen Qiang 2021-11-23 06:55:51 UTC
(In reply to Kalle Valo from comment #3)
> I also noticed a warning in dmesg:
> 
> [    7.315630] WARNING: CPU: 0 PID: 561 at mm/page_alloc.c:5366
> __alloc_pages+0x2a9/0x320
> 
> Comes from:
> 
> [    7.315838]  ath11k_qmi_msg_mem_request_cb+0xe3/0x190 [ath11k]

There is already a patch for this warning:
https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/drivers/net/wireless/ath/ath11k?id=b9b5948cdd7bc8d9fa31c78cbbb04382c815587f
Comment 7 Baochen Qiang 2021-11-23 07:17:20 UTC
(In reply to Pengyu Ma from comment #0)
> Created attachment 298477 [details]
> mainline kernel 5.14-rc4 dmesg
> 
> When resume, ath11k reports error, mhi is in MHI_STATE_RESET than M3.
> 
> [   46.003087] ath11k_pci 0000:03:00.0: failed to set mhi state: RESUME(6)
> [   46.003796] mhi-pci-generic 0000:05:00.0: failed to resume device: -22
> [   46.003814] mhi-pci-generic 0000:05:00.0: device recovery started
> 
> Even disable runtime suspend, it will show the same error.
> 
> fw is from https://github.com/kvalo/ath11k-firmware.git

Hi Pengyu, how do you find that mhi is in MHI_STATE_RESET? I did not find this info in the dmesg log.
Comment 8 Pengyu Ma 2021-11-23 12:14:29 UTC
Hi Baochen,

After added extra debug info, I confirmed the RESET status.
It should be the same issue as NUC suspend issue that Kalle and Mani discussed.
Comment 9 Pengyu Ma 2021-11-23 12:18:32 UTC
Hi Mani,

Please Cc me if you got any patch, so I can help verify it.
I think a recovery work after power lost as PCI MHI driver is necessary.

The revert patch that ignore the MHI state is only a hack, after resume the mhi of ath11k won't work any more.
Comment 10 Pengyu Ma 2021-12-06 09:58:46 UTC
Hi Mani:

Is there any proposed patch I can help verify?
Or a schedule for this issue?

Please let me know if I can help.
Comment 11 mani 2021-12-06 11:27:50 UTC
(In reply to Pengyu Ma from comment #9)
> Hi Mani,
> 
> Please Cc me if you got any patch, so I can help verify it.
> I think a recovery work after power lost as PCI MHI driver is necessary.
> 
> The revert patch that ignore the MHI state is only a hack, after resume the
> mhi of ath11k won't work any more.

Oh, on NUC if we ignore the MHI state during resume (basically the revert of 020d3b26c07a) it works just fine. But on your setup, it is not working?
Comment 12 Pengyu Ma 2021-12-06 15:09:29 UTC
Hi Mani,

The WLAN could work after revert the patch, but I think MHI driver is not in right condition.

Also it introduces another regression for LTE module [1eac:1001].
With the revert, the recovery work will be delayed for 20 seconds.

So I think a proper recovery work like LTE module is better.
Comment 13 mani 2021-12-06 15:39:13 UTC
(In reply to Pengyu Ma from comment #12)
> Hi Mani,
> 
> The WLAN could work after revert the patch, but I think MHI driver is not in
> right condition.
> 

Okay. Thanks for the confirmation. I tried to reset the device in ath11k but so far I'm not able to bring it back to a working state afterward.

During PM resume failure:

1. Doing MHI reset causes ath11k crashes.
2. Doing SOC reset requires full initialization of the device.

So I'm going to apply the patch suggested by Loic that forces the resume ignoring the MHI state of the device. Will CC you so that you can also test.

> Also it introduces another regression for LTE module [1eac:1001].
> With the revert, the recovery work will be delayed for 20 seconds.
> 

Yeah, that patch is necessary for some LTE modules.

> So I think a proper recovery work like LTE module is better.

Unfortunately, that's not straightforward with WLAN devices.
Comment 14 mani 2021-12-10 07:47:56 UTC
Patch submitted and it got accepted into char-misc-linus branch. This should be available in the next -rc release candidate.

https://lore.kernel.org/lkml/20211209131633.4168-1-manivannan.sadhasivam@linaro.org/
Comment 15 Peter Zhang 2021-12-10 07:48:14 UTC
Created attachment 299979 [details]
attachment-2895-0.html

Out of office on Thu and Fri.
Please call me if it's urgent  : (+86) 181-1611-8005.

Thanks,
Peter
Comment 16 Pengyu Ma 2021-12-27 04:43:57 UTC
Hi all,

This bug is fixed by firmware:
https://github.com/kvalo/ath11k-firmware/tree/master/WCN6855/hw2.0/1.1/WLAN.HSP.1.1-03003-QCAHSPSWPL_V1_V2_SILICONZ_LITE-2

No other patch is needed.

Note You need to log in before you can comment on or make changes to this bug.