Bug 217016 - rtw88 regression bisected: resume after suspend breaks, reboots instead
Summary: rtw88 regression bisected: resume after suspend breaks, reboots instead
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: AMD Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-02-09 19:52 UTC by Paul Gover
Modified: 2023-06-27 02:06 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.1.1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
git bisect log (3.02 KB, text/plain)
2023-02-09 19:52 UTC, Paul Gover
Details
Draft fix v1 (5.07 KB, patch)
2023-02-14 09:20 UTC, Ping-Ke Shih
Details | Diff

Description Paul Gover 2023-02-09 19:52:08 UTC
Created attachment 303708 [details]
git bisect log

Suspend/Resume was working OK on kernel 6.0.13, broken ever since 6.1.1
(I've not tried kernels between those, except in the bisect)

All subsequent 6,1 kernels exhibit the same behaviour, viz:
Suspend works OK, but on Resume, there's a flicker, and then it reboots.
Sometimes the screen gets restored to its contents at the time of suspend. but less than a second later, it starts rebooting.
To reproduce, simply boot, suspend, and resume.

Subsequent dmesg include the following "Emergency" output:

x86/pm: family 0x15 cpu detected, MSR  saving is needed during suspending.
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: System Fatal error.
[Hardware Error]: CPU:0 (15:70:0) MC4_STATUS[Over|UE|-|AddrV|PCC|-]: 
0xf600000000070f0f
[Hardware Error]: Error Addr: 0x00000000f10011d4
[Hardware Error]: MC4 Error (node 2): Watchdog timeout due to lack of 
progress.
[Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: 
GEN (timed out)

I'll send emails as per "Reporting issues" in the kernel documentation
I'll attach dmesg and configs later
Comment 1 Artem S. Tashkinov 2023-02-10 08:04:28 UTC
Please send this report as an email to:

gary.chang AT realtek.com
s.hauer AT pengutronix.de

None of them are subscribed to kernel's bugzilla.
Comment 2 Paul Gover 2023-02-10 18:10:28 UTC
I've tried kernel 6.2-rc7; the bug remains.

Kernel config for the commit identified by bisections can be found here:
https://www.dropbox.com/s/nq95rwze856m39j/bisected.config.gz?dl=0

Boot log with said config and kernel:
https://www.dropbox.com/s/gpudjk0wnm9ufiw/bisected.boot.log.gz?dl=0

Both gzipped.
Comment 3 Paul Gover 2023-02-10 19:05:46 UTC
Doh!  The above boot log was for the bisect good kernel!

The following link is for a boot log with the bisect bad kernel:
https://www.dropbox.com/s/wbgltknrfyzb2ql/bisected.bad.log.gz?dl=0

My apologies.
Comment 4 Artem S. Tashkinov 2023-02-11 05:03:58 UTC
Please email the developers.
Comment 5 Paul Gover 2023-02-11 16:00:26 UTC
(In reply to Artem S. Tashkinov from comment #4)
> Please email the developers.

Did so yesterday.  Just putting the links here so people can find them.
Comment 6 Ping-Ke Shih 2023-02-13 06:53:39 UTC
I think I can reproduce by below steps:
1. ifconfig wlan0 up
2. iw wlan0 connect AP
3. use GUI to suspend
4. press space to resume
5. ifconfig wlan0 up
6. iw wlan0 connect AP <--- system stall 


If I revert "wifi: rtw88: add flag check before enter or leave IPS", 
it will not stall in step 6.


I also try 8822CE, but it works well with/without the patch. I think that is why we don't find the issue before submitting. 


Could you help to check if this symptom is identical to yours?
Comment 7 Paul Gover 2023-02-13 13:37:16 UTC
My system uses wpa_supplicant, dhcpcd and acpi.  I'm not sure where "ifconfig" comes in that stack.

I tried stopping dhcpcd and then suspending, but got the same hardware error and reboot as described before.  However ...

I stopped wpa_supplicant and then suspended, and the system then resumed OK with wpa_supplicant still stopped.  When I restarted wpa_supplicant, the system crashed with the same hardware error and rebooted.

I think that says the symptoms are identical.

This was on kernel 6.1.11, which is what I currently use; I have others available if that's of use.
Comment 8 Ping-Ke Shih 2023-02-14 05:21:57 UTC
Thanks for the aditional information. I think symptoms are identical as well.

I will prepare a patch soon. Please give it a try.
Comment 9 Ping-Ke Shih 2023-02-14 09:20:19 UTC
Created attachment 303723 [details]
Draft fix v1
Comment 10 Ping-Ke Shih 2023-02-14 09:21:31 UTC
I have posted a draft patch (attachment 303723 [details]). Please give it a try.
Comment 11 Paul Gover 2023-02-14 15:24:49 UTC
That works for me.  I've tested on 6.2.0-rc7 compiled with both gcc and clang/LTO and 6.1.11 (which is what I'd normally use).  I also tested suspending while on mains power and resuming on batteries.  All good!

Thanks for the speedy resolution.
Comment 12 Paul Gover 2023-02-14 18:51:09 UTC
I see there's a new 6.1.12 stable kernel.
I've just tested that with the patch.
The patch works OK there too.
Comment 13 Ping-Ke Shih 2023-02-15 01:24:34 UTC
Thanks for the testing.

I will think more about this fix, and then submit the patch soon.
Comment 14 Ping-Ke Shih 2023-02-16 05:43:08 UTC
I have sent a patch to upstream: 

https://lore.kernel.org/linux-wireless/20230216053633.20366-1-pkshih@realtek.com/T/#u

This patch content is identical to draft version I posted here.
Could you please post a Tested-by tag there?

Thank you.
Comment 15 Mario Limonciello (AMD) 2023-06-27 02:06:16 UTC
4a267bc5ea8f1 ("wifi: rtw88: use RTW_FLAG_POWERON flag to prevent to power on/off twice")

Note You need to log in before you can comment on or make changes to this bug.