Bug 216438

Summary: System unusable after resuming from suspend on Ryzen 6000 Laptop (Lenovo Yoga Slim 7 ProX 14ARH7)
Product: Power Management Reporter: Sebastian S. (iam)
Component: Hibernation/SuspendAssignee: Mario Limonciello (AMD) (mario.limonciello)
Status: RESOLVED CODE_FIX    
Severity: high CC: ein4rth, iam, mario.limonciello, mpearson-lenovo
Priority: P1    
Hardware: AMD   
OS: Linux   
Kernel Version: 5.19.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg errors after resuming from suspend
dmesg output before entering suspend
lspci -vv
lsmod
lshw
dmesg output before entering suspend (only warnings+)
dmesg with iommu=pt (inlcuding entering and leaving suspend)
dmesg with iommu=pt (inlcuding entering and leaving suspend, only warnings+)

Description Sebastian S. 2022-09-01 18:31:56 UTC
Created attachment 301712 [details]
dmesg errors after resuming from suspend

After resuming from suspend, the system is in an unusable state. Several hardware components fail, in particular the SSD cannot write any more data and the wireless card is also unusable. Please see attached file "dmesg-after-suspend.txt" for all the errors popping up after resuming from suspend.

I could only collect the attached dmesg errors using netconsole with an USB Ethernet adapter, which also goes to show that not all hardware failed to resume successfully from suspend. The system couldn't write the errors to the local log files because the SSD failed.

I tried this from a console with systemctl suspend, no GUI running. I can still type in the console that was open before entering suspend, but, e.g., systemctl poweroff only returns an "Input/Output error".

Note that I am running the vanilla Archlinux kernel with a simple patch applied (which is already part of 5.20) to get the keyboard working [0]. The reported behavior happens with or without this patch, so I assume it is unrelated. I also tried several kernel versions, which all show the same behavior (5.18.16, 5.19.4, and 5.19.6) and also the most recent Ubuntu 22.04.1 LTS Live system booted from a USB Stick hangs after suspend, so I assume it is not a regression.

If I have miscategorized this bug, please change it.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=216118
Comment 1 Sebastian S. 2022-09-01 18:32:58 UTC
Created attachment 301713 [details]
dmesg output before entering suspend
Comment 2 Sebastian S. 2022-09-01 18:33:21 UTC
Created attachment 301714 [details]
lspci -vv
Comment 3 Sebastian S. 2022-09-01 18:33:40 UTC
Created attachment 301715 [details]
lsmod
Comment 4 Sebastian S. 2022-09-01 18:33:58 UTC
Created attachment 301716 [details]
lshw
Comment 5 Sebastian S. 2022-09-01 18:38:11 UTC
Created attachment 301717 [details]
dmesg output before entering suspend (only warnings+)
Comment 6 Artem S. Tashkinov 2022-09-02 10:32:59 UTC
Possibly related: bug 216440
Comment 7 Mario Limonciello (AMD) 2022-09-02 12:04:47 UTC
Please have a try with 'iommu=pt' on your kernel command line.
Comment 8 Sebastian S. 2022-09-02 16:29:32 UTC
Thanks for the quick reply. Unfortunately, I will only be able to try it on Sept 13 the earliest. Will give an update then.
Comment 9 Sebastian S. 2022-09-15 10:30:17 UTC
'iommu=pt' does indeed solve the main problems, thanks! Now, the system comes back quickly with a working SSD.

One new issue now came up: the built-in keyboard stops working after resuming from suspend. This wasn't the case without `iommu=pt'. The relevant dmesg output is:

atkbd serio0: Failed to enable keyboard on isa0060/serio0

The wifi chip also still gets reset with dmesg showing

mt7921e 0000:02:00.0: Message 00020007 (seq 13) timeout
mt7921e 0000:02:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -110
mt7921e 0000:02:00.0: PM: failed to resume async: error -110
mt7921e 0000:02:00.0: chip reset

But it still seems functional, NetworkManager successfully reestablishes the wifi connection.

I will attach an updated full dmesg with `iommu=pt` enabled. Note that after resuming from suspend, I attached the laptop to a USB-C monitor for power and an external keyboard, resulting in errors like

ACPI Error: No handler for Region [ECSI] (00000000e7945f01) [EmbeddedControl] (20220331/evregion-130)
ACPI Error: Region EmbeddedControl (ID=3) has no handler (20220331/exfldio-261)
ACPI Error: Aborting method \_SB.UBTC.ECRD due to previous error (AE_NOT_EXIST) (20220331/psparse-529)
ACPI Error: Aborting method \_SB.UBTC.NTFY due to previous error (AE_NOT_EXIST) (20220331/psparse-529)
ACPI Error: Aborting method \_SB.PCI0.LPC0.EC0._Q4F due to previous error (AE_NOT_EXIST) (20220331/psparse-529)

But that was also the case before and needs separate investigation.
Comment 10 Sebastian S. 2022-09-15 10:31:48 UTC
Created attachment 301814 [details]
dmesg with iommu=pt (inlcuding entering and leaving suspend)
Comment 11 Sebastian S. 2022-09-15 10:32:30 UTC
Created attachment 301815 [details]
dmesg with iommu=pt (inlcuding entering and leaving suspend, only warnings+)
Comment 12 Mario Limonciello (AMD) 2022-09-15 10:55:58 UTC
I suspect you might be hitting two separate bugs.  The necessity of iommu=pt to avoid the SSD issue is a pure BIOS bug that Lenovo should fix.  Something similar happened in earlier generations too.  It's worked around in the earlier generations by adding platforms to this list:
https://github.com/torvalds/linux/commit/455cd867b85b53fd3602345f9b8a8facc551adc9

To confirm there are two bugs can you please have a try with:

1) 6.0-rc5
2) This series applied: https://lore.kernel.org/linux-acpi/20220912172401.22301-1-mario.limonciello@amd.com/T/#t
3) The following on the kernel command line:

iommu=pt acpi.prefer_microsoft_guid=1
Comment 13 Sebastian S. 2022-09-15 18:23:49 UTC
To 1) I can confirm that the keyboard works after resuming with 6.0-rc5. Also, the last line "mt7921e 0000:02:00.0: chip reset" is gone from the mt7921e errors. The first three are still there though.

To 3) 'acpi.prefer_microsoft_guid=1' doesn't fix the keyboard after resuming with kernel 5.19.6 (keyboard fix for ZEN devices applied, as mentioned before).

To 2) do you still need this info? If yes, applied to which kernel version? 5.19? 6.0-rc5?

Are these mt7921e errors another bug that should be reported?

And what's the procedure to get Lenovo to fix this BIOS bug? Are you reporting these bugs to Lenovo?


I observe a strange behavior on both, 5.19 and 6.0, when pressing any key on an external keyboard to resume the laptop from suspend. The laptop's keyboard backlights light up for half a second, then go dark again and the laptop stays suspended. If I now press any key on the laptop, it resumes from suspend.
Comment 14 Mario Limonciello (AMD) 2022-09-15 18:32:52 UTC
I actually had meant to test the combination of those 3 things, but now that you mention your results below let's break it down a little better.

> To 1) I can confirm that the keyboard works after resuming with 6.0-rc5.
> Also, the last line "mt7921e 0000:02:00.0: chip reset" is gone from the
> mt7921e errors. The first three are still there though.

OK so if the keyboard works after resuming in 6.0-rc5 with nothing beyond iommu=pt on kernel command line, it's probably fixed via https://github.com/torvalds/linux/commit/9946e39fe8d0a5da9eb947d8e40a7ef204ba016e 

> To 3) 'acpi.prefer_microsoft_guid=1' doesn't fix the keyboard after resuming
> with kernel 5.19.6 (keyboard fix for ZEN devices applied, as mentioned
> before).
> To 2) do you still need this info? If yes, applied to which kernel version?
> 5.19? 6.0-rc5?

Using 6.0-rc5 I'd like to see if applying that patch series and adding "acpi.prefer_microsoft_guid=1" improves anything.  Given the keyboard is working in 6.0-rc5 now I suspect it won't change anything for you though and thus there is no reason to add your system to the quirk list in that series.

> Are these mt7921e errors another bug that should be reported?

Yes; separate bug for that driver's maintainer to address.

> And what's the procedure to get Lenovo to fix this BIOS bug? Are you
> reporting these bugs to Lenovo?

The BIOS bug necessitating "iommu=pt" is pretty well understood by Lenovo.  It hit a TON of models, the fix just needs to be ported to yours.
You can report it  to Lenovo's forums and see if they can do that for you.

> I observe a strange behavior on both, 5.19 and 6.0, when pressing any key on
> an external keyboard to resume the laptop from suspend. The laptop's keyboard
> backlights light up for half a second, then go dark again and the laptop
> stays suspended. If I now press any key on the laptop, it resumes from
> suspend.

Can you please contrast if this behavior still happens with that series I mentioned and kernel command line change?

If it does, I think we'll need a debugging log with the following to characterize it:

pm_debug_messages acpi.dyndbg='file drivers/acpi/x86/s2idle.c +p' pinctrl_amd.dyndbg='+p'
Comment 15 Mario Limonciello (AMD) 2022-09-15 19:34:43 UTC
BTW, *make sure* that last debugging step I mentioned is run in 6.0-rc kernels.  The reason is this commit addresses a related bug to wakeup from USB keyboard.

https://github.com/torvalds/linux/commit/b8c824a869f220c6b46df724f85794349bafbf23
Comment 16 Mark Pearson 2022-09-15 19:40:30 UTC
Apologies if I've missed the obvious - but which platform is this on?
Thanks
Mark
Comment 17 Mario Limonciello (AMD) 2022-09-15 19:42:06 UTC
LENOVO 82TL (Lenovo Yoga Slim 7 ProX 14ARH7)
Comment 18 Mark Pearson 2022-09-15 19:48:03 UTC
Thanks (and argh....)
It's not in the Linux program so no promises - but I'll see if I can flag this to them as it is something that should be fixed. At least they can reference the fix from the Thinkpads...
Tracked with internal ticket LO-2023
Mark
Comment 19 Sebastian S. 2022-09-16 08:03:53 UTC
> OK so if the keyboard works after resuming in 6.0-rc5 with nothing beyond
> iommu=pt on kernel command line, it's probably fixed via
> https://github.com/torvalds/linux/commit/
> 9946e39fe8d0a5da9eb947d8e40a7ef204ba016e 

I have to revert my previous observation. Booting 6.0-rc5 another time, the keyboard issue was back, no keyboard on resume. Note also that I run a 5.19.6 with this patch applied, where it also doesn't work. This patch just fixes the keyboard working at all on initial boot.

(This time with 6.0-rc5, the backlight didn't light up shortly when I pressed a key on an external keyboard. Seems like either the keyboard works randomly, but then has this backlight flicker, or it doesn't work at all.)

> Using 6.0-rc5 I'd like to see if applying that patch series and adding
> "acpi.prefer_microsoft_guid=1" improves anything.  Given the keyboard is
> working in 6.0-rc5 now I suspect it won't change anything for you though and
> thus there is no reason to add your system to the quirk list in that series.

This time I can confirm that 6.0-rc5, with this patch series applied and 'acpi.prefer_microsoft_guid=1' solves all issues!
- keyboard works on resume
- even all mt7921e dmesg errors on resume are gone, so it's apparently related
- no short light-up of the laptop keyboard on an external keyboard press during suspend

So looks like adding my model to the quirks list makes sense after all. Note that the only difference between Travis laptop from #216473 and mine is that he has a touch screen.

> > I observe a strange behavior on both, 5.19 and 6.0, when pressing any key
> on
> > an external keyboard to resume the laptop from suspend. The laptop's
> keyboard
> > backlights light up for half a second, then go dark again and the laptop
> > stays suspended. If I now press any key on the laptop, it resumes from
> > suspend.
> 
> Can you please contrast if this behavior still happens with that series I
> mentioned and kernel command line change?
> 
> If it does, I think we'll need a debugging log with the following to
> characterize it:
> 
> pm_debug_messages acpi.dyndbg='file drivers/acpi/x86/s2idle.c +p'
> pinctrl_amd.dyndbg='+p'

Now that everything works with the patchset, do you still need this info?
Comment 20 Sebastian S. 2022-09-16 12:39:31 UTC
Just to confirm, running 6.0-rc5 with the patchset, but without 'acpi.prefer_microsoft_guid=1', still has all the issues (keyboard randomly works or doesn't on resume, mt7921e throwing errors).
Comment 21 Mario Limonciello (AMD) 2022-09-16 12:57:09 UTC
Can you please try to modify the last patch from this line:

DMI_MATCH(DMI_PRODUCT_NAME, "82V2"),

to

DMI_MATCH(DMI_PRODUCT_NAME, "82"),


Then remove acpi.prefer_microsoft_guid=1 from kernel command line.  That should hopefully make the patch series apply automatically to your system too.
If that works I'll revise it for a v3 to loop your system in too.
Comment 22 Sebastian S. 2022-09-16 14:55:50 UTC
Can confirm that this works, thanks!
Comment 23 Mario Limonciello (AMD) 2022-09-16 18:48:17 UTC
Rolled that change into v3.
Comment 24 Mario Limonciello (AMD) 2022-09-26 13:35:39 UTC
The kernel solution has been queued up for 6.1 (https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=bleeding-edge&id=888ca9c7955e3969df84f5a1bda2143be9fa365a)