Bug 217900 - amd-sfh module causes reboot and shutdown hangs randomly on hp aero 13
Summary: amd-sfh module causes reboot and shutdown hangs randomly on hp aero 13
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Input Devices (show other bugs)
Hardware: AMD Linux
: P3 normal
Assignee: drivers_input-devices
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-11 09:32 UTC by Mehmet Mütin Türk
Modified: 2023-09-15 14:53 UTC (History)
4 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output when experiencing hangs (87.84 KB, text/plain)
2023-09-11 09:32 UTC, Mehmet Mütin Türk
Details
dmesg output that leads to no hangs (75.33 KB, text/plain)
2023-09-11 09:33 UTC, Mehmet Mütin Türk
Details
dmesg output that causes hang on kernel v6.3.1 (81.62 KB, text/plain)
2023-09-11 14:55 UTC, Mehmet Mütin Türk
Details
possible patch (1.97 KB, patch)
2023-09-11 16:05 UTC, Mario Limonciello (AMD)
Details | Diff
dmesg output leading to reboot/poweroff hang after patch (81.60 KB, text/plain)
2023-09-11 20:03 UTC, Mehmet Mütin Türk
Details
dmesg after both patches applied (110.41 KB, text/plain)
2023-09-12 08:56 UTC, Mehmet Mütin Türk
Details
dmesg linux v6.3 (80.94 KB, text/plain)
2023-09-12 15:57 UTC, Mehmet Mütin Türk
Details
possible patch v0.2 (2.13 KB, patch)
2023-09-13 04:34 UTC, Mario Limonciello (AMD)
Details | Diff
dmesg with patch v0.2 applied (87.59 KB, text/plain)
2023-09-13 06:50 UTC, Mehmet Mütin Türk
Details

Description Mehmet Mütin Türk 2023-09-11 09:32:44 UTC
Created attachment 305083 [details]
dmesg output when experiencing hangs

My laptop (hp aero 13-be0024t) hangs at reboot and poweroff requiring physical poweroffs (long pressing the power button) when attached dmesg output is generated. But this seems to be random as sometimes I have a dmesg with no errors related to amd_sfh and I can cleanly reboot/poweroff. Blacklisting amd_sfh module fixes the problem. This problem started with kernel 6.2.x and still present in 6.5.2.

During shutdown/reboot console outputs:

"Failed to umount /oldroot..."
"kvm exiting virtualization..."

but cannot complete the process (waited for more than 1 hour).
Comment 1 Mehmet Mütin Türk 2023-09-11 09:33:49 UTC
Created attachment 305084 [details]
dmesg output that leads to no hangs
Comment 2 Bagas Sanjaya 2023-09-11 12:42:17 UTC
(In reply to Mehmet from comment #1)
> Created attachment 305084 [details]
> dmesg output that leads to no hangs

I don't see any backtraces or other hanging clues in that 6.5.2 dmesg.
Do you mean that this bug has been fixed there?
Comment 3 Bagas Sanjaya 2023-09-11 12:42:52 UTC
(In reply to Mehmet from comment #0)
> Created attachment 305083 [details]
> dmesg output when experiencing hangs
> 
> My laptop (hp aero 13-be0024t) hangs at reboot and poweroff requiring
> physical poweroffs (long pressing the power button) when attached dmesg
> output is generated. But this seems to be random as sometimes I have a dmesg
> with no errors related to amd_sfh and I can cleanly reboot/poweroff.
> Blacklisting amd_sfh module fixes the problem. This problem started with
> kernel 6.2.x and still present in 6.5.2.
> 
> During shutdown/reboot console outputs:
> 
> "Failed to umount /oldroot..."
> "kvm exiting virtualization..."
> 
> but cannot complete the process (waited for more than 1 hour).

Does v6.1.y stable series not have this issue?
Comment 4 Mehmet Mütin Türk 2023-09-11 13:25:45 UTC
Both dmesgs are from the same machine using the same 6.5.2 kernel. As I said it's random, sometimes dmesg outputs errors other times does not. You should check "dmesg_output" file for errors not "dmesg" file.

I'm sure I didn't have this problem on kernel v6.1.0. But I haven't tested every single minor version of v6.1 such as v6.1.52.
Comment 5 Bagas Sanjaya 2023-09-11 13:27:25 UTC
On 11/09/2023 20:25, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217900
> 
> --- Comment #4 from Mehmet (mehmetmutinturk@gmail.com) ---
> Both dmesgs are from the same machine using the same 6.5.2 kernel. As I said
> it's random, sometimes dmesg outputs errors other times does not. You should
> check "dmesg_output" file for errors not "dmesg" file.
> 
> I'm sure I didn't have this problem on kernel v6.1.0. But I haven't tested
> every single minor version of v6.1 such as v6.1.52.
> 

Then try v6.1 (mainline, not stable).
Comment 6 Bagas Sanjaya 2023-09-11 13:28:56 UTC
On 11/09/2023 20:27, Bagas Sanjaya wrote:
> On 11/09/2023 20:25, bugzilla-daemon@kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=217900
>>
>> --- Comment #4 from Mehmet (mehmetmutinturk@gmail.com) ---
>> Both dmesgs are from the same machine using the same 6.5.2 kernel. As I said
>> it's random, sometimes dmesg outputs errors other times does not. You should
>> check "dmesg_output" file for errors not "dmesg" file.
>>
>> I'm sure I didn't have this problem on kernel v6.1.0. But I haven't tested
>> every single minor version of v6.1 such as v6.1.52.
>>
> 
> Then try v6.1 (mainline, not stable).
> 

Oops, I don't see that you have tried that version. Sorry for inconvenience.
Comment 7 Mehmet Mütin Türk 2023-09-11 13:32:46 UTC
When dmesg outputs errors, I cannot reboot/poweroff cleanly. When it does not output any errors, I can cleanly reboot/poweroff. And this seems to be random (about half of the boots).
Comment 8 Bagas Sanjaya 2023-09-11 13:33:42 UTC
(In reply to Mehmet from comment #4)
> Both dmesgs are from the same machine using the same 6.5.2 kernel. As I said
> it's random, sometimes dmesg outputs errors other times does not. You should
> check "dmesg_output" file for errors not "dmesg" file.
> 
> I'm sure I didn't have this problem on kernel v6.1.0. But I haven't tested
> every single minor version of v6.1 such as v6.1.52.

Can you check current mainline (v6.6-rc1)?
Comment 9 Bagas Sanjaya 2023-09-11 13:37:41 UTC
Last but not least, please do bisection (see Documentation/admin-guide/bug-bisect.rst for how to do that).
Comment 10 Mehmet Mütin Türk 2023-09-11 14:51:41 UTC
I used arch linux's archived repositories to test old kernels. There was no issues up to kernel v6.2.13. But I encountered the issue on kernel v6.3.1 and every kernel after that. Arch linux's archive did not have any kernel versions between v6.2.13 and v6.3.1.

I have never compiled anything more than a few simple projects. I have no idea how to use git and do bisecting. But if I can figure that out I will provide more information.
Comment 11 Mehmet Mütin Türk 2023-09-11 14:55:16 UTC
Created attachment 305088 [details]
dmesg output that causes hang on kernel v6.3.1

This is the dmesg output on linux v6.3.1 that caused the hang.
Comment 12 Mario Limonciello (AMD) 2023-09-11 16:05:55 UTC
Created attachment 305089 [details]
possible patch

From what you describe, it sounds like list corruption by a race.  Can you have a try with the attached patch to see if this fixes it reliably?
Comment 13 Mehmet Mütin Türk 2023-09-11 19:52:19 UTC
I've built kernel v6.5.2 with your patch but unfortunately it didn't make any difference. Still getting errors on dmesg and getting stuck at reboot/poweroff.
Comment 14 Mario Limonciello (AMD) 2023-09-11 19:53:18 UTC
Can you share the new log?  I want to make sure it's exactly the same.
Comment 15 Mehmet Mütin Türk 2023-09-11 20:03:16 UTC
Created attachment 305090 [details]
dmesg output leading to reboot/poweroff hang after patch

I've attached the new dmesg log.
Comment 16 Mario Limonciello (AMD) 2023-09-11 20:10:29 UTC
Can you please use that existing patch as well as this one together?

https://lore.kernel.org/linux-input/20230620200117.22261-1-mario.limonciello@amd.com/T/#u
Comment 17 Mehmet Mütin Türk 2023-09-12 08:56:33 UTC
Created attachment 305094 [details]
dmesg after both patches applied

I've applied both patches. Upon first boot I experienced a system freeze after a login attemp with the following errors:

C
31.485395) watchdog: Watchdog detected hard LOCKUP on cpu 7
66.777190) rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
66.7772401 rcu: -7-...0: (1 GPs behind) idle=25dc/1/0x4000000000000000 softirg=713/714 fqs-5370
66.777283) rcu: o(detected by 3. t=18002 jiffies. g=-343, q=1170 ncpus=12)
AAP
96.986960) rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: ( 7-...D ) 18425 jiffies s: 217 root: 0x8
[
0/.
96.987081) rcu: blocking reu_node structures (internal RCU debug):
[
login: timed out after 60 seconds_

I had to do a forced shutdown (holding power button). After powering the laptop and logging in, dmesg generated errors about again. I had to do a forced shutdown again. I've attached the dmesg output to this post.
Comment 18 Mehmet Mütin Türk 2023-09-12 10:03:40 UTC
There are 7 amd_sfh commits in v6.2.15 and 2 amd_sfh commits in v6.3. I'm currently building v6.2.15 and will see how that works. I'm guessing one/or more of these 9 commits caused the regression.
Comment 19 Mehmet Mütin Türk 2023-09-12 15:57:32 UTC
Created attachment 305095 [details]
dmesg linux v6.3

I've tested linux 6.2.15, 6.2.16, 6.3 and 6.3.1 (all built from source). The problem seems to be appear on linux v6.3. I've attached dmesg output from linux v6.3.
Comment 20 Mario Limonciello (AMD) 2023-09-12 17:00:08 UTC
The changes to sfh in 6.3 seem unlikely to cause this; it might be because of other kernel changes.  The failure seems like it's caused by other parts of the kernel racing with the driver initialization to me.

I had really expected the second patch to help.  I'll think about it some more about how this can happen and come back with some different ideas later.
Comment 21 Mario Limonciello (AMD) 2023-09-13 04:34:04 UTC
Created attachment 305096 [details]
possible patch v0.2

Here's an alternative idea I have for this issue.  The theory here is that there is a race for accessing the linked list before it's been set up.

Can you please apply just this patch, and if it still fails share the dmesg again?
Comment 22 Mehmet Mütin Türk 2023-09-13 06:50:09 UTC
Created attachment 305097 [details]
dmesg with patch v0.2 applied

Applied patch v0.2 to linux v6.5.2 and experienced the problem again. I've attached the dmesg output.
Comment 23 Tester47 2023-09-15 13:23:23 UTC
We have the same issue with Kernel 6.6-rc1 only. Shutdown hangs at Lenovo image and requires a cold shutdown. Cold boot takes 2 minutes and restart stalls, but works with magic key . Not sure if cause by amd-sfh. The commits made on linux drm-tip on Sept 12 is the source of the bug=previous kernel is ok(20230909). Ideapad 3 Ryzen 5825U/Renoir. The same issue is also present in Arch linux-mainline (miffe repo).

dmesg for cold boot:
sudo dmesg | curl -F 'file=@-' 0x0.st
http://0x0.st/HO-4.txt

The commits database last 7 days only. Hope this will help.

https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2023-09-12/CHANGES
Comment 24 Mario Limonciello (AMD) 2023-09-15 13:42:59 UTC
@Mehmet:

This looks slightly different, it's not the same list with the problem.  Can you post your kconfig?  I'm not sure why I'm not seeing the same issue.

@Tester47:

Your issue is different, this is the fix for it:

https://lore.kernel.org/all/20230906084842.1922052-1-heikki.krogerus@linux.intel.com/
Comment 25 Mehmet Mütin Türk 2023-09-15 14:53:17 UTC
By kconfig if you mean my kernel .config file, I'm using the default config provided by Arch Linux. But I need to mention that I'm having same trouble on other linux distributions updated to 6.3+.

Here's the config: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/6.5.2.arch1-1/config?ref_type=tags

Note You need to log in before you can comment on or make changes to this bug.