Created attachment 305083 [details] dmesg output when experiencing hangs My laptop (hp aero 13-be0024t) hangs at reboot and poweroff requiring physical poweroffs (long pressing the power button) when attached dmesg output is generated. But this seems to be random as sometimes I have a dmesg with no errors related to amd_sfh and I can cleanly reboot/poweroff. Blacklisting amd_sfh module fixes the problem. This problem started with kernel 6.2.x and still present in 6.5.2. During shutdown/reboot console outputs: "Failed to umount /oldroot..." "kvm exiting virtualization..." but cannot complete the process (waited for more than 1 hour).
Created attachment 305084 [details] dmesg output that leads to no hangs
(In reply to Mehmet from comment #1) > Created attachment 305084 [details] > dmesg output that leads to no hangs I don't see any backtraces or other hanging clues in that 6.5.2 dmesg. Do you mean that this bug has been fixed there?
(In reply to Mehmet from comment #0) > Created attachment 305083 [details] > dmesg output when experiencing hangs > > My laptop (hp aero 13-be0024t) hangs at reboot and poweroff requiring > physical poweroffs (long pressing the power button) when attached dmesg > output is generated. But this seems to be random as sometimes I have a dmesg > with no errors related to amd_sfh and I can cleanly reboot/poweroff. > Blacklisting amd_sfh module fixes the problem. This problem started with > kernel 6.2.x and still present in 6.5.2. > > During shutdown/reboot console outputs: > > "Failed to umount /oldroot..." > "kvm exiting virtualization..." > > but cannot complete the process (waited for more than 1 hour). Does v6.1.y stable series not have this issue?
Both dmesgs are from the same machine using the same 6.5.2 kernel. As I said it's random, sometimes dmesg outputs errors other times does not. You should check "dmesg_output" file for errors not "dmesg" file. I'm sure I didn't have this problem on kernel v6.1.0. But I haven't tested every single minor version of v6.1 such as v6.1.52.
On 11/09/2023 20:25, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=217900 > > --- Comment #4 from Mehmet (mehmetmutinturk@gmail.com) --- > Both dmesgs are from the same machine using the same 6.5.2 kernel. As I said > it's random, sometimes dmesg outputs errors other times does not. You should > check "dmesg_output" file for errors not "dmesg" file. > > I'm sure I didn't have this problem on kernel v6.1.0. But I haven't tested > every single minor version of v6.1 such as v6.1.52. > Then try v6.1 (mainline, not stable).
On 11/09/2023 20:27, Bagas Sanjaya wrote: > On 11/09/2023 20:25, bugzilla-daemon@kernel.org wrote: >> https://bugzilla.kernel.org/show_bug.cgi?id=217900 >> >> --- Comment #4 from Mehmet (mehmetmutinturk@gmail.com) --- >> Both dmesgs are from the same machine using the same 6.5.2 kernel. As I said >> it's random, sometimes dmesg outputs errors other times does not. You should >> check "dmesg_output" file for errors not "dmesg" file. >> >> I'm sure I didn't have this problem on kernel v6.1.0. But I haven't tested >> every single minor version of v6.1 such as v6.1.52. >> > > Then try v6.1 (mainline, not stable). > Oops, I don't see that you have tried that version. Sorry for inconvenience.
When dmesg outputs errors, I cannot reboot/poweroff cleanly. When it does not output any errors, I can cleanly reboot/poweroff. And this seems to be random (about half of the boots).
(In reply to Mehmet from comment #4) > Both dmesgs are from the same machine using the same 6.5.2 kernel. As I said > it's random, sometimes dmesg outputs errors other times does not. You should > check "dmesg_output" file for errors not "dmesg" file. > > I'm sure I didn't have this problem on kernel v6.1.0. But I haven't tested > every single minor version of v6.1 such as v6.1.52. Can you check current mainline (v6.6-rc1)?
Last but not least, please do bisection (see Documentation/admin-guide/bug-bisect.rst for how to do that).
I used arch linux's archived repositories to test old kernels. There was no issues up to kernel v6.2.13. But I encountered the issue on kernel v6.3.1 and every kernel after that. Arch linux's archive did not have any kernel versions between v6.2.13 and v6.3.1. I have never compiled anything more than a few simple projects. I have no idea how to use git and do bisecting. But if I can figure that out I will provide more information.
Created attachment 305088 [details] dmesg output that causes hang on kernel v6.3.1 This is the dmesg output on linux v6.3.1 that caused the hang.
Created attachment 305089 [details] possible patch From what you describe, it sounds like list corruption by a race. Can you have a try with the attached patch to see if this fixes it reliably?
I've built kernel v6.5.2 with your patch but unfortunately it didn't make any difference. Still getting errors on dmesg and getting stuck at reboot/poweroff.
Can you share the new log? I want to make sure it's exactly the same.
Created attachment 305090 [details] dmesg output leading to reboot/poweroff hang after patch I've attached the new dmesg log.
Can you please use that existing patch as well as this one together? https://lore.kernel.org/linux-input/20230620200117.22261-1-mario.limonciello@amd.com/T/#u
Created attachment 305094 [details] dmesg after both patches applied I've applied both patches. Upon first boot I experienced a system freeze after a login attemp with the following errors: C 31.485395) watchdog: Watchdog detected hard LOCKUP on cpu 7 66.777190) rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: 66.7772401 rcu: -7-...0: (1 GPs behind) idle=25dc/1/0x4000000000000000 softirg=713/714 fqs-5370 66.777283) rcu: o(detected by 3. t=18002 jiffies. g=-343, q=1170 ncpus=12) AAP 96.986960) rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: ( 7-...D ) 18425 jiffies s: 217 root: 0x8 [ 0/. 96.987081) rcu: blocking reu_node structures (internal RCU debug): [ login: timed out after 60 seconds_ I had to do a forced shutdown (holding power button). After powering the laptop and logging in, dmesg generated errors about again. I had to do a forced shutdown again. I've attached the dmesg output to this post.
There are 7 amd_sfh commits in v6.2.15 and 2 amd_sfh commits in v6.3. I'm currently building v6.2.15 and will see how that works. I'm guessing one/or more of these 9 commits caused the regression.
Created attachment 305095 [details] dmesg linux v6.3 I've tested linux 6.2.15, 6.2.16, 6.3 and 6.3.1 (all built from source). The problem seems to be appear on linux v6.3. I've attached dmesg output from linux v6.3.
The changes to sfh in 6.3 seem unlikely to cause this; it might be because of other kernel changes. The failure seems like it's caused by other parts of the kernel racing with the driver initialization to me. I had really expected the second patch to help. I'll think about it some more about how this can happen and come back with some different ideas later.
Created attachment 305096 [details] possible patch v0.2 Here's an alternative idea I have for this issue. The theory here is that there is a race for accessing the linked list before it's been set up. Can you please apply just this patch, and if it still fails share the dmesg again?
Created attachment 305097 [details] dmesg with patch v0.2 applied Applied patch v0.2 to linux v6.5.2 and experienced the problem again. I've attached the dmesg output.
We have the same issue with Kernel 6.6-rc1 only. Shutdown hangs at Lenovo image and requires a cold shutdown. Cold boot takes 2 minutes and restart stalls, but works with magic key . Not sure if cause by amd-sfh. The commits made on linux drm-tip on Sept 12 is the source of the bug=previous kernel is ok(20230909). Ideapad 3 Ryzen 5825U/Renoir. The same issue is also present in Arch linux-mainline (miffe repo). dmesg for cold boot: sudo dmesg | curl -F 'file=@-' 0x0.st http://0x0.st/HO-4.txt The commits database last 7 days only. Hope this will help. https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2023-09-12/CHANGES
@Mehmet: This looks slightly different, it's not the same list with the problem. Can you post your kconfig? I'm not sure why I'm not seeing the same issue. @Tester47: Your issue is different, this is the fix for it: https://lore.kernel.org/all/20230906084842.1922052-1-heikki.krogerus@linux.intel.com/
By kconfig if you mean my kernel .config file, I'm using the default config provided by Arch Linux. But I need to mention that I'm having same trouble on other linux distributions updated to 6.3+. Here's the config: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/6.5.2.arch1-1/config?ref_type=tags