Bug 216616 - video/aperture patch [v2,07/11] in 6.0.3 cases CPU stalls and eventually causes a complete system freeze
Summary: video/aperture patch [v2,07/11] in 6.0.3 cases CPU stalls and eventually caus...
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: AMD Linux
: P1 high
Assignee: other_other
URL: https://patchwork.freedesktop.org/pat...
Keywords:
Depends on:
Blocks:
 
Reported: 2022-10-22 14:25 UTC by Andreas
Modified: 2022-10-30 17:06 UTC (History)
2 users (show)

See Also:
Kernel Version: 6.0.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (19.61 KB, application/x-xz)
2022-10-22 14:25 UTC, Andreas
Details
my kernel .config for 6.0.3 (37.93 KB, application/x-xz)
2022-10-22 14:27 UTC, Andreas
Details

Description Andreas 2022-10-22 14:25:32 UTC
Created attachment 303074 [details]
dmesg

6.0.2 works.

On 6.0.3 the system is very sluggish with graphic glitches all over the place in KDE Plasma Desktop X11 (no graphic glitches when using Wayland, but also sluggish). SDDM works fine.

Hardware: Lenovo Legion 5 Pro 16ACH6H: AMD Ryzen 7 5800H "Cezanne", hybrid graphics AMD "Green Sardine" (Vega 8 GCN 5.1, AMDGPU) and Nvidia GeForce RTX 3070 Mobile (GA104M, not working with nouveau, I'm not using the proprietary nvidia driver).
Comment 1 Andreas 2022-10-22 14:27:15 UTC
Created attachment 303075 [details]
my kernel .config for 6.0.3

Only was CONFIG_HID_TOPRE added in 6.0.3, otherwise it is identical as my .config for 6.0.2.
Comment 2 Andreas 2022-10-22 14:51:23 UTC
In /var/log/Xorg.0.log the only obvious difference is the last line:
---- snap
randr: falling back to unsynchronized pixmap sharing
---- snap
The line is present when I boot with 6.0.3, but isn't when I boot 6.0.2.

(Obviously this is when I login to KDE with X11, not with Wayland, from SDDM.)
Comment 3 Andreas 2022-10-22 22:10:19 UTC
I did a git bisect on stable kernels 5.0.3 as bad and 5.0.2 as good, this is the result:

cfecfc98a78d97a49807531b5b224459bda877de is the first bad commit
commit cfecfc98a78d97a49807531b5b224459bda877de (HEAD, refs/bisect/bad)
Author: Thomas Zimmermann <tzimmermann@suse.de>
Date:   Mon Jul 18 09:23:18 2022 +0200

    video/aperture: Disable and unregister sysfb devices via aperture helpers
    
    [ Upstream commit 5e01376124309b4dbd30d413f43c0d9c2f60edea ]
    
    Call sysfb_disable() before removing conflicting devices in aperture
    helpers. Fixes sysfb state if fbdev has been disabled.
    
    Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
    Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
    Fixes: fb84efa28a48 ("drm/aperture: Run fbdev removal before internal helpers")
Comment 5 Andreas 2022-10-22 22:38:14 UTC
Okay, so I reverted v2-07-11-video-aperture-Disable-and-unregister-sysfb-devices-via-aperture-helpers.patch on stable 5.0.3 and the fault is gone.

I always logged out immediately, which worked (even though everything is very very sluggish). Also, when I killed the X session within a couple of seconds (15 or so), no error was shown (I used "systemctl stop sddm" from another virtual console).

Noteworthy: I once compiled a kernel from within the Plasma Desktop, while it was sluggish. The kernel compiled alright. When it was finished I moved the mouse to reboot, at which point it completely froze and I had to hard-reset the system.

While still running, after > 15 seconds, the fault looked like this (dmesg):
---- snap ----
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 7 jiffies s: 165 root: 0x2000/.
rcu: blocking rcu_node structures (internal RCU debug):
Task dump for CPU 13:
task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x00000008
Call Trace:
 <TASK>
 ? commit_tail+0xd7/0x130
 ? drm_atomic_helper_commit+0x126/0x150
 ? drm_atomic_commit+0xa4/0xe0
 ? drm_plane_get_damage_clips.cold+0x1c/0x1c
 ? drm_atomic_helper_dirtyfb+0x19e/0x280
 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? drm_ioctl_kernel+0xc4/0x150
 ? drm_ioctl+0x246/0x3f0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? __x64_sys_ioctl+0x91/0xd0
 ? do_syscall_64+0x60/0xd0
 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
 </TASK>
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 29 jiffies s: 165 root: 0x2000/.
rcu: blocking rcu_node structures (internal RCU debug):
Task dump for CPU 13:
task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x00000008
Call Trace:
 <TASK>
 ? commit_tail+0xd7/0x130
 ? drm_atomic_helper_commit+0x126/0x150
 ? drm_atomic_commit+0xa4/0xe0
 ? drm_plane_get_damage_clips.cold+0x1c/0x1c
 ? drm_atomic_helper_dirtyfb+0x19e/0x280
 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? drm_ioctl_kernel+0xc4/0x150
 ? drm_ioctl+0x246/0x3f0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? __x64_sys_ioctl+0x91/0xd0
 ? do_syscall_64+0x60/0xd0
 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
 </TASK>
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 8 jiffies s: 169 root: 0x2000/.
rcu: blocking rcu_node structures (internal RCU debug):
Task dump for CPU 13:
task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x0000400e
Call Trace:
 <TASK>
 ? memcpy_toio+0x76/0xc0
 ? drm_fb_memcpy_toio+0x76/0xb0
 ? drm_fb_blit_toio+0x75/0x2b0
 ? simpledrm_simple_display_pipe_update+0x132/0x150
 ? drm_atomic_helper_commit_planes+0xb6/0x230
 ? drm_atomic_helper_commit_tail+0x44/0x80
 ? commit_tail+0xd7/0x130
 ? drm_atomic_helper_commit+0x126/0x150
 ? drm_atomic_commit+0xa4/0xe0
 ? drm_plane_get_damage_clips.cold+0x1c/0x1c
 ? drm_atomic_helper_dirtyfb+0x19e/0x280
 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? drm_ioctl_kernel+0xc4/0x150
 ? drm_ioctl+0x246/0x3f0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? __x64_sys_ioctl+0x91/0xd0
 ? do_syscall_64+0x60/0xd0
 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
 </TASK>
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 30 jiffies s: 169 root: 0x2000/.
rcu: blocking rcu_node structures (internal RCU debug):
Task dump for CPU 13:
task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x0000400e
Call Trace:
 <TASK>
 ? memcpy_toio+0x76/0xc0
 ? memcpy_toio+0x1b/0xc0
 ? drm_fb_memcpy_toio+0x76/0xb0
 ? drm_fb_blit_toio+0x75/0x2b0
 ? simpledrm_simple_display_pipe_update+0x132/0x150
 ? drm_atomic_helper_commit_planes+0xb6/0x230
 ? drm_atomic_helper_commit_tail+0x44/0x80
 ? commit_tail+0xd7/0x130
 ? drm_atomic_helper_commit+0x126/0x150
 ? drm_atomic_commit+0xa4/0xe0
 ? drm_plane_get_damage_clips.cold+0x1c/0x1c
 ? drm_atomic_helper_dirtyfb+0x19e/0x280
 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? drm_ioctl_kernel+0xc4/0x150
 ? drm_ioctl+0x246/0x3f0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? __x64_sys_ioctl+0x91/0xd0
 ? do_syscall_64+0x60/0xd0
 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
 </TASK>
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 52 jiffies s: 169 root: 0x2000/.
rcu: blocking rcu_node structures (internal RCU debug):
Task dump for CPU 13:
task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x0000400e
Call Trace:
 <TASK>
 ? memcpy_toio+0x76/0xc0
 ? memcpy_toio+0x1b/0xc0
 ? drm_fb_memcpy_toio+0x76/0xb0
 ? drm_fb_blit_toio+0x75/0x2b0
 ? simpledrm_simple_display_pipe_update+0x132/0x150
 ? drm_atomic_helper_commit_planes+0xb6/0x230
 ? drm_atomic_helper_commit_tail+0x44/0x80
 ? commit_tail+0xd7/0x130
 ? drm_atomic_helper_commit+0x126/0x150
 ? drm_atomic_commit+0xa4/0xe0
 ? drm_plane_get_damage_clips.cold+0x1c/0x1c
 ? drm_atomic_helper_dirtyfb+0x19e/0x280
 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? drm_ioctl_kernel+0xc4/0x150
 ? drm_ioctl+0x246/0x3f0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? __x64_sys_ioctl+0x91/0xd0
 ? do_syscall_64+0x60/0xd0
 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
 </TASK>
traps: avahi-ml[4447] general protection fault ip:7fdde6a37bc1 sp:7fdde07fc920 error:0 in module-zeroconf-publish.so[7fdde6a37000+3000]
Comment 6 Mario Limonciello (AMD) 2022-10-24 13:50:17 UTC
To confirm whether it's a problem with the backport only (missing dependencies?) or it's a general problem would it be possible to also check 6.1-rc1 or later where that patch landed to see if affected?

That would help decide whether it just needs to revert in stable or also needs to be fixed in 6.1 too.
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-10-24 13:55:01 UTC
Apparently it's a incomplete backport and the fix is already queued and part of 6.0.4-rc1: 
https://lore.kernel.org/stable/51651c2e-3706-37d7-01e7-5d473a412850@suse.de/
Comment 8 Andreas 2022-10-24 16:15:31 UTC
Thanks... In short: the additional patch did NOT fix the problem.

I don't use git and I don't know how to /cherry-pick commit/ 9d69ef183815, but I found the patch here: https://patchwork.freedesktop.org/patch/494609/

I hope that's the right one. I reintegrated v2-07-11-video-aperture-Disable-and-unregister-sysfb-devices-via-aperture-helpers.patch and also applied v2-04-11-fbdev-core-Remove-remove_conflicting_pci_framebuffers.patch, did a "make mrproper" and thereafter compiled a clean new 6.0.3 kernel (same .config).

Now the system doesn't even boot to a console. The first boot got me to a rcu_shed stall on CPUs/tasks, same as above, but this time with:
Workqueue: btrfs-cache btrfs_work_helper

I booted a second time with the same kernel, and it got stuck after mounting the root btrfs filesystem (what looked like a total freeze, but when it didn't show a rcu_stall message after ~2 min I got impatient and wanted to see if I had just busted my root filesystem...)

I booted 6.0.2 and everything is fine. (I'm very glad! I definitely should update my backup right away!)

I will try 6.1-rc1 next, bear with...
Comment 9 Andreas 2022-10-24 16:53:33 UTC
Just tested with 6.1-rc2 (tarball from kernel.org), which works.
Comment 10 Andreas 2022-10-25 19:03:55 UTC
Thanks to a quick patch by Thomas Zimmermann this is fixed and, I assume, will be so in 6.0.4 as well.

Patch for download (necessary only for 6.0.3):
https://lore.kernel.org/regressions/ef862938-3e1a-5138-2bda-d10e9188f920@suse.de/1.1.2-0001-video-aperture-Call-sysfb_disable-before-removing-PC.patch

For future reference, whole conversation:
https://lore.kernel.org/regressions/d6afe54b-f8d7-beb2-3609-186e566cbfac@gmx.net/T/#t
Comment 11 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-10-26 11:29:11 UTC
(In reply to Andreas from comment #10)
> Thanks to a quick patch by Thomas Zimmermann this is fixed and, I assume,
> will be so in 6.0.4 as well.

FWIW, it's not even in 6.0.5, Thomas submission didn't meet the stable requirements.
Comment 12 Andreas 2022-10-26 13:00:02 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #11)
> FWIW, it's not even in 6.0.5, Thomas submission didn't meet the stable
> requirements.

That's not good, since this is clearly breaking the kernel in some cases. It could well be that my kernel config is special in some way, but I don't see how. (Thinking not only of Gentoo Linux, but also Arch, where users tend to compile their own kernels...)

But I assume that Linux distributions will not have this issue, because even though they might enable simplefb in-kernel they'll likely have amdgpu (and nouveau etc.) compiled as modules. Hence, my case where amdgpu loads first and simplefb then messes the whole thing up (without the fix for 6.0.3) will not be possible - it's most likely the other way around (amdgpu loads after simplefb), which apparently works without problems.
But that's just me speculating.

FYI, according to the changelog for 6.0.4 (https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.0.4) remove_conflicting_pci_framebuffers() has been removed. I did that too for 6.0.3, but it didn't help...

And, yes, there's no patch to solve this in 6.0.5 (https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.0.5).
Comment 13 Andreas 2022-10-26 16:02:10 UTC
The patch has been queued for 6.0, so looking good for the next 6.0 release (which will be 6.0.6).
Comment 14 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-10-26 16:49:33 UTC
(In reply to Andreas from comment #13)
> The patch has been queued for 6.0

Yup, it was just bad timing, should have mentioned that in my earlier comment, sorry.
Comment 15 Andreas 2022-10-30 17:06:13 UTC
Included in 6.0.6, thanks! (https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.0.6)

Note You need to log in before you can comment on or make changes to this bug.