Bug 216616
Summary: | video/aperture patch [v2,07/11] in 6.0.3 cases CPU stalls and eventually causes a complete system freeze | ||
---|---|---|---|
Product: | Drivers | Reporter: | Andreas (andreas.thalhammer) |
Component: | Video(DRI - non Intel) | Assignee: | other_other |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | mario.limonciello, regressions |
Priority: | P1 | ||
Hardware: | AMD | ||
OS: | Linux | ||
URL: | https://patchwork.freedesktop.org/patch/494608/ | ||
Kernel Version: | 6.0.3 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg
my kernel .config for 6.0.3 |
Created attachment 303075 [details]
my kernel .config for 6.0.3
Only was CONFIG_HID_TOPRE added in 6.0.3, otherwise it is identical as my .config for 6.0.2.
In /var/log/Xorg.0.log the only obvious difference is the last line: ---- snap randr: falling back to unsynchronized pixmap sharing ---- snap The line is present when I boot with 6.0.3, but isn't when I boot 6.0.2. (Obviously this is when I login to KDE with X11, not with Wayland, from SDDM.) I did a git bisect on stable kernels 5.0.3 as bad and 5.0.2 as good, this is the result: cfecfc98a78d97a49807531b5b224459bda877de is the first bad commit commit cfecfc98a78d97a49807531b5b224459bda877de (HEAD, refs/bisect/bad) Author: Thomas Zimmermann <tzimmermann@suse.de> Date: Mon Jul 18 09:23:18 2022 +0200 video/aperture: Disable and unregister sysfb devices via aperture helpers [ Upstream commit 5e01376124309b4dbd30d413f43c0d9c2f60edea ] Call sysfb_disable() before removing conflicting devices in aperture helpers. Fixes sysfb state if fbdev has been disabled. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Fixes: fb84efa28a48 ("drm/aperture: Run fbdev removal before internal helpers") Link to the suspect patch: https://patchwork.freedesktop.org/patch/msgid/20220718072322.8927-8-tzimmermann@suse.de (or https://patchwork.freedesktop.org/patch/494608/) Okay, so I reverted v2-07-11-video-aperture-Disable-and-unregister-sysfb-devices-via-aperture-helpers.patch on stable 5.0.3 and the fault is gone. I always logged out immediately, which worked (even though everything is very very sluggish). Also, when I killed the X session within a couple of seconds (15 or so), no error was shown (I used "systemctl stop sddm" from another virtual console). Noteworthy: I once compiled a kernel from within the Plasma Desktop, while it was sluggish. The kernel compiled alright. When it was finished I moved the mouse to reboot, at which point it completely froze and I had to hard-reset the system. While still running, after > 15 seconds, the fault looked like this (dmesg): ---- snap ---- rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 7 jiffies s: 165 root: 0x2000/. rcu: blocking rcu_node structures (internal RCU debug): Task dump for CPU 13: task:X state:R running task stack: 0 pid: 4242 ppid: 4228 flags:0x00000008 Call Trace: <TASK> ? commit_tail+0xd7/0x130 ? drm_atomic_helper_commit+0x126/0x150 ? drm_atomic_commit+0xa4/0xe0 ? drm_plane_get_damage_clips.cold+0x1c/0x1c ? drm_atomic_helper_dirtyfb+0x19e/0x280 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? drm_ioctl_kernel+0xc4/0x150 ? drm_ioctl+0x246/0x3f0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? __x64_sys_ioctl+0x91/0xd0 ? do_syscall_64+0x60/0xd0 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5 </TASK> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 29 jiffies s: 165 root: 0x2000/. rcu: blocking rcu_node structures (internal RCU debug): Task dump for CPU 13: task:X state:R running task stack: 0 pid: 4242 ppid: 4228 flags:0x00000008 Call Trace: <TASK> ? commit_tail+0xd7/0x130 ? drm_atomic_helper_commit+0x126/0x150 ? drm_atomic_commit+0xa4/0xe0 ? drm_plane_get_damage_clips.cold+0x1c/0x1c ? drm_atomic_helper_dirtyfb+0x19e/0x280 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? drm_ioctl_kernel+0xc4/0x150 ? drm_ioctl+0x246/0x3f0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? __x64_sys_ioctl+0x91/0xd0 ? do_syscall_64+0x60/0xd0 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5 </TASK> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 8 jiffies s: 169 root: 0x2000/. rcu: blocking rcu_node structures (internal RCU debug): Task dump for CPU 13: task:X state:R running task stack: 0 pid: 4242 ppid: 4228 flags:0x0000400e Call Trace: <TASK> ? memcpy_toio+0x76/0xc0 ? drm_fb_memcpy_toio+0x76/0xb0 ? drm_fb_blit_toio+0x75/0x2b0 ? simpledrm_simple_display_pipe_update+0x132/0x150 ? drm_atomic_helper_commit_planes+0xb6/0x230 ? drm_atomic_helper_commit_tail+0x44/0x80 ? commit_tail+0xd7/0x130 ? drm_atomic_helper_commit+0x126/0x150 ? drm_atomic_commit+0xa4/0xe0 ? drm_plane_get_damage_clips.cold+0x1c/0x1c ? drm_atomic_helper_dirtyfb+0x19e/0x280 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? drm_ioctl_kernel+0xc4/0x150 ? drm_ioctl+0x246/0x3f0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? __x64_sys_ioctl+0x91/0xd0 ? do_syscall_64+0x60/0xd0 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5 </TASK> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 30 jiffies s: 169 root: 0x2000/. rcu: blocking rcu_node structures (internal RCU debug): Task dump for CPU 13: task:X state:R running task stack: 0 pid: 4242 ppid: 4228 flags:0x0000400e Call Trace: <TASK> ? memcpy_toio+0x76/0xc0 ? memcpy_toio+0x1b/0xc0 ? drm_fb_memcpy_toio+0x76/0xb0 ? drm_fb_blit_toio+0x75/0x2b0 ? simpledrm_simple_display_pipe_update+0x132/0x150 ? drm_atomic_helper_commit_planes+0xb6/0x230 ? drm_atomic_helper_commit_tail+0x44/0x80 ? commit_tail+0xd7/0x130 ? drm_atomic_helper_commit+0x126/0x150 ? drm_atomic_commit+0xa4/0xe0 ? drm_plane_get_damage_clips.cold+0x1c/0x1c ? drm_atomic_helper_dirtyfb+0x19e/0x280 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? drm_ioctl_kernel+0xc4/0x150 ? drm_ioctl+0x246/0x3f0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? __x64_sys_ioctl+0x91/0xd0 ? do_syscall_64+0x60/0xd0 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5 </TASK> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 52 jiffies s: 169 root: 0x2000/. rcu: blocking rcu_node structures (internal RCU debug): Task dump for CPU 13: task:X state:R running task stack: 0 pid: 4242 ppid: 4228 flags:0x0000400e Call Trace: <TASK> ? memcpy_toio+0x76/0xc0 ? memcpy_toio+0x1b/0xc0 ? drm_fb_memcpy_toio+0x76/0xb0 ? drm_fb_blit_toio+0x75/0x2b0 ? simpledrm_simple_display_pipe_update+0x132/0x150 ? drm_atomic_helper_commit_planes+0xb6/0x230 ? drm_atomic_helper_commit_tail+0x44/0x80 ? commit_tail+0xd7/0x130 ? drm_atomic_helper_commit+0x126/0x150 ? drm_atomic_commit+0xa4/0xe0 ? drm_plane_get_damage_clips.cold+0x1c/0x1c ? drm_atomic_helper_dirtyfb+0x19e/0x280 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? drm_ioctl_kernel+0xc4/0x150 ? drm_ioctl+0x246/0x3f0 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0 ? __x64_sys_ioctl+0x91/0xd0 ? do_syscall_64+0x60/0xd0 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5 </TASK> traps: avahi-ml[4447] general protection fault ip:7fdde6a37bc1 sp:7fdde07fc920 error:0 in module-zeroconf-publish.so[7fdde6a37000+3000] To confirm whether it's a problem with the backport only (missing dependencies?) or it's a general problem would it be possible to also check 6.1-rc1 or later where that patch landed to see if affected? That would help decide whether it just needs to revert in stable or also needs to be fixed in 6.1 too. Apparently it's a incomplete backport and the fix is already queued and part of 6.0.4-rc1: https://lore.kernel.org/stable/51651c2e-3706-37d7-01e7-5d473a412850@suse.de/ Thanks... In short: the additional patch did NOT fix the problem. I don't use git and I don't know how to /cherry-pick commit/ 9d69ef183815, but I found the patch here: https://patchwork.freedesktop.org/patch/494609/ I hope that's the right one. I reintegrated v2-07-11-video-aperture-Disable-and-unregister-sysfb-devices-via-aperture-helpers.patch and also applied v2-04-11-fbdev-core-Remove-remove_conflicting_pci_framebuffers.patch, did a "make mrproper" and thereafter compiled a clean new 6.0.3 kernel (same .config). Now the system doesn't even boot to a console. The first boot got me to a rcu_shed stall on CPUs/tasks, same as above, but this time with: Workqueue: btrfs-cache btrfs_work_helper I booted a second time with the same kernel, and it got stuck after mounting the root btrfs filesystem (what looked like a total freeze, but when it didn't show a rcu_stall message after ~2 min I got impatient and wanted to see if I had just busted my root filesystem...) I booted 6.0.2 and everything is fine. (I'm very glad! I definitely should update my backup right away!) I will try 6.1-rc1 next, bear with... Just tested with 6.1-rc2 (tarball from kernel.org), which works. Thanks to a quick patch by Thomas Zimmermann this is fixed and, I assume, will be so in 6.0.4 as well. Patch for download (necessary only for 6.0.3): https://lore.kernel.org/regressions/ef862938-3e1a-5138-2bda-d10e9188f920@suse.de/1.1.2-0001-video-aperture-Call-sysfb_disable-before-removing-PC.patch For future reference, whole conversation: https://lore.kernel.org/regressions/d6afe54b-f8d7-beb2-3609-186e566cbfac@gmx.net/T/#t (In reply to Andreas from comment #10) > Thanks to a quick patch by Thomas Zimmermann this is fixed and, I assume, > will be so in 6.0.4 as well. FWIW, it's not even in 6.0.5, Thomas submission didn't meet the stable requirements. (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #11) > FWIW, it's not even in 6.0.5, Thomas submission didn't meet the stable > requirements. That's not good, since this is clearly breaking the kernel in some cases. It could well be that my kernel config is special in some way, but I don't see how. (Thinking not only of Gentoo Linux, but also Arch, where users tend to compile their own kernels...) But I assume that Linux distributions will not have this issue, because even though they might enable simplefb in-kernel they'll likely have amdgpu (and nouveau etc.) compiled as modules. Hence, my case where amdgpu loads first and simplefb then messes the whole thing up (without the fix for 6.0.3) will not be possible - it's most likely the other way around (amdgpu loads after simplefb), which apparently works without problems. But that's just me speculating. FYI, according to the changelog for 6.0.4 (https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.0.4) remove_conflicting_pci_framebuffers() has been removed. I did that too for 6.0.3, but it didn't help... And, yes, there's no patch to solve this in 6.0.5 (https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.0.5). The patch has been queued for 6.0, so looking good for the next 6.0 release (which will be 6.0.6). (In reply to Andreas from comment #13) > The patch has been queued for 6.0 Yup, it was just bad timing, should have mentioned that in my earlier comment, sorry. Included in 6.0.6, thanks! (https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.0.6) |
Created attachment 303074 [details] dmesg 6.0.2 works. On 6.0.3 the system is very sluggish with graphic glitches all over the place in KDE Plasma Desktop X11 (no graphic glitches when using Wayland, but also sluggish). SDDM works fine. Hardware: Lenovo Legion 5 Pro 16ACH6H: AMD Ryzen 7 5800H "Cezanne", hybrid graphics AMD "Green Sardine" (Vega 8 GCN 5.1, AMDGPU) and Nvidia GeForce RTX 3070 Mobile (GA104M, not working with nouveau, I'm not using the proprietary nvidia driver).