First of all, I'm very sorry I have to post this here, since I know it should go elsewhere but I can't use the correct issue tracker. I hope this report can be sent to the right people, because with any luck I did manage to bisect it correctly. The issue I'm facing is kwin_wayland unpredictably hanging during regular use. I tried a large number of Mesa environment variables as well as both RR and FIFO GPU schedulers but nothing made it either happen or go away 100% of the time, so it's likely a timing related bug. With one exception, across many boots there were no dmesg entries indicating any kind of an issue and the system journal does not show any obvious patterns or failures either, so the one time it did print anything might be just a consequence of the actual bug. In one of the bootups, `BUG: kernel NULL pointer dereference, address: 0000000000000008` was reported but it was for commit b70438004a14 which should be well into the clear, since there were multiple other commits past it which were in total tested for days of typical use without a single GUI hang. When kwin_wayland's screen output freezes, rarely SysRq+E might be able to get back to a working SDDM's Wayland login prompt running the Weston display server but it almost always freezes there and, if not, it will imminently freeze after/during login. Likewise switching to tty is unlikely to be possible or will eventually freeze, too. In once case the mouse pointer was movable but nothing reacted to interactions. The issue started happening between Linux 6.8 pre-rc1 commits 70d201a40823 (good) and 052d534373b7 (bad). Due to no reliable reproducer, bisecting this was not easy and I still can't say with full confidence I got the right culprit but my third round of git bisection arrived at a6149f0393699308fb00149be913044977bceb56 being the first bad commit. It may or may not be relevant that at some point (IIRC, between ca34d816558c and e013aa9ab01b) the kernel also started to severely hang when entering S3 sleep as well as at the end of `systemctl reboot` process but I do not know if that's indicative of the same bug or not. When the S3 or reboot hangs happen, use of PC reset button on the case is required i.e. SysRq+B does nothing. I did encounter https://gitlab.freedesktop.org/drm/amd/-/issues/3124 and it's seemingly similar to my issue with kwin_wayland however the instant GNOME hang went away during bisection, so I'm not sure if it's a more severe form of the same underlying bug or a different one. Hardware in use: Intel Core i5-12400 CPU and AMD RX 580 GPU with the Intel HD730 iGPU in RC6 render standby for HEVC encoding and Vulkan compute roles. IOMMU and CET are enabled. Bisection was initially done with Linux firmware 20231211 and the 3rd go at bisecting with 20240115. If it's relevant, I have the second newest Intel ME and UEFI firmware for my platform, since I'm still waiting for enough time to go by before I dare to flash the latest unsigned firmware update. *sigh*
a6149f0393699308fb00149be913044977bceb56 is the first bad commit commit a6149f0393699308fb00149be913044977bceb56 Author: Matthew Brost <matthew.brost@intel.com> Date: Mon Oct 30 20:24:36 2023 -0700 drm/sched: Convert drm scheduler to use a work queue rather than kthread In Xe, the new Intel GPU driver, a choice has made to have a 1 to 1 mapping between a drm_gpu_scheduler and drm_sched_entity. At first this seems a bit odd but let us explain the reasoning below. 1. In Xe the submission order from multiple drm_sched_entity is not guaranteed to be the same completion even if targeting the same hardware engine. This is because in Xe we have a firmware scheduler, the GuC, which allowed to reorder, timeslice, and preempt submissions. If a using shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls apart as the TDR expects submission order == completion order. Using a dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. 2. In Xe submissions are done via programming a ring buffer (circular buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow control on the ring for free. A problem with this design is currently a drm_gpu_scheduler uses a kthread for submission / job cleanup. This doesn't scale if a large number of drm_gpu_scheduler are used. To work around the scaling issue, use a worker rather than kthread for submission / job cleanup. v2: - (Rob Clark) Fix msm build - Pass in run work queue v3: - (Boris) don't have loop in worker v4: - (Tvrtko) break out submit ready, stop, start helpers into own patch v5: - (Boris) default to ordered work queue v6: - (Luben / checkpatch) fix alignment in msm_ringbuffer.c - (Luben) s/drm_sched_submit_queue/drm_sched_wqueue_enqueue - (Luben) Update comment for drm_sched_wqueue_enqueue - (Luben) Positive check for submit_wq in drm_sched_init - (Luben) s/alloc_submit_wq/own_submit_wq v7: - (Luben) s/drm_sched_wqueue_enqueue/drm_sched_run_job_queue v8: - (Luben) Adjust var names / comments Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-3-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_sched.c | 2 +- drivers/gpu/drm/lima/lima_sched.c | 2 +- drivers/gpu/drm/msm/msm_ringbuffer.c | 2 +- drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- drivers/gpu/drm/panfrost/panfrost_job.c | 2 +- drivers/gpu/drm/scheduler/sched_main.c | 131 +++++++++++++++-------------- drivers/gpu/drm/v3d/v3d_sched.c | 10 +-- include/drm/gpu_scheduler.h | 14 +-- 9 files changed, 86 insertions(+), 81 deletions(-)
Slight correction regarding https://gitlab.freedesktop.org/drm/amd/-/issues/3124 . For me the GNOME hang happens usually at the end of login procedure but once it also happened right after login via SDDM. However since I use Weston rather than Mutter for login prompt, it could still be the same bug w.r.t. GNOME hanging with 6.8 rc1.
Please report here: https://gitlab.freedesktop.org/drm/amd/-/issues
Please understand that I cannot do that. Truly sorry.
I've posted it here: https://gitlab.freedesktop.org/drm/amd/-/issues/3126
Thank you. :)
I just had a Plasma freeze with a supposedly good kernel (the commit before the one bisected as the first bad one), however it looks different on two accounts: 1) journal was filled by these two lines on repeat (in various multiples of each): kwin_wayland_wrapper[3337]: kwin_scene_opengl: 0x2: GL_INVALID_OPERATION in glDrawArrays kwin_wayland_wrapper[3337]: kwin_scene_opengl: 0x2: GL_INVALID_OPERATION in glVertexAttribPointer(no array object bound) 2) after SysRq+E plasmashell coredumped and I was returned to SDDM login prompt where I could log back in. The two differences indicate at least to me that it's probably an unrelated issue and maybe due to me running the KDE's git version that will become KF6 and Plasma 6 but just in case I'm noting it here.
Created attachment 305773 [details] A supposedly good kernel hanging GUI with a null pointer dereference While I was away from PC doing chores, the screen had gone blank and when I tried to figure out what's wrong, it seemed to have irrecoverably hanged. I'm attaching journal output with a kernel null pointer dereference, which may or may not be related to this issue (the kernel in question was supposed to be good).
After more testing, I'm starting to suspect that there's either one bug that shows itself in different ways possibly depending on presence of other commits as well as just pure chance. Or there's 2-3 GPU hang bugs and maybe related or unrelated S3 sleep entry/reboot bug. I'm not sure I have it in me to keep prodding the range between 70d201a40823 and 052d534373b7 which almost certainly contains the trigger and maybe also the source of all those issues.
I have tested the patch from https://gitlab.freedesktop.org/drm/amd/-/issues/3124#note_2252559 and I can confirm that the GNOME hang is resolved. Because kwin_wayland hangs randomly, I can't say for certain that everything is resolved but it's felt perfect so far, which is a promising sign. It's hard to describe but it's almost like the frame pacing or latency is [sometimes?] off in a very subtle and hardly perceivable way, when the bug is present and with the patch applied it feels right again. Reported-and-tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
The bug has been fixed upstream by commit 66dbd9004a55073c5931f5f65f5fe2bbd414bdaa .