Bug 218413
Summary: | Seemingly commit a6149f039369 broke amdgpu driver | ||
---|---|---|---|
Product: | Drivers | Reporter: | Niklāvs Koļesņikovs (pinkflames.linux) |
Component: | Other | Assignee: | drivers_other |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | ||
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.8 pre-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | a6149f0393699308fb00149be913044977bceb56 |
Attachments: | A supposedly good kernel hanging GUI with a null pointer dereference |
Description
Niklāvs Koļesņikovs
2024-01-23 17:28:10 UTC
a6149f0393699308fb00149be913044977bceb56 is the first bad commit commit a6149f0393699308fb00149be913044977bceb56 Author: Matthew Brost <matthew.brost@intel.com> Date: Mon Oct 30 20:24:36 2023 -0700 drm/sched: Convert drm scheduler to use a work queue rather than kthread In Xe, the new Intel GPU driver, a choice has made to have a 1 to 1 mapping between a drm_gpu_scheduler and drm_sched_entity. At first this seems a bit odd but let us explain the reasoning below. 1. In Xe the submission order from multiple drm_sched_entity is not guaranteed to be the same completion even if targeting the same hardware engine. This is because in Xe we have a firmware scheduler, the GuC, which allowed to reorder, timeslice, and preempt submissions. If a using shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls apart as the TDR expects submission order == completion order. Using a dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. 2. In Xe submissions are done via programming a ring buffer (circular buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow control on the ring for free. A problem with this design is currently a drm_gpu_scheduler uses a kthread for submission / job cleanup. This doesn't scale if a large number of drm_gpu_scheduler are used. To work around the scaling issue, use a worker rather than kthread for submission / job cleanup. v2: - (Rob Clark) Fix msm build - Pass in run work queue v3: - (Boris) don't have loop in worker v4: - (Tvrtko) break out submit ready, stop, start helpers into own patch v5: - (Boris) default to ordered work queue v6: - (Luben / checkpatch) fix alignment in msm_ringbuffer.c - (Luben) s/drm_sched_submit_queue/drm_sched_wqueue_enqueue - (Luben) Update comment for drm_sched_wqueue_enqueue - (Luben) Positive check for submit_wq in drm_sched_init - (Luben) s/alloc_submit_wq/own_submit_wq v7: - (Luben) s/drm_sched_wqueue_enqueue/drm_sched_run_job_queue v8: - (Luben) Adjust var names / comments Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-3-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_sched.c | 2 +- drivers/gpu/drm/lima/lima_sched.c | 2 +- drivers/gpu/drm/msm/msm_ringbuffer.c | 2 +- drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- drivers/gpu/drm/panfrost/panfrost_job.c | 2 +- drivers/gpu/drm/scheduler/sched_main.c | 131 +++++++++++++++-------------- drivers/gpu/drm/v3d/v3d_sched.c | 10 +-- include/drm/gpu_scheduler.h | 14 +-- 9 files changed, 86 insertions(+), 81 deletions(-) Slight correction regarding https://gitlab.freedesktop.org/drm/amd/-/issues/3124 . For me the GNOME hang happens usually at the end of login procedure but once it also happened right after login via SDDM. However since I use Weston rather than Mutter for login prompt, it could still be the same bug w.r.t. GNOME hanging with 6.8 rc1. Please report here: https://gitlab.freedesktop.org/drm/amd/-/issues Please understand that I cannot do that. Truly sorry. I've posted it here: https://gitlab.freedesktop.org/drm/amd/-/issues/3126 Thank you. :) I just had a Plasma freeze with a supposedly good kernel (the commit before the one bisected as the first bad one), however it looks different on two accounts: 1) journal was filled by these two lines on repeat (in various multiples of each): kwin_wayland_wrapper[3337]: kwin_scene_opengl: 0x2: GL_INVALID_OPERATION in glDrawArrays kwin_wayland_wrapper[3337]: kwin_scene_opengl: 0x2: GL_INVALID_OPERATION in glVertexAttribPointer(no array object bound) 2) after SysRq+E plasmashell coredumped and I was returned to SDDM login prompt where I could log back in. The two differences indicate at least to me that it's probably an unrelated issue and maybe due to me running the KDE's git version that will become KF6 and Plasma 6 but just in case I'm noting it here. Created attachment 305773 [details]
A supposedly good kernel hanging GUI with a null pointer dereference
While I was away from PC doing chores, the screen had gone blank and when I tried to figure out what's wrong, it seemed to have irrecoverably hanged. I'm attaching journal output with a kernel null pointer dereference, which may or may not be related to this issue (the kernel in question was supposed to be good).
After more testing, I'm starting to suspect that there's either one bug that shows itself in different ways possibly depending on presence of other commits as well as just pure chance. Or there's 2-3 GPU hang bugs and maybe related or unrelated S3 sleep entry/reboot bug. I'm not sure I have it in me to keep prodding the range between 70d201a40823 and 052d534373b7 which almost certainly contains the trigger and maybe also the source of all those issues. I have tested the patch from https://gitlab.freedesktop.org/drm/amd/-/issues/3124#note_2252559 and I can confirm that the GNOME hang is resolved. Because kwin_wayland hangs randomly, I can't say for certain that everything is resolved but it's felt perfect so far, which is a promising sign. It's hard to describe but it's almost like the frame pacing or latency is [sometimes?] off in a very subtle and hardly perceivable way, when the bug is present and with the patch applied it feels right again. Reported-and-tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com> The bug has been fixed upstream by commit 66dbd9004a55073c5931f5f65f5fe2bbd414bdaa . |