Bug 218413 - Seemingly commit a6149f039369 broke amdgpu driver
Summary: Seemingly commit a6149f039369 broke amdgpu driver
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P3 high
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-01-23 17:28 UTC by Niklāvs Koļesņikovs
Modified: 2024-01-28 15:34 UTC (History)
0 users

See Also:
Kernel Version: 6.8 pre-rc1
Subsystem:
Regression: Yes
Bisected commit-id: a6149f0393699308fb00149be913044977bceb56


Attachments
A supposedly good kernel hanging GUI with a null pointer dereference (9.40 KB, text/plain)
2024-01-24 09:55 UTC, Niklāvs Koļesņikovs
Details

Description Niklāvs Koļesņikovs 2024-01-23 17:28:10 UTC
First of all, I'm very sorry I have to post this here, since I know it should go elsewhere but I can't use the correct issue tracker. I hope this report can be sent to the right people, because with any luck I did manage to bisect it correctly.

The issue I'm facing is kwin_wayland unpredictably hanging during regular use. I tried a large number of Mesa environment variables as well as both RR and FIFO GPU schedulers but nothing made it either happen or go away 100% of the time, so it's likely a timing related bug. With one exception, across many boots there were no dmesg entries indicating any kind of an issue and the system journal does not show any obvious patterns or failures either, so the one time it did print anything might be just a consequence of the actual bug.

In one of the bootups, `BUG: kernel NULL pointer dereference, address: 0000000000000008` was reported but it was for commit b70438004a14 which should be well into the clear, since there were multiple other commits past it which were in total tested for days of typical use without a single GUI hang.

When kwin_wayland's screen output freezes, rarely SysRq+E might be able to get back to a working SDDM's Wayland login prompt running the Weston display server but it almost always freezes there and, if not, it will imminently freeze after/during login. Likewise switching to tty is unlikely to be possible or will eventually freeze, too. In once case the mouse pointer was movable but nothing reacted to interactions.

The issue started happening between Linux 6.8 pre-rc1 commits 70d201a40823 (good) and 052d534373b7 (bad). Due to no reliable reproducer, bisecting this was not easy and I still can't say with full confidence I got the right culprit but my third round of git bisection arrived at a6149f0393699308fb00149be913044977bceb56 being the first bad commit. It may or may not be relevant that at some point (IIRC, between ca34d816558c and e013aa9ab01b) the kernel also started to severely hang when entering S3 sleep as well as at the end of `systemctl reboot` process but I do not know if that's indicative of the same bug or not. When the S3 or reboot hangs happen, use of PC reset button on the case is required i.e. SysRq+B does nothing.

I did encounter https://gitlab.freedesktop.org/drm/amd/-/issues/3124 and it's seemingly similar to my issue with kwin_wayland however the instant GNOME hang went away during bisection, so I'm not sure if it's a more severe form of the same underlying bug or a different one.

Hardware in use: Intel Core i5-12400 CPU and AMD RX 580 GPU with the Intel HD730 iGPU in RC6 render standby for HEVC encoding and Vulkan compute roles. IOMMU and CET are enabled. Bisection was initially done with Linux firmware 20231211 and the 3rd go at bisecting with 20240115. If it's relevant, I have the second newest Intel ME and UEFI firmware for my platform, since I'm still waiting for enough time to go by before I dare to flash the latest unsigned firmware update. *sigh*
Comment 1 Niklāvs Koļesņikovs 2024-01-23 17:29:17 UTC
a6149f0393699308fb00149be913044977bceb56 is the first bad commit
commit a6149f0393699308fb00149be913044977bceb56
Author: Matthew Brost <matthew.brost@intel.com>
Date:   Mon Oct 30 20:24:36 2023 -0700

    drm/sched: Convert drm scheduler to use a work queue rather than kthread
    
    In Xe, the new Intel GPU driver, a choice has made to have a 1 to 1
    mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
    seems a bit odd but let us explain the reasoning below.
    
    1. In Xe the submission order from multiple drm_sched_entity is not
    guaranteed to be the same completion even if targeting the same hardware
    engine. This is because in Xe we have a firmware scheduler, the GuC,
    which allowed to reorder, timeslice, and preempt submissions. If a using
    shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
    apart as the TDR expects submission order == completion order. Using a
    dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
    
    2. In Xe submissions are done via programming a ring buffer (circular
    buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
    limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
    control on the ring for free.
    
    A problem with this design is currently a drm_gpu_scheduler uses a
    kthread for submission / job cleanup. This doesn't scale if a large
    number of drm_gpu_scheduler are used. To work around the scaling issue,
    use a worker rather than kthread for submission / job cleanup.
    
    v2:
      - (Rob Clark) Fix msm build
      - Pass in run work queue
    v3:
      - (Boris) don't have loop in worker
    v4:
      - (Tvrtko) break out submit ready, stop, start helpers into own patch
    v5:
      - (Boris) default to ordered work queue
    v6:
      - (Luben / checkpatch) fix alignment in msm_ringbuffer.c
      - (Luben) s/drm_sched_submit_queue/drm_sched_wqueue_enqueue
      - (Luben) Update comment for drm_sched_wqueue_enqueue
      - (Luben) Positive check for submit_wq in drm_sched_init
      - (Luben) s/alloc_submit_wq/own_submit_wq
    v7:
      - (Luben) s/drm_sched_wqueue_enqueue/drm_sched_run_job_queue
    v8:
      - (Luben) Adjust var names / comments
    
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
    Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
    Link: https://lore.kernel.org/r/20231031032439.1558703-3-matthew.brost@intel.com
    Signed-off-by: Luben Tuikov <ltuikov89@gmail.com>

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   2 +-
 drivers/gpu/drm/lima/lima_sched.c          |   2 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c       |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c    |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c    |   2 +-
 drivers/gpu/drm/scheduler/sched_main.c     | 131 +++++++++++++++--------------
 drivers/gpu/drm/v3d/v3d_sched.c            |  10 +--
 include/drm/gpu_scheduler.h                |  14 +--
 9 files changed, 86 insertions(+), 81 deletions(-)
Comment 2 Niklāvs Koļesņikovs 2024-01-23 17:41:07 UTC
Slight correction regarding https://gitlab.freedesktop.org/drm/amd/-/issues/3124 . For me the GNOME hang happens usually at the end of login procedure but once it also happened right after login via SDDM. However since I use Weston rather than Mutter for login prompt, it could still be the same bug w.r.t. GNOME hanging with 6.8 rc1.
Comment 3 Artem S. Tashkinov 2024-01-23 18:08:21 UTC
Please report here:

https://gitlab.freedesktop.org/drm/amd/-/issues
Comment 4 Niklāvs Koļesņikovs 2024-01-23 18:09:31 UTC
Please understand that I cannot do that. Truly sorry.
Comment 5 Artem S. Tashkinov 2024-01-23 20:54:41 UTC
I've posted it here: https://gitlab.freedesktop.org/drm/amd/-/issues/3126
Comment 6 Niklāvs Koļesņikovs 2024-01-23 22:13:32 UTC
Thank you. :)
Comment 7 Niklāvs Koļesņikovs 2024-01-24 09:22:11 UTC
I just had a Plasma freeze with a supposedly good kernel (the commit before the one bisected as the first bad one), however it looks different on two accounts:
1) journal was filled by these two lines on repeat (in various multiples of each):
  kwin_wayland_wrapper[3337]: kwin_scene_opengl: 0x2: GL_INVALID_OPERATION in glDrawArrays
  kwin_wayland_wrapper[3337]: kwin_scene_opengl: 0x2: GL_INVALID_OPERATION in glVertexAttribPointer(no array object bound)
2) after SysRq+E plasmashell coredumped and I was returned to SDDM login prompt where I could log back in.

The two differences indicate at least to me that it's probably an unrelated issue and maybe due to me running the KDE's git version that will become KF6 and Plasma 6 but just in case I'm noting it here.
Comment 8 Niklāvs Koļesņikovs 2024-01-24 09:55:42 UTC
Created attachment 305773 [details]
A supposedly good kernel hanging GUI with a null pointer dereference

While I was away from PC doing chores, the screen had gone blank and when I tried to figure out what's wrong, it seemed to have irrecoverably hanged. I'm attaching journal output with a kernel null pointer dereference, which may or may not be related to this issue (the kernel in question was supposed to be good).
Comment 9 Niklāvs Koļesņikovs 2024-01-24 15:55:22 UTC
After more testing, I'm starting to suspect that there's either one bug that shows itself in different ways possibly depending on presence of other commits as well as just pure chance. Or there's 2-3 GPU hang bugs and maybe related or unrelated S3 sleep entry/reboot bug. I'm not sure I have it in me to keep prodding the range between 70d201a40823 and 052d534373b7 which almost certainly contains the trigger and maybe also the source of all those issues.
Comment 10 Niklāvs Koļesņikovs 2024-01-24 21:07:42 UTC
I have tested the patch from https://gitlab.freedesktop.org/drm/amd/-/issues/3124#note_2252559 and I can confirm that the GNOME hang is resolved. Because kwin_wayland hangs randomly, I can't say for certain that everything is resolved but it's felt perfect so far, which is a promising sign. It's hard to describe but it's almost like the frame pacing or latency is [sometimes?] off in a very subtle and hardly perceivable way, when the bug is present and with the patch applied it feels right again.

Reported-and-tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
Comment 11 Niklāvs Koļesņikovs 2024-01-27 10:18:23 UTC
The bug has been fixed upstream by commit 66dbd9004a55073c5931f5f65f5fe2bbd414bdaa .

Note You need to log in before you can comment on or make changes to this bug.