Bug 219118 - Linux 6.10.x [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout VM fault / GPU fault detected
Summary: Linux 6.10.x [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout VM f...
Status: RESOLVED ANSWERED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P3 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-08-01 16:23 UTC by Michael Evans
Modified: 2024-08-01 17:22 UTC (History)
0 users

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Michael Evans 2024-08-01 16:23:33 UTC
I'm not sure if this should be filed under Console/Framebuffers, Video(DRI - non Intel), or Video(Other).

I thought I'd created the bug in the correct location, https://gitlab.freedesktop.org/drm/amd/-/issues/3510 but no maintainer has commented or otherwise notably interacted with the report.  Initially I thought it was just an MPV bug since VLC didn't trigger the issue https://github.com/mpv-player/mpv/issues/14600 .

It looks like a developer's personal(?) drm-fixes-6.11 branch cherry picked the commit that appeared to fix the issue completely for my test cases: https://gitlab.freedesktop.org/agd5f/linux/-/commit/f3572db3c049b4d32bb5ba77ad5305616c44c7c1

However that isn't for the earlier 6.10.x series which also needs the fix, unless it's dead.

This appears to be a Swiss cheese sort of bug situation.  If software requests/provides contiguous buffers then the error results are more subtle, such as momentary video corruption if the kernel's access isn't out of bounds but rather rarely scrambled.  It's only when both the userspace and driver don't enforce contiguous buffer segments that out of bounds accesses result in a GPU reset and consequently terminated userspace.


ArchLinux (rolling release)
Linux 6.10.1-arch1-1 #1 (closed) SMP PREEMPT_DYNAMIC Wed, 24 Jul 2024 22:25:43 +0000 x86_64 GNU/Linux
amdgpu + OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.1.4-arch1.2
ArchLinux current stable builds


[ 1766.321165] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0a22c802
[ 1766.321171] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321172] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101F44
[ 1766.321174] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0C8002
[ 1766.321175] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1056580, write from 'TC3' (0x54433300) (200)
[ 1766.321237] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07f2a002
[ 1766.321238] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321239] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010120C
[ 1766.321240] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002
[ 1766.321241] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053196, write from 'CB2' (0x43423200) (32)
[ 1766.321244] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b29002
[ 1766.321245] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321247] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101237
[ 1766.321247] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B010002
[ 1766.321248] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053239, write from 'CB3' (0x43423300) (16)
[ 1766.321255] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772e002
[ 1766.321256] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321257] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101200
[ 1766.321258] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002
[ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053184, write from 'CB4' (0x43423400) (160)
[ 1766.321262] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772d002
[ 1766.321263] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321264] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101232
[ 1766.321264] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002
[ 1766.321265] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053234, write from 'CB4' (0x43423400) (160)
[ 1766.321268] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07729002
[ 1766.321269] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321271] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010123A
[ 1766.321271] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002
[ 1766.321272] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053242, write from 'CB1' (0x43423100) (80)
[ 1766.321275] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732d002
[ 1766.321276] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321277] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001012AB
[ 1766.321278] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002
[ 1766.321279] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053355, write from 'CB2' (0x43423200) (32)
[ 1766.321282] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07126002
[ 1766.321283] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321284] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010124C
[ 1766.321285] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0E0002
[ 1766.321286] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053260, write from 'CB6' (0x43423600) (224)
[ 1766.321289] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b21002
[ 1766.321290] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321291] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101223
[ 1766.321292] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002
[ 1766.321293] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80)
[ 1766.321296] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002
[ 1766.321297] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321298] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101277
[ 1766.321298] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002
[ 1766.321299] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208)
[ 1777.234990] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=168813, emitted seq=168816
[ 1777.236251] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007

Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80)
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101277
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208)
Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=168813, emitted seq=168816
Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:21 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Jul 25 22:09:21 HOSTNAME kernel: amdgpu: cp is busy, skip halt cp
Jul 25 22:09:22 HOSTNAME kernel: amdgpu: rlc is busy, skip halt rlc
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
Jul 25 22:09:22 HOSTNAME kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400800000).
Jul 25 22:09:22 HOSTNAME kernel: [drm] VRAM is lost due to GPU reset!
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
Jul 25 22:09:22 HOSTNAME kernel: [drm] UVD initialized successfully.
Jul 25 22:09:22 HOSTNAME kernel: [drm] VCE initialized successfully.
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start
Jul 25 22:09:22 HOSTNAME mpv[5307]: amdgpu: The CS has cancelled because the context is lost. This context is innocent.
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) succeeded!
Jul 25 22:09:22 HOSTNAME systemd-coredump[5681]: Process 5307 (mpv) of user 1000 terminated abnormally with signal 6/ABRT, processing...
Jul 25 22:09:22 HOSTNAME systemd[1]: Created slice Slice /system/drkonqi-coredump-processor.
-- Subject: A start job for unit system-drkonqi\x2dcoredump\x2dprocessor.slice has finished successfully
Comment 1 Artem S. Tashkinov 2024-08-01 17:22:49 UTC
Please report here instead: https://gitlab.freedesktop.org/drm/amd/-/issues

Note You need to log in before you can comment on or make changes to this bug.