Bug 212739 - [amdgpu] Sporadic GPU errors, screen artifacts and GPU-induced system lockups on Vega 10 (Raven Ridge)
Summary: [amdgpu] Sporadic GPU errors, screen artifacts and GPU-induced system lockups...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-21 06:49 UTC by tunas
Modified: 2021-07-26 09:12 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.11.14-1, 5.12.rc7.d0411.gd434405-1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Example of GPU artifacts from the recoverable variant of this error (115.44 KB, image/jpeg)
2021-04-21 06:49 UTC, tunas
Details

Description tunas 2021-04-21 06:49:18 UTC
Created attachment 296449 [details]
Example of GPU artifacts from the recoverable variant of this error

From time to time, the amdgpu driver will report a page fault (sometimes coming from pid 0, sometimes coming from the web browser, sometimes the screen compositor or Xorg, sometimes a video player, etc.) as shown below:

>kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0
>ring:0 vmid:4 pasid:0, for process  pid 0 thread  pid 0)
>kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address
>0x800101606000 from client 27
>kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00401031
>kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP
>(0x8)
>kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
>kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
>kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
>kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
>kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0`

This message is repeated several thousand times in dmesg ("x callbacks suppressed") with different addresses of form 0x80010160Y000 (where Y is a hex digit between 1-8.)
In the meantime, the computer is completely hung in terms of display, i.e. inputs go through, music keeps playing, but the screen is static.

Then, several seconds later, it's followed by:
>kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences
>timed out!

And finally,

>[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft
>recovered

After this, the computer resumes operation (but with GPU artifacts having appeared on the screen - for an example of these, see attached screenshot).

Alternatively, sometimes instead of the soft recovery message, the GPU cannot recover and displays the following messages in the kernel log:

>kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access
>in command stream
>kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled
>seq=3356413, emitted seq=3356415
>kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
>process Xorg pid 14524 thread Xorg:cs0 pid 14539
>kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
>kernel: [drm] free PSP TMR buffer
>kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
>kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
>kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
>kernel: [drm] PSP is resuming...
>kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
>kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not
>available
>kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not
>available
>kernel: [drm] kiq ring mec 2 pipe 1 q 0
>kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR*
>ring sdma0 test failed (-110)
>kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP
>block <sdma_v4_0> failed -110
>kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(4) failed
>kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110

at which point rebooting is necessary as the GPU will not resume operation.

This also happens on the latest 5.12 rc (as of the writing of this bug report, this is rc7).

Note You need to log in before you can comment on or make changes to this bug.