Bug 216224 - AMDGPU fails to reset RX 480 after Ring GFX timeout
Summary: AMDGPU fails to reset RX 480 after Ring GFX timeout
Status: RESOLVED ANSWERED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-07-09 06:26 UTC by happysmash27
Modified: 2022-07-09 12:27 UTC (History)
0 users

See Also:
Kernel Version: 5.18.0 and earlier versions
Subsystem:
Regression: No
Bisected commit-id:


Attachments
SteamVR-induced crash 2022-07-08 (329.03 KB, application/octet-stream)
2022-07-09 06:26 UTC, happysmash27
Details
SteamVR-induced crash 2022-06-18 (177.03 KB, application/octet-stream)
2022-07-09 06:28 UTC, happysmash27
Details
SteamVR-induced crash 2022-05-26 (151.09 KB, application/octet-stream)
2022-07-09 06:29 UTC, happysmash27
Details
GPU crash 2022-05-24 (335.40 KB, application/octet-stream)
2022-07-09 06:32 UTC, happysmash27
Details

Description happysmash27 2022-07-09 06:26:53 UTC
Created attachment 301374 [details]
SteamVR-induced crash 2022-07-08

This is perhaps the worst bug I have ever experienced and has been going on for a couple years now. A ring GFX timeout can be triggered fairly reliably by launching SteamVR while using literally anything else that uses the GPU (even having Waterfox open usually causes a crash) or by running an Ethereum miner (every one I have tried does this, some worse than others) and having something else using the GPU such as Blender (moving around in a 3D scene should do it). Other things do this as well (such as the time this happened when running Sway and LXDE at the same time) but Ethereum miners and SteamVR seem to be the programs that trigger it the most reliably. I imagine trying to run both at once would be a guaranteed GPU meltdown. 

When the bug occurs, something like: 

[1767311.339261] [drm:amdgpu_job_timedout] *ERROR* ring comp_1.0.0 timeout, signaled seq=23354655, emitted seq=23354657
[1767311.339274] [drm:amdgpu_job_timedout] *ERROR* Process information: process vrcompositor pid 7701 thread vrcompositor pid 7701
[1767311.339279] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!

or: 

57492.984178] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=9450226, emitted seq=9450228
[57492.984189] [drm:amdgpu_job_timedout] *ERROR* Process information: process vrcompositor pid 8198 thread vrcompositor pid 8198
[57492.984194] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!

or: 

[112094.633679] [drm:dm_vblank_get_counter] *ERROR* dc_stream_state is NULL for crtc '2'!
[112094.633696] [drm:dm_crtc_get_scanoutpos] *ERROR* dc_stream_state is NULL for crtc '2'!
[112094.633699] [drm:dm_vblank_get_counter] *ERROR* dc_stream_state is NULL for crtc '2'!
[112094.633703] ------------[ cut here ]------------
[112094.633704] amdgpu 0000:03:00.0: drm_WARN_ON_ONCE(drm_drv_uses_atomic_modeset(dev))
[112094.633735] WARNING: CPU: 9 PID: 12640 at drivers/gpu/drm/drm_vblank.c:728 drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x331/0x340
[112094.633741] Modules linked in: uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc bpfilter btusb btrtl btbcm btintel mptsas mptscsih mptbase
[112094.633754] CPU: 9 PID: 12640 Comm: VulkanVblankThr Not tainted 5.18.0-gentoo #1
[112094.633757] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1     03/17/2012
[112094.633758] RIP: 0010:drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x331/0x340
[112094.633762] Code: 4c 8b 6f 50 4d 85 ed 75 03 4c 8b 2f e8 28 81 32 00 48 c7 c1 f0 2a 2d 83 4c 89 ea 48 c7 c7 f2 b0 2c 83 48 89 c6 e8 67 b6 b8 00 <0f> 0b e9 d0 fe ff ff e8 23 ec c3 00 0f 1f 00 48 8b 87 b0 01 00 00
[112094.633764] RSP: 0018:ffffc9000b1e7bb0 EFLAGS: 00010082
[112094.633766] RAX: 0000000000000000 RBX: ffffffff81aeae50 RCX: 0000000000000000
[112094.633767] RDX: 0000000000000003 RSI: ffffffff8330cd18 RDI: 00000000ffffffff
[112094.633769] RBP: ffffc9000b1e7c20 R08: ffffffff837653c8 R09: 00000000ffffdfff
[112094.633770] R10: ffffffff836853e0 R11: ffffffff8373edb8 R12: 0000000000000000
[112094.633771] R13: ffff888103ee7380 R14: 0000000000000003 R15: ffff888e448e6b08
[112094.633773] FS:  00007fcf587f8640(0000) GS:ffff889dffac0000(0000) knlGS:0000000000000000
[112094.633775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[112094.633776] CR2: 00007fceec001258 CR3: 0000000363f60003 CR4: 00000000000206e0
[112094.633778] Call Trace:
[112094.633780]  <TASK>
[112094.633783]  drm_get_last_vbltimestamp+0xa5/0xb0
[112094.633789]  drm_update_vblank_count+0x7f/0x3c0
[112094.633793]  drm_vblank_enable+0x13e/0x170
[112094.633796]  drm_vblank_get+0x8b/0xd0
[112094.633798]  drm_crtc_queue_sequence_ioctl+0xee/0x2a0
[112094.633801]  ? drm_crtc_get_sequence_ioctl+0x190/0x190
[112094.633803]  drm_ioctl_kernel+0xac/0x140
[112094.633809]  drm_ioctl+0x1f5/0x3c0
[112094.633812]  ? drm_crtc_get_sequence_ioctl+0x190/0x190
[112094.633814]  ? ioctl_has_perm.constprop.0.isra.0+0xb4/0x110
[112094.633821]  amdgpu_drm_ioctl+0x44/0x80
[112094.633825]  __x64_sys_ioctl+0x7d/0xb0
[112094.633830]  do_syscall_64+0x3b/0x90
[112094.633833]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[112094.633838] RIP: 0033:0x7fcf8d9f2f0b
[112094.633840] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1b 48 8b 44 24 18 64 48 2b 04 25 28 00
[112094.633842] RSP: 002b:00007fcf587f7aa0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[112094.633844] RAX: ffffffffffffffda RBX: 00007fcf587f7b30 RCX: 00007fcf8d9f2f0b
[112094.633845] RDX: 00007fcf587f7b30 RSI: 00000000c018643c RDI: 000000000000004c
[112094.633847] RBP: 00000000c018643c R08: 0000000000000000 R09: 00007fceec000be0
[112094.633848] R10: 00007fcf89fc5ba0 R11: 0000000000000246 R12: 000055e7852de508
[112094.633849] R13: 000000000000004c R14: 000055e785460fd0 R15: 000055e7852de4c0
[112094.633852]  </TASK>
[112094.633852] ---[ end trace 0000000000000000 ]---
[112094.633854] [drm:dm_vblank_get_counter] *ERROR* dc_stream_state is NULL for crtc '2'!
[112094.633857] [drm:dm_crtc_get_scanoutpos] *ERROR* dc_stream_state is NULL for crtc '2'!
[112094.633859] [drm:dm_vblank_get_counter] *ERROR* dc_stream_state is NULL for crtc '2'!
[112114.167264] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=3094506, emitted seq=3094508
[112114.167276] [drm:amdgpu_job_timedout] *ERROR* Process information: process vrcompositor pid 12498 thread vrcompositor pid 12498
[112114.167281] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[112148.281473] elogind-daemon[4528]: New session 8 of user happysmash27.
[112174.774130] elogind-daemon[4528]: New session 9 of user happysmash27.
[112192.196386] Discord[5027]: segfault at 0 ip 00007ff10741e0d6 sp 00007ff1073fdd00 error 4 in discord_utils.node[7ff107400000+8d000]
[112192.196413] Code: 38 48 83 ff 01 77 09 49 8d 70 ff 48 21 ce eb 13 48 89 ce 4c 39 c1 72 0b 48 89 c8 31 d2 49 f7 f0 48 89 d6 48 8b 05 5a 47 27 00 <48> 8b 04 f0 48 85 c0 74 76 48 8b 18 48 85 db 74 6e 4d 8d 48 ff eb

will occur, all on kernel 5.18.0, but various similar issues also occured on earlier kernel versions. These more recent ones are all triggered by SteamVR. Ethereum miners (if I am reading this log properly) on older versions (this one is 5.17.4) would trigger something like: 

[3986931.032554] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=296571651, emitted seq=296571653
[3986931.032566] [drm:amdgpu_job_timedout] *ERROR* Process information: process blender-3.3 pid 11711 thread blender-3.:cs0 pid 11747
[3986931.032571] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[3986931.291361] amdgpu 0000:03:00.0: amdgpu: BACO reset
[3986931.477687] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[3986931.478926] [drm] PCIE GART of 256M enabled (table at 0x000000F400500000).
[3986931.478942] [drm] VRAM is lost due to GPU reset!
[3986933.521790] amdgpu:
                  failed to send message 200 ret is 0
[3986937.653740] amdgpu:
                  last message was failed ret is 0
[3986939.720118] amdgpu:
                  failed to send message 100 ret is 0
[3986941.821883] amdgpu:
                  last message was failed ret is 0
[3986941.825119] amdgpu: SMU Firmware start failed!
[3986941.825123] amdgpu: Failed to load SMU ucode.
[3986941.825125] amdgpu: fw load failed
[3986941.825126] amdgpu: smu firmware loading failed
[3986941.825129] [drm] Skip scheduling IBs!
[3986941.825131] [drm] Skip scheduling IBs!
[3986941.825140] [drm] Skip scheduling IBs!
[3986941.825145] [drm] Skip scheduling IBs!
[3986941.825160] amdgpu 0000:03:00.0: amdgpu: GPU reset(2) failed

and a large amount of other messages, many repeating, including: 

[3986941.825198] [drm] Skip scheduling IBs!

[3986941.825315] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!

and

[3986963.354537] amdgpu:
                  failed to send message 200 ret is 0
[3986965.419403] amdgpu:
                  last message was failed ret is 0

In all of these cases `cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover` will generally either return some vague vague number, -11 IIRC, or will just freeze, as has happened with my most recent SteamVR-induced freeze today. If frozen, cat cannot be killed with signal 2 or signal 9, as it will be in a d state. Furthermore, X and sometimes vrcompositor cannot be killed with it either as they will also be in d states. Shutdown will do nothing, so the only way to reboot is to manually shut down all processes then use the magic sysrq sequence to shut down everything. In one instance even that didn't work and I needed to hold the power key. SSH works fine as in other bugs related to ring GFX timeout. 

The subject of this bug isn't the ring GFX timeout itself, but the fact that the GPU reset never works. It's worked a couple of times (after manually restarting X from SSH of course), but 90% of the time it hangs and does not reset successfully. 

I will attach several dmesg (with color) logs from the several times this has happened in the past couple months, but if needed I have logs going all the way back to February 2020 as well.
Comment 1 happysmash27 2022-07-09 06:28:04 UTC
Created attachment 301375 [details]
SteamVR-induced crash 2022-06-18
Comment 2 happysmash27 2022-07-09 06:29:25 UTC
Created attachment 301376 [details]
SteamVR-induced crash 2022-05-26
Comment 3 happysmash27 2022-07-09 06:32:39 UTC
Created attachment 301377 [details]
GPU crash 2022-05-24

Was most likely from lolMiner and on kernel 5.17.4, as I upgraded to 5.18.0 on the same day as this log was produced, most likely via SSH to take advantage of the reboot opportunity.
Comment 4 Artem S. Tashkinov 2022-07-09 12:27:20 UTC
Please report to https://gitlab.freedesktop.org/drm/amd/-/issues

Note You need to log in before you can comment on or make changes to this bug.