Bug 210201

Summary: [amdpgu] crash when playing after suspend/resume
Product: Drivers Reporter: Artur Bac (arturbac.ab)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED WILL_NOT_FIX    
Severity: normal CC: arturbac.ab
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.9.8, 5.6.19, 5.8.18 Subsystem:
Regression: No Bisected commit-id:
Attachments: Full dmesg after video crash

Description Artur Bac 2020-11-14 16:45:51 UTC
Created attachment 293669 [details]
Full dmesg after video crash

When i play vulkan games, like Kerbal Space Program, ReadDeadRedemption2(via Proton) after i return from suspend, after runing them after 30min - 1h graphics driver crashes.



OS: Gentoo 
Kernel: x86_64 Linux 5.9.8
Resolution: 7680x2160 (2 monitors attached 4K free sync)
DE: KDE 5.75.0 / Plasma 5.20.3
WM: KWin
GTK Theme: Adwaita [GTK2/3]
CPU: AMD Ryzen 9 3900X 12-Core @ 24x 3.8GHz
GPU: AMD Radeon RX 5700 XT (NAVI10, DRM 3.39.0, 5.9.8, LLVM 10.0.1) Mesa 20.2.2
RAM: 32038MiB

Full dmesg attached.

[104307.850190] amdgpu 0000:0f:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[104307.850192] amdgpu 0000:0f:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[104307.850194] amdgpu 0000:0f:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[104307.850195] amdgpu 0000:0f:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[104307.850196] amdgpu 0000:0f:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[104307.850198] amdgpu 0000:0f:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[104307.850199] amdgpu 0000:0f:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[104307.850201] amdgpu 0000:0f:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[104307.850202] amdgpu 0000:0f:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[104307.850203] amdgpu 0000:0f:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[104307.850205] amdgpu 0000:0f:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[104307.850206] amdgpu 0000:0f:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[104307.850208] amdgpu 0000:0f:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[104307.850209] amdgpu 0000:0f:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[104307.850210] amdgpu 0000:0f:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
[104307.850212] amdgpu 0000:0f:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[104307.852128] [drm] recover vram bo from shadow start
[104307.872340] [drm] recover vram bo from shadow done
[104307.872342] [drm] Skip scheduling IBs!
[104307.872343] [drm] Skip scheduling IBs!
[104307.872357] [drm] Skip scheduling IBs!
[104307.872362] amdgpu 0000:0f:00.0: amdgpu: GPU reset(2) succeeded!
[104307.872373] [drm] Skip scheduling IBs!
[repeated many times...]
[104307.872440] [drm] Skip scheduling IBs!
[104314.769174] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-16)
[104314.795600] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-16)
[104314.847946] amdgpu 0000:0f:00.0: amdgpu: failed to clear page tables on GEM object close (-16)
[repeated many times...]
[104315.300254] amdgpu 0000:0f:00.0: amdgpu: failed to clear page tables on GEM object close (-16)
[104325.487235] GpuWatchdog[731778]: segfault at 0 ip 00007f9be2bf92dd sp 00007f9bd77ed670 error 6 in libcef.so[7f9bdee73000+69a4000]
[104325.488266] Code: 00 79 09 48 8b 7d a0 e8 21 80 c1 02 41 8b 85 00 01 00 00 85 c0 0f 84 ab 00 00 00 49 8b 45 00 4c 89 ef be 01 00 00 00 ff 50 58 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 a5 37 03 01 80 bd 7f ff
[104335.590624] GpuWatchdog[731809]: segfault at 0 ip 00007f494f6142dd sp 00007f4944208670 error 6 in libcef.so[7f494b88e000+69a4000]
[104335.590631] Code: 00 79 09 48 8b 7d a0 e8 21 80 c1 02 41 8b 85 00 01 00 00 85 c0 0f 84 ab 00 00 00 49 8b 45 00 4c 89 ef be 01 00 00 00 ff 50 58 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 a5 37 03 01 80 bd 7f ff
[104345.690401] GpuWatchdog[731833]: segfault at 0 ip 00007fcbb4c722dd sp 00007fcba9866670 error 6 in libcef.so[7fcbb0eec000+69a4000]
[104345.692176] Code: 00 79 09 48 8b 7d a0 e8 21 80 c1 02 41 8b 85 00 01 00 00 85 c0 0f 84 ab 00 00 00 49 8b 45 00 4c 89 ef be 01 00 00 00 ff 50 58 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 a5 37 03 01 80 bd 7f ff
Comment 1 Artur Bac 2020-11-19 17:17:07 UTC
What is interesting people report similar case ~1h on windows, does amdgpu is sharing code with windows driver ?

https://www.reddit.com/r/AMDHelp/comments/jx4660/crash_rx5700xt
Comment 2 Artur Bac 2021-01-06 22:10:17 UTC
I can confirm this bug exists only with clang compiled kernel.
gnu gcc compiled works ok.