Bug 207463

Summary: [amdgpu] System freeze / corrupted graphics
Product: Drivers Reporter: Rokas Kupstys (rokups+kernel-bugs)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED MOVED    
Severity: high CC: alexdeucher
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.6.7-arch1-1 Subsystem:
Regression: No Bisected commit-id:
Attachments: Corrupted graphics

Description Rokas Kupstys 2020-04-27 13:47:58 UTC
Created attachment 288763 [details]
Corrupted graphics

Application i am working on somehow causes AMDGPU driver to fail, which results in soft-ish system lockup. Graphics get corrupted, system does react to some keypresses (can switch VT, sysrq works).

Problem is reproducible *always*.

Steps to reproduce:
git clone --branch=hdpi-support-fontscale-viewport https://github.com/rokups/imgui   
cd imgui/examples/example_sdl_opengl3 
make
./example_sdl_opengl3

Interact with program a bit, try to drag windows outside of main viewport. Issue happens within seconds.


bal. 27 16:30:21 rk-PC systemd-coredump[15701]: Process 15649 (example_sdl_ope) of user 1000 dumped core.
                                                
                                                Stack trace of thread 15649:
                                                #0  0x00007fed0b12ace5 raise (libc.so.6 + 0x3bce5)
                                                #1  0x00007fed0b114857 abort (libc.so.6 + 0x25857)
                                                #2  0x00007fed0b16e2b0 __libc_message (libc.so.6 + 0x7f2b0)
                                                #3  0x00007fed0b1fe06a __fortify_fail (libc.so.6 + 0x10f06a)
                                                #4  0x00007fed0b1fc904 __chk_fail (libc.so.6 + 0x10d904)
                                                #5  0x00007fed085ab0b3 n/a (radeonsi_dri.so + 0x6340b3)
                                                #6  0x00007fed080f28b7 n/a (radeonsi_dri.so + 0x17b8b7)
                                                #7  0x00007fed080b71b8 n/a (radeonsi_dri.so + 0x1401b8)
                                                #8  0x00007fed080a8fcb n/a (radeonsi_dri.so + 0x131fcb)
                                                #9  0x00007fed082e8a63 n/a (radeonsi_dri.so + 0x371a63)
                                                #10 0x00007fed082e8e82 n/a (radeonsi_dri.so + 0x371e82)
                                                #11 0x00005564903169b5 n/a (/home/rk/src/games/Libs/imgui/cmake-build-debug/bin/example_sdl_opengl3 + 0x2a9b5)
                                                #12 0x0000556490313e69 n/a (/home/rk/src/games/Libs/imgui/cmake-build-debug/bin/example_sdl_opengl3 + 0x27e69)
                                                #13 0x00007fed0b116023 __libc_start_main (libc.so.6 + 0x27023)
                                                #14 0x00005564903136ce n/a (/home/rk/src/games/Libs/imgui/cmake-build-debug/bin/example_sdl_opengl3 + 0x276ce)

<...>

bal. 27 16:30:33 rk-PC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
bal. 27 16:30:37 rk-PC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
bal. 27 16:30:38 rk-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=542239, emitted seq=542241
bal. 27 16:30:38 rk-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process example_sdl_ope pid 15735 thread example_sd:cs0 pid 15741
bal. 27 16:30:38 rk-PC kernel: amdgpu 0000:08:00.0: GPU reset begin!
bal. 27 16:30:38 rk-PC kernel: amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
bal. 27 16:30:38 rk-PC kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
bal. 27 16:30:39 rk-PC kernel: cp is busy, skip halt cp
bal. 27 16:30:39 rk-PC kernel: rlc is busy, skip halt rlc
bal. 27 16:30:39 rk-PC kernel: amdgpu 0000:08:00.0: GPU BACO reset
bal. 27 16:30:39 rk-PC kernel: amdgpu 0000:08:00.0: GPU reset succeeded, trying to resume
bal. 27 16:30:39 rk-PC kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
bal. 27 16:30:39 rk-PC kernel: [drm] VRAM is lost due to GPU reset!
bal. 27 16:30:40 rk-PC kernel: [drm] UVD and UVD ENC initialized successfully.
bal. 27 16:30:40 rk-PC rtkit-daemon[7614]: Supervising 8 threads of 5 processes of 1 users.
bal. 27 16:30:40 rk-PC rtkit-daemon[7614]: Successfully made thread 15789 of process 7410 owned by '1000' RT at priority 5.
bal. 27 16:30:40 rk-PC rtkit-daemon[7614]: Supervising 9 threads of 5 processes of 1 users.
bal. 27 16:30:40 rk-PC kernel: [drm] VCE initialized successfully.
bal. 27 16:30:40 rk-PC kernel: [drm] recover vram bo from shadow start
bal. 27 16:30:40 rk-PC kernel: [drm] recover vram bo from shadow done
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: [drm] Skip scheduling IBs!
bal. 27 16:30:40 rk-PC kernel: amdgpu 0000:08:00.0: GPU reset(2) succeeded!
bal. 27 16:30:41 rk-PC audit[2071]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2071 comm="Xorg:gdrv0" exe="/usr/lib/Xorg" sig=11 res=1
bal. 27 16:30:41 rk-PC kernel: audit: type=1701 audit(1587994241.386:339): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2071 comm="Xorg:gdrv0" exe="/usr/lib/Xorg" sig=11 res=1
bal. 27 16:30:41 rk-PC audit: AUDIT1334 prog-id=35 op=LOAD
bal. 27 16:30:41 rk-PC kernel: audit: type=1334 audit(1587994241.399:340): prog-id=35 op=LOAD
bal. 27 16:30:41 rk-PC audit: AUDIT1334 prog-id=36 op=LOAD
bal. 27 16:30:41 rk-PC systemd[1]: Started Process Core Dump (PID 15795/UID 0).
bal. 27 16:30:41 rk-PC audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-15795-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
bal. 27 16:30:41 rk-PC kernel: audit: type=1334 audit(1587994241.399:341): prog-id=36 op=LOAD
bal. 27 16:30:41 rk-PC kernel: audit: type=1130 audit(1587994241.399:342): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-15795-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
bal. 27 16:30:41 rk-PC krunner[13252]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC konsole[11554]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC pulseaudio[7410]: X connection to :0 broken (explicit kill or server shutdown).
bal. 27 16:30:41 rk-PC kdeinit5[7253]: kdeinit5: Fatal IO error: client killed
bal. 27 16:30:41 rk-PC org_kde_powerdevil[7414]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC kdeinit5[7253]: kdeinit5: sending SIGHUP to children.
bal. 27 16:30:41 rk-PC DiscoverNotifier[7388]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC xembedsniproxy[7335]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC plasmashell[7333]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC kscreen_backend_launcher[7301]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC kactivitymanagerd[7395]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC kernel: audit: type=1701 audit(1587994241.892:343): auid=1000 uid=1000 gid=100 ses=1 pid=11413 comm="spotify" exe="/opt/spotify/spotify" sig=6 res=1
bal. 27 16:30:41 rk-PC audit[11413]: ANOM_ABEND auid=1000 uid=1000 gid=100 ses=1 pid=11413 comm="spotify" exe="/opt/spotify/spotify" sig=6 res=1
bal. 27 16:30:41 rk-PC at-spi-bus-launcher[7858]: X connection to :0 broken (explicit kill or server shutdown).
bal. 27 16:30:41 rk-PC ksmserver[7322]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC python[7420]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC kwalletd5[7086]: The X11 connection broke (error 1). Did the X11 server die?
bal. 27 16:30:41 rk-PC gmenudbusmenuproxy[7385]: The X11 connection broke (error 1). Did the X11 server die?
Comment 1 Rokas Kupstys 2020-04-27 14:01:36 UTC
Tested 5.4.35-1-lts kernel as well, corruption does happen, but looks bit different visually. Also i can access another VT without issues, rendering is ok there. Restarting X11 does not help to recover system, reboot is still needed.

I also forgot to specify my GPU: AMD RX 580

Kernel command line: initrd=\amd-ucode.img initrd=\initramfs-linux-lts.img rd.luks.name=<...>=cryptolvm rd.luks.options=discard,keyfile-timeout=10s rd.luks.key=<...>=/keys/root.key:UUID=<...> root=/dev/mapper/system-root resume=/dev/mapper/system-swap fastboot rw amd_iommu=on amd_iommu=pt nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 rcu_nocb_poll user_namespace.enable=1

And some info from early boot, should it be useful:

bal. 27 15:46:49 archlinux kernel: [drm] amdgpu kernel modesetting enabled.
bal. 27 15:46:49 archlinux kernel: fb0: switching to amdgpudrmfb from EFI VGA
bal. 27 15:46:49 archlinux kernel: amdgpu 0000:08:00.0: vgaarb: deactivate vga console
bal. 27 15:46:49 archlinux kernel: [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1462:0x3417 0xE7).
bal. 27 15:46:49 archlinux kernel: [drm] register mmio base: 0xF7C00000
bal. 27 15:46:49 archlinux kernel: [drm] register mmio size: 262144
bal. 27 15:46:49 archlinux kernel: [drm] add ip block number 0 <vi_common>
bal. 27 15:46:49 archlinux kernel: [drm] add ip block number 1 <gmc_v8_0>
bal. 27 15:46:49 archlinux kernel: [drm] add ip block number 2 <tonga_ih>
bal. 27 15:46:49 archlinux kernel: [drm] add ip block number 3 <gfx_v8_0>
bal. 27 15:46:49 archlinux kernel: [drm] add ip block number 4 <sdma_v3_0>
bal. 27 15:46:49 archlinux kernel: [drm] add ip block number 5 <powerplay>
bal. 27 15:46:49 archlinux kernel: [drm] add ip block number 6 <dm>
bal. 27 15:46:49 archlinux kernel: [drm] add ip block number 7 <uvd_v6_0>
bal. 27 15:46:49 archlinux kernel: [drm] add ip block number 8 <vce_v3_0>
bal. 27 15:46:49 archlinux kernel: amdgpu 0000:08:00.0: No more image in the PCI ROM
bal. 27 15:46:49 archlinux kernel: [drm] UVD is enabled in VM mode
bal. 27 15:46:49 archlinux kernel: [drm] UVD ENC is enabled in VM mode
bal. 27 15:46:49 archlinux kernel: [drm] VCE enabled in VM mode
bal. 27 15:46:49 archlinux kernel: [drm] vm size is 512 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
bal. 27 15:46:49 archlinux kernel: amdgpu 0000:08:00.0: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
bal. 27 15:46:49 archlinux kernel: amdgpu 0000:08:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
bal. 27 15:46:49 archlinux kernel: [drm] Detected VRAM RAM=8192M, BAR=256M
bal. 27 15:46:49 archlinux kernel: [drm] RAM width 256bits GDDR5
bal. 27 15:46:49 archlinux kernel: [drm] amdgpu: 8192M of VRAM memory ready
bal. 27 15:46:49 archlinux kernel: [drm] amdgpu: 8192M of GTT memory ready.
bal. 27 15:46:49 archlinux kernel: [drm] GART: num cpu pages 65536, num gpu pages 65536
bal. 27 15:46:49 archlinux kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
bal. 27 15:46:49 archlinux kernel: [drm] Chained IB support enabled!
bal. 27 15:46:49 archlinux kernel: amdgpu: [powerplay] hwmgr_sw_init smu backed is polaris10_smu
bal. 27 15:46:49 archlinux kernel: [drm] Found UVD firmware Version: 1.130 Family ID: 16
bal. 27 15:46:49 archlinux kernel: [drm] Found VCE firmware Version: 53.26 Binary ID: 3
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB: values for Engine clock
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         300000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         600000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         927000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         1179000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         1251000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         1294000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         1339000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         1380000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB: Validation clocks:
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:    engine_max_clock: 138000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:    memory_max_clock: 200000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:    level           : 8
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB: values for Memory clock
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         300000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         1000000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:         2000000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB: Validation clocks:
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:    engine_max_clock: 138000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:    memory_max_clock: 200000
bal. 27 15:46:49 archlinux kernel: [drm] DM_PPLIB:    level           : 8
bal. 27 15:46:49 archlinux kernel: [drm] Display Core initialized with v3.2.69!
bal. 27 15:46:49 archlinux kernel: [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
bal. 27 15:46:49 archlinux kernel: [drm] Driver supports precise vblank timestamp query.
bal. 27 15:46:49 archlinux kernel: [drm] UVD and UVD ENC initialized successfully.
bal. 27 15:46:49 archlinux kernel: [drm] VCE initialized successfully.
bal. 27 15:46:49 archlinux kernel: [drm] fb mappable at 0xE0830000
bal. 27 15:46:49 archlinux kernel: [drm] vram apper at 0xE0000000
bal. 27 15:46:49 archlinux kernel: [drm] size 14745600
bal. 27 15:46:49 archlinux kernel: [drm] fb depth is 24
bal. 27 15:46:49 archlinux kernel: [drm]    pitch is 10240
bal. 27 15:46:49 archlinux kernel: fbcon: amdgpudrmfb (fb0) is primary device
bal. 27 15:46:49 archlinux kernel: amdgpu 0000:08:00.0: fb0: amdgpudrmfb frame buffer device
bal. 27 15:46:49 archlinux systemd-modules-load[471]: Inserted module 'amdgpu'
bal. 27 15:46:49 archlinux kernel: [drm] Initialized amdgpu 3.36.0 20150101 for 0000:08:00.0 on minor 0
bal. 27 15:46:54 rk-PC systemd[1]: Condition check resulted in Load Kernel Module drm being skipped.
bal. 27 15:46:55 rk-PC kernel: snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
bal. 27 15:46:58 rk-PC kernel: amdgpu 0000:08:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
Comment 2 Alex Deucher 2020-04-27 14:52:22 UTC
This is most likely an application or mesa bug.  The GPU has hung and the kernel driver has recovered it.  You'll need to restart your GUI after a GPU reset.
Comment 3 Rokas Kupstys 2020-05-04 12:29:47 UTC
This makes sense. I am pretty sure i never observed this issue with LTS kernel in the past, and now it is present. I will report it on mesa bugtracker. Thank you for replying.