The screen freezes completely and does not even respond to commands like Alt+Ctrl+F1. I can still ssh the device. Seems to happen more often when the system is cold :S. Similar (maybe same as #151341) Relevant dmesg: [ 246.751055] INFO: task amdgpu_cs:0:530 blocked for more than 120 seconds. [ 246.751067] Tainted: G O 4.15.0-qtec-standard #3 [ 246.751070] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 246.751074] amdgpu_cs:0 D 0 530 518 0x00080000 [ 246.751079] Call Trace: [ 246.751107] ? __schedule+0x25c/0x860 [ 246.751113] ? dma_fence_default_wait+0x10b/0x280 [ 246.751115] ? dma_fence_default_wait+0x1c7/0x280 [ 246.751118] schedule+0x2f/0x90 [ 246.751122] schedule_timeout+0x1e0/0x430 [ 246.751237] ? amdgpu_vm_update_directories+0x2d/0x5d0 [amdgpu] [ 246.751241] ? dma_fence_default_wait+0x10b/0x280 [ 246.751243] ? dma_fence_default_wait+0x1c7/0x280 [ 246.751246] dma_fence_default_wait+0x1f3/0x280 [ 246.751251] ? __kfifo_in+0x2e/0x40 [ 246.751254] ? dma_fence_default_wait+0x280/0x280 [ 246.751256] dma_fence_wait_timeout+0x2e/0x100 [ 246.751319] amdgpu_ctx_wait_prev_fence+0x49/0x80 [amdgpu] [ 246.751378] amdgpu_cs_ioctl+0x26e/0x1a90 [amdgpu] [ 246.751403] ? radix_tree_node_alloc.constprop.13+0x3d/0xc0 [ 246.751461] ? amdgpu_cs_find_mapping+0xc0/0xc0 [amdgpu] [ 246.751493] drm_ioctl_kernel+0x59/0xb0 [drm] [ 246.751514] drm_ioctl+0x29f/0x340 [drm] [ 246.751572] ? amdgpu_cs_find_mapping+0xc0/0xc0 [amdgpu] [ 246.751620] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 246.751626] do_vfs_ioctl+0x8e/0x680 [ 246.751632] ? SyS_futex+0x11d/0x150 [ 246.751635] SyS_ioctl+0x74/0x80 [ 246.751638] ? get_vtime_delta+0xe/0x40 [ 246.751642] do_syscall_64+0x6f/0x1c0 [ 246.751645] entry_SYSCALL64_slow_path+0x25/0x25 [ 246.751649] RIP: 0033:0x3c768e57e7 [ 246.751651] RSP: 002b:00007fd1220d0ba8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 246.751655] RAX: ffffffffffffffda RBX: 00007fd1220d0c88 RCX: 0000003c768e57e7 [ 246.751656] RDX: 00007fd1220d0c10 RSI: 00000000c0186444 RDI: 000000000000000c [ 246.751658] RBP: 00007fd1220d0c10 R08: 00007fd1220d0cb0 R09: 00007fd1220d0c88 [ 246.751659] R10: 00007fd1220d0cb0 R11: 0000000000000246 R12: 00000000c0186444 [ 246.751661] R13: 000000000000000c R14: 0000000000000008 R15: 0000000000000000
Created attachment 274353 [details] dmesg
Created attachment 274355 [details] xorg log
Created attachment 274357 [details] xorg.conf
Could you please provide Ftrace output for following events ? (See here on HOWTO https://www.kernel.org/doc/Documentation/trace/ftrace.txt) /sys/kernel/debug/tracing/events/dma_fence /sys/kernel/debug/tracing/events/gpu_scheduler In amdgpu events folder - /sys/kernel/debug/tracing/events/amdgpu amdgpu_cs_ioctl amdgpu_cs Thanks, Andrey
Created attachment 274397 [details] trace.dat
Created attachment 274399 [details] trace.txt
Traces obtained with trace-cmd record -e dma_fence:* -e amdgpu:* -e gpu_sched:*
How you reproduce it and how often does it happen ? Andrey
I just have to start up the system and around 1 out of 10 times X will crash with no human intervention. X manages to render some of the screen, like the window decorator or the cursor, but the content of the xterm is missing. It is more likely to happen when the system is cold :S. I have rarely seen this happening after a hot reboot. Not that often: a system that seems to be working fine crashes X after running dmesg (or any other command with a lot of text) inside xterm. Once the system is frozen, there is no change on the screen and I cannot alt+ctrl+f1.
Are you starting X with xstart or xinit or some desktop manager ? Can you also provide output of glxinfo ? BTW what distro are you running ?
Distro I am using Yocto Project Launching X with: /etc/init.d/xserver-nodm start that launches su -l -c '/etc/xserver-nodm/Xserver &' $USER that calls xinit internally. Attaching glxinfo
Created attachment 274401 [details] glxinfo
Thanks for all the info, I will later try to reproduce it on my CZ setup. Andrey
Thanks Andrey Please bare in mind that #151341 seems very related, so this bug might be happening in more platforms.
Yep, I noticed the other bug. Thanks, Andrey
(In reply to Ricardo Ribalda from comment #14) > Please bare in mind that #151341 seems very related, [...] What makes you think so? If it's that the backtraces look similar, that's just a generic symptom of a GPU hang, which can be caused by lots of different things.
The description of what/how it happens and the backtraces: -able to login remotely via ssh. -I tried to reset the gpu by using /sys/kernel/debug/dri/0/amdgpu_gpu_reset, and the result is a NULL pointer dereference in the kernel. (I did that and had almost the same result) This is definitely out of my expertise :), I just dont want to waste 2x developers time. Thanks again
(In reply to Ricardo Ribalda from comment #17) > -able to login remotely via ssh. > > -I tried to reset the gpu by using /sys/kernel/debug/dri/0/amdgpu_gpu_reset, > and the result is a NULL pointer dereference in the kernel. (I did that and > had almost the same result) That could happen with any GPU hang, no matter what caused it. > This is definitely out of my expertise :), I just dont want to waste 2x > developers time. Then it's better to assume for the time being that they're separate issues.
I tried on my Ubuntu multiple times with both starting and stopping desktop manager and just xinint. Didn't observe any hang, this is relevant SW info from my glxinfo OenGL renderer string: AMD Radeon R7 Graphics (CARRIZO / DRM 3.25.0 / 4.15.0-rc4.main+, LLVM 7.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.1.0-devel (git-b494ed168c) So I have same kernel as you but use LLVM 7 while you have LLVM 5. We also differ in MESA and probably libdrm. I wonder if you can try update you MESA and libdrm to latest upstream and try switch to LLVM 7 ? Thanks, Andrey
Did you power cycle your platform between attempts? By latest upstream you mean tip of the git devel? I did update to the last release of Mesa, libdrm, llvm, xorg-amdgpu around two weeks ago. Since there are many components involved, would be an option to send you a root file system? Maybe we are hunting a bios or hw error. Please be patient,I will be in the Embedded World the whole next week. Thanks!
Yea, you right, forgot all about the cold reset part you mentioned so will retry again on Monday. In the meanwhile you can at least try switching to LLVM 7 and see if this makes things better. If it does it can narrow down the problem. Thanks, Andrey
I tried cold resets around 15 times (taking out the power cord) running lightdm manger or directly xinit and haven't seen any hang. Let me know your firmware versions - cat /sys/kernel/debug/dri/0/amdgpu_firmware_info THanks, Andrey
Looked a bit more into the ftraces, fence from gfx ring, context=80 seqno=10 which is a drm_sched_fence.finished fence is never signaled so next time when same slot (slot 10) needs to be reused it waits for ever for that fence to be signaled. finished fence is signaled once the HW fence wrappinig the IB is signaled so looks like the related IB never completed execution. You can try and generate gfx ring dump with UMR tool (https://cgit.freedesktop.org/amd/umr/) Also try applying the attached patch to your kernel to improve UMR output Once installed, please reproduce the issue and then attach output from following command sudo umr -O verbose,follow_ib -R gfx[0:2047]
Created attachment 274473 [details] UMR output improvement
Hi Andrey Back from the Embedded World. My firmware versions: root@qt5122:~# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x34040300 UVD feature version: 0, firmware version: 0x01570b00 MC feature version: 0, firmware version: 0x00000000 ME feature version: 37, firmware version: 0x00000092 PFP feature version: 37, firmware version: 0x000000d8 CE feature version: 37, firmware version: 0x0000007e RLC feature version: 1, firmware version: 0x00000099 MEC feature version: 37, firmware version: 0x00000299 MEC2 feature version: 37, firmware version: 0x00000299 SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x00000000 SMC feature version: 0, firmware version: 0x00000000 SDMA0 feature version: 0, firmware version: 0x00000022 SDMA1 feature version: 0, firmware version: 0x00000022 I am working on the umr debug Thanks!
Created attachment 274535 [details] Umr dump 0
Created attachment 274537 [details] umr dump 1
Here you go! umr dump 0 is a crash just after starting X umr dump 1 is a crash after running dmesg inside xterm just after starting X Thanks!
Thanks for the logs, just to clear things out, you keep saying 'crash', but you actually mean hang, right ? Or do observe actual Xorg process crash ? To answer for sure check if Xorg is still running when you run 'ps' or 'top'and /var/log/Xorg.log.
Using UMR please provide following data: sudo umr -O many,bits -r *.gfx80.mmGRBM_STATUS sudo umr -O many,bits -r *.gfx80.HEADER_DUMP sudo umr -O many,bits -r *.gfx80.CP_EOP
Hi Andrey I mean a stall, sorry about that. root@qt5122:~# ps aux | grep Xorg root 520 0.0 0.0 15936 900 ? S 07:59 0:00 xinit /etc/X11/Xsession -- /usr/bin/Xorg :0 -br -pn root 541 0.2 1.3 859896 47412 ? S<sl 07:59 0:00 /usr/bin/Xorg :0 -br -pn root 1072 0.0 0.0 8088 1052 pts/3 S+ 08:01 0:00 grep Xorg root@qt5122:~# gdb -p 541 GNU gdb (GDB) 8.0 Copyright (C) 2017 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-poky-linux". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 541 [New LWP 614] [New LWP 615] [New LWP 616] [New LWP 617] [New LWP 618] [New LWP 619] [New LWP 620] [New LWP 625] [New LWP 636] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/libthread_db.so.1". 0x000000321c4ee1a6 in __GI_epoll_pwait (epfd=3, events=events@entry=0x7ffc92802560, maxevents=maxevents@entry=256, timeout=474326, set=set@entry=0x0) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_pwait.c:42 42 return SYSCALL_CANCEL (epoll_pwait, epfd, events, maxevents, (gdb) thread apply all bt Thread 10 (Thread 0x7fb4d31cd700 (LWP 636)): #0 0x000000321c4ee1a6 in __GI_epoll_pwait (epfd=24, events=events@entry=0x7fb4d31cc0d0, maxevents=maxevents@entry=256, timeout=timeout@entry=-1, set=set@entry=0x0) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_pwait.c:42 #1 0x000000321c4ee318 in epoll_wait (epfd=<optimized out>, events=events@entry=0x7fb4d31cc0d0, maxevents=maxevents@entry=256, timeout=timeout@entry=-1) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_wait.c:30 #2 0x000000000056ff24 in ospoll_wait (ospoll=0xea8b50, timeout=timeout@entry=-1) at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/os/ospoll.c:397 #3 0x000000000056d9b6 in InputThreadDoWork (arg=<optimized out>) at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/os/inputthread.c:367 #4 0x000000321c807385 in start_thread (arg=0x7fb4d31cd700) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465 #5 0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 9 (Thread 0x7fb4d3fff700 (LWP 625)): #0 0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xd390f4) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xd390a0, cond=0xd390c8) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xd390c8, mutex=0xd390a0) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655 #3 0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so #4 0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so #5 0x000000321c807385 in start_thread (arg=0x7fb4d3fff700) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465 #6 0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 8 (Thread 0x7fb4d93d8700 (LWP 620)): #0 0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xc655f8) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xc655a8, cond=0xc655d0) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xc655d0, mutex=0xc655a8) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655 #3 0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so #4 0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so #5 0x000000321c807385 in start_thread (arg=0x7fb4d93d8700) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465 #6 0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 7 (Thread 0x7fb4d9bd9700 (LWP 619)): #0 0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xc655f8) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xc655a8, cond=0xc655d0) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xc655d0, mutex=0xc655a8) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655 #3 0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so #4 0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so ---Type <return> to continue, or q <return> to quit--- #5 0x000000321c807385 in start_thread (arg=0x7fb4d9bd9700) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465 #6 0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 6 (Thread 0x7fb4da3da700 (LWP 618)): #0 0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xc65510) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xc654c0, cond=0xc654e8) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xc654e8, mutex=0xc654c0) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655 #3 0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so #4 0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so #5 0x000000321c807385 in start_thread (arg=0x7fb4da3da700) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465 #6 0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 5 (Thread 0x7fb4dabdb700 (LWP 617)): #0 0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xc65510) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xc654c0, cond=0xc654e8) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xc654e8, mutex=0xc654c0) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655 #3 0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so #4 0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so #5 0x000000321c807385 in start_thread (arg=0x7fb4dabdb700) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465 #6 0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 4 (Thread 0x7fb4db3dc700 (LWP 616)): #0 0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xc65514) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xc654c0, cond=0xc654e8) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xc654e8, mutex=0xc654c0) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655 #3 0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so #4 0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so #5 0x000000321c807385 in start_thread (arg=0x7fb4db3dc700) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465 #6 0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 3 (Thread 0x7fb4dbbdd700 (LWP 615)): #0 0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xc15d10) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xc15cc0, cond=0xc15ce8) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xc15ce8, mutex=0xc15cc0) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655 ---Type <return> to continue, or q <return> to quit--- #3 0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so #4 0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so #5 0x000000321c807385 in start_thread (arg=0x7fb4dbbdd700) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465 #6 0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 2 (Thread 0x7fb4e094d700 (LWP 614)): #0 0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xc62760) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xc62710, cond=0xc62738) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xc62738, mutex=0xc62710) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655 #3 0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so #4 0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so #5 0x000000321c807385 in start_thread (arg=0x7fb4e094d700) at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465 #6 0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 1 (Thread 0x7fb4e5de18c0 (LWP 541)): #0 0x000000321c4ee1a6 in __GI_epoll_pwait (epfd=3, events=events@entry=0x7ffc92802560, maxevents=maxevents@entry=256, timeout=474326, set=set@entry=0x0) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_pwait.c:42 #1 0x000000321c4ee318 in epoll_wait (epfd=<optimized out>, events=events@entry=0x7ffc92802560, maxevents=maxevents@entry=256, timeout=<optimized out>) at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_wait.c:30 #2 0x000000000056ff24 in ospoll_wait (ospoll=0xc07330, timeout=<optimized out>) at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/os/ospoll.c:397 #3 0x0000000000569a4b in WaitForSomething (are_ready=<optimized out>) at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/os/WaitFor.c:226 #4 0x000000000043246b in Dispatch () at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/dix/dispatch.c:422 #5 0x000000000043647d in dix_main (argc=4, argv=0x7ffc92803388, envp=<optimized out>) at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/dix/main.c:287 #6 0x000000321c420f1c in __libc_start_main (main=0x4217e0 <main>, argc=4, argv=0x7ffc92803388, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc92803378) at /usr/src/debug/glibc/2.26-r0/git/csu/libc-start.c:308 #7 0x000000000042181a in _start () at ../sysdeps/x86_64/start.S:120 (gdb)
Created attachment 274579 [details] sudo umr -O many,bits -r *.gfx80.mmGRBM_STATUS &> stall sudo umr -O many,bits -r *.gfx80.HEADER_DUMP &>>stall sudo umr -O many,bits -r *.gfx80.CP_EOP &>>stall
Last dump belongs to a stall after running dmesg. (similar to yesterdays dump 1). The next one happens after running Xorg. This time, gdb also stalls when trying to get a dump: root@qt5122:~# ps aux | grep Xorg root 519 0.0 0.0 15936 900 ? S 08:19 0:00 xinit /etc/X11/Xsession -- /usr/bin/Xorg :0 -br -pn root 540 0.3 1.3 857220 47812 ? S<sl 08:19 0:00 /usr/bin/Xorg :0 -br -pn root 692 0.0 0.0 8088 1056 pts/2 S+ 08:20 0:00 grep Xorg root@qt5122:~# gdb -p 540 GNU gdb (GDB) 8.0 Copyright (C) 2017 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-poky-linux". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 540 [New LWP 617] [New LWP 618] [New LWP 619] [New LWP 620] [New LWP 621] [New LWP 622] [New LWP 623] [New LWP 624] [New LWP 635]
Created attachment 274581 [details] Stalled after starting X. Gdb also stalls sudo umr -O many,bits -r *.gfx80.mmGRBM_STATUS &> stall sudo umr -O many,bits -r *.gfx80.HEADER_DUMP &>>stall sudo umr -O many,bits -r *.gfx80.CP_EOP &>>stall
I will take a look at the register dumps and also forward them to a few people to take a look. DId you consider to try and run the latest mainline kernel from https://www.kernel.org/ (4.16-rc4) and see if the issue still happening ? Because I still can't reproduce it. Thanks, Andrey
Hi Andrey I just tried with commit ce380619fab99036f5e745c7a865b21c59f005f6 Linux version 4.16.0-rc4-qtec-standard+ (ricardo@neopili) (gcc version 7.3.0 (Debian 7.3.0 -5)) #1 SMP Tue Mar 6 14:11:26 CET 2018 And it is still stalling. It has an interesting message in dmesg: [ 52.703150] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=20, last emitted seq=22 [ 52.703162] [drm] IP block:gfx_v8_0 is hung! [ 52.703168] [drm] GPU recovery disabled. Attached you will find the whole dmesg
Created attachment 274589 [details] Dmesg for 4.16 rc4
Just for the fun of it, I did try the amdgpu.gpu_recovery=1 module parameter, It seems to detect properly the crash, but fails miserably on the recovery. The screens blanks and the whole system is unusable: Leds stop blinking, ssh does not work... I need to hard reset the system. Thanks!
(In reply to Ricardo Ribalda from comment #38) > Just for the fun of it, I did try the amdgpu.gpu_recovery=1 module parameter, > > > It seems to detect properly the crash, but fails miserably on the recovery. > The screens blanks and the whole system is unusable: Leds stop blinking, ssh > does not work... I need to hard reset the system. > > > Thanks! job timeout message is expected since you indeed have a hanged pipe GPU recovery failure on CZ is expected since it's another issue we have with GPU pci config reset hard hang ( I am looking at this also) I suggest you try to be latest on everything to see if this still happens. So you already on latest kernel, now pick up latest firmware we have from this git https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu Take all the carrizo* files and override the ones you have under you kernel (usally in /lib/firmware`uname -r`/amdgpu or in kernel specific folder under lib/firmware I think you already updated your libdrm/MESA/LLVM but you LLVM is still 5.0 right ? Can you try switching to LLVM 7
Seems that I am already using the latest firmware files. I will try to switch to llvm7, but that will require a bit of compilation time, it is not a 5 minutes job. Just to be 100% sure, by llvm7 you mean the HEAD of https://github.com/llvm-mirror/llvm ? Do you prefer a specific HASH? Regards!
I am actually use debian packages for LLVM 7 so can't tell you, but you will now it's LLVM7 once you install it and run glxinfo - it will show LLVM version you are using.
With current LLVM7 master X does not start at all. I get the following output: root@qt5122:~# X X.Org X Server 1.19.3 Release Date: 2017-03-15 X Protocol Version 11, Revision 0 Build Operating System: Linux 4.14.0-3-amd64 x86_64 Current Operating System: Linux qt5122 4.16.0-rc4-qtec-standard #1 SMP Fri Mar 9 15:13:05 CET 2018 x86_64 Kernel command line: BOOT_IMAGE=/boot/bzImage rw root=PARTUUID=c839955a-02 rootwait qtec_mem.size=64 quiet Build Date: 09 March 2018 02:29:53PM Current version of pixman: 0.34.0 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.0.log", Time: Fri Mar 9 15:47:03 2018 (==) Using config file: "/etc/X11/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d" (II) [KMS] Kernel modesetting enabled. LLVM ERROR: Cannot select: 0x7f74b40a59e0: v4i32,ch = load<LD16[%22(addrspace=2)](dereferenceable)(invariant)> 0x7f74b40251c8, 0x7f74b40a5978, undef:i32 0x7f74b40a5978: i32 = add 0x7f74b409f378, Constant:i32<16> 0x7f74b409f378: i32,ch = CopyFromReg 0x7f74b40251c8, Register:i32 %4 0x7f74b409f310: i32 = Register %4 0x7f74b40a5910: i32 = Constant<16> 0x7f74b409fc68: i32 = undef In function: main LLVM ERROR: Cannot select: 0x7f74a00b0570: v4i32,ch = load<LD16[%22(addrspace=2)](dereferenceable)(invariant)> 0x7f74a002eab8, 0x7f74a00b0508, undef:i32 0x7f74a00b0508: i32 = add 0x7f74a00a9f08, Constant:i32<16> 0x7f74a00a9f08: i32,ch = CopyFromReg 0x7f74a002eab8, Register:i32 %4 0x7f74a00a9ea0: i32 = Register %4 0x7f74a00b04a0: i32 = Constant<16> 0x7f74a00aa7f8: i32 = undef In function: main root@qt5122:~#
(In reply to Ricardo Ribalda from comment #42) > With current LLVM7 master X does not start at all. I get the following > output: Did you recompile current Mesa Git master against that as well? If so, that's most likely a separate issue, either in LLVM or Mesa, which needs to be tracked separately from this one.
For the LLVM problem, please provide the log output from starting X with R600_DEBUG=vs,ps in the environment.
(In reply to Michel Dänzer from comment #43) > (In reply to Ricardo Ribalda from comment #42) > > With current LLVM7 master X does not start at all. I get the following > > output: > > Did you recompile current Mesa Git master against that as well? If so, > that's most likely a separate issue, either in LLVM or Mesa, which needs to > be tracked separately from this one. I did use mesa 17.3.6 + this patches https://github.com/mesa3d/mesa/commit/78673b614b01a8a416367db23937743c0e1aaa36.diff https://patchwork.freedesktop.org/patch/186737/ (In reply to Nicolai Hähnle from comment #44) > For the LLVM problem, please provide the log output from starting X with > R600_DEBUG=vs,ps in the environment. It will have to wait until Monday, sorry :(
(In reply to Nicolai Hähnle from comment #44) > For the LLVM problem, please provide the log output from starting X with > R600_DEBUG=vs,ps in the environment. Here you are. Thanks! X.Org X Server 1.19.3 Release Date: 2017-03-15 X Protocol Version 11, Revision 0 Build Operating System: Linux 4.14.0-3-amd64 x86_64 Current Operating System: Linux qt5122 4.16.0-rc4-qtec-standard #1 SMP Fri Mar 9 15:13:05 CET 2018 x86_64 Kernel command line: BOOT_IMAGE=/boot/bzImage rw root=PARTUUID=c839955a-02 rootwait qtec_mem.size=64 quiet Build Date: 09 March 2018 02:29:53PM Current version of pixman: 0.34.0 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.0.log", Time: Mon Mar 12 08:12:00 2018 (==) Using config file: "/etc/X11/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d" (II) [KMS] Kernel modesetting enabled. VERT PROPERTY NEXT_SHADER FRAG DCL IN[0] DCL IN[1] DCL OUT[0], POSITION DCL OUT[1].xy, GENERIC[0] 0: MOV OUT[0], IN[0] 1: MOV OUT[1].xy, IN[1].xyxx 2: END radeonsi: Compiling shader 1 TGSI shader LLVM IR: ; ModuleID = 'tgsi' source_filename = "tgsi" target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-A5" target triple = "amdgcn--" define amdgpu_vs void @main([12 x <4 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), [0 x <8 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), [0 x <4 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), [80 x <8 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), [16 x <4 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32, i32, i32, i32, i32, i32) #0 { main_body: %15 = getelementptr [16 x <4 x i32>], [16 x <4 x i32>] addrspace(2)* %4, i32 0, i32 0, !amdgpu.uniform !0 %16 = load <4 x i32>, <4 x i32> addrspace(2)* %15, align 16, !invariant.load !0 %17 = call nsz <4 x float> @llvm.amdgcn.buffer.load.format.v4f32(<4 x i32> %16, i32 %13, i32 0, i1 false, i1 false) #3 %18 = extractelement <4 x float> %17, i32 0 %19 = extractelement <4 x float> %17, i32 1 %20 = extractelement <4 x float> %17, i32 2 %21 = extractelement <4 x float> %17, i32 3 %22 = getelementptr [16 x <4 x i32>], [16 x <4 x i32>] addrspace(2)* %4, i32 0, i32 1, !amdgpu.uniform !0 %23 = load <4 x i32>, <4 x i32> addrspace(2)* %22, align 16, !invariant.load !0 %24 = call nsz <4 x float> @llvm.amdgcn.buffer.load.format.v4f32(<4 x i32> %23, i32 %14, i32 0, i1 false, i1 false) #3 %25 = extractelement <4 x float> %24, i32 0 %26 = extractelement <4 x float> %24, i32 1 call void @llvm.amdgcn.exp.f32(i32 12, i32 15, float %18, float %19, float %20, float %21, i1 true, i1 false) #2 call void @llvm.amdgcn.exp.f32(i32 32, i32 15, float %25, float %26, float undef, float undef, i1 false, i1 false) #2 ret void } ; Function Attrs: nounwind readonly declare <4 x float> @llvm.amdgcn.buffer.load.format.v4f32(<4 x i32>, i32, i32, i1, i1) #1 ; Function Attrs: nounwind declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #2 attributes #0 = { "no-signed-zeros-fp-math"="true" } attributes #1 = { nounwind readonly } attributes #2 = { nounwind } attributes #3 = { nounwind readnone } !0 = !{} LLVM ERROR: Cannot select: 0x12844a0: v4i32,ch = load<LD16[%22(addrspace=2)](dereferenceable)(invariant)> 0xc352c8, 0x1284438, undef:i32 0x1284438: i32 = add 0x127df58, Constant:i32<16> 0x127df58: i32,ch = CopyFromReg 0xc352c8, Register:i32 %4 0x127def0: i32 = Register %4 0x12843d0: i32 = Constant<16> 0x127e848: i32 = undef In function: main
Just for the fun of it, I have tried 4.16.0-rc4 + llvm git HEAD + mesa git HEAD + libdrm 2.4.91 + xf86-video-amdgpu_18.0.0 + libxcb_1.13.bb and X starts. Unfortunately it also freezes: [ 34.272153] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=18, last emitted seq=20 [ 34.272166] [drm] IP block:gfx_v8_0 is hung! [ 34.272171] [drm] GPU recovery disabled. Some extra debug info that you want from me?
Is it LLVM 7.0 in this case ?
(In reply to Andrey Grodzovsky from comment #48) > Is it LLVM 7.0 in this case ? Yes: OpenGL renderer string: AMD Radeon R7 Graphics (CARRIZO / DRM 3.23.0 / 4.16.0-rc4-qtec-standard, LLVM 7.0.0)
Created attachment 274679 [details] glxinfo with llvm7
Andrey, do you have access to a Bettong board? If so I can send you an image where you can get the stall.
Don't think I have since I don't know what it is. So since it failed with LLVM 7 that not the problem then. We can try again getting more detailed logs from SAME hang as following - sudo umr -O verbose,follow_ib -R gfx[0:2047] sudo umr -O bits -wa sudo umr -O many,bits -r*.gfx80.mmGRBM_STATUS sudo umr -O many,bits -r *.gfx80.HEADER_DUMP sudo umr -O many,bits -r *.gfx80.CP_EOP This way addresses from ring Dump can be related to reg values. On user mode side run X with env var GALLIUM_DDEBUG=1000 and get resulting ~/ddebug_dumps/ files Thanks, Andrey
Created attachment 274683 [details] ddebug dumps with llvm7
Created attachment 274685 [details] urm dump with llvm7
Bettong is the reference Merlin Falcon board from AMD, I think they reference it also as DB-FP4. It is also supported by coreboot.
(In reply to Ricardo Ribalda from comment #46) This was due to the constant address space change in LLVM. It has since been fixed in Mesa master.
Please try following to see if it helps avoiding the hang = R600_DEBUG=notiling,norbplus try with GALLIUM_DDEBUG=flush to see if it makes a difference (flushes after each draw call) try running programs that can be used without X: like KMS cube Andrey
Hi Andrey Testing with llvm7 setup: R600_DEBUG=notiling,norbplus xinit does not avoid the hang :( [ 31.200134] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=7, last emitted seq=8 [ 31.200147] [drm] IP block:gfx_v8_0 is hung! [ 31.200152] [drm] GPU recovery disabled. GALLIUM_DDEBUG=flush xinit Seems to have a (good) impact. Hang happened after around 20 boots. [ 13.389894] NET: Registered protocol family 39 [ 28.640142] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=652, last emitted seq=654 [ 28.640154] [drm] IP block:gfx_v8_0 is hung! [ 28.640160] [drm] GPU recovery disabled. [ 246.752083] INFO: task amdgpu_cs:0:636 blocked for more than 120 seconds. [ 246.752092] Not tainted 4.16.0-rc4-qtec-standard #1 [ 246.752095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 246.752099] amdgpu_cs:0 D 0 636 543 0x00080000 [ 246.752099] amdgpu_cs:0 D 0 636 543 0x00080000 [ 246.752103] Call Trace: [ 246.752115] ? __schedule+0x25c/0x860 [ 246.752122] ? dma_fence_default_wait+0x10c/0x280 [ 246.752124] ? dma_fence_default_wait+0x1c8/0x280 [ 246.752127] schedule+0x2f/0x90 [ 246.752130] schedule_timeout+0x1f1/0x440 [ 246.752220] ? amdgpu_cs_bo_validate+0x7f/0x120 [amdgpu] [ 246.752276] ? amdgpu_ttm_alloc_gart+0x5d/0x270 [amdgpu] [ 246.752284] ? dma_fence_default_wait+0x10c/0x280 [ 246.752287] ? dma_fence_default_wait+0x1c8/0x280 [ 246.752290] dma_fence_default_wait+0x1f4/0x280 [ 246.752294] ? dma_fence_default_wait+0x280/0x280 [ 246.752297] dma_fence_wait_timeout+0x2e/0x100 [ 246.752359] amdgpu_ctx_wait_prev_fence+0x46/0x80 [amdgpu] [ 246.752418] amdgpu_cs_ioctl+0x1f2/0x1af0 [amdgpu] [ 246.752478] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 246.752504] drm_ioctl_kernel+0x59/0xb0 [drm] [ 246.752524] drm_ioctl+0x29f/0x340 [drm] [ 246.752581] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 246.752630] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 246.752635] do_vfs_ioctl+0x8e/0x680 [ 246.752641] ? SyS_futex+0x11d/0x150 [ 246.752644] SyS_ioctl+0x74/0x80 [ 246.752647] ? get_vtime_delta+0xe/0x40 [ 246.752650] do_syscall_64+0x7b/0x1d0 [ 246.752654] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 246.752658] RIP: 0033:0x3e292e57e7 [ 246.752660] RSP: 002b:00007f17a2924b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 246.752663] RAX: ffffffffffffffda RBX: 00007f17a2924c78 RCX: 0000003e292e57e7 [ 246.752664] RDX: 00007f17a2924bf0 RSI: 00000000c0186444 RDI: 000000000000000d [ 246.752666] RBP: 00007f17a2924bf0 R08: 00007f17a2924ca0 R09: 00007f17a2924c78 [ 246.752667] R10: 00007f17a2924ca0 R11: 0000000000000246 R12: 00000000c0186444 [ 246.752668] R13: 000000000000000d R14: 0000000000d2c588 R15: 0000000000000002 kmscube git HEAD have not stalled after 50 attempts without any flag in GALLIUM_DDEBUG or R600_DEBUG. It might not be relevant, but my platform can only boot with BIOS (it is based on coreboot). When you tried this bug did you tried with UEFI or BIOS? Thanks
(In reply to Ricardo Ribalda from comment #59) > Hi Andrey > > Testing with llvm7 setup: > > > R600_DEBUG=notiling,norbplus xinit > > does not avoid the hang :( > > [ 31.200134] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > last signaled seq=7, last emitted seq=8 > [ 31.200147] [drm] IP block:gfx_v8_0 is hung! > [ 31.200152] [drm] GPU recovery disabled. > > > > GALLIUM_DDEBUG=flush xinit > > Seems to have a (good) impact. Hang happened after around 20 boots. > > [ 13.389894] NET: Registered protocol family 39 > [ 28.640142] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > last signaled seq=652, last emitted seq=654 > [ 28.640154] [drm] IP block:gfx_v8_0 is hung! > [ 28.640160] [drm] GPU recovery disabled. > [ 246.752083] INFO: task amdgpu_cs:0:636 blocked for more than 120 seconds. > [ 246.752092] Not tainted 4.16.0-rc4-qtec-standard #1 > [ 246.752095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 246.752099] amdgpu_cs:0 D 0 636 543 0x00080000 > [ 246.752099] amdgpu_cs:0 D 0 636 543 0x00080000 > [ 246.752103] Call Trace: > [ 246.752115] ? __schedule+0x25c/0x860 > [ 246.752122] ? dma_fence_default_wait+0x10c/0x280 > [ 246.752124] ? dma_fence_default_wait+0x1c8/0x280 > [ 246.752127] schedule+0x2f/0x90 > [ 246.752130] schedule_timeout+0x1f1/0x440 > [ 246.752220] ? amdgpu_cs_bo_validate+0x7f/0x120 [amdgpu] > [ 246.752276] ? amdgpu_ttm_alloc_gart+0x5d/0x270 [amdgpu] > [ 246.752284] ? dma_fence_default_wait+0x10c/0x280 > [ 246.752287] ? dma_fence_default_wait+0x1c8/0x280 > [ 246.752290] dma_fence_default_wait+0x1f4/0x280 > [ 246.752294] ? dma_fence_default_wait+0x280/0x280 > [ 246.752297] dma_fence_wait_timeout+0x2e/0x100 > [ 246.752359] amdgpu_ctx_wait_prev_fence+0x46/0x80 [amdgpu] > [ 246.752418] amdgpu_cs_ioctl+0x1f2/0x1af0 [amdgpu] > [ 246.752478] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] > [ 246.752504] drm_ioctl_kernel+0x59/0xb0 [drm] > [ 246.752524] drm_ioctl+0x29f/0x340 [drm] > [ 246.752581] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] > [ 246.752630] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] > [ 246.752635] do_vfs_ioctl+0x8e/0x680 > [ 246.752641] ? SyS_futex+0x11d/0x150 > [ 246.752644] SyS_ioctl+0x74/0x80 > [ 246.752647] ? get_vtime_delta+0xe/0x40 > [ 246.752650] do_syscall_64+0x7b/0x1d0 > [ 246.752654] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [ 246.752658] RIP: 0033:0x3e292e57e7 > [ 246.752660] RSP: 002b:00007f17a2924b88 EFLAGS: 00000246 ORIG_RAX: > 0000000000000010 > [ 246.752663] RAX: ffffffffffffffda RBX: 00007f17a2924c78 RCX: > 0000003e292e57e7 > [ 246.752664] RDX: 00007f17a2924bf0 RSI: 00000000c0186444 RDI: > 000000000000000d > [ 246.752666] RBP: 00007f17a2924bf0 R08: 00007f17a2924ca0 R09: > 00007f17a2924c78 > [ 246.752667] R10: 00007f17a2924ca0 R11: 0000000000000246 R12: > 00000000c0186444 > [ 246.752668] R13: 000000000000000d R14: 0000000000d2c588 R15: > 0000000000000002 > > > kmscube git HEAD have not stalled after 50 attempts without any flag in > GALLIUM_DDEBUG or R600_DEBUG. > > It might not be relevant, but my platform can only boot with BIOS (it is > based on coreboot). When you tried this bug did you tried with UEFI or BIOS? > > Thanks I think my BIOS is set to UEFI but it can't have any relation to this issue. Andrey
Hi Andrey You are more quite than you use to ;). Do you need anything else from my side?
Hi, sorry , was busy a bit with other stuff. Can you please give another bit of info reading some debug data from color buffer (CB) Reproduce it and then from UMR do sudo umr -O bits,many --bank 0 0 0 -r *.*.mmCB_DEBUG sudo umr -O bits,many --bank 0 0 1 -r *.*.mmCB_DEBUG Andrey
Here you are gfx80.mmCB_DEBUG_BUS_1 => 0x00008801 gfx80.mmCB_DEBUG_BUS_2 => 0x00000000 gfx80.mmCB_DEBUG_BUS_3 => 0x00000000 gfx80.mmCB_DEBUG_BUS_4 => 0x00000000 gfx80.mmCB_DEBUG_BUS_5 => 0x00000000 gfx80.mmCB_DEBUG_BUS_6 => 0x00000000 gfx80.mmCB_DEBUG_BUS_7 => 0x00000000 gfx80.mmCB_DEBUG_BUS_8 => 0x00000000 gfx80.mmCB_DEBUG_BUS_9 => 0x00000000 gfx80.mmCB_DEBUG_BUS_10 => 0x0000e000 gfx80.mmCB_DEBUG_BUS_11 => 0x00000000 gfx80.mmCB_DEBUG_BUS_12 => 0x00000000 gfx80.mmCB_DEBUG_BUS_13 => 0x00000000 gfx80.mmCB_DEBUG_BUS_14 => 0x00000000 gfx80.mmCB_DEBUG_BUS_15 => 0x00000000 gfx80.mmCB_DEBUG_BUS_16 => 0x00000000 gfx80.mmCB_DEBUG_BUS_17 => 0x00000804 .TILE_INTFC_BUSY[0:0] == 0 (0x00000000) .MU_BUSY[1:1] == 0 (0x00000000) .TQ_BUSY[2:2] == 1 (0x00000001) .AC_BUSY[3:3] == 0 (0x00000000) .CRW_BUSY[4:4] == 0 (0x00000000) .CACHE_CTRL_BUSY[5:5] == 0 (0x00000000) .MC_WR_PENDING[6:6] == 0 (0x00000000) .FC_WR_PENDING[7:7] == 0 (0x00000000) .FC_RD_PENDING[8:8] == 0 (0x00000000) .EVICT_PENDING[9:9] == 0 (0x00000000) .LAST_RD_ARB_WINNER[10:10] == 0 (0x00000000) .MU_STATE[11:18] == 1 (0x00000001) gfx80.mmCB_DEBUG_BUS_18 => 0x00000100 .TILE_RETIREMENT_BUSY[0:0] == 0 (0x00000000) .FOP_BUSY[1:1] == 0 (0x00000000) .CLEAR_BUSY[2:2] == 0 (0x00000000) .LAT_BUSY[3:3] == 0 (0x00000000) .CACHE_CTL_BUSY[4:4] == 0 (0x00000000) .ADDR_BUSY[5:5] == 0 (0x00000000) .MERGE_BUSY[6:6] == 0 (0x00000000) .QUAD_BUSY[7:7] == 0 (0x00000000) .TILE_BUSY[8:8] == 1 (0x00000001) .DCC_BUSY[9:9] == 0 (0x00000000) .DOC_BUSY[10:10] == 0 (0x00000000) .DAG_BUSY[11:11] == 0 (0x00000000) .DOC_STALL[12:12] == 0 (0x00000000) .DOC_QT_CAM_FULL[13:13] == 0 (0x00000000) .DOC_CL_CAM_FULL[14:14] == 0 (0x00000000) .DOC_QUAD_PTR_FIFO_FULL[15:15] == 0 (0x00000000) .DOC_SECTOR_MASK_FIFO_FULL[16:16] == 0 (0x00000000) .DCS_READ_WINNER_LAST[17:17] == 0 (0x00000000) .DCS_READ_EV_PENDING[18:18] == 0 (0x00000000) .DCS_WRITE_CC_PENDING[19:19] == 0 (0x00000000) .DCS_READ_CC_PENDING[20:20] == 0 (0x00000000) .DCS_WRITE_MC_PENDING[21:21] == 0 (0x00000000) gfx80.mmCB_DEBUG_BUS_19 => 0x0005c000 .SURF_SYNC_STATE[0:1] == 0 (0x00000000) .SURF_SYNC_START[2:2] == 0 (0x00000000) .SF_BUSY[3:3] == 0 (0x00000000) .CS_BUSY[4:4] == 0 (0x00000000) .RB_BUSY[5:5] == 0 (0x00000000) .DS_BUSY[6:6] == 0 (0x00000000) .TB_BUSY[7:7] == 0 (0x00000000) .IB_BUSY[8:8] == 0 (0x00000000) .DRR_BUSY[9:9] == 0 (0x00000000) .DF_BUSY[10:10] == 0 (0x00000000) .DD_BUSY[11:11] == 0 (0x00000000) .DC_BUSY[12:12] == 0 (0x00000000) .DK_BUSY[13:13] == 0 (0x00000000) .DF_SKID_FIFO_EMPTY[14:14] == 1 (0x00000001) .DF_CLEAR_FIFO_EMPTY[15:15] == 1 (0x00000001) .DD_READY[16:16] == 1 (0x00000001) .DC_FIFO_FULL[17:17] == 0 (0x00000000) .DC_READY[18:18] == 1 (0x00000001) gfx80.mmCB_DEBUG_BUS_20 => 0x00f00820 .MC_RDREQ_CREDITS[0:5] == 32 (0x00000020) .MC_WRREQ_CREDITS[6:11] == 32 (0x00000020) .CC_RDREQ_HAD_ITS_TURN[12:12] == 0 (0x00000000) .FC_RDREQ_HAD_ITS_TURN[13:13] == 0 (0x00000000) .CM_RDREQ_HAD_ITS_TURN[14:14] == 0 (0x00000000) .CC_WRREQ_HAD_ITS_TURN[16:16] == 0 (0x00000000) .FC_WRREQ_HAD_ITS_TURN[17:17] == 0 (0x00000000) .CM_WRREQ_HAD_ITS_TURN[18:18] == 0 (0x00000000) .CC_WRREQ_FIFO_EMPTY[20:20] == 1 (0x00000001) .FC_WRREQ_FIFO_EMPTY[21:21] == 1 (0x00000001) .CM_WRREQ_FIFO_EMPTY[22:22] == 1 (0x00000001) .DCC_WRREQ_FIFO_EMPTY[23:23] == 1 (0x00000001) gfx80.mmCB_DEBUG_BUS_21 => 0x000000e3 .CM_BUSY[0:0] == 1 (0x00000001) .FC_BUSY[1:1] == 1 (0x00000001) .CC_BUSY[2:2] == 0 (0x00000000) .BB_BUSY[3:3] == 0 (0x00000000) .MA_BUSY[4:4] == 0 (0x00000000) .CORE_SCLK_VLD[5:5] == 1 (0x00000001) .REG_SCLK1_VLD[6:6] == 1 (0x00000001) .REG_SCLK0_VLD[7:7] == 1 (0x00000001) gfx80.mmCB_DEBUG_BUS_22 => 0x00000000 .OUTSTANDING_MC_READS[0:11] == 0 (0x00000000) .OUTSTANDING_MC_WRITES[12:23] == 0 (0x00000000) gfx80.mmCB_DEBUG_BUS_1 => 0x00008801 gfx80.mmCB_DEBUG_BUS_2 => 0x00000000 gfx80.mmCB_DEBUG_BUS_3 => 0x00000000 gfx80.mmCB_DEBUG_BUS_4 => 0x00000000 gfx80.mmCB_DEBUG_BUS_5 => 0x00000000 gfx80.mmCB_DEBUG_BUS_6 => 0x00000000 gfx80.mmCB_DEBUG_BUS_7 => 0x00000000 gfx80.mmCB_DEBUG_BUS_8 => 0x00000000 gfx80.mmCB_DEBUG_BUS_9 => 0x00000000 gfx80.mmCB_DEBUG_BUS_10 => 0x0000e000 gfx80.mmCB_DEBUG_BUS_11 => 0x00000000 gfx80.mmCB_DEBUG_BUS_12 => 0x00000000 gfx80.mmCB_DEBUG_BUS_13 => 0x00000000 gfx80.mmCB_DEBUG_BUS_14 => 0x00000000 gfx80.mmCB_DEBUG_BUS_15 => 0x00000000 gfx80.mmCB_DEBUG_BUS_16 => 0x00000000 gfx80.mmCB_DEBUG_BUS_17 => 0x00000804 .TILE_INTFC_BUSY[0:0] == 0 (0x00000000) .MU_BUSY[1:1] == 0 (0x00000000) .TQ_BUSY[2:2] == 1 (0x00000001) .AC_BUSY[3:3] == 0 (0x00000000) .CRW_BUSY[4:4] == 0 (0x00000000) .CACHE_CTRL_BUSY[5:5] == 0 (0x00000000) .MC_WR_PENDING[6:6] == 0 (0x00000000) .FC_WR_PENDING[7:7] == 0 (0x00000000) .FC_RD_PENDING[8:8] == 0 (0x00000000) .EVICT_PENDING[9:9] == 0 (0x00000000) .LAST_RD_ARB_WINNER[10:10] == 0 (0x00000000) .MU_STATE[11:18] == 1 (0x00000001) gfx80.mmCB_DEBUG_BUS_18 => 0x00000100 .TILE_RETIREMENT_BUSY[0:0] == 0 (0x00000000) .FOP_BUSY[1:1] == 0 (0x00000000) .CLEAR_BUSY[2:2] == 0 (0x00000000) .LAT_BUSY[3:3] == 0 (0x00000000) .CACHE_CTL_BUSY[4:4] == 0 (0x00000000) .ADDR_BUSY[5:5] == 0 (0x00000000) .MERGE_BUSY[6:6] == 0 (0x00000000) .QUAD_BUSY[7:7] == 0 (0x00000000) .TILE_BUSY[8:8] == 1 (0x00000001) .DCC_BUSY[9:9] == 0 (0x00000000) .DOC_BUSY[10:10] == 0 (0x00000000) .DAG_BUSY[11:11] == 0 (0x00000000) .DOC_STALL[12:12] == 0 (0x00000000) .DOC_QT_CAM_FULL[13:13] == 0 (0x00000000) .DOC_CL_CAM_FULL[14:14] == 0 (0x00000000) .DOC_QUAD_PTR_FIFO_FULL[15:15] == 0 (0x00000000) .DOC_SECTOR_MASK_FIFO_FULL[16:16] == 0 (0x00000000) .DCS_READ_WINNER_LAST[17:17] == 0 (0x00000000) .DCS_READ_EV_PENDING[18:18] == 0 (0x00000000) .DCS_WRITE_CC_PENDING[19:19] == 0 (0x00000000) .DCS_READ_CC_PENDING[20:20] == 0 (0x00000000) .DCS_WRITE_MC_PENDING[21:21] == 0 (0x00000000) gfx80.mmCB_DEBUG_BUS_19 => 0x0005c000 .SURF_SYNC_STATE[0:1] == 0 (0x00000000) .SURF_SYNC_START[2:2] == 0 (0x00000000) .SF_BUSY[3:3] == 0 (0x00000000) .CS_BUSY[4:4] == 0 (0x00000000) .RB_BUSY[5:5] == 0 (0x00000000) .DS_BUSY[6:6] == 0 (0x00000000) .TB_BUSY[7:7] == 0 (0x00000000) .IB_BUSY[8:8] == 0 (0x00000000) .DRR_BUSY[9:9] == 0 (0x00000000) .DF_BUSY[10:10] == 0 (0x00000000) .DD_BUSY[11:11] == 0 (0x00000000) .DC_BUSY[12:12] == 0 (0x00000000) .DK_BUSY[13:13] == 0 (0x00000000) .DF_SKID_FIFO_EMPTY[14:14] == 1 (0x00000001) .DF_CLEAR_FIFO_EMPTY[15:15] == 1 (0x00000001) .DD_READY[16:16] == 1 (0x00000001) .DC_FIFO_FULL[17:17] == 0 (0x00000000) .DC_READY[18:18] == 1 (0x00000001) gfx80.mmCB_DEBUG_BUS_20 => 0x00f00820 .MC_RDREQ_CREDITS[0:5] == 32 (0x00000020) .MC_WRREQ_CREDITS[6:11] == 32 (0x00000020) .CC_RDREQ_HAD_ITS_TURN[12:12] == 0 (0x00000000) .FC_RDREQ_HAD_ITS_TURN[13:13] == 0 (0x00000000) .CM_RDREQ_HAD_ITS_TURN[14:14] == 0 (0x00000000) .CC_WRREQ_HAD_ITS_TURN[16:16] == 0 (0x00000000) .FC_WRREQ_HAD_ITS_TURN[17:17] == 0 (0x00000000) .CM_WRREQ_HAD_ITS_TURN[18:18] == 0 (0x00000000) .CC_WRREQ_FIFO_EMPTY[20:20] == 1 (0x00000001) .FC_WRREQ_FIFO_EMPTY[21:21] == 1 (0x00000001) .CM_WRREQ_FIFO_EMPTY[22:22] == 1 (0x00000001) .DCC_WRREQ_FIFO_EMPTY[23:23] == 1 (0x00000001) gfx80.mmCB_DEBUG_BUS_21 => 0x000000e3 .CM_BUSY[0:0] == 1 (0x00000001) .FC_BUSY[1:1] == 1 (0x00000001) .CC_BUSY[2:2] == 0 (0x00000000) .BB_BUSY[3:3] == 0 (0x00000000) .MA_BUSY[4:4] == 0 (0x00000000) .CORE_SCLK_VLD[5:5] == 1 (0x00000001) .REG_SCLK1_VLD[6:6] == 1 (0x00000001) .REG_SCLK0_VLD[7:7] == 1 (0x00000001) gfx80.mmCB_DEBUG_BUS_22 => 0x00000000 .OUTSTANDING_MC_READS[0:11] == 0 (0x00000000) .OUTSTANDING_MC_WRITES[12:23] == 0 (0x00000000) Thanks!
Actually i was given info that a similar problem on Bettong board was related to IOMMU. Can you check if IOMMU is enabled in your BIOS , if it is, disable it and check if the issue is still present ? In case you are not sure how to check and disable it let me know. Andrey
Hi IOMMU is disabled: root@qt5122:~# ls /sys/class/iommu ls: cannot access '/sys/class/iommu': No such file or directory root@qt5122:~# dmesg | grep -i iommu root@qt5122:~#
(In reply to Ricardo Ribalda from comment #65) > Hi > > IOMMU is disabled: > > root@qt5122:~# ls /sys/class/iommu > ls: cannot access '/sys/class/iommu': No such file or directory > root@qt5122:~# dmesg | grep -i iommu > root@qt5122:~# To be sure, is it disabled in BIOS ?
I have coreboot. So I do not have the typical menu. I have an FPGA writing to the main memory, and that could not happen if iommu is enabled without extra configuration. If you can tell me which register to look for we can be 100% sure.
(In reply to Ricardo Ribalda from comment #67) > I have coreboot. So I do not have the typical menu. > > I have an FPGA writing to the main memory, and that could not happen if > iommu is enabled without extra configuration. > > If you can tell me which register to look for we can be 100% sure. I took a bettong board here in office, booted it while pressing ESC, goes into BIOS, then choosing Chipest->GFX Configuration->IOMMU En/Dis But i guess you mean you have coreboot INSTEAD of standard BIOS like i have, So at least you sure you have the following disabled as in this link - https://wiki.gentoo.org/wiki/IOMMU_SWIOTLB IOMMU Hardware Support disabled in make menuconfig and also you boot the kernel with iommu=off ? If so I assume you indeed have IOMMU disabled and then I will try later to again reproduce your issue with Bettong board I just found because before I was reproducing with a different CZ board. Andrey
This is my kernel configuration: ricardo@neopili:~/curro/kernel-upstream$ cat .config | grep -i IOMMU CONFIG_GART_IOMMU=y # CONFIG_CALGARY_IOMMU is not set CONFIG_IOMMU_HELPER=y CONFIG_IOMMU_SUPPORT=y # Generic IOMMU Pagetable Support # CONFIG_AMD_IOMMU is not set # CONFIG_INTEL_IOMMU is not set # CONFIG_IOMMU_DEBUG is not set Tomorrow I will try bootig with iommu=off, but my understanding was that if /sys/class/iommu was missing, iommu was disabled. Thans!
Just another while guess also , try mod probing amdgpu with power and clock gating disabled to see if it makes a difference sudo modprobe amdgpu cg_mask=0 pg_mask=0
(In reply to Ricardo Ribalda from comment #69) > This is my kernel configuration: > > > ricardo@neopili:~/curro/kernel-upstream$ cat .config | grep -i IOMMU > CONFIG_GART_IOMMU=y > # CONFIG_CALGARY_IOMMU is not set > CONFIG_IOMMU_HELPER=y > CONFIG_IOMMU_SUPPORT=y > # Generic IOMMU Pagetable Support > # CONFIG_AMD_IOMMU is not set > # CONFIG_INTEL_IOMMU is not set > # CONFIG_IOMMU_DEBUG is not set > > > Tomorrow I will try bootig with iommu=off, but my understanding was that if > /sys/class/iommu was missing, iommu was disabled. Agreed. > > Thans!
With iommu=off I can still see the error after around 3 attempts With iommu=off amdgpu.cg_mask=0 amdgpu.pg_mask=0 I could not see the stall after 30 boots With amdgpu.cg_mask=0 amdgpu.pg_mask=0 I could get the stall after 19 attepts
Created attachment 274867 [details] Exclude HIQ patch So if IOMMU disabling does makes a difference can you please try to patch your kernel with this short patch and see if it helps.
Hi Andrey They patch did not do the trick :(. Shall I try also with the patch and amdgpu.cg_mask=0 amdgpu.pg_mask=0
(In reply to Ricardo Ribalda from comment #74) > Hi Andrey > > They patch did not do the trick :(. > > Shall I try also with the patch and amdgpu.cg_mask=0 amdgpu.pg_mask=0 Yes.
I was testing with GALLIUM_DDEBUG=flush When I removed it: amdgpu.cg_mask=0 amdgpu.pg_mask=0 : Stalls iommu=off amdgpu.cg_mask=0 amdgpu.pg_mask=0 : Also stalls :(
(In reply to Ricardo Ribalda from comment #76) > I was testing with GALLIUM_DDEBUG=flush > > When I removed it: > > > amdgpu.cg_mask=0 amdgpu.pg_mask=0 : Stalls > > iommu=off amdgpu.cg_mask=0 amdgpu.pg_mask=0 : Also stalls > > > :( I see, ok , let me try to find time and reproduce it here. Might take a bit since I am doing something else at the moment. Andrey
Hello Ricardo, Did you have a chance to workaround this issue? I think I'm seeing the same one on the same hardware. Jerome
Hi Jerome Not really, I am waiting for some feedback from Andrey. With the env. variable: GALLIUM_DDEBUG=flush it works better, but still not perfect. Andrey, did you managed to replicate the issue? I can send you a root file system. Regards!
(In reply to Ricardo Ribalda from comment #79) > Hi Jerome > > Not really, I am waiting for some feedback from Andrey. > > > With the env. variable: > > GALLIUM_DDEBUG=flush > > it works better, but still not perfect. > > Andrey, did you managed to replicate the issue? I can send you a root file > system. > > > Regards! Unfortunately I have been side tracked by more prioritized issues, I definitely will get to this, it's on my TODO list.
With Betong board it seems I reproduced it, I am trying to debug it more + involving MESA/LLVM people to take a look.
Hi Andrey. Excellent news, thanks for your effort on this.
Created attachment 275403 [details] set COMPUTE_PGM_RSRC1 for SGPR/VGPR clearing Please test this patch by Nicolai who did a great job analyzing the issue, for me it fixes the hang. Andrey
Hi Andrey I have removed the GALLIUM_DDEBUG=flush hack and performed around 10 boots. I have not been able to replicate the bug. I am porting the patch to my current distro and will run the test there for a couple of hours (hopefully tomorrow). But It looks pretty good! Kudos!
(In reply to Ricardo Ribalda from comment #84) > Hi Andrey > > I have removed the GALLIUM_DDEBUG=flush hack and performed around 10 boots. > I have not been able to replicate the bug. > > I am porting the patch to my current distro and will run the test there for > a couple of hours (hopefully tomorrow). But It looks pretty good! > > Kudos! Ping me once it's done. Thanks.
Hi, any updates (In reply to Andrey Grodzovsky from comment #85) > (In reply to Ricardo Ribalda from comment #84) > > Hi Andrey > > > > I have removed the GALLIUM_DDEBUG=flush hack and performed around 10 > boots. > > I have not been able to replicate the bug. > > > > I am porting the patch to my current distro and will run the test there for > > a couple of hours (hopefully tomorrow). But It looks pretty good! > > > > Kudos! > > Ping me once it's done. > > Thanks. Any updates ? Andrey
Hi Andrey I have been running some manual tests (~30 boots) and it has been always ok. I did not have time to setup an automatic test machine. You are more than welcome to close the ticket and I will update it with the results later. Thanks for your help!
Great, thanks.(In reply to Ricardo Ribalda from comment #87) > Hi Andrey > > I have been running some manual tests (~30 boots) and it has been always ok. > I did not have time to setup an automatic test machine. > > You are more than welcome to close the ticket and I will update it with the > results later. > > Thanks for your help! Great, thanks. Andrey