Bug 198883 - amdgpu: carrizo: Screen stalls after starting X
Summary: amdgpu: carrizo: Screen stalls after starting X
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-22 10:46 UTC by Ricardo Ribalda
Modified: 2018-04-20 12:23 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.15.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (61.53 KB, text/plain)
2018-02-22 10:47 UTC, Ricardo Ribalda
Details
xorg log (18.09 KB, text/plain)
2018-02-22 10:47 UTC, Ricardo Ribalda
Details
xorg.conf (193 bytes, text/plain)
2018-02-22 10:47 UTC, Ricardo Ribalda
Details
trace.dat (4.70 MB, application/octet-stream)
2018-02-23 15:01 UTC, Ricardo Ribalda
Details
trace.txt (1.14 MB, text/plain)
2018-02-23 15:02 UTC, Ricardo Ribalda
Details
glxinfo (101.81 KB, text/plain)
2018-02-23 15:29 UTC, Ricardo Ribalda
Details
UMR output improvement (3.78 KB, application/mbox)
2018-02-27 13:58 UTC, Andrey Grodzovsky
Details
Umr dump 0 (723.56 KB, text/plain)
2018-03-05 14:01 UTC, Ricardo Ribalda
Details
umr dump 1 (3.55 MB, text/plain)
2018-03-05 14:02 UTC, Ricardo Ribalda
Details
sudo umr -O many,bits -r *.gfx80.mmGRBM_STATUS &> stall sudo umr -O many,bits -r *.gfx80.HEADER_DUMP &>>stall sudo umr -O many,bits -r *.gfx80.CP_EOP &>>stall (10.23 KB, text/plain)
2018-03-06 08:09 UTC, Ricardo Ribalda
Details
Stalled after starting X. Gdb also stalls (10.23 KB, text/plain)
2018-03-06 08:24 UTC, Ricardo Ribalda
Details
Dmesg for 4.16 rc4 (47.11 KB, text/plain)
2018-03-06 13:39 UTC, Ricardo Ribalda
Details
glxinfo with llvm7 (138.77 KB, text/plain)
2018-03-12 12:04 UTC, Ricardo Ribalda
Details
ddebug dumps with llvm7 (159.89 KB, application/x-compressed-tar)
2018-03-12 14:40 UTC, Ricardo Ribalda
Details
urm dump with llvm7 (825.68 KB, text/plain)
2018-03-12 14:41 UTC, Ricardo Ribalda
Details
Exclude HIQ patch (709 bytes, patch)
2018-03-22 13:27 UTC, Andrey Grodzovsky
Details | Diff
set COMPUTE_PGM_RSRC1 for SGPR/VGPR clearing (3.28 KB, patch)
2018-04-16 14:04 UTC, Andrey Grodzovsky
Details | Diff

Description Ricardo Ribalda 2018-02-22 10:46:56 UTC
The screen freezes completely and does not even respond to commands like Alt+Ctrl+F1.

I can still ssh the device.

Seems to happen more often when the system is cold :S.

Similar (maybe same as #151341)

Relevant dmesg:

[  246.751055] INFO: task amdgpu_cs:0:530 blocked for more than 120 seconds.
[  246.751067]       Tainted: G           O     4.15.0-qtec-standard #3
[  246.751070] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.751074] amdgpu_cs:0     D    0   530    518 0x00080000
[  246.751079] Call Trace:
[  246.751107]  ? __schedule+0x25c/0x860
[  246.751113]  ? dma_fence_default_wait+0x10b/0x280
[  246.751115]  ? dma_fence_default_wait+0x1c7/0x280
[  246.751118]  schedule+0x2f/0x90
[  246.751122]  schedule_timeout+0x1e0/0x430
[  246.751237]  ? amdgpu_vm_update_directories+0x2d/0x5d0 [amdgpu]
[  246.751241]  ? dma_fence_default_wait+0x10b/0x280
[  246.751243]  ? dma_fence_default_wait+0x1c7/0x280
[  246.751246]  dma_fence_default_wait+0x1f3/0x280
[  246.751251]  ? __kfifo_in+0x2e/0x40
[  246.751254]  ? dma_fence_default_wait+0x280/0x280
[  246.751256]  dma_fence_wait_timeout+0x2e/0x100
[  246.751319]  amdgpu_ctx_wait_prev_fence+0x49/0x80 [amdgpu]
[  246.751378]  amdgpu_cs_ioctl+0x26e/0x1a90 [amdgpu]
[  246.751403]  ? radix_tree_node_alloc.constprop.13+0x3d/0xc0
[  246.751461]  ? amdgpu_cs_find_mapping+0xc0/0xc0 [amdgpu]
[  246.751493]  drm_ioctl_kernel+0x59/0xb0 [drm]
[  246.751514]  drm_ioctl+0x29f/0x340 [drm]
[  246.751572]  ? amdgpu_cs_find_mapping+0xc0/0xc0 [amdgpu]
[  246.751620]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[  246.751626]  do_vfs_ioctl+0x8e/0x680
[  246.751632]  ? SyS_futex+0x11d/0x150
[  246.751635]  SyS_ioctl+0x74/0x80
[  246.751638]  ? get_vtime_delta+0xe/0x40
[  246.751642]  do_syscall_64+0x6f/0x1c0
[  246.751645]  entry_SYSCALL64_slow_path+0x25/0x25
[  246.751649] RIP: 0033:0x3c768e57e7
[  246.751651] RSP: 002b:00007fd1220d0ba8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  246.751655] RAX: ffffffffffffffda RBX: 00007fd1220d0c88 RCX: 0000003c768e57e7
[  246.751656] RDX: 00007fd1220d0c10 RSI: 00000000c0186444 RDI: 000000000000000c
[  246.751658] RBP: 00007fd1220d0c10 R08: 00007fd1220d0cb0 R09: 00007fd1220d0c88
[  246.751659] R10: 00007fd1220d0cb0 R11: 0000000000000246 R12: 00000000c0186444
[  246.751661] R13: 000000000000000c R14: 0000000000000008 R15: 0000000000000000
Comment 1 Ricardo Ribalda 2018-02-22 10:47:22 UTC
Created attachment 274353 [details]
dmesg
Comment 2 Ricardo Ribalda 2018-02-22 10:47:33 UTC
Created attachment 274355 [details]
xorg log
Comment 3 Ricardo Ribalda 2018-02-22 10:47:56 UTC
Created attachment 274357 [details]
xorg.conf
Comment 4 Andrey Grodzovsky 2018-02-23 13:02:55 UTC
Could you please provide Ftrace output for following events ?
(See here on HOWTO https://www.kernel.org/doc/Documentation/trace/ftrace.txt)

/sys/kernel/debug/tracing/events/dma_fence
/sys/kernel/debug/tracing/events/gpu_scheduler

In amdgpu events folder  -
/sys/kernel/debug/tracing/events/amdgpu

amdgpu_cs_ioctl
amdgpu_cs

Thanks,
Andrey
Comment 5 Ricardo Ribalda 2018-02-23 15:01:12 UTC
Created attachment 274397 [details]
trace.dat
Comment 6 Ricardo Ribalda 2018-02-23 15:02:34 UTC
Created attachment 274399 [details]
trace.txt
Comment 7 Ricardo Ribalda 2018-02-23 15:02:59 UTC
Traces obtained with trace-cmd record -e dma_fence:* -e amdgpu:* -e gpu_sched:*
Comment 8 Andrey Grodzovsky 2018-02-23 15:14:20 UTC
How you reproduce it and how often does it happen ?

Andrey
Comment 9 Ricardo Ribalda 2018-02-23 15:19:25 UTC
I just have to start up the system and around 1 out of 10 times X will crash with no human intervention.

X manages to render some of the screen, like the window decorator or the cursor, but the content of the xterm is missing.

It is more likely to happen when the system is cold :S. I have rarely seen this happening after a hot reboot.

Not that often: a system that seems to be working fine crashes X after running dmesg  (or any other command with a lot of text) inside xterm. 

Once the system is frozen, there is no change on the screen and I cannot alt+ctrl+f1.
Comment 10 Andrey Grodzovsky 2018-02-23 15:23:44 UTC
Are you starting X with xstart or xinit or some desktop manager ? Can you also provide output of glxinfo ? BTW what distro are you running ?
Comment 11 Ricardo Ribalda 2018-02-23 15:29:02 UTC
Distro
I am using Yocto Project

Launching X with:

/etc/init.d/xserver-nodm start
that launches
su -l -c '/etc/xserver-nodm/Xserver &' $USER
that calls xinit internally.


Attaching glxinfo
Comment 12 Ricardo Ribalda 2018-02-23 15:29:16 UTC
Created attachment 274401 [details]
glxinfo
Comment 13 Andrey Grodzovsky 2018-02-23 15:31:49 UTC
Thanks for all the info, I will later try to reproduce it on my CZ setup.

Andrey
Comment 14 Ricardo Ribalda 2018-02-23 15:33:15 UTC
Thanks Andrey

Please bare in mind that #151341 seems very related, so this bug might be happening in more platforms.
Comment 15 Andrey Grodzovsky 2018-02-23 15:34:40 UTC
Yep, I noticed the other bug.

Thanks,
Andrey
Comment 16 Michel Dänzer 2018-02-23 15:42:54 UTC
(In reply to Ricardo Ribalda from comment #14)
> Please bare in mind that #151341 seems very related, [...]

What makes you think so? If it's that the backtraces look similar, that's just a generic symptom of a GPU hang, which can be caused by lots of different things.
Comment 17 Ricardo Ribalda 2018-02-23 15:45:52 UTC
The description of what/how it happens and the backtraces:


-able to login remotely via ssh.

-I tried to reset the gpu by using /sys/kernel/debug/dri/0/amdgpu_gpu_reset, and the result is a NULL pointer dereference in the kernel. (I did that and had almost the same result)


This is definitely out of my expertise :), I just dont want to waste 2x developers time.

Thanks again
Comment 18 Michel Dänzer 2018-02-23 20:20:31 UTC
(In reply to Ricardo Ribalda from comment #17)
> -able to login remotely via ssh.
> 
> -I tried to reset the gpu by using /sys/kernel/debug/dri/0/amdgpu_gpu_reset,
> and the result is a NULL pointer dereference in the kernel. (I did that and
> had almost the same result)

That could happen with any GPU hang, no matter what caused it.


> This is definitely out of my expertise :), I just dont want to waste 2x
> developers time.

Then it's better to assume for the time being that they're separate issues.
Comment 19 Andrey Grodzovsky 2018-02-23 21:44:05 UTC
I tried on my Ubuntu multiple times with both starting and stopping desktop manager and just xinint. Didn't observe any hang, 

this is relevant SW info from my glxinfo

OenGL renderer string: AMD Radeon R7 Graphics (CARRIZO / DRM 3.25.0 / 4.15.0-rc4.main+, LLVM 7.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.1.0-devel (git-b494ed168c)

So I have same kernel as you but use LLVM 7 while you have LLVM 5. 
We also differ in MESA and probably libdrm.

I wonder if you can try update you MESA and libdrm to latest upstream and try switch to  LLVM 7 ?

Thanks,
Andrey
Comment 20 Ricardo Ribalda 2018-02-24 06:41:25 UTC
Did you power cycle your platform between attempts? 

By latest upstream you mean tip of the git devel? I did update to the last release of Mesa, libdrm, llvm, xorg-amdgpu around two weeks ago.

Since there are many components involved, would be an option to send you a root file system? Maybe we are hunting a bios or hw error.

Please be patient,I will be in the Embedded World the whole next week.

Thanks!
Comment 21 Andrey Grodzovsky 2018-02-24 14:04:30 UTC
Yea, you right, forgot all about the cold reset part you mentioned so will retry again on Monday. In the meanwhile you can at least try switching to LLVM 7 and see if this makes things better. If it does it can narrow down the problem.

Thanks,
Andrey
Comment 22 Andrey Grodzovsky 2018-02-26 16:11:14 UTC
I tried cold resets around 15 times (taking out the power cord) running lightdm manger or directly xinit and haven't seen any hang.

Let me know your firmware versions - 
cat /sys/kernel/debug/dri/0/amdgpu_firmware_info

THanks,
Andrey
Comment 23 Andrey Grodzovsky 2018-02-27 13:57:41 UTC
Looked a bit more into the ftraces, fence from gfx ring, context=80 seqno=10 which is a drm_sched_fence.finished fence is never signaled so next time when same slot (slot 10) needs to be reused it waits for ever for that fence to be signaled. finished fence is signaled once the HW fence wrappinig the IB is signaled so looks like the related IB never completed execution.

You can try and generate  gfx ring dump with UMR tool (https://cgit.freedesktop.org/amd/umr/)

Also try applying the attached patch to your kernel to improve UMR output

Once installed, please reproduce the issue and then attach output from following command 

sudo umr -O verbose,follow_ib -R gfx[0:2047]
Comment 24 Andrey Grodzovsky 2018-02-27 13:58:57 UTC
Created attachment 274473 [details]
UMR output improvement
Comment 25 Ricardo Ribalda 2018-03-05 13:26:06 UTC
Hi Andrey

Back from the Embedded World.

My firmware versions:

root@qt5122:~# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x34040300
UVD feature version: 0, firmware version: 0x01570b00
MC feature version: 0, firmware version: 0x00000000
ME feature version: 37, firmware version: 0x00000092
PFP feature version: 37, firmware version: 0x000000d8
CE feature version: 37, firmware version: 0x0000007e
RLC feature version: 1, firmware version: 0x00000099
MEC feature version: 37, firmware version: 0x00000299
MEC2 feature version: 37, firmware version: 0x00000299
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x00000000
SDMA0 feature version: 0, firmware version: 0x00000022
SDMA1 feature version: 0, firmware version: 0x00000022


I am working on the umr debug


Thanks!
Comment 26 Ricardo Ribalda 2018-03-05 14:01:52 UTC
Created attachment 274535 [details]
Umr dump 0
Comment 27 Ricardo Ribalda 2018-03-05 14:02:13 UTC
Created attachment 274537 [details]
umr dump 1
Comment 28 Ricardo Ribalda 2018-03-05 14:03:02 UTC
Here you go!

umr dump 0 is a crash just after starting X

umr dump 1 is a crash after running dmesg inside xterm just after starting X

Thanks!
Comment 29 Andrey Grodzovsky 2018-03-05 17:27:48 UTC
Thanks for the logs, just to clear things out, you keep saying 'crash', but you actually mean hang, right ? Or do observe actual Xorg process crash ? To answer for sure check if Xorg is still running when you run 'ps' or 'top'and /var/log/Xorg.log.
Comment 30 Andrey Grodzovsky 2018-03-05 21:22:12 UTC
Using UMR please provide following data:

sudo umr -O many,bits  -r *.gfx80.mmGRBM_STATUS
sudo umr -O many,bits  -r *.gfx80.HEADER_DUMP
sudo umr -O many,bits  -r *.gfx80.CP_EOP
Comment 31 Ricardo Ribalda 2018-03-06 08:08:20 UTC
Hi Andrey

I mean a stall, sorry about that.

root@qt5122:~# ps aux | grep Xorg
root       520  0.0  0.0  15936   900 ?        S    07:59   0:00 xinit /etc/X11/Xsession -- /usr/bin/Xorg :0 -br -pn
root       541  0.2  1.3 859896 47412 ?        S<sl 07:59   0:00 /usr/bin/Xorg :0 -br -pn
root      1072  0.0  0.0   8088  1052 pts/3    S+   08:01   0:00 grep Xorg

root@qt5122:~# gdb -p 541
GNU gdb (GDB) 8.0
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-poky-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 541
[New LWP 614]
[New LWP 615]
[New LWP 616]
[New LWP 617]
[New LWP 618]
[New LWP 619]
[New LWP 620]
[New LWP 625]
[New LWP 636]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
0x000000321c4ee1a6 in __GI_epoll_pwait (epfd=3, events=events@entry=0x7ffc92802560, 
    maxevents=maxevents@entry=256, timeout=474326, set=set@entry=0x0)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_pwait.c:42
42	  return SYSCALL_CANCEL (epoll_pwait, epfd, events, maxevents,

(gdb) thread apply all bt

Thread 10 (Thread 0x7fb4d31cd700 (LWP 636)):
#0  0x000000321c4ee1a6 in __GI_epoll_pwait (epfd=24, events=events@entry=0x7fb4d31cc0d0, 
    maxevents=maxevents@entry=256, timeout=timeout@entry=-1, set=set@entry=0x0)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1  0x000000321c4ee318 in epoll_wait (epfd=<optimized out>, events=events@entry=0x7fb4d31cc0d0, 
    maxevents=maxevents@entry=256, timeout=timeout@entry=-1)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_wait.c:30
#2  0x000000000056ff24 in ospoll_wait (ospoll=0xea8b50, timeout=timeout@entry=-1)
    at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/os/ospoll.c:397
#3  0x000000000056d9b6 in InputThreadDoWork (arg=<optimized out>)
    at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/os/inputthread.c:367
#4  0x000000321c807385 in start_thread (arg=0x7fb4d31cd700)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465
#5  0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7fb4d3fff700 (LWP 625)):
#0  0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xd390f4)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xd390a0, cond=0xd390c8)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xd390c8, mutex=0xd390a0)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655
#3  0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so
#4  0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so
#5  0x000000321c807385 in start_thread (arg=0x7fb4d3fff700)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465
#6  0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7fb4d93d8700 (LWP 620)):
#0  0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xc655f8)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc655a8, cond=0xc655d0)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc655d0, mutex=0xc655a8)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655
#3  0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so
#4  0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so
#5  0x000000321c807385 in start_thread (arg=0x7fb4d93d8700)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465
#6  0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7fb4d9bd9700 (LWP 619)):
#0  0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xc655f8)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc655a8, cond=0xc655d0)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc655d0, mutex=0xc655a8)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655
#3  0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so
#4  0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so
---Type <return> to continue, or q <return> to quit---
#5  0x000000321c807385 in start_thread (arg=0x7fb4d9bd9700)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465
#6  0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7fb4da3da700 (LWP 618)):
#0  0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xc65510)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc654c0, cond=0xc654e8)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc654e8, mutex=0xc654c0)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655
#3  0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so
#4  0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so
#5  0x000000321c807385 in start_thread (arg=0x7fb4da3da700)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465
#6  0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fb4dabdb700 (LWP 617)):
#0  0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xc65510)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc654c0, cond=0xc654e8)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc654e8, mutex=0xc654c0)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655
#3  0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so
#4  0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so
#5  0x000000321c807385 in start_thread (arg=0x7fb4dabdb700)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465
#6  0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fb4db3dc700 (LWP 616)):
#0  0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xc65514)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc654c0, cond=0xc654e8)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc654e8, mutex=0xc654c0)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655
#3  0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so
#4  0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so
#5  0x000000321c807385 in start_thread (arg=0x7fb4db3dc700)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465
#6  0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fb4dbbdd700 (LWP 615)):
#0  0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xc15d10)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc15cc0, cond=0xc15ce8)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc15ce8, mutex=0xc15cc0)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655
---Type <return> to continue, or q <return> to quit---
#3  0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so
#4  0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so
#5  0x000000321c807385 in start_thread (arg=0x7fb4dbbdd700)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465
#6  0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fb4e094d700 (LWP 614)):
#0  0x000000321c80d2a5 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xc62760)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc62710, cond=0xc62738)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc62738, mutex=0xc62710)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_cond_wait.c:655
#3  0x00007fb4e40e5243 in ?? () from /usr/lib/dri/radeonsi_dri.so
#4  0x00007fb4e40e5188 in ?? () from /usr/lib/dri/radeonsi_dri.so
#5  0x000000321c807385 in start_thread (arg=0x7fb4e094d700)
    at /usr/src/debug/glibc/2.26-r0/git/nptl/pthread_create.c:465
#6  0x000000321c4ee03f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fb4e5de18c0 (LWP 541)):
#0  0x000000321c4ee1a6 in __GI_epoll_pwait (epfd=3, events=events@entry=0x7ffc92802560, 
    maxevents=maxevents@entry=256, timeout=474326, set=set@entry=0x0)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1  0x000000321c4ee318 in epoll_wait (epfd=<optimized out>, events=events@entry=0x7ffc92802560, 
    maxevents=maxevents@entry=256, timeout=<optimized out>)
    at /usr/src/debug/glibc/2.26-r0/git/sysdeps/unix/sysv/linux/epoll_wait.c:30
#2  0x000000000056ff24 in ospoll_wait (ospoll=0xc07330, timeout=<optimized out>)
    at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/os/ospoll.c:397
#3  0x0000000000569a4b in WaitForSomething (are_ready=<optimized out>)
    at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/os/WaitFor.c:226
#4  0x000000000043246b in Dispatch ()
    at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/dix/dispatch.c:422
#5  0x000000000043647d in dix_main (argc=4, argv=0x7ffc92803388, envp=<optimized out>)
    at /usr/src/debug/xserver-xorg/2_1.19.3-r0/xorg-server-1.19.3/dix/main.c:287
#6  0x000000321c420f1c in __libc_start_main (main=0x4217e0 <main>, argc=4, argv=0x7ffc92803388, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc92803378)
    at /usr/src/debug/glibc/2.26-r0/git/csu/libc-start.c:308
#7  0x000000000042181a in _start () at ../sysdeps/x86_64/start.S:120
(gdb)
Comment 32 Ricardo Ribalda 2018-03-06 08:09:23 UTC
Created attachment 274579 [details]
sudo umr -O many,bits  -r *.gfx80.mmGRBM_STATUS &> stall  sudo umr -O many,bits  -r *.gfx80.HEADER_DUMP &>>stall sudo umr -O many,bits  -r *.gfx80.CP_EOP &>>stall
Comment 33 Ricardo Ribalda 2018-03-06 08:23:40 UTC
Last dump belongs to a stall after running dmesg. (similar to yesterdays dump 1).

The next one happens after running Xorg. This time, gdb also stalls when trying to get a dump:

root@qt5122:~# ps aux | grep Xorg
root       519  0.0  0.0  15936   900 ?        S    08:19   0:00 xinit /etc/X11/Xsession -- /usr/bin/Xorg :0 -br -pn
root       540  0.3  1.3 857220 47812 ?        S<sl 08:19   0:00 /usr/bin/Xorg :0 -br -pn
root       692  0.0  0.0   8088  1056 pts/2    S+   08:20   0:00 grep Xorg
root@qt5122:~# gdb -p 540
GNU gdb (GDB) 8.0
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-poky-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 540
[New LWP 617]
[New LWP 618]
[New LWP 619]
[New LWP 620]
[New LWP 621]
[New LWP 622]
[New LWP 623]
[New LWP 624]
[New LWP 635]
Comment 34 Ricardo Ribalda 2018-03-06 08:24:34 UTC
Created attachment 274581 [details]
Stalled after starting X. Gdb also stalls

sudo umr -O many,bits  -r *.gfx80.mmGRBM_STATUS &> stall 
sudo umr -O many,bits  -r *.gfx80.HEADER_DUMP &>>stall
sudo umr -O many,bits  -r *.gfx80.CP_EOP &>>stall
Comment 35 Andrey Grodzovsky 2018-03-06 12:52:18 UTC
I will take a look at the register dumps and also forward them to a few people  to take a look. 

DId you consider to try and run the latest mainline kernel from https://www.kernel.org/ (4.16-rc4) and see if the issue still happening ? Because I still can't reproduce it.

Thanks,
Andrey
Comment 36 Ricardo Ribalda 2018-03-06 13:38:49 UTC
Hi Andrey

I just tried with commit ce380619fab99036f5e745c7a865b21c59f005f6

Linux version 4.16.0-rc4-qtec-standard+ (ricardo@neopili) (gcc version 7.3.0 (Debian 7.3.0
-5)) #1 SMP Tue Mar 6 14:11:26 CET 2018

And it is still stalling.

It has an interesting message in dmesg:

[   52.703150] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=20, last emitted seq=22
[   52.703162] [drm] IP block:gfx_v8_0 is hung!
[   52.703168] [drm] GPU recovery disabled.

Attached you will find the whole dmesg
Comment 37 Ricardo Ribalda 2018-03-06 13:39:35 UTC
Created attachment 274589 [details]
Dmesg for 4.16 rc4
Comment 38 Ricardo Ribalda 2018-03-06 14:07:41 UTC
Just for the fun of it, I did try the amdgpu.gpu_recovery=1 module parameter,


It seems to detect properly the crash, but fails miserably on the recovery. The screens blanks and the whole system is unusable: Leds stop blinking, ssh does not work... I need to hard reset the system.


Thanks!
Comment 39 Andrey Grodzovsky 2018-03-06 16:39:13 UTC
(In reply to Ricardo Ribalda from comment #38)
> Just for the fun of it, I did try the amdgpu.gpu_recovery=1 module parameter,
> 
> 
> It seems to detect properly the crash, but fails miserably on the recovery.
> The screens blanks and the whole system is unusable: Leds stop blinking, ssh
> does not work... I need to hard reset the system.
> 
> 
> Thanks!

job timeout message is expected since you indeed have a hanged pipe
GPU recovery failure on CZ is expected since it's another issue we have with  GPU pci config reset hard hang ( I am looking at this also) 

I suggest you try to be latest on everything to see if this still happens. 
So you already on latest kernel, now pick up latest firmware we have from this git https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu
Take all the carrizo* files and override the ones you have under you kernel (usally in /lib/firmware`uname -r`/amdgpu or in kernel specific folder under lib/firmware 

I think you already updated your  libdrm/MESA/LLVM but you LLVM is still 5.0 right ? Can you try switching to LLVM 7
Comment 40 Ricardo Ribalda 2018-03-06 16:53:22 UTC
Seems that I am already using the latest firmware files. 

I will try to switch to llvm7, but that will require a bit of compilation time, it is not a 5 minutes job. Just to be 100% sure, by llvm7 you mean the HEAD of https://github.com/llvm-mirror/llvm ? Do you prefer a specific HASH?

Regards!
Comment 41 Andrey Grodzovsky 2018-03-06 17:54:15 UTC
I am actually use debian packages for LLVM 7 so can't tell you, but you will now it's LLVM7 once you install it and run glxinfo - it will show LLVM version you are using.
Comment 42 Ricardo Ribalda 2018-03-09 15:46:34 UTC
With current LLVM7 master X does not start at all. I get the following output:

root@qt5122:~# X

X.Org X Server 1.19.3
Release Date: 2017-03-15
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.14.0-3-amd64 x86_64 
Current Operating System: Linux qt5122 4.16.0-rc4-qtec-standard #1 SMP Fri Mar 9 15:13:05 CET 2018 x86_64
Kernel command line: BOOT_IMAGE=/boot/bzImage rw root=PARTUUID=c839955a-02 rootwait qtec_mem.size=64 quiet
Build Date: 09 March 2018  02:29:53PM
 
Current version of pixman: 0.34.0
	Before reporting problems, check http://wiki.x.org
	to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
	(++) from command line, (!!) notice, (II) informational,
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Fri Mar  9 15:47:03 2018
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(II) [KMS] Kernel modesetting enabled.
LLVM ERROR: Cannot select: 0x7f74b40a59e0: v4i32,ch = load<LD16[%22(addrspace=2)](dereferenceable)(invariant)> 0x7f74b40251c8, 0x7f74b40a5978, undef:i32
  0x7f74b40a5978: i32 = add 0x7f74b409f378, Constant:i32<16>
    0x7f74b409f378: i32,ch = CopyFromReg 0x7f74b40251c8, Register:i32 %4
      0x7f74b409f310: i32 = Register %4
    0x7f74b40a5910: i32 = Constant<16>
  0x7f74b409fc68: i32 = undef
In function: main
LLVM ERROR: Cannot select: 0x7f74a00b0570: v4i32,ch = load<LD16[%22(addrspace=2)](dereferenceable)(invariant)> 0x7f74a002eab8, 0x7f74a00b0508, undef:i32
  0x7f74a00b0508: i32 = add 0x7f74a00a9f08, Constant:i32<16>
    0x7f74a00a9f08: i32,ch = CopyFromReg 0x7f74a002eab8, Register:i32 %4
      0x7f74a00a9ea0: i32 = Register %4
    0x7f74a00b04a0: i32 = Constant<16>
  0x7f74a00aa7f8: i32 = undef
In function: main
root@qt5122:~#
Comment 43 Michel Dänzer 2018-03-09 16:08:27 UTC
(In reply to Ricardo Ribalda from comment #42)
> With current LLVM7 master X does not start at all. I get the following
> output:

Did you recompile current Mesa Git master against that as well? If so, that's most likely a separate issue, either in LLVM or Mesa, which needs to be tracked separately from this one.
Comment 44 Nicolai Hähnle 2018-03-09 18:31:25 UTC
For the LLVM problem, please provide the log output from starting X with R600_DEBUG=vs,ps in the environment.
Comment 45 Ricardo Ribalda 2018-03-09 18:38:17 UTC
(In reply to Michel Dänzer from comment #43)
> (In reply to Ricardo Ribalda from comment #42)
> > With current LLVM7 master X does not start at all. I get the following
> > output:
> 
> Did you recompile current Mesa Git master against that as well? If so,
> that's most likely a separate issue, either in LLVM or Mesa, which needs to
> be tracked separately from this one.

I did use mesa 17.3.6 + this patches

https://github.com/mesa3d/mesa/commit/78673b614b01a8a416367db23937743c0e1aaa36.diff

https://patchwork.freedesktop.org/patch/186737/


(In reply to Nicolai Hähnle from comment #44)
> For the LLVM problem, please provide the log output from starting X with
> R600_DEBUG=vs,ps in the environment.

It will have to wait until Monday, sorry :(
Comment 46 Ricardo Ribalda 2018-03-12 08:12:41 UTC
(In reply to Nicolai Hähnle from comment #44)
> For the LLVM problem, please provide the log output from starting X with
> R600_DEBUG=vs,ps in the environment.

Here you are.

Thanks!

X.Org X Server 1.19.3
Release Date: 2017-03-15
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.14.0-3-amd64 x86_64 
Current Operating System: Linux qt5122 4.16.0-rc4-qtec-standard #1 SMP Fri Mar 9 15:13:05 CET 2018 x86_64
Kernel command line: BOOT_IMAGE=/boot/bzImage rw root=PARTUUID=c839955a-02 rootwait qtec_mem.size=64 quiet
Build Date: 09 March 2018  02:29:53PM
 
Current version of pixman: 0.34.0
	Before reporting problems, check http://wiki.x.org
	to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
	(++) from command line, (!!) notice, (II) informational,
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Mon Mar 12 08:12:00 2018
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(II) [KMS] Kernel modesetting enabled.
VERT
PROPERTY NEXT_SHADER FRAG
DCL IN[0]
DCL IN[1]
DCL OUT[0], POSITION
DCL OUT[1].xy, GENERIC[0]
  0: MOV OUT[0], IN[0]
  1: MOV OUT[1].xy, IN[1].xyxx
  2: END
radeonsi: Compiling shader 1
TGSI shader LLVM IR:

; ModuleID = 'tgsi'
source_filename = "tgsi"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-A5"
target triple = "amdgcn--"

define amdgpu_vs void @main([12 x <4 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), [0 x <8 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), [0 x <4 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), [80 x <8 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), [16 x <4 x i32>] addrspace(2)* byval noalias dereferenceable(18446744073709551615), i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32, i32, i32, i32, i32, i32) #0 {
main_body:
  %15 = getelementptr [16 x <4 x i32>], [16 x <4 x i32>] addrspace(2)* %4, i32 0, i32 0, !amdgpu.uniform !0
  %16 = load <4 x i32>, <4 x i32> addrspace(2)* %15, align 16, !invariant.load !0
  %17 = call nsz <4 x float> @llvm.amdgcn.buffer.load.format.v4f32(<4 x i32> %16, i32 %13, i32 0, i1 false, i1 false) #3
  %18 = extractelement <4 x float> %17, i32 0
  %19 = extractelement <4 x float> %17, i32 1
  %20 = extractelement <4 x float> %17, i32 2
  %21 = extractelement <4 x float> %17, i32 3
  %22 = getelementptr [16 x <4 x i32>], [16 x <4 x i32>] addrspace(2)* %4, i32 0, i32 1, !amdgpu.uniform !0
  %23 = load <4 x i32>, <4 x i32> addrspace(2)* %22, align 16, !invariant.load !0
  %24 = call nsz <4 x float> @llvm.amdgcn.buffer.load.format.v4f32(<4 x i32> %23, i32 %14, i32 0, i1 false, i1 false) #3
  %25 = extractelement <4 x float> %24, i32 0
  %26 = extractelement <4 x float> %24, i32 1
  call void @llvm.amdgcn.exp.f32(i32 12, i32 15, float %18, float %19, float %20, float %21, i1 true, i1 false) #2
  call void @llvm.amdgcn.exp.f32(i32 32, i32 15, float %25, float %26, float undef, float undef, i1 false, i1 false) #2
  ret void
}

; Function Attrs: nounwind readonly
declare <4 x float> @llvm.amdgcn.buffer.load.format.v4f32(<4 x i32>, i32, i32, i1, i1) #1

; Function Attrs: nounwind
declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #2

attributes #0 = { "no-signed-zeros-fp-math"="true" }
attributes #1 = { nounwind readonly }
attributes #2 = { nounwind }
attributes #3 = { nounwind readnone }

!0 = !{}

LLVM ERROR: Cannot select: 0x12844a0: v4i32,ch = load<LD16[%22(addrspace=2)](dereferenceable)(invariant)> 0xc352c8, 0x1284438, undef:i32
  0x1284438: i32 = add 0x127df58, Constant:i32<16>
    0x127df58: i32,ch = CopyFromReg 0xc352c8, Register:i32 %4
      0x127def0: i32 = Register %4
    0x12843d0: i32 = Constant<16>
  0x127e848: i32 = undef
In function: main
Comment 47 Ricardo Ribalda 2018-03-12 11:00:02 UTC
Just for the fun of it, I have tried  4.16.0-rc4 + llvm git HEAD + mesa git HEAD + libdrm 2.4.91 + xf86-video-amdgpu_18.0.0 + libxcb_1.13.bb and X starts.

Unfortunately it also freezes:

[   34.272153] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=18, last emitted seq=20
[   34.272166] [drm] IP block:gfx_v8_0 is hung!
[   34.272171] [drm] GPU recovery disabled.


Some extra debug info that you want from me?
Comment 48 Andrey Grodzovsky 2018-03-12 11:55:34 UTC
Is it LLVM 7.0 in this case ?
Comment 49 Ricardo Ribalda 2018-03-12 12:04:32 UTC
(In reply to Andrey Grodzovsky from comment #48)
> Is it LLVM 7.0 in this case ?

Yes: OpenGL renderer string: AMD Radeon R7 Graphics (CARRIZO / DRM 3.23.0 / 4.16.0-rc4-qtec-standard, LLVM 7.0.0)
Comment 50 Ricardo Ribalda 2018-03-12 12:04:57 UTC
Created attachment 274679 [details]
glxinfo with llvm7
Comment 51 Ricardo Ribalda 2018-03-12 14:12:35 UTC
Andrey, do you have access to a Bettong board? If so I can send you an image where you can get the stall.
Comment 52 Andrey Grodzovsky 2018-03-12 14:32:38 UTC
Don't think I have since I don't know what it is.

So since it failed with LLVM 7 that not the problem then. 

We can try again getting more detailed logs from SAME hang as following -

sudo umr -O verbose,follow_ib -R gfx[0:2047]
sudo umr -O bits -wa
sudo umr -O many,bits  -r*.gfx80.mmGRBM_STATUS
sudo umr -O many,bits  -r *.gfx80.HEADER_DUMP
sudo umr -O many,bits  -r *.gfx80.CP_EOP

This way addresses from ring
Dump can be related to reg values.

On user mode side run X with env var GALLIUM_DDEBUG=1000 and get resulting ~/ddebug_dumps/ files 

Thanks,
Andrey
Comment 53 Andrey Grodzovsky 2018-03-12 14:33:16 UTC
Don't think I have since I don't know what it is.

So since it failed with LLVM 7 that not the problem then. 

We can try again getting more detailed logs from SAME hang as following -

sudo umr -O verbose,follow_ib -R gfx[0:2047]
sudo umr -O bits -wa
sudo umr -O many,bits  -r*.gfx80.mmGRBM_STATUS
sudo umr -O many,bits  -r *.gfx80.HEADER_DUMP
sudo umr -O many,bits  -r *.gfx80.CP_EOP

This way addresses from ring
Dump can be related to reg values.

On user mode side run X with env var GALLIUM_DDEBUG=1000 and get resulting ~/ddebug_dumps/ files 

Thanks,
Andrey
Comment 54 Ricardo Ribalda 2018-03-12 14:40:44 UTC
Created attachment 274683 [details]
ddebug dumps with llvm7
Comment 55 Ricardo Ribalda 2018-03-12 14:41:07 UTC
Created attachment 274685 [details]
urm dump with llvm7
Comment 56 Ricardo Ribalda 2018-03-12 14:42:41 UTC
Bettong is the reference Merlin Falcon board from AMD, I think they reference it also as DB-FP4. It is also supported by coreboot.
Comment 57 Nicolai Hähnle 2018-03-14 08:14:40 UTC
(In reply to Ricardo Ribalda from comment #46)

This was due to the constant address space change in LLVM. It has since been fixed in Mesa master.
Comment 58 Andrey Grodzovsky 2018-03-15 17:59:11 UTC
Please try following to see if it helps avoiding the hang = 

R600_DEBUG=notiling,norbplus

try with GALLIUM_DDEBUG=flush to see if it makes a difference (flushes after each draw call)

try running programs that can be used without X: like KMS cube

Andrey
Comment 59 Ricardo Ribalda 2018-03-16 11:11:42 UTC
Hi Andrey

Testing with llvm7 setup:


R600_DEBUG=notiling,norbplus xinit

does not avoid the hang :(

[   31.200134] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=7, last emitted seq=8
[   31.200147] [drm] IP block:gfx_v8_0 is hung!
[   31.200152] [drm] GPU recovery disabled.



GALLIUM_DDEBUG=flush xinit

Seems to have a (good) impact. Hang happened after around 20 boots.

[   13.389894] NET: Registered protocol family 39
[   28.640142] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=652, last emitted seq=654
[   28.640154] [drm] IP block:gfx_v8_0 is hung!
[   28.640160] [drm] GPU recovery disabled.
[  246.752083] INFO: task amdgpu_cs:0:636 blocked for more than 120 seconds.
[  246.752092]       Not tainted 4.16.0-rc4-qtec-standard #1
[  246.752095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.752099] amdgpu_cs:0     D    0   636    543 0x00080000
[  246.752099] amdgpu_cs:0     D    0   636    543 0x00080000
[  246.752103] Call Trace:
[  246.752115]  ? __schedule+0x25c/0x860
[  246.752122]  ? dma_fence_default_wait+0x10c/0x280
[  246.752124]  ? dma_fence_default_wait+0x1c8/0x280
[  246.752127]  schedule+0x2f/0x90
[  246.752130]  schedule_timeout+0x1f1/0x440
[  246.752220]  ? amdgpu_cs_bo_validate+0x7f/0x120 [amdgpu]
[  246.752276]  ? amdgpu_ttm_alloc_gart+0x5d/0x270 [amdgpu]
[  246.752284]  ? dma_fence_default_wait+0x10c/0x280
[  246.752287]  ? dma_fence_default_wait+0x1c8/0x280
[  246.752290]  dma_fence_default_wait+0x1f4/0x280
[  246.752294]  ? dma_fence_default_wait+0x280/0x280
[  246.752297]  dma_fence_wait_timeout+0x2e/0x100
[  246.752359]  amdgpu_ctx_wait_prev_fence+0x46/0x80 [amdgpu]
[  246.752418]  amdgpu_cs_ioctl+0x1f2/0x1af0 [amdgpu]
[  246.752478]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  246.752504]  drm_ioctl_kernel+0x59/0xb0 [drm]
[  246.752524]  drm_ioctl+0x29f/0x340 [drm]
[  246.752581]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  246.752630]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[  246.752635]  do_vfs_ioctl+0x8e/0x680
[  246.752641]  ? SyS_futex+0x11d/0x150
[  246.752644]  SyS_ioctl+0x74/0x80
[  246.752647]  ? get_vtime_delta+0xe/0x40
[  246.752650]  do_syscall_64+0x7b/0x1d0
[  246.752654]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  246.752658] RIP: 0033:0x3e292e57e7
[  246.752660] RSP: 002b:00007f17a2924b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  246.752663] RAX: ffffffffffffffda RBX: 00007f17a2924c78 RCX: 0000003e292e57e7
[  246.752664] RDX: 00007f17a2924bf0 RSI: 00000000c0186444 RDI: 000000000000000d
[  246.752666] RBP: 00007f17a2924bf0 R08: 00007f17a2924ca0 R09: 00007f17a2924c78
[  246.752667] R10: 00007f17a2924ca0 R11: 0000000000000246 R12: 00000000c0186444
[  246.752668] R13: 000000000000000d R14: 0000000000d2c588 R15: 0000000000000002


kmscube git HEAD have not stalled after 50 attempts without any flag in GALLIUM_DDEBUG or R600_DEBUG.

It might not be relevant, but my platform can only boot with BIOS (it is based on coreboot). When you tried this bug did you tried with UEFI or BIOS?

Thanks
Comment 60 Andrey Grodzovsky 2018-03-16 14:37:08 UTC
(In reply to Ricardo Ribalda from comment #59)
> Hi Andrey
> 
> Testing with llvm7 setup:
> 
> 
> R600_DEBUG=notiling,norbplus xinit
> 
> does not avoid the hang :(
> 
> [   31.200134] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> last signaled seq=7, last emitted seq=8
> [   31.200147] [drm] IP block:gfx_v8_0 is hung!
> [   31.200152] [drm] GPU recovery disabled.
> 
> 
> 
> GALLIUM_DDEBUG=flush xinit
> 
> Seems to have a (good) impact. Hang happened after around 20 boots.
> 
> [   13.389894] NET: Registered protocol family 39
> [   28.640142] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> last signaled seq=652, last emitted seq=654
> [   28.640154] [drm] IP block:gfx_v8_0 is hung!
> [   28.640160] [drm] GPU recovery disabled.
> [  246.752083] INFO: task amdgpu_cs:0:636 blocked for more than 120 seconds.
> [  246.752092]       Not tainted 4.16.0-rc4-qtec-standard #1
> [  246.752095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  246.752099] amdgpu_cs:0     D    0   636    543 0x00080000
> [  246.752099] amdgpu_cs:0     D    0   636    543 0x00080000
> [  246.752103] Call Trace:
> [  246.752115]  ? __schedule+0x25c/0x860
> [  246.752122]  ? dma_fence_default_wait+0x10c/0x280
> [  246.752124]  ? dma_fence_default_wait+0x1c8/0x280
> [  246.752127]  schedule+0x2f/0x90
> [  246.752130]  schedule_timeout+0x1f1/0x440
> [  246.752220]  ? amdgpu_cs_bo_validate+0x7f/0x120 [amdgpu]
> [  246.752276]  ? amdgpu_ttm_alloc_gart+0x5d/0x270 [amdgpu]
> [  246.752284]  ? dma_fence_default_wait+0x10c/0x280
> [  246.752287]  ? dma_fence_default_wait+0x1c8/0x280
> [  246.752290]  dma_fence_default_wait+0x1f4/0x280
> [  246.752294]  ? dma_fence_default_wait+0x280/0x280
> [  246.752297]  dma_fence_wait_timeout+0x2e/0x100
> [  246.752359]  amdgpu_ctx_wait_prev_fence+0x46/0x80 [amdgpu]
> [  246.752418]  amdgpu_cs_ioctl+0x1f2/0x1af0 [amdgpu]
> [  246.752478]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
> [  246.752504]  drm_ioctl_kernel+0x59/0xb0 [drm]
> [  246.752524]  drm_ioctl+0x29f/0x340 [drm]
> [  246.752581]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
> [  246.752630]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
> [  246.752635]  do_vfs_ioctl+0x8e/0x680
> [  246.752641]  ? SyS_futex+0x11d/0x150
> [  246.752644]  SyS_ioctl+0x74/0x80
> [  246.752647]  ? get_vtime_delta+0xe/0x40
> [  246.752650]  do_syscall_64+0x7b/0x1d0
> [  246.752654]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  246.752658] RIP: 0033:0x3e292e57e7
> [  246.752660] RSP: 002b:00007f17a2924b88 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  246.752663] RAX: ffffffffffffffda RBX: 00007f17a2924c78 RCX:
> 0000003e292e57e7
> [  246.752664] RDX: 00007f17a2924bf0 RSI: 00000000c0186444 RDI:
> 000000000000000d
> [  246.752666] RBP: 00007f17a2924bf0 R08: 00007f17a2924ca0 R09:
> 00007f17a2924c78
> [  246.752667] R10: 00007f17a2924ca0 R11: 0000000000000246 R12:
> 00000000c0186444
> [  246.752668] R13: 000000000000000d R14: 0000000000d2c588 R15:
> 0000000000000002
> 
> 
> kmscube git HEAD have not stalled after 50 attempts without any flag in
> GALLIUM_DDEBUG or R600_DEBUG.
> 
> It might not be relevant, but my platform can only boot with BIOS (it is
> based on coreboot). When you tried this bug did you tried with UEFI or BIOS?
> 
> Thanks

I think my BIOS is set to UEFI but it can't have any relation to this issue.

Andrey
Comment 61 Ricardo Ribalda 2018-03-20 08:48:15 UTC
Hi Andrey

You are more quite than you use to ;). Do you need anything else from my side?
Comment 62 Andrey Grodzovsky 2018-03-20 11:58:58 UTC
Hi, sorry , was busy a bit with other stuff. 

Can you please give another bit of info reading some debug data from color buffer (CB)

Reproduce it and then from UMR do 

sudo umr -O bits,many --bank 0 0 0 -r *.*.mmCB_DEBUG
sudo umr -O bits,many --bank 0 0 1 -r *.*.mmCB_DEBUG 

Andrey
Comment 63 Ricardo Ribalda 2018-03-20 15:10:13 UTC
Here you are

gfx80.mmCB_DEBUG_BUS_1 => 0x00008801
gfx80.mmCB_DEBUG_BUS_2 => 0x00000000
gfx80.mmCB_DEBUG_BUS_3 => 0x00000000
gfx80.mmCB_DEBUG_BUS_4 => 0x00000000
gfx80.mmCB_DEBUG_BUS_5 => 0x00000000
gfx80.mmCB_DEBUG_BUS_6 => 0x00000000
gfx80.mmCB_DEBUG_BUS_7 => 0x00000000
gfx80.mmCB_DEBUG_BUS_8 => 0x00000000
gfx80.mmCB_DEBUG_BUS_9 => 0x00000000
gfx80.mmCB_DEBUG_BUS_10 => 0x0000e000
gfx80.mmCB_DEBUG_BUS_11 => 0x00000000
gfx80.mmCB_DEBUG_BUS_12 => 0x00000000
gfx80.mmCB_DEBUG_BUS_13 => 0x00000000
gfx80.mmCB_DEBUG_BUS_14 => 0x00000000
gfx80.mmCB_DEBUG_BUS_15 => 0x00000000
gfx80.mmCB_DEBUG_BUS_16 => 0x00000000
gfx80.mmCB_DEBUG_BUS_17 => 0x00000804
	.TILE_INTFC_BUSY[0:0]                                            ==        0 (0x00000000)
	.MU_BUSY[1:1]                                                    ==        0 (0x00000000)
	.TQ_BUSY[2:2]                                                    ==        1 (0x00000001)
	.AC_BUSY[3:3]                                                    ==        0 (0x00000000)
	.CRW_BUSY[4:4]                                                   ==        0 (0x00000000)
	.CACHE_CTRL_BUSY[5:5]                                            ==        0 (0x00000000)
	.MC_WR_PENDING[6:6]                                              ==        0 (0x00000000)
	.FC_WR_PENDING[7:7]                                              ==        0 (0x00000000)
	.FC_RD_PENDING[8:8]                                              ==        0 (0x00000000)
	.EVICT_PENDING[9:9]                                              ==        0 (0x00000000)
	.LAST_RD_ARB_WINNER[10:10]                                       ==        0 (0x00000000)
	.MU_STATE[11:18]                                                 ==        1 (0x00000001)
gfx80.mmCB_DEBUG_BUS_18 => 0x00000100
	.TILE_RETIREMENT_BUSY[0:0]                                       ==        0 (0x00000000)
	.FOP_BUSY[1:1]                                                   ==        0 (0x00000000)
	.CLEAR_BUSY[2:2]                                                 ==        0 (0x00000000)
	.LAT_BUSY[3:3]                                                   ==        0 (0x00000000)
	.CACHE_CTL_BUSY[4:4]                                             ==        0 (0x00000000)
	.ADDR_BUSY[5:5]                                                  ==        0 (0x00000000)
	.MERGE_BUSY[6:6]                                                 ==        0 (0x00000000)
	.QUAD_BUSY[7:7]                                                  ==        0 (0x00000000)
	.TILE_BUSY[8:8]                                                  ==        1 (0x00000001)
	.DCC_BUSY[9:9]                                                   ==        0 (0x00000000)
	.DOC_BUSY[10:10]                                                 ==        0 (0x00000000)
	.DAG_BUSY[11:11]                                                 ==        0 (0x00000000)
	.DOC_STALL[12:12]                                                ==        0 (0x00000000)
	.DOC_QT_CAM_FULL[13:13]                                          ==        0 (0x00000000)
	.DOC_CL_CAM_FULL[14:14]                                          ==        0 (0x00000000)
	.DOC_QUAD_PTR_FIFO_FULL[15:15]                                   ==        0 (0x00000000)
	.DOC_SECTOR_MASK_FIFO_FULL[16:16]                                ==        0 (0x00000000)
	.DCS_READ_WINNER_LAST[17:17]                                     ==        0 (0x00000000)
	.DCS_READ_EV_PENDING[18:18]                                      ==        0 (0x00000000)
	.DCS_WRITE_CC_PENDING[19:19]                                     ==        0 (0x00000000)
	.DCS_READ_CC_PENDING[20:20]                                      ==        0 (0x00000000)
	.DCS_WRITE_MC_PENDING[21:21]                                     ==        0 (0x00000000)
gfx80.mmCB_DEBUG_BUS_19 => 0x0005c000
	.SURF_SYNC_STATE[0:1]                                            ==        0 (0x00000000)
	.SURF_SYNC_START[2:2]                                            ==        0 (0x00000000)
	.SF_BUSY[3:3]                                                    ==        0 (0x00000000)
	.CS_BUSY[4:4]                                                    ==        0 (0x00000000)
	.RB_BUSY[5:5]                                                    ==        0 (0x00000000)
	.DS_BUSY[6:6]                                                    ==        0 (0x00000000)
	.TB_BUSY[7:7]                                                    ==        0 (0x00000000)
	.IB_BUSY[8:8]                                                    ==        0 (0x00000000)
	.DRR_BUSY[9:9]                                                   ==        0 (0x00000000)
	.DF_BUSY[10:10]                                                  ==        0 (0x00000000)
	.DD_BUSY[11:11]                                                  ==        0 (0x00000000)
	.DC_BUSY[12:12]                                                  ==        0 (0x00000000)
	.DK_BUSY[13:13]                                                  ==        0 (0x00000000)
	.DF_SKID_FIFO_EMPTY[14:14]                                       ==        1 (0x00000001)
	.DF_CLEAR_FIFO_EMPTY[15:15]                                      ==        1 (0x00000001)
	.DD_READY[16:16]                                                 ==        1 (0x00000001)
	.DC_FIFO_FULL[17:17]                                             ==        0 (0x00000000)
	.DC_READY[18:18]                                                 ==        1 (0x00000001)
gfx80.mmCB_DEBUG_BUS_20 => 0x00f00820
	.MC_RDREQ_CREDITS[0:5]                                           ==       32 (0x00000020)
	.MC_WRREQ_CREDITS[6:11]                                          ==       32 (0x00000020)
	.CC_RDREQ_HAD_ITS_TURN[12:12]                                    ==        0 (0x00000000)
	.FC_RDREQ_HAD_ITS_TURN[13:13]                                    ==        0 (0x00000000)
	.CM_RDREQ_HAD_ITS_TURN[14:14]                                    ==        0 (0x00000000)
	.CC_WRREQ_HAD_ITS_TURN[16:16]                                    ==        0 (0x00000000)
	.FC_WRREQ_HAD_ITS_TURN[17:17]                                    ==        0 (0x00000000)
	.CM_WRREQ_HAD_ITS_TURN[18:18]                                    ==        0 (0x00000000)
	.CC_WRREQ_FIFO_EMPTY[20:20]                                      ==        1 (0x00000001)
	.FC_WRREQ_FIFO_EMPTY[21:21]                                      ==        1 (0x00000001)
	.CM_WRREQ_FIFO_EMPTY[22:22]                                      ==        1 (0x00000001)
	.DCC_WRREQ_FIFO_EMPTY[23:23]                                     ==        1 (0x00000001)
gfx80.mmCB_DEBUG_BUS_21 => 0x000000e3
	.CM_BUSY[0:0]                                                    ==        1 (0x00000001)
	.FC_BUSY[1:1]                                                    ==        1 (0x00000001)
	.CC_BUSY[2:2]                                                    ==        0 (0x00000000)
	.BB_BUSY[3:3]                                                    ==        0 (0x00000000)
	.MA_BUSY[4:4]                                                    ==        0 (0x00000000)
	.CORE_SCLK_VLD[5:5]                                              ==        1 (0x00000001)
	.REG_SCLK1_VLD[6:6]                                              ==        1 (0x00000001)
	.REG_SCLK0_VLD[7:7]                                              ==        1 (0x00000001)
gfx80.mmCB_DEBUG_BUS_22 => 0x00000000
	.OUTSTANDING_MC_READS[0:11]                                      ==        0 (0x00000000)
	.OUTSTANDING_MC_WRITES[12:23]                                    ==        0 (0x00000000)
gfx80.mmCB_DEBUG_BUS_1 => 0x00008801
gfx80.mmCB_DEBUG_BUS_2 => 0x00000000
gfx80.mmCB_DEBUG_BUS_3 => 0x00000000
gfx80.mmCB_DEBUG_BUS_4 => 0x00000000
gfx80.mmCB_DEBUG_BUS_5 => 0x00000000
gfx80.mmCB_DEBUG_BUS_6 => 0x00000000
gfx80.mmCB_DEBUG_BUS_7 => 0x00000000
gfx80.mmCB_DEBUG_BUS_8 => 0x00000000
gfx80.mmCB_DEBUG_BUS_9 => 0x00000000
gfx80.mmCB_DEBUG_BUS_10 => 0x0000e000
gfx80.mmCB_DEBUG_BUS_11 => 0x00000000
gfx80.mmCB_DEBUG_BUS_12 => 0x00000000
gfx80.mmCB_DEBUG_BUS_13 => 0x00000000
gfx80.mmCB_DEBUG_BUS_14 => 0x00000000
gfx80.mmCB_DEBUG_BUS_15 => 0x00000000
gfx80.mmCB_DEBUG_BUS_16 => 0x00000000
gfx80.mmCB_DEBUG_BUS_17 => 0x00000804
	.TILE_INTFC_BUSY[0:0]                                            ==        0 (0x00000000)
	.MU_BUSY[1:1]                                                    ==        0 (0x00000000)
	.TQ_BUSY[2:2]                                                    ==        1 (0x00000001)
	.AC_BUSY[3:3]                                                    ==        0 (0x00000000)
	.CRW_BUSY[4:4]                                                   ==        0 (0x00000000)
	.CACHE_CTRL_BUSY[5:5]                                            ==        0 (0x00000000)
	.MC_WR_PENDING[6:6]                                              ==        0 (0x00000000)
	.FC_WR_PENDING[7:7]                                              ==        0 (0x00000000)
	.FC_RD_PENDING[8:8]                                              ==        0 (0x00000000)
	.EVICT_PENDING[9:9]                                              ==        0 (0x00000000)
	.LAST_RD_ARB_WINNER[10:10]                                       ==        0 (0x00000000)
	.MU_STATE[11:18]                                                 ==        1 (0x00000001)
gfx80.mmCB_DEBUG_BUS_18 => 0x00000100
	.TILE_RETIREMENT_BUSY[0:0]                                       ==        0 (0x00000000)
	.FOP_BUSY[1:1]                                                   ==        0 (0x00000000)
	.CLEAR_BUSY[2:2]                                                 ==        0 (0x00000000)
	.LAT_BUSY[3:3]                                                   ==        0 (0x00000000)
	.CACHE_CTL_BUSY[4:4]                                             ==        0 (0x00000000)
	.ADDR_BUSY[5:5]                                                  ==        0 (0x00000000)
	.MERGE_BUSY[6:6]                                                 ==        0 (0x00000000)
	.QUAD_BUSY[7:7]                                                  ==        0 (0x00000000)
	.TILE_BUSY[8:8]                                                  ==        1 (0x00000001)
	.DCC_BUSY[9:9]                                                   ==        0 (0x00000000)
	.DOC_BUSY[10:10]                                                 ==        0 (0x00000000)
	.DAG_BUSY[11:11]                                                 ==        0 (0x00000000)
	.DOC_STALL[12:12]                                                ==        0 (0x00000000)
	.DOC_QT_CAM_FULL[13:13]                                          ==        0 (0x00000000)
	.DOC_CL_CAM_FULL[14:14]                                          ==        0 (0x00000000)
	.DOC_QUAD_PTR_FIFO_FULL[15:15]                                   ==        0 (0x00000000)
	.DOC_SECTOR_MASK_FIFO_FULL[16:16]                                ==        0 (0x00000000)
	.DCS_READ_WINNER_LAST[17:17]                                     ==        0 (0x00000000)
	.DCS_READ_EV_PENDING[18:18]                                      ==        0 (0x00000000)
	.DCS_WRITE_CC_PENDING[19:19]                                     ==        0 (0x00000000)
	.DCS_READ_CC_PENDING[20:20]                                      ==        0 (0x00000000)
	.DCS_WRITE_MC_PENDING[21:21]                                     ==        0 (0x00000000)
gfx80.mmCB_DEBUG_BUS_19 => 0x0005c000
	.SURF_SYNC_STATE[0:1]                                            ==        0 (0x00000000)
	.SURF_SYNC_START[2:2]                                            ==        0 (0x00000000)
	.SF_BUSY[3:3]                                                    ==        0 (0x00000000)
	.CS_BUSY[4:4]                                                    ==        0 (0x00000000)
	.RB_BUSY[5:5]                                                    ==        0 (0x00000000)
	.DS_BUSY[6:6]                                                    ==        0 (0x00000000)
	.TB_BUSY[7:7]                                                    ==        0 (0x00000000)
	.IB_BUSY[8:8]                                                    ==        0 (0x00000000)
	.DRR_BUSY[9:9]                                                   ==        0 (0x00000000)
	.DF_BUSY[10:10]                                                  ==        0 (0x00000000)
	.DD_BUSY[11:11]                                                  ==        0 (0x00000000)
	.DC_BUSY[12:12]                                                  ==        0 (0x00000000)
	.DK_BUSY[13:13]                                                  ==        0 (0x00000000)
	.DF_SKID_FIFO_EMPTY[14:14]                                       ==        1 (0x00000001)
	.DF_CLEAR_FIFO_EMPTY[15:15]                                      ==        1 (0x00000001)
	.DD_READY[16:16]                                                 ==        1 (0x00000001)
	.DC_FIFO_FULL[17:17]                                             ==        0 (0x00000000)
	.DC_READY[18:18]                                                 ==        1 (0x00000001)
gfx80.mmCB_DEBUG_BUS_20 => 0x00f00820
	.MC_RDREQ_CREDITS[0:5]                                           ==       32 (0x00000020)
	.MC_WRREQ_CREDITS[6:11]                                          ==       32 (0x00000020)
	.CC_RDREQ_HAD_ITS_TURN[12:12]                                    ==        0 (0x00000000)
	.FC_RDREQ_HAD_ITS_TURN[13:13]                                    ==        0 (0x00000000)
	.CM_RDREQ_HAD_ITS_TURN[14:14]                                    ==        0 (0x00000000)
	.CC_WRREQ_HAD_ITS_TURN[16:16]                                    ==        0 (0x00000000)
	.FC_WRREQ_HAD_ITS_TURN[17:17]                                    ==        0 (0x00000000)
	.CM_WRREQ_HAD_ITS_TURN[18:18]                                    ==        0 (0x00000000)
	.CC_WRREQ_FIFO_EMPTY[20:20]                                      ==        1 (0x00000001)
	.FC_WRREQ_FIFO_EMPTY[21:21]                                      ==        1 (0x00000001)
	.CM_WRREQ_FIFO_EMPTY[22:22]                                      ==        1 (0x00000001)
	.DCC_WRREQ_FIFO_EMPTY[23:23]                                     ==        1 (0x00000001)
gfx80.mmCB_DEBUG_BUS_21 => 0x000000e3
	.CM_BUSY[0:0]                                                    ==        1 (0x00000001)
	.FC_BUSY[1:1]                                                    ==        1 (0x00000001)
	.CC_BUSY[2:2]                                                    ==        0 (0x00000000)
	.BB_BUSY[3:3]                                                    ==        0 (0x00000000)
	.MA_BUSY[4:4]                                                    ==        0 (0x00000000)
	.CORE_SCLK_VLD[5:5]                                              ==        1 (0x00000001)
	.REG_SCLK1_VLD[6:6]                                              ==        1 (0x00000001)
	.REG_SCLK0_VLD[7:7]                                              ==        1 (0x00000001)
gfx80.mmCB_DEBUG_BUS_22 => 0x00000000
	.OUTSTANDING_MC_READS[0:11]                                      ==        0 (0x00000000)
	.OUTSTANDING_MC_WRITES[12:23]                                    ==        0 (0x00000000)


Thanks!
Comment 64 Andrey Grodzovsky 2018-03-20 16:27:06 UTC
Actually i was given info that a similar problem on Bettong board was related to IOMMU. Can you check if IOMMU is enabled in your BIOS , if it is, disable it and check if the issue is still present ?

In case you are not sure how to check and disable it let me know.

Andrey
Comment 65 Ricardo Ribalda 2018-03-21 08:29:39 UTC
Hi

IOMMU is disabled:

root@qt5122:~# ls /sys/class/iommu
ls: cannot access '/sys/class/iommu': No such file or directory
root@qt5122:~# dmesg | grep -i iommu
root@qt5122:~#
Comment 66 Andrey Grodzovsky 2018-03-21 12:41:06 UTC
(In reply to Ricardo Ribalda from comment #65)
> Hi
> 
> IOMMU is disabled:
> 
> root@qt5122:~# ls /sys/class/iommu
> ls: cannot access '/sys/class/iommu': No such file or directory
> root@qt5122:~# dmesg | grep -i iommu
> root@qt5122:~#

To be sure, is it disabled in BIOS ?
Comment 67 Ricardo Ribalda 2018-03-21 13:45:06 UTC
I have coreboot. So I do not have the typical menu.

I have an FPGA writing to the main memory, and that could not happen if iommu is enabled without extra configuration.

 If you can tell me which register to look for we can be 100% sure.
Comment 68 Andrey Grodzovsky 2018-03-21 14:23:53 UTC
(In reply to Ricardo Ribalda from comment #67)
> I have coreboot. So I do not have the typical menu.
> 
> I have an FPGA writing to the main memory, and that could not happen if
> iommu is enabled without extra configuration.
> 
>  If you can tell me which register to look for we can be 100% sure.

I took a bettong board here in office, booted it  while pressing ESC, goes into BIOS, then choosing Chipest->GFX Configuration->IOMMU En/Dis

But i guess you mean you have coreboot INSTEAD of standard BIOS like i have, 

So at least you sure you have the following disabled  as in this link -
https://wiki.gentoo.org/wiki/IOMMU_SWIOTLB

IOMMU Hardware Support disabled in make menuconfig and also you boot the kernel with iommu=off ?

If so I assume you indeed have IOMMU disabled and then I will try later to again reproduce your issue with Bettong board I just found because before I was reproducing with a different CZ board.

Andrey
Comment 69 Ricardo Ribalda 2018-03-21 14:33:08 UTC
This is my kernel configuration:


ricardo@neopili:~/curro/kernel-upstream$ cat .config | grep -i IOMMU
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_IOMMU_HELPER=y
CONFIG_IOMMU_SUPPORT=y
# Generic IOMMU Pagetable Support
# CONFIG_AMD_IOMMU is not set
# CONFIG_INTEL_IOMMU is not set
# CONFIG_IOMMU_DEBUG is not set


Tomorrow I will try bootig with iommu=off, but my understanding was that if /sys/class/iommu was missing, iommu was disabled.

Thans!
Comment 70 Andrey Grodzovsky 2018-03-21 14:33:38 UTC
Just another while guess also , try mod probing amdgpu with power and clock gating disabled to see if it makes a difference 

sudo modprobe amdgpu cg_mask=0 pg_mask=0
Comment 71 Andrey Grodzovsky 2018-03-21 14:34:09 UTC
(In reply to Ricardo Ribalda from comment #69)
> This is my kernel configuration:
> 
> 
> ricardo@neopili:~/curro/kernel-upstream$ cat .config | grep -i IOMMU
> CONFIG_GART_IOMMU=y
> # CONFIG_CALGARY_IOMMU is not set
> CONFIG_IOMMU_HELPER=y
> CONFIG_IOMMU_SUPPORT=y
> # Generic IOMMU Pagetable Support
> # CONFIG_AMD_IOMMU is not set
> # CONFIG_INTEL_IOMMU is not set
> # CONFIG_IOMMU_DEBUG is not set
> 
> 
> Tomorrow I will try bootig with iommu=off, but my understanding was that if
> /sys/class/iommu was missing, iommu was disabled.

Agreed.

> 
> Thans!
Comment 72 Ricardo Ribalda 2018-03-22 12:18:43 UTC
With iommu=off I can still see the error after around 3 attempts

With iommu=off amdgpu.cg_mask=0 amdgpu.pg_mask=0 I could not see the stall after 30 boots

With amdgpu.cg_mask=0 amdgpu.pg_mask=0 I could get the stall after 19 attepts
Comment 73 Andrey Grodzovsky 2018-03-22 13:27:31 UTC
Created attachment 274867 [details]
Exclude HIQ patch

So if IOMMU disabling does makes a difference can you please try to patch your kernel with this short patch and see if it helps.
Comment 74 Ricardo Ribalda 2018-03-22 13:47:44 UTC
Hi Andrey

They patch did not do the trick :(.

Shall I try also with the patch and amdgpu.cg_mask=0 amdgpu.pg_mask=0
Comment 75 Andrey Grodzovsky 2018-03-22 13:48:22 UTC
(In reply to Ricardo Ribalda from comment #74)
> Hi Andrey
> 
> They patch did not do the trick :(.
> 
> Shall I try also with the patch and amdgpu.cg_mask=0 amdgpu.pg_mask=0

Yes.
Comment 76 Ricardo Ribalda 2018-03-22 14:02:35 UTC
I was testing with GALLIUM_DDEBUG=flush

When I removed it:


amdgpu.cg_mask=0 amdgpu.pg_mask=0 : Stalls

iommu=off amdgpu.cg_mask=0 amdgpu.pg_mask=0  : Also stalls


:(
Comment 77 Andrey Grodzovsky 2018-03-22 14:43:57 UTC
(In reply to Ricardo Ribalda from comment #76)
> I was testing with GALLIUM_DDEBUG=flush
> 
> When I removed it:
> 
> 
> amdgpu.cg_mask=0 amdgpu.pg_mask=0 : Stalls
> 
> iommu=off amdgpu.cg_mask=0 amdgpu.pg_mask=0  : Also stalls
> 
> 
> :(

I see, ok , let me try to find time and reproduce it here. Might take a bit since I am doing something else at the moment.

Andrey
Comment 78 Jerome Oufella 2018-04-08 22:30:22 UTC
Hello Ricardo,

Did you have a chance to workaround this issue? I think I'm seeing the same
one on the same hardware.

Jerome
Comment 79 Ricardo Ribalda 2018-04-09 06:39:48 UTC
Hi Jerome

Not really, I am waiting for some feedback from Andrey. 


With the env. variable:

GALLIUM_DDEBUG=flush

it works better, but still not perfect. 

Andrey, did you managed to replicate the issue? I can send you a root file system.


Regards!
Comment 80 Andrey Grodzovsky 2018-04-09 13:18:46 UTC
(In reply to Ricardo Ribalda from comment #79)
> Hi Jerome
> 
> Not really, I am waiting for some feedback from Andrey. 
> 
> 
> With the env. variable:
> 
> GALLIUM_DDEBUG=flush
> 
> it works better, but still not perfect. 
> 
> Andrey, did you managed to replicate the issue? I can send you a root file
> system.
> 
> 
> Regards!

Unfortunately I have been side tracked by more prioritized issues, I definitely will get to this, it's on my TODO list.
Comment 81 Andrey Grodzovsky 2018-04-10 14:46:17 UTC
With Betong board it seems I reproduced it, I am trying to debug it more + involving MESA/LLVM people to take a look.
Comment 82 Ricardo Ribalda 2018-04-10 14:49:22 UTC
Hi Andrey.

Excellent news, thanks for your effort on this.
Comment 83 Andrey Grodzovsky 2018-04-16 14:04:14 UTC
Created attachment 275403 [details]
set COMPUTE_PGM_RSRC1 for SGPR/VGPR clearing

Please test this patch by Nicolai who did a great job analyzing the issue, for me it fixes the hang.

Andrey
Comment 84 Ricardo Ribalda 2018-04-16 14:56:02 UTC
Hi Andrey

I have removed the GALLIUM_DDEBUG=flush  hack and performed around 10 boots. I have not been able to replicate the bug.

I am porting the patch to my current distro and will run the test there for a couple of hours (hopefully tomorrow). But It looks pretty good!

Kudos!
Comment 85 Andrey Grodzovsky 2018-04-16 15:13:42 UTC
(In reply to Ricardo Ribalda from comment #84)
> Hi Andrey
> 
> I have removed the GALLIUM_DDEBUG=flush  hack and performed around 10 boots.
> I have not been able to replicate the bug.
> 
> I am porting the patch to my current distro and will run the test there for
> a couple of hours (hopefully tomorrow). But It looks pretty good!
> 
> Kudos!

Ping me once it's done.

Thanks.
Comment 86 Andrey Grodzovsky 2018-04-20 12:19:41 UTC
Hi, any updates (In reply to Andrey Grodzovsky from comment #85)
> (In reply to Ricardo Ribalda from comment #84)
> > Hi Andrey
> > 
> > I have removed the GALLIUM_DDEBUG=flush  hack and performed around 10
> boots.
> > I have not been able to replicate the bug.
> > 
> > I am porting the patch to my current distro and will run the test there for
> > a couple of hours (hopefully tomorrow). But It looks pretty good!
> > 
> > Kudos!
> 
> Ping me once it's done.
> 
> Thanks.

Any updates ?

Andrey
Comment 87 Ricardo Ribalda 2018-04-20 12:22:09 UTC
Hi Andrey

I have been running some manual tests (~30 boots) and it has been always ok. I did not have time to setup an automatic test machine.

You are more than welcome to close the ticket and I will update it with the results later.

Thanks for your help!
Comment 88 Andrey Grodzovsky 2018-04-20 12:23:06 UTC
Great, thanks.(In reply to Ricardo Ribalda from comment #87)
> Hi Andrey
> 
> I have been running some manual tests (~30 boots) and it has been always ok.
> I did not have time to setup an automatic test machine.
> 
> You are more than welcome to close the ticket and I will update it with the
> results later.
> 
> Thanks for your help!

Great, thanks.

Andrey

Note You need to log in before you can comment on or make changes to this bug.