Bug 151341 - AMDGPU Hawaii: screen freeze, Xorg blocked in fence_default_wait
Summary: AMDGPU Hawaii: screen freeze, Xorg blocked in fence_default_wait
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-03 06:32 UTC by yshuiv7
Modified: 2018-02-22 10:48 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.7.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (90.16 KB, application/octet-stream)
2016-08-06 23:45 UTC, yshuiv7
Details
Xorg.log (86.57 KB, application/x-trash)
2016-08-06 23:46 UTC, yshuiv7
Details
Kernel Oops when resetting the GPU (85.90 KB, application/octet-stream)
2016-08-08 06:05 UTC, yshuiv7
Details

Description yshuiv7 2016-08-03 06:32:38 UTC
The system is still operational, and I was able to login remotely via ssh.

Seems to happen more often when the CPU is under load.

Relevant dmesg:

INFO: task Xorg:539 blocked for more than 120 seconds.
      Not tainted 4.7.0 #5
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Xorg            D ffff880085387a48     0   539    537 0x00080084
 ffff880085387a48 ffff880392f1ea40 0000000000000000 ffff880085388000
 ffff88045e658000 ffff88045e658000 ffff880062516840 7fffffffffffffff
 ffff880085387a60 ffffffff815e5dd0 0000000000000000 ffff880085387ad8
Call Trace:
 [<ffffffff815e5dd0>] schedule+0x30/0x80
 [<ffffffff815e8429>] schedule_timeout+0x159/0x1a0
 [<ffffffff810a956e>] ? __wake_up_locked+0xe/0x10
 [<ffffffff8144b452>] fence_default_wait+0x182/0x1d0
 [<ffffffff8144adf0>] ? fence_free+0x20/0x20
 [<ffffffff8144af9e>] fence_wait_timeout+0x4e/0x90
 [<ffffffffc02d89dd>] amdgpu_ctx_add_fence+0x5d/0x100 [amdgpu]
 [<ffffffffc02cad00>] amdgpu_cs_ioctl+0xbf0/0xff0 [amdgpu]
 [<ffffffff813e978d>] drm_ioctl+0x14d/0x530
 [<ffffffffc02ca110>] ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu]
 [<ffffffff81171b6a>] ? do_readv_writev+0x11a/0x200
 [<ffffffffc02b2047>] amdgpu_drm_ioctl+0x47/0x80 [amdgpu]
 [<ffffffff81182ebd>] do_vfs_ioctl+0x8d/0x570
 [<ffffffff81001255>] ? syscall_slow_exit_work+0xb5/0x100
 [<ffffffff8153ea5d>] ? __sys_recvmsg+0x5d/0x70
 [<ffffffff810fbec6>] ? __audit_syscall_entry+0xa6/0xf0
 [<ffffffff811833dc>] SyS_ioctl+0x3c/0x70
 [<ffffffff8100179a>] do_syscall_64+0x5a/0x110
 [<ffffffff815e9040>] entry_SYSCALL64_slow_path+0x25/0x25
Comment 1 Alex Deucher 2016-08-03 13:20:44 UTC
Please attach your xorg log and dmesg output.
Comment 2 yshuiv7 2016-08-06 23:45:33 UTC
Created attachment 227861 [details]
dmesg
Comment 3 yshuiv7 2016-08-06 23:46:48 UTC
Created attachment 227871 [details]
Xorg.log
Comment 4 yshuiv7 2016-08-07 00:04:54 UTC
Previous dmesg is from 4.7.0, the attached one is from git HEAD.

Using the _k_ firmware seems to make this happen much less frequently.
Comment 5 yshuiv7 2016-08-08 06:04:31 UTC
I tried to reset the gpu by using /sys/kernel/debug/dri/0/amdgpu_gpu_reset, and the result is a NULL pointer dereference in the kernel. 

dmesg attached
Comment 6 yshuiv7 2016-08-08 06:05:19 UTC
Created attachment 227901 [details]
Kernel Oops when resetting the GPU
Comment 7 Ricardo Ribalda 2018-02-22 10:48:26 UTC
This seems to be (very) related to https://bugzilla.kernel.org/show_bug.cgi?id=198883

Note You need to log in before you can comment on or make changes to this bug.