Bug 196291

Summary: amdgpu: Freeze because of syscall not returning
Product: Drivers Reporter: Tobias Auerochs (tobi291019)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED INVALID    
Severity: normal CC: christian.koenig
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.11.8 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg with lockup warning at the end
/sys/kernel/debug/dri/0/amdgpu_fence_info after being frozen for a few minutes

Description Tobias Auerochs 2017-07-07 18:01:25 UTC
Created attachment 257397 [details]
dmesg with lockup warning at the end

An amdgpu syscall, called by plasmashell, appears to deadlock randomly and freeze X.org completely. Several graphics processes, plasmashell and X.org are left stuck in D-State. Everything else continues to operate correctly, including audio, networking, etc..

The issue seems to appear more frequently whilst running games, although I am unable to find any particular pattern to it.

Running Arch Linux with a custom compiled linux-zen kernel (with ACS override patches) and ZFS, although as far as I can tell those are not related to the issue, Mesa 17.1.4 with Radeon RX 480. The issue has been around for a while and I sadly do not remember when it first occured, but definitely the entire 4.11.x lineup is affected and I am fairly sure 4.10.x was as well. The issue is way too rare though for me to bisect the exact cause however.
Comment 1 Christian König 2017-07-07 18:10:46 UTC
Please provide the output of "cat /sys/kernel/debug/dri/0/amdgpu_fence_info" when this happens.
Comment 2 Tobias Auerochs 2017-07-10 18:34:20 UTC
Created attachment 257449 [details]
/sys/kernel/debug/dri/0/amdgpu_fence_info after being frozen for a few minutes

Got the freeze again randomly, attached the output from /sys/kernel/debug/dri/0/amdgpu_fence_info.
Comment 3 Christian König 2017-07-10 18:50:54 UTC
That isn't related to any system call. The problem is simply that the hardware has crashed and some task is trying to push new commands to it, waiting for previous commands to end (which never happens).

That is most likely a problem on the user space driver side and not related to the kernel at all.

Please open a bug report on FDO for this.
Comment 4 Tobias Auerochs 2017-07-10 19:11:01 UTC
Submitted on freedesktop.org bugzilla:
https://bugs.freedesktop.org/show_bug.cgi?id=101746
Comment 5 Tobias Auerochs 2017-09-22 01:15:25 UTC
Well, after encountering a possibly unrelated (reproducible) issue, causing the exact same symptons and a GPU reset (in debugfs) seems to recover correctly from that, I think this issue really just runs down to GPU resets not being issued automatically on the kernel side yet.