Bug 199357

Summary: amdgpu: hang a few seconds after logging in, most likely due to regression
Product: Drivers Reporter: Mathias Tillman (master.homer)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: high CC: alexdeucher, christian.koenig, harry.wentland
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: v4.16 Subsystem:
Regression: No Bisected commit-id:
Attachments: Kernel log of the hang/crash
Hardware info
Kernel log with added logging

Description Mathias Tillman 2018-04-11 12:26:54 UTC
Created attachment 275291 [details]
Kernel log of the hang/crash

I've been testing kernel v4.16 on my computer, but it's basically unusable - because after a few seconds or so after logging in it will do a soft lockup, and I can't even switch to VT. I was, however, able to ssh in to it, which is how I was able to get the kernel log. Right as the hang happened, I can see this in the log:
Apr 11 14:04:13 homer-desktop kernel: [   45.532038] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:45:crtc-1] flip_done timed out
Apr 11 14:04:23 homer-desktop kernel: [   55.772028] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:37:plane-1] flip_done timed out
Apr 11 14:04:33 homer-desktop kernel: [   66.012282] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:44:plane-7] flip_done timed out

and after that is the regular kernel crash.

I have tried this on both v4.16 and v4.16.1 with the same results. However, it doesn't happen on v4.15 (which is what I'm running now). So there must be some kind of regression between those releases.

I am running stable KDE neon (which is based on Ubuntu LTS) with precompiled kernels from the ubuntu mainline ppa.
Comment 1 Mathias Tillman 2018-04-11 12:27:25 UTC
Created attachment 275293 [details]
Hardware info
Comment 2 Christian König 2018-04-11 13:37:48 UTC
Looks like an issue with DC to me.

Can you bisect?
Comment 3 Mathias Tillman 2018-04-11 17:37:15 UTC
I've just finished running a bisect now, and I have concluded that commit 36cc549d59864b7161f0e23d710c1c4d1b9cf022 (drm/amd/display: disable CRTCs with NULL FB on their primary plane (V2)) causes the lock-up.
Let me know if you need anything else.
Comment 4 Christian König 2018-04-11 18:35:44 UTC
Thanks, yeah that is clearly DC (display core).

Harry can you take a look?
Comment 5 Harry Wentland 2018-04-11 20:36:02 UTC
I've no idea why this causes "flip_done timed out" and locks the system right now, but we're currently also dealing with some more fallout from that change, in particular blinking/flickering display if redshift/nightlight is on. I'm reluctant to just revert the offending commit as it's not incorrect but seems to expose some other flaws in our atomic check/commit implementation.
Comment 6 Michel Dänzer 2018-04-12 08:06:44 UTC
(In reply to Harry Wentland from comment #5)
> I'm reluctant to just revert the offending commit as it's not incorrect
> but seems to expose some other flaws in our atomic check/commit
> implementation.

Unless a fix is at least on the horizon, since this commit introduced multiple issues, it would be nice to our users to revert it for the time being, then re-apply it when it's safe.
Comment 7 Mathias Tillman 2018-04-12 11:41:35 UTC
Wanted to add some more info. The soft lock up will release after approximately 30 seconds, but after a few seconds it will lock up again and repeat.
Looking at the kernel log, it seems that when the lock up happens, it takes an abnormally long time to reach the dm_pflip_high_irq function which is supposed to trigger the flip_done message.
I've attached a new log with my added logging in case that helps.
Comment 8 Mathias Tillman 2018-04-12 11:42:23 UTC
Created attachment 275337 [details]
Kernel log with added logging
Comment 9 Mathias Tillman 2018-04-13 10:16:59 UTC
Just saw that this has been reverted on git, so I will mark this as resolved.
Comment 10 Mathias Tillman 2018-04-17 07:03:15 UTC
Since that commit was pushed to v4.16, shouldn't it also be reverted on linux-stable to make it to a future 4.16.y release?
Comment 11 Alex Deucher 2018-04-17 14:59:25 UTC
Yes, the revert cc'ed stable so it will show up in 4.16 as well.