Bug 205169

Summary: AMDGPU driver with Navi card hangs Xorg in fullscreen only.
Product: Drivers Reporter: Dmitri Seletski (drjoms)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: aladjev.andrew, alexdeucher, kernelbug5193, pierre-eric.pelloux-prayer, shtetldik, witold.baryluk+kernel
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.4.0-rc2 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg Sat 12 Oct 2019 03:34:43 PM IST
.config file Sat 12 Oct 2019 03:36:01 PM IST
possible fix

Description Dmitri Seletski 2019-10-12 14:24:47 UTC
I have another problem logged with Navi + AMDGPU drivers. It's triggered independently and reliable.
https://bugzilla.kernel.org/show_bug.cgi?id=204725

With that said, starting strictly and specifically with kernel version 5.4.0* I have new problem.

I successfully load into Xorg. I can start OpenGL and Vulkan games in non full screen. But once I start them - input devices hang, screen freezes. Machine is responsive over SSH/ethernet. I can raise skinny elephants.

I tried opening a few games in non full screen mode and in full screen mode. And i reliably hit bug everytime anything with OpenGL goes full screen on native resolution of the screen.

I noticed, issue is less likely to happen if program goes full screen in non native resolution.

I will attach details in files for DMESG, lsmod and some other things directly as message, if they are short enough.
Comment 1 Dmitri Seletski 2019-10-12 14:35:38 UTC
Created attachment 285479 [details]
dmesg  Sat 12 Oct 2019 03:34:43 PM IST
Comment 2 Dmitri Seletski 2019-10-12 14:37:10 UTC
Created attachment 285481 [details]
.config file Sat 12 Oct 2019 03:36:01 PM IST
Comment 3 Dmitri Seletski 2019-10-12 14:37:52 UTC
Module                  Size  Used by
bridge                147456  0
stp                    16384  1 bridge
llc                    16384  2 bridge,stp
tun                    53248  2
uvcvideo              106496  0
videobuf2_vmalloc      16384  1 uvcvideo
videobuf2_memops       16384  1 videobuf2_vmalloc
videobuf2_v4l2         24576  1 uvcvideo
videodev              204800  2 videobuf2_v4l2,uvcvideo
kvm_amd                86016  0
videobuf2_common       49152  2 videobuf2_v4l2,uvcvideo
joydev                 24576  0
mousedev               24576  0
kvm                   659456  1 kvm_amd
amdgpu               3989504  12
irqbypass              16384  1 kvm
snd_virtuoso           49152  2
snd_oxygen_lib         49152  1 snd_virtuoso
snd_mpu401_uart        16384  1 snd_oxygen_lib
gpu_sched              32768  1 amdgpu
i2c_piix4              24576  0
snd_rawmidi            32768  1 snd_mpu401_uart
ttm                    94208  1 amdgpu
sr_mod                 28672  0
cdrom                  36864  1 sr_mod
k10temp                16384  0
Comment 4 Dmitri Seletski 2019-10-12 14:47:09 UTC
i realised that I have llvm 10 and 9 same time on my machine. i removed llvm 10, recompiled mesa.

uname -a
Linux (none)dimko's Desktop 5.4.0-rc2 #1 SMP PREEMPT Tue Oct 8 19:48:16 IST 2019 x86_64 AMD Ryzen 5 1600 Six-Core Processor AuthenticAMD GNU/Linux

I am on AMD64 Gentoo.

will test after mesa is recompiled with V9 LLVM support and report any changes. If any.
Comment 5 Dmitri Seletski 2019-10-12 14:48:15 UTC
screen resolution 3440x1440.
refresh rate 100, also tried 60. did not make any difference.
Comment 6 Dmitri Seletski 2019-10-12 16:58:40 UTC
interesting find, under Xwayland, same issue doesn't happen!
I won't blame it on Xorg, because under older kernel programs with OpenGL and fulscreen work.
Comment 7 Pierre-Eric Pelloux-Prayer 2019-10-13 08:14:48 UTC
(In reply to Dmitri Seletski from comment #0)
> I have another problem logged with Navi + AMDGPU drivers. It's triggered
> independently and reliable.
> https://bugzilla.kernel.org/show_bug.cgi?id=204725
> 
> With that said, starting strictly and specifically with kernel version
> 5.4.0* I have new problem.
> 

What kernel version were you using before that didn't have the problem?
Comment 8 Dmitri Seletski 2019-10-13 09:43:37 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #7)
> (In reply to Dmitri Seletski from comment #0)
> > I have another problem logged with Navi + AMDGPU drivers. It's triggered
> > independently and reliable.
> > https://bugzilla.kernel.org/show_bug.cgi?id=204725
> > 
> > With that said, starting strictly and specifically with kernel version
> > 5.4.0* I have new problem.
> > 
> 
> What kernel version were you using before that didn't have the problem?

It was 5.3.* when I could open and use OpenGL and Vulkan apps full screen and it wouldn't crash. This is list of kernels I used from 5.3.*

ls /boot/ |grep vmlinuz-5.3. 
vmlinuz-5.3.0+
vmlinuz-5.3.0-next-20190920
vmlinuz-5.3.0+.old
vmlinuz-5.3.0-rc6
vmlinuz-5.3.0-rc6+
vmlinuz-5.3.0-rc6+.old
vmlinuz-5.3.0-rc8
vmlinuz-5.3.0-rc8.old
Comment 9 Dmitri Seletski 2019-10-13 09:54:47 UTC
i had a couple of LLVM versions.i removed all. Now I have version 9.0.0
dimko@(none)dimko's Desktop ~ $ ls /boot/ |grep vmlinuz-5.3. 

 sys-devel/llvm
      Latest version available: 9.0.0
      Latest version installed: 9.0.0

I have recompiled Mesa with llvm 9(previously was compiled with llvm 10 which i removed off the system manually)

glxinfo | grep "OpenGL version"
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.3.0-devel (git-1294f01e06)
Comment 10 Pierre-Eric Pelloux-Prayer 2019-10-14 09:27:59 UTC
"git bisect" identifies this commit as the problematic one: 617089d5837a ("drm/amd/display: revert wait in pipelock").

Reverting this commit on top of amd-staging-drm-next seems to work fine.
Comment 11 Dmitri Seletski 2019-10-14 20:04:43 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #10)
> "git bisect" identifies this commit as the problematic one: 617089d5837a
> ("drm/amd/display: revert wait in pipelock").
> 
> Reverting this commit on top of amd-staging-drm-next seems to work fine.

uname -a
Linux (none)dimko's Desktop 5.3.0-rc3+ #3 SMP PREEMPT Mon Oct 14 20:49:02 IST 2019 x86_64 AMD Ryzen 5 1600 Six-Core Processor AuthenticAMD GNU/Linux


git checkout 617089d5837a^

Issue no longer happens

Major downgrade, but no more problem.
Which commit can I use to solve this issue?

Bug 205169 - AMDGPU driver with Navi card hangs Xorg in fullscreen only. (edit) 
https://bugzilla.kernel.org/show_bug.cgi?id=204725

Sorry that I take advantage of you here.
I will try to find 5.3.0 commit. I am new into all this stuff.
Comment 12 Dmitri Seletski 2019-10-15 20:51:35 UTC
(In reply to Dmitri Seletski from comment #11)
> (In reply to Pierre-Eric Pelloux-Prayer from comment #10)
> > "git bisect" identifies this commit as the problematic one: 617089d5837a
> > ("drm/amd/display: revert wait in pipelock").
> > 
> > Reverting this commit on top of amd-staging-drm-next seems to work fine.
> 
> uname -a
> Linux (none)dimko's Desktop 5.3.0-rc3+ #3 SMP PREEMPT Mon Oct 14 20:49:02
> IST 2019 x86_64 AMD Ryzen 5 1600 Six-Core Processor AuthenticAMD GNU/Linux
> 
> 
> git checkout 617089d5837a^
> 
> Issue no longer happens
> 
> Major downgrade, but no more problem.
> Which commit can I use to solve this issue?
> 
> Bug 205169 - AMDGPU driver with Navi card hangs Xorg in fullscreen only.
> (edit) 
> https://bugzilla.kernel.org/show_bug.cgi?id=204725
> 
> Sorry that I take advantage of you here.
> I will try to find 5.3.0 commit. I am new into all this stuff.

with regards to that other bug. It's there since moment when Navi driver was first introduced.
Comment 13 ArneJ 2019-10-30 06:11:56 UTC
I had a similar issue with Borderlands 2: https://gitlab.freedesktop.org/mesa/mesa/issues/2004


After I reverted the patch  mentioned in comment 10, the issue seems to be fixed.
The other hang later seems unrelated (looks like sdma is the problem with that one).
Comment 14 Shmerl 2019-11-14 21:59:12 UTC
Looks like the same issue with Pathfinder: Kingmaker:
https://bugs.freedesktop.org/show_bug.cgi?id=112266
Comment 15 Dmitri Seletski 2019-11-14 22:50:47 UTC
(In reply to ArneJ from comment #13)
> I had a similar issue with Borderlands 2:
> https://gitlab.freedesktop.org/mesa/mesa/issues/2004
> 
> 
> After I reverted the patch  mentioned in comment 10, the issue seems to be
> fixed.
> The other hang later seems unrelated (looks like sdma is the problem with
> that one).

in my case  its with ALL games. pls try others and report back.
Comment 16 Dmitri Seletski 2019-11-14 22:51:06 UTC
(In reply to Shmerl from comment #14)
> Looks like the same issue with Pathfinder: Kingmaker:
> https://bugs.freedesktop.org/show_bug.cgi?id=112266

in my case  its with ALL games. pls try others and report back.
Comment 17 Shmerl 2019-11-15 00:06:38 UTC
(In reply to Dmitri Seletski from comment #16)
> (In reply to Shmerl from comment #14)
> > Looks like the same issue with Pathfinder: Kingmaker:
> > https://bugs.freedesktop.org/show_bug.cgi?id=112266
> 
> in my case  its with ALL games. pls try others and report back.

I don't know which games you mean. Some others work don't hang me, such as Ion Fury, The Bard's Tale IV and etc. Yet some others like Hedon hang with gfx_0.0.0 timeout hang, so not the same as flip_done timed out hang.

Anyway, I'll try reverting that commit, to check if it helps.
Comment 18 Shmerl 2019-11-15 01:17:48 UTC
I can confirm, that reverting that commit indeed prevents the hang in Pathfinder: Kingmaker!
Comment 19 ArneJ 2019-11-15 06:20:02 UTC
(In reply to Dmitri Seletski from comment #16)
> (In reply to Shmerl from comment #14)
> > Looks like the same issue with Pathfinder: Kingmaker:
> > https://bugs.freedesktop.org/show_bug.cgi?id=112266
> 
> in my case  its with ALL games. pls try others and report back.

I tested many games all over. Many had this issue, some not. After reverting the aforementioned kernel patch and installing latest llvm and mesa from git, I had no more hangs (around 3-4 weeks without a hang now).
Comment 20 Alex Deucher 2019-11-15 16:03:10 UTC
Created attachment 285935 [details]
possible fix

Does this patch help?
Comment 21 Dmitri Seletski 2019-11-15 19:08:56 UTC
(In reply to Alex Deucher from comment #20)
> Created attachment 285935 [details]
> possible fix
> 
> Does this patch help?

It did not just solve one problem, but two!

First of all it solved original issue.
Second of all, some games were hanging right before quitting.
Xorg was responsive, but processes did not disappear.

I was blaming on proprietary code.

Apparently it was same bug, just different invocation of it.

Please close this bug report. My problem is now fixed.
Comment 22 Shmerl 2019-11-15 19:11:19 UTC
It fixes Pathfinder: Kingamer too. But first let the patch be upstreamed, then it's OK to close the bug :)
Comment 23 ArneJ 2019-11-15 19:52:21 UTC
I just let Borderlands 2 run for about one hour in the menu which causes a hang without this patch in at most 3 minutes.

Consider Borderlands 2 also fixed with this :)
Comment 24 Shmerl 2019-11-25 01:37:23 UTC
Just FYI, 5.4 is out, but the fix didn't land yet, so it needs to be still applied manually.
Comment 25 Shmerl 2019-11-25 01:39:10 UTC
Also, even with 100 ms timeout, the flip hang still happens just very rarely and not in the usual scenarios for me. For example when playing The Witcher 3 (Wine+dxvk) and minimizing the game Window, on some rare occasion that flip hang occurs even with the patch. I suppose it's something to do with KWin (I usually keep compositing disabled though in those cases).

So may be 100 ms value is not always enough?
Comment 27 aladjev.andrew@gmail.com 2021-08-03 15:03:54 UTC
Kernel driver hangs in production using regular usage. Such issues should be escalated as much as possible: DCN authors and developers meetings, core developers replacements, driver refactoring/rewrite, tests coverage. But it works in commercial environment only, open source provides TIMEOUT_FOR_FLIP_PENDING.

1.5 years passed: TIMEOUT_FOR_FLIP_PENDING is still here and nobody cares, and i am almost sure that nobody will care about it tomorrow.

Thank you.