Bug 208893 - Navi (RX 5700 XT) system appears to hang with more than one display connected
Summary: Navi (RX 5700 XT) system appears to hang with more than one display connected
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-08-13 05:33 UTC by Gordon
Modified: 2020-09-13 13:17 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.8.1, 5.7.15
Subsystem:
Regression: No
Bisected commit-id:


Attachments
added logfile for system has some AMD GPU errors in it, this is when both displays are connected. (106.93 KB, text/plain)
2020-08-13 05:58 UTC, Gordon
Details
added logfile for when single display is connected - notice no ERROR from AMDGPU anymore! (98.72 KB, text/plain)
2020-08-13 06:01 UTC, Gordon
Details
this is the log from after my patch is applied (98.51 KB, text/plain)
2020-08-13 06:47 UTC, Gordon
Details

Description Gordon 2020-08-13 05:33:21 UTC
I am using Fedora 32, and have tried the latest fixes with 5.8.1 mainline, and stable Fedora 32.

If i connect a single monitor to my system it works fine.

However, when I connect a freesync monitor (144Hz and another 24 inch monitor, the system during boot shows much larger startup times, and also general hanging, the screen flickers too but ONLY when the second monitor is connected.)

When I boot the system, even the 'booting' is delayed considerably by comparison when the second display port monitor is connected.

It seems to be one of two problems:
- with multiple displays the card is not behaving properly
- the difference in refresh rates is causing a problem / conflict.

I do not know for certain where the problem lies, but even booting is slowing down so I do not believe this is purely a Mesa issue.

Using wayland or xorg seems to make little difference the problem is on both, and even on terminal sessions (slowing down terminal output to the display and general unresponsiveness)

however as stated above, rebooting with a single display resolves these problems.
Comment 1 Gordon 2020-08-13 05:58:47 UTC
Created attachment 290867 [details]
added logfile for system has some AMD GPU errors in it, this is when both displays are connected.

added logfile for system has some AMD GPU errors in it, this is when both displays are connected.
Comment 2 Gordon 2020-08-13 06:01:05 UTC
Created attachment 290869 [details]
added logfile for when single display is connected - notice no ERROR from AMDGPU anymore!
Comment 3 Gordon 2020-08-13 06:02:32 UTC
The Diff from the log file - clearly showing the problem:

[   51.925494] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   54.442825] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   56.962050] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   59.475881] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   61.752557] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   64.274971] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   66.797358] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   69.318513] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   70.395781] rfkill: input handler disabled
[   72.035479] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   74.554161] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   77.073499] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   78.755571] snd_hda_intel 0000:0c:00.1: refused to change power state from D0 to D3hot
[   79.593997] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   81.941379] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   84.460616] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   86.980582] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   89.498263] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[   91.742602] rfkill: input handler enabled
Comment 4 Gordon 2020-08-13 06:39:13 UTC
Found a fix for the issue. 

My card is a PowerColor Red Devil 5600 XT.


diff --git a/drivers/gpu/drm/amd/powerplay/smu_cmn.c b/drivers/gpu/drm/amd/powerplay/smu_cmn.c
index 5c23c44c33bd..62917c447ddb 100644
--- a/drivers/gpu/drm/amd/powerplay/smu_cmn.c
+++ b/drivers/gpu/drm/amd/powerplay/smu_cmn.c
@@ -87,7 +87,7 @@ static void smu_cmn_read_arg(struct smu_context *smu,
 static int smu_cmn_wait_for_response(struct smu_context *smu)
 {
        struct amdgpu_device *adev = smu->adev;
-       uint32_t cur_value, i, timeout = adev->usec_timeout * 10;
+       uint32_t cur_value, i, timeout = adev->usec_timeout * 20;
 
        for (i = 0; i < timeout; i++) {
                cur_value = RREG32_SOC15_NO_KIQ(MP1, 0, mmMP1_SMN_C2PMSG_90);
(END)
Comment 5 Gordon 2020-08-13 06:39:58 UTC
The problem: the GPU card / seems to make a difference here, some seem to require a bigger delay? and some just work.

Would be nice to know why.
Comment 6 Gordon 2020-08-13 06:47:28 UTC
Created attachment 290871 [details]
this is the log from after my patch is applied
Comment 7 Gordon 2020-08-13 06:47:51 UTC
[   18.764922] snd_hda_intel 0000:0c:00.1: refused to change power state from D3hot to D0
[   18.869351] snd_hda_intel 0000:0c:00.1: CORB reset timeout#2, CORBRP = 65535
[   19.090949] snd_hda_codec_hdmi hdaudioC0D0: Unable to sync register 0x2f0d00. -5
[   19.310079] amdgpu: failed send message: SetHardMinByFreq (28) 	param: 0x0002036b response 0xffffffc2
Comment 8 Gordon 2020-08-13 06:48:14 UTC
not getting any symptoms of the problem now.
Comment 9 Gordon 2020-08-13 06:51:45 UTC
nvm rebooted and the problem came back.
Comment 10 Alex Deucher 2020-08-13 15:34:33 UTC
Do things work correctly if you attach the second monitor after the system is up and running?
Comment 11 Gordon 2020-08-13 17:30:45 UTC
(In reply to Alex Deucher from comment #10)
> Do things work correctly if you attach the second monitor after the system
> is up and running?

Yep they do indeed.
Comment 12 Gordon 2020-08-13 19:16:50 UTC
I'm not certain but:
- uint32_t cur_value, i, timeout = adev->usec_timeout * 10;

if adev_usec_timeout is the microseconds timeout, then most likely the * 10 is wrong, as the below for loop uses microseconds too, so possibly they meant to add there too. Each iteration is 1 micro second, and the below code makes it 10 times that for the timeout, i think this explains the 'hanging' but not why it inconsistently breaks.

Another few thoughts were:
- Perhaps there is a period of messages on 'init' / startup where we are sending more.
- I can only think the flaky behaviour is a memory issue, or a problem where the GPU is not being reset fully on boot. As it makes zero sense it works 'sometimes'.
- When the message / error occours the 'timeout' is 10 times the value, thus blocking messages for the other startup events.
- The display port being connected / or being interfaced with on init might be preventing the normal startup messages to be sent to the GPU.

These GPUs have a problem in general with loss of display output, so if there is an electronics issue, maybe we don't have the 'hack' to make it work.

IDK for sure, I'm new to kernel devel. I mainly work on Godot.
Comment 13 Gordon 2020-08-20 13:45:41 UTC
To fix this problem after boot on the RX 5700 XT Red Devil I had to run this as root, can we make a patch for that to be the default?

- echo "high" > /sys/class/drm/card0/device/power_dpm_force_performance_level
Comment 14 Gordon 2020-08-21 14:33:13 UTC
more info:
echo 'auto' has the flickering problem
echo 'low' seems to work fine
echo 'high' seems to work fine

transitioning between 'high' and 'low' makes the problem appear / light flickering.

auto can always reproduce the issue.
Comment 15 Alex Deucher 2020-09-01 01:13:59 UTC
Does setting amdgpu.dpm=0 on the kernel command line in grub fix the issue?  If so, remove that and try setting amdgpu.ppfeaturemask=0xffffbffd on the kernel command line in grub.  Does that help?
Comment 16 Gordon 2020-09-12 21:29:16 UTC
Those settings both work to solve the problems. 

Running on: 5.8.7-200.fc32.x86_64 with those options only.

Performance is almost what I'd expect now too. I can consistently across distros apply this and it seems to work fine.

Tested out:
- GTA: San Andreas
- ZeroK (with avg 40+ fps in games, no stuttering visible)

Can we possibly get these to be the defaults or do they possibly have some negative performance problems?
Comment 17 Gordon 2020-09-13 12:50:41 UTC
Update:
- discord is crashing due to OpenGL problems
- Random video output in the bottom right corner of screen bug in driver level code most likely
- no info about it in dmesg
- benchmarks show card is very slow compared to where it should be using a web benchmark

"Your results compared to other users:
You score better than 24% of all users so far!
You score better than 26% of the people who use the same browser and OS!"

Going to test with these options removed.
Comment 18 Gordon 2020-09-13 13:03:16 UTC
With a single display connected and those options still enabled:

Canvas score - Test 1: 901 - Test 2: 2049
WebGL score - Test 1: 903 - Test 2: 1020
Total score: 4873
Your results compared to other users:
You score better than 96% of all users so far!
You score better than 99% of the people who use the same browser and OS!

Honestly, there is a bug in the AMD code somewhere with multiple displays still.

I left ppfeaturemask set for this test. I am going to run it again with dual displays.
Comment 19 Gordon 2020-09-13 13:17:14 UTC
Never mind I was wrong:

Canvas score - Test 1: 1235 - Test 2: 2115
WebGL score - Test 1: 911 - Test 2: 1005
Total score: 5266
Your results compared to other users
You score better than 97% of all users so far!
You score better than 100% of the people who use the same browser and OS!


With both displays results seem good now, apart from the OpenGL based crashes.

Note You need to log in before you can comment on or make changes to this bug.