Bug 206575

Summary: [amdgpu] [drm] No video signal on resume from suspend, R9 380
Product: Drivers Reporter: Noel Maersk (veox+kernel)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: low CC: alexdeucher, brandomrobor, bwyazel, dap, dickvandrake, kernel_bugzilla, sevenever, thfrkbz, veox+kernel
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output for resume-from-suspend, linux 5.5.4
dmesg output for resume-from-suspend, linux 5.4.20
dmesg output for resume-from-hibernate, linux 5.5.4
lspci output
git bisect log to find the culprit
git bisect log to find the fix (failed)
git bisect log to find the fix (successful)

Description Noel Maersk 2020-02-17 16:28:39 UTC
Created attachment 287441 [details]
dmesg output for resume-from-suspend, linux 5.5.4

OS: Arch Linux
GPU: (MSI) Radeon R9 380

On `systemctl suspend` and subsequent resume, the monitors display "no signal". The machine is responsive, commands can be typed on the keyboard, SSH'ing is also possible.

Somewhat unexpectedly, resume from hibernation works fine (i.e. there is signal).

This started happening a few weeks ago, seemingly when `linux` v5.5.2 was installed. Was also present on v5.5.3 and v5.5.4 (current).

`linux-lts` v5.4.20 does not exhibit this behaviour; it's a regression.
Comment 1 Noel Maersk 2020-02-17 16:29:24 UTC
Created attachment 287443 [details]
dmesg output for resume-from-suspend, linux 5.4.20
Comment 2 Noel Maersk 2020-02-17 16:29:54 UTC
Created attachment 287445 [details]
dmesg output for resume-from-hibernate, linux 5.5.4
Comment 3 Noel Maersk 2020-02-17 16:30:49 UTC
Created attachment 287447 [details]
lspci output
Comment 4 Alex Deucher 2020-02-17 16:33:50 UTC
Can you bisect?
Comment 5 Noel Maersk 2020-02-17 16:38:33 UTC
I'm not able to bisect at current moment. Will try by end of workweek.

-----

User `muncrief` has recently reported something similar in a different bug report, here:

https://bugzilla.kernel.org/show_bug.cgi?id=204241#c48

... for Radeon R9 390, ever since linux 5.5-rc1. They were suggested opening a new issue, but a search on bugzilla shows they never did.
Comment 6 Thomas Frank 2020-02-18 17:47:11 UTC
I have the same graphics card and the same problem.  Do you need additional dmesg outputs from kernel 5.4.20 and 5.5.4?

I don't know if this helps but I diffed my `amdgpu` filtered dmesg outputs:

```
--- 5.4.20-1-lts_amdgpu_wo_uptime.txt	2020-02-18 18:38:07.393633705 +0100
+++ 5.5.4-arch1-1_amdgpu_wo_uptime.txt	2020-02-18 18:38:32.714488497 +0100
@@ -1,7 +1,4 @@
 [drm] amdgpu kernel modesetting enabled.
-amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xd0000000 -> 0xdfffffff
-amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xe0000000 -> 0xe01fffff
-amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xefe00000 -> 0xefe3ffff
 fb0: switching to amdgpudrmfb from VESA VGA
 amdgpu 0000:01:00.0: vgaarb: deactivate vga console
 amdgpu 0000:01:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
@@ -9,7 +6,9 @@
 [drm] amdgpu: 4096M of VRAM memory ready
 [drm] amdgpu: 4096M of GTT memory ready.
 amdgpu: [powerplay] hwmgr_sw_init smu backed is tonga_smu
+[drm:dm_helpers_parse_edid_caps [amdgpu]] *ERROR* Couldn't read SADs: -2
 fbcon: amdgpudrmfb (fb0) is primary device
 amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device
-[drm] Initialized amdgpu 3.35.0 20150101 for 0000:01:00.0 on minor 0
+[drm] Initialized amdgpu 3.36.0 20150101 for 0000:01:00.0 on minor 0
 snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
+[drm:dm_helpers_parse_edid_caps [amdgpu]] *ERROR* Couldn't read SADs: -2
```

The last line from the diff 
`[drm:dm_helpers_parse_edid_caps [amdgpu]] *ERROR* Couldn't read SADs: -2`
happens after resuming (with blank screen).
Comment 7 Noel Maersk 2020-02-18 21:33:02 UTC
`git bisect log` output at:
https://gist.github.com/veox/36aeb77acfbcaea9c4ba1cc70052329a

Had to `skip` a few because of system instability on v5.5.4 (cause unknown, likely
unrelated to this bug); switched to v5.4.20 halfway-in to avoid.

Result as follows (e-mails changed).


1ea8751bd28d1ec2b36a56ec6bc1ac28903d09b4 is the first bad commit
commit 1ea8751bd28d1ec2b36a56ec6bc1ac28903d09b4
Author: Noah Abradjian <spam@gmail.com>
Date:   Fri Sep 27 16:30:57 2019 -0400

    drm/amd/display: Make clk mgr the only dto update point
    
    [Why]
    
    * Clk Mgr DTO update point did not cover all needed updates, as it included a
      check for plane_state which does not exist yet when the updater is called on
      driver startup
    * This resulted in another update path in the pipe programming sequence, based
      on a dppclk update flag
    * However, this alternate path allowed for stray DTO updates, some of which would
      occur in the wrong order during dppclk lowering and cause underflow
    
    [How]
    
    * Remove plane_state check and use of plane_res.dpp->inst, getting rid
      of sequence dependencies (this results in extra dto programming for unused
      pipes but that doesn't cause issues and is a small cost)
    * Allow DTOs to be updated even if global clock is equal, to account for
      edge case exposed by diags tests
    * Remove update_dpp_dto call in pipe programming sequence (leave update to
      dppclk_control there, as that update is necessary and shouldn't occur in clk
      mgr)
    * Remove call to optimize_bandwidth when committing state, as it is not needed
      and resulted in sporadic underflows even with other fixes in place
    
    Signed-off-by: Noah Abradjian <spam@gmail.com>
    Reviewed-by: Jun Lei <spam@gmail.com>
    Acked-by: Leo Li <spam@gmail.com>
    Signed-off-by: Alex Deucher <spam@gmail.com>

 .../gpu/drm/amd/display/dc/clk_mgr/dcn20/dcn20_clk_mgr.c   | 14 +++++++++-----
 drivers/gpu/drm/amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c  |  3 ++-
 drivers/gpu/drm/amd/display/dc/core/dc.c                   |  4 ----
 drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c         |  8 +-------
 4 files changed, 12 insertions(+), 17 deletions(-)
Comment 8 Noel Maersk 2020-02-18 21:44:19 UTC
The

    *ERROR* Couldn't read SADs: -2

in my and Thomas' logs are unrelated to the issue, I believe, and pertain to sound (HDMI sound?..).

The error comes from drivers/gpu/drm/radeon/radeon_audio.c, referring to Speaker Allocation Data.

Anyway, I've seen the error on resume-from-suspend for commits that managed to "signal up" properly in the bisect above, as well as the "blank screen" cases.
Comment 9 Noel Maersk 2020-02-18 22:06:21 UTC
Some of the commits that got skipped in that `git bisect log` of mine actually come before the one above when viewing `git log`. :/

Guess I'll try the bisect again in coming days.
Comment 10 Noel Maersk 2020-02-19 19:51:26 UTC
That came in negative. Looks like it's 1ea8751bd28d1ec2b36a56ec6bc1ac28903d09b4 indeed.
Comment 11 Thomas Frank 2020-02-21 00:42:46 UTC
I can confirm Noel's finding. Reverting 1ea8751bd28d1ec2b36a56ec6bc1ac28903d09b4 brings back the screen output after resume for me as well.
Comment 12 Britt Yazel 2020-02-24 02:12:30 UTC
This issue is also present on a discrete r9-290, also on ArchLinux. So it seems the issue is with any 200 or 300 series cards.

Likewise, there's a thread on Reddit with this issue was well:
https://www.reddit.com/r/archlinux/comments/f7oti1/issue_with_resume_from_suspend_black_backlit/
Comment 13 Duncan 2020-03-16 09:39:42 UTC
I have the same problem. My graphic card is an AMD R9 285 and since some kernel updates ago the resume from suspend dont work with a black screen output but system and keyboard respond well.
My distro is Anarchy Linux with KDE desktop and SDDM like display manager.
Regards.
Comment 14 Joe Ramsey 2020-03-20 19:54:29 UTC
Looks like this has been corrected in 5.6... is there any intent to include the fix in any 5.5 kernel or will we just have to wait for 5.6?
Comment 15 Alex Deucher 2020-03-20 20:15:09 UTC
(In reply to Joe Ramsey from comment #14)
> Looks like this has been corrected in 5.6... is there any intent to include
> the fix in any 5.5 kernel or will we just have to wait for 5.6?

Can you identify the fix?
Comment 16 Joe Ramsey 2020-03-20 22:03:47 UTC
(In reply to Alex Deucher from comment #15)
> (In reply to Joe Ramsey from comment #14)
> > Looks like this has been corrected in 5.6... is there any intent to include
> > the fix in any 5.5 kernel or will we just have to wait for 5.6?
> 
> Can you identify the fix?

If I understood Noel Maersk's and Thomas Frank's posts reverting 1ea8751bd28d1ec2b36a56ec6bc1ac28903d09b4 resolves the issue.  The Reddit thread that was referenced (https://www.reddit.com/r/archlinux/comments/f7oti1/issue_with_resume_from_suspend_black_backlit/) seems to indicate that it's resolved in 5.6.  Was wondering if whatever fix was applied to 5.6 would also be applied to 5.5.  Could be I've completely misunderstood things.

I'm running Slackware and have been using the -current kernel packages (currently at 5.4.25), but the kernel modules for virtualbox don't seem to be compiling under that kernel for some reason.  I tried several of the recent 5.5 releases (5.5.8-5.5.10), and can get the virtualbox kernel modules to compile under them, but they all seem to have this bug.  Was hoping to get one kernel that would allow my laptop to suspend and also compile the virtualbox modules.  :^)
Comment 17 Alex Deucher 2020-03-23 05:06:23 UTC
If you could verify that 5.6 works for you, you could bisect to see what commit fixed it.
Comment 18 Joe Ramsey 2020-03-24 20:11:36 UTC
(In reply to Alex Deucher from comment #17)
> If you could verify that 5.6 works for you, you could bisect to see what
> commit fixed it.

OK, I'm about to reveal my ignorance.  I just got a chance to compile 5.6-rc7 to confirm that resume from suspend worked (it did), but I have no idea how to bisect.  Googled for it and it looks like I need to be using git, but I'm just downloading the tarball from kernel.org to compile my kernel.  Is this even worth messing with given that it looks like we may have a stable 5.6 in the near future?
Comment 19 Duncan 2020-04-10 08:35:36 UTC
I can confirm that this issue was solved on 5.6 kernel, but sadly I will continue using lts kernel because I still have problems with my webcam's fps and microphone's bitrate on others kernels.
Comment 20 Noel Maersk 2020-04-11 13:56:52 UTC
I'll do a bisect to identify the fix. Roughly 15 steps.
Comment 21 Noel Maersk 2020-04-11 13:59:32 UTC
Created attachment 288351 [details]
git bisect log to find the culprit

Attaching original git bisect log (was posted to github previously).
Comment 22 Duncan 2020-04-14 09:09:29 UTC
(In reply to Duncan from comment #19)
> I can confirm that this issue was solved on 5.6 kernel, but sadly I will
> continue using lts kernel because I still have problems with my webcam's fps
> and microphone's bitrate on others kernels.

My webcam and microphone issues seem to be resolved in this new kernel, but I will keep an lts kernel in case I ever have to use it again.
Comment 23 Noel Maersk 2020-04-14 13:28:19 UTC
Created attachment 288445 [details]
git bisect log to find the fix (failed)

After >45 steps, I gave up the bisect. There's a different bug that prevents the initramfs from loading at all, making it impossible to check if the issue-at-hand is still present.

After having `skip`ped the first time this happened, I made the bad call of "maybe I'll tag this condition `old`, too"; I did this just once, but it might've had a negative effect on the outcome.

I'm attaching the bisect log anyway.
Comment 24 Noel Maersk 2020-04-14 13:31:38 UTC
(In reply to Alex Deucher from comment #17)
> If you could verify that 5.6 works for you, you could bisect to see what
> commit fixed it.

I'm not 100% on bug closing process for the kernel; is this strictly required
to mark the bug as resolved?

The issue is no longer there, and the fix seems difficult to pin down. :(
Comment 25 Alex Deucher 2020-04-14 14:10:21 UTC
Go ahead and close it.  You can always open a new one if you see further issues.
Comment 26 Noel Maersk 2020-04-15 11:43:12 UTC
Will close as resolved shortly.

I did run a second bisect successfully, showing:


f2988e67144a263e33aa3b916457bf3095288c94 is the first new commit
commit f2988e67144a263e33aa3b916457bf3095288c94
Author: Yongqiang Sun <yongqiang.sun@amd.com>
Date:   Fri Oct 18 18:24:59 2019 -0400

    drm/amd/display: optimize bandwidth after commit streams.
    
    [Why]
    System is unable to enter S0i3 due to DISPLAY_OFF_MASK not asserted
    in SMU.
    
    [How]
    Optimized bandwidth should be called paired and to resolve unplug
    display underflow issue, optimize bandwidth after commit streams is
    moved to next page flip, in case of S0i3, there is a change for no
    flip coming causing display count is 1 in SMU side.
    Add optimize bandwidth after commit stream.
    
    Signed-off-by: Yongqiang Sun <yongqiang.sun@amd.com>
    Reviewed-by: Tony Cheng <Tony.Cheng@amd.com>
    Acked-by: Bhawanpreet Lakha <Bhawanpreet.Lakha@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

 drivers/gpu/drm/amd/display/dc/core/dc.c | 4 ++++
 1 file changed, 4 insertions(+)
Comment 27 Noel Maersk 2020-04-15 11:44:01 UTC
Created attachment 288483 [details]
git bisect log to find the fix (successful)

Attaching successful git bisect log.
Comment 28 Noel Maersk 2020-04-15 11:45:44 UTC
Closing as resolved - fix already in tree and released versions.