Bug 196615 - amdgpu - resume from suspend is no longer working on rx480
Summary: amdgpu - resume from suspend is no longer working on rx480
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: drivers_video-dri
URL: https://git.kernel.org/pub/scm/linux/...
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-08 20:12 UTC by Peter Spiess-Knafl
Modified: 2017-11-02 16:52 UTC (History)
13 users (show)

See Also:
Kernel Version: >= 4.11.3
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg log regarding the freeze. (8.27 KB, text/plain)
2017-08-09 21:21 UTC, Peter Spiess-Knafl
Details
revert the change (1.12 KB, patch)
2017-08-17 20:12 UTC, Alex Deucher
Details | Diff
possible fix (1.71 KB, patch)
2017-10-20 14:09 UTC, Alex Deucher
Details | Diff

Description Peter Spiess-Knafl 2017-08-08 20:12:52 UTC
Hi!

Since 4.12.4 I can no longer resume from suspend using the amdgpu driver on my rx 480.

I did a bisect and it revealed the following commit being the problem:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.12.y&id=2dc1889ebf8501b0edf125e89a30e1cf3744a2a7

Can someone help me with fixing that?
Comment 1 Peter Spiess-Knafl 2017-08-08 20:14:49 UTC
I also started a discussion thread on the arch forum:

https://bbs.archlinux.org/viewtopic.php?pid=1729393#p1729393
Comment 2 Alex Deucher 2017-08-08 20:44:17 UTC
Please attach your dmesg output.  How exactly does resume fail?
Comment 3 Peter Spiess-Knafl 2017-08-09 21:20:37 UTC
Hi Alex!

Thanks for getting back. First there are strange artefacts where the mouse pointer should be and shortly after the system freezes all together.

I'll attach a dmesg log. But I think the relevant errors are these:

Aug 08 22:30:29 rabe kernel: [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 13 test failed
Aug 08 22:30:29 rabe kernel: [drm:amdgpu_resume [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110
Aug 08 22:30:29 rabe kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110).
Aug 08 22:30:29 rabe kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -110
Aug 08 22:30:29 rabe kernel: PM: Device 0000:01:00.0 failed to resume async: error -110
Comment 4 Peter Spiess-Knafl 2017-08-09 21:21:08 UTC
Created attachment 257861 [details]
dmesg log regarding the freeze.

dmesg log regarding the freeze.
Comment 5 Peter Spiess-Knafl 2017-08-13 10:48:45 UTC
Alex, do you need further infos?
Comment 6 Francisco J. Vazquez 2017-08-17 17:39:18 UTC
Same error on Gentoo, kernel 4.12.7, RX 470. On 4.12.7 the screen does come up on wake up from suspend after a while (20 seconds or so) but the system is unusable: the mouse cursor moves fine and the keyboard responds to keypresses but the screen updates with a 20-30s lag (if I launch a new terminal it appears after half a minute). Changing to VT with ctrl+alt+f[1-6] garbles the screen.

4.12.7 wake up:
[...]
[  128.978655] [drm] ring test on 10 succeeded in 6 usecs
[  129.025563] [drm] ring test on 11 succeeded in 1 usecs
[  129.025563] [drm] UVD initialized successfully.
[  129.126522] [drm] ring test on 12 succeeded in 0 usecs
[  129.331084] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 13 test failed
[  129.331088] [drm:amdgpu_resume [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110
[  129.331092] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110).
[  129.331094] dpm_run_callback(): pci_pm_resume+0x0/0xd0 returns -110
[  129.331095] PM: Device 0000:01:00.0 failed to resume async: error -110



On 4.12.3 the screen comes up instantly after resume and everything works fine. 

4.12.3 wake up:
[...]
[   14.255996] [drm] ring test on 10 succeeded in 6 usecs
[   14.302859] [drm] ring test on 11 succeeded in 1 usecs
[   14.302860] [drm] UVD initialized successfully.
[   14.403827] [drm] ring test on 12 succeeded in 0 usecs
[   14.403847] [drm] ring test on 13 succeeded in 9 usecs
[   14.403848] [drm] VCE initialized successfully.
[   14.403945] [drm] ib test on ring 0 succeeded
[   14.404119] [drm] ib test on ring 1 succeeded
[...]
[   14.405487] [drm] ib test on ring 11 succeeded
[   14.405694] [drm] ib test on ring 12 succeeded
Comment 7 Alex Deucher 2017-08-17 20:12:05 UTC
Created attachment 258001 [details]
revert the change

Does reverting it help?
Comment 8 Peter Spiess-Knafl 2017-08-17 20:22:06 UTC
Yes, it does. But i guess it was a bugfix for another problem as indicated in your commit message.

Will you revert it?
Comment 9 Alex Deucher 2017-08-17 20:28:36 UTC
(In reply to Peter Spiess-Knafl from comment #8)
> Yes, it does. But i guess it was a bugfix for another problem as indicated
> in your commit message.

It is a bug fix for high mclks when displays are off, but it seems to regress resume for some reason so we are just trading one bug for another.  I guess maybe there is some other fix missing.

> 
> Will you revert it?

Unless you think otherwise.
Comment 10 Peter Spiess-Knafl 2017-08-17 20:32:22 UTC
Please revert it then. Thanks for your help.
Comment 11 Peter Spiess-Knafl 2017-08-25 12:28:42 UTC
Alex, when will this be released?
Comment 12 Alex Deucher 2017-08-25 14:18:11 UTC
I sent the patch to Greg last week.
Comment 13 Łukasz Żarnowiecki 2017-09-07 17:43:50 UTC
After I updated my kernel from 4.12.9 to 4.12.10 I started experiencing screen flickering on my RX 480.  I did bisecting and turns out that this commit dbe5b2d70cfdc3e1df1ceb3f715c6ef7d17fc566 makes my screen flickers.
Comment 14 Harry Wentland 2017-09-07 17:58:15 UTC
(In reply to dolohow from comment #13)
> After I updated my kernel from 4.12.9 to 4.12.10 I started experiencing
> screen flickering on my RX 480.  I did bisecting and turns out that this
> commit dbe5b2d70cfdc3e1df1ceb3f715c6ef7d17fc566 makes my screen flickers.

Do you mind adding the commit subject and description in addition the the sha? Which git tree is this from? I'm having trouble finding it.
Comment 15 Łukasz Żarnowiecki 2017-09-07 18:07:17 UTC
Sure, it's a Linus tree


> Revert "drm/amdgpu: fix vblank_time when displays are off"
> 
> This reverts commit 2dc1889.
> 
> Fixes a suspend and resume regression.
> 
> bug: https://bugzilla.kernel.org/show_bug.cgi?id=196615
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Comment 16 klavkalashj 2017-09-28 07:32:34 UTC
This bug remains for me. It was working after the patch was reverted, and continued to work fine for the rest of the 4.12 version, but as of linux 4.13.3, still the same symtoms. When browsing the sources for this kernel, it seems the patch is still there. Was it supposed to be reapplied?
Comment 17 Peter Spiess-Knafl 2017-10-01 21:29:01 UTC
Same for me here. Arch 4.13.3. Was the original patch reapplied?
Comment 18 klavkalashj 2017-10-02 07:04:56 UTC
Looks like it when browsing the source. It's in both 4.13 and 4.14-rc3. I hope it can be removed again in time for the LTS release. For now I'm holding off the upgrade to 4.13. I don't know if I'm getting this right, but it sounds like there is a choice between suspend/resume and screen flickering...
Comment 19 klavkalashj 2017-10-09 14:15:59 UTC
The code is still there in 4.14-rc4.
Comment 20 Peter Spiess-Knafl 2017-10-12 14:48:18 UTC
Alex can you help out here? Why was the patch fixing the suspend/resume issue removed in 4.13?
Comment 21 Peter Spiess-Knafl 2017-10-12 14:56:01 UTC
"git log -p drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c" reveals that the original patch (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.12.y&id=2dc1889ebf8501b0edf125e89a30e1cf3744a2a7) has been reapplied over the fix for suspend.
Comment 22 Alex Deucher 2017-10-12 15:09:00 UTC
(In reply to Peter Spiess-Knafl from comment #20)
> Alex can you help out here? Why was the patch fixing the suspend/resume
> issue removed in 4.13?

The fix was only applied to 4.12.  No one reported any problems with 4.13 or newe until later.
Comment 23 klavkalashj 2017-10-12 20:18:09 UTC
Oh. I didn't realize it worked like that. The same problem happens with all versions of 4.13 and 4.14 I tried so far.
Comment 24 klavkalashj 2017-10-13 06:03:08 UTC
I would suggest to remove the fix in all kernel versions until we can confirm it doesn't break anything. Having an LTS kernel break suspend/resume for polaris users doesn't sound to good.
Comment 25 Florent 2017-10-13 21:36:33 UTC
Hi,
Same issue here. OS freezing after resume from suspend with an AMD RX480 GPU.

$ cat /etc/redhat-release 
Fedora release 26 (Twenty Six)
$ uname -a
Linux amn 4.13.4-200.fc26.x86_64 #1 SMP Thu Sep 28 20:46:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ journalctl -k -b -2 | grep amdgpu | tail -5
Oct 12 01:46:06 amn kernel: [drm] Initialized amdgpu 3.18.0 20150101 for 0000:01:00.0 on minor 0
Oct 12 23:16:29 amn kernel: [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 14 test failed
Oct 12 23:16:29 amn kernel: [drm:amdgpu_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110
Oct 12 23:16:29 amn kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110).
Oct 12 23:16:30 amn kernel: amdgpu 0000:01:00.0: ffff9514d9161800 unpin not necessary

Regards
Comment 26 Florent 2017-10-14 09:59:11 UTC
I just figured out that if instead of opening a 'Gnome on Xorg' session, I open a 'Gnome' session (which, as far as I know, starts a 'Wayland Display Server' instead of Xorg on my Fedora setup), then I don't have any more issues resuming after a suspend. It works pretty well, no more amdgpu related error messages in my journal.
Comment 27 klavkalashj 2017-10-17 07:09:14 UTC
Alex, I'm sorry for being pushy, but is anything being done about this? The next LTS kernel is closing in on release and suspend/resume is still not working. Linux 4.12.13 is still the last kernel with it working. If there is anything I can help with to solve this, like testing, info etc. just ask.
Comment 28 Alex Deucher 2017-10-20 14:09:07 UTC
Created attachment 260307 [details]
possible fix

Does the attached patch fix the issue?
Comment 29 klavkalashj 2017-10-20 20:17:35 UTC
I did a quick test on my Arch linux install. With Linux 4.14-rc5 and this latest patch applied, it seems to work like it should! I suspended and resumed twice and there were no errors reported and the computer resumed correctly. I couldn't get 4.13 to build for some reason, but I think the fault lies in my noobieness :) Will try tomorrow to build 4.13 on Ubuntu instead and get back with results. But so far so good, great job and many thanks!
Comment 30 Florian Schmitt 2017-10-20 22:18:51 UTC
Looks like that did the trick for me. I'm using linux 4.13.8 on Fedora.
Comment 31 klavkalashj 2017-10-21 13:58:21 UTC
Yep, it seems to work fine also on 4.13 in Ubuntu. Built the current version of Ubuntu which is called 4.13.0-16-generic with the patch just posted, and the same small test with two suspend/resume cycles worked just fine with no errors.
Comment 32 Philipp Claßen 2017-10-29 23:15:29 UTC
Solved it for me, too. Tested on Arch Linux with 4.14.0-rc6-mainline (plus the patch).
Comment 33 Florian Schmitt 2017-11-02 16:52:21 UTC
Looks like the patch made it into 4.13.11. Yay. Thanks!

From the changelog:

commit 0d74253003e6370e65468f5aec8c969bdef6733e
Author: Rex Zhu <Rex.Zhu@amd.com>
Date:   Fri Oct 20 15:07:41 2017 +0800

    drm/amd/powerplay: fix uninitialized variable
    
    commit 8b95f4f730cba02ef6febbdc4ca7e55ca045b00e upstream.
    
    refresh_rate was not initialized when program
    display gap.
    this patch can fix vce ring test failed
    when do S3 on Polaris10.
    
    bug: https://bugs.freedesktop.org/show_bug.cgi?id=103102
    bug: https://bugzilla.kernel.org/show_bug.cgi?id=196615

Note You need to log in before you can comment on or make changes to this bug.