Bug 196615
Summary: | amdgpu - resume from suspend is no longer working on rx480 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Peter Spiess-Knafl (psk) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | high | CC: | alexander, alexdeucher, contact, daniel.otero, dv, harry.wentland, jbreedon98, klavkalashj, kommerz11, lukasz, maaniv, philipp.classen, psk |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
URL: | https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.12.y&id=2dc1889ebf8501b0edf125e89a30e1cf3744a2a7 | ||
Kernel Version: | >= 4.11.3 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg log regarding the freeze.
revert the change possible fix |
Description
Peter Spiess-Knafl
2017-08-08 20:12:52 UTC
I also started a discussion thread on the arch forum: https://bbs.archlinux.org/viewtopic.php?pid=1729393#p1729393 Please attach your dmesg output. How exactly does resume fail? Hi Alex! Thanks for getting back. First there are strange artefacts where the mouse pointer should be and shortly after the system freezes all together. I'll attach a dmesg log. But I think the relevant errors are these: Aug 08 22:30:29 rabe kernel: [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 13 test failed Aug 08 22:30:29 rabe kernel: [drm:amdgpu_resume [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110 Aug 08 22:30:29 rabe kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110). Aug 08 22:30:29 rabe kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -110 Aug 08 22:30:29 rabe kernel: PM: Device 0000:01:00.0 failed to resume async: error -110 Created attachment 257861 [details]
dmesg log regarding the freeze.
dmesg log regarding the freeze.
Alex, do you need further infos? Same error on Gentoo, kernel 4.12.7, RX 470. On 4.12.7 the screen does come up on wake up from suspend after a while (20 seconds or so) but the system is unusable: the mouse cursor moves fine and the keyboard responds to keypresses but the screen updates with a 20-30s lag (if I launch a new terminal it appears after half a minute). Changing to VT with ctrl+alt+f[1-6] garbles the screen. 4.12.7 wake up: [...] [ 128.978655] [drm] ring test on 10 succeeded in 6 usecs [ 129.025563] [drm] ring test on 11 succeeded in 1 usecs [ 129.025563] [drm] UVD initialized successfully. [ 129.126522] [drm] ring test on 12 succeeded in 0 usecs [ 129.331084] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 13 test failed [ 129.331088] [drm:amdgpu_resume [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110 [ 129.331092] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110). [ 129.331094] dpm_run_callback(): pci_pm_resume+0x0/0xd0 returns -110 [ 129.331095] PM: Device 0000:01:00.0 failed to resume async: error -110 On 4.12.3 the screen comes up instantly after resume and everything works fine. 4.12.3 wake up: [...] [ 14.255996] [drm] ring test on 10 succeeded in 6 usecs [ 14.302859] [drm] ring test on 11 succeeded in 1 usecs [ 14.302860] [drm] UVD initialized successfully. [ 14.403827] [drm] ring test on 12 succeeded in 0 usecs [ 14.403847] [drm] ring test on 13 succeeded in 9 usecs [ 14.403848] [drm] VCE initialized successfully. [ 14.403945] [drm] ib test on ring 0 succeeded [ 14.404119] [drm] ib test on ring 1 succeeded [...] [ 14.405487] [drm] ib test on ring 11 succeeded [ 14.405694] [drm] ib test on ring 12 succeeded Created attachment 258001 [details]
revert the change
Does reverting it help?
Yes, it does. But i guess it was a bugfix for another problem as indicated in your commit message. Will you revert it? (In reply to Peter Spiess-Knafl from comment #8) > Yes, it does. But i guess it was a bugfix for another problem as indicated > in your commit message. It is a bug fix for high mclks when displays are off, but it seems to regress resume for some reason so we are just trading one bug for another. I guess maybe there is some other fix missing. > > Will you revert it? Unless you think otherwise. Please revert it then. Thanks for your help. Alex, when will this be released? I sent the patch to Greg last week. After I updated my kernel from 4.12.9 to 4.12.10 I started experiencing screen flickering on my RX 480. I did bisecting and turns out that this commit dbe5b2d70cfdc3e1df1ceb3f715c6ef7d17fc566 makes my screen flickers. (In reply to dolohow from comment #13) > After I updated my kernel from 4.12.9 to 4.12.10 I started experiencing > screen flickering on my RX 480. I did bisecting and turns out that this > commit dbe5b2d70cfdc3e1df1ceb3f715c6ef7d17fc566 makes my screen flickers. Do you mind adding the commit subject and description in addition the the sha? Which git tree is this from? I'm having trouble finding it. Sure, it's a Linus tree
> Revert "drm/amdgpu: fix vblank_time when displays are off"
>
> This reverts commit 2dc1889.
>
> Fixes a suspend and resume regression.
>
> bug: https://bugzilla.kernel.org/show_bug.cgi?id=196615
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This bug remains for me. It was working after the patch was reverted, and continued to work fine for the rest of the 4.12 version, but as of linux 4.13.3, still the same symtoms. When browsing the sources for this kernel, it seems the patch is still there. Was it supposed to be reapplied? Same for me here. Arch 4.13.3. Was the original patch reapplied? Looks like it when browsing the source. It's in both 4.13 and 4.14-rc3. I hope it can be removed again in time for the LTS release. For now I'm holding off the upgrade to 4.13. I don't know if I'm getting this right, but it sounds like there is a choice between suspend/resume and screen flickering... The code is still there in 4.14-rc4. Alex can you help out here? Why was the patch fixing the suspend/resume issue removed in 4.13? "git log -p drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c" reveals that the original patch (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.12.y&id=2dc1889ebf8501b0edf125e89a30e1cf3744a2a7) has been reapplied over the fix for suspend. (In reply to Peter Spiess-Knafl from comment #20) > Alex can you help out here? Why was the patch fixing the suspend/resume > issue removed in 4.13? The fix was only applied to 4.12. No one reported any problems with 4.13 or newe until later. Oh. I didn't realize it worked like that. The same problem happens with all versions of 4.13 and 4.14 I tried so far. I would suggest to remove the fix in all kernel versions until we can confirm it doesn't break anything. Having an LTS kernel break suspend/resume for polaris users doesn't sound to good. Hi, Same issue here. OS freezing after resume from suspend with an AMD RX480 GPU. $ cat /etc/redhat-release Fedora release 26 (Twenty Six) $ uname -a Linux amn 4.13.4-200.fc26.x86_64 #1 SMP Thu Sep 28 20:46:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux $ journalctl -k -b -2 | grep amdgpu | tail -5 Oct 12 01:46:06 amn kernel: [drm] Initialized amdgpu 3.18.0 20150101 for 0000:01:00.0 on minor 0 Oct 12 23:16:29 amn kernel: [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 14 test failed Oct 12 23:16:29 amn kernel: [drm:amdgpu_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110 Oct 12 23:16:29 amn kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110). Oct 12 23:16:30 amn kernel: amdgpu 0000:01:00.0: ffff9514d9161800 unpin not necessary Regards I just figured out that if instead of opening a 'Gnome on Xorg' session, I open a 'Gnome' session (which, as far as I know, starts a 'Wayland Display Server' instead of Xorg on my Fedora setup), then I don't have any more issues resuming after a suspend. It works pretty well, no more amdgpu related error messages in my journal. Alex, I'm sorry for being pushy, but is anything being done about this? The next LTS kernel is closing in on release and suspend/resume is still not working. Linux 4.12.13 is still the last kernel with it working. If there is anything I can help with to solve this, like testing, info etc. just ask. Created attachment 260307 [details]
possible fix
Does the attached patch fix the issue?
I did a quick test on my Arch linux install. With Linux 4.14-rc5 and this latest patch applied, it seems to work like it should! I suspended and resumed twice and there were no errors reported and the computer resumed correctly. I couldn't get 4.13 to build for some reason, but I think the fault lies in my noobieness :) Will try tomorrow to build 4.13 on Ubuntu instead and get back with results. But so far so good, great job and many thanks! Looks like that did the trick for me. I'm using linux 4.13.8 on Fedora. Yep, it seems to work fine also on 4.13 in Ubuntu. Built the current version of Ubuntu which is called 4.13.0-16-generic with the patch just posted, and the same small test with two suspend/resume cycles worked just fine with no errors. Solved it for me, too. Tested on Arch Linux with 4.14.0-rc6-mainline (plus the patch). Looks like the patch made it into 4.13.11. Yay. Thanks! From the changelog: commit 0d74253003e6370e65468f5aec8c969bdef6733e Author: Rex Zhu <Rex.Zhu@amd.com> Date: Fri Oct 20 15:07:41 2017 +0800 drm/amd/powerplay: fix uninitialized variable commit 8b95f4f730cba02ef6febbdc4ca7e55ca045b00e upstream. refresh_rate was not initialized when program display gap. this patch can fix vce ring test failed when do S3 on Polaris10. bug: https://bugs.freedesktop.org/show_bug.cgi?id=103102 bug: https://bugzilla.kernel.org/show_bug.cgi?id=196615 |