Created attachment 276359 [details]
failed resume - journalctl output
After suspend and resume, I see the lock screen, but mouse cursor doesn't move, pressing keys doesn't seem to change anything (can't perform VT switch too).
Sapphire Radeon RX 580 Pulse 8 Gb
Two displays connected through DisplayPort: Dell P2415Q and LG 27UD69P
Cinnamon desktop (Xorg)
Happens on kernels 4.16.13 and 4.17 (even with amdgpu.dc=0)
Doesn happen with kernel 4.14.48 (and earlier 4.14.*)
Just tested 4.15.15 and 4.16
On 4.15.15 suspend and resume works fine
On 4.16 the system freezes even with amdgpu.dc=0
Note that Arch has
# CONFIG_DRM_AMD_DC_FBC is not set
in kernel config for both 4.15 and 4.16
Created attachment 276363 [details]
Failed resume - 4.16, amdgpu.dc=0
Standard question: Can you bisect?
The logs don't show anything suspicious, so without a bisect it is probably really hard to guess what this could be.
Commit d6895ad39f3b396be199f5b6fdfb8cde4be7bbf7 seems to be the cause. Resume works on 4.16 if I revert that single commit (tested on 4.16.0, 4.16.13, with both amdgpu.dc=0 and amdgpu.dc=1).
Ok, well that is interesting.
Please provide the output of "sudo cat /proc/iomem" and "lspci -t -v -nn".
In the meantime I will try to reproduce the issue here.
Created attachment 276391 [details]
lspci -t -v -nn
Created attachment 276393 [details]
Mhm, I've tried the same ASIC (Polaris 10 8gb) in an AMD Threadripper and here it is working quite fine with suspend/resume.
So the only explanation I have is that this is some strange issue with PCI BAR resizing and Intel hardware.
Is the system completely unresponsive after resume, or can you at least ping it over the network?
It seems that only GPU is hung, I can even SSH to the machine.
But things like restarting gdm/Xorg/unplugging the monitor didn't "fix" it. "shutdown -h now" didn't work.
Actually, sometimes mouse pointer moves, and only freezes after I press a few keys/click a few times.
Also, sometimes it's just colored pattern instead of the lock screen on the background.
With Gnome on Wayland it takes a bit more time to break: after resume I see the desktop, but after a few clicks/key presses I see artifacts and then eventually everything freezes.
And just in case:
- The problem also occurs with only one monitor connected.
- On Windows on the same machine suspend and resume works without any problems.
I literally have no idea what I'm doing, but adding 'amdgpu_device_resize_fb_bar(adev);' line to all 'gmc_v?_?_resume()' (because I don't know which version is used for my card) "fixed" it somehow. Resume works, but there are some artifacts on screen during resume (they flash only once and then disappear). Before 'amdgpu_device_resize_fb_bar' was introduced, there were no artifacts at all.
Created attachment 276415 [details]
dmesg: resume with device_resize_fb_bar() in gmc_v?_?_resume()
(In reply to Alexander Mezin from comment #11)
> I literally have no idea what I'm doing, but adding
> 'amdgpu_device_resize_fb_bar(adev);' line to all 'gmc_v?_?_resume()'
> (because I don't know which version is used for my card) "fixed" it somehow.
> Resume works, but there are some artifacts on screen during resume (they
> flash only once and then disappear). Before 'amdgpu_device_resize_fb_bar'
> was introduced, there were no artifacts at all.
Hehe, yeah that was a really nice test and confirms my suspicion on what's going wrong here.
Because you tried to resize the BAR once more after resume the resources in the address space are freed up and allocated again:
[ 212.484672] amdgpu 0000:65:00.0: BAR 2: releasing [mem 0xe200000000-0xe2001fffff 64bit pref]
[ 212.484673] amdgpu 0000:65:00.0: BAR 0: releasing [mem 0xe000000000-0xe1ffffffff 64bit pref]
[ 212.484683] pcieport 0000:64:00.0: BAR 15: releasing [mem 0xe000000000-0xe2ffffffff 64bit pref]
[ 212.484691] pcieport 0000:64:00.0: BAR 15: assigned [mem 0xe000000000-0xe2ffffffff 64bit pref]
[ 212.484692] amdgpu 0000:65:00.0: BAR 0: assigned [mem 0xe000000000-0xe1ffffffff 64bit pref]
[ 212.484697] amdgpu 0000:65:00.0: BAR 2: assigned [mem 0xe200000000-0xe2001fffff 64bit pref]
Since it allocates the exact same address we freed up before the real issue is not the address itself, but that fact that the hardware config isn't saved during suspend/resume.
That strongly looks like a bug in the BIOS and/or the Linux PCI subsystem driver for Intel hardware to me.
I will try to narrow this down with a few patches on Monday, but don't expect any quick fix.
Created attachment 276471 [details]
Please test if this patch helps as well.
It limits the work done during resume to reprogramming BAR 0 & 2 and not the bridge.
No, it doesn't change anything, system freezes on resume.
So the problem seems to be the bridge then.
Please provide me with the output of the following commands, once before you suspended, once after you resumed without any change and once after you resumed with your hack to resize the BAR once more:
sudo setpci -s 64:00.0 COMMAND PREF_MEMORY_BASE PREF_MEMORY_LIMIT PREF_BASE_UPPER32 PREF_LIMIT_UPPER32
sudo lspci -s 64:00.0 -vvvv
setpci - exactly the same output in all 3 cases (verified with 'diff' to be sure):
Created attachment 276517 [details]
lspci before suspend
Created attachment 276519 [details]
lspci after resume, no hack
Created attachment 276521 [details]
lspci after resume with hack
Not sure if it'll help, but I've added more logging here:
@@ -436,6 +436,8 @@ int pci_resize_resource(struct pci_dev *dev, int resno, int size)
+ pci_info(dev, "BAR %d: resized from %d to %d", resno, old, size);
res->end = res->start + pci_rebar_size_to_bytes(size) - 1;
/* Check if the new config works by trying to assign everything. */
And suspend-resume with "re-resize" hack shows this:
amdgpu 0000:65:00.0: BAR 0: resized from 8 to 13
(this message appears in dmesg two times, first one on boot, second one during resume, exactly the same message in both cases)
Your debugging efforts are better than mine.
Please provide the output of "sudo setpci -s 65:00.0 ECAP15.l ECAP15+4.l ECAP15+8.l" once before suspend and once after suspend without any changes (e.g. when the problem happens).
(In reply to Christian König from comment #22)
> Your debugging efforts are better than mine.
> Please provide the output of "sudo setpci -s 65:00.0 ECAP15.l ECAP15+4.l
> ECAP15+8.l" once before suspend and once after suspend without any changes
> (e.g. when the problem happens).
Created attachment 276547 [details]
In this case please try the attached patch and see if it helps.
Yes, it works
[ 34.330683] amdgpu 0000:65:00.0: Test 0 from 8 to 13
For me, it works to.
dmesg | grep amdgpu:
[ 3.437098] [drm] amdgpu kernel modesetting enabled.
[ 3.442103] fb: switching to amdgpudrmfb from EFI VGA
[ 3.442234] amdgpu 0000:01:00.0: enabling device (0006 -> 0007)
[ 3.443795] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0xd0000000-0xd01fffff 64bit pref]
[ 3.443797] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0xc0000000-0xcfffffff 64bit pref]
[ 3.443822] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x2200000000-0x23ffffffff 64bit pref]
[ 3.443827] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x2100000000-0x21001fffff 64bit pref]
[ 3.443849] amdgpu 0000:01:00.0: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[ 3.443850] amdgpu 0000:01:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF
[ 3.443917] [drm] amdgpu: 8192M of VRAM memory ready
[ 3.443918] [drm] amdgpu: 8192M of GTT memory ready.
[ 4.239650] fbcon: amdgpudrmfb (fb0) is primary device
[ 4.323338] amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device
[ 4.340440] [drm] Initialized amdgpu 3.25.0 20150101 for 0000:01:00.0 on minor 0
[ 10.704309] amdgpu 0000:01:00.0: 00000000a78be373 unpin not necessary
[ 10.704310] amdgpu 0000:01:00.0: 00000000a78be373 unpin not necessary
[ 10.704310] amdgpu 0000:01:00.0: 000000006047af5e unpin not necessary
[ 10.704311] amdgpu 0000:01:00.0: 000000002d9a27ec unpin not necessary
[ 11.443673] amdgpu 0000:01:00.0: Test 0 from 8 to 13
So the patch will only land in 4.19.
Are you going to fix the regression (in amdgpu) for 4.15-4.18 somehow?
Seems to be fixed in 4.18.5 by backport