Bug 199959 - amdgpu, regression?: system freezes after resume
Summary: amdgpu, regression?: system freezes after resume
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-07 00:45 UTC by Aleksandr Mezin
Modified: 2018-12-11 17:24 UTC (History)
5 users (show)

See Also:
Kernel Version: 4.16
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
failed resume - journalctl output (409.26 KB, text/plain)
2018-06-07 00:45 UTC, Aleksandr Mezin
Details
Failed resume - 4.16, amdgpu.dc=0 (311.70 KB, text/plain)
2018-06-07 03:04 UTC, Aleksandr Mezin
Details
lspci -t -v -nn (7.21 KB, text/plain)
2018-06-08 11:13 UTC, Aleksandr Mezin
Details
/proc/iomem (4.31 KB, text/plain)
2018-06-08 11:14 UTC, Aleksandr Mezin
Details
dmesg: resume with device_resize_fb_bar() in gmc_v?_?_resume() (87.69 KB, text/plain)
2018-06-09 05:47 UTC, Aleksandr Mezin
Details
Testing patch (1.19 KB, patch)
2018-06-11 14:22 UTC, Christian König
Details | Diff
lspci before suspend (4.43 KB, text/plain)
2018-06-13 02:39 UTC, Aleksandr Mezin
Details
lspci after resume, no hack (4.43 KB, text/plain)
2018-06-13 02:39 UTC, Aleksandr Mezin
Details
lspci after resume with hack (4.43 KB, text/plain)
2018-06-13 02:39 UTC, Aleksandr Mezin
Details
Possible fix (1.95 KB, patch)
2018-06-14 10:02 UTC, Christian König
Details | Diff

Description Aleksandr Mezin 2018-06-07 00:45:42 UTC
Created attachment 276359 [details]
failed resume - journalctl output

After suspend and resume, I see the lock screen, but mouse cursor doesn't move, pressing keys doesn't seem to change anything (can't perform VT switch too).

Sapphire Radeon RX 580 Pulse 8 Gb
Two displays connected through DisplayPort: Dell P2415Q and LG 27UD69P
Cinnamon desktop (Xorg)
Arch Linux

Happens on kernels 4.16.13 and 4.17 (even with amdgpu.dc=0)
Doesn happen with kernel 4.14.48 (and earlier 4.14.*)
Comment 1 Aleksandr Mezin 2018-06-07 03:03:54 UTC
Just tested 4.15.15 and 4.16
On 4.15.15 suspend and resume works fine
On 4.16 the system freezes even with amdgpu.dc=0

Note that Arch has

CONFIG_DRM_AMD_DC=y
CONFIG_DRM_AMD_DC_PRE_VEGA=y
# CONFIG_DRM_AMD_DC_FBC is not set
CONFIG_DRM_AMD_DC_DCN1_0=y

in kernel config for both 4.15 and 4.16
Comment 2 Aleksandr Mezin 2018-06-07 03:04:42 UTC
Created attachment 276363 [details]
Failed resume - 4.16, amdgpu.dc=0
Comment 3 Christian König 2018-06-07 07:54:04 UTC
Standard question: Can you bisect?

The logs don't show anything suspicious, so without a bisect it is probably really hard to guess what this could be.
Comment 4 Aleksandr Mezin 2018-06-08 04:13:27 UTC
Commit d6895ad39f3b396be199f5b6fdfb8cde4be7bbf7 seems to be the cause. Resume works on 4.16 if I revert that single commit (tested on 4.16.0, 4.16.13, with both amdgpu.dc=0 and amdgpu.dc=1).
Comment 5 Christian König 2018-06-08 07:53:38 UTC
Ok, well that is interesting.

Please provide the output of "sudo cat /proc/iomem" and "lspci -t -v -nn".

In the meantime I will try to reproduce the issue here.
Comment 6 Aleksandr Mezin 2018-06-08 11:13:54 UTC
Created attachment 276391 [details]
lspci -t -v -nn
Comment 7 Aleksandr Mezin 2018-06-08 11:14:44 UTC
Created attachment 276393 [details]
/proc/iomem
Comment 8 Christian König 2018-06-08 12:02:43 UTC
Mhm, I've tried the same ASIC (Polaris 10 8gb) in an AMD Threadripper and here it is working quite fine with suspend/resume.

So the only explanation I have is that this is some strange issue with PCI BAR resizing and Intel hardware.

Is the system completely unresponsive after resume, or can you at least ping it over the network?
Comment 9 Aleksandr Mezin 2018-06-09 00:36:46 UTC
It seems that only GPU is hung, I can even SSH to the machine.
But things like restarting gdm/Xorg/unplugging the monitor didn't "fix" it. "shutdown -h now" didn't work.
Comment 10 Aleksandr Mezin 2018-06-09 03:31:53 UTC
Actually, sometimes mouse pointer moves, and only freezes after I press a few keys/click a few times.
Also, sometimes it's just colored pattern instead of the lock screen on the background.
With Gnome on Wayland it takes a bit more time to break: after resume I see the desktop, but after a few clicks/key presses I see artifacts and then eventually everything freezes.

And just in case:
- The problem also occurs with only one monitor connected.
- On Windows on the same machine suspend and resume works without any problems.
Comment 11 Aleksandr Mezin 2018-06-09 05:18:05 UTC
I literally have no idea what I'm doing, but adding 'amdgpu_device_resize_fb_bar(adev);' line to all 'gmc_v?_?_resume()' (because I don't know which version is used for my card) "fixed" it somehow. Resume works, but there are some artifacts on screen during resume (they flash only once and then disappear). Before 'amdgpu_device_resize_fb_bar' was introduced, there were no artifacts at all.
Comment 12 Aleksandr Mezin 2018-06-09 05:47:58 UTC
Created attachment 276415 [details]
dmesg: resume with device_resize_fb_bar() in gmc_v?_?_resume()
Comment 13 Christian König 2018-06-09 09:37:07 UTC
(In reply to Alexander Mezin from comment #11)
> I literally have no idea what I'm doing, but adding
> 'amdgpu_device_resize_fb_bar(adev);' line to all 'gmc_v?_?_resume()'
> (because I don't know which version is used for my card) "fixed" it somehow.
> Resume works, but there are some artifacts on screen during resume (they
> flash only once and then disappear). Before 'amdgpu_device_resize_fb_bar'
> was introduced, there were no artifacts at all.

Hehe, yeah that was a really nice test and confirms my suspicion on what's going wrong here.

Because you tried to resize the BAR once more after resume the resources in the address space are freed up and allocated again:
[  212.484672] amdgpu 0000:65:00.0: BAR 2: releasing [mem 0xe200000000-0xe2001fffff 64bit pref]
[  212.484673] amdgpu 0000:65:00.0: BAR 0: releasing [mem 0xe000000000-0xe1ffffffff 64bit pref]
[  212.484683] pcieport 0000:64:00.0: BAR 15: releasing [mem 0xe000000000-0xe2ffffffff 64bit pref]

[  212.484691] pcieport 0000:64:00.0: BAR 15: assigned [mem 0xe000000000-0xe2ffffffff 64bit pref]
[  212.484692] amdgpu 0000:65:00.0: BAR 0: assigned [mem 0xe000000000-0xe1ffffffff 64bit pref]
[  212.484697] amdgpu 0000:65:00.0: BAR 2: assigned [mem 0xe200000000-0xe2001fffff 64bit pref]

Since it allocates the exact same address we freed up before the real issue is not the address itself, but that fact that the hardware config isn't saved during suspend/resume.

That strongly looks like a bug in the BIOS and/or the Linux PCI subsystem driver for Intel hardware to me.

I will try to narrow this down with a few patches on Monday, but don't expect any quick fix.
Comment 14 Christian König 2018-06-11 14:22:16 UTC
Created attachment 276471 [details]
Testing patch

Please test if this patch helps as well.

It limits the work done during resume to reprogramming BAR 0 & 2 and not the bridge.
Comment 15 Aleksandr Mezin 2018-06-12 00:40:30 UTC
No, it doesn't change anything, system freezes on resume.
Comment 16 Christian König 2018-06-12 08:12:33 UTC
So the problem seems to be the bridge then.

Please provide me with the output of the following commands, once before you suspended, once after you resumed without any change and once after you resumed with your hack to resize the BAR once more:

sudo setpci -s 64:00.0 COMMAND PREF_MEMORY_BASE PREF_MEMORY_LIMIT PREF_BASE_UPPER32 PREF_LIMIT_UPPER32
sudo lspci -s 64:00.0 -vvvv
Comment 17 Aleksandr Mezin 2018-06-13 02:38:34 UTC
setpci - exactly the same output in all 3 cases (verified with 'diff' to be sure):
0407
0001
fff1
000000e0
000000e2
Comment 18 Aleksandr Mezin 2018-06-13 02:39:03 UTC
Created attachment 276517 [details]
lspci before suspend
Comment 19 Aleksandr Mezin 2018-06-13 02:39:25 UTC
Created attachment 276519 [details]
lspci after resume, no hack
Comment 20 Aleksandr Mezin 2018-06-13 02:39:59 UTC
Created attachment 276521 [details]
lspci after resume with hack
Comment 21 Aleksandr Mezin 2018-06-13 03:42:03 UTC
Not sure if it'll help, but I've added more logging here:

--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -436,6 +436,8 @@ int pci_resize_resource(struct pci_dev *dev, int resno, int size)
        if (ret)
                return ret;
 
+       pci_info(dev, "BAR %d: resized from %d to %d", resno, old, size);
+
        res->end = res->start + pci_rebar_size_to_bytes(size) - 1;
 
        /* Check if the new config works by trying to assign everything. */

And suspend-resume with "re-resize" hack shows this:

amdgpu 0000:65:00.0: BAR 0: resized from 8 to 13

(this message appears in dmesg two times, first one on boot, second one during resume, exactly the same message in both cases)
Comment 22 Christian König 2018-06-14 08:38:26 UTC
Your debugging efforts are better than mine.

Please provide the output of "sudo setpci -s 65:00.0 ECAP15.l ECAP15+4.l ECAP15+8.l" once before suspend and once after suspend without any changes (e.g. when the problem happens).
Comment 23 Aleksandr Mezin 2018-06-14 09:55:00 UTC
(In reply to Christian König from comment #22)
> Your debugging efforts are better than mine.
> 
> Please provide the output of "sudo setpci -s 65:00.0 ECAP15.l ECAP15+4.l
> ECAP15+8.l" once before suspend and once after suspend without any changes
> (e.g. when the problem happens).

before suspend:
27010015
0003f000
00000d20

after resume:
27010015
0003f000
00000820
Comment 24 Christian König 2018-06-14 10:02:02 UTC
Created attachment 276547 [details]
Possible fix

In this case please try the attached patch and see if it helps.
Comment 25 Aleksandr Mezin 2018-06-14 10:17:57 UTC
Yes, it works

dmesg:
[   34.330683] amdgpu 0000:65:00.0: Test 0 from 8 to 13
Comment 26 Joern Hoffmann 2018-06-19 14:13:33 UTC
For me, it works to.

dmesg | grep amdgpu:

[    3.437098] [drm] amdgpu kernel modesetting enabled.
[    3.442103] fb: switching to amdgpudrmfb from EFI VGA
[    3.442234] amdgpu 0000:01:00.0: enabling device (0006 -> 0007)
[    3.443795] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0xd0000000-0xd01fffff 64bit pref]
[    3.443797] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0xc0000000-0xcfffffff 64bit pref]
[    3.443822] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x2200000000-0x23ffffffff 64bit pref]
[    3.443827] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x2100000000-0x21001fffff 64bit pref]
[    3.443849] amdgpu 0000:01:00.0: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[    3.443850] amdgpu 0000:01:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF
[    3.443917] [drm] amdgpu: 8192M of VRAM memory ready
[    3.443918] [drm] amdgpu: 8192M of GTT memory ready.
[    4.239650] fbcon: amdgpudrmfb (fb0) is primary device
[    4.323338] amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device
[    4.340440] [drm] Initialized amdgpu 3.25.0 20150101 for 0000:01:00.0 on minor 0
[   10.704309] amdgpu 0000:01:00.0: 00000000a78be373 unpin not necessary
[   10.704310] amdgpu 0000:01:00.0: 00000000a78be373 unpin not necessary
[   10.704310] amdgpu 0000:01:00.0: 000000006047af5e unpin not necessary
[   10.704311] amdgpu 0000:01:00.0: 000000002d9a27ec unpin not necessary
[   11.443673] amdgpu 0000:01:00.0: Test 0 from 8 to 13
Comment 27 Aleksandr Mezin 2018-06-30 01:46:07 UTC
So the patch will only land in 4.19.
Are you going to fix the regression (in amdgpu) for 4.15-4.18 somehow?
Comment 28 Aleksandr Mezin 2018-08-26 00:34:40 UTC
Seems to be fixed in 4.18.5 by backport

Note You need to log in before you can comment on or make changes to this bug.