Bug 201015

Summary: [amdgpu] BUG: unable to handle kernel NULL pointer dereference on resume with 2 monitors (vega)
Product: Drivers Reporter: Aleksandr Mezin (mezin.alexander)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED OBSOLETE    
Severity: normal CC: harry.wentland, nicholas.kazlauskas
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.19-rc2 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg, failed resume with 2 monitors
dmesg, multiple suspend-resume cycles with 1 monitor, then attached 2nd monitor, resume failed
0001-drm-amd-display-Add-null-checks-to-surface-update-in.patch
Kernel log, 2 suspend-resume attempts, second one failed
kernel log: patched kernel + xorg modesetting, resume failed
user log: patched kernel + xorg modesetting, resume failed

Description Aleksandr Mezin 2018-09-05 06:06:44 UTC
Created attachment 278309 [details]
dmesg, failed resume with 2 monitors

Happens on resume from suspend when 2 monitors are connected (over DisplayPort).
With 1 monitor suspend/resume works reliably.

Vega 64 (Sapphire Nitro+)
Dell P2415Q and LG 27UD69P, connected over DisplayPort
Comment 1 Aleksandr Mezin 2018-09-05 06:08:33 UTC
Created attachment 278311 [details]
dmesg, multiple suspend-resume cycles with 1 monitor, then attached 2nd monitor, resume failed
Comment 2 Michel Dänzer 2018-09-05 07:40:07 UTC
Is this a regression in 4.19-rc compared to 4.18?
Comment 3 Aleksandr Mezin 2018-09-05 09:37:38 UTC
(In reply to Michel Dänzer from comment #2)
> Is this a regression in 4.19-rc compared to 4.18?

No, happens on 4.18.5 too.

But on 4.18 suspend-resume also triggers https://bugzilla.kernel.org/show_bug.cgi?id=200531 (one of the monitors turns off quickly -> resume with one monitor in standby mode -> triggers REG_WAIT timeout). But null pointer dereference is there too. With only one monitor quick suspend-resume on 4.18.5 works fine (exactly as it is on 4.19-rc2).
Comment 4 Nicholas Kazlauskas 2018-09-10 17:55:32 UTC
Created attachment 278425 [details]
0001-drm-amd-display-Add-null-checks-to-surface-update-in.patch

I'm unable to reproduce the issue under Ubuntu 18.04, GNOME, 4.19 rc2 and a Vega.

What's your userspace setup like?

You can try the attached patch and see if that helps the problem. Try booting with drm.debug=6 in your bootline and post the results of the suspend with the patch.
Comment 5 Aleksandr Mezin 2018-09-10 18:34:33 UTC
Created attachment 278427 [details]
Kernel log, 2 suspend-resume attempts, second one failed

(In reply to Nicholas Kazlauskas from comment #4)
> Created attachment 278425 [details]
> 0001-drm-amd-display-Add-null-checks-to-surface-update-in.patch
> 
> I'm unable to reproduce the issue under Ubuntu 18.04, GNOME, 4.19 rc2 and a
> Vega.
> 
> What's your userspace setup like?

Arch Linux, 4.19-rc3, GNOME 3.28.3

> 
> You can try the attached patch and see if that helps the problem. Try
> booting with drm.debug=6 in your bootline and post the results of the
> suspend with the patch.

This time first suspend-resume worked fine, second one failed
Comment 6 Nicholas Kazlauskas 2018-09-10 18:46:34 UTC
I'd imagine you're probably running GNOME on Wayland from that setup environment.

The patch seems to fix the null pointer deference but you're probably getting a black screen from those failed atomic commits.

Might not be a problem with the driver but with the GNOME Wayland implementation - I would need to do more investigation to see which atomic commits are failing and if the failures are valid (but unchecked).

You would probably not see this occur for GNOME over Xorg.
Comment 7 Aleksandr Mezin 2018-09-10 20:34:35 UTC
(In reply to Nicholas Kazlauskas from comment #6)
> I'd imagine you're probably running GNOME on Wayland from that setup
> environment.
> 
> The patch seems to fix the null pointer deference but you're probably
> getting a black screen from those failed atomic commits.
> 
> Might not be a problem with the driver but with the GNOME Wayland
> implementation - I would need to do more investigation to see which atomic
> commits are failing and if the failures are valid (but unchecked).
> 
> You would probably not see this occur for GNOME over Xorg.

No, it occurs with Gnome on Xorg, with modesetting driver. Gnome on Wayland seems to handle suspend and resume fine (even on unpatched 4.19-rc3). Also, I tried xf86-video-amdgpu. It works like Gnome on Wayland, but sometimes after resume one display is limited to 800x600 resolution only (it's a 4k display). Probably another different issue.

I expected modesetting driver to work though.
Comment 8 Aleksandr Mezin 2018-09-11 19:16:12 UTC
Created attachment 278457 [details]
kernel log: patched kernel + xorg modesetting, resume failed

Even with patched kernel, when resume fails there are errors in kernel log (when using modesetting driver):

[   98.136982] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:43:crtc-0] flip_done timed out
[  103.668322] [drm:dc_remove_plane_from_context [amdgpu]] *ERROR* Existing plane_state not found; failed to detach it!
[  103.702464] [drm:dc_remove_plane_from_context [amdgpu]] *ERROR* Existing plane_state not found; failed to detach it!
Comment 9 Aleksandr Mezin 2018-09-11 19:19:03 UTC
Created attachment 278459 [details]
user log: patched kernel + xorg modesetting, resume failed

сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (WW) modeset(0): flip queue failed: Invalid argument
сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (WW) modeset(0): Page flip failed: Invalid argument
сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (EE) modeset(0): present flip failed
сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (WW) modeset(0): flip queue failed: Invalid argument
сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (WW) modeset(0): Page flip failed: Invalid argument
Comment 10 Aleksandr Mezin 2018-10-03 16:17:00 UTC
After recent updates, the issue went away. But I'm not sure what exactly has changed. I tried reverting the kernel (to 4.19-rc3) and libdrm, but still can't trigger it anymore.