Bug 201015
Summary: | [amdgpu] BUG: unable to handle kernel NULL pointer dereference on resume with 2 monitors (vega) | ||
---|---|---|---|
Product: | Drivers | Reporter: | Aleksandr Mezin (mezin.alexander) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | harry.wentland, nicholas.kazlauskas |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.19-rc2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg, failed resume with 2 monitors
dmesg, multiple suspend-resume cycles with 1 monitor, then attached 2nd monitor, resume failed 0001-drm-amd-display-Add-null-checks-to-surface-update-in.patch Kernel log, 2 suspend-resume attempts, second one failed kernel log: patched kernel + xorg modesetting, resume failed user log: patched kernel + xorg modesetting, resume failed |
Created attachment 278311 [details]
dmesg, multiple suspend-resume cycles with 1 monitor, then attached 2nd monitor, resume failed
Is this a regression in 4.19-rc compared to 4.18? (In reply to Michel Dänzer from comment #2) > Is this a regression in 4.19-rc compared to 4.18? No, happens on 4.18.5 too. But on 4.18 suspend-resume also triggers https://bugzilla.kernel.org/show_bug.cgi?id=200531 (one of the monitors turns off quickly -> resume with one monitor in standby mode -> triggers REG_WAIT timeout). But null pointer dereference is there too. With only one monitor quick suspend-resume on 4.18.5 works fine (exactly as it is on 4.19-rc2). Created attachment 278425 [details]
0001-drm-amd-display-Add-null-checks-to-surface-update-in.patch
I'm unable to reproduce the issue under Ubuntu 18.04, GNOME, 4.19 rc2 and a Vega.
What's your userspace setup like?
You can try the attached patch and see if that helps the problem. Try booting with drm.debug=6 in your bootline and post the results of the suspend with the patch.
Created attachment 278427 [details] Kernel log, 2 suspend-resume attempts, second one failed (In reply to Nicholas Kazlauskas from comment #4) > Created attachment 278425 [details] > 0001-drm-amd-display-Add-null-checks-to-surface-update-in.patch > > I'm unable to reproduce the issue under Ubuntu 18.04, GNOME, 4.19 rc2 and a > Vega. > > What's your userspace setup like? Arch Linux, 4.19-rc3, GNOME 3.28.3 > > You can try the attached patch and see if that helps the problem. Try > booting with drm.debug=6 in your bootline and post the results of the > suspend with the patch. This time first suspend-resume worked fine, second one failed I'd imagine you're probably running GNOME on Wayland from that setup environment. The patch seems to fix the null pointer deference but you're probably getting a black screen from those failed atomic commits. Might not be a problem with the driver but with the GNOME Wayland implementation - I would need to do more investigation to see which atomic commits are failing and if the failures are valid (but unchecked). You would probably not see this occur for GNOME over Xorg. (In reply to Nicholas Kazlauskas from comment #6) > I'd imagine you're probably running GNOME on Wayland from that setup > environment. > > The patch seems to fix the null pointer deference but you're probably > getting a black screen from those failed atomic commits. > > Might not be a problem with the driver but with the GNOME Wayland > implementation - I would need to do more investigation to see which atomic > commits are failing and if the failures are valid (but unchecked). > > You would probably not see this occur for GNOME over Xorg. No, it occurs with Gnome on Xorg, with modesetting driver. Gnome on Wayland seems to handle suspend and resume fine (even on unpatched 4.19-rc3). Also, I tried xf86-video-amdgpu. It works like Gnome on Wayland, but sometimes after resume one display is limited to 800x600 resolution only (it's a 4k display). Probably another different issue. I expected modesetting driver to work though. Created attachment 278457 [details]
kernel log: patched kernel + xorg modesetting, resume failed
Even with patched kernel, when resume fails there are errors in kernel log (when using modesetting driver):
[ 98.136982] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:43:crtc-0] flip_done timed out
[ 103.668322] [drm:dc_remove_plane_from_context [amdgpu]] *ERROR* Existing plane_state not found; failed to detach it!
[ 103.702464] [drm:dc_remove_plane_from_context [amdgpu]] *ERROR* Existing plane_state not found; failed to detach it!
Created attachment 278459 [details]
user log: patched kernel + xorg modesetting, resume failed
сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (WW) modeset(0): flip queue failed: Invalid argument
сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (WW) modeset(0): Page flip failed: Invalid argument
сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (EE) modeset(0): present flip failed
сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (WW) modeset(0): flip queue failed: Invalid argument
сен 12 01:05:22 X299 /usr/lib/gdm-x-session[1593]: (WW) modeset(0): Page flip failed: Invalid argument
After recent updates, the issue went away. But I'm not sure what exactly has changed. I tried reverting the kernel (to 4.19-rc3) and libdrm, but still can't trigger it anymore. |
Created attachment 278309 [details] dmesg, failed resume with 2 monitors Happens on resume from suspend when 2 monitors are connected (over DisplayPort). With 1 monitor suspend/resume works reliably. Vega 64 (Sapphire Nitro+) Dell P2415Q and LG 27UD69P, connected over DisplayPort