Bug 196777

Summary: Virtual guest using video device QXL does not reach GDM
Product: Drivers Reporter: Joachim Frieben (jfrieben)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: jmforbes, kraxel, krissn, mike, tiwai
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.12.5 Subsystem:
Regression: Yes Bisected commit-id:

Description Joachim Frieben 2017-08-26 08:28:14 UTC
Description of problem:
When booting current Fedora 26 as a virtual guest in gnome-boxes, the system does not reach GDM. It gets stuck during the graphical boot procedure. After removing boot option "rhgb", the system gets frozen after attempting to launch the graphical login manager GDM.

Version-Release number of selected component (if applicable):
kernel-4.12.5-300.fc26

How reproducible:
Always

Steps to Reproduce:
1. Boot current Fedora 26 virtual guest in gnome-boxes.

Actual results:
System gets stuck during the graphical boot procedure.

Expected results:
System launches GDM successfully.

Additional info:
- After adding boot option "nomodeset", the system launches GDM on Xorg successfully.
- After removing "rhgb" from the kernel options and running 'startx' at run level 3, the GNOME on Xorg session starts up as expected.
- After changing the video device from "qxl" to "virtio", the system launches GDM on Wayland successfully.
- This issue was introduced in the 4.12.x kernel series and now also affects 4.13.x kernel series.
Comment 1 Krzysztof Nowicki 2017-09-10 14:41:21 UTC
I have bisected this to a series of commits introducing atomic modesetting to the QXL driver (more specifically commit 3538e80a869be74764ae7db484b371894f04d0f8).
Comment 2 Gerd Hoffmann 2017-09-11 08:43:30 UTC
can you check whenever this patch fixes it?

https://www.kraxel.org/cgit/linux/commit/?h=drm-qxl-atomic&id=b16a0bb7a9d54d9dd256059b35adf6f96fddc22e
Comment 3 Krzysztof Nowicki 2017-09-12 18:44:44 UTC
(In reply to Gerd Hoffmann from comment #2)
> can you check whenever this patch fixes it?
> 
> https://www.kraxel.org/cgit/linux/commit/?h=drm-qxl-
> atomic&id=b16a0bb7a9d54d9dd256059b35adf6f96fddc22e

I have applied this patch against a clean 4.12.0 and unfortunately the problem is still easily reproducible.
Comment 4 Gerd Hoffmann 2017-09-13 10:31:41 UTC
Retested 4.13 + comment #2 patch.

plymouth (aka graphical boot) hangs the machine indeed.

when disabling rhgb gdm comes up just fine though, in both wayland and xorg mode.
so apparently we have two issues here, and the patch fixes only one of them.

The plymouth hang appears to be pretty serious, the whole machine appears to be toast.  I can't login over network to see what is going on, so it's not only the display which is f*cked up.  Nothing written to the logs either.  When enabling the serial console to see the logs plymouth skips the splash screen though, so the issue doesn't trigger any more.  Hmm, I'm running out of ideas ...
Comment 5 Takashi Iwai 2017-09-13 12:51:18 UTC
Isn't the plymouth hang the dup of bug 102338?
If so, it's BUG_ON() in ttm_bo_kmap() for non-empty bo->swap list.

BTW, with the patch in comment 2 applied, qemu itself crashes on my machine, not the VM :-<

id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
id 1, group 1, virt start 7fec96800000, virt end 7fec9a7fe000, generation 0, delta 7fec96800000
id 2, group 1, virt start 7fec92400000, virt end 7fec96400000, generation 0, delta 7fec92400000
((null):6072): Spice-Warning **: red_memslots.c:69:validate_virt: virtual address out of range
    virt=0x7fec96b07018+0xff000000 slot_id=1 group_id=1
    slot=0x7fec96800000-0x7fec9a7fe000 delta=0x7fec96800000
((null):6072): Spice-ERROR **: red_parse_qxl.c:334:red_get_clip_rects: assertion `num_rects * sizeof(QXLRect) == size' failed
Thread 24 (Thread 0x7fec8189c700 (LWP 6097)):
....
Comment 6 Takashi Iwai 2017-09-13 13:23:11 UTC
(In reply to Takashi Iwai from comment #5)
> Isn't the plymouth hang the dup of bug 102338?

Erm, I meant the fdo bugzilla 102338,
  https://bugs.freedesktop.org/show_bug.cgi?id=102338
Comment 7 Justin M. Forbes 2017-09-14 17:23:13 UTC
(In reply to Gerd Hoffmann from comment #4)
> Retested 4.13 + comment #2 patch.
> 
> plymouth (aka graphical boot) hangs the machine indeed.
> 
> when disabling rhgb gdm comes up just fine though, in both wayland and xorg
> mode.
> so apparently we have two issues here, and the patch fixes only one of them.
> 
> The plymouth hang appears to be pretty serious, the whole machine appears to
> be toast.  I can't login over network to see what is going on, so it's not
> only the display which is f*cked up.  Nothing written to the logs either. 
> When enabling the serial console to see the logs plymouth skips the splash
> screen though, so the issue doesn't trigger any more.  Hmm, I'm running out
> of ideas ...

I dropped the patch into the 4.12.12 update for Fedora kernels. The complaints I am seeing are screen strobing with this patch, it has been backed out for now.  The plymouth issue is also a big one.
Comment 8 Gerd Hoffmann 2017-09-15 10:50:36 UTC
https://www.kraxel.org/cgit/linux/log/?h=qxl-4.13
please test
Comment 9 Krzysztof Nowicki 2017-09-17 08:50:40 UTC
(In reply to Gerd Hoffmann from comment #8)
> https://www.kraxel.org/cgit/linux/log/?h=qxl-4.13
> please test

Applied over vanilla 4.12 on top of the patch from comment #2.

SDDM started fine, I was able to login to the Plasma session and use it for some time. Test repeated twice with the same results. No errors found in dmesg and system is stable.

As for me - TEST PASSED

Thanks :)
Comment 10 Justin M. Forbes 2017-09-19 14:37:38 UTC
After going through these with a number of users:

qxl: fix primary surface handling - This patch is widely reported to cause serious screen flickering that is not there without it, making the system unusable.

qxl: fix pinning: This patch resolves the GDM login issues with plymouth.
Comment 11 Gerd Hoffmann 2017-09-19 15:23:01 UTC
(In reply to Justin M. Forbes from comment #10)
> After going through these with a number of users:
> 
> qxl: fix primary surface handling - This patch is widely reported to cause
> serious screen flickering that is not there without it, making the system
> unusable.

Workaround #1: turn off wayland.
Workaround #2: use virtio-vga instead. wayland doesn't use qxl 2d accel anyway.

Fundamental problem here is that the qxl virtual hardware simply doesn't support pageflip, we have to destroy + re-create the primary surface instead.  This is where the flicker comes from.

Commit "058e9f5c82 drm/qxl: simple crtc page flipping emulated using buffer copy" handles the issue with a pretty gross hack, blitting one framebuffer over the other instead of a proper primary surface update.  With atomic modesetting that doesn't work any more.

We could possibly decouple the primary surface from the drm framebuffers, so the drm framebuffers effectively become shadow framebuffers, and every display update becomes a drm framebuffer -> primary surface blit.  Not sure whenever that scheme can work properly with xorg though.  Also has a high chance to cause xorg performance regressions.

> qxl: fix pinning: This patch resolves the GDM login issues with plymouth.

Good.
Comment 12 Justin M. Forbes 2017-10-02 16:24:38 UTC
(In reply to Gerd Hoffmann from comment #11)
> (In reply to Justin M. Forbes from comment #10)
> > After going through these with a number of users:
> > 
> > qxl: fix primary surface handling - This patch is widely reported to cause
> > serious screen flickering that is not there without it, making the system
> > unusable.
> 
> Workaround #1: turn off wayland.

Possible as a short term fix, but with wayland being pretty much "the way forward" it doesn't seem to be a workable long term solution.

> Workaround #2: use virtio-vga instead. wayland doesn't use qxl 2d accel
> anyway.
> 
> Fundamental problem here is that the qxl virtual hardware simply doesn't
> support pageflip, we have to destroy + re-create the primary surface
> instead.  This is where the flicker comes from.
> 
> Commit "058e9f5c82 drm/qxl: simple crtc page flipping emulated using buffer
> copy" handles the issue with a pretty gross hack, blitting one framebuffer
> over the other instead of a proper primary surface update.  With atomic
> modesetting that doesn't work any more.
> 
> We could possibly decouple the primary surface from the drm framebuffers, so
> the drm framebuffers effectively become shadow framebuffers, and every
> display update becomes a drm framebuffer -> primary surface blit.  Not sure
> whenever that scheme can work properly with xorg though.  Also has a high
> chance to cause xorg performance regressions.
>

So this brings up an interesting problem in how things are to move forward. It came up as a blocker in Fedora 27 today. Let's say we find a way to force boxes to revert to virtio-vga.  That wouldn't change any existing VMs, and it is something we have no control over when the host is not Fedora as well.  It also would be a problem for non wayland guests.
Comment 13 Gerd Hoffmann 2017-10-04 10:37:10 UTC
> > Workaround #1: turn off wayland.
> 
> Possible as a short term fix, but with wayland being pretty much "the way
> forward" it doesn't seem to be a workable long term solution.

Yes.

> So this brings up an interesting problem in how things are to move forward.

Kicked discussion on spice-devel list.
https://lists.freedesktop.org/archives/spice-devel/2017-October/040310.html

> It came up as a blocker in Fedora 27 today. Let's say we find a way to force
> boxes to revert to virtio-vga.  That wouldn't change any existing VMs, and
> it is something we have no control over when the host is not Fedora as well.

That would probably best done via libosinfo (because for guests without virtio-vga guest drivers we better don't do the switch).  Which should be picked up by other distros and projects too.

> It also would be a problem for non wayland guests.

Why?  The xorg modesetting driver works just fine with virtio-vga.
Comment 14 Gerd Hoffmann 2017-10-19 07:07:58 UTC
> We could possibly decouple the primary surface from the drm framebuffers, so
> the drm framebuffers effectively become shadow framebuffers, and every
> display update becomes a drm framebuffer -> primary surface blit.  Not sure
> whenever that scheme can work properly with xorg though.  Also has a high
> chance to cause xorg performance regressions.

Turns out there is an easy way out: shadow dumb framebuffers only.

https://www.kraxel.org/cgit/linux/log/?h=drm-qxl-atomic