Bug 196777
Summary: | Virtual guest using video device QXL does not reach GDM | ||
---|---|---|---|
Product: | Drivers | Reporter: | Joachim Frieben (jfrieben) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | normal | CC: | jmforbes, kraxel, krissn, mike, tiwai |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.12.5 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Joachim Frieben
2017-08-26 08:28:14 UTC
I have bisected this to a series of commits introducing atomic modesetting to the QXL driver (more specifically commit 3538e80a869be74764ae7db484b371894f04d0f8). can you check whenever this patch fixes it? https://www.kraxel.org/cgit/linux/commit/?h=drm-qxl-atomic&id=b16a0bb7a9d54d9dd256059b35adf6f96fddc22e (In reply to Gerd Hoffmann from comment #2) > can you check whenever this patch fixes it? > > https://www.kraxel.org/cgit/linux/commit/?h=drm-qxl- > atomic&id=b16a0bb7a9d54d9dd256059b35adf6f96fddc22e I have applied this patch against a clean 4.12.0 and unfortunately the problem is still easily reproducible. Retested 4.13 + comment #2 patch. plymouth (aka graphical boot) hangs the machine indeed. when disabling rhgb gdm comes up just fine though, in both wayland and xorg mode. so apparently we have two issues here, and the patch fixes only one of them. The plymouth hang appears to be pretty serious, the whole machine appears to be toast. I can't login over network to see what is going on, so it's not only the display which is f*cked up. Nothing written to the logs either. When enabling the serial console to see the logs plymouth skips the splash screen though, so the issue doesn't trigger any more. Hmm, I'm running out of ideas ... Isn't the plymouth hang the dup of bug 102338? If so, it's BUG_ON() in ttm_bo_kmap() for non-empty bo->swap list. BTW, with the patch in comment 2 applied, qemu itself crashes on my machine, not the VM :-< id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 id 1, group 1, virt start 7fec96800000, virt end 7fec9a7fe000, generation 0, delta 7fec96800000 id 2, group 1, virt start 7fec92400000, virt end 7fec96400000, generation 0, delta 7fec92400000 ((null):6072): Spice-Warning **: red_memslots.c:69:validate_virt: virtual address out of range virt=0x7fec96b07018+0xff000000 slot_id=1 group_id=1 slot=0x7fec96800000-0x7fec9a7fe000 delta=0x7fec96800000 ((null):6072): Spice-ERROR **: red_parse_qxl.c:334:red_get_clip_rects: assertion `num_rects * sizeof(QXLRect) == size' failed Thread 24 (Thread 0x7fec8189c700 (LWP 6097)): .... (In reply to Takashi Iwai from comment #5) > Isn't the plymouth hang the dup of bug 102338? Erm, I meant the fdo bugzilla 102338, https://bugs.freedesktop.org/show_bug.cgi?id=102338 (In reply to Gerd Hoffmann from comment #4) > Retested 4.13 + comment #2 patch. > > plymouth (aka graphical boot) hangs the machine indeed. > > when disabling rhgb gdm comes up just fine though, in both wayland and xorg > mode. > so apparently we have two issues here, and the patch fixes only one of them. > > The plymouth hang appears to be pretty serious, the whole machine appears to > be toast. I can't login over network to see what is going on, so it's not > only the display which is f*cked up. Nothing written to the logs either. > When enabling the serial console to see the logs plymouth skips the splash > screen though, so the issue doesn't trigger any more. Hmm, I'm running out > of ideas ... I dropped the patch into the 4.12.12 update for Fedora kernels. The complaints I am seeing are screen strobing with this patch, it has been backed out for now. The plymouth issue is also a big one. (In reply to Gerd Hoffmann from comment #8) > https://www.kraxel.org/cgit/linux/log/?h=qxl-4.13 > please test Applied over vanilla 4.12 on top of the patch from comment #2. SDDM started fine, I was able to login to the Plasma session and use it for some time. Test repeated twice with the same results. No errors found in dmesg and system is stable. As for me - TEST PASSED Thanks :) After going through these with a number of users: qxl: fix primary surface handling - This patch is widely reported to cause serious screen flickering that is not there without it, making the system unusable. qxl: fix pinning: This patch resolves the GDM login issues with plymouth. (In reply to Justin M. Forbes from comment #10) > After going through these with a number of users: > > qxl: fix primary surface handling - This patch is widely reported to cause > serious screen flickering that is not there without it, making the system > unusable. Workaround #1: turn off wayland. Workaround #2: use virtio-vga instead. wayland doesn't use qxl 2d accel anyway. Fundamental problem here is that the qxl virtual hardware simply doesn't support pageflip, we have to destroy + re-create the primary surface instead. This is where the flicker comes from. Commit "058e9f5c82 drm/qxl: simple crtc page flipping emulated using buffer copy" handles the issue with a pretty gross hack, blitting one framebuffer over the other instead of a proper primary surface update. With atomic modesetting that doesn't work any more. We could possibly decouple the primary surface from the drm framebuffers, so the drm framebuffers effectively become shadow framebuffers, and every display update becomes a drm framebuffer -> primary surface blit. Not sure whenever that scheme can work properly with xorg though. Also has a high chance to cause xorg performance regressions. > qxl: fix pinning: This patch resolves the GDM login issues with plymouth. Good. (In reply to Gerd Hoffmann from comment #11) > (In reply to Justin M. Forbes from comment #10) > > After going through these with a number of users: > > > > qxl: fix primary surface handling - This patch is widely reported to cause > > serious screen flickering that is not there without it, making the system > > unusable. > > Workaround #1: turn off wayland. Possible as a short term fix, but with wayland being pretty much "the way forward" it doesn't seem to be a workable long term solution. > Workaround #2: use virtio-vga instead. wayland doesn't use qxl 2d accel > anyway. > > Fundamental problem here is that the qxl virtual hardware simply doesn't > support pageflip, we have to destroy + re-create the primary surface > instead. This is where the flicker comes from. > > Commit "058e9f5c82 drm/qxl: simple crtc page flipping emulated using buffer > copy" handles the issue with a pretty gross hack, blitting one framebuffer > over the other instead of a proper primary surface update. With atomic > modesetting that doesn't work any more. > > We could possibly decouple the primary surface from the drm framebuffers, so > the drm framebuffers effectively become shadow framebuffers, and every > display update becomes a drm framebuffer -> primary surface blit. Not sure > whenever that scheme can work properly with xorg though. Also has a high > chance to cause xorg performance regressions. > So this brings up an interesting problem in how things are to move forward. It came up as a blocker in Fedora 27 today. Let's say we find a way to force boxes to revert to virtio-vga. That wouldn't change any existing VMs, and it is something we have no control over when the host is not Fedora as well. It also would be a problem for non wayland guests. > > Workaround #1: turn off wayland. > > Possible as a short term fix, but with wayland being pretty much "the way > forward" it doesn't seem to be a workable long term solution. Yes. > So this brings up an interesting problem in how things are to move forward. Kicked discussion on spice-devel list. https://lists.freedesktop.org/archives/spice-devel/2017-October/040310.html > It came up as a blocker in Fedora 27 today. Let's say we find a way to force > boxes to revert to virtio-vga. That wouldn't change any existing VMs, and > it is something we have no control over when the host is not Fedora as well. That would probably best done via libosinfo (because for guests without virtio-vga guest drivers we better don't do the switch). Which should be picked up by other distros and projects too. > It also would be a problem for non wayland guests. Why? The xorg modesetting driver works just fine with virtio-vga. > We could possibly decouple the primary surface from the drm framebuffers, so > the drm framebuffers effectively become shadow framebuffers, and every > display update becomes a drm framebuffer -> primary surface blit. Not sure > whenever that scheme can work properly with xorg though. Also has a high > chance to cause xorg performance regressions. Turns out there is an easy way out: shadow dumb framebuffers only. https://www.kraxel.org/cgit/linux/log/?h=drm-qxl-atomic |