Bug 216475
Summary: | fbcon crashes during single gpu passthough reattachment to host | ||
---|---|---|---|
Product: | Drivers | Reporter: | Sergey V. (truesmb) |
Component: | Video(Other) | Assignee: | drivers_video-other |
Status: | NEW --- | ||
Severity: | normal | CC: | daniel, jared, javier |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.19.7 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
My dmesg right after VM shutdown
Script to invoke the bug |
Description
Sergey V.
2022-09-12 14:40:14 UTC
After further investigation I found out the crash happens then echo 1 > /sys/class/vtconsole/vtcon0/bind echo 1 > /sys/class/vtconsole/vtcon1/bind calls before echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind moving efi-framebuffer binding before vtconsole binding is solving the crash (for 5.19.9) Only vtconsole isn't working after re-bind, but I guess it's another issue Apologies for missing this in the flood of mails I'm drowning under. Also thanks a lot for the bisect, this generally helps a lot. A few questions: - in the bisect, did you check that the symptoms didn't change at all? I.e. same WARNING backtrace followed by the BUG on that commit, and it's immediate parent is bug-free? I'm asking since I can't connect the code change (which should just consolidate some cleanup) with the bug you're reporting, so worth checking that there's not been a few commits that gradually created the bug - the patch is fairly simple, but doesn't revert cleanly on all kernels anymore. Have you tried to apply the revert and fix it up on your local latest kernel, to double check that the bisected commit is indeed the culprit? I've checked the history and subsequent patches shouldn't really impact this code (but you never know) - I see from the logs that you're using the proprietary nvidia driver. Please reproduce without that. - Also can you pls decode the BUG into line numbers (ideally on latest upstream kernels so we don't have a confusion about the source code) per https://dri.freedesktop.org/docs/drm/admin-guide/reporting-issues.html?highlight=oops%20decoding#decode-failure-messages That might help me understand why things blow up. Thanks Created attachment 303350 [details]
Script to invoke the bug
Script which successfully invokes the kernel NULL pointer dereference at fbcon.c
Successfully triggers bug with:
* Primary desktop on nvidia-dkms 525.60.11 housing a GTX2080Ti
* Old desktop on nvidia-470xx-dkms 470.141.03 housing a GTX780 + GTX660.
It does not trigger the bug on my Dell XPS 13, which has no dGPU only an iGPU (Iris Xe Graphics). That laptop successfully unbinds then rebinds the efi-framebuffer and vtconns, then rebinds them and the virtual console on Ctrl+Alt+fx still works.
So in my experience, this is pointing to something NVIDIA related. I wouldn't mind trying older kernel versions to find when this problem were introduced, and also NVIDIA driver versions too for the same reason.
Sorry, I should have mentioned in that comment -- I am also doing VFIO, but as you can see in the attached script, this problem doesn't actually involve VFIO. The problem seems to happen on my NVIDIA powered desktops just by unbinding the efi-framebuffer and vtcons, then rebinding them is enough to trigger it. Passing through video cards with VFIO requires you to do this too, so it's a troubling issue. You can start a VM and ignore the vtcons and in my case they get stripped off anyway, but obviously once you're done with the VM and go to rebind them, this crash happens regardless. I did some elimination on this bug today. I used QEMU but gave it my current kernel and initramfs to boot with, plus a fresh Archlinux ext4.img as a rootfs for it to mount. Trying to minimize on variables here. ----- As for my tests: 1. Started QEMU with `-vga std` for a virtual display so it's just a window in my host's graphical session. The QEMU environment approriately chose the `bochs-drm` Linux virtual display driver for this virtual VGA display. The VM successfully ran the vtcon unbind/rebind script without any trouble. 2. I rebooted QEMU but this time added the kernel argument `nomodeset` so it cannot bind a driver to the virtual VGA interface, this time it chose tried booting QEMU again this time with `nomodeset` in the arguments so it cannot use `bochs-drm` With `bochs` bound to the virutal VGA card instead: The VM FAILED to run the script, causing a fbcon.c NULL pointer dereference visible in `dmesg`. 3. Booted the VM again without a virutal display and passed through my physical GTX 2080Ti, making sure to avoid touching vtcon paths on my host to avoid running into this issue on my physically running kernel. Now the VM is outputting directly to my monitor as expected. I noticed the terminal was much higher quality because surprise, nouveau kicked in and bound itself to my 2080Ti in the guest. With my GPU passed through to the VM and Nouveau attached to it: The VM successfully ran the vtcon unbind/rebind script without any trouble. So the hint to me here is that people using Nouveau as a host nvidia driver won't experience this issue in any way shape or form. I'd like to try it later on my real physical hardware later to be sure. 4. I booted the QEMU VM with my 2080Ti again this time with the `nomodeset` kernel argument to avoid video drivers hooking my card. With my GPU passed through to the VM but no driver attached to it: The VM successfully ran the vtcon unbind/rebind script without any trouble. 5. Finally I installed `nvidia 525.60.11-1` and `nvidia-utils 525.60.11-1` in the VM's ext4 rootfs and rebooted. In the VM, nvidia bound to the 2080Ti: The VM FAILED to run the script, causing a fbcon.c NULL pointer dereference visible in `dmesg`. ------ To Summarize when un/rebinding your vtcon's on kernel 6.0.11-arch1-1 These combinations DON'T throw a fbcon.c NULL pointer dereference: * bochs-drm - Linux display driver for virtual displays / QEMU's `-vga std` * Nouveau - On a Nvidia GTX 2080Ti * No Driver - On a Nvidia GTX 2080Ti These combinations DO: * bochs - (nomodeset) Linux display driver for virtual displays / QEMU's `-vga std` * nvidia 525.60.11 - On a GTX 2080Ti both tested on a physical host and when done in a guest virtually. * nvidia 470.141.03 - On a physical host with a GTX 780 and GTX 660 Given nvidia 470 is a legacy driver I doubt it's changing very often, leaning more towards this being a kernel behavior. I will have to try different versions of the kernel next. |