Bug 216475 - fbcon crashes during single gpu passthough reattachment to host
Summary: fbcon crashes during single gpu passthough reattachment to host
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-12 14:40 UTC by Sergey V.
Modified: 2022-12-16 00:02 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.19.7
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
My dmesg right after VM shutdown (148.90 KB, text/plain)
2022-09-12 14:40 UTC, Sergey V.
Details
Script to invoke the bug (791 bytes, text/plain)
2022-12-04 03:20 UTC, Jared J
Details

Description Sergey V. 2022-09-12 14:40:14 UTC
Created attachment 301792 [details]
My dmesg right after VM shutdown

Hello, after 5.19 kernel many VFIO users have problems with reattaching GPU from guest to host. It works well previously (5.18.16 for me).

More complains about the issue:
https://www.reddit.com/r/VFIO/comments/wp85ve/linux_519_kernel_single_gpu_passthough_black/

My PC Spec:
  CPU: Ryzen 5950X
  RAM: 128GB
  GPU: NVIDIA RTX 3080
  OS: Arch Linux

How to reproduce:
  1. You have to have properly configured VM with working GPU passthough (too complicated to explain it here)
  2. When VM starts it detaches GPU from host by 'start.sh' (see below)
  3. VM starts properly, Windows loads properly
  4. Shutdown VM regularly and GPU should be reattached by 'revert.sh' (see below)
Actual results (5.19.*):
  5. Windows shutdowns, and GPU is not reattaching to host only black screen present and monitors shutdown (no signal)
  5.1 dmesg contains error message - dmesg.txt in attachments
    WARNING: CPU: 30 PID: 12528 at drivers/video/fbdev/core/fbcon.c:999 fbcon_init+0x5ce/0x670
    ...
    BUG: kernel NULL pointer dereference, address: 0000000000000330
Expected Result (5.18.* and previous):
  5. Windows shutdowns, and GPU successfully reattached to host

I have tried to bisect git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git v5.18.16 as good and v5.19.2 as bad
(I've done it for the first time, maybe I've done something wrong)

During bisect after some point my Linux doesn't boot, and it trying to mark those commits as bad.
Commit below might be not real problem causer

Commit which I found by bisect:

commit 3647d6d3dbdafc55f8c4ca8225966963252abe7b (refs/bisect/bad)
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Apr 5 23:03:33 2022 +0200

    fbcon: Move more code into fbcon_release

    con2fb_release_oldinfo() has a bunch more kfree() calls than
    fbcon_exit(), but since kfree() on NULL is harmless doing that in both
    places should be ok. This is also a bit more symmetric now again with
    fbcon_open also allocating the fbcon_ops structure.

    Acked-by: Sam Ravnborg <sam@ravnborg.org>
    Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Du Cheng <ducheng2@gmail.com>
    Cc: Claudio Suarez <cssk@net-c.es>
    Link: https://patchwork.freedesktop.org/patch/msgid/20220405210335.3434130-16-daniel.vetter@ffwll.ch


start.sh
========
#!/bin/bash
set -x

systemctl stop display-manager.service
while systemctl is-active --quiet "display-manager.service" ; do
    sleep 1
done

killall gdm-x-session
killall -u bormor

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind

# Unbind EFI-Framebuffer
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

# Avoid a Race condition by waiting 2 seconds. This can be calibrated to be shorter or longer if required for your system
sleep 2

# Unload all Nvidia drivers
modprobe -r nvidia_drm
modprobe -r nvidia_modeset
modprobe -r nvidia_uvm
modprobe -r nvidia
modprobe -r nouveau

# Unbind the GPU from display driver
virsh nodedev-detach pci_0000_09_00_0
virsh nodedev-detach pci_0000_09_00_1

# Load VFIO Kernel Module  
modprobe vfio-pci


revert.sh
========
#!/bin/bash
set -x

# Unload VFIO-PCI Kernel Driver
modprobe -r vfio-pci
modprobe -r vfio_iommu_type1
modprobe -r vfio

virsh nodedev-reattach pci_0000_09_00_1
virsh nodedev-reattach pci_0000_09_00_0

echo 1 > /sys/class/vtconsole/vtcon0/bind
echo 1 > /sys/class/vtconsole/vtcon1/bind

nvidia-xconfig --query-gpu-info > /dev/null 2>&1
echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind

modprobe nvidia_drm
modprobe nvidia_modeset
modprobe nvidia_uvm
modprobe nvidia
modprobe nouveau


systemctl start display-manager.service
Comment 1 Sergey V. 2022-09-19 10:42:27 UTC
After further investigation I found out the crash happens then

echo 1 > /sys/class/vtconsole/vtcon0/bind
echo 1 > /sys/class/vtconsole/vtcon1/bind

calls before

echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind

moving efi-framebuffer binding before vtconsole binding is solving the crash (for 5.19.9) 

Only vtconsole isn't working after re-bind, but I guess it's another issue
Comment 2 Daniel Vetter 2022-11-16 10:16:20 UTC
Apologies for missing this in the flood of mails I'm drowning under. Also thanks a lot for the bisect, this generally helps a lot. A few questions:

- in the bisect, did you check that the symptoms didn't change at all? I.e. same WARNING backtrace followed by the BUG on that commit, and it's immediate parent is bug-free? I'm asking since I can't connect the code change (which should just consolidate some cleanup) with the bug you're reporting, so worth checking that there's not been a few commits that gradually created the bug

- the patch is fairly simple, but doesn't revert cleanly on all kernels anymore. Have you tried to apply the revert and fix it up on your local latest kernel, to double check that the bisected commit is indeed the culprit? I've checked the history and subsequent patches shouldn't really impact this code (but you never know)

- I see from the logs that you're using the proprietary nvidia driver. Please reproduce without that.

- Also can you pls decode the BUG into line numbers (ideally on latest upstream kernels so we don't have a confusion about the source code) per https://dri.freedesktop.org/docs/drm/admin-guide/reporting-issues.html?highlight=oops%20decoding#decode-failure-messages That might help me understand why things blow up.

Thanks
Comment 3 Jared J 2022-12-04 03:20:32 UTC
Created attachment 303350 [details]
Script to invoke the bug

Script which successfully invokes the kernel NULL pointer dereference at fbcon.c

Successfully triggers bug with:
  * Primary desktop on nvidia-dkms 525.60.11 housing a GTX2080Ti
  * Old desktop on nvidia-470xx-dkms 470.141.03 housing a GTX780 + GTX660.

It does not trigger the bug on my Dell XPS 13, which has no dGPU only an iGPU (Iris Xe Graphics). That laptop successfully unbinds then rebinds the efi-framebuffer and vtconns, then rebinds them and the virtual console on Ctrl+Alt+fx still works.

So in my experience, this is pointing to something NVIDIA related. I wouldn't mind trying older kernel versions to find when this problem were introduced, and also NVIDIA driver versions too for the same reason.
Comment 4 Jared J 2022-12-04 03:24:06 UTC
Sorry, I should have mentioned in that comment -- I am also doing VFIO, but as you can see in the attached script, this problem doesn't actually involve VFIO. The problem seems to happen on my NVIDIA powered desktops just by unbinding the efi-framebuffer and vtcons, then rebinding them is enough to trigger it.

Passing through video cards with VFIO requires you to do this too, so it's a troubling issue. You can start a VM and ignore the vtcons and in my case they get stripped off anyway, but obviously once you're done with the VM and go to rebind them, this crash happens regardless.
Comment 5 Jared J 2022-12-04 08:17:36 UTC
I did some elimination on this bug today. I used QEMU but gave it my current kernel and initramfs to boot with, plus a fresh Archlinux ext4.img as a rootfs for it to mount. Trying to minimize on variables here.

-----
As for my tests:

1. Started QEMU with `-vga std` for a virtual display so it's just a window in my host's graphical session. The QEMU environment approriately chose the `bochs-drm` Linux virtual display driver for this virtual VGA display.

The VM successfully ran the vtcon unbind/rebind script without any trouble.

2. I rebooted QEMU but this time added the kernel argument `nomodeset` so it cannot bind a driver to the virtual VGA interface, this time it chose tried booting QEMU again this time with `nomodeset` in the arguments so it cannot use `bochs-drm`

With `bochs` bound to the virutal VGA card instead:
The VM FAILED to run the script, causing a fbcon.c NULL pointer dereference visible in `dmesg`.


3. Booted the VM again without a virutal display and passed through my physical GTX 2080Ti, making sure to avoid touching vtcon paths on my host to avoid running into this issue on my physically running kernel.

Now the VM is outputting directly to my monitor as expected. I noticed the terminal was much higher quality because surprise, nouveau kicked in and bound itself to my 2080Ti in the guest.

With my GPU passed through to the VM and Nouveau attached to it:
The VM successfully ran the vtcon unbind/rebind script without any trouble.

So the hint to me here is that people using Nouveau as a host nvidia driver won't experience this issue in any way shape or form. I'd like to try it later on my real physical hardware later to be sure.

4. I booted the QEMU VM with my 2080Ti again this time with the `nomodeset` kernel argument to avoid video drivers hooking my card.

With my GPU passed through to the VM but no driver attached to it:
The VM successfully ran the vtcon unbind/rebind script without any trouble.
 
5. Finally I installed `nvidia 525.60.11-1` and `nvidia-utils 525.60.11-1` in the VM's ext4 rootfs and rebooted.

In the VM, nvidia bound to the 2080Ti:
The VM FAILED to run the script, causing a fbcon.c NULL pointer dereference visible in `dmesg`.

------

To Summarize when un/rebinding your vtcon's on kernel 6.0.11-arch1-1

These combinations DON'T throw a fbcon.c NULL pointer dereference:
  * bochs-drm - Linux display driver for virtual displays / QEMU's `-vga std`
  * Nouveau   - On a Nvidia GTX 2080Ti
  * No Driver - On a Nvidia GTX 2080Ti


These combinations DO:
  * bochs - (nomodeset) Linux display driver for virtual displays / QEMU's `-vga std`
  * nvidia 525.60.11 -  On a GTX 2080Ti both tested on a physical host and when done in a guest virtually.
  * nvidia 470.141.03 - On a physical host with a GTX 780 and GTX 660

Given nvidia 470 is a legacy driver I doubt it's changing very often, leaning more towards this being a kernel behavior. I will have to try different versions of the kernel next.

Note You need to log in before you can comment on or make changes to this bug.