Bug 216475 - fbcon crashes during single gpu passthough reattachment to host
Summary: fbcon crashes during single gpu passthough reattachment to host
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-12 14:40 UTC by Sergey V.
Modified: 2022-11-16 10:16 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.19.7
Tree: Mainline
Regression: Yes


Attachments
My dmesg right after VM shutdown (148.90 KB, text/plain)
2022-09-12 14:40 UTC, Sergey V.
Details

Description Sergey V. 2022-09-12 14:40:14 UTC
Created attachment 301792 [details]
My dmesg right after VM shutdown

Hello, after 5.19 kernel many VFIO users have problems with reattaching GPU from guest to host. It works well previously (5.18.16 for me).

More complains about the issue:
https://www.reddit.com/r/VFIO/comments/wp85ve/linux_519_kernel_single_gpu_passthough_black/

My PC Spec:
  CPU: Ryzen 5950X
  RAM: 128GB
  GPU: NVIDIA RTX 3080
  OS: Arch Linux

How to reproduce:
  1. You have to have properly configured VM with working GPU passthough (too complicated to explain it here)
  2. When VM starts it detaches GPU from host by 'start.sh' (see below)
  3. VM starts properly, Windows loads properly
  4. Shutdown VM regularly and GPU should be reattached by 'revert.sh' (see below)
Actual results (5.19.*):
  5. Windows shutdowns, and GPU is not reattaching to host only black screen present and monitors shutdown (no signal)
  5.1 dmesg contains error message - dmesg.txt in attachments
    WARNING: CPU: 30 PID: 12528 at drivers/video/fbdev/core/fbcon.c:999 fbcon_init+0x5ce/0x670
    ...
    BUG: kernel NULL pointer dereference, address: 0000000000000330
Expected Result (5.18.* and previous):
  5. Windows shutdowns, and GPU successfully reattached to host

I have tried to bisect git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git v5.18.16 as good and v5.19.2 as bad
(I've done it for the first time, maybe I've done something wrong)

During bisect after some point my Linux doesn't boot, and it trying to mark those commits as bad.
Commit below might be not real problem causer

Commit which I found by bisect:

commit 3647d6d3dbdafc55f8c4ca8225966963252abe7b (refs/bisect/bad)
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Apr 5 23:03:33 2022 +0200

    fbcon: Move more code into fbcon_release

    con2fb_release_oldinfo() has a bunch more kfree() calls than
    fbcon_exit(), but since kfree() on NULL is harmless doing that in both
    places should be ok. This is also a bit more symmetric now again with
    fbcon_open also allocating the fbcon_ops structure.

    Acked-by: Sam Ravnborg <sam@ravnborg.org>
    Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Du Cheng <ducheng2@gmail.com>
    Cc: Claudio Suarez <cssk@net-c.es>
    Link: https://patchwork.freedesktop.org/patch/msgid/20220405210335.3434130-16-daniel.vetter@ffwll.ch


start.sh
========
#!/bin/bash
set -x

systemctl stop display-manager.service
while systemctl is-active --quiet "display-manager.service" ; do
    sleep 1
done

killall gdm-x-session
killall -u bormor

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind

# Unbind EFI-Framebuffer
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

# Avoid a Race condition by waiting 2 seconds. This can be calibrated to be shorter or longer if required for your system
sleep 2

# Unload all Nvidia drivers
modprobe -r nvidia_drm
modprobe -r nvidia_modeset
modprobe -r nvidia_uvm
modprobe -r nvidia
modprobe -r nouveau

# Unbind the GPU from display driver
virsh nodedev-detach pci_0000_09_00_0
virsh nodedev-detach pci_0000_09_00_1

# Load VFIO Kernel Module  
modprobe vfio-pci


revert.sh
========
#!/bin/bash
set -x

# Unload VFIO-PCI Kernel Driver
modprobe -r vfio-pci
modprobe -r vfio_iommu_type1
modprobe -r vfio

virsh nodedev-reattach pci_0000_09_00_1
virsh nodedev-reattach pci_0000_09_00_0

echo 1 > /sys/class/vtconsole/vtcon0/bind
echo 1 > /sys/class/vtconsole/vtcon1/bind

nvidia-xconfig --query-gpu-info > /dev/null 2>&1
echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind

modprobe nvidia_drm
modprobe nvidia_modeset
modprobe nvidia_uvm
modprobe nvidia
modprobe nouveau


systemctl start display-manager.service
Comment 1 Sergey V. 2022-09-19 10:42:27 UTC
After further investigation I found out the crash happens then

echo 1 > /sys/class/vtconsole/vtcon0/bind
echo 1 > /sys/class/vtconsole/vtcon1/bind

calls before

echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind

moving efi-framebuffer binding before vtconsole binding is solving the crash (for 5.19.9) 

Only vtconsole isn't working after re-bind, but I guess it's another issue
Comment 2 Daniel Vetter 2022-11-16 10:16:20 UTC
Apologies for missing this in the flood of mails I'm drowning under. Also thanks a lot for the bisect, this generally helps a lot. A few questions:

- in the bisect, did you check that the symptoms didn't change at all? I.e. same WARNING backtrace followed by the BUG on that commit, and it's immediate parent is bug-free? I'm asking since I can't connect the code change (which should just consolidate some cleanup) with the bug you're reporting, so worth checking that there's not been a few commits that gradually created the bug

- the patch is fairly simple, but doesn't revert cleanly on all kernels anymore. Have you tried to apply the revert and fix it up on your local latest kernel, to double check that the bisected commit is indeed the culprit? I've checked the history and subsequent patches shouldn't really impact this code (but you never know)

- I see from the logs that you're using the proprietary nvidia driver. Please reproduce without that.

- Also can you pls decode the BUG into line numbers (ideally on latest upstream kernels so we don't have a confusion about the source code) per https://dri.freedesktop.org/docs/drm/admin-guide/reporting-issues.html?highlight=oops%20decoding#decode-failure-messages That might help me understand why things blow up.

Thanks

Note You need to log in before you can comment on or make changes to this bug.