Bug 219619

Summary: [REGRESSION, BISECTED] vfio-pci: screen graphics artifacts after 6.12 kernel upgrade
Product: Drivers Reporter: Athul Krishna K R (athul.krishna.kr)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: alex.williamson, athul.krishna.kr, precification
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: 6.12 Subsystem:
Regression: Yes Bisected commit-id: f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101
Attachments: dmesg
Host dmesg, 6.13-rc2 stock, 16G BAR
Host dmesg, 6.13-rc2 vfio-pci-core without huge_fault, 16G BAR
Host dmesgs, 6.13-rc2, QEMU 9.1.2/9.2.0, vfio-pci-core huge_fault PUD/PMD/both
Host dmesgs, 6.12.4/6.13-rc2, QEMU 9.1.2/9.2.0, vfio-pci-core huge_fault alignment patch
Host dmesgs, 6.12.4, QEMU 9.1.2/9.2.0, vfio-pci-core submitted huge_fault alignment patch

Description Athul Krishna K R 2024-12-21 10:10:02 UTC
Created attachment 307382 [details]
dmesg

Device: Asus Zephyrus GA402RJ
CPU: Ryzen 7 6800HS
GPU: RX 6700S
Kernel: 6.13.0-rc3-g8faabc041a00

Problem:
Launching games or gpu bench-marking tools in qemu windows 11 vm will cause screen artifacts, ultimately qemu will pause with unrecoverable error.

Commit:
f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 is the first bad commit
commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Mon Aug 26 16:43:53 2024 -0400

    vfio/pci: implement huge_fault support
    
    With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we can
    take advantage of PMD and PUD faults to PCI BAR mmaps and create more
    efficient mappings.  PCI BARs are always a power of two and will typically
    get at least PMD alignment without userspace even trying.  Userspace
    alignment for PUD mappings is also not too difficult.
    
    Consolidate faults through a single handler with a new wrapper for
    standard single page faults.  The pre-faulting behavior of commit
    d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") is removed
    in this refactoring since huge_fault will cover the bulk of the faults and
    results in more efficient page table usage.  We also want to avoid that
    pre-faulted single page mappings preempt huge page mappings.
    
    Link: https://lkml.kernel.org/r/20240826204353.2228736-20-peterx@redhat.com
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gavin Shan <gshan@redhat.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Niklas Schnelle <schnelle@linux.ibm.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

 drivers/vfio/pci/vfio_pci_core.c | 60 ++++++++++++++++++++++++++++------------
 1 file changed, 43 insertions(+), 17 deletions(-)
Comment 1 precification 2024-12-30 19:34:30 UTC
Created attachment 307424 [details]
Host dmesg, 6.13-rc2 stock, 16G BAR

Kernel 6.13-rc2 (MANJARO stock), 16G BAR for 6700XT's VRAM. A kernel log opening a Linux VM with the GPU passed through. In this case, the VM fails to init the GPU. vfio-pci-core has its new huge_fault handler.

Created with `echo "file vfio_pci_core.c +p">/sys/kernel/debug/dynamic_debug/control`

See mailing list thread: https://lore.kernel.org/regressions/20241222223604.GA3735586@bhelgaas/
Comment 2 precification 2024-12-30 19:37:55 UTC
Created attachment 307425 [details]
Host dmesg, 6.13-rc2 vfio-pci-core without huge_fault, 16G BAR

Kernel 6.13-rc2 (MANJARO, patched vfio-pci-core), 16G BAR for 6700XT's VRAM. A kernel log opening a Linux VM with the GPU passed through. In this case, the VM successfully inits the GPU. Patched vfio-pci-core only to remove its new huge_fault handler.

Created with `echo "file vfio_pci_core.c +p">/sys/kernel/debug/dynamic_debug/control`

See mailing list thread: https://lore.kernel.org/regressions/20241222223604.GA3735586@bhelgaas/
Comment 3 precification 2024-12-31 14:56:39 UTC
Created attachment 307429 [details]
Host dmesgs, 6.13-rc2, QEMU 9.1.2/9.2.0, vfio-pci-core huge_fault PUD/PMD/both

Kernel log excerpts, opening a Linux VM with the GPU passed through.

Kernel 6.13-rc2 (MANJARO), 16G BAR for 6700XT's VRAM. QEMU 9.1.2 vs. QEMU 9.2.0.
vfio-pci-core huge_fault support set to either PUD only ('no2Mpages') / PMD only ('no1Gpages') / both ('stock') using the patches by Alex https://lore.kernel.org/regressions/20241230182737.154cd33a.alex.williamson@redhat.com/ .

Configurations where the guest fails to initialize the GPU: QEMU 9.1.2 'stock'/'no2Mpages'; Working configurations: QEMU 9.1.2 'no1Gpages', QEMU 9.2.0 all

Created with `echo "file vfio_pci_core.c +p">/sys/kernel/debug/dynamic_debug/control`.
Comment 4 precification 2025-01-01 02:58:18 UTC
Created attachment 307432 [details]
Host dmesgs, 6.12.4/6.13-rc2, QEMU 9.1.2/9.2.0, vfio-pci-core huge_fault alignment patch

Kernel log excerpts, opening a Linux VM with the GPU passed through.

Kernels 6.12.4, 6.13-rc2 (MANJARO), 16G BAR for 6700XT's VRAM. QEMU 9.1.2, additionally QEMU 9.2.0 for 6.12.4.
Logs are with the vfio-pci-core patch by Alex https://lore.kernel.org/regressions/20241231090733.5cc5504a.alex.williamson@redhat.com/ .

All configurations work as expected, with QEMU 9.2.0 getting the 1G mappings (as before) and 9.1.2 now falling back to 2M.

Created with `echo "file vfio_pci_core.c +p">/sys/kernel/debug/dynamic_debug/control`.
Comment 5 precification 2025-01-03 12:39:33 UTC
Created attachment 307444 [details]
Host dmesgs, 6.12.4, QEMU 9.1.2/9.2.0, vfio-pci-core submitted huge_fault alignment patch

Kernel log excerpts, opening a Linux VM with the GPU passed through.

Kernel 6.12.4 (MANJARO, mostly stock), 16G BAR for 6700XT's VRAM. QEMU 9.1.2 and QEMU 9.2.0.
Logs are with the submitted vfio-pci-core patch by Alex Williamson https://lore.kernel.org/lkml/2025010322-overblown-symptom-d4cd@gregkh/T/#t .

All configurations work as expected, with QEMU 9.2.0 getting the 1G mappings (as before) and 9.1.2 falling back to 2M (which it didn't before, causing the bug).

Created with `echo "file vfio_pci_core.c +p">/sys/kernel/debug/dynamic_debug/control`.