Bug 206977 - AMD gpu Crash after power or reboot the VM
Summary: AMD gpu Crash after power or reboot the VM
Status: NEW
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-26 10:32 UTC by Sebastian Münch
Modified: 2020-03-26 10:53 UTC (History)
0 users

See Also:
Kernel Version: linux-lts 5.4.26 and 5.5.11 and 5.6.rc7 mainline krnel
Subsystem:
Regression: No
Bisected commit-id:


Attachments
AMD gpu Crash after power or reboot the VM logs (102.36 KB, text/plain)
2020-03-26 10:32 UTC, Sebastian Münch
Details

Description Sebastian Münch 2020-03-26 10:32:23 UTC
Created attachment 288075 [details]
AMD gpu Crash after power or reboot the VM logs

Hartdware
CPU:
AMD RYZEN 1700X
MAIBOARD:
Asrock X370 Taichi
bios:
2.40 last bios version of cpu
GPU1:
amd radeon r7 260x it works good the rest.
GPU2:
SAPPHIRE Nitro Plus RX VEGA 64 it works not good the rest with the vfio-pci module.



I have a Problem with a corrupt Header on my AMD RX VEGA 64 Card after shutdown the VM.
The GPU is with vfio in Qemu VM.
arch linux kernel 5.5.10 and linux-lts 5.4. and linxu kernel 5.6rx7 make this BUG on my KVM server.
I downgrade the kernel to 5.3.5 an the corrupt Header is fixed and the rest with the vfio-pci module work.
I have mesa beta 20.0.1 and archlinux 19.3.4-2 tested. and the BUG is not fixed.
see the log lspci -v > lspciv1.log for the 5.3.5 kernel loading in VM after shutdown.
see the log lspci -v > lspci_header_corupt.log for the 5.4.26 or 5.5.10 kernel loading in VM after shutdown.
see the dmesg >vfio_5.3.5.log for the 5.3.5 kernel loading in VM after shutdown.
see the log dmesg > vfio_5.4.26.log for the 5.4.26 or 5.5.10 kernel loading in VM after shutdown.

the gpu corrupt header has a gpu then not colling any more and fan rpm of 0.
the gpu 30min - 40min after crash is the fan of 100% and pc must remove for engine and waiting 15min gpu coling down.



Additional info:
* linux linux-lts mesa 19.3.4-2
* config and/or log files etc.
* link to upstream bug report, if any

Steps to reproduce:
I starting a qemu q35 or qemu std VM with the gpu per vfio deice add and shoutdown the vm and the Header is corrupt.



I post it in archlinux bug reporter 
https://bugs.archlinux.org/task/65956


server log after shoutdown VM.
lspci -v >lspci_header_corupt.log
--
11:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: vfio-pci
	Kernel modules: amdgpu

11:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel
--
server 
grub boot parmter
iommu=pt amd_iommu=on vfio-pci.ids=1002:687f,1002:aaf8,1022:145c rd.driver.pre=vfio-pci nopti

VM WIDNWOS 7 work with the AMD RX veag 64 card good.
Comment 1 Sebastian Münch 2020-03-26 10:53:19 UTC
I blacklist the amdgpu module and the radeon module and the header ist good on linx kernel 5.4 and 5.5 and 5.6-rx7

Note You need to log in before you can comment on or make changes to this bug.