Bug 200101 - random freeze under load
Summary: random freeze under load
Status: NEW
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-17 18:16 UTC by lekto
Modified: 2020-06-27 08:21 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.16.11, 4.18.0-rc5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
libvirt confix (6.14 KB, application/xml)
2018-07-27 13:03 UTC, lekto
Details

Description lekto 2018-06-17 18:16:00 UTC
Hi, I have random freezes while using virtual machine. I have vm (virtmanager + libvirt + qemu + kvm) with windows 10 + gpu passthrough (nvidia 1050). It occur under load, such as playing factorio or running valley benchmark. Last working kernel was 4.16.10, using git bisect I found commit that probably cause this:
>95271aeb93d4681c65e2f94969b23ef6070367a6 is the first bad commit
>commit 95271aeb93d4681c65e2f94969b23ef6070367a6
>Author: Thomas Gleixner <tglx@linutronix.de>
>Date:   Thu May 10 20:42:48 2018 +0200
>
>    x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
>    
>    commit 47c61b3955cf712cadfc25635bf9bc174af030ea upstream
>    
>    Add the necessary logic for supporting the emulated VIRT_SPEC_CTRL MSR to
>    x86_virt_spec_ctrl().  If either X86_FEATURE_LS_CFG_SSBD or
>    X86_FEATURE_VIRT_SPEC_CTRL is set then use the new guest_virt_spec_ctrl
>    argument to check whether the state must be modified on the host. The
>    update reuses speculative_store_bypass_update() so the ZEN-specific
>    sibling
>    coordination can be reused.
>    
>    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
>:040000 040000 1a076b17c0c8d0bf255c9aba856f1b75791deeb1
>>82bd77c23c8e1ade512f4afa5ce021cbad6b2ae8 M    arch

CPU: ryzen 1600,
mobo: asrock x370 gaming x
Comment 1 lekto 2018-07-27 13:02:24 UTC
With kernel 4.18.0-rc5 system still hangs up, but after I removed this patch system don't freeze anymore.

System: Gentoo 17.1
virt-manager 1.5.1-r1
libvirt 4.5.0-r1
qemu 2.12.0-r4
Comment 2 lekto 2018-07-27 13:03:13 UTC
Created attachment 277575 [details]
libvirt confix
Comment 3 Garry Filakhtov 2020-06-12 06:15:24 UTC
Struggling with the same issue. Also coming from Gentoo 👋 lekto!

This was long coming, I just needed a lot of time to ensure there is no hardware issues or any kind of misconfiguration on my end, before reporting here.

I have Intel X299 platform and using it to run Windows 10 virtual machine with PCI pass-through. I use NVMe SSD (Samsung EVO 970 Plus), PCIe USB 3.0 (StarTech PEXUSB3S3GE) adapter and GPU (nVidia GeForce 1650) pass-through to get best possible performance and isolation from host OS.

I have been running on 4.19 LTS kernel without any issues, but 5.4 LTS got promoted to stable for AMD64 architecture and I have switched. After doing so, I have started experiencing random guest freezes, happening anywhere immediately after boot all the way up to multiple hours of usage without a freeze. When the freeze occurs, guest machine will completely stop responding to input, ping, etc. Host machine works fine and I can connect to qemu socket without any problems. I am running on QEMU 4.2.0.

Freeze can continue anywhere from 1 minute up to 5 minutes, and eventually VM is recovering and working properly afterwards, up until the next freeze. Inspecting dmesg or journalctl on the host machine reveals no any relevant entries.

Problem appears regardless of the type of workflow performed. It can just freeze on the desktop, in the web browser or in the GPU benchmark. I was playing music on the system and just before freezing, sound starts to drop/glitch and then goes completely silent.

Windows event viewer is of course as useful as a fridge on the North pole before the climate change :D (pardon my pun), meaning no entries are produced during the freeze, and there is actually a gap between written entries for however long the freeze took.

So far, I have tested a good variety of Kernel versions:

  [1]   linux-4.19.120-gentoo <- works fine
  [2]   linux-4.20.17-gentoo <- works fine
  [3]   linux-5.0.0-gentoo <- randomly freezes as described
  [4]   linux-5.0.21-gentoo <- randomly freezes as described
  [5]   linux-5.1.21-gentoo <- can't even boot guest, getting freeze during very early boot
  [6]   linux-5.2.20-gentoo <- qemu won't even start, complaining about KVM suberror 1
  [7]   linux-5.3.18-gentoo <- randomly freezes as described
  [8]   linux-5.4.38-gentoo <- randomly freezes as described

My takeaway here is that something went wrong in the 5.0.0 and was never fixed since.

I have not yet tried to bisect the GIT source, but might give it a go, time permitting.

I am using naked qemu-system-x86_64 command, to rule out virt-manager problems. PCIe devices are attached via separate pcie-root-port devices. Using OVMF UEFI (sys-firmware/edk2-ovmf-201905) for booting with Secure Boot enabled (disabling Secure Boot makes no difference). I have also did clean Windows 10 install to rule out any issues with the guest OS itself, but problem persisted. I have tried using Windows-provided GPU drivers as well as the latest from nVidia. Using "host" CPU for qemu.

There is a similar problem reported on Reddit too, the solution was to downgrade: https://www.reddit.com/r/VFIO/comments/b1xx0g/windows_10_qemukvm_freezes_after_50x_kernel_update/

Host hardware:
Motherboard: ASUS WS X299 SAGE
CPU: Intel i9-9940x
Guest GPU: nVidia GTX 1650
Host GPU: AMD Radeon PRO WX 3100
RAM: 64Gb (4x16Gb) DDR4 2666MHz
SSD: Samsung 970 EVO Plus
PCIe adapter: StarTech PEXUSB3S3GE 3xUSB3.0 + USB Realtek Gigabit network combo adapter
Guest OS: Windows 10 Professional (1909)
QEMU version: 4.2.0

qemu options used:
-name Microsoft Windows 10 Professional
-M q35,kernel_irqchip=on,vmport=off,accel=kvm,mem-merge=off
-nodefaults
-display none
-vga none
-net none
-nographic
-monitor unix:/run/qemu/win10.sock,server,nowait
-pidfile /run/qemu/win10.pid
-cpu host,kvm=off
-smp sockets=1,cores=6,threads=2
-m size=16G
-drive if=pflash,format=raw,readonly,file=/usr/share/edk2-ovmf/OVMF_CODE.secboot.fd
-drive if=pflash,format=raw,file=/usr/share/edk2-ovmf/OVMF_VARS.secboot.fd
-rtc base=localtime
-device pcie-root-port,id=port0.0,bus=pcie.0,chassis=0,slot=0,addr=1.0
-device vfio-pci,host=19:0.0,multifunction=on,bus=port0.0,addr=0.0
-device vfio-pci,host=19:0.1,bus=pcie.0,bus=port0.0,addr=0.1
-device pcie-root-port,id=port0.2,bus=pcie.0,chassis=0,slot=2
-device vfio-pci,host=1a:0.0,bus=port0.2
-device pcie-root-port,id=port0.5,bus=pcie.0,chassis=0,slot=5
-device vfio-pci,host=b3:0.0,bus=port0.5

I will try lekto's suggestion and report back any progress.
Comment 4 Garry Filakhtov 2020-06-27 08:21:34 UTC
Okay, have played a bit further with all of this. I have managed to get freezes on linux-4.19.120-gentoo as well, after using CPU pinning together with RR scheduling policy and priority to 1 for all vCPU threads.

After removing the commit 47c61b3955cf712cadfc25635bf9bc174af030ea it seems like the system is indeed working without freezing. I will continue testing and updating as I get more information.

Note You need to log in before you can comment on or make changes to this bug.