Bug 199727

Summary: CPU freezes in KVM guests during high IO load on host
Product: Virtualization Reporter: Gergely Kovacs (gkovacs)
Component: kvmAssignee: virtualization_kvm
Status: NEW ---    
Severity: high CC: bjo, cmultari, devzero, eliphas, gkovacs, jake, kernelbugs, losev45, m.keller, mgabriel, phren, rm, stefanha, yann.papouin, zimmer
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.x, 4.x, 5.x Subsystem:
Regression: No Bisected commit-id:

Description Gergely Kovacs 2018-05-14 20:27:34 UTC
Proxmox is a Debian based virtualization distribution with an Ubuntu LTS based kernel.

When there is high IO load on Proxmox v4 and v5 virtualization hosts during vzdump backups, restores, migrations or simply reading/writing of huge files on the same storage where the KVM guests are stored, these guests show the following symptoms:

- CPU soft lockups
- rcu_sched detected stalls
- task blocked, stack dump
- huge latency in network services (even pings time out for several seconds)
- lost network connectivity (Windows guests often lose Remote Desktop connections)

The issue affects KVM guests with VirtIO, VirtIO SCSI and IDE disks, with different guest error messages. This issue affects Windows, Debian 7/8 guests the worst, Debian 9 and Ubuntus are a bit less sensitive.

The issue affects many hardware configurations: we have tested and found it on single and dual socket Westmere, Sandy Bridge and Ivy Bridge Core i7 and Xeon based systems.

The issue is present on many local storage setups, regardless of HDD or SSD used, was confirmed on below configurations:
- LVM / ext4 with qCOW2 guests (on ICH and Adaptec connected single HDD, Adaptec HW mirror HDD and Adaptec HW RAID10 HDD tested)
- ZFS with qCOW2 or zVOL guests (on ICH and Adaptec connected single HDD & SSD, ZFS mirror & RAID10 & RAIDZ1 HDD & SSD tested) 


REPRODUCTION
1. Install Proxmox 4 or 5 on bare metal (ZFS or LVM+ext4, HDD or SSD, single disk or array)
2. Create Windows and Debian 7 or 8 KVM guests on local storage (with IDE or VirtIO disks, VirtIO network)
3. Start actively polling guest network services from network (ping, Apache load test, Remote Desktop, etc.) and observe guest consoles
4. Start backing up VMs with the built-in backup function to same local storage or NFS share on network
5. Restore VM backups from local storage or NFS share on network (or simply copy huge files to local storage from external disk or network)

During the backup and restore operations, KVM guests will show the symptoms above.


MITIGATION
If vm.dirty_ratio and vm.dirty_background_ratio are set to very low values on the host (2 and 1), the problem is somewhat less severe.


LINKS
Many users confirmed this issue on different platforms (ZFS+zvol, ZFS+QCOW2, ext4+LVM) over the past few years:
https://forum.proxmox.com/threads/kvm-guests-freeze-hung-tasks-during-backup-restore-migrate.34362/
https://forum.proxmox.com/threads/virtio-task-xxx-blocked-for-more-than-120-seconds.25835/
https://forum.proxmox.com/threads/frequent-cpu-stalls-in-kvm-guests-during-high-io-on-host.30702/

We also filed a bugreport in the Proxmox bugzilla, but this bug is most likely in QEMU/KVM:
https://bugzilla.proxmox.com/show_bug.cgi?id=1453
Comment 1 Roland Kletzing 2021-08-20 17:25:26 UTC
i can confirm there is a severe issue here, which renders kvm/proxmox virtually unusable when you have significantly io loaded hosts, i.e. if there is lots of write io on the host or guest.

whenever you get into a situation when the disk io where the vm resides on is getting saturated, the VMs start going nuts and getting hiccup, i.e. severe latency is getting added to the guests. 

they behave "jumpy", you can't use the console or they are totaly sluggish, ping goes up above 10secs , kernel throws "BUG: soft lockup - CPU#X stuck for XXs!" and such...

i have found that with cache=writeback for the virtual machines disk which reside on the appropriate hdd wich heavy io, things go much much more smoothly. 

without cache=writeback , live migration/move could make a guest go crazy.

now with cache=writeback i could do 3 live migrations in parallel , even with lots of io inside the virtual machines, and even with additional writer/reader in the host os (dd from/to the disk - ping to the guests mostly is <5ms.

so, to me this problem appears to be related to disk io saturation and probably related to sync writes, what else can explain that cache=writeback helps so much ?
Comment 2 Roland Kletzing 2021-08-20 17:26:54 UTC
oh, and it still applies to kernel 5.4.124-1-pve (which is kernel for recent proxmox 6.4)
Comment 3 Roland Kletzing 2021-08-21 08:53:17 UTC
i had a look at kvm process with strace. 

with virtual disk caching option set to "Default (no cache)", kvm is doing IO submission via io_submit() instead of pwritev(), and apparently that can be a long blocking call.

i see whenever the VM getting hickup and pingtime goes through the roof, there is long blocking io_submit() operation in progress 

looks like a "design issue" to me and "Default (no cache)" thus being a bad default setting.

see:
https://lwn.net/Articles/508064/

and 
https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux
Comment 4 Roland Kletzing 2021-08-22 12:11:30 UTC
http://blog.vmsplice.net/2015/08/asynchronous-file-io-on-linux-plus-ca.html

"However, the io_submit(2) system call remains a treacherous ally in the quest for asynchronous file I/O. I don't think much has changed since 2009 in making Linux AIO the best asynchronous file I/O mechanism.

The main problem is that io_submit(2) waits for I/O in some cases. It can block! This defeats the purpose of asynchronous file I/O because the caller is stuck until the system call completes. If called from a program's event loop, the program becomes unresponsive until the system call returns. But even if io_submit(2) is invoked from a dedicated thread where blocking doesn't matter, latency is introduced to any further I/O requests submitted in the same io_submit(2) call.

Sources of blocking in io_submit(2) depend on the file system and block devices being used. There are many different cases but in general they occur because file I/O code paths contain synchronous I/O (for metadata I/O or page cache write-out) as well as locks/waiting (for serializing operations). This is why the io_submit(2) system call can be held up while submitting a request.

This means io_submit(2) works best on fully-allocated files, volumes, or block devices. Anything else is likely to result in blocking behavior and cause poor performance."
Comment 5 Roland Kletzing 2021-08-29 14:58:47 UTC
apparently, things got even worse with proxmox 7, as it seems it's using async io (via io_uring) by default for all virtual disk IO, i.e. the workaround "cache=writeback" does not work for me anymore.

if i set aio=threads by directly editing VM configuration, things run smoothly again.

so, still being curious:

why do VMs get severe hiccup with async (via io_submit or io_uring) when storage is getting some load and why does that NOT happen with the described workaround ?
Comment 6 Roland Kletzing 2022-01-13 12:09:08 UTC
what i have also seen is VM freezes when backup runs in our gitlab vm server, which is apparently related to fsync/fdatasync sync writes.  

at least for zfs there exists some write starvation issue , as large sync writes may starve small ones, as there apparently is no fair scheduling for it, see 

https://github.com/openzfs/zfs/issues/10110
Comment 7 Yann Papouin 2022-02-10 13:22:52 UTC
(In reply to Roland Kletzing from comment #5)
> apparently, things got even worse with proxmox 7, as it seems it's using
> async io (via io_uring) by default for all virtual disk IO, i.e. the
> workaround "cache=writeback" does not work for me anymore.
> 
> if i set aio=threads by directly editing VM configuration, things run
> smoothly again.

Are you using "cache=writeback" with "aio=threads" ?
For me, using "aio=threads" reduces the VM freeze (High CPU Load) but it still happens on high disk IO (backup/disk move)
Comment 8 Roland Kletzing 2022-02-12 00:13:46 UTC
yes.

i found the following interesting information. i think this explains a LOT.

https://docs.openeuler.org/en/docs/20.03_LTS/docs/Virtualization/best-practices.html#i-o-thread-configuration

I/O Thread Configuration
Overview

By default, QEMU main threads handle backend VM read and write operations on the KVM. This causes the following issues:

    VM I/O requests are processed by a QEMU main thread. Therefore, the single-thread CPU usage becomes the bottleneck of VM I/O performance.
    The QEMU global lock (qemu_global_mutex) is used when VM I/O requests are processed by the QEMU main thread. If the I/O processing takes a long time, the QEMU main thread will occupy the global lock for a long time. As a result, the VM vCPU cannot be scheduled properly, affecting the overall VM performance and user experience.

You can configure the I/O thread attribute for the virtio-blk disk or virtio-scsi controller. At the QEMU backend, an I/O thread is used to process read and write requests of a virtual disk. The mapping relationship between the I/O thread and the virtio-blk disk or virtio-scsi controller can be a one-to-one relationship to minimize the impact on the QEMU main thread, enhance the overall I/O performance of the VM, and improve user experience.
Configu
Comment 9 Roland Kletzing 2022-02-12 10:26:39 UTC
https://qemu-devel.nongnu.narkive.com/I59Sm5TH/lock-contention-in-qemu
<snip>
I find the timeslice of vCPU thread in QEMU/KVM is unstable when there
are lots of read requests (for example, read 4KB each time (8GB in
total) from one file) from Guest OS. I also find that this phenomenon
may be caused by lock contention in QEMU layer. I find this problem
under following workload.
<snip>
Yes, there is a way to reduce jitter caused by the QEMU global mutex:

qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0

Now the ioeventfd and thread pool completions will be processed in
iothread0 instead of the QEMU main loop thread. This thread does not
take the QEMU global mutex so vcpu execution is not hindered.

This feature is called virtio-blk dataplane.
<snip>


i tried "virtio scsi single" with "aio=threads" and "iothread=1" in proxmox, and after that, even with totally heavy read/write io inside 2 VMs (located on the same spinning hdd on top of zfs lz4 + zstd dataset and qcow) and severe write starvation (some ioping  >>30s), even while live migrating both vm disks in parallel to another zfs dataset on the same hdd, i get absolutely NO jitter in ping anymore. ping to both VMs is constantly at <0.2ms 

from the kvm pid:
-object iothread,id=iothread-virtioscsi0  
-device virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0 
-drive file=/hddpool/vms-files-lz4/images/116/vm-116-disk-3.qcow2,if=none,id=drive-scsi0,cache=writeback,aio=threads,format=qcow2,detect-zeroes=on 
-device scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100
Comment 10 Roland Kletzing 2022-02-24 18:58:42 UTC
nobody listening? 

what should we do with this bugreport now?
Comment 11 Kai Zimmer 2022-02-25 09:49:26 UTC
Thanks for your research Roland Kietzing. I'm also a user of Proxmox.

We spent a lot of time for troubleshooting this problem and after years finally invested into a decent full flash storage system - now the problem has disappeared here. But this cannot be considered a solution for all affected users of Proxmox. 

I fear that the error description is a discouraging for any kvm developer:
"Proxmox is a Debian based virtualization distribution with an Ubuntu LTS based kernel."

Maybe test a recent vanilla kernel version and add it to the bug metadata to get more attention?
Comment 12 Stefan Hajnoczi 2022-03-02 13:33:45 UTC
Hi,
I contribute to QEMU and have encountered similar issues in the past. QEMU configuration options that should allow you to avoid this issue and it sounds like you have found options that work for you.

If io_submit(2) is blocking with aio=native, try aio=io_uring. If that is not available (older kernels), use aio=threads to work around this particular problem.

I recommend cache=none. Although cache=writeback can shift the problem around it doesn't solve it and leaves the VMs open to unpredictable performance (including I/O stalls like this) due to host memory pressure and host page cache I/O.

Regarding the original bug report, it's a limitation of that particular QEMU configuration. I don't think anything will be done about it in the Linux kernel. Maybe Proxmox can adjust the QEMU configuration to avoid it.
Comment 13 Roland Kletzing 2022-03-07 19:01:16 UTC
hello, thanks - aio=io_uring is no better, the only real way to get to a stable system is virtio-scsi-single/iothreads=1/aio=threads

the question is why aio=native and io_uring has issues and threads has not...
Comment 14 Stefan Hajnoczi 2022-03-08 06:20:44 UTC
(In reply to Roland Kletzing from comment #13)
> hello, thanks - aio=io_uring is no better, the only real way to get to a
> stable system is virtio-scsi-single/iothreads=1/aio=threads
> 
> the question is why aio=native and io_uring has issues and threads has not...

Are you using cache=none with io_uring and the io_uring_enter(2) syscall is blocking for a long period of time?

aio=threads avoids softlockups because the preadv(2)/pwritev(2)/fdatasync(2) syscalls run in worker threads that don't take the QEMU global mutex. Therefore vcpu threads can execute even when I/O is stuck in the kernel due to a lock.

io_uring should avoid that problem too because it is supposed to submit I/O truly asynchronously.
Comment 15 Roland Kletzing 2022-03-08 08:01:19 UTC
yes, i was using cache=none and io_uring also caused issues. 

>aio=threads avoids softlockups because the preadv(2)/pwritev(2)/fdatasync(2)
> syscalls run in worker threads that don't take the QEMU global mutex. 
>Therefore vcpu threads can execute even when I/O is stuck in the kernel due to
>a lock.

yes, that was a long search/journey to get to this information/params....

regarding io_uring - after proxmox enabled it as default, it was taken back again after some issues had been reported.

have look at:
https://github.com/proxmox/qemu-server/blob/master/debian/changelog

maybe it's not ready for primetime yet !?

-- Proxmox Support Team <support@proxmox.com>  Fri, 30 Jul 2021 16:53:44 +0200
qemu-server (7.0-11) bullseye; urgency=medium
<snip>
  * lvm: avoid the use of io_uring for now
<snip>
-- Proxmox Support Team <support@proxmox.com>  Fri, 23 Jul 2021 11:08:48 +0200
qemu-server (7.0-10) bullseye; urgency=medium
<snip>
  * avoid using io_uring for drives backed by LVM and configured for write-back
    or write-through cache
<snip>
 -- Proxmox Support Team <support@proxmox.com>  Mon, 05 Jul 2021 20:49:50 +0200
qemu-server (7.0-6) bullseye; urgency=medium
<snip>
  * For now do not use io_uring for drives backed by Ceph RBD, with KRBD and
    write-back or write-through cache enabled, as in that case some polling/IO
    may hang in QEMU 6.0.
<snip>
Comment 16 Stefan Hajnoczi 2022-03-08 08:26:05 UTC
On Tue, 8 Mar 2022 at 08:01, <bugzilla-daemon@kernel.org> wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=199727
>
> --- Comment #15 from Roland Kletzing (devzero@web.de) ---
> yes, i was using cache=none and io_uring also caused issues.
>
> >aio=threads avoids softlockups because the preadv(2)/pwritev(2)/fdatasync(2)
> > syscalls run in worker threads that don't take the QEMU global mutex.
> >Therefore vcpu threads can execute even when I/O is stuck in the kernel due
> to
> >a lock.
>
> yes, that was a long search/journey to get to this information/params....
>
> regarding io_uring - after proxmox enabled it as default, it was taken back
> again after some issues had been reported.
>
> have look at:
> https://github.com/proxmox/qemu-server/blob/master/debian/changelog
>
> maybe it's not ready for primetime yet !?
>
> -- Proxmox Support Team <support@proxmox.com>  Fri, 30 Jul 2021 16:53:44
> +0200
> qemu-server (7.0-11) bullseye; urgency=medium
> <snip>
>   * lvm: avoid the use of io_uring for now
> <snip>
> -- Proxmox Support Team <support@proxmox.com>  Fri, 23 Jul 2021 11:08:48
> +0200
> qemu-server (7.0-10) bullseye; urgency=medium
> <snip>
>   * avoid using io_uring for drives backed by LVM and configured for
>   write-back
>     or write-through cache
> <snip>
>  -- Proxmox Support Team <support@proxmox.com>  Mon, 05 Jul 2021 20:49:50
>  +0200
> qemu-server (7.0-6) bullseye; urgency=medium
> <snip>
>   * For now do not use io_uring for drives backed by Ceph RBD, with KRBD and
>     write-back or write-through cache enabled, as in that case some
>     polling/IO
>     may hang in QEMU 6.0.
> <snip>

Changelog messages mention cache=writethrough and cache=writeback,
which are both problematic because host memory pressure will interfere
with guest performance. That is probably not an issue with io_uring
per se, just another symptom of using cache=writeback/writethrough in
cases where it's inappropriate.

If you have trace data showing io_uring_enter(2) hanging with
cache=none then Jens Axboe and other io_uring developers may be able
to help resolve that.

Stefan
Comment 17 Chris M 2022-03-26 15:17:33 UTC
Experiencing the same issue with Proxmox 7 under high IO load. To achieve the highest stability, are you setting all VMs to async IO Threads/IO Thread/Virtio SCSI Single, or just the machines with the highest load?  I moved our higher load machines to those settings but still experience the issue at times.
Comment 18 Gergely Kovacs 2022-04-06 23:52:25 UTC
Thank you Roland Kletzing for your exhaustive investigation and Stefan Hajnoczi for your insightful comments. This is a problem that has been affecting us (and many many users of Proxmox and likely vanilla KVM) for more than a decade, yet the Proxmox developers were unable to solve it or even reproduce it (despite the large number of forum threads and bugs filed), hence the reason for me creating this bugreport 4 years ago.

It looks like we are closing in: the KVM global mutex could be the real culprit, as in our case the problems were only mostly gone by moving all our VM storage to NVMe (increasing IO bandwidth by a LOT), but fully gone after setting VirtIO SCSI Single / iothread=1 / aio=threads on all our KVM guests. For many years VM migrations or restores could render other VMs on the same host practically unusable for the duration of the heavy IO, now these operations can be safely done.

I will experiment with io_uring in the near future and report back my findings, will leave the status NEW since I reckon attention should be given to the ring io code to achieve the same stability as threaded io.
Comment 19 Roland Kletzing 2022-11-29 10:03:58 UTC
@chris

>To achieve the highest stability, are you setting all VMs to async IO
>Threads/IO Thread/Virtio SCSI Single, or just the machines with the 
>highest load?

i have set ALL our virtual maschines to this

@gergely, any news on this? should we close this ticket ?
Comment 20 Marco Gabriel 2024-02-01 13:15:44 UTC
Thanks to all, especially Robert, Stefan and Gergely for this exhaustive bug research.

Please keep this ticket open as this problem still persists in Proxmox 8.x with kernel 6.5 and it seems to get worse.

Thanks to all,
Marco
Comment 21 Roland Kletzing 2024-02-01 13:25:47 UTC
do you have more details (e.g. proxmox forum thread) for this ?
Comment 22 Marco Gabriel 2024-02-01 13:46:59 UTC
I have several sources, at least multiple clients suffering from possible the same problem.

As we're in touch with the Proxmox Support, I can't directly point to a forum message, but probably to related/same issues in other trackers:

- https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756
- https://forum.proxmox.com/threads/redhat-virtio-developers-would-like-to-coordinate-with-proxmox-devs-re-vioscsi-reset-to-device-system-unresponsive.139160/ (I guess you know this thread already)
Comment 23 Marco Gabriel 2024-02-01 13:51:03 UTC
(In reply to Roland Kletzing from comment #13)
> hello, thanks - aio=io_uring is no better, the only real way to get to a
> stable system is virtio-scsi-single/iothreads=1/aio=threads
> 
> the question is why aio=native and io_uring has issues and threads has not...

Just for reference: Using aio=threads doesn't help on our lab and customer setups (Proxmox/Ceph HCI) - we still see vm freezes after several minutes when I/O load is high.
Comment 24 Gergely Kovacs 2024-02-01 19:56:53 UTC
To be clear, in our experience aio=threads and iothread=1 solved all VM freezes on local storage, regardless of the VM running from SATA/SAS HDD or SSD, or NVME SSD, so it's a great mitigating step. Ceph is not solved for us either, therefore using Ceph is still not recommended for KVM storage until this bug is actually fixed (instead of mitigated).