Bug 218259

Summary: High latency in KVM guests
Product: Virtualization Reporter: Joern Heissler (kernelbugs2012)
Component: kvmAssignee: virtualization_kvm
Status: NEW ---    
Severity: normal CC: devzero, f.weber, seanjc
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
lspci -v

Description Joern Heissler 2023-12-12 16:37:56 UTC
Hello,

some of my guests are experiencing heavy latency, e.g.:

* SSH sessions get stuck, sometimes for several seconds, then continue.
* PING replies can take several seconds.
* t0 = time(); sleep(1); t1 = time(); print(t1 - t0); can show several seconds.
* Various services with small timeouts throw errors.
* guest system clock appears to work correctly.

Sometimes this happens once or twice an hour or not for many hours. Usually the lag is way below
a second, that's why I didn't notice it earlier.
On highly affected hosts this may happen much more often, and often in clusters, e.g. lots of
time during a span of a few Minutes.

The affected hosts run Debian 12; until Debian 11 there was no trouble.
I git-bisected the kernel and the commit which appears to somehow cause the trouble is:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f47e5bbbc92f5d234bbab317523c64a65b6ac4e2

CPU model: Intel(R) Xeon(R) CPU E5-2695 v4 (See below for /proc/cpuinfo)
Host kernels:
    * Debian, linux-image-6.1.0-15-amd64 (6.1.66-1), affected
    * Debian, linux-image-6.5.0-0.deb12.4-amd64 (6.5.10-1~bpo12+1), affected
    * Vanilla, v6.7-rc5, affected
    * Vanilla, v5.17-rc3-349-gf47e5bbbc92f, first affected commit
    * Vanilla, v5.17-rc3-348-ga80ced6ea514, last non-affected commit
    * Vanilla, several other versions during bisecting
Host kernel arch: x86_64
Host RAM: 512 GiB.
Host storage: HW-Raid6 over spinning disks.
Guest: Debian 11, x86_64. Debian Kernels linux-image-5.10.0-26-amd64 (5.10.197-1) and linux-image-6.1.0-0.deb11.13-amd64 (6.1.55-1~bpo11+1).
Qemu command line: See below.
Problem does *not* go away when appending "kernel_irqchip=off" to the -machine parameter
Problem *does* go away with "-accel tcg", even though the guest becomes much slower.

All affected guests run kubernetes with various workloads, mostly Java, databases like postgres
und a few legacy 32 bit containers.

Best method to manually trigger the problem I found was to drain other kubernetes nodes, causing
many pods to start at the same time on the affected guest. But even when the initial load
settled, there's little I/O and the guest is like 80% idle, the problem still occurs.

The problem occurs whether the host runs only a single guest or lots of other (non-kubernetes) guests.

Other (i.e. not kubernetes) guests don't appear to be affected, but those got way less resources and usually less load.

I adjusted several qemu parameters, e.g. pass-through host cpu, different smp layout (2
sockets, 4 cores each, 2 threads each), remove memory balloon, add I/O-thread for the disk, set
latest supported machine type. None of those resolved the problem.

There are no kernel logs in the host or the guest, and no userspace logs in the host.

AMD hosts with SSDs seem to be less severely affected, but they still are.

Sadly I couldn't think of any good way to trigger the problem on say a blank Debian guest.

If I can provide additional information or can run additional tests, please let me know!

Many thanks in advance!
Jörn Heissler

---------

/usr/bin/qemu-system-x86_64
-name guest=k8s-worker6,debug-threads=on
-S
-object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-k8s-worker6/master-key.aes"}
-machine pc-i440fx-3.1,usb=off,dump-guest-core=off,memory-backend=pc.ram
-accel kvm
-cpu qemu64
-m 131072
-object {"qom-type":"memory-backend-ram","id":"pc.ram","size":137438953472}
-overcommit mem-lock=off
-smp 16,sockets=16,cores=1,threads=1
-uuid 2c220b5b-9d0a-4d41-a13e-cd78c5551b35
-no-user-config
-nodefaults
-chardev socket,id=charmonitor,fd=30,server=on,wait=off
-mon chardev=charmonitor,id=monitor,mode=control
-rtc base=utc
-no-shutdown
-boot menu=on,strict=on
-device {"driver":"piix3-usb-uhci","id":"usb","bus":"pci.0","addr":"0x1.0x2"}
-blockdev {"driver":"host_device","filename":"/dev/vg-kvm/k8s-worker6_root","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}
-blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}
-device {"driver":"virtio-blk-pci","bus":"pci.0","addr":"0x4","drive":"libvirt-1-format","id":"virtio-disk0","bootindex":1,"write-cache":"on"}
-netdev {"type":"tap","fd":"32","vhost":true,"vhostfd":"34","id":"hostnet0"}
-device {"driver":"virtio-net-pci","netdev":"hostnet0","id":"net0","mac":"52:54:00:e7:56:ae","bus":"pci.0","addr":"0x3"}
-chardev pty,id=charserial0
-device {"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0}
-audiodev {"id":"audio1","driver":"none"}
-vnc 127.0.0.1:0,audiodev=audio1
-device {"driver":"cirrus-vga","id":"video0","bus":"pci.0","addr":"0x2"}
-incoming defer
-device {"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.0","addr":"0x5"}
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
-msg timestamp=on

---------

processor	: 71
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
stepping	: 1
microcode	: 0xb000040
cpu MHz		: 3300.000
cache size	: 46080 KB
physical id	: 1
siblings	: 36
core id		: 27
cpu cores	: 18
apicid		: 119
initial apicid	: 119
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit mmio_stale_data
bogomips	: 4199.23
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
Comment 1 Sean Christopherson 2023-12-13 16:54:45 UTC
On Tue, Dec 12, 2023, bugzilla-daemon@kernel.org wrote:
> The affected hosts run Debian 12; until Debian 11 there was no trouble.
> I git-bisected the kernel and the commit which appears to somehow cause the
> trouble is:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f47e5bbbc92f5d234bbab317523c64a65b6ac4e2

Huh.  That commit makes it so that KVM keeps non-leaf SPTEs, i.e. upper level page
table structures, when zapping/unmapping a guest memory range.  The idea is that
preserving paging structures will allow for faster unmapping (less work) and faster
repopulation if/when the guest faults the memory back in (again, less work to create
a valid mapping).

The only downside that comes to mind is that keeping upper level paging structures
will make it more costly to handle future invalidations as KVM will have to walk
deeper into the page tables before discovering more work that needs to be done.

> Qemu command line: See below.
> Problem does *not* go away when appending "kernel_irqchip=off" to the
> -machine
> parameter
> Problem *does* go away with "-accel tcg", even though the guest becomes much
> slower.

Yeah, that's expected, as that completely takes KVM out of the picture.

> All affected guests run kubernetes with various workloads, mostly Java,
> databases like postgres und a few legacy 32 bit containers.
> 
> Best method to manually trigger the problem I found was to drain other
> kubernetes nodes, causing many pods to start at the same time on the affected
> guest. But even when the initial load settled, there's little I/O and the
> guest is like 80% idle, the problem still occurs.
> 
> The problem occurs whether the host runs only a single guest or lots of other
> (non-kubernetes) guests.
> 
> Other (i.e. not kubernetes) guests don't appear to be affected, but those got
> way less resources and usually less load.

The affected flows are used only for handling mmu_notifier invalidations and for
edge cases related to non-coherent DMA.  I don't see any passthrough devices in
your setup, so that rules out the non-coherent DMA side of things.

A few things to try:

 1. Disable KSM (if enabled)

        echo 0 > /sys/kernel/mm/ksm/run

 2. Disable NUMA autobalancing (if enabled):

        echo 0 > /proc/sys/kernel/numa_balancing

 3. Disable KVM's TDP MMU.  On pre-v6.3 kernels, this can be done without having
    to reload KVM (or reboot the kernel if KVM is builtin).

        echo N > /sys/module/kvm/parameters/tdp_mmu

    On v6.3 and later kernels, tdp_mmu is a read-only module param and so needs
    to be disable when loading kvm.ko or when booting the kernel.

There are plenty more things that can be tried, but the above are relatively easy
and will hopefully narrow down the search significantly.

Oh, and one question: is your host kernel preemptible?
Comment 2 Joern Heissler 2023-12-14 07:14:57 UTC
Hi,

1. KSM is already disabled. Didn't try to enable it.
2. NUMA autobalancing was enabled on the host (value 1), not in the guest. When disabled, I can't see the issue anymore.
3. tdp_mmu was "Y", disabling it seems to make no difference.

So might be related to NUMA. On older kernels, the flag is 1 as well.

There's one difference in the kernel messages that I hadn't noticed before. The newer one prints "pci_bus 0000:7f: Unknown NUMA node; performance will be reduced" (same with ff again). The older ones don't. No idea what this means, if it's important, and can't find info on the web regarding it.


I think the kernel is preemptible:

"uname -a" shows: "Linux vm123 6.1.0-15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.66-1 (2023-12-09) x86_64 GNU/Linux"

"grep -i" on the config shows:
CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_RCU=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

Attaching output of "dmesg" and "lspci -v". Perhaps there's something useful in there.
Comment 3 Joern Heissler 2023-12-14 07:15:56 UTC
Created attachment 305596 [details]
dmesg
Comment 4 Joern Heissler 2023-12-14 07:16:25 UTC
Created attachment 305597 [details]
lspci -v
Comment 5 Sean Christopherson 2023-12-18 17:07:06 UTC
On Thu, Dec 14, 2023, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=218259
> 
> --- Comment #2 from Joern Heissler (kernelbugs2012@joern-heissler.de) ---
> Hi,
> 
> 1. KSM is already disabled. Didn't try to enable it.
> 2. NUMA autobalancing was enabled on the host (value 1), not in the guest.
> When
> disabled, I can't see the issue anymore.

This is likely/hopefully the same thing Yan encountered[1].  If you are able to
test patches, the proposed fix[2] applies cleanly on v6.6 (note, I need to post a
refreshed version of the series regardless), any feedback you can provide would
be much appreciated.

KVM changes aside, I highly recommend evaluating whether or not NUMA
autobalancing is a net positive for your environment.  The interactions between
autobalancing and KVM are often less than stellar, and disabling autobalancing
is sometimes a completely legitimate option/solution.

[1] https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
[2] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com

> 3. tdp_mmu was "Y", disabling it seems to make no difference.

Hrm, that's odd.  The commit blamed by bisection was purely a TDP MMU change.
Did you relaunch VMs after disabling the module params?  While the module param
is writable, it's effectively snapshotted by each VM during creation, i.e. toggling
it won't affect running VMs.

> So might be related to NUMA. On older kernels, the flag is 1 as well.
> 
> There's one difference in the kernel messages that I hadn't noticed before.
> The
> newer one prints "pci_bus 0000:7f: Unknown NUMA node; performance will be
> reduced" (same with ff again). The older ones don't. No idea what this means,
> if it's important, and can't find info on the web regarding it.

That was a new message added by commit ad5086108b9f ("PCI: Warn if no host bridge
NUMA node info"), which was first released in v5.5.  AFAICT, that warning is only
complaning about the driver code for PCI devices possibly running on the wrong
node.

However, if you are seeing that error on v6.1 or v6.6, but not v5.17, i.e. if the
message started showing up well after the printk was added, then it might be a
symptom of an underlying problem, e.g. maybe the kernel is botching parsing of
ACPI tables?

> I think the kernel is preemptible:

Ya, not fully preemptible (voluntary only), but the important part is that KVM
will drop mmu_lock if there is contention (which is a "requirement" for the bug
that Yan encountered).
Comment 6 Joern Heissler 2023-12-19 14:09:08 UTC
(In reply to Sean Christopherson from comment #5)

> This is likely/hopefully the same thing Yan encountered[1].  If you are able
> to
> test patches, the proposed fix[2] applies cleanly on v6.6 (note, I need to
> post a
> refreshed version of the series regardless), any feedback you can provide
> would
> be much appreciated.
> 
> [1] https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
> [2] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com

I admit that I don't understand most of what's written in the those threads.
I applied the two patches from [2] (excluding [3]) to v6.6 and it appears to solve the problem.

However I haven't measured how/if any of the changes/flags affect performance or if any other problems are caused. After about 1 hour uptime it appears to be okay.

[3] https://lore.kernel.org/all/ZPtVF5KKxLhMj58n@google.com/


> KVM changes aside, I highly recommend evaluating whether or not NUMA
> autobalancing is a net positive for your environment.  The interactions
> between
> autobalancing and KVM are often less than stellar, and disabling
> autobalancing
> is sometimes a completely legitimate option/solution.

I'll have to evaluate multiple options for my production environment. Patching+Building the kernel myself would only be a last resort. And it will probably take a while until Debian ships a patch for the issue. So maybe disable the NUMA balancing, or perhaps try to pin a VM's memory+cpu to a single NUMA node.

> > 3. tdp_mmu was "Y", disabling it seems to make no difference.
> 
> Hrm, that's odd.  The commit blamed by bisection was purely a TDP MMU change.
> Did you relaunch VMs after disabling the module params?  While the module
> param
> is writable, it's effectively snapshotted by each VM during creation, i.e.
> toggling
> it won't affect running VMs.

It's quite possible that I did not restart the VM afterwards. I tried again, this time paying attention. Setting it to "N" *does* seem to eliminate the issue.


> > The newer one prints "pci_bus 0000:7f: Unknown NUMA node; performance will
> be
> > reduced" (same with ff again). The older ones don't.
> 
> That was a new message added by commit ad5086108b9f ("PCI: Warn if no host
> bridge
> NUMA node info"), which was first released in v5.5.

Seems I looked on systems running older (< v5.5) kernels. On the ones with v5.10 the message is printed too.


Thanks a lot so far, I believe I've now got enough options to consider for my production environment.
Comment 7 Sean Christopherson 2023-12-19 15:16:08 UTC
On Tue, Dec 19, 2023, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=218259
> 
> --- Comment #6 from Joern Heissler (kernelbugs2012@joern-heissler.de) ---
> (In reply to Sean Christopherson from comment #5)
> 
> > This is likely/hopefully the same thing Yan encountered[1].  If you are
> able
> > to
> > test patches, the proposed fix[2] applies cleanly on v6.6 (note, I need to
> > post a
> > refreshed version of the series regardless), any feedback you can provide
> > would
> > be much appreciated.
> > 
> > [1] https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
> > [2] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com
> 
> I admit that I don't understand most of what's written in the those threads.

LOL, no worries, sometimes none of us understand what's written either ;-)

> I applied the two patches from [2] (excluding [3]) to v6.6 and it appears to
> solve the problem.
> 
> However I haven't measured how/if any of the changes/flags affect performance
> or if any other problems are caused. After about 1 hour uptime it appears to
> be
> okay.

Don't worry too much about additional testing.  Barring a straight up bug (knock
wood), the changes in those patches have a very, very low probability of
introducing unwanted side effects.

> > KVM changes aside, I highly recommend evaluating whether or not NUMA
> > autobalancing is a net positive for your environment.  The interactions
> > between
> > autobalancing and KVM are often less than stellar, and disabling
> > autobalancing
> > is sometimes a completely legitimate option/solution.
> 
> I'll have to evaluate multiple options for my production environment.
> Patching+Building the kernel myself would only be a last resort. And it will
> probably take a while until Debian ships a patch for the issue. So maybe
> disable the NUMA balancing, or perhaps try to pin a VM's memory+cpu to a
> single
> NUMA node.

Another viable option is to disable the TDP MMU, at least until the above patches
land and are picked up by Debian.  You could even reference commit 7e546bd08943
("Revert "KVM: x86: enable TDP MMU by default"") from the v5.15 stable tree if
you want a paper trail that provides some justification as to why it's ok to revert
back to the "old" MMU.

Quoting from that:

  : As far as what is lost by disabling the TDP MMU, the main selling point of
  : the TDP MMU is its ability to service page fault VM-Exits in parallel,
  : i.e. the main benefactors of the TDP MMU are deployments of large VMs
  : (hundreds of vCPUs), and in particular delployments that live-migrate such
  : VMs and thus need to fault-in huge amounts of memory on many vCPUs after
  : restarting the VM after migration.

In other words, the old MMU is not broken, e.g. it didn't suddently become unusable
after 15+ years of use.  We enabled the newfangled TDP MMU by default because it
is the long-term replacement, e.g. it can scale to support use cases that the old
MMU falls over on, and we want to put the old MMU into maintenance-only mode.

But we are still ironing out some wrinkles in the TDP MMU, particularly for host
kernels that support preemption (the kernel has lock contention logic that is
unique to preemptible kernels).  And in the meantime, for most KVM use cases, the
old MMU is still perfectly servicable.
Comment 8 Sean Christopherson 2024-01-11 15:55:08 UTC
On Mon, Dec 18, 2023, bugzilla-daemon@kernel.org wrote:
> > I think the kernel is preemptible:
> 
> Ya, not fully preemptible (voluntary only), but the important part is that
> KVM
> will drop mmu_lock if there is contention (which is a "requirement" for the
> bug
> that Yan encountered).

For posterity, the above is wrong.  Volutary preemption isn't _supposed_ to enable
yielding of contended spinlocks/rwlocks, but due to a bug with dynamic preemption,
the behavior got enabled for all preempt models:

https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com
Comment 9 Joern Heissler 2024-01-16 13:29:03 UTC
(In reply to Sean Christopherson from comment #5)
> On Thu, Dec 14, 2023, bugzilla-daemon@kernel.org wrote:

> While the [tdp_mmu] module param is writable, it's effectively snapshotted by
> each VM during creation, i.e. toggling it won't affect running VMs.

How can I see which MMU a running VM is using?
Comment 10 Sean Christopherson 2024-01-17 01:05:19 UTC
On Tue, Jan 16, 2024, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=218259
> 
> --- Comment #9 from Joern Heissler (kernelbugs2012@joern-heissler.de) ---
> (In reply to Sean Christopherson from comment #5)
> > On Thu, Dec 14, 2023, bugzilla-daemon@kernel.org wrote:
> 
> > While the [tdp_mmu] module param is writable, it's effectively snapshotted
> by
> > each VM during creation, i.e. toggling it won't affect running VMs.
> 
> How can I see which MMU a running VM is using?

You can't, which in hindsight was a rather stupid thing for us to not make visible
somewhere.  As of v6.3, the module param is read-only, i.e. it's either enable or
disabled for all VMs, so sadly it's unlikely that older kernels will see any kind
of "fix".
Comment 11 Friedrich Weber 2024-01-17 14:04:53 UTC
For the record: I was seeing a variant of this issue in combination with KSM, see [1] for more details and reproducer.

[1] https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/
Comment 12 Roland Kletzing 2024-11-05 00:27:00 UTC
also keep an eye on https://bugzilla.kernel.org/show_bug.cgi?id=199727 , as I/O also may be the cause for severe VM latency