Bug 218259
Summary: | High latency in KVM guests | ||
---|---|---|---|
Product: | Virtualization | Reporter: | Joern Heissler (kernelbugs2012) |
Component: | kvm | Assignee: | virtualization_kvm |
Status: | NEW --- | ||
Severity: | normal | CC: | devzero, f.weber, seanjc |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg
lspci -v |
Description
Joern Heissler
2023-12-12 16:37:56 UTC
On Tue, Dec 12, 2023, bugzilla-daemon@kernel.org wrote: > The affected hosts run Debian 12; until Debian 11 there was no trouble. > I git-bisected the kernel and the commit which appears to somehow cause the > trouble is: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f47e5bbbc92f5d234bbab317523c64a65b6ac4e2 Huh. That commit makes it so that KVM keeps non-leaf SPTEs, i.e. upper level page table structures, when zapping/unmapping a guest memory range. The idea is that preserving paging structures will allow for faster unmapping (less work) and faster repopulation if/when the guest faults the memory back in (again, less work to create a valid mapping). The only downside that comes to mind is that keeping upper level paging structures will make it more costly to handle future invalidations as KVM will have to walk deeper into the page tables before discovering more work that needs to be done. > Qemu command line: See below. > Problem does *not* go away when appending "kernel_irqchip=off" to the > -machine > parameter > Problem *does* go away with "-accel tcg", even though the guest becomes much > slower. Yeah, that's expected, as that completely takes KVM out of the picture. > All affected guests run kubernetes with various workloads, mostly Java, > databases like postgres und a few legacy 32 bit containers. > > Best method to manually trigger the problem I found was to drain other > kubernetes nodes, causing many pods to start at the same time on the affected > guest. But even when the initial load settled, there's little I/O and the > guest is like 80% idle, the problem still occurs. > > The problem occurs whether the host runs only a single guest or lots of other > (non-kubernetes) guests. > > Other (i.e. not kubernetes) guests don't appear to be affected, but those got > way less resources and usually less load. The affected flows are used only for handling mmu_notifier invalidations and for edge cases related to non-coherent DMA. I don't see any passthrough devices in your setup, so that rules out the non-coherent DMA side of things. A few things to try: 1. Disable KSM (if enabled) echo 0 > /sys/kernel/mm/ksm/run 2. Disable NUMA autobalancing (if enabled): echo 0 > /proc/sys/kernel/numa_balancing 3. Disable KVM's TDP MMU. On pre-v6.3 kernels, this can be done without having to reload KVM (or reboot the kernel if KVM is builtin). echo N > /sys/module/kvm/parameters/tdp_mmu On v6.3 and later kernels, tdp_mmu is a read-only module param and so needs to be disable when loading kvm.ko or when booting the kernel. There are plenty more things that can be tried, but the above are relatively easy and will hopefully narrow down the search significantly. Oh, and one question: is your host kernel preemptible? Hi, 1. KSM is already disabled. Didn't try to enable it. 2. NUMA autobalancing was enabled on the host (value 1), not in the guest. When disabled, I can't see the issue anymore. 3. tdp_mmu was "Y", disabling it seems to make no difference. So might be related to NUMA. On older kernels, the flag is 1 as well. There's one difference in the kernel messages that I hadn't noticed before. The newer one prints "pci_bus 0000:7f: Unknown NUMA node; performance will be reduced" (same with ff again). The older ones don't. No idea what this means, if it's important, and can't find info on the web regarding it. I think the kernel is preemptible: "uname -a" shows: "Linux vm123 6.1.0-15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.66-1 (2023-12-09) x86_64 GNU/Linux" "grep -i" on the config shows: CONFIG_PREEMPT_BUILD=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_PREEMPT_COUNT=y CONFIG_PREEMPTION=y CONFIG_PREEMPT_DYNAMIC=y CONFIG_PREEMPT_RCU=y CONFIG_HAVE_PREEMPT_DYNAMIC=y CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y CONFIG_PREEMPT_NOTIFIERS=y CONFIG_DRM_I915_PREEMPT_TIMEOUT=640 # CONFIG_DEBUG_PREEMPT is not set # CONFIG_PREEMPT_TRACER is not set # CONFIG_PREEMPTIRQ_DELAY_TEST is not set Attaching output of "dmesg" and "lspci -v". Perhaps there's something useful in there. Created attachment 305596 [details]
dmesg
Created attachment 305597 [details]
lspci -v
On Thu, Dec 14, 2023, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=218259 > > --- Comment #2 from Joern Heissler (kernelbugs2012@joern-heissler.de) --- > Hi, > > 1. KSM is already disabled. Didn't try to enable it. > 2. NUMA autobalancing was enabled on the host (value 1), not in the guest. > When > disabled, I can't see the issue anymore. This is likely/hopefully the same thing Yan encountered[1]. If you are able to test patches, the proposed fix[2] applies cleanly on v6.6 (note, I need to post a refreshed version of the series regardless), any feedback you can provide would be much appreciated. KVM changes aside, I highly recommend evaluating whether or not NUMA autobalancing is a net positive for your environment. The interactions between autobalancing and KVM are often less than stellar, and disabling autobalancing is sometimes a completely legitimate option/solution. [1] https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com [2] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com > 3. tdp_mmu was "Y", disabling it seems to make no difference. Hrm, that's odd. The commit blamed by bisection was purely a TDP MMU change. Did you relaunch VMs after disabling the module params? While the module param is writable, it's effectively snapshotted by each VM during creation, i.e. toggling it won't affect running VMs. > So might be related to NUMA. On older kernels, the flag is 1 as well. > > There's one difference in the kernel messages that I hadn't noticed before. > The > newer one prints "pci_bus 0000:7f: Unknown NUMA node; performance will be > reduced" (same with ff again). The older ones don't. No idea what this means, > if it's important, and can't find info on the web regarding it. That was a new message added by commit ad5086108b9f ("PCI: Warn if no host bridge NUMA node info"), which was first released in v5.5. AFAICT, that warning is only complaning about the driver code for PCI devices possibly running on the wrong node. However, if you are seeing that error on v6.1 or v6.6, but not v5.17, i.e. if the message started showing up well after the printk was added, then it might be a symptom of an underlying problem, e.g. maybe the kernel is botching parsing of ACPI tables? > I think the kernel is preemptible: Ya, not fully preemptible (voluntary only), but the important part is that KVM will drop mmu_lock if there is contention (which is a "requirement" for the bug that Yan encountered). (In reply to Sean Christopherson from comment #5) > This is likely/hopefully the same thing Yan encountered[1]. If you are able > to > test patches, the proposed fix[2] applies cleanly on v6.6 (note, I need to > post a > refreshed version of the series regardless), any feedback you can provide > would > be much appreciated. > > [1] https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com > [2] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com I admit that I don't understand most of what's written in the those threads. I applied the two patches from [2] (excluding [3]) to v6.6 and it appears to solve the problem. However I haven't measured how/if any of the changes/flags affect performance or if any other problems are caused. After about 1 hour uptime it appears to be okay. [3] https://lore.kernel.org/all/ZPtVF5KKxLhMj58n@google.com/ > KVM changes aside, I highly recommend evaluating whether or not NUMA > autobalancing is a net positive for your environment. The interactions > between > autobalancing and KVM are often less than stellar, and disabling > autobalancing > is sometimes a completely legitimate option/solution. I'll have to evaluate multiple options for my production environment. Patching+Building the kernel myself would only be a last resort. And it will probably take a while until Debian ships a patch for the issue. So maybe disable the NUMA balancing, or perhaps try to pin a VM's memory+cpu to a single NUMA node. > > 3. tdp_mmu was "Y", disabling it seems to make no difference. > > Hrm, that's odd. The commit blamed by bisection was purely a TDP MMU change. > Did you relaunch VMs after disabling the module params? While the module > param > is writable, it's effectively snapshotted by each VM during creation, i.e. > toggling > it won't affect running VMs. It's quite possible that I did not restart the VM afterwards. I tried again, this time paying attention. Setting it to "N" *does* seem to eliminate the issue. > > The newer one prints "pci_bus 0000:7f: Unknown NUMA node; performance will > be > > reduced" (same with ff again). The older ones don't. > > That was a new message added by commit ad5086108b9f ("PCI: Warn if no host > bridge > NUMA node info"), which was first released in v5.5. Seems I looked on systems running older (< v5.5) kernels. On the ones with v5.10 the message is printed too. Thanks a lot so far, I believe I've now got enough options to consider for my production environment. On Tue, Dec 19, 2023, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=218259 > > --- Comment #6 from Joern Heissler (kernelbugs2012@joern-heissler.de) --- > (In reply to Sean Christopherson from comment #5) > > > This is likely/hopefully the same thing Yan encountered[1]. If you are > able > > to > > test patches, the proposed fix[2] applies cleanly on v6.6 (note, I need to > > post a > > refreshed version of the series regardless), any feedback you can provide > > would > > be much appreciated. > > > > [1] https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com > > [2] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com > > I admit that I don't understand most of what's written in the those threads. LOL, no worries, sometimes none of us understand what's written either ;-) > I applied the two patches from [2] (excluding [3]) to v6.6 and it appears to > solve the problem. > > However I haven't measured how/if any of the changes/flags affect performance > or if any other problems are caused. After about 1 hour uptime it appears to > be > okay. Don't worry too much about additional testing. Barring a straight up bug (knock wood), the changes in those patches have a very, very low probability of introducing unwanted side effects. > > KVM changes aside, I highly recommend evaluating whether or not NUMA > > autobalancing is a net positive for your environment. The interactions > > between > > autobalancing and KVM are often less than stellar, and disabling > > autobalancing > > is sometimes a completely legitimate option/solution. > > I'll have to evaluate multiple options for my production environment. > Patching+Building the kernel myself would only be a last resort. And it will > probably take a while until Debian ships a patch for the issue. So maybe > disable the NUMA balancing, or perhaps try to pin a VM's memory+cpu to a > single > NUMA node. Another viable option is to disable the TDP MMU, at least until the above patches land and are picked up by Debian. You could even reference commit 7e546bd08943 ("Revert "KVM: x86: enable TDP MMU by default"") from the v5.15 stable tree if you want a paper trail that provides some justification as to why it's ok to revert back to the "old" MMU. Quoting from that: : As far as what is lost by disabling the TDP MMU, the main selling point of : the TDP MMU is its ability to service page fault VM-Exits in parallel, : i.e. the main benefactors of the TDP MMU are deployments of large VMs : (hundreds of vCPUs), and in particular delployments that live-migrate such : VMs and thus need to fault-in huge amounts of memory on many vCPUs after : restarting the VM after migration. In other words, the old MMU is not broken, e.g. it didn't suddently become unusable after 15+ years of use. We enabled the newfangled TDP MMU by default because it is the long-term replacement, e.g. it can scale to support use cases that the old MMU falls over on, and we want to put the old MMU into maintenance-only mode. But we are still ironing out some wrinkles in the TDP MMU, particularly for host kernels that support preemption (the kernel has lock contention logic that is unique to preemptible kernels). And in the meantime, for most KVM use cases, the old MMU is still perfectly servicable. On Mon, Dec 18, 2023, bugzilla-daemon@kernel.org wrote: > > I think the kernel is preemptible: > > Ya, not fully preemptible (voluntary only), but the important part is that > KVM > will drop mmu_lock if there is contention (which is a "requirement" for the > bug > that Yan encountered). For posterity, the above is wrong. Volutary preemption isn't _supposed_ to enable yielding of contended spinlocks/rwlocks, but due to a bug with dynamic preemption, the behavior got enabled for all preempt models: https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com (In reply to Sean Christopherson from comment #5) > On Thu, Dec 14, 2023, bugzilla-daemon@kernel.org wrote: > While the [tdp_mmu] module param is writable, it's effectively snapshotted by > each VM during creation, i.e. toggling it won't affect running VMs. How can I see which MMU a running VM is using? On Tue, Jan 16, 2024, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=218259 > > --- Comment #9 from Joern Heissler (kernelbugs2012@joern-heissler.de) --- > (In reply to Sean Christopherson from comment #5) > > On Thu, Dec 14, 2023, bugzilla-daemon@kernel.org wrote: > > > While the [tdp_mmu] module param is writable, it's effectively snapshotted > by > > each VM during creation, i.e. toggling it won't affect running VMs. > > How can I see which MMU a running VM is using? You can't, which in hindsight was a rather stupid thing for us to not make visible somewhere. As of v6.3, the module param is read-only, i.e. it's either enable or disabled for all VMs, so sadly it's unlikely that older kernels will see any kind of "fix". For the record: I was seeing a variant of this issue in combination with KSM, see [1] for more details and reproducer. [1] https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/ also keep an eye on https://bugzilla.kernel.org/show_bug.cgi?id=199727 , as I/O also may be the cause for severe VM latency |