We found some latency issue in high-density and high-concurrency scenarios, we are using cloud hypervisor as vmm for lightweight VM, using VIRTIO net and block for VM. In our test, we got about 50ms to 100ms+ latency in creating VM and register irqfd, after trace with funclatency (a tool of bcc-tools, https://github.com/iovisor/bcc), we found the latency introduced by following functions: - irq_bypass_register_consumer introduce more than 60ms per VM. This function was called when registering irqfd, the function will register irqfd as consumer to irqbypass, wait for connecting from irqbypass producers, like VFIO or VDPA. In our test, one irqfd register will get about 4ms latency, and 5 devices with total 16 irqfd will introduce more than 60ms latency. Here is a simple case, which can emulate the latency issue (the real latency is lager). The case create 800 VM as background do nothing, then repeatedly create 20 VM then destroy them after 400ms, every VM will do simple thing, create in kernel irq chip, and register 15 riqfd (emulate 5 devices and every device has 3 irqfd), just trace the "irq_bypass_register_consumer" latency, you will reproduce such kind latency issue. Here is a trace log on Xeon(R) Platinum 8255C server (96C, 2 sockets) with linux 6.2.20. Reproduce Case https://github.com/zhuangel/misc/blob/main/test/kvm_irqfd_fork/kvm_irqfd_fork.c Reproduce log https://github.com/zhuangel/misc/blob/main/test/kvm_irqfd_fork/test.log To fix these latencies, I didn't have a graceful method, just simple ideas is give user a chance to avoid these latencies, like new flag to disable irqbypass for each irqfd. Any suggestion to fix the issue if welcomed.
On Fri, Apr 28, 2023, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=217379 > > Bug ID: 217379 > Summary: Latency issues in irq_bypass_register_consumer > Product: Virtualization > Version: unspecified > Hardware: Intel > OS: Linux > Status: NEW > Severity: normal > Priority: P3 > Component: kvm > Assignee: virtualization_kvm@kernel-bugs.osdl.org > Reporter: zhuangel570@gmail.com > Regression: No > > We found some latency issue in high-density and high-concurrency scenarios, > we are using cloud hypervisor as vmm for lightweight VM, using VIRTIO net and > block for VM. In our test, we got about 50ms to 100ms+ latency in creating VM > and register irqfd, after trace with funclatency (a tool of bcc-tools, > https://github.com/iovisor/bcc), we found the latency introduced by following > functions: > > - irq_bypass_register_consumer introduce more than 60ms per VM. > This function was called when registering irqfd, the function will register > irqfd as consumer to irqbypass, wait for connecting from irqbypass > producers, > like VFIO or VDPA. In our test, one irqfd register will get about 4ms > latency, and 5 devices with total 16 irqfd will introduce more than 60ms > latency. > > Here is a simple case, which can emulate the latency issue (the real latency > is lager). The case create 800 VM as background do nothing, then repeatedly > create 20 VM then destroy them after 400ms, every VM will do simple thing, > create in kernel irq chip, and register 15 riqfd (emulate 5 devices and every > device has 3 irqfd), just trace the "irq_bypass_register_consumer" latency, > you > will reproduce such kind latency issue. Here is a trace log on Xeon(R) > Platinum > 8255C server (96C, 2 sockets) with linux 6.2.20. > > Reproduce Case > > https://github.com/zhuangel/misc/blob/main/test/kvm_irqfd_fork/kvm_irqfd_fork.c > Reproduce log > https://github.com/zhuangel/misc/blob/main/test/kvm_irqfd_fork/test.log > > To fix these latencies, I didn't have a graceful method, just simple ideas > is give user a chance to avoid these latencies, like new flag to disable > irqbypass for each irqfd. > > Any suggestion to fix the issue if welcomed. Looking at the code, it's not surprising that irq_bypass_register_consumer() can exhibit high latencies. The producers and consumers are stored in simple linked lists, and a single mutex is held while traversing the lists *and* connecting a consumer to a producer (and vice versa). There are two obvious optimizations that can be done to reduce latency in irq_bypass_register_consumer(): - Use a different data type to track the producers and consumers so that lookups don't require a linear walk. AIUI, the "tokens" used to match producers and consumers are just kernel pointers, so I _think_ XArray would perform reasonably well. - Connect producers and consumers outside of a global mutex. Unfortunately, because .add_producer() and .add_consumer() can fail, and because connections can be established by adding a consumer _or_ a producer, getting the locking right without a global mutex is quite difficult. It's certainly doable to move the (dis)connect logic out of a global lock, but it's going to require a dedicated effort, i.e. not something that can be sketched out in a few minutes (I played around with the code for the better part of an hour trying to do just that and kept running into edge case race conditions).
Thanks for the suggestion, Sean. Before we had a complete optimize for irq_bypass, do you think such kind of fix is acceptable. We should provide VMM with the ability to turn off irq_bypass feature for device not need irq_bypass: diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 737318b1c1d9..a7a018ce954a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1292,7 +1292,7 @@ struct kvm_xen_hvm_config { }; #endif -#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0) +#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0) /* * Available with KVM_CAP_IRQFD_RESAMPLE * @@ -1300,7 +1300,8 @@ struct kvm_xen_hvm_config { * the irqfd to operate in resampling mode for level triggered interrupt * emulation. See Documentation/virt/kvm/api.rst. */ -#define KVM_IRQFD_FLAG_RESAMPLE (1 << 1) +#define KVM_IRQFD_FLAG_RESAMPLE (1 << 1) +#define KVM_IRQFD_FLAG_NO_BYPASS (1 << 2) struct kvm_irqfd { __u32 fd; diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index b0af834ffa95..90fa203d7ef3 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -425,7 +425,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) schedule_work(&irqfd->inject); #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS - if (kvm_arch_has_irq_bypass()) { + if (!(args->flags & KVM_IRQFD_FLAG_NO_BYPASS) && kvm_arch_has_irq_bypass()) { irqfd->consumer.token = (void *)irqfd->eventfd; irqfd->consumer.add_producer = kvm_arch_irq_bypass_add_producer; irqfd->consumer.del_producer = kvm_arch_irq_bypass_del_producer; @@ -587,7 +587,8 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args) int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args) { - if (args->flags & ~(KVM_IRQFD_FLAG_DEASSIGN | KVM_IRQFD_FLAG_RESAMPLE)) + if (args->flags & ~(KVM_IRQFD_FLAG_DEASSIGN | KVM_IRQFD_FLAG_RESAMPLE + | KVM_IRQFD_FLAG_NO_BYPASS)) return -EINVAL; if (args->flags & KVM_IRQFD_FLAG_DEASSIGN) -- 2.31.1