Bug 208081

Summary: Memory leak in kvm_async_pf_task_wake
Product: Virtualization Reporter: Daniel Lo Nigro (sites+kernel)
Component: kvmAssignee: virtualization_kvm
Status: NEW ---    
Severity: normal CC: BarinaHoppe, bonzini, wanpeng.li
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.6.14 Subsystem:
Regression: No Bisected commit-id:

Description Daniel Lo Nigro 2020-06-06 00:35:39 UTC
I have several KVM virtual servers at a number of hosting providers. On just one of them, the unreclaimable slab memory is growing linearly over time, until it hits a maximum (when the server's memory is 100% allocated). All the memory is allocated in kmalloc-64 slabs.

After enabling slab debugging using slub_debug=U, /sys/kernel/slab/kmalloc-64/alloc_calls says that most of the allocations are coming from kvm_async_pf_task_wake

This looks very similar to this blog post: https://darkimmortal.com/debian-10-kernel-slab-memory-leak/. Also see my post on ServerFault: https://serverfault.com/questions/1020241/debugging-kmalloc-64-slab-allocations-memory-leak

Any suggestions on how to debug this? It seems like it could be a kernel bug.
Comment 1 Daniel Lo Nigro 2020-06-06 00:37:35 UTC
I forgot to mention that this VPS is running Debian Bullseye (testing)

root@lux01:~# uname -a
Linux lux01 5.6.0-2-cloud-amd64 #1 SMP Debian 5.6.14-1 (2020-05-23) x86_64 GNU/Linux
Comment 2 Wanpeng Li 2020-06-08 00:46:50 UTC
Could you apply below to the guest kernel and have a try again?

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d6f22a3..93d267e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -156,6 +156,7 @@ static void apf_task_wake_one(struct kvm_task_sleep_node *n)
        hlist_del_init(&n->link);
        if (swq_has_sleeper(&n->wq))
                swake_up_one(&n->wq);
+       kfree(n);
 }
Comment 3 Daniel Lo Nigro 2020-06-09 07:17:48 UTC
@Wanpeng Li - I'll try that out and let you know how it goes.
Comment 4 Paolo Bonzini 2020-06-09 13:00:36 UTC
That patch won't work, most APFs have a node that comes from the stack.  The issue must be arising when you enter this branch of kvm_async_pf_task_wake:

                /*
                 * async PF was not yet handled.
                 * Add dummy entry for the token.
                 */
                n = kzalloc(sizeof(*n), GFP_ATOMIC);

but it should be handled here in kvm_async_pf_task_wait:

        if (e) {
                /* dummy entry exist -> wake up was delivered ahead of PF */
                hlist_del(&e->link);
                raw_spin_unlock(&b->lock);
                kfree(e);
                return false;
        }
Comment 5 Daniel Lo Nigro 2020-06-09 17:09:34 UTC
Yeah, after applying that patch, I ended up getting a kernel panic on the kfree() call.

A blog post I read (https://darkimmortal.com/debian-10-kernel-slab-memory-leak/) mentioned adding "no-kvmapf" to the kernel command line as a workaround for this issue... Are there any major issues that would occur as a result of doing that? Currently this issue is totally filling this server's memory after a few days of uptime.
Comment 6 Paolo Bonzini 2020-06-11 01:03:52 UTC
No particular issue apart from a latency increase if the host is being overcommitted.  It will be fixed in 5.8, but it's a host-side bug so you need to inform your hosting provider or add no-kvmapf yourself.