Hi all, today I updated to 6.4.10 on arch linux. This broke my setup with running a KVM nested virtualization within a KVM VM. Problem seems kernel update related not distribution specific since others report same issue on a totally different setup: https://forum.proxmox.com/threads/amd-incpetion-fixes-cause-qemu-kvm-memory-leak.132057/#post-581207 Issue: 1. Start KVM vm ("hostVM") with 60GB memory assigned -> all works. 2. within that hostVM I start a nestedVM with 5GB memory assigned. 3. Memory consumption of the quemu process within the hostVM goes beyond available memory. Then the nestedVM gets OOM killed before even being started using more than the 60GB + Swap. I tried to setup fresh nestedVMs with no luck, same problem. Reverting to an earlier kernel (6.4.7 on arch linux) lets everything work again. host kernel: 6.4.10-arch1 (this induces the problems, rest was unchanged) hostVM kernel: 5.15.107+truenas nestedVM kernel: 5.15.0-78-generic Logs from the hostVM when OOM happens: Aug 15 10:59:41 truenas kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0 Aug 15 10:59:42 truenas kernel: CPU: 9 PID: 7079 Comm: CPU 0/KVM Tainted: P OE 5.15.107+truenas #1 Aug 15 10:59:43 truenas kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 Aug 15 10:59:43 truenas kernel: Call Trace: Aug 15 10:59:43 truenas kernel: <TASK> Aug 15 10:59:43 truenas kernel: dump_stack_lvl+0x46/0x5e Aug 15 10:59:43 truenas kernel: dump_header+0x4a/0x1f4 Aug 15 10:59:43 truenas kernel: oom_kill_process.cold+0xb/0x10 Aug 15 10:59:43 truenas kernel: out_of_memory+0x1bd/0x4f0 Aug 15 10:59:43 truenas kernel: __alloc_pages_slowpath.constprop.0+0xc30/0xd00 Aug 15 10:59:44 truenas kernel: __alloc_pages+0x1e9/0x220 Aug 15 10:59:44 truenas kernel: __get_free_pages+0xd/0x40 Aug 15 10:59:44 truenas kernel: kvm_mmu_topup_memory_cache+0x56/0x80 [kvm] Aug 15 10:59:44 truenas kernel: mmu_topup_memory_caches+0x39/0x70 [kvm] Aug 15 10:59:44 truenas kernel: direct_page_fault+0x3d9/0xbb0 [kvm] Aug 15 10:59:44 truenas kernel: ? kvm_mtrr_check_gfn_range_consistency+0x61/0x120 [kvm] Aug 15 10:59:44 truenas kernel: kvm_mmu_page_fault+0x7a/0x730 [kvm] Aug 15 10:59:44 truenas kernel: ? ktime_get+0x38/0xa0 Aug 15 10:59:44 truenas kernel: ? lock_timer_base+0x61/0x80 Aug 15 10:59:44 truenas kernel: ? __svm_vcpu_run+0x5f/0xf0 [kvm_amd] Aug 15 10:59:44 truenas kernel: ? __svm_vcpu_run+0x59/0xf0 [kvm_amd] Aug 15 10:59:44 truenas kernel: ? __svm_vcpu_run+0xaa/0xf0 [kvm_amd] Aug 15 10:59:44 truenas kernel: ? load_fixmap_gdt+0x22/0x30 Aug 15 10:59:44 truenas kernel: ? native_load_tr_desc+0x67/0x70 Aug 15 10:59:44 truenas kernel: ? x86_virt_spec_ctrl+0x43/0xb0 Aug 15 10:59:44 truenas kernel: kvm_arch_vcpu_ioctl_run+0xbff/0x1750 [kvm] Aug 15 10:59:44 truenas kernel: kvm_vcpu_ioctl+0x278/0x660 [kvm] Aug 15 10:59:44 truenas kernel: ? __seccomp_filter+0x385/0x5c0 Aug 15 10:59:44 truenas kernel: __x64_sys_ioctl+0x8b/0xc0 Aug 15 10:59:44 truenas kernel: do_syscall_64+0x3b/0xc0 Aug 15 10:59:44 truenas kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb Aug 15 10:59:44 truenas kernel: RIP: 0033:0x7f29eee166b7 Aug 15 10:59:45 truenas kernel: Code: Unable to access opcode bytes at RIP 0x7f29eee1668d. Aug 15 10:59:45 truenas kernel: RSP: 002b:00007f27f35fd4c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Aug 15 10:59:45 truenas kernel: RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f29eee166b7 Aug 15 10:59:45 truenas kernel: RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015 Aug 15 10:59:45 truenas kernel: RBP: 00005558a87d3f00 R08: 00005558a7e52848 R09: 00005558a827c580 Aug 15 10:59:45 truenas kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 Aug 15 10:59:45 truenas kernel: R13: 00005558a8298bc0 R14: 00007f27f35fd780 R15: 0000000000802000 Aug 15 10:59:45 truenas kernel: </TASK> Aug 15 10:59:45 truenas kernel: Mem-Info:
Note, adding spec_rstack_overflow=off as a kernel command line makes nested VM boot properly again without problems: https://bugs.archlinux.org/task/79384 So, spec_rstack_overflow=safe-ret is breaking nested KVM virtualization.
This is going to sound completely ridiculous, but can you try the fix for the guest RFLAGS corruption issue in the return thunk? It's definitely unlikely that the _only_ symptom is an unexpected OOM, but it's theoretically possible, e.g. if your setup only triggers KVM (bare metal host) emulation in a handful of flows, and one of those flows just happens to send a single Jcc in the wrong direction. https://lore.kernel.org/all/20230811155255.250835-1-seanjc@google.com
Sean, it does sound ridiculous, but it isn't. I tested the fix you suggested and it works now with that patch applied. In the meantime i switched to a differnet machine to be able to test your fix and there I could also confirm the problem there on a 6.4.11 kernel: Test machine setup: Gentoo, (vanilla) Kernel 6.4.11 Without the patch and spec_rstack_overflow in the default meaning pec_rstack_overflow=safe-ret also on this system my nested VMs do not start and get OOM-killed. I then applied the patch from your link, Sean, and it works now. Cheers, Oliver