Background: We have a lightweight Hypervisor(iKGT) which aims to monitor very limited resources and passthrough most resources to its guest. The IO-Port:0x3F9 is also passthrough to its guest, so when guest tries to trigger a reboot event(through IO-port:0x3F9), the hardware will do the platform reset directly. We ported it to running under KVM, then it becomes nested virtualization architecture: KVM(L0), iKGT(L1), Guest(L2). Reproduce Steps: Guest(L2) write 0xCF9 to trigger a platform reboot. Expected result: KVM perform a virtual platform reset and reboot guest. Current result: It seems KVM only reset part of the vCPU(L2), but it does not clear the nVMX state, it still tries to emulate VMExit to iKGT(L1). We still can observe VMExit from iKGT(L1) and the exit reason is not expected.
Hmm, so KVM doesn't perform the RESET, that's handled by userspace. KVM's responsibility is purely to determine whether the OUT 0xCF9 should be forwarded to L1 or bounced out to userspace. Does the 0xCF9 I/O access get sent to userspace? If not, can you provide L1's VMCS configuration for L2? Specifically, the settings for USE_IO_BITMAPS and the relevant bitmap bits if in use, or UNCOND_IO_EXITING if not using bitmaps. If the I/O access does show up in userspace, then it's likely a userspace bug, e.g. userspace fails to clear nested state when emulating RESET.
Currently in L1, USE_IO_BITMAPS is set and only monitored the S3 releated I/O port(0x604). And we are using QEMU as userspace manager, I will submit a QEMU issue and link to this bug.
Hmm, QEMU is clearing clearing "guest mode" to get the vCPU back into L1 at RESET, and IIRC QEMU will do KVM_SET_NESTED_STATE as part of its RESET emulation. My question about the 0xcf9 write still stands. Does the OUT get sent to QEMU and then something goes awry during RESET emulation? Or is the OUT forwarded to L1 as a nested VM-Exit?
I think the OUT of 0xCF9 is forwarded to QEMU , because there is no any 0xCF9-OUT VMExit in L1 been traced. Besides, the first VMExit in L1 after the OUT is a RDMSR-VMExit which is totally unexpected, and the guest(L2) RIP is 0xFFF0. So I guess L0(QEMU/KVM) has reset part of the vCPU, but the not cleared the nVMX state, so when L0 resume guest, it still treat L1 as alive and emulate unexpected VMExit to L1.
Definitely sounds like a QEMU bug. Jim pointed out that you might have an older version of QEMU, which seems likely given the fix went into v5.1.0 and lack of clearing HF_GUEST_MASK would result in the behavior you're seeing. Can you try running v5.1.0 or later? Specifically, something with this commit. commit b16c0e20c74218f2d69710cedad11da7dd4d2190 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Wed May 20 10:49:22 2020 -0400 KVM: add support for AMD nested live migration Support for nested guest live migration is part of Linux 5.8, add the corresponding code to QEMU. The migration format consists of a few flags, is an opaque 4k blob. The blob is in VMCB format (the control area represents the L1 VMCB control fields, the save area represents the pre-vmentry state; KVM does not use the host save area since the AMD manual allows that) but QEMU does not really care about that. However, the flags need to be copied to hflags/hflags2 and back. In addition, support for retrieving and setting the AMD nested virtualization states allows the L1 guest to be reset while running a nested guest, but a small bug in CPU reset needs to be fixed for that to work. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> diff --git a/target/i386/cpu.c b/target/i386/cpu.c index e46ab8f774..008fd93ff1 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -5968,6 +5968,7 @@ static void x86_cpu_reset(DeviceState *dev) /* init to reset state */ env->hflags2 |= HF2_GIF_MASK; + env->hflags &= ~HF_GUEST_MASK; cpu_x86_update_cr0(env, 0x60000010); env->a20_mask = ~0x0;
We are running on top of QEMU-6.0.0, and I also checked the source code of QEMU, the code contains the commit you mentioned. Issue link for QEMU: https://gitlab.com/qemu-project/qemu/-/issues/1021 $ qemu-system-x86_64 --version QEMU emulator version 6.0.0 Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
This seems similar to what I observed in https://gitlab.com/qemu-project/qemu/-/issues/530. The AMD nested commit specifically seems to be the culprit in my experiments.