Bug 215964 - nVMX: KVM(L0) does not perform a platform reboot when guest(L2) trigger a reboot event through IO-Port-0xCF9
Summary: nVMX: KVM(L0) does not perform a platform reboot when guest(L2) trigger a reb...
Status: NEW
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-10 02:45 UTC by Yadong Qi
Modified: 2023-01-26 10:03 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.10+
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Yadong Qi 2022-05-10 02:45:27 UTC
Background:
  We have a lightweight Hypervisor(iKGT) which aims to monitor very limited resources and passthrough most resources to its guest. The IO-Port:0x3F9 is also passthrough to its guest, so when guest tries to trigger a reboot event(through IO-port:0x3F9), the hardware will do the platform reset directly.
  We ported it to running under KVM, then it becomes nested virtualization architecture: KVM(L0), iKGT(L1), Guest(L2). 


Reproduce Steps:
  Guest(L2) write 0xCF9 to trigger a platform reboot.

Expected result:
  KVM perform a virtual platform reset and reboot guest.

Current result:
  It seems KVM only reset part of the vCPU(L2), but it does not clear the nVMX state, it still tries to emulate VMExit to iKGT(L1). We still can observe VMExit from iKGT(L1) and the exit reason is not expected.
Comment 1 Sean Christopherson 2022-05-10 16:01:13 UTC
Hmm, so KVM doesn't perform the RESET, that's handled by userspace.  KVM's responsibility is purely to determine whether the OUT 0xCF9 should be forwarded to L1 or bounced out to userspace.

Does the 0xCF9 I/O access get sent to userspace?  If not, can you provide L1's VMCS configuration for L2?  Specifically, the settings for USE_IO_BITMAPS and the relevant bitmap bits if in use, or UNCOND_IO_EXITING if not using bitmaps.

If the I/O access does show up in userspace, then it's likely a userspace bug, e.g. userspace fails to clear nested state when emulating RESET.
Comment 2 Yadong Qi 2022-05-11 02:29:35 UTC
Currently in L1, USE_IO_BITMAPS is set and only monitored the S3 releated I/O port(0x604). 

And we are using QEMU as userspace manager, I will submit a QEMU issue and link to this bug.
Comment 3 Sean Christopherson 2022-05-11 13:55:14 UTC
Hmm, QEMU is clearing clearing "guest mode" to get the vCPU back into L1 at RESET, and IIRC QEMU will do KVM_SET_NESTED_STATE as part of its RESET emulation.

My question about the 0xcf9 write still stands.  Does the OUT get sent to QEMU and then something goes awry during RESET emulation?  Or is the OUT forwarded to L1 as a nested VM-Exit?
Comment 4 Yadong Qi 2022-05-12 01:36:07 UTC
I think the OUT of 0xCF9 is forwarded to QEMU , because there is no any 0xCF9-OUT VMExit in L1 been traced.

Besides, the first VMExit in L1 after the OUT is a RDMSR-VMExit which is totally unexpected, and the guest(L2) RIP is 0xFFF0. So I guess L0(QEMU/KVM) has reset part of the vCPU, but the not cleared the nVMX state, so when L0 resume guest, it still treat L1 as alive and emulate unexpected VMExit to L1.
Comment 5 Sean Christopherson 2022-05-12 14:06:17 UTC
Definitely sounds like a QEMU bug.  Jim pointed out that you might have an older version of QEMU, which seems likely given the fix went into v5.1.0 and lack of clearing HF_GUEST_MASK would result in the behavior you're seeing.

Can you try running v5.1.0 or later?  Specifically, something with this commit.

commit b16c0e20c74218f2d69710cedad11da7dd4d2190
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Wed May 20 10:49:22 2020 -0400

    KVM: add support for AMD nested live migration
    
    Support for nested guest live migration is part of Linux 5.8, add the
    corresponding code to QEMU.  The migration format consists of a few
    flags, is an opaque 4k blob.
    
    The blob is in VMCB format (the control area represents the L1 VMCB
    control fields, the save area represents the pre-vmentry state; KVM does
    not use the host save area since the AMD manual allows that) but QEMU
    does not really care about that.  However, the flags need to be
    copied to hflags/hflags2 and back.
    
    In addition, support for retrieving and setting the AMD nested virtualization
    states allows the L1 guest to be reset while running a nested guest, but
    a small bug in CPU reset needs to be fixed for that to work.
    
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index e46ab8f774..008fd93ff1 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5968,6 +5968,7 @@ static void x86_cpu_reset(DeviceState *dev)
     /* init to reset state */
 
     env->hflags2 |= HF2_GIF_MASK;
+    env->hflags &= ~HF_GUEST_MASK;
 
     cpu_x86_update_cr0(env, 0x60000010);
     env->a20_mask = ~0x0;
Comment 6 Yadong Qi 2022-05-13 01:50:03 UTC
We are running on top of QEMU-6.0.0, and I also checked the source code of QEMU, the code contains the commit you mentioned. Issue link for QEMU: https://gitlab.com/qemu-project/qemu/-/issues/1021


$ qemu-system-x86_64 --version
QEMU emulator version 6.0.0
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
Comment 7 mail 2023-01-26 10:03:27 UTC
This seems similar to what I observed in https://gitlab.com/qemu-project/qemu/-/issues/530. The AMD nested commit specifically seems to be the culprit in my experiments.

Note You need to log in before you can comment on or make changes to this bug.