Bug 216033

Summary: KVM VMX nested virtualization: VMXON does not check guest CR0 against IA32_VMX_CR0_FIXED0
Product: Virtualization Reporter: Eric Li (lixiaoyi13691419520)
Component: kvmAssignee: virtualization_kvm
Status: NEW ---    
Severity: normal CC: lixiaoyi13691419520, seanjc
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.17.8-200.fc35.x86_64 Subsystem:
Regression: No Bisected commit-id:
Attachments: Guest hypervisor to reproduce this bug (xz compressed)

Description Eric Li 2022-05-26 03:54:23 UTC
Created attachment 301050 [details]
Guest hypervisor to reproduce this bug (xz compressed)

CPU model I am running: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
Host kernel version: 5.17.8-200.fc35.x86_64
Host kernel arch: x86_64
Guest: a hypervisor I wrote myself, 32-bits, compressed and attached as c.img.xz.
QEMU command line: qemu-system-x86_64 -m 512M -smp 1 -cpu Haswell,vmx=yes -enable-kvm -serial stdio -drive media=disk,file=c.img,index=1
The problem does not go away if using -machine kernel_irqchip=off
Since the guest is a hypervisor, -accel tcg cannot be used (TCG does not support nested virtualization)

Actual behavior: serial port shows:

...
CR0        = 0x0000000080000015
CR0 fixed0 = 0x0000000080000021
CR0 fixed1 = 0x00000000ffffffff
VMXON succeeds

Expected behaivor: serial port shows:

...
CR0        = 0x0000000080000015
CR0 fixed0 = 0x0000000080000021
CR0 fixed1 = 0x00000000ffffffff
[00]: unhandled exception 13 (0xd), halting!
[00]: error code: 0x00000000
[00]: state dump follows...
[00] CS:EIP ...
...

Explanation:

When the guest hypervisor starts VMX using the VMXON instruction, the guest hypervisor's CR0 is not legal. IA32_VMX_CR0_FIXED0 = 0x0000000080000021. The 0x20 bit in this MSR is 1, which indicates that the 0x20 bit in CR0 must be 1 when executing VMXON. However, my hypervisor uses CR0 = 0x80000015 (the 0x20 bit is 0).

According to SDM 29.3, if "the values of CR0 and CR4 are not supported in VMX operation", then a general protection exception (#GP(0)) should be raised. This happens on real hardware, but not on KVM.

The relevant code in my hypervisor is:

https://github.com/lxylxy123456/uberxmhf/blob/770bdaa7afce560b9f46348bee5a05e2c680de06/xmhf/src/xmhf-core/xmhf-runtime/xmhf-startup/lhv-vmx.c#L250

The pseudo code is

print the hypervisor's CR0 to serial port
print MSR value IA32_VMX_CR0_FIXED0 to serial port
print MSR value IA32_VMX_CR0_FIXED1 to serial port
sleep for 3 seconds
run VMXON instruction
If succeed, write "VMXON succeeds" to serial port.
If VMXON receives an exception, write exception details to serial port ("[00]: unhandled exception...")

To fix this bug, handle_vmon() in arch/x86/kvm/vmx/nested.c needs to be updated. The check to CR0 and CR4 against IA32_VMX_CR0_FIXED0 etc need to be added.
Comment 1 Sean Christopherson 2022-05-26 16:18:35 UTC
Ugh, KVM is comically wrong.  It _deliberately_ avoids checking CR0/CR4 with a comment saying that "most faulting conditions have already been checked by hardware", but the SDM pseudocode makes it very clear that only the (CR0.PE = 0) or (CR4.VMXE = 0) or (RFLAGS.VM = 1) or (IA32_EFER.LMA = 1 and CS.L = 0) checks are performed before the VM-Exit occurs.

	/*
	 * The Intel VMX Instruction Reference lists a bunch of bits that are
	 * prerequisite to running VMXON, most notably cr4.VMXE must be set to
	 * 1 (see vmx_is_valid_cr4() for when we allow the guest to set this).
	 * Otherwise, we should fail with #UD.  But most faulting conditions
	 * have already been checked by hardware, prior to the VM-exit for
	 * VMXON.  We do test guest cr4.VMXE because processor CR4 always has
	 * that bit set to 1 in non-root mode.
	 */
	if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE)) {
		kvm_queue_exception(vcpu, UD_VECTOR);
		return 1;
	}
Comment 2 Eric Li 2022-09-02 18:39:14 UTC
@Sean Christopherson Thanks for submitting the fix to this bug in https://lore.kernel.org/lkml/20220607213604.3346000-4-seanjc@google.com/ . However, I recently tested this fix and the behavior is not as expected.

According to Intel's SDM, VMXON may generate 2 types of exceptions:

    IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ...
        THEN #UD;
    ELSIF not in VMX operation
        THEN
            IF (CPL > 0) or (in A20M mode) or
            (the values of CR0 and CR4 are not supported in VMX operation ...
                THEN #GP(0);

For example, when CR4 value is incorrect, different exceptions may be generated depending on which bit is incorrect. If CR4.VMXE = 0, #UD should be generated. Otherwise, #GP(0) should be generated. However, after the fix, #UD is always generated.
Comment 3 Sean Christopherson 2022-09-02 19:00:12 UTC
On Fri, Sep 02, 2022, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=216033
> 
> --- Comment #2 from Eric Li (ercli@ucdavis.edu) ---
> @Sean Christopherson Thanks for submitting the fix to this bug in
> https://lore.kernel.org/lkml/20220607213604.3346000-4-seanjc@google.com/ .
> However, I recently tested this fix and the behavior is not as expected.
> 
> According to Intel's SDM, VMXON may generate 2 types of exceptions:
> 
>     IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ...
>         THEN #UD;
>     ELSIF not in VMX operation
>         THEN
>             IF (CPL > 0) or (in A20M mode) or
>             (the values of CR0 and CR4 are not supported in VMX operation ...
>                 THEN #GP(0);
> 
> For example, when CR4 value is incorrect, different exceptions may be
> generated
> depending on which bit is incorrect. If CR4.VMXE = 0, #UD should be
> generated.
> Otherwise, #GP(0) should be generated. However, after the fix, #UD is always
> generated.

/facepalm

All that and I overlooked that the other CR0/CR4 checks take a #GP.

On the bright side, it does mean I can blame Jim at least a little bit for commit
70f3aac964ae ("kvm: nVMX: Remove superfluous VMX instruction fault checks").

Untested, but this should do the trick.

---
 arch/x86/kvm/vmx/nested.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ddd4367d4826..86ee2ab8a497 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4936,25 +4936,32 @@ static int handle_vmxon(struct kvm_vcpu *vcpu)
 		| FEAT_CTL_VMX_ENABLED_OUTSIDE_SMX;
 
 	/*
-	 * Note, KVM cannot rely on hardware to perform the CR0/CR4 #UD checks
-	 * that have higher priority than VM-Exit (see Intel SDM's pseudocode
-	 * for VMXON), as KVM must load valid CR0/CR4 values into hardware while
-	 * running the guest, i.e. KVM needs to check the _guest_ values.
+	 * Note, KVM cannot rely on hardware to perform the CR0.PE and CR4.VMXE
+	 * #UD checks that have higher priority than VM-Exit (see Intel SDM's
+	 * pseudocode for VMXON), as KVM must load valid CR0/CR4 values into
+	 * hardware while running the guest, i.e. KVM needs to check the _guest_
+	 * values.
 	 *
 	 * Rely on hardware for the other two pre-VM-Exit checks, !VM86 and
 	 * !COMPATIBILITY modes.  KVM may run the guest in VM86 to emulate Real
 	 * Mode, but KVM will never take the guest out of those modes.
 	 */
+	if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+	    !kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	/*
+	 * All other checks that are lower priority than VM-Exit must be
+	 * checked manually, including the other CR0/CR4 reserved bit checks.
+	 */
 	if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) ||
 	    !nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) {
 		kvm_queue_exception(vcpu, UD_VECTOR);
 		return 1;
 	}
 
-	/*
-	 * CPL=0 and all other checks that are lower priority than VM-Exit must
-	 * be checked manually.
-	 */
 	if (vmx_get_cpl(vcpu)) {
 		kvm_inject_gp(vcpu, 0);
 		return 1;

base-commit: 476d5fb78ea6438941559af4814a2795849cb8f0
Comment 4 Eric Li 2022-09-02 19:45:51 UTC
>       if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) ||
>           !nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) {
>               kvm_queue_exception(vcpu, UD_VECTOR);
>               return 1;
>       }

Thanks for the reply. I think there is still a typo. Do you mean the following?

 	if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) ||
 	    !nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) {
 		kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
 		return 1;
 	}

Or maybe:

 	if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) ||
 	    !nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) {
 		kvm_inject_gp(vcpu, 0);
 		return 1;
 	}

I am not familiar with KVM code so not sure which one should be used. Thanks again!
Comment 5 Sean Christopherson 2022-09-02 20:57:18 UTC
On Fri, Sep 02, 2022, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=216033
> 
> --- Comment #4 from Eric Li (ercli@ucdavis.edu) ---
> >       if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) ||
> >           !nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) {
> >               kvm_queue_exception(vcpu, UD_VECTOR);
> >               return 1;
> >       }
> 
> Thanks for the reply. I think there is still a typo. Do you mean the
> following?

Yes, yes I did.

>         if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) ||
>             !nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) {
>                 kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
>                 return 1;
>         }
> 
> Or maybe:
> 
>         if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) ||
>             !nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) {
>                 kvm_inject_gp(vcpu, 0);
>                 return 1;
>         }
> 
> I am not familiar with KVM code

Heh, for all the good that being familiar with KVM is doing me.

> so not sure which one should be used. Thanks again!