Bug 218267 - [Sapphire Rapids][Upstream]Boot up multiple Windows VMs hang
Summary: [Sapphire Rapids][Upstream]Boot up multiple Windows VMs hang
Status: NEW
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: Intel Linux
: P3 high
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-12-15 08:23 UTC by guoqiang
Modified: 2025-01-27 18:32 UTC (History)
5 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Boot up 8 Windows VM script (880 bytes, text/plain)
2023-12-15 08:23 UTC, guoqiang
Details

Description guoqiang 2023-12-15 08:23:46 UTC
Created attachment 305601 [details]
Boot up 8 Windows VM script

System Environment
=======

Platform: Sapphire Rapids Platform

Host OS: CentOS Stream 9

Kernel:6.7.0-rc1 (commit:8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1)
Qemu: QEMU emulator version 8.1.94 (v8.2.0-rc4) (commit:039afc5ef7367fbc8fb475580c291c2655e856cb)

Host Kernel cmdline:BOOT_IMAGE=/kvm-vmlinuz root=/dev/mapper/cs_spr--2s2-root ro crashkernel=auto console=tty0 console=ttyS0,115200,8n1 3 intel_iommu=on disable_mtrr_cleanup

Bug detailed description
=======
We boot up 8 Windows VMs (total vCPUs > pCPUs) in host, random run application on each VM such as WPS editing etc, and wait for a moment, then Some of the Windows Guest hang and console reports "KVM internal error. Suberror: 3".

Tips:We add "-cpu host,host-cache-info=on,migratable=on,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x1fff" in qemu parameters and boot up VMs.Some of VMs easy to hang.
 

Reproduce Steps
==============
1.Boot up 8 Windows VMs in Host:

for ((i=1;i<=8;i++));do
qemu-img create -b /home/guoqiang/win2k16_vdi_local.qcow2 -F qcow2 -f qcow2 /home/guoqiang/win2016$i.qcow2

sleep 1

qemu-system-x86_64 -accel kvm -cpu host,host-cache-info=on,migratable=on,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x1fff -smp 30 -drive file=/home/guoqiang/win2016$i.qcow2,if=none,id=virtio-disk0 -device virtio-blk-pci,drive=virtio-disk0,bootindex=0 -m 4096 -daemonize -vnc :$i -device virtio-net-pci,netdev=nic0 -netdev tap,id=nic0,br=virbr0,helper=/usr/local/libexec/qemu-bridge-helper,vhost=on

sleep 5

done

2.Wait a monent and VMs hang.

Host error log:
KVM internal error. Suberror: 3

extra data[0]: 0x000000008000002f

extra data[1]: 0x0000000000000020

extra data[2]: 0x0000000000000d83

extra data[3]: 0x0000000000000038

RAX=0000000000000000 RBX=0000000000000000 RCX=0000000040000070 RDX=0000000000000000

RSI=0000000000000000 RDI=ffffc58dcf552010 RBP=fffff801ed48e100 RSP=fffff801ed48e060

R8 =00000000ffffffff R9 =0000000000000000 R10=00000000ffffffff R11=0000000000000000

R12=000000133fd128fc R13=0000000000000046 R14=0000000000000000 R15=0000000000000000

RIP=fffff801eb94fd7c RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0

ES =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA]

CS =0010 0000000000000000 00000000 00209b00 DPL=0 CS64 [-RA]

SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]

DS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA]

FS =0053 000000000059b000 00003c00 0040f300 DPL=3 DS [-WA]

GS =002b fffff801ebb3f000 ffffffff 00c0f300 DPL=3 DS [-WA]

LDT=0000 0000000000000000 ffffffff 00c00000

TR =0040 fffff801ed486070 00000067 00008b00 DPL=0 TSS64-busy

GDT= fffff801ed485000 0000006f

IDT= fffff801ed485070 00000fff

CR0=80050031 CR2=0000000000000030 CR3=00000000001aa000 CR4=001506f8

DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000

DR6=00000000ffff0ff0 DR7=0000000000000400

EFER=0000000000000d01

Code=25 88 61 00 00 b9 70 00 00 40 0f ba 32 00 72 06 33 c0 8b d0 <0f> 30 5a 58 59 c3 cc cc cc cc cc cc 0f 1f 84 00 00 00 00 00 48 81 ec 38 01 00 00 48 8d 84

KVM internal error. Suberror: 3

extra data[0]: 0x000000008000002f

extra data[1]: 0x0000000000000020

extra data[2]: 0x0000000000000d81

extra data[3]: 0x00000000000000a2

RAX=0000000000000000 RBX=0000000000000000 RCX=0000000040000070 RDX=0000000000000000

RSI=0000000000000000 RDI=ffffdf86659d07b0 RBP=ffff96806225b100 RSP=ffff96806225b060

R8 =00000000ffffffff R9 =0000000000000000 R10=00000000ffffffff R11=0000000000000000

R12=00000013e153ce49 R13=0000000000000046 R14=0000000000000000 R15=0000000000000000

RIP=fffff8001f1ddd7c RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0

ES =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA]

CS =0010 0000000000000000 00000000 00209b00 DPL=0 CS64 [-RA]

SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]

DS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA]

FS =0053 0000000000604000 00007c00 0040f300 DPL=3 DS [-WA]

GS =002b ffff968062230000 ffffffff 00c0f300 DPL=3 DS [-WA]

LDT=0000 0000000000000000 ffffffff 00c00000

TR =0040 ffff968062236ac0 00000067 00008b00 DPL=0 TSS64-busy

GDT= ffff96806223db80 0000006f

IDT= ffff96806223dbf0 00000fff

CR0=80050031 CR2=0000000000000030 CR3=00000000001aa000 CR4=001506f8

DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000

DR6=00000000fffe07f0 DR7=0000000000000400

EFER=0000000000000d01

Code=25 88 61 00 00 b9 70 00 00 40 0f ba 32 00 72 06 33 c0 8b d0 <0f> 30 5a 58 59 c3 cc cc cc cc cc cc 0f 1f 84 00 00 00 00 00 48 81 ec 38 01 00 00 48 8d 84

KVM internal error. Suberror: 3

extra data[0]: 0x000000008000002f

extra data[1]: 0x0000000000000020

extra data[2]: 0x0000000000000f82

extra data[3]: 0x000000000000004b

KVM internal error. Suberror: 3

extra data[0]: 0x000000008000002f

extra data[1]: 0x0000000000000020

extra data[2]: 0x0000000000000f82

extra data[3]: 0x000000000000004b

RAX=0000000000000000 RBX=0000000000000000 RCX=0000000040000070 RDX=0000000000000000

RSI=0000000000000000 RDI=ffffe7885a932010 RBP=fffff802a5a8e100 RSP=fffff802a5a8e060

R8 =00000000ffffffff R9 =0000000000000000 R10=00000000ffffffff R11=0000000000000000

R12=000000144b0a7258 R13=0000000000000046 R14=0000000000000000 R15=0000000000000000

RIP=fffff802a3f60d7c RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0

ES =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA]

CS =0010 0000000000000000 00000000 00209b00 DPL=0 CS64 [-RA]

SS =0018 0000000000000000 00000000 00409300 DPL=0 DS [-WA]

DS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA]

FS =0053 0000000013b70000 00003c00 0040f300 DPL=3 DS [-WA]

GS =002b fffff802a4150000 ffffffff 00c0f300 DPL=3 DS [-WA]

LDT=0000 0000000000000000 ffffffff 00c00000

TR =0040 fffff802a5a86070 00000067 00008b00 DPL=0 TSS64-busy

GDT= fffff802a5a85000 0000006f

IDT= fffff802a5a85070 00000fff

CR0=80050031 CR2=0000000000000030 CR3=00000000001aa000 CR4=001506f8

DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000

DR6=00000000ffff0ff0 DR7=0000000000000400

EFER=0000000000000d01

Code=25 88 61 00 00 b9 70 00 00 40 0f ba 32 00 72 06 33 c0 8b d0 <0f> 30 5a 58 59 c3 cc cc cc cc cc cc 0f 1f 84 00 00 00 00 00 48 81 ec 38 01 00 00 48 8d 84
Comment 1 Sean Christopherson 2023-12-18 17:54:16 UTC
On Fri, Dec 15, 2023, bugzilla-daemon@kernel.org wrote:
> Platform: Sapphire Rapids Platform
> 
> Host OS: CentOS Stream 9
> 
> Kernel:6.7.0-rc1 (commit:8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1)

...

> Qemu: QEMU emulator version 8.1.94 (v8.2.0-rc4)
> (commit:039afc5ef7367fbc8fb475580c291c2655e856cb)
> 
> Host Kernel cmdline:BOOT_IMAGE=/kvm-vmlinuz root=/dev/mapper/cs_spr--2s2-root
> ro crashkernel=auto console=tty0 console=ttyS0,115200,8n1 3 intel_iommu=on
> disable_mtrr_cleanup
> 
> Bug detailed description
> =======
> We boot up 8 Windows VMs (total vCPUs > pCPUs) in host, random run
> application
> on each VM such as WPS editing etc, and wait for a moment, then Some of the
> Windows Guest hang and console reports "KVM internal error. Suberror: 3".

...

> Code=25 88 61 00 00 b9 70 00 00 40 0f ba 32 00 72 06 33 c0 8b d0 <0f> 30 5a
> 58
> 59 c3 cc cc cc cc cc cc 0f 1f 84 00 00 00 00 00 48 81 ec 38 01 00 00 48 8d 84
> 
> KVM internal error. Suberror: 3
> extra data[0]: 0x000000008000002f  <= Vectoring IRQ 47 (decimal)
> extra data[1]: 0x0000000000000020  <= WRMSR VM-Exit
> extra data[2]: 0x0000000000000f82
> extra data[3]: 0x000000000000004b

KVM exits with an internal error because the CPU indicates that IRQ 47 was being
delivered/vectored when the VM-Exit occurred, but the VM-Exit is due to WRMSR.
A WRMSR VM-Exit is supposed to only occur on an instruction boundary, i.e. can't
occur while delivering an IRQ (or any exception/event), and so KVM kicks out to
userspace because something has gone off the rails.

   b9 70 00 00 40          mov    0x40000070, ecx
   0f ba 32 00             btr    0x0, DWORD PTR [rdx]
   72 06                   jb     0x16
   33 c0                   xor    eax,eax
   8b d0                   mov    eax, edx
   0f 30                   wrmsr

FWIW, the MSR in question is Hyper-V's synthetic EOI, a.k.a. HV_X64_MSR_EOI, though
I doubt the exact MSR matters.

Have you tried an older host kernel?  If not can you try something like v6.1?
Note, if you do, use base v6.1, *not* the stable tree in case a bug was backported.

There was a recent change to relevant code, commit 50011c2a2457 ("KVM: VMX: Refresh
available regs and IDT vectoring info before NMI handling"), though I don't see
any obvious bugs.  But I'm pretty sure the only alternative explanation is a
CPU/ucode bug, so it's definitely worth checking older versions of KVM.
Comment 2 yuxiating 2024-03-27 11:59:26 UTC
Do you have any progress on this issue?

I have the same error on Windows 2008R2, but the same virtual machine works fine on an Ice Lake CPU
Comment 3 Chao Gao 2024-04-08 05:21:38 UTC
This is not considered a Linux/KVM issue.

Guoqiang, could you close this ticket?

Yuxiating, I assume you are using APICv and also have "hv-vapic" in qemu cmdline. At this point, you can remove "hv-vapic" to work around this issue. Note that, APICv outperforms Hyper-V's synthetic MSRs; regardless of this bug, it is recommended to remove "hv-vapic" if KVM enables APICv.
Comment 4 Sean Christopherson 2024-04-08 17:22:24 UTC
On Mon, Apr 08, 2024, bugzilla-daemon@kernel.org wrote:
> This is not considered a Linux/KVM issue.

Can you elaborate?  E.g. if this an SPR ucode/CPU bug, it would be nice to know
what's going wrong, so that at the very least we can more easily triage issues.
Comment 5 vkuznets 2024-08-06 11:02:11 UTC
(In reply to Chao Gao from comment #3)

> Note that, APICv outperforms Hyper-V's synthetic MSRs; regardless of this
> bug, it is recommended to remove "hv-vapic" if KVM enables APICv.

'hv-vapic' is a prerequisite for some other Hyper-V features, e.g. Enlightened VMCS so disabling it may not be desired.

Also, there's 'hv-apicv' (AKA 'hv-avic') feature which prevents AutoEOI advertisement (can't work with APICv and this KVM inhibits it with APICV_INHIBIT_REASON_HYPERV). AFAIR, newer Windows versions don't use AutoEOI either way but Win8/Win7 may.
Comment 6 mlevitsk 2024-12-13 17:25:47 UTC
On Mon, 2024-04-08 at 17:22 +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=218267
> 
> --- Comment #4 from Sean Christopherson (seanjc@google.com) ---
> On Mon, Apr 08, 2024, bugzilla-daemon@kernel.org wrote:
> > This is not considered a Linux/KVM issue.
> 
> Can you elaborate?  E.g. if this an SPR ucode/CPU bug, it would be nice to
> know
> what's going wrong, so that at the very least we can more easily triage
> issues.
> 

Hi!

Any update on this? We seem to hit this bug as well but so far I don't have new details on what is going on.


Best regards,
	Maxim Levitsky
Comment 7 Chao Gao 2024-12-14 01:33:59 UTC
Hi Maxim,

I was told the erratum writeup and microcode fix would be released this month.

I just checked the microcode release https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/releases. The microcode fix hasn't been released yet, but the erratum is already in SPR/EMR specification. e.g., for SPR,

SPR141. VM Exit Following MOV to CR8 Instruction May Lead to Unexpected
IDT Vectoring-Information
Problem: Under certain conditions, a VM exit following execution of the MOV to CR8
instruction may unexpectedly result in setting the Valid bit (bit 31) of the IDTVectoring Information Field in the Virtual Machine Control Structure (VMCS).
Implication: Depending on the operation of the virtual-machine monitor (VMM), this
may result in unexpected VM behavior.
Workaround: It may be possible for the BIOS to contain a workaround for this
erratum
Comment 8 Sean Christopherson 2024-12-16 19:08:13 UTC
Thanks Chao!

Until the ucode update is available, I think we can workaround the issue in KVM by clearing VECTORING_INFO_VALID_MASK _immediately_ after exit, i.e. before queueing the event for re-injection, if it should be impossible for the exit to have occurred while vectoring.  I'm not sure I want to carry something like this long-term since a ucode fix is imminent, but at the least it can hopefully unblock end users.

The below uses a fairly conservative list of exits (a false positive could be quite painful).  A slightly less conservative approach would be to also include:

case EXIT_REASON_EXTERNAL_INTERRUPT:
case EXIT_REASON_TRIPLE_FAULT:
case EXIT_REASON_INIT_SIGNAL:
case EXIT_REASON_SIPI_SIGNAL:
case EXIT_REASON_INTERRUPT_WINDOW:
case EXIT_REASON_NMI_WINDOW:

as those exits should all be recognized only at instruction boundaries.

Compile tested only...

---
 arch/x86/kvm/vmx/vmx.c | 66 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 893366e53732..7240bd72b5f2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -147,6 +147,9 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
 extern bool __read_mostly allow_smaller_maxphyaddr;
 module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
 
+static bool __ro_after_init enable_spr141_erratum_workaround = true;
+module_param(enable_spr141_erratum_workaround, bool, S_IRUGO);
+
 #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
 #define KVM_VM_CR0_ALWAYS_ON				\
@@ -7163,8 +7166,67 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 	}
 }
 
+static bool is_vectoring_on_exit_impossible(struct vcpu_vmx *vmx)
+{
+	switch (vmx->exit_reason.basic) {
+	case EXIT_REASON_CPUID:
+	case EXIT_REASON_HLT:
+	case EXIT_REASON_INVD:
+	case EXIT_REASON_INVLPG:
+	case EXIT_REASON_RDPMC:
+	case EXIT_REASON_RDTSC:
+	case EXIT_REASON_VMCALL:
+	case EXIT_REASON_VMCLEAR:
+	case EXIT_REASON_VMLAUNCH:
+	case EXIT_REASON_VMPTRLD:
+	case EXIT_REASON_VMPTRST:
+	case EXIT_REASON_VMREAD:
+	case EXIT_REASON_VMRESUME:
+	case EXIT_REASON_VMWRITE:
+	case EXIT_REASON_VMOFF:
+	case EXIT_REASON_VMON:
+	case EXIT_REASON_CR_ACCESS:
+	case EXIT_REASON_DR_ACCESS:
+	case EXIT_REASON_IO_INSTRUCTION:
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_MSR_WRITE:
+	case EXIT_REASON_MSR_LOAD_FAIL:
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+	case EXIT_REASON_MONITOR_TRAP_FLAG:
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+	case EXIT_REASON_PAUSE_INSTRUCTION:
+	case EXIT_REASON_TPR_BELOW_THRESHOLD:
+	case EXIT_REASON_GDTR_IDTR:
+	case EXIT_REASON_LDTR_TR:
+	case EXIT_REASON_INVEPT:
+	case EXIT_REASON_RDTSCP:
+	case EXIT_REASON_PREEMPTION_TIMER:
+	case EXIT_REASON_INVVPID:
+	case EXIT_REASON_WBINVD:
+	case EXIT_REASON_XSETBV:
+	case EXIT_REASON_APIC_WRITE:
+	case EXIT_REASON_RDRAND:
+	case EXIT_REASON_INVPCID:
+	case EXIT_REASON_VMFUNC:
+	case EXIT_REASON_ENCLS:
+	case EXIT_REASON_RDSEED:
+	case EXIT_REASON_XSAVES:
+	case EXIT_REASON_XRSTORS:
+	case EXIT_REASON_UMWAIT:
+	case EXIT_REASON_TPAUSE:
+		return true;
+	}
+
+	return false;
+}
+
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+	if ((vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
+	    enable_spr141_erratum_workaround &&
+	    is_vectoring_on_exit_impossible(vmx))
+		vmx->idt_vectoring_info &= ~VECTORING_INFO_VALID_MASK;
+
 	__vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
 				  IDT_VECTORING_ERROR_CODE);
@@ -8487,6 +8549,10 @@ __init int vmx_hardware_setup(void)
 	if (!enable_apicv || !cpu_has_vmx_ipiv())
 		enable_ipiv = false;
 
+	if (boot_cpu_data.x86_vfm != INTEL_SAPPHIRERAPIDS_X &&
+	    boot_cpu_data.x86_vfm != INTEL_EMERALDRAPIDS_X)
+		enable_spr141_erratum_workaround = false;
+
 	if (cpu_has_vmx_tsc_scaling())
 		kvm_caps.has_tsc_control = true;
 

base-commit: 50e5669285fc2586c9f946c1d2601451d77cb49e
--
Comment 9 Chao Gao 2024-12-17 03:13:49 UTC
On Mon, Dec 16, 2024 at 07:08:13PM +0000, bugzilla-daemon@kernel.org wrote:
>https://bugzilla.kernel.org/show_bug.cgi?id=218267
>
>--- Comment #8 from Sean Christopherson (seanjc@google.com) ---
>Thanks Chao!
>
>Until the ucode update is available, I think we can workaround the issue in
>KVM
>by clearing VECTORING_INFO_VALID_MASK _immediately_ after exit, i.e. before
>queueing the event for re-injection, if it should be impossible for the exit
>to
>have occurred while vectoring.  I'm not sure I want to carry something like

Yes. I tried a similar workaround (i.e., clearing the "valid" bit only for
EXIT_REASON_MSR_WRITE) and our tests showed that it works well.

Strictly speaking, this issue also impacts those VM exits which may occur
during event delivery. Because they might be reported as occurring during event
delivery even if they didn't. KVM won't notice this, and the guest will receive
an extra event due to event re-injection. I wrote a kselftest to demonstrate
this.

Clearing the valid bit works in practice. And there is no ideal software
workaround for all cases. Disabling APICv or intercepting MOV-to-CR8 can
eliminate the issue, but neither is ideal due to the performance impact.

>this long-term since a ucode fix is imminent, but at the least it can
>hopefully
>unblock end users.
>
>The below uses a fairly conservative list of exits (a false positive could be
>quite painful).  A slightly less conservative approach would be to also
>include:
>
>case EXIT_REASON_EXTERNAL_INTERRUPT:

We need to include EXTERNAL_INTERRUPT because we observed it in real workloads
on affected CPUs.

>case EXIT_REASON_TRIPLE_FAULT:
>case EXIT_REASON_INIT_SIGNAL:
>case EXIT_REASON_SIPI_SIGNAL:
>case EXIT_REASON_INTERRUPT_WINDOW:
>case EXIT_REASON_NMI_WINDOW:
>
>as those exits should all be recognized only at instruction boundaries.
>
>Compile tested only...
>
>---

...

>@@ -8487,6 +8549,10 @@ __init int vmx_hardware_setup(void)
>        if (!enable_apicv || !cpu_has_vmx_ipiv())
>                enable_ipiv = false;
>
>+       if (boot_cpu_data.x86_vfm != INTEL_SAPPHIRERAPIDS_X &&
>+           boot_cpu_data.x86_vfm != INTEL_EMERALDRAPIDS_X)
>+               enable_spr141_erratum_workaround = false;

RaptorLake has the same issue.

https://cdrdv2.intel.com/v1/dl/getContent/740518

>+
>        if (cpu_has_vmx_tsc_scaling())
>                kvm_caps.has_tsc_control = true;
>
>
>base-commit: 50e5669285fc2586c9f946c1d2601451d77cb49e
>--
>
>-- 
>You may reply to this email to add a comment.
>
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 10 shu.info.oss 2024-12-24 06:44:02 UTC
Hi Gao,

(In reply to Chao Gao from comment #7)
> I just checked the microcode release
> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/releases.
> The microcode fix hasn't been released yet, but the erratum is already in
> SPR/EMR specification. e.g., for SPR,
> 
> SPR141. VM Exit Following MOV to CR8 Instruction May Lead to Unexpected
> IDT Vectoring-Information
> Problem: Under certain conditions, a VM exit following execution of the MOV
> to CR8
> instruction may unexpectedly result in setting the Valid bit (bit 31) of the
> IDTVectoring Information Field in the Virtual Machine Control Structure
> (VMCS).
> Implication: Depending on the operation of the virtual-machine monitor
> (VMM), this
> may result in unexpected VM behavior.
> Workaround: It may be possible for the BIOS to contain a workaround for this
> erratum

Can we resolve this bug with only BIOS updates if a update patch includes a fix for this bug?
If so, what is ticket number for a update patch of BIOS?  it is SPR141?

Thanks,
Hidehiko Matsumoto
Comment 11 mlevitsk 2025-01-22 16:43:31 UTC
On Mon, 2024-12-16 at 19:08 +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=218267
> 
> --- Comment #8 from Sean Christopherson (seanjc@google.com) ---
> Thanks Chao!
> 
> Until the ucode update is available, I think we can workaround the issue in
> KVM
> by clearing VECTORING_INFO_VALID_MASK _immediately_ after exit, i.e. before
> queueing the event for re-injection, if it should be impossible for the exit
> to
> have occurred while vectoring.  I'm not sure I want to carry something like
> this long-term since a ucode fix is imminent, but at the least it can
> hopefully
> unblock end users.
> 
> The below uses a fairly conservative list of exits (a false positive could be
> quite painful).  A slightly less conservative approach would be to also
> include:
> 
> case EXIT_REASON_EXTERNAL_INTERRUPT:
> case EXIT_REASON_TRIPLE_FAULT:
> case EXIT_REASON_INIT_SIGNAL:
> case EXIT_REASON_SIPI_SIGNAL:
> case EXIT_REASON_INTERRUPT_WINDOW:
> case EXIT_REASON_NMI_WINDOW:
> 
> as those exits should all be recognized only at instruction boundaries.
> 
> Compile tested only...
> 
> ---
>  arch/x86/kvm/vmx/vmx.c | 66 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 66 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 893366e53732..7240bd72b5f2 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -147,6 +147,9 @@ module_param_named(preemption_timer,
> enable_preemption_timer, bool, S_IRUGO);
>  extern bool __read_mostly allow_smaller_maxphyaddr;
>  module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
> 
> +static bool __ro_after_init enable_spr141_erratum_workaround = true;
> +module_param(enable_spr141_erratum_workaround, bool, S_IRUGO);
> +
>  #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
>  #define KVM_VM_CR0_ALWAYS_ON                           \
> @@ -7163,8 +7166,67 @@ static void __vmx_complete_interrupts(struct kvm_vcpu
> *vcpu,
>         }
>  }
> 
> +static bool is_vectoring_on_exit_impossible(struct vcpu_vmx *vmx)
> +{
> +       switch (vmx->exit_reason.basic) {
> +       case EXIT_REASON_CPUID:
> +       case EXIT_REASON_HLT:
> +       case EXIT_REASON_INVD:
> +       case EXIT_REASON_INVLPG:
> +       case EXIT_REASON_RDPMC:
> +       case EXIT_REASON_RDTSC:
> +       case EXIT_REASON_VMCALL:
> +       case EXIT_REASON_VMCLEAR:
> +       case EXIT_REASON_VMLAUNCH:
> +       case EXIT_REASON_VMPTRLD:
> +       case EXIT_REASON_VMPTRST:
> +       case EXIT_REASON_VMREAD:
> +       case EXIT_REASON_VMRESUME:
> +       case EXIT_REASON_VMWRITE:
> +       case EXIT_REASON_VMOFF:
> +       case EXIT_REASON_VMON:
> +       case EXIT_REASON_CR_ACCESS:
> +       case EXIT_REASON_DR_ACCESS:
> +       case EXIT_REASON_IO_INSTRUCTION:
> +       case EXIT_REASON_MSR_READ:
> +       case EXIT_REASON_MSR_WRITE:
> +       case EXIT_REASON_MSR_LOAD_FAIL:
> +       case EXIT_REASON_MWAIT_INSTRUCTION:
> +       case EXIT_REASON_MONITOR_TRAP_FLAG:
> +       case EXIT_REASON_MONITOR_INSTRUCTION:
> +       case EXIT_REASON_PAUSE_INSTRUCTION:
> +       case EXIT_REASON_TPR_BELOW_THRESHOLD:
> +       case EXIT_REASON_GDTR_IDTR:
> +       case EXIT_REASON_LDTR_TR:
> +       case EXIT_REASON_INVEPT:
> +       case EXIT_REASON_RDTSCP:
> +       case EXIT_REASON_PREEMPTION_TIMER:
> +       case EXIT_REASON_INVVPID:
> +       case EXIT_REASON_WBINVD:
> +       case EXIT_REASON_XSETBV:
> +       case EXIT_REASON_APIC_WRITE:
> +       case EXIT_REASON_RDRAND:
> +       case EXIT_REASON_INVPCID:
> +       case EXIT_REASON_VMFUNC:
> +       case EXIT_REASON_ENCLS:
> +       case EXIT_REASON_RDSEED:
> +       case EXIT_REASON_XSAVES:
> +       case EXIT_REASON_XRSTORS:
> +       case EXIT_REASON_UMWAIT:
> +       case EXIT_REASON_TPAUSE:
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>  {
> +       if ((vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
> +           enable_spr141_erratum_workaround &&
> +           is_vectoring_on_exit_impossible(vmx))
> +               vmx->idt_vectoring_info &= ~VECTORING_INFO_VALID_MASK;
> +
>         __vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info,
>                                   VM_EXIT_INSTRUCTION_LEN,
>                                   IDT_VECTORING_ERROR_CODE);
> @@ -8487,6 +8549,10 @@ __init int vmx_hardware_setup(void)
>         if (!enable_apicv || !cpu_has_vmx_ipiv())
>                 enable_ipiv = false;
> 
> +       if (boot_cpu_data.x86_vfm != INTEL_SAPPHIRERAPIDS_X &&
> +           boot_cpu_data.x86_vfm != INTEL_EMERALDRAPIDS_X)
> +               enable_spr141_erratum_workaround = false;
> +
>         if (cpu_has_vmx_tsc_scaling())
>                 kvm_caps.has_tsc_control = true;
> 
> 
> base-commit: 50e5669285fc2586c9f946c1d2601451d77cb49e
> --
> 

Do we plan to move forward with this workaround or you think this is adds too much complexity to KVM?

Best regards,
	Maxim Levitsky
Comment 12 Sean Christopherson 2025-01-27 18:32:21 UTC
I'm not terribly concerned about the complexity, I'm more concerned about the efficacy of a software workaround, and to a lesser extent the risk of doing more harm than good (this seems unlikely though).  E.g. if an exit that _can_ occur during vectoring collides with the bug, then KVM will inject a spurious fault into the guest.  And if our list of "impossible" exits is wrong, KVM could incorrectly suppress an exception.

I suppose we could mitigate the efficacy concerns by emitting a pr_err_once() to suggest a ucode update if the erratum is hit.

Note You need to log in before you can comment on or make changes to this bug.