Bug 213781 - KVM: x86/svm: The guest (#vcpu>1) can't boot up with QEMU "-overcommit cpu-pm=on"
Summary: KVM: x86/svm: The guest (#vcpu>1) can't boot up with QEMU "-overcommit cpu-pm...
Status: NEW
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-19 10:08 UTC by Like Xu
Modified: 2022-06-22 13:00 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.19.0-rc1+
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Like Xu 2021-07-19 10:08:44 UTC
Hi,

This issue is an upstream bug and very easy to reproduce on the AMD platforms.
It was first introduced since the commit e72436bc3a5206f95bb384e741154166ddb3202e.

The QEMU reports the the following stack:

KVM internal error. Suberror: 1
emulation failure
EAX=000f38b3 EBX=00000000 ECX=000002ff EDX=00000001
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006d88
EIP=000fc95a EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008300 DPL=0 TSS16-busy
GDT=     000f50a0 00000037
IDT=     000f50de 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=34 41 0f 00 e8 5b 26 ff ff c7 05 38 41 0f 00 00 00 00 00 f4 <eb> fd fa fc 66 b8 00 c2 00 00 8e d8 8e d0 66 bc 58 f8 00 00 e9 07 f9 66 54 66 0f b7 e4 66

At the buggy time, the dump_vmcb() says:

[47175.214140] SVM: VMCB 00000000a4006788, last attempted VMRUN on CPU 81
[47175.215862] SVM: VMCB Control Area:
[47175.216155] SVM: cr_read:            0010
[47175.216426] SVM: cr_write:           0110
[47175.216699] SVM: dr_read:            00ff
[47175.216939] SVM: dr_write:           00ff
[47175.217170] SVM: exceptions:         00060042
[47175.217400] SVM: intercepts:         bc4c8027 0000624f
[47175.217651] SVM: pause filter count: 0
[47175.217879] SVM: pause filter threshold:0
[47175.218107] SVM: iopm_base_pa:       0000000194674000
[47175.218342] SVM: msrpm_base_pa:      00000040857d4000
[47175.218589] SVM: tsc_offset:         ffff92710e0ed2c0
[47175.218823] SVM: asid:               1
[47175.219052] SVM: tlb_ctl:            0
[47175.219280] SVM: int_ctl:            03000200
[47175.219522] SVM: int_vector:         00000000
[47175.219753] SVM: int_state:          00000000
[47175.219981] SVM: exit_code:          00000400
[47175.220208] SVM: exit_info1:         0000000100000014
[47175.220441] SVM: exit_info2:         00000000000fc000
[47175.220684] SVM: exit_int_info:      00000000
[47175.220913] SVM: exit_int_info_err:  00000000
[47175.221140] SVM: nested_ctl:         1
[47175.221363] SVM: nested_cr3:         0000004184ca8000
[47175.221598] SVM: avic_vapic_bar:     0000000000000000
[47175.221823] SVM: ghcb:               0000000000000000
[47175.222047] SVM: event_inj:          00000000
[47175.222272] SVM: event_inj_err:      00000000
[47175.222497] SVM: virt_ext:           2
[47175.222739] SVM: next_rip:           0000000000000000
[47175.222968] SVM: avic_backing_page:  0000000000000000
[47175.223198] SVM: avic_logical_id:    0000000000000000
[47175.223425] SVM: avic_physical_id:   0000000000000000
[47175.223665] SVM: vmsa_pa:            0000000000000000
[47175.223885] SVM: VMCB State Save Area:
[47175.224105] SVM: es:   s: 0010 a: 0c93 l: ffffffff b: 0000000000000000
[47175.224342] SVM: cs:   s: 0008 a: 049b l: ffffffff b: 0000000000000000
[47175.224588] SVM: ss:   s: 0010 a: 0c93 l: ffffffff b: 0000000000000000
[47175.224817] SVM: ds:   s: 0010 a: 0c93 l: ffffffff b: 0000000000000000
[47175.225043] SVM: fs:   s: 0010 a: 0c93 l: ffffffff b: 0000000000000000
[47175.225266] SVM: gs:   s: 0010 a: 0c93 l: ffffffff b: 0000000000000000
[47175.225486] SVM: gdtr: s: 0000 a: 0000 l: 00000037 b: 00000000000f50a0
[47175.225720] SVM: ldtr: s: 0000 a: 0082 l: 0000ffff b: 0000000000000000
[47175.225939] SVM: idtr: s: 0000 a: 0000 l: 00000000 b: 00000000000f50de
[47175.226156] SVM: tr:   s: 0000 a: 0083 l: 0000ffff b: 0000000000000000
[47175.226445] SVM: cpl:            0                efer:         0000000000001000
[47175.226682] SVM: cr0:            0000000000000011 cr2:          0000000000000000
[47175.226900] SVM: cr3:            0000000000000000 cr4:          0000000000000000
[47175.227112] SVM: dr6:            00000000ffff0ff0 dr7:          0000000000000400
[47175.227327] SVM: rip:            00000000000fc95a rflags:       0000000000000002
[47175.227554] SVM: rsp:            0000000000006d88 rax:          00000000000f38b3
[47175.227768] SVM: star:           0000000000000000 lstar:        0000000000000000
[47175.227983] SVM: cstar:          0000000000000000 sfmask:       0000000000000000
[47175.228198] SVM: kernel_gs_base: 0000000000000000 sysenter_cs:  0000000000000000
[47175.228413] SVM: sysenter_esp:   0000000000000000 sysenter_eip: 0000000000000000
[47175.228641] SVM: gpat:           0007040600070406 dbgctl:       0000000000000000
[47175.228859] SVM: br_from:        0000000000000000 br_to:        0000000000000000
[47175.229076] SVM: excp_from:      0000000000000000 excp_to:      0000000000000000

You may need the target BIOS code part:

   fc940:       00 00
   fc942:       72 f3                   jb     fc937 <entry_smp+0xb>
   fc944:       8b 25 34 41 0f 00       mov    0xf4134,%esp
   fc94a:       e8 5b 26 ff ff          call   eefaa <handle_smp>
   fc94f:       c7 05 38 41 0f 00 00    movl   $0x0,0xf4138
   fc956:       00 00 00
   fc959:       f4                      hlt
   fc95a:       eb fd                   jmp    fc959 <entry_smp+0x2d>
   fc95c:       fa                      cli
   fc95d:       fc                      cld
   fc95e:       66 b8 00 c2             mov    $0xc200,%ax
   fc962:       00 00                   add    %al,(%eax)
   fc964:       8e d8                   mov    %eax,%ds
   fc966:       8e d0                   mov    %eax,%ss
   fc968:       66 bc 58 f8             mov    $0xf858,%sp
   fc96c:       00 00                   add    %al,(%eax)
   fc96e:       e9 07 f9 66 54          jmp    5476c27a <code32flat_end+0x5466c27a>
   fc973:       66 0f b7 e4             movzww %sp,%sp
   fc977:       66 9c                   pushfw
   fc979:       fa                      cli
   fc97a:       fc                      cld

Or the code from the SeaBios:

// Entry point for QEMU smp sipi interrupts.
        DECLFUNC entry_smp
entry_smp:
        // Transition to 32bit mode.
        cli
        cld
        movl $2f + BUILD_BIOS_ADDR, %edx
        jmp transition32_nmi_off
        .code32
        // Acquire lock and take ownership of shared stack
1:      rep ; nop
2:      lock btsl $0, SMPLock
        jc 1b
        movl SMPStack, %esp
        // Call handle_smp
        calll _cfunc32flat_handle_smp - BUILD_BIOS_ADDR
        // Release lock and halt processor.
        movl $0, SMPLock
3:      hlt
        jmp 3b
        .code16

The related trace:

       CPU 1/KVM-1278472 [119] d..2 246654.769260: kvm_entry: vcpu 1, rip 0xfc95a
       CPU 1/KVM-1278472 [119] ...1 246654.769261: kvm_exit: vcpu 1 reason npf rip 0xfc95a info1 0x0000000100000014 info2 0x00000000000fc000 intr_info 0x00000000 error_code 0x00000000
       CPU 1/KVM-1278472 [119] ...1 246654.769262: kvm_page_fault: address fc000 error_code 14
       CPU 1/KVM-1278472 [119] d..2 246654.769262: kvm_entry: vcpu 1, rip 0xfc95a
       CPU 1/KVM-1278472 [119] ...1 246654.769263: kvm_exit: vcpu 1 reason npf rip 0xfc95a info1 0x0000000100000014 info2 0x00000000000fc000 intr_info 0x00000000 error_code 0x00000000
       CPU 1/KVM-1278472 [119] ...1 246654.769263: kvm_page_fault: address fc000 error_code 14
       CPU 1/KVM-1278472 [119] ...1 246654.769272: kvm_emulate_insn: 0:fc95a: (prot32)
       CPU 1/KVM-1278472 [119] ...1 246654.769274: kvm_emulate_insn: 0:fc95a: (prot32) failed
       CPU 1/KVM-1278472 [119] ...1 246654.769275: kvm_fpu: unload
       CPU 1/KVM-1278472 [119] ...1 246654.769275: kvm_userspace_exit: reason KVM_EXIT_INTERNAL_ERROR (17)


My early explorations:

- Instruction emulation of EIP 0xfc95a raised by (EMULTYPE_ALLOW_RETRY_PF | EMULTYPE_PF) exited by kvm_mmu_page_fault();
- The __do_insn_fetch_bytes() is called in the x86_decode_insn() due to svm->vmcb->control.insn_len is 0 (not sure if it's another Errata about #NPF);
- The X86EMUL_IO_NEEDED is returned for kvm_fetch_guest_virt();
- Please note we will have "kvm_emulate_insn: ffff0000:fff0: (real) failed"
for the tools/testing/selftests/kvm/set_memory_region_test.

Please share your understanding with me or fix it with your proposal.

Thanks,
Like Xu
Comment 1 Maxim Levitsky 2021-07-19 10:57:28 UTC
I sadly know exactly why this happens and yes this commit is technically to blame.

But the root cause is non atomic memslot updates that qemu does. It will be fixed this way or another I hope.
Comment 2 Like Xu 2021-07-29 01:57:31 UTC
Hi Maxim,

Do we have any updates on this issue? Can you help provide more details
about "non-atomic memslot update made by qemu" so I can try to fix it?
Comment 3 Maxim Levitsky 2021-07-29 09:29:37 UTC
For all practical purposes you can just revert this commit.
The fix for root cause is not simple, and I will work on it when I get to it.
Comment 4 Like Xu 2022-06-22 12:49:59 UTC
The issue still exits on the AMD after we revert the commit in 31c25585695a.

Just confirmed that it's caused by non-atomic accesses to memslot:
- __do_insn_fetch_bytes() from the prot32 code page #NPF;
- kvm_vm_ioctl_set_memory_region() from user space;

Considering the expected result [selftests::test_zero_memory_regions on x86_64] is that the guest will trigger an internal KVM error due to the initial code fetch encountering a non-existent memslot and resulting in an emulation failure.

More similar cases will gradually emerge. I'm not sure if KVM has documentation pointing out this restriction on memslot updates (fix one application QEMU may be one-sided), or any need to add something unwise like check gfn_to_memslot(kvm, gpa_to_gfn(cr2_or_gpa)) in the x86_emulate_instruction().

Any other suggestions ?
Comment 5 mlevitsk 2022-06-22 13:00:35 UTC
On Wed, 2022-06-22 at 12:49 +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=213781
> 
> Like Xu (like.xu.linux@gmail.com) changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>      Kernel Version|5.14.0-rc1+                 |5.19.0-rc1+
> 
> --- Comment #4 from Like Xu (like.xu.linux@gmail.com) ---
> The issue still exits on the AMD after we revert the commit in 31c25585695a.
> 
> Just confirmed that it's caused by non-atomic accesses to memslot:
> - __do_insn_fetch_bytes() from the prot32 code page #NPF;
> - kvm_vm_ioctl_set_memory_region() from user space;
> 
> Considering the expected result [selftests::test_zero_memory_regions on
> x86_64]
> is that the guest will trigger an internal KVM error due to the initial code
> fetch encountering a non-existent memslot and resulting in an emulation
> failure.
> 
> More similar cases will gradually emerge. I'm not sure if KVM has
> documentation
> pointing out this restriction on memslot updates (fix one application QEMU
> may
> be one-sided), or any need to add something unwise like check
> gfn_to_memslot(kvm, gpa_to_gfn(cr2_or_gpa)) in the x86_emulate_instruction().
> 
> Any other suggestions ?
> 

Yep, agree. This has to be fixed on qemu and kvm level (kvm needs new API to upload
atomaically a set of memslot changes (easy part), and the qemu needs code to
batch the memslot updates when it does SMM related memslot updates.

Best regards,
	Maxim Levitsky

Note You need to log in before you can comment on or make changes to this bug.