Hi, This issue is an upstream bug and very easy to reproduce on the AMD platforms. It was first introduced since the commit e72436bc3a5206f95bb384e741154166ddb3202e. The QEMU reports the the following stack: KVM internal error. Suberror: 1 emulation failure EAX=000f38b3 EBX=00000000 ECX=000002ff EDX=00000001 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006d88 EIP=000fc95a EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008300 DPL=0 TSS16-busy GDT= 000f50a0 00000037 IDT= 000f50de 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=34 41 0f 00 e8 5b 26 ff ff c7 05 38 41 0f 00 00 00 00 00 f4 <eb> fd fa fc 66 b8 00 c2 00 00 8e d8 8e d0 66 bc 58 f8 00 00 e9 07 f9 66 54 66 0f b7 e4 66 At the buggy time, the dump_vmcb() says: [47175.214140] SVM: VMCB 00000000a4006788, last attempted VMRUN on CPU 81 [47175.215862] SVM: VMCB Control Area: [47175.216155] SVM: cr_read: 0010 [47175.216426] SVM: cr_write: 0110 [47175.216699] SVM: dr_read: 00ff [47175.216939] SVM: dr_write: 00ff [47175.217170] SVM: exceptions: 00060042 [47175.217400] SVM: intercepts: bc4c8027 0000624f [47175.217651] SVM: pause filter count: 0 [47175.217879] SVM: pause filter threshold:0 [47175.218107] SVM: iopm_base_pa: 0000000194674000 [47175.218342] SVM: msrpm_base_pa: 00000040857d4000 [47175.218589] SVM: tsc_offset: ffff92710e0ed2c0 [47175.218823] SVM: asid: 1 [47175.219052] SVM: tlb_ctl: 0 [47175.219280] SVM: int_ctl: 03000200 [47175.219522] SVM: int_vector: 00000000 [47175.219753] SVM: int_state: 00000000 [47175.219981] SVM: exit_code: 00000400 [47175.220208] SVM: exit_info1: 0000000100000014 [47175.220441] SVM: exit_info2: 00000000000fc000 [47175.220684] SVM: exit_int_info: 00000000 [47175.220913] SVM: exit_int_info_err: 00000000 [47175.221140] SVM: nested_ctl: 1 [47175.221363] SVM: nested_cr3: 0000004184ca8000 [47175.221598] SVM: avic_vapic_bar: 0000000000000000 [47175.221823] SVM: ghcb: 0000000000000000 [47175.222047] SVM: event_inj: 00000000 [47175.222272] SVM: event_inj_err: 00000000 [47175.222497] SVM: virt_ext: 2 [47175.222739] SVM: next_rip: 0000000000000000 [47175.222968] SVM: avic_backing_page: 0000000000000000 [47175.223198] SVM: avic_logical_id: 0000000000000000 [47175.223425] SVM: avic_physical_id: 0000000000000000 [47175.223665] SVM: vmsa_pa: 0000000000000000 [47175.223885] SVM: VMCB State Save Area: [47175.224105] SVM: es: s: 0010 a: 0c93 l: ffffffff b: 0000000000000000 [47175.224342] SVM: cs: s: 0008 a: 049b l: ffffffff b: 0000000000000000 [47175.224588] SVM: ss: s: 0010 a: 0c93 l: ffffffff b: 0000000000000000 [47175.224817] SVM: ds: s: 0010 a: 0c93 l: ffffffff b: 0000000000000000 [47175.225043] SVM: fs: s: 0010 a: 0c93 l: ffffffff b: 0000000000000000 [47175.225266] SVM: gs: s: 0010 a: 0c93 l: ffffffff b: 0000000000000000 [47175.225486] SVM: gdtr: s: 0000 a: 0000 l: 00000037 b: 00000000000f50a0 [47175.225720] SVM: ldtr: s: 0000 a: 0082 l: 0000ffff b: 0000000000000000 [47175.225939] SVM: idtr: s: 0000 a: 0000 l: 00000000 b: 00000000000f50de [47175.226156] SVM: tr: s: 0000 a: 0083 l: 0000ffff b: 0000000000000000 [47175.226445] SVM: cpl: 0 efer: 0000000000001000 [47175.226682] SVM: cr0: 0000000000000011 cr2: 0000000000000000 [47175.226900] SVM: cr3: 0000000000000000 cr4: 0000000000000000 [47175.227112] SVM: dr6: 00000000ffff0ff0 dr7: 0000000000000400 [47175.227327] SVM: rip: 00000000000fc95a rflags: 0000000000000002 [47175.227554] SVM: rsp: 0000000000006d88 rax: 00000000000f38b3 [47175.227768] SVM: star: 0000000000000000 lstar: 0000000000000000 [47175.227983] SVM: cstar: 0000000000000000 sfmask: 0000000000000000 [47175.228198] SVM: kernel_gs_base: 0000000000000000 sysenter_cs: 0000000000000000 [47175.228413] SVM: sysenter_esp: 0000000000000000 sysenter_eip: 0000000000000000 [47175.228641] SVM: gpat: 0007040600070406 dbgctl: 0000000000000000 [47175.228859] SVM: br_from: 0000000000000000 br_to: 0000000000000000 [47175.229076] SVM: excp_from: 0000000000000000 excp_to: 0000000000000000 You may need the target BIOS code part: fc940: 00 00 fc942: 72 f3 jb fc937 <entry_smp+0xb> fc944: 8b 25 34 41 0f 00 mov 0xf4134,%esp fc94a: e8 5b 26 ff ff call eefaa <handle_smp> fc94f: c7 05 38 41 0f 00 00 movl $0x0,0xf4138 fc956: 00 00 00 fc959: f4 hlt fc95a: eb fd jmp fc959 <entry_smp+0x2d> fc95c: fa cli fc95d: fc cld fc95e: 66 b8 00 c2 mov $0xc200,%ax fc962: 00 00 add %al,(%eax) fc964: 8e d8 mov %eax,%ds fc966: 8e d0 mov %eax,%ss fc968: 66 bc 58 f8 mov $0xf858,%sp fc96c: 00 00 add %al,(%eax) fc96e: e9 07 f9 66 54 jmp 5476c27a <code32flat_end+0x5466c27a> fc973: 66 0f b7 e4 movzww %sp,%sp fc977: 66 9c pushfw fc979: fa cli fc97a: fc cld Or the code from the SeaBios: // Entry point for QEMU smp sipi interrupts. DECLFUNC entry_smp entry_smp: // Transition to 32bit mode. cli cld movl $2f + BUILD_BIOS_ADDR, %edx jmp transition32_nmi_off .code32 // Acquire lock and take ownership of shared stack 1: rep ; nop 2: lock btsl $0, SMPLock jc 1b movl SMPStack, %esp // Call handle_smp calll _cfunc32flat_handle_smp - BUILD_BIOS_ADDR // Release lock and halt processor. movl $0, SMPLock 3: hlt jmp 3b .code16 The related trace: CPU 1/KVM-1278472 [119] d..2 246654.769260: kvm_entry: vcpu 1, rip 0xfc95a CPU 1/KVM-1278472 [119] ...1 246654.769261: kvm_exit: vcpu 1 reason npf rip 0xfc95a info1 0x0000000100000014 info2 0x00000000000fc000 intr_info 0x00000000 error_code 0x00000000 CPU 1/KVM-1278472 [119] ...1 246654.769262: kvm_page_fault: address fc000 error_code 14 CPU 1/KVM-1278472 [119] d..2 246654.769262: kvm_entry: vcpu 1, rip 0xfc95a CPU 1/KVM-1278472 [119] ...1 246654.769263: kvm_exit: vcpu 1 reason npf rip 0xfc95a info1 0x0000000100000014 info2 0x00000000000fc000 intr_info 0x00000000 error_code 0x00000000 CPU 1/KVM-1278472 [119] ...1 246654.769263: kvm_page_fault: address fc000 error_code 14 CPU 1/KVM-1278472 [119] ...1 246654.769272: kvm_emulate_insn: 0:fc95a: (prot32) CPU 1/KVM-1278472 [119] ...1 246654.769274: kvm_emulate_insn: 0:fc95a: (prot32) failed CPU 1/KVM-1278472 [119] ...1 246654.769275: kvm_fpu: unload CPU 1/KVM-1278472 [119] ...1 246654.769275: kvm_userspace_exit: reason KVM_EXIT_INTERNAL_ERROR (17) My early explorations: - Instruction emulation of EIP 0xfc95a raised by (EMULTYPE_ALLOW_RETRY_PF | EMULTYPE_PF) exited by kvm_mmu_page_fault(); - The __do_insn_fetch_bytes() is called in the x86_decode_insn() due to svm->vmcb->control.insn_len is 0 (not sure if it's another Errata about #NPF); - The X86EMUL_IO_NEEDED is returned for kvm_fetch_guest_virt(); - Please note we will have "kvm_emulate_insn: ffff0000:fff0: (real) failed" for the tools/testing/selftests/kvm/set_memory_region_test. Please share your understanding with me or fix it with your proposal. Thanks, Like Xu
I sadly know exactly why this happens and yes this commit is technically to blame. But the root cause is non atomic memslot updates that qemu does. It will be fixed this way or another I hope.
Hi Maxim, Do we have any updates on this issue? Can you help provide more details about "non-atomic memslot update made by qemu" so I can try to fix it?
For all practical purposes you can just revert this commit. The fix for root cause is not simple, and I will work on it when I get to it.
The issue still exits on the AMD after we revert the commit in 31c25585695a. Just confirmed that it's caused by non-atomic accesses to memslot: - __do_insn_fetch_bytes() from the prot32 code page #NPF; - kvm_vm_ioctl_set_memory_region() from user space; Considering the expected result [selftests::test_zero_memory_regions on x86_64] is that the guest will trigger an internal KVM error due to the initial code fetch encountering a non-existent memslot and resulting in an emulation failure. More similar cases will gradually emerge. I'm not sure if KVM has documentation pointing out this restriction on memslot updates (fix one application QEMU may be one-sided), or any need to add something unwise like check gfn_to_memslot(kvm, gpa_to_gfn(cr2_or_gpa)) in the x86_emulate_instruction(). Any other suggestions ?
On Wed, 2022-06-22 at 12:49 +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=213781 > > Like Xu (like.xu.linux@gmail.com) changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Kernel Version|5.14.0-rc1+ |5.19.0-rc1+ > > --- Comment #4 from Like Xu (like.xu.linux@gmail.com) --- > The issue still exits on the AMD after we revert the commit in 31c25585695a. > > Just confirmed that it's caused by non-atomic accesses to memslot: > - __do_insn_fetch_bytes() from the prot32 code page #NPF; > - kvm_vm_ioctl_set_memory_region() from user space; > > Considering the expected result [selftests::test_zero_memory_regions on > x86_64] > is that the guest will trigger an internal KVM error due to the initial code > fetch encountering a non-existent memslot and resulting in an emulation > failure. > > More similar cases will gradually emerge. I'm not sure if KVM has > documentation > pointing out this restriction on memslot updates (fix one application QEMU > may > be one-sided), or any need to add something unwise like check > gfn_to_memslot(kvm, gpa_to_gfn(cr2_or_gpa)) in the x86_emulate_instruction(). > > Any other suggestions ? > Yep, agree. This has to be fixed on qemu and kvm level (kvm needs new API to upload atomaically a set of memslot changes (easy part), and the qemu needs code to batch the memslot updates when it does SMM related memslot updates. Best regards, Maxim Levitsky