# Problem We have 2 hypervisors with a single guest. That guest has nested vmx to run it's own guest. Lets call them 'hypervisor', 'guest' and 'guest2'. Whenever guest has no VMs running, and guest is migrated between hypervisors, no problem occurs. Whenever guest has 1 or more VMs running, and the guest is migrated between hypervisors, the guest has the following panic. We have tried different kernel and QEMU versions to try to mitigate this problem. The latest kernel version we have tried is 4.15.0-rc9, since that contains some commits regarding VMX. Since this seems to be a double panic, we have yet been unable to derive a kdump from here (the trace is derived from a virtual serial console). The issue is easily reproducible for us, should we need to test some patches. # Trace [19669.932875] ------------[ cut here ]------------ [19669.960516] kernel BUG at arch/x86/kvm/x86.c:337! [19669.964394] invalid opcode: 0000 [#1] SMP PTI [19669.967046] Modules linked in: vhost_net vhost tap tun nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc overlay ebtable_filter ebtables binfmt_misc nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6_tables ipt_REJECT nf_reject_ipv4 xt_pkttype xt_NFLOG nfnetlink_log nfnetlink xt_limit xt_owner xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32c_generic xt_comment xt_CHECKSUM xt_tcpudp iptable_mangle iptable_raw ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd joydev evdev serio_raw pcspkr virtio_balloon [19670.020552] button kvm_intel kvm irqbypass fou ip6_udp_tunnel udp_tunnel ip_tunnel autofs4 hid_generic usbhid hid ext4 crc16 mbcache jbd2 dm_mod dax ata_generic virtio_net virtio_blk crc32c_intel psmouse floppy ata_piix uhci_hcd ehci_hcd virtio_pci virtio_ring virtio i2c_piix4 i2c_core usbcore usb_common libata scsi_mod [19670.042506] CPU: 0 PID: 7874 Comm: CPU 0/KVM Not tainted 4.15.0-rc9-jessie1.0 #3 [19670.047608] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [19670.058408] RIP: 0010:kvm_spurious_fault+0x0/0x10 [kvm] [19670.062264] RSP: 0018:ffffc9000043fce8 EFLAGS: 00010246 [19670.066647] RAX: 0000000000000000 RBX: ffff880038d30040 RCX: 0000000000000000 [19670.072687] RDX: 0000000000006820 RSI: 0000000000000292 RDI: ffff880038d30040 [19670.078061] RBP: 0000000000000001 R08: ffff8800ba5b8000 R09: 0000000000000002 [19670.084080] R10: ffffc9000043fc80 R11: 00000000000002c6 R12: ffff880038d300c8 [19670.089353] R13: 000011e3a53b8523 R14: ffff880038d342e8 R15: ffff88003a519ba8 [19670.094749] FS: 00007f3b70816700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [19670.103106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19670.107364] CR2: 00000000ffffffff CR3: 000000003a5b8000 CR4: 00000000000026f0 [19670.114301] Call Trace: [19670.116901] nested_vmx_check_msr_switch+0x99a/0x3f24 [kvm_intel] [19670.121842] ? vmx_interrupt_allowed+0x10/0x30 [kvm_intel] [19670.126891] kvm_arch_vcpu_runnable+0xc6/0x120 [kvm] [19670.132700] kvm_vcpu_check_block+0x9/0x50 [kvm] [19670.136939] kvm_vcpu_block+0x88/0x2d0 [kvm] [19670.140653] kvm_arch_vcpu_ioctl_run+0x14b/0x1570 [kvm] [19670.145555] ? kvm_arch_vcpu_load+0x5d/0x230 [kvm] [19670.150841] ? kvm_vcpu_ioctl+0x302/0x590 [kvm] [19670.155058] kvm_vcpu_ioctl+0x302/0x590 [kvm] [19670.158478] ? __switch_to+0x31e/0x410 [19670.162554] do_vfs_ioctl+0x86/0x5d0 [19670.166483] ? kvm_on_user_return+0x5a/0x90 [kvm] [19670.170357] ? fire_user_return_notifiers+0x32/0x40 [19670.174606] SyS_ioctl+0x71/0x80 [19670.178397] entry_SYSCALL_64_fastpath+0x20/0x83 [19670.182934] RIP: 0033:0x7f3b7bbfa1c7 [19670.186147] RSP: 002b:00007f3b70815978 EFLAGS: 00000246 [19670.186152] Code: 00 d3 e2 f6 c2 1a 75 10 81 e2 00 01 04 00 83 fa 01 19 c0 f7 d0 83 e0 02 f3 c3 0f ff b8 03 00 00 00 c3 66 0f 1f 84 00 00 00 00 00 <0f> 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 53 89 ff 65 [19670.209812] RIP: kvm_spurious_fault+0x0/0x10 [kvm] RSP: ffffc9000043fce8 [19670.215763] ---[ end trace 35986d140a71d28d ]--- [19670.306630] BUG: unable to handle kernel paging request at ffffffff8109d594 [19670.314077] IP: _raw_spin_lock_irqsave+0x19/0x40 [19670.318103] PGD 3ee0c067 P4D 3ee0c067 PUD 3ee0d063 PMD 3e0000e1 [19670.324073] Oops: 0003 [#2] SMP PTI [19670.327388] Modules linked in: vhost_net vhost tap tun nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc overlay ebtable_filter ebtables binfmt_misc nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6_tables ipt_REJECT nf_reject_ipv4 xt_pkttype xt_NFLOG nfnetlink_log nfnetlink xt_limit xt_owner xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32c_generic xt_comment xt_CHECKSUM xt_tcpudp iptable_mangle iptable_raw ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd joydev evdev serio_raw pcspkr virtio_balloon [19670.395258] button kvm_intel kvm irqbypass fou ip6_udp_tunnel udp_tunnel ip_tunnel autofs4 hid_generic usbhid hid ext4 crc16 mbcache jbd2 dm_mod dax ata_generic virtio_net virtio_blk crc32c_intel psmouse floppy ata_piix uhci_hcd ehci_hcd virtio_pci virtio_ring virtio i2c_piix4 i2c_core usbcore usb_common libata scsi_mod [19670.422984] CPU: 0 PID: 7872 Comm: vhost-7868 Tainted: G D 4.15.0-rc9-jessie1.0 #3 [19670.432299] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [19670.443035] RIP: 0010:_raw_spin_lock_irqsave+0x19/0x40 [19670.448104] RSP: 0018:ffffc9000042fb28 EFLAGS: 00010046 [19670.452169] RAX: 0000000000000000 RBX: 0000000000000082 RCX: ffffc9000043fd28 [19670.459021] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffffff8109d594 [19670.465036] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [19670.471656] R10: ffff88003a544000 R11: 0000000000000000 R12: ffffffff8109d594 [19670.479346] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000003 [19670.485161] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [19670.493660] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19670.499017] CR2: ffffffff8109d594 CR3: 000000003a5b8000 CR4: 00000000000026f0 [19670.505389] Call Trace: [19670.507926] ? bit_waitqueue+0x30/0x30 [19670.511152] try_to_wake_up+0x28/0x410 [19670.514859] ? handle_mm_fault+0xd8/0x1e0 [19670.519054] swake_up_locked+0x1b/0x40 [19670.522620] swake_up+0x15/0x30 [19670.525425] kvm_vcpu_wake_up+0x2e/0x40 [kvm] [19670.529449] kvm_vcpu_kick+0xd/0x50 [kvm] [19670.533713] __apic_accept_irq+0x1ae/0x330 [kvm] [19670.538398] kvm_irq_delivery_to_apic_fast+0xd7/0x390 [kvm] [19670.542872] ? copyout+0x22/0x30 [19670.546254] kvm_arch_set_irq_inatomic+0x78/0x90 [kvm] [19670.551855] irqfd_wakeup+0xf7/0x140 [kvm] [19670.555607] ? kvm_irq_delivery_to_apic+0x2a0/0x2a0 [kvm] [19670.560627] __wake_up_common+0x82/0x120 [19670.565208] eventfd_signal+0x52/0x70 [19670.568249] handle_rx+0x45a/0x770 [vhost_net] [19670.571842] vhost_worker+0xce/0x140 [vhost] [19670.576492] ? vhost_vq_avail_empty+0xe0/0xe0 [vhost] [19670.582125] kthread+0x108/0x140 [19670.584700] ? kthread_associate_blkcg+0xa0/0xa0 [19670.588936] ret_from_fork+0x35/0x40 [19670.592918] Code: de 8b ae ff 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 53 9c 58 66 66 90 66 90 48 89 c3 fa 66 66 90 66 66 90 31 c0 ba 01 00 00 00 <3e> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 e3 76 ae ff 66 [19670.610108] RIP: _raw_spin_lock_irqsave+0x19/0x40 RSP: ffffc9000042fb28 [19670.615849] CR2: ffffffff8109d594 [19670.619077] ---[ end trace 35986d140a71d28e ]--- # Current Kernel and QEMU versions: | Level | Kernel | QEMU | | ----------- | ----------- | -------- | | Hypervisor | 4.9.51 | 2.6.2 | | Guest | 4.15.0-rc9 | 2.11.0 | | 2nd Guest | 4.15.0-rc9 | n.a. | For the guests, we have also tried kernel 4.9.77 and 4.14.14. On the hypervisor, we have also tried QEMU 2.11.0 On the first guest, we have also tried QEMU 2.1 and 2.6.2 # QEMU commandline arguments ## Hypervisor running guest: /usr/bin/kvm -name guest=guest,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-76-guest/master-key.aes -machine pc-i440fx-2.6,accel=kvm,usb=off,dump-guest-core=off -cpu Westmere,+vmx -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 0226cc48-599e-6cd9-b31d-00123af56abc -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-76-guest/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/guest/guest.raw,format=raw,if=none,id=drive-virtio-disk0,cache=none,throttling.bps-read=734003200,throttling.bps-write=576716800,throttling.iops-read=3500,throttling.iops-write=2500 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=59,id=hostnet0,vhost=on,vhostfd=60 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:00:11:22,bus=pci.0,addr=0x3,bootindex=2 -chardev file,id=charserial0,path=/var/log/libvirt/guest.log -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input1,bus=usb.0,port=1 -vnc 0.0.0.0:2,password -k en-us -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device ES1370,id=sound0,bus=pci.0,addr=0x4 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on ## Guest running guest2: /usr/bin/kvm -name guest=guest2,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-guest2/master-key.aes -machine pc-i440fx-2.11,accel=kvm,usb=off,dump-guest-core=off -cpu Westmere,vme=on,pclmuldq=on,vmx=on,x2apic=on,tsc-deadline=on,hypervisor=on,arat=on,tsc_adjust=on,svm=off -m 2048 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid c95b7009-5c60-41bc-992c-af2443e3ce9c -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-guest2/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/guest2.img,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:33:66:99,bus=pci.0,addr=0x5 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 0.0.0.0:0,password -k en-us -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on # CPU info ## Hypervisor /proc/cpuinfo processor : 23 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz stepping : 2 microcode : 0x15 cpu MHz : 2660.062 cache size : 12288 KB physical id : 0 siblings : 12 core id : 10 cpu cores : 6 apicid : 21 initial apicid : 21 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm tpr_shadow vnmi flexpriority ept vpid dtherm ida arat bugs : bogomips : 5319.98 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ## Guest /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) stepping : 1 microcode : 0x1 cpu MHz : 2659.998 cache size : 4096 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc rep_good nopl cpuid pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm pti retpoline tpr_shadow vnmi flexpriority ept vpid arat bugs : cpu_meltdown spectre_v1 spectre_v2 bogomips : 5319.99 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ## Guest2 /proc/cpuinfo: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) stepping : 1 microcode : 0x1 cpu MHz : 2660.028 cache size : 16384 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm cpuid_fault pti retpoline tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat bugs : cpu_meltdown spectre_v1 spectre_v2 bogomips : 5320.05 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:
There is no support for nVMX migration yet. So trying to migrate a guest while it is using VMX itself is expected to fail. Some people are currently working on this. Especially to - Migrate the nVMX state - Properly add and indicate VMX features in the CPU model to guarantee that no VMX features will be lost during migration.
Taking the liberty to cross-reference a related discussion on the libvirt mailing list: https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html
(In reply to David Hildenbrand from comment #1) > There is no support for nVMX migration yet. So trying to migrate a guest > while it is using VMX itself is expected to fail. > > Some people are currently working on this. Especially to > - Migrate the nVMX state > - Properly add and indicate VMX features in the CPU model to guarantee that > no VMX features will be lost during migration. Hmm, interesting. FWIW, I was able to successfully migrate L2 guest to the destination L1 guest (hypervisor). My full QEMU command-lines for both L1s and L2 are here: https://kashyapc.fedorapeople.org/virt/Migrate-a-nested-guest-08Feb2018.txt With the below versions: L0: $ uname -r; qemu-system-x86_64 --version; rpm -q libvirt-daemon-kvm 4.13.13-300.fc27.x86_64+debug QEMU emulator version 2.11.0(qemu-2.11.0-4.fc27) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers libvirt-daemon-kvm-3.10.0-2.fc27.x86_64 On both source and destination L1s: $ uname -r; qemu-system-x86_64 --version; rpm -q libvirt-daemon-kvm 4.14.16-200.fc26.x86_64 QEMU emulator version 2.10.0(qemu-2.10.0-4.fc26) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers libvirt-daemon-kvm-3.7.0-2.fc26.x86_64
(In reply to Kashyap Chamarthy from comment #3) > (In reply to David Hildenbrand from comment #1) > > There is no support for nVMX migration yet. So trying to migrate a guest > > while it is using VMX itself is expected to fail. > > > > Some people are currently working on this. Especially to > > - Migrate the nVMX state > > - Properly add and indicate VMX features in the CPU model to guarantee that > > no VMX features will be lost during migration. > > Hmm, interesting. > > FWIW, I was able to successfully migrate L2 guest to the destination L1 > guest (hypervisor). Ah, hang on -- I should actually double-read the original bug description. The original bug description is takking about migrating the *L1* (while L2 is running on it) to a destination L0 hypervisor. So I admit: I haven't yet migrated the L1 guest while L2 is running on it. (My earlier comment only talks about migrating L2 guest to a destination L1.)
Some documentation on the limitations associated with nested guests can now be found here: https://www.linux-kvm.org/page/Nested_Guests https://www.linux-kvm.org/page/Migration#Problems_.2F_Todo
Related: bug 53851. Actually, this bug depends on 53851, but it seems like only the original reporter can add dependencies.