Hello, while I was playing around with kvm and trying nested virtual machines, I got OOPS on the hardware machine. I ran $ qemu-system-i386 -enable-kvm -virtfs local,path=.,security_model=none,mount_tag=hostfs -cpu host /mnt/extras/src/qemu-image-autopkgtest2 and inside the machine, I ran a freedos install image residing in teh currect directory (ie. through the virtfs mount). The image is running a 5.2-rc4 kernel; note that when I run a 4.19 kernel as the L1 guest it seems to work. It crashed very early, before the nested system prints anything to the screen. The error on L0 was: [ 505.814203] BUG: unable to handle kernel NULL pointer dereference at 00000000 [ 505.814208] #PF error: [WRITE] [ 505.814209] *pdpt = 0000000015f1f001 *pde = 0000000000000000 [ 505.814212] Oops: 0002 [#1] SMP NOPTI [ 505.814216] CPU: 1 PID: 2292 Comm: qemu-system-i38 Tainted: P O 5.1.0-bughunt+ #2 [ 505.814217] Hardware name: System manufacturer System Product Name/M4N68T-M, BIOS 1301 07/05/2011 [ 505.814234] EIP: kvm_mmu_load+0x292/0x4c0 [kvm] [ 505.814236] Code: 55 e8 e8 d1 f0 ff ff 8b 48 20 ff 40 28 8b 07 81 c1 00 00 00 40 c6 00 00 0f 1f 00 8b 87 68 02 00 00 0b 4d dc 8b 80 88 00 00 00 <89> 0c 30 c7 44 30 04 00 00 00 00 e9 6b ff ff ff 8d b6 00 00 00 00 [ 505.814238] EAX: 00000000 EBX: 00000000 ECX: 1267a001 EDX: d30c7d6c [ 505.814239] ESI: 00000000 EDI: d2538000 EBP: d30c7dd0 ESP: d30c7d9c [ 505.814241] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210202 [ 505.814242] CR0: 80050033 CR2: 00000000 CR3: 223e2e40 CR4: 000006f0 [ 505.814243] Call Trace: [ 505.814256] kvm_arch_vcpu_ioctl_run+0xc87/0x1910 [kvm] [ 505.814260] ? _copy_to_user+0x21/0x30 [ 505.814264] ? tomoyo_path_number_perm+0x5f/0x200 [ 505.814274] kvm_vcpu_ioctl+0x214/0x580 [kvm] [ 505.814284] ? __bpf_trace_kvm_async_pf_nopresent_ready+0x30/0x30 [kvm] [ 505.814287] do_vfs_ioctl+0x91/0x6b0 [ 505.814290] ? __audit_syscall_entry+0xb8/0x100 [ 505.814292] ? syscall_trace_enter+0x1e1/0x240 [ 505.814294] ? tomoyo_file_ioctl+0x19/0x20 [ 505.814296] ? security_file_ioctl+0x2a/0x40 [ 505.814298] ksys_ioctl+0x60/0x90 [ 505.814300] sys_ioctl+0x16/0x20 [ 505.814302] do_fast_syscall_32+0x91/0x17c [ 505.814304] entry_SYSENTER_32+0x6b/0xbe [ 505.814306] EIP: 0xb7f8b83d [ 505.814307] Code: 54 cd ff ff 8b 98 58 cd ff ff 85 d2 89 c8 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 [ 505.814308] EAX: ffffffda EBX: 0000000e ECX: 0000ae80 EDX: 00000000 [ 505.814309] ESI: 0224ead0 EDI: 00000000 EBP: b50f6000 ESP: b31bbc98 [ 505.814311] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292 [ 505.814313] ? nmi+0x8b/0x190 [ 505.814314] Modules linked in: snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device cpufreq_powersave cpufreq_userspace cpufreq_conservative nvidia_drm(PO) drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt nvidia_modeset(PO) nvidia(PO) binfmt_misc fuse snd_hda_codec_via snd_hda_codec_hdmi snd_hda_codec_generic nls_iso8859_2 nls_cp437 vfat kvm_amd snd_hda_intel fat kvm snd_hda_codec snd_hda_core snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd ohci_pci irqbypass ohci_hcd soundcore k10temp ehci_pci ehci_hcd forcedeth i2c_nforce2 sr_mod sata_nv cdrom sg asus_atk0110 pcc_cpufreq pcspkr acpi_cpufreq button ipmi_devintf ipmi_msghandler usblp usbcore parport_pc ppdev lp parport ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod psmouse evdev serio_raw ata_generic pata_amd libata scsi_mod [ 505.814341] CR2: 0000000000000000 [ 505.814343] ---[ end trace f9a592688c8617bc ]--- [ 505.814354] EIP: kvm_mmu_load+0x292/0x4c0 [kvm] [ 505.814355] Code: 55 e8 e8 d1 f0 ff ff 8b 48 20 ff 40 28 8b 07 81 c1 00 00 00 40 c6 00 00 0f 1f 00 8b 87 68 02 00 00 0b 4d dc 8b 80 88 00 00 00 <89> 0c 30 c7 44 30 04 00 00 00 00 e9 6b ff ff ff 8d b6 00 00 00 00 [ 505.814357] EAX: 00000000 EBX: 00000000 ECX: 1267a001 EDX: d30c7d6c [ 505.814358] ESI: 00000000 EDI: d2538000 EBP: d30c7dd0 ESP: d6a0d3bc [ 505.814359] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210202 [ 505.814360] CR0: 80050033 CR2: 00000000 CR3: 223e2e40 CR4: 000006f0 The processor on L0 is Athlon II X2 240.
Created attachment 283319 [details] dmesg from L0
A patch for this is on its way to Linus.
Good! Could you point me to the patch please?
Sure: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b6b80c78af838bef17501416d5d383fedab0010a
Oh... I'm sorry I haven't mentioned this, but I have this patch already applied. So there must be something else.
Looking at it more closely, it is the same problem as bug 203845 after all. Only in a different scenario and not helped by the original patch. Any ideas?
Created attachment 283393 [details] Patch that fixes this problem on my system So, I had a look around the code and found that SVM initialized the nested vcpus in such a way that ->arch.mmu points to ->arch.guest_mmu. The code in mmu.c then uses ->arch.mmu->pae_root which crashes. This patch really takes the path of the least resistance. If they want to have pae_root allocated even for guest_mmu, let them have it and just allocate it. Maybe if this is specific to AMD the whole business should be in svm.c though? Or do it lazily only when actually doing the nesting? The patch fixes 5.1 kernel on my machine, kvm guest start and the nested guest start as well. However, in 5.2 there will probably be more problems ahead because I got a different error there (kvm_spurious_fault in L1). What are your thoughts on this?
The second patch was committed as v5.4-rc1~138^2~6. I found this while staring at a similar-looking kvm_mmu_load NULL dereference on the hardware kernel while starting a nested VM on an AMD Ryzen 7 1800X, kernel 5.4.28. Should I try to expand this into a full report, or does your original recipe still reproduce? BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP NOPTI CPU: 5 PID: 1994 Comm: CPU 7/KVM Tainted: P OE 5.4.28 #1-NixOS Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350M Pro4, BIOS P5.90 07/03/2019 RIP: 0010:kvm_mmu_load+0x2e6/0x5b0 [kvm] Code: 2b 0d 46 c7 0c fa 83 40 50 01 49 8b 3f c6 07 00 0f 1f 40 00 49 8b 87 68 03 00 00 48 01 ca 48 0b 54 24 08 48 8b 80 b8 00 00 00 <4a> 89 14 30 e9 57 ff ff ff 48 c1 e8 0c 4c 89 ff 48 89 c6 49 89 c5 RSP: 0018:ffffbc2883aefcc8 EFLAGS: 00010206 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00006355c0000000 RDX: 00000007b954a027 RSI: ffffbc2883aefc68 RDI: ffffbc2883ab1000 RBP: 0000000000000000 R08: ffffbc2883ab1000 R09: ffffbc2883aefbf0 R10: ffffbc2883aefc68 R11: ffff9cb05e950008 R12: 0000000000000000 R13: 00000000000290a3 R14: 0000000000000000 R15: ffff9cb15c0c38f0 FS: 0000000000000000(0000) GS:ffff9cb1be940000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000079fda4000 CR4: 00000000003406e0 Call Trace: kvm_arch_vcpu_ioctl_run+0xfe4/0x1d60 [kvm] ? _copy_to_user+0x28/0x30 ? kvm_vm_ioctl+0x7ab/0x8e0 [kvm] kvm_vcpu_ioctl+0x215/0x5c0 [kvm] ? __seccomp_filter+0x7b/0x670 do_vfs_ioctl+0x3fe/0x660 ksys_ioctl+0x5e/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4e/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4a959ba147 Code: 00 00 90 48 8b 05 39 9d 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 9d 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007f4a79ffa508 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f4a959ba147 RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000022 RBP: 00005563acc09d00 R08: 00005563aab57b90 R09: 00000000000000ff R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00005563ab1ba980 R14: 0000000000000001 R15: 0000000000000000 Modules linked in: fuse vhost_net vhost ip6table_mangle ebtable_filter ebtables iptable_mangle xt_CHECKSUM xt_comment xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter msr ip6table_nat iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter sch_fq_codel nls_iso8859_1 nls_cp437 vfat snd_hda_codec_hdmi fat snd_hda_codec_realtek wmi_bmof nvidia_drm(POE) snd_hda_codec_generic ledtrig_audio drm_kms_helper edac_mce_amd nvidia_modeset(POE) snd_hda_intel edac_core snd_intel_nhlt nvidia_uvm(OE) drm joydev snd_hda_codec evdev mousedev mac_hid deflate efi_pstore crct10dif_pclmul pstore agpgart sp5100_tco snd_hda_core crc32_pclmul fb_sys_fops watchdog syscopyarea efivars sysfillrect snd_hwdep ghash_clmulni_intel i2c_piix4 sysimgblt k10temp gpio_amdpt pinctrl_amd gpio_generic wmi button acpi_cpufreq nvidia(POE) ipmi_devintf ipmi_msghandler i2c_core snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore atkbd libps2 serio loop cpufreq_ondemand tap macvlan bridge stp llc tun efivarfs ip_tables x_tables ipv6 nf_defrag_ipv6 crc_ccitt autofs4 dm_crypt algif_skcipher af_alg input_leds led_class sd_mod hid_generic usbhid hid xhci_pci ahci xhci_hcd libahci libata usbcore aesni_intel scsi_mod crypto_simd cryptd glue_helper usb_common rtc_cmos af_packet dm_mod btrfs libcrc32c crc32c_generic crc32c_intel xor zstd_decompress zstd_compress raid6_pq kvm_amd kvm irqbypass r8169 realtek libphy CR2: 0000000000000000 ---[ end trace d6db99b9073bce58 ]--- RIP: 0010:kvm_mmu_load+0x2e6/0x5b0 [kvm] Code: 2b 0d 46 c7 0c fa 83 40 50 01 49 8b 3f c6 07 00 0f 1f 40 00 49 8b 87 68 03 00 00 48 01 ca 48 0b 54 24 08 48 8b 80 b8 00 00 00 <4a> 89 14 30 e9 57 ff ff ff 48 c1 e8 0c 4c 89 ff 48 89 c6 49 89 c5 RSP: 0018:ffffbc2883aefcc8 EFLAGS: 00010206 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00006355c0000000 RDX: 00000007b954a027 RSI: ffffbc2883aefc68 RDI: ffffbc2883ab1000 RBP: 0000000000000000 R08: ffffbc2883ab1000 R09: ffffbc2883aefbf0 R10: ffffbc2883aefc68 R11: ffff9cb05e950008 R12: 0000000000000000 R13: 00000000000290a3 R14: 0000000000000000 R15: ffff9cb15c0c38f0 FS: 0000000000000000(0000) GS:ffff9cb1be940000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000079fda4000 CR4: 00000000003406e0
(In reply to Anders Kaseorg from comment #8) > The second patch was committed as v5.4-rc1~138^2~6. > > I found this while staring at a similar-looking kvm_mmu_load NULL > dereference on the hardware kernel while starting a nested VM on an AMD > Ryzen 7 1800X, kernel 5.4.28. Should I try to expand this into a full > report, or does your original recipe still reproduce? Hi. I was going to say your problem was likely to be something else, because that original problem got fixed, and anyway all of my problems with kvm were specifically related to 32bit. However then I tried it for myself and it's still broken, albeit with a different error message. Oh well...