Bug 198195 - RSM: set CR3.PCID only after CR4.PCIDE (exposed by guest 10af6235e0d3)
Summary: RSM: set CR3.PCID only after CR4.PCIDE (exposed by guest 10af6235e0d3)
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Paolo Bonzini
Depends on:
Reported: 2017-12-18 20:53 UTC by Laszlo Ersek
Modified: 2018-01-02 13:46 UTC (History)
3 users (show)

See Also:
Kernel Version: v4.14.5
Tree: Mainline
Regression: No

tentative patch (2.93 KB, patch)
2017-12-19 18:20 UTC, Paolo Bonzini
Details | Diff

Description Laszlo Ersek 2017-12-18 20:53:10 UTC
Lenny reported in private that his recently updated Fedora 27 guest started
crashing during boot, on QEMU/KVM, using the OVMF (UEFI) guest firmware. I
reproduced the same with a freshly updated Fedora 26 guest too.

Capturing a KVM trace, a triple fault is triggered during guest boot (the
common trace prefix "CPU-16361 [004]" is removed below, for legibility):

> 9362.886853: kvm_exit:          reason EPT_VIOLATION rip 0xffffffff810610bb
> info 182 0
> 9362.886853: kvm_page_fault:    address 1bfc9fff8 error_code 182
> 9362.886854: kvm_entry:         vcpu 2
> 9362.886882: kvm_exit:          reason EPT_VIOLATION rip 0xffffffff818c2267
> info 182 0
> 9362.886882: kvm_page_fault:    address 1bfc97fb0 error_code 182
> 9362.886884: kvm_entry:         vcpu 2
> 9362.886912: kvm_exit:          reason EPT_VIOLATION rip 0xffffffff810610bc
> info 182 0
> 9362.886912: kvm_page_fault:    address 1bfc8fff8 error_code 182
> 9362.886914: kvm_inj_exception: #PF (0x0)
> 9362.886914: kvm_entry:         vcpu 2
> 9362.886938: kvm_exit:          reason TRIPLE_FAULT rip 0xffffffff810610bc
> info 0 0

My component versions:

- host kernel: RHEL-7.4.z: 3.10.0-693.11.1.el7.x86_64 (however, the issue
  also reproduces when running pristine upstream Linux v4.14.5, built from
  source, on the host)

- QEMU: upstream v2.11.0-rc5

- OVMF: upstream 5c59537c1092 ("UefiCpuPkg: Check invalid RegisterCpuFeature
  parameter", 2017-12-13)

- last working guest kernel version: 4.13.11-200.fc26.x86_64

- broken guest kernel: 4.14.4-200.fc26.x86_64

Today I've completed the guest kernel bisection (covering the upstream range
v4.13.11..v4.14.4). The first bad guest kernel commit is 10af6235e0d3
("x86/mm: Implement PCID based optimization: try to preserve old TLB entries
using PCID", 2017-07-25).

Please find the bisection log below:

> git bisect start
> # good: [3996e9c638b8fe280971dc7f7c1f5baf3a6b4578] Linux 4.13.11
> git bisect good 3996e9c638b8fe280971dc7f7c1f5baf3a6b4578
> # bad: [51a2a68fde2035887c0d74aee1c9569c691dfd61] Linux 4.14.4
> git bisect bad 51a2a68fde2035887c0d74aee1c9569c691dfd61
> # good: [569dbb88e80deb68974ef6fdd6a13edb9d686261] Linux 4.13
> git bisect good 569dbb88e80deb68974ef6fdd6a13edb9d686261
> # bad: [3645e6d0dc80be4376f87acc9ee527768387c909] Merge tag 'md/4.14-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
> git bisect bad 3645e6d0dc80be4376f87acc9ee527768387c909
> # bad: [bafb0762cb6a906eb4105cccfb3bcd90be7f40d2] Merge tag
> 'char-misc-4.14-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
> git bisect bad bafb0762cb6a906eb4105cccfb3bcd90be7f40d2
> # good: [9657752cb5039c7498d4b27c4a75530f93b87d9b] Merge branch
> 'perf-core-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good 9657752cb5039c7498d4b27c4a75530f93b87d9b
> # bad: [e63a94f12b5fc67b2b92a89d4058e7a9021e900e] Merge tag 'tty-4.14-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> git bisect bad e63a94f12b5fc67b2b92a89d4058e7a9021e900e
> # bad: [d1ce495676644fc79b3ccd58657133c5d4a414fb] Merge tag
> 'm68k-for-v4.14-tag1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
> git bisect bad d1ce495676644fc79b3ccd58657133c5d4a414fb
> # bad: [b1b6f83ac938d176742c85757960dec2cf10e468] Merge branch
> 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect bad b1b6f83ac938d176742c85757960dec2cf10e468
> # good: [6c51e67b64d169419fb13318035bb442f9176612] Merge branch
> 'x86-syscall-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good 6c51e67b64d169419fb13318035bb442f9176612
> # bad: [6e0b52d406f64d2bd65731968a072387b91b44d2] x86/mm: Fix SME encryption
> stack ptr handling
> git bisect bad 6e0b52d406f64d2bd65731968a072387b91b44d2
> # good: [c7753208a94c73d5beb1e4bd843081d6dc7d4678] x86, swiotlb: Add memory
> encryption support
> git bisect good c7753208a94c73d5beb1e4bd843081d6dc7d4678
> # good: [3a366f791d83c690f3042e37f410f38de04fb4c7] x86/mm/dump_pagetables:
> Generalize address normalization
> git bisect good 3a366f791d83c690f3042e37f410f38de04fb4c7
> # bad: [10af6235e0d327d42e1bad974385197817923dc1] x86/mm: Implement PCID
> based optimization: try to preserve old TLB entries using PCID
> git bisect bad 10af6235e0d327d42e1bad974385197817923dc1
> # good: [44b04912fa72489d403738f39e1c782614b7ae7c] x86/mpx: Do not allow MPX
> if we have mappings above 47-bit
> git bisect good 44b04912fa72489d403738f39e1c782614b7ae7c
> # good: [ee00f4a32a76ef631394f31d5b6028d50462b357] x86/mm: Allow userspace
> have mappings above 47-bit
> git bisect good ee00f4a32a76ef631394f31d5b6028d50462b357
> # good: [77ef56e4f0fbb350d93289aa025c7d605af012d4] x86: Enable 5-level paging
> support via CONFIG_X86_5LEVEL=y
> git bisect good 77ef56e4f0fbb350d93289aa025c7d605af012d4
> # first bad commit: [10af6235e0d327d42e1bad974385197817923dc1] x86/mm:
> Implement PCID based optimization: try to preserve old TLB entries using PCID

I'm marking this issue as Severity=Blocking because the most recent Fedora
26 and Fedora 27 kernels are unbootable on the virt stack described above.
Please adjust as necessary.

(Please note that I'm on PTO, hence feedback from me will be spotty to
nonexistent for a while. The only reason I could find time for a day-long
bisection is exactly my being on PTO...)

Comment 1 Laszlo Ersek 2017-12-18 21:11:36 UTC
* OVMF build commands:

(1) perform the steps described in

(2) run:

$ source edksetup.sh
$ build \
    -a IA32 \
    -a X64 \
    -p OvmfPkg/OvmfPkgIa32X64.dsc \
    -t GCC48 \
    -b NOOPT \

* QEMU command line (generated by libvirt):

/opt/qemu-installed/bin/qemu-system-x86_64 \
  -name guest=ovmf.fedora.q35,debug-threads=on \
  -S \
  -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-37-ovmf.fedora.q35/master-key.aes \
  -machine pc-q35-2.11,accel=kvm,usb=off,smm=on,dump-guest-core=off \
  -cpu Haswell-noTSX,vmx=on \
  -global driver=cfi.pflash01,property=secure,value=on \
  -drive file=/home/virt-images/OVMF_CODE.4m.3264.fd,if=pflash,format=raw,unit=0,readonly=on \
  -drive file=/var/lib/libvirt/qemu/nvram/ovmf.fedora.q35_VARS.fd,if=pflash,format=raw,unit=1 \
  -m 5120 \
  -realtime mlock=off \
  -smp 4,sockets=1,cores=2,threads=2 \
  -uuid a51c0e4c-93b1-4485-811e-ea9727eb748c \
  -no-user-config \
  -nodefaults \
  -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-37-ovmf.fedora.q35/monitor.sock,server,nowait \
  -mon chardev=charmonitor,id=monitor,mode=control \
  -rtc base=utc \
  -no-shutdown \
  -global ICH9-LPC.disable_s3=0 \
  -global ICH9-LPC.disable_s4=1 \
  -boot menu=off,strict=on \
  -device i82801b11-bridge,id=pci.1,bus=pcie.0,addr=0x1e \
  -device pci-bridge,chassis_nr=2,id=pci.2,bus=pci.1,addr=0x1 \
  -device pcie-root-port,port=0x10,chassis=3,id=pci.3,bus=pcie.0,multifunction=on,addr=0x3 \
  -device pcie-root-port,port=0x11,chassis=4,id=pci.4,bus=pcie.0,multifunction=on,addr=0x3.0x1 \
  -device ich9-usb-ehci1,id=usb,bus=pci.2,addr=0x2.0x7 \
  -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.2,multifunction=on,addr=0x2 \
  -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.2,addr=0x2.0x1 \
  -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.2,addr=0x2.0x2 \
  -device qemu-xhci,p2=15,p3=15,id=usb1,bus=pcie.0,multifunction=on,addr=0x2 \
  -device qemu-xhci,p2=15,p3=15,id=usb2,bus=pcie.0,addr=0x2.0x1 \
  -device virtio-scsi-pci,id=scsi0,bus=pci.2,addr=0x5 \
  -device virtio-serial-pci,id=virtio-serial0,bus=pci.2,addr=0x6 \
  -drive file=/mnt/data/virt-images-big/ovmf.fedora.q35.img,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=writeback,discard=unmap,werror=enospc \
  -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 \
  -drive file=/usr/share/OVMF/UefiShell.iso,format=raw,if=none,media=cdrom,id=drive-sata0-0-0,readonly=on,cache=writeback \
  -device ide-cd,bus=ide.0,drive=drive-sata0-0-0,id=sata0-0-0 \
  -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 \
  -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:29:80:ae,bus=pci.3,addr=0x0,rombar=0 \
  -chardev pty,id=charserial0 \
  -device isa-serial,chardev=charserial0,id=serial0 \
  -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-37-ovmf.fedora.q35/org.qemu.guest_agent.0,server,nowait \
  -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
  -chardev spicevmc,id=charchannel1,name=vdagent \
  -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 \
  -device usb-tablet,id=input0,bus=usb.0,port=2 \
  -spice port=5900,addr=,disable-ticketing,streaming-video=off,seamless-migration=on \
  -device VGA,id=video0,vgamem_mb=16,bus=pcie.0,addr=0x1 \
  -device virtio-balloon-pci,id=balloon0,bus=pci.2,addr=0x4 \
  -object rng-random,id=objrng0,filename=/dev/urandom \
  -device virtio-rng-pci,rng=objrng0,id=rng0,max-bytes=1048576,period=1000,bus=pci.2,addr=0x3 \
  -global isa-debugcon.iobase=0x402 \
  -debugcon file:/tmp/ovmf.fedora.q35.log \
  -global pcie-root-port.io-reserve=0 \
  -global pcie-root-port.pref64-reserve=4G \
  -s \
  -msg timestamp=on
Comment 2 Andy Lutomirski 2017-12-18 22:20:49 UTC
Is this reproducible using the copy of OVMF that Fedora distributes?  If not, can one of you attach or otherwise send me an OVMF binary that reproduces it?
Comment 3 Laszlo Ersek 2017-12-19 10:29:02 UTC
Yes, it reproduces with Fedora's most recent OVMF build (minimally -- I didn't check other OVMF builds from Fedora):


Please install the sub-package


and from it, please use the firmware binary


(sha256sum: ae8c9dd92eac5aa5e16e72ab85b89d2f6f911ada0aa5e5812396509945fb2dc3.)

Comment 4 Laszlo Ersek 2017-12-19 10:51:03 UTC
After re-reading the message of kernel commit 10af6235e0d3, I've now figured I should try booting the same (problematic) guest kernel with the "nopcid" cmdline parameter added.

Indeed, with "nopcid" specified, the guest boots fine.

... Is there perhaps a problem with KVM's handling (or emulation?) of PCIDs?
Comment 5 Andy Lutomirski 2017-12-19 16:49:19 UTC
Debugging results so far: I added a bunch of printks calls, and we're dying some time after the first call to efi_call_virt(get_next_variable, ...) in efivar_init() when called by efivar_ssdt_load().

On inspection of the code, we seem to turn off interrupts *after* arch_efi_call_virt_setup(), which makes no sense to me.  We do *not* want to take an interrupt while running with the funny EFI CR3.  But fixing that doesn't seem to help.

I can reproduce the problem with two cpus but not with only one cpu.

I'm rather mystified as to what's actually happening, since efi->systab appears to be a bogus pointer.  Maybe efi->systab only points anywhere when we're using EFI's CR3?

I can easily imagine that something explodes if we enter SMM with CR3[11:0] != 0, but I'm skeptical that this is the actual problem we're seeing.
Comment 6 Andy Lutomirski 2017-12-19 17:13:46 UTC
I'm going to go out on a limb and suggest that the problem is that either QEMU is emulating SMM incorrectly when a non-trapping CPU is forced into SMM and PCID is in use or that OVMF has a bug.  No one has reported this problem on real hardware...
Comment 7 Andy Lutomirski 2017-12-19 17:18:55 UTC
Paolo, I think that rsm_load_state_64() may be buggy.  It loads CR3, then CR4 & ~PCIDE, then CR0, then CR4.  But, if the emulation logic works like the a real CPU, setting CR4.PCIDE is bogus if CR3[11:0] != 0.  If the emulation does *not* work like a real CPU but CR3 is parsed at load time, then we might have a bogus CR3 state when all is said and done.

I would set CR3 to zero, then load CR4 & ~PCIDE, then CR0, then full CR4, and *then* CR3.

(This may be busted on 32-bit PAE, too.  Loading CR3 has side effects that depend on CR4.)
Comment 8 Laszlo Ersek 2017-12-19 18:08:48 UTC
FWIW, I confirm that using an OVMF build that does not have -D SMM_REQUIRE, the same guest boots fine (with PCID enabled, as it is by default). This can be checked e.g. by booting "OVMF_CODE.fd" instead of "OVMF_CODE.secboot.fd", from comment 3. For this to work, the QEMU cmdline option "-global driver=cfi.pflash01,property=secure,value=on" has to be removed as well. Thanks!
Comment 9 Paolo Bonzini 2017-12-19 18:20:14 UTC
Created attachment 261263 [details]
tentative patch

The right fix is probably to add an emulator callback that sets all the special registers (like KVM_SET_SREGS would do), but the attached patch should do it.  Untested, will try it of course before posting it formally to LKML and kvm@vger.kernel.org.
Comment 10 Laszlo Ersek 2017-12-20 09:32:18 UTC
Lowering importance to "normal" -- symptoms are only present when using SMM, and even in that case, it can be worked around on both the host side ("-cpu MODEL,pcid=off" on the QEMU cmdline) and the guest side ("nopcid" kernel param).
Comment 11 Laszlo Ersek 2017-12-20 13:24:07 UTC
(In reply to Paolo Bonzini from comment #9)
> Created attachment 261263 [details]
> tentative patch
> The right fix is probably to add an emulator callback that sets all the
> special registers (like KVM_SET_SREGS would do), but the attached patch
> should do it.  Untested, will try it of course before posting it formally to
> LKML and kvm@vger.kernel.org.

I locally rebuilt the kvm and kvm-intel modules only, with this patch applied, for my RHEL-7.4.z (3.10.0-693.11.1.el7.x86_64) host kernel. The patch applies cleanly:

> patching file arch/x86/kvm/emulate.c
> Hunk #1 succeeded at 2419 (offset 15 lines).
> Hunk #2 succeeded at 2452 (offset 15 lines).
> Hunk #3 succeeded at 2468 (offset 15 lines).
> Hunk #4 succeeded at 2514 (offset 15 lines).
> Hunk #5 succeeded at 2538 (offset 15 lines).
> Hunk #6 succeeded at 2566 (offset 15 lines).

There's one line with superfluous whitespace (right above the comment "In order to later set CR4.PCIDE..."):

> warning: 1 line adds whitespace errors.

The patch works; the guest boots fine.

I also repeated my Linux guest tests from <https://github.com/tianocore/tianocore.github.io/wiki/Testing-SMM-with-QEMU,-KVM-and-libvirt#tests-to-perform-in-the-installed-guest-fedora-26-guest>.

When you post the patch to LKML and kvm@vger.kernel.org, please add:

Tested-by: Laszlo Ersek <lersek@redhat.com>

Thank you both!
Comment 12 Laszlo Ersek 2017-12-20 13:30:39 UTC
Updating the BZ metadata to reflect that this report ultimately concerns the host side.
Comment 13 Laszlo Ersek 2017-12-22 11:22:15 UTC
Paolo's patch is on the lists:

[PATCH] kvm: x86: fix RSM when PCID is non-zero
Comment 14 Laszlo Ersek 2017-12-30 17:09:07 UTC
Fixed in Paolo's commit fae1a3e775cc ("kvm: x86: fix RSM when PCID is non-zero", 2017-12-21). Part of v4.15-rc5. Thanks everyone!
Comment 15 Laszlo Ersek 2018-01-02 13:46:47 UTC
Greg Kroah-Hartman queued the patch for the following stable trees: 4.4, 4.9, 4.14.

Note You need to log in before you can comment on or make changes to this bug.