QEMU/KVM VM's using PCI pass-through are taking a long time to boot, or never booting. The issue is only present when the VM has more then ~2.5G of ram. I was able to get a VM with 6G to boot, but I had to let it run overnight. More info can be found in this thread: https://bbs.archlinux.org/viewtopic.php?id=203240 Someone in the forum seems to have narrowed it down to a a set of commits regarding kvm and mtrr in 4.1-rc2 that are in the mainline kernel as of 4.2. Specifically commit fa61213746a706dd975661151c35795ca4dd82c2 KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type.
I can confirm this bug. As soon as I enabled PCI passthrough on my VM while using kernel 4.2.2, my VM ran extremely slowly, and increasing the amount of RAM made the problem worse. Switching to kernel 4.1.12 fixed the problem.
I can also confirm this bug. After upgrading the kernel on a Fedora box from 4.0.4 to 4.2.5, my Windows 8.1 VM hung on start. The VM would only start if the GPU passthrough configuration was removed, or if the VM's memory size was reduced to 2GB.
Issue is still present in 4.4-rc1 and same conditions exist. After lowering it below 2.5G of ram, the VM boots normally. Anything over 2.5G will cause it to boot really slow and utilize a lot of cpu.
I narrowed it down earlier to "simplify kvm_mtrr_get_guest_memory_type". Actually this was the commit to break it, but it is only a small part of the feauture set implementing vMTRR. To my knowledge 4.1.x versions didn't have this code set at all. It was introduced in 4.2. What kvm_mtrr_get_guest_memory_type() does is basically decides what cache type to use (Uncachable, Write-Back, or Write-Through) based on the function mtrr_for_each_mem_type(). This kvm_mtrr_get_guest_memory_type function is only called by vmx_get_mt_mask. This is all related to how the hardware is set up regarding IOMMU, MMIO, VT-d and EPT andress translation. For experimenting purposes I simply modified kvm_mtrr_get_guest_memory_type to always return MTRR_TYPE_WRBACK and that fixed the issue. Guest boots correctly. Although I know, that hardcoding Write-back mode is not a viable long-term solution. It seems as the new function (in 4.2 and onward) does not return the proper cache type in case of pci passthrough configurations. I would also like to point out, that https://bugzilla.kernel.org/show_bug.cgi?id=107921 may be related. The other bug mentions very similar hardware to mine, and issues started the same time (upgrading to 4.2.x) with a pci passthrough configuration.
Please capture a trace according to http://www.linux-kvm.org/page/Tracing, so that I can see your MTRR settings and figure out why that patch breaks things. If possible, capture a trace with 2 GB of memory and one with 3 GB.
Ah, the traces are huge. You can xz them and send them through private email.
*** Bug 107921 has been marked as a duplicate of this bug. ***
I captured the traces, they take more than half a GB uncompressed. They are available in xz under https://drive.google.com/folderview?id=0B8ebX_WjVHnGNlN4eTEzU2xtMEk&usp=sharing To make things clear: I have two hosts. HostB is a testing machine. VGA passthrough with EDKII worked out of the box on 4.1 kernels. It broke with with several 4.2 versions and also 4.3. Tried several repo versions and untouched kernel.org versions, that was back about a month ago. Than I tested again back a few days, and it works again with 4.2.6-301.fc23 fedora repo kernel. HostA is the problematic one. I did the trace as requested, with 2G and 3G guest ram. It worked with 4.1 kernels, but still doesn't work with 4.2.6-301.fc23. I can make it work with either lowering RAM to under 2.5 GB or with my beforementioned modification (kvm_mtrr_get_guest_memory_type to always return MTRR_TYPE_WRBACK). Of course I made the traces with unmodified kernel, and only the 2G guest actually booted. The exact symptom is (when I say it doesn't work), that the guest is extremly slow. I tried booting live Linux guest, after about 15 minutes a saw messages, but even after two hours I still couldn't get to a shell. Windows guest only shows the white dots under the logo circling around very slowly. Once I got a blue screen, and letters came up one-by-one, like if the error message was written with a typewriter. Sometimes the guest just shuts down, qemu terminates. UEFI shell, UEFI setup in guest works perfectly at all conditions, slowness starts when booting an OS. I can also provide the traces with working kernel version on HostA and also on HostB, if requested for comparison. In the shared folders you can find MTRR, PAT, and lspci -vvv info for each host, along with traces for HostA as requested (2GB and 3GB). One of the members on the original Arch Linux thread suggested I put a printk in the problematic function. The dmesg files in each folder show the arguments of vmx_get_mt_mask and what kvm_mtrr_get_guest_memory_type returns to it. The added line was (just before the return statement in vmx_get_mt_mask): printk(KERN_INFO "vmx_get_mt_mask got the following: cpu=%d, vcpu=%d, gfn=%x, MMIO=%d, cache=%x", vcpu->cpu, vcpu->vcpu_id, gfn, is_mmio, cache); It is visible from dmesg files that in case of large guest ram, the function doesn't even get called for vcpus other than 0. On the other hand it is called for all in case of small memory. The traces are NOT from the same run as the dmesg files, as they have been created before your post about tracing. Please ask if any more info, traces or dumps are needed. I would be glad to provide any help.
Can you try the following on HostA: 1) providing *host* dmesg 2) disabling each MTRR starting from the first, until things are fast again. Then reboot and try disabling only the MTRR that made things fast again.
Another test on the 3GB machine: make it fast by nuking kvm_mtrr_get_guest_memory_type (return MTRR_TYPE_WRBACK), then cat /proc/mtrr from the guest.
Created attachment 197071 [details] OVMF log > 1) providing host dmesg Of course this makes no sense, you already did. And from the traces I suspect that /proc/mtrr is the same on the two guests. The culprit seems to be this MTRR: qemu-system-x86-2228 [002] 127.763684: kvm_msr: msr_read 201 = 0xff80000800 qemu-system-x86-2228 [002] 127.763685: kvm_msr: msr_read 200 = 0x80000000 which is set up by OVMF. I'm attaching the OVMF output for reference (it can be extracted from the traces with "trace-cmd report trace-3GBmem.dat |grep 0x402 | awk --non-decimal-data '{printf "%c", 0+$NF}' > OVMF.log").
Created attachment 197081 [details] patch to be tested The bug happens when the VM is created with a maxphyaddr that doesn't match the host. This patch should fix your bug.
Sorry for the delay. I couldn't check sooner. Unfortunately the guest doesn't start with the patch applied. Nothing displayed (no sync signal from passthrough card). Host dmesg doesn't show any errors. Usually, even with slow guest I get the following on host dmesg when starting VM: ... VFIO - User Level meta-driver version: 0.3 kernel: vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem kernel: vfio_ecap_init: 0000:01:00.0 hiding ecap 0x1e@0x258 kernel: vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x900 kernel: pmd_set_huge: Cannot satisfy [mem 0xe0000000-0xe0200000] with a huge-page mapping due to MTRR override. - than some USB resets for keyboard passthrough - kernel: kvm: zapping shadow pages for mmio generation wraparound ...and so on, the guest screen displays UEFI startup by now With the patch applied it simly stops at the pmd_set_huge line, and nothing further, no usb, no guest start, no error messages. I also did follow up on the other tests. Disabling host mtrrs one-by-one did not yield any results. Guest is still slow even if disabling all the hosts mtrrs. Tried nukeing kvm_mtrr_get_guest_memory_type as requested and started a live distro. Guest did boot, but did not have a /proc/mtrr file. (Graphics scrolling was slow, I guess on account of lack of cacheing, but otherwise guest was speedy as normal) I saved some info from the guest, like dmesg, lspci -vvv, and some sysfs files. You can find them in the same share as before. https://drive.google.com/folderview?id=0B8ebX_WjVHnGNlN4eTEzU2xtMEk&usp=sharing
Can confirm. I am also hit by this bug, and my virtual machine doesn't start either. I tried applying the patch to both 4.3.0 and linux-next, and neither of those worked.
Created attachment 197381 [details] patch to be tested v2 There is indeed a two-character mistake in the patch. Can you guys please test this one instead?
Created attachment 197391 [details] host dmesg, after applying patch in att. 197381 After applying the patch I ran into different symptoms. Linux guest: Grub works, kernel doesn't start. No kernel messages displayed even without rhgb quiet. No messages on host dmesg either. Windows guest: white dots start to circle for a bit less than a second, then freezes. Host dmesg shows errors. Full dmesg in the attachement.
Please try this additional debug patch on Windows, and include a trace of both Linux and Windows. Thanks! Also, has the patch broken bigger memory sizes as well, or do they still work? diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c index 7747b6d716fa..74c1841f12c7 100644 --- a/arch/x86/kvm/mtrr.c +++ b/arch/x86/kvm/mtrr.c @@ -305,6 +305,7 @@ * variable MTRRs causes a #GP. */ *end = (*start | ~mask) + 1; + printk_ratelimited("start %llx end %llx\n", *start, *end); } static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 44976a596fa6..1451e3d5c6af 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8867,6 +8867,7 @@ } cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn); + if (cache != 6) printk("gfn %x cache %x\n", (unsigned)gfn, cache); exit: return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
Applyed the debug patches and did the traces. I tested the following configurations: 2.5GB guest memory and 3GB guset memory and 24GB guest memory, each with both Fedora Live and Windows. It seems that with the patch all memory configurations act the same, the symptom are the same as described in my previuos, regardless of guest memory size. I only included the traces and host dmesg for 2.5 and 3 GB. Tests were performed each with a clean boot. The dmesg files are from the same run as the traces, named accordingly. I found that in case of a Linux guest the host dmesg shows a few new printks (due to this new patch) every few seconds. Though the guest doesn't show a single message, i let it run for about 5 minutes, so Linux traces are quite large this time. I also noticed, that there is kernel oops, when "virsh destroy"-ing the Linux guest. On the other hand Windows guest causes a warning in host dmesg, than printks stop popping up (also the guest freezes), and I can virsh destroy the guest without any further problems. Traces are big, so I used the same share as before: https://drive.google.com/folderview?id=0B8ebX_WjVHnGNlN4eTEzU2xtMEk&usp=sharing Look for Traces_with_test_patch folder.
I'm also experiencing this bug. I'm able to successfully pass through a nvidia 970 card and boot to an EFI shell with 100% reliability but it usually hangs when attempting to boot my guest OS, windows 10. My behavior was a bit different from schefister's; the guest VM would try to boot and fail to do so 9 out of 10 times or so, and after a random amount of failed boot attempts it would eventually boot correctly. Behavior was as expected when the VM did eventually manage to boot, with working 3D acceleration and close to native windows performance. The guest VM would sometimes bluescreen with a variety of different error messages and once slowly printed a BSOD error to the screen at the rate of about 1 CPS as noted by schefister. I'll have plenty of time to test on the weekends going forward. The host is Antergos 2015.11.14, running kernel 4.2.5.
schefister, what are your QEMU command lines? Because it looks like in the Fedora case you are not using MTRRs at all. Compare trace-cmd report trace-Fedora_2.5G.dat |grep -ne 'msr_write [2f]' -e page_fault.*3be11 -e pio_write.*0x402.*0x21 65112: qemu-system-x86-2157 [003] 76.758024: kvm_pio: pio_write at 0x402 size 1 count 1 val 0x21 75123: qemu-system-x86-2157 [003] 76.764036: kvm_pio: pio_write at 0x402 size 1 count 1 val 0x21 ... 23209469: qemu-system-x86-2163 [006] 95.089323: kvm_page_fault: address 3be11338 error_code 181 with the trace-Win_3G.dat trace, starting like: 16412: qemu-system-x86-2192 [001] 173.076260: kvm_msr: msr_write 2ff = 0x0 16418: qemu-system-x86-2192 [001] 173.076271: kvm_msr: msr_write 250 = 0x606060606060606 16421: qemu-system-x86-2192 [001] 173.076272: kvm_msr: msr_write 258 = 0x606060606060606 16424: qemu-system-x86-2192 [001] 173.076273: kvm_msr: msr_write 259 = 0x606060606060606
The lack of MTRRs means that in the Fedora case it's just glacial, as shown by the fact that it takes almost 20 seconds to complete the firmware: qemu-system-x86-2157 [003] 76.467052: kvm_fpu: load qemu-system-x86-2157 [000] 93.067524: kvm_pio: pio_write at 0x402 size 1 count 1 val 0xa vs. 5 seconds for Windows (all spent at the splash screen, so it's okay): qemu-system-x86-2192 [001] 172.812258: kvm_fpu: load qemu-system-x86-2192 [000] 177.373041: kvm_pio: pio_write at 0x402 size 1 count 1 val 0x21 (port 0x402 is only written by the firmware). On the other hand, Windows seems to be a KVM bug that I cannot explain right now. So, more debugging patches ahead... diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c index 9e8bf13572e6..9052cae8a6e7 100644 --- a/arch/x86/kvm/mtrr.c +++ b/arch/x86/kvm/mtrr.c @@ -127,7 +127,7 @@ static u8 mtrr_disabled_type(void) * IA32_MTRR_DEF_TYPE.E bit is cleared, and the UC * memory type is applied to all of physical memory. */ - return MTRR_TYPE_UNCACHABLE; + return MTRR_TYPE_WRBACK; } /* @@ -587,6 +592,7 @@ static bool mtrr_lookup_okay(struct mtrr_iter *iter) { if (iter->fixed) { iter->mem_type = iter->mtrr_state->fixed_ranges[iter->index]; + printk("index %d type %d\n", iter->index, iter->mem_type); return true; } The first should fix Fedora or at least let it proceed faster out of the firmware. The second should provide more info. Please gather trace and dmesg again. One size is enough. BTW, the fact that the size doesn't matter is a good thing. It means the patch actually did what it was meant to do.
Also, check that you have this patch, recently submitted to the KVM mailing list: diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c index 9e8bf13572e6..adc54e1d40a9 100644 --- a/arch/x86/kvm/mtrr.c +++ b/arch/x86/kvm/mtrr.c @@ -267,7 +267,7 @@ static int fixed_mtrr_addr_to_seg(u64 addr) for (seg = 0; seg < seg_num; seg++) { mtrr_seg = &fixed_seg_table[seg]; - if (mtrr_seg->start >= addr && addr < mtrr_seg->end) + if (mtrr_seg->start <= addr && addr < mtrr_seg->end) return seg; }
I have good news. It seems to work this time. The lack of mtrr was my wrong. During experimenting I disabled mtrr cpu feature in qemu for the Fedora guest and forgot to turn it back on. This didn't make it work though. It was the latter patch you posted. Anyways with mtrr enabled the guest mtrr showed only one entry 2GB long uncacheable, just like last time it worked with 4.1.x kernels. At first I applied the patches in both Comment 21 and 22. This made both Fedora and Windows guest work fine, though I encountered general protection faults and NX memory protection errors consequently after shutting down or virsh destroy-ing guests. Also I was unable to start or restart any guest after the fisrt one because of this. Then I rolled back and applied only one at a time (first for static u8 mtrr_disabled_type(void) and second only for static int fixed_mtrr_addr_to_seg(u64 addr)). The conclusion was that applying only the latter one makes both guest work fine. And the memory protection faults also disappeared for Windows shutdown or detruction. Fedora sometimes still gives Oops on regular guest shutdown. Apart from that performance is quite great. BTW Fedora does have an unconvinient delay on starting the kernel, but nothing unbareable. It seems to be proportional with the guest memory size. I can barely notice it with 3GB, but it lasts about 15-20 seconds with 24GB. I think its not so important. The kernel oops bothers me more. Windows guest seems to work, though I didn't go through with the full install process. It seems to works as speedy as last time with a 4.1 kernel. Just to be sure I recorded the traces and dmesg with the working solution. (Only fixed_mtrr_addr_to_seg patch was applied, and of course the debug patch) Files at the usual location. https://drive.google.com/folderview?id=0B8ebX_WjVHnGNlN4eTEzU2xtMEk&usp=sharing Look for Traces-this_works folder.
Ok, I'll try to get what's left to Linus as quick as possible.
I tried applying the patch from Comment 22 to kernel 4.3.3. When I tried to start my Windows 10 VM (with 12 GB RAM), it still cannot boot. It is certainly better than before -- it gets past the Tianocore/OVMF logo and into the Windows boot sequence (whereas before it used to freeze on the Tianocore logo before even trying to boot Windows), but it froze there for a while. After a long wait (at least a minute), the Windows spinning dots appeared. The dots animation kept going for a loooooong time. In 4.1, Windows boots almost instantly (VM image is on an SSD). I have waited around 10 minutes now, and Windows is still showing the spinning dots animation. I am going to kill it now and reboot back into 4.1. So yeah, the patch improves things, but does not fix the issue for me. I only applied the patch from Comment 22; should I have applied something else as well? Maybe I am not doing this correctly. Perhaps I am affected by a different variation of this bug than schefister? Should I try to apply the debug patches and submit traces as well?
To quick summarize the solution: Attachment 197381 [details] should be applied (this fixes the very cause of the bug, the max_phyaddress issue in vcpu.) and apply the patch in Comment 22 as well. This combination is what worked for me. The patch in Comment 21 did not improve things for me, it was rather the opposite. Debug patches don't modify anything just print some info, you may apply them if you whish to help developers to track down the issue, otherwise debug patches don't make your problem go away.
If you have time to test comment 21 with MTRR disabled, that would be great. I'd then modify that patch to key on the MTRR CPUID bit. Apart from that, your summary matches what I was going to send to Linus, so thanks for the confirmation!
Thank you for the clarification, schefister! I had misunderstood the instructions. I only applied the patch in Comment 22 without the one in the Attachment. Now I applied both of them and can confirm that it indeed WORKS!! THANK YOU THANK YOU Paolo Bonzini for finally fixing this issue! Sorry for the false alarm in my previous comment. The patches work. It was entirely my fault.
FWIW, I just played some CS:GO in my Windows 10 virtual machine with hardware acceleration on my VFIO-passed-through NVIDIA GTX 980 graphics card. Worked great!
Just tried with all patches applied (including Comment 21) and mtrr disabled in guest CPU with the following qemu xml: <cpu mode='host-passthrough'> <feature policy='disable' name='mtrr'/> </cpu> Windows did not boot, logo shown but no white dots at all. A machine check error was logged. Fedora did start fast, despite having 24G guest memory. But graphics drawing/scrolling was slow (as expected with lack of mtrr). An oops happened this time too, when doing a regular shutdown of the guest. Traces and dmesg in the usual location: https://drive.google.com/folderview?id=0B8ebX_WjVHnGNlN4eTEzU2xtMEk&usp=sharing Look for Traces-with_all_patch_nomtrr folder.
I can also report success with these patches (against 4.4.0-rc6). Thanks all!
Merged in 4.4-rc7/
I have the same issue after upgrading to 4.4. I also tried applying the patches in comment 22 and the attachment (but not comment 21) to 4.4-rc6, and that didn't fix the issue either.
Nothing was logged in dmesg either.
Okay, turns out that my problem was OVMF-related. I ran a bisection of OVMF on mainline kernel 4.4. It seems that any OVMF version before commit 68f06742379437e412f9699cc3c82421f4684b67 ( https://github.com/tianocore/edk2/commit/68f06742379437e412f9699cc3c82421f4684b67 ) will peg the host's CPU usage, slow the system to a crawl and boot very slowly, if it boots at all. After a few minutes, the Tianocore logo will appear and the memory check progress bar will inch along at the bottom of the screen. It seems to do nothing for several minutes after that. I didn't bother to wait and see what happens after that. If I use Kernel 4.1.12, or if I disable PCI passthrough, the VM will boot without any trouble. I would have captured a trace, but I couldn't figure out how to do that with my VM configured with libvirt; I don't know what command line to use. For what it's worth, I'll attach my VM configuration XML. So I don't know what exactly was wrong with OVMF prior to the commit I mentioned that caused this to happen; that commit doesn't seem to have anything to do with MTRRs and such. But it is somehow related to PCI passthrough and a change that occurred between 4.1 and 4.4 (or perhaps 4.2, since that's where I first spotted a problem that has these symptoms). Anyway, thanks to everyone involved for helping to fix this problem.
Created attachment 201341 [details] VM configuration