Bug 210695 - error: kvm run failed Invalid argument
Summary: error: kvm run failed Invalid argument
Status: RESOLVED CODE_FIX
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-12-15 00:11 UTC by Richard Herbert
Modified: 2020-12-18 03:10 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.10.1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Error outputs (5.39 KB, text/plain)
2020-12-15 00:11 UTC, Richard Herbert
Details
qemu syslog crash output after patch for 5.10.1 (4.27 KB, text/plain)
2020-12-16 23:08 UTC, Richard Herbert
Details

Description Richard Herbert 2020-12-15 00:11:55 UTC
Created attachment 294133 [details]
Error outputs

Hi.

Starting with kernel 5.10-rc1, I've been unable to start Qemu VM's.  The only 5.10-rc that worked was 5.10-rc4 (not sure about -rc5).  I've tried with both qemu 4.1.0 and the latest 5.2.0.  The VM runs fine with kernel 5.9.13.

uname -a:

Linux starbug.dom 5.10.1 #1 SMP Mon Dec 14 18:24:40 EST 2020 x86_64 Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz GenuineIntel GNU/Linux

Motherboard: Intel DG41RQ

Thanks.
Comment 1 Sean Christopherson 2020-12-16 02:03:28 UTC
Hrm, this is more than likely related to refactoring for the introduction of the TDP MMU.

> kernel get_mmio_spte: detect reserved bits on spte, addr 0xfe013000, dump
> hierarchy:
> kernel        ------ spte 0x80000b0e level 3.

This SPTE looks odd/wrong, but unless I'm misreading things, there are no reserved bits set.

> kernel        ------ spte 0x49c0027 level 2.
> kernel        ------ spte 0x38001ffe0134d7 level 1.

And this SPTE obviously has reserved bits set, but that's intentional and this should not trigger the warning as the SPTE should be recognized as an MMIO SPTE.

I'll stare at this more tomorrow and play with things a bit to see if I can figure out what's broken.
Comment 2 Richard Herbert 2020-12-16 02:28:25 UTC
Hi, Sean.  I'm no programmer, unfortunately, but when kernel 5.10-rc4 was released, this was the patch that I suspected had fixed the problem.  I don't find this code is 5.10.1, so it must have been reverted or moved.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 27e381c9da6c2..ff28a5c6abd63 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -49,7 +49,14 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
 {
 	struct kvm_mmu_page *sp;
 
+	if (!kvm->arch.tdp_mmu_enabled)
+		return false;
+	if (WARN_ON(!VALID_PAGE(hpa)))
+		return false;
+
 	sp = to_shadow_page(hpa);
+	if (WARN_ON(!sp))
+		return false;
 
 	return sp->tdp_mmu_page && sp->root_count;
 }
Comment 3 Richard Herbert 2020-12-16 02:57:17 UTC
So sorry, Sean. I was looking in mmu.c and not tdp_mmu.c.  My very bad.  Sheesh.
Comment 4 Sean Christopherson 2020-12-16 20:49:49 UTC
Aha!  I haven't reproduce the bug (mostly because I'm pretty sure my guests aren't doing emulated MMIO accesses with paging disabled), but I'm pretty sure I know what's going on, and why -rc4 may have worked.

Your guest has paging disabled, in which case mmu->root_level will be '0' and mmu->shadow_root_level will be '3'.  If the shadow walk in get_walk() bails without ever entering the loop (due to an invalid PAE root), the returned leaf will be '0' because get_walk() uses mmu->root level instead of mmu->shadow_root level.  In get_mmio_spte(), this causes the check for reserved bits to check uninitialized/stale stack memory and return a bogus SPTE.

Pre rc6, both get_mmio_spte() and get_walk() used the bad mmu->root_level, which meant that the reserved bits check would get skipped in the above scenario.  But, get_mmio_spte() would still return a stale/bogus SPTE, so it's not at all surprising that things failed.  Actually, it's surprising that any 5.10-rc* work.  Best guess is that there is a mostly unrelated change that cause things to work by sheer dumb luck.

In rc6, the get_mmio_spte() half of the bug was fixed by commit 9a2a0d3ca163 ("kvm: x86/mmu: Fix get_mmio_spte() on CPUs supporting 5-level PT").  This cause get_mmio_spte() to resume the reserved bits check, which all but guaranteed an explosion, i.e. ensured a 100% failure rate on your end.

TL;DR: Can you try this patch?  I'll also try to reproduce the original bug on my end now that I have a smoking gun.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7a6ae9e90bd7..6880119840c1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3488,7 +3488,7 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes)
 {
        struct kvm_shadow_walk_iterator iterator;
-       int leaf = vcpu->arch.mmu->root_level;
+       int leaf = vcpu->arch.mmu->shadow_root_level;
        u64 spte;
Comment 5 Richard Herbert 2020-12-16 23:08:28 UTC
Created attachment 294167 [details]
qemu syslog crash output after patch for 5.10.1
Comment 6 Richard Herbert 2020-12-16 23:14:01 UTC
Hi, Sean.

Thanks for the patch.  I applied it and recompiled kernel 5.10.1. Qemu still crashes, but the output is slightly different.

Not sure if this has anything to do with it:

dmesg | grep -i mmu

[    0.301060] pci 0000:00:00.0: DMAR: Disabling IOMMU for graphics on this chipset

Thanks again.
Comment 7 Sean Christopherson 2020-12-17 19:54:15 UTC
Finally figured it out.  When KVM uses PAE shadow paging, the PDPTRs are effectively skipped when walking the shadow tables because a bad PDPTR will be detected much earlier and KVM essentially hardcodes the PDPTRs.  As a result, the corresponding SPTE isn't filled by get_walk() as it never "sees" the SPTE for the PDTPR.  The refactored get_mmio_spte() doesn't account for this and checks the PDPTR SPTE array value.  This reads uninitialized stack data, and in your case, this yields the garbage value '0x80000b0e' that causes things to explode (my system gets '0' most/all of the time).

I'll get a patch out later today, it's a bit of a mess....
Comment 8 Richard Herbert 2020-12-18 03:10:06 UTC
Get the so called "root" level from the low level shadow page table
walkers instead of manually attempting to calculate it higher up the
stack, e.g. in get_mmio_spte().  When KVM is using PAE shadow paging,
the starting level of the walk, from the callers perspective, is not
the CR3 root but rather the PDPTR "root".  Checking for reserved bits
from the CR3 root causes get_mmio_spte() to consume uninitialized stack
data due to indexing into sptes[] for a level that was not filled by
get_walk().  This can result in false positives and/or negatives
depending on what garbage happens to be on the stack.

Opportunistically nuke a few extra newlines.

Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU")

Signed-off-by: Sean Christopherson <seanjc@google.com>

Marking as RESOLVED, with Thanks.

Note You need to log in before you can comment on or make changes to this bug.