Bug 197951

Summary: QEMU/KVM & VFIO & PCI passthru with Windows 10 x64 guest: memory access intermittently causes CRITICAL_STRUCTURE_CORRUPTION BSOD unless swap is disabled on host, since 4.12.13
Product: Virtualization Reporter: Jimi (JimiJames.Bove)
Component: kvmAssignee: virtualization_kvm
Status: NEW ---    
Severity: high CC: alex.williamson, f.gruenbichler, JimiJames.Bove, lprosek, tyler, xjtuwjp
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.12.13 Subsystem:
Regression: No Bisected commit-id:

Description Jimi 2017-11-21 21:19:09 UTC
Originally reported here: https://bugs.launchpad.net/qemu/+bug/1728256

The title explains it all. When running a Windows 10 x64 guest that passes through a PCIe device with VFIO (especially a GPU, or perhaps only if at least one passed device is a GPU, since nobody who isn't passing through a GPU has confirmed this) on kernel version 4.12.13 and up, the guest will crash with the CRITICAL_STRUCTURE_CORRUPTION blue screen of death randomly when accessing the host's memory. You have to be doing something memory intensive, like mining or gaming, for it to happen often enough to really notice, but if you do, it happens once an hour or every few hours.

I downgraded to 4.12.12 and the issue went away. Others disabled swap instead of downgrading, and the issue went away. Disabling swap isn't an option for me, so I can't upgrade the kernel until this is fixed.
Comment 1 Jimi 2017-11-21 21:23:09 UTC
Also, I use an AMD GPU and another user reported using NVidia, so this seems to not be vendor-specific, if it even is GPU-specific.
Comment 2 Alex Williamson 2017-11-21 21:32:25 UTC
Seems like a simple matter of bisecting between 4.12.12 and 4.12.13 then, it's a very short list:

$ git log --oneline v4.12.12..v4.12.13
5d7d2e03e0f0 Linux 4.12.13
9f7df0bca168 xfs: XFS_IS_REALTIME_INODE() should be false if no rt device presen
da0f4931ec52 NFSv4: Fix up mirror allocation
3307d5f5099c NFS: Sync the correct byte range during synchronous writes
6f50e3a1b8c3 NFS: Fix 2 use after free issues in the I/O code
7714f302294d ARM: 8692/1: mm: abort uaccess retries upon fatal signal
b9a489e1d4a3 ARM64: dts: marvell: armada-37xx: Fix GIC maintenance interrupt
8329b5e8c6cf Bluetooth: Properly check L2CAP config option output buffer length
99dc1296b47c rt2800: fix TX_PIN_CFG setting for non MT7620 chips
2bce0fe7d0cd KVM: SVM: Limit PFERR_NESTED_GUEST_PAGE error_code check to L1 gues
9d6412aa06ce ALSA: msnd: Optimize / harden DSP and MIDI loops
846073130799 mm/memory.c: fix mem_cgroup_oom_disable() call missing
46791eb9f13e mm/swapfile.c: fix swapon frontswap_map memory leak on error
637f25e5ba94 mm: kvfree the swap cluster info if the swap file is unsatisfactory
58989dc3af0d selftests/x86/fsgsbase: Test selectors 1, 2, and 3
9ed3dc1c0431 radix-tree: must check __radix_tree_preload() return value
0af760ab3882 rtlwifi: btcoexist: Fix breakage of ant_sel for rtl8723be
8004198bb025 btrfs: resume qgroup rescan on rw remount
9a5537a76b62 nvme-fabrics: generate spec-compliant UUID NQNs
02c54b35cad8 mtd: nand: qcom: fix config error for BCH
f2339a072e47 mtd: nand: qcom: fix read failure without complete bootchain
71515c37777d mtd: nand: mxc: Fix mxc_v1 ooblayout
c54a31845019 mtd: nand: hynix: add support for 20nm NAND chips
2b8b46b24217 mtd: nand: make Samsung SLC NAND usable again

Let us know the results.
Comment 3 Jimi 2017-11-21 21:45:41 UTC
Sure, I'll start bisecting next time I get the chance (maybe tomorrow). It'll take a long time, though, since the BSOD might only happen once a day. I'll have to run the same commit for a few days before I'm confident that it isn't BSODing. Thank god for binary search.
Comment 4 Fabian Grünbichler 2017-11-30 09:41:06 UTC
Did you have a chance to bisect yet? We are experiencing a similar issue with 4.13 and 4.14 based kernels, and our test case and bisect points to a series not contained in 4.12.13:

https://forum.proxmox.com/threads/blue-screen-with-5-1.37664/
https://lkml.kernel.org/r/<20171130093320.66cxaoj45g2ttzoh@nora.maurer-it.com>
Comment 5 Jimi 2017-11-30 17:45:54 UTC
I've been doing it. Currently on "[3307d5f5099c186d1ae43205eb23c29fabc6f5b8] NFS: Sync the correct byte range during synchronous writes" with 2 commits left to test after it. They've all been good commits so far.
Comment 6 Ladi Prosek 2017-12-04 08:49:54 UTC
I have seen this crash on a Windows 10 x64 guest *without* any kind of device assignment. Didn't keep track of exact kernel versions but it was Fedora 26, very likely 4.12.*.

If you've been able to build a kernel where this happens for you, try cherry-picking:

commit a2b7861bb33b2538420bb5d8554153484d3f961f                       
Author: Boqun Feng <boqun.feng@gmail.com>                             
Date:   Tue Oct 3 21:36:51 2017 +0800                                 

    kvm/x86: Avoid async PF preempting the kernel incorrectly         
                                   
    Currently, in PREEMPT_COUNT=n kernel, kvm_async_pf_task_wait() could call
    schedule() to reschedule in some cases.  This could result in     
    accidentally ending the current RCU read-side critical section early,    
    causing random memory corruption in the guest, or otherwise preempting   
    the currently running task inside between preempt_disable and     
    preempt_enable.                


Keywords: "PF" (since the report mentions swap), "random memory corruption in the guest"
Comment 7 Ladi Prosek 2017-12-04 09:49:42 UTC
Correction: It looks like a2b7861bb33b2538420bb5d8554153484d3f961f is more of a guest-side fix with no effect on non-Linux guests. Please ignore it.
Comment 8 Jack Wang 2017-12-04 11:58:43 UTC
We've seen windows 10 BSOD with CRITICAL_STRUCTURE_CORRUPTION short after migration. host kernel version is 4.4.50, no device passthrough, no swap in our case.
Comment 9 Ladi Prosek 2017-12-04 12:32:51 UTC
(In reply to Jack Wang from comment #8)
> We've seen windows 10 BSOD with CRITICAL_STRUCTURE_CORRUPTION short after
> migration. host kernel version is 4.4.50, no device passthrough, no swap in
> our case.

What kernel version was running on the migration source? Thanks!
Comment 10 Jack Wang 2017-12-04 13:42:56 UTC
Source server were running kernel 3.12.45.
Comment 11 Jimi 2017-12-04 19:28:15 UTC
I'm about to spend a few days with it installed to make sure, but it looks like this commit is probably our culprit:

$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[9f7df0bca168528aba20794f400be134495551b8] xfs: XFS_IS_REALTIME_INODE() should be false if no rt device present

It looks like there's some evidence that this issue doesn't *only* come from 4.12.13. I want to reiterate, I was on 4.12.13 when this problem started happening to me, and I haven't had a single BSOD since downgrading to 4.12.12, including during this entire bisect. It was happening frequently enough that if 4.12.13 wasn't at least one of the cuprits, I definitely would've had a few BSODs by now.
Comment 12 Ladi Prosek 2017-12-05 05:52:24 UTC
(In reply to Jimi from comment #11)
> I'm about to spend a few days with it installed to make sure, but it looks
> like this commit is probably our culprit:
> 
> $ git bisect good
> Bisecting: 0 revisions left to test after this (roughly 1 step)
> [9f7df0bca168528aba20794f400be134495551b8] xfs: XFS_IS_REALTIME_INODE()
> should be false if no rt device present

A few things hint at this being a red herring.

* It's the first commit before the 4.12.13 tag which means that you marked 4.12.13 as bad and everything else as good.

* There's nothing in it that would explain why it affects only virt and only Windows guests.

> It looks like there's some evidence that this issue doesn't *only* come from
> 4.12.13. I want to reiterate, I was on 4.12.13 when this problem started
> happening to me, and I haven't had a single BSOD since downgrading to
> 4.12.12, including during this entire bisect. It was happening frequently
> enough that if 4.12.13 wasn't at least one of the cuprits, I definitely
> would've had a few BSODs by now.

The bug is likely timing sensitive and just rebuilding the kernel, out of the same sources, may end up more (or less) prone to it just by how the binary is laid out, the exact compiler used etc.

Also, we should not rule out the possibility that the problem has existed for a long time and Windows 10 got the ability to detect certain corruptions recently via a Windows Update patch.

I hit it again yesterday and the BSOD analyzes to:

CRITICAL_STRUCTURE_CORRUPTION (109)
This bugcheck is generated when the kernel detects that critical kernel code or
data have been corrupted. There are generally three causes for a corruption:
1) A driver has inadvertently or deliberately modified critical kernel code
 or data. See http://www.microsoft.com/whdc/driver/kernel/64bitPatching.mspx
2) A developer attempted to set a normal kernel breakpoint using a kernel
 debugger that was not attached when the system was booted. Normal breakpoints,
 "bp", can only be set if the debugger is attached at boot time. Hardware
 breakpoints, "ba", can be set at any time.
3) A hardware corruption occurred, e.g. failing RAM holding kernel code or data.
Arguments:
Arg1: a3a0206143b9d5b3, Reserved
Arg2: b3b72ce7963bad06, Reserved
Arg3: 0000032000000000, Failure type dependent information
Arg4: 0000000000000017, Type of corrupted region, can be
[...]
	16  : Critical floating point control register modification
	17  : Local APIC modification
	18  : Kernel notification callout modification
[...]


I'm pretty sure that last time I got it the type of corrupted region was 17 as well.
Comment 13 Fabian Grünbichler 2017-12-05 08:15:36 UTC
(In reply to Fabian Grünbichler from comment #4)
> Did you have a chance to bisect yet? We are experiencing a similar issue
> with 4.13 and 4.14 based kernels, and our test case and bisect points to a
> series not contained in 4.12.13:
> 
> https://lkml.kernel.org/r/<20171130093320.66cxaoj45g2ttzoh@nora.maurer-it.
> com>

FWIW, the 4.13 and 4.14 issue was caused by the linked series, and a subsequent patch[1] solved it completely for us.

1: https://lkml.kernel.org/r/<20171130180546.4331-1-rkrcmar@redhat.com>
Comment 14 Jack Wang 2017-12-05 09:22:10 UTC
In my case, it's region 7   : Critical MSR modification, I guess in my case just some under line MSR state changed during migration from 3.12 to 4.4 kernel.
Comment 15 Ladi Prosek 2017-12-05 10:04:30 UTC
(In reply to Fabian Grünbichler from comment #13)
> FWIW, the 4.13 and 4.14 issue was caused by the linked series, and a
> subsequent patch[1] solved it completely for us.
> 
> 1: https://lkml.kernel.org/r/<20171130180546.4331-1-rkrcmar@redhat.com>

Thanks! I have added this patch to my kernel (4.13.16 based, built locally, reproduces the BSOD). Will report back in a few days.
Comment 16 Ladi Prosek 2017-12-14 07:57:28 UTC
(In reply to Ladi Prosek from comment #15)
> (In reply to Fabian Grünbichler from comment #13)
> > FWIW, the 4.13 and 4.14 issue was caused by the linked series, and a
> > subsequent patch[1] solved it completely for us.
> > 
> > 1: https://lkml.kernel.org/r/<20171130180546.4331-1-rkrcmar@redhat.com>
> 
> Thanks! I have added this patch to my kernel (4.13.16 based, built locally,
> reproduces the BSOD). Will report back in a few days.

No crashes so far. The fix is in Linus's tree:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b1394e745b9453dcb5b0671c205b770e87dedb87