Bug 217799 - kvm: Speculative RAS Overflow mitigation breaks old Windows guest VMs
Summary: kvm: Speculative RAS Overflow mitigation breaks old Windows guest VMs
Status: RESOLVED CODE_FIX
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-16 08:52 UTC by Roman Mamedov
Modified: 2023-08-23 22:18 UTC (History)
5 users (show)

See Also:
Kernel Version: 6.1.44
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Roman Mamedov 2023-08-16 08:52:59 UTC
Hello,

I have a virtual machine running the old Windows Server 2003. On kernels 6.1.44 and 6.1.45, the QEMU VNC window stays dark, not switching to any of the guest's video modes and the VM process uses only ~64 MB of RAM of the assigned 2 GB, indefinitely. It's like the VM is paused/halted/stuck before even starting. The process can be killed successfully and then restarted again (with the same result), so it is not deadlocked in kernel or the like.

Kernel 6.1.43 works fine.

I have also tried downgrading CPU microcode from 20230808 to 20230719, but that did not help.

The CPU is AMD Ryzen 5900. I suspect some of the newly added mitigations may be the culprit?
Comment 1 Roman Mamedov 2023-08-16 09:04:11 UTC
Booting the kernel with "spec_rstack_overflow=off" solves the problem.
Comment 2 Bagas Sanjaya 2023-08-16 09:22:08 UTC
(In reply to Roman Mamedov from comment #0)
> Hello,
> 
> I have a virtual machine running the old Windows Server 2003. On kernels
> 6.1.44 and 6.1.45, the QEMU VNC window stays dark, not switching to any of
> the guest's video modes and the VM process uses only ~64 MB of RAM of the
> assigned 2 GB, indefinitely. It's like the VM is paused/halted/stuck before
> even starting. The process can be killed successfully and then restarted
> again (with the same result), so it is not deadlocked in kernel or the like.
> 
> Kernel 6.1.43 works fine.
> 
> I have also tried downgrading CPU microcode from 20230808 to 20230719, but
> that did not help.
> 
> The CPU is AMD Ryzen 5900. I suspect some of the newly added mitigations may
> be the culprit?

Can you do bisection between v6.1.44 and v6.1.45 to find out the specific
mitigation that have this regression?
Comment 3 Roman Mamedov 2023-08-16 10:58:09 UTC
Hello,

Unfortunately I am not in a position to easily do bisects.
But as noted above, setting "spec_rstack_overflow=off" is enough to solve it.

Further info, trying with an XP x64 install ISO provided by Microsoft:
https://archive.org/details/windows-xp-professional-x64-edition

With "spec_rstack_overflow=off", it works fine. But in the default state of this new mitigation (which is "safe RET, no microcode" on my machine), the install ISO hangs at the "Setup is starting Windows" message. So if anyone wants to reproduce on their local machine, there is now a quick and legal way to do so.

My QEMU command-line:

kvm -cpu host -m 2048 -machine pc,mem-merge=on,accel=kvm -vnc [::]:24 -device ide-hd,drive=drive0,bus=ide.0 -drive if=none,id=drive0,cache=writeback,aio=threads,format=raw,discard=unmap,detect-zeroes=off,file=xp.img -rtc base=localtime -cdrom xp64ce.iso -boot d

I should add that when a VM is in this stuck state, the CPU load by QEMU process is 0% (not 100%).

And I am not sure why the default mitigation state says "no microcode", as I use a 2023-08-08 updated microcode package from Debian.

# dmesg | grep microcode
[    0.401618] Speculative Return Stack Overflow: IBPB-extending microcode not applied!
[    0.401618] Speculative Return Stack Overflow: Mitigation: safe RET, no microcode
[    1.051941] microcode: CPU0: patch_level=0x0a201016
[    1.051947] microcode: CPU1: patch_level=0x0a201016
[    1.051953] microcode: CPU2: patch_level=0x0a201016
[    1.051960] microcode: CPU3: patch_level=0x0a201016
[    1.051967] microcode: CPU4: patch_level=0x0a201016
[    1.051973] microcode: CPU5: patch_level=0x0a201016
[    1.051981] microcode: CPU6: patch_level=0x0a201016
[    1.051989] microcode: CPU7: patch_level=0x0a201016
[    1.051996] microcode: CPU8: patch_level=0x0a201016
[    1.052003] microcode: CPU9: patch_level=0x0a201016
[    1.052010] microcode: CPU10: patch_level=0x0a201016
[    1.052018] microcode: CPU11: patch_level=0x0a201016
[    1.052024] microcode: CPU12: patch_level=0x0a201016
[    1.052030] microcode: CPU13: patch_level=0x0a201016
[    1.052036] microcode: CPU14: patch_level=0x0a201016
[    1.052041] microcode: CPU15: patch_level=0x0a201016
[    1.052046] microcode: CPU16: patch_level=0x0a201016
[    1.052052] microcode: CPU17: patch_level=0x0a201016
[    1.052058] microcode: CPU18: patch_level=0x0a201016
[    1.052064] microcode: CPU19: patch_level=0x0a201016
[    1.052070] microcode: CPU20: patch_level=0x0a201016
[    1.052076] microcode: CPU21: patch_level=0x0a201016
[    1.052082] microcode: CPU22: patch_level=0x0a201016
[    1.052088] microcode: CPU23: patch_level=0x0a201016
[    1.052092] microcode: Microcode Update Driver: v2.2.
Comment 4 Roman Mamedov 2023-08-16 11:17:43 UTC
Borislav, as you are author of the patch adding Speculative RAS Overflow mitigation, could you maybe take a look what could be wrong here? Thanks

Windows XP-era 64-bit guest VMs in KVM no longer work with it enabled.

Windows 7 (and likely newer) does work.
Comment 5 Sean Christopherson 2023-08-16 13:50:54 UTC
As pointed out by Vitaly, this is probably the guest RFLAGS corruption bug[*], especially since it's XP specific (more likely to trigger emulation).  The fix should make its way to Linus' tree this week, and hopefully to stable kernels shortly thereafter.  Though if you can manually apply and test the fix before then, that would be very helpful.

[*] https://lore.kernel.org/all/20230811155255.250835-1-seanjc@google.com
Comment 6 Roman Mamedov 2023-08-16 17:23:26 UTC
Indeed, this patch appears to fix it. I built 6.1.46 with it added, and the
issue is no longer present. Thanks!

Note You need to log in before you can comment on or make changes to this bug.