Bug 219787
Summary: | Guest's applications crash with EXCEPTION_SINGLE_STEP (0x80000004) | ||
---|---|---|---|
Product: | Virtualization | Reporter: | rangemachine |
Component: | kvm | Assignee: | virtualization_kvm |
Status: | NEW --- | ||
Severity: | high | CC: | jonbetti, ravi.bangoria, seanjc, whanos |
Priority: | P3 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | 6.13 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Debugger attached to Steam.exe
CrystalDiskMark installation bisection-log bisection-config-culprit 0001-KVM-x86-Snapshot-the-host-s-DEBUGCTL-in-common-x86.patch 0002-KVM-SVM-Manually-zero-restore-DEBUGCTL-if-LBR-virtua.patch |
Created attachment 307666 [details]
CrystalDiskMark installation
Are you able to bisect to an exact commit? There are significant KVM changes in 6.13, but they're almost all related to memory management. I can't think of anything that would manifest as an unexpected single step #DB, especially not with any consistency. And just to double check, the only difference in the setup is that the host kernel was upgraded from v6.12 => v6.13? E.g. there was no QEMU update or guest-side changes? I have been able to reproduce this bug too on Linux 6.13.3 - Specifically whilst attempting to download/install any game via Steam in a GPU passthrough enabled Windows KVM guest. Downgrading to Linux 6.12.9 - with no other changes made, immediately resolves the issue for me. (In reply to Sean Christopherson from comment #2) > Are you able to bisect to an exact commit? There are significant KVM > changes in 6.13, but they're almost all related to memory management. I > can't think of anything that would manifest as an unexpected single step > #DB, especially not with any consistency. > > And just to double check, the only difference in the setup is that the host > kernel was upgraded from v6.12 => v6.13? E.g. there was no QEMU update or > guest-side changes? I was not able to bisect yet, sorry. And yes, I double checked, the only change is kernel upgraded from v6.12.10 to v6.13.2 (did not checked v6.13.3 yet, but rc version had some behaviour). On Thu, Feb 20, 2025, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219787 > > whanos@sergal.fun changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |whanos@sergal.fun > > --- Comment #3 from whanos@sergal.fun --- > I have been able to reproduce this bug too on Linux 6.13.3 - Specifically > whilst attempting to download/install any game via Steam in a GPU passthrough > enabled Windows KVM guest. Are you also running an AMD system? On Thu, Feb 20, 2025, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219787 > > --- Comment #4 from rangemachine@gmail.com --- > (In reply to Sean Christopherson from comment #2) > > Are you able to bisect to an exact commit? There are significant KVM > > changes in 6.13, but they're almost all related to memory management. I > > can't think of anything that would manifest as an unexpected single step > > #DB, especially not with any consistency. > > > > And just to double check, the only difference in the setup is that the host > > kernel was upgraded from v6.12 => v6.13? E.g. there was no QEMU update or > > guest-side changes? > > I was not able to bisect yet, sorry. No need to be sorry, you didn't introduce the bug :-) > And yes, I double checked, the only change is kernel upgraded from v6.12.10 > to v6.13.2 (did not checked v6.13.3 yet, but rc version had some behaviour). Please let me know if you'll be able to bisect (or not). Unless I have a random epiphany, this will likely require bisection. (In reply to Sean Christopherson from comment #5) > On Thu, Feb 20, 2025, bugzilla-daemon@kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=219787 > > > > whanos@sergal.fun changed: > > > > What |Removed |Added > > > ---------------------------------------------------------------------------- > > CC| |whanos@sergal.fun > > > > --- Comment #3 from whanos@sergal.fun --- > > I have been able to reproduce this bug too on Linux 6.13.3 - Specifically > > whilst attempting to download/install any game via Steam in a GPU > passthrough > > enabled Windows KVM guest. > > Are you also running an AMD system? Yep. I am running a 9800X3D in an X670E chipset motherboard. I honestly wonder if this bug only affects people using a 9800X3D. (In reply to Sean Christopherson from comment #6) > Please let me know if you'll be able to bisect (or not). Unless I have a > random > epiphany, this will likely require bisection. Yes, I will try to bisect it tomorrow. Created attachment 307690 [details]
bisection-log
Created attachment 307691 [details]
bisection-config-culprit
Here we go: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=408eb7417a92c5354c7be34f7425b305dfe30ad9 Double-checked both reverting commit or unsetting X86_BUS_LOCK_DETECT fixes the problem. Added bisection log and config to attachments. Thanks for the bug report. This is what is probably happening: BusLockTrap is controlled through DEBUGCTL MSR and currently DEBUGCTL MSR is saved/restored on guest entry/exit only if LBRV is enabled. So, if BusLockTrap is enabled on the host, it will remain enabled even after guest entry and thus, if some process inside the guest causes a BusLock, KVM will inject #DB from host to the guest. I had a KVM patch[1] but couldn't get back to work on it. Let me try to spend some time and respin it. [1] https://lore.kernel.org/all/20240808062937.1149-5-ravi.bangoria@amd.com Created attachment 307694 [details] 0001-KVM-x86-Snapshot-the-host-s-DEBUGCTL-in-common-x86.patch On Fri, Feb 21, 2025, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219787 > > Ravi Bangoria (ravi.bangoria@amd.com) changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |ravi.bangoria@amd.com > > --- Comment #12 from Ravi Bangoria (ravi.bangoria@amd.com) --- > Thanks for the bug report. This is what is probably happening: > > BusLockTrap is controlled through DEBUGCTL MSR and currently DEBUGCTL MSR is > saved/restored on guest entry/exit only if LBRV is enabled. So, if > BusLockTrap > is enabled on the host, it will remain enabled even after guest entry and > thus, > if some process inside the guest causes a BusLock, KVM will inject #DB from > host to the guest. *sigh* Bluntly, that's horrific architecture. Why on earth isn't debugctl automatically context switched when BusLockTrap is supported? And does AMD do _any_ testing? This doesn't even require a full reproducer, e.g. the existing debug KVM-Unit-Test fails on my system (Turin) without ever generating a split/bus lock. AFAICT, the CPU is reporting bus locks in DR6 on #DBs that are most definitely not due to bus locks. > I had a KVM patch[1] but couldn't get back to work on it. Let me try to > spend some time and respin it. > > [1] https://lore.kernel.org/all/20240808062937.1149-5-ravi.bangoria@amd.com Virtualizing BusLockTrap won't do a damn thing. If the guest isn't using LBRs or BusLockTrap, then KVM won't enable LBR virtualization and so will run the guest with the host's DEBUGCTL. Furthermore, running with the host's DEBUGCTL is a bug irrespective of BusLockTrap. It just happens to be fatal with BusLockTrap, but running with BTF=1 and whatever other bits may be enabled in the host most definitely isn't correct. Bug reporters, can you test the attached patches? I have a reproducer in the form of a KVM test, but I haven't actually tested a Windows guest. Assuming squashing DEBUGCTL remedies the issue, I'll post patches after I've done a bit more testing. Created attachment 307695 [details]
0002-KVM-SVM-Manually-zero-restore-DEBUGCTL-if-LBR-virtua.patch
(In reply to Sean Christopherson from comment #13) > Bug reporters, can you test the attached patches? I have a reproducer in the > form of a KVM test, but I haven't actually tested a Windows guest. Assuming > squashing DEBUGCTL remedies the issue, I'll post patches after I've done a > bit > more testing. Tested, these 2 patches solves the issue. (In reply to Sean Christopherson from comment #13) > And does AMD do _any_ testing? This doesn't even require a full reproducer, > e.g. the existing debug KVM-Unit-Test fails on my system (Turin) without ever > generating a split/bus lock. AFAICT, the CPU is reporting bus locks in DR6 > on > #DBs that are most definitely not due to bus locks. It seems, the CPU is preserving SW written DR6[BusLockDetected] while generating the #DB when the CPL is 0 and DEBUGCTL[BusLockTrapEn] is set. Since most of the x86/debug.c KUT tests clears DR6[BusLockDetected] before executing the test, the bit remains cleared at the exception time which causes tests to fail. (In reply to whanos from comment #7) > I honestly wonder if this bug only affects people using a 9800X3D. I'm running a 9950X and had repro as another data point. (I had this issue for a few weeks but thought it was Steam until I started looking at crash dumps in Windows... and I thankfully stumbled onto this bug which restored my sanity :).) (In reply to rangemachine from comment #15) > (In reply to Sean Christopherson from comment #13) > > Bug reporters, can you test the attached patches? > Tested, these 2 patches solves the issue. +1. Patched my kernel and the issue went away (again: 'twas Steam for me that threw the exception). (In reply to Ravi Bangoria from comment #16) > It seems, the CPU is preserving SW written DR6[BusLockDetected] while > generating the #DB when the CPL is 0 and DEBUGCTL[BusLockTrapEn] is set. My bad, the behavior is same for CPL 3 as well. Apparently, it's a correct behavior as documented in the AMD Architecture Programmer's Manual. I've posted a KUT patch to KVM mailing list. (More details in the patch). Please review. https://lore.kernel.org/r/20250224112601.6504-1-ravi.bangoria@amd.com |
Created attachment 307665 [details] Debugger attached to Steam.exe Overview ======== Linux 6.13 update introduced problem with Windows guest's applications on AMD processors. Several applications crash with EXCEPTION_SINGLE_STEP (0x80000004). The list of confirmed software: CrystalDiskMark, Visual Studio Code, Steam, Looking Glass server, Windows Tweaker. It never happened prior 6.13 update. I also checked 6.13.3rc and 6.14.1rc updates, problem persists there too. I did quick check differences in KVM/SVM between 6.12 and 6.13 and did not found anything that could set trapflag, so problem could be somewhere deeper inside kernel. Steps to reproduce ================== Run VM with Windows guest, launch any software from the list. Hardware ======== CPU: AMD Ryzen 7 9800X3D (16) @ 5.27 GHz MB: TUF GAMING X870-PLUS WIFI Additional Information ====================== Steam crashes when downloading game, Looking Glass crashes on WinAPI QueryPerformanceCountrer call. Tested on Window 11 22H2/23H2/24H2.