Bug 219787 - Guest's applications crash with EXCEPTION_SINGLE_STEP (0x80000004)
Summary: Guest's applications crash with EXCEPTION_SINGLE_STEP (0x80000004)
Status: NEW
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: AMD Linux
: P3 high
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-02-16 09:53 UTC by rangemachine
Modified: 2025-02-24 11:36 UTC (History)
4 users (show)

See Also:
Kernel Version: 6.13
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Debugger attached to Steam.exe (92.90 KB, image/png)
2025-02-16 09:53 UTC, rangemachine
Details
CrystalDiskMark installation (81.60 KB, image/jpeg)
2025-02-16 10:01 UTC, rangemachine
Details
bisection-log (2.97 KB, text/plain)
2025-02-21 01:31 UTC, rangemachine
Details
bisection-config-culprit (190.89 KB, text/plain)
2025-02-21 01:31 UTC, rangemachine
Details
0001-KVM-x86-Snapshot-the-host-s-DEBUGCTL-in-common-x86.patch (3.17 KB, text/x-diff)
2025-02-21 18:22 UTC, Sean Christopherson
Details
0002-KVM-SVM-Manually-zero-restore-DEBUGCTL-if-LBR-virtua.patch (2.69 KB, text/x-diff)
2025-02-21 18:22 UTC, Sean Christopherson
Details

Description rangemachine 2025-02-16 09:53:58 UTC
Created attachment 307665 [details]
Debugger attached to Steam.exe

Overview
========
Linux 6.13 update introduced problem with Windows guest's applications on AMD processors. Several applications crash with EXCEPTION_SINGLE_STEP (0x80000004). The list of confirmed software: CrystalDiskMark, Visual Studio Code, Steam, Looking Glass server, Windows Tweaker. 
It never happened prior 6.13 update. I also checked 6.13.3rc and 6.14.1rc updates, problem persists there too. I did quick check differences in KVM/SVM between 6.12 and 6.13 and did not found anything that could set trapflag, so problem could be somewhere deeper inside kernel.

Steps to reproduce
==================
Run VM with Windows guest, launch any software from the list.

Hardware
========
CPU: AMD Ryzen 7 9800X3D (16) @ 5.27 GHz
MB: TUF GAMING X870-PLUS WIFI

Additional Information
======================
Steam crashes when downloading game, Looking Glass crashes on WinAPI QueryPerformanceCountrer call. Tested on Window 11 22H2/23H2/24H2.
Comment 1 rangemachine 2025-02-16 10:01:56 UTC
Created attachment 307666 [details]
CrystalDiskMark installation
Comment 2 Sean Christopherson 2025-02-20 00:31:40 UTC
Are you able to bisect to an exact commit?  There are significant KVM changes in 6.13, but they're almost all related to memory management.  I can't think of anything that would manifest as an unexpected single step #DB, especially not with any consistency.

And just to double check, the only difference in the setup is that the host kernel was upgraded from v6.12 => v6.13?  E.g. there was no QEMU update or guest-side changes?
Comment 3 whanos 2025-02-20 02:57:31 UTC
I have been able to reproduce this bug too on Linux 6.13.3 - Specifically whilst attempting to download/install any game via Steam in a GPU passthrough enabled Windows KVM guest.

Downgrading to Linux 6.12.9 - with no other changes made, immediately resolves the issue for me.
Comment 4 rangemachine 2025-02-20 07:10:54 UTC
(In reply to Sean Christopherson from comment #2)
> Are you able to bisect to an exact commit?  There are significant KVM
> changes in 6.13, but they're almost all related to memory management.  I
> can't think of anything that would manifest as an unexpected single step
> #DB, especially not with any consistency.
> 
> And just to double check, the only difference in the setup is that the host
> kernel was upgraded from v6.12 => v6.13?  E.g. there was no QEMU update or
> guest-side changes?

I was not able to bisect yet, sorry. And yes, I double checked, the only change is kernel upgraded from v6.12.10 to v6.13.2 (did not checked v6.13.3 yet, but rc version had some behaviour).
Comment 5 Sean Christopherson 2025-02-20 17:41:00 UTC
On Thu, Feb 20, 2025, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219787
> 
> whanos@sergal.fun changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |whanos@sergal.fun
> 
> --- Comment #3 from whanos@sergal.fun ---
> I have been able to reproduce this bug too on Linux 6.13.3 - Specifically
> whilst attempting to download/install any game via Steam in a GPU passthrough
> enabled Windows KVM guest.

Are you also running an AMD system?
Comment 6 Sean Christopherson 2025-02-20 17:43:44 UTC
On Thu, Feb 20, 2025, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219787
> 
> --- Comment #4 from rangemachine@gmail.com ---
> (In reply to Sean Christopherson from comment #2)
> > Are you able to bisect to an exact commit?  There are significant KVM
> > changes in 6.13, but they're almost all related to memory management.  I
> > can't think of anything that would manifest as an unexpected single step
> > #DB, especially not with any consistency.
> > 
> > And just to double check, the only difference in the setup is that the host
> > kernel was upgraded from v6.12 => v6.13?  E.g. there was no QEMU update or
> > guest-side changes?
> 
> I was not able to bisect yet, sorry.

No need to be sorry, you didn't introduce the bug :-)

> And yes, I double checked, the only change is kernel upgraded from v6.12.10
> to v6.13.2 (did not checked v6.13.3 yet, but rc version had some behaviour).

Please let me know if you'll be able to bisect (or not).  Unless I have a random
epiphany, this will likely require bisection.
Comment 7 whanos 2025-02-20 17:46:01 UTC
(In reply to Sean Christopherson from comment #5)
> On Thu, Feb 20, 2025, bugzilla-daemon@kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=219787
> > 
> > whanos@sergal.fun changed:
> > 
> >            What    |Removed                     |Added
> >
> ----------------------------------------------------------------------------
> >                  CC|                            |whanos@sergal.fun
> > 
> > --- Comment #3 from whanos@sergal.fun ---
> > I have been able to reproduce this bug too on Linux 6.13.3 - Specifically
> > whilst attempting to download/install any game via Steam in a GPU
> passthrough
> > enabled Windows KVM guest.
> 
> Are you also running an AMD system?

Yep. I am running a 9800X3D in an X670E chipset motherboard. 
I honestly wonder if this bug only affects people using a 9800X3D.
Comment 8 rangemachine 2025-02-20 19:00:54 UTC
(In reply to Sean Christopherson from comment #6)
> Please let me know if you'll be able to bisect (or not).  Unless I have a
> random
> epiphany, this will likely require bisection.

Yes, I will try to bisect it tomorrow.
Comment 9 rangemachine 2025-02-21 01:31:08 UTC
Created attachment 307690 [details]
bisection-log
Comment 10 rangemachine 2025-02-21 01:31:49 UTC
Created attachment 307691 [details]
bisection-config-culprit
Comment 11 rangemachine 2025-02-21 01:32:22 UTC
Here we go:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=408eb7417a92c5354c7be34f7425b305dfe30ad9

Double-checked both reverting commit or unsetting X86_BUS_LOCK_DETECT fixes the problem.

Added bisection log and config to attachments.
Comment 12 Ravi Bangoria 2025-02-21 10:48:04 UTC
Thanks for the bug report. This is what is probably happening:

BusLockTrap is controlled through DEBUGCTL MSR and currently DEBUGCTL MSR is saved/restored on guest entry/exit only if LBRV is enabled. So, if BusLockTrap is enabled on the host, it will remain enabled even after guest entry and thus, if some process inside the guest causes a BusLock, KVM will inject #DB from host to the guest.

I had a KVM patch[1] but couldn't get back to work on it. Let me try to
spend some time and respin it.

[1] https://lore.kernel.org/all/20240808062937.1149-5-ravi.bangoria@amd.com
Comment 13 Sean Christopherson 2025-02-21 18:22:56 UTC
Created attachment 307694 [details]
0001-KVM-x86-Snapshot-the-host-s-DEBUGCTL-in-common-x86.patch

On Fri, Feb 21, 2025, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219787
> 
> Ravi Bangoria (ravi.bangoria@amd.com) changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |ravi.bangoria@amd.com
> 
> --- Comment #12 from Ravi Bangoria (ravi.bangoria@amd.com) ---
> Thanks for the bug report. This is what is probably happening:
> 
> BusLockTrap is controlled through DEBUGCTL MSR and currently DEBUGCTL MSR is
> saved/restored on guest entry/exit only if LBRV is enabled. So, if
> BusLockTrap
> is enabled on the host, it will remain enabled even after guest entry and
> thus,
> if some process inside the guest causes a BusLock, KVM will inject #DB from
> host to the guest.

*sigh*

Bluntly, that's horrific architecture.  Why on earth isn't debugctl automatically
context switched when BusLockTrap is supported?

And does AMD do _any_ testing?  This doesn't even require a full reproducer,
e.g. the existing debug KVM-Unit-Test fails on my system (Turin) without ever
generating a split/bus lock.  AFAICT, the CPU is reporting bus locks in DR6 on
#DBs that are most definitely not due to bus locks.

> I had a KVM patch[1] but couldn't get back to work on it. Let me try to
> spend some time and respin it.
> 
> [1] https://lore.kernel.org/all/20240808062937.1149-5-ravi.bangoria@amd.com

Virtualizing BusLockTrap won't do a damn thing.  If the guest isn't using LBRs
or BusLockTrap, then KVM won't enable LBR virtualization and so will run the
guest with the host's DEBUGCTL.

Furthermore, running with the host's DEBUGCTL is a bug irrespective of
BusLockTrap.  It just happens to be fatal with BusLockTrap, but running with
BTF=1 and whatever other bits may be enabled in the host most definitely isn't
correct.

Bug reporters, can you test the attached patches?  I have a reproducer in the
form of a KVM test, but I haven't actually tested a Windows guest.  Assuming
squashing DEBUGCTL remedies the issue, I'll post patches after I've done a bit
more testing.
Comment 14 Sean Christopherson 2025-02-21 18:22:57 UTC
Created attachment 307695 [details]
0002-KVM-SVM-Manually-zero-restore-DEBUGCTL-if-LBR-virtua.patch
Comment 15 rangemachine 2025-02-21 20:04:38 UTC
(In reply to Sean Christopherson from comment #13)
> Bug reporters, can you test the attached patches?  I have a reproducer in the
> form of a KVM test, but I haven't actually tested a Windows guest.  Assuming
> squashing DEBUGCTL remedies the issue, I'll post patches after I've done a
> bit
> more testing.

Tested, these 2 patches solves the issue.
Comment 16 Ravi Bangoria 2025-02-23 04:57:31 UTC
(In reply to Sean Christopherson from comment #13)

> And does AMD do _any_ testing?  This doesn't even require a full reproducer,
> e.g. the existing debug KVM-Unit-Test fails on my system (Turin) without ever
> generating a split/bus lock.  AFAICT, the CPU is reporting bus locks in DR6
> on
> #DBs that are most definitely not due to bus locks.

It seems, the CPU is preserving SW written DR6[BusLockDetected] while generating the #DB when the CPL is 0 and DEBUGCTL[BusLockTrapEn] is set.

Since most of the x86/debug.c KUT tests clears DR6[BusLockDetected] before executing the test, the bit remains cleared at the exception time which causes tests to fail.
Comment 17 Jon Betti 2025-02-24 11:33:43 UTC
(In reply to whanos from comment #7)
> I honestly wonder if this bug only affects people using a 9800X3D.

I'm running a 9950X and had repro as another data point. (I had this issue for a few weeks but thought it was Steam until I started looking at crash dumps in Windows... and I thankfully stumbled onto this bug which restored my sanity :).)

(In reply to rangemachine from comment #15)
> (In reply to Sean Christopherson from comment #13)
> > Bug reporters, can you test the attached patches?
> Tested, these 2 patches solves the issue.

+1. Patched my kernel and the issue went away (again: 'twas Steam for me that threw the exception).
Comment 18 Ravi Bangoria 2025-02-24 11:36:03 UTC
(In reply to Ravi Bangoria from comment #16)
> It seems, the CPU is preserving SW written DR6[BusLockDetected] while
> generating the #DB when the CPL is 0 and DEBUGCTL[BusLockTrapEn] is set.

My bad, the behavior is same for CPL 3 as well. Apparently, it's a correct behavior as documented in the AMD Architecture Programmer's Manual. I've posted a KUT patch to KVM mailing list. (More details in the patch). Please review.

https://lore.kernel.org/r/20250224112601.6504-1-ravi.bangoria@amd.com

Note You need to log in before you can comment on or make changes to this bug.