Bug 218707 - dosemu crashes on kernel 6.6.21 and newer on some older hardware
Summary: dosemu crashes on kernel 6.6.21 and newer on some older hardware
Status: NEW
Alias: None
Product: Process Management
Classification: Unclassified
Component: Preemption (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Robert Love
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-10 20:28 UTC by Robert Gill
Modified: 2024-04-26 23:54 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lscpu output for host01 (dosemu runs fine on this system) (2.80 KB, text/plain)
2024-04-10 20:29 UTC, Robert Gill
Details
lscpu output for host02 (1.93 KB, text/plain)
2024-04-10 20:29 UTC, Robert Gill
Details
crash report for host02 (2.32 KB, text/plain)
2024-04-10 20:30 UTC, Robert Gill
Details
lscpu output for host03 (2.16 KB, text/plain)
2024-04-10 20:30 UTC, Robert Gill
Details
crash report for host03 (2.32 KB, text/plain)
2024-04-10 20:31 UTC, Robert Gill
Details
crash report for host03 (kernel 6.8.4) (2.32 KB, text/plain)
2024-04-10 20:31 UTC, Robert Gill
Details
crash report for host03 (CONFIG_RETHUNK disabled) (2.29 KB, text/plain)
2024-04-16 22:25 UTC, Robert Gill
Details
Kernel 6.6.21 config file (81.44 KB, text/plain)
2024-04-16 22:26 UTC, Robert Gill
Details
dosemu strace log (44.66 KB, text/plain)
2024-04-21 22:44 UTC, Robert Gill
Details
Workaround patch (917 bytes, patch)
2024-04-23 19:30 UTC, Pawan Gupta
Details | Diff

Description Robert Gill 2024-04-10 20:28:55 UTC
dosemu crashes on some older hardware when running kernel 6.6.21 or newer, runs fine on other hardware.

The error is:
kernel: general protection fault: 0000 [#1] PREEMPT SMP PTI

I'm including attachments describing the hardware it runs on and the hardware it fails on, as well as crash reports from the hardware where it fails. host01 is the system it runs with no problem on, it's currently running kernel version 6.6.21.
Comment 1 Robert Gill 2024-04-10 20:29:25 UTC
Created attachment 306118 [details]
lscpu output for host01 (dosemu runs fine on this system)
Comment 2 Robert Gill 2024-04-10 20:29:51 UTC
Created attachment 306119 [details]
lscpu output for host02
Comment 3 Robert Gill 2024-04-10 20:30:14 UTC
Created attachment 306120 [details]
crash report for host02
Comment 4 Robert Gill 2024-04-10 20:30:36 UTC
Created attachment 306121 [details]
lscpu output for host03
Comment 5 Robert Gill 2024-04-10 20:31:06 UTC
Created attachment 306122 [details]
crash report for host03
Comment 6 Robert Gill 2024-04-10 20:31:27 UTC
Created attachment 306123 [details]
crash report for host03 (kernel 6.8.4)
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-11 17:16:18 UTC
When you say 6.6.21, which is the last kernel where this did not happen? 6.6.20? In that case: could you bisect?

I did a quick search on lore and wondered if this might be related:
https://lore.kernel.org/all/202403281553.79f5a16f-lkp@intel.com/

Are you compiling the kernel with clang as well?
Comment 8 Robert Gill 2024-04-11 23:14:27 UTC
I'm compiling on Gentoo using gcc-13.2.1_p20240210. I had to download the kernel sources for 6.6.20 from kernel.org because Gentoo masked kernel 6.6.13 and removed all releases between that and 6.6.21. 6.6.13 was the last stabilized kernel on Gentoo before a security issue was discovered that I believe was patched in 6.6.21.


Anyways, dosemu runs fine with kernel 6.6.20.
Comment 9 Robert Gill 2024-04-11 23:15:52 UTC
I should also mention that these machines are running with 32-bit kernels.
Comment 10 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-12 11:30:16 UTC
(In reply to Robert Gill from comment #9)
> I should also mention that these machines are running with 32-bit kernels.

Which do not get a lot of testing these days. :-/

(In reply to Robert Gill from comment #8)
> Anyways, dosemu runs fine with kernel 6.6.20.

Good. Then it would be ideal if you could bisect between 6.6.20 and 6.6.21 to find the culprit. https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
Comment 11 Robert Gill 2024-04-13 23:11:55 UTC
I've determined that the following is the offending commit:

7a62647efcb2f7b744ff52ad91c23eda1ec9be94 is the first bad commit
commit 7a62647efcb2f7b744ff52ad91c23eda1ec9be94
Author: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Date:   Sun Mar 3 21:08:42 2024 -0800

    x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key

    commit 6613d82e617dd7eb8b0c40b2fe3acea655b1d611 upstream.

    The VERW mitigation at exit-to-user is enabled via a static branch
    mds_user_clear. This static branch is never toggled after boot, and can
    be safely replaced with an ALTERNATIVE() which is convenient to use in
    asm.

    Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user
    path. Also remove the now redundant VERW in exc_nmi() and
    arch_exit_to_user_mode().

    Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 Documentation/arch/x86/mds.rst       | 38 +++++++++++++++++++++++++-----------
 arch/x86/include/asm/entry-common.h  |  1 -
 arch/x86/include/asm/nospec-branch.h | 12 ------------
 arch/x86/kernel/cpu/bugs.c           | 15 ++++++--------
 arch/x86/kernel/nmi.c                |  3 ---
 arch/x86/kvm/vmx/vmx.c               |  2 +-
 6 files changed, 34 insertions(+), 37 deletions(-)
Comment 12 Robert Gill 2024-04-13 23:12:32 UTC
I'll take a look at it and also see if I can't find a way to work around it in dosemu.
Comment 13 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-14 06:51:33 UTC
(In reply to Robert Gill from comment #11)
>     x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
> 
>     commit 6613d82e617dd7eb8b0c40b2fe3acea655b1d611 upstream.

Ohh, that's the commit that caused the issue I had mentioned earlier. Send a reply there, lets see what happens:
https://lore.kernel.org/all/8c77ccfd-d561-45a1-8ed5-6b75212c7a58@leemhuis.info/
Comment 14 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-14 12:29:21 UTC
Not my area or expertise, so I might send you sideways -- and it's just a hunch anyway. But do you by chance have MITIGATION_RETHUNK=y in your .config? Could you try if disabling makes a difference? I wonder if it might be somehow related to https://lore.kernel.org/all/20240414090810.GBZhuc-lN6tyKbF_-M@fat_crate.local/
Comment 15 Robert Gill 2024-04-16 22:24:27 UTC
Disabled CONFIG_RETHUNK on host03 and dosemu still crashed. Uploading crash dump and kernel config.
Comment 16 Robert Gill 2024-04-16 22:25:59 UTC
Created attachment 306168 [details]
crash report for host03 (CONFIG_RETHUNK disabled)
Comment 17 Robert Gill 2024-04-16 22:26:37 UTC
Created attachment 306169 [details]
Kernel 6.6.21 config file
Comment 18 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-17 04:38:07 UTC
(In reply to Robert Gill from comment #15)
> Disabled CONFIG_RETHUNK on host03 and dosemu still crashed.

Thx for checking. Seems forwarding this to the developers did not lead to any reaction. If nothing happens any time soon, it would be best if you could confirm that this is something that happens with mainline (e.g. 6.9-rc4), too; ideally afterwards even try if reverting 6613d82e617dd7eb8b0c40b2fe3acea655b1d611 fixes the problem there.
Comment 19 Pawan Gupta 2024-04-18 00:27:18 UTC
(In reply to Robert Gill from comment #2)
> Created attachment 306119 [details]
> lscpu output for host02

Looking at the reported mitigations in lscpu, it appears that most of the mitigations are either disabled or not available. Just curious if you have MDS mitigation enabled on purpose? Because the system doesn't have the right microcode, and without the microcode software mitigation is useless.

If you are blocked by this issue you may be able to workaround it by using kernel cmdline "mitigations=off" or with "mds=off tsx_async_abort=off mmio_stale_data=off reg_file_data_sampling=off".

Meltdown status is "Vulnerable" in the lscpu, so looks like you have "nopti" in the cmdline. Could you please also try with CONFIG_PAGE_TABLE_ISOLATION=n
(or CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=n depending on the kernel version).

Is the #GP fault happening with other 32-bit applications? By any chance do you have strace logs with the crashing bin?
Comment 20 Robert Gill 2024-04-21 22:44:23 UTC
Tried turning off all the mitigations separately. dosemu runs with "mds=off." The other options had no effect. Crashes still occurred with CONFIG_PAGE_TABLE_ISOLATION=n as well. I'm attaching an strace log created with my default kernel command line options.
Comment 21 Robert Gill 2024-04-21 22:44:51 UTC
Created attachment 306190 [details]
dosemu strace log
Comment 22 Robert Gill 2024-04-21 23:03:38 UTC
Also, the latest available microcode is installed on both systems. The list is as follows (using iucode_tool):


Intel Xeon E5520 selected microcodes:
  069/001: sig 0x000106a4, pf_mask 0x03, 2015-06-30, rev 0x0013, size 14336
  070/001: sig 0x000106a5, pf_mask 0x03, 2018-05-11, rev 0x001d, size 12288

Intel Xeon E5405 selected microcodes:
  063/001: sig 0x00010670, pf_mask 0x80, 2007-02-09, rev 0x0005, size 4096
  064/001: sig 0x00010671, pf_mask 0x80, 2007-03-29, rev 0x0106, size 4096
  064/002: sig 0x00010671, pf_mask 0x44, 2007-03-29, rev 0x0106, size 4096
  064/003: sig 0x00010671, pf_mask 0x10, 2007-03-29, rev 0x0106, size 4096
  064/004: sig 0x00010671, pf_mask 0x01, 2007-03-29, rev 0x0106, size 4096
  065/001: sig 0x00010674, pf_mask 0x80, 2007-07-20, rev 0x0405, size 4096
  065/002: sig 0x00010674, pf_mask 0x44, 2007-06-08, rev 0x0404, size 4096
  065/003: sig 0x00010674, pf_mask 0x10, 2007-06-08, rev 0x0404, size 4096
  065/004: sig 0x00010674, pf_mask 0x01, 2007-06-08, rev 0x0404, size 4096
  066/001: sig 0x00010676, pf_mask 0x91, 2015-08-02, rev 0x0612, size 4096
  066/002: sig 0x00010676, pf_mask 0x54, 2008-01-19, rev 0x060c, size 4096
  066/003: sig 0x00010676, pf_mask 0x44, 2010-09-29, rev 0x060f, size 4096
  066/004: sig 0x00010676, pf_mask 0x40, 2015-08-02, rev 0x0612, size 4096
  066/005: sig 0x00010676, pf_mask 0x04, 2015-08-02, rev 0x0612, size 4096
  067/001: sig 0x00010677, pf_mask 0x10, 2015-08-02, rev 0x070d, size 4096
  067/002: sig 0x00010677, pf_mask 0x01, 2007-10-26, rev 0x0701, size 4096
  068/001: sig 0x0001067a, pf_mask 0xb1, 2015-07-29, rev 0x0a0e, size 8192
  068/002: sig 0x0001067a, pf_mask 0x44, 2015-07-29, rev 0x0a0e, size 8192
Comment 23 Robert Gill 2024-04-22 02:18:46 UTC
Also, no other 32-bit applications are crashing.
Comment 24 Pawan Gupta 2024-04-23 19:27:34 UTC
(In reply to Robert Gill from comment #21)
> Created attachment 306190 [details]
> dosemu strace log

Looking at the strace log, it appears that vm86() syscall is the culprit:

  vm86(0x1, 0x617d80, 0xb7c9fa04, 0x5c6ff4, 0x617c44) = 0
  +++ killed by SIGSEGV +++

It is likely that vm86() does not map the .entry.text section that has the VERW operand mds_verw_sel.

Attaching a debug patch based on v6.9-rc4 that moves VERW execution before CR3 is switched. This is slightly less secure, but 32-bit mode has not received some of the other mitigations, so mitigating MDS doesn't add much value anyways. Please let me know if it fixes the issue.
Comment 25 Pawan Gupta 2024-04-23 19:30:03 UTC
Created attachment 306201 [details]
Workaround patch

Moves VERW execution before CR3 is switched for 32-bit mode.
Comment 26 Robert Gill 2024-04-26 00:30:21 UTC
I've tested that patch and it appears to fix the issue.
Comment 27 Pawan Gupta 2024-04-26 23:54:02 UTC
Formal patch submitted here:
https://lore.kernel.org/all/20240426-fix-dosemu-vm86-v1-1-88c826a3f378@linux.intel.com/

Note You need to log in before you can comment on or make changes to this bug.