dosemu crashes on some older hardware when running kernel 6.6.21 or newer, runs fine on other hardware. The error is: kernel: general protection fault: 0000 [#1] PREEMPT SMP PTI I'm including attachments describing the hardware it runs on and the hardware it fails on, as well as crash reports from the hardware where it fails. host01 is the system it runs with no problem on, it's currently running kernel version 6.6.21.
Created attachment 306118 [details] lscpu output for host01 (dosemu runs fine on this system)
Created attachment 306119 [details] lscpu output for host02
Created attachment 306120 [details] crash report for host02
Created attachment 306121 [details] lscpu output for host03
Created attachment 306122 [details] crash report for host03
Created attachment 306123 [details] crash report for host03 (kernel 6.8.4)
When you say 6.6.21, which is the last kernel where this did not happen? 6.6.20? In that case: could you bisect? I did a quick search on lore and wondered if this might be related: https://lore.kernel.org/all/202403281553.79f5a16f-lkp@intel.com/ Are you compiling the kernel with clang as well?
I'm compiling on Gentoo using gcc-13.2.1_p20240210. I had to download the kernel sources for 6.6.20 from kernel.org because Gentoo masked kernel 6.6.13 and removed all releases between that and 6.6.21. 6.6.13 was the last stabilized kernel on Gentoo before a security issue was discovered that I believe was patched in 6.6.21. Anyways, dosemu runs fine with kernel 6.6.20.
I should also mention that these machines are running with 32-bit kernels.
(In reply to Robert Gill from comment #9) > I should also mention that these machines are running with 32-bit kernels. Which do not get a lot of testing these days. :-/ (In reply to Robert Gill from comment #8) > Anyways, dosemu runs fine with kernel 6.6.20. Good. Then it would be ideal if you could bisect between 6.6.20 and 6.6.21 to find the culprit. https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
I've determined that the following is the offending commit: 7a62647efcb2f7b744ff52ad91c23eda1ec9be94 is the first bad commit commit 7a62647efcb2f7b744ff52ad91c23eda1ec9be94 Author: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Date: Sun Mar 3 21:08:42 2024 -0800 x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key commit 6613d82e617dd7eb8b0c40b2fe3acea655b1d611 upstream. The VERW mitigation at exit-to-user is enabled via a static branch mds_user_clear. This static branch is never toggled after boot, and can be safely replaced with an ALTERNATIVE() which is convenient to use in asm. Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user path. Also remove the now redundant VERW in exc_nmi() and arch_exit_to_user_mode(). Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Documentation/arch/x86/mds.rst | 38 +++++++++++++++++++++++++----------- arch/x86/include/asm/entry-common.h | 1 - arch/x86/include/asm/nospec-branch.h | 12 ------------ arch/x86/kernel/cpu/bugs.c | 15 ++++++-------- arch/x86/kernel/nmi.c | 3 --- arch/x86/kvm/vmx/vmx.c | 2 +- 6 files changed, 34 insertions(+), 37 deletions(-)
I'll take a look at it and also see if I can't find a way to work around it in dosemu.
(In reply to Robert Gill from comment #11) > x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key > > commit 6613d82e617dd7eb8b0c40b2fe3acea655b1d611 upstream. Ohh, that's the commit that caused the issue I had mentioned earlier. Send a reply there, lets see what happens: https://lore.kernel.org/all/8c77ccfd-d561-45a1-8ed5-6b75212c7a58@leemhuis.info/
Not my area or expertise, so I might send you sideways -- and it's just a hunch anyway. But do you by chance have MITIGATION_RETHUNK=y in your .config? Could you try if disabling makes a difference? I wonder if it might be somehow related to https://lore.kernel.org/all/20240414090810.GBZhuc-lN6tyKbF_-M@fat_crate.local/
Disabled CONFIG_RETHUNK on host03 and dosemu still crashed. Uploading crash dump and kernel config.
Created attachment 306168 [details] crash report for host03 (CONFIG_RETHUNK disabled)
Created attachment 306169 [details] Kernel 6.6.21 config file
(In reply to Robert Gill from comment #15) > Disabled CONFIG_RETHUNK on host03 and dosemu still crashed. Thx for checking. Seems forwarding this to the developers did not lead to any reaction. If nothing happens any time soon, it would be best if you could confirm that this is something that happens with mainline (e.g. 6.9-rc4), too; ideally afterwards even try if reverting 6613d82e617dd7eb8b0c40b2fe3acea655b1d611 fixes the problem there.
(In reply to Robert Gill from comment #2) > Created attachment 306119 [details] > lscpu output for host02 Looking at the reported mitigations in lscpu, it appears that most of the mitigations are either disabled or not available. Just curious if you have MDS mitigation enabled on purpose? Because the system doesn't have the right microcode, and without the microcode software mitigation is useless. If you are blocked by this issue you may be able to workaround it by using kernel cmdline "mitigations=off" or with "mds=off tsx_async_abort=off mmio_stale_data=off reg_file_data_sampling=off". Meltdown status is "Vulnerable" in the lscpu, so looks like you have "nopti" in the cmdline. Could you please also try with CONFIG_PAGE_TABLE_ISOLATION=n (or CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=n depending on the kernel version). Is the #GP fault happening with other 32-bit applications? By any chance do you have strace logs with the crashing bin?
Tried turning off all the mitigations separately. dosemu runs with "mds=off." The other options had no effect. Crashes still occurred with CONFIG_PAGE_TABLE_ISOLATION=n as well. I'm attaching an strace log created with my default kernel command line options.
Created attachment 306190 [details] dosemu strace log
Also, the latest available microcode is installed on both systems. The list is as follows (using iucode_tool): Intel Xeon E5520 selected microcodes: 069/001: sig 0x000106a4, pf_mask 0x03, 2015-06-30, rev 0x0013, size 14336 070/001: sig 0x000106a5, pf_mask 0x03, 2018-05-11, rev 0x001d, size 12288 Intel Xeon E5405 selected microcodes: 063/001: sig 0x00010670, pf_mask 0x80, 2007-02-09, rev 0x0005, size 4096 064/001: sig 0x00010671, pf_mask 0x80, 2007-03-29, rev 0x0106, size 4096 064/002: sig 0x00010671, pf_mask 0x44, 2007-03-29, rev 0x0106, size 4096 064/003: sig 0x00010671, pf_mask 0x10, 2007-03-29, rev 0x0106, size 4096 064/004: sig 0x00010671, pf_mask 0x01, 2007-03-29, rev 0x0106, size 4096 065/001: sig 0x00010674, pf_mask 0x80, 2007-07-20, rev 0x0405, size 4096 065/002: sig 0x00010674, pf_mask 0x44, 2007-06-08, rev 0x0404, size 4096 065/003: sig 0x00010674, pf_mask 0x10, 2007-06-08, rev 0x0404, size 4096 065/004: sig 0x00010674, pf_mask 0x01, 2007-06-08, rev 0x0404, size 4096 066/001: sig 0x00010676, pf_mask 0x91, 2015-08-02, rev 0x0612, size 4096 066/002: sig 0x00010676, pf_mask 0x54, 2008-01-19, rev 0x060c, size 4096 066/003: sig 0x00010676, pf_mask 0x44, 2010-09-29, rev 0x060f, size 4096 066/004: sig 0x00010676, pf_mask 0x40, 2015-08-02, rev 0x0612, size 4096 066/005: sig 0x00010676, pf_mask 0x04, 2015-08-02, rev 0x0612, size 4096 067/001: sig 0x00010677, pf_mask 0x10, 2015-08-02, rev 0x070d, size 4096 067/002: sig 0x00010677, pf_mask 0x01, 2007-10-26, rev 0x0701, size 4096 068/001: sig 0x0001067a, pf_mask 0xb1, 2015-07-29, rev 0x0a0e, size 8192 068/002: sig 0x0001067a, pf_mask 0x44, 2015-07-29, rev 0x0a0e, size 8192
Also, no other 32-bit applications are crashing.
(In reply to Robert Gill from comment #21) > Created attachment 306190 [details] > dosemu strace log Looking at the strace log, it appears that vm86() syscall is the culprit: vm86(0x1, 0x617d80, 0xb7c9fa04, 0x5c6ff4, 0x617c44) = 0 +++ killed by SIGSEGV +++ It is likely that vm86() does not map the .entry.text section that has the VERW operand mds_verw_sel. Attaching a debug patch based on v6.9-rc4 that moves VERW execution before CR3 is switched. This is slightly less secure, but 32-bit mode has not received some of the other mitigations, so mitigating MDS doesn't add much value anyways. Please let me know if it fixes the issue.
Created attachment 306201 [details] Workaround patch Moves VERW execution before CR3 is switched for 32-bit mode.
I've tested that patch and it appears to fix the issue.
Formal patch submitted here: https://lore.kernel.org/all/20240426-fix-dosemu-vm86-v1-1-88c826a3f378@linux.intel.com/