Bug 101061
Summary: | Problems while executing 32-bit code on AMD64 | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Krzysztof Sobiecki (sobkas) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | austinenglish, dex, hpa, luto, sobkas, torvalds, vda.linux |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.2-rc3+ 9d634c410b07be7bf637ea03362d3ff132088fe3 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Test patch for revert on top of current -git
More complete revert Linuses patch witchout last chunk Smaller revert patch My .config |
Description
Krzysztof Sobiecki
2015-07-06 13:33:33 UTC
Also I have problems with 32-bit apps inside chroot, while compiling some software using pbuilder I'm getting rules[25968]: segfault at 6e6f7270 ip 00000000f7624aa6 sp 00000000ffe6cb10 error 4 in libc-2.21.so[f75b2000+176000] no idea why I can confirm this. vendor_id : AuthenticAMD cpu family : 16 model : 10 model name : AMD Phenom(tm) II X6 1055T Processor glibc 2.21 I also think this bug is related to this one in Wine: https://bugs.winehq.org/show_bug.cgi?id=38953 Posted link there, maybe someone will finally do something. To summarize problems: PCSX2 Boot CDVD doesn't work on. It causes segfaults Running 32-bit binaries inside chroot causes sometimes this: [ 604.181173] configure[7748]: segfault at 7 ip 00000000f7542a7c sp 00000000ffecf580 error 4 in libc-2.21.so[f7517000+176000] [ 1335.696203] rules[20040]: segfault at 6e6f7270 ip 00000000f7699aa6 sp 00000000ff90b0e0 error 4 in libc-2.21.so[f7627000+176000] And this: https://bugs.winehq.org/show_bug.cgi?id=38953 Everything started with 53e9accf0f7682d717c7b578b6e01fd297ba6630 And (almost) no one cares (In reply to Krzysztof Sobiecki from comment #4) > Everything started with 53e9accf0f7682d717c7b578b6e01fd297ba6630 Did you bisect it down to this commit? Does reverting it fix the problem for you? I tested the same kernel on another maschine and couldn't reproduce it there: vendor_id : GenuineIntel cpu family : 6 model : 69 model name : Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz So it seems this particular bug is AMD specific. >Did you bisect it down to this commit? Yes I did >Does reverting it fix the problem for you? I can't revert it on the current tree. I have tested 53e9accf0f7682d717c7b578b6e01fd297ba6630 and commit before it and 53e9accf0f7682d717c7b578b6e01fd297ba6630 breaks things as described while one before it don't I have also tried to revert it by hand, but it didn't do any good. I would probably have to read some docs to even start understand what's going on it that code. Some more testing is needed: So can everyone test if 53e9accf0f7682d717c7b578b6e01fd297ba6630 is really broken and if 73cbf687914f works? Thanks Created attachment 183271 [details]
Test patch for revert on top of current -git
This is an entirely untested straight manual revert of commit 53e9accf0f76 ("x86/asm/entry/32: Do not use R9 in SYSCALL32 entry point") modified for other changes in that area.
No guarantees, but testing would be appreciated.
(In reply to Linus Torvalds from comment #9) > Created attachment 183271 [details] > Test patch for revert on top of current -git > > This is an entirely untested straight manual revert of commit 53e9accf0f76 > ("x86/asm/entry/32: Do not use R9 in SYSCALL32 entry point") modified for > other changes in that area. > > No guarantees, but testing would be appreciated. Thanks for your help Linus, but I 'm sorry to disappoint you: I still have the same issue with your patch applied. Something that I might have forgotten to mention: I get this effect only about in 4/5 of tries. Sometimes the call of an 32-bit binary succeeds. Created attachment 183281 [details]
More complete revert
This is a more complete revert of the changes to the system call entry code starting with commit 53e9accf0f76.
I left the entrypoint renaming alone, and didn't undo the whitespace fixes, but this should make the actual generated code be the same as before. I didn't check if there might be some interactions with other changes, though, and this remains entirely untested.
Created attachment 183301 [details]
Linuses patch witchout last chunk
Linus suggested that removing last chunk from previous patch will fix problems with threading in 32-bit binaries so I attached modified patch for your convince. Testing is required. But with it I didn't have any problems.
(In reply to Linus Torvalds from comment #12) > Created attachment 183281 [details] > More complete revert > > This is a more complete revert of the changes to the system call entry code > starting with commit 53e9accf0f76. > > I left the entrypoint renaming alone, and didn't undo the whitespace fixes, > but this should make the actual generated code be the same as before. I > didn't check if there might be some interactions with other changes, though, > and this remains entirely untested. There seems to be some interactions. At least my box didn't boot with the patch applied. I can check later what exactly is the error, since I tried it remote. In general it looks like syscalls sometimes clobber a register. "segfault at 7" - looks like code dereferenced a pointer with value 7. I.e. some registed was clobbered, instead of previous contents which was a valid address now it contains an invalid one. It doesn't looks like SYSCALL insn inself fails to return to userspace, or clobbers EIP, or ESP. The "sometimes" part is interesting. It means that execution needs to divert from the typical "fast path" for this to occur. Maybe here: call *ia32_sys_call_table(, %rax, 8) movq %rax, RAX(%rsp) 1: movl RCX(%rsp), %ebp DISABLE_INTERRUPTS(CLBR_NONE) testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS) jnz sysretl_audit The jnz is usually not jumping (unless audit is active). When it is not jumping, it most likely works fine. If it jumps, we enter a much messier code. My one theory is that the bug is there. My another theory is that it (also non-deterministically) happens when bitness changes (execing 32-bit binery from 64-bit one or vice versa). It's a nasty special case where syscall entry and syscall exit have different bitness. (In reply to Krzysztof Sobiecki from comment #4) > Running 32-bit binaries inside chroot causes sometimes this: > [ 604.181173] configure[7748]: segfault at 7 ip 00000000f7542a7c sp > 00000000ffecf580 error 4 in libc-2.21.so[f7517000+176000] > [ 1335.696203] rules[20040]: segfault at 6e6f7270 ip 00000000f7699aa6 sp > 00000000ff90b0e0 error 4 in libc-2.21.so[f7627000+176000] Krzysztof, I would like to reproduce this. This last example seems simpler than installing a complex beast like PS2 emulator. Can you attach your kernel .config? Do you run your system with audit enabled (in particular, selinux always enables audit)? Are your "32-bit binaries inside chroot" can be packaged into a compact enough set (tarball) which you can email to me so that I can execute the crashing configure run on my machine? If it is feasible for you, rebuild the kernel with CONFIG_AUDITSYSCALL off. Does it still fail? Created attachment 183461 [details] Smaller revert patch > In general it looks like syscalls sometimes clobber a register. To be more specific: the clobbeterd register has to be EBP, and userspace should see it as clobbered ECX because of this register shuffling in VDSO: __kernel_vsyscall: push %ebp movl %ecx, %ebp syscall movl %ebp, %ecx popl %ebp Krzysztof, glad to hear than Linus' revert patch (patch5.diff) worked! However, I still don't see where EBP gets clobbered. Linus' patch reverted *all* changes, but some of those seem harmelss. I'm attaching patch6.diff which is a "smaller" revert: it doesn't revert comments, "xchg optimization" in auditsys_entry_common, and doesn't reintroduce ia32_sysret code path. Can you test whether patch6.diff also works for you? Created attachment 183481 [details] My .config Hope it helps, but about that chroot... it's a pbuilder environment and it's a bit bigger than I would like to send anyone: 297M, so it's not an option. On the other hand it's extremely easy to reproduce it with pbuilder I was compiling mesa when I first found that problem, so I believe that any big makefile should trigger it in make. But it depends on makefiles, it works in some cases where different makefiles are presented (like building a radeon driver for X.org) So I believe that testing PCSX2 is more reliable and the code itself that trigger it is more straightforward(they use asm in that place) than to chase optimizations in make. there is a package for it https://packages.debian.org/sid/pcsx2 I have used version from git, but I believe that stable version should trigger that problem anyway. > If it is feasible for you, rebuild the kernel with CONFIG_AUDITSYSCALL off.
> Does it still fail?
I assume that kernel should be without any patches? Right?
I will compile and test it as soon as I test patch6.diff
with patch6.diff I was able to run PCSX2 without problems, so I will assume that everything works as it should. Now I will test kernel with CONFIG_AUDITSYSCALL disabled kernel with CONFIG_AUDITSYSCALL disabled have this bug too I can confirm that patch6.diff works for me to. I still want to point out that this bug seems to only affects AMD64 and not EM64T. According to Denys Vlasenko: "It happens only on AMD because on Intel CPUs, in 32-bit mode SYSCALL instruction is not working. On Intel, a different instruction (SYSENTER) and different kernel code handles fast syscalls." That is indeed correct -- 32-bit SYSCALL is AMD only; the Intel equivalent is SYSENTER which has different characteristics. For 64-bit processes, all x86 processors use the 64-bit version of the SYSCALL instruction, which behaves differently. Fix should trickle down into Linuses tree soon This should now be fixed in 4.2-rc4 Please verify that everything works again. Thanks. Yes, it works fine now. I have tested PCSX2, Battle.net app and compilation in 32-bit environment. I'm no longer able to reproduce this bug. So I will mark this bug as solved. I have a 2/3-written test case that's designed to exercise syscall args (compat and native), their interactions with ptrace, and their fates after syscall return. All kernels appear to work as expected on Intel hardware (with Denys' patch and otherwise). On a QEMU-emulated (TCG) Opteron_G4, the results are quite different. Syscall restart is broken on all kernels I tried (i.e. ip -= 2 doesn't have the desired effect). (Al Viro pointed this out a couple years ago, and apparently it never got fixed.) On the other hand, even without the revert, I'm not seeing register corruption with or without ptrace, so I'm confused. I'd really like to have a good test case for this issue. My best current guess is that something's wrong with signal handling. FWIW, I'm testing 8a0a5da6d9cbf1b142115ff6e6b253a82633c3d9. Never mind, I think I got it. The bug was that the R9 rework broke syscall restart. In particular, the second arg (arg1, which is regs->cx) ends up being reloaded from arg5, which is regs->bp. IOW, syscall arg *changes* have never worked using SYSCALL32, but syscall *restart* (with unchanged args) work. And now I have a test case. Holy *crap* this code is a mess. On Tue, Aug 25, 2015 at 2:45 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > On a QEMU-emulated (TCG) Opteron_G4, the results are quite different. > Syscall > restart is broken on all kernels I tried (i.e. ip -= 2 doesn't have the > desired > effect) So "sysenter" restart is really odd. The initial system call is done with "sysenter", but the "ip -= 2" trick ends up subtracting two from the sysenter return address ("VDSO32_SYSENTER_RETURN"), which points to an "int 0x80". I forget why. I think it was a misguided attempt to not have the issue with %rbp, but that makes little sense. Maybe we should replace the "int 0x80" with just sysenter (still two bytes) and make sure that the sysenter case pushes rbp and rsp as the same value (so that the re-loading of the sixth argument works) Linus --- Comment #28 from Andy Lutomirski <luto@kernel.org> --- > > IOW, syscall arg *changes* have never worked using SYSCALL32, but syscall > *restart* (with unchanged args) work. Ugh. Do you have a patch? And I realized that while "sysenter" is really odd, I think that it's correct (exactly because we load %rbp from the stack and then do "int 0x80"). And the case you have is the compat case for "syscall", which is the AMD model. And I guess Intel UD's on syscall outside of long mode, so it's a bitch to test, because you really do need AMD hardware. Linus (In reply to Linus Torvalds from comment #30) > --- Comment #28 from Andy Lutomirski <luto@kernel.org> --- > > > > IOW, syscall arg *changes* have never worked using SYSCALL32, but syscall > > *restart* (with unchanged args) work. > > Ugh. Do you have a patch? No, but I will as part of the C rewrite. > > And I realized that while "sysenter" is really odd, I think that it's > correct (exactly because we load %rbp from the stack and then do "int > 0x80"). And the case you have is the compat case for "syscall", which > is the AMD model. Yeah, the sysenter case is better. While it's fresh in my mind, here's the ABI as well as I can tell: call [vsyscall] issues a syscall with nr eax and args (ebx, ecx, edx, esi, edi, ebp). Callers absolutely require that the exit args (or at least some of them) that aren't part of the syscall signature are unchanged -- a bunch of the (hideous) glibc asm shoves things like the PIC registers and BP in there for safe keeping for syscalls with fewer than 6 args. ptrace expects regs->ebx to contain arg0, etc. (ip -= 2; ax = orig_ax;) is expected to restart the syscall. I don't know whether any user code relies on this, but the kernel definitely does. Most likely the raw semantics of the SYSCALL32 and SYSENTER instructions are not meaningful, since SYSENTER faults on AMD, SYSCALL32 faults on Intel and on native 32-bit AMD kernels, and anyone who uses SYSCALL32 directly on anything but the very latest kernels without fixing up SS manually will get bitten by the sysret_ss_attrs thing. I was thinking about cleaning this up by targeting the lowest common denominator. We don't use SYSEXIT any more, so the constraints are that SYSENTER clobbers ESP and EIP, SYSCALL32 clobbers ECX, and SYSRET32 clobbers ECX. (SYSEXIT clobbers EDX, but we don't care.) So let's just do this: __kernel_vsyscall: push %ecx push %ebp movl %esp, %ebp ALTERNATIVE (Intel with SEP): sysenter ALTERNATIVE (AMD with SYSCALL32 on 64-bit kernels): syscall hlt /* keep weird binary tracers from utterly screwing up */ ALTERNATIVE (if neither of the other cases apply): nops movl %ebp, %esp movl (%esp), %ebp movl 8(%esp), %ecx int $0x80 vsyscall_after_int80: popl %ecx popl %ebp ret Yes, this is weird. But bear with me :) First, in the case where we have neither SEP nor SYSCALL32, I claim that this Just Works. We push a couple regs, pointlessly shuffle esp, restore the regs, do int $0x80 with the same regs we started with, and then (again, pointlessly) pop the regs we pushed. Now we make the semantics of *both* syscall32 and sysenter that they load ebp into regs->sp, fetch regs->bp and regs->cx from memory, and set regs->ip to vsyscall_after_int80. (This is a wee bit slower than the current sysenter code because it does two memory fetches instead of one.) Then they proceed just like int80. Note that sysenter's slow path already sort of works like this. Now we've fixed the correctness issues but we've killed performance, as we'll use IRET instead of SYSRET to get back out. But we already know how to fix that: opportunistic sysret. If we're returning from a compat syscall that entered via sysenter or syscall32, if regs->ip == vsyscall_after_int80, regs->r11 == regs->flags, regs->ss == __USER_DS, regs->cs == __USER32_CS, and flags are sane, then return using SYSRETL. (The r11 check is probably unnecessary.) This is not quite as elegant as 64-bit opportunistic sysret, since we're zapping ecx. This should be unobservable except by debuggers, since we already know that we're returning to a 'pop ecx' instruction. For native 32-bit, we obviously can't use SYSRETL, but we can do more or less the same thing with SYSEXIT. > > And I guess Intel UD's on syscall outside of long mode, so it's a > bitch to test, because you really do need AMD hardware. QEMU contains an entire miniature and slow AMD chip :) On second thought, let's push and pop edx as well. That way the exact same code works for native 32-bit as well. |