Bug 101061

Summary: Problems while executing 32-bit code on AMD64
Product: Platform Specific/Hardware Reporter: Krzysztof Sobiecki (sobkas)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED CODE_FIX    
Severity: blocking CC: austinenglish, dex, hpa, luto, sobkas, torvalds, vda.linux
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.2-rc3+ 9d634c410b07be7bf637ea03362d3ff132088fe3 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Test patch for revert on top of current -git
More complete revert
Linuses patch witchout last chunk
Smaller revert patch
My .config

Description Krzysztof Sobiecki 2015-07-06 13:33:33 UTC
Starting with rev: 53e9accf0f7682d717c7b578b6e01fd297ba6630 I have a problem with executing 32-bit binary on my amd64 system, "PCSX2".

So after normally starting PCSX2(git: e3adf823b32beb0330290b7639758e2d3376ba52), I select Boot CDVD(fast or full, doesn't matter), normally before that faulty rev(53e9accf0f7682d717c7b578b6e01fd297ba6630), it would open a window and start game or bios. But now(1c4c7159ed2468f3ac4ce5a7f08d79663d381a93) it just segfaults:

(gdb) bt full
#0  x86Emitter::xWrite16 (val=val@entry=4111) at /tmp/buildd/pcsx2-1.3.1/common/src/x86emitter/x86emitter.cpp:82
No locals.
#1  0x0834a6c4 in SimdPrefix (opcode=16, prefix=<optimized out>) at /tmp/buildd/pcsx2-1.3.1/common/src/x86emitter/simd.cpp:117
        is16BitOpcode = false
#2  xOpWrite0F<x86Emitter::xRegisterSSE, x86Emitter::xIndirectVoid> (param2=..., param1=..., opcode=16, prefix=<optimized out>)
    at /tmp/buildd/pcsx2-1.3.1/common/include/x86emitter/internal.h:57
No locals.
#3  x86Emitter::xImplSimd_MoveSSE::operator() (this=0x840251c <x86Emitter::xMOVUPS>, to=..., from=...)
    at /tmp/buildd/pcsx2-1.3.1/common/src/x86emitter/simd.cpp:583
No locals.
#4  0x0828f871 in VifUnpackSSE_Base::xUPK_S_32 (this=0xf30fee04) at /tmp/buildd/pcsx2-1.3.1/pcsx2/x86/newVif_UnpackSSE.cpp:94
No locals.

I have no idea what's going on. Thanks.

libc:
libc6-i686:i386:
  Installed: 2.21-0experimental0
  Candidate: 2.21-0experimental0
  Version table:
 *** 2.21-0experimental0 0
        501 ftp://ftp.pl.debian.org/debian/ experimental/main i386 Packages
        500 /var/lib/dpkg/status
     2.19-18 0
        501 ftp://ftp.pl.debian.org/debian/ unstable/main i386 Packages
        500 ftp://ftp.pl.debian.org/debian/ testing/main i386 Packages
libc6:i386:
  Installed: 2.21-0experimental0
  Candidate: 2.21-0experimental0
  Version table:
 *** 2.21-0experimental0 0
        501 ftp://ftp.pl.debian.org/debian/ experimental/main i386 Packages
        500 /var/lib/dpkg/status
     2.19-18 0
        501 ftp://ftp.pl.debian.org/debian/ unstable/main i386 Packages
        500 ftp://ftp.pl.debian.org/debian/ testing/main i386 Packages

cpuinfo:
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD FX(tm)-6350 Six-Core Processor
stepping	: 0
microcode	: 0x6000822
cpu MHz		: 1400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 6
core id		: 0
cpu cores	: 3
apicid		: 16
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vmmcall bmi1
bugs		: fxsave_leak sysret_ss_attrs
bogomips	: 8451.92
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD FX(tm)-6350 Six-Core Processor
stepping	: 0
microcode	: 0x6000822
cpu MHz		: 1400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 6
core id		: 1
cpu cores	: 3
apicid		: 17
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vmmcall bmi1
bugs		: fxsave_leak sysret_ss_attrs
bogomips	: 8451.92
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

processor	: 2
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD FX(tm)-6350 Six-Core Processor
stepping	: 0
microcode	: 0x6000822
cpu MHz		: 1400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 6
core id		: 2
cpu cores	: 3
apicid		: 18
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vmmcall bmi1
bugs		: fxsave_leak sysret_ss_attrs
bogomips	: 8451.92
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

processor	: 3
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD FX(tm)-6350 Six-Core Processor
stepping	: 0
microcode	: 0x6000822
cpu MHz		: 1400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 6
core id		: 3
cpu cores	: 3
apicid		: 19
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vmmcall bmi1
bugs		: fxsave_leak sysret_ss_attrs
bogomips	: 8451.92
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

processor	: 4
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD FX(tm)-6350 Six-Core Processor
stepping	: 0
microcode	: 0x6000822
cpu MHz		: 1400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 6
core id		: 4
cpu cores	: 3
apicid		: 20
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vmmcall bmi1
bugs		: fxsave_leak sysret_ss_attrs
bogomips	: 8451.92
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

processor	: 5
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 2
model name	: AMD FX(tm)-6350 Six-Core Processor
stepping	: 0
microcode	: 0x6000822
cpu MHz		: 1400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 6
core id		: 5
cpu cores	: 3
apicid		: 21
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vmmcall bmi1
bugs		: fxsave_leak sysret_ss_attrs
bogomips	: 8451.92
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
Comment 1 Krzysztof Sobiecki 2015-07-06 16:05:33 UTC
Also I have problems with 32-bit apps inside chroot, while compiling some software using pbuilder I'm getting rules[25968]: segfault at 6e6f7270 ip 00000000f7624aa6 sp 00000000ffe6cb10 error 4 in libc-2.21.so[f75b2000+176000]
no idea why
Comment 2 Daniel Exner 2015-07-20 19:15:14 UTC
I can confirm this.

vendor_id       : AuthenticAMD
cpu family      : 16
model           : 10
model name      : AMD Phenom(tm) II X6 1055T Processor

glibc 2.21

I also think this bug is related to this one in Wine:

https://bugs.winehq.org/show_bug.cgi?id=38953
Comment 3 Krzysztof Sobiecki 2015-07-21 11:46:07 UTC
Posted link there, maybe someone will finally do something.
Comment 4 Krzysztof Sobiecki 2015-07-21 13:17:54 UTC
To summarize problems:

PCSX2 Boot CDVD doesn't work on. It causes segfaults

Running 32-bit binaries inside chroot causes sometimes this:
[  604.181173] configure[7748]: segfault at 7 ip 00000000f7542a7c sp 00000000ffecf580 error 4 in libc-2.21.so[f7517000+176000]
[ 1335.696203] rules[20040]: segfault at 6e6f7270 ip 00000000f7699aa6 sp 00000000ff90b0e0 error 4 in libc-2.21.so[f7627000+176000]

And this: https://bugs.winehq.org/show_bug.cgi?id=38953

Everything started with 53e9accf0f7682d717c7b578b6e01fd297ba6630

And (almost) no one cares
Comment 5 Daniel Exner 2015-07-21 19:42:09 UTC
(In reply to Krzysztof Sobiecki from comment #4)

> Everything started with 53e9accf0f7682d717c7b578b6e01fd297ba6630

Did you bisect it down to this commit? Does reverting it fix the problem for you?
Comment 6 Daniel Exner 2015-07-22 09:51:57 UTC
I tested the same kernel on another maschine and couldn't reproduce it there:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 69
model name      : Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz

So it seems this particular bug is AMD specific.
Comment 7 Krzysztof Sobiecki 2015-07-22 16:57:04 UTC
>Did you bisect it down to this commit?
Yes I did
>Does reverting it fix the problem for you?
I can't revert it on the current tree.
I have tested 53e9accf0f7682d717c7b578b6e01fd297ba6630 and commit before it and 53e9accf0f7682d717c7b578b6e01fd297ba6630 breaks things as described while one before it don't

I have also tried to revert it by hand, but it didn't do any good. I would probably have to read some docs to even start understand what's going on it that code.
Comment 8 Krzysztof Sobiecki 2015-07-22 18:40:59 UTC
Some more testing is needed:
  So can everyone test if 53e9accf0f7682d717c7b578b6e01fd297ba6630 is really broken and if 73cbf687914f works?
Thanks
Comment 9 Linus Torvalds 2015-07-22 19:05:44 UTC
Created attachment 183271 [details]
Test patch for revert on top of current -git

This is an entirely untested straight manual revert of commit 53e9accf0f76 ("x86/asm/entry/32: Do not use R9 in SYSCALL32 entry point") modified for other changes in that area.

No guarantees, but testing would be appreciated.
Comment 10 Daniel Exner 2015-07-22 19:57:09 UTC
(In reply to Linus Torvalds from comment #9)
> Created attachment 183271 [details]
> Test patch for revert on top of current -git
> 
> This is an entirely untested straight manual revert of commit 53e9accf0f76
> ("x86/asm/entry/32: Do not use R9 in SYSCALL32 entry point") modified for
> other changes in that area.
> 
> No guarantees, but testing would be appreciated.

Thanks for your help Linus, but I 'm sorry to disappoint you: I still have the same issue with your patch applied.
Comment 11 Daniel Exner 2015-07-22 19:58:52 UTC
Something that I might have forgotten to mention:

I get this effect only about in 4/5 of tries. Sometimes the call of an 32-bit binary succeeds.
Comment 12 Linus Torvalds 2015-07-22 21:44:17 UTC
Created attachment 183281 [details]
More complete revert

This is a more complete revert of the changes to the system call entry code starting with commit 53e9accf0f76.

I left the entrypoint renaming alone, and didn't undo the whitespace fixes, but this should make the actual generated code be the same as before. I didn't check if there might be some interactions with other changes, though, and this remains entirely untested.
Comment 13 Krzysztof Sobiecki 2015-07-23 08:42:19 UTC
Created attachment 183301 [details]
Linuses patch witchout last chunk

Linus suggested that removing last chunk from previous patch will fix problems with threading in 32-bit binaries so I attached modified patch for your convince. Testing is required. But with it I didn't have any problems.
Comment 14 Daniel Exner 2015-07-23 09:44:35 UTC
(In reply to Linus Torvalds from comment #12)
> Created attachment 183281 [details]
> More complete revert
> 
> This is a more complete revert of the changes to the system call entry code
> starting with commit 53e9accf0f76.
> 
> I left the entrypoint renaming alone, and didn't undo the whitespace fixes,
> but this should make the actual generated code be the same as before. I
> didn't check if there might be some interactions with other changes, though,
> and this remains entirely untested.

There seems to be some interactions. At least my box didn't boot with the patch applied. I can check later what exactly is the error, since I tried it remote.
Comment 15 Denys Vlasenko 2015-07-23 10:22:04 UTC
In general it looks like syscalls sometimes clobber a register. "segfault at 7" - looks like code dereferenced a pointer with value 7. I.e. some registed was clobbered, instead of previous contents which was a valid address now it contains an invalid one. It doesn't looks like SYSCALL insn inself fails to return to userspace, or clobbers EIP, or ESP.

The "sometimes" part is interesting. It means that execution needs to divert from the typical "fast path" for this to occur. Maybe here:

        call    *ia32_sys_call_table(, %rax, 8)
        movq    %rax, RAX(%rsp)
1:
        movl    RCX(%rsp), %ebp
        DISABLE_INTERRUPTS(CLBR_NONE)
        testl   $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
        jnz     sysretl_audit

The jnz is usually not jumping (unless audit is active). When it is not jumping, it most likely works fine. If it jumps, we enter a much messier code. My one theory is that the bug is there.

My another theory is that it (also non-deterministically) happens when bitness changes (execing 32-bit binery from 64-bit one or vice versa). It's a nasty special case where syscall entry and syscall exit have different bitness.

(In reply to Krzysztof Sobiecki from comment #4)
> Running 32-bit binaries inside chroot causes sometimes this:
> [  604.181173] configure[7748]: segfault at 7 ip 00000000f7542a7c sp
> 00000000ffecf580 error 4 in libc-2.21.so[f7517000+176000]
> [ 1335.696203] rules[20040]: segfault at 6e6f7270 ip 00000000f7699aa6 sp
> 00000000ff90b0e0 error 4 in libc-2.21.so[f7627000+176000]

Krzysztof, I would like to reproduce this. This last example seems simpler than installing a complex beast like PS2 emulator.

Can you attach your kernel .config?

Do you run your system with audit enabled (in particular, selinux always enables audit)?

Are your "32-bit binaries inside chroot" can be packaged into a compact enough set (tarball) which you can email to me so that I can execute the crashing configure run on my machine?

If it is feasible for you, rebuild the kernel with CONFIG_AUDITSYSCALL off. Does it still fail?
Comment 16 Denys Vlasenko 2015-07-23 14:28:03 UTC
Created attachment 183461 [details]
Smaller revert patch

> In general it looks like syscalls sometimes clobber a register.

To be more specific: the clobbeterd register has to be EBP, and userspace should see it as clobbered ECX because of this register shuffling in VDSO:

__kernel_vsyscall:
        push    %ebp
        movl    %ecx, %ebp
        syscall
        movl    %ebp, %ecx
        popl    %ebp

Krzysztof, glad to hear than Linus' revert patch (patch5.diff) worked!

However, I still don't see where EBP gets clobbered. Linus' patch reverted *all* changes, but some of those seem harmelss. I'm attaching patch6.diff which is a "smaller" revert: it doesn't revert comments, "xchg optimization" in auditsys_entry_common, and doesn't reintroduce ia32_sysret code path.

Can you test whether patch6.diff also works for you?
Comment 17 Krzysztof Sobiecki 2015-07-23 15:58:11 UTC
Created attachment 183481 [details]
My .config

Hope it helps, but about that chroot...
it's a pbuilder environment and it's a bit bigger than I would like to send anyone: 297M, so it's not an option.
On the other hand it's extremely easy to reproduce it with pbuilder
I was compiling mesa when I first found that problem, so I believe that any big makefile should trigger it in make. But it depends on makefiles, it works in some cases where different makefiles are presented (like building a radeon driver for X.org)
So I believe that testing PCSX2 is more reliable and the code itself that trigger it is more straightforward(they use asm in that place) than to chase optimizations in make.
there is a package for it https://packages.debian.org/sid/pcsx2
I have used version from git, but I believe that stable version should trigger that problem anyway.
Comment 18 Krzysztof Sobiecki 2015-07-23 16:00:58 UTC
> If it is feasible for you, rebuild the kernel with CONFIG_AUDITSYSCALL off.
> Does it still fail?

I assume that kernel should be without any patches? Right?
I will compile and test it as soon as I test patch6.diff
Comment 19 Krzysztof Sobiecki 2015-07-23 16:47:45 UTC
with patch6.diff I was able to run PCSX2 without problems, so I will assume that everything works as it should. Now I will test kernel with CONFIG_AUDITSYSCALL disabled
Comment 20 Krzysztof Sobiecki 2015-07-23 17:40:54 UTC
kernel with CONFIG_AUDITSYSCALL disabled have this bug too
Comment 21 Daniel Exner 2015-07-23 18:16:13 UTC
I can confirm that patch6.diff works for me to. I still want to point out that this bug seems to only affects AMD64 and not EM64T.
Comment 22 Krzysztof Sobiecki 2015-07-23 18:20:04 UTC
According to Denys Vlasenko:
"It happens only on AMD because on Intel CPUs,
in 32-bit mode SYSCALL instruction is not working.
On Intel, a different instruction (SYSENTER)
and different kernel code handles fast syscalls."
Comment 23 H. Peter Anvin 2015-07-23 18:39:28 UTC
That is indeed correct -- 32-bit SYSCALL is AMD only; the Intel equivalent is SYSENTER which has different characteristics.

For 64-bit processes, all x86 processors use the 64-bit version of the SYSCALL instruction, which behaves differently.
Comment 24 Krzysztof Sobiecki 2015-07-25 12:13:58 UTC
Fix should trickle down into Linuses tree soon
Comment 25 Linus Torvalds 2015-07-26 19:39:01 UTC
This should now be fixed in 4.2-rc4

Please verify that everything works again. Thanks.
Comment 26 Krzysztof Sobiecki 2015-07-27 07:31:15 UTC
Yes, it works fine now. I have tested PCSX2, Battle.net app and compilation in 32-bit environment. I'm no longer able to reproduce this bug.

So I will mark this bug as solved.
Comment 27 Andy Lutomirski 2015-08-25 21:45:57 UTC
I have a 2/3-written test case that's designed to exercise syscall args (compat and native), their interactions with ptrace, and their fates after syscall return.

All kernels appear to work as expected on Intel hardware (with Denys' patch and otherwise).

On a QEMU-emulated (TCG) Opteron_G4, the results are quite different.  Syscall restart is broken on all kernels I tried (i.e. ip -= 2 doesn't have the desired effect).  (Al Viro pointed this out a couple years ago, and apparently it never got fixed.)  On the other hand, even without the revert, I'm not seeing register corruption with or without ptrace, so I'm confused.  I'd really like to have a good test case for this issue.

My best current guess is that something's wrong with signal handling.

FWIW, I'm testing 8a0a5da6d9cbf1b142115ff6e6b253a82633c3d9.
Comment 28 Andy Lutomirski 2015-08-26 00:34:15 UTC
Never mind, I think I got it.  The bug was that the R9 rework broke syscall restart.  In particular, the second arg (arg1, which is regs->cx) ends up being reloaded from arg5, which is regs->bp.

IOW, syscall arg *changes* have never worked using SYSCALL32, but syscall *restart* (with unchanged args) work.

And now I have a test case.  Holy *crap* this code is a mess.
Comment 29 Linus Torvalds 2015-08-26 00:43:47 UTC
On Tue, Aug 25, 2015 at 2:45 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> On a QEMU-emulated (TCG) Opteron_G4, the results are quite different. 
> Syscall
> restart is broken on all kernels I tried (i.e. ip -= 2 doesn't have the
> desired
> effect)

So "sysenter" restart is really odd.

The initial system call is done with "sysenter", but the "ip -= 2"
trick ends up subtracting two from the sysenter return address
("VDSO32_SYSENTER_RETURN"), which points to an "int 0x80".

I forget why. I think it was a misguided attempt to not have the issue
with %rbp, but that makes little sense.

Maybe we should replace the "int 0x80" with just sysenter (still two
bytes) and make sure that the sysenter case pushes rbp and rsp as the
same value (so that the re-loading of the sixth argument works)

                 Linus
Comment 30 Linus Torvalds 2015-08-26 00:53:05 UTC
--- Comment #28 from Andy Lutomirski <luto@kernel.org> ---
>
> IOW, syscall arg *changes* have never worked using SYSCALL32, but syscall
> *restart* (with unchanged args) work.

Ugh. Do you have a patch?

And I realized that while "sysenter" is really odd, I think that it's
correct (exactly because we load %rbp from the stack and then do "int
0x80"). And the case you have is the compat case for "syscall", which
is the AMD model.

And I guess Intel UD's on syscall outside of long mode, so it's a
bitch to test, because you really do need AMD hardware.

             Linus
Comment 31 Andy Lutomirski 2015-08-26 01:54:45 UTC
(In reply to Linus Torvalds from comment #30)
> --- Comment #28 from Andy Lutomirski <luto@kernel.org> ---
> >
> > IOW, syscall arg *changes* have never worked using SYSCALL32, but syscall
> > *restart* (with unchanged args) work.
> 
> Ugh. Do you have a patch?

No, but I will as part of the C rewrite.

> 
> And I realized that while "sysenter" is really odd, I think that it's
> correct (exactly because we load %rbp from the stack and then do "int
> 0x80"). And the case you have is the compat case for "syscall", which
> is the AMD model.

Yeah, the sysenter case is better.

While it's fresh in my mind, here's the ABI as well as I can tell:

call [vsyscall]

issues a syscall with nr eax and args (ebx, ecx, edx, esi, edi, ebp).  Callers absolutely require that the exit args (or at least some of them) that aren't part of the syscall signature are unchanged -- a bunch of the (hideous) glibc asm shoves things like the PIC registers and BP in there for safe keeping for syscalls with fewer than 6 args.

ptrace expects regs->ebx to contain arg0, etc.

(ip -= 2; ax = orig_ax;) is expected to restart the syscall.  I don't know whether any user code relies on this, but the kernel definitely does.

Most likely the raw semantics of the SYSCALL32 and SYSENTER instructions are not meaningful, since SYSENTER faults on AMD, SYSCALL32 faults on Intel and on native 32-bit AMD kernels, and anyone who uses SYSCALL32 directly on anything but the very latest kernels without fixing up SS manually will get bitten by the sysret_ss_attrs thing.

I was thinking about cleaning this up by targeting the lowest common denominator.  We don't use SYSEXIT any more, so the constraints are that SYSENTER clobbers ESP and EIP, SYSCALL32 clobbers ECX, and SYSRET32 clobbers ECX.  (SYSEXIT clobbers EDX, but we don't care.)

So let's just do this:

__kernel_vsyscall:
  push %ecx
  push %ebp
  movl %esp, %ebp

ALTERNATIVE (Intel with SEP):
  sysenter

ALTERNATIVE (AMD with SYSCALL32 on 64-bit kernels):
  syscall
  hlt /* keep weird binary tracers from utterly screwing up */

ALTERNATIVE (if neither of the other cases apply):
  nops

  movl %ebp, %esp
  movl (%esp), %ebp
  movl 8(%esp), %ecx
  int $0x80
vsyscall_after_int80:
  popl %ecx
  popl %ebp
  ret

Yes, this is weird.  But bear with me :)

First, in the case where we have neither SEP nor SYSCALL32, I claim that this Just Works.  We push a couple regs, pointlessly shuffle esp, restore the regs, do int $0x80 with the same regs we started with, and then (again, pointlessly) pop the regs we pushed.

Now we make the semantics of *both* syscall32 and sysenter that they load ebp into regs->sp, fetch regs->bp and regs->cx from memory, and set regs->ip to vsyscall_after_int80.  (This is a wee bit slower than the current sysenter code because it does two memory fetches instead of one.)  Then they proceed just like int80.

Note that sysenter's slow path already sort of works like this.

Now we've fixed the correctness issues but we've killed performance, as we'll use IRET instead of SYSRET to get back out.  But we already know how to fix that: opportunistic sysret.  If we're returning from a compat syscall that entered via sysenter or syscall32, if regs->ip == vsyscall_after_int80, regs->r11 == regs->flags, regs->ss == __USER_DS, regs->cs == __USER32_CS, and flags are sane, then return using SYSRETL.  (The r11 check is probably unnecessary.)

This is not quite as elegant as 64-bit opportunistic sysret, since we're zapping ecx.  This should be unobservable except by debuggers, since we already know that we're returning to a 'pop ecx' instruction.

For native 32-bit, we obviously can't use SYSRETL, but we can do more or less the same thing with SYSEXIT.

> 
> And I guess Intel UD's on syscall outside of long mode, so it's a
> bitch to test, because you really do need AMD hardware.

QEMU contains an entire miniature and slow AMD chip :)
Comment 32 Andy Lutomirski 2015-08-26 02:06:01 UTC
On second thought, let's push and pop edx as well.  That way the exact same code works for native 32-bit as well.