Bug 5826

Summary: Multi-thread corefiles broken since April 2005
Product: Platform Specific/Hardware Reporter: Steve Work (swork)
Component: i386Assignee: Stas Sergeev (stsp2)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, bunk, swork
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.11.8 or so (5df240826c90afdc7956f55a004ea6b702df9203) Subsystem:
Regression: --- Bisected commit-id:
Attachments: Proposed fix by Stas Sergeev

Description Steve Work 2006-01-04 10:35:28 UTC
Most recent kernel where this bug did not occur: Introduced with
5df240826c90afdc7956f55a004ea6b702df9203

Distribution: Kernel built from tree 5df240826c90afdc7956f55a004ea6b702df9203 or
later; Debian and gentoo at least
Hardware Environment: i386 PC
Software Environment: Debian and gentoo at least
Problem Description:

Coredumps from programs with more than one thread show garbage information for
all threads except the primary.  The problem was introduced with:

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5df240826c90afdc7956f55a004ea6b702df9203

on Apr 16 ("fix crash in entry.S restore_all") and is still present in current
builds.

"kill -SEGV" this program and "info threads" the resulting corefile to see the
problem:

 #include <pthread.h>
 static void* thread_sleep(void* x) { while (1) sleep(30); }
 int main(int c, char** v) {
     const static int tcount = 5;
     pthread_t thr[tcount];
     int i;
     for (i=0; i<tcount; ++i)
         pthread_create(&thr[i], NULL, thread_sleep, NULL);
     while (1)
         sleep(30);
     return 0;
 }

 (gdb) info threads
   7 process 18138  0x00000246 in ?? ()
   6 process 18139  0x00000246 in ?? ()
   5 process 18140  0x00000246 in ?? ()
   4 process 18141  0x00000246 in ?? ()
   3 process 18142  0x00000246 in ?? ()
   2 process 18143  0x00000246 in ?? ()
 * 1 process 18137  0xb7e69db6 in nanosleep () from /lib/tls/libc.so.6
 (gdb)

All these threads should show a legitimate location (the same spot in nanosleep)
and do on kernels prior to the commit named above.  (Notice one too many threads
listed here also -- is this a related problem?)

Commenting out this line (in asm/i386/kernel/process.c:copy_thread) fixes the
corefiles:

  childregs = (struct pt_regs *) ((unsigned long) childregs - 8);

but presumably re-introduces the crash the original patch was intended to fix.
Comment 1 Steve Work 2006-01-04 10:37:30 UTC
Created attachment 6931 [details]
Proposed fix by Stas Sergeev

Stas Sergeev wrote this patch and reports it appears to help.
Comment 2 Adrian Bunk 2006-01-04 10:41:00 UTC
The proposed fix by Stas is included in 2.6.15.

Can you confirm it's fixed in 2.6.15?
Comment 3 Steve Work 2006-01-05 12:05:55 UTC
Yes, confirmed fixed in 2.6.15; and the patch backports cleanly to other kernels
in the problem range and works fine there too.  Thank you!