Bug 5826 - Multi-thread corefiles broken since April 2005
Summary: Multi-thread corefiles broken since April 2005
Status: CLOSED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: i386 (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Stas Sergeev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-04 10:35 UTC by Steve Work
Modified: 2006-01-05 12:46 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.11.8 or so (5df240826c90afdc7956f55a004ea6b702df9203)
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Proposed fix by Stas Sergeev (505 bytes, patch)
2006-01-04 10:37 UTC, Steve Work
Details | Diff

Description Steve Work 2006-01-04 10:35:28 UTC
Most recent kernel where this bug did not occur: Introduced with
5df240826c90afdc7956f55a004ea6b702df9203

Distribution: Kernel built from tree 5df240826c90afdc7956f55a004ea6b702df9203 or
later; Debian and gentoo at least
Hardware Environment: i386 PC
Software Environment: Debian and gentoo at least
Problem Description:

Coredumps from programs with more than one thread show garbage information for
all threads except the primary.  The problem was introduced with:

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5df240826c90afdc7956f55a004ea6b702df9203

on Apr 16 ("fix crash in entry.S restore_all") and is still present in current
builds.

"kill -SEGV" this program and "info threads" the resulting corefile to see the
problem:

 #include <pthread.h>
 static void* thread_sleep(void* x) { while (1) sleep(30); }
 int main(int c, char** v) {
     const static int tcount = 5;
     pthread_t thr[tcount];
     int i;
     for (i=0; i<tcount; ++i)
         pthread_create(&thr[i], NULL, thread_sleep, NULL);
     while (1)
         sleep(30);
     return 0;
 }

 (gdb) info threads
   7 process 18138  0x00000246 in ?? ()
   6 process 18139  0x00000246 in ?? ()
   5 process 18140  0x00000246 in ?? ()
   4 process 18141  0x00000246 in ?? ()
   3 process 18142  0x00000246 in ?? ()
   2 process 18143  0x00000246 in ?? ()
 * 1 process 18137  0xb7e69db6 in nanosleep () from /lib/tls/libc.so.6
 (gdb)

All these threads should show a legitimate location (the same spot in nanosleep)
and do on kernels prior to the commit named above.  (Notice one too many threads
listed here also -- is this a related problem?)

Commenting out this line (in asm/i386/kernel/process.c:copy_thread) fixes the
corefiles:

  childregs = (struct pt_regs *) ((unsigned long) childregs - 8);

but presumably re-introduces the crash the original patch was intended to fix.
Comment 1 Steve Work 2006-01-04 10:37:30 UTC
Created attachment 6931 [details]
Proposed fix by Stas Sergeev

Stas Sergeev wrote this patch and reports it appears to help.
Comment 2 Adrian Bunk 2006-01-04 10:41:00 UTC
The proposed fix by Stas is included in 2.6.15.

Can you confirm it's fixed in 2.6.15?
Comment 3 Steve Work 2006-01-05 12:05:55 UTC
Yes, confirmed fixed in 2.6.15; and the patch backports cleanly to other kernels
in the problem range and works fine there too.  Thank you!

Note You need to log in before you can comment on or make changes to this bug.