Bug 769 - Kernel locks up when heavily using pthread keys (NPTL)
Summary: Kernel locks up when heavily using pthread keys (NPTL)
Status: CLOSED CODE_FIX
Alias: None
Product: Process Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Alexander Nyberg
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-06-02 17:33 UTC by Carlo Wood
Modified: 2004-12-04 06:17 UTC (History)
0 users

See Also:
Kernel Version: 2.5.69
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Carlo Wood 2003-06-02 17:33:53 UTC
Distribution: RedHat 9
Hardware Environment: athlon 900 with 386392 kB RAM
Software Environment: RedHat 9 + updates with custom configured 2.5.69 kernel.
Problem Description: Kernel freezes (no Oops) while running user program.

Steps to reproduce:
I am reluctant to try to reproduce this (I already lost a file
during file system repair - and it takes 30 minutes to do that
file system check).  However - it seems clear to me what caused
the lockup: I was testing a program existing of 15 threads, each
of which created 100 pthread_keys and who wrote to std::cerr (it
is a C++ program) inside their key destruction routine, using a
pthread_mutex around that write.  Theoretically it should be reproducable
by getting the source of 'libcwd' from CVS (http://libcwd.sourceforge.net)
of a date/time of around 3 June 0:30 am GMT (the NEWS file should
mention version 0.99.31 - then you got the correct time), run './bootstrap'
and  configure with 'CC=gcc-cvs-3.4 CXX=g++-cvs-3.4 ./configure
--enable-maintainer-mode --enable-debugt --enable-debugm --enable-debug
--disable-nonthreading --disable-debug-output', then 'make' and
'make threads_threads_shared' inside testsuite/.  Run then
threads_threads_shared a few times... (it doesn't ALWAYS happen, that I
know).
Comment 1 Carlo Wood 2003-06-02 18:05:56 UTC
Ok, it just reproduced :(.

This happened after I removed the mutex around
the std::cerr << ... in the key destructor
routine.  Also, now I noticed that all 100
key destructor routine were already called,
and wrote to std::cerr, before the kernel
freezes.  It therefore seems to have to do
more with the fact that I use 100 keys then
with what I do in the key destructor routine.

Note that I call 'abort()' in the first call
to free() from _dl_deallocate_tls.  It could
be that this is the point were the freeze
actually happens.  I'll try what happens
when I replace that abort with a write and
an exit.
Comment 2 Carlo Wood 2003-06-02 18:26:29 UTC
It froze again - very reproducable I must say,
not froze 3 times out of 5 runs.

This was after I changed the 'abort()' into
a write() + _exit.  The write() was not performed,
so we don't even get there it seems.
Comment 3 Carlo Wood 2003-06-02 18:56:44 UTC
Hmm, the write() caused a stack overflow
due to an infinite loop, because I catch
most system calls too...
Perhaps the lockup is related to a stack
overflow thus (functions calling themself
in an infinite loop).

I'll investigate this further tomorrow.
Comment 4 Carlo Wood 2003-06-03 06:03:35 UTC
Ok, I had the kernel crash several times more.
It is very reproducable.

However, please note that in the meantime I changed
the CVS of libcwd so that this crash no longer occurs.

If you want to reproduce it then you will need to
pick a time/date for the CVS checkout that will
give you libcwd/threading.cc version 1.45:

// $Header: /cvsroot/libcwd/libcwd/threading.cc,v 1.45 2003/06/02 23:19:18 libcw
Exp $

I tried:

cvs update -D "2003/06/02 23:19:18 PST"

and that seemed to give me all the right versions.

Also note that in the meantime I tried to write
a little C program to reproduce this, but failed.

One of the kernel crashes happened AFTER a
core dump as a result of running 'threads_threads_shared'!
My conclusion is now that something is messed up in
the kernel state as a result of my test program, but
that it is more complex than just using a lot of
pthread_keys.  You'll have to checkout libcwd from
cvs to reproduce this I am afraid.


Comment 5 Alexander Nyberg 2004-12-04 06:14:22 UTC
This bug is extremely old, I'm not aware of any similar NPTL problems as of now.
Closing.

Note You need to log in before you can comment on or make changes to this bug.