Distribution: RedHat 9 Hardware Environment: athlon 900 with 386392 kB RAM Software Environment: RedHat 9 + updates with custom configured 2.5.69 kernel. Problem Description: Kernel freezes (no Oops) while running user program. Steps to reproduce: I am reluctant to try to reproduce this (I already lost a file during file system repair - and it takes 30 minutes to do that file system check). However - it seems clear to me what caused the lockup: I was testing a program existing of 15 threads, each of which created 100 pthread_keys and who wrote to std::cerr (it is a C++ program) inside their key destruction routine, using a pthread_mutex around that write. Theoretically it should be reproducable by getting the source of 'libcwd' from CVS (http://libcwd.sourceforge.net) of a date/time of around 3 June 0:30 am GMT (the NEWS file should mention version 0.99.31 - then you got the correct time), run './bootstrap' and configure with 'CC=gcc-cvs-3.4 CXX=g++-cvs-3.4 ./configure --enable-maintainer-mode --enable-debugt --enable-debugm --enable-debug --disable-nonthreading --disable-debug-output', then 'make' and 'make threads_threads_shared' inside testsuite/. Run then threads_threads_shared a few times... (it doesn't ALWAYS happen, that I know).
Ok, it just reproduced :(. This happened after I removed the mutex around the std::cerr << ... in the key destructor routine. Also, now I noticed that all 100 key destructor routine were already called, and wrote to std::cerr, before the kernel freezes. It therefore seems to have to do more with the fact that I use 100 keys then with what I do in the key destructor routine. Note that I call 'abort()' in the first call to free() from _dl_deallocate_tls. It could be that this is the point were the freeze actually happens. I'll try what happens when I replace that abort with a write and an exit.
It froze again - very reproducable I must say, not froze 3 times out of 5 runs. This was after I changed the 'abort()' into a write() + _exit. The write() was not performed, so we don't even get there it seems.
Hmm, the write() caused a stack overflow due to an infinite loop, because I catch most system calls too... Perhaps the lockup is related to a stack overflow thus (functions calling themself in an infinite loop). I'll investigate this further tomorrow.
Ok, I had the kernel crash several times more. It is very reproducable. However, please note that in the meantime I changed the CVS of libcwd so that this crash no longer occurs. If you want to reproduce it then you will need to pick a time/date for the CVS checkout that will give you libcwd/threading.cc version 1.45: // $Header: /cvsroot/libcwd/libcwd/threading.cc,v 1.45 2003/06/02 23:19:18 libcw Exp $ I tried: cvs update -D "2003/06/02 23:19:18 PST" and that seemed to give me all the right versions. Also note that in the meantime I tried to write a little C program to reproduce this, but failed. One of the kernel crashes happened AFTER a core dump as a result of running 'threads_threads_shared'! My conclusion is now that something is messed up in the kernel state as a result of my test program, but that it is more complex than just using a lot of pthread_keys. You'll have to checkout libcwd from cvs to reproduce this I am afraid.
This bug is extremely old, I'm not aware of any similar NPTL problems as of now. Closing.