Bug 16527
Summary: | panic at do_wait->release_task –>__exit_signal->BUG_ON(!atomic_read(&sig->count)); | ||
---|---|---|---|
Product: | Process Management | Reporter: | yi.he |
Component: | Scheduler | Assignee: | Ingo Molnar (mingo) |
Status: | RESOLVED OBSOLETE | ||
Severity: | blocking | CC: | akpm, alan, oleg |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.30 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
yi.he
2010-08-06 04:32:50 UTC
hm, once-off kernel crashes are hard. We often (and I think correctly) assume that they're caused by rare hardware failures. Has it been repeatable at all? (In reply to comment #1) > hm, once-off kernel crashes are hard. We often (and I think correctly) assume > that they're caused by rare hardware failures. > Has it been repeatable at all? hi,thanks for comment, It occured six times on the same box last month , but we could not catch any clues and still confused, at last we had to upgrade the following patch to try to avoid it. about 2.6.35, related code have been refactored completely(but for some reason,we can not upgrade from 2.6.30 to 2.6.35 directly), so i think there are some unkown bugs in related code. --- linux/kernel/exit.c 2010-08-09 08:03:52.000000000 +0800 +++ linux/kernel/exit.c 2010-08-09 08:21:38.000000000 +0800 @@ -62,11 +62,11 @@ static void exit_mm(struct task_struct * tsk); -static void __unhash_process(struct task_struct *p) +static void __unhash_process(struct task_struct *p, bool group_dead) { nr_threads--; detach_pid(p, PIDTYPE_PID); - if (thread_group_leader(p)) { + if (group_dead) { detach_pid(p, PIDTYPE_PGID); detach_pid(p, PIDTYPE_SID); @@ -83,16 +83,20 @@ static void __exit_signal(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; + bool group_dead = thread_group_leader(tsk); struct sighand_struct *sighand; +#ifdef CONFIG_KDB BUG_ON(!sig); BUG_ON(!atomic_read(&sig->count)); +#endif sighand = rcu_dereference(tsk->sighand); spin_lock(&sighand->siglock); + atomic_dec(&sig->count); posix_cpu_timers_exit(tsk); - if (atomic_dec_and_test(&sig->count)) + if (group_dead) posix_cpu_timers_exit_group(tsk); else { /* @@ -125,10 +129,9 @@ sig->oublock += task_io_get_oublock(tsk); task_io_accounting_add(&sig->ioac, &tsk->ioac); sig->sum_sched_runtime += tsk->se.sum_exec_runtime; - sig = NULL; /* Marker for below. */ } - __unhash_process(tsk); + __unhash_process(tsk,group_dead); /* * Do this under ->siglock, we can race with another thread @@ -142,7 +145,7 @@ __cleanup_sighand(sighand); clear_tsk_thread_flag(tsk,TIF_SIGPENDING); - if (sig) { + if (group_dead) { flush_sigqueue(&sig->shared_pending); taskstats_tgid_free(sig); /* On 08/26, bugzilla-daemon@bugzilla.kernel.org wrote: > > --- Comment #2 from yi.he@o2micro.com 2010-08-26 01:17:00 --- > > hi,thanks for comment, It occured six times on the same box last month , but > we > could not catch any clues and still confused, at last we had to upgrade the > following patch to try to avoid it. Not sure I understand. The following patch can't be applied on top of 2.6.30. And in any case it shouldn't, it relies on previous changes. Do you run 2.6.30 with some other patches? > about 2.6.35, related code have been refactored completely Yes, this all was refactored. > (but for some > reason,we can not upgrade from 2.6.30 to 2.6.35 directly), so i think there > are > some unkown bugs in related code. I hope not, but everything is possible ;) In fact I _hope_ this problem was fixed even before refactoring of this code. There were a lot of fixes in this area after 2.6.30. But I must admit, I can't recall the change which could explain this particular problem, and while signal->count logic was buggy before v2.6.31-rc7-114-g4ab6c08 I don't think this can be connected. Looks like, this task was already reaped. Are these httpd's multithreaded? If you used something like strace, then we have even more fixes which can be connected to this problem. And, TAINT_PROPRIETARY_MODULE in the calltrace doesn't make the problem more clear. So. Sorry, right now I have no idea. It would be great if you could test the recent kernels or find the reproducer, but I am sure you know this ;) Oleg. |