hi,all, we encountered a very strange and unreproduced panic on 2.6.30 ,it occrued at do_wait->release_task –>__exit_signal->BUG_ON(!atomic_read(&sig->count)), we tried our best to analyze and reproduce this bug, but all were failed. so counld you help us to solve it: why did this bug happened, how to reproduce it, and how to fix it , thank you very much! 1. version detail is : cat /proc/version Linux version 2.6.30 (root@DEVfc9) (gcc version 4.3.0 20080428 (Red Hat 4.3.0-8) (GCC) ) #17 SMP Mon May 24 19:38:11 CST 2010 2. calltrace is: <0>[3809706.114772] ------------[ cut here ]------------ <2>[3809706.114806] Kernel BUG at 4042c07a [verbose debug info unavailable] <0>[3809706.114835] invalid opcode: 0000 [#1] SMP <0>[3809706.114862] last sysfs file: <4>[3809706.114878] Modules linked in: dos_kernel antidos antireplay syn_cookie ip_mac av_proxy_vif ring_packet aaa ipsec cryptosoft ocf kstartLog(P) log_kernel_thread pppoe_handle route_reflect route_helper report tm cservice ipr_status cf_filter wdt o2hal libcrc32c session_kernel nf_nat_sqlnet nf_nat_mms nf_nat_rtsp nf_nat_ftp nf_nat_h323 nf_nat_irc nf_nat_tftp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_conntrack_ftp nf_conntrack_h323 nf_conntrack_irc nf_conntrack_mms nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_rtsp nf_conntrack_sip nf_conntrack_sqlnet nf_conntrack_tftp dnat_kernel_api snat_kernel_api slb_kernel_api maplist_kernel_api admin_kernel_api aclv6_kernel_api aclv4_kernel_api alg_kernel_api obj_kernelv6 obj_kernel xt_tcpudp xt_limit xt_state xt_normaltosw xt_TCPMSS xt_DEBUG xt_MSG xt_bridge xt_csm ipt_aaa ip6t_LOG ipt_LOG ipt_IFSNAT ebtable_filter ebtables ip6table_filter ip6_tables nf_conntrack_session_fn iptable_mangle iptable_nat iptable_filter ip_tables nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 x_tables nf_conntrack common_kernel_api age_kernel_api triones_kernel triones_kernel_msg dump ax88742_1n2p r8169 sasic <4>[3809706.115008] <4>[3809706.115008] Pid: 1717, comm: httpd Tainted: P (2.6.30 #17) 945GSE-ICH7M <4>[3809706.115008] EIP: 0060:[<4042c07a>] EFLAGS: 00010046 CPU: 0 <4>[3809706.115008] EIP is at release_task+0x32a/0x340 <4>[3809706.115008] EAX: 00000000 EBX: 3ffffffd ECX: 00000000 EDX: 00000001 <4>[3809706.115008] ESI: a4138370 EDI: aa2eae00 EBP: 00007b04 ESP: be735ec0 <4>[3809706.115008] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 <0>[3809706.115008] Process httpd (pid: 1717, ti=be734000 task=a4acc370 task.ti=be734000) <0>[3809706.115008] Stack: <4>[3809706.115008] 409fef68 00000001 3ffffffd 00007b04 a4138370 00007b04 4042c66f a180a000 <0>[3809706.115008] 00000100 000000d0 a180a01c 00000000 407c81d6 00000063 00000000 5c472c00 <0>[3809706.115008] 00000001 00000000 00000046 00000021 064d45de 00000000 a4138370 a4acc370 <0>[3809706.115008] Call Trace: <0>[3809706.115008] [<4042c66f>] ? wait_consider_task+0x5df/0x850 <0>[3809706.115008] [<407c81d6>] ? tcp_current_mss+0x36/0x60 <0>[3809706.115008] [<4042ca1b>] ? do_wait+0x13b/0x350 <0>[3809706.115008] [<40423f50>] ? default_wake_function+0x0/0x10 <0>[3809706.115008] [<4042ccac>] ? sys_wait4+0x7c/0xd0 <0>[3809706.115008] [<4042cd27>] ? sys_waitpid+0x27/0x30 <0>[3809706.115008] [<40402e08>] ? sysenter_do_call+0x12/0x26 <0>[3809706.115008] Code: 00 02 20 00 8b 14 24 64 a1 80 e4 9f 40 ff 0c 10 e9 4a fe ff ff 8d 74 26 00 0f 0b eb fe 8d 74 26 00 0f 0b eb fe 0f 0b 66 90 eb fc <0f> 0b 8d 74 26 00 eb fa 0f 0b 8d 74 26 00 eb fa 8d b6 00 00 00 <0>[3809706.115008] EIP: [<4042c07a>] release_task+0x32a/0x340 SS:ESP 0068:be735ec0 3. related task info: <4>[3809706.115008] 0xa4acd810 1713 1 0 0 S 0xa4acda18 httpd <4>[3809706.115008] 0xa4acc370 1717 1 1 0 S 0xa4acc578 *httpd <4>[3809706.115008] Error: no saved data for this cpu <4>[3809706.115008] 0xbe7fedc0 30547 1717 0 0 R 0xbe7fefc8 httpd <4>[3809706.115008] 0xa4138370 31492 1717 0 0 E 0xa4138578 httpd <4>[3809706.115008] 0xa1d68000 32522 1713 0 0 S 0xa1d68208 httpd <4>[3809706.115008] 0xa1d69810 693 1713 0 0 S 0xa1d69a18 httpd 4. the direct panic code is at: exit.c, release_task –>__exit_signal->BUG_ON(!atomic_read(&sig->count)); following is the detail reassemble analysis: [3809706.115008] EIP is at release_task+0x32a/0x340 objectdump -S exit.o: 00000950 <release_task>: 950: 55 push %ebp 951: 57 push %edi 952: 56 push %esi 953: 89 c6 mov %eax,%esi 955: 53 push %ebx 956: b8 00 00 00 00 mov $0x0,%eax 95b: 83 ec 08 sub $0x8,%esp 95e: 89 04 24 mov %eax,(%esp) 961: 8b 86 d0 01 00 00 mov 0x1d0(%esi),%eax 967: 8b 50 48 mov 0x48(%eax),%edx 96a: 8d 42 04 lea 0x4(%edx),%eax 96d: f0 ff 4a 04 lock decl 0x4(%edx) 971: 89 f0 mov %esi,%eax 973: e8 fc ff ff ff call 974 <release_task+0x24> 978: b8 00 00 00 00 mov $0x0,%eax 97d: e8 fc ff ff ff call 97e <release_task+0x2e> static inline void ptrace_release_task(struct task_struct *task) { BUG_ON(!list_empty(&task->ptraced)); ptrace_unlink(task); BUG_ON(!list_empty(&task->ptrace_entry)); } 982: 8d 86 24 01 00 00 lea 0x124(%esi),%eax BUG_ON(!list_empty(&task->ptraced)); 988: 39 86 24 01 00 00 cmp %eax,0x124(%esi) 98e: 0f 85 dc 02 00 00 jne c70 <release_task+0x320> 994: 8b 4e 10 mov 0x10(%esi),%ecx ptrace_unlink(task); 997: 85 c9 test %ecx,%ecx 999: 0f 85 01 02 00 00 jne ba0 <release_task+0x250> 99f: 8d 86 2c 01 00 00 lea 0x12c(%esi),%eax 9a5: 39 86 2c 01 00 00 cmp %eax,0x12c(%esi) BUG_ON(!list_empty(&task->ptrace_entry)); 9ab: 0f 85 b7 02 00 00 jne c68 <release_task+0x318> static void __exit_signal(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *sighand; BUG_ON(!sig); BUG_ON(!atomic_read(&sig->count)); 9b1: 8b be 98 02 00 00 mov 0x298(%esi),%edi struct signal_struct *sig = tsk->signal; 9b7: 85 ff test %edi,%edi 9b9: 0f 84 c3 02 00 00 je c82 <release_task+0x332> BUG_ON(!sig); 9bf: 8b 07 mov (%edi),%eax ==è BUG_ON(!atomic_read(&sig->count)); 9c1: 85 c0 test %eax,%eax 9c3: 0f 84 b1 02 00 00 je c7a <release_task+0x32a> 。。。。 c7a: 0f 0b ud2a (死的最后位置: 950+32a) c7c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi c80: eb fa jmp c7c <release_task+0x32c> c82: 0f 0b ud2a
hm, once-off kernel crashes are hard. We often (and I think correctly) assume that they're caused by rare hardware failures. Has it been repeatable at all?
(In reply to comment #1) > hm, once-off kernel crashes are hard. We often (and I think correctly) assume > that they're caused by rare hardware failures. > Has it been repeatable at all? hi,thanks for comment, It occured six times on the same box last month , but we could not catch any clues and still confused, at last we had to upgrade the following patch to try to avoid it. about 2.6.35, related code have been refactored completely(but for some reason,we can not upgrade from 2.6.30 to 2.6.35 directly), so i think there are some unkown bugs in related code. --- linux/kernel/exit.c 2010-08-09 08:03:52.000000000 +0800 +++ linux/kernel/exit.c 2010-08-09 08:21:38.000000000 +0800 @@ -62,11 +62,11 @@ static void exit_mm(struct task_struct * tsk); -static void __unhash_process(struct task_struct *p) +static void __unhash_process(struct task_struct *p, bool group_dead) { nr_threads--; detach_pid(p, PIDTYPE_PID); - if (thread_group_leader(p)) { + if (group_dead) { detach_pid(p, PIDTYPE_PGID); detach_pid(p, PIDTYPE_SID); @@ -83,16 +83,20 @@ static void __exit_signal(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; + bool group_dead = thread_group_leader(tsk); struct sighand_struct *sighand; +#ifdef CONFIG_KDB BUG_ON(!sig); BUG_ON(!atomic_read(&sig->count)); +#endif sighand = rcu_dereference(tsk->sighand); spin_lock(&sighand->siglock); + atomic_dec(&sig->count); posix_cpu_timers_exit(tsk); - if (atomic_dec_and_test(&sig->count)) + if (group_dead) posix_cpu_timers_exit_group(tsk); else { /* @@ -125,10 +129,9 @@ sig->oublock += task_io_get_oublock(tsk); task_io_accounting_add(&sig->ioac, &tsk->ioac); sig->sum_sched_runtime += tsk->se.sum_exec_runtime; - sig = NULL; /* Marker for below. */ } - __unhash_process(tsk); + __unhash_process(tsk,group_dead); /* * Do this under ->siglock, we can race with another thread @@ -142,7 +145,7 @@ __cleanup_sighand(sighand); clear_tsk_thread_flag(tsk,TIF_SIGPENDING); - if (sig) { + if (group_dead) { flush_sigqueue(&sig->shared_pending); taskstats_tgid_free(sig); /*
On 08/26, bugzilla-daemon@bugzilla.kernel.org wrote: > > --- Comment #2 from yi.he@o2micro.com 2010-08-26 01:17:00 --- > > hi,thanks for comment, It occured six times on the same box last month , but > we > could not catch any clues and still confused, at last we had to upgrade the > following patch to try to avoid it. Not sure I understand. The following patch can't be applied on top of 2.6.30. And in any case it shouldn't, it relies on previous changes. Do you run 2.6.30 with some other patches? > about 2.6.35, related code have been refactored completely Yes, this all was refactored. > (but for some > reason,we can not upgrade from 2.6.30 to 2.6.35 directly), so i think there > are > some unkown bugs in related code. I hope not, but everything is possible ;) In fact I _hope_ this problem was fixed even before refactoring of this code. There were a lot of fixes in this area after 2.6.30. But I must admit, I can't recall the change which could explain this particular problem, and while signal->count logic was buggy before v2.6.31-rc7-114-g4ab6c08 I don't think this can be connected. Looks like, this task was already reaped. Are these httpd's multithreaded? If you used something like strace, then we have even more fixes which can be connected to this problem. And, TAINT_PROPRIETARY_MODULE in the calltrace doesn't make the problem more clear. So. Sorry, right now I have no idea. It would be great if you could test the recent kernels or find the reproducer, but I am sure you know this ;) Oleg.