Most recent kernel where this bug did not occur: Default RHEL AS4/U1 Kernel (2.6.9-11.ELsmp) Distribution: RHEL AS4/U1, but with a stock kernel from kernel.org Hardware Environment: Iwill H8501; http://www.iwill.net/product_2.asp?p_id=90&sp=Y Software Environment: RHEL AS4/U1 Problem Description: When executing a piece of chemistry software, it crashes and the following message appears in the dmesg. ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at "fs/exec.c":788 invalid operand: 0000 [1] SMP CPU 8 Modules linked in: ipv6 autofs4 sunrpc dm_mod video button battery ac ohci_hcd hw_random e100 mii e1000 floppy ext3 jbd 3w_xxxx sd_mod scsi_mod Pid: 3786, comm: l1.exe Not tainted 2.6.13 RIP: 0010:[<ffffffff8018e0c5>] <ffffffff8018e0c5>{flush_old_exec+1573} RSP: 0018:ffff81053f347ce8 EFLAGS: 00010287 RAX: ffff81044387e8b8 RBX: 00000000ffffffb0 RCX: 00000000ffffffff RDX: ffff81044387e730 RSI: ffff8103439a5700 RDI: ffff81013fffc980 RBP: ffff810343890a80 R08: ffff81053f346000 R09: 0000000000000000 R10: 0000000000000200 R11: 00000000ffffffff R12: 0000000000000001 R13: ffff810749b13500 R14: ffff810749b13d08 R15: ffff8103439a5700 FS: 00002aaab1db9ee0(0000) GS:ffffffff80519c00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000047374c CR3: 0000000749b23000 CR4: 00000000000006e0 Process l1.exe (pid: 3786, threadinfo ffff81053f346000, task ffff81044387e730) Stack: ffff81033fd9b0d0 0000000000000000 0000000100000000 ffff81033fd9b0d0 0000000000000080 ffff81013eac7800 ffff81000c038148 ffff81053f347d70 ffffffff80371102 ffffffff80182955 Call Trace:<ffffffff80182955>{vfs_read+341} <ffffffff801b384f>{load_elf_binary+1471} <ffffffff80162a31>{buffered_rmqueue+545} <ffffffff8018d3d4>{copy_strings+516} <ffffffff801b3290>{load_elf_binary+0} <ffffffff8018ea59>{search_binary_handler+201} <ffffffff8018edbf>{do_execve+415} <ffffffff8010dc66>{system_call+126} <ffffffff8010c684>{sys_execve+68} <ffffffff8010e086>{stub_execve+106} Code: 0f 0b a3 5d 4c 37 80 ff ff ff ff c2 14 03 65 48 8b 04 25 00 RIP <ffffffff8018e0c5>{flush_old_exec+1573} RSP <ffff81053f347ce8> <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():0, irqs_disabled():1 Call Trace:<ffffffff8012ed3f>{__might_sleep+191} <ffffffff80137109>{profile_task_exit+41} <ffffffff80138a52>{do_exit+34} <ffffffff8024e0a5>{do_unblank_screen+53} <ffffffff8010f665>{die+69} <ffffffff8010fab1>{do_invalid_op+145} <ffffffff8018e0c5>{flush_old_exec+1573} <ffffffff8034f0a3>{thread_return+0} <ffffffff8034f0fb>{thread_return+88} <ffffffff8012f0c5>{activate_task+149} <ffffffff8010e785>{error_exit+0} <ffffffff8018e0c5>{flush_old_exec+1573} <ffffffff8018e0ac>{flush_old_exec+1548} <ffffffff80182955>{vfs_read+341} <ffffffff801b384f>{load_elf_binary+1471} <ffffffff80162a31>{buffered_rmqueue+545} <ffffffff8018d3d4>{copy_strings+516} <ffffffff801b3290>{load_elf_binary+0} <ffffffff8018ea59>{search_binary_handler+201} <ffffffff8018edbf>{do_execve+415} <ffffffff8010dc66>{system_call+126} <ffffffff8010c684>{sys_execve+68} <ffffffff8010e086>{stub_execve+106} Steps to reproduce: Execute the piece of chemistry software
OK, this is about the third report of this bug. It dies here: BUG_ON(!thread_group_empty(current)); The most recent info I have is the below, from Linus: Roland, Ingo, any ideas? I was looking at that recently fixed 'handle_stop_signal()' bug, and while it looks pretty unlikely, it looks like getting a signal at _just_ the right moment would clear the SIGNAL_GROUP_EXIT flag. Which in turn would potentially allow a clone() to happen even though the group is supposed to exit. Which could in theory result in this bug report, I guess. Marc - despite it probably not making any difference, can you check a recent snapshot? There's at least one thread-specific bug that got fixed pretty recently (it's called "NPTL signal delivery deadlock fix", but that has more to do with the load it was found under, rather than necessarily what other symptoms it could cause). Put the thinking caps on please, gentlemen...
I concur that the handle_stop_signal bug could result in clearing ->signal->flags due to a SIGCONT. I overlooked this before when considering how that bug might relate to other potential problems. If the exec.c BUG_ON never hits in a system with that bug fixed, then I think we can put this to bed. If it still hits, then I am still short on ideas so far.
Roland, when did this handle_stop_signal bug fix go in? A user reported that the BUG_ON was hit in 2.6.13-rc6-git13. His testcase doesn't appear to be working for me though :|
Scratch the last comment, the testcase now works and yes, this happens in 2.6.13. i'll dig in but if anyone wants patched tested...
I think this does explain it (and I'm forwardign the whole thing to bugzilla). I don't think the patch is necessarily correct, since the comment about sys_times() is correct. But I'd argue that yes, the code that does while (atomic_read(&sig->count) > count) ... is arguably waiting on the wrong thing, since it waits for something that isn't "final". The other possibility is to just say that ok, there may still be stuff on the hash chains, and that we simply don't care, because they _are_ going away. And just remove the BUG_ON(). Linus On Fri, 9 Sep 2005, Alexander Nyberg wrote: > > This appears to be a pretty simple thing > > de_thread() does > while (atomic_read(&sig->count) > count) { > } > ..... > ..... > BUG_ON(!thread_group_empty(current)); > > but release_task does > __exit_signal > (this is where atomic_dec(&sig->count) is run) > __exit_sighand > __unhash_process > takes write lock on tasklist_lock > remove itself out of PIDTYPE_TGID list > > > so there's a clear (although small) window between the atomic_dec(&sig->count) > and the actual PIDTYPE_TGID unhashing of the thread. One might also say > that de_thread() uses the wrong method to find out if threads really > are gone. > > So to prove me right I did the patch below, lo and behold the testcase > hasn't triggered yet and it should have by now. But it appears according > to the comment that there's some ordering requirement from sys_times(). > > Anyway, I think these subtle ordering requirements are evil. According > to the comment in sys_times it relies heavily on the current release_task > code even if I can't see the exact ordering. > > > Roland, am I making any sense? > > > Index: linux-2.6/kernel/exit.c > =================================================================== > --- linux-2.6.orig/kernel/exit.c 2005-09-08 09:58:57.000000000 +0200 > +++ linux-2.6/kernel/exit.c 2005-09-09 21:17:24.000000000 +0200 > @@ -70,14 +70,14 @@ > if (unlikely(p->ptrace)) > __ptrace_unlink(p); > BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children)); > - __exit_signal(p); > - __exit_sighand(p); > /* > * Note that the fastpath in sys_times depends on __exit_signal having > * updated the counters before a task is removed from the tasklist of > * the process by __unhash_process. > */ > __unhash_process(p); > + __exit_signal(p); > + __exit_sighand(p); > > /* > * If we are the last non-leader member of the thread >
ye olde bug.. did this ever get fixed ?
It's all lost in the mists of time, but this talks about a BUG_ON that is not there in the current de_thread code.
I'm not sure of its status... I don't see this issue any more on recent kernels.