Bug 5170
Summary: | Kernel BUG at "fs/exec.c":788 | ||
---|---|---|---|
Product: | File System | Reporter: | Mark Williamson (mjw) |
Component: | Other | Assignee: | fs_other |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | akpm, alexn, andi-bz, mingo, roland, torvalds, zwane |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.13 | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Mark Williamson
2005-09-01 06:43:02 UTC
OK, this is about the third report of this bug. It dies here: BUG_ON(!thread_group_empty(current)); The most recent info I have is the below, from Linus: Roland, Ingo, any ideas? I was looking at that recently fixed 'handle_stop_signal()' bug, and while it looks pretty unlikely, it looks like getting a signal at _just_ the right moment would clear the SIGNAL_GROUP_EXIT flag. Which in turn would potentially allow a clone() to happen even though the group is supposed to exit. Which could in theory result in this bug report, I guess. Marc - despite it probably not making any difference, can you check a recent snapshot? There's at least one thread-specific bug that got fixed pretty recently (it's called "NPTL signal delivery deadlock fix", but that has more to do with the load it was found under, rather than necessarily what other symptoms it could cause). Put the thinking caps on please, gentlemen... I concur that the handle_stop_signal bug could result in clearing ->signal->flags due to a SIGCONT. I overlooked this before when considering how that bug might relate to other potential problems. If the exec.c BUG_ON never hits in a system with that bug fixed, then I think we can put this to bed. If it still hits, then I am still short on ideas so far. Roland, when did this handle_stop_signal bug fix go in? A user reported that the BUG_ON was hit in 2.6.13-rc6-git13. His testcase doesn't appear to be working for me though :| Scratch the last comment, the testcase now works and yes, this happens in 2.6.13. i'll dig in but if anyone wants patched tested...
I think this does explain it (and I'm forwardign the whole thing to
bugzilla).
I don't think the patch is necessarily correct, since the comment about
sys_times() is correct. But I'd argue that yes, the code that does
while (atomic_read(&sig->count) > count)
...
is arguably waiting on the wrong thing, since it waits for something that
isn't "final".
The other possibility is to just say that ok, there may still be stuff on
the hash chains, and that we simply don't care, because they _are_ going
away. And just remove the BUG_ON().
Linus
On Fri, 9 Sep 2005, Alexander Nyberg wrote:
>
> This appears to be a pretty simple thing
>
> de_thread() does
> while (atomic_read(&sig->count) > count) {
> }
> .....
> .....
> BUG_ON(!thread_group_empty(current));
>
> but release_task does
> __exit_signal
> (this is where atomic_dec(&sig->count) is run)
> __exit_sighand
> __unhash_process
> takes write lock on tasklist_lock
> remove itself out of PIDTYPE_TGID list
>
>
> so there's a clear (although small) window between the atomic_dec(&sig->count)
> and the actual PIDTYPE_TGID unhashing of the thread. One might also say
> that de_thread() uses the wrong method to find out if threads really
> are gone.
>
> So to prove me right I did the patch below, lo and behold the testcase
> hasn't triggered yet and it should have by now. But it appears according
> to the comment that there's some ordering requirement from sys_times().
>
> Anyway, I think these subtle ordering requirements are evil. According
> to the comment in sys_times it relies heavily on the current release_task
> code even if I can't see the exact ordering.
>
>
> Roland, am I making any sense?
>
>
> Index: linux-2.6/kernel/exit.c
> ===================================================================
> --- linux-2.6.orig/kernel/exit.c 2005-09-08 09:58:57.000000000 +0200
> +++ linux-2.6/kernel/exit.c 2005-09-09 21:17:24.000000000 +0200
> @@ -70,14 +70,14 @@
> if (unlikely(p->ptrace))
> __ptrace_unlink(p);
> BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children));
> - __exit_signal(p);
> - __exit_sighand(p);
> /*
> * Note that the fastpath in sys_times depends on __exit_signal having
> * updated the counters before a task is removed from the tasklist of
> * the process by __unhash_process.
> */
> __unhash_process(p);
> + __exit_signal(p);
> + __exit_sighand(p);
>
> /*
> * If we are the last non-leader member of the thread
>
ye olde bug.. did this ever get fixed ? It's all lost in the mists of time, but this talks about a BUG_ON that is not there in the current de_thread code. I'm not sure of its status... I don't see this issue any more on recent kernels. |