Bug 5170

Summary: Kernel BUG at "fs/exec.c":788
Product: File System Reporter: Mark Williamson (mjw)
Component: OtherAssignee: fs_other
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, alexn, andi-bz, mingo, roland, torvalds, zwane
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.13 Subsystem:
Regression: --- Bisected commit-id:

Description Mark Williamson 2005-09-01 06:43:02 UTC
Most recent kernel where this bug did not occur: Default RHEL AS4/U1 Kernel
(2.6.9-11.ELsmp)

Distribution:  RHEL AS4/U1, but with a stock kernel from kernel.org
Hardware Environment: Iwill H8501; 
http://www.iwill.net/product_2.asp?p_id=90&sp=Y

Software Environment: RHEL AS4/U1

Problem Description: When executing a piece of chemistry software, it crashes
and the following message appears in the dmesg.

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at "fs/exec.c":788
invalid operand: 0000 [1] SMP
CPU 8
Modules linked in: ipv6 autofs4 sunrpc dm_mod video button battery ac ohci_hcd
hw_random e100 mii e1000 floppy ext3 jbd 3w_xxxx sd_mod scsi_mod
Pid: 3786, comm: l1.exe Not tainted 2.6.13
RIP: 0010:[<ffffffff8018e0c5>] <ffffffff8018e0c5>{flush_old_exec+1573}
RSP: 0018:ffff81053f347ce8  EFLAGS: 00010287
RAX: ffff81044387e8b8 RBX: 00000000ffffffb0 RCX: 00000000ffffffff
RDX: ffff81044387e730 RSI: ffff8103439a5700 RDI: ffff81013fffc980
RBP: ffff810343890a80 R08: ffff81053f346000 R09: 0000000000000000
R10: 0000000000000200 R11: 00000000ffffffff R12: 0000000000000001
R13: ffff810749b13500 R14: ffff810749b13d08 R15: ffff8103439a5700
FS:  00002aaab1db9ee0(0000) GS:ffffffff80519c00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000047374c CR3: 0000000749b23000 CR4: 00000000000006e0
Process l1.exe (pid: 3786, threadinfo ffff81053f346000, task ffff81044387e730)
Stack: ffff81033fd9b0d0 0000000000000000 0000000100000000 ffff81033fd9b0d0
       0000000000000080 ffff81013eac7800 ffff81000c038148 ffff81053f347d70
       ffffffff80371102 ffffffff80182955
Call Trace:<ffffffff80182955>{vfs_read+341} <ffffffff801b384f>{load_elf_binary+1471}
       <ffffffff80162a31>{buffered_rmqueue+545} <ffffffff8018d3d4>{copy_strings+516}
       <ffffffff801b3290>{load_elf_binary+0}
<ffffffff8018ea59>{search_binary_handler+201}
       <ffffffff8018edbf>{do_execve+415} <ffffffff8010dc66>{system_call+126}
       <ffffffff8010c684>{sys_execve+68} <ffffffff8010e086>{stub_execve+106}


Code: 0f 0b a3 5d 4c 37 80 ff ff ff ff c2 14 03 65 48 8b 04 25 00
RIP <ffffffff8018e0c5>{flush_old_exec+1573} RSP <ffff81053f347ce8>
 <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():0, irqs_disabled():1

Call Trace:<ffffffff8012ed3f>{__might_sleep+191}
<ffffffff80137109>{profile_task_exit+41}
       <ffffffff80138a52>{do_exit+34} <ffffffff8024e0a5>{do_unblank_screen+53}
       <ffffffff8010f665>{die+69} <ffffffff8010fab1>{do_invalid_op+145}
       <ffffffff8018e0c5>{flush_old_exec+1573} <ffffffff8034f0a3>{thread_return+0}
       <ffffffff8034f0fb>{thread_return+88} <ffffffff8012f0c5>{activate_task+149}
       <ffffffff8010e785>{error_exit+0} <ffffffff8018e0c5>{flush_old_exec+1573}
       <ffffffff8018e0ac>{flush_old_exec+1548} <ffffffff80182955>{vfs_read+341}
       <ffffffff801b384f>{load_elf_binary+1471}
<ffffffff80162a31>{buffered_rmqueue+545}
       <ffffffff8018d3d4>{copy_strings+516} <ffffffff801b3290>{load_elf_binary+0}
       <ffffffff8018ea59>{search_binary_handler+201}
<ffffffff8018edbf>{do_execve+415}
       <ffffffff8010dc66>{system_call+126} <ffffffff8010c684>{sys_execve+68}
       <ffffffff8010e086>{stub_execve+106}


Steps to reproduce:
Execute the piece of chemistry software
Comment 1 Andrew Morton 2005-09-01 12:11:50 UTC
OK, this is about the third report of this bug.

It dies here:

        BUG_ON(!thread_group_empty(current));

The most recent info I have is the below, from Linus:

Roland, Ingo,
  any ideas?

 I was looking at that recently fixed 'handle_stop_signal()' bug, and while
 it looks pretty unlikely, it looks like getting a signal at _just_ the
 right moment would clear the SIGNAL_GROUP_EXIT flag. Which in turn would
 potentially allow a clone() to happen even though the group is supposed to
 exit. Which could in theory result in this bug report, I guess.

 Marc - despite it probably not making any difference, can you check a 
 recent snapshot? There's at least one thread-specific bug that got fixed 
 pretty recently (it's called "NPTL signal delivery deadlock fix", but that 
 has more to do with the load it was found under, rather than necessarily 
 what other symptoms it could cause).

Put the thinking caps on please, gentlemen...
Comment 2 Roland McGrath 2005-09-01 14:11:11 UTC
I concur that the handle_stop_signal bug could result in clearing
->signal->flags due to a SIGCONT.  I overlooked this before when considering how
that bug might relate to other potential problems.  If the exec.c BUG_ON never
hits in a system with that bug fixed, then I think we can put this to bed.  If
it still hits, then I am still short on ideas so far.
Comment 3 Alexander Nyberg 2005-09-06 11:53:35 UTC
Roland, when did this handle_stop_signal bug fix go in? A user reported that the
BUG_ON was hit in 2.6.13-rc6-git13. His testcase doesn't appear to be working
for me though :|
Comment 4 Alexander Nyberg 2005-09-06 12:08:17 UTC
Scratch the last comment, the testcase now works and yes, this happens in
2.6.13. i'll dig in but if anyone wants patched tested...
Comment 5 Linus Torvalds 2005-09-09 15:08:47 UTC
I think this does explain it (and I'm forwardign the whole thing to 
bugzilla).

I don't think the patch is necessarily correct, since the comment about 
sys_times() is correct. But I'd argue that yes, the code that does

	while (atomic_read(&sig->count) > count)
		...

is arguably waiting on the wrong thing, since it waits for something that 
isn't "final".

The other possibility is to just say that ok, there may still be stuff on 
the hash chains, and that we simply don't care, because they _are_ going 
away. And just remove the BUG_ON(). 

		Linus

On Fri, 9 Sep 2005, Alexander Nyberg wrote:
>
> This appears to be a pretty simple thing
> 
> de_thread() does
> 	while (atomic_read(&sig->count) > count) {
> 	}
> 	.....
> 	.....
> 	BUG_ON(!thread_group_empty(current));
> 
> but release_task does
> 	__exit_signal
> 		(this is where atomic_dec(&sig->count) is run)
> 	__exit_sighand
> 	__unhash_process
> 		takes write lock on tasklist_lock
> 		remove itself out of PIDTYPE_TGID list
> 
> 
> so there's a clear (although small) window between the atomic_dec(&sig->count)
> and the actual PIDTYPE_TGID unhashing of the thread. One might also say
> that de_thread() uses the wrong method to find out if threads really
> are gone.
> 
> So to prove me right I did the patch below, lo and behold the testcase
> hasn't triggered yet and it should have by now. But it appears according
> to the comment that there's some ordering requirement from sys_times(). 
> 
> Anyway, I think these subtle ordering requirements are evil. According
> to the comment in sys_times it relies heavily on the current release_task
> code even if I can't see the exact ordering. 
> 
> 
> Roland, am I making any sense?
> 
> 
> Index: linux-2.6/kernel/exit.c
> ===================================================================
> --- linux-2.6.orig/kernel/exit.c	2005-09-08 09:58:57.000000000 +0200
> +++ linux-2.6/kernel/exit.c	2005-09-09 21:17:24.000000000 +0200
> @@ -70,14 +70,14 @@
>  	if (unlikely(p->ptrace))
>  		__ptrace_unlink(p);
>  	BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children));
> -	__exit_signal(p);
> -	__exit_sighand(p);
>  	/*
>  	 * Note that the fastpath in sys_times depends on __exit_signal having
>  	 * updated the counters before a task is removed from the tasklist of
>  	 * the process by __unhash_process.
>  	 */
>  	__unhash_process(p);
> +	__exit_signal(p);
> +	__exit_sighand(p);
>  
>  	/*
>  	 * If we are the last non-leader member of the thread
> 

Comment 6 Dave Jones 2007-05-17 15:04:40 UTC
ye olde bug..
did this ever get fixed ?
Comment 7 Roland McGrath 2007-05-17 18:03:50 UTC
It's all lost in the mists of time, but this talks about a BUG_ON that is not
there in the current de_thread code.
Comment 8 Mark Williamson 2007-05-18 06:10:55 UTC
I'm not sure of its status... I don't see this issue any more on recent kernels.