In: http://lwn.net/Articles/280959/ there is a sample core pipe receiver which does this: > /* Change our current working directory to that of the > crashing process */ > > snprintf(cwd, PATH_MAX, "/proc/%s/cwd", argv[1]); > chdir(cwd); My inspection of the patches suggests to me that there is no guarantee that the failing process is still available when the pipe receiver activates and works on core file. Some experiments that I've done seem to confirm this: a. I have my core pipe receiver simply sleep(30), and I cannot find the /proc/pid/ entry for the failing process using another interactive shell window. b. I remove the sleep(30), and my core pipe receiver can find /proc/pid/exe. If my analysis is correct, I'm wondering if there is a straightforward way to hold off destruction of the failing process until the core pipe receiver "acknowledges" (somehow) that it has dealt with the image. I think this would be useful because it would then allow the receiver to inspect /proc/pid/* to emit useful things about the failing process.
Using /proc to get the path wouldn't be sufficient anyway - the following could have occurred core dump begins mv core-dir old-core-dir mkdir core-dir You would need an actual handle to that directory to be sure. I guess the logical approach would be to start the core dumping thread in the cwd of the core dumped task if we don't do so already. This isn't however a bug - so really if you want it fixed you need to submit patches upstream
You miss the point here, this is using the core_pattern infrastructure, and its the user space core collector that reading the pipe of the core dump, which might also need access to the crashing processes /proc/<pid> directory. We need to block the return of do_coredump until such time as the user space application is finished and exits.
I don't think I am but maybe my description is confusing: At the point the core pattern pipe is executed the directory the program ran in may not even be attached to the file system. It may be in a different namespace, it may have been deleted, it may be blocked by security. Thus we cannot pass it by name If we allowed dead processes to get stuck forever when the core dump handler wedged mayhem would result, so we can't block without adding a lot of risk So you would have to find a way of passing the directory handle somehow and the obvious way would be to make call_usermodehelper_pipe() able to run the helper with the same cwd as the core dumping process.
I think I see our disconnect, the race we're talking about isn't with writing to a directory, its with reading files in a directory, specifically the /proc/<pid> directory of the crashing process. That directory can't be moved detached, etc while the process is crashing, but since, when we specify core_pattern as being a pipe to an executable, we may or may not be able to access that crashing processes /proc/<pid> directory. Weather or not we can depends on the size of the crashing process and the size of the stdin pipe. If the crashing process is small, and the core image fits into the pipe without blocking, do_coredump in the kernel will complete and clean the crashing process, removing its proc directory along the way. Since the contents of that directory can be very usefull to a core_pattern receiving process we'd like to be able to guarantee its accessibility. I agree that simply blocking on the process completion in the kernel is prone to errors, and other dangers, but I've got a few other ideas that I'm playing with at the moment.
The example you posted uses /proc/<pid> to retrieve the cwd - that cwd in /proc is useless so the example is pointless even if you could use /proc/pid reliably. The core dump pattern formatter can insert things like pid - because that is the only safe way to get hold of them. What do you need from /proc/pid that can be used in any other context ?
I didn't post it, but regardless, the fact that the example may not work doesn't make the info in proc useless. With proc still around you can access data about the process thats not in the core file (like the full path to the executable, what its cwd was (even if those things have moved) whats its limits or personality flags were, etc. Relevant debug information...
Created attachment 21492 [details] patch to block core_pattern dumping until the dump application exits Heres a patch for you to try Eric, I've not tested it yet, but it should do what you need. For all the reasons Alan has pointed out, and then some, I'm not really thrilled with the idea of blocking like this in do_coredump, but this should prove at least that doing so will keep the crashing process around. I'm going to fiddle with the notion of just holding a reference to the proc directory as an alternative (perhaps by setting core_patterns cwd to /proc/<pid>), but if you could test this in the meantime, that would be great
So put the relevant bits into the formatting options - which is what we always did before when one turned up that was really useful
Thats alot of bits to put on the command line, but it might be an option, depending on how the above testing turns out.
You only have to pass what is really needed - you can also pass it as environment variables which might actually be better for compatibility and for size ?
(In reply to comment #7) > I'm going to fiddle with the notion of just holding a reference to the proc > directory as an alternative (perhaps by setting core_patterns cwd to > /proc/<pid>), but if you could test this in the meantime, that would be great Looks promising for our current needs. I'll try to work it into our kernel this afternoon.
(In reply to comment #7) > For all the reasons Alan has pointed out, and then some, I'm not really > thrilled with the idea of blocking like this in do_coredump, but this should > prove at least that doing so will keep the crashing process around. I know there is concern about keeping the kernel code blocked for long periods of time. This is understandable and I share that concern. I think that this is an underlying issue that already exists with the current code and I don't see this particular weakness having been discussed (at least not in the current context). The susceptibility to blocking likely arose the moment provision was made to allow core dumps to be made anywhere other than in the file system alongside the executable (ie the introduction of core_pattern perhaps was the turning point). For example: a. echo '/dev/ttya' > /proc/sys/kernel/core_pattern In other words the writing of the core might be rate limited by the target file/device. b. echo '|/usr/local/bin/slowreaper' > /proc/sys/kernel/core_pattern Suppose slowreaper is something like: while (...) { sleep(10); ch = getchar(); ... } Then even the current code will block in the kernel once the pipe buffers fill. Granted these cases are somewhat pathological, but I think not much more so than the case where the wait4(2) takes too long to complete, or perhaps never completes. Admittedly all of this is rather theoretical because in practice things usually turn out "all right". However, the kernel needs to handle lots of situations, and in particular try to be well behaved when things are not turning out right. Some questions: 1. Is it reasonable to require that binfmt->core_dump() complete within a bounded amount of time ? Perhaps a timeout scaled to the size of the executable ? 2. Is it reasonable to require that the core_pattern pipe process complete within a bounded amount of time (IOW don't wait4(2) indefinitely) ? I think it is entirely reasonable to allow the core_pattern pipe process to gather whatever diagnostic information it sees fit. As an application write, I see /proc/<pid> as essential in this regard. I think it is somewhat unwieldy to attempt to foretell what information might (or might not) be useful and provision that in various formatting options in do_coredump(). This approach couples do_coredump() more tightly to modifications and enhancements in /proc/<pid>, and I think that would not be the best approach.
Alan: "You only have to pass what is really needed" Thats what I see as the problem, I'm not entirely sure what is really needed, and theres lots of info to choose from. I know that early users of this interface wanted to interrogate proc so they could gather info on program metadata for use in identifying program versions and releases as part of automatic bug filing utilities. I don't really know what others wanted to do with it. If we had a better idea of what subset of data was relevant it might be easier. As for puttin git in env variables, that was discussed before (ubuntu tried it that way initially). It was turned down IIRC as being prone to application breakage and aliasing (i.e. if we add a new env variable for relevant data later and programs are using that variable to hold internal information, they break. Earl: I think you make a good point regarding the timing of the program. In fact, thats probably what has let this work for so long (programs are of sufficient size such that they stall the pipe writer in the kernel giving the core-collector time to interrogate proc). Regarding the use of timeouts, I'm sure that would be safe, but I'm not sure how we syncronize that timeout event with userspace. I was thinkging we might be able to hold an extra ref count on the task struct and prevent proc deregistration until the user space app was done with the process. Let me know how the test results go and I'll keep working on it.
(In reply to comment #7) > Created an attachment (id=21492) [details] > patch to block core_pattern dumping until the dump application exits I am concerned about this part of the patch: diff --git a/kernel/kmod.c b/kernel/kmod.c index b750675..b4de1d4 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -257,6 +257,8 @@ static void __call_usermodehelper(struct work_struct *work) pid = kernel_thread(____call_usermodehelper, sub_info, CLONE_VFORK | SIGCHLD); + sub_info->retval = pid; + switch (wait) { case UMH_NO_WAIT: break; @@ -264,7 +266,6 @@ static void __call_usermodehelper(struct work_struct *work) case UMH_WAIT_PROC: if (pid > 0) break; - sub_info->retval = pid; /* FALLTHROUGH */ case UMH_WAIT_EXEC: The code immediately preceding this is: /* CLONE_VFORK: wait until the usermode helper has execve'd * successfully We need the data structures to stay around * until that is done. */ if (wait == UMH_WAIT_PROC || wait == UMH_NO_WAIT) pid = kernel_thread(wait_for_helper, sub_info, CLONE_FS | CLONE_FILES | SIGCHLD); else pid = kernel_thread(____call_usermodehelper, sub_info, CLONE_VFORK | SIGCHLD); In the case of UMH_WAIT_EXEC, the else-arm is executed, and there is code in ____call_usermodehelper() that does: retval = kernel_execve(sub_info->path, sub_info->argv, sub_info->envp); /* Exec failed? */ sub_info->retval = retval; do_exit(0); IOW, the thread executing ____call_usermodehelper() writes sub_info->retval(), and I think will race with your patch. Consider: --- kmod.c.orig 2009-05-22 14:09:04.000000000 -0700 +++ kmod.c 2009-05-22 14:55:42.000000000 -0700 @@ -124,6 +124,7 @@ char **envp; enum umh_wait wait; int retval; + int pid; struct file *stdin; void (*cleanup)(char **argv, char **envp); }; @@ -255,14 +256,15 @@ pid = kernel_thread(____call_usermodehelper, sub_info, CLONE_VFORK | SIGCHLD); + sub_info->pid = pid; + switch (wait) { case UMH_NO_WAIT: break; case UMH_WAIT_PROC: - if (pid > 0) + if (pid >= 0) break; - sub_info->retval = pid; /* FALLTHROUGH */ case UMH_WAIT_EXEC: @@ -476,7 +478,10 @@ if (wait == UMH_NO_WAIT) /* task has freed sub_info */ goto unlock; wait_for_completion(&done); - retval = sub_info->retval; + if (wait == UMH_WAIT_EXEC) + retval = sub_info->pid; + else + retval = sub_info->retval; out: call_usermodehelper_freeinfo(sub_info); Note that I changed the test for success to (pid >= 0) to complement the usual test for failure (pid < 0) as per https://lists.linux-foundation.org/pipermail/kernel-janitors/2006-July/015990.html.
(In reply to comment #11) > Looks promising for our current needs. I'll try to work it into our kernel > this afternoon. The outcome looks good. Our kernel has a slightly older version of kmod.c, so I had to make the change to kmod.c in an equivalent way. See the attachment. Here's my preliminary testing with a slow reaper. # echo '|/tmp/slowreaper %p' > /proc/sys/kernel/core_pattern # ls -l coredump -rwxr-xr-x 1 root root 3064 May 22 15:22 coredump # ./coredump Note the small size of the coredump program (essentially kill(getpid(), SIGSEGV)). The slowreaper script is: #!/bin/sh echo Sleep $1 > /dev/console sleep 30 echo Wake $1 > /dev/console ls -l /proc/$1/ > /dev/console After 30 seconds I see the content of /proc/$1/ on the console.
Created attachment 21497 [details] kmod.c patch against older implementation
Ok, I've been thinking about this, and it occured to me, that this will actually be a safe solution as long as I fix the other bug you reported, which stops the core pattern from infinite regression. The resource starvation that happens if a core colletion process blocks indefinately isn't much worse than if the pipe is never drained. I'm going to fix the regression issue and push this upstream.
Just an update: I've not forgotten about this issue. I'm working on ways to do crash reccursion detection. I'll post all the changes when I have them working
Created attachment 21759 [details] new patch Heres a new version of the patch. It revamps the way we detect recursive core dumps, and seems pretty robust in my testing. It also waits for core_pattern helpers to complete before cleaning up the crashed process, and adds a new sysctl to limit the number of parallel core dumps to prevent flooding the box with long running core collection processes. it works pretty well for me, but I would appreciate some extra testing before I posted it for review. Thanks!
(In reply to comment #19) > Heres a new version of the patch. One new section generates a warning about possible recursion: + if (!last_exe_file) { + printk(KERN_WARNING "Abort on possible recursive core\n"); + mmput(cdmm); + goto fail_fput; + } As far as I can tell, last_exe_file can only be null in this context if pattern_task is null in the following context: + if (dump_count == 1) { + struct pid *ps = find_get_pid(pid); + pattern_task = get_pid_task(ps, PIDTYPE_PID); + if (pattern_task) { + cdmm = get_task_mm(pattern_task); + current_exe = get_mm_exe_file(cdmm); + last_exe_file = current_exe; + put_task_struct(pattern_task); + mmput(cdmm); + } + put_pid(ps); + } This leaves me a little puzzled as to how the "possible recursive core" can be inferred from !last_exe_file . Would you elaborate? Further along: +fail_dropcount: + if (ispipe) + atomic_dec(&core_dump_count); I'm thinking that there should be some code to reset last_exe_file: if (ispipe && ! atomic_dec_return(&core_dump_count)) last_exe_file = NULL;
>This leaves me a little puzzled as to how the "possible recursive core" can be >inferred from !last_exe_file . It can't really, That particular clause is there as a safety catch. If something is wrong with the specification of /proc/sys/kernel/core_pattern, and call_usermodehelper_pipe fails, then the core_dump which sets dump_count to 1 won't have an exe file struct to point to. If another cpu is handling a second core in parallel, theres no way to compare the current exe file struct to the last one (since its still null), but we don't want to go on, because I don't want to accidentally wait forever on a garbage pid. The real check for recursion is in the else clause of that conditional, in which we check current_exe == last_exe. I agree that "possible recursion" is a bit misleading, but I couldn't think of anything else concise to say there. Suggestsions for a better error message welcome. >I'm thinking that there should be some code to reset last_exe_file: We could I suppose, but I didn't want to worry about having to lock access to that pointer, and slow this path down further. If we reset it to NULL there, then its possible that, if we decrement dump count to zero there, a second cpu can then immeidately bump it back to 1, and we risk the chance of NULLing that pointer after the second cpu sets it to a new value. By just leaving it alone, we can use the counter as the lock. When we decrement to zero, the next increment will set it to its new value, which effectively prevents parallel writes (since we only do the write in one place, and its guarded by the fact that only one execution context can see the counter as being 1 at any one time. Theres a chance that, if you have lots of core dumps happening in parallel, and you try to change the core_pattern executable while they're happening, you might get a race where the recursion check will fail inadvertently, but thats what the core_dump_limit sysctl in part can help with. How has your testing gone?
(In reply to comment #21) > We could I suppose, but I didn't want to worry about having to lock access to > that pointer, and slow this path down further. If we reset it to NULL there, > then its possible that, if we decrement dump count to zero there, a second > cpu > can then immeidately bump it back to 1, and we risk the chance of NULLing > that > pointer after the second cpu sets it to a new value. Yikes. Sorry about that silly suggestion. Not enough coffee this morning ;-) What I should have written was: if (ispipe) { if (dump_count == 1) last_exe_file = NULL; atomic_dec(&core_dump_count); } If dump_count was one, this thread was privileged enough to be the first, and thus has setting rights on last_ext_file, so it is safe for it to also have resetting rights. If you do not reset last_exe_file, the predicate !last_exe_file will never be true after the first core dump is taken after boot --- which renders the whole if(!last_exe_file){} rather useless in practical terms. > Theres a chance that, if you have lots of core dumps happening in parallel, > and you try to change the core_pattern executable while they're happening, > you > might get a race where the recursion check will fail inadvertently, but thats > what the core_dump_limit sysctl in part can help with. Yes, there is a reasonable chance that two processes will race to dump core simultaneously. One of those processes will the first to dump, and the other second as counted by core_dump_count. The second might get to run this code: + if (!last_exe_file) { + printk(KERN_WARNING "Abort on possible recursive core\n"); + mmput(cdmm); + goto fail_fput; + } before the first, and so will fail to dump for (apparently) no good reason. All in all, I think you're better off forgetting about the "possible" clause and going with the simpler: + if (dump_count > 1) { + cdmm = get_task_mm(current); + current_exe = get_mm_exe_file(cdmm); + if ( ! last_exe_file && last_exe_file == current_exe) { + printk(KERN_WARNING "Abort on recursive core\n"); + mmput(cdmm); + goto fail_fput; + } It appears that get_mm_ext_file() can return NULL (if it's confused?), but I'm not sure what this code should do if that eventuates -- probably nothing. I can't see any other use of current_exe. > How has your testing gone? I only started looking at this today with a view of merging your patch, so I don't have any concrete test details to report (apart from these musings upon reviewing your changes).
>If dump_count was one, this thread was privileged enough to be the first, and >thus has setting rights on last_ext_file, so it is safe for it to also have >resetting rights. Unfortunately, thats not the case, because the do_coredump code isn't serialized. Jsut because a given context grabs the dump_count = 1 reference, doesn't mean that other cpus aren't processing dumps while this one is finishing. By returning last_exe_file to null, it prevents those parallel dumps from making the recursive crash check. >before the first, and so will fail to dump for (apparently) no good reason. Thats true, my concern is that the last_exe_file == NULL when dump_count > 1 case is ambiguous in meaning. Does it mean that a previous attempt to dump failed in forking the user space app, or does it mean that dump_count==2 raced ahead of dump_count=1? I'm really worreid about letting recursive dumps through, although it wouldn't hurt to try the other way. Feel free to make that null check change if you'd like to test with it. Let me know how your testing goes, we'll make a final version of the patch based on those test, and then I'll post. Thanks!
(In reply to comment #23) > >If dump_count was one, this thread was privileged enough to be the first, > and > >thus has setting rights on last_ext_file, so it is safe for it to also have > >resetting rights. > Unfortunately, thats not the case, because the do_coredump code isn't > serialized. Jsut because a given context grabs the dump_count = 1 reference, > doesn't mean that other cpus aren't processing dumps while this one is > finishing. By returning last_exe_file to null, it prevents those parallel > dumps from making the recursive crash check. I got dragged off to some other work. I'm still working on the merge and test. In the meantime, I thought of another approach to detect recursion which is more general and robust. Suppose we add the ability to mark a process as being a core dumper. Say introduce a COREDUMPER attribute to the process. Normally a process is not a core dumper. When the core_pattern indicates a pipe, that pipe is established and the target process is marked as a core dumper. The core dumper attribute is sticky and transfers across a fork() and across an exec(). The do_coredump() need only check if the dumpee is a core dumper (ie has the COREDUMPER attribute), and if so can take evasive action. The advantages of this approach are that it can detect likely recursion under many conditions. The evasive action on detecting that the dumpee is a core dumper is probably to revert to the default "core" action at the very least. Perhaps we can get a little more creative in this regard.
No, I'd rather not. Its not a bad idea, but to implement it, we'd need to introduce a new field to task_struct and nobody is going to want that (burning space for such a rare use case is going to be frowned upon, and all the exitsting flags fields are already in use. What we have in the patch above works well in my testing, and I don't see any holes in it. Please test it as is, and let me know the results.
Just set the coresize limit to zero for the dumper - thats something that root can override with care if they wish (eg to debug dump tools) and needs no new features ?
> What we have in the patch above works well in my testing, and > I don't see any holes in it. Please test it as is, and let me know the > results. Ok. I understand that there may be objections to using up valuable flag space. I believe the PF_* flags are the ones in question: [snip] #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ #define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */ #define PF_SUPERPRIV 0x00000100 /* used super-user privileges */ [snip] #define PF_USED_MATH 0x00002000 /* if unset the fpu must be initialized before use */ #define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */ [snip] #define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */ #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ [snip] It seems to me that the following flag bits are available: 0x00000020 0x00000080 0x00004000 0x08000000 Thus whilst there may be a great reluctance to use these, I don't think it's true to say that they're all used up.
(In reply to comment #26) > Just set the coresize limit to zero for the dumper - thats something that > root > can override with care if they wish (eg to debug dump tools) and needs no new > features ? I don't think this will work out of the box. IIRC the current strategy is to ignore the coresize limit if a pipe (ie |) is seen in the core_pattern. In this case, do_coredump() proceeds to launch the process and to pipe in the core dump. I'm ok with changing the semantics so that the piped core_pattern proceeds if the coresize limit is non-zero. This would be not be a problem for the application(s) I have in mind. Others might have differing opinions.
Earl: Yes, you are correct, they aren't completely exhausted, technically, but I'm still very hesitant to burn a flag on indicating that process can't recurse through do_coredump. Its just such a rare condition. And I really think that doing so isn't needed. The code I have as it stands now catches recursion in all the testing that I've done. About the ony case it doesn't catch is the case in which a core dump helper forks a child and that segfaults, which the core dump limit sysctl catches. I'd really like to get some more robust testing on what we have currently, please. Alan: This is actually a good point. We already set RLIM_CORE to 0 in ____call_usermodehelper, so that part is taken care of. we are currently ignoring the RLIM_CORE setting when a pipe is used in do_coredump though because a partial core to a usermodehelper isn't very usefull. Its that ingnoring of the RLIMIT_CORE that opens us to this recursion problem. If we were to start honoring the core limit we could avoid the recursion issue alltogether, and the price would be the need to manually set ulimit -c unlimited on any process that was to use a piped core helper. IIRC the initial argument against this was that it required the co-ordination of each process in the system to make use of this facility. Perhaps its not to high a price to pay... Thoughts?
Interpreting 0 as "no dump" for pipe cases as well would probably not be unreasonable - the length has no meaning in a pipe but the 0 still implies a limit.
Yeah, agreed, that won't make for too pleasant a backport to many distro's, but whatever, its sounding more and more like the most sane solution. I'll rewrite this in teh next few days to use RLIMIT_CORE as the protection mechanism.
Created attachment 21946 [details] new version of core_pattern patch Ok, here's a new variant of the patch. As We've been discussing, it takes a new approach to recursion detection. Instead of actually looking for recursive cores, we simply fail to dump any core which has a process RLIMIT_CORE value of 0. Since the usemodehelper process sets that value, we don't recurse. Of course we could if the process in user space changed its rlim, but it runs as root, so its allowed to do stupid things. process waiting, and the core limit sysctl are in this version of the patch as well (details in the patch itself). Earl, this patch has worked on my test system here quite well. Please provide feedback ASAP, and I'll get it posted for review. Thanks!
Created attachment 22009 [details] Minor edit to allow core_pipe_limit to be accessible. I had to make this modification to have the core_pipe_limit readable on my kernel.
Yep, thanks for that. Is the testing going well?
bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #34 from Neil Horman <nhorman@tuxdriver.com> 2009-06-19 17:34:11 > --- > Yep, thanks for that. Is the testing going well? Yes. No other anomalies to report so far. I'll attempt to stress some of the corner cases this afternoon after I get some other kernel code review snippets out to my colleagues. Earl
Copy that, If I don't hear from you on this over the weekend then, I'll assume all is well and post it monday late morning. Thanks!