Latest working kernel version: ??? (2.6.15) Earliest failing kernel version: 2.6.23 Distribution: Debian. Hardware Environment: x86 Software Environment: ??? Problem Description: The kernel does not honour the value of _ARG_MAX provided by sysconf(_SC_ARG_MAX) or the value in limits.h. This was broken by this commit: ~~~ commit b6a2fea39318e43fee84fa7b0b90d68bed92d2ba Author: Ollie Wild <aaw@google.com> Date: Thu Jul 19 01:48:16 2007 -0700 mm: variable length argument support Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from the old mm into the new mm. We create the new mm before the binfmt code runs, and place the new stack at the very top of the address space. Once the binfmt code runs and figures out where the stack should be, we move it downwards. It is a bit peculiar in that we have one task with two mm's, one of which is inactive. [a.p.zijlstra@chello.nl: limit stack size] Signed-off-by: Ollie Wild <aaw@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Cc: Hugh Dickins <hugh@veritas.com> [bunk@stusta.de: unexport bprm_mm_init] Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> ~~~ In fs/exec.c (get_arg_page) we should ensure that a minimum of MAX_ARG_PAGES is provided, since this is what glibc is telling userspace via sysconf (_SC_ARG_MAX). Steps to reproduce: See attached patch. This testcase works on earlier kernels, but fails on 2.6.23.
Created attachment 14982 [details] Test showing kernel does not honour _ARG_MAX space for arg+env in execve. Compile with `gcc -o sysconf-fail sysconf-fail.c' and run the test. It will print a 'PASS:.*' as the last line if it passes or 'FAIL:.*' as the last line if it fails.
This is a rather silly bug. - Firstly, the test program is actually wrong, because it should also calculate the env size. - Secondly, the limit ARG_MAX is a tad useless now that its variable. There isn't a single good value, and the requirement for sysconf() values to be constant during a process make it impossible to return the actual number. - Thirdly, since the env and arg arrays are on the stack, and you limit the stack to less than the given number, its arguable correct to fail. I would propose to make sysconf(_SC_ARG_MAX) return _POSIX_MAX_ARG (4096) and allow for the one page. This ought to discourage its use - like the man page already does.
Peter, There is a real need in userspace for very-large argument passing, as proof see the @file support added to libiberty for gcc. The @file support allows a program to inject command line arguments into another exec'd program via a file. It would be nice to avoid this, unfortunately, we need a way to determine if the running kernel supports variable length argument passing. How do we expose cool new features, like this one, to userspace? Please see http://sourceware.org/bugzilla/show_bug.cgi?id=5786 for some background on the issue. What do you think about adding a new RLIMIT_ARG_MAX, and having the kernel return the current value it has reserved as the limit? I can then go suggest the POSIX.1 spec changes: 1. Define RLIMIT_ARG_MAX for getrlimit, as the number of bytes from RLIMIT_STACK which are reserved for arg and env passing. 2. Reword sysconf to say "The return value of sysconf(_SC_ARG_MAX) may change if you call setrlimit with changes to RLIMIT_ARG_MAX." Note that it already says similar things for sysconf(_SC_OPEN_MAX). In light of all this, I still think the kernel should provide a minimum of MAX_ARG_PAGES * PAGE_SIZE bytes, as it did before, to comply with the sysconf requirement that the limit not be more restrictive than the compiled in limit. My only intent is to make it easier for userspace to use this feature. As it stands no userspace process knows that there is more than sysconf(_SC_ARG_MAX) available, and the new kernel fails to provide the compiled in minimum.
On Mon, 2008-02-25 at 12:59 -0800, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=10095 > > > > > > ------- Comment #3 from carlos@codesourcery.com 2008-02-25 12:59 ------- > Peter, > > There is a real need in userspace for very-large argument passing, as proof > see > the @file support added to libiberty for gcc. The @file support allows a > program to inject command line arguments into another exec'd program via a > file. It would be nice to avoid this, unfortunately, we need a way to > determine > if the running kernel supports variable length argument passing. > > How do we expose cool new features, like this one, to userspace? > > Please see http://sourceware.org/bugzilla/show_bug.cgi?id=5786 for some > background on the issue. > > What do you think about adding a new RLIMIT_ARG_MAX, and having the kernel > return the current value it has reserved as the limit? Not needed its all stack and we already use RLIMIT_STACK for exactly that purpose. Currently its 1/4th of RLIMIT_STACK that can be used for env+arg. Introducing another RLIMIT which overlaps with the stack limit doesn't make sense. > I can then go suggest the POSIX.1 spec changes: > > 1. Define RLIMIT_ARG_MAX for getrlimit, as the number of bytes from > RLIMIT_STACK which are reserved for arg and env passing. > > 2. Reword sysconf to say "The return value of sysconf(_SC_ARG_MAX) may change > if you call setrlimit with changes to RLIMIT_ARG_MAX." Note that it already > says similar things for sysconf(_SC_OPEN_MAX). Where does the current sysconf value come from? When I grep the kernel I get this: # git grep _SC_ARG_MAX arch/sparc/kernel/sys_sunos.c:#define _SC_ARG_MAX 1 arch/sparc/kernel/sys_sunos.c: case _SC_ARG_MAX: arch/sparc64/kernel/sys_sunos32.c:#define _SC_ARG_MAX 1 arch/sparc64/kernel/sys_sunos32.c: case _SC_ARG_MAX: include/asm-sparc64/unistd.h:#define _SC_ARG_MAX 1 > In light of all this, I still think the kernel should provide a minimum of > MAX_ARG_PAGES * PAGE_SIZE bytes, as it did before, to comply with the sysconf > requirement that the limit not be more restrictive than the compiled in > limit. A sysconf value of 1 page as defined by POSIX would do exactly that, no? > My only intent is to make it easier for userspace to use this feature. As it > stands no userspace process knows that there is more than > sysconf(_SC_ARG_MAX) > available, and the new kernel fails to provide the compiled in minimum. Well, only if you set a very small stack limit. We could change it to provide a minimum of 32 pages, but if you then provide a stack of 32 pages you'll hit a wall as soon as you hit userspace anyway.
The kernel defines ARG_MAX in include/linux/limits.h. In userspace the include chain starts with /usr/include/limits.h and ends with the kernel definition. The value of ARG_MAX is currently a static value of 131072 bytes. The kernel should honour this value as a minimum. How does userspace determine there is a higher limit on ARG_MAX?
I reported http://sourceware.org/bugzilla/show_bug.cgi?id=5786 , which led to Carlos filing this bug. Some thoughts: Peter Z wrote: > Secondly, the limit ARG_MAX is a tad useless now that its variable. > There isn't a single good value, and the requirement for sysconf() > values to be constant during a process make it impossible to return the > actual number. This doesn't change two points, AFAICS: a) Portable applications need a way of determining the limit on the size available for [argv+environ]. b) POSIX.1 specifies that the way to do that for the implementation to advertise via the ARG_MAX limit. The kernel should support these requirements, IMO. Peter Z wrote: > - Thirdly, since the env and arg arrays are on the stack, Is that not simply an implementation detail? argv and environ could be anywhere in the process address space. Traditionally, they have sat just above the stack. But they need not do so, and POSIX.1 certainly doesn't require it. It is the proximity of the stack and argv+enviorn that has led to the confusion of overloading RLIMIT_STACK with this new meaning. In retrospect, overloading RLIMIT_STACK to act as a limit on both stack size and on argv+environ was a mistake. Really, it would have been better to have a new independent read-write rlmit, named something like RLIMIT_ARG_MAX. (Carlos has proposed adding this as a read-only rlimit. That's a hack to repair the earlier design mistake. The POSIX folk probably wouldn't, and IMO, shouldn't, accept such a resource limit.) One question is whether we want to retroactively add a read-write RLIMIT_ARG_MAX rlimit? Peter Z. wrote > I would propose to make sysconf(_SC_ARG_MAX) return _POSIX_MAX_ARG > (4096) and allow for the one page. This ought to discourage its use - > like the man page already does. I'm not sure it's that simple. Taking it's cue from the kernel, glibc has long advertised the compile time limit ARG_MAX as 131072. Lowering that to 4096 would seem to break binary compatibility. (I'm not sure on this point...) Cheers, Michael
On Mon, 2008-02-25 at 15:42 -0800, bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #5 from carlos@codesourcery.com 2008-02-25 15:42 ------- > The kernel defines ARG_MAX in include/linux/limits.h. In userspace the > include > chain starts with /usr/include/limits.h and ends with the kernel definition. > The value of ARG_MAX is currently a static value of 131072 bytes. The kernel > should honour this value as a minimum. Argh, so its all statically build into everything. How inconvenient. So sysconf is more a userspace construct while it provides kernel limits, seems like a horrible construct. > How does userspace determine there is a higher limit on ARG_MAX? Kernel version and RLIMIT_STACK?
On Tue, 2008-02-26 at 02:20 -0800, bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #6 from michael.kerrisk@gmail.com 2008-02-26 02:19 ------- > I reported http://sourceware.org/bugzilla/show_bug.cgi?id=5786 , which led to > Carlos filing this bug. Some thoughts: > > Peter Z wrote: > > > Secondly, the limit ARG_MAX is a tad useless now that its variable. > > There isn't a single good value, and the requirement for sysconf() > > values to be constant during a process make it impossible to return the > > actual number. > > This doesn't change two points, AFAICS: > > a) Portable applications need a way of determining the limit on the size > available for [argv+environ]. Agreed, that is useful. > b) POSIX.1 specifies that the way to do that for the implementation to > advertise via the ARG_MAX limit. Well, here we get into trouble. The sysconf stuff is IMHO quite useless, it seems a statically compiled userspace construct while it provides kernel limits. Not really a usable interaction here. > The kernel should support these requirements, IMO. > > Peter Z wrote: > > > - Thirdly, since the env and arg arrays are on the stack, > > Is that not simply an implementation detail? argv and environ could be > anywhere in the process address space. Traditionally, they have sat just > above > the stack. But they need not do so, and POSIX.1 certainly doesn't require > it. Ah, ok. I wasn't aware this was an implementation detail. > It is the proximity of the stack and argv+enviorn that has led to the > confusion > of overloading RLIMIT_STACK with this new meaning. In retrospect, > overloading > RLIMIT_STACK to act as a limit on both stack size and on argv+environ was a > mistake. Really, it would have been better to have a new independent > read-write rlmit, named something like RLIMIT_ARG_MAX. (Carlos has proposed > adding this as a read-only rlimit. That's a hack to repair the earlier > design > mistake. The POSIX folk probably wouldn't, and IMO, shouldn't, accept such a > resource limit.) > > One question is whether we want to retroactively add a read-write > RLIMIT_ARG_MAX rlimit? Sure, if you say the env+arg arrays are not part of the stack proper per spec, that makes sense. > Peter Z. wrote > > > I would propose to make sysconf(_SC_ARG_MAX) return _POSIX_MAX_ARG > > (4096) and allow for the one page. This ought to discourage its use - > > like the man page already does. > > I'm not sure it's that simple. Taking it's cue from the kernel, glibc has > long > advertised the compile time limit ARG_MAX as 131072. Lowering that to 4096 > would seem to break binary compatibility. (I'm not sure on this point...) This is merely problem of our flawed implementation of sysconf as far as I can see.
What do you think of the following summary? 1. Kernel should honour, for binary compatibility, the minimum [arg + environ] space of 131072 bytes, or ARG_MAX as is defined in the kernel headers. 2. A robust and future proof implementation likely involves a new RLIMIT, let us say RLIMIT_ARG_MAX. - The kernel will allow this value to be read and write. - Reading RLIMIT_ARG_MAX returns the current number of bytes the kernel will allow for [arg + environ] space. - Increasing RLIMIT_ARG_MAX beyond the current value requires CAP_SYS_RESOURCE, and the kernel will allow the VM to grow to RLIMIT_ARG_MAX + RLIMIT_STACK space. - Decreasing RLIMIT_ARG_MAX below ARG_MAX returns an error. - Glibc intends to read the value to provide userspace with a return value for sysconf(_SC_ARG_MAX). Newer applications can detect the presence of RLIMIT_ARG_MAX, and use this to determine the current limit, set it lower, or set it higher. Notes: - If we allow for RLIMIT_ARG_MAX + RLIMIT_STACK space in fs/exec.c, this would *truly* allow a process to control how much space it needed. This would be an exciting feature!
Reply-To: michael.kerrisk@googlemail.com > Peter Z wrote: > > I'm not sure it's that simple. Taking it's cue from the kernel, glibc has > long > > advertised the compile time limit ARG_MAX as 131072. Lowering that to > 4096 > > would seem to break binary compatibility. (I'm not sure on this point...) > > This is merely problem of our flawed implementation of sysconf as far as > I can see. No, it's more than that. Up until now, userland has expected to have 128k for argv+environ. With the 2.6.23 change, that is no longer true.
Reply-To: michael.kerrisk@googlemail.com [Oliver, please take a look at this bug, since you made the earlier kernel change.] On Tue, Feb 26, 2008 at 3:17 PM, <bugme-daemon@bugzilla.kernel.org> wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=10095 > > > > > > ------- Comment #9 from carlos@codesourcery.com 2008-02-26 06:17 ------- > What do you think of the following summary? > > 1. Kernel should honour, for binary compatibility, the minimum [arg + > environ] > space of 131072 bytes, or ARG_MAX as is defined in the kernel headers. > > 2. A robust and future proof implementation likely involves a new RLIMIT, > let > us say RLIMIT_ARG_MAX. > > - The kernel will allow this value to be read and write. > > - Reading RLIMIT_ARG_MAX returns the current number of bytes the kernel will > allow for [arg + environ] space. > > - Increasing RLIMIT_ARG_MAX beyond the current value requires > CAP_SYS_RESOURCE, > and the kernel will allow the VM to grow to RLIMIT_ARG_MAX + RLIMIT_STACK > space. Not quite - increasing the soft rlimit above the hard value, or increasing the hard value, requires CAP_SYS_RESOURCE. > - Decreasing RLIMIT_ARG_MAX below ARG_MAX returns an error. My one concern here is whether there might be legitimate cases where we want to permit this. > - Glibc intends to read the value to provide userspace with a return value > for > sysconf(_SC_ARG_MAX). Newer applications can detect the presence of > RLIMIT_ARG_MAX, and use this to determine the current limit, set it lower, > or > set it higher. > > Notes: > - If we allow for RLIMIT_ARG_MAX + RLIMIT_STACK space in fs/exec.c, this > would > *truly* allow a process to control how much space it needed. This would be > an > exciting feature! Other than the comments above, and the remark that this subtly changes the new ABI that was put in place in 2.6.23, the above sounds reasonable. Cheers, Michael
One thing I'm trying to understand is why the test code works at all. In pre-2.6.23 kernels, the arguments were still stored on the stack and still counted against the RLIMIT_STACK limit. I suspect installation of the argument pages simply bypassed the acct_stack_growth() call. If that's the case, this test only works because it doesn't attempt to grow the stack beyond its initial (overly large) allocation. To test this, I added char bogus[4096]; after the declaration of str. Low and behold, the child crashes immediately upon startup. This points more to a bug in the pre-2.6.23 implementation than a bug in the new implementation. That said, moving the arguments out of the stack seems like a reasonable idea as a future enhancement. It's not one I feel particularly motivated to pursue, though.
Some further thoughts on Carlos's ideas: Carlos O'Donell wrote: > 1. Kernel should honour, for binary compatibility, the minimum [arg + > environ] > space of 131072 bytes, or ARG_MAX as is defined in the kernel headers. I'd rather say that default must be at least 131072 bytes. A lower limit should be permissible (see below). > 2. A robust and future proof implementation likely involves a new RLIMIT, let > us say RLIMIT_ARG_MAX. > > - The kernel will allow this value to be read and write. > > - Reading RLIMIT_ARG_MAX returns the current number of bytes the kernel will > allow for [arg + environ] space. > > - Increasing RLIMIT_ARG_MAX beyond the current value requires > CAP_SYS_RESOURCE, and the kernel will allow the VM to grow to RLIMIT_ARG_MAX > + RLIMIT_STACK space. > > - Decreasing RLIMIT_ARG_MAX below ARG_MAX returns an error. There's no reason why we should impose this restriction. Providing a new rlimit gives the user explicit control of the ABI. The only thing we should guarantee is that by default, at least the old ARG_MAX bytes of space is guaranteed. If the user chooses to explicitly set a lower rlimit, there's no reason not to honor that.
Michael, Yes, you are probably right. Could the manpage indicate that lowering this value below the compiled ARG_MAX violates POSIX? I hope that Peter's RLIMIT_ARG_MAX patch gets picked up. This would give us forward progress.
Michael - can we document that and close the bug as DOCUMENTED ?
(In reply to comment #15) > Michael - can we document that and close the bug as DOCUMENTED ? Hi Alan, In fact, the problem has by now been fixed, I just hadn't got round to documenting it. See Linus' http://thread.gmane.org/gmane.linux.kernel/646709/focus=648101, checked into 2.6.25 as commit a64e715fc74b1a7dcc5944f848acc38b2c4d4ee2. This commit imposes a floor on ARG_MAX, so that it can't be set lower than the pre-2.6.23 limit of 32 pages (128k on x86-32). (Ulrich also checked in some changes to glibc so that syscon(_SC_ARG_MAX) nowadays reports things correctly again.) I'm making some changes to the execve(2) man page to document this (for man-pages-3.11). They should be available in the next hour or so via git.