Bug 10095

Summary: Kernel does not honour _ARG_MAX when computing size of argp+envp in execve'd new process image.
Product: Process Management Reporter: Carlos O'Donell (carlos)
Component: OtherAssignee: Michael Kerrisk (mtk.manpages)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: a.p.zijlstra, aaw, akpm, michael.kerrisk, mtk.manpages
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.23 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Test showing kernel does not honour _ARG_MAX space for arg+env in execve.

Description Carlos O'Donell 2008-02-25 09:07:16 UTC
Latest working kernel version: ??? (2.6.15)
Earliest failing kernel version: 2.6.23
Distribution: Debian.
Hardware Environment: x86
Software Environment: ???
Problem Description:

The kernel does not honour the value of _ARG_MAX provided by sysconf(_SC_ARG_MAX) or the value in limits.h. 

This was broken by this commit:
~~~
commit b6a2fea39318e43fee84fa7b0b90d68bed92d2ba
Author: Ollie Wild <aaw@google.com>
Date:   Thu Jul 19 01:48:16 2007 -0700

    mm: variable length argument support
    
    Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
    the old mm into the new mm.
    
    We create the new mm before the binfmt code runs, and place the new stack at
    the very top of the address space.  Once the binfmt code runs and figures out
    where the stack should be, we move it downwards.
    
    It is a bit peculiar in that we have one task with two mm's, one of which is
    inactive.
    
    [a.p.zijlstra@chello.nl: limit stack size]
    Signed-off-by: Ollie Wild <aaw@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: <linux-arch@vger.kernel.org>
    Cc: Hugh Dickins <hugh@veritas.com>
    [bunk@stusta.de: unexport bprm_mm_init]
    Signed-off-by: Adrian Bunk <bunk@stusta.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
~~~

In fs/exec.c (get_arg_page) we should ensure that a minimum of MAX_ARG_PAGES is provided, since this is what glibc is telling userspace via sysconf (_SC_ARG_MAX).


Steps to reproduce:
See attached patch. This testcase works on earlier kernels, but fails on 2.6.23.
Comment 1 Carlos O'Donell 2008-02-25 09:10:49 UTC
Created attachment 14982 [details]
Test showing kernel does not honour _ARG_MAX space for arg+env in execve.

Compile with `gcc -o sysconf-fail sysconf-fail.c' and run the test. It will print a 'PASS:.*' as the last line if it passes or 'FAIL:.*' as the last line if it fails.
Comment 2 Peter Zijlstra 2008-02-25 10:22:36 UTC
This is a rather silly bug.

 - Firstly, the test program is actually wrong, because it should also
calculate the env size.
 - Secondly, the limit ARG_MAX is a tad useless now that its variable.
There isn't a single good value, and the requirement for sysconf()
values to be constant during a process make it impossible to return the
actual number.
 - Thirdly, since the env and arg arrays are on the stack, and you limit
the stack to less than the given number, its arguable correct to fail.

I would propose to make sysconf(_SC_ARG_MAX) return _POSIX_MAX_ARG
(4096) and allow for the one page. This ought to discourage its use -
like the man page already does.
Comment 3 Carlos O'Donell 2008-02-25 12:59:04 UTC
Peter,

There is a real need in userspace for very-large argument passing, as proof see the @file support added to libiberty for gcc. The @file support allows a program to inject command line arguments into another exec'd program via a file. It would be nice to avoid this, unfortunately, we need a way to determine if the running kernel supports variable length argument passing.

How do we expose cool new features, like this one, to userspace?

Please see http://sourceware.org/bugzilla/show_bug.cgi?id=5786 for some background on the issue.

What do you think about adding a new RLIMIT_ARG_MAX, and having the kernel return the current value it has reserved as the limit?

I can then go suggest the POSIX.1 spec changes:

1. Define RLIMIT_ARG_MAX for getrlimit, as the number of bytes from RLIMIT_STACK which are reserved for arg and env passing.

2. Reword sysconf to say "The return value of sysconf(_SC_ARG_MAX) may change if you call setrlimit with changes to RLIMIT_ARG_MAX." Note that it already says similar things for sysconf(_SC_OPEN_MAX).

In light of all this, I still think the kernel should provide a minimum of MAX_ARG_PAGES * PAGE_SIZE bytes, as it did before, to comply with the sysconf requirement that the limit not be more restrictive than the compiled in limit.

My only intent is to make it easier for userspace to use this feature. As it stands no userspace process knows that there is more than sysconf(_SC_ARG_MAX) available, and the new kernel fails to provide the compiled in minimum.
Comment 4 Peter Zijlstra 2008-02-25 13:17:48 UTC
On Mon, 2008-02-25 at 12:59 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10095
> 
> 
> 
> 
> 
> ------- Comment #3 from carlos@codesourcery.com  2008-02-25 12:59 -------
> Peter,
> 
> There is a real need in userspace for very-large argument passing, as proof
> see
> the @file support added to libiberty for gcc. The @file support allows a
> program to inject command line arguments into another exec'd program via a
> file. It would be nice to avoid this, unfortunately, we need a way to
> determine
> if the running kernel supports variable length argument passing.
> 
> How do we expose cool new features, like this one, to userspace?
> 
> Please see http://sourceware.org/bugzilla/show_bug.cgi?id=5786 for some
> background on the issue.
> 
> What do you think about adding a new RLIMIT_ARG_MAX, and having the kernel
> return the current value it has reserved as the limit?

Not needed its all stack and we already use RLIMIT_STACK for exactly
that purpose. Currently its 1/4th of RLIMIT_STACK that can be used for
env+arg.

Introducing another RLIMIT which overlaps with the stack limit doesn't
make sense.

> I can then go suggest the POSIX.1 spec changes:
> 
> 1. Define RLIMIT_ARG_MAX for getrlimit, as the number of bytes from
> RLIMIT_STACK which are reserved for arg and env passing.
> 
> 2. Reword sysconf to say "The return value of sysconf(_SC_ARG_MAX) may change
> if you call setrlimit with changes to RLIMIT_ARG_MAX." Note that it already
> says similar things for sysconf(_SC_OPEN_MAX).

Where does the current sysconf value come from? When I grep the kernel I
get this:

# git grep _SC_ARG_MAX
arch/sparc/kernel/sys_sunos.c:#define   _SC_ARG_MAX             1
arch/sparc/kernel/sys_sunos.c:  case _SC_ARG_MAX:
arch/sparc64/kernel/sys_sunos32.c:#define   _SC_ARG_MAX             1
arch/sparc64/kernel/sys_sunos32.c:      case _SC_ARG_MAX:
include/asm-sparc64/unistd.h:#define   _SC_ARG_MAX             1

> In light of all this, I still think the kernel should provide a minimum of
> MAX_ARG_PAGES * PAGE_SIZE bytes, as it did before, to comply with the sysconf
> requirement that the limit not be more restrictive than the compiled in
> limit.

A sysconf value of 1 page as defined by POSIX would do exactly that, no?

> My only intent is to make it easier for userspace to use this feature. As it
> stands no userspace process knows that there is more than
> sysconf(_SC_ARG_MAX)
> available, and the new kernel fails to provide the compiled in minimum.

Well, only if you set a very small stack limit. We could change it to
provide a minimum of 32 pages, but if you then provide a stack of 32
pages you'll hit a wall as soon as you hit userspace anyway.
Comment 5 Carlos O'Donell 2008-02-25 15:42:29 UTC
The kernel defines ARG_MAX in include/linux/limits.h. In userspace the include chain starts with  /usr/include/limits.h and ends with the kernel definition. The value of ARG_MAX is currently a static value of 131072 bytes. The kernel should honour this value as a minimum.

How does userspace determine there is a higher limit on ARG_MAX?
Comment 6 Michael Kerrisk 2008-02-26 02:19:59 UTC
I reported http://sourceware.org/bugzilla/show_bug.cgi?id=5786 , which led to Carlos filing this bug.  Some thoughts:

Peter Z wrote:

> Secondly, the limit ARG_MAX is a tad useless now that its variable.
> There isn't a single good value, and the requirement for sysconf()
> values to be constant during a process make it impossible to return the
> actual number.

This doesn't change two points, AFAICS:

a) Portable applications need a way of determining the limit on the size available for [argv+environ].  

b) POSIX.1 specifies that the way to do that for the implementation to advertise via the ARG_MAX limit.

The kernel should support these requirements, IMO.

Peter Z wrote:

> - Thirdly, since the env and arg arrays are on the stack, 

Is that not simply an implementation detail?  argv and environ could be anywhere in the process address space.  Traditionally, they have sat just above the stack.  But they need not do so, and POSIX.1 certainly doesn't require it.

It is the proximity of the stack and argv+enviorn that has led to the confusion of overloading RLIMIT_STACK with this new meaning.  In retrospect, overloading RLIMIT_STACK to act as a limit on both stack size and on argv+environ was a mistake.  Really, it would have been better to have a new independent read-write rlmit, named something like RLIMIT_ARG_MAX.  (Carlos has proposed adding this as a read-only rlimit.  That's a hack to repair the earlier design mistake.  The POSIX folk probably wouldn't, and IMO, shouldn't, accept such a resource limit.)

One question is whether we want to retroactively add a read-write RLIMIT_ARG_MAX rlimit?

Peter Z. wrote

> I would propose to make sysconf(_SC_ARG_MAX) return _POSIX_MAX_ARG
> (4096) and allow for the one page. This ought to discourage its use -
> like the man page already does.

I'm not sure it's that simple.  Taking it's cue from the kernel, glibc has long advertised the compile time limit ARG_MAX as 131072.  Lowering that to 4096 would seem to break binary compatibility.  (I'm not sure on this point...)

Cheers,

Michael
Comment 7 Peter Zijlstra 2008-02-26 04:12:47 UTC
On Mon, 2008-02-25 at 15:42 -0800, bugme-daemon@bugzilla.kernel.org
wrote:

> ------- Comment #5 from carlos@codesourcery.com  2008-02-25 15:42 -------
> The kernel defines ARG_MAX in include/linux/limits.h. In userspace the
> include
> chain starts with  /usr/include/limits.h and ends with the kernel definition.
> The value of ARG_MAX is currently a static value of 131072 bytes. The kernel
> should honour this value as a minimum.

Argh, so its all statically build into everything. How inconvenient.
So sysconf is more a userspace construct while it provides kernel
limits, seems like a horrible construct.

> How does userspace determine there is a higher limit on ARG_MAX?

Kernel version and RLIMIT_STACK?
Comment 8 Peter Zijlstra 2008-02-26 04:18:38 UTC
On Tue, 2008-02-26 at 02:20 -0800, bugme-daemon@bugzilla.kernel.org
wrote:

> ------- Comment #6 from michael.kerrisk@gmail.com  2008-02-26 02:19 -------
> I reported http://sourceware.org/bugzilla/show_bug.cgi?id=5786 , which led to
> Carlos filing this bug.  Some thoughts:
> 
> Peter Z wrote:
> 
> > Secondly, the limit ARG_MAX is a tad useless now that its variable.
> > There isn't a single good value, and the requirement for sysconf()
> > values to be constant during a process make it impossible to return the
> > actual number.
> 
> This doesn't change two points, AFAICS:
> 
> a) Portable applications need a way of determining the limit on the size
> available for [argv+environ].  

Agreed, that is useful.

> b) POSIX.1 specifies that the way to do that for the implementation to
> advertise via the ARG_MAX limit.

Well, here we get into trouble. The sysconf stuff is IMHO quite useless,
it seems a statically compiled userspace construct while it provides
kernel limits. Not really a usable interaction here.

> The kernel should support these requirements, IMO.
> 
> Peter Z wrote:
> 
> > - Thirdly, since the env and arg arrays are on the stack, 
> 
> Is that not simply an implementation detail?  argv and environ could be
> anywhere in the process address space.  Traditionally, they have sat just
> above
> the stack.  But they need not do so, and POSIX.1 certainly doesn't require
> it.

Ah, ok. I wasn't aware this was an implementation detail.

> It is the proximity of the stack and argv+enviorn that has led to the
> confusion
> of overloading RLIMIT_STACK with this new meaning.  In retrospect,
> overloading
> RLIMIT_STACK to act as a limit on both stack size and on argv+environ was a
> mistake.  Really, it would have been better to have a new independent
> read-write rlmit, named something like RLIMIT_ARG_MAX.  (Carlos has proposed
> adding this as a read-only rlimit.  That's a hack to repair the earlier
> design
> mistake.  The POSIX folk probably wouldn't, and IMO, shouldn't, accept such a
> resource limit.)
> 
> One question is whether we want to retroactively add a read-write
> RLIMIT_ARG_MAX rlimit?

Sure, if you say the env+arg arrays are not part of the stack proper per
spec, that makes sense.

> Peter Z. wrote
> 
> > I would propose to make sysconf(_SC_ARG_MAX) return _POSIX_MAX_ARG
> > (4096) and allow for the one page. This ought to discourage its use -
> > like the man page already does.
> 
> I'm not sure it's that simple.  Taking it's cue from the kernel, glibc has
> long
> advertised the compile time limit ARG_MAX as 131072.  Lowering that to 4096
> would seem to break binary compatibility.  (I'm not sure on this point...)

This is merely problem of our flawed implementation of sysconf as far as
I can see.
Comment 9 Carlos O'Donell 2008-02-26 06:17:00 UTC
What do you think of the following summary?

1. Kernel should honour, for binary compatibility, the minimum [arg + environ] space of 131072 bytes, or ARG_MAX as is defined in the kernel headers.

2. A robust and future proof implementation likely involves a new RLIMIT, let us say RLIMIT_ARG_MAX.

- The kernel will allow this value to be read and write. 

- Reading RLIMIT_ARG_MAX returns the current number of bytes the kernel will allow for [arg + environ] space.

- Increasing RLIMIT_ARG_MAX beyond the current value requires CAP_SYS_RESOURCE, and the kernel will allow the VM to grow to RLIMIT_ARG_MAX + RLIMIT_STACK space.

- Decreasing RLIMIT_ARG_MAX below ARG_MAX returns an error.

- Glibc intends to read the value to provide userspace with a return value for sysconf(_SC_ARG_MAX). Newer applications can detect the presence of RLIMIT_ARG_MAX, and use this to determine the current limit, set it lower, or set it higher.

Notes:
- If we allow for RLIMIT_ARG_MAX + RLIMIT_STACK space in fs/exec.c, this would *truly* allow a process to control how much space it needed. This would be an exciting feature!
Comment 10 Anonymous Emailer 2008-02-26 06:35:58 UTC
Reply-To: michael.kerrisk@googlemail.com

> Peter Z wrote:
>  > I'm not sure it's that simple.  Taking it's cue from the kernel, glibc has
>  long
>  > advertised the compile time limit ARG_MAX as 131072.  Lowering that to
>  4096
>  > would seem to break binary compatibility.  (I'm not sure on this point...)
>
>  This is merely problem of our flawed implementation of sysconf as far as
>  I can see.

No, it's more than that.  Up until now, userland has expected to have
128k for argv+environ.  With the 2.6.23 change, that is no longer
true.
Comment 11 Anonymous Emailer 2008-02-26 06:44:30 UTC
Reply-To: michael.kerrisk@googlemail.com

[Oliver, please take a look at this bug, since you made the earlier
kernel change.]

On Tue, Feb 26, 2008 at 3:17 PM,  <bugme-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10095
>
>
>
>
>
>  ------- Comment #9 from carlos@codesourcery.com  2008-02-26 06:17 -------
>  What do you think of the following summary?
>
>  1. Kernel should honour, for binary compatibility, the minimum [arg +
>  environ]
>  space of 131072 bytes, or ARG_MAX as is defined in the kernel headers.
>
>  2. A robust and future proof implementation likely involves a new RLIMIT,
>  let
>  us say RLIMIT_ARG_MAX.
>
>  - The kernel will allow this value to be read and write.
>
>  - Reading RLIMIT_ARG_MAX returns the current number of bytes the kernel will
>  allow for [arg + environ] space.
>
>  - Increasing RLIMIT_ARG_MAX beyond the current value requires
>  CAP_SYS_RESOURCE,
>  and the kernel will allow the VM to grow to RLIMIT_ARG_MAX + RLIMIT_STACK
>  space.

Not quite - increasing the soft rlimit above the hard value, or
increasing the hard value, requires CAP_SYS_RESOURCE.

>  - Decreasing RLIMIT_ARG_MAX below ARG_MAX returns an error.

My one concern here is whether there might be legitimate cases where
we want to permit this.

>  - Glibc intends to read the value to provide userspace with a return value
>  for
>  sysconf(_SC_ARG_MAX). Newer applications can detect the presence of
>  RLIMIT_ARG_MAX, and use this to determine the current limit, set it lower,
>  or
>  set it higher.
>
>  Notes:
>  - If we allow for RLIMIT_ARG_MAX + RLIMIT_STACK space in fs/exec.c, this
>  would
>  *truly* allow a process to control how much space it needed. This would be
>  an
>  exciting feature!

Other than the comments above, and the remark that this subtly changes
the new ABI that was put in place in 2.6.23, the above sounds
reasonable.

Cheers,

Michael
Comment 12 Ollie Wild 2008-02-26 17:41:43 UTC
One thing I'm trying to understand is why the test code works at all.  In pre-2.6.23 kernels, the arguments were still stored on the stack and still counted against the RLIMIT_STACK limit.  I suspect installation of the argument pages simply bypassed the acct_stack_growth() call.  If that's the case, this test only works because it doesn't attempt to grow the stack beyond its initial (overly large) allocation.

To test this, I added

  char bogus[4096];

after the declaration of str.  Low and behold, the child crashes immediately upon startup.

This points more to a bug in the pre-2.6.23 implementation than a bug in the new implementation.

That said, moving the arguments out of the stack seems like a reasonable idea as a future enhancement.  It's not one I feel particularly motivated to pursue, though.
Comment 13 Michael Kerrisk 2008-02-28 13:09:24 UTC
Some further thoughts on Carlos's ideas:

Carlos O'Donell wrote:

> 1. Kernel should honour, for binary compatibility, the minimum [arg +
> environ]
> space of 131072 bytes, or ARG_MAX as is defined in the kernel headers.

I'd rather say that default must be at least 131072 bytes.  A lower limit should be permissible (see below).

> 2. A robust and future proof implementation likely involves a new RLIMIT, let
> us say RLIMIT_ARG_MAX.
>
> - The kernel will allow this value to be read and write. 
>
> - Reading RLIMIT_ARG_MAX returns the current number of bytes the kernel will
> allow for [arg + environ] space.
> 
> - Increasing RLIMIT_ARG_MAX beyond the current value requires 
> CAP_SYS_RESOURCE, and the kernel will allow the VM to grow to RLIMIT_ARG_MAX 
> + RLIMIT_STACK space.
>
> - Decreasing RLIMIT_ARG_MAX below ARG_MAX returns an error.

There's no reason why we should impose this restriction.  Providing a new rlimit gives the user explicit control of the ABI.  The only thing we should guarantee is that by default, at least the old ARG_MAX bytes of space is guaranteed.  If the user chooses to explicitly set a lower rlimit, there's no reason not to honor that.
Comment 14 Carlos O'Donell 2008-02-29 08:56:58 UTC
Michael,

Yes, you are probably right. Could the manpage indicate that lowering this value below the compiled ARG_MAX violates POSIX?

I hope that Peter's RLIMIT_ARG_MAX patch gets picked up. This would give us forward progress.
Comment 15 Alan 2008-09-23 04:06:57 UTC
Michael - can we document that and close the bug as DOCUMENTED ?
Comment 16 Michael Kerrisk 2008-10-04 19:32:56 UTC
(In reply to comment #15)
> Michael - can we document that and close the bug as DOCUMENTED ?

Hi Alan,

In fact, the problem has by now been fixed, I just hadn't got round to documenting it.  See Linus' http://thread.gmane.org/gmane.linux.kernel/646709/focus=648101,
checked into 2.6.25 as commit a64e715fc74b1a7dcc5944f848acc38b2c4d4ee2.
This commit imposes a floor on ARG_MAX, so that it can't be set lower than the pre-2.6.23 limit of 32 pages (128k on x86-32).  (Ulrich also checked in some changes to glibc so that syscon(_SC_ARG_MAX) nowadays reports things correctly again.)

I'm making some changes to the execve(2) man page to document this (for man-pages-3.11).  They should be available in the next hour or so via git.