Bug 13302 - "bad pmd" on fork() of process with hugepage shared memory segments attached
Summary: "bad pmd" on fork() of process with hugepage shared memory segments attached
Status: CLOSED CODE_FIX
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-05-13 19:54 UTC by starlight
Modified: 2012-06-07 10:35 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.29.1
Tree: Mainline
Regression: Yes


Attachments

Description starlight 2009-05-13 19:54:09 UTC
Kernel reports "bad pmd" errors when process with hugepage
shared memory segments attached executes fork() system call.
Using vfork() avoids the issue.

Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes
leakage of huge pages.

Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp.

See bug 12134 for an example of the errors reported
by 'dmesg'.
Comment 1 Andrew Morton 2009-05-13 20:09:21 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

(Please read this ^^^^ !)

On Wed, 13 May 2009 19:54:10 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=13302
> 
>            Summary: "bad pmd" on fork() of process with hugepage shared
>                     memory segments attached
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 2.6.29.1
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: akpm@linux-foundation.org
>         ReportedBy: starlight@binnacle.cx
>         Regression: Yes
> 
> 
> Kernel reports "bad pmd" errors when process with hugepage
> shared memory segments attached executes fork() system call.
> Using vfork() avoids the issue.
> 
> Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes
> leakage of huge pages.
> 
> Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp.
> 
> See bug 12134 for an example of the errors reported
> by 'dmesg'.
>
Comment 2 Anonymous Emailer 2009-05-14 10:53:34 UTC
Reply-To: mel@csn.ul.ie

On Wed, May 13, 2009 at 01:08:46PM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> (Please read this ^^^^ !)
> 
> On Wed, 13 May 2009 19:54:10 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=13302
> > 
> >            Summary: "bad pmd" on fork() of process with hugepage shared
> >                     memory segments attached
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 2.6.29.1
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >         AssignedTo: akpm@linux-foundation.org
> >         ReportedBy: starlight@binnacle.cx
> >         Regression: Yes
> > 
> > 
> > Kernel reports "bad pmd" errors when process with hugepage
> > shared memory segments attached executes fork() system call.
> > Using vfork() avoids the issue.
> > 
> > Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes
> > leakage of huge pages.
> > 
> > Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp.
> > 
> > See bug 12134 for an example of the errors reported
> > by 'dmesg'.
> > 

This seems familiar and I believe it couldn't be reproduced the last time
and then the problem reporter went away. We need a reproduction case so
I modified on of the libhugetlbfs tests to do what I think you described
above. However, it does not trigger the problem for me on x86 or x86-64
running 2.6.29.1.

starlight@binnacle.cz, can you try the reproduction steps on your system
please? If it reproduces, can you send me your .config please? If it
does not reproduce, can you look at the test program and tell me what
it's doing different to your reproduction case?

1. wget http://heanet.dl.sourceforge.net/sourceforge/libhugetlbfs/libhugetlbfs-2.3.tar.gz
2. tar -zxf libhugetlbfs-2.3.tar.gz
3. cd libhugetlbfs-2.3
4. wget http://www.csn.ul.ie/~mel/shm-fork.c (program is below for reference)
5. mv shm-fork.c tests/
6. make
7. ./obj/hugeadm --create-global-mounts
8. ./obj/hugeadm --pool-pages-min 2M:20
	(Adjust pagesize of 2M if necessary. If x86 and not 2M, tell me
	and send me your .config)
9. ./tests/obj32/shm-fork 10 2

On my two systems, I saw something like

# ./tests/obj32/shm-fork 10 2
Starting testcase "./tests/obj32/shm-fork", pid 3527
Requesting 4194304 bytes for each test
Spawning children glibc_fork:..........glibc_fork
Spawning children glibc_vfork:..........glibc_vfork
Spawning children sys_fork:..........sys_fork
PASS

Test program I used is below and is a modified version of what's in
libhugetlbfs. It does not compile standalone. The steps it takes are

1. Gets the hugepage size
2. Calls shmget() to create a suitably large shared memory segment
3. Creates a requested number of children
4.   Each child attaches to the share memory segment
5.     Each child creates a grandchild
6.   The child and grandchildren write the segment
7.   The grandchild exists, the child waits for the grandchild
8.   The child detaches and exists
9. The parent waits for the child to exit

It does this for glibc fork, glibc vfork and a direct call to the system
call fork().

Thanks

==== CUT HERE ====

/*
 * libhugetlbfs - Easy use of Linux hugepages
 * Copyright (C) 2005-2006 David Gibson & Adam Litke, IBM Corporation.
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public License
 * as published by the Free Software Foundation; either version 2.1 of
 * the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public
 * License along with this library; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/shm.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <hugetlbfs.h>
#include "hugetests.h"

#define P "shm-fork"
#define DESC \
	"* Test shared memory behavior when multiple threads are attached  *\n"\
	"* to a segment.  A segment is created and then children are       *\n"\
	"* spawned which attach, write, read (verify), and detach from the *\n"\
	"* shared memory segment.                                          *"

extern int errno;

/* Global Configuration */
static int nr_hugepages;
static int numprocs;
static int shmid = -1;

#define MAX_PROCS 200
#define BUF_SZ 256

#define GLIBC_FORK  0
#define GLIBC_VFORK 1
#define SYS_FORK    2
static char *testnames[] = { "glibc_fork", "glibc_vfork", "sys_fork" };

#define CHILD_FAIL(thread, fmt, ...) \
	do { \
		verbose_printf("Thread %d (pid=%d) FAIL: " fmt, \
			       thread, getpid(), __VA_ARGS__); \
		exit(1); \
	} while (0)

void cleanup(void)
{
	remove_shmid(shmid);
}

static void do_child(int thread, unsigned long size, int testtype)
{
	volatile char *shmaddr;
	int j, k;
	int pid, status;

	verbose_printf(".");
	for (j=0; j<5; j++) {
		shmaddr = shmat(shmid, 0, SHM_RND);
		if (shmaddr == MAP_FAILED)
			CHILD_FAIL(thread, "shmat() failed: %s",
				   strerror(errno));

		/* Create even more children to double up the work */
		switch (testtype) {
			case GLIBC_FORK:
				if ((pid = fork()) < 0)
					FAIL("glibc_fork(): %s", strerror(errno));
				break;
			case GLIBC_VFORK:
				if ((pid = vfork()) < 0)
					FAIL("glibc_vfork(): %s", strerror(errno));
				break;
			case SYS_FORK:
				if ((pid = syscall(__NR_fork)) < 0)
					FAIL("sys_fork(): %s", strerror(errno));
				break;
			default:
				FAIL("Test type %d not implemented\n", testtype);
		}

		/* Child and parent access the shared area */
		for (k=0;k<size;k++)
			shmaddr[k] = (char) (k);
		for (k=0;k<size;k++)
			if (shmaddr[k] != (char)k)
				CHILD_FAIL(thread, "Index %d mismatch", k);

		/* Children exits */
		if (pid == 0)
			exit(0);
		
		/* Parent waits for child and detaches */
		waitpid(pid, &status, 0);
		if (shmdt((const void *)shmaddr) != 0)
			CHILD_FAIL(thread, "shmdt() failed: %s",
				   strerror(errno));
	}
	exit(0);
}

static void do_test(unsigned long size, int testtype)
{
	int wait_list[MAX_PROCS];
	int i;
	int pid, status;
	char *testname = testnames[testtype];

	if ((shmid = shmget(2, size, SHM_HUGETLB|IPC_CREAT|SHM_R|SHM_W )) < 0)
		FAIL("shmget(): %s", strerror(errno));

	verbose_printf("Spawning children %s:", testname);
	for (i=0; i<numprocs; i++) {
		switch (testtype) {
			case GLIBC_FORK:
				if ((pid = fork()) < 0)
					FAIL("glibc_fork(): %s", strerror(errno));
				break;
			case GLIBC_VFORK:
				if ((pid = vfork()) < 0)
					FAIL("glibc_vfork(): %s", strerror(errno));
				break;
			case SYS_FORK:
				if ((pid = syscall(__NR_fork)) < 0)
					FAIL("sys_fork(): %s", strerror(errno));
				break;
			default:
				FAIL("Test type %d not implemented\n", testtype);
		}

		if (pid == 0)
			do_child(i, size, testtype);

		wait_list[i] = pid;
	}

	for (i=0; i<numprocs; i++) {
		waitpid(wait_list[i], &status, 0);
		if (WEXITSTATUS(status) != 0)
			FAIL("Thread %d (pid=%d) failed", i, wait_list[i]);

		if (WIFSIGNALED(status))
			FAIL("Thread %d (pid=%d) received unhandled signal",
			     i, wait_list[i]);
	}
	printf("%s\n", testname);
}

int main(int argc, char ** argv)
{
	unsigned long size;
	long hpage_size;

	test_init(argc, argv);

	if (argc < 3)
		CONFIG("Usage:  %s <# procs> <# pages>", argv[0]);

	numprocs = atoi(argv[1]);
	nr_hugepages = atoi(argv[2]);

	if (numprocs > MAX_PROCS)
		CONFIG("Cannot spawn more than %d processes", MAX_PROCS);

	check_hugetlb_shm_group();

	hpage_size = check_hugepagesize();
        size = hpage_size * nr_hugepages;

	verbose_printf("Requesting %lu bytes for each test\n", size);
	do_test(size, GLIBC_FORK);
	do_test(size, GLIBC_VFORK);
	do_test(size, SYS_FORK);
	PASS();
}
Comment 3 Anonymous Emailer 2009-05-14 10:59:31 UTC
Reply-To: mel@csn.ul.ie

On Thu, May 14, 2009 at 11:53:27AM +0100, Mel Gorman wrote:
> On Wed, May 13, 2009 at 01:08:46PM -0700, Andrew Morton wrote:
> > 
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> > 
> > (Please read this ^^^^ !)
> > 
> > On Wed, 13 May 2009 19:54:10 GMT
> > bugzilla-daemon@bugzilla.kernel.org wrote:
> > 
> > > http://bugzilla.kernel.org/show_bug.cgi?id=13302
> > > 
> > >            Summary: "bad pmd" on fork() of process with hugepage shared
> > >                     memory segments attached
> > >            Product: Memory Management
> > >            Version: 2.5
> > >     Kernel Version: 2.6.29.1
> > >           Platform: All
> > >         OS/Version: Linux
> > >               Tree: Mainline
> > >             Status: NEW
> > >           Severity: normal
> > >           Priority: P1
> > >          Component: Other
> > >         AssignedTo: akpm@linux-foundation.org
> > >         ReportedBy: starlight@binnacle.cx
> > >         Regression: Yes
> > > 
> > > 
> > > Kernel reports "bad pmd" errors when process with hugepage
> > > shared memory segments attached executes fork() system call.
> > > Using vfork() avoids the issue.
> > > 
> > > Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes
> > > leakage of huge pages.
> > > 
> > > Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp.
> > > 
> > > See bug 12134 for an example of the errors reported
> > > by 'dmesg'.
> > > 
> 
> This seems familiar and I believe it couldn't be reproduced the last time
> and then the problem reporter went away. We need a reproduction case so
> I modified on of the libhugetlbfs tests to do what I think you described
> above. However, it does not trigger the problem for me on x86 or x86-64
> running 2.6.29.1.
> 
> starlight@binnacle.cz, can you try the reproduction steps on your system
> please? If it reproduces, can you send me your .config please? If it
> does not reproduce, can you look at the test program and tell me what
> it's doing different to your reproduction case?
> 

Another question on top of this.

At any point, do you call madvise(MADV_WILLNEED), fadvise(FADV_WILLNEED)
or readahead() on the share memory segment?
Comment 4 starlight 2009-05-14 17:16:44 UTC
Will try it out, but it has to wait till this weekend.


At 11:53 AM 5/14/2009 +0100, Mel Gorman wrote:
>starlight@binnacle.cx, can you try the reproduction steps on your system
>please? If it reproduces, can you send me your .config please? If it
>does not reproduce, can you look at the test program and tell me what
>it's doing different to your reproduction case?
>
Comment 5 starlight 2009-05-14 17:20:42 UTC
Definately no.

The possibly unusual thing done is that a file is read into 
something like 30% of the segment, and the remaining pages are 
not touched.


At 11:59 AM 5/14/2009 +0100, Mel Gorman wrote:
>Another question on top of this.
>
>At any point, do you call madvise(MADV_WILLNEED),
>fadvise(FADV_WILLNEED) or readahead() on the share memory segment?
Comment 6 Anonymous Emailer 2009-05-14 17:49:55 UTC
Reply-To: mel@csn.ul.ie

On Thu, May 14, 2009 at 01:20:09PM -0400, starlight@binnacle.cx wrote:
> At 11:59 AM 5/14/2009 +0100, Mel Gorman wrote:
> >Another question on top of this.
> >
> >At any point, do you call madvise(MADV_WILLNEED),
> >fadvise(FADV_WILLNEED) or readahead() on the share memory segment?
>
> Definately no.
> 
> The possibly unusual thing done is that a file is read into 
> something like 30% of the segment, and the remaining pages are 
> not touched.
> 

Ok, I just tried that there - parent writing 30% of the shared memory
before forking but still did not reproduce the problem :(
Comment 7 starlight 2009-05-14 19:10:02 UTC
At 06:49 PM 5/14/2009 +0100, Mel Gorman wrote:
>Ok, I just tried that there - parent writing 30% of the shared memory
>before forking but still did not reproduce the problem :(

Maybe it makes a difference to have lots of RAM (16GB on this 
server), and about 1.5 GB of hugepage shared memory allocated in 
the forking process in about four segments.  Often have all free 
memory consumed by the file cache, but I don't belive this is 
necessary to produce the problem as it will happen even right 
after a reboot.  [RHEL5 meminfo attached]

Other possible factors:
   daemon is non-root but has explicit
      CAP_IPC_LOCK, CAP_NET_RAW, CAP_SYS_NICE set via
      'setcap cap_net_raw,cap_ipc_lock,cap_sys_nice+ep daemon'
   ulimit -Hl and -Sl are set to <unlimited>
   process group is set in /proc/sys/vm/hugetlb_shm_group
   /proc/sys/vm/nr_hugepages is set to 2048
   daemon has 200 threads at time of fork()
   shared memory segments explictly located [RHEL5 pmap -x attached]
   between fork & exec these syscalls are issued
      sched_getscheduler/sched_setscheduler
      getpriority/setpriority
      seteuid(getuid())
      setegid(getgid())
   with vfork() work-around, no syscalls are made before exec()

Don't think it's something anything specific to the DL160 (Intel E5430)
we have because the DL165 (Opteron 2354) also exhibits the problem.

Will run the test cases provided this weekend for certain and 
will let you know if bug is reproduced.

Have to go silent on this till the weekend.
Comment 8 starlight 2009-05-14 19:10:38 UTC
At 06:49 PM 5/14/2009 +0100, Mel Gorman wrote:
>Ok, I just tried that there - parent writing 30% of the shared memory
>before forking but still did not reproduce the problem :(

Maybe it makes a difference to have lots of RAM (16GB on this 
server), and about 1.5 GB of hugepage shared memory allocated in 
the forking process in about four segments.  Often have all free 
memory consumed by the file cache, but I don't belive this is 
necessary to produce the problem as it will happen even right 
after a reboot.  [RHEL5 meminfo attached]

Other possible factors:
   daemon is non-root but has explicit
      CAP_IPC_LOCK, CAP_NET_RAW, CAP_SYS_NICE set via
      'setcap cap_net_raw,cap_ipc_lock,cap_sys_nice+ep daemon'
   ulimit -Hl and -Sl are set to <unlimited>
   process group is set in /proc/sys/vm/hugetlb_shm_group
   /proc/sys/vm/nr_hugepages is set to 2048
   daemon has 200 threads at time of fork()
   shared memory segments explictly located [RHEL5 pmap -x attached]
   between fork & exec these syscalls are issued
      sched_getscheduler/sched_setscheduler
      getpriority/setpriority
      seteuid(getuid())
      setegid(getgid())
   with vfork() work-around, no syscalls are made before exec()

Don't think it's something anything specific to the DL160 (Intel E5430)
we have because the DL165 (Opteron 2354) also exhibits the problem.

Will run the test cases provided this weekend for certain and 
will let you know if bug is reproduced.

Have to go silent on this till the weekend.
Comment 9 starlight 2009-05-15 05:44:20 UTC
Whacked at a this, attempting to build a testcase from a 
combination of the original daemon strace in the bug report
and knowledge of what the daemon is doing.

What emerged is something that will destroy RHEL5 
2.6.18-128.1.6.el5 100% every time.  Completely fills the kernel 
message log with "bad pmd" errors and wrecks hugepages.

Unfortunately it only occasionally breaks 2.6.29.1.  Haven't
been able to produce "bad pmd" messages, but did get the
kernel to think it's out of large page memory when in
theory it was not.  Saw a lot of really strange accounting
in the hugepage section of /proc/meminfo.

For what it's worth, the testcase code is attached.

Note that hugepages=2048 is assumed--the bug seems to require 
use of more than 50% of large page memory.

Definately will be posted under the RHEL5 bug report, which is 
the more pressing issue here than far-future kernel support.

In addition, the original segment attach bug 
http://bugzilla.kernel.org/show_bug.cgi?id=12134 is still there 
and can be reproduced every time with the 'create_seg_strace' 
and 'access_seg_straceX' sequences.
Comment 10 Anonymous Emailer 2009-05-15 14:55:13 UTC
Reply-To: mel@csn.ul.ie

On Fri, May 15, 2009 at 01:32:38AM -0400, starlight@binnacle.cx wrote:
> Whacked at a this, attempting to build a testcase from a 
> combination of the original daemon strace in the bug report
> and knowledge of what the daemon is doing.
> 
> What emerged is something that will destroy RHEL5 
> 2.6.18-128.1.6.el5 100% every time.  Completely fills the kernel 
> message log with "bad pmd" errors and wrecks hugepages.
> 

Ok, I can confirm that more or less. I reproduced the problem on 2.6.18-92.el5
on x86-64 running RHEL 5.2. I didn't have access to a machine with enough
memory though so I dropped the requirements slightly. It still triggered
a failure though.

However, when I ran 2.6.18, 2.6.19 and 2.6.29.1 on the same machine, I could
not reproduce the problem, nor could I cause hugepages to leak so I'm leaning
towards believing this is a distribution bug at the moment.

On the plus side, due to your good work, there is enough available for them
to bisect this problem hopefully.

> Unfortunately it only occasionally breaks 2.6.29.1.  Haven't
> been able to produce "bad pmd" messages, but did get the
> kernel to think it's out of large page memory when in
> theory it was not.  Saw a lot of really strange accounting
> in the hugepage section of /proc/meminfo.
> 

What sort of strange accounting? The accounting has changed since 2.6.18
so I want to be sure you're really seeing something weird. When I was
testing, I didn't see anything out of the ordinary but maybe I'm looking
in a different place.

> For what it's worth, the testcase code is attached.
> 

I cleaned the test up a bit and wrote a wrapper script to run this
multiple times while checking for hugepage leaks. I've it running in a
loop while the machine runs sysbench as a stress test to see can I cause
anything out of the ordinary to happen. Nothing so far though.

> Note that hugepages=2048 is assumed--the bug seems to require 
> use of more than 50% of large page memory.
> 
> Definately will be posted under the RHEL5 bug report, which is 
> the more pressing issue here than far-future kernel support.
> 

If you've filed a RedHat bug, this modified testcase and wrapper script
might help them. The program exists and cleans up after itself and the memory
requirements are less. The script sets the machine up in a way that
breaks for me where the breakage is bad pmd messages and hugepages
leaking.
Comment 11 starlight 2009-05-15 15:20:02 UTC
At 03:55 PM 5/15/2009 +0100, Mel Gorman wrote:
>On Fri, May 15, 2009 at 01:32:38AM -0400, starlight@binnacle.cx 
>wrote:
>> Whacked at a this, attempting to build a testcase from a 
>> combination of the original daemon strace in the bug report
>> and knowledge of what the daemon is doing.
>> 
>> What emerged is something that will destroy RHEL5 
>> 2.6.18-128.1.6.el5 100% every time.  Completely fills the kernel
>> message log with "bad pmd" errors and wrecks hugepages.
>
>Ok, I can confirm that more or less. I reproduced the problem on 
>2.6.18-92.el5 on x86-64 running RHEL 5.2. I didn't have access 
>to a machine with enough memory though so I dropped the 
>requirements slightly. It still triggered a failure though.
>
>However, when I ran 2.6.18, 2.6.19 and 2.6.29.1 on the same 
>machine, I could not reproduce the problem, nor could I cause 
>hugepages to leak so I'm leaning towards believing this is a 
>distribution bug at the moment.
>
>On the plus side, due to your good work, there is enough 
>available for them to bisect this problem hopefully.

Good to hear that the testcase works on other machines.

>> Unfortunately it only occasionally breaks 2.6.29.1.  Haven't
>> been able to produce "bad pmd" messages, but did get the
>> kernel to think it's out of large page memory when in
>> theory it was not.  Saw a lot of really strange accounting
>> in the hugepage section of /proc/meminfo.
>>

>What sort of strange accounting? The accounting has changed 
>since 2.6.18 so I want to be sure you're really seeing something 
>weird. When I was testing, I didn't see anything out of the 
>ordinary but maybe I'm looking in a different place.

Saw things like both free and used set to zero, used set to 2048 
when it should not have been (in association with the failure).  
Often the counters would correct themselves after segments were 
removed with 'ipcs'.  Sometimes not--usually when it broke.  
Also saw some truly insane usage counts like 32520 and less 
egregious off-by-one-or-two inaccuracies.

>> For what it's worth, the testcase code is attached.
>> 
>I cleaned the test up a bit and wrote a wrapper script to run 
>this multiple times while checking for hugepage leaks. I've it 
>running in a loop while the machine runs sysbench as a stress 
>test to see can I cause anything out of the ordinary to happen. 
>Nothing so far though.
>
>> Note that hugepages=2048 is assumed--the bug seems to require 
>> use of more than 50% of large page memory.
>> 
>> Definately will be posted under the RHEL5 bug report, which is 
>> the more pressing issue here than far-future kernel support.
>> 
>If you've filed a RedHat bug, this modified testcase and wrapper 
>script might help them. The program exists and cleans up after 
>itself and the memory requirements are less. The script sets the 
>machine up in a way that breaks for me where the breakage is bad 
>pmd messages and hugepages leaking.

Thank you for your efforts.  Could you post to the RH bug along 
with a back-reference to this?  Might improve the chances 
someone will pay attention to it.  It's at

https://bugzilla.redhat.com/show_bug.cgi?id=497653

In a week or two I'll see if I can make time to turn the 100% 
failure scenario into a testcase.  This is just the run of a
segment loader followed by running a status checker three times. 
In 2.6.29.1 I'm wondering if the "bad pmd" I saw was just a bit 
of bad memory, so might as well focus on the thing that fails 
with certainty.  Possibly the "bad pmd" case requires a few hours 
of live data runtime before it emerges--a tougher nut.
Comment 12 starlight 2009-05-15 18:47:56 UTC
This was really bugging me, so I hacked out
the test case for the attach failure.

Hoses 2.6.29.1 100% every time.  Run it like this:

tcbm_att
tcbm_att -
tcbm_att -
tcbm_att -

It will break on the last iteration with ENOMEM
and ENOMEM is all any shmget() or shmat() call
gets forever more.

After removing the segments this appears:

HugePages_Total:    2048
HugePages_Free:     2048
HugePages_Rsvd:     1280
HugePages_Surp:        0

Even though no segments show in 'ipcs'.
Comment 13 starlight 2009-05-15 18:53:31 UTC
Here's another possible clue:

I tried the first 'tcbm' testcase on a 2.6.27.7
kernel that was hanging around from a few months
ago and it breaks it 100% of the time.

Completely hoses huge memory.  Enough "bad pmd"
errors to fill the kernel log.
Comment 14 Anonymous Emailer 2009-05-18 16:36:58 UTC
Reply-To: mel@csn.ul.ie

On Fri, May 15, 2009 at 02:44:29PM -0400, starlight@binnacle.cx wrote:
> This was really bugging me, so I hacked out
> the test case for the attach failure.
> 
> Hoses 2.6.29.1 100% every time.  Run it like this:
> 
> tcbm_att
> tcbm_att -
> tcbm_att -
> tcbm_att -
> 
> It will break on the last iteration with ENOMEM
> and ENOMEM is all any shmget() or shmat() call
> gets forever more.
> 
> After removing the segments this appears:
> 
> HugePages_Total:    2048
> HugePages_Free:     2048
> HugePages_Rsvd:     1280
> HugePages_Surp:        0
> 

Ok, the critical fact was that one process mapped read-write and
populated the segment. Each subsequent process mapped it read-only. The
core VM sets VM_SHARED for file-shared-read-write mappings but not
file-shared-read-only mapping. Hugetlbfs confused how it should be using
VM_SHARED as it was being used to check if the mapping was MAP_SHARED.
Straight-forward mistake with the consequence that reservations "leaked"
and future mappings failed as a result.

Can you try this patch out please? It is against 2.6.29.1 and mostly
applies to 2.6.27.7. The reject is trivially resolved by editting
mm/hugetlb.c and changing the VM_SHARED at the end of
hugetlb_reserve_pages() to VM_MAYSHARE.

Thing is, this patch fixes a reservation issue. The bad pmd messages do
show up for the original test on 2.6.27.7 for x86-64 (not x86) but it's a
separate issue and I have not determined what it is yet. Can you test this
patch to begin with please?

==== CUT HERE ====
Account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs

hugetlbfs reserves huge pages and accounts for them differently depending on
whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. However, the check
it makes against the VMA in some places is VM_SHARED and not VM_MAYSHARE.
For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the
mapping is MAP_SHARED *and* it is read-write. If a shared memory mapping
was created read-write for populating of data and mapped read-only by other
processes, then hugetlbfs gets the accounting wrong and reservations leak.

This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when
the intent of the code was to check whether the VMA was mapped MAP_SHARED
or MAP_PRIVATE.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 mm/hugetlb.c |   26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28c655b..e83ad2c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -316,7 +316,7 @@ static void resv_map_release(struct kref *ref)
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_SHARED))
+	if (!(vma->vm_flags & VM_MAYSHARE))
 		return (struct resv_map *)(get_vma_private_data(vma) &
 							~HPAGE_RESV_MASK);
 	return NULL;
@@ -325,7 +325,7 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	VM_BUG_ON(vma->vm_flags & VM_SHARED);
+	VM_BUG_ON(vma->vm_flags & VM_MAYSHARE);
 
 	set_vma_private_data(vma, (get_vma_private_data(vma) &
 				HPAGE_RESV_MASK) | (unsigned long)map);
@@ -334,7 +334,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
 static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	VM_BUG_ON(vma->vm_flags & VM_SHARED);
+	VM_BUG_ON(vma->vm_flags & VM_MAYSHARE);
 
 	set_vma_private_data(vma, get_vma_private_data(vma) | flags);
 }
@@ -353,7 +353,7 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
 	if (vma->vm_flags & VM_NORESERVE)
 		return;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (vma->vm_flags & VM_MAYSHARE) {
 		/* Shared mappings always use reserves */
 		h->resv_huge_pages--;
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
@@ -369,14 +369,14 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_SHARED))
+	if (!(vma->vm_flags & VM_MAYSHARE))
 		vma->vm_private_data = (void *)0;
 }
 
 /* Returns true if the VMA has associated reserve pages */
 static int vma_has_reserves(struct vm_area_struct *vma)
 {
-	if (vma->vm_flags & VM_SHARED)
+	if (vma->vm_flags & VM_MAYSHARE)
 		return 1;
 	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		return 1;
@@ -924,7 +924,7 @@ static long vma_needs_reservation(struct hstate *h,
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		return region_chg(&inode->i_mapping->private_list,
 							idx, idx + 1);
@@ -949,7 +949,7 @@ static void vma_commit_reservation(struct hstate *h,
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		region_add(&inode->i_mapping->private_list, idx, idx + 1);
 
@@ -1893,7 +1893,7 @@ retry_avoidcopy:
 	 * at the time of fork() could consume its reserves on COW instead
 	 * of the full address range.
 	 */
-	if (!(vma->vm_flags & VM_SHARED) &&
+	if (!(vma->vm_flags & VM_MAYSHARE) &&
 			is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
 			old_page != pagecache_page)
 		outside_reserve = 1;
@@ -2000,7 +2000,7 @@ retry:
 		clear_huge_page(page, address, huge_page_size(h));
 		__SetPageUptodate(page);
 
-		if (vma->vm_flags & VM_SHARED) {
+		if (vma->vm_flags & VM_MAYSHARE) {
 			int err;
 			struct inode *inode = mapping->host;
 
@@ -2104,7 +2104,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto out_mutex;
 		}
 
-		if (!(vma->vm_flags & VM_SHARED))
+		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
 								vma, address);
 	}
@@ -2289,7 +2289,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_SHARED)
+	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		chg = region_chg(&inode->i_mapping->private_list, from, to);
 	else {
 		struct resv_map *resv_map = resv_map_alloc();
@@ -2330,7 +2330,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * consumed reservations are stored in the map. Hence, nothing
 	 * else has to be done for private mappings here
 	 */
-	if (!vma || vma->vm_flags & VM_SHARED)
+	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		region_add(&inode->i_mapping->private_list, from, to);
 	return 0;
 }
Comment 15 Anonymous Emailer 2009-05-20 11:35:33 UTC
Reply-To: mel@csn.ul.ie

On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> Here's another possible clue:
> 
> I tried the first 'tcbm' testcase on a 2.6.27.7
> kernel that was hanging around from a few months
> ago and it breaks it 100% of the time.
> 
> Completely hoses huge memory.  Enough "bad pmd"
> errors to fill the kernel log.
> 

So I investigated what's wrong with 2.6.27.7. The problem is a race between
exec() and the handling of mlock()ed VMAs but I can't see where. The normal
teardown of pages is applied to a shared memory segment as if VM_HUGETLB
was not set.

This was fixed between 2.6.27 and 2.6.28 but apparently by accident during the
introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes
to how mlock()ed are handled but I didn't spot which was the relevant change
that fixed the problem and reverse bisecting didn't help. I've added two people
that were working on the unevictable LRU patches to see if they spot something.

For context, the two attached files are used to reproduce a problem
where bad pmd messages are scribbled all over the console on 2.6.27.7.
Do something like

echo 64 > /proc/sys/vm/nr_hugepages
mount -t hugetlbfs none /mnt
sh ./test-tcbm.sh

I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is
set or not.  It's possible the race it still there but I don't know where
it is.

Any ideas where the race might be?
Comment 16 Anonymous Emailer 2009-05-20 14:29:49 UTC
Reply-To: mel@csn.ul.ie

On Wed, May 20, 2009 at 12:35:25PM +0100, Mel Gorman wrote:
> On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> > Here's another possible clue:
> > 
> > I tried the first 'tcbm' testcase on a 2.6.27.7
> > kernel that was hanging around from a few months
> > ago and it breaks it 100% of the time.
> > 
> > Completely hoses huge memory.  Enough "bad pmd"
> > errors to fill the kernel log.
> > 
> 
> So I investigated what's wrong with 2.6.27.7. The problem is a race between
> exec() and the handling of mlock()ed VMAs but I can't see where. The normal
> teardown of pages is applied to a shared memory segment as if VM_HUGETLB
> was not set.
> 
> This was fixed between 2.6.27 and 2.6.28 but apparently by accident during
> the
> introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes
> to how mlock()ed are handled but I didn't spot which was the relevant change
> that fixed the problem and reverse bisecting didn't help. I've added two
> people
> that were working on the unevictable LRU patches to see if they spot
> something.
> 
> For context, the two attached files are used to reproduce a problem
> where bad pmd messages are scribbled all over the console on 2.6.27.7.
> Do something like
> 
> echo 64 > /proc/sys/vm/nr_hugepages
> mount -t hugetlbfs none /mnt
> sh ./test-tcbm.sh
> 
> I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is
> set or not.  It's possible the race it still there but I don't know where
> it is.
> 
> Any ideas where the race might be?
> 

With all the grace of a drunken elephant in a china shop, I gave up on being
clever as it wasn't working and brute-force attacked this to make a list of the
commits needed for CONFIG_UNEVICTABLE_LRU on top of 2.6.27.7. This is the list

# Prereq commits for UNEVICT patches to apply
b69408e88bd86b98feb7b9a38fd865e1ddb29827 vmscan: Use an indexed array for LRU variabl
62695a84eb8f2e718bf4dfb21700afaa7a08e0ea vmscan: move isolate_lru_page() to vmscan.c
f04e9ebbe4909f9a41efd55149bc353299f4e83b swap: use an array for the LRU pagevecs
68a22394c286a2daf06ee8d65d8835f738faefa5 vmscan: free swap space on swap-in/activation
b2e185384f534781fd22f5ce170b2ad26f97df70 define page_file_cache() function
4f98a2fee8acdb4ac84545df98cccecfd130f8db vmscan: split LRU lists into anon & file sets
556adecba110bf5f1db6c6b56416cfab5bcab698 vmscan: second chance replacement
7e9cd484204f9e5b316ed35b241abf088d76e0af vmscan: fix pagecache reclaim referenced
33c120ed2843090e2bd316de1588b8bf8b96cbde more aggressively use lumpy reclaim

# Part 1: Initial patches for UNEVICTABLE_LRU
8a7a8544a4f6554ec2d8048ac9f9672f442db5a2 pageflag helpers for configed-out flags
894bc310419ac95f4fa4142dc364401a7e607f65 Unevictable LRU Infrastructure
bbfd28eee9fbd73e780b19beb3dc562befbb94fa unevictable lru: add event counting with stat
7b854121eb3e5ba0241882ff939e2c485228c9c5 Unevictable LRU Page Statistics
ba9ddf49391645e6bb93219131a40446538a5e76 Ramfs and Ram Disk pages are unevictable
89e004ea55abe201b29e2d6e35124101f1288ef7 SHM_LOCKED pages are unevictable

# Part 2: Critical patch that makes the problem go away
b291f000393f5a0b679012b39d79fbc85c018233 mlock: mlocked pages are unevictable

# Part 3: Rest of UNEVICTABLE_LRU
fa07e787733416c42938a310a8e717295934e33c doc: unevictable LRU and mlocked pages doc
8edb08caf68184fb170f4f69c7445929e199eaea mlock: downgrade mmap sem while pop mlock
ba470de43188cdbff795b5da43a1474523c6c2fb mmap: handle mlocked pages during map, remap
5344b7e648980cc2ca613ec03a56a8222ff48820 vmstat: mlocked pages statistics
64d6519dda3905dfb94d3f93c07c5f263f41813f swap: cull unevictable pages in fault path
af936a1606246a10c145feac3770f6287f483f02 vmscan: unevictable LRU scan sysctl
985737cf2ea096ea946aed82c7484d40defc71a8 mlock: count attempts to free mlocked page
902d2e8ae0de29f483840ba1134af27343b9564d vmscan: kill unused lru functions
e0f79b8f1f3394bb344b7b83d6f121ac2af327de vmscan: don't accumulate scan pressure on un
c11d69d8c830e09a0e7b3935c952afb26c48bba8 mlock: revert mainline handling of mlock erro
9978ad583e100945b74e4f33e73317983ea32df9 mlock: make mlock error return Posixly Correct

I won't get the chance to start picking apart
b291f000393f5a0b679012b39d79fbc85c018233 to see what's so special in there
until Friday but maybe someone else will spot the magic before I do.  Again,
it does not matter if UNEVICTABLE_LRU is set or not once that critical patch
is applied.

For what it's worth, this bug affects the SLES 11 kernel which is based on
2.6.27. I imagine they'd like to have this fixed but may not be so keen on
applying so many patches.
Comment 17 Lee Schermerhorn 2009-05-20 14:53:47 UTC
On Wed, 2009-05-20 at 12:35 +0100, Mel Gorman wrote:
> On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> > Here's another possible clue:
> > 
> > I tried the first 'tcbm' testcase on a 2.6.27.7
> > kernel that was hanging around from a few months
> > ago and it breaks it 100% of the time.
> > 
> > Completely hoses huge memory.  Enough "bad pmd"
> > errors to fill the kernel log.
> > 
> 
> So I investigated what's wrong with 2.6.27.7. The problem is a race between
> exec() and the handling of mlock()ed VMAs but I can't see where. The normal
> teardown of pages is applied to a shared memory segment as if VM_HUGETLB
> was not set.
> 
> This was fixed between 2.6.27 and 2.6.28 but apparently by accident during
> the
> introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes
> to how mlock()ed are handled but I didn't spot which was the relevant change
> that fixed the problem and reverse bisecting didn't help. I've added two
> people
> that were working on the unevictable LRU patches to see if they spot
> something.

Hi, Mel:
and still do.  With the unevictable lru, mlock()/mmap('LOCKED) now move
the mlocked pages to the unevictable lru list and munlock, including at
exit, must rescue them from the unevictable list.   Since hugepages are
not maintained on the lru and don't get reclaimed, we don't want to move
them to the unevictable list,  However, we still want to populate the
page tables.  So, we still call [_]mlock_vma_pages_range() for hugepage
vmas, but after making the pages present to preserve prior behavior, we
remove the VM_LOCKED flag from the vma.
The basic change to handling of hugepage handling with the unevictable
lru patches is that we no longer keep a huge page vma marked with
VM_LOCKED.  So, at exit time, there is no record that this is a vmlocked
vma.

A bit of context:  before the unevictable lru, mlock() or
mmap(MAP_LOCKED) would just set the VM_LOCKED flag and
"make_pages_present()" for all but a few vma types.  We've always
excluded those that get_user_pages() can't handle and still do.  With
the unevictable lru, mlock()/mmap('LOCKED) now move the mlocked pages to
the unevictable lru list and munlock, including at exit, must rescue
them from the unevictable list.   Since hugepages are not maintained on
the lru and don't get reclaimed, we don't want to move them to the
unevictable list,  However, we still want to populate the page tables.
So, we still call [_]mlock_vma_pages_range() for hugepage vmas, but
after making the pages present to preserve prior behavior, we remove the
VM_LOCKED flag from the vma.

This may have resulted in the apparent fix to the subject problem in
2.6.28...

> 
> For context, the two attached files are used to reproduce a problem
> where bad pmd messages are scribbled all over the console on 2.6.27.7.
> Do something like
> 
> echo 64 > /proc/sys/vm/nr_hugepages
> mount -t hugetlbfs none /mnt
> sh ./test-tcbm.sh
> 
> I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is
> set or not.  It's possible the race it still there but I don't know where
> it is.
> 
> Any ideas where the race might be?

No, sorry.  Haven't had time to investigate this.

Lee
>
Comment 18 Lee Schermerhorn 2009-05-20 15:05:50 UTC
On Wed, 2009-05-20 at 10:53 -0400, Lee Schermerhorn wrote:
> On Wed, 2009-05-20 at 12:35 +0100, Mel Gorman wrote:
> > On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> > > Here's another possible clue:
> > > 
> > > I tried the first 'tcbm' testcase on a 2.6.27.7
> > > kernel that was hanging around from a few months
> > > ago and it breaks it 100% of the time.
> > > 
> > > Completely hoses huge memory.  Enough "bad pmd"
> > > errors to fill the kernel log.
> > > 
> > 
> > So I investigated what's wrong with 2.6.27.7. The problem is a race between
> > exec() and the handling of mlock()ed VMAs but I can't see where. The normal
> > teardown of pages is applied to a shared memory segment as if VM_HUGETLB
> > was not set.
> > 
> > This was fixed between 2.6.27 and 2.6.28 but apparently by accident during
> the
> > introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of
> changes
> > to how mlock()ed are handled but I didn't spot which was the relevant
> change
> > that fixed the problem and reverse bisecting didn't help. I've added two
> people
> > that were working on the unevictable LRU patches to see if they spot
> something.
> 
> Hi, Mel:
> and still do.  With the unevictable lru, mlock()/mmap('LOCKED) now move
> the mlocked pages to the unevictable lru list and munlock, including at
> exit, must rescue them from the unevictable list.   Since hugepages are
> not maintained on the lru and don't get reclaimed, we don't want to move
> them to the unevictable list,  However, we still want to populate the
> page tables.  So, we still call [_]mlock_vma_pages_range() for hugepage
> vmas, but after making the pages present to preserve prior behavior, we
> remove the VM_LOCKED flag from the vma.

Wow!  that got garbled.  not sure how.  Message was intended to start
here:

> The basic change to handling of hugepage handling with the unevictable
> lru patches is that we no longer keep a huge page vma marked with
> VM_LOCKED.  So, at exit time, there is no record that this is a vmlocked
> vma.
> 
> A bit of context:  before the unevictable lru, mlock() or
> mmap(MAP_LOCKED) would just set the VM_LOCKED flag and
> "make_pages_present()" for all but a few vma types.  We've always
> excluded those that get_user_pages() can't handle and still do.  With
> the unevictable lru, mlock()/mmap('LOCKED) now move the mlocked pages to
> the unevictable lru list and munlock, including at exit, must rescue
> them from the unevictable list.   Since hugepages are not maintained on
> the lru and don't get reclaimed, we don't want to move them to the
> unevictable list,  However, we still want to populate the page tables.
> So, we still call [_]mlock_vma_pages_range() for hugepage vmas, but
> after making the pages present to preserve prior behavior, we remove the
> VM_LOCKED flag from the vma.
> 
> This may have resulted in the apparent fix to the subject problem in
> 2.6.28...
> 
> > 
> > For context, the two attached files are used to reproduce a problem
> > where bad pmd messages are scribbled all over the console on 2.6.27.7.
> > Do something like
> > 
> > echo 64 > /proc/sys/vm/nr_hugepages
> > mount -t hugetlbfs none /mnt
> > sh ./test-tcbm.sh
> > 
> > I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is
> > set or not.  It's possible the race it still there but I don't know where
> > it is.
> > 
> > Any ideas where the race might be?
> 
> No, sorry.  Haven't had time to investigate this.
> 
> Lee
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Comment 19 Anonymous Emailer 2009-05-20 15:41:35 UTC
Reply-To: mel@csn.ul.ie

On Wed, May 20, 2009 at 11:05:15AM -0400, Lee Schermerhorn wrote:
> On Wed, 2009-05-20 at 10:53 -0400, Lee Schermerhorn wrote:
> > On Wed, 2009-05-20 at 12:35 +0100, Mel Gorman wrote:
> > > On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote:
> > > > Here's another possible clue:
> > > > 
> > > > I tried the first 'tcbm' testcase on a 2.6.27.7
> > > > kernel that was hanging around from a few months
> > > > ago and it breaks it 100% of the time.
> > > > 
> > > > Completely hoses huge memory.  Enough "bad pmd"
> > > > errors to fill the kernel log.
> > > > 
> > > 
> > > So I investigated what's wrong with 2.6.27.7. The problem is a race
> between
> > > exec() and the handling of mlock()ed VMAs but I can't see where. The
> normal
> > > teardown of pages is applied to a shared memory segment as if VM_HUGETLB
> > > was not set.
> > > 
> > > This was fixed between 2.6.27 and 2.6.28 but apparently by accident
> during the
> > > introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of
> changes
> > > to how mlock()ed are handled but I didn't spot which was the relevant
> change
> > > that fixed the problem and reverse bisecting didn't help. I've added two
> people
> > > that were working on the unevictable LRU patches to see if they spot
> something.
> > 
> > Hi, Mel:
> > and still do.  With the unevictable lru, mlock()/mmap('LOCKED) now move
> > the mlocked pages to the unevictable lru list and munlock, including at
> > exit, must rescue them from the unevictable list.   Since hugepages are
> > not maintained on the lru and don't get reclaimed, we don't want to move
> > them to the unevictable list,  However, we still want to populate the
> > page tables.  So, we still call [_]mlock_vma_pages_range() for hugepage
> > vmas, but after making the pages present to preserve prior behavior, we
> > remove the VM_LOCKED flag from the vma.
> 
> Wow!  that got garbled.  not sure how.  Message was intended to start
> here:
> 
> > The basic change to handling of hugepage handling with the unevictable
> > lru patches is that we no longer keep a huge page vma marked with
> > VM_LOCKED.  So, at exit time, there is no record that this is a vmlocked
> > vma.
> > 

Basic and in this case, apparently the critical factor. This patch on
2.6.27.7 makes the problem disappear as well by never setting VM_LOCKED on
hugetlb-backed VMAs. Obviously, it's a hachet job and almost certainly the
wrong fix but it indicates that the handling of VM_LOCKED && VM_HUGETLB
is wrong somewhere. Now I have a better idea now what to search for on
Friday. Thanks Lee.

--- mm/mlock.c	2009-05-20 16:36:08.000000000 +0100
+++ mm/mlock-new.c	2009-05-20 16:28:17.000000000 +0100
@@ -64,7 +64,8 @@
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, make_pages_present below will bring it back.
 	 */
-	vma->vm_flags = newflags;
+	if (!(vma->vm_flags & VM_HUGETLB))
+		vma->vm_flags = newflags;
 
 	/*
 	 * Keep track of amount of locked VM.
Comment 20 KOSAKI Motohiro 2009-05-21 00:41:53 UTC
Hi

> Basic and in this case, apparently the critical factor. This patch on
> 2.6.27.7 makes the problem disappear as well by never setting VM_LOCKED on
> hugetlb-backed VMAs. Obviously, it's a hachet job and almost certainly the
> wrong fix but it indicates that the handling of VM_LOCKED && VM_HUGETLB
> is wrong somewhere. Now I have a better idea now what to search for on
> Friday. Thanks Lee.
> 
> --- mm/mlock.c        2009-05-20 16:36:08.000000000 +0100
> +++ mm/mlock-new.c    2009-05-20 16:28:17.000000000 +0100
> @@ -64,7 +64,8 @@
>        * It's okay if try_to_unmap_one unmaps a page just after we
>        * set VM_LOCKED, make_pages_present below will bring it back.
>        */
> -     vma->vm_flags = newflags;
> +     if (!(vma->vm_flags & VM_HUGETLB))

this condition meaning isn't so obvious to me. could you please
consider comment adding?


> +             vma->vm_flags = newflags;
>  
>       /*
>        * Keep track of amount of locked VM.
Comment 21 Anonymous Emailer 2009-05-22 16:41:10 UTC
Reply-To: mel@csn.ul.ie

On Thu, May 21, 2009 at 09:41:46AM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > Basic and in this case, apparently the critical factor. This patch on
> > 2.6.27.7 makes the problem disappear as well by never setting VM_LOCKED on
> > hugetlb-backed VMAs. Obviously, it's a hachet job and almost certainly the
> > wrong fix but it indicates that the handling of VM_LOCKED && VM_HUGETLB
> > is wrong somewhere. Now I have a better idea now what to search for on
> > Friday. Thanks Lee.
> > 
> > --- mm/mlock.c      2009-05-20 16:36:08.000000000 +0100
> > +++ mm/mlock-new.c  2009-05-20 16:28:17.000000000 +0100
> > @@ -64,7 +64,8 @@
> >      * It's okay if try_to_unmap_one unmaps a page just after we
> >      * set VM_LOCKED, make_pages_present below will bring it back.
> >      */
> > -   vma->vm_flags = newflags;
> > +   if (!(vma->vm_flags & VM_HUGETLB))
> 
> this condition meaning isn't so obvious to me. could you please
> consider comment adding?
> 

I should have used the helper, but anyway, the check was to see if the VMA was
backed by hugetlbfs or not. This wasn't the right fix. It was only intended
to show that it was something to do with the VM_LOCKED flag.

The real problem has something to do with pagetable-sharing of hugetlb-backed
segments. After fork(), the VM_LOCKED gets cleared so when huge_pmd_share()
is called, some of the pagetables are shared and others are not. I believe
this is resulting in pagetables being freed prematurely. I'm cc'ing the
author and acks to the pagetable-sharing patch to see can they shed more
light on whether this is the right patch or not. Kenneth, Hugh?

==== CUT HERE ====
x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not

On x86 and x86-64, it is possible that page tables are shared beween shared
mappings backed by hugetlbfs. As part of this, page_table_shareable() checks
a pair of vma->vm_flags and they must match if they are to be shared. All
VMA flags are taken into account, including VM_LOCKED.

The problem is that VM_LOCKED is cleared on fork(). When a process with a
shared memory segment forks() to exec() a helper, there will be shared VMAs
with different flags. The impact is that the shared segment is sometimes
considered shareable and other times not, depending on what process is
checking. A test process that forks and execs heavily can trigger a
number of "bad pmd" messages appearing in the kernel log and hugepages
being leaked.

I believe what happens is that the segment page tables are being shared but
the count is inaccurate depending on the ordering of events.

Strictly speaking, this affects mainline but the problem is masked by the
changes made for CONFIG_UNEVITABLE_LRU as the kernel now never has VM_LOCKED
set for hugetlbfs-backed mapping. This does affect the stable branch of
2.6.27 and distributions based on that kernel such as SLES 11.

This patch addresses the problem by comparing all flags but VM_LOCKED when
deciding if pagetables should be shared or not for hugetlbfs-backed mapping.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 arch/x86/mm/hugetlbpage.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8f307d9..16e4bcc 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	unsigned long sbase = saddr & PUD_MASK;
 	unsigned long s_end = sbase + PUD_SIZE;
 
+	/* Allow segments to share if only one is locked */
+	unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
+	unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED;
+
 	/*
 	 * match the virtual addresses, permission and the alignment of the
 	 * page table page.
 	 */
 	if (pmd_index(addr) != pmd_index(saddr) ||
-	    vma->vm_flags != svma->vm_flags ||
+	    vm_flags != svm_flags ||
 	    sbase < svma->vm_start || svma->vm_end < s_end)
 		return 0;
Comment 22 Anonymous Emailer 2009-05-25 08:51:45 UTC
Reply-To: mel@csn.ul.ie

On Sun, May 24, 2009 at 10:44:29PM +0900, KOSAKI Motohiro wrote:
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > --- 
> >  arch/x86/mm/hugetlbpage.c |    6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index 8f307d9..16e4bcc 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct
> vm_area_struct *svma,
> >     unsigned long sbase = saddr & PUD_MASK;
> >     unsigned long s_end = sbase + PUD_SIZE;
> >  
> > +   /* Allow segments to share if only one is locked */
> > +   unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
> > +   unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED;
>                                   svma?
> 

/me slaps self

svma indeed.

With the patch corrected, I still cannot trigger the bad pmd messages as
applied so I'm convinced the bug is related to hugetlb pagetable
sharing and this is more or less the fix. Any opinions?

>  - kosaki
> 
> > +
> >     /*
> >      * match the virtual addresses, permission and the alignment of the
> >      * page table page.
> >      */
> >     if (pmd_index(addr) != pmd_index(saddr) ||
> > -       vma->vm_flags != svma->vm_flags ||
> > +       vm_flags != svm_flags ||
> >         sbase < svma->vm_start || svma->vm_end < s_end)
> >             return 0;
> >  
> 
> 
>
Comment 23 Anonymous Emailer 2009-05-25 10:10:16 UTC
Reply-To: hugh.dickins@tiscali.co.uk

On Mon, 25 May 2009, Mel Gorman wrote:
> On Sun, May 24, 2009 at 10:44:29PM +0900, KOSAKI Motohiro wrote:
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > --- 
> > >  arch/x86/mm/hugetlbpage.c |    6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > > index 8f307d9..16e4bcc 100644
> > > --- a/arch/x86/mm/hugetlbpage.c
> > > +++ b/arch/x86/mm/hugetlbpage.c
> > > @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct
> vm_area_struct *svma,
> > >   unsigned long sbase = saddr & PUD_MASK;
> > >   unsigned long s_end = sbase + PUD_SIZE;
> > >  
> > > + /* Allow segments to share if only one is locked */
> > > + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
> > > + unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED;
> >                                   svma?
> > 
> 
> /me slaps self
> 
> svma indeed.
> 
> With the patch corrected, I still cannot trigger the bad pmd messages as
> applied so I'm convinced the bug is related to hugetlb pagetable
> sharing and this is more or less the fix. Any opinions?

Yes, you gave a very good analysis, and I agree with you, your patch
does seem to be needed for 2.6.27.N, and the right thing to do there
(though I prefer the way 2.6.28 mlocking skips huge areas completely).

One nit, doesn't really matter, but if I'm not too late: please change
-	/* Allow segments to share if only one is locked */
+	/* Allow segments to share if only one is marked locked */
since locking is such a no-op on hugetlb areas.

Hugetlb pagetable sharing does scare me some nights: it's a very easily
forgotten corner of mm, worrying that we do something so different in
there; but IIRC this is actually the first bug related to it, much to
Ken's credit (and Dave McCracken's).

(I'm glad Kosaki-san noticed the svma before I acked your previous
version!  And I've still got to go back to your VM_MAYSHARE patch:
seems right, but still wondering about the remaining VM_SHAREDs -
will report back later.)

Feel free to add an
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
to your fixed version.

Hugh

> 
> >  - kosaki
> > 
> > > +
> > >   /*
> > >    * match the virtual addresses, permission and the alignment of the
> > >    * page table page.
> > >    */
> > >   if (pmd_index(addr) != pmd_index(saddr) ||
> > > -     vma->vm_flags != svma->vm_flags ||
> > > +     vm_flags != svm_flags ||
> > >       sbase < svma->vm_start || svma->vm_end < s_end)
> > >           return 0;
Comment 24 Anonymous Emailer 2009-05-25 13:17:11 UTC
Reply-To: mel@csn.ul.ie

On Mon, May 25, 2009 at 11:10:11AM +0100, Hugh Dickins wrote:
> On Mon, 25 May 2009, Mel Gorman wrote:
> > On Sun, May 24, 2009 at 10:44:29PM +0900, KOSAKI Motohiro wrote:
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > --- 
> > > >  arch/x86/mm/hugetlbpage.c |    6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > > > index 8f307d9..16e4bcc 100644
> > > > --- a/arch/x86/mm/hugetlbpage.c
> > > > +++ b/arch/x86/mm/hugetlbpage.c
> > > > @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct
> vm_area_struct *svma,
> > > >         unsigned long sbase = saddr & PUD_MASK;
> > > >         unsigned long s_end = sbase + PUD_SIZE;
> > > >  
> > > > +       /* Allow segments to share if only one is locked */
> > > > +       unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
> > > > +       unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED;
> > >                                   svma?
> > > 
> > 
> > /me slaps self
> > 
> > svma indeed.
> > 
> > With the patch corrected, I still cannot trigger the bad pmd messages as
> > applied so I'm convinced the bug is related to hugetlb pagetable
> > sharing and this is more or less the fix. Any opinions?
> 
> Yes, you gave a very good analysis, and I agree with you, your patch
> does seem to be needed for 2.6.27.N, and the right thing to do there
> (though I prefer the way 2.6.28 mlocking skips huge areas completely).
> 

I similarly prefer how 2.6.28 simply makes the pages present and then gets
rid of the flag. I was tempted to back-porting something similar but it felt
better to fix where hugetlb was going wrong. Even though it's essentially a
no-op on mainline, I'd like to apply the patch there as well in case there
is ever another change in mlock() with respect to hugetlbfs.

> One nit, doesn't really matter, but if I'm not too late: please change
> -     /* Allow segments to share if only one is locked */
> +     /* Allow segments to share if only one is marked locked */
> since locking is such a no-op on hugetlb areas.
> 

It's not too late and that change makes sense.

> Hugetlb pagetable sharing does scare me some nights: it's a very easily
> forgotten corner of mm, worrying that we do something so different in
> there; but IIRC this is actually the first bug related to it, much to
> Ken's credit (and Dave McCracken's).
> 

I had totally forgotten about it which is why it took me so long to identify
it as the problem area. I don't remember there ever being a problem with
this area either.

> (I'm glad Kosaki-san noticed the svma before I acked your previous
> version!  And I've still got to go back to your VM_MAYSHARE patch:
> seems right, but still wondering about the remaining VM_SHAREDs -
> will report back later.)
> 

Thanks.

> Feel free to add an
> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> to your fixed version.
> 

Thanks again Hugh.

Note You need to log in before you can comment on or make changes to this bug.