Bug 13302
Summary: | "bad pmd" on fork() of process with hugepage shared memory segments attached | ||
---|---|---|---|
Product: | Memory Management | Reporter: | starlight |
Component: | Other | Assignee: | Andrew Morton (akpm) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | alan |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.29.1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
starlight
2009-05-13 19:54:09 UTC
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). (Please read this ^^^^ !) On Wed, 13 May 2009 19:54:10 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13302 > > Summary: "bad pmd" on fork() of process with hugepage shared > memory segments attached > Product: Memory Management > Version: 2.5 > Kernel Version: 2.6.29.1 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > AssignedTo: akpm@linux-foundation.org > ReportedBy: starlight@binnacle.cx > Regression: Yes > > > Kernel reports "bad pmd" errors when process with hugepage > shared memory segments attached executes fork() system call. > Using vfork() avoids the issue. > > Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes > leakage of huge pages. > > Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp. > > See bug 12134 for an example of the errors reported > by 'dmesg'. > Reply-To: mel@csn.ul.ie On Wed, May 13, 2009 at 01:08:46PM -0700, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > (Please read this ^^^^ !) > > On Wed, 13 May 2009 19:54:10 GMT > bugzilla-daemon@bugzilla.kernel.org wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=13302 > > > > Summary: "bad pmd" on fork() of process with hugepage shared > > memory segments attached > > Product: Memory Management > > Version: 2.5 > > Kernel Version: 2.6.29.1 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Other > > AssignedTo: akpm@linux-foundation.org > > ReportedBy: starlight@binnacle.cx > > Regression: Yes > > > > > > Kernel reports "bad pmd" errors when process with hugepage > > shared memory segments attached executes fork() system call. > > Using vfork() avoids the issue. > > > > Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes > > leakage of huge pages. > > > > Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp. > > > > See bug 12134 for an example of the errors reported > > by 'dmesg'. > > This seems familiar and I believe it couldn't be reproduced the last time and then the problem reporter went away. We need a reproduction case so I modified on of the libhugetlbfs tests to do what I think you described above. However, it does not trigger the problem for me on x86 or x86-64 running 2.6.29.1. starlight@binnacle.cz, can you try the reproduction steps on your system please? If it reproduces, can you send me your .config please? If it does not reproduce, can you look at the test program and tell me what it's doing different to your reproduction case? 1. wget http://heanet.dl.sourceforge.net/sourceforge/libhugetlbfs/libhugetlbfs-2.3.tar.gz 2. tar -zxf libhugetlbfs-2.3.tar.gz 3. cd libhugetlbfs-2.3 4. wget http://www.csn.ul.ie/~mel/shm-fork.c (program is below for reference) 5. mv shm-fork.c tests/ 6. make 7. ./obj/hugeadm --create-global-mounts 8. ./obj/hugeadm --pool-pages-min 2M:20 (Adjust pagesize of 2M if necessary. If x86 and not 2M, tell me and send me your .config) 9. ./tests/obj32/shm-fork 10 2 On my two systems, I saw something like # ./tests/obj32/shm-fork 10 2 Starting testcase "./tests/obj32/shm-fork", pid 3527 Requesting 4194304 bytes for each test Spawning children glibc_fork:..........glibc_fork Spawning children glibc_vfork:..........glibc_vfork Spawning children sys_fork:..........sys_fork PASS Test program I used is below and is a modified version of what's in libhugetlbfs. It does not compile standalone. The steps it takes are 1. Gets the hugepage size 2. Calls shmget() to create a suitably large shared memory segment 3. Creates a requested number of children 4. Each child attaches to the share memory segment 5. Each child creates a grandchild 6. The child and grandchildren write the segment 7. The grandchild exists, the child waits for the grandchild 8. The child detaches and exists 9. The parent waits for the child to exit It does this for glibc fork, glibc vfork and a direct call to the system call fork(). Thanks ==== CUT HERE ==== /* * libhugetlbfs - Easy use of Linux hugepages * Copyright (C) 2005-2006 David Gibson & Adam Litke, IBM Corporation. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public License * as published by the Free Software Foundation; either version 2.1 of * the License, or (at your option) any later version. * * This library is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * Lesser General Public License for more details. * * You should have received a copy of the GNU Lesser General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <unistd.h> #include <syscall.h> #include <sys/types.h> #include <sys/shm.h> #include <sys/mman.h> #include <sys/wait.h> #include <hugetlbfs.h> #include "hugetests.h" #define P "shm-fork" #define DESC \ "* Test shared memory behavior when multiple threads are attached *\n"\ "* to a segment. A segment is created and then children are *\n"\ "* spawned which attach, write, read (verify), and detach from the *\n"\ "* shared memory segment. *" extern int errno; /* Global Configuration */ static int nr_hugepages; static int numprocs; static int shmid = -1; #define MAX_PROCS 200 #define BUF_SZ 256 #define GLIBC_FORK 0 #define GLIBC_VFORK 1 #define SYS_FORK 2 static char *testnames[] = { "glibc_fork", "glibc_vfork", "sys_fork" }; #define CHILD_FAIL(thread, fmt, ...) \ do { \ verbose_printf("Thread %d (pid=%d) FAIL: " fmt, \ thread, getpid(), __VA_ARGS__); \ exit(1); \ } while (0) void cleanup(void) { remove_shmid(shmid); } static void do_child(int thread, unsigned long size, int testtype) { volatile char *shmaddr; int j, k; int pid, status; verbose_printf("."); for (j=0; j<5; j++) { shmaddr = shmat(shmid, 0, SHM_RND); if (shmaddr == MAP_FAILED) CHILD_FAIL(thread, "shmat() failed: %s", strerror(errno)); /* Create even more children to double up the work */ switch (testtype) { case GLIBC_FORK: if ((pid = fork()) < 0) FAIL("glibc_fork(): %s", strerror(errno)); break; case GLIBC_VFORK: if ((pid = vfork()) < 0) FAIL("glibc_vfork(): %s", strerror(errno)); break; case SYS_FORK: if ((pid = syscall(__NR_fork)) < 0) FAIL("sys_fork(): %s", strerror(errno)); break; default: FAIL("Test type %d not implemented\n", testtype); } /* Child and parent access the shared area */ for (k=0;k<size;k++) shmaddr[k] = (char) (k); for (k=0;k<size;k++) if (shmaddr[k] != (char)k) CHILD_FAIL(thread, "Index %d mismatch", k); /* Children exits */ if (pid == 0) exit(0); /* Parent waits for child and detaches */ waitpid(pid, &status, 0); if (shmdt((const void *)shmaddr) != 0) CHILD_FAIL(thread, "shmdt() failed: %s", strerror(errno)); } exit(0); } static void do_test(unsigned long size, int testtype) { int wait_list[MAX_PROCS]; int i; int pid, status; char *testname = testnames[testtype]; if ((shmid = shmget(2, size, SHM_HUGETLB|IPC_CREAT|SHM_R|SHM_W )) < 0) FAIL("shmget(): %s", strerror(errno)); verbose_printf("Spawning children %s:", testname); for (i=0; i<numprocs; i++) { switch (testtype) { case GLIBC_FORK: if ((pid = fork()) < 0) FAIL("glibc_fork(): %s", strerror(errno)); break; case GLIBC_VFORK: if ((pid = vfork()) < 0) FAIL("glibc_vfork(): %s", strerror(errno)); break; case SYS_FORK: if ((pid = syscall(__NR_fork)) < 0) FAIL("sys_fork(): %s", strerror(errno)); break; default: FAIL("Test type %d not implemented\n", testtype); } if (pid == 0) do_child(i, size, testtype); wait_list[i] = pid; } for (i=0; i<numprocs; i++) { waitpid(wait_list[i], &status, 0); if (WEXITSTATUS(status) != 0) FAIL("Thread %d (pid=%d) failed", i, wait_list[i]); if (WIFSIGNALED(status)) FAIL("Thread %d (pid=%d) received unhandled signal", i, wait_list[i]); } printf("%s\n", testname); } int main(int argc, char ** argv) { unsigned long size; long hpage_size; test_init(argc, argv); if (argc < 3) CONFIG("Usage: %s <# procs> <# pages>", argv[0]); numprocs = atoi(argv[1]); nr_hugepages = atoi(argv[2]); if (numprocs > MAX_PROCS) CONFIG("Cannot spawn more than %d processes", MAX_PROCS); check_hugetlb_shm_group(); hpage_size = check_hugepagesize(); size = hpage_size * nr_hugepages; verbose_printf("Requesting %lu bytes for each test\n", size); do_test(size, GLIBC_FORK); do_test(size, GLIBC_VFORK); do_test(size, SYS_FORK); PASS(); } Reply-To: mel@csn.ul.ie On Thu, May 14, 2009 at 11:53:27AM +0100, Mel Gorman wrote: > On Wed, May 13, 2009 at 01:08:46PM -0700, Andrew Morton wrote: > > > > (switched to email. Please respond via emailed reply-to-all, not via the > > bugzilla web interface). > > > > (Please read this ^^^^ !) > > > > On Wed, 13 May 2009 19:54:10 GMT > > bugzilla-daemon@bugzilla.kernel.org wrote: > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=13302 > > > > > > Summary: "bad pmd" on fork() of process with hugepage shared > > > memory segments attached > > > Product: Memory Management > > > Version: 2.5 > > > Kernel Version: 2.6.29.1 > > > Platform: All > > > OS/Version: Linux > > > Tree: Mainline > > > Status: NEW > > > Severity: normal > > > Priority: P1 > > > Component: Other > > > AssignedTo: akpm@linux-foundation.org > > > ReportedBy: starlight@binnacle.cx > > > Regression: Yes > > > > > > > > > Kernel reports "bad pmd" errors when process with hugepage > > > shared memory segments attached executes fork() system call. > > > Using vfork() avoids the issue. > > > > > > Bug also appears in RHEL5 2.6.18-128.1.6.el5 and causes > > > leakage of huge pages. > > > > > > Bug does not appear in RHEL4 2.6.9-78.0.13.ELsmp. > > > > > > See bug 12134 for an example of the errors reported > > > by 'dmesg'. > > > > > This seems familiar and I believe it couldn't be reproduced the last time > and then the problem reporter went away. We need a reproduction case so > I modified on of the libhugetlbfs tests to do what I think you described > above. However, it does not trigger the problem for me on x86 or x86-64 > running 2.6.29.1. > > starlight@binnacle.cz, can you try the reproduction steps on your system > please? If it reproduces, can you send me your .config please? If it > does not reproduce, can you look at the test program and tell me what > it's doing different to your reproduction case? > Another question on top of this. At any point, do you call madvise(MADV_WILLNEED), fadvise(FADV_WILLNEED) or readahead() on the share memory segment? Will try it out, but it has to wait till this weekend.
At 11:53 AM 5/14/2009 +0100, Mel Gorman wrote:
>starlight@binnacle.cx, can you try the reproduction steps on your system
>please? If it reproduces, can you send me your .config please? If it
>does not reproduce, can you look at the test program and tell me what
>it's doing different to your reproduction case?
>
Definately no. The possibly unusual thing done is that a file is read into something like 30% of the segment, and the remaining pages are not touched. At 11:59 AM 5/14/2009 +0100, Mel Gorman wrote: >Another question on top of this. > >At any point, do you call madvise(MADV_WILLNEED), >fadvise(FADV_WILLNEED) or readahead() on the share memory segment? Reply-To: mel@csn.ul.ie On Thu, May 14, 2009 at 01:20:09PM -0400, starlight@binnacle.cx wrote: > At 11:59 AM 5/14/2009 +0100, Mel Gorman wrote: > >Another question on top of this. > > > >At any point, do you call madvise(MADV_WILLNEED), > >fadvise(FADV_WILLNEED) or readahead() on the share memory segment? > > Definately no. > > The possibly unusual thing done is that a file is read into > something like 30% of the segment, and the remaining pages are > not touched. > Ok, I just tried that there - parent writing 30% of the shared memory before forking but still did not reproduce the problem :( At 06:49 PM 5/14/2009 +0100, Mel Gorman wrote:
>Ok, I just tried that there - parent writing 30% of the shared memory
>before forking but still did not reproduce the problem :(
Maybe it makes a difference to have lots of RAM (16GB on this
server), and about 1.5 GB of hugepage shared memory allocated in
the forking process in about four segments. Often have all free
memory consumed by the file cache, but I don't belive this is
necessary to produce the problem as it will happen even right
after a reboot. [RHEL5 meminfo attached]
Other possible factors:
daemon is non-root but has explicit
CAP_IPC_LOCK, CAP_NET_RAW, CAP_SYS_NICE set via
'setcap cap_net_raw,cap_ipc_lock,cap_sys_nice+ep daemon'
ulimit -Hl and -Sl are set to <unlimited>
process group is set in /proc/sys/vm/hugetlb_shm_group
/proc/sys/vm/nr_hugepages is set to 2048
daemon has 200 threads at time of fork()
shared memory segments explictly located [RHEL5 pmap -x attached]
between fork & exec these syscalls are issued
sched_getscheduler/sched_setscheduler
getpriority/setpriority
seteuid(getuid())
setegid(getgid())
with vfork() work-around, no syscalls are made before exec()
Don't think it's something anything specific to the DL160 (Intel E5430)
we have because the DL165 (Opteron 2354) also exhibits the problem.
Will run the test cases provided this weekend for certain and
will let you know if bug is reproduced.
Have to go silent on this till the weekend.
At 06:49 PM 5/14/2009 +0100, Mel Gorman wrote:
>Ok, I just tried that there - parent writing 30% of the shared memory
>before forking but still did not reproduce the problem :(
Maybe it makes a difference to have lots of RAM (16GB on this
server), and about 1.5 GB of hugepage shared memory allocated in
the forking process in about four segments. Often have all free
memory consumed by the file cache, but I don't belive this is
necessary to produce the problem as it will happen even right
after a reboot. [RHEL5 meminfo attached]
Other possible factors:
daemon is non-root but has explicit
CAP_IPC_LOCK, CAP_NET_RAW, CAP_SYS_NICE set via
'setcap cap_net_raw,cap_ipc_lock,cap_sys_nice+ep daemon'
ulimit -Hl and -Sl are set to <unlimited>
process group is set in /proc/sys/vm/hugetlb_shm_group
/proc/sys/vm/nr_hugepages is set to 2048
daemon has 200 threads at time of fork()
shared memory segments explictly located [RHEL5 pmap -x attached]
between fork & exec these syscalls are issued
sched_getscheduler/sched_setscheduler
getpriority/setpriority
seteuid(getuid())
setegid(getgid())
with vfork() work-around, no syscalls are made before exec()
Don't think it's something anything specific to the DL160 (Intel E5430)
we have because the DL165 (Opteron 2354) also exhibits the problem.
Will run the test cases provided this weekend for certain and
will let you know if bug is reproduced.
Have to go silent on this till the weekend.
Whacked at a this, attempting to build a testcase from a combination of the original daemon strace in the bug report and knowledge of what the daemon is doing. What emerged is something that will destroy RHEL5 2.6.18-128.1.6.el5 100% every time. Completely fills the kernel message log with "bad pmd" errors and wrecks hugepages. Unfortunately it only occasionally breaks 2.6.29.1. Haven't been able to produce "bad pmd" messages, but did get the kernel to think it's out of large page memory when in theory it was not. Saw a lot of really strange accounting in the hugepage section of /proc/meminfo. For what it's worth, the testcase code is attached. Note that hugepages=2048 is assumed--the bug seems to require use of more than 50% of large page memory. Definately will be posted under the RHEL5 bug report, which is the more pressing issue here than far-future kernel support. In addition, the original segment attach bug http://bugzilla.kernel.org/show_bug.cgi?id=12134 is still there and can be reproduced every time with the 'create_seg_strace' and 'access_seg_straceX' sequences. Reply-To: mel@csn.ul.ie On Fri, May 15, 2009 at 01:32:38AM -0400, starlight@binnacle.cx wrote: > Whacked at a this, attempting to build a testcase from a > combination of the original daemon strace in the bug report > and knowledge of what the daemon is doing. > > What emerged is something that will destroy RHEL5 > 2.6.18-128.1.6.el5 100% every time. Completely fills the kernel > message log with "bad pmd" errors and wrecks hugepages. > Ok, I can confirm that more or less. I reproduced the problem on 2.6.18-92.el5 on x86-64 running RHEL 5.2. I didn't have access to a machine with enough memory though so I dropped the requirements slightly. It still triggered a failure though. However, when I ran 2.6.18, 2.6.19 and 2.6.29.1 on the same machine, I could not reproduce the problem, nor could I cause hugepages to leak so I'm leaning towards believing this is a distribution bug at the moment. On the plus side, due to your good work, there is enough available for them to bisect this problem hopefully. > Unfortunately it only occasionally breaks 2.6.29.1. Haven't > been able to produce "bad pmd" messages, but did get the > kernel to think it's out of large page memory when in > theory it was not. Saw a lot of really strange accounting > in the hugepage section of /proc/meminfo. > What sort of strange accounting? The accounting has changed since 2.6.18 so I want to be sure you're really seeing something weird. When I was testing, I didn't see anything out of the ordinary but maybe I'm looking in a different place. > For what it's worth, the testcase code is attached. > I cleaned the test up a bit and wrote a wrapper script to run this multiple times while checking for hugepage leaks. I've it running in a loop while the machine runs sysbench as a stress test to see can I cause anything out of the ordinary to happen. Nothing so far though. > Note that hugepages=2048 is assumed--the bug seems to require > use of more than 50% of large page memory. > > Definately will be posted under the RHEL5 bug report, which is > the more pressing issue here than far-future kernel support. > If you've filed a RedHat bug, this modified testcase and wrapper script might help them. The program exists and cleans up after itself and the memory requirements are less. The script sets the machine up in a way that breaks for me where the breakage is bad pmd messages and hugepages leaking. At 03:55 PM 5/15/2009 +0100, Mel Gorman wrote: >On Fri, May 15, 2009 at 01:32:38AM -0400, starlight@binnacle.cx >wrote: >> Whacked at a this, attempting to build a testcase from a >> combination of the original daemon strace in the bug report >> and knowledge of what the daemon is doing. >> >> What emerged is something that will destroy RHEL5 >> 2.6.18-128.1.6.el5 100% every time. Completely fills the kernel >> message log with "bad pmd" errors and wrecks hugepages. > >Ok, I can confirm that more or less. I reproduced the problem on >2.6.18-92.el5 on x86-64 running RHEL 5.2. I didn't have access >to a machine with enough memory though so I dropped the >requirements slightly. It still triggered a failure though. > >However, when I ran 2.6.18, 2.6.19 and 2.6.29.1 on the same >machine, I could not reproduce the problem, nor could I cause >hugepages to leak so I'm leaning towards believing this is a >distribution bug at the moment. > >On the plus side, due to your good work, there is enough >available for them to bisect this problem hopefully. Good to hear that the testcase works on other machines. >> Unfortunately it only occasionally breaks 2.6.29.1. Haven't >> been able to produce "bad pmd" messages, but did get the >> kernel to think it's out of large page memory when in >> theory it was not. Saw a lot of really strange accounting >> in the hugepage section of /proc/meminfo. >> >What sort of strange accounting? The accounting has changed >since 2.6.18 so I want to be sure you're really seeing something >weird. When I was testing, I didn't see anything out of the >ordinary but maybe I'm looking in a different place. Saw things like both free and used set to zero, used set to 2048 when it should not have been (in association with the failure). Often the counters would correct themselves after segments were removed with 'ipcs'. Sometimes not--usually when it broke. Also saw some truly insane usage counts like 32520 and less egregious off-by-one-or-two inaccuracies. >> For what it's worth, the testcase code is attached. >> >I cleaned the test up a bit and wrote a wrapper script to run >this multiple times while checking for hugepage leaks. I've it >running in a loop while the machine runs sysbench as a stress >test to see can I cause anything out of the ordinary to happen. >Nothing so far though. > >> Note that hugepages=2048 is assumed--the bug seems to require >> use of more than 50% of large page memory. >> >> Definately will be posted under the RHEL5 bug report, which is >> the more pressing issue here than far-future kernel support. >> >If you've filed a RedHat bug, this modified testcase and wrapper >script might help them. The program exists and cleans up after >itself and the memory requirements are less. The script sets the >machine up in a way that breaks for me where the breakage is bad >pmd messages and hugepages leaking. Thank you for your efforts. Could you post to the RH bug along with a back-reference to this? Might improve the chances someone will pay attention to it. It's at https://bugzilla.redhat.com/show_bug.cgi?id=497653 In a week or two I'll see if I can make time to turn the 100% failure scenario into a testcase. This is just the run of a segment loader followed by running a status checker three times. In 2.6.29.1 I'm wondering if the "bad pmd" I saw was just a bit of bad memory, so might as well focus on the thing that fails with certainty. Possibly the "bad pmd" case requires a few hours of live data runtime before it emerges--a tougher nut. This was really bugging me, so I hacked out the test case for the attach failure. Hoses 2.6.29.1 100% every time. Run it like this: tcbm_att tcbm_att - tcbm_att - tcbm_att - It will break on the last iteration with ENOMEM and ENOMEM is all any shmget() or shmat() call gets forever more. After removing the segments this appears: HugePages_Total: 2048 HugePages_Free: 2048 HugePages_Rsvd: 1280 HugePages_Surp: 0 Even though no segments show in 'ipcs'. Here's another possible clue: I tried the first 'tcbm' testcase on a 2.6.27.7 kernel that was hanging around from a few months ago and it breaks it 100% of the time. Completely hoses huge memory. Enough "bad pmd" errors to fill the kernel log. Reply-To: mel@csn.ul.ie On Fri, May 15, 2009 at 02:44:29PM -0400, starlight@binnacle.cx wrote: > This was really bugging me, so I hacked out > the test case for the attach failure. > > Hoses 2.6.29.1 100% every time. Run it like this: > > tcbm_att > tcbm_att - > tcbm_att - > tcbm_att - > > It will break on the last iteration with ENOMEM > and ENOMEM is all any shmget() or shmat() call > gets forever more. > > After removing the segments this appears: > > HugePages_Total: 2048 > HugePages_Free: 2048 > HugePages_Rsvd: 1280 > HugePages_Surp: 0 > Ok, the critical fact was that one process mapped read-write and populated the segment. Each subsequent process mapped it read-only. The core VM sets VM_SHARED for file-shared-read-write mappings but not file-shared-read-only mapping. Hugetlbfs confused how it should be using VM_SHARED as it was being used to check if the mapping was MAP_SHARED. Straight-forward mistake with the consequence that reservations "leaked" and future mappings failed as a result. Can you try this patch out please? It is against 2.6.29.1 and mostly applies to 2.6.27.7. The reject is trivially resolved by editting mm/hugetlb.c and changing the VM_SHARED at the end of hugetlb_reserve_pages() to VM_MAYSHARE. Thing is, this patch fixes a reservation issue. The bad pmd messages do show up for the original test on 2.6.27.7 for x86-64 (not x86) but it's a separate issue and I have not determined what it is yet. Can you test this patch to begin with please? ==== CUT HERE ==== Account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs hugetlbfs reserves huge pages and accounts for them differently depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. However, the check it makes against the VMA in some places is VM_SHARED and not VM_MAYSHARE. For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the mapping is MAP_SHARED *and* it is read-write. If a shared memory mapping was created read-write for populating of data and mapped read-only by other processes, then hugetlbfs gets the accounting wrong and reservations leak. This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when the intent of the code was to check whether the VMA was mapped MAP_SHARED or MAP_PRIVATE. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/hugetlb.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 28c655b..e83ad2c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -316,7 +316,7 @@ static void resv_map_release(struct kref *ref) static struct resv_map *vma_resv_map(struct vm_area_struct *vma) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); - if (!(vma->vm_flags & VM_SHARED)) + if (!(vma->vm_flags & VM_MAYSHARE)) return (struct resv_map *)(get_vma_private_data(vma) & ~HPAGE_RESV_MASK); return NULL; @@ -325,7 +325,7 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma) static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); - VM_BUG_ON(vma->vm_flags & VM_SHARED); + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); set_vma_private_data(vma, (get_vma_private_data(vma) & HPAGE_RESV_MASK) | (unsigned long)map); @@ -334,7 +334,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); - VM_BUG_ON(vma->vm_flags & VM_SHARED); + VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); set_vma_private_data(vma, get_vma_private_data(vma) | flags); } @@ -353,7 +353,7 @@ static void decrement_hugepage_resv_vma(struct hstate *h, if (vma->vm_flags & VM_NORESERVE) return; - if (vma->vm_flags & VM_SHARED) { + if (vma->vm_flags & VM_MAYSHARE) { /* Shared mappings always use reserves */ h->resv_huge_pages--; } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { @@ -369,14 +369,14 @@ static void decrement_hugepage_resv_vma(struct hstate *h, void reset_vma_resv_huge_pages(struct vm_area_struct *vma) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); - if (!(vma->vm_flags & VM_SHARED)) + if (!(vma->vm_flags & VM_MAYSHARE)) vma->vm_private_data = (void *)0; } /* Returns true if the VMA has associated reserve pages */ static int vma_has_reserves(struct vm_area_struct *vma) { - if (vma->vm_flags & VM_SHARED) + if (vma->vm_flags & VM_MAYSHARE) return 1; if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) return 1; @@ -924,7 +924,7 @@ static long vma_needs_reservation(struct hstate *h, struct address_space *mapping = vma->vm_file->f_mapping; struct inode *inode = mapping->host; - if (vma->vm_flags & VM_SHARED) { + if (vma->vm_flags & VM_MAYSHARE) { pgoff_t idx = vma_hugecache_offset(h, vma, addr); return region_chg(&inode->i_mapping->private_list, idx, idx + 1); @@ -949,7 +949,7 @@ static void vma_commit_reservation(struct hstate *h, struct address_space *mapping = vma->vm_file->f_mapping; struct inode *inode = mapping->host; - if (vma->vm_flags & VM_SHARED) { + if (vma->vm_flags & VM_MAYSHARE) { pgoff_t idx = vma_hugecache_offset(h, vma, addr); region_add(&inode->i_mapping->private_list, idx, idx + 1); @@ -1893,7 +1893,7 @@ retry_avoidcopy: * at the time of fork() could consume its reserves on COW instead * of the full address range. */ - if (!(vma->vm_flags & VM_SHARED) && + if (!(vma->vm_flags & VM_MAYSHARE) && is_vma_resv_set(vma, HPAGE_RESV_OWNER) && old_page != pagecache_page) outside_reserve = 1; @@ -2000,7 +2000,7 @@ retry: clear_huge_page(page, address, huge_page_size(h)); __SetPageUptodate(page); - if (vma->vm_flags & VM_SHARED) { + if (vma->vm_flags & VM_MAYSHARE) { int err; struct inode *inode = mapping->host; @@ -2104,7 +2104,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, goto out_mutex; } - if (!(vma->vm_flags & VM_SHARED)) + if (!(vma->vm_flags & VM_MAYSHARE)) pagecache_page = hugetlbfs_pagecache_page(h, vma, address); } @@ -2289,7 +2289,7 @@ int hugetlb_reserve_pages(struct inode *inode, * to reserve the full area even if read-only as mprotect() may be * called to make the mapping read-write. Assume !vma is a shm mapping */ - if (!vma || vma->vm_flags & VM_SHARED) + if (!vma || vma->vm_flags & VM_MAYSHARE) chg = region_chg(&inode->i_mapping->private_list, from, to); else { struct resv_map *resv_map = resv_map_alloc(); @@ -2330,7 +2330,7 @@ int hugetlb_reserve_pages(struct inode *inode, * consumed reservations are stored in the map. Hence, nothing * else has to be done for private mappings here */ - if (!vma || vma->vm_flags & VM_SHARED) + if (!vma || vma->vm_flags & VM_MAYSHARE) region_add(&inode->i_mapping->private_list, from, to); return 0; } Reply-To: mel@csn.ul.ie On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote: > Here's another possible clue: > > I tried the first 'tcbm' testcase on a 2.6.27.7 > kernel that was hanging around from a few months > ago and it breaks it 100% of the time. > > Completely hoses huge memory. Enough "bad pmd" > errors to fill the kernel log. > So I investigated what's wrong with 2.6.27.7. The problem is a race between exec() and the handling of mlock()ed VMAs but I can't see where. The normal teardown of pages is applied to a shared memory segment as if VM_HUGETLB was not set. This was fixed between 2.6.27 and 2.6.28 but apparently by accident during the introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes to how mlock()ed are handled but I didn't spot which was the relevant change that fixed the problem and reverse bisecting didn't help. I've added two people that were working on the unevictable LRU patches to see if they spot something. For context, the two attached files are used to reproduce a problem where bad pmd messages are scribbled all over the console on 2.6.27.7. Do something like echo 64 > /proc/sys/vm/nr_hugepages mount -t hugetlbfs none /mnt sh ./test-tcbm.sh I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is set or not. It's possible the race it still there but I don't know where it is. Any ideas where the race might be? Reply-To: mel@csn.ul.ie On Wed, May 20, 2009 at 12:35:25PM +0100, Mel Gorman wrote: > On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote: > > Here's another possible clue: > > > > I tried the first 'tcbm' testcase on a 2.6.27.7 > > kernel that was hanging around from a few months > > ago and it breaks it 100% of the time. > > > > Completely hoses huge memory. Enough "bad pmd" > > errors to fill the kernel log. > > > > So I investigated what's wrong with 2.6.27.7. The problem is a race between > exec() and the handling of mlock()ed VMAs but I can't see where. The normal > teardown of pages is applied to a shared memory segment as if VM_HUGETLB > was not set. > > This was fixed between 2.6.27 and 2.6.28 but apparently by accident during > the > introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes > to how mlock()ed are handled but I didn't spot which was the relevant change > that fixed the problem and reverse bisecting didn't help. I've added two > people > that were working on the unevictable LRU patches to see if they spot > something. > > For context, the two attached files are used to reproduce a problem > where bad pmd messages are scribbled all over the console on 2.6.27.7. > Do something like > > echo 64 > /proc/sys/vm/nr_hugepages > mount -t hugetlbfs none /mnt > sh ./test-tcbm.sh > > I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is > set or not. It's possible the race it still there but I don't know where > it is. > > Any ideas where the race might be? > With all the grace of a drunken elephant in a china shop, I gave up on being clever as it wasn't working and brute-force attacked this to make a list of the commits needed for CONFIG_UNEVICTABLE_LRU on top of 2.6.27.7. This is the list # Prereq commits for UNEVICT patches to apply b69408e88bd86b98feb7b9a38fd865e1ddb29827 vmscan: Use an indexed array for LRU variabl 62695a84eb8f2e718bf4dfb21700afaa7a08e0ea vmscan: move isolate_lru_page() to vmscan.c f04e9ebbe4909f9a41efd55149bc353299f4e83b swap: use an array for the LRU pagevecs 68a22394c286a2daf06ee8d65d8835f738faefa5 vmscan: free swap space on swap-in/activation b2e185384f534781fd22f5ce170b2ad26f97df70 define page_file_cache() function 4f98a2fee8acdb4ac84545df98cccecfd130f8db vmscan: split LRU lists into anon & file sets 556adecba110bf5f1db6c6b56416cfab5bcab698 vmscan: second chance replacement 7e9cd484204f9e5b316ed35b241abf088d76e0af vmscan: fix pagecache reclaim referenced 33c120ed2843090e2bd316de1588b8bf8b96cbde more aggressively use lumpy reclaim # Part 1: Initial patches for UNEVICTABLE_LRU 8a7a8544a4f6554ec2d8048ac9f9672f442db5a2 pageflag helpers for configed-out flags 894bc310419ac95f4fa4142dc364401a7e607f65 Unevictable LRU Infrastructure bbfd28eee9fbd73e780b19beb3dc562befbb94fa unevictable lru: add event counting with stat 7b854121eb3e5ba0241882ff939e2c485228c9c5 Unevictable LRU Page Statistics ba9ddf49391645e6bb93219131a40446538a5e76 Ramfs and Ram Disk pages are unevictable 89e004ea55abe201b29e2d6e35124101f1288ef7 SHM_LOCKED pages are unevictable # Part 2: Critical patch that makes the problem go away b291f000393f5a0b679012b39d79fbc85c018233 mlock: mlocked pages are unevictable # Part 3: Rest of UNEVICTABLE_LRU fa07e787733416c42938a310a8e717295934e33c doc: unevictable LRU and mlocked pages doc 8edb08caf68184fb170f4f69c7445929e199eaea mlock: downgrade mmap sem while pop mlock ba470de43188cdbff795b5da43a1474523c6c2fb mmap: handle mlocked pages during map, remap 5344b7e648980cc2ca613ec03a56a8222ff48820 vmstat: mlocked pages statistics 64d6519dda3905dfb94d3f93c07c5f263f41813f swap: cull unevictable pages in fault path af936a1606246a10c145feac3770f6287f483f02 vmscan: unevictable LRU scan sysctl 985737cf2ea096ea946aed82c7484d40defc71a8 mlock: count attempts to free mlocked page 902d2e8ae0de29f483840ba1134af27343b9564d vmscan: kill unused lru functions e0f79b8f1f3394bb344b7b83d6f121ac2af327de vmscan: don't accumulate scan pressure on un c11d69d8c830e09a0e7b3935c952afb26c48bba8 mlock: revert mainline handling of mlock erro 9978ad583e100945b74e4f33e73317983ea32df9 mlock: make mlock error return Posixly Correct I won't get the chance to start picking apart b291f000393f5a0b679012b39d79fbc85c018233 to see what's so special in there until Friday but maybe someone else will spot the magic before I do. Again, it does not matter if UNEVICTABLE_LRU is set or not once that critical patch is applied. For what it's worth, this bug affects the SLES 11 kernel which is based on 2.6.27. I imagine they'd like to have this fixed but may not be so keen on applying so many patches. On Wed, 2009-05-20 at 12:35 +0100, Mel Gorman wrote: > On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote: > > Here's another possible clue: > > > > I tried the first 'tcbm' testcase on a 2.6.27.7 > > kernel that was hanging around from a few months > > ago and it breaks it 100% of the time. > > > > Completely hoses huge memory. Enough "bad pmd" > > errors to fill the kernel log. > > > > So I investigated what's wrong with 2.6.27.7. The problem is a race between > exec() and the handling of mlock()ed VMAs but I can't see where. The normal > teardown of pages is applied to a shared memory segment as if VM_HUGETLB > was not set. > > This was fixed between 2.6.27 and 2.6.28 but apparently by accident during > the > introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes > to how mlock()ed are handled but I didn't spot which was the relevant change > that fixed the problem and reverse bisecting didn't help. I've added two > people > that were working on the unevictable LRU patches to see if they spot > something. Hi, Mel: and still do. With the unevictable lru, mlock()/mmap('LOCKED) now move the mlocked pages to the unevictable lru list and munlock, including at exit, must rescue them from the unevictable list. Since hugepages are not maintained on the lru and don't get reclaimed, we don't want to move them to the unevictable list, However, we still want to populate the page tables. So, we still call [_]mlock_vma_pages_range() for hugepage vmas, but after making the pages present to preserve prior behavior, we remove the VM_LOCKED flag from the vma. The basic change to handling of hugepage handling with the unevictable lru patches is that we no longer keep a huge page vma marked with VM_LOCKED. So, at exit time, there is no record that this is a vmlocked vma. A bit of context: before the unevictable lru, mlock() or mmap(MAP_LOCKED) would just set the VM_LOCKED flag and "make_pages_present()" for all but a few vma types. We've always excluded those that get_user_pages() can't handle and still do. With the unevictable lru, mlock()/mmap('LOCKED) now move the mlocked pages to the unevictable lru list and munlock, including at exit, must rescue them from the unevictable list. Since hugepages are not maintained on the lru and don't get reclaimed, we don't want to move them to the unevictable list, However, we still want to populate the page tables. So, we still call [_]mlock_vma_pages_range() for hugepage vmas, but after making the pages present to preserve prior behavior, we remove the VM_LOCKED flag from the vma. This may have resulted in the apparent fix to the subject problem in 2.6.28... > > For context, the two attached files are used to reproduce a problem > where bad pmd messages are scribbled all over the console on 2.6.27.7. > Do something like > > echo 64 > /proc/sys/vm/nr_hugepages > mount -t hugetlbfs none /mnt > sh ./test-tcbm.sh > > I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is > set or not. It's possible the race it still there but I don't know where > it is. > > Any ideas where the race might be? No, sorry. Haven't had time to investigate this. Lee > On Wed, 2009-05-20 at 10:53 -0400, Lee Schermerhorn wrote: > On Wed, 2009-05-20 at 12:35 +0100, Mel Gorman wrote: > > On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote: > > > Here's another possible clue: > > > > > > I tried the first 'tcbm' testcase on a 2.6.27.7 > > > kernel that was hanging around from a few months > > > ago and it breaks it 100% of the time. > > > > > > Completely hoses huge memory. Enough "bad pmd" > > > errors to fill the kernel log. > > > > > > > So I investigated what's wrong with 2.6.27.7. The problem is a race between > > exec() and the handling of mlock()ed VMAs but I can't see where. The normal > > teardown of pages is applied to a shared memory segment as if VM_HUGETLB > > was not set. > > > > This was fixed between 2.6.27 and 2.6.28 but apparently by accident during > the > > introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of > changes > > to how mlock()ed are handled but I didn't spot which was the relevant > change > > that fixed the problem and reverse bisecting didn't help. I've added two > people > > that were working on the unevictable LRU patches to see if they spot > something. > > Hi, Mel: > and still do. With the unevictable lru, mlock()/mmap('LOCKED) now move > the mlocked pages to the unevictable lru list and munlock, including at > exit, must rescue them from the unevictable list. Since hugepages are > not maintained on the lru and don't get reclaimed, we don't want to move > them to the unevictable list, However, we still want to populate the > page tables. So, we still call [_]mlock_vma_pages_range() for hugepage > vmas, but after making the pages present to preserve prior behavior, we > remove the VM_LOCKED flag from the vma. Wow! that got garbled. not sure how. Message was intended to start here: > The basic change to handling of hugepage handling with the unevictable > lru patches is that we no longer keep a huge page vma marked with > VM_LOCKED. So, at exit time, there is no record that this is a vmlocked > vma. > > A bit of context: before the unevictable lru, mlock() or > mmap(MAP_LOCKED) would just set the VM_LOCKED flag and > "make_pages_present()" for all but a few vma types. We've always > excluded those that get_user_pages() can't handle and still do. With > the unevictable lru, mlock()/mmap('LOCKED) now move the mlocked pages to > the unevictable lru list and munlock, including at exit, must rescue > them from the unevictable list. Since hugepages are not maintained on > the lru and don't get reclaimed, we don't want to move them to the > unevictable list, However, we still want to populate the page tables. > So, we still call [_]mlock_vma_pages_range() for hugepage vmas, but > after making the pages present to preserve prior behavior, we remove the > VM_LOCKED flag from the vma. > > This may have resulted in the apparent fix to the subject problem in > 2.6.28... > > > > > For context, the two attached files are used to reproduce a problem > > where bad pmd messages are scribbled all over the console on 2.6.27.7. > > Do something like > > > > echo 64 > /proc/sys/vm/nr_hugepages > > mount -t hugetlbfs none /mnt > > sh ./test-tcbm.sh > > > > I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is > > set or not. It's possible the race it still there but I don't know where > > it is. > > > > Any ideas where the race might be? > > No, sorry. Haven't had time to investigate this. > > Lee > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> Reply-To: mel@csn.ul.ie On Wed, May 20, 2009 at 11:05:15AM -0400, Lee Schermerhorn wrote: > On Wed, 2009-05-20 at 10:53 -0400, Lee Schermerhorn wrote: > > On Wed, 2009-05-20 at 12:35 +0100, Mel Gorman wrote: > > > On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote: > > > > Here's another possible clue: > > > > > > > > I tried the first 'tcbm' testcase on a 2.6.27.7 > > > > kernel that was hanging around from a few months > > > > ago and it breaks it 100% of the time. > > > > > > > > Completely hoses huge memory. Enough "bad pmd" > > > > errors to fill the kernel log. > > > > > > > > > > So I investigated what's wrong with 2.6.27.7. The problem is a race > between > > > exec() and the handling of mlock()ed VMAs but I can't see where. The > normal > > > teardown of pages is applied to a shared memory segment as if VM_HUGETLB > > > was not set. > > > > > > This was fixed between 2.6.27 and 2.6.28 but apparently by accident > during the > > > introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of > changes > > > to how mlock()ed are handled but I didn't spot which was the relevant > change > > > that fixed the problem and reverse bisecting didn't help. I've added two > people > > > that were working on the unevictable LRU patches to see if they spot > something. > > > > Hi, Mel: > > and still do. With the unevictable lru, mlock()/mmap('LOCKED) now move > > the mlocked pages to the unevictable lru list and munlock, including at > > exit, must rescue them from the unevictable list. Since hugepages are > > not maintained on the lru and don't get reclaimed, we don't want to move > > them to the unevictable list, However, we still want to populate the > > page tables. So, we still call [_]mlock_vma_pages_range() for hugepage > > vmas, but after making the pages present to preserve prior behavior, we > > remove the VM_LOCKED flag from the vma. > > Wow! that got garbled. not sure how. Message was intended to start > here: > > > The basic change to handling of hugepage handling with the unevictable > > lru patches is that we no longer keep a huge page vma marked with > > VM_LOCKED. So, at exit time, there is no record that this is a vmlocked > > vma. > > Basic and in this case, apparently the critical factor. This patch on 2.6.27.7 makes the problem disappear as well by never setting VM_LOCKED on hugetlb-backed VMAs. Obviously, it's a hachet job and almost certainly the wrong fix but it indicates that the handling of VM_LOCKED && VM_HUGETLB is wrong somewhere. Now I have a better idea now what to search for on Friday. Thanks Lee. --- mm/mlock.c 2009-05-20 16:36:08.000000000 +0100 +++ mm/mlock-new.c 2009-05-20 16:28:17.000000000 +0100 @@ -64,7 +64,8 @@ * It's okay if try_to_unmap_one unmaps a page just after we * set VM_LOCKED, make_pages_present below will bring it back. */ - vma->vm_flags = newflags; + if (!(vma->vm_flags & VM_HUGETLB)) + vma->vm_flags = newflags; /* * Keep track of amount of locked VM. Hi > Basic and in this case, apparently the critical factor. This patch on > 2.6.27.7 makes the problem disappear as well by never setting VM_LOCKED on > hugetlb-backed VMAs. Obviously, it's a hachet job and almost certainly the > wrong fix but it indicates that the handling of VM_LOCKED && VM_HUGETLB > is wrong somewhere. Now I have a better idea now what to search for on > Friday. Thanks Lee. > > --- mm/mlock.c 2009-05-20 16:36:08.000000000 +0100 > +++ mm/mlock-new.c 2009-05-20 16:28:17.000000000 +0100 > @@ -64,7 +64,8 @@ > * It's okay if try_to_unmap_one unmaps a page just after we > * set VM_LOCKED, make_pages_present below will bring it back. > */ > - vma->vm_flags = newflags; > + if (!(vma->vm_flags & VM_HUGETLB)) this condition meaning isn't so obvious to me. could you please consider comment adding? > + vma->vm_flags = newflags; > > /* > * Keep track of amount of locked VM. Reply-To: mel@csn.ul.ie On Thu, May 21, 2009 at 09:41:46AM +0900, KOSAKI Motohiro wrote: > Hi > > > Basic and in this case, apparently the critical factor. This patch on > > 2.6.27.7 makes the problem disappear as well by never setting VM_LOCKED on > > hugetlb-backed VMAs. Obviously, it's a hachet job and almost certainly the > > wrong fix but it indicates that the handling of VM_LOCKED && VM_HUGETLB > > is wrong somewhere. Now I have a better idea now what to search for on > > Friday. Thanks Lee. > > > > --- mm/mlock.c 2009-05-20 16:36:08.000000000 +0100 > > +++ mm/mlock-new.c 2009-05-20 16:28:17.000000000 +0100 > > @@ -64,7 +64,8 @@ > > * It's okay if try_to_unmap_one unmaps a page just after we > > * set VM_LOCKED, make_pages_present below will bring it back. > > */ > > - vma->vm_flags = newflags; > > + if (!(vma->vm_flags & VM_HUGETLB)) > > this condition meaning isn't so obvious to me. could you please > consider comment adding? > I should have used the helper, but anyway, the check was to see if the VMA was backed by hugetlbfs or not. This wasn't the right fix. It was only intended to show that it was something to do with the VM_LOCKED flag. The real problem has something to do with pagetable-sharing of hugetlb-backed segments. After fork(), the VM_LOCKED gets cleared so when huge_pmd_share() is called, some of the pagetables are shared and others are not. I believe this is resulting in pagetables being freed prematurely. I'm cc'ing the author and acks to the pagetable-sharing patch to see can they shed more light on whether this is the right patch or not. Kenneth, Hugh? ==== CUT HERE ==== x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not On x86 and x86-64, it is possible that page tables are shared beween shared mappings backed by hugetlbfs. As part of this, page_table_shareable() checks a pair of vma->vm_flags and they must match if they are to be shared. All VMA flags are taken into account, including VM_LOCKED. The problem is that VM_LOCKED is cleared on fork(). When a process with a shared memory segment forks() to exec() a helper, there will be shared VMAs with different flags. The impact is that the shared segment is sometimes considered shareable and other times not, depending on what process is checking. A test process that forks and execs heavily can trigger a number of "bad pmd" messages appearing in the kernel log and hugepages being leaked. I believe what happens is that the segment page tables are being shared but the count is inaccurate depending on the ordering of events. Strictly speaking, this affects mainline but the problem is masked by the changes made for CONFIG_UNEVITABLE_LRU as the kernel now never has VM_LOCKED set for hugetlbfs-backed mapping. This does affect the stable branch of 2.6.27 and distributions based on that kernel such as SLES 11. This patch addresses the problem by comparing all flags but VM_LOCKED when deciding if pagetables should be shared or not for hugetlbfs-backed mapping. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- arch/x86/mm/hugetlbpage.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 8f307d9..16e4bcc 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma, unsigned long sbase = saddr & PUD_MASK; unsigned long s_end = sbase + PUD_SIZE; + /* Allow segments to share if only one is locked */ + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED; + unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED; + /* * match the virtual addresses, permission and the alignment of the * page table page. */ if (pmd_index(addr) != pmd_index(saddr) || - vma->vm_flags != svma->vm_flags || + vm_flags != svm_flags || sbase < svma->vm_start || svma->vm_end < s_end) return 0; Reply-To: mel@csn.ul.ie On Sun, May 24, 2009 at 10:44:29PM +0900, KOSAKI Motohiro wrote: > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > --- > > arch/x86/mm/hugetlbpage.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > > index 8f307d9..16e4bcc 100644 > > --- a/arch/x86/mm/hugetlbpage.c > > +++ b/arch/x86/mm/hugetlbpage.c > > @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct > vm_area_struct *svma, > > unsigned long sbase = saddr & PUD_MASK; > > unsigned long s_end = sbase + PUD_SIZE; > > > > + /* Allow segments to share if only one is locked */ > > + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED; > > + unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED; > svma? > /me slaps self svma indeed. With the patch corrected, I still cannot trigger the bad pmd messages as applied so I'm convinced the bug is related to hugetlb pagetable sharing and this is more or less the fix. Any opinions? > - kosaki > > > + > > /* > > * match the virtual addresses, permission and the alignment of the > > * page table page. > > */ > > if (pmd_index(addr) != pmd_index(saddr) || > > - vma->vm_flags != svma->vm_flags || > > + vm_flags != svm_flags || > > sbase < svma->vm_start || svma->vm_end < s_end) > > return 0; > > > > > Reply-To: hugh.dickins@tiscali.co.uk On Mon, 25 May 2009, Mel Gorman wrote: > On Sun, May 24, 2009 at 10:44:29PM +0900, KOSAKI Motohiro wrote: > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > --- > > > arch/x86/mm/hugetlbpage.c | 6 +++++- > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > > > index 8f307d9..16e4bcc 100644 > > > --- a/arch/x86/mm/hugetlbpage.c > > > +++ b/arch/x86/mm/hugetlbpage.c > > > @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct > vm_area_struct *svma, > > > unsigned long sbase = saddr & PUD_MASK; > > > unsigned long s_end = sbase + PUD_SIZE; > > > > > > + /* Allow segments to share if only one is locked */ > > > + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED; > > > + unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED; > > svma? > > > > /me slaps self > > svma indeed. > > With the patch corrected, I still cannot trigger the bad pmd messages as > applied so I'm convinced the bug is related to hugetlb pagetable > sharing and this is more or less the fix. Any opinions? Yes, you gave a very good analysis, and I agree with you, your patch does seem to be needed for 2.6.27.N, and the right thing to do there (though I prefer the way 2.6.28 mlocking skips huge areas completely). One nit, doesn't really matter, but if I'm not too late: please change - /* Allow segments to share if only one is locked */ + /* Allow segments to share if only one is marked locked */ since locking is such a no-op on hugetlb areas. Hugetlb pagetable sharing does scare me some nights: it's a very easily forgotten corner of mm, worrying that we do something so different in there; but IIRC this is actually the first bug related to it, much to Ken's credit (and Dave McCracken's). (I'm glad Kosaki-san noticed the svma before I acked your previous version! And I've still got to go back to your VM_MAYSHARE patch: seems right, but still wondering about the remaining VM_SHAREDs - will report back later.) Feel free to add an Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> to your fixed version. Hugh > > > - kosaki > > > > > + > > > /* > > > * match the virtual addresses, permission and the alignment of the > > > * page table page. > > > */ > > > if (pmd_index(addr) != pmd_index(saddr) || > > > - vma->vm_flags != svma->vm_flags || > > > + vm_flags != svm_flags || > > > sbase < svma->vm_start || svma->vm_end < s_end) > > > return 0; Reply-To: mel@csn.ul.ie On Mon, May 25, 2009 at 11:10:11AM +0100, Hugh Dickins wrote: > On Mon, 25 May 2009, Mel Gorman wrote: > > On Sun, May 24, 2009 at 10:44:29PM +0900, KOSAKI Motohiro wrote: > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > --- > > > > arch/x86/mm/hugetlbpage.c | 6 +++++- > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > > > > index 8f307d9..16e4bcc 100644 > > > > --- a/arch/x86/mm/hugetlbpage.c > > > > +++ b/arch/x86/mm/hugetlbpage.c > > > > @@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct > vm_area_struct *svma, > > > > unsigned long sbase = saddr & PUD_MASK; > > > > unsigned long s_end = sbase + PUD_SIZE; > > > > > > > > + /* Allow segments to share if only one is locked */ > > > > + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED; > > > > + unsigned long svm_flags = vma->vm_flags & ~VM_LOCKED; > > > svma? > > > > > > > /me slaps self > > > > svma indeed. > > > > With the patch corrected, I still cannot trigger the bad pmd messages as > > applied so I'm convinced the bug is related to hugetlb pagetable > > sharing and this is more or less the fix. Any opinions? > > Yes, you gave a very good analysis, and I agree with you, your patch > does seem to be needed for 2.6.27.N, and the right thing to do there > (though I prefer the way 2.6.28 mlocking skips huge areas completely). > I similarly prefer how 2.6.28 simply makes the pages present and then gets rid of the flag. I was tempted to back-porting something similar but it felt better to fix where hugetlb was going wrong. Even though it's essentially a no-op on mainline, I'd like to apply the patch there as well in case there is ever another change in mlock() with respect to hugetlbfs. > One nit, doesn't really matter, but if I'm not too late: please change > - /* Allow segments to share if only one is locked */ > + /* Allow segments to share if only one is marked locked */ > since locking is such a no-op on hugetlb areas. > It's not too late and that change makes sense. > Hugetlb pagetable sharing does scare me some nights: it's a very easily > forgotten corner of mm, worrying that we do something so different in > there; but IIRC this is actually the first bug related to it, much to > Ken's credit (and Dave McCracken's). > I had totally forgotten about it which is why it took me so long to identify it as the problem area. I don't remember there ever being a problem with this area either. > (I'm glad Kosaki-san noticed the svma before I acked your previous > version! And I've still got to go back to your VM_MAYSHARE patch: > seems right, but still wondering about the remaining VM_SHAREDs - > will report back later.) > Thanks. > Feel free to add an > Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> > to your fixed version. > Thanks again Hugh. |