Created attachment 54662 [details] mprotect test program Description of problem: see attached test program. This program can be run like this: ./mprotect <times> <length> <flag> times: how many times the mprotect() function execute; length: same as the "length" option in mprotect() function; flag: when flag is set to 1, the program would touch every page within the range [addr, addr+length-1] before it calls mprotect(). to reproduce the stuck, execute: ./mprotect 50 128 1 Note that the stuck only happens when the following conditions are all satisfied: flag == 1, i.e. touch page before mprotect() proto = PROT_NONE in mprotect() THP is enabled with "always" option Version-Release number of selected component (if applicable): Linux version 2.6.39-rc3 (caspar@caspar-gentoo) (gcc version 4.5.2 (Gentoo 4.5.2 p1.0, pie-0.4.5) ) #1 SMP Tue Apr 19 12:32:20 CST 2011 How reproducible: very often Actual results: test program got stuck when touching pages + THP always enabled: caspar-gentoo tmp # echo always > /sys/kernel/mm/transparent_hugepage/enabled caspar-gentoo tmp # ./mprotect 50 128 1 ^C <- stuck caspar-gentoo tmp # ./mprotect 50 128 1 ^C caspar-gentoo tmp # ./mprotect 50 128 1 ^C caspar-gentoo tmp # ./mprotect 50 128 1 ^C caspar-gentoo tmp # echo madvise > /sys/kernel/mm/transparent_hugepage/enabled caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # echo never > /sys/kernel/mm/transparent_hugepage/enabled caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # ./mprotect 50 128 1 done caspar-gentoo tmp # ./mprotect 50 128 1 Expected results: test program exit normally Additional info: This reproducer was similar to a test program in upstream test suite: libMicro (http://hub.opensolaris.org/bin/view/Project+libmicro/) strace ouput: # strace ./mprotect 50 128 1 execve("./mprotect", ["./mprotect", "50", "128", "1"], [/* 35 vars */]) = 0 brk(0) = 0x16ad000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa0c266000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=163815, ...}) = 0 mmap(NULL, 163815, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffa0c23e000 close(3) = 0 open("/lib64/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\357\1\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1608912, ...}) = 0 mmap(NULL, 3718152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffa0bcbc000 mprotect(0x7ffa0be3f000, 2093056, PROT_NONE) = 0 mmap(0x7ffa0c03e000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x182000) = 0x7ffa0c03e000 mmap(0x7ffa0c043000, 19464, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffa0c043000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa0c23d000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa0c23c000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa0c23b000 arch_prctl(ARCH_SET_FS, 0x7ffa0c23c700) = 0 mprotect(0x7ffa0c03e000, 16384, PROT_READ) = 0 mprotect(0x600000, 4096, PROT_READ) = 0 mprotect(0x7ffa0c267000, 4096, PROT_READ) = 0 munmap(0x7ffa0c23e000, 163815) = 0 open("/dev/zero", O_RDWR) = 3 mmap(NULL, 6553600, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x7ffa0b67c000 mprotect(0x7ffa0b67c000, 131072, PROT_NONE) = 0 mprotect(0x7ffa0b69c000, 131072, PROT_NONE) = 0 mprotect(0x7ffa0b6bc000, 131072, PROT_NONE) = 0 mprotect(0x7ffa0b6dc000, 131072, PROT_NONE) = 0 mprotect(0x7ffa0b6fc000, 131072, PROT_NONE) = 0 mprotect(0x7ffa0b71c000, 131072, PROT_NONE) = 0 mprotect(0x7ffa0b73c000, 131072, PROT_NONE) = 0 mprotect(0x7ffa0b75c000, 131072, PROT_NONE) = 0 <repeated random times, snip>
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 19 Apr 2011 05:25:41 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=33682 > > Summary: mprotect got stuck when THP is "always" enabled > Product: Memory Management > Version: 2.5 > Kernel Version: 2.6.38-r1 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > AssignedTo: akpm@linux-foundation.org > ReportedBy: bugs@casparzhang.com > Regression: No > > > Created an attachment (id=54662) > --> (https://bugzilla.kernel.org/attachment.cgi?id=54662) > mprotect test program > > Description of problem: > > see attached test program. This program can be run like this: > > ./mprotect <times> <length> <flag> > > times: how many times the mprotect() function execute; > length: same as the "length" option in mprotect() function; > flag: when flag is set to 1, the program would touch every page within the > range [addr, addr+length-1] before it calls mprotect(). > > to reproduce the stuck, execute: ./mprotect 50 128 1 > > Note that the stuck only happens when the following conditions are all > satisfied: > > flag == 1, i.e. touch page before mprotect() > proto = PROT_NONE in mprotect() > THP is enabled with "always" option > > Version-Release number of selected component (if applicable): > > Linux version 2.6.39-rc3 (caspar@caspar-gentoo) (gcc version 4.5.2 (Gentoo > 4.5.2 p1.0, pie-0.4.5) ) #1 SMP Tue Apr 19 12:32:20 CST 2011 > > How reproducible: > very often > > Actual results: > test program got stuck when touching pages + THP always enabled: > > caspar-gentoo tmp # echo always > /sys/kernel/mm/transparent_hugepage/enabled > caspar-gentoo tmp # ./mprotect 50 128 1 > ^C <- stuck > caspar-gentoo tmp # ./mprotect 50 128 1 > ^C > caspar-gentoo tmp # ./mprotect 50 128 1 > ^C > caspar-gentoo tmp # ./mprotect 50 128 1 > ^C > caspar-gentoo tmp # echo madvise > > /sys/kernel/mm/transparent_hugepage/enabled > caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # echo never > > /sys/kernel/mm/transparent_hugepage/enabled > caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # ./mprotect 50 128 1 > done caspar-gentoo tmp # ./mprotect 50 128 1 > > Expected results: > test program exit normally > > Additional info: > > This reproducer was similar to a test program in upstream test suite: > libMicro > (http://hub.opensolaris.org/bin/view/Project+libmicro/) > > strace ouput: > > # strace ./mprotect 50 128 1 > execve("./mprotect", ["./mprotect", "50", "128", "1"], [/* 35 vars */]) = 0 > brk(0) = 0x16ad000 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > 0x7ffa0c266000 > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or > directory) > open("/etc/ld.so.cache", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=163815, ...}) = 0 > mmap(NULL, 163815, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffa0c23e000 > close(3) = 0 > open("/lib64/libc.so.6", O_RDONLY) = 3 > read(3, > "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\357\1\0\0\0\0\0"..., 832) > = > 832 > fstat(3, {st_mode=S_IFREG|0755, st_size=1608912, ...}) = 0 > mmap(NULL, 3718152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = > 0x7ffa0bcbc000 > mprotect(0x7ffa0be3f000, 2093056, PROT_NONE) = 0 > mmap(0x7ffa0c03e000, 20480, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x182000) = 0x7ffa0c03e000 > mmap(0x7ffa0c043000, 19464, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffa0c043000 > close(3) = 0 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > 0x7ffa0c23d000 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > 0x7ffa0c23c000 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > 0x7ffa0c23b000 > arch_prctl(ARCH_SET_FS, 0x7ffa0c23c700) = 0 > mprotect(0x7ffa0c03e000, 16384, PROT_READ) = 0 > mprotect(0x600000, 4096, PROT_READ) = 0 > mprotect(0x7ffa0c267000, 4096, PROT_READ) = 0 > munmap(0x7ffa0c23e000, 163815) = 0 > open("/dev/zero", O_RDWR) = 3 > mmap(NULL, 6553600, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x7ffa0b67c000 > mprotect(0x7ffa0b67c000, 131072, PROT_NONE) = 0 > mprotect(0x7ffa0b69c000, 131072, PROT_NONE) = 0 > mprotect(0x7ffa0b6bc000, 131072, PROT_NONE) = 0 > mprotect(0x7ffa0b6dc000, 131072, PROT_NONE) = 0 > mprotect(0x7ffa0b6fc000, 131072, PROT_NONE) = 0 > mprotect(0x7ffa0b71c000, 131072, PROT_NONE) = 0 > mprotect(0x7ffa0b73c000, 131072, PROT_NONE) = 0 > mprotect(0x7ffa0b75c000, 131072, PROT_NONE) = 0 > <repeated random times, snip> >
Reply-To: aarcange@redhat.com On Mon, Apr 18, 2011 at 11:06:51PM -0700, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). Ok. /dev/zero mapped with MAP_PRIVATE is special as it doesn't set vm_ops but is vm_file backed. The page fault code only checks the vm_ops being NULL to decide if to use THP or not (it doesn't check vm_file). But then the huge_memory.c code also checks vm_file (like in vma_adjust_trans_huge). So I think what's going on in the special case of /dev/zero MAP_PRIVATE is that we could have THP enabled on a filebacked vma (but no vm_ops backed) that doesn't fully enable all operations necessary to be safe like vma_adjust_trans_huge because vm_file being set. To let /dev/zero to use THP safely I guess we can remove the vm_file checks and only use the vm_ops checks across huge_memory.c/huge_mm.h or to add add "!vma->vm_ops && !vma->vm_file" to the page fault so THP isn't used. The former is more risky than adding the vm_file check in the page fault path but it'd match more closely what do_anonymous_page does, so I'm thinking what are the implications of removing vm_file across the huge_memory.c/huge_mm.h. I didn't manage to reboot into a fixed kernel to test yet... I'll let you know the results after some testing.
Reply-To: aarcange@redhat.com Hi, this should fix bug https://bugzilla.kernel.org/show_bug.cgi?id=33682 . ==== Subject: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups From: Andrea Arcangeli <aarcange@redhat.com> The huge_memory.c THP page fault was allowed to run if vm_ops was null (which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't setup a special vma->vm_ops and it would fallback to regular anonymous memory) but other THP logics weren't fully activated for vmas with vm_file not NULL (/dev/zero has a not NULL vma->vm_file). So this removes the vm_file checks so that /dev/zero also can safely use THP (the other albeit safer approach to fix this bug would have been to prevent the THP initial page fault to run if vm_file was set). After removing the vm_file checks, this also makes huge_memory.c stricter in khugepaged for the DEBUG_VM=y case. It doesn't replace the vm_file check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should only be allowed to exist before the first page fault, and in turn when vma->anon_vma is null (so preventing khugepaged registration). So I tend to think the previous comment saying if vm_file was set, VM_PFNMAP might have been set and we could still be registered in khugepaged (despite anon_vma was not NULL to be registered in khugepaged) was too paranoid. The is_linear_pfn_mapping check is also I think superfluous (as described by comment) but under DEBUG_VM it is safe to stay. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index df29c8f..8847c8c 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -117,7 +117,7 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long end, long adjust_next) { - if (!vma->anon_vma || vma->vm_ops || vma->vm_file) + if (!vma->anon_vma || vma->vm_ops) return; __vma_adjust_trans_huge(vma, start, end, adjust_next); } diff --git a/include/linux/mm.h b/include/linux/mm.h index 692dbae..2348db2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -137,7 +137,8 @@ extern unsigned int kobjsize(const void *objp); #define VM_RandomReadHint(v) ((v)->vm_flags & VM_RAND_READ) /* - * special vmas that are non-mergable, non-mlock()able + * Special vmas that are non-mergable, non-mlock()able. + * Note: mm/huge_memory.c VM_NO_THP depends on this definition. */ #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470dcda..83326ad 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1408,6 +1408,9 @@ out: return ret; } +#define VM_NO_THP (VM_SPECIAL|VM_INSERTPAGE|VM_MIXEDMAP|VM_SAO| \ + VM_HUGETLB|VM_SHARED|VM_MAYSHARE) + int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice) { @@ -1416,11 +1419,7 @@ int hugepage_madvise(struct vm_area_struct *vma, /* * Be somewhat over-protective like KSM for now! */ - if (*vm_flags & (VM_HUGEPAGE | - VM_SHARED | VM_MAYSHARE | - VM_PFNMAP | VM_IO | VM_DONTEXPAND | - VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE | - VM_MIXEDMAP | VM_SAO)) + if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) return -EINVAL; *vm_flags &= ~VM_NOHUGEPAGE; *vm_flags |= VM_HUGEPAGE; @@ -1436,11 +1435,7 @@ int hugepage_madvise(struct vm_area_struct *vma, /* * Be somewhat over-protective like KSM for now! */ - if (*vm_flags & (VM_NOHUGEPAGE | - VM_SHARED | VM_MAYSHARE | - VM_PFNMAP | VM_IO | VM_DONTEXPAND | - VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE | - VM_MIXEDMAP | VM_SAO)) + if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP)) return -EINVAL; *vm_flags &= ~VM_HUGEPAGE; *vm_flags |= VM_NOHUGEPAGE; @@ -1574,10 +1569,14 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma) * page fault if needed. */ return 0; - if (vma->vm_file || vma->vm_ops) + if (vma->vm_ops) /* khugepaged not yet working on file or special mappings */ return 0; - VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma)); + /* + * If is_pfn_mapping() is true is_learn_pfn_mapping() must be + * true too, verify it here. + */ + VM_BUG_ON(is_linear_pfn_mapping(vma) || vma->vm_flags & VM_NO_THP); hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; hend = vma->vm_end & HPAGE_PMD_MASK; if (hstart < hend) @@ -1828,12 +1827,15 @@ static void collapse_huge_page(struct mm_struct *mm, (vma->vm_flags & VM_NOHUGEPAGE)) goto out; - /* VM_PFNMAP vmas may have vm_ops null but vm_file set */ - if (!vma->anon_vma || vma->vm_ops || vma->vm_file) + if (!vma->anon_vma || vma->vm_ops) goto out; if (is_vma_temporary_stack(vma)) goto out; - VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma)); + /* + * If is_pfn_mapping() is true is_learn_pfn_mapping() must be + * true too, verify it here. + */ + VM_BUG_ON(is_linear_pfn_mapping(vma) || vma->vm_flags & VM_NO_THP); pgd = pgd_offset(mm, address); if (!pgd_present(*pgd)) @@ -2066,13 +2068,16 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, progress++; continue; } - /* VM_PFNMAP vmas may have vm_ops null but vm_file set */ - if (!vma->anon_vma || vma->vm_ops || vma->vm_file) + if (!vma->anon_vma || vma->vm_ops) goto skip; if (is_vma_temporary_stack(vma)) goto skip; - - VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma)); + /* + * If is_pfn_mapping() is true is_learn_pfn_mapping() + * must be true too, verify it here. + */ + VM_BUG_ON(is_linear_pfn_mapping(vma) || + vma->vm_flags & VM_NO_THP); hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; hend = vma->vm_end & HPAGE_PMD_MASK;
Reply-To: riel@redhat.com On 04/19/2011 09:51 AM, Andrea Arcangeli wrote: > Hi, > > this should fix bug > https://bugzilla.kernel.org/show_bug.cgi?id=33682 . > > ==== > Subject: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups > > From: Andrea Arcangeli<aarcange@redhat.com> > > The huge_memory.c THP page fault was allowed to run if vm_ops was null (which > would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't setup a > special vma->vm_ops and it would fallback to regular anonymous memory) but > other THP logics weren't fully activated for vmas with vm_file not NULL > (/dev/zero has a not NULL vma->vm_file). > Signed-off-by: Andrea Arcangeli<aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com>
Reply-To: mel@csn.ul.ie On Tue, Apr 19, 2011 at 03:51:19PM +0200, Andrea Arcangeli wrote: > Hi, > > this should fix bug > https://bugzilla.kernel.org/show_bug.cgi?id=33682 . > > ==== > Subject: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups > > From: Andrea Arcangeli <aarcange@redhat.com> > > The huge_memory.c THP page fault was allowed to run if vm_ops was null (which > would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't setup a > special vma->vm_ops and it would fallback to regular anonymous memory) but > other THP logics weren't fully activated for vmas with vm_file not NULL > (/dev/zero has a not NULL vma->vm_file). > > So this removes the vm_file checks so that /dev/zero also can safely > use THP (the other albeit safer approach to fix this bug would have > been to prevent the THP initial page fault to run if vm_file was set). > > After removing the vm_file checks, this also makes huge_memory.c > stricter in khugepaged for the DEBUG_VM=y case. It doesn't replace the > vm_file check with a is_pfn_mapping check (but it keeps checking for > VM_PFNMAP under VM_BUG_ON) because for a is_cow_mapping() mapping > VM_PFNMAP should only be allowed to exist before the first page fault, > and in turn when vma->anon_vma is null (so preventing khugepaged > registration). So I tend to think the previous comment saying if > vm_file was set, VM_PFNMAP might have been set and we could still be > registered in khugepaged (despite anon_vma was not NULL to be > registered in khugepaged) was too paranoid. The is_linear_pfn_mapping > check is also I think superfluous (as described by comment) but under > DEBUG_VM it is safe to stay. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie>
A patch referencing this bug report has been merged in v2.6.39-rc5-274-g609cfda: commit 78f11a255749d09025f54d4e2df4fbcb031530e2 Author: Andrea Arcangeli <aarcange@redhat.com> Date: Wed Apr 27 15:26:45 2011 -0700 mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups