Bug 56881

Summary: MAP_HUGETLB mmap fails for certain sizes
Product: Memory Management Reporter: iceman_dvd
Component: OtherAssignee: Andrew Morton (akpm)
Status: CLOSED CODE_FIX    
Severity: high CC: alan, leonardoborda
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.5.0-27 Tree: Mainline
Regression: No
Attachments: Test program source code

Description iceman_dvd 2013-04-20 03:00:29 UTC
This is on an Ubuntu 12.10 desktop, but the same issue has been found on 12.04 with 3.5.0 kernel.
See the sample program. An allocation with MAP_HUGETLB consistently fails with certain sizes, while it succeeds with others.
The allocation sizes are well below the number of free huge pages.

$ uname -a Linux davide-lnx2 3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux


# echo 100 > /proc/sys/vm/nr_hugepages

# cat /proc/meminfo
...
AnonHugePages:         0 kB
HugePages_Total:     100
HugePages_Free:      100
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB


$ ./mmappu $((5 * 2 * 1024 * 1024 - 4096))
size=10481664	0x9ff000
hugepage mmap: Invalid argument


$ ./mmappu $((5 * 2 * 1024 * 1024 - 4095))
size=10481665	0x9ff001
OK!


It seems the trigger point is a normal page size.
The same binary works flawlessly in previous kernels.
Comment 1 iceman_dvd 2013-04-20 15:12:12 UTC
Created attachment 99481 [details]
Test program source code
Comment 2 Andrew Morton 2013-04-23 20:25:24 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 20 Apr 2013 03:00:30 +0000 (UTC) bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=56881
> 
>            Summary: MAP_HUGETLB mmap fails for certain sizes
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 3.5.0-27

Thanks.

It's a post-3.4 regression, testcase included.  Does someone want to
take a look, please?

>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>         AssignedTo: akpm@linux-foundation.org
>         ReportedBy: iceman_dvd@yahoo.com
>         Regression: No
> 
> 
> This is on an Ubuntu 12.10 desktop, but the same issue has been found on
> 12.04
> with 3.5.0 kernel.
> See the sample program. An allocation with MAP_HUGETLB consistently fails
> with
> certain sizes, while it succeeds with others.
> The allocation sizes are well below the number of free huge pages.
> 
> $ uname -a Linux davide-lnx2 3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25
> 19:58:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> # echo 100 > /proc/sys/vm/nr_hugepages
> 
> # cat /proc/meminfo
> ...
> AnonHugePages:         0 kB
> HugePages_Total:     100
> HugePages_Free:      100
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> 
> 
> $ ./mmappu $((5 * 2 * 1024 * 1024 - 4096))
> size=10481664    0x9ff000
> hugepage mmap: Invalid argument
> 
> 
> $ ./mmappu $((5 * 2 * 1024 * 1024 - 4095))
> size=10481665    0x9ff001
> OK!
> 
> 
> It seems the trigger point is a normal page size.
> The same binary works flawlessly in previous kernels.
Comment 3 Naoya Horiguchi 2013-04-23 22:26:39 UTC
On Tue, Apr 23, 2013 at 01:25:22PM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Sat, 20 Apr 2013 03:00:30 +0000 (UTC) bugzilla-daemon@bugzilla.kernel.org
> wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=56881
> > 
> >            Summary: MAP_HUGETLB mmap fails for certain sizes
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 3.5.0-27
> 
> Thanks.
> 
> It's a post-3.4 regression, testcase included.  Does someone want to
> take a look, please?

Let me try it.

  static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
  {                                                                            
          struct inode *inode = file->f_path.dentry->d_inode;
          loff_t len, vma_len;                               
          int ret;                                           
          struct hstate *h = hstate_file(file);              
          ...                                                                               
          if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))              
                  return -EINVAL;                                              

This code checks only whether a given hugetlb vma covers (1 << order)
pages, not whether it's exactly hugepage aligned.
Before 2b37c35e6552 "fs/hugetlbfs/inode.c: fix pgoff alignment
checking on 32-bit", it was

  if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))

, but this made no sense because ~(huge_page_mask(h) >> PAGE_SHIFT) is
0xff for 2M hugepage.
I think the reported problem is not a bug because the behavior before
this change was wrong or not as expected.

If we want to make sure that a given address range fits hugepage size,
something like below can be useful.

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 78bde32..a98304b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -113,11 +113,11 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND | VM_DONTDUMP;
 	vma->vm_ops = &hugetlb_vm_ops;
 
-	if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
-		return -EINVAL;
-
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
+	if (vma->len & ~huge_page_mask(h))
+		return -EINVAL;
+
 	mutex_lock(&inode->i_mutex);
 	file_accessed(file);
 

Thanks,
Naoya Horiguchi

> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: Other
> >         AssignedTo: akpm@linux-foundation.org
> >         ReportedBy: iceman_dvd@yahoo.com
> >         Regression: No
> > 
> > 
> > This is on an Ubuntu 12.10 desktop, but the same issue has been found on
> 12.04
> > with 3.5.0 kernel.
> > See the sample program. An allocation with MAP_HUGETLB consistently fails
> with
> > certain sizes, while it succeeds with others.
> > The allocation sizes are well below the number of free huge pages.
> > 
> > $ uname -a Linux davide-lnx2 3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25
> > 19:58:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> > 
> > 
> > # echo 100 > /proc/sys/vm/nr_hugepages
> > 
> > # cat /proc/meminfo
> > ...
> > AnonHugePages:         0 kB
> > HugePages_Total:     100
> > HugePages_Free:      100
> > HugePages_Rsvd:        0
> > HugePages_Surp:        0
> > Hugepagesize:       2048 kB
> > 
> > 
> > $ ./mmappu $((5 * 2 * 1024 * 1024 - 4096))
> > size=10481664    0x9ff000
> > hugepage mmap: Invalid argument
> > 
> > 
> > $ ./mmappu $((5 * 2 * 1024 * 1024 - 4095))
> > size=10481665    0x9ff001
> > OK!
> > 
> > 
> > It seems the trigger point is a normal page size.
> > The same binary works flawlessly in previous kernels.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Comment 4 Johannes Weiner 2013-04-24 08:15:08 UTC
On Tue, Apr 23, 2013 at 01:25:22PM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Sat, 20 Apr 2013 03:00:30 +0000 (UTC) bugzilla-daemon@bugzilla.kernel.org
> wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=56881
> > 
> >            Summary: MAP_HUGETLB mmap fails for certain sizes
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 3.5.0-27
> 
> Thanks.
> 
> It's a post-3.4 regression, testcase included.  Does someone want to
> take a look, please?

commit 40716e29243de46720e5773797791466c28904ec
Author: Steven Truelove <steven.truelove@utoronto.ca>
Date:   Wed Mar 21 16:34:14 2012 -0700

    hugetlbfs: fix alignment of huge page requests
    
    When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
    PAGE_SIZE, but this is not sufficient.
    
    Modify hugetlb_file_setup() to align requests to the huge page size, and
    to accept an address argument so that all alignment checks can be
    performed in hugetlb_file_setup(), rather than in its callers.  Change
    newseg() and mmap_pgoff() to match the new prototype and eliminate a now
    redundant alignment check.
    
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Steven Truelove <steven.truelove@utoronto.ca>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

This pushes down the length alignment from mmap_pgoff() into
hugetlb_file_setup() and failed to observe that mmap_pgoff() continues
to work with that now unaligned length parameter.  I.e. this part:

diff --git a/mm/mmap.c b/mm/mmap.c
index 9e0c0de..a19cc27 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1099,9 +1099,9 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 		 * A dummy user value is used because we are not locking
 		 * memory so no accounting is necessary
 		 */
-		len = ALIGN(len, huge_page_size(&default_hstate));
-		file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, VM_NORESERVE,
-						&user, HUGETLB_ANONHUGE_INODE);
+		file = hugetlb_file_setup(HUGETLB_ANON_FILE, addr, len,
+						VM_NORESERVE, &user,
+						HUGETLB_ANONHUGE_INODE);
 		if (IS_ERR(file))
 			return PTR_ERR(file);
 	}

It would probably be best to revert this commit for the most part and
fix up the alignment in the shmem code.

> > # echo 100 > /proc/sys/vm/nr_hugepages
> > 
> > # cat /proc/meminfo
> > ...
> > AnonHugePages:         0 kB
> > HugePages_Total:     100
> > HugePages_Free:      100
> > HugePages_Rsvd:        0
> > HugePages_Surp:        0
> > Hugepagesize:       2048 kB
> > 
> > 
> > $ ./mmappu $((5 * 2 * 1024 * 1024 - 4096))
> > size=10481664    0x9ff000
> > hugepage mmap: Invalid argument
> > 
> > 
> > $ ./mmappu $((5 * 2 * 1024 * 1024 - 4095))
> > size=10481665    0x9ff001
> > OK!

hugetlb_get_unmapped_area() expects a hugepage-aligned size argument.
Before do_mmap_pgoff() calls it, it does len = PAGE_ALIGN(len), which
is why the second case works and the first one does not.

How about this (untested) partial revert of the above patch that keeps
the shm alignment fix?

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 84e3d85..13a7d51 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -929,9 +929,8 @@ static struct dentry_operations anon_ops = {
 	.d_dname = hugetlb_dname
 };
 
-struct file *hugetlb_file_setup(const char *name, unsigned long addr,
-				size_t size, vm_flags_t acctflag,
-				struct user_struct **user,
+struct file *hugetlb_file_setup(const char *name, size_t size,
+				vm_flags_t acctflag, struct user_struct **user,
 				int creat_flags, int page_size_log)
 {
 	struct file *file = ERR_PTR(-ENOMEM);
@@ -939,8 +938,6 @@ struct file *hugetlb_file_setup(const char *name, unsigned long addr,
 	struct path path;
 	struct super_block *sb;
 	struct qstr quick_string;
-	struct hstate *hstate;
-	unsigned long num_pages;
 	int hstate_idx;
 
 	hstate_idx = get_hstate_idx(page_size_log);
@@ -980,12 +977,10 @@ struct file *hugetlb_file_setup(const char *name, unsigned long addr,
 	if (!inode)
 		goto out_dentry;
 
-	hstate = hstate_inode(inode);
-	size += addr & ~huge_page_mask(hstate);
-	num_pages = ALIGN(size, huge_page_size(hstate)) >>
-			huge_page_shift(hstate);
 	file = ERR_PTR(-ENOMEM);
-	if (hugetlb_reserve_pages(inode, 0, num_pages, NULL, acctflag))
+	if (hugetlb_reserve_pages(inode, 0,
+			size >> huge_page_shift(hstate_inode(inode)), NULL,
+			acctflag))
 		goto out_inode;
 
 	d_instantiate(path.dentry, inode);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 16e4e9a..437875c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -185,8 +185,7 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
-struct file *hugetlb_file_setup(const char *name, unsigned long addr,
-				size_t size, vm_flags_t acct,
+struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
 				struct user_struct **user, int creat_flags,
 				int page_size_log);
 
@@ -205,9 +204,9 @@ static inline int is_file_hugepages(struct file *file)
 
 #define is_file_hugepages(file)			0
 static inline struct file *
-hugetlb_file_setup(const char *name, unsigned long addr, size_t size,
-		vm_flags_t acctflag, struct user_struct **user, int creat_flags,
-		int page_size_log)
+hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
+		   struct user_struct **user, int creat_flags,
+		   int page_size_log)
 {
 	return ERR_PTR(-ENOSYS);
 }
diff --git a/ipc/shm.c b/ipc/shm.c
index cb858df..c1293d1 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -491,10 +491,13 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 
 	sprintf (name, "SYSV%08x", key);
 	if (shmflg & SHM_HUGETLB) {
+		unsigned int hugesize;
+
 		/* hugetlb_file_setup applies strict accounting */
 		if (shmflg & SHM_NORESERVE)
 			acctflag = VM_NORESERVE;
-		file = hugetlb_file_setup(name, 0, size, acctflag,
+		hugesize = ALIGN(size, huge_page_size(&default_hstate));
+		file = hugetlb_file_setup(name, hugesize, acctflag,
 				  &shp->mlock_user, HUGETLB_SHMFS_INODE,
 				(shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
 	} else {
diff --git a/mm/mmap.c b/mm/mmap.c
index 6466699..36342dd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1335,7 +1335,8 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 		 * A dummy user value is used because we are not locking
 		 * memory so no accounting is necessary
 		 */
-		file = hugetlb_file_setup(HUGETLB_ANON_FILE, addr, len,
+		len = ALIGN(len, huge_page_size(&default_hstate));
+		file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
 				VM_NORESERVE,
 				&user, HUGETLB_ANONHUGE_INODE,
 				(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
Comment 5 Anonymous Emailer 2013-04-24 08:22:06 UTC
Reply-To: wujianguo@huawei.com

On 2013/4/24 6:26, Naoya Horiguchi wrote:

> On Tue, Apr 23, 2013 at 01:25:22PM -0700, Andrew Morton wrote:
>>
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Sat, 20 Apr 2013 03:00:30 +0000 (UTC) bugzilla-daemon@bugzilla.kernel.org
>> wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=56881
>>>
>>>            Summary: MAP_HUGETLB mmap fails for certain sizes
>>>            Product: Memory Management
>>>            Version: 2.5
>>>     Kernel Version: 3.5.0-27
>>
>> Thanks.
>>
>> It's a post-3.4 regression, testcase included.  Does someone want to
>> take a look, please?
> 
> Let me try it.
> 
>   static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct
>   *vma)
>   {                                                                           
>           struct inode *inode = file->f_path.dentry->d_inode;
>           loff_t len, vma_len;                               
>           int ret;                                           
>           struct hstate *h = hstate_file(file);              
>           ...                                                                 
>           if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))             
>                   return -EINVAL;                                             
> 
> This code checks only whether a given hugetlb vma covers (1 << order)
> pages, not whether it's exactly hugepage aligned.
> Before 2b37c35e6552 "fs/hugetlbfs/inode.c: fix pgoff alignment
> checking on 32-bit", it was
> 
>   if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
> 
> , but this made no sense because ~(huge_page_mask(h) >> PAGE_SHIFT) is
> 0xff for 2M hugepage.
> I think the reported problem is not a bug because the behavior before
> this change was wrong or not as expected.
> 
> If we want to make sure that a given address range fits hugepage size,
> something like below can be useful.
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 78bde32..a98304b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -113,11 +113,11 @@ static int hugetlbfs_file_mmap(struct file *file,
> struct vm_area_struct *vma)
>       vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND | VM_DONTDUMP;
>       vma->vm_ops = &hugetlb_vm_ops;
>  
> -     if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
> -             return -EINVAL;
> -
>       vma_len = (loff_t)(vma->vm_end - vma->vm_start);
>  
> +     if (vma->len & ~huge_page_mask(h))
> +             return -EINVAL;
> +
>       mutex_lock(&inode->i_mutex);
>       file_accessed(file);
>  
> 
> Thanks,
> Naoya Horiguchi
> 

Hi Naoya,

I think the -EINVAL is returned from hugetlb_get_unmapped_area(),
for the two testcases:
1) $ ./mmappu $((5 * 2 * 1024 * 1024 - 4096))	//len1 = 0x9ff000
2) $ ./mmappu $((5 * 2 * 1024 * 1024 - 4095))	//len2 = 0x9ff001

In do_mmap_pgoff(), after "len = PAGE_ALIGN(len);", len1 = 0x9ff000,
len2 = 0xa00000, so len2 will pass "if (len & ~huge_page_mask(h))" check in
hugetlb_get_unmapped_area(), and len1 will return -EINVAL. As follow:

do_mmap_pgoff()
{
	...
	/* Careful about overflows.. */
	len = PAGE_ALIGN(len);
	...
	get_unmapped_area()
		-->hugetlb_get_unmapped_area()
		   {
			...
			if (len & ~huge_page_mask(h))
				return -EINVAL;
			...
		   }
}

do we need to align len to hugepage size if it's hugetlbfs mmap? something like below:

---
 mm/mmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0db0de1..bd42be24 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1188,7 +1188,10 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 		addr = round_hint_to_min(addr);
 
 	/* Careful about overflows.. */
-	len = PAGE_ALIGN(len);
+	if (file && is_file_hugepages(file))
+		len = ALIGN(len, huge_page_size(hstate_file(file)));
+	else
+		len = PAGE_ALIGN(len);
 	if (!len)
 		return -ENOMEM;
Comment 6 Naoya Horiguchi 2013-04-24 14:17:23 UTC
On Wed, Apr 24, 2013 at 04:20:50PM +0800, Jianguo Wu wrote:
...
> Hi Naoya,
> 
> I think the -EINVAL is returned from hugetlb_get_unmapped_area(),
> for the two testcases:
> 1) $ ./mmappu $((5 * 2 * 1024 * 1024 - 4096)) //len1 = 0x9ff000
> 2) $ ./mmappu $((5 * 2 * 1024 * 1024 - 4095)) //len2 = 0x9ff001
> 
> In do_mmap_pgoff(), after "len = PAGE_ALIGN(len);", len1 = 0x9ff000,
> len2 = 0xa00000, so len2 will pass "if (len & ~huge_page_mask(h))" check in
> hugetlb_get_unmapped_area(), and len1 will return -EINVAL. As follow:
> 
> do_mmap_pgoff()
> {
>       ...
>       /* Careful about overflows.. */
>       len = PAGE_ALIGN(len);
>       ...
>       get_unmapped_area()
>               -->hugetlb_get_unmapped_area()
>                  {
>                       ...
>                       if (len & ~huge_page_mask(h))
>                               return -EINVAL;
>                       ...
>                  }
> }

You are right, Jianguo. Thanks you.
I totally missed the point.

> 
> do we need to align len to hugepage size if it's hugetlbfs mmap? something
> like below:
> 
> ---
>  mm/mmap.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 0db0de1..bd42be24 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1188,7 +1188,10 @@ unsigned long do_mmap_pgoff(struct file *file,
> unsigned long addr,
>               addr = round_hint_to_min(addr);
>  
>       /* Careful about overflows.. */
> -     len = PAGE_ALIGN(len);
> +     if (file && is_file_hugepages(file))
> +             len = ALIGN(len, huge_page_size(hstate_file(file)));
> +     else
> +             len = PAGE_ALIGN(len);
>       if (!len)
>               return -ENOMEM;
>  
> -- 

I like putting this alignment code in if (flags & MAP_HUGETLB) branch in
SYSCALL_DEFINE6(mmap_pgoff) as Johannes pointed out in another subthread,
because it adds no impact on mmap calls with !MAP_HUGETLB.

Thanks,
Naoya Horiguchi
Comment 7 Naoya Horiguchi 2013-04-24 15:16:44 UTC
On Wed, Apr 24, 2013 at 04:14:54AM -0400, Johannes Weiner wrote:
> On Tue, Apr 23, 2013 at 01:25:22PM -0700, Andrew Morton wrote:
...
> hugetlb_get_unmapped_area() expects a hugepage-aligned size argument.
> Before do_mmap_pgoff() calls it, it does len = PAGE_ALIGN(len), which
> is why the second case works and the first one does not.
> 
> How about this (untested) partial revert of the above patch that keeps
> the shm alignment fix?

I have one comment about the suggested patch.

> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 84e3d85..13a7d51 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -929,9 +929,8 @@ static struct dentry_operations anon_ops = {
>       .d_dname = hugetlb_dname
>  };
>  
> -struct file *hugetlb_file_setup(const char *name, unsigned long addr,
> -                             size_t size, vm_flags_t acctflag,
> -                             struct user_struct **user,
> +struct file *hugetlb_file_setup(const char *name, size_t size,
> +                             vm_flags_t acctflag, struct user_struct **user,
>                               int creat_flags, int page_size_log)
>  {
>       struct file *file = ERR_PTR(-ENOMEM);
> @@ -939,8 +938,6 @@ struct file *hugetlb_file_setup(const char *name,
> unsigned long addr,
>       struct path path;
>       struct super_block *sb;
>       struct qstr quick_string;
> -     struct hstate *hstate;
> -     unsigned long num_pages;
>       int hstate_idx;
>  
>       hstate_idx = get_hstate_idx(page_size_log);
> @@ -980,12 +977,10 @@ struct file *hugetlb_file_setup(const char *name,
> unsigned long addr,
>       if (!inode)
>               goto out_dentry;
>  
> -     hstate = hstate_inode(inode);
> -     size += addr & ~huge_page_mask(hstate);
> -     num_pages = ALIGN(size, huge_page_size(hstate)) >>
> -                     huge_page_shift(hstate);
>       file = ERR_PTR(-ENOMEM);
> -     if (hugetlb_reserve_pages(inode, 0, num_pages, NULL, acctflag))
> +     if (hugetlb_reserve_pages(inode, 0,
> +                     size >> huge_page_shift(hstate_inode(inode)), NULL,
> +                     acctflag))
>               goto out_inode;
>  
>       d_instantiate(path.dentry, inode);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 16e4e9a..437875c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -185,8 +185,7 @@ static inline struct hugetlbfs_sb_info
> *HUGETLBFS_SB(struct super_block *sb)
>  
>  extern const struct file_operations hugetlbfs_file_operations;
>  extern const struct vm_operations_struct hugetlb_vm_ops;
> -struct file *hugetlb_file_setup(const char *name, unsigned long addr,
> -                             size_t size, vm_flags_t acct,
> +struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t
> acct,
>                               struct user_struct **user, int creat_flags,
>                               int page_size_log);
>  
> @@ -205,9 +204,9 @@ static inline int is_file_hugepages(struct file *file)
>  
>  #define is_file_hugepages(file)                      0
>  static inline struct file *
> -hugetlb_file_setup(const char *name, unsigned long addr, size_t size,
> -             vm_flags_t acctflag, struct user_struct **user, int
> creat_flags,
> -             int page_size_log)
> +hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
> +                struct user_struct **user, int creat_flags,
> +                int page_size_log)
>  {
>       return ERR_PTR(-ENOSYS);
>  }
> diff --git a/ipc/shm.c b/ipc/shm.c
> index cb858df..c1293d1 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -491,10 +491,13 @@ static int newseg(struct ipc_namespace *ns, struct
> ipc_params *params)
>  
>       sprintf (name, "SYSV%08x", key);
>       if (shmflg & SHM_HUGETLB) {
> +             unsigned int hugesize;
> +
>               /* hugetlb_file_setup applies strict accounting */
>               if (shmflg & SHM_NORESERVE)
>                       acctflag = VM_NORESERVE;
> -             file = hugetlb_file_setup(name, 0, size, acctflag,
> +             hugesize = ALIGN(size, huge_page_size(&default_hstate));
> +             file = hugetlb_file_setup(name, hugesize, acctflag,
>                                 &shp->mlock_user, HUGETLB_SHMFS_INODE,
>                               (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
>       } else {

Would it be better to find proper hstate instead of using default_hstate?

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 6466699..36342dd 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1335,7 +1335,8 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr,
> unsigned long, len,
>                * A dummy user value is used because we are not locking
>                * memory so no accounting is necessary
>                */
> -             file = hugetlb_file_setup(HUGETLB_ANON_FILE, addr, len,
> +             len = ALIGN(len, huge_page_size(&default_hstate));
> +             file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
>                               VM_NORESERVE,
>                               &user, HUGETLB_ANONHUGE_INODE,
>                               (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
> 

ditto.

Thanks,
Naoya Horiguchi
Comment 8 Johannes Weiner 2013-04-24 15:40:06 UTC
On Wed, Apr 24, 2013 at 11:16:39AM -0400, Naoya Horiguchi wrote:
> On Wed, Apr 24, 2013 at 04:14:54AM -0400, Johannes Weiner wrote:
> > @@ -491,10 +491,13 @@ static int newseg(struct ipc_namespace *ns, struct
> ipc_params *params)
> >  
> >     sprintf (name, "SYSV%08x", key);
> >     if (shmflg & SHM_HUGETLB) {
> > +           unsigned int hugesize;
> > +
> >             /* hugetlb_file_setup applies strict accounting */
> >             if (shmflg & SHM_NORESERVE)
> >                     acctflag = VM_NORESERVE;
> > -           file = hugetlb_file_setup(name, 0, size, acctflag,
> > +           hugesize = ALIGN(size, huge_page_size(&default_hstate));
> > +           file = hugetlb_file_setup(name, hugesize, acctflag,
> >                               &shp->mlock_user, HUGETLB_SHMFS_INODE,
> >                             (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
> >     } else {
> 
> Would it be better to find proper hstate instead of using default_hstate?

You are probably right, I guess we can't assume default_hstate anymore
after page_size_log can be passed in.

Can we have hugetlb_file_setup() return an adjusted length, or an
alignment requirement?

Or pull the hstate lookup into the callsites (since they pass in
page_size_log to begin with)?
Comment 9 Naoya Horiguchi 2013-04-24 23:05:39 UTC
On Wed, Apr 24, 2013 at 11:39:51AM -0400, Johannes Weiner wrote:
> On Wed, Apr 24, 2013 at 11:16:39AM -0400, Naoya Horiguchi wrote:
> > On Wed, Apr 24, 2013 at 04:14:54AM -0400, Johannes Weiner wrote:
> > > @@ -491,10 +491,13 @@ static int newseg(struct ipc_namespace *ns, struct
> ipc_params *params)
> > >  
> > >   sprintf (name, "SYSV%08x", key);
> > >   if (shmflg & SHM_HUGETLB) {
> > > +         unsigned int hugesize;
> > > +
> > >           /* hugetlb_file_setup applies strict accounting */
> > >           if (shmflg & SHM_NORESERVE)
> > >                   acctflag = VM_NORESERVE;
> > > -         file = hugetlb_file_setup(name, 0, size, acctflag,
> > > +         hugesize = ALIGN(size, huge_page_size(&default_hstate));
> > > +         file = hugetlb_file_setup(name, hugesize, acctflag,
> > >                             &shp->mlock_user, HUGETLB_SHMFS_INODE,
> > >                           (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
> > >   } else {
> > 
> > Would it be better to find proper hstate instead of using default_hstate?
> 
> You are probably right, I guess we can't assume default_hstate anymore
> after page_size_log can be passed in.
> 
> Can we have hugetlb_file_setup() return an adjusted length, or an
> alignment requirement?

Yes, it's possible if callers pass the pointer of size (length) to
hugetlb_file_setup() and make it adjusted inside the function.
And as for alignment, I think it's not a hugetlb_file_setup's job,
so we don't have to do it in this function.

> Or pull the hstate lookup into the callsites (since they pass in
> page_size_log to begin with)?

This is also a possible solution, where we might need to define and
export a function converting hugepage order to hstate.

I like the former one, so wrote a patch like below.
# I added your Signed-off-by: because this's based on your draft patch.
# if you don't like it, please let me know.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Wed, 24 Apr 2013 16:44:19 -0400
Subject: [PATCH] hugetlbfs: fix mmap failure in unaligned size request

As reported in https://bugzilla.kernel.org/show_bug.cgi?id=56881, current
kernel returns -EINVAL unless a given mmap length is "almost" hugepage
aligned. This is because in sys_mmap_pgoff() the given length is passed to
vm_mmap_pgoff() as it is without being aligned with hugepage boundary.

This is a regression introduced in commit 40716e29243d "hugetlbfs: fix
alignment of huge page requests", where alignment code is pushed into
hugetlb_file_setup() and the variable len in caller side is not changed.

To fix this, this patch partially reverts that commit, and changes
the type of parameter size from size_t to (size_t *) in order to
align the size in caller side.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/hugetlbfs/inode.c    | 20 ++++++++++----------
 include/linux/hugetlb.h |  7 +++----
 ipc/shm.c               |  2 +-
 mm/mmap.c               |  2 +-
 4 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 523464e..7adbd7b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -929,9 +929,8 @@ static struct dentry_operations anon_ops = {
 	.d_dname = hugetlb_dname
 };
 
-struct file *hugetlb_file_setup(const char *name, unsigned long addr,
-				size_t size, vm_flags_t acctflag,
-				struct user_struct **user,
+struct file *hugetlb_file_setup(const char *name, size_t *sizeptr,
+				vm_flags_t acctflag, struct user_struct **user,
 				int creat_flags, int page_size_log)
 {
 	struct file *file = ERR_PTR(-ENOMEM);
@@ -939,9 +938,8 @@ struct file *hugetlb_file_setup(const char *name, unsigned long addr,
 	struct path path;
 	struct super_block *sb;
 	struct qstr quick_string;
-	struct hstate *hstate;
-	unsigned long num_pages;
 	int hstate_idx;
+	size_t size;
 
 	hstate_idx = get_hstate_idx(page_size_log);
 	if (hstate_idx < 0)
@@ -951,6 +949,10 @@ struct file *hugetlb_file_setup(const char *name, unsigned long addr,
 	if (!hugetlbfs_vfsmount[hstate_idx])
 		return ERR_PTR(-ENOENT);
 
+	size = 1 << hstate_index_to_shift(hstate_idx);
+	if (sizeptr)
+		*sizeptr = ALIGN(*sizeptr, size);
+
 	if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
 		*user = current_user();
 		if (user_shm_lock(size, *user)) {
@@ -980,12 +982,10 @@ struct file *hugetlb_file_setup(const char *name, unsigned long addr,
 	if (!inode)
 		goto out_dentry;
 
-	hstate = hstate_inode(inode);
-	size += addr & ~huge_page_mask(hstate);
-	num_pages = ALIGN(size, huge_page_size(hstate)) >>
-			huge_page_shift(hstate);
 	file = ERR_PTR(-ENOMEM);
-	if (hugetlb_reserve_pages(inode, 0, num_pages, NULL, acctflag))
+	if (hugetlb_reserve_pages(inode, 0,
+			size >> huge_page_shift(hstate_inode(inode)), NULL,
+			acctflag))
 		goto out_inode;
 
 	d_instantiate(path.dentry, inode);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8220a8a..ca67d42 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -193,8 +193,7 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 
 extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
-struct file *hugetlb_file_setup(const char *name, unsigned long addr,
-				size_t size, vm_flags_t acct,
+struct file *hugetlb_file_setup(const char *name, size_t *size, vm_flags_t acct,
 				struct user_struct **user, int creat_flags,
 				int page_size_log);
 
@@ -213,8 +212,8 @@ static inline int is_file_hugepages(struct file *file)
 
 #define is_file_hugepages(file)			0
 static inline struct file *
-hugetlb_file_setup(const char *name, unsigned long addr, size_t size,
-		vm_flags_t acctflag, struct user_struct **user, int creat_flags,
+hugetlb_file_setup(const char *name, size_t *size, vm_flags_t acctflag,
+		struct user_struct **user, int creat_flags,
 		int page_size_log)
 {
 	return ERR_PTR(-ENOSYS);
diff --git a/ipc/shm.c b/ipc/shm.c
index cb858df..e2cb809 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -494,7 +494,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 		/* hugetlb_file_setup applies strict accounting */
 		if (shmflg & SHM_NORESERVE)
 			acctflag = VM_NORESERVE;
-		file = hugetlb_file_setup(name, 0, size, acctflag,
+		file = hugetlb_file_setup(name, NULL, acctflag,
 				  &shp->mlock_user, HUGETLB_SHMFS_INODE,
 				(shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
 	} else {
diff --git a/mm/mmap.c b/mm/mmap.c
index 2664a47..d03a541 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1333,7 +1333,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 		 * A dummy user value is used because we are not locking
 		 * memory so no accounting is necessary
 		 */
-		file = hugetlb_file_setup(HUGETLB_ANON_FILE, addr, len,
+		file = hugetlb_file_setup(HUGETLB_ANON_FILE, &len,
 				VM_NORESERVE,
 				&user, HUGETLB_ANONHUGE_INODE,
 				(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
Comment 10 Naoya Horiguchi 2013-04-24 23:13:10 UTC
On Wed, Apr 24, 2013 at 07:05:35PM -0400, Naoya Horiguchi wrote:
...
> diff --git a/ipc/shm.c b/ipc/shm.c
> index cb858df..e2cb809 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -494,7 +494,7 @@ static int newseg(struct ipc_namespace *ns, struct
> ipc_params *params)
>               /* hugetlb_file_setup applies strict accounting */
>               if (shmflg & SHM_NORESERVE)
>                       acctflag = VM_NORESERVE;
> -             file = hugetlb_file_setup(name, 0, size, acctflag,
> +             file = hugetlb_file_setup(name, NULL, acctflag,
>                                 &shp->mlock_user, HUGETLB_SHMFS_INODE,
>                               (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
>       } else {

Ugh, NULL is not correct, it should be &size.
But anyway, I want your comment, and after that I'll repost with fixing this.

Naoya
Comment 11 Johannes Weiner 2013-04-24 23:26:17 UTC
On Wed, Apr 24, 2013 at 07:05:35PM -0400, Naoya Horiguchi wrote:
> On Wed, Apr 24, 2013 at 11:39:51AM -0400, Johannes Weiner wrote:
> > On Wed, Apr 24, 2013 at 11:16:39AM -0400, Naoya Horiguchi wrote:
> > > On Wed, Apr 24, 2013 at 04:14:54AM -0400, Johannes Weiner wrote:
> > > > @@ -491,10 +491,13 @@ static int newseg(struct ipc_namespace *ns,
> struct ipc_params *params)
> > > >  
> > > >         sprintf (name, "SYSV%08x", key);
> > > >         if (shmflg & SHM_HUGETLB) {
> > > > +               unsigned int hugesize;
> > > > +
> > > >                 /* hugetlb_file_setup applies strict accounting */
> > > >                 if (shmflg & SHM_NORESERVE)
> > > >                         acctflag = VM_NORESERVE;
> > > > -               file = hugetlb_file_setup(name, 0, size, acctflag,
> > > > +               hugesize = ALIGN(size,
> huge_page_size(&default_hstate));
> > > > +               file = hugetlb_file_setup(name, hugesize, acctflag,
> > > >                                   &shp->mlock_user,
> HUGETLB_SHMFS_INODE,
> > > >                                 (shmflg >> SHM_HUGE_SHIFT) &
> SHM_HUGE_MASK);
> > > >         } else {
> > > 
> > > Would it be better to find proper hstate instead of using default_hstate?
> > 
> > You are probably right, I guess we can't assume default_hstate anymore
> > after page_size_log can be passed in.
> > 
> > Can we have hugetlb_file_setup() return an adjusted length, or an
> > alignment requirement?
> 
> Yes, it's possible if callers pass the pointer of size (length) to
> hugetlb_file_setup() and make it adjusted inside the function.
> And as for alignment, I think it's not a hugetlb_file_setup's job,
> so we don't have to do it in this function.
> 
> > Or pull the hstate lookup into the callsites (since they pass in
> > page_size_log to begin with)?
> 
> This is also a possible solution, where we might need to define and
> export a function converting hugepage order to hstate.

After thinking about it some more, I would actually prefer this.  The
callsites have all the information and the file setup code should not
really care about the alignment requirements of the callers.

I.e. export something like get_hstate_idx() but which returns hstate,
then make the callers look it up, do the alignment, pass in the
aligned size and hstate instead of page_size_log.  Then they are free
to use the aligned size (mmap) or use the original size (shm).

> I like the former one, so wrote a patch like below.
> # I added your Signed-off-by: because this's based on your draft patch.
> # if you don't like it, please let me know.

Thanks, I appreciate it.  But usually if you take and modify a patch
add the original From: line to the changelog to give credit, then add
your own signoff and only add other people's signoff after they agree.

> @@ -929,9 +929,8 @@ static struct dentry_operations anon_ops = {
>       .d_dname = hugetlb_dname
>  };
>  
> -struct file *hugetlb_file_setup(const char *name, unsigned long addr,
> -                             size_t size, vm_flags_t acctflag,
> -                             struct user_struct **user,
> +struct file *hugetlb_file_setup(const char *name, size_t *sizeptr,
> +                             vm_flags_t acctflag, struct user_struct **user,
>                               int creat_flags, int page_size_log)
>  {
>       struct file *file = ERR_PTR(-ENOMEM);
> @@ -939,9 +938,8 @@ struct file *hugetlb_file_setup(const char *name,
> unsigned long addr,
>       struct path path;
>       struct super_block *sb;
>       struct qstr quick_string;
> -     struct hstate *hstate;
> -     unsigned long num_pages;
>       int hstate_idx;
> +     size_t size;
>  
>       hstate_idx = get_hstate_idx(page_size_log);
>       if (hstate_idx < 0)
> @@ -951,6 +949,10 @@ struct file *hugetlb_file_setup(const char *name,
> unsigned long addr,
>       if (!hugetlbfs_vfsmount[hstate_idx])
>               return ERR_PTR(-ENOENT);
>  
> +     size = 1 << hstate_index_to_shift(hstate_idx);
> +     if (sizeptr)
> +             *sizeptr = ALIGN(*sizeptr, size);

You always assume the file will just be one hugepage in size?
Comment 12 Johannes Weiner 2013-04-24 23:44:50 UTC
On Wed, Apr 24, 2013 at 07:13:08PM -0400, Naoya Horiguchi wrote:
> On Wed, Apr 24, 2013 at 07:05:35PM -0400, Naoya Horiguchi wrote:
> ...
> > diff --git a/ipc/shm.c b/ipc/shm.c
> > index cb858df..e2cb809 100644
> > --- a/ipc/shm.c
> > +++ b/ipc/shm.c
> > @@ -494,7 +494,7 @@ static int newseg(struct ipc_namespace *ns, struct
> ipc_params *params)
> >             /* hugetlb_file_setup applies strict accounting */
> >             if (shmflg & SHM_NORESERVE)
> >                     acctflag = VM_NORESERVE;
> > -           file = hugetlb_file_setup(name, 0, size, acctflag,
> > +           file = hugetlb_file_setup(name, NULL, acctflag,
> >                               &shp->mlock_user, HUGETLB_SHMFS_INODE,
> >                             (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
> >     } else {
> 
> Ugh, NULL is not correct, it should be &size.

No, shm does not want its size variable aligned, it wants the segment
to be the originally requested size...  only mmap uses the aligned
size later on.
Comment 13 Anonymous Emailer 2013-04-25 03:03:42 UTC
Reply-To: wujianguo@huawei.com

On 2013/4/25 7:05, Naoya Horiguchi wrote:

> On Wed, Apr 24, 2013 at 11:39:51AM -0400, Johannes Weiner wrote:
>> On Wed, Apr 24, 2013 at 11:16:39AM -0400, Naoya Horiguchi wrote:
>>> On Wed, Apr 24, 2013 at 04:14:54AM -0400, Johannes Weiner wrote:
>>>> @@ -491,10 +491,13 @@ static int newseg(struct ipc_namespace *ns, struct
>>>> ipc_params *params)
>>>>  
>>>>    sprintf (name, "SYSV%08x", key);
>>>>    if (shmflg & SHM_HUGETLB) {
>>>> +          unsigned int hugesize;
>>>> +
>>>>            /* hugetlb_file_setup applies strict accounting */
>>>>            if (shmflg & SHM_NORESERVE)
>>>>                    acctflag = VM_NORESERVE;
>>>> -          file = hugetlb_file_setup(name, 0, size, acctflag,
>>>> +          hugesize = ALIGN(size, huge_page_size(&default_hstate));
>>>> +          file = hugetlb_file_setup(name, hugesize, acctflag,
>>>>                              &shp->mlock_user, HUGETLB_SHMFS_INODE,
>>>>                            (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
>>>>    } else {
>>>
>>> Would it be better to find proper hstate instead of using default_hstate?
>>
>> You are probably right, I guess we can't assume default_hstate anymore
>> after page_size_log can be passed in.
>>
>> Can we have hugetlb_file_setup() return an adjusted length, or an
>> alignment requirement?
> 
> Yes, it's possible if callers pass the pointer of size (length) to
> hugetlb_file_setup() and make it adjusted inside the function.
> And as for alignment, I think it's not a hugetlb_file_setup's job,
> so we don't have to do it in this function.
> 
>> Or pull the hstate lookup into the callsites (since they pass in
>> page_size_log to begin with)?
> 
> This is also a possible solution, where we might need to define and
> export a function converting hugepage order to hstate.
> 
> I like the former one, so wrote a patch like below.
> # I added your Signed-off-by: because this's based on your draft patch.
> # if you don't like it, please let me know.
> 
> Thanks,
> Naoya Horiguchi
> ---
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Wed, 24 Apr 2013 16:44:19 -0400
> Subject: [PATCH] hugetlbfs: fix mmap failure in unaligned size request
> 
> As reported in https://bugzilla.kernel.org/show_bug.cgi?id=56881, current
> kernel returns -EINVAL unless a given mmap length is "almost" hugepage
> aligned. This is because in sys_mmap_pgoff() the given length is passed to
> vm_mmap_pgoff() as it is without being aligned with hugepage boundary.
> 
> This is a regression introduced in commit 40716e29243d "hugetlbfs: fix
> alignment of huge page requests", where alignment code is pushed into
> hugetlb_file_setup() and the variable len in caller side is not changed.
> 
> To fix this, this patch partially reverts that commit, and changes
> the type of parameter size from size_t to (size_t *) in order to
> align the size in caller side.
> 

Hi Naoya,

This patch only fix anonymous hugetlb mmap case, should also fix hugetlbfs file mmap case?

diff --git a/mm/mmap.c b/mm/mmap.c
index 0db0de1..5ed9561 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1327,6 +1327,8 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 		file = fget(fd);
 		if (!file)
 			goto out;
+		else if (is_file_hugepages(file))
+			len = ALIGN(len, huge_page_size(hstate_file(file)));
 	} else if (flags & MAP_HUGETLB) {
 		struct user_struct *user = NULL;
 		/*

Thanks,
Jianguo Wu

> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/hugetlbfs/inode.c    | 20 ++++++++++----------
>  include/linux/hugetlb.h |  7 +++----
>  ipc/shm.c               |  2 +-
>  mm/mmap.c               |  2 +-
>  4 files changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 523464e..7adbd7b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -929,9 +929,8 @@ static struct dentry_operations anon_ops = {
>       .d_dname = hugetlb_dname
>  };
>  
> -struct file *hugetlb_file_setup(const char *name, unsigned long addr,
> -                             size_t size, vm_flags_t acctflag,
> -                             struct user_struct **user,
> +struct file *hugetlb_file_setup(const char *name, size_t *sizeptr,
> +                             vm_flags_t acctflag, struct user_struct **user,
>                               int creat_flags, int page_size_log)
>  {
>       struct file *file = ERR_PTR(-ENOMEM);
> @@ -939,9 +938,8 @@ struct file *hugetlb_file_setup(const char *name,
> unsigned long addr,
>       struct path path;
>       struct super_block *sb;
>       struct qstr quick_string;
> -     struct hstate *hstate;
> -     unsigned long num_pages;
>       int hstate_idx;
> +     size_t size;
>  
>       hstate_idx = get_hstate_idx(page_size_log);
>       if (hstate_idx < 0)
> @@ -951,6 +949,10 @@ struct file *hugetlb_file_setup(const char *name,
> unsigned long addr,
>       if (!hugetlbfs_vfsmount[hstate_idx])
>               return ERR_PTR(-ENOENT);
>  
> +     size = 1 << hstate_index_to_shift(hstate_idx);
> +     if (sizeptr)
> +             *sizeptr = ALIGN(*sizeptr, size);
> +
>       if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
>               *user = current_user();
>               if (user_shm_lock(size, *user)) {
> @@ -980,12 +982,10 @@ struct file *hugetlb_file_setup(const char *name,
> unsigned long addr,
>       if (!inode)
>               goto out_dentry;
>  
> -     hstate = hstate_inode(inode);
> -     size += addr & ~huge_page_mask(hstate);
> -     num_pages = ALIGN(size, huge_page_size(hstate)) >>
> -                     huge_page_shift(hstate);
>       file = ERR_PTR(-ENOMEM);
> -     if (hugetlb_reserve_pages(inode, 0, num_pages, NULL, acctflag))
> +     if (hugetlb_reserve_pages(inode, 0,
> +                     size >> huge_page_shift(hstate_inode(inode)), NULL,
> +                     acctflag))
>               goto out_inode;
>  
>       d_instantiate(path.dentry, inode);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8220a8a..ca67d42 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -193,8 +193,7 @@ static inline struct hugetlbfs_sb_info
> *HUGETLBFS_SB(struct super_block *sb)
>  
>  extern const struct file_operations hugetlbfs_file_operations;
>  extern const struct vm_operations_struct hugetlb_vm_ops;
> -struct file *hugetlb_file_setup(const char *name, unsigned long addr,
> -                             size_t size, vm_flags_t acct,
> +struct file *hugetlb_file_setup(const char *name, size_t *size, vm_flags_t
> acct,
>                               struct user_struct **user, int creat_flags,
>                               int page_size_log);
>  
> @@ -213,8 +212,8 @@ static inline int is_file_hugepages(struct file *file)
>  
>  #define is_file_hugepages(file)                      0
>  static inline struct file *
> -hugetlb_file_setup(const char *name, unsigned long addr, size_t size,
> -             vm_flags_t acctflag, struct user_struct **user, int
> creat_flags,
> +hugetlb_file_setup(const char *name, size_t *size, vm_flags_t acctflag,
> +             struct user_struct **user, int creat_flags,
>               int page_size_log)
>  {
>       return ERR_PTR(-ENOSYS);
> diff --git a/ipc/shm.c b/ipc/shm.c
> index cb858df..e2cb809 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -494,7 +494,7 @@ static int newseg(struct ipc_namespace *ns, struct
> ipc_params *params)
>               /* hugetlb_file_setup applies strict accounting */
>               if (shmflg & SHM_NORESERVE)
>                       acctflag = VM_NORESERVE;
> -             file = hugetlb_file_setup(name, 0, size, acctflag,
> +             file = hugetlb_file_setup(name, NULL, acctflag,
>                                 &shp->mlock_user, HUGETLB_SHMFS_INODE,
>                               (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
>       } else {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2664a47..d03a541 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1333,7 +1333,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr,
> unsigned long, len,
>                * A dummy user value is used because we are not locking
>                * memory so no accounting is necessary
>                */
> -             file = hugetlb_file_setup(HUGETLB_ANON_FILE, addr, len,
> +             file = hugetlb_file_setup(HUGETLB_ANON_FILE, &len,
>                               VM_NORESERVE,
>                               &user, HUGETLB_ANONHUGE_INODE,
>                               (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
Comment 14 Naoya Horiguchi 2013-04-25 21:00:35 UTC
On Wed, Apr 24, 2013 at 07:26:00PM -0400, Johannes Weiner wrote:
> On Wed, Apr 24, 2013 at 07:05:35PM -0400, Naoya Horiguchi wrote:
> > On Wed, Apr 24, 2013 at 11:39:51AM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 24, 2013 at 11:16:39AM -0400, Naoya Horiguchi wrote:
> > > > On Wed, Apr 24, 2013 at 04:14:54AM -0400, Johannes Weiner wrote:
> > > > > @@ -491,10 +491,13 @@ static int newseg(struct ipc_namespace *ns,
> struct ipc_params *params)
> > > > >  
> > > > >       sprintf (name, "SYSV%08x", key);
> > > > >       if (shmflg & SHM_HUGETLB) {
> > > > > +             unsigned int hugesize;
> > > > > +
> > > > >               /* hugetlb_file_setup applies strict accounting */
> > > > >               if (shmflg & SHM_NORESERVE)
> > > > >                       acctflag = VM_NORESERVE;
> > > > > -             file = hugetlb_file_setup(name, 0, size, acctflag,
> > > > > +             hugesize = ALIGN(size,
> huge_page_size(&default_hstate));
> > > > > +             file = hugetlb_file_setup(name, hugesize, acctflag,
> > > > >                                 &shp->mlock_user,
> HUGETLB_SHMFS_INODE,
> > > > >                               (shmflg >> SHM_HUGE_SHIFT) &
> SHM_HUGE_MASK);
> > > > >       } else {
> > > > 
> > > > Would it be better to find proper hstate instead of using
> default_hstate?
> > > 
> > > You are probably right, I guess we can't assume default_hstate anymore
> > > after page_size_log can be passed in.
> > > 
> > > Can we have hugetlb_file_setup() return an adjusted length, or an
> > > alignment requirement?
> > 
> > Yes, it's possible if callers pass the pointer of size (length) to
> > hugetlb_file_setup() and make it adjusted inside the function.
> > And as for alignment, I think it's not a hugetlb_file_setup's job,
> > so we don't have to do it in this function.
> > 
> > > Or pull the hstate lookup into the callsites (since they pass in
> > > page_size_log to begin with)?
> > 
> > This is also a possible solution, where we might need to define and
> > export a function converting hugepage order to hstate.
> 
> After thinking about it some more, I would actually prefer this.  The
> callsites have all the information and the file setup code should not
> really care about the alignment requirements of the callers.
> 
> I.e. export something like get_hstate_idx() but which returns hstate,
> then make the callers look it up, do the alignment, pass in the
> aligned size and hstate instead of page_size_log.  Then they are free
> to use the aligned size (mmap) or use the original size (shm).

OK. I'll do this.

> > I like the former one, so wrote a patch like below.
> > # I added your Signed-off-by: because this's based on your draft patch.
> > # if you don't like it, please let me know.
> 
> Thanks, I appreciate it.  But usually if you take and modify a patch
> add the original From: line to the changelog to give credit, then add
> your own signoff and only add other people's signoff after they agree.

OK, got it.

> > @@ -929,9 +929,8 @@ static struct dentry_operations anon_ops = {
> >     .d_dname = hugetlb_dname
> >  };
> >  
> > -struct file *hugetlb_file_setup(const char *name, unsigned long addr,
> > -                           size_t size, vm_flags_t acctflag,
> > -                           struct user_struct **user,
> > +struct file *hugetlb_file_setup(const char *name, size_t *sizeptr,
> > +                           vm_flags_t acctflag, struct user_struct **user,
> >                             int creat_flags, int page_size_log)
> >  {
> >     struct file *file = ERR_PTR(-ENOMEM);
> > @@ -939,9 +938,8 @@ struct file *hugetlb_file_setup(const char *name,
> unsigned long addr,
> >     struct path path;
> >     struct super_block *sb;
> >     struct qstr quick_string;
> > -   struct hstate *hstate;
> > -   unsigned long num_pages;
> >     int hstate_idx;
> > +   size_t size;
> >  
> >     hstate_idx = get_hstate_idx(page_size_log);
> >     if (hstate_idx < 0)
> > @@ -951,6 +949,10 @@ struct file *hugetlb_file_setup(const char *name,
> unsigned long addr,
> >     if (!hugetlbfs_vfsmount[hstate_idx])
> >             return ERR_PTR(-ENOENT);
> >  
> > +   size = 1 << hstate_index_to_shift(hstate_idx);
> > +   if (sizeptr)
> > +           *sizeptr = ALIGN(*sizeptr, size);
> 
> You always assume the file will just be one hugepage in size?

No, this line means that *sizeptr (given by the caller) is round up
to the multiple of hugepage's size. So assuming size is 2MB, for example,
if 5MB is given it's round up to 6MB in return (3 hugepages.)
Comment 15 Naoya Horiguchi 2013-04-25 21:03:23 UTC
On Thu, Apr 25, 2013 at 11:02:59AM +0800, Jianguo Wu wrote:
> On 2013/4/25 7:05, Naoya Horiguchi wrote:
> 
> > On Wed, Apr 24, 2013 at 11:39:51AM -0400, Johannes Weiner wrote:
> >> On Wed, Apr 24, 2013 at 11:16:39AM -0400, Naoya Horiguchi wrote:
> >>> On Wed, Apr 24, 2013 at 04:14:54AM -0400, Johannes Weiner wrote:
> >>>> @@ -491,10 +491,13 @@ static int newseg(struct ipc_namespace *ns, struct
> ipc_params *params)
> >>>>  
> >>>>          sprintf (name, "SYSV%08x", key);
> >>>>          if (shmflg & SHM_HUGETLB) {
> >>>> +                unsigned int hugesize;
> >>>> +
> >>>>                  /* hugetlb_file_setup applies strict accounting */
> >>>>                  if (shmflg & SHM_NORESERVE)
> >>>>                          acctflag = VM_NORESERVE;
> >>>> -                file = hugetlb_file_setup(name, 0, size, acctflag,
> >>>> +                hugesize = ALIGN(size,
> huge_page_size(&default_hstate));
> >>>> +                file = hugetlb_file_setup(name, hugesize, acctflag,
> >>>>                                    &shp->mlock_user,
> HUGETLB_SHMFS_INODE,
> >>>>                                  (shmflg >> SHM_HUGE_SHIFT) &
> SHM_HUGE_MASK);
> >>>>          } else {
> >>>
> >>> Would it be better to find proper hstate instead of using default_hstate?
> >>
> >> You are probably right, I guess we can't assume default_hstate anymore
> >> after page_size_log can be passed in.
> >>
> >> Can we have hugetlb_file_setup() return an adjusted length, or an
> >> alignment requirement?
> > 
> > Yes, it's possible if callers pass the pointer of size (length) to
> > hugetlb_file_setup() and make it adjusted inside the function.
> > And as for alignment, I think it's not a hugetlb_file_setup's job,
> > so we don't have to do it in this function.
> > 
> >> Or pull the hstate lookup into the callsites (since they pass in
> >> page_size_log to begin with)?
> > 
> > This is also a possible solution, where we might need to define and
> > export a function converting hugepage order to hstate.
> > 
> > I like the former one, so wrote a patch like below.
> > # I added your Signed-off-by: because this's based on your draft patch.
> > # if you don't like it, please let me know.
> > 
> > Thanks,
> > Naoya Horiguchi
> > ---
> > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Date: Wed, 24 Apr 2013 16:44:19 -0400
> > Subject: [PATCH] hugetlbfs: fix mmap failure in unaligned size request
> > 
> > As reported in https://bugzilla.kernel.org/show_bug.cgi?id=56881, current
> > kernel returns -EINVAL unless a given mmap length is "almost" hugepage
> > aligned. This is because in sys_mmap_pgoff() the given length is passed to
> > vm_mmap_pgoff() as it is without being aligned with hugepage boundary.
> > 
> > This is a regression introduced in commit 40716e29243d "hugetlbfs: fix
> > alignment of huge page requests", where alignment code is pushed into
> > hugetlb_file_setup() and the variable len in caller side is not changed.
> > 
> > To fix this, this patch partially reverts that commit, and changes
> > the type of parameter size from size_t to (size_t *) in order to
> > align the size in caller side.
> > 
> 
> Hi Naoya,
> 
> This patch only fix anonymous hugetlb mmap case, should also fix hugetlbfs
> file mmap case?

Right, thank you, Jianguo.

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 0db0de1..5ed9561 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1327,6 +1327,8 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr,
> unsigned long, len,
>               file = fget(fd);
>               if (!file)
>                       goto out;
> +             else if (is_file_hugepages(file))
> +                     len = ALIGN(len, huge_page_size(hstate_file(file)));
>       } else if (flags & MAP_HUGETLB) {
>               struct user_struct *user = NULL;
>               /*

I'll added this in next post.

Thanks,
Naoya Horiguchi
Comment 16 Leonardo Borda 2013-05-16 18:24:39 UTC
Hi Naoya,

Thanks for your work on this issue.
Did you have a chance to make a patch for the hugetlbfs file mmap case ?
We're also affected by this issue.

Thanks 
Leo
Comment 17 Aneesh Kumar K.V 2013-06-12 12:16:43 UTC
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> On Wed, Apr 24, 2013 at 11:39:51AM -0400, Johannes Weiner wrote:
>> On Wed, Apr 24, 2013 at 11:16:39AM -0400, Naoya Horiguchi wrote:
>> > On Wed, Apr 24, 2013 at 04:14:54AM -0400, Johannes Weiner wrote:
>> > > @@ -491,10 +491,13 @@ static int newseg(struct ipc_namespace *ns, struct
>> ipc_params *params)
>> > >  
>> > >          sprintf (name, "SYSV%08x", key);
>> > >          if (shmflg & SHM_HUGETLB) {
>> > > +                unsigned int hugesize;
>> > > +
>> > >                  /* hugetlb_file_setup applies strict accounting */
>> > >                  if (shmflg & SHM_NORESERVE)
>> > >                          acctflag = VM_NORESERVE;
>> > > -                file = hugetlb_file_setup(name, 0, size, acctflag,
>> > > +                hugesize = ALIGN(size,
>> huge_page_size(&default_hstate));
>> > > +                file = hugetlb_file_setup(name, hugesize, acctflag,
>> > >                                    &shp->mlock_user,
>> HUGETLB_SHMFS_INODE,
>> > >                                  (shmflg >> SHM_HUGE_SHIFT) &
>> SHM_HUGE_MASK);
>> > >          } else {
>> > 
>> > Would it be better to find proper hstate instead of using default_hstate?
>> 
>> You are probably right, I guess we can't assume default_hstate anymore
>> after page_size_log can be passed in.
>> 
>> Can we have hugetlb_file_setup() return an adjusted length, or an
>> alignment requirement?
>
> Yes, it's possible if callers pass the pointer of size (length) to
> hugetlb_file_setup() and make it adjusted inside the function.
> And as for alignment, I think it's not a hugetlb_file_setup's job,
> so we don't have to do it in this function.
>
>> Or pull the hstate lookup into the callsites (since they pass in
>> page_size_log to begin with)?
>
> This is also a possible solution, where we might need to define and
> export a function converting hugepage order to hstate.
>
> I like the former one, so wrote a patch like below.
> # I added your Signed-off-by: because this's based on your draft patch.
> # if you don't like it, please let me know.
>
> Thanks,
> Naoya Horiguchi
> ---
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Wed, 24 Apr 2013 16:44:19 -0400
> Subject: [PATCH] hugetlbfs: fix mmap failure in unaligned size request
>
> As reported in https://bugzilla.kernel.org/show_bug.cgi?id=56881, current
> kernel returns -EINVAL unless a given mmap length is "almost" hugepage
> aligned. This is because in sys_mmap_pgoff() the given length is passed to
> vm_mmap_pgoff() as it is without being aligned with hugepage boundary.
>
> This is a regression introduced in commit 40716e29243d "hugetlbfs: fix
> alignment of huge page requests", where alignment code is pushed into
> hugetlb_file_setup() and the variable len in caller side is not changed.
>
> To fix this, this patch partially reverts that commit, and changes
> the type of parameter size from size_t to (size_t *) in order to
> align the size in caller side.

After the change af73e4d9506d3b797509f3c030e7dcd554f7d9c4 we have
alignment related failures in libhugetlbfs test suite. misalign test
fails with 3.10-rc5, while it works with 3.9.

-aneesh
Comment 18 Andrew Morton 2013-06-13 21:29:46 UTC
On Wed, 12 Jun 2013 17:46:16 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Date: Wed, 24 Apr 2013 16:44:19 -0400
> > Subject: [PATCH] hugetlbfs: fix mmap failure in unaligned size request
> >
> > As reported in https://bugzilla.kernel.org/show_bug.cgi?id=56881, current
> > kernel returns -EINVAL unless a given mmap length is "almost" hugepage
> > aligned. This is because in sys_mmap_pgoff() the given length is passed to
> > vm_mmap_pgoff() as it is without being aligned with hugepage boundary.
> >
> > This is a regression introduced in commit 40716e29243d "hugetlbfs: fix
> > alignment of huge page requests", where alignment code is pushed into
> > hugetlb_file_setup() and the variable len in caller side is not changed.
> >
> > To fix this, this patch partially reverts that commit, and changes
> > the type of parameter size from size_t to (size_t *) in order to
> > align the size in caller side.
> 
> After the change af73e4d9506d3b797509f3c030e7dcd554f7d9c4 we have
> alignment related failures in libhugetlbfs test suite. misalign test
> fails with 3.10-rc5, while it works with 3.9.

What does this mean.  Is 3.10-rc5 more strict, or less strict?

If "less strict" then that's expected and old userspace should be OK
with the change and the test should be updated (sorry).
Comment 19 Aneesh Kumar K.V 2013-06-18 11:15:03 UTC
Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 12 Jun 2013 17:46:16 +0530 "Aneesh Kumar K.V"
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
>> > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> > Date: Wed, 24 Apr 2013 16:44:19 -0400
>> > Subject: [PATCH] hugetlbfs: fix mmap failure in unaligned size request
>> >
>> > As reported in https://bugzilla.kernel.org/show_bug.cgi?id=56881, current
>> > kernel returns -EINVAL unless a given mmap length is "almost" hugepage
>> > aligned. This is because in sys_mmap_pgoff() the given length is passed to
>> > vm_mmap_pgoff() as it is without being aligned with hugepage boundary.
>> >
>> > This is a regression introduced in commit 40716e29243d "hugetlbfs: fix
>> > alignment of huge page requests", where alignment code is pushed into
>> > hugetlb_file_setup() and the variable len in caller side is not changed.
>> >
>> > To fix this, this patch partially reverts that commit, and changes
>> > the type of parameter size from size_t to (size_t *) in order to
>> > align the size in caller side.
>> 
>> After the change af73e4d9506d3b797509f3c030e7dcd554f7d9c4 we have
>> alignment related failures in libhugetlbfs test suite. misalign test
>> fails with 3.10-rc5, while it works with 3.9.
>
> What does this mean.  Is 3.10-rc5 more strict, or less strict?
>
> If "less strict" then that's expected and old userspace should be OK
> with the change and the test should be updated (sorry).

3.10_rc5 is less strict. Also Naoya Horiguchi updated that relevant
changes to libhugetlbfs test is also posted at 

http://www.mail-archive.com/libhugetlbfs-devel@lists.sourceforge.net/msg13317.html

-aneesh