Bug 210023

Summary:	Crash when allocating > 2 TB memory
Product:	Memory Management	Reporter:	hsinhuiwu
Component:	Page Allocator	Assignee:	Andrew Morton (akpm)
Status:	NEW ---
Severity:	blocking	CC:	hsinhuiwu, raquini
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	3.10.0-957.27.2.el7.x86_64	Subsystem:
Regression:	No	Bisected commit-id:

Description hsinhuiwu 2020-11-03 18:50:07 UTC

With a machine with 3 TB (more than 2 TB memory). If you use vmalloc to allocate > 2 TB memory, the array_size below will be overflowed.

The array_size is an unsigned int and can only be used to allocate less than 2 TB memory. If you pass 2*1028*1028*1024*1024 = 2 * 2^40 in the argument of vmalloc. The array_size will become 2*2^31 = 2^32. The 2^32 cannot be store with a 32 bit integer.

The fix is to change the type of array_size to unsigned long.

vmalloc.c

1762 void *vmalloc(unsigned long size)
1763 {
1764     return __vmalloc_node_flags(size, NUMA_NO_NODE,
1765                     GFP_KERNEL | __GFP_HIGHMEM);
1766 }

# diff -Nurp mm/vmalloc.c mm/vmalloc.c.new
--- mm/vmalloc.c 2020-08-11 18:40:55.000000000 -0400
+++ mm/vmalloc.c.new 2020-10-20 10:02:22.071509308 -0400
@@ -1603,11 +1603,12 @@ static void *__vmalloc_area_node(struct
{
const int order = 0;
struct page **pages;
- unsigned int nr_pages, array_size, i;
+ unsigned int nr_pages, i;
+ unsigned long array_size;
gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;

nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT;
- array_size = (nr_pages * sizeof(struct page *));
+ array_size = ((unsigned long)(nr_pages) * sizeof(struct page *));

area->nr_pages = nr_pages;
/* Please note that the recursion is strictly bounded. */

Comment 1 Andrew Morton 2020-11-04 00:27:43 UTC

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 03 Nov 2020 18:50:07 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=210023
> 
>             Bug ID: 210023
>            Summary: Crash when allocating > 2 TB memory
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 3.10.0-957.27.2.el7.x86_64
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: blocking
>           Priority: P1
>          Component: Slab Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: hsinhuiwu@gmail.com
>         Regression: No
> 
> With a machine with 3 TB (more than 2 TB memory). If you use vmalloc to
> allocate > 2 TB memory, the array_size below will be overflowed.

How was this observed?

Is there any know userspace operation which causes the kernel to try to
vmalloc such a large hunk of memory?

> The array_size is an unsigned int and can only be used to allocate less than
> 2
> TB memory. If you pass 2*1028*1028*1024*1024 = 2 * 2^40 in the argument of
> vmalloc. The array_size will become 2*2^31 = 2^32. The 2^32 cannot be store
> with a 32 bit integer.
> 
> The fix is to change the type of array_size to unsigned long.
> 
> vmalloc.c
> 
> 1762 void *vmalloc(unsigned long size)
> 1763 {
> 1764     return __vmalloc_node_flags(size, NUMA_NO_NODE,
> 1765                     GFP_KERNEL | __GFP_HIGHMEM);
> 1766 }

OK, thanks.  Against current mainline your proposed change would look
like this, yes?

--- a/mm/vmalloc.c~a
+++ a/mm/vmalloc.c
@@ -2461,9 +2461,11 @@ static void *__vmalloc_area_node(struct
 {
 	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
 	unsigned int nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
-	unsigned int array_size = nr_pages * sizeof(struct page *), i;
+	unsigned long array_size
+	unsigned int i;
 	struct page **pages;
 
+	array_size = (unsigned long)nr_pages * sizeof(struct page *);
 	gfp_mask |= __GFP_NOWARN;
 	if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
 		gfp_mask |= __GFP_HIGHMEM;
_

Comment 2 willy 2020-11-04 00:47:11 UTC

On Tue, Nov 03, 2020 at 04:27:40PM -0800, Andrew Morton wrote:
>       const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
>       unsigned int nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
> -     unsigned int array_size = nr_pages * sizeof(struct page *), i;
> +     unsigned long array_size
> +     unsigned int i;
>       struct page **pages;
>  
> +     array_size = (unsigned long)nr_pages * sizeof(struct page *);

This is only pushing the problem out ~5 years.  If somebody tries
to allocate a 16TB area, 16TB / 4kB is 4GB and nr_pages overflows.
That's only 3 doublings of RAM away.

I think we need to change vm_struct's nr_pages from an unsigned int to
an unsigned long.  It won't take up any more room because it's sandwiched
between a pointer and a phys_addr_t.

Comment 3 hsinhuiwu 2020-11-04 13:55:24 UTC

Created attachment 293461 [details]
attachment-10548-0.html

> With a machine with 3 TB (more than 2 TB memory). If you use vmalloc to
> allocate > 2 TB memory, the array_size below will be overflowed.

How was this observed?

Is there any know userspace operation which causes the kernel to try to
vmalloc such a large hunk of memory?

[Frank] The Dell PowerEdge R740/R940 can have up to  3TB/6TB memory.
installed. Our application requires reserve consecutive memory in the kernel
space and protected from userspace programs.

----------------------------------------------------------------------------------------------------
OK, thanks.  Against current mainline your proposed change would look
like this, yes?

[Frank] Yes. This will support up to less than 16 TB. If you want to support
 more than 16 TB, we need to expand nr_pages to unsigned long as
Matthew pointed out.

Will it be possible to add this to kernel 3.10.0-957.27.2.el7.x86_64?

Thanks,
Frank

On Tue, Nov 3, 2020 at 7:27 PM Andrew Morton <akpm@linux-foundation.org>
wrote:

> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Tue, 03 Nov 2020 18:50:07 +0000 bugzilla-daemon@bugzilla.kernel.org
> wrote:
>
> > https://bugzilla.kernel.org/show_bug.cgi?id=210023
> >
> >             Bug ID: 210023
> >            Summary: Crash when allocating > 2 TB memory
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 3.10.0-957.27.2.el7.x86_64
> >           Hardware: All
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: blocking
> >           Priority: P1
> >          Component: Slab Allocator
> >           Assignee: akpm@linux-foundation.org
> >           Reporter: hsinhuiwu@gmail.com
> >         Regression: No
> >
> > With a machine with 3 TB (more than 2 TB memory). If you use vmalloc to
> > allocate > 2 TB memory, the array_size below will be overflowed.
>
> How was this observed?
>
> Is there any know userspace operation which causes the kernel to try to
> vmalloc such a large hunk of memory?
>
> > The array_size is an unsigned int and can only be used to allocate less
> than 2
> > TB memory. If you pass 2*1028*1028*1024*1024 = 2 * 2^40 in the argument
> of
> > vmalloc. The array_size will become 2*2^31 = 2^32. The 2^32 cannot be
> store
> > with a 32 bit integer.
> >
> > The fix is to change the type of array_size to unsigned long.
> >
> > vmalloc.c
> >
> > 1762 void *vmalloc(unsigned long size)
> > 1763 {
> > 1764     return __vmalloc_node_flags(size, NUMA_NO_NODE,
> > 1765                     GFP_KERNEL | __GFP_HIGHMEM);
> > 1766 }
>
> OK, thanks.  Against current mainline your proposed change would look
> like this, yes?
>
> --- a/mm/vmalloc.c~a
> +++ a/mm/vmalloc.c
> @@ -2461,9 +2461,11 @@ static void *__vmalloc_area_node(struct
>  {
>         const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) |
> __GFP_ZERO;
>         unsigned int nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
> -       unsigned int array_size = nr_pages * sizeof(struct page *), i;
> +       unsigned long array_size
> +       unsigned int i;
>         struct page **pages;
>
> +       array_size = (unsigned long)nr_pages * sizeof(struct page *);
>         gfp_mask |= __GFP_NOWARN;
>         if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
>                 gfp_mask |= __GFP_HIGHMEM;
> _
>
>

Comment 4 Andrew Morton 2020-11-04 18:19:09 UTC

On Wed, 4 Nov 2020 08:55:11 -0500 Hsin-Hui Wu <hsinhuiwu@gmail.com> wrote:

> > With a machine with 3 TB (more than 2 TB memory). If you use vmalloc to
> > allocate > 2 TB memory, the array_size below will be overflowed.
> 
> How was this observed?
> 
> Is there any know userspace operation which causes the kernel to try to
> vmalloc such a large hunk of memory?
> 
> [Frank] The Dell PowerEdge R740/R940 can have up to  3TB/6TB memory.
> installed. Our application requires reserve consecutive memory in the kernel
> space and protected from userspace programs.

Did this require custom kernel changes?  If not, precisely which system
calls were used to cause this allocation attempt?

>
> ----------------------------------------------------------------------------------------------------
> OK, thanks.  Against current mainline your proposed change would look
> like this, yes?
> 
> [Frank] Yes. This will support up to less than 16 TB. If you want to support
>  more than 16 TB, we need to expand nr_pages to unsigned long as
> Matthew pointed out.
> 
> Will it be possible to add this to kernel 3.10.0-957.27.2.el7.x86_64?

That is up to Red Hat to decide.

Comment 5 hsinhuiwu 2020-11-04 20:53:49 UTC

Created attachment 293489 [details]
attachment-10105-0.html

Did this require custom kernel changes?  If not, precisely which system
calls were used to cause this allocation attempt?

[Frank] No. vmalloc().

That is up to Red Hat to decide.

{Frank] Understood.


On Wed, Nov 4, 2020 at 1:19 PM Andrew Morton <akpm@linux-foundation.org>
wrote:

> On Wed, 4 Nov 2020 08:55:11 -0500 Hsin-Hui Wu <hsinhuiwu@gmail.com> wrote:
>
> > > With a machine with 3 TB (more than 2 TB memory). If you use vmalloc to
> > > allocate > 2 TB memory, the array_size below will be overflowed.
> >
> > How was this observed?
> >
> > Is there any know userspace operation which causes the kernel to try to
> > vmalloc such a large hunk of memory?
> >
> > [Frank] The Dell PowerEdge R740/R940 can have up to  3TB/6TB memory.
> > installed. Our application requires reserve consecutive memory in the
> kernel
> > space and protected from userspace programs.
>
> Did this require custom kernel changes?  If not, precisely which system
> calls were used to cause this allocation attempt?
>
> >
>
> ----------------------------------------------------------------------------------------------------
> > OK, thanks.  Against current mainline your proposed change would look
> > like this, yes?
> >
> > [Frank] Yes. This will support up to less than 16 TB. If you want to
> support
> >  more than 16 TB, we need to expand nr_pages to unsigned long as
> > Matthew pointed out.
> >
> > Will it be possible to add this to kernel 3.10.0-957.27.2.el7.x86_64?
>
> That is up to Red Hat to decide.
>
>

Comment 6 Andrew Morton 2020-11-05 00:38:24 UTC

On Wed, 4 Nov 2020 15:53:36 -0500 Hsin-Hui Wu <hsinhuiwu@gmail.com> wrote:

> Did this require custom kernel changes?  If not, precisely which system
> calls were used to cause this allocation attempt?
> 
> [Frank] No. vmalloc().
> 

vmalloc() is not a system call.

Please fully describe how vmalloc() came to be called with such a large
request size.  Fully.  Are you able to provide example userspace code?

Comment 7 hsinhuiwu 2020-11-05 03:52:53 UTC

Created attachment 293503 [details]
attachment-3513-0.html

We don't use userspace program to allocate memory. We use kernel module to
do so.
Here is the sample kernel program you can use to test on a machine that has
more than 2 TB memory.

Use "insmod memtest.ko mb=2100000" to allocate more than 2 TB memory.
Use "rmmod memtest" to free the memory.

memtest.c
=============================================================
#include <linux/version.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/cdev.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#ifdef MODVERSIONS
#include <linux/modversions.h>
#endif
#include <asm/io.h>

static char *mem_ptr;
MODULE_LICENSE("GPL");
int mb = 0;
MODULE_PARM_DESC(mb, "alloc MB memory");
module_param(mb, int, S_IWUSR | S_IRUGO);

static int __init memtest_init(void) {
    mem_ptr = NULL;
    if (mb > 0) {
        unsigned long alloc_len;
        alloc_len = mb * 1024 * 1024;

        mem_ptr = vmalloc(alloc_len);

        if (mem_ptr) {
            printk("memtest: vmalloc for size %u MB.\n", mb);
        } else {
            printk("memtest: failed to vmalloc for bytes %u MB.\n", mb);
        }
    } else {
        printk("memtest len_MB\n");
    }
    return(0);
}

static void __exit memtest_exit(void) {
    if (mem_ptr) {
        vfree(mem_ptr);
        printk("memtest: vfree for size %u MB.\n", mb);
        mem_ptr = NULL;
    }
}

module_init(memtest_init);
module_exit(memtest_exit);

On Wed, Nov 4, 2020 at 7:38 PM Andrew Morton <akpm@linux-foundation.org>
wrote:

> On Wed, 4 Nov 2020 15:53:36 -0500 Hsin-Hui Wu <hsinhuiwu@gmail.com> wrote:
>
> > Did this require custom kernel changes?  If not, precisely which system
> > calls were used to cause this allocation attempt?
> >
> > [Frank] No. vmalloc().
> >
>
> vmalloc() is not a system call.
>
> Please fully describe how vmalloc() came to be called with such a large
> request size.  Fully.  Are you able to provide example userspace code?
>
>

Comment 8 hsinhuiwu 2021-01-06 15:43:37 UTC

To correct the overflow problem in my test program. Here is the update. I should use unsigned long long for alloc_len to avoid overflow. l

static int __init memtest_init(void) {
    mem_ptr = NULL;  // init memory pointer
    if (mb > 0) {
        unsigned long long alloc_len;
        alloc_len = (unsigned long long)mb * 1024 * 1024;

        mem_ptr = vmalloc(alloc_len);

        if (mem_ptr) {
            printk("memtest: vmalloc for size %u MB.\n", mb);
        } else {
            printk("memtest: failed to vmalloc for bytes %u MB.\n", mb);
        }
    } else {
        printk("memtest len_MB\n");
    }
    return(0);
}

Comment 9 Rafae Aquini 2021-06-23 17:42:06 UTC

Andrew, 

Just a quick note: 
I think we can close this one now, as it is addressed by:

commit 34fe653716b0d340bc26dd4823d2dbe00c57f849
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Mon Dec 14 19:08:43 2020 -0800

    mm/vmalloc.c:__vmalloc_area_node(): avoid 32-bit overflow