Bug 13690

Summary: nodes_clear cause hugepage unusable on non-NUMA machine
Product: Platform Specific/Hardware Reporter: Alex Shi (alex.shi)
Component: i386Assignee: platform_i386
Status: CLOSED CODE_FIX    
Severity: high CC: rjw, yinghai
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13615    
Attachments: i386 kernel config file on stoakley
dmesg after apply yinghai's patch.

Description Alex Shi 2009-07-02 01:22:23 UTC
73d60b7f747176dbdff826c4127d22e1fd3f9f74  commit introduced a nodes_clear function for NUMA machine. But seems the commit omits non-NUMA machine.
 If find_zone_movable_pfns_for_nodes/early_calculate_totalpages has no
 chance to run. nodes_clear will block HUPEPAGE using in my specjbb2005
 testing on my Stoakely(i386/x86_64), waybridge(i386), IBM T61(i386)
 
+       /*
+        * find_zone_movable_pfns_for_nodes/early_calculate_totalpages init
+        * that node_mask, clear it at first
+        */
+       nodes_clear(node_states[N_HIGH_MEMORY]);
Comment 1 Alex Shi 2009-07-02 01:27:50 UTC
My specjbb2005 test setting: 
1, Using 2GB hugepage memory setting. 
2, JAVA_OPTION= -Xmx2g -Xms2g -Xns1g -XXaggressive -Xlargepages -XXlazyUnlocking -Xgc:genpar -XXtlasize:min=16k,preferred=64k  -Djava.awt.headless=true
3, JVM is jrockit-R27.3.1-jre1.5.0_11 

System memory on my Stoakley is 3GB.
Comment 2 Alex Shi 2009-07-02 01:30:48 UTC
lease check if following patch fixed your problem

[PATCH] x86: only clear node_states for 64bit

Nathan reported that
| commit 73d60b7f747176dbdff826c4127d22e1fd3f9f74
| Author: Yinghai Lu <yinghai@kernel.org>
| Date:   Tue Jun 16 15:33:00 2009 -0700
|
|    page-allocator: clear N_HIGH_MEMORY map before we set it again
|    
|    SRAT tables may contains nodes of very small size.  The arch code may
|    decide to not activate such a node.  However, currently the early boot
|    code sets N_HIGH_MEMORY for such nodes.  These nodes therefore seem to be
|    active although these nodes have no present pages.
|    
|    For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too

broke the cpuset.mems cgroup attribute on an i386 kvm guest

fix it by only clearing node_states[N_NORMAL_MEMORY] for 64bit only.
and need to do save/restore for that in find_zone_movable_pfn

Reported-by: Nathan Lynch <ntl@pobox.com>
Tested-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/mm/init_64.c |    2 ++
 mm/page_alloc.c       |   13 +++++++------
 2 files changed, 9 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -598,6 +598,8 @@ void __init paging_init(void)
 
        sparse_memory_present_with_active_regions(MAX_NUMNODES);
        sparse_init();
+       /* clear the default setting with node 0 */
+       nodes_clear(node_states[N_NORMAL_MEMORY]);
        free_area_init_nodes(max_zone_pfns);
 }
 
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -4037,6 +4037,8 @@ static void __init find_zone_movable_pfn
        int i, nid;
        unsigned long usable_startpfn;
        unsigned long kernelcore_node, kernelcore_remaining;
+       /* save the state before borrow the nodemask */
+       nodemask_t saved_node_state = node_states[N_HIGH_MEMORY];
        unsigned long totalpages = early_calculate_totalpages();
        int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
 
@@ -4064,7 +4066,7 @@ static void __init find_zone_movable_pfn
 
        /* If kernelcore was not specified, there is no ZONE_MOVABLE */
        if (!required_kernelcore)
-               return;
+               goto out;
 
        /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
        find_usable_zone_for_movable();
@@ -4163,6 +4165,10 @@ restart:
        for (nid = 0; nid < MAX_NUMNODES; nid++)
                zone_movable_pfn[nid] =
                        roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
+
+out:
+       /* restore the node_state */
+       node_states[N_HIGH_MEMORY] = saved_node_state;
 }
 
 /* Any regular memory on that node ? */
@@ -4247,11 +4253,6 @@ void __init free_area_init_nodes(unsigne
                                                early_node_map[i].start_pfn,
                                                early_node_map[i].end_pfn);
 
-       /*
-        * find_zone_movable_pfns_for_nodes/early_calculate_totalpages init
-        * that node_mask, clear it at first
-        */
-       nodes_clear(node_states[N_HIGH_MEMORY]);
        /* Initialise every node */
        mminit_verify_pageflags_layout();
        setup_nr_node_ids();
Comment 3 Alex Shi 2009-07-02 01:34:36 UTC
Above patch come from Yinghai Lu, with it the specjbb2005 still can not run with initial java options.
Comment 4 Andrew Morton 2009-07-02 01:34:57 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 2 Jul 2009 01:22:24 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=13690
> 
>            Summary: nodes_clear cause hugepage unusable on non-NUMA
>                     machine
>            Product: Platform Specific/Hardware
>            Version: 2.5
>     Kernel Version: 2.6.31-rc1
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: i386
>         AssignedTo: platform_i386@kernel-bugs.osdl.org
>         ReportedBy: alex.shi@intel.com
>                 CC: yinghai@kernel.org
>         Regression: Yes
> 
> 
> 73d60b7f747176dbdff826c4127d22e1fd3f9f74  commit introduced a nodes_clear
> function for NUMA machine. But seems the commit omits non-NUMA machine.
>  If find_zone_movable_pfns_for_nodes/early_calculate_totalpages has no
>  chance to run. nodes_clear will block HUPEPAGE using in my specjbb2005
>  testing on my Stoakely(i386/x86_64), waybridge(i386), IBM T61(i386)
> 
> +       /*
> +        * find_zone_movable_pfns_for_nodes/early_calculate_totalpages init
> +        * that node_mask, clear it at first
> +        */
> +       nodes_clear(node_states[N_HIGH_MEMORY]);

Thanks.

fyi, with recently-occurring bugs and regressions of this nature, it is (I
think) best to deal with them via email rather than bugzilla.  Bugzilla is
better-suited to longer-lived bugs where we have a need to track them,
generate statistics, etc.
Comment 5 Alex Shi 2009-07-02 01:39:13 UTC
Created attachment 22173 [details]
i386 kernel config file on stoakley
Comment 6 Alex Shi 2009-07-02 02:06:06 UTC
Created attachment 22174 [details]
dmesg after apply yinghai's patch. 

As Andrew's suggestion, this bug will tracking via e-mail and it will be closed after this problem resolving.
Comment 7 Yinghai Lu 2009-07-02 02:14:35 UTC
that looks strange...

config is 32bit. 

the second patch only do save and restore. and should be right right.

please check following patch on today's linus tree. and send out /proc/iomem

Thanks

Yinghai

[PATCH] x86: add boundary check for 32bit res before expand e820 resource to alignment

fix hang with HIGHMEM_64G and 32bit resource.
according to hpa and Linus, use (resource_size_t)-1 to fend off big ranges.

analyized by hpa

Reported-and-tested-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/include/asm/proto.h |    3 ---
 arch/x86/kernel/e820.c       |   20 ++++++++++++--------
 include/linux/kernel.h       |    5 +++++
 3 files changed, 17 insertions(+), 11 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -1367,9 +1367,9 @@ void __init e820_reserve_resources(void)
 }
 
 /* How much should we pad RAM ending depending on where it is? */
-static unsigned long ram_alignment(resource_size_t pos)
+static u64 ram_alignment(u64 pos)
 {
-	unsigned long mb = pos >> 20;
+	u64 mb = pos >> 20;
 
 	/* To 64kB in the first megabyte */
 	if (!mb)
@@ -1383,6 +1383,8 @@ static unsigned long ram_alignment(resou
 	return 32*1024*1024;
 }
 
+#define MAX_RESOURCE_SIZE ((resource_size_t)-1)
+
 void __init e820_reserve_resources_late(void)
 {
 	int i;
@@ -1400,17 +1402,19 @@ void __init e820_reserve_resources_late(
 	 * avoid stolen RAM:
 	 */
 	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *entry = &e820_saved.map[i];
-		resource_size_t start, end;
+		struct e820entry *entry = &e820.map[i];
+		u64 start, end;
 
 		if (entry->type != E820_RAM)
 			continue;
 		start = entry->addr + entry->size;
-		end = round_up(start, ram_alignment(start));
-		if (start == end)
+		end = round_up(start, ram_alignment(start)) - 1;
+		if (end > MAX_RESOURCE_SIZE)
+			end = MAX_RESOURCE_SIZE;
+		if (start > end)
 			continue;
-		reserve_region_with_split(&iomem_resource, start,
-						  end - 1, "RAM buffer");
+		reserve_region_with_split(&iomem_resource, start, end,
+					  "RAM buffer");
 	}
 }
Comment 8 Yinghai Lu 2009-07-02 08:50:51 UTC
Alex Shi wrote:
> The new patch works for my stoakley i386 machine. But for x86_64 machine
> the specjbb2005 still can not run with hugepage. The specjbb2005 use the
> same java setting as i386 system. After apply your patch, the iomem of
> x86_64 is:

please check

[PATCH] x86: don't clear nodes_states[N_NORMAL_MEMORY] when numa is not compiled in

Alex found:
for x86_64 machine the specjbb2005 still can not run with hugepage

only happens when numa is not compiled in

the root cause: node_set_state will not set it back for us in that case

so don't clear that when numa is not select in config

Reported-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/mm/init_64.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -598,8 +598,14 @@ void __init paging_init(void)
 
 	sparse_memory_present_with_active_regions(MAX_NUMNODES);
 	sparse_init();
-	/* clear the default setting with node 0 */
+#if MAX_NUMNODES > 1
+	/*
+	 * clear the default setting with node 0
+	 * note: don't clear it, node_set_state will do nothing
+	 *	 (aka set it back) when numa support is not compiled in
+	 */
 	nodes_clear(node_states[N_NORMAL_MEMORY]);
+#endif
 	free_area_init_nodes(max_zone_pfns);
 }
Comment 9 Anonymous Emailer 2009-07-02 14:11:51 UTC
Reply-To: cl@linux-foundation.org

On Thu, 2 Jul 2009, Yinghai Lu wrote:

> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -598,8 +598,14 @@ void __init paging_init(void)
>
>       sparse_memory_present_with_active_regions(MAX_NUMNODES);
>       sparse_init();
> -     /* clear the default setting with node 0 */
> +#if MAX_NUMNODES > 1
> +     /*
> +      * clear the default setting with node 0
> +      * note: don't clear it, node_set_state will do nothing
> +      *       (aka set it back) when numa support is not compiled in
> +      */
>       nodes_clear(node_states[N_NORMAL_MEMORY]);

The problem was that nodes_clear() does not fall back to a noop on !NUMA.
The node_set/clear_states() operations do become noops.

Could we make it more consistent by using only operations of the same
type? F.e. Add a node_clearall_states() in include/linux/nodemask.h that
falls back to a noop on !NUMA like the node_*_states operation?

Another options is to restore node_states[N_NORMAL_MEMORY] to its
initial condition. See the definition of node_states in page_alloc.c.
Comment 10 Alex Shi 2009-07-02 14:14:11 UTC
Yes, the patch fixes this bug! 

Alex 

>-----Original Message-----
>From: Yinghai Lu [mailto:yinghai@kernel.org]
>Sent: 2009年7月2日 16:51
>To: Shi, Alex; Andrew Morton; Ingo Molnar
>Cc: bugzilla-daemon@bugzilla.kernel.org; bugme-daemon@bugzilla.kernel.org;
>Christoph Lameter; Mel Gorman; linux-kernel@vger.kernel.org; Zhang, Yanmin;
>Chen, Tim C
>Subject: Re: [Bugme-new] [Bug 13690] New: nodes_clear cause hugepage unusable
>on non-NUMA machine
>
>Alex Shi wrote:
>> The new patch works for my stoakley i386 machine. But for x86_64 machine
>> the specjbb2005 still can not run with hugepage. The specjbb2005 use the
>> same java setting as i386 system. After apply your patch, the iomem of
>> x86_64 is:
>
>please check
>
>[PATCH] x86: don't clear nodes_states[N_NORMAL_MEMORY] when numa is not
>compiled in
>
>Alex found:
>for x86_64 machine the specjbb2005 still can not run with hugepage
>
>only happens when numa is not compiled in
>
>the root cause: node_set_state will not set it back for us in that case
>
>so don't clear that when numa is not select in config
>
>Reported-by: Alex Shi <alex.shi@intel.com>
>Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>
>---
> arch/x86/mm/init_64.c |    8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
>Index: linux-2.6/arch/x86/mm/init_64.c
>===================================================================
>--- linux-2.6.orig/arch/x86/mm/init_64.c
>+++ linux-2.6/arch/x86/mm/init_64.c
>@@ -598,8 +598,14 @@ void __init paging_init(void)
>
>       sparse_memory_present_with_active_regions(MAX_NUMNODES);
>       sparse_init();
>-      /* clear the default setting with node 0 */
>+#if MAX_NUMNODES > 1
>+      /*
>+       * clear the default setting with node 0
>+       * note: don't clear it, node_set_state will do nothing
>+       *       (aka set it back) when numa support is not compiled in
>+       */
>       nodes_clear(node_states[N_NORMAL_MEMORY]);
>+#endif
>       free_area_init_nodes(max_zone_pfns);
> }
>
Comment 11 Yinghai Lu 2009-07-03 15:39:54 UTC
Christoph Lameter wrote:
> On Thu, 2 Jul 2009, Yinghai Lu wrote:
> 
>> Index: linux-2.6/arch/x86/mm/init_64.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init_64.c
>> +++ linux-2.6/arch/x86/mm/init_64.c
>> @@ -598,8 +598,14 @@ void __init paging_init(void)
>>
>>      sparse_memory_present_with_active_regions(MAX_NUMNODES);
>>      sparse_init();
>> -    /* clear the default setting with node 0 */
>> +#if MAX_NUMNODES > 1
>> +    /*
>> +     * clear the default setting with node 0
>> +     * note: don't clear it, node_set_state will do nothing
>> +     *       (aka set it back) when numa support is not compiled in
>> +     */
>>      nodes_clear(node_states[N_NORMAL_MEMORY]);
> 
> The problem was that nodes_clear() does not fall back to a noop on !NUMA.
> The node_set/clear_states() operations do become noops.
> 
> Could we make it more consistent by using only operations of the same
> type? F.e. Add a node_clearall_states() in include/linux/nodemask.h that
> falls back to a noop on !NUMA like the node_*_states operation?
> 
> Another options is to restore node_states[N_NORMAL_MEMORY] to its
> initial condition. See the definition of node_states in page_alloc.c.

could use node_clear_state(0, N_NORMAL_MEMORY) instead. because default one only have node 0 set in that mask.

YH
Comment 12 Alex Shi 2009-07-06 03:08:50 UTC
Yinghai:

The 31-rc2 kernel still can not use hugepage on non-NUMA machine. And
this patch did not appear on rc2 kernel. Are there some concern about
this? 

BRG
Alex 


On Thu, 2009-07-02 at 16:50 +0800, Yinghai Lu wrote:
> Alex Shi wrote:
> > The new patch works for my stoakley i386 machine. But for x86_64 machine
> > the specjbb2005 still can not run with hugepage. The specjbb2005 use the
> > same java setting as i386 system. After apply your patch, the iomem of
> > x86_64 is:
> 
> please check
> 
> [PATCH] x86: don't clear nodes_states[N_NORMAL_MEMORY] when numa is not
> compiled in
> 
> Alex found:
> for x86_64 machine the specjbb2005 still can not run with hugepage
> 
> only happens when numa is not compiled in
> 
> the root cause: node_set_state will not set it back for us in that case
> 
> so don't clear that when numa is not select in config
> 
> Reported-by: Alex Shi <alex.shi@intel.com>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> 
> ---
>  arch/x86/mm/init_64.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -598,8 +598,14 @@ void __init paging_init(void)
>  
>       sparse_memory_present_with_active_regions(MAX_NUMNODES);
>       sparse_init();
> -     /* clear the default setting with node 0 */
> +#if MAX_NUMNODES > 1
> +     /*
> +      * clear the default setting with node 0
> +      * note: don't clear it, node_set_state will do nothing
> +      *       (aka set it back) when numa support is not compiled in
> +      */
>       nodes_clear(node_states[N_NORMAL_MEMORY]);
> +#endif
>       free_area_init_nodes(max_zone_pfns);
>  }
>
Comment 13 Yinghai Lu 2009-07-07 00:07:56 UTC
Alex Shi wrote:
> Yinghai:
> 
> The 31-rc2 kernel still can not use hugepage on non-NUMA machine. And
> this patch did not appear on rc2 kernel. Are there some concern about
> this? 
> 

can you check
http://lkml.org/lkml/2009/7/2/326

YH
Comment 14 Alex Shi 2009-07-07 07:18:33 UTC
On Tue, 2009-07-07 at 08:07 +0800, Yinghai Lu wrote:
> Alex Shi wrote:
> > Yinghai:
> > 
> > The 31-rc2 kernel still can not use hugepage on non-NUMA machine. And
> > this patch did not appear on rc2 kernel. Are there some concern about
> > this? 
> > 
> 
> can you check
> http://lkml.org/lkml/2009/7/2/326
> 
> YH

It works on my Stoakley i386 and x86_64 with latest Linus' kernel tree. 



Alex
Comment 15 Rafael J. Wysocki 2009-07-07 10:42:11 UTC
Handled-By : Yinghai Lu <yinghai@kernel.org>
Patch : http://patchwork.kernel.org/patch/33744/
Comment 16 Rafael J. Wysocki 2009-07-26 23:03:24 UTC
Fixed by commit 44b572809581d5a10dbe35aa6bf689f32b9c5ad6 .