Bug 13690
Summary: | nodes_clear cause hugepage unusable on non-NUMA machine | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Alex Shi (alex.shi) |
Component: | i386 | Assignee: | platform_i386 |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | rjw, yinghai |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.31-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 13615 | ||
Attachments: |
i386 kernel config file on stoakley
dmesg after apply yinghai's patch. |
Description
Alex Shi
2009-07-02 01:22:23 UTC
My specjbb2005 test setting: 1, Using 2GB hugepage memory setting. 2, JAVA_OPTION= -Xmx2g -Xms2g -Xns1g -XXaggressive -Xlargepages -XXlazyUnlocking -Xgc:genpar -XXtlasize:min=16k,preferred=64k -Djava.awt.headless=true 3, JVM is jrockit-R27.3.1-jre1.5.0_11 System memory on my Stoakley is 3GB. lease check if following patch fixed your problem [PATCH] x86: only clear node_states for 64bit Nathan reported that | commit 73d60b7f747176dbdff826c4127d22e1fd3f9f74 | Author: Yinghai Lu <yinghai@kernel.org> | Date: Tue Jun 16 15:33:00 2009 -0700 | | page-allocator: clear N_HIGH_MEMORY map before we set it again | | SRAT tables may contains nodes of very small size. The arch code may | decide to not activate such a node. However, currently the early boot | code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be | active although these nodes have no present pages. | | For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too broke the cpuset.mems cgroup attribute on an i386 kvm guest fix it by only clearing node_states[N_NORMAL_MEMORY] for 64bit only. and need to do save/restore for that in find_zone_movable_pfn Reported-by: Nathan Lynch <ntl@pobox.com> Tested-by: Nathan Lynch <ntl@pobox.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> --- arch/x86/mm/init_64.c | 2 ++ mm/page_alloc.c | 13 +++++++------ 2 files changed, 9 insertions(+), 6 deletions(-) Index: linux-2.6/arch/x86/mm/init_64.c =================================================================== --- linux-2.6.orig/arch/x86/mm/init_64.c +++ linux-2.6/arch/x86/mm/init_64.c @@ -598,6 +598,8 @@ void __init paging_init(void) sparse_memory_present_with_active_regions(MAX_NUMNODES); sparse_init(); + /* clear the default setting with node 0 */ + nodes_clear(node_states[N_NORMAL_MEMORY]); free_area_init_nodes(max_zone_pfns); } Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -4037,6 +4037,8 @@ static void __init find_zone_movable_pfn int i, nid; unsigned long usable_startpfn; unsigned long kernelcore_node, kernelcore_remaining; + /* save the state before borrow the nodemask */ + nodemask_t saved_node_state = node_states[N_HIGH_MEMORY]; unsigned long totalpages = early_calculate_totalpages(); int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]); @@ -4064,7 +4066,7 @@ static void __init find_zone_movable_pfn /* If kernelcore was not specified, there is no ZONE_MOVABLE */ if (!required_kernelcore) - return; + goto out; /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */ find_usable_zone_for_movable(); @@ -4163,6 +4165,10 @@ restart: for (nid = 0; nid < MAX_NUMNODES; nid++) zone_movable_pfn[nid] = roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); + +out: + /* restore the node_state */ + node_states[N_HIGH_MEMORY] = saved_node_state; } /* Any regular memory on that node ? */ @@ -4247,11 +4253,6 @@ void __init free_area_init_nodes(unsigne early_node_map[i].start_pfn, early_node_map[i].end_pfn); - /* - * find_zone_movable_pfns_for_nodes/early_calculate_totalpages init - * that node_mask, clear it at first - */ - nodes_clear(node_states[N_HIGH_MEMORY]); /* Initialise every node */ mminit_verify_pageflags_layout(); setup_nr_node_ids(); Above patch come from Yinghai Lu, with it the specjbb2005 still can not run with initial java options. (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Thu, 2 Jul 2009 01:22:24 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13690 > > Summary: nodes_clear cause hugepage unusable on non-NUMA > machine > Product: Platform Specific/Hardware > Version: 2.5 > Kernel Version: 2.6.31-rc1 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: i386 > AssignedTo: platform_i386@kernel-bugs.osdl.org > ReportedBy: alex.shi@intel.com > CC: yinghai@kernel.org > Regression: Yes > > > 73d60b7f747176dbdff826c4127d22e1fd3f9f74 commit introduced a nodes_clear > function for NUMA machine. But seems the commit omits non-NUMA machine. > If find_zone_movable_pfns_for_nodes/early_calculate_totalpages has no > chance to run. nodes_clear will block HUPEPAGE using in my specjbb2005 > testing on my Stoakely(i386/x86_64), waybridge(i386), IBM T61(i386) > > + /* > + * find_zone_movable_pfns_for_nodes/early_calculate_totalpages init > + * that node_mask, clear it at first > + */ > + nodes_clear(node_states[N_HIGH_MEMORY]); Thanks. fyi, with recently-occurring bugs and regressions of this nature, it is (I think) best to deal with them via email rather than bugzilla. Bugzilla is better-suited to longer-lived bugs where we have a need to track them, generate statistics, etc. Created attachment 22173 [details]
i386 kernel config file on stoakley
Created attachment 22174 [details]
dmesg after apply yinghai's patch.
As Andrew's suggestion, this bug will tracking via e-mail and it will be closed after this problem resolving.
that looks strange... config is 32bit. the second patch only do save and restore. and should be right right. please check following patch on today's linus tree. and send out /proc/iomem Thanks Yinghai [PATCH] x86: add boundary check for 32bit res before expand e820 resource to alignment fix hang with HIGHMEM_64G and 32bit resource. according to hpa and Linus, use (resource_size_t)-1 to fend off big ranges. analyized by hpa Reported-and-tested-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: Yinghai Lu <yinghai@kernel.org> --- arch/x86/include/asm/proto.h | 3 --- arch/x86/kernel/e820.c | 20 ++++++++++++-------- include/linux/kernel.h | 5 +++++ 3 files changed, 17 insertions(+), 11 deletions(-) Index: linux-2.6/arch/x86/kernel/e820.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/e820.c +++ linux-2.6/arch/x86/kernel/e820.c @@ -1367,9 +1367,9 @@ void __init e820_reserve_resources(void) } /* How much should we pad RAM ending depending on where it is? */ -static unsigned long ram_alignment(resource_size_t pos) +static u64 ram_alignment(u64 pos) { - unsigned long mb = pos >> 20; + u64 mb = pos >> 20; /* To 64kB in the first megabyte */ if (!mb) @@ -1383,6 +1383,8 @@ static unsigned long ram_alignment(resou return 32*1024*1024; } +#define MAX_RESOURCE_SIZE ((resource_size_t)-1) + void __init e820_reserve_resources_late(void) { int i; @@ -1400,17 +1402,19 @@ void __init e820_reserve_resources_late( * avoid stolen RAM: */ for (i = 0; i < e820.nr_map; i++) { - struct e820entry *entry = &e820_saved.map[i]; - resource_size_t start, end; + struct e820entry *entry = &e820.map[i]; + u64 start, end; if (entry->type != E820_RAM) continue; start = entry->addr + entry->size; - end = round_up(start, ram_alignment(start)); - if (start == end) + end = round_up(start, ram_alignment(start)) - 1; + if (end > MAX_RESOURCE_SIZE) + end = MAX_RESOURCE_SIZE; + if (start > end) continue; - reserve_region_with_split(&iomem_resource, start, - end - 1, "RAM buffer"); + reserve_region_with_split(&iomem_resource, start, end, + "RAM buffer"); } } Alex Shi wrote: > The new patch works for my stoakley i386 machine. But for x86_64 machine > the specjbb2005 still can not run with hugepage. The specjbb2005 use the > same java setting as i386 system. After apply your patch, the iomem of > x86_64 is: please check [PATCH] x86: don't clear nodes_states[N_NORMAL_MEMORY] when numa is not compiled in Alex found: for x86_64 machine the specjbb2005 still can not run with hugepage only happens when numa is not compiled in the root cause: node_set_state will not set it back for us in that case so don't clear that when numa is not select in config Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> --- arch/x86/mm/init_64.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) Index: linux-2.6/arch/x86/mm/init_64.c =================================================================== --- linux-2.6.orig/arch/x86/mm/init_64.c +++ linux-2.6/arch/x86/mm/init_64.c @@ -598,8 +598,14 @@ void __init paging_init(void) sparse_memory_present_with_active_regions(MAX_NUMNODES); sparse_init(); - /* clear the default setting with node 0 */ +#if MAX_NUMNODES > 1 + /* + * clear the default setting with node 0 + * note: don't clear it, node_set_state will do nothing + * (aka set it back) when numa support is not compiled in + */ nodes_clear(node_states[N_NORMAL_MEMORY]); +#endif free_area_init_nodes(max_zone_pfns); } Reply-To: cl@linux-foundation.org On Thu, 2 Jul 2009, Yinghai Lu wrote: > Index: linux-2.6/arch/x86/mm/init_64.c > =================================================================== > --- linux-2.6.orig/arch/x86/mm/init_64.c > +++ linux-2.6/arch/x86/mm/init_64.c > @@ -598,8 +598,14 @@ void __init paging_init(void) > > sparse_memory_present_with_active_regions(MAX_NUMNODES); > sparse_init(); > - /* clear the default setting with node 0 */ > +#if MAX_NUMNODES > 1 > + /* > + * clear the default setting with node 0 > + * note: don't clear it, node_set_state will do nothing > + * (aka set it back) when numa support is not compiled in > + */ > nodes_clear(node_states[N_NORMAL_MEMORY]); The problem was that nodes_clear() does not fall back to a noop on !NUMA. The node_set/clear_states() operations do become noops. Could we make it more consistent by using only operations of the same type? F.e. Add a node_clearall_states() in include/linux/nodemask.h that falls back to a noop on !NUMA like the node_*_states operation? Another options is to restore node_states[N_NORMAL_MEMORY] to its initial condition. See the definition of node_states in page_alloc.c. Yes, the patch fixes this bug! Alex >-----Original Message----- >From: Yinghai Lu [mailto:yinghai@kernel.org] >Sent: 2009年7月2日 16:51 >To: Shi, Alex; Andrew Morton; Ingo Molnar >Cc: bugzilla-daemon@bugzilla.kernel.org; bugme-daemon@bugzilla.kernel.org; >Christoph Lameter; Mel Gorman; linux-kernel@vger.kernel.org; Zhang, Yanmin; >Chen, Tim C >Subject: Re: [Bugme-new] [Bug 13690] New: nodes_clear cause hugepage unusable >on non-NUMA machine > >Alex Shi wrote: >> The new patch works for my stoakley i386 machine. But for x86_64 machine >> the specjbb2005 still can not run with hugepage. The specjbb2005 use the >> same java setting as i386 system. After apply your patch, the iomem of >> x86_64 is: > >please check > >[PATCH] x86: don't clear nodes_states[N_NORMAL_MEMORY] when numa is not >compiled in > >Alex found: >for x86_64 machine the specjbb2005 still can not run with hugepage > >only happens when numa is not compiled in > >the root cause: node_set_state will not set it back for us in that case > >so don't clear that when numa is not select in config > >Reported-by: Alex Shi <alex.shi@intel.com> >Signed-off-by: Yinghai Lu <yinghai@kernel.org> > >--- > arch/x86/mm/init_64.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > >Index: linux-2.6/arch/x86/mm/init_64.c >=================================================================== >--- linux-2.6.orig/arch/x86/mm/init_64.c >+++ linux-2.6/arch/x86/mm/init_64.c >@@ -598,8 +598,14 @@ void __init paging_init(void) > > sparse_memory_present_with_active_regions(MAX_NUMNODES); > sparse_init(); >- /* clear the default setting with node 0 */ >+#if MAX_NUMNODES > 1 >+ /* >+ * clear the default setting with node 0 >+ * note: don't clear it, node_set_state will do nothing >+ * (aka set it back) when numa support is not compiled in >+ */ > nodes_clear(node_states[N_NORMAL_MEMORY]); >+#endif > free_area_init_nodes(max_zone_pfns); > } > Christoph Lameter wrote:
> On Thu, 2 Jul 2009, Yinghai Lu wrote:
>
>> Index: linux-2.6/arch/x86/mm/init_64.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init_64.c
>> +++ linux-2.6/arch/x86/mm/init_64.c
>> @@ -598,8 +598,14 @@ void __init paging_init(void)
>>
>> sparse_memory_present_with_active_regions(MAX_NUMNODES);
>> sparse_init();
>> - /* clear the default setting with node 0 */
>> +#if MAX_NUMNODES > 1
>> + /*
>> + * clear the default setting with node 0
>> + * note: don't clear it, node_set_state will do nothing
>> + * (aka set it back) when numa support is not compiled in
>> + */
>> nodes_clear(node_states[N_NORMAL_MEMORY]);
>
> The problem was that nodes_clear() does not fall back to a noop on !NUMA.
> The node_set/clear_states() operations do become noops.
>
> Could we make it more consistent by using only operations of the same
> type? F.e. Add a node_clearall_states() in include/linux/nodemask.h that
> falls back to a noop on !NUMA like the node_*_states operation?
>
> Another options is to restore node_states[N_NORMAL_MEMORY] to its
> initial condition. See the definition of node_states in page_alloc.c.
could use node_clear_state(0, N_NORMAL_MEMORY) instead. because default one only have node 0 set in that mask.
YH
Yinghai:
The 31-rc2 kernel still can not use hugepage on non-NUMA machine. And
this patch did not appear on rc2 kernel. Are there some concern about
this?
BRG
Alex
On Thu, 2009-07-02 at 16:50 +0800, Yinghai Lu wrote:
> Alex Shi wrote:
> > The new patch works for my stoakley i386 machine. But for x86_64 machine
> > the specjbb2005 still can not run with hugepage. The specjbb2005 use the
> > same java setting as i386 system. After apply your patch, the iomem of
> > x86_64 is:
>
> please check
>
> [PATCH] x86: don't clear nodes_states[N_NORMAL_MEMORY] when numa is not
> compiled in
>
> Alex found:
> for x86_64 machine the specjbb2005 still can not run with hugepage
>
> only happens when numa is not compiled in
>
> the root cause: node_set_state will not set it back for us in that case
>
> so don't clear that when numa is not select in config
>
> Reported-by: Alex Shi <alex.shi@intel.com>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>
> ---
> arch/x86/mm/init_64.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -598,8 +598,14 @@ void __init paging_init(void)
>
> sparse_memory_present_with_active_regions(MAX_NUMNODES);
> sparse_init();
> - /* clear the default setting with node 0 */
> +#if MAX_NUMNODES > 1
> + /*
> + * clear the default setting with node 0
> + * note: don't clear it, node_set_state will do nothing
> + * (aka set it back) when numa support is not compiled in
> + */
> nodes_clear(node_states[N_NORMAL_MEMORY]);
> +#endif
> free_area_init_nodes(max_zone_pfns);
> }
>
Alex Shi wrote: > Yinghai: > > The 31-rc2 kernel still can not use hugepage on non-NUMA machine. And > this patch did not appear on rc2 kernel. Are there some concern about > this? > can you check http://lkml.org/lkml/2009/7/2/326 YH On Tue, 2009-07-07 at 08:07 +0800, Yinghai Lu wrote:
> Alex Shi wrote:
> > Yinghai:
> >
> > The 31-rc2 kernel still can not use hugepage on non-NUMA machine. And
> > this patch did not appear on rc2 kernel. Are there some concern about
> > this?
> >
>
> can you check
> http://lkml.org/lkml/2009/7/2/326
>
> YH
It works on my Stoakley i386 and x86_64 with latest Linus' kernel tree.
Alex
Handled-By : Yinghai Lu <yinghai@kernel.org> Patch : http://patchwork.kernel.org/patch/33744/ Fixed by commit 44b572809581d5a10dbe35aa6bf689f32b9c5ad6 . |