Bug 36192 - Kernel panic when boot the 2.6.39+ kernel based off of 2.6.32 kernel
Summary: Kernel panic when boot the 2.6.39+ kernel based off of 2.6.32 kernel
Status: CLOSED CODE_FIX
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Page Allocator (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-30 02:38 UTC by qcui
Modified: 2012-06-13 15:06 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.39+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
kernel panic console output (15.51 KB, application/octet-stream)
2011-05-30 02:38 UTC, qcui
Details

Description qcui 2011-05-30 02:38:30 UTC
Created attachment 60012 [details]
kernel panic console output

When I updated the kernel from 2.6.32 to 2.6.39+ on a server with AMD Magny-Cours CPU, the server can not boot the 2.6.39+ kernel successfully. The console ouput showed 'Kernel panic - not syncing: Attempted to kill the idle task!' I have tried to set the kernel parameter idle=poll in the grub file. It still failed to reboot due to the same error. But it can reboot successfully on the server with Intel CPU. The full console output is attached.

Steps to reproduce:
1. install the 2.6.32 kernel
2. compile and install the kernel 2.6.39+
3. reboot
Comment 1 Andrew Morton 2011-05-30 06:20:43 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Mon, 30 May 2011 02:38:33 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=36192
> 
>            Summary: Kernel panic when boot the 2.6.39+ kernel based off of
>                     2.6.32 kernel
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 2.6.39+
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Page Allocator
>         AssignedTo: akpm@linux-foundation.org
>         ReportedBy: qcui@redhat.com
>         Regression: Yes
> 
> 
> Created an attachment (id=60012)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=60012)
> kernel panic console output
> 
> When I updated the kernel from 2.6.32 to 2.6.39+ on a server with AMD
> Magny-Cours CPU, the server can not boot the 2.6.39+ kernel successfully. The
> console ouput showed 'Kernel panic - not syncing: Attempted to kill the idle
> task!' I have tried to set the kernel parameter idle=poll in the grub file.
> It
> still failed to reboot due to the same error. But it can reboot successfully
> on
> the server with Intel CPU. The full console output is attached.
> 
> Steps to reproduce:
> 1. install the 2.6.32 kernel
> 2. compile and install the kernel 2.6.39+
> 3. reboot
> 

hm, this is not good.  Might be memcg-related?

> BUG: unable to handle kernel paging request at 0000000000001c08
> IP: [<ffffffff811076cc>] __alloc_pages_nodemask+0x7c/0x1f0
> PGD 0 
> Oops: 0000 [#1] SMP 
> last sysfs file: 
> CPU 0 
> Modules linked in:
> 
> Pid: 0, comm: swapper Not tainted 2.6.39+ #1 AMD DRACHMA/DRACHMA
> RIP: 0010:[<ffffffff811076cc>]  [<ffffffff811076cc>]
> __alloc_pages_nodemask+0x7c/0x1f0
> RSP: 0000:ffffffff81a01e48  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000008 RDI: 00000000000002d0
> RBP: ffffffff81a01ea8 R08: ffffffff81c03680 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000001 R12: 00000000000002d0
> R13: 0000000000001c00 R14: ffffffff81a01fa8 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff880437800000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000001c08 CR3: 0000000001a03000 CR4: 00000000000006b0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a0b020)
> Stack:
>  0000000000000000 0000000000000000 ffffffff81a01eb8 000002d000000008
>  ffffffff00000020 ffffffff81a01ec8 ffffffff81a01e88 0000000000000008
>  0000000000100000 0000000000000000 ffffffff81a01fa8 0000000000093cf0
> Call Trace:
>  [<ffffffff81107d7f>] alloc_pages_exact_nid+0x5f/0xc0
>  [<ffffffff814b2dea>] alloc_page_cgroup+0x2a/0x80
>  [<ffffffff814b2ece>] init_section_page_cgroup+0x8e/0x110
>  [<ffffffff81c4a2f1>] page_cgroup_init+0x6e/0xa7
>  [<ffffffff81c22de4>] start_kernel+0x2ae/0x366
>  [<ffffffff81c22346>] x86_64_start_reservations+0x131/0x135
>  [<ffffffff81c2244d>] x86_64_start_kernel+0x103/0x112
> Code: e0 08 83 f8 01 44 89 e0 19 db c1 e8 13 f7 d3 83 e0 01 83 e3 02 09 c3 8b
> 05 22 e5 af 00 44 21 e0 a8 10 89 45 bc 0f 85 c4 00 00 00 
>  83 7d 08 00 0f 84 dd 00 00 00 65 4c 8b 34 25 c0 cc 00 00 41 
> RIP  [<ffffffff811076cc>] __alloc_pages_nodemask+0x7c/0x1f0
>  RSP <ffffffff81a01e48>
> CR2: 0000000000001c08
Comment 2 Minchan Kim 2011-05-30 07:12:39 UTC
On Mon, May 30, 2011 at 4:01 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Sun, 29 May 2011 23:19:48 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
>>
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Mon, 30 May 2011 02:38:33 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
>>
>> > https://bugzilla.kernel.org/show_bug.cgi?id=36192
>> >
>> >            Summary: Kernel panic when boot the 2.6.39+ kernel based off of
>> >                     2.6.32 kernel
>> >            Product: Memory Management
>> >            Version: 2.5
>> >     Kernel Version: 2.6.39+
>> >           Platform: All
>> >         OS/Version: Linux
>> >               Tree: Mainline
>> >             Status: NEW
>> >           Severity: normal
>> >           Priority: P1
>> >          Component: Page Allocator
>> >         AssignedTo: akpm@linux-foundation.org
>> >         ReportedBy: qcui@redhat.com
>> >         Regression: Yes
>> >
>> >
>> > Created an attachment (id=60012)
>> >  --> (https://bugzilla.kernel.org/attachment.cgi?id=60012)
>> > kernel panic console output
>> >
>> > When I updated the kernel from 2.6.32 to 2.6.39+ on a server with AMD
>> > Magny-Cours CPU, the server can not boot the 2.6.39+ kernel successfully.
>> The
>> > console ouput showed 'Kernel panic - not syncing: Attempted to kill the
>> idle
>> > task!' I have tried to set the kernel parameter idle=poll in the grub
>> file. It
>> > still failed to reboot due to the same error. But it can reboot
>> successfully on
>> > the server with Intel CPU. The full console output is attached.
>> >
>> > Steps to reproduce:
>> > 1. install the 2.6.32 kernel
>> > 2. compile and install the kernel 2.6.39+
>> > 3. reboot
>> >
>>
>> hm, this is not good.  Might be memcg-related?
>>
>
> yes, and the system may be able to boot with a boot option of
> cgroup_disable=memory.
> but the problem happens in __alloc_pages_nodemask with NULL pointer access.
> Hmm, doesn't this imply some error in building zone/pgdat ?

I have tracked down this issue.
http://marc.info/?l=linux-mm&m=130616558019604&w=2

Qiannan, Could you test it with reverting patches mentioned in above URL?

>
> Thanks,
> -Kame
>
>
>> > BUG: unable to handle kernel paging request at 0000000000001c08
>> > IP: [<ffffffff811076cc>] __alloc_pages_nodemask+0x7c/0x1f0
>> > PGD 0
>> > Oops: 0000 [#1] SMP
>> > last sysfs file:
>> > CPU 0
>> > Modules linked in:
>> >
>> > Pid: 0, comm: swapper Not tainted 2.6.39+ #1 AMD DRACHMA/DRACHMA
>> > RIP: 0010:[<ffffffff811076cc>]  [<ffffffff811076cc>]
>> __alloc_pages_nodemask+0x7c/0x1f0
>> > RSP: 0000:ffffffff81a01e48  EFLAGS: 00010246
>> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>> > RDX: 0000000000000000 RSI: 0000000000000008 RDI: 00000000000002d0
>> > RBP: ffffffff81a01ea8 R08: ffffffff81c03680 R09: 0000000000000000
>> > R10: 0000000000000001 R11: 0000000000000001 R12: 00000000000002d0
>> > R13: 0000000000001c00 R14: ffffffff81a01fa8 R15: 0000000000000000
>> > FS:  0000000000000000(0000) GS:ffff880437800000(0000)
>> knlGS:0000000000000000
>> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> > CR2: 0000000000001c08 CR3: 0000000001a03000 CR4: 00000000000006b0
>> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Process swapper (pid: 0, threadinfo ffffffff81a00000, task
>> ffffffff81a0b020)
>> > Stack:
>> >  0000000000000000 0000000000000000 ffffffff81a01eb8 000002d000000008
>> >  ffffffff00000020 ffffffff81a01ec8 ffffffff81a01e88 0000000000000008
>> >  0000000000100000 0000000000000000 ffffffff81a01fa8 0000000000093cf0
>> > Call Trace:
>> >  [<ffffffff81107d7f>] alloc_pages_exact_nid+0x5f/0xc0
>> >  [<ffffffff814b2dea>] alloc_page_cgroup+0x2a/0x80
>> >  [<ffffffff814b2ece>] init_section_page_cgroup+0x8e/0x110
>> >  [<ffffffff81c4a2f1>] page_cgroup_init+0x6e/0xa7
>> >  [<ffffffff81c22de4>] start_kernel+0x2ae/0x366
>> >  [<ffffffff81c22346>] x86_64_start_reservations+0x131/0x135
>> >  [<ffffffff81c2244d>] x86_64_start_kernel+0x103/0x112
>> > Code: e0 08 83 f8 01 44 89 e0 19 db c1 e8 13 f7 d3 83 e0 01 83 e3 02 09 c3
>> 8b 05 22 e5 af 00 44 21 e0 a8 10 89 45 bc 0f 85 c4 00 00 00
>> >  83 7d 08 00 0f 84 dd 00 00 00 65 4c 8b 34 25 c0 cc 00 00 41
>> > RIP  [<ffffffff811076cc>] __alloc_pages_nodemask+0x7c/0x1f0
>> >  RSP <ffffffff81a01e48>
>> > CR2: 0000000000001c08
>>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
Comment 3 KAMEZAWA Hiroyuki 2011-05-30 07:44:18 UTC
On Sun, 29 May 2011 23:19:48 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Mon, 30 May 2011 02:38:33 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=36192
> > 
> >            Summary: Kernel panic when boot the 2.6.39+ kernel based off of
> >                     2.6.32 kernel
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 2.6.39+
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Page Allocator
> >         AssignedTo: akpm@linux-foundation.org
> >         ReportedBy: qcui@redhat.com
> >         Regression: Yes
> > 
> > 
> > Created an attachment (id=60012)
> >  --> (https://bugzilla.kernel.org/attachment.cgi?id=60012)
> > kernel panic console output
> > 
> > When I updated the kernel from 2.6.32 to 2.6.39+ on a server with AMD
> > Magny-Cours CPU, the server can not boot the 2.6.39+ kernel successfully.
> The
> > console ouput showed 'Kernel panic - not syncing: Attempted to kill the
> idle
> > task!' I have tried to set the kernel parameter idle=poll in the grub file.
> It
> > still failed to reboot due to the same error. But it can reboot
> successfully on
> > the server with Intel CPU. The full console output is attached.
> > 
> > Steps to reproduce:
> > 1. install the 2.6.32 kernel
> > 2. compile and install the kernel 2.6.39+
> > 3. reboot
> > 
> 
> hm, this is not good.  Might be memcg-related?
> 

yes, and the system may be able to boot with a boot option of cgroup_disable=memory.
but the problem happens in __alloc_pages_nodemask with NULL pointer access.
Hmm, doesn't this imply some error in building zone/pgdat ?

Thanks,
-Kame


> > BUG: unable to handle kernel paging request at 0000000000001c08
> > IP: [<ffffffff811076cc>] __alloc_pages_nodemask+0x7c/0x1f0
> > PGD 0 
> > Oops: 0000 [#1] SMP 
> > last sysfs file: 
> > CPU 0 
> > Modules linked in:
> > 
> > Pid: 0, comm: swapper Not tainted 2.6.39+ #1 AMD DRACHMA/DRACHMA
> > RIP: 0010:[<ffffffff811076cc>]  [<ffffffff811076cc>]
> __alloc_pages_nodemask+0x7c/0x1f0
> > RSP: 0000:ffffffff81a01e48  EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: 0000000000000008 RDI: 00000000000002d0
> > RBP: ffffffff81a01ea8 R08: ffffffff81c03680 R09: 0000000000000000
> > R10: 0000000000000001 R11: 0000000000000001 R12: 00000000000002d0
> > R13: 0000000000001c00 R14: ffffffff81a01fa8 R15: 0000000000000000
> > FS:  0000000000000000(0000) GS:ffff880437800000(0000)
> knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000000001c08 CR3: 0000000001a03000 CR4: 00000000000006b0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process swapper (pid: 0, threadinfo ffffffff81a00000, task
> ffffffff81a0b020)
> > Stack:
> >  0000000000000000 0000000000000000 ffffffff81a01eb8 000002d000000008
> >  ffffffff00000020 ffffffff81a01ec8 ffffffff81a01e88 0000000000000008
> >  0000000000100000 0000000000000000 ffffffff81a01fa8 0000000000093cf0
> > Call Trace:
> >  [<ffffffff81107d7f>] alloc_pages_exact_nid+0x5f/0xc0
> >  [<ffffffff814b2dea>] alloc_page_cgroup+0x2a/0x80
> >  [<ffffffff814b2ece>] init_section_page_cgroup+0x8e/0x110
> >  [<ffffffff81c4a2f1>] page_cgroup_init+0x6e/0xa7
> >  [<ffffffff81c22de4>] start_kernel+0x2ae/0x366
> >  [<ffffffff81c22346>] x86_64_start_reservations+0x131/0x135
> >  [<ffffffff81c2244d>] x86_64_start_kernel+0x103/0x112
> > Code: e0 08 83 f8 01 44 89 e0 19 db c1 e8 13 f7 d3 83 e0 01 83 e3 02 09 c3
> 8b 05 22 e5 af 00 44 21 e0 a8 10 89 45 bc 0f 85 c4 00 00 00 
> >  83 7d 08 00 0f 84 dd 00 00 00 65 4c 8b 34 25 c0 cc 00 00 41 
> > RIP  [<ffffffff811076cc>] __alloc_pages_nodemask+0x7c/0x1f0
> >  RSP <ffffffff81a01e48>
> > CR2: 0000000000001c08
> 
>
Comment 4 KAMEZAWA Hiroyuki 2011-05-30 08:02:54 UTC
On Mon, 30 May 2011 16:29:04 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 30 May 2011 16:01:14 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> 
> 
> I want to see .config and 2.6.32's boot log (dmesg) and 2.6.39+'s boot log
> if possible.
> 
Ah, sorry. The log was in bugzilla.
==
SRAT: Node 1 PXM 1 0-a0000
SRAT: Node 1 PXM 1 100000-c8000000
SRAT: Node 1 PXM 1 100000000-438000000
SRAT: Node 3 PXM 3 438000000-838000000
SRAT: Node 5 PXM 5 838000000-c38000000
SRAT: Node 7 PXM 7 c38000000-1038000000

Initmem setup node 1 0000000000000000-0000000438000000
  NODE_DATA [0000000437fd9000 - 0000000437ffffff]
Initmem setup node 3 0000000438000000-0000000838000000
  NODE_DATA [0000000837fd9000 - 0000000837ffffff]
Initmem setup node 5 0000000838000000-0000000c38000000
  NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
Initmem setup node 7 0000000c38000000-0000001038000000
  NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
[ffffea000ec40000-ffffea000edfffff] potential offnode page_structs
[ffffea001cc40000-ffffea001cdfffff] potential offnode page_structs
[ffffea002ac40000-ffffea002adfffff] potential offnode page_structs
==

Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm ?

Could you show 2.6.32's log ?

Thanks,
-Kame
Comment 5 KAMEZAWA Hiroyuki 2011-05-30 08:17:38 UTC
On Mon, 30 May 2011 16:01:14 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Sun, 29 May 2011 23:19:48 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > 
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> > 
> > On Mon, 30 May 2011 02:38:33 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
> > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=36192
> > > 
> > >            Summary: Kernel panic when boot the 2.6.39+ kernel based off
> of
> > >                     2.6.32 kernel
> > >            Product: Memory Management
> > >            Version: 2.5
> > >     Kernel Version: 2.6.39+
> > >           Platform: All
> > >         OS/Version: Linux
> > >               Tree: Mainline
> > >             Status: NEW
> > >           Severity: normal
> > >           Priority: P1
> > >          Component: Page Allocator
> > >         AssignedTo: akpm@linux-foundation.org
> > >         ReportedBy: qcui@redhat.com
> > >         Regression: Yes
> > > 
> > > 
> > > Created an attachment (id=60012)
> > >  --> (https://bugzilla.kernel.org/attachment.cgi?id=60012)
> > > kernel panic console output
> > > 
> > > When I updated the kernel from 2.6.32 to 2.6.39+ on a server with AMD
> > > Magny-Cours CPU, the server can not boot the 2.6.39+ kernel successfully.
> The
> > > console ouput showed 'Kernel panic - not syncing: Attempted to kill the
> idle
> > > task!' I have tried to set the kernel parameter idle=poll in the grub
> file. It
> > > still failed to reboot due to the same error. But it can reboot
> successfully on
> > > the server with Intel CPU. The full console output is attached.
> > > 
> > > Steps to reproduce:
> > > 1. install the 2.6.32 kernel
> > > 2. compile and install the kernel 2.6.39+
> > > 3. reboot
> > > 
> > 
> > hm, this is not good.  Might be memcg-related?
> > 
> 
> yes, and the system may be able to boot with a boot option of
> cgroup_disable=memory.
> but the problem happens in __alloc_pages_nodemask with NULL pointer access.
> Hmm, doesn't this imply some error in building zone/pgdat ?
> 

I want to see .config and 2.6.32's boot log (dmesg) and 2.6.39+'s boot log
if possible.

Thanks,
-Kame
Comment 6 KAMEZAWA Hiroyuki 2011-05-30 08:59:21 UTC
On Mon, 30 May 2011 16:54:53 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 30 May 2011 16:29:04 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> SRAT: Node 1 PXM 1 0-a0000
> SRAT: Node 1 PXM 1 100000-c8000000
> SRAT: Node 1 PXM 1 100000000-438000000
> SRAT: Node 3 PXM 3 438000000-838000000
> SRAT: Node 5 PXM 5 838000000-c38000000
> SRAT: Node 7 PXM 7 c38000000-1038000000
> 
> Initmem setup node 1 0000000000000000-0000000438000000
>   NODE_DATA [0000000437fd9000 - 0000000437ffffff]
> Initmem setup node 3 0000000438000000-0000000838000000
>   NODE_DATA [0000000837fd9000 - 0000000837ffffff]
> Initmem setup node 5 0000000838000000-0000000c38000000
>   NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
> Initmem setup node 7 0000000c38000000-0000001038000000
>   NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
> [ffffea000ec40000-ffffea000edfffff] potential offnode page_structs
> [ffffea001cc40000-ffffea001cdfffff] potential offnode page_structs
> [ffffea002ac40000-ffffea002adfffff] potential offnode page_structs
> ==
> 
> Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm ?
> 

I think I found a reason and this is a possible fix. But need to be tested.
And suggestion for better fix rather than this band-aid is appreciated.

==
From b95edcf43619312f72895476c3e6ef46079bb05f Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Mon, 30 May 2011 16:49:59 +0900
Subject: [PATCH][BUGFIX] fallbacks at page_cgroup allocation.

Under SPARSEMEM, the page_struct is allocated per section.
Then, pfn_valid() for the whole section is "true" and there are page
structs. But, it's not related to valid range of [start_pfn, end_pfn)
and some page structs may not be initialized collectly because
it's not a valid pages.
(memmap_init_zone() skips a page which is not correct in
 early_node_map[] and page->flags is initialized to be 0.)

In this case, a page->flags can be '0'. Assume a case where
node 0 has no memory....

page_cgroup is allocated onto the node

   - page_to_nid(head of section pfn)

Head's pfn will be valid (struct page exists) but page->flags is 0 and contains
node_id:0. This causes allocation onto NODE_DATA(0) and cause panic.

This patch makes page_cgroup to use alloc_pages_exact() only
when NID is N_NORMAL_MEMORY.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_cgroup.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 74ccff6..3b0c8f2 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -134,7 +134,10 @@ static void *__meminit alloc_page_cgroup(size_t size, int nid)
 {
 	void *addr = NULL;
 
-	addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL | __GFP_NOWARN);
+	if (node_state(nid, N_NORMAL_MEMORY))
+		addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL | __GFP_NOWARN);
+	if (!addr)
+		addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN);
 	if (addr)
 		return addr;
Comment 7 qcui 2011-05-31 07:58:34 UTC
(In reply to comment #6)
> On Mon, 30 May 2011 16:54:53 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 30 May 2011 16:29:04 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > SRAT: Node 1 PXM 1 0-a0000
> > SRAT: Node 1 PXM 1 100000-c8000000
> > SRAT: Node 1 PXM 1 100000000-438000000
> > SRAT: Node 3 PXM 3 438000000-838000000
> > SRAT: Node 5 PXM 5 838000000-c38000000
> > SRAT: Node 7 PXM 7 c38000000-1038000000
> > 
> > Initmem setup node 1 0000000000000000-0000000438000000
> >   NODE_DATA [0000000437fd9000 - 0000000437ffffff]
> > Initmem setup node 3 0000000438000000-0000000838000000
> >   NODE_DATA [0000000837fd9000 - 0000000837ffffff]
> > Initmem setup node 5 0000000838000000-0000000c38000000
> >   NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
> > Initmem setup node 7 0000000c38000000-0000001038000000
> >   NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
> > [ffffea000ec40000-ffffea000edfffff] potential offnode page_structs
> > [ffffea001cc40000-ffffea001cdfffff] potential offnode page_structs
> > [ffffea002ac40000-ffffea002adfffff] potential offnode page_structs
> > ==
> > 
> > Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm ?
> > 
> 
> I think I found a reason and this is a possible fix. But need to be tested.
> And suggestion for better fix rather than this band-aid is appreciated.
> 
> ==
> From b95edcf43619312f72895476c3e6ef46079bb05f Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Mon, 30 May 2011 16:49:59 +0900
> Subject: [PATCH][BUGFIX] fallbacks at page_cgroup allocation.
> 
> Under SPARSEMEM, the page_struct is allocated per section.
> Then, pfn_valid() for the whole section is "true" and there are page
> structs. But, it's not related to valid range of [start_pfn, end_pfn)
> and some page structs may not be initialized collectly because
> it's not a valid pages.
> (memmap_init_zone() skips a page which is not correct in
>  early_node_map[] and page->flags is initialized to be 0.)
> 
> In this case, a page->flags can be '0'. Assume a case where
> node 0 has no memory....
> 
> page_cgroup is allocated onto the node
> 
>    - page_to_nid(head of section pfn)
> 
> Head's pfn will be valid (struct page exists) but page->flags is 0 and
> contains
> node_id:0. This causes allocation onto NODE_DATA(0) and cause panic.
> 
> This patch makes page_cgroup to use alloc_pages_exact() only
> when NID is N_NORMAL_MEMORY.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_cgroup.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 74ccff6..3b0c8f2 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -134,7 +134,10 @@ static void *__meminit alloc_page_cgroup(size_t size,
> int
> nid)
>  {
>      void *addr = NULL;
> 
> -    addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL | __GFP_NOWARN);
> +    if (node_state(nid, N_NORMAL_MEMORY))
> +        addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL | __GFP_NOWARN);
> +    if (!addr)
> +        addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN);
>      if (addr)
>          return addr;

Tested this patch with 2.6.39-rc7 kernel and latest 3.0.0-rc1 kernel, and it worked well.
Comment 8 Johannes Weiner 2011-06-06 14:21:55 UTC
Cc Mel for memory model

On Mon, May 30, 2011 at 05:51:40PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 30 May 2011 16:54:53 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 30 May 2011 16:29:04 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > SRAT: Node 1 PXM 1 0-a0000
> > SRAT: Node 1 PXM 1 100000-c8000000
> > SRAT: Node 1 PXM 1 100000000-438000000
> > SRAT: Node 3 PXM 3 438000000-838000000
> > SRAT: Node 5 PXM 5 838000000-c38000000
> > SRAT: Node 7 PXM 7 c38000000-1038000000
> > 
> > Initmem setup node 1 0000000000000000-0000000438000000
> >   NODE_DATA [0000000437fd9000 - 0000000437ffffff]
> > Initmem setup node 3 0000000438000000-0000000838000000
> >   NODE_DATA [0000000837fd9000 - 0000000837ffffff]
> > Initmem setup node 5 0000000838000000-0000000c38000000
> >   NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
> > Initmem setup node 7 0000000c38000000-0000001038000000
> >   NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
> > [ffffea000ec40000-ffffea000edfffff] potential offnode page_structs
> > [ffffea001cc40000-ffffea001cdfffff] potential offnode page_structs
> > [ffffea002ac40000-ffffea002adfffff] potential offnode page_structs
> > ==
> > 
> > Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm ?
> > 
> 
> I think I found a reason and this is a possible fix. But need to be tested.
> And suggestion for better fix rather than this band-aid is appreciated.
> 
> ==
> >From b95edcf43619312f72895476c3e6ef46079bb05f Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Mon, 30 May 2011 16:49:59 +0900
> Subject: [PATCH][BUGFIX] fallbacks at page_cgroup allocation.
> 
> Under SPARSEMEM, the page_struct is allocated per section.
> Then, pfn_valid() for the whole section is "true" and there are page
> structs. But, it's not related to valid range of [start_pfn, end_pfn)
> and some page structs may not be initialized collectly because
> it's not a valid pages.
> (memmap_init_zone() skips a page which is not correct in
>  early_node_map[] and page->flags is initialized to be 0.)
> 
> In this case, a page->flags can be '0'. Assume a case where
> node 0 has no memory....
> 
> page_cgroup is allocated onto the node
> 
>    - page_to_nid(head of section pfn)
> 
> Head's pfn will be valid (struct page exists) but page->flags is 0 and
> contains
> node_id:0. This causes allocation onto NODE_DATA(0) and cause panic.
> 
> This patch makes page_cgroup to use alloc_pages_exact() only
> when NID is N_NORMAL_MEMORY.

I don't like this much as it essentially will allocate the array from
a (semantically) random node, as long as it has memory.

IMO, the problem is either 1) looking at PFNs outside known node
ranges, or 2) having present/valid sections partially outside of node
ranges.  I am leaning towards 2), so I am wondering about the
following fix:

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] sparse: only mark sections present when fully covered by memory

When valid memory ranges are to be registered with sparsemem, make
sure that only fully covered sections are marked as present.

Otherwise we end up with PFN ranges that are reported present and
valid but are actually backed by uninitialized mem map.

The page_cgroup allocator relies on pfn_present() being reliable for
all PFNs between 0 and max_pfn, then retrieve the node id stored in
the corresponding page->flags to allocate the per-section page_cgroup
arrays on the local node.

This lead to at least one crash in the page allocator on a system
where the uninitialized page struct returned the id for node 0, which
had no memory itself.

Reported-by: qcui@redhat.com
Debugged-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Not-Yet-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

diff --git a/mm/sparse.c b/mm/sparse.c
index aa64b12..a4fbeb8 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -182,7 +182,9 @@ void __init memory_present(int nid, unsigned long start, unsigned long end)
 {
 	unsigned long pfn;
 
-	start &= PAGE_SECTION_MASK;
+	start = ALIGN(start, PAGES_PER_SECTION);
+	end &= PAGE_SECTION_MASK;
+
 	mminit_validate_memmodel_limits(&start, &end);
 	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
 		unsigned long section = pfn_to_section_nr(pfn);

---

> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_cgroup.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 74ccff6..3b0c8f2 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -134,7 +134,10 @@ static void *__meminit alloc_page_cgroup(size_t size,
> int nid)
>  {
>       void *addr = NULL;
>  
> -     addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL | __GFP_NOWARN);
> +     if (node_state(nid, N_NORMAL_MEMORY))
> +             addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL |
> __GFP_NOWARN);
> +     if (!addr)
> +             addr = alloc_pages_exact(size, GFP_KERNEL | __GFP_NOWARN);
>       if (addr)
>               return addr;
>
Comment 9 Andrew Morton 2011-06-06 21:45:59 UTC
On Mon, 6 Jun 2011 14:54:21 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Cc Mel for memory model
> 
> On Mon, May 30, 2011 at 05:51:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 30 May 2011 16:54:53 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Mon, 30 May 2011 16:29:04 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > SRAT: Node 1 PXM 1 0-a0000
> > > SRAT: Node 1 PXM 1 100000-c8000000
> > > SRAT: Node 1 PXM 1 100000000-438000000
> > > SRAT: Node 3 PXM 3 438000000-838000000
> > > SRAT: Node 5 PXM 5 838000000-c38000000
> > > SRAT: Node 7 PXM 7 c38000000-1038000000
> > > 
> > > Initmem setup node 1 0000000000000000-0000000438000000
> > >   NODE_DATA [0000000437fd9000 - 0000000437ffffff]
> > > Initmem setup node 3 0000000438000000-0000000838000000
> > >   NODE_DATA [0000000837fd9000 - 0000000837ffffff]
> > > Initmem setup node 5 0000000838000000-0000000c38000000
> > >   NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
> > > Initmem setup node 7 0000000c38000000-0000001038000000
> > >   NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
> > > [ffffea000ec40000-ffffea000edfffff] potential offnode page_structs
> > > [ffffea001cc40000-ffffea001cdfffff] potential offnode page_structs
> > > [ffffea002ac40000-ffffea002adfffff] potential offnode page_structs
> > > ==
> > > 
> > > Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm ?
> > > 
> > 
> > I think I found a reason and this is a possible fix. But need to be tested.
> > And suggestion for better fix rather than this band-aid is appreciated.
> > 
> > ==
> > >From b95edcf43619312f72895476c3e6ef46079bb05f Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Mon, 30 May 2011 16:49:59 +0900
> > Subject: [PATCH][BUGFIX] fallbacks at page_cgroup allocation.
> > 
> > Under SPARSEMEM, the page_struct is allocated per section.
> > Then, pfn_valid() for the whole section is "true" and there are page
> > structs. But, it's not related to valid range of [start_pfn, end_pfn)
> > and some page structs may not be initialized collectly because
> > it's not a valid pages.
> > (memmap_init_zone() skips a page which is not correct in
> >  early_node_map[] and page->flags is initialized to be 0.)
> > 
> > In this case, a page->flags can be '0'. Assume a case where
> > node 0 has no memory....
> > 
> > page_cgroup is allocated onto the node
> > 
> >    - page_to_nid(head of section pfn)
> > 
> > Head's pfn will be valid (struct page exists) but page->flags is 0 and
> contains
> > node_id:0. This causes allocation onto NODE_DATA(0) and cause panic.
> > 
> > This patch makes page_cgroup to use alloc_pages_exact() only
> > when NID is N_NORMAL_MEMORY.

fyi, the reporter has gone in via the bugzilla UI and says he has
tested the patch and it worked well.

Please don't do that!  See this?

: (switched to email.  Please respond via emailed reply-to-all, not via the
: bugzilla web interface).

So we have a tested-by if we use this patch.

> I don't like this much as it essentially will allocate the array from
> a (semantically) random node, as long as it has memory.
> 
> IMO, the problem is either 1) looking at PFNs outside known node
> ranges, or 2) having present/valid sections partially outside of node
> ranges.  I am leaning towards 2), so I am wondering about the
> following fix:
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] sparse: only mark sections present when fully covered by
> memory
> 
> When valid memory ranges are to be registered with sparsemem, make
> sure that only fully covered sections are marked as present.
> 
> Otherwise we end up with PFN ranges that are reported present and
> valid but are actually backed by uninitialized mem map.
> 
> The page_cgroup allocator relies on pfn_present() being reliable for
> all PFNs between 0 and max_pfn, then retrieve the node id stored in
> the corresponding page->flags to allocate the per-section page_cgroup
> arrays on the local node.
> 
> This lead to at least one crash in the page allocator on a system
> where the uninitialized page struct returned the id for node 0, which
> had no memory itself.
> 
> Reported-by: qcui@redhat.com
> Debugged-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Not-Yet-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aa64b12..a4fbeb8 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -182,7 +182,9 @@ void __init memory_present(int nid, unsigned long start,
> unsigned long end)
>  {
>       unsigned long pfn;
>  
> -     start &= PAGE_SECTION_MASK;
> +     start = ALIGN(start, PAGES_PER_SECTION);
> +     end &= PAGE_SECTION_MASK;
> +
>       mminit_validate_memmodel_limits(&start, &end);
>       for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
>               unsigned long section = pfn_to_section_nr(pfn);
> 

Hopefully he can test this one for us as well, thanks.
Comment 10 KAMEZAWA Hiroyuki 2011-06-07 00:45:54 UTC
On Mon, 6 Jun 2011 14:45:19 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 6 Jun 2011 14:54:21 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Cc Mel for memory model
> > 
> > On Mon, May 30, 2011 at 05:51:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 30 May 2011 16:54:53 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Mon, 30 May 2011 16:29:04 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > SRAT: Node 1 PXM 1 0-a0000
> > > > SRAT: Node 1 PXM 1 100000-c8000000
> > > > SRAT: Node 1 PXM 1 100000000-438000000
> > > > SRAT: Node 3 PXM 3 438000000-838000000
> > > > SRAT: Node 5 PXM 5 838000000-c38000000
> > > > SRAT: Node 7 PXM 7 c38000000-1038000000
> > > > 
> > > > Initmem setup node 1 0000000000000000-0000000438000000
> > > >   NODE_DATA [0000000437fd9000 - 0000000437ffffff]
> > > > Initmem setup node 3 0000000438000000-0000000838000000
> > > >   NODE_DATA [0000000837fd9000 - 0000000837ffffff]
> > > > Initmem setup node 5 0000000838000000-0000000c38000000
> > > >   NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
> > > > Initmem setup node 7 0000000c38000000-0000001038000000
> > > >   NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
> > > > [ffffea000ec40000-ffffea000edfffff] potential offnode page_structs
> > > > [ffffea001cc40000-ffffea001cdfffff] potential offnode page_structs
> > > > [ffffea002ac40000-ffffea002adfffff] potential offnode page_structs
> > > > ==
> > > > 
> > > > Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm ?
> > > > 
> > > 
> > > I think I found a reason and this is a possible fix. But need to be
> tested.
> > > And suggestion for better fix rather than this band-aid is appreciated.
> > > 
> > > ==
> > > >From b95edcf43619312f72895476c3e6ef46079bb05f Mon Sep 17 00:00:00 2001
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Date: Mon, 30 May 2011 16:49:59 +0900
> > > Subject: [PATCH][BUGFIX] fallbacks at page_cgroup allocation.
> > > 
> > > Under SPARSEMEM, the page_struct is allocated per section.
> > > Then, pfn_valid() for the whole section is "true" and there are page
> > > structs. But, it's not related to valid range of [start_pfn, end_pfn)
> > > and some page structs may not be initialized collectly because
> > > it's not a valid pages.
> > > (memmap_init_zone() skips a page which is not correct in
> > >  early_node_map[] and page->flags is initialized to be 0.)
> > > 
> > > In this case, a page->flags can be '0'. Assume a case where
> > > node 0 has no memory....
> > > 
> > > page_cgroup is allocated onto the node
> > > 
> > >    - page_to_nid(head of section pfn)
> > > 
> > > Head's pfn will be valid (struct page exists) but page->flags is 0 and
> contains
> > > node_id:0. This causes allocation onto NODE_DATA(0) and cause panic.
> > > 
> > > This patch makes page_cgroup to use alloc_pages_exact() only
> > > when NID is N_NORMAL_MEMORY.
> 
> fyi, the reporter has gone in via the bugzilla UI and says he has
> tested the patch and it worked well.
> 
> Please don't do that!  See this?
> 
> : (switched to email.  Please respond via emailed reply-to-all, not via the
> : bugzilla web interface).
> 
> So we have a tested-by if we use this patch.
> 
> > I don't like this much as it essentially will allocate the array from
> > a (semantically) random node, as long as it has memory.
> > 
> > IMO, the problem is either 1) looking at PFNs outside known node
> > ranges, or 2) having present/valid sections partially outside of node
> > ranges.  I am leaning towards 2), so I am wondering about the
> > following fix:
> > 
> > ---
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Subject: [patch] sparse: only mark sections present when fully covered by
> memory
> > 
> > When valid memory ranges are to be registered with sparsemem, make
> > sure that only fully covered sections are marked as present.
> > 
> > Otherwise we end up with PFN ranges that are reported present and
> > valid but are actually backed by uninitialized mem map.
> > 
> > The page_cgroup allocator relies on pfn_present() being reliable for
> > all PFNs between 0 and max_pfn, then retrieve the node id stored in
> > the corresponding page->flags to allocate the per-section page_cgroup
> > arrays on the local node.
> > 
> > This lead to at least one crash in the page allocator on a system
> > where the uninitialized page struct returned the id for node 0, which
> > had no memory itself.
> > 
> > Reported-by: qcui@redhat.com
> > Debugged-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Not-Yet-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> > 
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index aa64b12..a4fbeb8 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -182,7 +182,9 @@ void __init memory_present(int nid, unsigned long
> start, unsigned long end)
> >  {
> >     unsigned long pfn;
> >  
> > -   start &= PAGE_SECTION_MASK;
> > +   start = ALIGN(start, PAGES_PER_SECTION);
> > +   end &= PAGE_SECTION_MASK;
> > +
> >     mminit_validate_memmodel_limits(&start, &end);
> >     for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> >             unsigned long section = pfn_to_section_nr(pfn);
> > 
> 
> Hopefully he can test this one for us as well, thanks.
> 


My concern is ARM. I know ARM unmaps 'struct page' even if pages are in
existing section.
So, I wonder above fix may cause a new panic at boot with ARM.
Then, I used a dirty hack ;)

But I'm not sure. ARM is too special at memmap handling.

I think this Johannes's patch should be tested by ARM (and possiblye all arch)
before merge.

Thanks,
-Kame
Comment 11 KAMEZAWA Hiroyuki 2011-06-07 01:04:36 UTC
On Mon, 6 Jun 2011 14:45:19 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> Hopefully he can test this one for us as well, thanks.
> 

A  patch with better description (of mine) is here.
Anyway, I felt I needed a fix for ARM special case.

==
fix-init-page_cgroup-for-sparsemem-taking-care-of-broken-page-flags.patch
Even with SPARSEMEM, there are some magical memmap.

If a Node is not aligned to SECTION, memmap of pfn which is out of
Node's range is not initialized. And page->flags contains 0.

If Node(0) doesn't exist, NODE_DATA(pfn_to_nid(pfn)) causes error.

In another case, for example, ARM frees memmap which is never be used
even under SPARSEMEM. In that case, page->flags will contain broken
value.

This patch does a strict check on nid which is obtained by
pfn_to_page() and use proper NID for page_cgroup allocation.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/page_cgroup.c |   36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

Index: linux-3.0-rc1/mm/page_cgroup.c
===================================================================
--- linux-3.0-rc1.orig/mm/page_cgroup.c
+++ linux-3.0-rc1/mm/page_cgroup.c
@@ -168,6 +168,7 @@ static int __meminit init_section_page_c
 	struct mem_section *section;
 	unsigned long table_size;
 	unsigned long nr;
+	unsigned long tmp;
 	int nid, index;
 
 	nr = pfn_to_section_nr(pfn);
@@ -175,8 +176,41 @@ static int __meminit init_section_page_c
 
 	if (section->page_cgroup)
 		return 0;
+	/*
+	 * check Node-ID. Because we get 'pfn' which is obtained by calculation,
+	 * the pfn may "not exist" or "alreay freed". Even if pfn_valid() returns
+	 * true, page->flags may contain broken value and pfn_to_nid() returns
+	 * bad value.
+	 * (See CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and ARM's free_memmap())
+	 * So, we need to do careful check, here.
+	 */
+	for (tmp = pfn;
+	     tmp < pfn + PAGES_PER_SECTION;
+	     tmp += MAX_ORDER_NR_PAGES, nid = -1) {
+		struct page *page;
+
+		if (!pfn_valid(tmp))
+			continue;
+
+		page = pfn_to_page(tmp);
+		nid = page_to_nid(page);
 
-	nid = page_to_nid(pfn_to_page(pfn));
+		/*
+		 * If 'page' isn't initialized or freed, it may contains broken
+		 * information.
+		 */
+		if (!node_state(nid, N_NORMAL_MEMORY))
+			continue;
+		if (page_to_pfn(pfn_to_page(tmp)) != tmp)
+			continue;
+		/*
+		 * The page seems valid and this 'nid' is safe to access,
+ 		 * at least.
+ 		 */
+		break;
+	}
+	if (nid == -1)
+		return 0;
 	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
 	base = alloc_page_cgroup(table_size, nid);
Comment 12 Johannes Weiner 2011-06-07 07:52:11 UTC
On Tue, Jun 07, 2011 at 09:57:08AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 6 Jun 2011 14:45:19 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Hopefully he can test this one for us as well, thanks.
> > 
> 
> A  patch with better description (of mine) is here.
> Anyway, I felt I needed a fix for ARM special case.

It's a different issue that warrants a separate patch, I think.

> fix-init-page_cgroup-for-sparsemem-taking-care-of-broken-page-flags.patch
> Even with SPARSEMEM, there are some magical memmap.
> 
> If a Node is not aligned to SECTION, memmap of pfn which is out of
> Node's range is not initialized. And page->flags contains 0.
> 
> If Node(0) doesn't exist, NODE_DATA(pfn_to_nid(pfn)) causes error.
> 
> In another case, for example, ARM frees memmap which is never be used
> even under SPARSEMEM. In that case, page->flags will contain broken
> value.
> 
> This patch does a strict check on nid which is obtained by
> pfn_to_page() and use proper NID for page_cgroup allocation.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  mm/page_cgroup.c |   36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> Index: linux-3.0-rc1/mm/page_cgroup.c
> ===================================================================
> --- linux-3.0-rc1.orig/mm/page_cgroup.c
> +++ linux-3.0-rc1/mm/page_cgroup.c
> @@ -168,6 +168,7 @@ static int __meminit init_section_page_c
>       struct mem_section *section;
>       unsigned long table_size;
>       unsigned long nr;
> +     unsigned long tmp;
>       int nid, index;
>  
>       nr = pfn_to_section_nr(pfn);
> @@ -175,8 +176,41 @@ static int __meminit init_section_page_c
>  
>       if (section->page_cgroup)
>               return 0;
> +     /*
> +      * check Node-ID. Because we get 'pfn' which is obtained by
> calculation,
> +      * the pfn may "not exist" or "alreay freed". Even if pfn_valid()
> returns
> +      * true, page->flags may contain broken value and pfn_to_nid() returns
> +      * bad value.
> +      * (See CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and ARM's free_memmap())
> +      * So, we need to do careful check, here.
> +      */
> +     for (tmp = pfn;
> +          tmp < pfn + PAGES_PER_SECTION;
> +          tmp += MAX_ORDER_NR_PAGES, nid = -1) {
> +             struct page *page;
> +
> +             if (!pfn_valid(tmp))
> +                     continue;
> +
> +             page = pfn_to_page(tmp);
> +             nid = page_to_nid(page);
>  
> -     nid = page_to_nid(pfn_to_page(pfn));
> +             /*
> +              * If 'page' isn't initialized or freed, it may contains broken
> +              * information.
> +              */
> +             if (!node_state(nid, N_NORMAL_MEMORY))
> +                     continue;
> +             if (page_to_pfn(pfn_to_page(tmp)) != tmp)
> +                     continue;

This looks quite elaborate just to figure out the node id.

Here is what I wrote before I went with the sparsemem model fix (with
a modified changelog, because it also fixed the off-node range
problem).

It just iterates nodes in the first place, so the node id is never a
question.  The memory hotplug callback still relies on pfn_to_nid()
but ARM, at least as of now, does not support hotplug anyway.

What do you think?

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] page_cgroup: do not rely on memmap outside of node ranges

On ARM, memmap for present sections may partially be released.

As a consequence of this, a PFN walker like the page_cgroup array
allocator may not rely on the struct page that corresponds to a
pfn_present() PFN before it has been validated with something like
memmap_valid_within().

However, since this code only requires the node ID from the PFN range,
this patch changes it from a pure PFN walker to a PFN in nodes walker.
The node ID information is then inherently available through the node
that is currently being walked.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_cgroup.c |   41 +++++++++++++++++++++++++----------------
 1 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 74ccff6..46b6814 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -162,13 +162,13 @@ static void free_page_cgroup(void *addr)
 }
 #endif
 
-static int __meminit init_section_page_cgroup(unsigned long pfn)
+static int __meminit init_section_page_cgroup(int nid, unsigned long pfn)
 {
 	struct page_cgroup *base, *pc;
 	struct mem_section *section;
 	unsigned long table_size;
 	unsigned long nr;
-	int nid, index;
+	int index;
 
 	nr = pfn_to_section_nr(pfn);
 	section = __nr_to_section(nr);
@@ -176,7 +176,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn)
 	if (section->page_cgroup)
 		return 0;
 
-	nid = page_to_nid(pfn_to_page(pfn));
 	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
 	base = alloc_page_cgroup(table_size, nid);
 
@@ -222,13 +221,16 @@ int __meminit online_page_cgroup(unsigned long start_pfn,
 	unsigned long start, end, pfn;
 	int fail = 0;
 
+	if (nid < 0)
+		nid = pfn_to_nid(start_pfn);
+
 	start = start_pfn & ~(PAGES_PER_SECTION - 1);
 	end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
 
 	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
 		if (!pfn_present(pfn))
 			continue;
-		fail = init_section_page_cgroup(pfn);
+		fail = init_section_page_cgroup(nid, pfn);
 	}
 	if (!fail)
 		return 0;
@@ -283,23 +285,30 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
 
 void __init page_cgroup_init(void)
 {
-	unsigned long pfn;
-	int fail = 0;
+	pg_data_t *pgdat;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
-		if (!pfn_present(pfn))
-			continue;
-		fail = init_section_page_cgroup(pfn);
-	}
-	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
-		panic("Out of memory");
-	} else {
-		hotplug_memory_notifier(page_cgroup_callback, 0);
+	for_each_online_pgdat(pgdat) {
+		unsigned long start;
+		unsigned long end;
+		unsigned long pfn;
+
+		start = pgdat->node_start_pfn & ~(PAGES_PER_SECTION - 1);
+		end = ALIGN(pgdat->node_start_pfn + pgdat->node_spanned_pages,
+			    PAGES_PER_SECTION);
+		for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
+			if (!pfn_present(pfn))
+				continue;
+			if (!init_section_page_cgroup(pgdat->node_id, pfn))
+				continue;
+			printk(KERN_CRIT
+			       "try 'cgroup_disable=memory' boot option\n");
+			panic("Out of memory");
+		}
 	}
+	hotplug_memory_notifier(page_cgroup_callback, 0);
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
 	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
 	" want memory cgroups\n");
Comment 13 KAMEZAWA Hiroyuki 2011-06-07 08:03:02 UTC
On Tue, 7 Jun 2011 09:51:31 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Jun 07, 2011 at 09:57:08AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 6 Jun 2011 14:45:19 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > Hopefully he can test this one for us as well, thanks.
> > > 
> > 
> > A  patch with better description (of mine) is here.
> > Anyway, I felt I needed a fix for ARM special case.
> 
> It's a different issue that warrants a separate patch, I think.
> 
> > fix-init-page_cgroup-for-sparsemem-taking-care-of-broken-page-flags.patch
> > Even with SPARSEMEM, there are some magical memmap.
> > 
> > If a Node is not aligned to SECTION, memmap of pfn which is out of
> > Node's range is not initialized. And page->flags contains 0.
> > 
> > If Node(0) doesn't exist, NODE_DATA(pfn_to_nid(pfn)) causes error.
> > 
> > In another case, for example, ARM frees memmap which is never be used
> > even under SPARSEMEM. In that case, page->flags will contain broken
> > value.
> > 
> > This patch does a strict check on nid which is obtained by
> > pfn_to_page() and use proper NID for page_cgroup allocation.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > ---
> >  mm/page_cgroup.c |   36 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 35 insertions(+), 1 deletion(-)
> > 
> > Index: linux-3.0-rc1/mm/page_cgroup.c
> > ===================================================================
> > --- linux-3.0-rc1.orig/mm/page_cgroup.c
> > +++ linux-3.0-rc1/mm/page_cgroup.c
> > @@ -168,6 +168,7 @@ static int __meminit init_section_page_c
> >     struct mem_section *section;
> >     unsigned long table_size;
> >     unsigned long nr;
> > +   unsigned long tmp;
> >     int nid, index;
> >  
> >     nr = pfn_to_section_nr(pfn);
> > @@ -175,8 +176,41 @@ static int __meminit init_section_page_c
> >  
> >     if (section->page_cgroup)
> >             return 0;
> > +   /*
> > +    * check Node-ID. Because we get 'pfn' which is obtained by
> calculation,
> > +    * the pfn may "not exist" or "alreay freed". Even if pfn_valid()
> returns
> > +    * true, page->flags may contain broken value and pfn_to_nid() returns
> > +    * bad value.
> > +    * (See CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and ARM's free_memmap())
> > +    * So, we need to do careful check, here.
> > +    */
> > +   for (tmp = pfn;
> > +        tmp < pfn + PAGES_PER_SECTION;
> > +        tmp += MAX_ORDER_NR_PAGES, nid = -1) {
> > +           struct page *page;
> > +
> > +           if (!pfn_valid(tmp))
> > +                   continue;
> > +
> > +           page = pfn_to_page(tmp);
> > +           nid = page_to_nid(page);
> >  
> > -   nid = page_to_nid(pfn_to_page(pfn));
> > +           /*
> > +            * If 'page' isn't initialized or freed, it may contains broken
> > +            * information.
> > +            */
> > +           if (!node_state(nid, N_NORMAL_MEMORY))
> > +                   continue;
> > +           if (page_to_pfn(pfn_to_page(tmp)) != tmp)
> > +                   continue;
> 
> This looks quite elaborate just to figure out the node id.
> 

Hm, but Node is not required to be aligned to section. Okay ?



> Here is what I wrote before I went with the sparsemem model fix (with
> a modified changelog, because it also fixed the off-node range
> problem).
> 
> It just iterates nodes in the first place, so the node id is never a
> question.  The memory hotplug callback still relies on pfn_to_nid()
> but ARM, at least as of now, does not support hotplug anyway.
> 
> What do you think?
> 

I have a concern. see below. 


> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] page_cgroup: do not rely on memmap outside of node ranges
> 
> On ARM, memmap for present sections may partially be released.
> 
> As a consequence of this, a PFN walker like the page_cgroup array
> allocator may not rely on the struct page that corresponds to a
> pfn_present() PFN before it has been validated with something like
> memmap_valid_within().
> 
> However, since this code only requires the node ID from the PFN range,
> this patch changes it from a pure PFN walker to a PFN in nodes walker.
> The node ID information is then inherently available through the node
> that is currently being walked.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_cgroup.c |   41 +++++++++++++++++++++++++----------------
>  1 files changed, 25 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 74ccff6..46b6814 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -162,13 +162,13 @@ static void free_page_cgroup(void *addr)
>  }
>  #endif
>  
> -static int __meminit init_section_page_cgroup(unsigned long pfn)
> +static int __meminit init_section_page_cgroup(int nid, unsigned long pfn)
>  {
>       struct page_cgroup *base, *pc;
>       struct mem_section *section;
>       unsigned long table_size;
>       unsigned long nr;
> -     int nid, index;
> +     int index;
>  
>       nr = pfn_to_section_nr(pfn);
>       section = __nr_to_section(nr);
> @@ -176,7 +176,6 @@ static int __meminit init_section_page_cgroup(unsigned
> long pfn)
>       if (section->page_cgroup)
>               return 0;
>  
> -     nid = page_to_nid(pfn_to_page(pfn));
>       table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
>       base = alloc_page_cgroup(table_size, nid);
>  
> @@ -222,13 +221,16 @@ int __meminit online_page_cgroup(unsigned long
> start_pfn,
>       unsigned long start, end, pfn;
>       int fail = 0;
>  
> +     if (nid < 0)
> +             nid = pfn_to_nid(start_pfn);
> +
>       start = start_pfn & ~(PAGES_PER_SECTION - 1);
>       end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
>  
>       for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
>               if (!pfn_present(pfn))
>                       continue;
> -             fail = init_section_page_cgroup(pfn);
> +             fail = init_section_page_cgroup(nid, pfn);
>       }
>       if (!fail)
>               return 0;
> @@ -283,23 +285,30 @@ static int __meminit page_cgroup_callback(struct
> notifier_block *self,
>  
>  void __init page_cgroup_init(void)
>  {
> -     unsigned long pfn;
> -     int fail = 0;
> +     pg_data_t *pgdat;
>  
>       if (mem_cgroup_disabled())
>               return;
>  
> -     for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> -             if (!pfn_present(pfn))
> -                     continue;
> -             fail = init_section_page_cgroup(pfn);
> -     }
> -     if (fail) {
> -             printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
> -             panic("Out of memory");
> -     } else {
> -             hotplug_memory_notifier(page_cgroup_callback, 0);
> +     for_each_online_pgdat(pgdat) {
> +             unsigned long start;
> +             unsigned long end;
> +             unsigned long pfn;
> +
> +             start = pgdat->node_start_pfn & ~(PAGES_PER_SECTION - 1);
> +             end = ALIGN(pgdat->node_start_pfn + pgdat->node_spanned_pages,
> +                         PAGES_PER_SECTION);
> +             for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> +                     if (!pfn_present(pfn))
> +                             continue;
> +                     if (!init_section_page_cgroup(pgdat->node_id, pfn))
> +                             continue;

AFAIK, nodes can overlap. So, this [start, end) scan doesn't work. sections
may be initizalised mulitple times ...in wrong way. At here, what we can trust
is nid in page->flags or early_node_map[]?. 
But that kind of record as early_node_map[] in __init section is....complicated.

Hmm, the problem is "boot failure" and it's scary and needed to be backported to
stable kernels. And stable kernels requires no-regression.

Can't we start from a band-aid ? We can clean up later.

Thanks,
-Kame
Comment 14 KAMEZAWA Hiroyuki 2011-06-07 08:51:19 UTC
On Tue, 7 Jun 2011 09:45:30 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jun 07, 2011 at 08:45:30AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 6 Jun 2011 14:45:19 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > On Mon, 6 Jun 2011 14:54:21 +0200
> > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > 
> > > > Cc Mel for memory model
> > > > 
> > > > On Mon, May 30, 2011 at 05:51:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > On Mon, 30 May 2011 16:54:53 +0900
> > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > 
> > > > > > On Mon, 30 May 2011 16:29:04 +0900
> > > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > 
> > > > > > SRAT: Node 1 PXM 1 0-a0000
> > > > > > SRAT: Node 1 PXM 1 100000-c8000000
> > > > > > SRAT: Node 1 PXM 1 100000000-438000000
> > > > > > SRAT: Node 3 PXM 3 438000000-838000000
> > > > > > SRAT: Node 5 PXM 5 838000000-c38000000
> > > > > > SRAT: Node 7 PXM 7 c38000000-1038000000
> > > > > > 
> > > > > > Initmem setup node 1 0000000000000000-0000000438000000
> > > > > >   NODE_DATA [0000000437fd9000 - 0000000437ffffff]
> > > > > > Initmem setup node 3 0000000438000000-0000000838000000
> > > > > >   NODE_DATA [0000000837fd9000 - 0000000837ffffff]
> > > > > > Initmem setup node 5 0000000838000000-0000000c38000000
> > > > > >   NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
> > > > > > Initmem setup node 7 0000000c38000000-0000001038000000
> > > > > >   NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
> > > > > > [ffffea000ec40000-ffffea000edfffff] potential offnode page_structs
> > > > > > [ffffea001cc40000-ffffea001cdfffff] potential offnode page_structs
> > > > > > [ffffea002ac40000-ffffea002adfffff] potential offnode page_structs
> > > > > > ==
> > > > > > 
> > > > > > Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm ?
> > > > > > 
> > > > > 
> > > > > I think I found a reason and this is a possible fix. But need to be
> tested.
> > > > > And suggestion for better fix rather than this band-aid is
> appreciated.
> > > > > 
> > > > > ==
> > > > > >From b95edcf43619312f72895476c3e6ef46079bb05f Mon Sep 17 00:00:00
> 2001
> > > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > Date: Mon, 30 May 2011 16:49:59 +0900
> > > > > Subject: [PATCH][BUGFIX] fallbacks at page_cgroup allocation.
> > > > > 
> > > > > Under SPARSEMEM, the page_struct is allocated per section.
> > > > > Then, pfn_valid() for the whole section is "true" and there are page
> > > > > structs. But, it's not related to valid range of [start_pfn, end_pfn)
> > > > > and some page structs may not be initialized collectly because
> > > > > it's not a valid pages.
> > > > > (memmap_init_zone() skips a page which is not correct in
> > > > >  early_node_map[] and page->flags is initialized to be 0.)
> > > > > 
> > > > > In this case, a page->flags can be '0'. Assume a case where
> > > > > node 0 has no memory....
> > > > > 
> > > > > page_cgroup is allocated onto the node
> > > > > 
> > > > >    - page_to_nid(head of section pfn)
> > > > > 
> > > > > Head's pfn will be valid (struct page exists) but page->flags is 0
> and contains
> > > > > node_id:0. This causes allocation onto NODE_DATA(0) and cause panic.
> > > > > 
> > > > > This patch makes page_cgroup to use alloc_pages_exact() only
> > > > > when NID is N_NORMAL_MEMORY.
> > > 
> > > fyi, the reporter has gone in via the bugzilla UI and says he has
> > > tested the patch and it worked well.
> > > 
> > > Please don't do that!  See this?
> > > 
> > > : (switched to email.  Please respond via emailed reply-to-all, not via
> the
> > > : bugzilla web interface).
> > > 
> > > So we have a tested-by if we use this patch.
> > > 
> > > > I don't like this much as it essentially will allocate the array from
> > > > a (semantically) random node, as long as it has memory.
> > > > 
> > > > IMO, the problem is either 1) looking at PFNs outside known node
> > > > ranges, or 2) having present/valid sections partially outside of node
> > > > ranges.  I am leaning towards 2), so I am wondering about the
> > > > following fix:
> > > > 
> > > > ---
> > > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > > Subject: [patch] sparse: only mark sections present when fully covered
> by memory
> > > > 
> > > > When valid memory ranges are to be registered with sparsemem, make
> > > > sure that only fully covered sections are marked as present.
> > > > 
> > > > Otherwise we end up with PFN ranges that are reported present and
> > > > valid but are actually backed by uninitialized mem map.
> > > > 
> > > > The page_cgroup allocator relies on pfn_present() being reliable for
> > > > all PFNs between 0 and max_pfn, then retrieve the node id stored in
> > > > the corresponding page->flags to allocate the per-section page_cgroup
> > > > arrays on the local node.
> > > > 
> > > > This lead to at least one crash in the page allocator on a system
> > > > where the uninitialized page struct returned the id for node 0, which
> > > > had no memory itself.
> > > > 
> > > > Reported-by: qcui@redhat.com
> > > > Debugged-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Not-Yet-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > ---
> > > > 
> > > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > > index aa64b12..a4fbeb8 100644
> > > > --- a/mm/sparse.c
> > > > +++ b/mm/sparse.c
> > > > @@ -182,7 +182,9 @@ void __init memory_present(int nid, unsigned long
> start, unsigned long end)
> > > >  {
> > > >         unsigned long pfn;
> > > >  
> > > > -       start &= PAGE_SECTION_MASK;
> > > > +       start = ALIGN(start, PAGES_PER_SECTION);
> > > > +       end &= PAGE_SECTION_MASK;
> > > > +
> > > >         mminit_validate_memmodel_limits(&start, &end);
> > > >         for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> > > >                 unsigned long section = pfn_to_section_nr(pfn);
> > > > 
> > > 
> > > Hopefully he can test this one for us as well, thanks.
> > > 
> > 
> > 
> > My concern is ARM. I know ARM unmaps 'struct page' even if pages are in
> > existing section.
> 
> Yes, but not outside zone boundaries. The problem for ARM is having
> zones unaligned to sections. The struct pages for the non-resident
> memory gets unmapped. This is a problem for linear PFN walkers that
> align to boundaries unrelated to the zone such as to MAX_ORDER_NR_PAGES
> or pageblock_nr_pages.
> 

zone boundary is not problem. If memmap for head of section is unmapped and
reused, we'll see wrong node because page->flags is broken.

Thanks,
-Kame
Comment 15 Anonymous Emailer 2011-06-07 09:09:49 UTC
Reply-To: mgorman@suse.de

On Tue, Jun 07, 2011 at 05:43:55PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 7 Jun 2011 09:45:30 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Tue, Jun 07, 2011 at 08:45:30AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 6 Jun 2011 14:45:19 -0700
> > > Andrew Morton <akpm@linux-foundation.org> wrote:
> > > 
> > > > On Mon, 6 Jun 2011 14:54:21 +0200
> > > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > 
> > > > > Cc Mel for memory model
> > > > > 
> > > > > On Mon, May 30, 2011 at 05:51:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > On Mon, 30 May 2011 16:54:53 +0900
> > > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > > 
> > > > > > > On Mon, 30 May 2011 16:29:04 +0900
> > > > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > > 
> > > > > > > SRAT: Node 1 PXM 1 0-a0000
> > > > > > > SRAT: Node 1 PXM 1 100000-c8000000
> > > > > > > SRAT: Node 1 PXM 1 100000000-438000000
> > > > > > > SRAT: Node 3 PXM 3 438000000-838000000
> > > > > > > SRAT: Node 5 PXM 5 838000000-c38000000
> > > > > > > SRAT: Node 7 PXM 7 c38000000-1038000000
> > > > > > > 
> > > > > > > Initmem setup node 1 0000000000000000-0000000438000000
> > > > > > >   NODE_DATA [0000000437fd9000 - 0000000437ffffff]
> > > > > > > Initmem setup node 3 0000000438000000-0000000838000000
> > > > > > >   NODE_DATA [0000000837fd9000 - 0000000837ffffff]
> > > > > > > Initmem setup node 5 0000000838000000-0000000c38000000
> > > > > > >   NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
> > > > > > > Initmem setup node 7 0000000c38000000-0000001038000000
> > > > > > >   NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
> > > > > > > [ffffea000ec40000-ffffea000edfffff] potential offnode
> page_structs
> > > > > > > [ffffea001cc40000-ffffea001cdfffff] potential offnode
> page_structs
> > > > > > > [ffffea002ac40000-ffffea002adfffff] potential offnode
> page_structs
> > > > > > > ==
> > > > > > > 
> > > > > > > Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm
> ?
> > > > > > > 
> > > > > > 
> > > > > > I think I found a reason and this is a possible fix. But need to be
> tested.
> > > > > > And suggestion for better fix rather than this band-aid is
> appreciated.
> > > > > > 
> > > > > > ==
> > > > > > >From b95edcf43619312f72895476c3e6ef46079bb05f Mon Sep 17 00:00:00
> 2001
> > > > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > > Date: Mon, 30 May 2011 16:49:59 +0900
> > > > > > Subject: [PATCH][BUGFIX] fallbacks at page_cgroup allocation.
> > > > > > 
> > > > > > Under SPARSEMEM, the page_struct is allocated per section.
> > > > > > Then, pfn_valid() for the whole section is "true" and there are
> page
> > > > > > structs. But, it's not related to valid range of [start_pfn,
> end_pfn)
> > > > > > and some page structs may not be initialized collectly because
> > > > > > it's not a valid pages.
> > > > > > (memmap_init_zone() skips a page which is not correct in
> > > > > >  early_node_map[] and page->flags is initialized to be 0.)
> > > > > > 
> > > > > > In this case, a page->flags can be '0'. Assume a case where
> > > > > > node 0 has no memory....
> > > > > > 
> > > > > > page_cgroup is allocated onto the node
> > > > > > 
> > > > > >    - page_to_nid(head of section pfn)
> > > > > > 
> > > > > > Head's pfn will be valid (struct page exists) but page->flags is 0
> and contains
> > > > > > node_id:0. This causes allocation onto NODE_DATA(0) and cause
> panic.
> > > > > > 
> > > > > > This patch makes page_cgroup to use alloc_pages_exact() only
> > > > > > when NID is N_NORMAL_MEMORY.
> > > > 
> > > > fyi, the reporter has gone in via the bugzilla UI and says he has
> > > > tested the patch and it worked well.
> > > > 
> > > > Please don't do that!  See this?
> > > > 
> > > > : (switched to email.  Please respond via emailed reply-to-all, not via
> the
> > > > : bugzilla web interface).
> > > > 
> > > > So we have a tested-by if we use this patch.
> > > > 
> > > > > I don't like this much as it essentially will allocate the array from
> > > > > a (semantically) random node, as long as it has memory.
> > > > > 
> > > > > IMO, the problem is either 1) looking at PFNs outside known node
> > > > > ranges, or 2) having present/valid sections partially outside of node
> > > > > ranges.  I am leaning towards 2), so I am wondering about the
> > > > > following fix:
> > > > > 
> > > > > ---
> > > > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > > > Subject: [patch] sparse: only mark sections present when fully
> covered by memory
> > > > > 
> > > > > When valid memory ranges are to be registered with sparsemem, make
> > > > > sure that only fully covered sections are marked as present.
> > > > > 
> > > > > Otherwise we end up with PFN ranges that are reported present and
> > > > > valid but are actually backed by uninitialized mem map.
> > > > > 
> > > > > The page_cgroup allocator relies on pfn_present() being reliable for
> > > > > all PFNs between 0 and max_pfn, then retrieve the node id stored in
> > > > > the corresponding page->flags to allocate the per-section page_cgroup
> > > > > arrays on the local node.
> > > > > 
> > > > > This lead to at least one crash in the page allocator on a system
> > > > > where the uninitialized page struct returned the id for node 0, which
> > > > > had no memory itself.
> > > > > 
> > > > > Reported-by: qcui@redhat.com
> > > > > Debugged-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > Not-Yet-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > ---
> > > > > 
> > > > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > > > index aa64b12..a4fbeb8 100644
> > > > > --- a/mm/sparse.c
> > > > > +++ b/mm/sparse.c
> > > > > @@ -182,7 +182,9 @@ void __init memory_present(int nid, unsigned long
> start, unsigned long end)
> > > > >  {
> > > > >       unsigned long pfn;
> > > > >  
> > > > > -     start &= PAGE_SECTION_MASK;
> > > > > +     start = ALIGN(start, PAGES_PER_SECTION);
> > > > > +     end &= PAGE_SECTION_MASK;
> > > > > +
> > > > >       mminit_validate_memmodel_limits(&start, &end);
> > > > >       for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> > > > >               unsigned long section = pfn_to_section_nr(pfn);
> > > > > 
> > > > 
> > > > Hopefully he can test this one for us as well, thanks.
> > > > 
> > > 
> > > 
> > > My concern is ARM. I know ARM unmaps 'struct page' even if pages are in
> > > existing section.
> > 
> > Yes, but not outside zone boundaries. The problem for ARM is having
> > zones unaligned to sections. The struct pages for the non-resident
> > memory gets unmapped. This is a problem for linear PFN walkers that
> > align to boundaries unrelated to the zone such as to MAX_ORDER_NR_PAGES
> > or pageblock_nr_pages.
> > 
> 
> zone boundary is not problem. If memmap for head of section is unmapped and
> reused, we'll see wrong node because page->flags is broken.
> 

I should have said "nodes" even though the end result is the same. The
problem at the moment is cgroup initialisation is checking PFNs outside
node boundaries. It should be ensuring that the start and end PFNs it
uses are within boundaries.
Comment 16 KAMEZAWA Hiroyuki 2011-06-07 09:13:53 UTC
On Tue, 7 Jun 2011 10:03:13 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jun 07, 2011 at 09:57:08AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 6 Jun 2011 14:45:19 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > Hopefully he can test this one for us as well, thanks.
> > > 
> > 
> > A  patch with better description (of mine) is here.
> > Anyway, I felt I needed a fix for ARM special case.
> > 
> > ==
> > fix-init-page_cgroup-for-sparsemem-taking-care-of-broken-page-flags.patch
> > Even with SPARSEMEM, there are some magical memmap.
> > 
> 
> Who wants to introduce SPARSEMEM_MAGICAL?
> 

ARM guys ;)

> > If a Node is not aligned to SECTION, memmap of pfn which is out of
> > Node's range is not initialized. And page->flags contains 0.
> > 
> 
> This is tangential but it might be worth introducing
> CONFIG_DEBUG_MEMORY_MODEL that WARN_ONs page->flag == 0 in
> pfn_to_page() to catch some accesses outside node boundaries. Not for
> this bug though.
> 

Hmm, buf if zone == 0 && section == 0 && nid == 0, page->flags is 0.



> > If Node(0) doesn't exist, NODE_DATA(pfn_to_nid(pfn)) causes error.
> > 
> 
> Well, not in itself. It causes a bug when we try allocate memory
> from node 0 but there is a subtle performance bug here as well. For
> unaligned nodes, the cgroup information can be allocated from node
> 0 instead of node-local.
> 
> > In another case, for example, ARM frees memmap which is never be used
> > even under SPARSEMEM. In that case, page->flags will contain broken
> > value.
> > 
> 
> Again, not as such. In that case, struct page is not valid memory
> at all.

Hmm, IIUC, ARM's code frees memmap by free_bootmem().....so, memory used 
for 'struct page' is valid and can access (but it's not struct page.)

If my English sounds strange, I'm sorry. Hm

How about this ?
== 
 In another case, for example, ARM frees memmap which is never be used
 and reuse memory for memmap for other purpose. So, in that case,
 a page got by pfn_to_page(pfn) may not a struct page.
==



> 
> > This patch does a strict check on nid which is obtained by
> > pfn_to_page() and use proper NID for page_cgroup allocation.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > ---
> >  mm/page_cgroup.c |   36 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 35 insertions(+), 1 deletion(-)
> > 
> > Index: linux-3.0-rc1/mm/page_cgroup.c
> > ===================================================================
> > --- linux-3.0-rc1.orig/mm/page_cgroup.c
> > +++ linux-3.0-rc1/mm/page_cgroup.c
> > @@ -168,6 +168,7 @@ static int __meminit init_section_page_c
> >     struct mem_section *section;
> >     unsigned long table_size;
> >     unsigned long nr;
> > +   unsigned long tmp;
> >     int nid, index;
> >  
> >     nr = pfn_to_section_nr(pfn);
> > @@ -175,8 +176,41 @@ static int __meminit init_section_page_c
> >  
> >     if (section->page_cgroup)
> >             return 0;
> > +   /*
> > +    * check Node-ID. Because we get 'pfn' which is obtained by
> calculation,
> > +    * the pfn may "not exist" or "alreay freed". Even if pfn_valid()
> returns
> > +    * true, page->flags may contain broken value and pfn_to_nid() returns
> > +    * bad value.
> > +    * (See CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and ARM's free_memmap())
> > +    * So, we need to do careful check, here.
> > +    */
> 
> You don't really need to worry about ARM here as long as you stay
> within node boundaries and you only care about the first valid page
> in the node. Why not lookup NODE_DATA(nid) and make sure start and
> end are within the node boundaries?
> 

I thought ARM's code just takes care of MAX_ORDER alignment..and doesn't
take care of making holes in a zone/node. Am I wrong ?

== arch/arm/mm/init.c===
        for_each_bank(i, mi) {
                struct membank *bank = &mi->bank[i];

                bank_start = bank_pfn_start(bank);

#ifdef CONFIG_SPARSEMEM
                /*
                 * Take care not to free memmap entries that don't exist
                 * due to SPARSEMEM sections which aren't present.
                 */
                bank_start = min(bank_start,
                                 ALIGN(prev_bank_end, PAGES_PER_SECTION));
#endif
                /*
                 * If we had a previous bank, and there is a space
                 * between the current bank and the previous, free it.
                 */
                if (prev_bank_end && prev_bank_end < bank_start)
                        free_memmap(prev_bank_end, bank_start);

                /*
                 * Align up here since the VM subsystem insists that the
                 * memmap entries are valid from the bank end aligned to
                 * MAX_ORDER_NR_PAGES.
                 */
                prev_bank_end = ALIGN(bank_pfn_end(bank), MAX_ORDER_NR_PAGES);
        }
===

ARM frees memmap for holes between valid memory bank.

Do you mean this one "memory bank" represents a node finally ?


Thanks,
-Kame
Comment 17 Anonymous Emailer 2011-06-07 09:17:50 UTC
Reply-To: mgorman@suse.de

On Tue, Jun 07, 2011 at 08:45:30AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 6 Jun 2011 14:45:19 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Mon, 6 Jun 2011 14:54:21 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > Cc Mel for memory model
> > > 
> > > On Mon, May 30, 2011 at 05:51:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > On Mon, 30 May 2011 16:54:53 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > On Mon, 30 May 2011 16:29:04 +0900
> > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > SRAT: Node 1 PXM 1 0-a0000
> > > > > SRAT: Node 1 PXM 1 100000-c8000000
> > > > > SRAT: Node 1 PXM 1 100000000-438000000
> > > > > SRAT: Node 3 PXM 3 438000000-838000000
> > > > > SRAT: Node 5 PXM 5 838000000-c38000000
> > > > > SRAT: Node 7 PXM 7 c38000000-1038000000
> > > > > 
> > > > > Initmem setup node 1 0000000000000000-0000000438000000
> > > > >   NODE_DATA [0000000437fd9000 - 0000000437ffffff]
> > > > > Initmem setup node 3 0000000438000000-0000000838000000
> > > > >   NODE_DATA [0000000837fd9000 - 0000000837ffffff]
> > > > > Initmem setup node 5 0000000838000000-0000000c38000000
> > > > >   NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
> > > > > Initmem setup node 7 0000000c38000000-0000001038000000
> > > > >   NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
> > > > > [ffffea000ec40000-ffffea000edfffff] potential offnode page_structs
> > > > > [ffffea001cc40000-ffffea001cdfffff] potential offnode page_structs
> > > > > [ffffea002ac40000-ffffea002adfffff] potential offnode page_structs
> > > > > ==
> > > > > 
> > > > > Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm ?
> > > > > 
> > > > 
> > > > I think I found a reason and this is a possible fix. But need to be
> tested.
> > > > And suggestion for better fix rather than this band-aid is appreciated.
> > > > 
> > > > ==
> > > > >From b95edcf43619312f72895476c3e6ef46079bb05f Mon Sep 17 00:00:00 2001
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Date: Mon, 30 May 2011 16:49:59 +0900
> > > > Subject: [PATCH][BUGFIX] fallbacks at page_cgroup allocation.
> > > > 
> > > > Under SPARSEMEM, the page_struct is allocated per section.
> > > > Then, pfn_valid() for the whole section is "true" and there are page
> > > > structs. But, it's not related to valid range of [start_pfn, end_pfn)
> > > > and some page structs may not be initialized collectly because
> > > > it's not a valid pages.
> > > > (memmap_init_zone() skips a page which is not correct in
> > > >  early_node_map[] and page->flags is initialized to be 0.)
> > > > 
> > > > In this case, a page->flags can be '0'. Assume a case where
> > > > node 0 has no memory....
> > > > 
> > > > page_cgroup is allocated onto the node
> > > > 
> > > >    - page_to_nid(head of section pfn)
> > > > 
> > > > Head's pfn will be valid (struct page exists) but page->flags is 0 and
> contains
> > > > node_id:0. This causes allocation onto NODE_DATA(0) and cause panic.
> > > > 
> > > > This patch makes page_cgroup to use alloc_pages_exact() only
> > > > when NID is N_NORMAL_MEMORY.
> > 
> > fyi, the reporter has gone in via the bugzilla UI and says he has
> > tested the patch and it worked well.
> > 
> > Please don't do that!  See this?
> > 
> > : (switched to email.  Please respond via emailed reply-to-all, not via the
> > : bugzilla web interface).
> > 
> > So we have a tested-by if we use this patch.
> > 
> > > I don't like this much as it essentially will allocate the array from
> > > a (semantically) random node, as long as it has memory.
> > > 
> > > IMO, the problem is either 1) looking at PFNs outside known node
> > > ranges, or 2) having present/valid sections partially outside of node
> > > ranges.  I am leaning towards 2), so I am wondering about the
> > > following fix:
> > > 
> > > ---
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Subject: [patch] sparse: only mark sections present when fully covered by
> memory
> > > 
> > > When valid memory ranges are to be registered with sparsemem, make
> > > sure that only fully covered sections are marked as present.
> > > 
> > > Otherwise we end up with PFN ranges that are reported present and
> > > valid but are actually backed by uninitialized mem map.
> > > 
> > > The page_cgroup allocator relies on pfn_present() being reliable for
> > > all PFNs between 0 and max_pfn, then retrieve the node id stored in
> > > the corresponding page->flags to allocate the per-section page_cgroup
> > > arrays on the local node.
> > > 
> > > This lead to at least one crash in the page allocator on a system
> > > where the uninitialized page struct returned the id for node 0, which
> > > had no memory itself.
> > > 
> > > Reported-by: qcui@redhat.com
> > > Debugged-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Not-Yet-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > > 
> > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > index aa64b12..a4fbeb8 100644
> > > --- a/mm/sparse.c
> > > +++ b/mm/sparse.c
> > > @@ -182,7 +182,9 @@ void __init memory_present(int nid, unsigned long
> start, unsigned long end)
> > >  {
> > >   unsigned long pfn;
> > >  
> > > - start &= PAGE_SECTION_MASK;
> > > + start = ALIGN(start, PAGES_PER_SECTION);
> > > + end &= PAGE_SECTION_MASK;
> > > +
> > >   mminit_validate_memmodel_limits(&start, &end);
> > >   for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> > >           unsigned long section = pfn_to_section_nr(pfn);
> > > 
> > 
> > Hopefully he can test this one for us as well, thanks.
> > 
> 
> 
> My concern is ARM. I know ARM unmaps 'struct page' even if pages are in
> existing section.

Yes, but not outside zone boundaries. The problem for ARM is having
zones unaligned to sections. The struct pages for the non-resident
memory gets unmapped. This is a problem for linear PFN walkers that
align to boundaries unrelated to the zone such as to MAX_ORDER_NR_PAGES
or pageblock_nr_pages.
Comment 18 Anonymous Emailer 2011-06-07 09:17:50 UTC
Reply-To: mgorman@suse.de

On Tue, Jun 07, 2011 at 09:57:08AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 6 Jun 2011 14:45:19 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Hopefully he can test this one for us as well, thanks.
> > 
> 
> A  patch with better description (of mine) is here.
> Anyway, I felt I needed a fix for ARM special case.
> 
> ==
> fix-init-page_cgroup-for-sparsemem-taking-care-of-broken-page-flags.patch
> Even with SPARSEMEM, there are some magical memmap.
> 

Who wants to introduce SPARSEMEM_MAGICAL?

> If a Node is not aligned to SECTION, memmap of pfn which is out of
> Node's range is not initialized. And page->flags contains 0.
> 

This is tangential but it might be worth introducing
CONFIG_DEBUG_MEMORY_MODEL that WARN_ONs page->flag == 0 in
pfn_to_page() to catch some accesses outside node boundaries. Not for
this bug though.

> If Node(0) doesn't exist, NODE_DATA(pfn_to_nid(pfn)) causes error.
> 

Well, not in itself. It causes a bug when we try allocate memory
from node 0 but there is a subtle performance bug here as well. For
unaligned nodes, the cgroup information can be allocated from node
0 instead of node-local.

> In another case, for example, ARM frees memmap which is never be used
> even under SPARSEMEM. In that case, page->flags will contain broken
> value.
> 

Again, not as such. In that case, struct page is not valid memory
at all.

> This patch does a strict check on nid which is obtained by
> pfn_to_page() and use proper NID for page_cgroup allocation.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  mm/page_cgroup.c |   36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> Index: linux-3.0-rc1/mm/page_cgroup.c
> ===================================================================
> --- linux-3.0-rc1.orig/mm/page_cgroup.c
> +++ linux-3.0-rc1/mm/page_cgroup.c
> @@ -168,6 +168,7 @@ static int __meminit init_section_page_c
>       struct mem_section *section;
>       unsigned long table_size;
>       unsigned long nr;
> +     unsigned long tmp;
>       int nid, index;
>  
>       nr = pfn_to_section_nr(pfn);
> @@ -175,8 +176,41 @@ static int __meminit init_section_page_c
>  
>       if (section->page_cgroup)
>               return 0;
> +     /*
> +      * check Node-ID. Because we get 'pfn' which is obtained by
> calculation,
> +      * the pfn may "not exist" or "alreay freed". Even if pfn_valid()
> returns
> +      * true, page->flags may contain broken value and pfn_to_nid() returns
> +      * bad value.
> +      * (See CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and ARM's free_memmap())
> +      * So, we need to do careful check, here.
> +      */

You don't really need to worry about ARM here as long as you stay
within node boundaries and you only care about the first valid page
in the node. Why not lookup NODE_DATA(nid) and make sure start and
end are within the node boundaries?

> +     for (tmp = pfn;
> +          tmp < pfn + PAGES_PER_SECTION;
> +          tmp += MAX_ORDER_NR_PAGES, nid = -1) {
> +             struct page *page;
> +
> +             if (!pfn_valid(tmp))
> +                     continue;
> +
> +             page = pfn_to_page(tmp);
> +             nid = page_to_nid(page);
>  
> -     nid = page_to_nid(pfn_to_page(pfn));
> +             /*
> +              * If 'page' isn't initialized or freed, it may contains broken
> +              * information.
> +              */
> +             if (!node_state(nid, N_NORMAL_MEMORY))
> +                     continue;
> +             if (page_to_pfn(pfn_to_page(tmp)) != tmp)
> +                     continue;
> +             /*
> +              * The page seems valid and this 'nid' is safe to access,
> +              * at least.
> +              */
> +             break;
> +     }
> +     if (nid == -1)
> +             return 0;
>       table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
>       base = alloc_page_cgroup(table_size, nid);
>
Comment 19 Anonymous Emailer 2011-06-07 09:27:48 UTC
Reply-To: mgorman@suse.de

On Mon, Jun 06, 2011 at 02:54:21PM +0200, Johannes Weiner wrote:
> Cc Mel for memory model
> 
> On Mon, May 30, 2011 at 05:51:40PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 30 May 2011 16:54:53 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Mon, 30 May 2011 16:29:04 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > SRAT: Node 1 PXM 1 0-a0000
> > > SRAT: Node 1 PXM 1 100000-c8000000
> > > SRAT: Node 1 PXM 1 100000000-438000000
> > > SRAT: Node 3 PXM 3 438000000-838000000
> > > SRAT: Node 5 PXM 5 838000000-c38000000
> > > SRAT: Node 7 PXM 7 c38000000-1038000000
> > > 
> > > Initmem setup node 1 0000000000000000-0000000438000000
> > >   NODE_DATA [0000000437fd9000 - 0000000437ffffff]
> > > Initmem setup node 3 0000000438000000-0000000838000000
> > >   NODE_DATA [0000000837fd9000 - 0000000837ffffff]
> > > Initmem setup node 5 0000000838000000-0000000c38000000
> > >   NODE_DATA [0000000c37fd9000 - 0000000c37ffffff]
> > > Initmem setup node 7 0000000c38000000-0000001038000000
> > >   NODE_DATA [0000001037fd7000 - 0000001037ffdfff]
> > > [ffffea000ec40000-ffffea000edfffff] potential offnode page_structs
> > > [ffffea001cc40000-ffffea001cdfffff] potential offnode page_structs
> > > [ffffea002ac40000-ffffea002adfffff] potential offnode page_structs
> > > ==
> > > 
> > > Hmm..there are four nodes 1,3,5,7 but....no memory on node 0 hmm ?
> > > 
> > 
> > I think I found a reason and this is a possible fix. But need to be tested.
> > And suggestion for better fix rather than this band-aid is appreciated.
> > 
> > ==
> > >From b95edcf43619312f72895476c3e6ef46079bb05f Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Mon, 30 May 2011 16:49:59 +0900
> > Subject: [PATCH][BUGFIX] fallbacks at page_cgroup allocation.
> > 
> > Under SPARSEMEM, the page_struct is allocated per section.
> > Then, pfn_valid() for the whole section is "true" and there are page
> > structs. But, it's not related to valid range of [start_pfn, end_pfn)
> > and some page structs may not be initialized collectly because
> > it's not a valid pages.
> > (memmap_init_zone() skips a page which is not correct in
> >  early_node_map[] and page->flags is initialized to be 0.)
> > 
> > In this case, a page->flags can be '0'. Assume a case where
> > node 0 has no memory....
> > 
> > page_cgroup is allocated onto the node
> > 
> >    - page_to_nid(head of section pfn)
> > 
> > Head's pfn will be valid (struct page exists) but page->flags is 0 and
> contains
> > node_id:0. This causes allocation onto NODE_DATA(0) and cause panic.
> > 
> > This patch makes page_cgroup to use alloc_pages_exact() only
> > when NID is N_NORMAL_MEMORY.
> 
> I don't like this much as it essentially will allocate the array from
> a (semantically) random node, as long as it has memory.
> 

Agreed. It means on some configurations the struct pages will be
node-local and on others will be node-remote depending on whether
the node starts are section-aligned or not. That will be difficult to
detect so minimally it would spit out a big warning when the struct
pages are allocated on remote nodes.

> IMO, the problem is either 1) looking at PFNs outside known node
> ranges, or 2) having present/valid sections partially outside of node
> ranges. 
> I am leaning towards 2), so I am wondering about the
> following fix:
> 

I strongly suspect it's 1 :)

While sections have valid (albeit potentially uninitialised) struct
pages outside node boundaries, within the node boundaries the
page-to-nid mapping is always valid.

> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] sparse: only mark sections present when fully covered by
> memory
> 
> When valid memory ranges are to be registered with sparsemem, make
> sure that only fully covered sections are marked as present.
> 

This potentially wastes a lot of memory on architectures with large
sections.

> Otherwise we end up with PFN ranges that are reported present and
> valid but are actually backed by uninitialized mem map.
> 
> The page_cgroup allocator relies on pfn_present() being reliable for
> all PFNs between 0 and max_pfn, then retrieve the node id stored in
> the corresponding page->flags to allocate the per-section page_cgroup
> arrays on the local node.
> 
> This lead to at least one crash in the page allocator on a system
> where the uninitialized page struct returned the id for node 0, which
> had no memory itself.
> 
> Reported-by: qcui@redhat.com
> Debugged-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Not-Yet-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aa64b12..a4fbeb8 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -182,7 +182,9 @@ void __init memory_present(int nid, unsigned long start,
> unsigned long end)
>  {
>       unsigned long pfn;
>  
> -     start &= PAGE_SECTION_MASK;
> +     start = ALIGN(start, PAGES_PER_SECTION);
> +     end &= PAGE_SECTION_MASK;
> +
>       mminit_validate_memmodel_limits(&start, &end);
>       for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
>               unsigned long section = pfn_to_section_nr(pfn);
> 

I'm afraid I do not like this because of the potential memory wastage.

I think the real problem is that the page cgroup allocator expects that
sections are fully populated. Look at this place for example

int __meminit online_page_cgroup(unsigned long start_pfn,
                        unsigned long nr_pages,
                        int nid)
{
        unsigned long start, end, pfn;
        int fail = 0;

        start = start_pfn & ~(PAGES_PER_SECTION - 1);
        end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);

        for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
                if (!pfn_present(pfn))
                        continue;
                fail = init_section_page_cgroup(pfn);
        }
        if (!fail)

That is fully expecting aligned sections and there is no guarantee of
that.

During this walk, you also need to verify that the struct page is for a
node that you expect with something like

       for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
                if (!pfn_present(pfn))
                        continue;
		/* Watch for overlapping or unaligned nodes */
		if (page_to_nid(pfn_to_page(pfn)) != nid)
			continue;
                fail = init_section_page_cgroup(pfn);
        }
Comment 20 KAMEZAWA Hiroyuki 2011-06-07 09:40:43 UTC
On Tue, 7 Jun 2011 10:09:00 +0100
Mel Gorman <mgorman@suse.de> wrote:
 
> I should have said "nodes" even though the end result is the same. The
> problem at the moment is cgroup initialisation is checking PFNs outside
> node boundaries. It should be ensuring that the start and end PFNs it
> uses are within boundaries.
> 
Maybe you like this kind of fix. Yes, this can fix the problem on bugzilla.
My concern is this will not work for ARM. 

This patch (and all other patch) works on my test host.
==
make sparsemem's page_cgroup_init to be node aware.

With sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
But this may scan a pfn which is not on any node and can access
memmap which is not initialized.

This makes page_cgroup_init() for SPARSEMEM node aware.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_cgroup.c |   26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

Index: linux-3.0-rc1/mm/page_cgroup.c
===================================================================
--- linux-3.0-rc1.orig/mm/page_cgroup.c
+++ linux-3.0-rc1/mm/page_cgroup.c
@@ -285,14 +285,32 @@ void __init page_cgroup_init(void)
 {
 	unsigned long pfn;
 	int fail = 0;
+	int node;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
-		if (!pfn_present(pfn))
-			continue;
-		fail = init_section_page_cgroup(pfn);
+	for_each_node_state(node, N_HIGH_MEMORY) {
+		unsigned long start_pfn, end_pfn;
+
+		start_pfn = NODE_DATA(node)->node_start_pfn;
+		end_pfn = start_pfn + NODE_DATA(node)->node_spanned_pages;
+		/*
+		 * This calculation makes sure that this nid detection for
+		 * section can work even if node->start_pfn is not aligned to
+		 * section. For sections on not-node-boundary, we see head
+		 * page of sections.
+		 */
+		for (pfn = start_pfn;
+		     !fail & (pfn < end_pfn);
+		     pfn = ALIGN(pfn + PAGES_PER_SECTION, PAGES_PER_SECTION)) {
+			if (!pfn_present(pfn))
+				continue;
+			/* Nodes can be overlapped */
+			if (pfn_to_nid(pfn) != node)
+				continue;
+			fail = init_section_page_cgroup(pfn);
+		}
 	}
 	if (fail) {
 		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
Comment 21 Anonymous Emailer 2011-06-07 10:13:32 UTC
Reply-To: mgorman@suse.de

On Tue, Jun 07, 2011 at 06:06:30PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 7 Jun 2011 10:03:13 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Tue, Jun 07, 2011 at 09:57:08AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 6 Jun 2011 14:45:19 -0700
> > > Andrew Morton <akpm@linux-foundation.org> wrote:
> > > 
> > > > Hopefully he can test this one for us as well, thanks.
> > > > 
> > > 
> > > A  patch with better description (of mine) is here.
> > > Anyway, I felt I needed a fix for ARM special case.
> > > 
> > > ==
> > > fix-init-page_cgroup-for-sparsemem-taking-care-of-broken-page-flags.patch
> > > Even with SPARSEMEM, there are some magical memmap.
> > > 
> > 
> > Who wants to introduce SPARSEMEM_MAGICAL?
> > 
> 
> ARM guys ;)
> 
> > > If a Node is not aligned to SECTION, memmap of pfn which is out of
> > > Node's range is not initialized. And page->flags contains 0.
> > > 
> > 
> > This is tangential but it might be worth introducing
> > CONFIG_DEBUG_MEMORY_MODEL that WARN_ONs page->flag == 0 in
> > pfn_to_page() to catch some accesses outside node boundaries. Not for
> > this bug though.
> > 
> 
> Hmm, buf if zone == 0 && section == 0 && nid == 0, page->flags is 0.
> 

Sorry, what I meant to suggest was that page->flags outside of
boundaries be initialised to a poison value that is an impossible
combination of flags and check that.

> > > If Node(0) doesn't exist, NODE_DATA(pfn_to_nid(pfn)) causes error.
> > > 
> > 
> > Well, not in itself. It causes a bug when we try allocate memory
> > from node 0 but there is a subtle performance bug here as well. For
> > unaligned nodes, the cgroup information can be allocated from node
> > 0 instead of node-local.
> > 
> > > In another case, for example, ARM frees memmap which is never be used
> > > even under SPARSEMEM. In that case, page->flags will contain broken
> > > value.
> > > 
> > 
> > Again, not as such. In that case, struct page is not valid memory
> > at all.
> 
> Hmm, IIUC, ARM's code frees memmap by free_bootmem().....so, memory used 
> for 'struct page' is valid and can access (but it's not struct page.)
> 
> If my English sounds strange, I'm sorry. Hm
> 
> How about this ?
> == 
>  In another case, for example, ARM frees memmap which is never be used
>  and reuse memory for memmap for other purpose. So, in that case,
>  a page got by pfn_to_page(pfn) may not a struct page.
> ==
> 

Much better.

> 
> 
> > 
> > > This patch does a strict check on nid which is obtained by
> > > pfn_to_page() and use proper NID for page_cgroup allocation.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > ---
> > >  mm/page_cgroup.c |   36 +++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > > 
> > > Index: linux-3.0-rc1/mm/page_cgroup.c
> > > ===================================================================
> > > --- linux-3.0-rc1.orig/mm/page_cgroup.c
> > > +++ linux-3.0-rc1/mm/page_cgroup.c
> > > @@ -168,6 +168,7 @@ static int __meminit init_section_page_c
> > >   struct mem_section *section;
> > >   unsigned long table_size;
> > >   unsigned long nr;
> > > + unsigned long tmp;
> > >   int nid, index;
> > >  
> > >   nr = pfn_to_section_nr(pfn);
> > > @@ -175,8 +176,41 @@ static int __meminit init_section_page_c
> > >  
> > >   if (section->page_cgroup)
> > >           return 0;
> > > + /*
> > > +  * check Node-ID. Because we get 'pfn' which is obtained by
> calculation,
> > > +  * the pfn may "not exist" or "alreay freed". Even if pfn_valid()
> returns
> > > +  * true, page->flags may contain broken value and pfn_to_nid() returns
> > > +  * bad value.
> > > +  * (See CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and ARM's free_memmap())
> > > +  * So, we need to do careful check, here.
> > > +  */
> > 
> > You don't really need to worry about ARM here as long as you stay
> > within node boundaries and you only care about the first valid page
> > in the node. Why not lookup NODE_DATA(nid) and make sure start and
> > end are within the node boundaries?
> > 
> 
> I thought ARM's code just takes care of MAX_ORDER alignment..

Which is not the same as section alignment and whatever alignment it's
using, the start of the node is still going to be valid.
Comment 22 Anonymous Emailer 2011-06-07 10:19:34 UTC
Reply-To: mgorman@suse.de

On Tue, Jun 07, 2011 at 06:33:02PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 7 Jun 2011 10:09:00 +0100
> Mel Gorman <mgorman@suse.de> wrote:
>  
> > I should have said "nodes" even though the end result is the same. The
> > problem at the moment is cgroup initialisation is checking PFNs outside
> > node boundaries. It should be ensuring that the start and end PFNs it
> > uses are within boundaries.
> > 
> Maybe you like this kind of fix. Yes, this can fix the problem on bugzilla.
> My concern is this will not work for ARM. 
> 
> This patch (and all other patch) works on my test host.
> ==
> make sparsemem's page_cgroup_init to be node aware.
> 
> With sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
> But this may scan a pfn which is not on any node and can access
> memmap which is not initialized.
> 
> This makes page_cgroup_init() for SPARSEMEM node aware.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_cgroup.c |   26 ++++++++++++++++++++++----
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> Index: linux-3.0-rc1/mm/page_cgroup.c
> ===================================================================
> --- linux-3.0-rc1.orig/mm/page_cgroup.c
> +++ linux-3.0-rc1/mm/page_cgroup.c
> @@ -285,14 +285,32 @@ void __init page_cgroup_init(void)
>  {
>       unsigned long pfn;
>       int fail = 0;
> +     int node;
>  
>       if (mem_cgroup_disabled())
>               return;
>  
> -     for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> -             if (!pfn_present(pfn))
> -                     continue;
> -             fail = init_section_page_cgroup(pfn);
> +     for_each_node_state(node, N_HIGH_MEMORY) {
> +             unsigned long start_pfn, end_pfn;
> +
> +             start_pfn = NODE_DATA(node)->node_start_pfn;
> +             end_pfn = start_pfn + NODE_DATA(node)->node_spanned_pages;
> +             /*
> +              * This calculation makes sure that this nid detection for
> +              * section can work even if node->start_pfn is not aligned to
> +              * section. For sections on not-node-boundary, we see head
> +              * page of sections.
> +              */
> +             for (pfn = start_pfn;
> +                  !fail & (pfn < end_pfn);

&& instead of & there?

> +                  pfn = ALIGN(pfn + PAGES_PER_SECTION, PAGES_PER_SECTION)) {
> +                     if (!pfn_present(pfn))
> +                             continue;
> +                     /* Nodes can be overlapped */
> +                     if (pfn_to_nid(pfn) != node)
> +                             continue;
> +                     fail = init_section_page_cgroup(pfn);
> +             }

So this is finding the first valid PFN in a node. Even the
overlapping problem should not be an issue here as unless node memory
initialisation is overwriting flags belonging to other nodes. The
paranoia does not hurt.

I also don't think the ARM punching holes in the memmap is a problem
because we'd at least expect the start of the node to be valid.
Comment 23 Johannes Weiner 2011-06-07 10:26:54 UTC
On Tue, Jun 07, 2011 at 04:55:37PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 7 Jun 2011 09:51:31 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> > @@ -283,23 +285,30 @@ static int __meminit page_cgroup_callback(struct
> notifier_block *self,
> >  
> >  void __init page_cgroup_init(void)
> >  {
> > -   unsigned long pfn;
> > -   int fail = 0;
> > +   pg_data_t *pgdat;
> >  
> >     if (mem_cgroup_disabled())
> >             return;
> >  
> > -   for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> > -           if (!pfn_present(pfn))
> > -                   continue;
> > -           fail = init_section_page_cgroup(pfn);
> > -   }
> > -   if (fail) {
> > -           printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
> > -           panic("Out of memory");
> > -   } else {
> > -           hotplug_memory_notifier(page_cgroup_callback, 0);
> > +   for_each_online_pgdat(pgdat) {
> > +           unsigned long start;
> > +           unsigned long end;
> > +           unsigned long pfn;
> > +
> > +           start = pgdat->node_start_pfn & ~(PAGES_PER_SECTION - 1);
> > +           end = ALIGN(pgdat->node_start_pfn + pgdat->node_spanned_pages,
> > +                       PAGES_PER_SECTION);
> > +           for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> > +                   if (!pfn_present(pfn))
> > +                           continue;
> > +                   if (!init_section_page_cgroup(pgdat->node_id, pfn))
> > +                           continue;
> 
> AFAIK, nodes can overlap. So, this [start, end) scan doesn't work. sections
> may be initizalised mulitple times ...in wrong way. At here, what we can
> trust
> is nid in page->flags or early_node_map[]?.

Sections are not be initialized multiple times.  Once their
page_cgroup array is allocated they are skipped if considered again
later.

Second, even if there are two nodes backing the memory of a single
section, there is still just a single page_cgroup array per section,
we have to take the memory from one node or the other.

So if both node N1 and N2 fall into section SN, SN->page_cgroup will
be an array of page_cgroup structures, allocated on N1, to represent
the pages of SN.

The first section considered when walking the PFNs of N2 will be SN,
which will be skipped because of !!SN->page_cgroup.

I do not see the problem.
Comment 24 KAMEZAWA Hiroyuki 2011-06-07 23:48:08 UTC
On Tue, 7 Jun 2011 11:18:57 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jun 07, 2011 at 06:33:02PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 7 Jun 2011 10:09:00 +0100
> > Mel Gorman <mgorman@suse.de> wrote:
> >  
> > > I should have said "nodes" even though the end result is the same. The
> > > problem at the moment is cgroup initialisation is checking PFNs outside
> > > node boundaries. It should be ensuring that the start and end PFNs it
> > > uses are within boundaries.
> > > 
> > Maybe you like this kind of fix. Yes, this can fix the problem on bugzilla.
> > My concern is this will not work for ARM. 
> > 
> > This patch (and all other patch) works on my test host.
> > ==
> > make sparsemem's page_cgroup_init to be node aware.
> > 
> > With sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
> > But this may scan a pfn which is not on any node and can access
> > memmap which is not initialized.
> > 
> > This makes page_cgroup_init() for SPARSEMEM node aware.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/page_cgroup.c |   26 ++++++++++++++++++++++----
> >  1 file changed, 22 insertions(+), 4 deletions(-)
> > 
> > Index: linux-3.0-rc1/mm/page_cgroup.c
> > ===================================================================
> > --- linux-3.0-rc1.orig/mm/page_cgroup.c
> > +++ linux-3.0-rc1/mm/page_cgroup.c
> > @@ -285,14 +285,32 @@ void __init page_cgroup_init(void)
> >  {
> >     unsigned long pfn;
> >     int fail = 0;
> > +   int node;
> >  
> >     if (mem_cgroup_disabled())
> >             return;
> >  
> > -   for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> > -           if (!pfn_present(pfn))
> > -                   continue;
> > -           fail = init_section_page_cgroup(pfn);
> > +   for_each_node_state(node, N_HIGH_MEMORY) {
> > +           unsigned long start_pfn, end_pfn;
> > +
> > +           start_pfn = NODE_DATA(node)->node_start_pfn;
> > +           end_pfn = start_pfn + NODE_DATA(node)->node_spanned_pages;
> > +           /*
> > +            * This calculation makes sure that this nid detection for
> > +            * section can work even if node->start_pfn is not aligned to
> > +            * section. For sections on not-node-boundary, we see head
> > +            * page of sections.
> > +            */
> > +           for (pfn = start_pfn;
> > +                !fail & (pfn < end_pfn);
> 
> && instead of & there?
> 
> > +                pfn = ALIGN(pfn + PAGES_PER_SECTION, PAGES_PER_SECTION)) {
> > +                   if (!pfn_present(pfn))
> > +                           continue;
> > +                   /* Nodes can be overlapped */
> > +                   if (pfn_to_nid(pfn) != node)
> > +                           continue;
> > +                   fail = init_section_page_cgroup(pfn);
> > +           }
> 
> So this is finding the first valid PFN in a node. Even the
> overlapping problem should not be an issue here as unless node memory
> initialisation is overwriting flags belonging to other nodes. The
> paranoia does not hurt.
> 
> I also don't think the ARM punching holes in the memmap is a problem
> because we'd at least expect the start of the node to be valid.
> 

Ok, I'll post a fixed and cleaned one. (above patch has bug ;(

Thanks,
-Kame
Comment 25 KAMEZAWA Hiroyuki 2011-06-07 23:53:19 UTC
On Tue, 7 Jun 2011 12:26:11 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Jun 07, 2011 at 04:55:37PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 7 Jun 2011 09:51:31 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > @@ -283,23 +285,30 @@ static int __meminit page_cgroup_callback(struct
> notifier_block *self,
> > >  
> > >  void __init page_cgroup_init(void)
> > >  {
> > > - unsigned long pfn;
> > > - int fail = 0;
> > > + pg_data_t *pgdat;
> > >  
> > >   if (mem_cgroup_disabled())
> > >           return;
> > >  
> > > - for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> > > -         if (!pfn_present(pfn))
> > > -                 continue;
> > > -         fail = init_section_page_cgroup(pfn);
> > > - }
> > > - if (fail) {
> > > -         printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
> > > -         panic("Out of memory");
> > > - } else {
> > > -         hotplug_memory_notifier(page_cgroup_callback, 0);
> > > + for_each_online_pgdat(pgdat) {
> > > +         unsigned long start;
> > > +         unsigned long end;
> > > +         unsigned long pfn;
> > > +
> > > +         start = pgdat->node_start_pfn & ~(PAGES_PER_SECTION - 1);
> > > +         end = ALIGN(pgdat->node_start_pfn + pgdat->node_spanned_pages,
> > > +                     PAGES_PER_SECTION);
> > > +         for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> > > +                 if (!pfn_present(pfn))
> > > +                         continue;
> > > +                 if (!init_section_page_cgroup(pgdat->node_id, pfn))
> > > +                         continue;
> > 
> > AFAIK, nodes can overlap. So, this [start, end) scan doesn't work. sections
> > may be initizalised mulitple times ...in wrong way. At here, what we can
> trust
> > is nid in page->flags or early_node_map[]?.
> 
> Sections are not be initialized multiple times.  Once their
> page_cgroup array is allocated they are skipped if considered again
> later.
> 
> Second, even if there are two nodes backing the memory of a single
> section, there is still just a single page_cgroup array per section,
> we have to take the memory from one node or the other.
> 
> So if both node N1 and N2 fall into section SN, SN->page_cgroup will
> be an array of page_cgroup structures, allocated on N1, to represent
> the pages of SN.
> 
> The first section considered when walking the PFNs of N2 will be SN,
> which will be skipped because of !!SN->page_cgroup.
> 
> I do not see the problem.
> 

Assume a host with node layout

N0 | N1 | N2 | N3 | N4 | N0


pages for all node will be allocated onto node-0.

Thanks,
-Kame
Comment 26 KAMEZAWA Hiroyuki 2011-06-08 00:49:53 UTC
On Wed, 8 Jun 2011 08:40:34 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 7 Jun 2011 11:18:57 +0100
> Mel Gorman <mgorman@suse.de> wrote:
 
> > I also don't think the ARM punching holes in the memmap is a problem
> > because we'd at least expect the start of the node to be valid.
> > 
> 
> Ok, I'll post a fixed and cleaned one. (above patch has bug ;(
> 
> Thanks,
> -Kame
> 

fixed one here. I found another bug in node hotplug and will post
a fix later.
==

With sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
But this may scan a pfn which is not on any node and can access
memmap which is not initialized.

This makes page_cgroup_init() for SPARSEMEM node aware and remove
a code to get nid from page->flags. (Then, we'll use valid NID
always.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_cgroup.c |   41 +++++++++++++++++++++++++++++++++--------
 1 file changed, 33 insertions(+), 8 deletions(-)

Index: linux-3.0-rc1/mm/page_cgroup.c
===================================================================
--- linux-3.0-rc1.orig/mm/page_cgroup.c
+++ linux-3.0-rc1/mm/page_cgroup.c
@@ -162,21 +162,25 @@ static void free_page_cgroup(void *addr)
 }
 #endif
 
-static int __meminit init_section_page_cgroup(unsigned long pfn)
+static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 {
 	struct page_cgroup *base, *pc;
 	struct mem_section *section;
 	unsigned long table_size;
 	unsigned long nr;
-	int nid, index;
+	int index;
 
+	/*
+	 * Even if passed 'pfn' is not aligned to section, we need to align
+	 * it to section boundary because of SPARSEMEM pfn calculation.
+	 */
+	pfn = ALIGN(pfn, PAGES_PER_SECTION);
 	nr = pfn_to_section_nr(pfn);
 	section = __nr_to_section(nr);
 
 	if (section->page_cgroup)
 		return 0;
 
-	nid = page_to_nid(pfn_to_page(pfn));
 	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
 	base = alloc_page_cgroup(table_size, nid);
 
@@ -228,7 +232,7 @@ int __meminit online_page_cgroup(unsigne
 	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
 		if (!pfn_present(pfn))
 			continue;
-		fail = init_section_page_cgroup(pfn);
+		fail = init_section_page_cgroup(pfn, nid);
 	}
 	if (!fail)
 		return 0;
@@ -285,14 +289,35 @@ void __init page_cgroup_init(void)
 {
 	unsigned long pfn;
 	int fail = 0;
+	int node;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
-		if (!pfn_present(pfn))
-			continue;
-		fail = init_section_page_cgroup(pfn);
+	for_each_node_state(node, N_HIGH_MEMORY) {
+		unsigned long start_pfn, end_pfn;
+
+		start_pfn = NODE_DATA(node)->node_start_pfn;
+		end_pfn = start_pfn + NODE_DATA(node)->node_spanned_pages;
+		/*
+		 * Because we cannot trust page->flags of page out of node
+		 * boundary, we skip pfn < start_pfn.
+		 */
+		for (pfn = start_pfn;
+		     !fail && (pfn < end_pfn);
+		     pfn = ALIGN(pfn + PAGES_PER_SECTION, PAGES_PER_SECTION)) {
+			if (!pfn_present(pfn))
+				continue;
+			/*
+			 * Nodes can be overlapped
+			 * We know some arch can have nodes layout as
+			 * -------------pfn-------------->
+			 * N0 | N1 | N2 | N0 | N1 | N2 |.....
+			 */
+			if (pfn_to_nid(pfn) != node)
+				continue;
+			fail = init_section_page_cgroup(pfn, node);
+		}
 	}
 	if (fail) {
 		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
Comment 27 Anonymous Emailer 2011-06-08 07:44:59 UTC
Reply-To: mgorman@suse.de

On Wed, Jun 08, 2011 at 09:42:19AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 8 Jun 2011 08:40:34 +0900
> <SNIP>

Missing a subject 

> 
> With sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
> But this may scan a pfn which is not on any node and can access
> memmap which is not initialized.
> 
> This makes page_cgroup_init() for SPARSEMEM node aware and remove
> a code to get nid from page->flags. (Then, we'll use valid NID
> always.)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_cgroup.c |   41 +++++++++++++++++++++++++++++++++--------
>  1 file changed, 33 insertions(+), 8 deletions(-)
> 
> Index: linux-3.0-rc1/mm/page_cgroup.c
> ===================================================================
> --- linux-3.0-rc1.orig/mm/page_cgroup.c
> +++ linux-3.0-rc1/mm/page_cgroup.c
> @@ -162,21 +162,25 @@ static void free_page_cgroup(void *addr)
>  }
>  #endif
>  
> -static int __meminit init_section_page_cgroup(unsigned long pfn)
> +static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
>  {
>       struct page_cgroup *base, *pc;
>       struct mem_section *section;
>       unsigned long table_size;
>       unsigned long nr;
> -     int nid, index;
> +     int index;
>  
> +     /*
> +      * Even if passed 'pfn' is not aligned to section, we need to align
> +      * it to section boundary because of SPARSEMEM pfn calculation.
> +      */
> +     pfn = ALIGN(pfn, PAGES_PER_SECTION);
>       nr = pfn_to_section_nr(pfn);

This comment is a bit opaque and from the context of the patch,
it's hard to know why the alignment is necessary. At least move the
alignment to beside where section->page_cgroup is set because it'll
be easier to understand what is going on and why.

>       section = __nr_to_section(nr);
>  
>       if (section->page_cgroup)
>               return 0;
>  
> -     nid = page_to_nid(pfn_to_page(pfn));
>       table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
>       base = alloc_page_cgroup(table_size, nid);
>  
> @@ -228,7 +232,7 @@ int __meminit online_page_cgroup(unsigne
>       for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
>               if (!pfn_present(pfn))
>                       continue;
> -             fail = init_section_page_cgroup(pfn);
> +             fail = init_section_page_cgroup(pfn, nid);
>       }
>       if (!fail)
>               return 0;
> @@ -285,14 +289,35 @@ void __init page_cgroup_init(void)
>  {
>       unsigned long pfn;
>       int fail = 0;
> +     int node;
>  

Very nit-picky but you sometimes use node and sometimes use nid.
Personally, nid is my preferred choice of name as its meaning is
unambigious.

>       if (mem_cgroup_disabled())
>               return;
>  
> -     for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> -             if (!pfn_present(pfn))
> -                     continue;
> -             fail = init_section_page_cgroup(pfn);
> +     for_each_node_state(node, N_HIGH_MEMORY) {
> +             unsigned long start_pfn, end_pfn;
> +
> +             start_pfn = NODE_DATA(node)->node_start_pfn;
> +             end_pfn = start_pfn + NODE_DATA(node)->node_spanned_pages;
> +             /*
> +              * Because we cannot trust page->flags of page out of node
> +              * boundary, we skip pfn < start_pfn.
> +              */
> +             for (pfn = start_pfn;
> +                  !fail && (pfn < end_pfn);
> +                  pfn = ALIGN(pfn + PAGES_PER_SECTION, PAGES_PER_SECTION)) {
> +                     if (!pfn_present(pfn))
> +                             continue;

Why did you not use pfn_valid()? 

pfn_valid checks a section has SECTION_HAS_MEM_MAP
pfn_present checks a section has SECTION_MARKED_PRESENT

SECTION_MARKED_PRESENT does not necessarily mean mem_map has been
allocated although I admit that this is somewhat unlikely. I'm just
curious if you had a reason for avoiding pfn_valid()?

> +                     /*
> +                      * Nodes can be overlapped
> +                      * We know some arch can have nodes layout as
> +                      * -------------pfn-------------->
> +                      * N0 | N1 | N2 | N0 | N1 | N2 |.....
> +                      */
> +                     if (pfn_to_nid(pfn) != node)
> +                             continue;
> +                     fail = init_section_page_cgroup(pfn, node);
> +             }
>       }
>       if (fail) {
>               printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
> 

FWIW, overall I think this is heading in the right direction.

Thanks.
Comment 28 Anonymous Emailer 2011-06-08 09:04:03 UTC
Reply-To: mgorman@suse.de

On Wed, Jun 08, 2011 at 05:45:05PM +0900, KAMEZAWA Hiroyuki wrote:
> > > <SNIP>
> > > +                 /*
> > > +                  * Nodes can be overlapped
> > > +                  * We know some arch can have nodes layout as
> > > +                  * -------------pfn-------------->
> > > +                  * N0 | N1 | N2 | N0 | N1 | N2 |.....
> > > +                  */
> > > +                 if (pfn_to_nid(pfn) != node)
> > > +                         continue;
> > > +                 fail = init_section_page_cgroup(pfn, node);
> > > +         }
> > >   }
> > >   if (fail) {
> > >           printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
> > > 
> > 
> > FWIW, overall I think this is heading in the right direction.
> > 
> Thank you. and I noticed I misunderstood what ALIGN() does.
> 

And I missed it :/

> This patch is made agaisnt the latest mainline git tree.
> Tested on my host, at least.
> ==
> From 0485201fec6a9bbfabc4c2674756360c05080155 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Wed, 8 Jun 2011 17:13:37 +0900
> Subject: [PATCH] [BUGFIX] Avoid getting nid from invalid struct page at
> page_cgroup allocation.
> 
> With sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
> But this may scan a pfn which is not on any node and can access
> memmap which is not initialized.
> 
> This makes page_cgroup_init() for SPARSEMEM node aware and remove
> a code to get nid from page->flags. (Then, we'll use valid NID
> always.)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 

I do not see any problems now. Thanks!

Acked-by: Mel Gorman <mgorman@suse.de>
Comment 29 Johannes Weiner 2011-06-08 09:33:42 UTC
On Wed, Jun 08, 2011 at 08:45:38AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 7 Jun 2011 12:26:11 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, Jun 07, 2011 at 04:55:37PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 7 Jun 2011 09:51:31 +0200
> > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > @@ -283,23 +285,30 @@ static int __meminit page_cgroup_callback(struct
> notifier_block *self,
> > > >  
> > > >  void __init page_cgroup_init(void)
> > > >  {
> > > > -       unsigned long pfn;
> > > > -       int fail = 0;
> > > > +       pg_data_t *pgdat;
> > > >  
> > > >         if (mem_cgroup_disabled())
> > > >                 return;
> > > >  
> > > > -       for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION)
> {
> > > > -               if (!pfn_present(pfn))
> > > > -                       continue;
> > > > -               fail = init_section_page_cgroup(pfn);
> > > > -       }
> > > > -       if (fail) {
> > > > -               printk(KERN_CRIT "try 'cgroup_disable=memory' boot
> option\n");
> > > > -               panic("Out of memory");
> > > > -       } else {
> > > > -               hotplug_memory_notifier(page_cgroup_callback, 0);
> > > > +       for_each_online_pgdat(pgdat) {
> > > > +               unsigned long start;
> > > > +               unsigned long end;
> > > > +               unsigned long pfn;
> > > > +
> > > > +               start = pgdat->node_start_pfn & ~(PAGES_PER_SECTION -
> 1);
> > > > +               end = ALIGN(pgdat->node_start_pfn +
> pgdat->node_spanned_pages,
> > > > +                           PAGES_PER_SECTION);
> > > > +               for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
> {
> > > > +                       if (!pfn_present(pfn))
> > > > +                               continue;
> > > > +                       if (!init_section_page_cgroup(pgdat->node_id,
> pfn))
> > > > +                               continue;
> > > 
> > > AFAIK, nodes can overlap. So, this [start, end) scan doesn't work.
> sections
> > > may be initizalised mulitple times ...in wrong way. At here, what we can
> trust
> > > is nid in page->flags or early_node_map[]?.
> > 
> > Sections are not be initialized multiple times.  Once their
> > page_cgroup array is allocated they are skipped if considered again
> > later.
> > 
> > Second, even if there are two nodes backing the memory of a single
> > section, there is still just a single page_cgroup array per section,
> > we have to take the memory from one node or the other.
> > 
> > So if both node N1 and N2 fall into section SN, SN->page_cgroup will
> > be an array of page_cgroup structures, allocated on N1, to represent
> > the pages of SN.
> > 
> > The first section considered when walking the PFNs of N2 will be SN,
> > which will be skipped because of !!SN->page_cgroup.
> > 
> > I do not see the problem.
> > 
> 
> Assume a host with node layout
> 
> N0 | N1 | N2 | N3 | N4 | N0
> 
> 
> pages for all node will be allocated onto node-0.

That makes sense, thanks for the explanation!
Comment 30 KAMEZAWA Hiroyuki 2011-06-08 09:42:55 UTC
On Wed, 8 Jun 2011 08:43:50 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jun 08, 2011 at 09:42:19AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 8 Jun 2011 08:40:34 +0900
> > <SNIP>
> 
> Missing a subject 
> 
> > 
> > With sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
> > But this may scan a pfn which is not on any node and can access
> > memmap which is not initialized.
> > 
> > This makes page_cgroup_init() for SPARSEMEM node aware and remove
> > a code to get nid from page->flags. (Then, we'll use valid NID
> > always.)
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/page_cgroup.c |   41 +++++++++++++++++++++++++++++++++--------
> >  1 file changed, 33 insertions(+), 8 deletions(-)
> > 
> > Index: linux-3.0-rc1/mm/page_cgroup.c
> > ===================================================================
> > --- linux-3.0-rc1.orig/mm/page_cgroup.c
> > +++ linux-3.0-rc1/mm/page_cgroup.c
> > @@ -162,21 +162,25 @@ static void free_page_cgroup(void *addr)
> >  }
> >  #endif
> >  
> > -static int __meminit init_section_page_cgroup(unsigned long pfn)
> > +static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
> >  {
> >     struct page_cgroup *base, *pc;
> >     struct mem_section *section;
> >     unsigned long table_size;
> >     unsigned long nr;
> > -   int nid, index;
> > +   int index;
> >  
> > +   /*
> > +    * Even if passed 'pfn' is not aligned to section, we need to align
> > +    * it to section boundary because of SPARSEMEM pfn calculation.
> > +    */
> > +   pfn = ALIGN(pfn, PAGES_PER_SECTION);
> >     nr = pfn_to_section_nr(pfn);
> 
> This comment is a bit opaque and from the context of the patch,
> it's hard to know why the alignment is necessary. At least move the
> alignment to beside where section->page_cgroup is set because it'll
> be easier to understand what is going on and why.
> 

ok.


> >     section = __nr_to_section(nr);
> >  
> >     if (section->page_cgroup)
> >             return 0;
> >  
> > -   nid = page_to_nid(pfn_to_page(pfn));
> >     table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> >     base = alloc_page_cgroup(table_size, nid);
> >  
> > @@ -228,7 +232,7 @@ int __meminit online_page_cgroup(unsigne
> >     for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
> >             if (!pfn_present(pfn))
> >                     continue;
> > -           fail = init_section_page_cgroup(pfn);
> > +           fail = init_section_page_cgroup(pfn, nid);
> >     }
> >     if (!fail)
> >             return 0;
> > @@ -285,14 +289,35 @@ void __init page_cgroup_init(void)
> >  {
> >     unsigned long pfn;
> >     int fail = 0;
> > +   int node;
> >  
> 
> Very nit-picky but you sometimes use node and sometimes use nid.
> Personally, nid is my preferred choice of name as its meaning is
> unambigious.
> 

ok.


> >     if (mem_cgroup_disabled())
> >             return;
> >  
> > -   for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> > -           if (!pfn_present(pfn))
> > -                   continue;
> > -           fail = init_section_page_cgroup(pfn);
> > +   for_each_node_state(node, N_HIGH_MEMORY) {
> > +           unsigned long start_pfn, end_pfn;
> > +
> > +           start_pfn = NODE_DATA(node)->node_start_pfn;
> > +           end_pfn = start_pfn + NODE_DATA(node)->node_spanned_pages;
> > +           /*
> > +            * Because we cannot trust page->flags of page out of node
> > +            * boundary, we skip pfn < start_pfn.
> > +            */
> > +           for (pfn = start_pfn;
> > +                !fail && (pfn < end_pfn);
> > +                pfn = ALIGN(pfn + PAGES_PER_SECTION, PAGES_PER_SECTION)) {
> > +                   if (!pfn_present(pfn))
> > +                           continue;
> 
> Why did you not use pfn_valid()? 
> 
> pfn_valid checks a section has SECTION_HAS_MEM_MAP
> pfn_present checks a section has SECTION_MARKED_PRESENT
> 
> SECTION_MARKED_PRESENT does not necessarily mean mem_map has been
> allocated although I admit that this is somewhat unlikely. I'm just
> curious if you had a reason for avoiding pfn_valid()?
> 

hm, maybe I misunderstand some. I'll use pfn_valid().


> > +                   /*
> > +                    * Nodes can be overlapped
> > +                    * We know some arch can have nodes layout as
> > +                    * -------------pfn-------------->
> > +                    * N0 | N1 | N2 | N0 | N1 | N2 |.....
> > +                    */
> > +                   if (pfn_to_nid(pfn) != node)
> > +                           continue;
> > +                   fail = init_section_page_cgroup(pfn, node);
> > +           }
> >     }
> >     if (fail) {
> >             printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
> > 
> 
> FWIW, overall I think this is heading in the right direction.
> 
Thank you. and I noticed I misunderstood what ALIGN() does.

This patch is made agaisnt the latest mainline git tree.
Tested on my host, at least.
==
From 0485201fec6a9bbfabc4c2674756360c05080155 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Wed, 8 Jun 2011 17:13:37 +0900
Subject: [PATCH] [BUGFIX] Avoid getting nid from invalid struct page at page_cgroup allocation.

With sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
But this may scan a pfn which is not on any node and can access
memmap which is not initialized.

This makes page_cgroup_init() for SPARSEMEM node aware and remove
a code to get nid from page->flags. (Then, we'll use valid NID
always.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changelog:
  - moved pfn alignment calculation to nearby it's really meaningful
  - use "nid" instead of node.
  - use pfn_valid() instead of pfn_present().
  - fixed usage of ALIGN...I misunderstood.
---
 mm/page_cgroup.c |   43 ++++++++++++++++++++++++++++++++++---------
 1 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 74ccff6..ccc0c87 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -162,13 +162,13 @@ static void free_page_cgroup(void *addr)
 }
 #endif
 
-static int __meminit init_section_page_cgroup(unsigned long pfn)
+static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 {
 	struct page_cgroup *base, *pc;
 	struct mem_section *section;
 	unsigned long table_size;
 	unsigned long nr;
-	int nid, index;
+	int index;
 
 	nr = pfn_to_section_nr(pfn);
 	section = __nr_to_section(nr);
@@ -176,7 +176,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn)
 	if (section->page_cgroup)
 		return 0;
 
-	nid = page_to_nid(pfn_to_page(pfn));
 	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
 	base = alloc_page_cgroup(table_size, nid);
 
@@ -196,7 +195,11 @@ static int __meminit init_section_page_cgroup(unsigned long pfn)
 		pc = base + index;
 		init_page_cgroup(pc, nr);
 	}
-
+	/*
+	 * Even if passed 'pfn' is not aligned to section, we need to align
+	 * it to section boundary because of SPARSEMEM pfn calculation.
+	 */
+	pfn = pfn & ~(PAGES_PER_SECTION - 1);
 	section->page_cgroup = base - pfn;
 	total_usage += table_size;
 	return 0;
@@ -228,7 +231,7 @@ int __meminit online_page_cgroup(unsigned long start_pfn,
 	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
 		if (!pfn_present(pfn))
 			continue;
-		fail = init_section_page_cgroup(pfn);
+		fail = init_section_page_cgroup(pfn, nid);
 	}
 	if (!fail)
 		return 0;
@@ -285,14 +288,36 @@ void __init page_cgroup_init(void)
 {
 	unsigned long pfn;
 	int fail = 0;
+	int nid;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
-		if (!pfn_present(pfn))
-			continue;
-		fail = init_section_page_cgroup(pfn);
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		unsigned long start_pfn, end_pfn;
+
+		start_pfn = NODE_DATA(nid)->node_start_pfn;
+		end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+		/*
+		 * Because we cannot trust page->flags of page out of node
+		 * boundary, we skip pfn < start_pfn.
+		 */
+		for (pfn = start_pfn;
+		     !fail && (pfn < end_pfn);
+		     pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) {
+
+			if (!pfn_valid(pfn))
+				continue;
+			/*
+			 * Nodes can be overlapped
+			 * We know some arch can have nodes layout as
+			 * -------------pfn-------------->
+			 * N0 | N1 | N2 | N0 | N1 | N2 |.....
+			 */
+			if (pfn_to_nid(pfn) != nid)
+				continue;
+			fail = init_section_page_cgroup(pfn, nid);
+		}
 	}
 	if (fail) {
 		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
Comment 31 Johannes Weiner 2011-06-08 10:15:57 UTC
On Wed, Jun 08, 2011 at 05:45:05PM +0900, KAMEZAWA Hiroyuki wrote:
> @@ -196,7 +195,11 @@ static int __meminit init_section_page_cgroup(unsigned
> long pfn)
>               pc = base + index;
>               init_page_cgroup(pc, nr);
>       }
> -
> +     /*
> +      * Even if passed 'pfn' is not aligned to section, we need to align
> +      * it to section boundary because of SPARSEMEM pfn calculation.
> +      */
> +     pfn = pfn & ~(PAGES_PER_SECTION - 1);

PAGE_SECTION_MASK?

>       section->page_cgroup = base - pfn;
>       total_usage += table_size;
>       return 0;
> @@ -228,7 +231,7 @@ int __meminit online_page_cgroup(unsigned long start_pfn,
>       for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
>               if (!pfn_present(pfn))
>                       continue;
> -             fail = init_section_page_cgroup(pfn);
> +             fail = init_section_page_cgroup(pfn, nid);

AFAICS, nid can be -1 in the hotplug callbacks when there is a new
section added to a node that already has memory, and then the
allocation will fall back to numa_node_id().

So I think we either need to trust start_pfn has valid mem map backing
it (ARM has no memory hotplug support) and use pfn_to_nid(start_pfn),
or find another way to the right node, no?

> @@ -285,14 +288,36 @@ void __init page_cgroup_init(void)
>  {
>       unsigned long pfn;
>       int fail = 0;
> +     int nid;
>  
>       if (mem_cgroup_disabled())
>               return;
>  
> -     for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> -             if (!pfn_present(pfn))
> -                     continue;
> -             fail = init_section_page_cgroup(pfn);
> +     for_each_node_state(nid, N_HIGH_MEMORY) {
> +             unsigned long start_pfn, end_pfn;
> +
> +             start_pfn = NODE_DATA(nid)->node_start_pfn;
> +             end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
> +             /*
> +              * Because we cannot trust page->flags of page out of node
> +              * boundary, we skip pfn < start_pfn.
> +              */
> +             for (pfn = start_pfn;
> +                  !fail && (pfn < end_pfn);
> +                  pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) {

If we don't bother to align the pfn on the first iteration, I don't
think we should for subsequent iterations.  init_section_page_cgroup()
has to be able to cope anyway.  How about

	pfn += PAGES_PER_SECTION

instead?
Comment 32 KAMEZAWA Hiroyuki 2011-06-09 01:11:40 UTC
On Wed, 8 Jun 2011 12:15:11 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, Jun 08, 2011 at 05:45:05PM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -196,7 +195,11 @@ static int __meminit init_section_page_cgroup(unsigned
> long pfn)
> >             pc = base + index;
> >             init_page_cgroup(pc, nr);
> >     }
> > -
> > +   /*
> > +    * Even if passed 'pfn' is not aligned to section, we need to align
> > +    * it to section boundary because of SPARSEMEM pfn calculation.
> > +    */
> > +   pfn = pfn & ~(PAGES_PER_SECTION - 1);
> 
> PAGE_SECTION_MASK?
> 
will use it.

> >     section->page_cgroup = base - pfn;
> >     total_usage += table_size;
> >     return 0;
> > @@ -228,7 +231,7 @@ int __meminit online_page_cgroup(unsigned long
> start_pfn,
> >     for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
> >             if (!pfn_present(pfn))
> >                     continue;
> > -           fail = init_section_page_cgroup(pfn);
> > +           fail = init_section_page_cgroup(pfn, nid);
> 
> AFAICS, nid can be -1 in the hotplug callbacks when there is a new
> section added to a node that already has memory, and then the
> allocation will fall back to numa_node_id().
> 
Ah, thank you for pointing out.

> So I think we either need to trust start_pfn has valid mem map backing
> it (ARM has no memory hotplug support) and use pfn_to_nid(start_pfn),
> or find another way to the right node, no?
> 

memory hotplug itself does nid = page_to_nid(pfn_to_page(pfn))..
so we can assume memmap of start_pfn will be valid..



> > @@ -285,14 +288,36 @@ void __init page_cgroup_init(void)
> >  {
> >     unsigned long pfn;
> >     int fail = 0;
> > +   int nid;
> >  
> >     if (mem_cgroup_disabled())
> >             return;
> >  
> > -   for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> > -           if (!pfn_present(pfn))
> > -                   continue;
> > -           fail = init_section_page_cgroup(pfn);
> > +   for_each_node_state(nid, N_HIGH_MEMORY) {
> > +           unsigned long start_pfn, end_pfn;
> > +
> > +           start_pfn = NODE_DATA(nid)->node_start_pfn;
> > +           end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
> > +           /*
> > +            * Because we cannot trust page->flags of page out of node
> > +            * boundary, we skip pfn < start_pfn.
> > +            */
> > +           for (pfn = start_pfn;
> > +                !fail && (pfn < end_pfn);
> > +                pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) {
> 
> If we don't bother to align the pfn on the first iteration, I don't
> think we should for subsequent iterations.  init_section_page_cgroup()
> has to be able to cope anyway.  How about
> 
>       pfn += PAGES_PER_SECTION
> 
> instead?
> 

I thought of that but it means (pfn < end_pfn) goes wrong.
If pfn is not aligned.

                   pfn-------->end_pfn----------->pfn+PAGES_PER_SECTION

pages in [pfn..end_pfn) will not be handled. But I'd like to think about
this in the next version.

Thank you for review.

Thanks,
-Kame
Comment 33 KAMEZAWA Hiroyuki 2011-06-09 01:49:21 UTC
On Thu, 9 Jun 2011 10:04:34 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 8 Jun 2011 12:15:11 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:

> Thank you for review.

updated.
==
From a4f043565ad2388f930d956958b1167761e7d6a0 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Wed, 8 Jun 2011 17:13:37 +0900
Subject: [PATCH] [BUGFIX] Avoid getting nid from invalid struct page at page_cgroup allocation.

With sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
But this may scan a pfn which is not on any node and can access
memmap which is not initialized.

This makes page_cgroup_init() for SPARSEMEM node aware and remove
a code to get nid from page->flags. (Then, we'll use valid NID
always.)

This is a fix for Bug 36192.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changelog:
Jun/9
  - use PAGE_SECTION_MASK
  - add comments for init_section_page_cgroup
  - changed return value for init_section_page_cgroup
  - changed loop style.
  - fixed nid handling at memory hotplug.
Jun/8
  - moved ALIGN(pfn, PAGES_PER_SECTION) to nearby it's really meaningful
  - use "nid" instead of node.
  - use pfn_valid() instead of pfn_present().
  - fixed usage of ALIGN...I misunderstood.
---
 mm/page_cgroup.c |   86 ++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 61 insertions(+), 25 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 74ccff6..9ef0657 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -162,21 +162,28 @@ static void free_page_cgroup(void *addr)
 }
 #endif
 
-static int __meminit init_section_page_cgroup(unsigned long pfn)
+/** init_section_page_cgroup - initialize page_cgroup for a section.
+ * @pfn - pfn of a section should be initialized.
+ * @nid - nid where the memory should be allocated from.
+ *
+ * Allocates and initialized page_cgroup for a section. At success, this
+ * returns start_pfn of the next section. At failure, this returns 0.
+ */
+static unsigned long
+__meminit init_section_page_cgroup(unsigned long pfn, int nid)
 {
 	struct page_cgroup *base, *pc;
 	struct mem_section *section;
 	unsigned long table_size;
 	unsigned long nr;
-	int nid, index;
+	int index;
 
 	nr = pfn_to_section_nr(pfn);
 	section = __nr_to_section(nr);
 
 	if (section->page_cgroup)
-		return 0;
+		goto out;
 
-	nid = page_to_nid(pfn_to_page(pfn));
 	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
 	base = alloc_page_cgroup(table_size, nid);
 
@@ -189,17 +196,22 @@ static int __meminit init_section_page_cgroup(unsigned long pfn)
 
 	if (!base) {
 		printk(KERN_ERR "page cgroup allocation failure\n");
-		return -ENOMEM;
+		return 0;
 	}
 
 	for (index = 0; index < PAGES_PER_SECTION; index++) {
 		pc = base + index;
 		init_page_cgroup(pc, nr);
 	}
-
+	/*
+	 * Even if passed 'pfn' is not aligned to section, we need to align
+	 * it to section boundary because of SPARSEMEM pfn calculation.
+	 */
+	pfn = pfn & PAGE_SECTION_MASK;
 	section->page_cgroup = base - pfn;
 	total_usage += table_size;
-	return 0;
+out:
+	return ALIGN(pfn + 1, PAGES_PER_SECTION);
 }
 #ifdef CONFIG_MEMORY_HOTPLUG
 void __free_page_cgroup(unsigned long pfn)
@@ -220,19 +232,23 @@ int __meminit online_page_cgroup(unsigned long start_pfn,
 			int nid)
 {
 	unsigned long start, end, pfn;
-	int fail = 0;
 
 	start = start_pfn & ~(PAGES_PER_SECTION - 1);
 	end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
 
-	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
-		if (!pfn_present(pfn))
+	if (nid == -1)
+		nid = pfn_to_nid(start_pfn);
+
+	pfn = start_pfn;
+	while (pfn < end) {
+		if (!pfn_valid(pfn))
 			continue;
-		fail = init_section_page_cgroup(pfn);
+		pfn = init_section_page_cgroup(pfn, nid);
+		if (!pfn)
+			goto rollback;
 	}
-	if (!fail)
-		return 0;
-
+	return 0;
+rollback:
 	/* rollback */
 	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
 		__free_page_cgroup(pfn);
@@ -284,25 +300,45 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
 void __init page_cgroup_init(void)
 {
 	unsigned long pfn;
-	int fail = 0;
+	int nid;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
-		if (!pfn_present(pfn))
-			continue;
-		fail = init_section_page_cgroup(pfn);
-	}
-	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
-		panic("Out of memory");
-	} else {
-		hotplug_memory_notifier(page_cgroup_callback, 0);
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		unsigned long start_pfn, end_pfn;
+
+		start_pfn = NODE_DATA(nid)->node_start_pfn;
+		end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+		/*
+		 * Because we cannot trust page->flags of page out of node
+		 * boundary, we skip pfn < start_pfn.
+		 */
+		pfn = start_pfn;
+		while (pfn < end_pfn) {
+			/*
+			 * Nodes can be overlapped
+			 * We know some arch can have nodes layout as
+			 * -------------pfn-------------->
+			 * N0 | N1 | N2 | N0 | N1 | N2 |.....
+			 */
+			if (!pfn_valid(pfn) || (pfn_to_nid(pfn) != nid)) {
+				pfn = ALIGN(pfn + 1, PAGES_PER_SECTION);
+				continue;
+			}
+			pfn = init_section_page_cgroup(pfn, nid);
+			if (!pfn)
+				goto fail;
+		}
 	}
+	hotplug_memory_notifier(page_cgroup_callback, 0);
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
 	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
 	" want memory cgroups\n");
+	return;
+fail:
+	printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+	panic("Out of memory");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
Comment 34 Florian Mickler 2011-06-21 11:57:34 UTC
A patch referencing this bug report has been merged in v3.0-rc4:

commit 37573e8c718277103f61f03741bdc5606d31b07e
Author: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date:   Wed Jun 15 15:08:42 2011 -0700

    memcg: fix init_page_cgroup nid with sparsemem

Note You need to log in before you can comment on or make changes to this bug.