Bug 12832 - kernel leaks a lot of memory
Summary: kernel leaks a lot of memory
Status: CLOSED OBSOLETE
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-07 11:27 UTC by Pierre Ossman
Modified: 2012-05-30 14:34 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.27
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Pierre Ossman 2009-03-07 11:27:12 UTC
Latest working kernel version: 2.6.26
Earliest failing kernel version: 2.6.27
Distribution: Fedora
Hardware Environment: x86_64
Software Environment: Fedora 9 & 10
Problem Description:

Starting from 2.6.27, the kernel eats up a whole lot more of memory (hundreds of MB) at no gain.

I've compared what I can from 2.6.26 and so far haven't found where this missing memory has disappeared.

Original bug in RH's bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=481448
Comment 1 Anonymous Emailer 2009-03-07 12:25:03 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat,  7 Mar 2009 11:27:12 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=12832
> 
>            Summary: kernel leaks a lot of memory
>            Product: Memory Management
>            Version: 2.5
>      KernelVersion: 2.6.27
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>         AssignedTo: akpm@osdl.org
>         ReportedBy: drzeus-bugzilla@drzeus.cx
> 
> 
> Latest working kernel version: 2.6.26
> Earliest failing kernel version: 2.6.27
> Distribution: Fedora
> Hardware Environment: x86_64
> Software Environment: Fedora 9 & 10
> Problem Description:
> 
> Starting from 2.6.27, the kernel eats up a whole lot more of memory (hundreds
> of MB) at no gain.
> 
> I've compared what I can from 2.6.26 and so far haven't found where this
> missing memory has disappeared.
> 
> Original bug in RH's bugzilla:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=481448
> 

hm, not a lot to go on there.

We have quite a lot of instrumentation for memory consumption - were
you able to work out where it went by comparing /proc/meminfo,
/proc/slabinfo, `echo m > /proc/sysrq-trigger', etc?

Is the memory missing on initial boot up, or does it take some time for
the problem to become evident?
Comment 2 Anonymous Emailer 2009-03-07 13:01:10 UTC
Reply-To: drzeus@drzeus.cx

On Sat, 7 Mar 2009 12:24:52 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> hm, not a lot to go on there.
> 
> We have quite a lot of instrumentation for memory consumption - were
> you able to work out where it went by comparing /proc/meminfo,
> /proc/slabinfo, `echo m > /proc/sysrq-trigger', etc?
> 

The redhat entry contains all the info, and I've compared meminfo and
slabinfo without finding anything even close to the chunks of lost
memory.

I've attached the sysrq memory stats from 2.6.26 and 2.6.27. The only
difference though is in the reported free pages

I'm not very familiar with all the instrumentation, so pointers are
very welcome.

> Is the memory missing on initial boot up, or does it take some time for
> the problem to become evident?
> 

Initial boot as far as I can tell.


Rgds
Comment 3 Anonymous Emailer 2009-03-07 14:13:52 UTC
Reply-To: akpm@linux-foundation.org

On Sat, 7 Mar 2009 22:00:55 +0100 Pierre Ossman <drzeus@drzeus.cx> wrote:

> On Sat, 7 Mar 2009 12:24:52 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > 
> > hm, not a lot to go on there.
> > 
> > We have quite a lot of instrumentation for memory consumption - were
> > you able to work out where it went by comparing /proc/meminfo,
> > /proc/slabinfo, `echo m > /proc/sysrq-trigger', etc?
> > 
> 
> The redhat entry contains all the info, and I've compared meminfo and
> slabinfo without finding anything even close to the chunks of lost
> memory.

Ok.

> I've attached the sysrq memory stats from 2.6.26 and 2.6.27. The only
> difference though is in the reported free pages

Drat.

> I'm not very familiar with all the instrumentation, so pointers are
> very welcome.
> 
> > Is the memory missing on initial boot up, or does it take some time for
> > the problem to become evident?
> > 
> 
> Initial boot as far as I can tell.

OK.  In that case it might be that someone gobbled a lot of bootmem.

Unfortunately we only added the bootmem_debug boot option in 2.6.27.

Below is a super-quick hackport of that patch into 2.6.26.  That will
allow us (ie: you ;)) to compare bootmem allocations between the two
kernels.

Unfortunately bootmem-debugging doesn't tell us _who_ allocated the
memory, so I stuck a dump_stack() in there too.


diff -puN mm/bootmem.c~bdebug mm/bootmem.c
--- a/mm/bootmem.c~bdebug
+++ a/mm/bootmem.c
@@ -48,6 +48,22 @@ unsigned long __init bootmem_bootmap_pag
 	return mapsize;
 }
 
+static int bootmem_debug;
+
+static int __init bootmem_debug_setup(char *buf)
+{
+	bootmem_debug = 1;
+	return 0;
+}
+early_param("bootmem_debug", bootmem_debug_setup);
+
+#define bdebug(fmt, args...) ({				\
+	if (unlikely(bootmem_debug))			\
+		printk(KERN_INFO			\
+			"bootmem::%s " fmt,		\
+			__FUNCTION__, ## args);		\
+})
+
 /*
  * link bdata in order
  */
@@ -213,10 +229,10 @@ static void __init free_bootmem_core(boo
 	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
 		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
 
-	for (i = sidx; i < eidx; i++) {
-		if (unlikely(!test_and_clear_bit(i, bdata->node_bootmem_map)))
-			BUG();
-	}
+	for (i = sidx; i < eidx; i++)
+		if (test_and_set_bit(i, bdata->node_bootmem_map))
+			bdebug("hm, page %lx reserved twice.\n",
+				PFN_DOWN(bdata->node_boot_start) + i);
 }
 
 /*
@@ -252,6 +268,12 @@ __alloc_bootmem_core(struct bootmem_data
 	if (!bdata->node_bootmem_map)
 		return NULL;
 
+	bdebug("size=%lx [%lu pages] align=%lx goal=%lx limit=%lx\n",
+		size, PAGE_ALIGN(size) >> PAGE_SHIFT,
+		align, goal, limit);
+	if (bootmem_debug)
+		dump_stack();
+
 	/* bdata->node_boot_start is supposed to be (12+6)bits alignment on x86_64 ? */
 	node_boot_start = bdata->node_boot_start;
 	node_bootmem_map = bdata->node_bootmem_map;
@@ -359,6 +381,10 @@ found:
 		ret = phys_to_virt(start * PAGE_SIZE + node_boot_start);
 	}
 
+	bdebug("start=%lx end=%lx\n",
+		start + PFN_DOWN(bdata->node_boot_start),
+		start + areasize + PFN_DOWN(bdata->node_boot_start));
+
 	/*
 	 * Reserve the area now:
 	 */
@@ -432,6 +458,7 @@ static unsigned long __init free_all_boo
 	}
 	total += count;
 	bdata->node_bootmem_map = NULL;
+	bdebug("released=%lx\n", count);
 
 	return total;
 }
_
Comment 4 Anonymous Emailer 2009-03-07 14:16:43 UTC
Reply-To: akpm@linux-foundation.org


Now another possibility is that someone is gobbling lots of memory
during initcalls.

So here's an untested addition to the `initcall_debug' boot option
which should permit us to work out how much memory each initcall
consumed:

--- a/init/main.c~a
+++ a/init/main.c
@@ -714,6 +714,7 @@ static void __init do_one_initcall(initc
 		print_fn_descriptor_symbol("initcall %s", fn);
 		printk(" returned %d after %Ld msecs\n", result,
 			(unsigned long long) delta.tv64 >> 20);
+		printk("remaining memory: %d\n", nr_free_buffer_pages());
 	}
 
 	msgbuf[0] = 0;
_
Comment 5 Anonymous Emailer 2009-03-07 14:53:42 UTC
Reply-To: drzeus@drzeus.cx

On Sat, 7 Mar 2009 14:13:16 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> Below is a super-quick hackport of that patch into 2.6.26.  That will
> allow us (ie: you ;)) to compare bootmem allocations between the two
> kernels.
> 

Compiling...

I take it you couldn't see anything like this in your end?

Rgds
Comment 6 Anonymous Emailer 2009-03-08 03:00:21 UTC
Reply-To: drzeus@drzeus.cx

On Sat, 7 Mar 2009 14:13:16 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> Below is a super-quick hackport of that patch into 2.6.26.  That will
> allow us (ie: you ;)) to compare bootmem allocations between the two
> kernels.
> 
> Unfortunately bootmem-debugging doesn't tell us _who_ allocated the
> memory, so I stuck a dump_stack() in there too.
> 

I'm having problems booting this machine on a vanilla 2.26.6. Fedora's
kernel works nice though, so I guess they have a bug fix for this. I've
attached a screenshot in case it rings any bells.

I'm working on getting the data from the 2.6.27 kernel, but right now
it doesn't seem like we're getting any numbers for comparison. :/

Rgds
Comment 7 Anonymous Emailer 2009-03-08 03:36:48 UTC
Reply-To: drzeus@drzeus.cx

On Sun, 8 Mar 2009 11:00:06 +0100
Pierre Ossman <drzeus@drzeus.cx> wrote:

> 
> I'm having problems booting this machine on a vanilla 2.26.6. Fedora's
> kernel works nice though, so I guess they have a bug fix for this. I've
> attached a screenshot in case it rings any bells.
> 

It turns out it's your backported patch that's the problem. I'll see if
I can get it working. :)

Rgds
Comment 8 Wu Fengguang 2009-03-08 05:40:05 UTC
On Sun, Mar 08, 2009 at 11:36:19AM +0100, Pierre Ossman wrote:
> On Sun, 8 Mar 2009 11:00:06 +0100
> Pierre Ossman <drzeus@drzeus.cx> wrote:
> 
> > 
> > I'm having problems booting this machine on a vanilla 2.26.6. Fedora's
> > kernel works nice though, so I guess they have a bug fix for this. I've
> > attached a screenshot in case it rings any bells.
> > 
> 
> It turns out it's your backported patch that's the problem. I'll see if
> I can get it working. :)

Pierre, you can try the following fixed and combined patch and boot kernel
with "initcall_debug bootmem_debug".

The boot hung was due to this chunk floated from reserve_bootmem_core() into
free_bootmem_core()...

        @@ -213,10 +229,10 @@ static void __init free_bootmem_core(boo
                if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
                        eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);

        -       for (i = sidx; i < eidx; i++) {
        -               if (unlikely(!test_and_clear_bit(i, bdata->node_bootmem_map)))
        -                       BUG();
        -       }
        +       for (i = sidx; i < eidx; i++)
        +               if (test_and_set_bit(i, bdata->node_bootmem_map))
        +                       bdebug("hm, page %lx reserved twice.\n",
        +                               PFN_DOWN(bdata->node_boot_start) + i);
         }

         /*

Thanks,
Fengguang
---
From: Andrew Morton <akpm@linux-foundation.org>

---
 init/main.c  |    2 ++
 mm/bootmem.c |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

--- mm.orig/mm/bootmem.c
+++ mm/mm/bootmem.c
@@ -48,6 +48,22 @@ unsigned long __init bootmem_bootmap_pag
 	return mapsize;
 }
 
+static int bootmem_debug;
+
+static int __init bootmem_debug_setup(char *buf)
+{
+	bootmem_debug = 1;
+	return 0;
+}
+early_param("bootmem_debug", bootmem_debug_setup);
+
+#define bdebug(fmt, args...) ({				\
+	if (unlikely(bootmem_debug))			\
+		printk(KERN_INFO			\
+			"bootmem::%s " fmt,		\
+			__FUNCTION__, ## args);		\
+})
+
 /*
  * link bdata in order
  */
@@ -172,6 +188,14 @@ static void __init reserve_bootmem_core(
 	if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
 		eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
 
+	bdebug("size=%lx [%lu pages] start=%lx end=%lx flags=%x\n",
+		size, PAGE_ALIGN(size) >> PAGE_SHIFT,
+		sidx + PFN_DOWN(bdata->node_boot_start),
+		eidx + PFN_DOWN(bdata->node_boot_start),
+		flags);
+	if (bootmem_debug)
+		dump_stack();
+
 	for (i = sidx; i < eidx; i++) {
 		if (test_and_set_bit(i, bdata->node_bootmem_map)) {
 #ifdef CONFIG_DEBUG_BOOTMEM
@@ -252,6 +276,12 @@ __alloc_bootmem_core(struct bootmem_data
 	if (!bdata->node_bootmem_map)
 		return NULL;
 
+	bdebug("size=%lx [%lu pages] align=%lx goal=%lx limit=%lx\n",
+		size, PAGE_ALIGN(size) >> PAGE_SHIFT,
+		align, goal, limit);
+	if (bootmem_debug)
+		dump_stack();
+
 	/* bdata->node_boot_start is supposed to be (12+6)bits alignment on x86_64 ? */
 	node_boot_start = bdata->node_boot_start;
 	node_bootmem_map = bdata->node_bootmem_map;
@@ -359,6 +389,10 @@ found:
 		ret = phys_to_virt(start * PAGE_SIZE + node_boot_start);
 	}
 
+	bdebug("start=%lx end=%lx\n",
+		start + PFN_DOWN(bdata->node_boot_start),
+		start + areasize + PFN_DOWN(bdata->node_boot_start));
+
 	/*
 	 * Reserve the area now:
 	 */
@@ -432,6 +466,7 @@ static unsigned long __init free_all_boo
 	}
 	total += count;
 	bdata->node_bootmem_map = NULL;
+	bdebug("released=%lx\n", count);
 
 	return total;
 }
--- mm.orig/init/main.c
+++ mm/init/main.c
@@ -60,6 +60,7 @@
 #include <linux/sched.h>
 #include <linux/signal.h>
 #include <linux/idr.h>
+#include <linux/swap.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -714,6 +715,7 @@ static void __init do_one_initcall(initc
 		print_fn_descriptor_symbol("initcall %s", fn);
 		printk(" returned %d after %Ld msecs\n", result,
 			(unsigned long long) delta.tv64 >> 20);
+		printk("remaining memory: %d\n", nr_free_buffer_pages());
 	}
 
 	msgbuf[0] = 0;
Comment 9 Anonymous Emailer 2009-03-08 07:27:24 UTC
Reply-To: drzeus@drzeus.cx

On Sun, 8 Mar 2009 20:38:25 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
> Pierre, you can try the following fixed and combined patch and boot kernel
> with "initcall_debug bootmem_debug".
> 
> The boot hung was due to this chunk floated from reserve_bootmem_core() into
> free_bootmem_core()...
> 

Yeah, I found that as well. I'm getting a decent output now.

Included are the dmesg dumps of bootmem_debug. I'll get the initcall
stuff in a bit.

Rgds
Comment 10 Anonymous Emailer 2009-03-08 08:54:14 UTC
Reply-To: drzeus@drzeus.cx

I've gone through the dumps now, and still no meaningful difference.
All the big bootmem allocations are present in both kernels, and the
remaining memory in initcall is also the same for both (and doesn't
really decrease by any meaningful amount).

I also tried booting with init=/bin/sh, and the lost memory is present
even at that point.

More ideas?

Rgds
Comment 11 Anonymous Emailer 2009-03-08 12:11:54 UTC
Reply-To: akpm@linux-foundation.org

On Sun, 8 Mar 2009 16:54:03 +0100 Pierre Ossman <drzeus@drzeus.cx> wrote:

> I've gone through the dumps now, and still no meaningful difference.
> All the big bootmem allocations are present in both kernels, and the
> remaining memory in initcall is also the same for both (and doesn't
> really decrease by any meaningful amount).
> 
> I also tried booting with init=/bin/sh, and the lost memory is present
> even at that point.
> 

So we know that the memory gets consumed after end-of-initcalls and
before exec-of-init?  
Comment 12 Anonymous Emailer 2009-03-08 12:24:07 UTC
Reply-To: drzeus@drzeus.cx

On Sun, 8 Mar 2009 12:11:43 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sun, 8 Mar 2009 16:54:03 +0100 Pierre Ossman <drzeus@drzeus.cx> wrote:
> 
> > I've gone through the dumps now, and still no meaningful difference.
> > All the big bootmem allocations are present in both kernels, and the
> > remaining memory in initcall is also the same for both (and doesn't
> > really decrease by any meaningful amount).
> > 
> > I also tried booting with init=/bin/sh, and the lost memory is present
> > even at that point.
> > 
> 
> So we know that the memory gets consumed after end-of-initcalls and
> before exec-of-init?  

This is a fedora machine, so initrd might be the provoking party here.
I haven't yet tried the adventure of booting without initrd. It's after
initcalls at least.

Right now I'm compiling 2.6.27-rc1 in an effort to bisect this, but if
you have something more worthwhile then shoot. :)

Rgds
Comment 13 Wu Fengguang 2009-03-08 18:38:58 UTC
Hi Pierre,

On Sat, Mar 07, 2009 at 10:00:55PM +0100, Pierre Ossman wrote:
> On Sat, 7 Mar 2009 12:24:52 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > 
> > hm, not a lot to go on there.
> > 
> > We have quite a lot of instrumentation for memory consumption - were
> > you able to work out where it went by comparing /proc/meminfo,
> > /proc/slabinfo, `echo m > /proc/sysrq-trigger', etc?
> > 
> 
> The redhat entry contains all the info, and I've compared meminfo and
> slabinfo without finding anything even close to the chunks of lost
> memory.

The "free" pages in sysrq mem-info report should be equal to "MemFree"
in /proc/meminfo. So I'd expect meminfo numbers to be different in
.26/.27 as well.

Maybe the memory is taken by some user space program, so it would be
helpful to know the numbers in /proc/meminfo, /proc/vmstat and
/proc/zoneinfo.

> I've attached the sysrq memory stats from 2.6.26 and 2.6.27. The only
> difference though is in the reported free pages

The "free" entries in mem-info:

                     2.6.26     2.6.27
--------------------------------------
   free:             103730      62265 (pages)
  Node 0 DMA free:  10292kB     9448kB
  Node 0 DMA32 free:404628kB  239612kB

So there are 160MB less free pages in .27. Are you sure that initrd is
freed after booting?

Thanks,
Fengguang

> I'm not very familiar with all the instrumentation, so pointers are
> very welcome.
> 
> > Is the memory missing on initial boot up, or does it take some time for
> > the problem to become evident?
> > 
> 
> Initial boot as far as I can tell.
> 
> 
> Rgds
> -- 
>      -- Pierre Ossman
> 
>   WARNING: This correspondence is being monitored by the
>   Swedish government. Make sure your server uses encryption
>   for SMTP traffic and consider using PGP for end-to-end
>   encryption.

> Linux builder.drzeus.cx 2.6.26.6-79.fc9.x86_64 #1 SMP Fri Oct 17 14:20:33 EDT
> 2008 x86_64 x86_64 x86_64 GNU/Linux
> SysRq : Show Memory
> Mem-info:
> Node 0 DMA per-cpu:
> CPU    0: hi:    0, btch:   1 usd:   0
> Node 0 DMA32 per-cpu:
> CPU    0: hi:  186, btch:  31 usd: 115
> Active:8937 inactive:6285 dirty:48 writeback:0 unstable:0
>  free:103730 slab:5612 mapped:2148 pagetables:817 bounce:0
> Node 0 DMA free:10292kB min:48kB low:60kB high:72kB active:0kB inactive:0kB
> present:8908kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 489 489 489
> Node 0 DMA32 free:404628kB min:2804kB low:3504kB high:4204kB active:35748kB
> inactive:25140kB present:500896kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 3*4kB 5*8kB 4*16kB 4*32kB 3*64kB 3*128kB 3*256kB 3*512kB 3*1024kB
> 2*2048kB 0*4096kB = 10292kB
> Node 0 DMA32: 3*4kB 5*8kB 2*16kB 2*32kB 2*64kB 1*128kB 3*256kB 2*512kB
> 3*1024kB 3*2048kB 96*4096kB = 404628kB
> 9730 total pagecache pages
> Swap cache: add 0, delete 0, find 0/0
> Free swap  = 524280kB
> Total swap = 524280kB
> 131056 pages of RAM
> 3772 reserved pages
> 7750 pages shared
> 0 pages swap cached
> 

> Linux builder.drzeus.cx 2.6.27.4-19.fc9.x86_64 #1 SMP Thu Oct 30 19:30:01 EDT
> 2008 x86_64 x86_64 x86_64 GNU/Linux
> SysRq : Show Memory
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU    0: hi:    0, btch:   1 usd:   0
> Node 0 DMA32 per-cpu:
> CPU    0: hi:  186, btch:  31 usd:  86
> Active:8879 inactive:6265 dirty:8 writeback:0 unstable:0
>  free:62265 slab:5543 mapped:2154 pagetables:821 bounce:0
> Node 0 DMA free:9448kB min:40kB low:48kB high:60kB active:0kB inactive:0kB
> present:7804kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 489 489 489
> Node 0 DMA32 free:239612kB min:2808kB low:3508kB high:4212kB active:35516kB
> inactive:25060kB present:500896kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 4*4kB 3*8kB 2*16kB 5*32kB 4*64kB 2*128kB 2*256kB 2*512kB 3*1024kB
> 2*2048kB 0*4096kB = 9448kB
> Node 0 DMA32: 1*4kB 7*8kB 6*16kB 1*32kB 1*64kB 4*128kB 3*256kB 3*512kB
> 1*1024kB 3*2048kB 56*4096kB = 239612kB
> 9692 total pagecache pages
> 0 pages in swap cache
> Swap cache stats: add 0, delete 0, find 0/0
> Free swap  = 524280kB
> Total swap = 524280kB
> 131056 pages RAM
> 4046 pages reserved
> 7770 pages shared
> 61196 pages non-shared
> 
Comment 14 Anonymous Emailer 2009-03-09 00:41:48 UTC
Reply-To: drzeus@drzeus.cx

On Mon, 9 Mar 2009 10:07:01 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Mon, Mar 09, 2009 at 09:37:42AM +0800, Wu Fengguang wrote:
> > 
> > The "free" pages in sysrq mem-info report should be equal to "MemFree"
> > in /proc/meminfo. So I'd expect meminfo numbers to be different in
> > .26/.27 as well.
> > 
> > Maybe the memory is taken by some user space program, so it would be
> > helpful to know the numbers in /proc/meminfo, /proc/vmstat and
> > /proc/zoneinfo.
> 
> And maybe piggyback /proc/slabinfo in case it is a kernel bug :-)
> 

Big dump of relevant /proc files:

[root@builder ~]# free
             total       used       free     shared    buffers     cached
Mem:        509108     236988     272120          0        228      14760
-/+ buffers/cache:     222000     287108
Swap:       524280        228     524052

[root@builder ~]# cat /proc/meminfo 
MemTotal:       509108 kB
MemFree:        272172 kB
Buffers:           240 kB
Cached:          14788 kB
SwapCached:         64 kB
Active:          32544 kB
Inactive:         5900 kB
SwapTotal:      524280 kB
SwapFree:       524052 kB
Dirty:            5980 kB
Writeback:           0 kB
AnonPages:       23404 kB
Mapped:           8648 kB
Slab:            23148 kB
SReclaimable:     5420 kB
SUnreclaim:      17728 kB
PageTables:       3324 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
WritebackTmp:        0 kB
CommitLimit:    778832 kB
Committed_AS:    85196 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      1740 kB
VmallocChunk: 34359736619 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
HugePages_Surp:      0
Hugepagesize:     2048 kB
DirectMap4k:      2032
DirectMap2M:  18446744073709551613
DirectMap1G:         0

[root@builder ~]# cat /proc/vmstat 
nr_free_pages 68035
nr_inactive 1479
nr_active 8137
nr_anon_pages 5851
nr_mapped 2162
nr_file_pages 3777
nr_dirty 132
nr_writeback 0
nr_slab_reclaimable 1354
nr_slab_unreclaimable 4440
nr_page_table_pages 831
nr_unstable 0
nr_bounce 0
nr_vmscan_write 324
nr_writeback_temp 0
numa_hit 18985527
numa_miss 0
numa_foreign 0
numa_interleave 44220
numa_local 18985527
numa_other 0
pgpgin 379025
pgpgout 820238
pswpin 16
pswpout 57
pgalloc_dma 295454
pgalloc_dma32 18721928
pgalloc_normal 0
pgalloc_movable 0
pgfree 19085491
pgactivate 60797
pgdeactivate 47199
pgfault 25624481
pgmajfault 2490
pgrefill_dma 8144
pgrefill_dma32 103508
pgrefill_normal 0
pgrefill_movable 0
pgsteal_dma 4503
pgsteal_dma32 179395
pgsteal_normal 0
pgsteal_movable 0
pgscan_kswapd_dma 4999
pgscan_kswapd_dma32 180546
pgscan_kswapd_normal 0
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 384
pgscan_direct_normal 0
pgscan_direct_movable 0
pginodesteal 0
slabs_scanned 153856
kswapd_steal 183628
kswapd_inodesteal 35303
pageoutrun 3794
allocstall 3
pgrotated 72
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0

[root@builder ~]# cat /proc/zoneinfo 
Node 0, zone      DMA
  pages free     2524
        min      12
        low      15
        high     18
        scanned  0 (a: 27 i: 24)
        spanned  4096
        present  2180
    nr_free_pages 2524
    nr_inactive  0
    nr_active    8
    nr_anon_pages 8
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 16
    nr_slab_unreclaimable 7
    nr_page_table_pages 15
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 292
    nr_writeback_temp 0
    numa_hit     295370
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   295370
    numa_other   0
        protection: (0, 489, 489, 489)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         0
Node 0, zone    DMA32
  pages free     65515
        min      700
        low      875
        high     1050
        scanned  0 (a: 0 i: 0)
        spanned  126960
        present  125224
    nr_free_pages 65515
    nr_inactive  1482
    nr_active    8137
    nr_anon_pages 5843
    nr_mapped    2162
    nr_file_pages 3789
    nr_dirty     128
    nr_writeback 0
    nr_slab_reclaimable 1331
    nr_slab_unreclaimable 4429
    nr_page_table_pages 816
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 32
    nr_writeback_temp 0
    numa_hit     18690260
    numa_miss    0
    numa_foreign 0
    numa_interleave 44220
    numa_local   18690260
    numa_other   0
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 69
              high:  186
              batch: 31
  vm stats threshold: 6
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         4096

[root@builder ~]# cat /proc/slabinfo 
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
rpc_inode_cache       39     39    832   39    8 : tunables    0    0    0 : slabdata      1      1      0
nf_conntrack_expect      0      0    240   34    2 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                 34     34    960   34    8 : tunables    0    0    0 : slabdata      1      1      0
TCPv6                 18     18   1792   18    8 : tunables    0    0    0 : slabdata      1      1      0
kmalloc_dma-512       32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
dm_snap_pending_exception    144    144    112   36    1 : tunables    0    0    0 : slabdata      4      4      0
kcopyd_job             0      0    360   45    4 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache     387   1554    768   42    8 : tunables    0    0    0 : slabdata     37     37      0
ext3_xattr            46     46     88   46    1 : tunables    0    0    0 : slabdata      1      1      0
journal_handle       170    170     24  170    1 : tunables    0    0    0 : slabdata      1      1      0
journal_head          42     42     96   42    1 : tunables    0    0    0 : slabdata      1      1      0
revoke_table         256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
revoke_record        128    128     32  128    1 : tunables    0    0    0 : slabdata      1      1      0
cfq_io_context        44     48    168   24    1 : tunables    0    0    0 : slabdata      2      2      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
isofs_inode_cache      0      0    616   26    4 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     28     28    584   28    4 : tunables    0    0    0 : slabdata      1      1      0
dquot                  0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
inotify_event_cache    612    612     40  102    1 : tunables    0    0    0 : slabdata      6      6      0
fasync_cache      313798 313820     24  170    1 : tunables    0    0    0 : slabdata   1846   1846      0
shmem_inode_cache    735    738    792   41    8 : tunables    0    0    0 : slabdata     18     18      0
pid_namespace          0      0   2104   15    8 : tunables    0    0    0 : slabdata      0      0      0
nsproxy                0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
UNIX                  92     92    704   46    8 : tunables    0    0    0 : slabdata      2      2      0
xfrm_dst_cache         0      0    384   42    4 : tunables    0    0    0 : slabdata      0      0      0
ip_dst_cache          51     75    320   25    2 : tunables    0    0    0 : slabdata      3      3      0
TCP                   19     19   1664   19    8 : tunables    0    0    0 : slabdata      1      1      0
blkdev_integrity       0      0    120   34    1 : tunables    0    0    0 : slabdata      0      0      0
blkdev_queue          34     34   1824   17    8 : tunables    0    0    0 : slabdata      2      2      0
blkdev_requests       38     52    304   26    2 : tunables    0    0    0 : slabdata      2      2      0
sock_inode_cache     138    138    704   46    8 : tunables    0    0    0 : slabdata      3      3      0
file_lock_cache       42     42    192   42    2 : tunables    0    0    0 : slabdata      1      1      0
taskstats             26     26    312   26    2 : tunables    0    0    0 : slabdata      1      1      0
proc_inode_cache      90    162    600   27    4 : tunables    0    0    0 : slabdata      6      6      0
sigqueue              25     25    160   25    1 : tunables    0    0    0 : slabdata      1      1      0
radix_tree_node      623   2581    560   29    4 : tunables    0    0    0 : slabdata     89     89      0
bdev_cache            42     42    768   42    8 : tunables    0    0    0 : slabdata      1      1      0
sysfs_dir_cache     7084   7089     80   51    1 : tunables    0    0    0 : slabdata    139    139      0
inode_cache         1505   1708    568   28    4 : tunables    0    0    0 : slabdata     61     61      0
dentry              2555   4485    208   39    2 : tunables    0    0    0 : slabdata    115    115      0
avc_node            1735   2128     72   56    1 : tunables    0    0    0 : slabdata     38     38      0
buffer_head         1583   5472    112   36    1 : tunables    0    0    0 : slabdata    152    152      0
mm_struct             75     78    832   39    8 : tunables    0    0    0 : slabdata      2      2      0
vm_area_struct      2223   2438    176   46    2 : tunables    0    0    0 : slabdata     53     53      0
files_cache           78     84    768   42    8 : tunables    0    0    0 : slabdata      2      2      0
signal_cache         105    108    896   36    8 : tunables    0    0    0 : slabdata      3      3      0
sighand_cache         85     90   2112   15    8 : tunables    0    0    0 : slabdata      6      6      0
task_struct          141    145   5840    5    8 : tunables    0    0    0 : slabdata     29     29      0
anon_vma             741    768     32  128    1 : tunables    0    0    0 : slabdata      6      6      0
shared_policy_node     85     85     48   85    1 : tunables    0    0    0 : slabdata      1      1      0
numa_policy           56     60    136   30    1 : tunables    0    0    0 : slabdata      2      2      0
idr_layer_cache      269    270    536   30    4 : tunables    0    0    0 : slabdata      9      9      0
kmalloc-4096         247    248   4096    8    8 : tunables    0    0    0 : slabdata     31     31      0
kmalloc-2048         345    352   2048   16    8 : tunables    0    0    0 : slabdata     22     22      0
kmalloc-1024         396    416   1024   32    8 : tunables    0    0    0 : slabdata     13     13      0
kmalloc-512          297    320    512   32    4 : tunables    0    0    0 : slabdata     10     10      0
kmalloc-256          985    992    256   32    2 : tunables    0    0    0 : slabdata     31     31      0
kmalloc-128         1899   2016    128   32    1 : tunables    0    0    0 : slabdata     63     63      0
kmalloc-64          6795   9600     64   64    1 : tunables    0    0    0 : slabdata    150    150      0
kmalloc-32         20735  20736     32  128    1 : tunables    0    0    0 : slabdata    162    162      0
kmalloc-16        138778 139264     16  256    1 : tunables    0    0    0 : slabdata    544    544      0
kmalloc-8           8190   8192      8  512    1 : tunables    0    0    0 : slabdata     16     16      0
kmalloc-192          972   1050    192   42    2 : tunables    0    0    0 : slabdata     25     25      0
kmalloc-96          2815   2856     96   42    1 : tunables    0    0    0 : slabdata     68     68      0
kmem_cache_node        0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
Comment 15 Wu Fengguang 2009-03-09 07:23:35 UTC
Hi Pierre,

On Mon, Mar 09, 2009 at 08:40:45AM +0100, Pierre Ossman wrote:
> On Mon, 9 Mar 2009 10:07:01 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Mon, Mar 09, 2009 at 09:37:42AM +0800, Wu Fengguang wrote:
> > > 
> > > The "free" pages in sysrq mem-info report should be equal to "MemFree"
> > > in /proc/meminfo. So I'd expect meminfo numbers to be different in
> > > .26/.27 as well.
> > > 
> > > Maybe the memory is taken by some user space program, so it would be
> > > helpful to know the numbers in /proc/meminfo, /proc/vmstat and
> > > /proc/zoneinfo.
> > 
> > And maybe piggyback /proc/slabinfo in case it is a kernel bug :-)
> > 
> 
> Big dump of relevant /proc files:

Thanks for the data! Now it seems that some pages are totally missing
from bootmem or slabs or page cache or any application consumptions...

Will searching through /proc/kpageflags for reserved pages help
identify the problem?

Oh kpageflags_read() does not include support for PG_reserved:

        #define KPF_LOCKED     0
        #define KPF_ERROR      1
        #define KPF_REFERENCED 2
        #define KPF_UPTODATE   3
        #define KPF_DIRTY      4
        #define KPF_LRU        5
        #define KPF_ACTIVE     6
        #define KPF_SLAB       7
        #define KPF_WRITEBACK  8
        #define KPF_RECLAIM    9
        #define KPF_BUDDY     10

Thanks,
Fengguang

> [root@builder ~]# free
>              total       used       free     shared    buffers     cached
> Mem:        509108     236988     272120          0        228      14760
> -/+ buffers/cache:     222000     287108
> Swap:       524280        228     524052
> 
> [root@builder ~]# cat /proc/meminfo 
> MemTotal:       509108 kB
> MemFree:        272172 kB
> Buffers:           240 kB
> Cached:          14788 kB
> SwapCached:         64 kB
> Active:          32544 kB
> Inactive:         5900 kB
> SwapTotal:      524280 kB
> SwapFree:       524052 kB
> Dirty:            5980 kB
> Writeback:           0 kB
> AnonPages:       23404 kB
> Mapped:           8648 kB
> Slab:            23148 kB
> SReclaimable:     5420 kB
> SUnreclaim:      17728 kB
> PageTables:       3324 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> WritebackTmp:        0 kB
> CommitLimit:    778832 kB
> Committed_AS:    85196 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:      1740 kB
> VmallocChunk: 34359736619 kB
> HugePages_Total:     0
> HugePages_Free:      0
> HugePages_Rsvd:      0
> HugePages_Surp:      0
> Hugepagesize:     2048 kB
> DirectMap4k:      2032
> DirectMap2M:  18446744073709551613

This field looks weird.

> DirectMap1G:         0
> 
> [root@builder ~]# cat /proc/vmstat 
> nr_free_pages 68035
> nr_inactive 1479
> nr_active 8137
> nr_anon_pages 5851
> nr_mapped 2162
> nr_file_pages 3777
> nr_dirty 132
> nr_writeback 0
> nr_slab_reclaimable 1354
> nr_slab_unreclaimable 4440
> nr_page_table_pages 831
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 324
> nr_writeback_temp 0
> numa_hit 18985527
> numa_miss 0
> numa_foreign 0
> numa_interleave 44220
> numa_local 18985527
> numa_other 0
> pgpgin 379025
> pgpgout 820238
> pswpin 16
> pswpout 57
> pgalloc_dma 295454
> pgalloc_dma32 18721928
> pgalloc_normal 0
> pgalloc_movable 0
> pgfree 19085491
> pgactivate 60797
> pgdeactivate 47199
> pgfault 25624481
> pgmajfault 2490
> pgrefill_dma 8144
> pgrefill_dma32 103508
> pgrefill_normal 0
> pgrefill_movable 0
> pgsteal_dma 4503
> pgsteal_dma32 179395
> pgsteal_normal 0
> pgsteal_movable 0
> pgscan_kswapd_dma 4999
> pgscan_kswapd_dma32 180546
> pgscan_kswapd_normal 0
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 384
> pgscan_direct_normal 0
> pgscan_direct_movable 0
> pginodesteal 0
> slabs_scanned 153856
> kswapd_steal 183628
> kswapd_inodesteal 35303
> pageoutrun 3794
> allocstall 3
> pgrotated 72
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> 
> [root@builder ~]# cat /proc/zoneinfo 
> Node 0, zone      DMA
>   pages free     2524
>         min      12
>         low      15
>         high     18
>         scanned  0 (a: 27 i: 24)
>         spanned  4096
>         present  2180
>     nr_free_pages 2524
>     nr_inactive  0
>     nr_active    8
>     nr_anon_pages 8
>     nr_mapped    0
>     nr_file_pages 0
>     nr_dirty     0
>     nr_writeback 0
>     nr_slab_reclaimable 16
>     nr_slab_unreclaimable 7
>     nr_page_table_pages 15
>     nr_unstable  0
>     nr_bounce    0
>     nr_vmscan_write 292
>     nr_writeback_temp 0
>     numa_hit     295370
>     numa_miss    0
>     numa_foreign 0
>     numa_interleave 0
>     numa_local   295370
>     numa_other   0
>         protection: (0, 489, 489, 489)
>   pagesets
>     cpu: 0
>               count: 0
>               high:  0
>               batch: 1
>   vm stats threshold: 2
>   all_unreclaimable: 0
>   prev_priority:     12
>   start_pfn:         0
> Node 0, zone    DMA32
>   pages free     65515
>         min      700
>         low      875
>         high     1050
>         scanned  0 (a: 0 i: 0)
>         spanned  126960
>         present  125224
>     nr_free_pages 65515
>     nr_inactive  1482
>     nr_active    8137
>     nr_anon_pages 5843
>     nr_mapped    2162
>     nr_file_pages 3789
>     nr_dirty     128
>     nr_writeback 0
>     nr_slab_reclaimable 1331
>     nr_slab_unreclaimable 4429
>     nr_page_table_pages 816
>     nr_unstable  0
>     nr_bounce    0
>     nr_vmscan_write 32
>     nr_writeback_temp 0
>     numa_hit     18690260
>     numa_miss    0
>     numa_foreign 0
>     numa_interleave 44220
>     numa_local   18690260
>     numa_other   0
>         protection: (0, 0, 0, 0)
>   pagesets
>     cpu: 0
>               count: 69
>               high:  186
>               batch: 31
>   vm stats threshold: 6
>   all_unreclaimable: 0
>   prev_priority:     12
>   start_pfn:         4096
> 
> [root@builder ~]# cat /proc/slabinfo 
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
> <active_slabs> <num_slabs> <sharedavail>
> rpc_inode_cache       39     39    832   39    8 : tunables    0    0    0 :
> slabdata      1      1      0
> nf_conntrack_expect      0      0    240   34    2 : tunables    0    0    0
> : slabdata      0      0      0
> UDPv6                 34     34    960   34    8 : tunables    0    0    0 :
> slabdata      1      1      0
> TCPv6                 18     18   1792   18    8 : tunables    0    0    0 :
> slabdata      1      1      0
> kmalloc_dma-512       32     32    512   32    4 : tunables    0    0    0 :
> slabdata      1      1      0
> dm_snap_pending_exception    144    144    112   36    1 : tunables    0    0
>    0 : slabdata      4      4      0
> kcopyd_job             0      0    360   45    4 : tunables    0    0    0 :
> slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0 :
> slabdata      0      0      0
> ext3_inode_cache     387   1554    768   42    8 : tunables    0    0    0 :
> slabdata     37     37      0
> ext3_xattr            46     46     88   46    1 : tunables    0    0    0 :
> slabdata      1      1      0
> journal_handle       170    170     24  170    1 : tunables    0    0    0 :
> slabdata      1      1      0
> journal_head          42     42     96   42    1 : tunables    0    0    0 :
> slabdata      1      1      0
> revoke_table         256    256     16  256    1 : tunables    0    0    0 :
> slabdata      1      1      0
> revoke_record        128    128     32  128    1 : tunables    0    0    0 :
> slabdata      1      1      0
> cfq_io_context        44     48    168   24    1 : tunables    0    0    0 :
> slabdata      2      2      0
> mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 :
> slabdata      1      1      0
> isofs_inode_cache      0      0    616   26    4 : tunables    0    0    0 :
> slabdata      0      0      0
> hugetlbfs_inode_cache     28     28    584   28    4 : tunables    0    0   
> 0 : slabdata      1      1      0
> dquot                  0      0    256   32    2 : tunables    0    0    0 :
> slabdata      0      0      0
> inotify_event_cache    612    612     40  102    1 : tunables    0    0    0
> : slabdata      6      6      0
> fasync_cache      313798 313820     24  170    1 : tunables    0    0    0 :
> slabdata   1846   1846      0
> shmem_inode_cache    735    738    792   41    8 : tunables    0    0    0 :
> slabdata     18     18      0
> pid_namespace          0      0   2104   15    8 : tunables    0    0    0 :
> slabdata      0      0      0
> nsproxy                0      0     56   73    1 : tunables    0    0    0 :
> slabdata      0      0      0
> UNIX                  92     92    704   46    8 : tunables    0    0    0 :
> slabdata      2      2      0
> xfrm_dst_cache         0      0    384   42    4 : tunables    0    0    0 :
> slabdata      0      0      0
> ip_dst_cache          51     75    320   25    2 : tunables    0    0    0 :
> slabdata      3      3      0
> TCP                   19     19   1664   19    8 : tunables    0    0    0 :
> slabdata      1      1      0
> blkdev_integrity       0      0    120   34    1 : tunables    0    0    0 :
> slabdata      0      0      0
> blkdev_queue          34     34   1824   17    8 : tunables    0    0    0 :
> slabdata      2      2      0
> blkdev_requests       38     52    304   26    2 : tunables    0    0    0 :
> slabdata      2      2      0
> sock_inode_cache     138    138    704   46    8 : tunables    0    0    0 :
> slabdata      3      3      0
> file_lock_cache       42     42    192   42    2 : tunables    0    0    0 :
> slabdata      1      1      0
> taskstats             26     26    312   26    2 : tunables    0    0    0 :
> slabdata      1      1      0
> proc_inode_cache      90    162    600   27    4 : tunables    0    0    0 :
> slabdata      6      6      0
> sigqueue              25     25    160   25    1 : tunables    0    0    0 :
> slabdata      1      1      0
> radix_tree_node      623   2581    560   29    4 : tunables    0    0    0 :
> slabdata     89     89      0
> bdev_cache            42     42    768   42    8 : tunables    0    0    0 :
> slabdata      1      1      0
> sysfs_dir_cache     7084   7089     80   51    1 : tunables    0    0    0 :
> slabdata    139    139      0
> inode_cache         1505   1708    568   28    4 : tunables    0    0    0 :
> slabdata     61     61      0
> dentry              2555   4485    208   39    2 : tunables    0    0    0 :
> slabdata    115    115      0
> avc_node            1735   2128     72   56    1 : tunables    0    0    0 :
> slabdata     38     38      0
> buffer_head         1583   5472    112   36    1 : tunables    0    0    0 :
> slabdata    152    152      0
> mm_struct             75     78    832   39    8 : tunables    0    0    0 :
> slabdata      2      2      0
> vm_area_struct      2223   2438    176   46    2 : tunables    0    0    0 :
> slabdata     53     53      0
> files_cache           78     84    768   42    8 : tunables    0    0    0 :
> slabdata      2      2      0
> signal_cache         105    108    896   36    8 : tunables    0    0    0 :
> slabdata      3      3      0
> sighand_cache         85     90   2112   15    8 : tunables    0    0    0 :
> slabdata      6      6      0
> task_struct          141    145   5840    5    8 : tunables    0    0    0 :
> slabdata     29     29      0
> anon_vma             741    768     32  128    1 : tunables    0    0    0 :
> slabdata      6      6      0
> shared_policy_node     85     85     48   85    1 : tunables    0    0    0 :
> slabdata      1      1      0
> numa_policy           56     60    136   30    1 : tunables    0    0    0 :
> slabdata      2      2      0
> idr_layer_cache      269    270    536   30    4 : tunables    0    0    0 :
> slabdata      9      9      0
> kmalloc-4096         247    248   4096    8    8 : tunables    0    0    0 :
> slabdata     31     31      0
> kmalloc-2048         345    352   2048   16    8 : tunables    0    0    0 :
> slabdata     22     22      0
> kmalloc-1024         396    416   1024   32    8 : tunables    0    0    0 :
> slabdata     13     13      0
> kmalloc-512          297    320    512   32    4 : tunables    0    0    0 :
> slabdata     10     10      0
> kmalloc-256          985    992    256   32    2 : tunables    0    0    0 :
> slabdata     31     31      0
> kmalloc-128         1899   2016    128   32    1 : tunables    0    0    0 :
> slabdata     63     63      0
> kmalloc-64          6795   9600     64   64    1 : tunables    0    0    0 :
> slabdata    150    150      0
> kmalloc-32         20735  20736     32  128    1 : tunables    0    0    0 :
> slabdata    162    162      0
> kmalloc-16        138778 139264     16  256    1 : tunables    0    0    0 :
> slabdata    544    544      0
> kmalloc-8           8190   8192      8  512    1 : tunables    0    0    0 :
> slabdata     16     16      0
> kmalloc-192          972   1050    192   42    2 : tunables    0    0    0 :
> slabdata     25     25      0
> kmalloc-96          2815   2856     96   42    1 : tunables    0    0    0 :
> slabdata     68     68      0
> kmem_cache_node        0      0     64   64    1 : tunables    0    0    0 :
> slabdata      0      0      0
> 
> -- 
>      -- Pierre Ossman
> 
>   WARNING: This correspondence is being monitored by the
>   Swedish government. Make sure your server uses encryption
>   for SMTP traffic and consider using PGP for end-to-end
>   encryption.
Comment 16 Anonymous Emailer 2009-03-09 08:02:58 UTC
Reply-To: drzeus@drzeus.cx

On Mon, 9 Mar 2009 22:22:41 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
> Thanks for the data! Now it seems that some pages are totally missing
> from bootmem or slabs or page cache or any application consumptions...
> 

So it isn't just me that's blind. That's something I guess. :)

> Will searching through /proc/kpageflags for reserved pages help
> identify the problem?
> 
> Oh kpageflags_read() does not include support for PG_reserved:
> 

I can probably hack together something that outputs the served pages.
Anything else that is of interest?

> > DirectMap2M:  18446744073709551613
> 
> This field looks weird.
> 

Sorry, red herring. I'm in the middle of a bisect and that particular
old bug happened to surface. It was not present with the releases
2.6.27.

Rgds
Comment 17 Wu Fengguang 2009-03-09 19:42:33 UTC
On Mon, Mar 09, 2009 at 05:02:16PM +0200, Pierre Ossman wrote:
> On Mon, 9 Mar 2009 22:22:41 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> >
> > Thanks for the data! Now it seems that some pages are totally missing
> > from bootmem or slabs or page cache or any application consumptions...
> >
>
> So it isn't just me that's blind. That's something I guess. :)
>
> > Will searching through /proc/kpageflags for reserved pages help
> > identify the problem?
> >
> > Oh kpageflags_read() does not include support for PG_reserved:
> >
>
> I can probably hack together something that outputs the served pages.
> Anything else that is of interest?

Sure, Matt Mackall provides some example scripts for interpreting the
kpageflags file:

        http://selenic.com/repo/pagemap/

> > > DirectMap2M:  18446744073709551613
> >
> > This field looks weird.
> >
>
> Sorry, red herring. I'm in the middle of a bisect and that particular
> old bug happened to surface. It was not present with the releases
> 2.6.27.

That's OK.

        pgfault 25624481
        pgmajfault 2490
        pgrefill_dma 8144
        pgrefill_dma32 103508
        pgsteal_dma 4503
        pgsteal_dma32 179395
        pgscan_kswapd_dma 4999
        pgscan_kswapd_dma32 180546
        pgscan_direct_dma32 384
        slabs_scanned 153856

The above vmstat numbers are a bit large, maybe it's not a fresh booted system?

Thanks,
Fengguang
Comment 18 Anonymous Emailer 2009-03-09 23:56:48 UTC
Reply-To: drzeus@drzeus.cx

On Tue, 10 Mar 2009 10:41:35 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
>         pgfault 25624481
>         pgmajfault 2490
>         pgrefill_dma 8144
>         pgrefill_dma32 103508
>         pgsteal_dma 4503
>         pgsteal_dma32 179395
>         pgscan_kswapd_dma 4999
>         pgscan_kswapd_dma32 180546
>         pgscan_direct_dma32 384
>         slabs_scanned 153856
> 
> The above vmstat numbers are a bit large, maybe it's not a fresh booted
> system?
> 

Probably not. I just grabbed those stats as it was compiling the next
kernel. It takes two hours, so I'm trying to do as many things in
parallel as once. :/

Rgds
Comment 19 Wu Fengguang 2009-03-10 01:20:00 UTC
Hi Pierre,

On Tue, Mar 10, 2009 at 10:41:35AM +0800, Wu Fengguang wrote:
> On Mon, Mar 09, 2009 at 05:02:16PM +0200, Pierre Ossman wrote:
> > On Mon, 9 Mar 2009 22:22:41 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > >
> > > Thanks for the data! Now it seems that some pages are totally missing
> > > from bootmem or slabs or page cache or any application consumptions...
> > >
> >
> > So it isn't just me that's blind. That's something I guess. :)
> >
> > > Will searching through /proc/kpageflags for reserved pages help
> > > identify the problem?
> > >
> > > Oh kpageflags_read() does not include support for PG_reserved:
> > >
> >
> > I can probably hack together something that outputs the served pages.
> > Anything else that is of interest?

Here is the initial patch and tool for finding the missing pages.

In the following example, the pages with no flags set is kind of too
many (1816MB), but hopefully your missing pages will have PG_reserved
or other flags set ;-)

# ./page-types
L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked
 
 flags        symbolic-flags    page-count            MB
0x0000        ______________        464967          1816
0x0004        __R___________             1             0
0x0008        ___U__________             2             0
0x0014        __R_D_________             5             0
0x0020        _____L________             1             0
0x0028        ___U_L________          5956            23
0x002c        __RU_L________          5415            21
0x0038        ___UDL________             7             0
0x0068        ___U_LA_______           520             2
0x006c        __RU_LA_______          2083             8
0x0080        _______S______         10820            42
0x0228        ___U_L___x____           104             0
0x022c        __RU_L___x____            52             0
0x0268        ___U_LA__x____            22             0
0x026c        __RU_LA__x____            95             0
0x0400        __________B___           477             1
0x0800        ___________r__         18734            73
0x2008        ___U_________b             9             0
0x2068        ___U_LA______b          4644            18
0x206c        __RU_LA______b            33             0
0x2078        ___UDLA______b             4             0
0x207c        __RUDLA______b            17             0
 
 total                              513968          2007

Thanks,
Fengguang
Comment 20 Anonymous Emailer 2009-03-10 02:56:06 UTC
Reply-To: drzeus@drzeus.cx

On Tue, 10 Mar 2009 16:19:17 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
> Here is the initial patch and tool for finding the missing pages.
> 
> In the following example, the pages with no flags set is kind of too
> many (1816MB), but hopefully your missing pages will have PG_reserved
> or other flags set ;-)
> 
> # ./page-types
> L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab
> W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked
>  

Thanks. I'll have a look in a bit. Right now I'm very close to a
complete bisect. It is just ftrace commits left though, so I'm somewhat
sceptical that it is correct. ftrace isn't even turned on in the
kernels I've been testing.

The remaining commits are ec1bb60bb..6712e299.

Rgds
Comment 21 Wu Fengguang 2009-03-10 05:23:06 UTC
On Tue, Mar 10, 2009 at 11:55:23AM +0200, Pierre Ossman wrote:
> On Tue, 10 Mar 2009 16:19:17 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 
> > Here is the initial patch and tool for finding the missing pages.
> > 
> > In the following example, the pages with no flags set is kind of too
> > many (1816MB), but hopefully your missing pages will have PG_reserved
> > or other flags set ;-)
> > 
> > # ./page-types
> > L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab
> W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked
> >  
> 
> Thanks. I'll have a look in a bit. Right now I'm very close to a
> complete bisect. It is just ftrace commits left though, so I'm somewhat
> sceptical that it is correct. ftrace isn't even turned on in the
> kernels I've been testing.
> 
> The remaining commits are ec1bb60bb..6712e299.

And here's my progress, some more page flags are introduced:

# ./page-types
  flags page-count       MB    symbolic-flags    long-symbolic-flags
0x00000       3978       15  __________________  
0x00004          1        0  __R_______________  referenced
0x00014          5        0  __R_D_____________  referenced,dirty
0x00020          2        0  _____l____________  lru
0x00028       8835       34  ___U_l____________  uptodate,lru
0x0002c       9588       37  __RU_l____________  referenced,uptodate,lru
0x00068       1031        4  ___U_lA___________  uptodate,lru,active
0x0006c       3032       11  __RU_lA___________  referenced,uptodate,lru,active
0x00080      11001       42  _______S__________  slab
0x00228        140        0  ___U_l___x________  uptodate,lru,reclaim
0x0022c         79        0  __RU_l___x________  referenced,uptodate,lru,reclaim
0x00268         43        0  ___U_lA__x________  uptodate,lru,active,reclaim
0x0026c        110        0  __RU_lA__x________  referenced,uptodate,lru,active,reclaim
0x00400       1102        4  __________B_______  buddy
0x00800      18735       73  ___________r______  reserved
0x02008         13        0  ___U_________b____  uptodate,swapbacked
0x02068       9371       36  ___U_lA______b____  uptodate,lru,active,swapbacked
0x0206c       1339        5  __RU_lA______b____  referenced,uptodate,lru,active,swapbacked
0x02078         21        0  ___UDlA______b____  uptodate,dirty,lru,active,swapbacked
0x0207c         17        0  __RUDlA______b____  referenced,uptodate,dirty,lru,active,swapbacked
0x20000     445525     1740  _________________n  noflags
  total     513968     2007

Thanks,
Fengguang
Comment 22 Wu Fengguang 2009-03-10 06:13:35 UTC
On Tue, Mar 10, 2009 at 08:22:10PM +0800, Wu Fengguang wrote:
> On Tue, Mar 10, 2009 at 11:55:23AM +0200, Pierre Ossman wrote:
> > On Tue, 10 Mar 2009 16:19:17 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > 
> > > Here is the initial patch and tool for finding the missing pages.
> > > 
> > > In the following example, the pages with no flags set is kind of too
> > > many (1816MB), but hopefully your missing pages will have PG_reserved
> > > or other flags set ;-)
> > > 
> > > # ./page-types
> > > L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab
> W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked
> > >  
> > 
> > Thanks. I'll have a look in a bit. Right now I'm very close to a
> > complete bisect. It is just ftrace commits left though, so I'm somewhat
> > sceptical that it is correct. ftrace isn't even turned on in the
> > kernels I've been testing.
> > 
> > The remaining commits are ec1bb60bb..6712e299.

Another tool to show the page locations with specified flags:

# ./page-areas 0x20000 | head
    offset      len         KB
        11        1        4KB
        13        3       12KB
        17        7       28KB
        25        1        4KB
        31        1        4KB
        33       31      124KB
        65       63      252KB
       129       15       60KB
       145        7       28KB

If we run eatmem or the following commands to take up free memory,
the missing pages will show up :-)

        dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024
        cp /tmp/s /dev/null

Thanks,
Fengguang
Comment 23 Anonymous Emailer 2009-03-10 08:52:57 UTC
Reply-To: drzeus@drzeus.cx

My bisect has ran into a wall. I cannot run any of the intermediate
kernels that are left. I could try reverting the commits one at a time,
but I'll take a break and test your code here. Now we just have to wait
for the kernel to compile. :)

Rgds
Comment 24 Anonymous Emailer 2009-03-10 13:04:51 UTC
Reply-To: drzeus@drzeus.cx

Ok, I think I've found some, but not all of the missing memory.

I had to remove PG_swapbacked and PG_private2 as 2.6.26/2.6.27 didn't
have those bits.

After that, a comparison shows that this row is in 2.6.27, but not
2.6.26:

0x00020	     20576       80  _____l____________  lru

Unfortunately there are about 170 MB of missing memory, not 80. So we
probably need to dig deeper. But does the above say anything to you?

Rgds
Comment 25 Anonymous Emailer 2009-03-10 13:22:26 UTC
Reply-To: drzeus@drzeus.cx

On Tue, 10 Mar 2009 21:11:55 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> If we run eatmem or the following commands to take up free memory,
> the missing pages will show up :-)
> 
>         dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024
>         cp /tmp/s /dev/null
> 

Not here, which now means I've "found" all of my missing 170 MB.

On 2.6.27, when I fill the page cache I still get over 90 MB left in
"noflags":

0x20000	     24394       95  _________________n  noflags

The same thing with 2.6.26 almost completely drains it:

0x20000	      3697       14  _________________n  noflags

Another interesting data point is that those 80 MB always seem to be
the exact same number of pages every boot.

Rgds
Comment 26 KOSAKI Motohiro 2009-03-10 17:20:34 UTC
Hi

> > Here is the initial patch and tool for finding the missing pages.
> > 
> > In the following example, the pages with no flags set is kind of too
> > many (1816MB), but hopefully your missing pages will have PG_reserved
> > or other flags set ;-)
> > 
> > # ./page-types
> > L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab
> W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked
> >  
> 
> Thanks. I'll have a look in a bit. Right now I'm very close to a
> complete bisect. It is just ftrace commits left though, so I'm somewhat
> sceptical that it is correct. ftrace isn't even turned on in the
> kernels I've been testing.
> 
> The remaining commits are ec1bb60bb..6712e299.

Can you try to turn off CONFIG_FTRACE* build option?
Comment 27 Wu Fengguang 2009-03-10 18:38:43 UTC
On Tue, Mar 10, 2009 at 10:21:18PM +0200, Pierre Ossman wrote:
> On Tue, 10 Mar 2009 21:11:55 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > If we run eatmem or the following commands to take up free memory,
> > the missing pages will show up :-)
> >
> >         dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024
> >         cp /tmp/s /dev/null
> >
>
> Not here, which now means I've "found" all of my missing 170 MB.
>
> On 2.6.27, when I fill the page cache I still get over 90 MB left in
> "noflags":
>
> 0x20000            24394       95  _________________n  noflags
>
> The same thing with 2.6.26 almost completely drains it:
>
> 0x20000             3697       14  _________________n  noflags
>
> Another interesting data point is that those 80 MB always seem to be
> the exact same number of pages every boot.

This 80MB noflags pages together with the below 80MB lru pages are
very close to the missing page numbers :-) Could you run the following
commands on fresh booted 2.6.27 and post the output files? Thank you!

        dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024
        cp /tmp/s /dev/null

        ./page-flags > flags
        ./page-areas =0x20000 > areas-noflags
        ./page-areas =0x00020 > areas-lru

The attached page-areas.c can do the above exact flags matching.

> After that, a comparison shows that this row is in 2.6.27, but not
> 2.6.26:
>
> 0x00020      20576       80  _____l____________  lru
>
> Unfortunately there are about 170 MB of missing memory, not 80. So we
> probably need to dig deeper. But does the above say anything to you?

> I had to remove PG_swapbacked and PG_private2 as 2.6.26/2.6.27 didn't
> have those bits.

Ah sorry! I forgot to switch the tree back to 2.6.27 to run a test.

Thanks,
Fengguang
Comment 28 Anonymous Emailer 2009-03-10 23:58:12 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 09:37:40 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
> This 80MB noflags pages together with the below 80MB lru pages are
> very close to the missing page numbers :-) Could you run the following
> commands on fresh booted 2.6.27 and post the output files? Thank you!
> 
>         dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024
>         cp /tmp/s /dev/null
> 
>         ./page-flags > flags
>         ./page-areas =0x20000 > areas-noflags
>         ./page-areas =0x00020 > areas-lru
> 

Attached.

I have to say, the patterns look very much like some kind of leak.

Rgds
Comment 29 Wu Fengguang 2009-03-11 00:16:07 UTC
On Wed, Mar 11, 2009 at 08:57:03AM +0200, Pierre Ossman wrote:
> On Wed, 11 Mar 2009 09:37:40 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 
> > This 80MB noflags pages together with the below 80MB lru pages are
> > very close to the missing page numbers :-) Could you run the following
> > commands on fresh booted 2.6.27 and post the output files? Thank you!
> > 
> >         dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024
> >         cp /tmp/s /dev/null
> > 
> >         ./page-flags > flags
> >         ./page-areas =0x20000 > areas-noflags
> >         ./page-areas =0x00020 > areas-lru
> > 
> 
> Attached.

Thank you very much!

> I have to say, the patterns look very much like some kind of leak.

Wow it looks really interesting.  The lru pages and noflags pages make
perfect 1-page interleaved pattern...

Thanks,
Fengguang

areas-lru
>     offset      len         KB
>      86016        1        4KB
>      86018        1        4KB
>      86020        1        4KB
>      86022        1        4KB
>      86024        1        4KB
>      86026        1        4KB
>      86028        1        4KB
>      86030        1        4KB
>      86032        1        4KB
>      86034        1        4KB
>      86036        1        4KB
>      86038        1        4KB
>      86040        1        4KB
>      86042        1        4KB
>      86044        1        4KB
>      86046        1        4KB
>      86048        1        4KB
>      86050        1        4KB
>      86052        1        4KB
>      86054        1        4KB
>      86056        1        4KB
>      86058        1        4KB
>      86060        1        4KB
>      86062        1        4KB
>      86064        1        4KB
>      86066        1        4KB
>      86068        1        4KB
>      86070        1        4KB
>      86072        1        4KB
>      86074        1        4KB
>      86076        1        4KB
>      86078        1        4KB
>      86080        1        4KB
>      86082        1        4KB
>      86084        1        4KB
>      86086        1        4KB
>      86088        1        4KB
>      86090        1        4KB
>      86092        1        4KB
>      86094        1        4KB
>      86096        1        4KB
>      86098        1        4KB
>      86100        1        4KB
>      86102        1        4KB
>      86104        1        4KB

areas-noflags
>      86017        1        4KB
>      86019        1        4KB
>      86021        1        4KB
>      86023        1        4KB
>      86025        1        4KB
>      86027        1        4KB
>      86029        1        4KB
>      86031        1        4KB
>      86033        1        4KB
>      86035        1        4KB
>      86037        1        4KB
>      86039        1        4KB
>      86041        1        4KB
>      86043        1        4KB
>      86045        1        4KB
>      86047        1        4KB
>      86049        1        4KB
>      86051        1        4KB
>      86053        1        4KB
>      86055        1        4KB
>      86057        1        4KB
>      86059        1        4KB
>      86061        1        4KB
>      86063        1        4KB
>      86065        1        4KB
>      86067        1        4KB
>      86069        1        4KB
>      86071        1        4KB
>      86073        1        4KB
>      86075        1        4KB
>      86077        1        4KB
>      86079        1        4KB
>      86081        1        4KB
>      86083        1        4KB
>      86085        1        4KB
>      86087        1        4KB
>      86089        1        4KB
>      86091        1        4KB
>      86093        1        4KB
>      86095        1        4KB
>      86097        1        4KB
>      86099        1        4KB
>      86101        1        4KB
>      86103        1        4KB

>   flags       page-count       MB    symbolic-flags    long-symbolic-flags
> 0x00000             1892        7  __________________  
> 0x00004                1        0  __R_______________  referenced
> 0x00008              454        1  ___U______________  uptodate
> 0x0000c               94        0  __RU______________  referenced,uptodate
> 0x00020            20576       80  _____l____________  lru
> 0x00028              226        0  ___U_l____________  uptodate,lru
> 0x0002c            67911      265  __RU_l____________ 
> referenced,uptodate,lru
> 0x00068             6621       25  ___U_lA___________  uptodate,lru,active
> 0x0006c             1222        4  __RU_lA___________ 
> referenced,uptodate,lru,active
> 0x00078                1        0  ___UDlA___________ 
> uptodate,dirty,lru,active
> 0x00080             3523       13  _______S__________  slab
> 0x000c0               55        0  ______AS__________  active,slab
> 0x00228                5        0  ___U_l___x________  uptodate,lru,reclaim
> 0x0022c                1        0  __RU_l___x________ 
> referenced,uptodate,lru,reclaim
> 0x00268               23        0  ___U_lA__x________ 
> uptodate,lru,active,reclaim
> 0x0026c               52        0  __RU_lA__x________ 
> referenced,uptodate,lru,active,reclaim
> 0x00400                9        0  __________B_______  buddy
> 0x00408               60        0  ___U______B_______  uptodate,buddy
> 0x00800             4042       15  ___________r______  reserved
> 0x04020                9        0  _____l________P___  lru,private
> 0x04024               14        0  __R__l________P___  referenced,lru,private
> 0x04028                4        0  ___U_l________P___  uptodate,lru,private
> 0x0402c                1        0  __RU_l________P___ 
> referenced,uptodate,lru,private
> 0x04060               10        0  _____lA_______P___  lru,active,private
> 0x04064                7        0  __R__lA_______P___ 
> referenced,lru,active,private
> 0x04068               16        0  ___U_lA_______P___ 
> uptodate,lru,active,private
> 0x20000            24227       94  _________________n  noflags
>   total           131056      511

> MemTotal:       508056 kB
> MemFree:          7716 kB
> Buffers:           220 kB
> Cached:         280468 kB
> SwapCached:          0 kB
> Active:          31184 kB
> Inactive:       271508 kB
> SwapTotal:      524280 kB
> SwapFree:       524232 kB
> Dirty:            1284 kB
> Writeback:           0 kB
> AnonPages:       22044 kB
> Mapped:           8652 kB
> Slab:            21508 kB
> SReclaimable:     4212 kB
> SUnreclaim:      17296 kB
> PageTables:       3036 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> WritebackTmp:        0 kB
> CommitLimit:    778308 kB
> Committed_AS:    80544 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:      1740 kB
> VmallocChunk: 34359736619 kB
> HugePages_Total:     0
> HugePages_Free:      0
> HugePages_Rsvd:      0
> HugePages_Surp:      0
> Hugepagesize:     2048 kB
> DirectMap4k:      8128 kB
> DirectMap2M:    516096 kB
Comment 30 Anonymous Emailer 2009-03-11 00:23:20 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 09:19:32 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> 
> Can you try to turn off CONFIG_FTRACE* build option?
> 

That's just it, it is off.

Rgds
Comment 31 Anonymous Emailer 2009-03-11 00:27:29 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 15:14:45 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Wed, Mar 11, 2009 at 08:57:03AM +0200, Pierre Ossman wrote:
> > On Wed, 11 Mar 2009 09:37:40 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > 
> > > This 80MB noflags pages together with the below 80MB lru pages are
> > > very close to the missing page numbers :-) Could you run the following
> > > commands on fresh booted 2.6.27 and post the output files? Thank you!
> > > 
> > >         dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024
> > >         cp /tmp/s /dev/null
> > > 
> > >         ./page-flags > flags
> > >         ./page-areas =0x20000 > areas-noflags
> > >         ./page-areas =0x00020 > areas-lru
> > > 
> > 
> > Attached.
> 
> Thank you very much!
> 
> > I have to say, the patterns look very much like some kind of leak.
> 
> Wow it looks really interesting.  The lru pages and noflags pages make
> perfect 1-page interleaved pattern...
> 

Another breakthrough. I turned off everything in kernel/trace, and now
the missing memory is back. Here's the relevant diff against the
original .config:

@@ -3677,18 +3639,15 @@
 # CONFIG_BACKTRACE_SELF_TEST is not set
 # CONFIG_LKDTM is not set
 # CONFIG_FAULT_INJECTION is not set
-CONFIG_LATENCYTOP=y
+# CONFIG_LATENCYTOP is not set
 # CONFIG_SYSCTL_SYSCALL_CHECK is not set
 CONFIG_HAVE_FTRACE=y
 CONFIG_HAVE_DYNAMIC_FTRACE=y
-CONFIG_TRACER_MAX_TRACE=y
-CONFIG_TRACING=y
 # CONFIG_FTRACE is not set
-CONFIG_IRQSOFF_TRACER=y
-CONFIG_SYSPROF_TRACER=y
-CONFIG_SCHED_TRACER=y
-CONFIG_CONTEXT_SWITCH_TRACER=y
-# CONFIG_FTRACE_STARTUP_TEST is not set
+# CONFIG_IRQSOFF_TRACER is not set
+# CONFIG_SYSPROF_TRACER is not set
+# CONFIG_SCHED_TRACER is not set
+# CONFIG_CONTEXT_SWITCH_TRACER is not set

I'll enable them one at a time and see when the bug reappears, but if
you have some ideas on which it could be, that would be helpful. The
machine takes some time to recompile a kernel. :)

Rgds
Comment 32 Wu Fengguang 2009-03-11 00:37:38 UTC
(add cc)

On Wed, Mar 11, 2009 at 09:26:58AM +0200, Pierre Ossman wrote:
> On Wed, 11 Mar 2009 15:14:45 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Wed, Mar 11, 2009 at 08:57:03AM +0200, Pierre Ossman wrote:
> > > On Wed, 11 Mar 2009 09:37:40 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > 
> > > > This 80MB noflags pages together with the below 80MB lru pages are
> > > > very close to the missing page numbers :-) Could you run the following
> > > > commands on fresh booted 2.6.27 and post the output files? Thank you!
> > > > 
> > > >         dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024
> > > >         cp /tmp/s /dev/null
> > > > 
> > > >         ./page-flags > flags
> > > >         ./page-areas =0x20000 > areas-noflags
> > > >         ./page-areas =0x00020 > areas-lru
> > > > 
> > > 
> > > Attached.
> > 
> > Thank you very much!
> > 
> > > I have to say, the patterns look very much like some kind of leak.
> > 
> > Wow it looks really interesting.  The lru pages and noflags pages make
> > perfect 1-page interleaved pattern...
> > 
> 
> Another breakthrough. I turned off everything in kernel/trace, and now
> the missing memory is back. Here's the relevant diff against the
> original .config:
> 
> @@ -3677,18 +3639,15 @@
>  # CONFIG_BACKTRACE_SELF_TEST is not set
>  # CONFIG_LKDTM is not set
>  # CONFIG_FAULT_INJECTION is not set
> -CONFIG_LATENCYTOP=y
> +# CONFIG_LATENCYTOP is not set
>  # CONFIG_SYSCTL_SYSCALL_CHECK is not set
>  CONFIG_HAVE_FTRACE=y
>  CONFIG_HAVE_DYNAMIC_FTRACE=y
> -CONFIG_TRACER_MAX_TRACE=y
> -CONFIG_TRACING=y
>  # CONFIG_FTRACE is not set
> -CONFIG_IRQSOFF_TRACER=y
> -CONFIG_SYSPROF_TRACER=y
> -CONFIG_SCHED_TRACER=y
> -CONFIG_CONTEXT_SWITCH_TRACER=y
> -# CONFIG_FTRACE_STARTUP_TEST is not set
> +# CONFIG_IRQSOFF_TRACER is not set
> +# CONFIG_SYSPROF_TRACER is not set
> +# CONFIG_SCHED_TRACER is not set
> +# CONFIG_CONTEXT_SWITCH_TRACER is not set
> 
> I'll enable them one at a time and see when the bug reappears, but if
> you have some ideas on which it could be, that would be helpful. The
> machine takes some time to recompile a kernel. :)

A quick question: are there any possibility of ftrace memory reservation?

Thanks,
Fengguang
Comment 33 Anonymous Emailer 2009-03-11 00:58:28 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 15:36:19 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
> A quick question: are there any possibility of ftrace memory reservation?
> 

You tell me. CONFIG_FTRACE was always disabled, but CONFIG_HAVE_*FTRACE
is always on. FTRACE wasn't included in 2.6.26 though, and the bisect
showed only ftrace commits. So it would explain things.

Rgds
Comment 34 Wu Fengguang 2009-03-11 01:22:39 UTC
On Wed, Mar 11, 2009 at 09:57:38AM +0200, Pierre Ossman wrote:
> On Wed, 11 Mar 2009 15:36:19 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 
> > A quick question: are there any possibility of ftrace memory reservation?
> > 
> 
> You tell me. CONFIG_FTRACE was always disabled, but CONFIG_HAVE_*FTRACE
> is always on. FTRACE wasn't included in 2.6.26 though, and the bisect
> showed only ftrace commits. So it would explain things.

There are some __get_free_page() calls in kernel/trace/ring_buffer.c,
maybe the pages are consumed by one of them?

Thanks,
Fengguang
Comment 35 Wu Fengguang 2009-03-11 06:02:44 UTC
Hi Pierre,

On Wed, Mar 11, 2009 at 09:57:38AM +0200, Pierre Ossman wrote:
> On Wed, 11 Mar 2009 15:36:19 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 
> > A quick question: are there any possibility of ftrace memory reservation?
> > 
> 
> You tell me. CONFIG_FTRACE was always disabled, but CONFIG_HAVE_*FTRACE
> is always on. FTRACE wasn't included in 2.6.26 though, and the bisect
> showed only ftrace commits. So it would explain things.

I worked up a simple debugging patch. Since the missing pages are
continuously spanned, several stack dumping shall be enough to catch
the page consumer.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 27b8681..c0df7fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1087,6 +1087,13 @@ again:
 			goto failed;
 	}
 
+	/* wfg - hunting the 40000 missing pages */
+	{
+		unsigned long pfn = page_to_pfn(page);
+		if (pfn > 0x1000 && (pfn & 0xfff) <= 1)
+			dump_stack();
+	}
+
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
Comment 36 Anonymous Emailer 2009-03-11 06:06:36 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 16:20:38 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
> There are some __get_free_page() calls in kernel/trace/ring_buffer.c,
> maybe the pages are consumed by one of them?
> 

Perhaps. I enabled CONFIG_SYSPROF_TRACER (which pulls in
ring_buffer.c). That made the "noflags" memory disappear, but the "lru"
section is not there. I.e. I've lost about 80 MB instead of 170 MB.

The diff against the fully broken conf is now:

@@ -3677,17 +3640,16 @@
 # CONFIG_BACKTRACE_SELF_TEST is not set
 # CONFIG_LKDTM is not set
 # CONFIG_FAULT_INJECTION is not set
-CONFIG_LATENCYTOP=y
+# CONFIG_LATENCYTOP is not set
 # CONFIG_SYSCTL_SYSCALL_CHECK is not set
 CONFIG_HAVE_FTRACE=y
 CONFIG_HAVE_DYNAMIC_FTRACE=y
-CONFIG_TRACER_MAX_TRACE=y
 CONFIG_TRACING=y
 # CONFIG_FTRACE is not set
-CONFIG_IRQSOFF_TRACER=y
+# CONFIG_IRQSOFF_TRACER is not set
 CONFIG_SYSPROF_TRACER=y
-CONFIG_SCHED_TRACER=y
-CONFIG_CONTEXT_SWITCH_TRACER=y
+# CONFIG_SCHED_TRACER is not set
+# CONFIG_CONTEXT_SWITCH_TRACER is not set
 # CONFIG_FTRACE_STARTUP_TEST is not set
 CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
 # CONFIG_FIREWIRE_OHCI_REMOTE_DMA is not set


Rgds
Comment 37 Steven Rostedt 2009-03-11 07:26:08 UTC

On Wed, 11 Mar 2009, Wu Fengguang wrote:
> > > > > 
> > > > > This 80MB noflags pages together with the below 80MB lru pages are
> > > > > very close to the missing page numbers :-) Could you run the
> following
> > > > > commands on fresh booted 2.6.27 and post the output files? Thank you!
> > > > > 
> > > > >         dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024
> > > > >         cp /tmp/s /dev/null
> > > > > 
> > > > >         ./page-flags > flags
> > > > >         ./page-areas =0x20000 > areas-noflags
> > > > >         ./page-areas =0x00020 > areas-lru
> > > > > 
> > > > 
> > > > Attached.
> > > 
> > > Thank you very much!
> > > 
> > > > I have to say, the patterns look very much like some kind of leak.
> > > 
> > > Wow it looks really interesting.  The lru pages and noflags pages make
> > > perfect 1-page interleaved pattern...
> > > 
> > 
> > Another breakthrough. I turned off everything in kernel/trace, and now
> > the missing memory is back. Here's the relevant diff against the
> > original .config:
> > 
[..]
> > 
> > I'll enable them one at a time and see when the bug reappears, but if
> > you have some ideas on which it could be, that would be helpful. The
> > machine takes some time to recompile a kernel. :)
> 
> A quick question: are there any possibility of ftrace memory reservation?

The ring buffer is allocated at start up (although I'm thinking of making 
it allocated when it is first used), and the allocations are done percpu. 

It allocates around 3 megs per cpu. How many CPUs were on this box?

-- Steve
Comment 38 Anonymous Emailer 2009-03-11 07:35:54 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 10:25:10 -0400 (EDT)
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> The ring buffer is allocated at start up (although I'm thinking of making 
> it allocated when it is first used), and the allocations are done percpu. 
> 
> It allocates around 3 megs per cpu. How many CPUs were on this box?
> 

One. :)

Rgds
Comment 39 Anonymous Emailer 2009-03-11 08:03:26 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 21:00:22 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
> I worked up a simple debugging patch. Since the missing pages are
> continuously spanned, several stack dumping shall be enough to catch
> the page consumer.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 27b8681..c0df7fd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1087,6 +1087,13 @@ again:
>                       goto failed;
>       }
>  
> +     /* wfg - hunting the 40000 missing pages */
> +     {
> +             unsigned long pfn = page_to_pfn(page);
> +             if (pfn > 0x1000 && (pfn & 0xfff) <= 1)
> +                     dump_stack();
> +     }
> +
>       __count_zone_vm_events(PGALLOC, zone, 1 << order);
>       zone_statistics(preferred_zone, zone);
>       local_irq_restore(flags);

This got very noisy, but here's what was in the ring buffer once it had
booted.

Note that this is where only the "noflags" pages have been allocated,
not "lru".

Rgds
Comment 40 Steven Rostedt 2009-03-11 08:48:13 UTC

On Wed, 11 Mar 2009, Pierre Ossman wrote:

> On Wed, 11 Mar 2009 21:00:22 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 
> > I worked up a simple debugging patch. Since the missing pages are
> > continuously spanned, several stack dumping shall be enough to catch
> > the page consumer.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 27b8681..c0df7fd 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1087,6 +1087,13 @@ again:
> >                     goto failed;
> >     }
> >  
> > +   /* wfg - hunting the 40000 missing pages */
> > +   {
> > +           unsigned long pfn = page_to_pfn(page);
> > +           if (pfn > 0x1000 && (pfn & 0xfff) <= 1)
> > +                   dump_stack();
> > +   }
> > +
> >     __count_zone_vm_events(PGALLOC, zone, 1 << order);
> >     zone_statistics(preferred_zone, zone);
> >     local_irq_restore(flags);
> 
> This got very noisy, but here's what was in the ring buffer once it had
> booted.
> 
> Note that this is where only the "noflags" pages have been allocated,
> not "lru".

BTW, which kernel are you testing?  2.6.27, ftrace had its own special 
buffering system. It played tricks with the page structs of the pages in 
the buffer. It used the lru parts of the pages to link list itself.
I just booted on a straight 2.6.27 with tracing configured.

# cat /debug/tracing/trace_entries 
65586

This is the old method to see the amount of data used. There are a total 
of 65,586 entries all of 88 bytes each:  5,771,568  And since we also have
a "snapshot" buffer for max latencies, the total is: 11,543,136. That is 
quite a lot of memory for one CPU :-/

Starting with 2.6.28, we now have the unified ring buffer. It removes all 
of the page struct hackery in the original code.

In 2.6.28, the trace_entries is a misnomer. The conversion to the ring 
buffer brought had the change from representing the number of entries 
(entries in the ring buffer are now variable length) and the count is the 
number of bytes each CPU buffer takes up (*2 because of the "snapshot" 
buffer).

# cat /debug/tracing/trace_entries 
1441792

Now we have 1,441,792 or about 3 megs as the default.

Today, we now have it as:

# cat /debug/tracing/buffer_size_kb 
1410


Still the 3 megs. But going from 10Megs a CPU, to 3Megs is a big 
difference. Do you see the same amout of lost memory with the later 
kernels?

I'll have to make the option to expand the ring buffer when a tracer is 
registered. That will be the default option.

-- Steve
Comment 41 Anonymous Emailer 2009-03-11 09:47:40 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 11:47:16 -0400 (EDT)
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> BTW, which kernel are you testing?  2.6.27, ftrace had its own special 
> buffering system. It played tricks with the page structs of the pages in 
> the buffer. It used the lru parts of the pages to link list itself.
> I just booted on a straight 2.6.27 with tracing configured.
> 

I've been primarily testing 2.6.27, yes. I think I tested 2.6.29-rc7 at
the beginning of this, but my memory is a bit fuzzy so I better retest.

Rgds
Comment 42 Anonymous Emailer 2009-03-11 09:56:46 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 10:25:10 -0400 (EDT)
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> The ring buffer is allocated at start up (although I'm thinking of making 
> it allocated when it is first used), and the allocations are done percpu. 
> 
> It allocates around 3 megs per cpu. How many CPUs were on this box?
> 

Is this per actual CPU though? Or per CONFIG_NR_CPUS? 3 MB times 64
equals roughly the lost memory. But then again, you said it was 10 MB
per CPU for 2.6.27...

Rgds
Comment 43 Steven Rostedt 2009-03-11 10:29:31 UTC
On Wed, 11 Mar 2009, Pierre Ossman wrote:

> On Wed, 11 Mar 2009 10:25:10 -0400 (EDT)
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > 
> > The ring buffer is allocated at start up (although I'm thinking of making 
> > it allocated when it is first used), and the allocations are done percpu. 
> > 
> > It allocates around 3 megs per cpu. How many CPUs were on this box?
> > 
> 
> Is this per actual CPU though? Or per CONFIG_NR_CPUS? 3 MB times 64
> equals roughly the lost memory. But then again, you said it was 10 MB
> per CPU for 2.6.27...

It uses the possible_cpu mask. How many possible CPUs are on your box? 
I've thought about making this handle hot plug CPUs, but that will
require a little more overhead for everyone, whether or not you hot plug a 
cpu.

-- Steve
Comment 44 Anonymous Emailer 2009-03-11 11:34:43 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 13:28:31 -0400 (EDT)
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> On Wed, 11 Mar 2009, Pierre Ossman wrote:
> 
> > 
> > Is this per actual CPU though? Or per CONFIG_NR_CPUS? 3 MB times 64
> > equals roughly the lost memory. But then again, you said it was 10 MB
> > per CPU for 2.6.27...
> 
> It uses the possible_cpu mask. How many possible CPUs are on your box? 
> I've thought about making this handle hot plug CPUs, but that will
> require a little more overhead for everyone, whether or not you hot plug a 
> cpu.
> 

CONFIG_NR_CPUS is 64 for these compiles.

Rgds
Comment 45 Steven Rostedt 2009-03-11 11:48:59 UTC
On Wed, 11 Mar 2009, Pierre Ossman wrote:

> On Wed, 11 Mar 2009 13:28:31 -0400 (EDT)
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > 
> > On Wed, 11 Mar 2009, Pierre Ossman wrote:
> > 
> > > 
> > > Is this per actual CPU though? Or per CONFIG_NR_CPUS? 3 MB times 64
> > > equals roughly the lost memory. But then again, you said it was 10 MB
> > > per CPU for 2.6.27...
> > 
> > It uses the possible_cpu mask. How many possible CPUs are on your box? 
> > I've thought about making this handle hot plug CPUs, but that will
> > require a little more overhead for everyone, whether or not you hot plug a 
> > cpu.
> > 
> 
> CONFIG_NR_CPUS is 64 for these compiles.

Hmm, I assumed (but could be wrong) that on boot up, the system checked 
how many CPUs were physically possible, and updated the possible CPU 
mask accordingly (default being NR_CPUS).

If this is not the case, then I'll have to implement hot plug allocation. 
:-/

Thanks,

-- Steve
Comment 46 Anonymous Emailer 2009-03-11 11:56:59 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 14:48:02 -0400 (EDT)
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> Hmm, I assumed (but could be wrong) that on boot up, the system checked 
> how many CPUs were physically possible, and updated the possible CPU 
> mask accordingly (default being NR_CPUS).
> 
> If this is not the case, then I'll have to implement hot plug allocation. 
> :-/
> 

I have no idea, but every system doesn't suffer from this problem so
there is something more to this. Modern fedora kernels have NR_CPUS set
to 512, and it's not like I'm missing 1.5 GB here. :)

Rgds
Comment 47 Steven Rostedt 2009-03-11 12:04:49 UTC
On Wed, 11 Mar 2009, Pierre Ossman wrote:

> On Wed, 11 Mar 2009 14:48:02 -0400 (EDT)
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > 
> > Hmm, I assumed (but could be wrong) that on boot up, the system checked 
> > how many CPUs were physically possible, and updated the possible CPU 
> > mask accordingly (default being NR_CPUS).
> > 
> > If this is not the case, then I'll have to implement hot plug allocation. 
> > :-/
> > 
> 
> I have no idea, but every system doesn't suffer from this problem so
> there is something more to this. Modern fedora kernels have NR_CPUS set
> to 512, and it's not like I'm missing 1.5 GB here. :)
> 

I'm thinking it is a system dependent feature. I'm working on implementing 
the ring buffers to only allocate for online CPUS. I just realized that 
there's a check of a ring buffer cpu mask to see if it is OK to write to 
that CPU buffer. This works out perfectly, to keep non allocated buffers 
from being written to.

Thanks,

-- Steve
Comment 48 Anonymous Emailer 2009-03-11 14:44:24 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 17:46:38 +0100
Pierre Ossman <drzeus@drzeus.cx> wrote:

> On Wed, 11 Mar 2009 11:47:16 -0400 (EDT)
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > 
> > BTW, which kernel are you testing?  2.6.27, ftrace had its own special 
> > buffering system. It played tricks with the page structs of the pages in 
> > the buffer. It used the lru parts of the pages to link list itself.
> > I just booted on a straight 2.6.27 with tracing configured.
> > 
> 
> I've been primarily testing 2.6.27, yes. I think I tested 2.6.29-rc7 at
> the beginning of this, but my memory is a bit fuzzy so I better retest.
> 

Annoying... 2.6.28 and newer refuses to boot. Has someone broken the
virtio_blk interface?

I'll reconfigure it to use piix tomorrow and see if I can get it
running.

Rgds
Comment 49 Wu Fengguang 2009-03-11 18:23:25 UTC
On Wed, Mar 11, 2009 at 05:02:23PM +0200, Pierre Ossman wrote:
> On Wed, 11 Mar 2009 21:00:22 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 
> > I worked up a simple debugging patch. Since the missing pages are
> > continuously spanned, several stack dumping shall be enough to catch
> > the page consumer.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 27b8681..c0df7fd 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1087,6 +1087,13 @@ again:
> >                     goto failed;
> >     }
> >  
> > +   /* wfg - hunting the 40000 missing pages */
> > +   {
> > +           unsigned long pfn = page_to_pfn(page);
> > +           if (pfn > 0x1000 && (pfn & 0xfff) <= 1)
> > +                   dump_stack();
> > +   }
> > +
> >     __count_zone_vm_events(PGALLOC, zone, 1 << order);
> >     zone_statistics(preferred_zone, zone);
> >     local_irq_restore(flags);
> 
> This got very noisy, but here's what was in the ring buffer once it had
> booted.

It's about 20 stack dumps, hehe. Could you please paste some of them?
Thank you!

> Note that this is where only the "noflags" pages have been allocated,
> not "lru".

The lru pages have even numbered pfn, the noflags pages have odd
numbered pfn. So if it's 1-page allocations, the ((pfn & 0xfff) <= 1)
will match both lru and noflags pages.

Thanks,
Fengguang
Comment 50 KOSAKI Motohiro 2009-03-11 19:47:39 UTC
> 
> On Wed, 11 Mar 2009, Pierre Ossman wrote:
> 
> > On Wed, 11 Mar 2009 14:48:02 -0400 (EDT)
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > 
> > > Hmm, I assumed (but could be wrong) that on boot up, the system checked 
> > > how many CPUs were physically possible, and updated the possible CPU 
> > > mask accordingly (default being NR_CPUS).
> > > 
> > > If this is not the case, then I'll have to implement hot plug allocation. 
> > > :-/

Pierre, Could you please operate following command and post result?

# cat /sys/devices/system/cpu/possible


this is outputting the possible cpus of your system.



> > I have no idea, but every system doesn't suffer from this problem so
> > there is something more to this. Modern fedora kernels have NR_CPUS set
> > to 512, and it's not like I'm missing 1.5 GB here. :)
> > 
> 
> I'm thinking it is a system dependent feature. I'm working on implementing 
> the ring buffers to only allocate for online CPUS. I just realized that 
> there's a check of a ring buffer cpu mask to see if it is OK to write to 
> that CPU buffer. This works out perfectly, to keep non allocated buffers 
> from being written to.
> 
> Thanks,
> 
> -- Steve
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Comment 51 Anonymous Emailer 2009-03-11 23:51:13 UTC
Reply-To: drzeus@drzeus.cx

On Wed, 11 Mar 2009 22:43:53 +0100
Pierre Ossman <drzeus@drzeus.cx> wrote:

> 
> I'll reconfigure it to use piix tomorrow and see if I can get it
> running.
> 

No dice. In both cases (virtio_blk and piix), it sees the disk and
reads the partitions, but then fails to find any volume groups. Does
this ring any bells?

Rgds
Comment 52 Anonymous Emailer 2009-03-11 23:54:05 UTC
Reply-To: drzeus@drzeus.cx

On Thu, 12 Mar 2009 11:46:31 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> 
> Pierre, Could you please operate following command and post result?
> 
> # cat /sys/devices/system/cpu/possible
> 

[root@builder ~]# cat /sys/devices/system/cpu/possible
0-15

16 times 11 MB also is the amount of lost memory, so this seems
reasonable.

Rgds
Comment 53 Anonymous Emailer 2009-03-11 23:56:34 UTC
Reply-To: drzeus@drzeus.cx

On Thu, 12 Mar 2009 09:08:16 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Wed, Mar 11, 2009 at 05:02:23PM +0200, Pierre Ossman wrote:
> > On Wed, 11 Mar 2009 21:00:22 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > 
> > > I worked up a simple debugging patch. Since the missing pages are
> > > continuously spanned, several stack dumping shall be enough to catch
> > > the page consumer.
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 27b8681..c0df7fd 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1087,6 +1087,13 @@ again:
> > >                   goto failed;
> > >   }
> > >  
> > > + /* wfg - hunting the 40000 missing pages */
> > > + {
> > > +         unsigned long pfn = page_to_pfn(page);
> > > +         if (pfn > 0x1000 && (pfn & 0xfff) <= 1)
> > > +                 dump_stack();
> > > + }
> > > +
> > >   __count_zone_vm_events(PGALLOC, zone, 1 << order);
> > >   zone_statistics(preferred_zone, zone);
> > >   local_irq_restore(flags);
> > 
> > This got very noisy, but here's what was in the ring buffer once it had
> > booted.
> 
> It's about 20 stack dumps, hehe. Could you please paste some of them?
> Thank you!
> 

Ooops, I meant to attach the dmesg output. Let's try again. :)

Rgds
Comment 54 Wu Fengguang 2009-03-12 00:30:31 UTC
On Thu, Mar 12, 2009 at 08:55:30AM +0200, Pierre Ossman wrote:
> On Thu, 12 Mar 2009 09:08:16 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Wed, Mar 11, 2009 at 05:02:23PM +0200, Pierre Ossman wrote:
> > > On Wed, 11 Mar 2009 21:00:22 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > 
> > > > I worked up a simple debugging patch. Since the missing pages are
> > > > continuously spanned, several stack dumping shall be enough to catch
> > > > the page consumer.
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 27b8681..c0df7fd 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1087,6 +1087,13 @@ again:
> > > >                         goto failed;
> > > >         }
> > > >  
> > > > +       /* wfg - hunting the 40000 missing pages */
> > > > +       {
> > > > +               unsigned long pfn = page_to_pfn(page);
> > > > +               if (pfn > 0x1000 && (pfn & 0xfff) <= 1)
> > > > +                       dump_stack();
> > > > +       }
> > > > +
> > > >         __count_zone_vm_events(PGALLOC, zone, 1 << order);
> > > >         zone_statistics(preferred_zone, zone);
> > > >         local_irq_restore(flags);
> > > 
> > > This got very noisy, but here's what was in the ring buffer once it had
> > > booted.
> > 
> > It's about 20 stack dumps, hehe. Could you please paste some of them?
> > Thank you!
> > 
> 
> Ooops, I meant to attach the dmesg output. Let's try again. :)

Ooops, there're no ftrace in the dmesg. They are pretty normal
page faults. I overlooked the possibility of repeated alloc/free
cycles on the same pfn...

Anyway please go on with Steven's ftrace patchset :-)

Thanks,
Fengguang

Note You need to log in before you can comment on or make changes to this bug.