Latest working kernel version: 2.6.26 Earliest failing kernel version: 2.6.27 Distribution: Fedora Hardware Environment: x86_64 Software Environment: Fedora 9 & 10 Problem Description: Starting from 2.6.27, the kernel eats up a whole lot more of memory (hundreds of MB) at no gain. I've compared what I can from 2.6.26 and so far haven't found where this missing memory has disappeared. Original bug in RH's bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=481448
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sat, 7 Mar 2009 11:27:12 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12832 > > Summary: kernel leaks a lot of memory > Product: Memory Management > Version: 2.5 > KernelVersion: 2.6.27 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Other > AssignedTo: akpm@osdl.org > ReportedBy: drzeus-bugzilla@drzeus.cx > > > Latest working kernel version: 2.6.26 > Earliest failing kernel version: 2.6.27 > Distribution: Fedora > Hardware Environment: x86_64 > Software Environment: Fedora 9 & 10 > Problem Description: > > Starting from 2.6.27, the kernel eats up a whole lot more of memory (hundreds > of MB) at no gain. > > I've compared what I can from 2.6.26 and so far haven't found where this > missing memory has disappeared. > > Original bug in RH's bugzilla: > > https://bugzilla.redhat.com/show_bug.cgi?id=481448 > hm, not a lot to go on there. We have quite a lot of instrumentation for memory consumption - were you able to work out where it went by comparing /proc/meminfo, /proc/slabinfo, `echo m > /proc/sysrq-trigger', etc? Is the memory missing on initial boot up, or does it take some time for the problem to become evident?
Reply-To: drzeus@drzeus.cx On Sat, 7 Mar 2009 12:24:52 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > hm, not a lot to go on there. > > We have quite a lot of instrumentation for memory consumption - were > you able to work out where it went by comparing /proc/meminfo, > /proc/slabinfo, `echo m > /proc/sysrq-trigger', etc? > The redhat entry contains all the info, and I've compared meminfo and slabinfo without finding anything even close to the chunks of lost memory. I've attached the sysrq memory stats from 2.6.26 and 2.6.27. The only difference though is in the reported free pages I'm not very familiar with all the instrumentation, so pointers are very welcome. > Is the memory missing on initial boot up, or does it take some time for > the problem to become evident? > Initial boot as far as I can tell. Rgds
Reply-To: akpm@linux-foundation.org On Sat, 7 Mar 2009 22:00:55 +0100 Pierre Ossman <drzeus@drzeus.cx> wrote: > On Sat, 7 Mar 2009 12:24:52 -0800 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > hm, not a lot to go on there. > > > > We have quite a lot of instrumentation for memory consumption - were > > you able to work out where it went by comparing /proc/meminfo, > > /proc/slabinfo, `echo m > /proc/sysrq-trigger', etc? > > > > The redhat entry contains all the info, and I've compared meminfo and > slabinfo without finding anything even close to the chunks of lost > memory. Ok. > I've attached the sysrq memory stats from 2.6.26 and 2.6.27. The only > difference though is in the reported free pages Drat. > I'm not very familiar with all the instrumentation, so pointers are > very welcome. > > > Is the memory missing on initial boot up, or does it take some time for > > the problem to become evident? > > > > Initial boot as far as I can tell. OK. In that case it might be that someone gobbled a lot of bootmem. Unfortunately we only added the bootmem_debug boot option in 2.6.27. Below is a super-quick hackport of that patch into 2.6.26. That will allow us (ie: you ;)) to compare bootmem allocations between the two kernels. Unfortunately bootmem-debugging doesn't tell us _who_ allocated the memory, so I stuck a dump_stack() in there too. diff -puN mm/bootmem.c~bdebug mm/bootmem.c --- a/mm/bootmem.c~bdebug +++ a/mm/bootmem.c @@ -48,6 +48,22 @@ unsigned long __init bootmem_bootmap_pag return mapsize; } +static int bootmem_debug; + +static int __init bootmem_debug_setup(char *buf) +{ + bootmem_debug = 1; + return 0; +} +early_param("bootmem_debug", bootmem_debug_setup); + +#define bdebug(fmt, args...) ({ \ + if (unlikely(bootmem_debug)) \ + printk(KERN_INFO \ + "bootmem::%s " fmt, \ + __FUNCTION__, ## args); \ +}) + /* * link bdata in order */ @@ -213,10 +229,10 @@ static void __init free_bootmem_core(boo if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start)) eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start); - for (i = sidx; i < eidx; i++) { - if (unlikely(!test_and_clear_bit(i, bdata->node_bootmem_map))) - BUG(); - } + for (i = sidx; i < eidx; i++) + if (test_and_set_bit(i, bdata->node_bootmem_map)) + bdebug("hm, page %lx reserved twice.\n", + PFN_DOWN(bdata->node_boot_start) + i); } /* @@ -252,6 +268,12 @@ __alloc_bootmem_core(struct bootmem_data if (!bdata->node_bootmem_map) return NULL; + bdebug("size=%lx [%lu pages] align=%lx goal=%lx limit=%lx\n", + size, PAGE_ALIGN(size) >> PAGE_SHIFT, + align, goal, limit); + if (bootmem_debug) + dump_stack(); + /* bdata->node_boot_start is supposed to be (12+6)bits alignment on x86_64 ? */ node_boot_start = bdata->node_boot_start; node_bootmem_map = bdata->node_bootmem_map; @@ -359,6 +381,10 @@ found: ret = phys_to_virt(start * PAGE_SIZE + node_boot_start); } + bdebug("start=%lx end=%lx\n", + start + PFN_DOWN(bdata->node_boot_start), + start + areasize + PFN_DOWN(bdata->node_boot_start)); + /* * Reserve the area now: */ @@ -432,6 +458,7 @@ static unsigned long __init free_all_boo } total += count; bdata->node_bootmem_map = NULL; + bdebug("released=%lx\n", count); return total; } _
Reply-To: akpm@linux-foundation.org Now another possibility is that someone is gobbling lots of memory during initcalls. So here's an untested addition to the `initcall_debug' boot option which should permit us to work out how much memory each initcall consumed: --- a/init/main.c~a +++ a/init/main.c @@ -714,6 +714,7 @@ static void __init do_one_initcall(initc print_fn_descriptor_symbol("initcall %s", fn); printk(" returned %d after %Ld msecs\n", result, (unsigned long long) delta.tv64 >> 20); + printk("remaining memory: %d\n", nr_free_buffer_pages()); } msgbuf[0] = 0; _
Reply-To: drzeus@drzeus.cx On Sat, 7 Mar 2009 14:13:16 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > Below is a super-quick hackport of that patch into 2.6.26. That will > allow us (ie: you ;)) to compare bootmem allocations between the two > kernels. > Compiling... I take it you couldn't see anything like this in your end? Rgds
Reply-To: drzeus@drzeus.cx On Sat, 7 Mar 2009 14:13:16 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > Below is a super-quick hackport of that patch into 2.6.26. That will > allow us (ie: you ;)) to compare bootmem allocations between the two > kernels. > > Unfortunately bootmem-debugging doesn't tell us _who_ allocated the > memory, so I stuck a dump_stack() in there too. > I'm having problems booting this machine on a vanilla 2.26.6. Fedora's kernel works nice though, so I guess they have a bug fix for this. I've attached a screenshot in case it rings any bells. I'm working on getting the data from the 2.6.27 kernel, but right now it doesn't seem like we're getting any numbers for comparison. :/ Rgds
Reply-To: drzeus@drzeus.cx On Sun, 8 Mar 2009 11:00:06 +0100 Pierre Ossman <drzeus@drzeus.cx> wrote: > > I'm having problems booting this machine on a vanilla 2.26.6. Fedora's > kernel works nice though, so I guess they have a bug fix for this. I've > attached a screenshot in case it rings any bells. > It turns out it's your backported patch that's the problem. I'll see if I can get it working. :) Rgds
On Sun, Mar 08, 2009 at 11:36:19AM +0100, Pierre Ossman wrote: > On Sun, 8 Mar 2009 11:00:06 +0100 > Pierre Ossman <drzeus@drzeus.cx> wrote: > > > > > I'm having problems booting this machine on a vanilla 2.26.6. Fedora's > > kernel works nice though, so I guess they have a bug fix for this. I've > > attached a screenshot in case it rings any bells. > > > > It turns out it's your backported patch that's the problem. I'll see if > I can get it working. :) Pierre, you can try the following fixed and combined patch and boot kernel with "initcall_debug bootmem_debug". The boot hung was due to this chunk floated from reserve_bootmem_core() into free_bootmem_core()... @@ -213,10 +229,10 @@ static void __init free_bootmem_core(boo if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start)) eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start); - for (i = sidx; i < eidx; i++) { - if (unlikely(!test_and_clear_bit(i, bdata->node_bootmem_map))) - BUG(); - } + for (i = sidx; i < eidx; i++) + if (test_and_set_bit(i, bdata->node_bootmem_map)) + bdebug("hm, page %lx reserved twice.\n", + PFN_DOWN(bdata->node_boot_start) + i); } /* Thanks, Fengguang --- From: Andrew Morton <akpm@linux-foundation.org> --- init/main.c | 2 ++ mm/bootmem.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+) --- mm.orig/mm/bootmem.c +++ mm/mm/bootmem.c @@ -48,6 +48,22 @@ unsigned long __init bootmem_bootmap_pag return mapsize; } +static int bootmem_debug; + +static int __init bootmem_debug_setup(char *buf) +{ + bootmem_debug = 1; + return 0; +} +early_param("bootmem_debug", bootmem_debug_setup); + +#define bdebug(fmt, args...) ({ \ + if (unlikely(bootmem_debug)) \ + printk(KERN_INFO \ + "bootmem::%s " fmt, \ + __FUNCTION__, ## args); \ +}) + /* * link bdata in order */ @@ -172,6 +188,14 @@ static void __init reserve_bootmem_core( if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start)) eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start); + bdebug("size=%lx [%lu pages] start=%lx end=%lx flags=%x\n", + size, PAGE_ALIGN(size) >> PAGE_SHIFT, + sidx + PFN_DOWN(bdata->node_boot_start), + eidx + PFN_DOWN(bdata->node_boot_start), + flags); + if (bootmem_debug) + dump_stack(); + for (i = sidx; i < eidx; i++) { if (test_and_set_bit(i, bdata->node_bootmem_map)) { #ifdef CONFIG_DEBUG_BOOTMEM @@ -252,6 +276,12 @@ __alloc_bootmem_core(struct bootmem_data if (!bdata->node_bootmem_map) return NULL; + bdebug("size=%lx [%lu pages] align=%lx goal=%lx limit=%lx\n", + size, PAGE_ALIGN(size) >> PAGE_SHIFT, + align, goal, limit); + if (bootmem_debug) + dump_stack(); + /* bdata->node_boot_start is supposed to be (12+6)bits alignment on x86_64 ? */ node_boot_start = bdata->node_boot_start; node_bootmem_map = bdata->node_bootmem_map; @@ -359,6 +389,10 @@ found: ret = phys_to_virt(start * PAGE_SIZE + node_boot_start); } + bdebug("start=%lx end=%lx\n", + start + PFN_DOWN(bdata->node_boot_start), + start + areasize + PFN_DOWN(bdata->node_boot_start)); + /* * Reserve the area now: */ @@ -432,6 +466,7 @@ static unsigned long __init free_all_boo } total += count; bdata->node_bootmem_map = NULL; + bdebug("released=%lx\n", count); return total; } --- mm.orig/init/main.c +++ mm/init/main.c @@ -60,6 +60,7 @@ #include <linux/sched.h> #include <linux/signal.h> #include <linux/idr.h> +#include <linux/swap.h> #include <asm/io.h> #include <asm/bugs.h> @@ -714,6 +715,7 @@ static void __init do_one_initcall(initc print_fn_descriptor_symbol("initcall %s", fn); printk(" returned %d after %Ld msecs\n", result, (unsigned long long) delta.tv64 >> 20); + printk("remaining memory: %d\n", nr_free_buffer_pages()); } msgbuf[0] = 0;
Reply-To: drzeus@drzeus.cx On Sun, 8 Mar 2009 20:38:25 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > Pierre, you can try the following fixed and combined patch and boot kernel > with "initcall_debug bootmem_debug". > > The boot hung was due to this chunk floated from reserve_bootmem_core() into > free_bootmem_core()... > Yeah, I found that as well. I'm getting a decent output now. Included are the dmesg dumps of bootmem_debug. I'll get the initcall stuff in a bit. Rgds
Reply-To: drzeus@drzeus.cx I've gone through the dumps now, and still no meaningful difference. All the big bootmem allocations are present in both kernels, and the remaining memory in initcall is also the same for both (and doesn't really decrease by any meaningful amount). I also tried booting with init=/bin/sh, and the lost memory is present even at that point. More ideas? Rgds
Reply-To: akpm@linux-foundation.org On Sun, 8 Mar 2009 16:54:03 +0100 Pierre Ossman <drzeus@drzeus.cx> wrote: > I've gone through the dumps now, and still no meaningful difference. > All the big bootmem allocations are present in both kernels, and the > remaining memory in initcall is also the same for both (and doesn't > really decrease by any meaningful amount). > > I also tried booting with init=/bin/sh, and the lost memory is present > even at that point. > So we know that the memory gets consumed after end-of-initcalls and before exec-of-init?
Reply-To: drzeus@drzeus.cx On Sun, 8 Mar 2009 12:11:43 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Sun, 8 Mar 2009 16:54:03 +0100 Pierre Ossman <drzeus@drzeus.cx> wrote: > > > I've gone through the dumps now, and still no meaningful difference. > > All the big bootmem allocations are present in both kernels, and the > > remaining memory in initcall is also the same for both (and doesn't > > really decrease by any meaningful amount). > > > > I also tried booting with init=/bin/sh, and the lost memory is present > > even at that point. > > > > So we know that the memory gets consumed after end-of-initcalls and > before exec-of-init? This is a fedora machine, so initrd might be the provoking party here. I haven't yet tried the adventure of booting without initrd. It's after initcalls at least. Right now I'm compiling 2.6.27-rc1 in an effort to bisect this, but if you have something more worthwhile then shoot. :) Rgds
Hi Pierre, On Sat, Mar 07, 2009 at 10:00:55PM +0100, Pierre Ossman wrote: > On Sat, 7 Mar 2009 12:24:52 -0800 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > hm, not a lot to go on there. > > > > We have quite a lot of instrumentation for memory consumption - were > > you able to work out where it went by comparing /proc/meminfo, > > /proc/slabinfo, `echo m > /proc/sysrq-trigger', etc? > > > > The redhat entry contains all the info, and I've compared meminfo and > slabinfo without finding anything even close to the chunks of lost > memory. The "free" pages in sysrq mem-info report should be equal to "MemFree" in /proc/meminfo. So I'd expect meminfo numbers to be different in .26/.27 as well. Maybe the memory is taken by some user space program, so it would be helpful to know the numbers in /proc/meminfo, /proc/vmstat and /proc/zoneinfo. > I've attached the sysrq memory stats from 2.6.26 and 2.6.27. The only > difference though is in the reported free pages The "free" entries in mem-info: 2.6.26 2.6.27 -------------------------------------- free: 103730 62265 (pages) Node 0 DMA free: 10292kB 9448kB Node 0 DMA32 free:404628kB 239612kB So there are 160MB less free pages in .27. Are you sure that initrd is freed after booting? Thanks, Fengguang > I'm not very familiar with all the instrumentation, so pointers are > very welcome. > > > Is the memory missing on initial boot up, or does it take some time for > > the problem to become evident? > > > > Initial boot as far as I can tell. > > > Rgds > -- > -- Pierre Ossman > > WARNING: This correspondence is being monitored by the > Swedish government. Make sure your server uses encryption > for SMTP traffic and consider using PGP for end-to-end > encryption. > Linux builder.drzeus.cx 2.6.26.6-79.fc9.x86_64 #1 SMP Fri Oct 17 14:20:33 EDT > 2008 x86_64 x86_64 x86_64 GNU/Linux > SysRq : Show Memory > Mem-info: > Node 0 DMA per-cpu: > CPU 0: hi: 0, btch: 1 usd: 0 > Node 0 DMA32 per-cpu: > CPU 0: hi: 186, btch: 31 usd: 115 > Active:8937 inactive:6285 dirty:48 writeback:0 unstable:0 > free:103730 slab:5612 mapped:2148 pagetables:817 bounce:0 > Node 0 DMA free:10292kB min:48kB low:60kB high:72kB active:0kB inactive:0kB > present:8908kB pages_scanned:0 all_unreclaimable? no > lowmem_reserve[]: 0 489 489 489 > Node 0 DMA32 free:404628kB min:2804kB low:3504kB high:4204kB active:35748kB > inactive:25140kB present:500896kB pages_scanned:0 all_unreclaimable? no > lowmem_reserve[]: 0 0 0 0 > Node 0 DMA: 3*4kB 5*8kB 4*16kB 4*32kB 3*64kB 3*128kB 3*256kB 3*512kB 3*1024kB > 2*2048kB 0*4096kB = 10292kB > Node 0 DMA32: 3*4kB 5*8kB 2*16kB 2*32kB 2*64kB 1*128kB 3*256kB 2*512kB > 3*1024kB 3*2048kB 96*4096kB = 404628kB > 9730 total pagecache pages > Swap cache: add 0, delete 0, find 0/0 > Free swap = 524280kB > Total swap = 524280kB > 131056 pages of RAM > 3772 reserved pages > 7750 pages shared > 0 pages swap cached > > Linux builder.drzeus.cx 2.6.27.4-19.fc9.x86_64 #1 SMP Thu Oct 30 19:30:01 EDT > 2008 x86_64 x86_64 x86_64 GNU/Linux > SysRq : Show Memory > Mem-Info: > Node 0 DMA per-cpu: > CPU 0: hi: 0, btch: 1 usd: 0 > Node 0 DMA32 per-cpu: > CPU 0: hi: 186, btch: 31 usd: 86 > Active:8879 inactive:6265 dirty:8 writeback:0 unstable:0 > free:62265 slab:5543 mapped:2154 pagetables:821 bounce:0 > Node 0 DMA free:9448kB min:40kB low:48kB high:60kB active:0kB inactive:0kB > present:7804kB pages_scanned:0 all_unreclaimable? no > lowmem_reserve[]: 0 489 489 489 > Node 0 DMA32 free:239612kB min:2808kB low:3508kB high:4212kB active:35516kB > inactive:25060kB present:500896kB pages_scanned:0 all_unreclaimable? no > lowmem_reserve[]: 0 0 0 0 > Node 0 DMA: 4*4kB 3*8kB 2*16kB 5*32kB 4*64kB 2*128kB 2*256kB 2*512kB 3*1024kB > 2*2048kB 0*4096kB = 9448kB > Node 0 DMA32: 1*4kB 7*8kB 6*16kB 1*32kB 1*64kB 4*128kB 3*256kB 3*512kB > 1*1024kB 3*2048kB 56*4096kB = 239612kB > 9692 total pagecache pages > 0 pages in swap cache > Swap cache stats: add 0, delete 0, find 0/0 > Free swap = 524280kB > Total swap = 524280kB > 131056 pages RAM > 4046 pages reserved > 7770 pages shared > 61196 pages non-shared >
Reply-To: drzeus@drzeus.cx On Mon, 9 Mar 2009 10:07:01 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > On Mon, Mar 09, 2009 at 09:37:42AM +0800, Wu Fengguang wrote: > > > > The "free" pages in sysrq mem-info report should be equal to "MemFree" > > in /proc/meminfo. So I'd expect meminfo numbers to be different in > > .26/.27 as well. > > > > Maybe the memory is taken by some user space program, so it would be > > helpful to know the numbers in /proc/meminfo, /proc/vmstat and > > /proc/zoneinfo. > > And maybe piggyback /proc/slabinfo in case it is a kernel bug :-) > Big dump of relevant /proc files: [root@builder ~]# free total used free shared buffers cached Mem: 509108 236988 272120 0 228 14760 -/+ buffers/cache: 222000 287108 Swap: 524280 228 524052 [root@builder ~]# cat /proc/meminfo MemTotal: 509108 kB MemFree: 272172 kB Buffers: 240 kB Cached: 14788 kB SwapCached: 64 kB Active: 32544 kB Inactive: 5900 kB SwapTotal: 524280 kB SwapFree: 524052 kB Dirty: 5980 kB Writeback: 0 kB AnonPages: 23404 kB Mapped: 8648 kB Slab: 23148 kB SReclaimable: 5420 kB SUnreclaim: 17728 kB PageTables: 3324 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 778832 kB Committed_AS: 85196 kB VmallocTotal: 34359738367 kB VmallocUsed: 1740 kB VmallocChunk: 34359736619 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 2032 DirectMap2M: 18446744073709551613 DirectMap1G: 0 [root@builder ~]# cat /proc/vmstat nr_free_pages 68035 nr_inactive 1479 nr_active 8137 nr_anon_pages 5851 nr_mapped 2162 nr_file_pages 3777 nr_dirty 132 nr_writeback 0 nr_slab_reclaimable 1354 nr_slab_unreclaimable 4440 nr_page_table_pages 831 nr_unstable 0 nr_bounce 0 nr_vmscan_write 324 nr_writeback_temp 0 numa_hit 18985527 numa_miss 0 numa_foreign 0 numa_interleave 44220 numa_local 18985527 numa_other 0 pgpgin 379025 pgpgout 820238 pswpin 16 pswpout 57 pgalloc_dma 295454 pgalloc_dma32 18721928 pgalloc_normal 0 pgalloc_movable 0 pgfree 19085491 pgactivate 60797 pgdeactivate 47199 pgfault 25624481 pgmajfault 2490 pgrefill_dma 8144 pgrefill_dma32 103508 pgrefill_normal 0 pgrefill_movable 0 pgsteal_dma 4503 pgsteal_dma32 179395 pgsteal_normal 0 pgsteal_movable 0 pgscan_kswapd_dma 4999 pgscan_kswapd_dma32 180546 pgscan_kswapd_normal 0 pgscan_kswapd_movable 0 pgscan_direct_dma 0 pgscan_direct_dma32 384 pgscan_direct_normal 0 pgscan_direct_movable 0 pginodesteal 0 slabs_scanned 153856 kswapd_steal 183628 kswapd_inodesteal 35303 pageoutrun 3794 allocstall 3 pgrotated 72 htlb_buddy_alloc_success 0 htlb_buddy_alloc_fail 0 [root@builder ~]# cat /proc/zoneinfo Node 0, zone DMA pages free 2524 min 12 low 15 high 18 scanned 0 (a: 27 i: 24) spanned 4096 present 2180 nr_free_pages 2524 nr_inactive 0 nr_active 8 nr_anon_pages 8 nr_mapped 0 nr_file_pages 0 nr_dirty 0 nr_writeback 0 nr_slab_reclaimable 16 nr_slab_unreclaimable 7 nr_page_table_pages 15 nr_unstable 0 nr_bounce 0 nr_vmscan_write 292 nr_writeback_temp 0 numa_hit 295370 numa_miss 0 numa_foreign 0 numa_interleave 0 numa_local 295370 numa_other 0 protection: (0, 489, 489, 489) pagesets cpu: 0 count: 0 high: 0 batch: 1 vm stats threshold: 2 all_unreclaimable: 0 prev_priority: 12 start_pfn: 0 Node 0, zone DMA32 pages free 65515 min 700 low 875 high 1050 scanned 0 (a: 0 i: 0) spanned 126960 present 125224 nr_free_pages 65515 nr_inactive 1482 nr_active 8137 nr_anon_pages 5843 nr_mapped 2162 nr_file_pages 3789 nr_dirty 128 nr_writeback 0 nr_slab_reclaimable 1331 nr_slab_unreclaimable 4429 nr_page_table_pages 816 nr_unstable 0 nr_bounce 0 nr_vmscan_write 32 nr_writeback_temp 0 numa_hit 18690260 numa_miss 0 numa_foreign 0 numa_interleave 44220 numa_local 18690260 numa_other 0 protection: (0, 0, 0, 0) pagesets cpu: 0 count: 69 high: 186 batch: 31 vm stats threshold: 6 all_unreclaimable: 0 prev_priority: 12 start_pfn: 4096 [root@builder ~]# cat /proc/slabinfo slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> rpc_inode_cache 39 39 832 39 8 : tunables 0 0 0 : slabdata 1 1 0 nf_conntrack_expect 0 0 240 34 2 : tunables 0 0 0 : slabdata 0 0 0 UDPv6 34 34 960 34 8 : tunables 0 0 0 : slabdata 1 1 0 TCPv6 18 18 1792 18 8 : tunables 0 0 0 : slabdata 1 1 0 kmalloc_dma-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0 dm_snap_pending_exception 144 144 112 36 1 : tunables 0 0 0 : slabdata 4 4 0 kcopyd_job 0 0 360 45 4 : tunables 0 0 0 : slabdata 0 0 0 dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0 ext3_inode_cache 387 1554 768 42 8 : tunables 0 0 0 : slabdata 37 37 0 ext3_xattr 46 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0 journal_handle 170 170 24 170 1 : tunables 0 0 0 : slabdata 1 1 0 journal_head 42 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0 revoke_table 256 256 16 256 1 : tunables 0 0 0 : slabdata 1 1 0 revoke_record 128 128 32 128 1 : tunables 0 0 0 : slabdata 1 1 0 cfq_io_context 44 48 168 24 1 : tunables 0 0 0 : slabdata 2 2 0 mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0 isofs_inode_cache 0 0 616 26 4 : tunables 0 0 0 : slabdata 0 0 0 hugetlbfs_inode_cache 28 28 584 28 4 : tunables 0 0 0 : slabdata 1 1 0 dquot 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0 inotify_event_cache 612 612 40 102 1 : tunables 0 0 0 : slabdata 6 6 0 fasync_cache 313798 313820 24 170 1 : tunables 0 0 0 : slabdata 1846 1846 0 shmem_inode_cache 735 738 792 41 8 : tunables 0 0 0 : slabdata 18 18 0 pid_namespace 0 0 2104 15 8 : tunables 0 0 0 : slabdata 0 0 0 nsproxy 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0 UNIX 92 92 704 46 8 : tunables 0 0 0 : slabdata 2 2 0 xfrm_dst_cache 0 0 384 42 4 : tunables 0 0 0 : slabdata 0 0 0 ip_dst_cache 51 75 320 25 2 : tunables 0 0 0 : slabdata 3 3 0 TCP 19 19 1664 19 8 : tunables 0 0 0 : slabdata 1 1 0 blkdev_integrity 0 0 120 34 1 : tunables 0 0 0 : slabdata 0 0 0 blkdev_queue 34 34 1824 17 8 : tunables 0 0 0 : slabdata 2 2 0 blkdev_requests 38 52 304 26 2 : tunables 0 0 0 : slabdata 2 2 0 sock_inode_cache 138 138 704 46 8 : tunables 0 0 0 : slabdata 3 3 0 file_lock_cache 42 42 192 42 2 : tunables 0 0 0 : slabdata 1 1 0 taskstats 26 26 312 26 2 : tunables 0 0 0 : slabdata 1 1 0 proc_inode_cache 90 162 600 27 4 : tunables 0 0 0 : slabdata 6 6 0 sigqueue 25 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0 radix_tree_node 623 2581 560 29 4 : tunables 0 0 0 : slabdata 89 89 0 bdev_cache 42 42 768 42 8 : tunables 0 0 0 : slabdata 1 1 0 sysfs_dir_cache 7084 7089 80 51 1 : tunables 0 0 0 : slabdata 139 139 0 inode_cache 1505 1708 568 28 4 : tunables 0 0 0 : slabdata 61 61 0 dentry 2555 4485 208 39 2 : tunables 0 0 0 : slabdata 115 115 0 avc_node 1735 2128 72 56 1 : tunables 0 0 0 : slabdata 38 38 0 buffer_head 1583 5472 112 36 1 : tunables 0 0 0 : slabdata 152 152 0 mm_struct 75 78 832 39 8 : tunables 0 0 0 : slabdata 2 2 0 vm_area_struct 2223 2438 176 46 2 : tunables 0 0 0 : slabdata 53 53 0 files_cache 78 84 768 42 8 : tunables 0 0 0 : slabdata 2 2 0 signal_cache 105 108 896 36 8 : tunables 0 0 0 : slabdata 3 3 0 sighand_cache 85 90 2112 15 8 : tunables 0 0 0 : slabdata 6 6 0 task_struct 141 145 5840 5 8 : tunables 0 0 0 : slabdata 29 29 0 anon_vma 741 768 32 128 1 : tunables 0 0 0 : slabdata 6 6 0 shared_policy_node 85 85 48 85 1 : tunables 0 0 0 : slabdata 1 1 0 numa_policy 56 60 136 30 1 : tunables 0 0 0 : slabdata 2 2 0 idr_layer_cache 269 270 536 30 4 : tunables 0 0 0 : slabdata 9 9 0 kmalloc-4096 247 248 4096 8 8 : tunables 0 0 0 : slabdata 31 31 0 kmalloc-2048 345 352 2048 16 8 : tunables 0 0 0 : slabdata 22 22 0 kmalloc-1024 396 416 1024 32 8 : tunables 0 0 0 : slabdata 13 13 0 kmalloc-512 297 320 512 32 4 : tunables 0 0 0 : slabdata 10 10 0 kmalloc-256 985 992 256 32 2 : tunables 0 0 0 : slabdata 31 31 0 kmalloc-128 1899 2016 128 32 1 : tunables 0 0 0 : slabdata 63 63 0 kmalloc-64 6795 9600 64 64 1 : tunables 0 0 0 : slabdata 150 150 0 kmalloc-32 20735 20736 32 128 1 : tunables 0 0 0 : slabdata 162 162 0 kmalloc-16 138778 139264 16 256 1 : tunables 0 0 0 : slabdata 544 544 0 kmalloc-8 8190 8192 8 512 1 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-192 972 1050 192 42 2 : tunables 0 0 0 : slabdata 25 25 0 kmalloc-96 2815 2856 96 42 1 : tunables 0 0 0 : slabdata 68 68 0 kmem_cache_node 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
Hi Pierre, On Mon, Mar 09, 2009 at 08:40:45AM +0100, Pierre Ossman wrote: > On Mon, 9 Mar 2009 10:07:01 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > On Mon, Mar 09, 2009 at 09:37:42AM +0800, Wu Fengguang wrote: > > > > > > The "free" pages in sysrq mem-info report should be equal to "MemFree" > > > in /proc/meminfo. So I'd expect meminfo numbers to be different in > > > .26/.27 as well. > > > > > > Maybe the memory is taken by some user space program, so it would be > > > helpful to know the numbers in /proc/meminfo, /proc/vmstat and > > > /proc/zoneinfo. > > > > And maybe piggyback /proc/slabinfo in case it is a kernel bug :-) > > > > Big dump of relevant /proc files: Thanks for the data! Now it seems that some pages are totally missing from bootmem or slabs or page cache or any application consumptions... Will searching through /proc/kpageflags for reserved pages help identify the problem? Oh kpageflags_read() does not include support for PG_reserved: #define KPF_LOCKED 0 #define KPF_ERROR 1 #define KPF_REFERENCED 2 #define KPF_UPTODATE 3 #define KPF_DIRTY 4 #define KPF_LRU 5 #define KPF_ACTIVE 6 #define KPF_SLAB 7 #define KPF_WRITEBACK 8 #define KPF_RECLAIM 9 #define KPF_BUDDY 10 Thanks, Fengguang > [root@builder ~]# free > total used free shared buffers cached > Mem: 509108 236988 272120 0 228 14760 > -/+ buffers/cache: 222000 287108 > Swap: 524280 228 524052 > > [root@builder ~]# cat /proc/meminfo > MemTotal: 509108 kB > MemFree: 272172 kB > Buffers: 240 kB > Cached: 14788 kB > SwapCached: 64 kB > Active: 32544 kB > Inactive: 5900 kB > SwapTotal: 524280 kB > SwapFree: 524052 kB > Dirty: 5980 kB > Writeback: 0 kB > AnonPages: 23404 kB > Mapped: 8648 kB > Slab: 23148 kB > SReclaimable: 5420 kB > SUnreclaim: 17728 kB > PageTables: 3324 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 778832 kB > Committed_AS: 85196 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 1740 kB > VmallocChunk: 34359736619 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 2032 > DirectMap2M: 18446744073709551613 This field looks weird. > DirectMap1G: 0 > > [root@builder ~]# cat /proc/vmstat > nr_free_pages 68035 > nr_inactive 1479 > nr_active 8137 > nr_anon_pages 5851 > nr_mapped 2162 > nr_file_pages 3777 > nr_dirty 132 > nr_writeback 0 > nr_slab_reclaimable 1354 > nr_slab_unreclaimable 4440 > nr_page_table_pages 831 > nr_unstable 0 > nr_bounce 0 > nr_vmscan_write 324 > nr_writeback_temp 0 > numa_hit 18985527 > numa_miss 0 > numa_foreign 0 > numa_interleave 44220 > numa_local 18985527 > numa_other 0 > pgpgin 379025 > pgpgout 820238 > pswpin 16 > pswpout 57 > pgalloc_dma 295454 > pgalloc_dma32 18721928 > pgalloc_normal 0 > pgalloc_movable 0 > pgfree 19085491 > pgactivate 60797 > pgdeactivate 47199 > pgfault 25624481 > pgmajfault 2490 > pgrefill_dma 8144 > pgrefill_dma32 103508 > pgrefill_normal 0 > pgrefill_movable 0 > pgsteal_dma 4503 > pgsteal_dma32 179395 > pgsteal_normal 0 > pgsteal_movable 0 > pgscan_kswapd_dma 4999 > pgscan_kswapd_dma32 180546 > pgscan_kswapd_normal 0 > pgscan_kswapd_movable 0 > pgscan_direct_dma 0 > pgscan_direct_dma32 384 > pgscan_direct_normal 0 > pgscan_direct_movable 0 > pginodesteal 0 > slabs_scanned 153856 > kswapd_steal 183628 > kswapd_inodesteal 35303 > pageoutrun 3794 > allocstall 3 > pgrotated 72 > htlb_buddy_alloc_success 0 > htlb_buddy_alloc_fail 0 > > [root@builder ~]# cat /proc/zoneinfo > Node 0, zone DMA > pages free 2524 > min 12 > low 15 > high 18 > scanned 0 (a: 27 i: 24) > spanned 4096 > present 2180 > nr_free_pages 2524 > nr_inactive 0 > nr_active 8 > nr_anon_pages 8 > nr_mapped 0 > nr_file_pages 0 > nr_dirty 0 > nr_writeback 0 > nr_slab_reclaimable 16 > nr_slab_unreclaimable 7 > nr_page_table_pages 15 > nr_unstable 0 > nr_bounce 0 > nr_vmscan_write 292 > nr_writeback_temp 0 > numa_hit 295370 > numa_miss 0 > numa_foreign 0 > numa_interleave 0 > numa_local 295370 > numa_other 0 > protection: (0, 489, 489, 489) > pagesets > cpu: 0 > count: 0 > high: 0 > batch: 1 > vm stats threshold: 2 > all_unreclaimable: 0 > prev_priority: 12 > start_pfn: 0 > Node 0, zone DMA32 > pages free 65515 > min 700 > low 875 > high 1050 > scanned 0 (a: 0 i: 0) > spanned 126960 > present 125224 > nr_free_pages 65515 > nr_inactive 1482 > nr_active 8137 > nr_anon_pages 5843 > nr_mapped 2162 > nr_file_pages 3789 > nr_dirty 128 > nr_writeback 0 > nr_slab_reclaimable 1331 > nr_slab_unreclaimable 4429 > nr_page_table_pages 816 > nr_unstable 0 > nr_bounce 0 > nr_vmscan_write 32 > nr_writeback_temp 0 > numa_hit 18690260 > numa_miss 0 > numa_foreign 0 > numa_interleave 44220 > numa_local 18690260 > numa_other 0 > protection: (0, 0, 0, 0) > pagesets > cpu: 0 > count: 69 > high: 186 > batch: 31 > vm stats threshold: 6 > all_unreclaimable: 0 > prev_priority: 12 > start_pfn: 4096 > > [root@builder ~]# cat /proc/slabinfo > slabinfo - version: 2.1 > # name <active_objs> <num_objs> <objsize> <objperslab> > <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata > <active_slabs> <num_slabs> <sharedavail> > rpc_inode_cache 39 39 832 39 8 : tunables 0 0 0 : > slabdata 1 1 0 > nf_conntrack_expect 0 0 240 34 2 : tunables 0 0 0 > : slabdata 0 0 0 > UDPv6 34 34 960 34 8 : tunables 0 0 0 : > slabdata 1 1 0 > TCPv6 18 18 1792 18 8 : tunables 0 0 0 : > slabdata 1 1 0 > kmalloc_dma-512 32 32 512 32 4 : tunables 0 0 0 : > slabdata 1 1 0 > dm_snap_pending_exception 144 144 112 36 1 : tunables 0 0 > 0 : slabdata 4 4 0 > kcopyd_job 0 0 360 45 4 : tunables 0 0 0 : > slabdata 0 0 0 > dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : > slabdata 0 0 0 > ext3_inode_cache 387 1554 768 42 8 : tunables 0 0 0 : > slabdata 37 37 0 > ext3_xattr 46 46 88 46 1 : tunables 0 0 0 : > slabdata 1 1 0 > journal_handle 170 170 24 170 1 : tunables 0 0 0 : > slabdata 1 1 0 > journal_head 42 42 96 42 1 : tunables 0 0 0 : > slabdata 1 1 0 > revoke_table 256 256 16 256 1 : tunables 0 0 0 : > slabdata 1 1 0 > revoke_record 128 128 32 128 1 : tunables 0 0 0 : > slabdata 1 1 0 > cfq_io_context 44 48 168 24 1 : tunables 0 0 0 : > slabdata 2 2 0 > mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : > slabdata 1 1 0 > isofs_inode_cache 0 0 616 26 4 : tunables 0 0 0 : > slabdata 0 0 0 > hugetlbfs_inode_cache 28 28 584 28 4 : tunables 0 0 > 0 : slabdata 1 1 0 > dquot 0 0 256 32 2 : tunables 0 0 0 : > slabdata 0 0 0 > inotify_event_cache 612 612 40 102 1 : tunables 0 0 0 > : slabdata 6 6 0 > fasync_cache 313798 313820 24 170 1 : tunables 0 0 0 : > slabdata 1846 1846 0 > shmem_inode_cache 735 738 792 41 8 : tunables 0 0 0 : > slabdata 18 18 0 > pid_namespace 0 0 2104 15 8 : tunables 0 0 0 : > slabdata 0 0 0 > nsproxy 0 0 56 73 1 : tunables 0 0 0 : > slabdata 0 0 0 > UNIX 92 92 704 46 8 : tunables 0 0 0 : > slabdata 2 2 0 > xfrm_dst_cache 0 0 384 42 4 : tunables 0 0 0 : > slabdata 0 0 0 > ip_dst_cache 51 75 320 25 2 : tunables 0 0 0 : > slabdata 3 3 0 > TCP 19 19 1664 19 8 : tunables 0 0 0 : > slabdata 1 1 0 > blkdev_integrity 0 0 120 34 1 : tunables 0 0 0 : > slabdata 0 0 0 > blkdev_queue 34 34 1824 17 8 : tunables 0 0 0 : > slabdata 2 2 0 > blkdev_requests 38 52 304 26 2 : tunables 0 0 0 : > slabdata 2 2 0 > sock_inode_cache 138 138 704 46 8 : tunables 0 0 0 : > slabdata 3 3 0 > file_lock_cache 42 42 192 42 2 : tunables 0 0 0 : > slabdata 1 1 0 > taskstats 26 26 312 26 2 : tunables 0 0 0 : > slabdata 1 1 0 > proc_inode_cache 90 162 600 27 4 : tunables 0 0 0 : > slabdata 6 6 0 > sigqueue 25 25 160 25 1 : tunables 0 0 0 : > slabdata 1 1 0 > radix_tree_node 623 2581 560 29 4 : tunables 0 0 0 : > slabdata 89 89 0 > bdev_cache 42 42 768 42 8 : tunables 0 0 0 : > slabdata 1 1 0 > sysfs_dir_cache 7084 7089 80 51 1 : tunables 0 0 0 : > slabdata 139 139 0 > inode_cache 1505 1708 568 28 4 : tunables 0 0 0 : > slabdata 61 61 0 > dentry 2555 4485 208 39 2 : tunables 0 0 0 : > slabdata 115 115 0 > avc_node 1735 2128 72 56 1 : tunables 0 0 0 : > slabdata 38 38 0 > buffer_head 1583 5472 112 36 1 : tunables 0 0 0 : > slabdata 152 152 0 > mm_struct 75 78 832 39 8 : tunables 0 0 0 : > slabdata 2 2 0 > vm_area_struct 2223 2438 176 46 2 : tunables 0 0 0 : > slabdata 53 53 0 > files_cache 78 84 768 42 8 : tunables 0 0 0 : > slabdata 2 2 0 > signal_cache 105 108 896 36 8 : tunables 0 0 0 : > slabdata 3 3 0 > sighand_cache 85 90 2112 15 8 : tunables 0 0 0 : > slabdata 6 6 0 > task_struct 141 145 5840 5 8 : tunables 0 0 0 : > slabdata 29 29 0 > anon_vma 741 768 32 128 1 : tunables 0 0 0 : > slabdata 6 6 0 > shared_policy_node 85 85 48 85 1 : tunables 0 0 0 : > slabdata 1 1 0 > numa_policy 56 60 136 30 1 : tunables 0 0 0 : > slabdata 2 2 0 > idr_layer_cache 269 270 536 30 4 : tunables 0 0 0 : > slabdata 9 9 0 > kmalloc-4096 247 248 4096 8 8 : tunables 0 0 0 : > slabdata 31 31 0 > kmalloc-2048 345 352 2048 16 8 : tunables 0 0 0 : > slabdata 22 22 0 > kmalloc-1024 396 416 1024 32 8 : tunables 0 0 0 : > slabdata 13 13 0 > kmalloc-512 297 320 512 32 4 : tunables 0 0 0 : > slabdata 10 10 0 > kmalloc-256 985 992 256 32 2 : tunables 0 0 0 : > slabdata 31 31 0 > kmalloc-128 1899 2016 128 32 1 : tunables 0 0 0 : > slabdata 63 63 0 > kmalloc-64 6795 9600 64 64 1 : tunables 0 0 0 : > slabdata 150 150 0 > kmalloc-32 20735 20736 32 128 1 : tunables 0 0 0 : > slabdata 162 162 0 > kmalloc-16 138778 139264 16 256 1 : tunables 0 0 0 : > slabdata 544 544 0 > kmalloc-8 8190 8192 8 512 1 : tunables 0 0 0 : > slabdata 16 16 0 > kmalloc-192 972 1050 192 42 2 : tunables 0 0 0 : > slabdata 25 25 0 > kmalloc-96 2815 2856 96 42 1 : tunables 0 0 0 : > slabdata 68 68 0 > kmem_cache_node 0 0 64 64 1 : tunables 0 0 0 : > slabdata 0 0 0 > > -- > -- Pierre Ossman > > WARNING: This correspondence is being monitored by the > Swedish government. Make sure your server uses encryption > for SMTP traffic and consider using PGP for end-to-end > encryption.
Reply-To: drzeus@drzeus.cx On Mon, 9 Mar 2009 22:22:41 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > Thanks for the data! Now it seems that some pages are totally missing > from bootmem or slabs or page cache or any application consumptions... > So it isn't just me that's blind. That's something I guess. :) > Will searching through /proc/kpageflags for reserved pages help > identify the problem? > > Oh kpageflags_read() does not include support for PG_reserved: > I can probably hack together something that outputs the served pages. Anything else that is of interest? > > DirectMap2M: 18446744073709551613 > > This field looks weird. > Sorry, red herring. I'm in the middle of a bisect and that particular old bug happened to surface. It was not present with the releases 2.6.27. Rgds
On Mon, Mar 09, 2009 at 05:02:16PM +0200, Pierre Ossman wrote: > On Mon, 9 Mar 2009 22:22:41 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > Thanks for the data! Now it seems that some pages are totally missing > > from bootmem or slabs or page cache or any application consumptions... > > > > So it isn't just me that's blind. That's something I guess. :) > > > Will searching through /proc/kpageflags for reserved pages help > > identify the problem? > > > > Oh kpageflags_read() does not include support for PG_reserved: > > > > I can probably hack together something that outputs the served pages. > Anything else that is of interest? Sure, Matt Mackall provides some example scripts for interpreting the kpageflags file: http://selenic.com/repo/pagemap/ > > > DirectMap2M: 18446744073709551613 > > > > This field looks weird. > > > > Sorry, red herring. I'm in the middle of a bisect and that particular > old bug happened to surface. It was not present with the releases > 2.6.27. That's OK. pgfault 25624481 pgmajfault 2490 pgrefill_dma 8144 pgrefill_dma32 103508 pgsteal_dma 4503 pgsteal_dma32 179395 pgscan_kswapd_dma 4999 pgscan_kswapd_dma32 180546 pgscan_direct_dma32 384 slabs_scanned 153856 The above vmstat numbers are a bit large, maybe it's not a fresh booted system? Thanks, Fengguang
Reply-To: drzeus@drzeus.cx On Tue, 10 Mar 2009 10:41:35 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > pgfault 25624481 > pgmajfault 2490 > pgrefill_dma 8144 > pgrefill_dma32 103508 > pgsteal_dma 4503 > pgsteal_dma32 179395 > pgscan_kswapd_dma 4999 > pgscan_kswapd_dma32 180546 > pgscan_direct_dma32 384 > slabs_scanned 153856 > > The above vmstat numbers are a bit large, maybe it's not a fresh booted > system? > Probably not. I just grabbed those stats as it was compiling the next kernel. It takes two hours, so I'm trying to do as many things in parallel as once. :/ Rgds
Hi Pierre, On Tue, Mar 10, 2009 at 10:41:35AM +0800, Wu Fengguang wrote: > On Mon, Mar 09, 2009 at 05:02:16PM +0200, Pierre Ossman wrote: > > On Mon, 9 Mar 2009 22:22:41 +0800 > > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > > > > Thanks for the data! Now it seems that some pages are totally missing > > > from bootmem or slabs or page cache or any application consumptions... > > > > > > > So it isn't just me that's blind. That's something I guess. :) > > > > > Will searching through /proc/kpageflags for reserved pages help > > > identify the problem? > > > > > > Oh kpageflags_read() does not include support for PG_reserved: > > > > > > > I can probably hack together something that outputs the served pages. > > Anything else that is of interest? Here is the initial patch and tool for finding the missing pages. In the following example, the pages with no flags set is kind of too many (1816MB), but hopefully your missing pages will have PG_reserved or other flags set ;-) # ./page-types L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked flags symbolic-flags page-count MB 0x0000 ______________ 464967 1816 0x0004 __R___________ 1 0 0x0008 ___U__________ 2 0 0x0014 __R_D_________ 5 0 0x0020 _____L________ 1 0 0x0028 ___U_L________ 5956 23 0x002c __RU_L________ 5415 21 0x0038 ___UDL________ 7 0 0x0068 ___U_LA_______ 520 2 0x006c __RU_LA_______ 2083 8 0x0080 _______S______ 10820 42 0x0228 ___U_L___x____ 104 0 0x022c __RU_L___x____ 52 0 0x0268 ___U_LA__x____ 22 0 0x026c __RU_LA__x____ 95 0 0x0400 __________B___ 477 1 0x0800 ___________r__ 18734 73 0x2008 ___U_________b 9 0 0x2068 ___U_LA______b 4644 18 0x206c __RU_LA______b 33 0 0x2078 ___UDLA______b 4 0 0x207c __RUDLA______b 17 0 total 513968 2007 Thanks, Fengguang
Reply-To: drzeus@drzeus.cx On Tue, 10 Mar 2009 16:19:17 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > Here is the initial patch and tool for finding the missing pages. > > In the following example, the pages with no flags set is kind of too > many (1816MB), but hopefully your missing pages will have PG_reserved > or other flags set ;-) > > # ./page-types > L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab > W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked > Thanks. I'll have a look in a bit. Right now I'm very close to a complete bisect. It is just ftrace commits left though, so I'm somewhat sceptical that it is correct. ftrace isn't even turned on in the kernels I've been testing. The remaining commits are ec1bb60bb..6712e299. Rgds
On Tue, Mar 10, 2009 at 11:55:23AM +0200, Pierre Ossman wrote: > On Tue, 10 Mar 2009 16:19:17 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > Here is the initial patch and tool for finding the missing pages. > > > > In the following example, the pages with no flags set is kind of too > > many (1816MB), but hopefully your missing pages will have PG_reserved > > or other flags set ;-) > > > > # ./page-types > > L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab > W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked > > > > Thanks. I'll have a look in a bit. Right now I'm very close to a > complete bisect. It is just ftrace commits left though, so I'm somewhat > sceptical that it is correct. ftrace isn't even turned on in the > kernels I've been testing. > > The remaining commits are ec1bb60bb..6712e299. And here's my progress, some more page flags are introduced: # ./page-types flags page-count MB symbolic-flags long-symbolic-flags 0x00000 3978 15 __________________ 0x00004 1 0 __R_______________ referenced 0x00014 5 0 __R_D_____________ referenced,dirty 0x00020 2 0 _____l____________ lru 0x00028 8835 34 ___U_l____________ uptodate,lru 0x0002c 9588 37 __RU_l____________ referenced,uptodate,lru 0x00068 1031 4 ___U_lA___________ uptodate,lru,active 0x0006c 3032 11 __RU_lA___________ referenced,uptodate,lru,active 0x00080 11001 42 _______S__________ slab 0x00228 140 0 ___U_l___x________ uptodate,lru,reclaim 0x0022c 79 0 __RU_l___x________ referenced,uptodate,lru,reclaim 0x00268 43 0 ___U_lA__x________ uptodate,lru,active,reclaim 0x0026c 110 0 __RU_lA__x________ referenced,uptodate,lru,active,reclaim 0x00400 1102 4 __________B_______ buddy 0x00800 18735 73 ___________r______ reserved 0x02008 13 0 ___U_________b____ uptodate,swapbacked 0x02068 9371 36 ___U_lA______b____ uptodate,lru,active,swapbacked 0x0206c 1339 5 __RU_lA______b____ referenced,uptodate,lru,active,swapbacked 0x02078 21 0 ___UDlA______b____ uptodate,dirty,lru,active,swapbacked 0x0207c 17 0 __RUDlA______b____ referenced,uptodate,dirty,lru,active,swapbacked 0x20000 445525 1740 _________________n noflags total 513968 2007 Thanks, Fengguang
On Tue, Mar 10, 2009 at 08:22:10PM +0800, Wu Fengguang wrote: > On Tue, Mar 10, 2009 at 11:55:23AM +0200, Pierre Ossman wrote: > > On Tue, 10 Mar 2009 16:19:17 +0800 > > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > > > > Here is the initial patch and tool for finding the missing pages. > > > > > > In the following example, the pages with no flags set is kind of too > > > many (1816MB), but hopefully your missing pages will have PG_reserved > > > or other flags set ;-) > > > > > > # ./page-types > > > L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab > W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked > > > > > > > Thanks. I'll have a look in a bit. Right now I'm very close to a > > complete bisect. It is just ftrace commits left though, so I'm somewhat > > sceptical that it is correct. ftrace isn't even turned on in the > > kernels I've been testing. > > > > The remaining commits are ec1bb60bb..6712e299. Another tool to show the page locations with specified flags: # ./page-areas 0x20000 | head offset len KB 11 1 4KB 13 3 12KB 17 7 28KB 25 1 4KB 31 1 4KB 33 31 124KB 65 63 252KB 129 15 60KB 145 7 28KB If we run eatmem or the following commands to take up free memory, the missing pages will show up :-) dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024 cp /tmp/s /dev/null Thanks, Fengguang
Reply-To: drzeus@drzeus.cx My bisect has ran into a wall. I cannot run any of the intermediate kernels that are left. I could try reverting the commits one at a time, but I'll take a break and test your code here. Now we just have to wait for the kernel to compile. :) Rgds
Reply-To: drzeus@drzeus.cx Ok, I think I've found some, but not all of the missing memory. I had to remove PG_swapbacked and PG_private2 as 2.6.26/2.6.27 didn't have those bits. After that, a comparison shows that this row is in 2.6.27, but not 2.6.26: 0x00020 20576 80 _____l____________ lru Unfortunately there are about 170 MB of missing memory, not 80. So we probably need to dig deeper. But does the above say anything to you? Rgds
Reply-To: drzeus@drzeus.cx On Tue, 10 Mar 2009 21:11:55 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > If we run eatmem or the following commands to take up free memory, > the missing pages will show up :-) > > dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024 > cp /tmp/s /dev/null > Not here, which now means I've "found" all of my missing 170 MB. On 2.6.27, when I fill the page cache I still get over 90 MB left in "noflags": 0x20000 24394 95 _________________n noflags The same thing with 2.6.26 almost completely drains it: 0x20000 3697 14 _________________n noflags Another interesting data point is that those 80 MB always seem to be the exact same number of pages every boot. Rgds
Hi > > Here is the initial patch and tool for finding the missing pages. > > > > In the following example, the pages with no flags set is kind of too > > many (1816MB), but hopefully your missing pages will have PG_reserved > > or other flags set ;-) > > > > # ./page-types > > L:locked E:error R:referenced U:uptodate D:dirty L:lru A:active S:slab > W:writeback x:reclaim B:buddy r:reserved c:swapcache b:swapbacked > > > > Thanks. I'll have a look in a bit. Right now I'm very close to a > complete bisect. It is just ftrace commits left though, so I'm somewhat > sceptical that it is correct. ftrace isn't even turned on in the > kernels I've been testing. > > The remaining commits are ec1bb60bb..6712e299. Can you try to turn off CONFIG_FTRACE* build option?
On Tue, Mar 10, 2009 at 10:21:18PM +0200, Pierre Ossman wrote: > On Tue, 10 Mar 2009 21:11:55 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > If we run eatmem or the following commands to take up free memory, > > the missing pages will show up :-) > > > > dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024 > > cp /tmp/s /dev/null > > > > Not here, which now means I've "found" all of my missing 170 MB. > > On 2.6.27, when I fill the page cache I still get over 90 MB left in > "noflags": > > 0x20000 24394 95 _________________n noflags > > The same thing with 2.6.26 almost completely drains it: > > 0x20000 3697 14 _________________n noflags > > Another interesting data point is that those 80 MB always seem to be > the exact same number of pages every boot. This 80MB noflags pages together with the below 80MB lru pages are very close to the missing page numbers :-) Could you run the following commands on fresh booted 2.6.27 and post the output files? Thank you! dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024 cp /tmp/s /dev/null ./page-flags > flags ./page-areas =0x20000 > areas-noflags ./page-areas =0x00020 > areas-lru The attached page-areas.c can do the above exact flags matching. > After that, a comparison shows that this row is in 2.6.27, but not > 2.6.26: > > 0x00020 20576 80 _____l____________ lru > > Unfortunately there are about 170 MB of missing memory, not 80. So we > probably need to dig deeper. But does the above say anything to you? > I had to remove PG_swapbacked and PG_private2 as 2.6.26/2.6.27 didn't > have those bits. Ah sorry! I forgot to switch the tree back to 2.6.27 to run a test. Thanks, Fengguang
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 09:37:40 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > This 80MB noflags pages together with the below 80MB lru pages are > very close to the missing page numbers :-) Could you run the following > commands on fresh booted 2.6.27 and post the output files? Thank you! > > dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024 > cp /tmp/s /dev/null > > ./page-flags > flags > ./page-areas =0x20000 > areas-noflags > ./page-areas =0x00020 > areas-lru > Attached. I have to say, the patterns look very much like some kind of leak. Rgds
On Wed, Mar 11, 2009 at 08:57:03AM +0200, Pierre Ossman wrote: > On Wed, 11 Mar 2009 09:37:40 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > This 80MB noflags pages together with the below 80MB lru pages are > > very close to the missing page numbers :-) Could you run the following > > commands on fresh booted 2.6.27 and post the output files? Thank you! > > > > dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024 > > cp /tmp/s /dev/null > > > > ./page-flags > flags > > ./page-areas =0x20000 > areas-noflags > > ./page-areas =0x00020 > areas-lru > > > > Attached. Thank you very much! > I have to say, the patterns look very much like some kind of leak. Wow it looks really interesting. The lru pages and noflags pages make perfect 1-page interleaved pattern... Thanks, Fengguang areas-lru > offset len KB > 86016 1 4KB > 86018 1 4KB > 86020 1 4KB > 86022 1 4KB > 86024 1 4KB > 86026 1 4KB > 86028 1 4KB > 86030 1 4KB > 86032 1 4KB > 86034 1 4KB > 86036 1 4KB > 86038 1 4KB > 86040 1 4KB > 86042 1 4KB > 86044 1 4KB > 86046 1 4KB > 86048 1 4KB > 86050 1 4KB > 86052 1 4KB > 86054 1 4KB > 86056 1 4KB > 86058 1 4KB > 86060 1 4KB > 86062 1 4KB > 86064 1 4KB > 86066 1 4KB > 86068 1 4KB > 86070 1 4KB > 86072 1 4KB > 86074 1 4KB > 86076 1 4KB > 86078 1 4KB > 86080 1 4KB > 86082 1 4KB > 86084 1 4KB > 86086 1 4KB > 86088 1 4KB > 86090 1 4KB > 86092 1 4KB > 86094 1 4KB > 86096 1 4KB > 86098 1 4KB > 86100 1 4KB > 86102 1 4KB > 86104 1 4KB areas-noflags > 86017 1 4KB > 86019 1 4KB > 86021 1 4KB > 86023 1 4KB > 86025 1 4KB > 86027 1 4KB > 86029 1 4KB > 86031 1 4KB > 86033 1 4KB > 86035 1 4KB > 86037 1 4KB > 86039 1 4KB > 86041 1 4KB > 86043 1 4KB > 86045 1 4KB > 86047 1 4KB > 86049 1 4KB > 86051 1 4KB > 86053 1 4KB > 86055 1 4KB > 86057 1 4KB > 86059 1 4KB > 86061 1 4KB > 86063 1 4KB > 86065 1 4KB > 86067 1 4KB > 86069 1 4KB > 86071 1 4KB > 86073 1 4KB > 86075 1 4KB > 86077 1 4KB > 86079 1 4KB > 86081 1 4KB > 86083 1 4KB > 86085 1 4KB > 86087 1 4KB > 86089 1 4KB > 86091 1 4KB > 86093 1 4KB > 86095 1 4KB > 86097 1 4KB > 86099 1 4KB > 86101 1 4KB > 86103 1 4KB > flags page-count MB symbolic-flags long-symbolic-flags > 0x00000 1892 7 __________________ > 0x00004 1 0 __R_______________ referenced > 0x00008 454 1 ___U______________ uptodate > 0x0000c 94 0 __RU______________ referenced,uptodate > 0x00020 20576 80 _____l____________ lru > 0x00028 226 0 ___U_l____________ uptodate,lru > 0x0002c 67911 265 __RU_l____________ > referenced,uptodate,lru > 0x00068 6621 25 ___U_lA___________ uptodate,lru,active > 0x0006c 1222 4 __RU_lA___________ > referenced,uptodate,lru,active > 0x00078 1 0 ___UDlA___________ > uptodate,dirty,lru,active > 0x00080 3523 13 _______S__________ slab > 0x000c0 55 0 ______AS__________ active,slab > 0x00228 5 0 ___U_l___x________ uptodate,lru,reclaim > 0x0022c 1 0 __RU_l___x________ > referenced,uptodate,lru,reclaim > 0x00268 23 0 ___U_lA__x________ > uptodate,lru,active,reclaim > 0x0026c 52 0 __RU_lA__x________ > referenced,uptodate,lru,active,reclaim > 0x00400 9 0 __________B_______ buddy > 0x00408 60 0 ___U______B_______ uptodate,buddy > 0x00800 4042 15 ___________r______ reserved > 0x04020 9 0 _____l________P___ lru,private > 0x04024 14 0 __R__l________P___ referenced,lru,private > 0x04028 4 0 ___U_l________P___ uptodate,lru,private > 0x0402c 1 0 __RU_l________P___ > referenced,uptodate,lru,private > 0x04060 10 0 _____lA_______P___ lru,active,private > 0x04064 7 0 __R__lA_______P___ > referenced,lru,active,private > 0x04068 16 0 ___U_lA_______P___ > uptodate,lru,active,private > 0x20000 24227 94 _________________n noflags > total 131056 511 > MemTotal: 508056 kB > MemFree: 7716 kB > Buffers: 220 kB > Cached: 280468 kB > SwapCached: 0 kB > Active: 31184 kB > Inactive: 271508 kB > SwapTotal: 524280 kB > SwapFree: 524232 kB > Dirty: 1284 kB > Writeback: 0 kB > AnonPages: 22044 kB > Mapped: 8652 kB > Slab: 21508 kB > SReclaimable: 4212 kB > SUnreclaim: 17296 kB > PageTables: 3036 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 778308 kB > Committed_AS: 80544 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 1740 kB > VmallocChunk: 34359736619 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 8128 kB > DirectMap2M: 516096 kB
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 09:19:32 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: > > Can you try to turn off CONFIG_FTRACE* build option? > That's just it, it is off. Rgds
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 15:14:45 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > On Wed, Mar 11, 2009 at 08:57:03AM +0200, Pierre Ossman wrote: > > On Wed, 11 Mar 2009 09:37:40 +0800 > > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > > > > This 80MB noflags pages together with the below 80MB lru pages are > > > very close to the missing page numbers :-) Could you run the following > > > commands on fresh booted 2.6.27 and post the output files? Thank you! > > > > > > dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024 > > > cp /tmp/s /dev/null > > > > > > ./page-flags > flags > > > ./page-areas =0x20000 > areas-noflags > > > ./page-areas =0x00020 > areas-lru > > > > > > > Attached. > > Thank you very much! > > > I have to say, the patterns look very much like some kind of leak. > > Wow it looks really interesting. The lru pages and noflags pages make > perfect 1-page interleaved pattern... > Another breakthrough. I turned off everything in kernel/trace, and now the missing memory is back. Here's the relevant diff against the original .config: @@ -3677,18 +3639,15 @@ # CONFIG_BACKTRACE_SELF_TEST is not set # CONFIG_LKDTM is not set # CONFIG_FAULT_INJECTION is not set -CONFIG_LATENCYTOP=y +# CONFIG_LATENCYTOP is not set # CONFIG_SYSCTL_SYSCALL_CHECK is not set CONFIG_HAVE_FTRACE=y CONFIG_HAVE_DYNAMIC_FTRACE=y -CONFIG_TRACER_MAX_TRACE=y -CONFIG_TRACING=y # CONFIG_FTRACE is not set -CONFIG_IRQSOFF_TRACER=y -CONFIG_SYSPROF_TRACER=y -CONFIG_SCHED_TRACER=y -CONFIG_CONTEXT_SWITCH_TRACER=y -# CONFIG_FTRACE_STARTUP_TEST is not set +# CONFIG_IRQSOFF_TRACER is not set +# CONFIG_SYSPROF_TRACER is not set +# CONFIG_SCHED_TRACER is not set +# CONFIG_CONTEXT_SWITCH_TRACER is not set I'll enable them one at a time and see when the bug reappears, but if you have some ideas on which it could be, that would be helpful. The machine takes some time to recompile a kernel. :) Rgds
(add cc) On Wed, Mar 11, 2009 at 09:26:58AM +0200, Pierre Ossman wrote: > On Wed, 11 Mar 2009 15:14:45 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > On Wed, Mar 11, 2009 at 08:57:03AM +0200, Pierre Ossman wrote: > > > On Wed, 11 Mar 2009 09:37:40 +0800 > > > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > > > > > > > This 80MB noflags pages together with the below 80MB lru pages are > > > > very close to the missing page numbers :-) Could you run the following > > > > commands on fresh booted 2.6.27 and post the output files? Thank you! > > > > > > > > dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024 > > > > cp /tmp/s /dev/null > > > > > > > > ./page-flags > flags > > > > ./page-areas =0x20000 > areas-noflags > > > > ./page-areas =0x00020 > areas-lru > > > > > > > > > > Attached. > > > > Thank you very much! > > > > > I have to say, the patterns look very much like some kind of leak. > > > > Wow it looks really interesting. The lru pages and noflags pages make > > perfect 1-page interleaved pattern... > > > > Another breakthrough. I turned off everything in kernel/trace, and now > the missing memory is back. Here's the relevant diff against the > original .config: > > @@ -3677,18 +3639,15 @@ > # CONFIG_BACKTRACE_SELF_TEST is not set > # CONFIG_LKDTM is not set > # CONFIG_FAULT_INJECTION is not set > -CONFIG_LATENCYTOP=y > +# CONFIG_LATENCYTOP is not set > # CONFIG_SYSCTL_SYSCALL_CHECK is not set > CONFIG_HAVE_FTRACE=y > CONFIG_HAVE_DYNAMIC_FTRACE=y > -CONFIG_TRACER_MAX_TRACE=y > -CONFIG_TRACING=y > # CONFIG_FTRACE is not set > -CONFIG_IRQSOFF_TRACER=y > -CONFIG_SYSPROF_TRACER=y > -CONFIG_SCHED_TRACER=y > -CONFIG_CONTEXT_SWITCH_TRACER=y > -# CONFIG_FTRACE_STARTUP_TEST is not set > +# CONFIG_IRQSOFF_TRACER is not set > +# CONFIG_SYSPROF_TRACER is not set > +# CONFIG_SCHED_TRACER is not set > +# CONFIG_CONTEXT_SWITCH_TRACER is not set > > I'll enable them one at a time and see when the bug reappears, but if > you have some ideas on which it could be, that would be helpful. The > machine takes some time to recompile a kernel. :) A quick question: are there any possibility of ftrace memory reservation? Thanks, Fengguang
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 15:36:19 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > A quick question: are there any possibility of ftrace memory reservation? > You tell me. CONFIG_FTRACE was always disabled, but CONFIG_HAVE_*FTRACE is always on. FTRACE wasn't included in 2.6.26 though, and the bisect showed only ftrace commits. So it would explain things. Rgds
On Wed, Mar 11, 2009 at 09:57:38AM +0200, Pierre Ossman wrote: > On Wed, 11 Mar 2009 15:36:19 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > A quick question: are there any possibility of ftrace memory reservation? > > > > You tell me. CONFIG_FTRACE was always disabled, but CONFIG_HAVE_*FTRACE > is always on. FTRACE wasn't included in 2.6.26 though, and the bisect > showed only ftrace commits. So it would explain things. There are some __get_free_page() calls in kernel/trace/ring_buffer.c, maybe the pages are consumed by one of them? Thanks, Fengguang
Hi Pierre, On Wed, Mar 11, 2009 at 09:57:38AM +0200, Pierre Ossman wrote: > On Wed, 11 Mar 2009 15:36:19 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > A quick question: are there any possibility of ftrace memory reservation? > > > > You tell me. CONFIG_FTRACE was always disabled, but CONFIG_HAVE_*FTRACE > is always on. FTRACE wasn't included in 2.6.26 though, and the bisect > showed only ftrace commits. So it would explain things. I worked up a simple debugging patch. Since the missing pages are continuously spanned, several stack dumping shall be enough to catch the page consumer. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 27b8681..c0df7fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1087,6 +1087,13 @@ again: goto failed; } + /* wfg - hunting the 40000 missing pages */ + { + unsigned long pfn = page_to_pfn(page); + if (pfn > 0x1000 && (pfn & 0xfff) <= 1) + dump_stack(); + } + __count_zone_vm_events(PGALLOC, zone, 1 << order); zone_statistics(preferred_zone, zone); local_irq_restore(flags);
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 16:20:38 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > There are some __get_free_page() calls in kernel/trace/ring_buffer.c, > maybe the pages are consumed by one of them? > Perhaps. I enabled CONFIG_SYSPROF_TRACER (which pulls in ring_buffer.c). That made the "noflags" memory disappear, but the "lru" section is not there. I.e. I've lost about 80 MB instead of 170 MB. The diff against the fully broken conf is now: @@ -3677,17 +3640,16 @@ # CONFIG_BACKTRACE_SELF_TEST is not set # CONFIG_LKDTM is not set # CONFIG_FAULT_INJECTION is not set -CONFIG_LATENCYTOP=y +# CONFIG_LATENCYTOP is not set # CONFIG_SYSCTL_SYSCALL_CHECK is not set CONFIG_HAVE_FTRACE=y CONFIG_HAVE_DYNAMIC_FTRACE=y -CONFIG_TRACER_MAX_TRACE=y CONFIG_TRACING=y # CONFIG_FTRACE is not set -CONFIG_IRQSOFF_TRACER=y +# CONFIG_IRQSOFF_TRACER is not set CONFIG_SYSPROF_TRACER=y -CONFIG_SCHED_TRACER=y -CONFIG_CONTEXT_SWITCH_TRACER=y +# CONFIG_SCHED_TRACER is not set +# CONFIG_CONTEXT_SWITCH_TRACER is not set # CONFIG_FTRACE_STARTUP_TEST is not set CONFIG_PROVIDE_OHCI1394_DMA_INIT=y # CONFIG_FIREWIRE_OHCI_REMOTE_DMA is not set Rgds
On Wed, 11 Mar 2009, Wu Fengguang wrote: > > > > > > > > > > This 80MB noflags pages together with the below 80MB lru pages are > > > > > very close to the missing page numbers :-) Could you run the > following > > > > > commands on fresh booted 2.6.27 and post the output files? Thank you! > > > > > > > > > > dd if=/dev/zero of=/tmp/s bs=1M count=1 seek=1024 > > > > > cp /tmp/s /dev/null > > > > > > > > > > ./page-flags > flags > > > > > ./page-areas =0x20000 > areas-noflags > > > > > ./page-areas =0x00020 > areas-lru > > > > > > > > > > > > > Attached. > > > > > > Thank you very much! > > > > > > > I have to say, the patterns look very much like some kind of leak. > > > > > > Wow it looks really interesting. The lru pages and noflags pages make > > > perfect 1-page interleaved pattern... > > > > > > > Another breakthrough. I turned off everything in kernel/trace, and now > > the missing memory is back. Here's the relevant diff against the > > original .config: > > [..] > > > > I'll enable them one at a time and see when the bug reappears, but if > > you have some ideas on which it could be, that would be helpful. The > > machine takes some time to recompile a kernel. :) > > A quick question: are there any possibility of ftrace memory reservation? The ring buffer is allocated at start up (although I'm thinking of making it allocated when it is first used), and the allocations are done percpu. It allocates around 3 megs per cpu. How many CPUs were on this box? -- Steve
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 10:25:10 -0400 (EDT) Steven Rostedt <rostedt@goodmis.org> wrote: > > The ring buffer is allocated at start up (although I'm thinking of making > it allocated when it is first used), and the allocations are done percpu. > > It allocates around 3 megs per cpu. How many CPUs were on this box? > One. :) Rgds
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 21:00:22 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > I worked up a simple debugging patch. Since the missing pages are > continuously spanned, several stack dumping shall be enough to catch > the page consumer. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 27b8681..c0df7fd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1087,6 +1087,13 @@ again: > goto failed; > } > > + /* wfg - hunting the 40000 missing pages */ > + { > + unsigned long pfn = page_to_pfn(page); > + if (pfn > 0x1000 && (pfn & 0xfff) <= 1) > + dump_stack(); > + } > + > __count_zone_vm_events(PGALLOC, zone, 1 << order); > zone_statistics(preferred_zone, zone); > local_irq_restore(flags); This got very noisy, but here's what was in the ring buffer once it had booted. Note that this is where only the "noflags" pages have been allocated, not "lru". Rgds
On Wed, 11 Mar 2009, Pierre Ossman wrote: > On Wed, 11 Mar 2009 21:00:22 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > I worked up a simple debugging patch. Since the missing pages are > > continuously spanned, several stack dumping shall be enough to catch > > the page consumer. > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 27b8681..c0df7fd 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1087,6 +1087,13 @@ again: > > goto failed; > > } > > > > + /* wfg - hunting the 40000 missing pages */ > > + { > > + unsigned long pfn = page_to_pfn(page); > > + if (pfn > 0x1000 && (pfn & 0xfff) <= 1) > > + dump_stack(); > > + } > > + > > __count_zone_vm_events(PGALLOC, zone, 1 << order); > > zone_statistics(preferred_zone, zone); > > local_irq_restore(flags); > > This got very noisy, but here's what was in the ring buffer once it had > booted. > > Note that this is where only the "noflags" pages have been allocated, > not "lru". BTW, which kernel are you testing? 2.6.27, ftrace had its own special buffering system. It played tricks with the page structs of the pages in the buffer. It used the lru parts of the pages to link list itself. I just booted on a straight 2.6.27 with tracing configured. # cat /debug/tracing/trace_entries 65586 This is the old method to see the amount of data used. There are a total of 65,586 entries all of 88 bytes each: 5,771,568 And since we also have a "snapshot" buffer for max latencies, the total is: 11,543,136. That is quite a lot of memory for one CPU :-/ Starting with 2.6.28, we now have the unified ring buffer. It removes all of the page struct hackery in the original code. In 2.6.28, the trace_entries is a misnomer. The conversion to the ring buffer brought had the change from representing the number of entries (entries in the ring buffer are now variable length) and the count is the number of bytes each CPU buffer takes up (*2 because of the "snapshot" buffer). # cat /debug/tracing/trace_entries 1441792 Now we have 1,441,792 or about 3 megs as the default. Today, we now have it as: # cat /debug/tracing/buffer_size_kb 1410 Still the 3 megs. But going from 10Megs a CPU, to 3Megs is a big difference. Do you see the same amout of lost memory with the later kernels? I'll have to make the option to expand the ring buffer when a tracer is registered. That will be the default option. -- Steve
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 11:47:16 -0400 (EDT) Steven Rostedt <rostedt@goodmis.org> wrote: > > BTW, which kernel are you testing? 2.6.27, ftrace had its own special > buffering system. It played tricks with the page structs of the pages in > the buffer. It used the lru parts of the pages to link list itself. > I just booted on a straight 2.6.27 with tracing configured. > I've been primarily testing 2.6.27, yes. I think I tested 2.6.29-rc7 at the beginning of this, but my memory is a bit fuzzy so I better retest. Rgds
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 10:25:10 -0400 (EDT) Steven Rostedt <rostedt@goodmis.org> wrote: > > The ring buffer is allocated at start up (although I'm thinking of making > it allocated when it is first used), and the allocations are done percpu. > > It allocates around 3 megs per cpu. How many CPUs were on this box? > Is this per actual CPU though? Or per CONFIG_NR_CPUS? 3 MB times 64 equals roughly the lost memory. But then again, you said it was 10 MB per CPU for 2.6.27... Rgds
On Wed, 11 Mar 2009, Pierre Ossman wrote: > On Wed, 11 Mar 2009 10:25:10 -0400 (EDT) > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > The ring buffer is allocated at start up (although I'm thinking of making > > it allocated when it is first used), and the allocations are done percpu. > > > > It allocates around 3 megs per cpu. How many CPUs were on this box? > > > > Is this per actual CPU though? Or per CONFIG_NR_CPUS? 3 MB times 64 > equals roughly the lost memory. But then again, you said it was 10 MB > per CPU for 2.6.27... It uses the possible_cpu mask. How many possible CPUs are on your box? I've thought about making this handle hot plug CPUs, but that will require a little more overhead for everyone, whether or not you hot plug a cpu. -- Steve
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 13:28:31 -0400 (EDT) Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 11 Mar 2009, Pierre Ossman wrote: > > > > > Is this per actual CPU though? Or per CONFIG_NR_CPUS? 3 MB times 64 > > equals roughly the lost memory. But then again, you said it was 10 MB > > per CPU for 2.6.27... > > It uses the possible_cpu mask. How many possible CPUs are on your box? > I've thought about making this handle hot plug CPUs, but that will > require a little more overhead for everyone, whether or not you hot plug a > cpu. > CONFIG_NR_CPUS is 64 for these compiles. Rgds
On Wed, 11 Mar 2009, Pierre Ossman wrote: > On Wed, 11 Mar 2009 13:28:31 -0400 (EDT) > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > On Wed, 11 Mar 2009, Pierre Ossman wrote: > > > > > > > > Is this per actual CPU though? Or per CONFIG_NR_CPUS? 3 MB times 64 > > > equals roughly the lost memory. But then again, you said it was 10 MB > > > per CPU for 2.6.27... > > > > It uses the possible_cpu mask. How many possible CPUs are on your box? > > I've thought about making this handle hot plug CPUs, but that will > > require a little more overhead for everyone, whether or not you hot plug a > > cpu. > > > > CONFIG_NR_CPUS is 64 for these compiles. Hmm, I assumed (but could be wrong) that on boot up, the system checked how many CPUs were physically possible, and updated the possible CPU mask accordingly (default being NR_CPUS). If this is not the case, then I'll have to implement hot plug allocation. :-/ Thanks, -- Steve
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 14:48:02 -0400 (EDT) Steven Rostedt <rostedt@goodmis.org> wrote: > > Hmm, I assumed (but could be wrong) that on boot up, the system checked > how many CPUs were physically possible, and updated the possible CPU > mask accordingly (default being NR_CPUS). > > If this is not the case, then I'll have to implement hot plug allocation. > :-/ > I have no idea, but every system doesn't suffer from this problem so there is something more to this. Modern fedora kernels have NR_CPUS set to 512, and it's not like I'm missing 1.5 GB here. :) Rgds
On Wed, 11 Mar 2009, Pierre Ossman wrote: > On Wed, 11 Mar 2009 14:48:02 -0400 (EDT) > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > Hmm, I assumed (but could be wrong) that on boot up, the system checked > > how many CPUs were physically possible, and updated the possible CPU > > mask accordingly (default being NR_CPUS). > > > > If this is not the case, then I'll have to implement hot plug allocation. > > :-/ > > > > I have no idea, but every system doesn't suffer from this problem so > there is something more to this. Modern fedora kernels have NR_CPUS set > to 512, and it's not like I'm missing 1.5 GB here. :) > I'm thinking it is a system dependent feature. I'm working on implementing the ring buffers to only allocate for online CPUS. I just realized that there's a check of a ring buffer cpu mask to see if it is OK to write to that CPU buffer. This works out perfectly, to keep non allocated buffers from being written to. Thanks, -- Steve
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 17:46:38 +0100 Pierre Ossman <drzeus@drzeus.cx> wrote: > On Wed, 11 Mar 2009 11:47:16 -0400 (EDT) > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > BTW, which kernel are you testing? 2.6.27, ftrace had its own special > > buffering system. It played tricks with the page structs of the pages in > > the buffer. It used the lru parts of the pages to link list itself. > > I just booted on a straight 2.6.27 with tracing configured. > > > > I've been primarily testing 2.6.27, yes. I think I tested 2.6.29-rc7 at > the beginning of this, but my memory is a bit fuzzy so I better retest. > Annoying... 2.6.28 and newer refuses to boot. Has someone broken the virtio_blk interface? I'll reconfigure it to use piix tomorrow and see if I can get it running. Rgds
On Wed, Mar 11, 2009 at 05:02:23PM +0200, Pierre Ossman wrote: > On Wed, 11 Mar 2009 21:00:22 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > I worked up a simple debugging patch. Since the missing pages are > > continuously spanned, several stack dumping shall be enough to catch > > the page consumer. > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 27b8681..c0df7fd 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1087,6 +1087,13 @@ again: > > goto failed; > > } > > > > + /* wfg - hunting the 40000 missing pages */ > > + { > > + unsigned long pfn = page_to_pfn(page); > > + if (pfn > 0x1000 && (pfn & 0xfff) <= 1) > > + dump_stack(); > > + } > > + > > __count_zone_vm_events(PGALLOC, zone, 1 << order); > > zone_statistics(preferred_zone, zone); > > local_irq_restore(flags); > > This got very noisy, but here's what was in the ring buffer once it had > booted. It's about 20 stack dumps, hehe. Could you please paste some of them? Thank you! > Note that this is where only the "noflags" pages have been allocated, > not "lru". The lru pages have even numbered pfn, the noflags pages have odd numbered pfn. So if it's 1-page allocations, the ((pfn & 0xfff) <= 1) will match both lru and noflags pages. Thanks, Fengguang
> > On Wed, 11 Mar 2009, Pierre Ossman wrote: > > > On Wed, 11 Mar 2009 14:48:02 -0400 (EDT) > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > > Hmm, I assumed (but could be wrong) that on boot up, the system checked > > > how many CPUs were physically possible, and updated the possible CPU > > > mask accordingly (default being NR_CPUS). > > > > > > If this is not the case, then I'll have to implement hot plug allocation. > > > :-/ Pierre, Could you please operate following command and post result? # cat /sys/devices/system/cpu/possible this is outputting the possible cpus of your system. > > I have no idea, but every system doesn't suffer from this problem so > > there is something more to this. Modern fedora kernels have NR_CPUS set > > to 512, and it's not like I'm missing 1.5 GB here. :) > > > > I'm thinking it is a system dependent feature. I'm working on implementing > the ring buffers to only allocate for online CPUS. I just realized that > there's a check of a ring buffer cpu mask to see if it is OK to write to > that CPU buffer. This works out perfectly, to keep non allocated buffers > from being written to. > > Thanks, > > -- Steve > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Reply-To: drzeus@drzeus.cx On Wed, 11 Mar 2009 22:43:53 +0100 Pierre Ossman <drzeus@drzeus.cx> wrote: > > I'll reconfigure it to use piix tomorrow and see if I can get it > running. > No dice. In both cases (virtio_blk and piix), it sees the disk and reads the partitions, but then fails to find any volume groups. Does this ring any bells? Rgds
Reply-To: drzeus@drzeus.cx On Thu, 12 Mar 2009 11:46:31 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: > > Pierre, Could you please operate following command and post result? > > # cat /sys/devices/system/cpu/possible > [root@builder ~]# cat /sys/devices/system/cpu/possible 0-15 16 times 11 MB also is the amount of lost memory, so this seems reasonable. Rgds
Reply-To: drzeus@drzeus.cx On Thu, 12 Mar 2009 09:08:16 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > On Wed, Mar 11, 2009 at 05:02:23PM +0200, Pierre Ossman wrote: > > On Wed, 11 Mar 2009 21:00:22 +0800 > > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > > > > I worked up a simple debugging patch. Since the missing pages are > > > continuously spanned, several stack dumping shall be enough to catch > > > the page consumer. > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index 27b8681..c0df7fd 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -1087,6 +1087,13 @@ again: > > > goto failed; > > > } > > > > > > + /* wfg - hunting the 40000 missing pages */ > > > + { > > > + unsigned long pfn = page_to_pfn(page); > > > + if (pfn > 0x1000 && (pfn & 0xfff) <= 1) > > > + dump_stack(); > > > + } > > > + > > > __count_zone_vm_events(PGALLOC, zone, 1 << order); > > > zone_statistics(preferred_zone, zone); > > > local_irq_restore(flags); > > > > This got very noisy, but here's what was in the ring buffer once it had > > booted. > > It's about 20 stack dumps, hehe. Could you please paste some of them? > Thank you! > Ooops, I meant to attach the dmesg output. Let's try again. :) Rgds
On Thu, Mar 12, 2009 at 08:55:30AM +0200, Pierre Ossman wrote: > On Thu, 12 Mar 2009 09:08:16 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > On Wed, Mar 11, 2009 at 05:02:23PM +0200, Pierre Ossman wrote: > > > On Wed, 11 Mar 2009 21:00:22 +0800 > > > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > > > > > > > I worked up a simple debugging patch. Since the missing pages are > > > > continuously spanned, several stack dumping shall be enough to catch > > > > the page consumer. > > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > index 27b8681..c0df7fd 100644 > > > > --- a/mm/page_alloc.c > > > > +++ b/mm/page_alloc.c > > > > @@ -1087,6 +1087,13 @@ again: > > > > goto failed; > > > > } > > > > > > > > + /* wfg - hunting the 40000 missing pages */ > > > > + { > > > > + unsigned long pfn = page_to_pfn(page); > > > > + if (pfn > 0x1000 && (pfn & 0xfff) <= 1) > > > > + dump_stack(); > > > > + } > > > > + > > > > __count_zone_vm_events(PGALLOC, zone, 1 << order); > > > > zone_statistics(preferred_zone, zone); > > > > local_irq_restore(flags); > > > > > > This got very noisy, but here's what was in the ring buffer once it had > > > booted. > > > > It's about 20 stack dumps, hehe. Could you please paste some of them? > > Thank you! > > > > Ooops, I meant to attach the dmesg output. Let's try again. :) Ooops, there're no ftrace in the dmesg. They are pretty normal page faults. I overlooked the possibility of repeated alloc/free cycles on the same pfn... Anyway please go on with Steven's ftrace patchset :-) Thanks, Fengguang