Bug 15058 - oom-killer kicks in: how to debug?
Summary: oom-killer kicks in: how to debug?
Status: CLOSED OBSOLETE
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-01-14 16:39 UTC by Michael Reinelt
Modified: 2012-05-14 15:34 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.32.4 2.6.33-rc5
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
/var/log/messages (4.29 KB, text/plain)
2010-01-14 16:39 UTC, Michael Reinelt
Details
.config from 2.6.33-rc5 (70.78 KB, application/octet-stream)
2010-01-25 05:58 UTC, Michael Reinelt
Details
/var/log/syslog (265.18 KB, text/plain)
2010-03-30 20:37 UTC, James Stevens
Details
kernel config when running 2.6.32.2 (62.47 KB, application/octet-stream)
2010-03-30 20:38 UTC, James Stevens
Details
Current kernel config (2.6.32.9) (52.41 KB, application/octet-stream)
2010-03-30 20:40 UTC, James Stevens
Details
/var/log/messages (zipped) (379.89 KB, text/plain)
2010-05-08 14:10 UTC, James Stevens
Details
/var/log/syslog (zipped) (488.23 KB, application/x-zip-compressed)
2010-05-08 14:15 UTC, James Stevens
Details

Description Michael Reinelt 2010-01-14 16:39:51 UTC
Created attachment 24560 [details]
/var/log/messages

I have a computer with a Core2 Duo T7700 CPU, and 4 GB RAM. I recently upgraded my kernel from 32 to 64 bit, and I get killed processes from oom-killer, but I don't understand why. I never had these problems when running 32bit.

All the time when oom kicks in, VMware workstation 7 (64bit) is involved. There's only one virtual machine running, with only 1G RAM, and a KDE4 desktop with iceweasel and icedove (firefox and thunderbird on debian). It's not really reproducibly, but it happens often when there's heavy disk I/O (e.g. shrinking a virtual disk).

At the moment VMware, icedove and iceweasel is up and running, and free -lm reports:

merlin.merlin.local:/home/michi # free -lm
             total       used       free     shared    buffers     cached
Mem:          3880       3701        178          0        100       1832
Low:          3880       3701        178
High:            0          0          0
-/+ buffers/cache:       1768       2111
Swap:         3906         54       3851
 
as you can see, there should be more than enough memory available, swap space is nearly unused.

Fortunately I kept a /var/log/messages from the last oom invocation, but I don't understand all the numbers here. Maybe somebody can give me a hint whats wrong here....


TIA, Michael
Comment 1 Johannes Engel 2010-01-21 12:19:30 UTC
Probably a duplicate of #14933.
Comment 2 Michael Reinelt 2010-01-22 06:16:56 UTC
looks similar, indeed. Thanks a lot for the hint!

As the proposed patch is in 2.6.32.4, I just compiled this kernel, and I will do some testing.
Comment 3 Michael Reinelt 2010-01-22 07:52:26 UTC
Bad news - oom kicked in again with 2.6.32.4, again with VMware and heavy disk i/o

I'm increasing priority because this behaviour makes the system nearly unusable. There's a high risk that killind processes under heavy disk writing destroys data...

PLEASE could anybody have a look at my dmesg output (attached var/log/messages) and help me isolating this bug!


TIA, Michael
Comment 4 David Rientjes 2010-01-23 00:46:46 UTC
This looks like the same symptoms as reported by Roman Jarosz on LKML (see my reply: http://marc.info/?l=linux-kernel&m=126420724222325).

You don't see this happening with 2.6.32.4 ARCH=i386?

Would it be possible to try commit f50de2d38 from Linus' tree (or just try 2.6.33-rc5)?
Comment 5 Michael Reinelt 2010-01-23 06:16:38 UTC
Thanks for the pointer David, you're right, that looks *very* similar...

I haven't seen this neither with 2.6.32.2_i368 nor any earlier i386 kernel. Unfortunately it's not that easy to switch back to i386.

your analysis matches my logs, too:

DMA free:15892kB min:28kB low:32kB high:40kB ...
lowmem_reserve[]: 0 3247 4004 4004

15892kB - (4kB * 4004) = -124 kB (definitely < 28kB)

DMA32 free:9520kB min:6552kB low:8188kB high:9828kB ...
lowmem_reserve[]: 0 0 757 757

9520kB - (4kB * 757) = 6492kB (which is ok i think)


If you can tell me how to fetch commit f50de2d38 I surely will give it a try, otherwise I'll go for 2.6.33-rc5
Comment 6 Michael Reinelt 2010-01-23 06:22:16 UTC
Question: how can I monitor these lowmem_reserve values during normal operation?
Comment 7 Michael Reinelt 2010-01-23 09:17:19 UTC
Bad news again: same problem with 2.6.33-rc5:

kernel: Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0
kernel: Pid: 2467, comm: Xorg Not tainted 2.6.33-rc5 #1
kernel: Call Trace:
kernel:  [<ffffffff8108e282>] T.379+0x62/0x160
kernel:  [<ffffffff8108e3f6>] T.378+0x76/0x140
kernel:  [<ffffffff8108e5e0>] __out_of_memory+0x120/0x1a0
kernel:  [<ffffffff8108e788>] pagefault_out_of_memory+0x48/0x90
kernel:  [<ffffffff81024450>] mm_fault_error+0x40/0xc0
kernel:  [<ffffffff810247a8>] do_page_fault+0x2d8/0x2e0
kernel:  [<ffffffff8136ef8f>] page_fault+0x1f/0x30
kernel: Mem-Info:
kernel: DMA per-cpu:
kernel: CPU    0: hi:    0, btch:   1 usd:   0
kdm: localhost:0[2490]: pam_unix(kdm:session): session closed for user michi
kernel: CPU    1: hi:    0, btch:   1 usd:   0
kernel: DMA32 per-cpu:
kernel: CPU    0: hi:  186, btch:  31 usd: 162
kernel: CPU    1: hi:  186, btch:  31 usd: 101
kernel: Normal per-cpu:
kernel: CPU    0: hi:  186, btch:  31 usd: 181
kernel: CPU    1: hi:  186, btch:  31 usd: 153
kernel: active_anon:561965 inactive_anon:152447 isolated_anon:0
kernel:  active_file:78978 inactive_file:137085 isolated_file:32
kernel:  unevictable:0 dirty:56826 writeback:17076 unstable:0
kernel:  free:6649 slab_reclaimable:24716 slab_unreclaimable:8400
kernel:  mapped:51887 shmem:48449 pagetables:5573 bounce:0
kernel: DMA free:15884kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15264kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
kernel: lowmem_reserve[]: 0 3247 4004 4004
kernel: DMA32 free:9524kB min:6552kB low:8188kB high:9828kB active_anon:2068744kB inactive_anon:419740kB active_file:211724kB inactive_file:323960kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3324960kB mlocked:0kB dirty:126496kB writeback:41480kB mapped:132500kB shmem:134872kB slab_reclaimable:85532kB slab_unreclaimable:8108kB kernel_stack:712kB pagetables:11436kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:96 all_unreclaimable? no
kernel: lowmem_reserve[]: 0 0 757 757
kernel: Normal free:1188kB min:1528kB low:1908kB high:2292kB active_anon:179116kB inactive_anon:190048kB active_file:104188kB inactive_file:224380kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:775680kB mlocked:0kB dirty:100708kB writeback:26924kB mapped:75048kB shmem:58924kB slab_reclaimable:13332kB slab_unreclaimable:25492kB kernel_stack:1672kB pagetables:10856kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:96 all_unreclaimable? no
kernel: lowmem_reserve[]: 0 0 0 0
kernel: DMA: 1*4kB 1*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15884kB
kernel: DMA32: 837*4kB 0*8kB 3*16kB 3*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 9444kB
kernel: Normal: 213*4kB 7*8kB 10*16kB 2*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1196kB
kernel: 264728 total pagecache pages
kernel: 195 pages in swap cache
kernel: Swap cache stats: add 1023, delete 828, find 33/49
kernel: Free swap  = 3996716kB
kernel: Total swap = 4000176kB
kernel: 1048560 pages RAM
kernel: 55282 pages reserved
kernel: 823006 pages shared
kernel: 253325 pages non-shared
kernel: Out of memory: kill process 2589 (kdeinit4) score 262356 or a child
kernel: Killed process 2590 (klauncher) vsz:194160kB, anon-rss:3636kB, file-rss:5324kB

kernel: Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0
kernel: Pid: 2467, comm: Xorg Not tainted 2.6.33-rc5 #1
kernel: Call Trace:
kernel:  [<ffffffff8108e282>] T.379+0x62/0x160
kernel:  [<ffffffff8108e3f6>] T.378+0x76/0x140
kernel:  [<ffffffff8108e5e0>] __out_of_memory+0x120/0x1a0
kernel:  [<ffffffff8108e788>] pagefault_out_of_memory+0x48/0x90
kernel:  [<ffffffff81024450>] mm_fault_error+0x40/0xc0
kernel:  [<ffffffff810247a8>] do_page_fault+0x2d8/0x2e0
kernel:  [<ffffffff8136ef8f>] page_fault+0x1f/0x30
kernel: Mem-Info:
kernel: DMA per-cpu:
kernel: CPU    0: hi:    0, btch:   1 usd:   0
kernel: CPU    1: hi:    0, btch:   1 usd:   0
kernel: DMA32 per-cpu:
kernel: CPU    0: hi:  186, btch:  31 usd:  68
kernel: CPU    1: hi:  186, btch:  31 usd: 129
kernel: Normal per-cpu:
kernel: CPU    0: hi:  186, btch:  31 usd: 210
kernel: CPU    1: hi:  186, btch:  31 usd: 150
kernel: active_anon:563813 inactive_anon:152447 isolated_anon:0
kernel:  active_file:78978 inactive_file:135312 isolated_file:32
kernel:  unevictable:0 dirty:54706 writeback:19133 unstable:0
kernel:  free:6705 slab_reclaimable:24528 slab_unreclaimable:8468
kernel:  mapped:51899 shmem:50522 pagetables:5498 bounce:0
kernel: DMA free:15884kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15264kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
kernel: lowmem_reserve[]: 0 3247 4004 4004
kernel: DMA32 free:9492kB min:6552kB low:8188kB high:9828kB active_anon:2073036kB inactive_anon:419740kB active_file:211724kB inactive_file:320284kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3324960kB mlocked:0kB dirty:121316kB writeback:46808kB mapped:132648kB shmem:139164kB slab_reclaimable:84780kB slab_unreclaimable:8108kB kernel_stack:712kB pagetables:11436kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:96 all_unreclaimable? no
kernel: lowmem_reserve[]: 0 0 757 757
kernel: Normal free:1444kB min:1528kB low:1908kB high:2292kB active_anon:182216kB inactive_anon:190048kB active_file:104188kB inactive_file:220964kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:775680kB mlocked:0kB dirty:97508kB writeback:29724kB mapped:74948kB shmem:62924kB slab_reclaimable:13332kB slab_unreclaimable:25764kB kernel_stack:1672kB pagetables:10556kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:64 all_unreclaimable? no
kernel: lowmem_reserve[]: 0 0 0 0
kernel: DMA: 1*4kB 1*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15884kB
kernel: DMA32: 477*4kB 110*8kB 41*16kB 3*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 9492kB
kernel: Normal: 179*4kB 73*8kB 5*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1412kB
kernel: 265063 total pagecache pages
kernel: 195 pages in swap cache
kernel: Swap cache stats: add 1023, delete 828, find 33/49
kernel: Free swap  = 3996716kB
kernel: Total swap = 4000176kB
kernel: 1048560 pages RAM
kernel: 55282 pages reserved
kernel: 823245 pages shared
kernel: 251196 pages non-shared
kernel: Out of memory: kill process 2589 (kdeinit4) score 238085 or a child
kernel: Killed process 2600 (ksmserver) vsz:248300kB, anon-rss:4964kB, file-rss:9000kB


again 4004 Pages lowmem_reserve


where does this value come from? How can I read it during "normal operation"? Maybe I can trace when this value increases....
Comment 8 Michael Reinelt 2010-01-24 07:53:27 UTC
Well, this is strange: I just found out that I can print such a memory report by pressing Shift-ScrollLock. I rebooted the machine into runlevel 1, and got:

DMA free:15884kB min:28kB low:32kB high:40kB
lowmem_reserve[]: 0 3247 4004 4004

And these values do not change when I enter multiuser mode and X

According to David Rientjes' calculations, I'd be low on memory from the beginning:

4004 pages * 4 kB = 16016 kB reserved, but only 15884 kB free....
Comment 9 David Rientjes 2010-01-24 23:32:13 UTC
You can manipulate the lowmem_reserve number of pages by tuning /proc/sys/vm/lowmem_reserve_ratio, which is given as a fraction of the total system RAM.  See Documentation/sysctl/vm.txt for more information.

I don't recommend that as a solution, however, because a load requiring heavy disk I/O will still deplete ZONE_DMA (or ZONE_DMA32 on x86_64) later regardless of any reservation you make and the same problem will repeat.  It will simply eat the lowmem and then require an oom kill as the result of the page fault.

The default values are calculated at boot depending on the size of system RAM and won't change unless done so from userspace via the sysctl.

Would you be able to post your e820 memory map (shown early in your /var/log/messages when you boot) so we can see how much 16-bit memory is addressable and not reserved?

Since the oom killer is only called for such pagefault oom conditions since 2.6.29, would it be possible to try that kernel (and 2.6.30, 2.6.31, 2.6.32) to see if you can determine whether this is indeed a recent regression or not?
Comment 10 David Rientjes 2010-01-24 23:39:53 UTC
This looks similiar to the problem reported in http://marc.info/?l=linux-kernel&m=126169821317492, which was not reproducible with 2.6.31, so you may try that kernel first.  Since you reported 2.6.32.4 showed the same problem, this will help us bisect the problem to between those two kernel releases.

It would still be helpful to see your e820 memory map as well to determine whether the lowmem_reserve ratio is sane given the amount of 16-bit addressable memory.
Comment 11 KOSAKI Motohiro 2010-01-25 02:44:45 UTC
Hi

1. Please attach your .config and full dmesg
2. If your .config have following option, please try to disable it at once.
   It might help debugging.
   - CONFIG_DRM_I915
   - CONFIG_RELAY
   - CONFIG_MSPEC
   - VIDEOBUF_DMA_SG
Comment 12 Michael Reinelt 2010-01-25 05:55:23 UTC
Hi there,

now I'm running 2.6.31.12 for about an hour, creating heavy disk i/o in the background (defragmenting a vmware virtual disk, which triggered the oom after few minuts with 2.6.32 and .33), and everythin's fine until now! I'll stay with this kernel version the next days, just to be sure.

my e820 memory map (created by 2.6.33-rc5) is here:

kernel: BIOS-provided physical RAM map:
kernel:  BIOS-e820: 0000000000000000 - 000000000009e000 (usable)
kernel:  BIOS-e820: 000000000009e000 - 00000000000a0000 (reserved)
kernel:  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
kernel:  BIOS-e820: 0000000000100000 - 00000000cf6d0000 (usable)
kernel:  BIOS-e820: 00000000cf6d0000 - 00000000cf6df000 (ACPI data)
kernel:  BIOS-e820: 00000000cf6df000 - 00000000cf6e3000 (ACPI NVS)
kernel:  BIOS-e820: 00000000cf6e3000 - 00000000d0000000 (reserved)
kernel:  BIOS-e820: 00000000f8000000 - 00000000fc000000 (reserved)
kernel:  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
kernel:  BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
kernel:  BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
kernel:  BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
kernel:  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
kernel:  BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
kernel:  BIOS-e820: 0000000100000000 - 0000000130000000 (usable)
kernel: NX (Execute Disable) protection: active
kernel: DMI 2.4 present.
kernel: Phoenix BIOS detected: BIOS may corrupt low RAM, working around it.


.config from 2.6.33-rc5 is attached.

I have CONFIG_DRM_I915 set to "y", but none of the other options (CONFIG_RELAY, CONFIG_MSPEC, VIDEOBUF_DMA_SG). 

I will give 2.6.33-rc5 without DRM_i915 a try, and provide a clean but full dmesg then.
Comment 13 Michael Reinelt 2010-01-25 05:58:29 UTC
Created attachment 24700 [details]
.config from 2.6.33-rc5
Comment 14 David Rientjes 2010-01-25 10:49:55 UTC
You have about ~15.6MB of usable memory in ZONE_DMA as revealed by your e820 memory map; it's sane for the lowmem_reserve to report this:

DMA free:15884kB min:28kB low:32kB high:40kB
lowmem_reserve[]: 0 3247 4004 4004

because it is reserving 4004 pages (all usable ZONE_DMA memory) from page allocations that can be satisfied with ZONE_NORMAL, i.e. allocations that are not GFP_DMA.  Thus, lowering /proc/sys/vm/lowmem_reserve_ratio will hold off the problem temporarily but will not be a long-term fix.

We'll have to look at differences between 2.6.31.12 and 2.6.32.4 that are causing this problem, the same kernels Roman Jarosz reported on LKML as being bad.  Until we have a better idea, it might help to see if 2.6.32-rc1 fails or whether you can bisect the problem if it's easily reproducible.

Thanks!
Comment 15 Michael Reinelt 2010-01-25 12:10:52 UTC
David, thanks for your explanations!

Of course I'm willing to test 2.6.32-rc1, I think I can reproduce the problem quite easily.
Comment 16 Ernst Herzberg 2010-01-25 12:41:31 UTC
same problem here during a parallel restore of a big postgresql database.

Workaround that works for me:

set /sys/block/sd[ab]/queue/iosched/low_latency to 0
Comment 17 Michael Reinelt 2010-01-25 14:27:41 UTC
Now that sounds interesting, Ernst!

I just tested 2.6.32-rc1 (or -rc2, I downloaded rc1 from www.kernel.org and compiled it, but this kernel calls itself rc2), and I did not encounter the problem. But this version does not contain the low_latency stuff....

2.6.31.12 has been running here und heavy load for some hours, without any problems.

Can someone tell me with which kernel version this low_latency stuff was introduced? I will try this kernel version...
Comment 18 Michael Reinelt 2010-01-25 15:26:24 UTC
Again, bad news: I just had a OOM with 2.6.32.4 and /sys/block/sda/queue/iosched/low_latency set to 0
Comment 19 Michael Reinelt 2010-01-25 18:04:34 UTC
Even worse news: OOM with 2.6.32-rc2

OTOH, maybe this is good news: as rc2 does not contain the low_latency stuff, the bug is likely to be found in another area.

I found a quite reliable way to reproduce the bug: start a VMware with 2GB virtual memory, and defragment/shrink a vmdk file from another VM using vmware-vdiskmanager

So I will test 2.6.31.12 again using this scenario, just to make sure...


Did I tell you that this is driving me crazy? At least a bit?
Comment 20 David Rientjes 2010-01-25 21:45:36 UTC
Ok, so the regression seems to be between 2.6.31.12 and 2.6.32-rc1 (it was called -rc2 in `uname' because there was a mistake in editing the Makefile during the release cycle).  I'm inclined to believe this is a problem relating to the I/O scheduler since Ernst reported it held off the memory allocations sufficiently enough so that ZONE_NORMAL would not fall below its low watermark without reclaiming enough memory to keep up with the disk.

Are you using cfq as your default scheduler (CONFIG_DEFAULT_CFQ is set)?  Would it be possible to try deadline or AS to see if makes any difference on 2.6.32-rc1 (see Documentation/block/switching-sched.txt)?
Comment 21 KOSAKI Motohiro 2010-01-27 01:41:20 UTC
The suspect wrong commit is here.

----------------------------------------------------
commit 07f73f6912667621276b002e33844ef283d98203
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Sep 14 16:50:30 2009 +0100

    drm/i915: Improve behaviour under memory pressure

    Due to the necessity of having to take the struct_mutex, the i915
    shrinker can not free the inactive lists if we fail to allocate memory
    whilst processing a batch buffer, triggering an OOM and an ENOMEM that
    is reported back to userspace. In order to fare better under such
    circumstances we need to manually retry a failed allocation after
    evicting inactive buffers.

    To do so involves 3 steps:
    1. Marking the backing shm pages as NORETRY.
    2. Updating the get_pages() callers to evict something on failure and then
       retry.
    3. Revamping the evict something logic to be smarter about the required
       buffer size and prefer to use volatile or clean inactive pages.

    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
----------------------------------------------------


Current analysis are
  - OOM is invoked by VM_FAULT_OOM in page fault
  - GEM use lots shmem internally. i915 use GEM.
  - VM_FAULT_OOM is created by shmem.
  - shmem allocate some memory by using mapping_gfp_mask(inode->i_mapping).
    and if allocation failed, it can return -ENOMEM and -ENOMEM generate VM_FAULT_OOM.
  - But, GEM have following code.


drm_gem.c drm_gem_object_alloc()
--------------------
        obj->filp = shmem_file_setup("drm mm object", size, VM_NORESERVE);
(snip)
        /* Basically we want to disable the OOM killer and handle ENOMEM
         * ourselves by sacrificing pages from cached buffers.
         * XXX shmem_file_[gs]et_gfp_mask()
         */
        mapping_set_gfp_mask(obj->filp->f_path.dentry->d_inode->i_mapping,
                             GFP_HIGHUSER |
                             __GFP_COLD |
                             __GFP_FS |
                             __GFP_RECLAIMABLE |
                             __GFP_NORETRY |
                             __GFP_NOWARN |
                             __GFP_NOMEMALLOC);


This comment is not correct. __GFP_NORETY cause ENOMEM to shmem, not GEM itself.
GEM can't handle nor recover it.
Comment 22 Michael Reinelt 2010-01-27 05:32:54 UTC
I can confirm that my OOM's have gone away with 2.6.23.4 and Motohiro's partial revert patch posted on LKML.
Comment 23 James Stevens 2010-03-30 20:33:24 UTC
Hi Andrew,

I see this bug was assigned you. Sounds very similar to our experience.

We're running a number of qemu/kvm VMs (about 20) on a server that started as 2.6.32.2 but was upgraded to 2.6.32.9 because under 2.6.32.2 we were getting lots of OOM kills.

The same 20 VMs used to be running on a lower power server that ran 2.6.28.4 and we never had OOM kills on that.

With 2.6.32.9 its not nearly so bad, it had been running for a few days before it started killing again. Heavy IO also seems to be a possible trigger - although that could just be because the cache used up all real RAM.

The server has 48Gb RAM but only had 4Gb swap (of which it was using nearly zero) and I thought that might be the issue (too little swap), so I rebooted the server and started creating 10Gb swap files using "dd". While creating the third OOM killer killed pretty much all processes and I lost contact with the server, including console login, so we had to hit reset.

Here's a snip from the syslog at that time :-

Mar 30 12:55:19 th02 kernel: 6368827 total pagecache pages
Mar 30 12:55:19 th02 kernel: 0 pages in swap cache
Mar 30 12:55:19 th02 kernel: Swap cache stats: add 0, delete 0, find 0/0
Mar 30 12:55:19 th02 kernel: Free swap  = 24383768kB
Mar 30 12:55:19 th02 kernel: Total swap = 24383768kB
Mar 30 12:55:20 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 1074 (udevd) score 0 or a child

Hard for a non-kernel man to understand why it thought there wasn't enough memory??

The VMs are now all back up with 24Gb swap. I'm sure it will be fine at least until the cache starts filling the memory again - current cache is 9Gb so I have 31Gb actually free - the VMs are using 8.8Gb.

             total       used       free  shared   buffers    cached
Mem:      49488636   18045788   31442848       0    104256   9107080
-/+ buffers/cache:    8834452   40654184
Swap:     24383768          0   24383768


I've not tried disabling the "low_latency" option, but will try that tomorrow. I can't disable "CONFIG_RELAY" the "make menuconfig" won't let me.

I did notice I had the "Preemption Model" incorrectly set to "Desktop" and the kernel was optimised for speed not size.



James
Comment 24 James Stevens 2010-03-30 20:37:19 UTC
Created attachment 25767 [details]
/var/log/syslog
Comment 25 James Stevens 2010-03-30 20:38:10 UTC
Created attachment 25768 [details]
kernel config when running 2.6.32.2
Comment 26 Andrew Morton 2010-03-30 20:39:29 UTC
OK, thanks.  I'll mark this bug as a regression.

Nothing much seems to have happened for a couple of months.  This looks bad :(
Comment 27 James Stevens 2010-03-30 20:40:22 UTC
Created attachment 25769 [details]
Current kernel config (2.6.32.9)

This config is not *exactly* the same as the config that was running when we had the OOM killer problem, but it is very similar. 

However, the config I uploaded from the 2.6.32.2 is probably closer to what we were actually running in the 2.6.32.9 kernel when we hit the OOM killer issue.
Comment 28 James Stevens 2010-03-31 08:54:58 UTC
Just to clarify, save you reading my syslog and emphasise my point, this the the OOM-killer's killing spree referred to in #24

$ grep oom_kill_allocating_task /var/log/syslog
...
Mar 30 12:54:21 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2557 (qemu-system-x86) score 0 or a child
Mar 30 12:54:21 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2485 (qemu-system-x86) score 0 or a child
Mar 30 12:54:21 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2444 (qemu-system-x86) score 0 or a child
Mar 30 12:55:01 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2674 (qemu-system-x86) score 0 or a child
Mar 30 12:55:01 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2668 (qemu-system-x86) score 0 or a child
Mar 30 12:55:02 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2790 (qemu-system-x86) score 0 or a child
Mar 30 12:55:09 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2792 (qemu-system-x86) score 0 or a child
Mar 30 12:55:09 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2665 (qemu-system-x86) score 0 or a child
Mar 30 12:55:09 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2888 (ps) score 0 or a child
Mar 30 12:55:15 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2913 (qemu-system-x86) score 0 or a child
Mar 30 12:55:15 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2895 (qemu-system-x86) score 0 or a child
Mar 30 12:55:15 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2784 (qemu-system-x86) score 0 or a child
Mar 30 12:55:16 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2969 (qemu-system-x86) score 0 or a child
Mar 30 12:55:16 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 3035 (bash) score 0 or a child
Mar 30 12:55:18 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2685 (sshd) score 0 or a child
Mar 30 12:55:18 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2685 (sshd) score 0 or a child
Mar 30 12:55:19 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 2921 (qemu-system-x86) score 0 or a child
Mar 30 12:55:20 th02 kernel: Out of memory (oom_kill_allocating_task): kill process 1074 (udevd) score 0 or a child


On a system with 48Gb of RAM and 24Gb of swap - bearing in mind each qemu is limited to 2Gb (very few are allowed that much - all 20 only add up to 8.8Gb) and I had just rebooted, so the cache would be pretty empty, so running out of memory was the last problem I was expecting.

One of the VMs was probably re-building a MySQL database from a CSV file at the time - this is very disk intensive, as well as me doing the "dd" to create the third 10Gb swap file. All the other VMs were probably relatively idle - obviously some are more busy than others, but (for example) one acts as a recursive resolver to the others and nothing else - so pretty much idle, esp for disk i/o


My colleague has also seem the OOM killer issue on his media centre PC with 32 so regressed to 31
Comment 29 James Stevens 2010-03-31 10:28:13 UTC
I have a much less powerful server that also runs VMs - 2Gb RAM and 1 SATA disk - running either 2.6.32.7 or 2.6.32.9 I have been unable to re-produce this problem.

I ran a script that unpacked a load of big ZIP files on the VM (to generate a lot of reads & writes on the VM) and then ran the "dd" on the host - the result, with both kernel versions, was that the VM pretty much suspended until the "dd" had finished.

Setting or clearing /sys/block/sda/queue/iosched/low_latency made no difference to the "vmstat" or the fact the VM suspended. 

On that machine "Preemption Model" is also set to "Desktop".

"vmstat" reports the cache goes up to 1.3Gb and pretty much stays there. "free" hovers between 50 & 55 Mb.

FYI: The 48Gb host has a "LSI Logic / Symbios Logic MegaRAID SAS 1078" - six disks in a RAID-6 typically doing about 2,500 block writes per sec and virtual no disk reads! "vmstat" reports almost no "wio" %age. Its a fast responsive server, the only issue is these darn OOM kills!
Comment 30 James Stevens 2010-04-01 08:48:11 UTC
I have no idea what I'm talking about here, but when I changed "Optimise for ..." back from "speed" to "size" I was surprised the kernel reduced in size by so much from 4.2Mb to 3.1Mb - is it possible this is related?

Generally, my attitude with open source is to try and use the options that everybody else uses as these will have had the most testing and hence be the most reliable. But I thought I was being a smart alec optimising for speed.

I've re-compiled the 2.6.32.9 with the options in the config I uploaded and run "echo 0 > /sys/block/sda/queue/iosched/low_latency" - so I now have "Optimise for size" and "Preemption Model" set to "Server".
Comment 31 James Stevens 2010-05-06 17:27:24 UTC
Having regressed to 2.6.31.12, about a month ago, we had not seen this OOM-Killer issue since then - but yesterday and today it happened again, big time. :-

# grep oom_kill_allocating_task /var/log/syslog | wc -l
57

Of 48Gb, the server had over 23Gb of RAM free and about the same of free swap and still OOM-killer started killing off programs :-

May  6 17:33:08 (none) kernel: Free swap  = 24037784kB
May  6 17:33:08 (none) kernel: Total swap = 24383768kB
May  6 17:33:08 (none) kernel: Out of memory (oom_kill_allocating_task): kill process 3591 (qemu-system-x86) score 0 or a child

Note - this server (pretty much) only runs KVM hosts.

I think the killing was triggered when I created 8Gb of data on one of the VMs.

Suddenly creating a large amount of data on disk seems to be the trigger for this issue. 

I have been unable to re-create this issue on our test machine.

Also, I have "oom_kill_allocating_task" enabled, but it seems to favour killing the more idle tasks.
Comment 32 James Stevens 2010-05-06 18:23:14 UTC
will running with "vm.overcommit_memory=2" guarantee that this REALLY DUMB oom-killer doesn't go about killing large numbers of my programs even when there is LOADS of free memory (both real & swap)???
Comment 33 David Rientjes 2010-05-06 20:21:15 UTC
Be aware that enabling /proc/sys/vm/oom_kill_allocating_task as you have done will automatically kill the task trying to allocate memory when the oom killer is called and not necessarily the ideal task.  When you see a "score 0" in the killing message (as it is for all of those that you quoted), that's the badness score of the task: zero means that it hasn't allocated any memory so killing it isn't going to free any memory, that's why you see a killing spree of tasks that were just forked.
Comment 34 James Stevens 2010-05-06 22:00:01 UTC
Thank you for your comment (referring to #28), its nice to see someone is reading this stuff - I was beginning to feel I was the only one ... but oom-killer didn't just kill processes that had "just forked", but killed pretty much everything running on the system (including syslogd, klogd, udevd, sshd and so forth) - I think the "bash" I was actually logged in to, at the time, was the last process standing - as you can see, even my "ps" was killed - which wasn't going to free much memory, either compared to the 48GB RAM the system has, or compared to the memory the VMs each had been using.

Please could you clarify under what circumstances "the ideal task" means every program on the system?

If a process had only just forked, it is because it was run under "init" and OOM-Killer had only just killed it's predecessor, so init re-forked it. But mostly these were VM clients that had all been recently re-started after a reboot.

"udevd" would have been started at boot time and not been re-started since.

The system had LOADS of free RAM and LOADS of free swap. I have exactly the same VM running on the same box now, with the same memory allocation, and I'm getting "Committed_AS: 12696604 kB" - as I have 48Gb of RAM and 33.8Gb of SWAP I have now set "vm.overcommit_memory=2" in the vein hope this will stop the BUGGY oom-killer from decimating my system.


BTW: I had the same set of VMs, plus a few more, running on a 2.6.28.4 system with only 32Gb RAM for months with no trouble. No doubt, over time, they have picked up a little more work, but on the whole the load is pretty similar. Sadly 2.6.28.4 can't support the firmware in the new RAID controller.

I was hoping that setting "oom_kill_allocating_task" might simply change the behaviour such that it wasn't quite so destructive. "qemu" pretty much allocates all its memory at start time, so if it had a bug that caused it to run away allocating, chances are that would be the program that was killed - well that was my theory.


The key point here is that my system HAD NOT run out of either physical or virtual memory - in fact, you can see from the oom entries in my syslog that it hadn't used any of its 24GB of SWAP - so whatever your comments about what should happen under normal circumstances have less validity as this is clearly triggering some kind of unplanned scenario - aka bug.


What seems to happen is that cached disk blocks seems to be *so* fresh it decides that they can't possibly be thrown out so looks to re-gain memory in some other way. That is, when I "dd" the 10Gb file, that takes up 10Gb of cache so if I "dd" another 10Gb file it takes another 10Gb of cache - do this often enough and you fill cache (i.e. cache uses all remaining available RAM). If you then "dd" another 10Gb, when the cache is full, in theory, least recently used blocks should get thrown out, but it somehow seems to lock blocks into the cache if they are less than a certain age (may be they are dirty blocks waiting to get flushed?). With the blocks locked in cache it appears to decided RAM is all used up and goes on this killing spree.

Whatever the issue, it is related to a lot of disk writes (i/o) - seems this is also similar to Michael Reinelt's experience.

My 2p/2c worth - and I know nothing about the kernel code.
Comment 35 David Rientjes 2010-05-06 22:17:42 UTC
"Ideal task" never means everything on the system :)  I recently rewrote the entire oom killer because its selection was so bad, it's pending in the -mm tree and awaiting inclusion hopefully in 2.6.35.  It really sounds like there's some OOM_DISABLE task that is hogging memory and can't be killed, but we'll need more information to figure that out.  Would it be possible to do echo 0 > /proc/sys/vm/oom_kill_allocating_task so it can make a "smart" decision and do echo 1 > /proc/sys/vm/oom_dump_tasks so we can see the memory usage of each application during an oom kill?  This will help us figure out why the oom killer is being called, what application is to blame, and why we're not killing the right thing.
Comment 36 James Stevens 2010-05-06 23:01:38 UTC
I'm sorry you don't believe me, but it did mean pretty much everything. Nobody likes their baby being called ugly, but the oom-killer is.

I was on the system at the time - once init stopped re-spawning the VMs (that the oom-killer had killed) and everything calmed down there was pretty much nothing left - just my bash & init plus nearly 48Gb of free RAM and 24GB of free SWAP - the syslog I uploaded confirms the free swap - not sure if you can see the free RAM from the log.

Mar 30 12:55:20 th02 kernel: Free swap  = 24383768kB
Mar 30 12:55:20 th02 kernel: Total swap = 24383768kB

Why is the oom-killer being invoked AT ALL if there is this much free swap?

My problem is not with the oom-killer's selection, but the fact it is running at all. Currently, "free" says 10Gb is used, /proc/meminfo says 12.6Gb is committed, so the rest of memory usage is cache & buffers - all of which should be re-claimable in preference to killing my programs, I'd have thought.

"sounds like there's some OOM_DISABLE task that is hogging memory" - what??? you HAVE to be kidding - one process hogging 72Gb of memory?? Get real. This is a 32bit system anyway - no one process could hog that much, surely?

I've run those two "echo"s, but to be honest, my aim is to have *NO* processes killed by OOM, so I now have even more swap (~34Gb) and have set "vm.overcommit_memory=2".

I can show you a "ps" of what is running now - there isn't anything running now that wasn't running just before the killing spree, except me doing a "dd" three times to create the additional swap files - this is what appears to have triggered the oom-killer.

# dd if=/dev/zero of=swap1 bs=1024K count=10000

I ran that once (followed by mkswap & swapon) then a second time, and that worked fine, but when I ran it a third time the killing spree started.

I already had 4GB of swap in a swap partition - added to the 2 x 10Gb that gives the 24Gb reported. I wasn't planning on needing any swap, hence the relatively small 4Gb (on a 1.4Tb disk RAID) - but having already suffered a load of unwarranted oom-killings, I was creating a load of swap in the hope it wouldn't happen again - ironically this is what seems to have triggered it.


The only process running right now that does not have an "oom_adj" of zero is "udevd", which has the magic value -17 which (I read) is supposed to protect it, and that was the last program oom-killer killed in the killing spree (see my syslog, or comment #28).


Will setting "vm.overcommit_memory=2" stop the oom-killer from ever starting?
Comment 37 James Stevens 2010-05-06 23:09:56 UTC
from #23 "While creating the third OOM killer killed pretty much all processes and I lost contact with the server, including console login, so we had to hit reset."
Comment 38 James Stevens 2010-05-06 23:12:14 UTC
I had forgotten it was that bad - we have a remote management card in that machine so I used that to first try the console, then to hit the reset.
Comment 39 James Stevens 2010-05-08 13:58:21 UTC
*@@*?**!* oom-killer just went killing my VMs again - once more still LOADS of free SWAP available 

May  8 12:44:11 (none) kernel: Free swap  = 34623752kB
May  8 12:44:11 (none) kernel: Total swap = 34623752kB

and still only 12.8Gb of Committed_AS (of the 48GB of RAM I have) and "vm.overcommit_memory=2" (which I thought disabled the oom-killer).

# sysctl vm.overcommit_memory
vm.overcommit_memory = 2

VMs were using about 10Gb of RAM (according to free), but a lot of the rest of RAM was used by cache - mostly because I created 2 x 5Gb swap files yesterday - so that used 10Gb of cache which seemed to remain used.

The VMs that were killed were mostly the ones that were working - i.e. the REALLY important ones - the idles ones were left alone.

The 2 x echo's you requested had been in place - syslog & messages uploaded

Note :-

May  8 12:40:55 (none) init: Id "Z18" respawning too fast: disabled for 5 minutes

"Z18" is VM No.18 - oom-killer kept killing it *so* much init gave up!!!

It seems to be that when cache is full, instead of throwing blocks out to get more cache, it starts killing programs.

Here's a question, if I use "dd" to create a file it's blocks will be in cache, if I then run "mkswap" and "swapon" quickly after, will that cause those blocks to get locked into cache ?


May  8 02:38:25 (none) kernel: Mem-Info:
May  8 02:38:25 (none) kernel: 13107200 pages RAM
May  8 02:38:25 (none) kernel: 12879362 pages HighMem
May  8 02:38:25 (none) kernel: 630820 pages reserved
May  8 02:38:25 (none) kernel: 6095368 pages shared
May  8 02:38:25 (none) kernel: 6007705 pages non-shared
May  8 02:38:25 (none) kernel: [ pid ]   uid  tgid total_vm ...

This seems to report only 38.6Gb - what happened to my 48Gb ? Or was the rest of memory free ? And if there was 10Gb of memory free, what the heck is the oom-killer doing waking up ?
Comment 40 James Stevens 2010-05-08 14:02:46 UTC
*@@*?**!* oom-killer just went killing my VMs again - once more still LOADS of free SWAP available 

May  8 12:44:11 (none) kernel: Free swap  = 34623752kB
May  8 12:44:11 (none) kernel: Total swap = 34623752kB

and still only 12.8Gb of Committed_AS (of the 48GB of RAM I have) and "vm.overcommit_memory=2" (which I thought disabled the oom-killer).

# sysctl vm.overcommit_memory
vm.overcommit_memory = 2

VMs were using about 10Gb of RAM (according to free), but a lot of the rest of RAM was used by cache - mostly because I created 2 x 5Gb swap files yesterday - so that used 10Gb of cache which seemed to remain used.

The VMs that were killed were mostly the ones that were working - i.e. the REALLY important ones - the idles ones were left alone.

The 2 x echo's you requested had been in place - syslog & messages uploaded

Note :-

May  8 12:40:55 (none) init: Id "Z18" respawning too fast: disabled for 5 minutes

"Z18" is VM No.18 - oom-killer kept killing it *so* much init gave up!!!

It seems to be that when cache is full, instead of throwing blocks out to get more cache, it starts killing programs.

Here's a question, if I use "dd" to create a file it's blocks will be in cache, if I then run "mkswap" and "swapon" quickly after, will that cause those blocks to get locked into cache ?


May  8 02:38:25 (none) kernel: Mem-Info:
May  8 02:38:25 (none) kernel: 13107200 pages RAM
May  8 02:38:25 (none) kernel: 12879362 pages HighMem
May  8 02:38:25 (none) kernel: 630820 pages reserved
May  8 02:38:25 (none) kernel: 6095368 pages shared
May  8 02:38:25 (none) kernel: 6007705 pages non-shared
May  8 02:38:25 (none) kernel: [ pid ]   uid  tgid total_vm ...

This seems to report only 38.6Gb - what happened to my 48Gb ? Or was the rest of memory free ? And if there was 10Gb of memory free, what the heck is the oom-killer doing waking up ?
Comment 41 James Stevens 2010-05-08 14:10:38 UTC
Created attachment 26280 [details]
/var/log/messages (zipped)
Comment 42 James Stevens 2010-05-08 14:15:15 UTC
Created attachment 26281 [details]
/var/log/syslog (zipped)
Comment 43 James Stevens 2010-05-08 14:21:36 UTC
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0    776   308 ?        Ss   14:26   0:00 init [4]
root         2  0.0  0.0      0     0 ?        S<   14:26   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S<   14:26   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S<   14:26   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   14:26   0:00 [migration/1]
root         6  0.0  0.0      0     0 ?        S<   14:26   0:00 [ksoftirqd/1]
root         7  0.0  0.0      0     0 ?        S<   14:26   0:00 [migration/2]
root         8  0.0  0.0      0     0 ?        S<   14:26   0:00 [ksoftirqd/2]
root         9  0.0  0.0      0     0 ?        S<   14:26   0:00 [migration/3]
root        10  0.0  0.0      0     0 ?        S<   14:26   0:00 [ksoftirqd/3]
root        11  0.0  0.0      0     0 ?        S<   14:26   0:00 [migration/4]
root        12  0.0  0.0      0     0 ?        S<   14:26   0:00 [ksoftirqd/4]
root        13  0.0  0.0      0     0 ?        S<   14:26   0:00 [migration/5]
root        14  0.0  0.0      0     0 ?        S<   14:26   0:00 [ksoftirqd/5]
root        15  0.0  0.0      0     0 ?        S<   14:26   0:00 [migration/6]
root        16  0.0  0.0      0     0 ?        S<   14:26   0:00 [ksoftirqd/6]
root        17  0.0  0.0      0     0 ?        S<   14:26   0:00 [migration/7]
root        18  0.0  0.0      0     0 ?        S<   14:26   0:00 [ksoftirqd/7]
root        19  0.0  0.0      0     0 ?        S<   14:26   0:00 [events/0]
root        20  0.0  0.0      0     0 ?        S<   14:26   0:00 [events/1]
root        21  0.0  0.0      0     0 ?        S<   14:26   0:00 [events/2]
root        22  0.0  0.0      0     0 ?        S<   14:26   0:00 [events/3]
root        23  0.0  0.0      0     0 ?        S<   14:26   0:00 [events/4]
root        24  0.0  0.0      0     0 ?        S<   14:26   0:00 [events/5]
root        25  0.0  0.0      0     0 ?        S<   14:26   0:00 [events/6]
root        26  0.0  0.0      0     0 ?        S<   14:26   0:00 [events/7]
root        27  0.0  0.0      0     0 ?        S<   14:26   0:00 [khelper]
root        32  0.0  0.0      0     0 ?        S<   14:26   0:00 [async/mgr]
root       180  0.0  0.0      0     0 ?        S<   14:26   0:00 [kblockd/0]
root       181  0.0  0.0      0     0 ?        S<   14:26   0:00 [kblockd/1]
root       182  0.0  0.0      0     0 ?        S<   14:26   0:00 [kblockd/2]
root       183  0.0  0.0      0     0 ?        S<   14:26   0:00 [kblockd/3]
root       184  0.0  0.0      0     0 ?        S<   14:26   0:00 [kblockd/4]
root       185  0.0  0.0      0     0 ?        S<   14:26   0:00 [kblockd/5]
root       186  0.0  0.0      0     0 ?        S<   14:26   0:00 [kblockd/6]
root       187  0.0  0.0      0     0 ?        S<   14:26   0:00 [kblockd/7]
root       188  0.0  0.0      0     0 ?        S<   14:26   0:00 [kacpid]
root       189  0.0  0.0      0     0 ?        S<   14:26   0:00 [kacpi_notify]
root       190  0.0  0.0      0     0 ?        S<   14:26   0:00 [kacpi_hotplug]
root       296  0.0  0.0      0     0 ?        S<   14:26   0:00 [ata/0]
root       297  0.0  0.0      0     0 ?        S<   14:26   0:00 [ata/1]
root       298  0.0  0.0      0     0 ?        S<   14:26   0:00 [ata/2]
root       299  0.0  0.0      0     0 ?        S<   14:26   0:00 [ata/3]
root       300  0.0  0.0      0     0 ?        S<   14:26   0:00 [ata/4]
root       301  0.0  0.0      0     0 ?        S<   14:26   0:00 [ata/5]
root       302  0.0  0.0      0     0 ?        S<   14:26   0:00 [ata/6]
root       303  0.0  0.0      0     0 ?        S<   14:26   0:00 [ata/7]
root       304  0.0  0.0      0     0 ?        S<   14:26   0:00 [ata_aux]
root       308  0.0  0.0      0     0 ?        S<   14:26   0:00 [ksuspend_usbd]
root       312  0.0  0.0      0     0 ?        S<   14:26   0:00 [khubd]
root       315  0.0  0.0      0     0 ?        S<   14:26   0:00 [kseriod]
root       435  0.0  0.0      0     0 ?        S    14:26   0:00 [pdflush]
root       436  0.0  0.0      0     0 ?        S    14:26   0:00 [pdflush]
root       437  0.0  0.0      0     0 ?        S<   14:26   0:00 [kswapd0]
root       484  0.0  0.0      0     0 ?        S<   14:26   0:00 [aio/0]
root       485  0.0  0.0      0     0 ?        S<   14:26   0:00 [aio/1]
root       486  0.0  0.0      0     0 ?        S<   14:26   0:00 [aio/2]
root       487  0.0  0.0      0     0 ?        S<   14:26   0:00 [aio/3]
root       488  0.0  0.0      0     0 ?        S<   14:26   0:00 [aio/4]
root       489  0.0  0.0      0     0 ?        S<   14:26   0:00 [aio/5]
root       490  0.0  0.0      0     0 ?        S<   14:26   0:00 [aio/6]
root       491  0.0  0.0      0     0 ?        S<   14:26   0:00 [aio/7]
root       499  0.0  0.0      0     0 ?        S<   14:26   0:00 [crypto/0]
root       500  0.0  0.0      0     0 ?        S<   14:26   0:00 [crypto/1]
root       501  0.0  0.0      0     0 ?        S<   14:26   0:00 [crypto/2]
root       502  0.0  0.0      0     0 ?        S<   14:26   0:00 [crypto/3]
root       503  0.0  0.0      0     0 ?        S<   14:26   0:00 [crypto/4]
root       504  0.0  0.0      0     0 ?        S<   14:26   0:00 [crypto/5]
root       505  0.0  0.0      0     0 ?        S<   14:26   0:00 [crypto/6]
root       506  0.0  0.0      0     0 ?        S<   14:26   0:00 [crypto/7]
root       714  0.0  0.0      0     0 ?        S<   14:27   0:00 [scsi_eh_0]
root       740  0.0  0.0      0     0 ?        S<   14:27   0:00 [scsi_eh_1]
root       743  0.0  0.0      0     0 ?        S<   14:27   0:00 [scsi_eh_2]
root       797  0.0  0.0      0     0 ?        S<   14:27   0:00 [kpsmoused]
root       802  0.0  0.0      0     0 ?        S<   14:27   0:00 [kondemand/0]
root       803  0.0  0.0      0     0 ?        S<   14:27   0:00 [kondemand/1]
root       804  0.0  0.0      0     0 ?        S<   14:27   0:00 [kondemand/2]
root       805  0.0  0.0      0     0 ?        S<   14:27   0:00 [kondemand/3]
root       806  0.0  0.0      0     0 ?        S<   14:27   0:00 [kondemand/4]
root       807  0.0  0.0      0     0 ?        S<   14:27   0:00 [kondemand/5]
root       808  0.0  0.0      0     0 ?        S<   14:27   0:00 [kondemand/6]
root       809  0.0  0.0      0     0 ?        S<   14:27   0:00 [kondemand/7]
root       909  0.0  0.0      0     0 ?        S<   14:27   0:00 [usbhid_resumer]
root       965  0.0  0.0   2108   644 ?        S<s  14:27   0:00 /sbin/udevd --daemon
root      1567  0.0  0.0      0     0 ?        S<   14:27   0:00 [reiserfs/0]
root      1568  0.0  0.0      0     0 ?        S<   14:27   0:00 [reiserfs/1]
root      1569  0.0  0.0      0     0 ?        S<   14:27   0:00 [reiserfs/2]
root      1570  0.0  0.0      0     0 ?        S<   14:27   0:00 [reiserfs/3]
root      1571  0.0  0.0      0     0 ?        S<   14:27   0:00 [reiserfs/4]
root      1572  0.0  0.0      0     0 ?        S<   14:27   0:00 [reiserfs/5]
root      1573  0.0  0.0      0     0 ?        S<   14:27   0:00 [reiserfs/6]
root      1574  0.0  0.0      0     0 ?        S<   14:27   0:00 [reiserfs/7]
root      1592  0.0  0.0   1852   752 ?        Ss   14:27   0:00 /usr/sbin/syslogd -n -a /opt/rom/dev/log
root      1593  0.0  0.0   1708   520 ?        Ss   14:27   0:00 /usr/sbin/klogd -c 3 -x -n
root      1719  0.0  0.0   3932  1004 ?        Ss   14:27   0:00 /usr/sbin/sshd
root      1726  0.0  0.0   1700   568 ?        Ss   14:27   0:00 /usr/sbin/acpid
81        1737  0.0  0.0   2296   692 ?        Ss   14:27   0:00 /usr/bin/dbus-daemon --system
82        1742  0.0  0.0   5580  3628 ?        Ss   14:27   0:00 /usr/sbin/hald --daemon=yes
root      1743  0.0  0.0   3080  1016 ?        S    14:27   0:00 hald-runner
root      1774  0.0  0.0   3176  1016 ?        S    14:27   0:00 hald-addon-input: Listening on /dev/input/event0
82        1778  0.0  0.0   2232   944 ?        S    14:27   0:00 hald-addon-acpi: listening on acpid socket /var/
root      1783  0.0  0.0   3176  1012 ?        S    14:27   0:00 hald-addon-storage: polling /dev/sr0 (every 2 se
root      1793  0.0  0.0   1916   644 ?        S    14:27   0:00 /usr/sbin/crond -l10
daemon    1795  0.0  0.0   1916   420 ?        Ss   14:27   0:00 /usr/sbin/atd -b 15 -l 1
root      1877  0.0  0.0   1708   536 tty1     Ss+  14:28   0:00 /sbin/agetty 38400 tty1 linux
root      1878  0.0  0.0   1708   536 tty2     Ss+  14:28   0:00 /sbin/agetty 38400 tty2 linux
root      2210  3.4  0.1 113144 65352 ?        Ssl  14:28   1:48 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2211  2.7  0.6 572916 340624 ?       Ssl  14:28   1:26 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2212  2.0  0.1 112780 70416 ?        Ssl  14:28   1:04 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2315  3.9  0.5 1082820 250020 ?      Ssl  14:30   2:01 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2345 16.2  3.0 2110060 1541436 ?     Ssl  14:30   8:14 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2384  1.2  0.1 112780 63056 ?        Ssl  14:31   0:37 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2425  1.7  0.5 310320 272116 ?       Ssl  14:32   0:52 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2457  1.6  0.4 581152 224316 ?       Ssl  14:32   0:49 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2491  1.5  0.3 573148 159972 ?       Ssl  14:32   0:45 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2546  5.5  0.2 581416 117656 ?       Ssl  14:33   2:38 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2580  1.5  0.1 178816 90244 ?        Ssl  14:33   0:43 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2617  1.3  0.2 310188 105660 ?       Ssl  14:33   0:38 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2659  1.3  0.1 178540 72440 ?        Ssl  14:34   0:37 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2720  3.9  1.0 1082896 541264 ?      Ssl  14:34   1:50 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2763  1.4  0.1 178540 98544 ?        Ssl  14:35   0:39 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2798  1.3  0.1 178512 72352 ?        Ssl  14:35   0:35 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2835  3.2  3.0 1595908 1542776 ?     Ssl  14:35   1:29 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2873 13.2  0.2 844536 133784 ?       Ssl  14:36   5:56 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      2988 39.0  4.0 2110088 2016516 ?     Ssl  14:39  16:05 /usr/local/bin/qemu-system-x86_64 -hda /opt/kvm/
root      3965  0.0  0.0   6656  2060 ?        Ss   15:17   0:00 sshd: root@pts/0
root      3968  0.0  0.0   3140  1816 pts/0    Ss   15:17   0:00 -bash
root      4092  0.0  0.0   2568   940 pts/0    R+   15:21   0:00 ps axu
root@th02:~#
Comment 44 James Stevens 2010-05-08 14:22:37 UTC
root@th02:~# cat /proc/meminfo
MemTotal:       49905520 kB
MemFree:        34478276 kB
Buffers:          104068 kB
Cached:          7319752 kB
SwapCached:            0 kB
Active:          9388520 kB
Inactive:        5782364 kB
Active(anon):    7747116 kB
Inactive(anon):        0 kB
Active(file):    1641404 kB
Inactive(file):  5782364 kB
Unevictable:           0 kB
Mlocked:               0 kB
HighTotal:      49410668 kB
HighFree:       34292324 kB
LowTotal:         494852 kB
LowFree:          185952 kB
SwapTotal:      34623752 kB
SwapFree:       34623752 kB
Dirty:               728 kB
Writeback:             0 kB
AnonPages:       7747064 kB
Mapped:             7216 kB
Slab:              95820 kB
SReclaimable:      75348 kB
SUnreclaim:        20472 kB
PageTables:        16740 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    72052892 kB
Committed_AS:   12696264 kB
VmallocTotal:     122880 kB
VmallocUsed:       35332 kB
VmallocChunk:      59704 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        4088 kB
DirectMap2M:      907264 kB
root@th02:~# cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/sda3                               partition       3903784 0       -1
/opt/swap1                              file            10239992        0       -2
/opt/swap2                              file            10239992        0       -3
/opt/swap3                              file            5119992 0       -4
/opt/swap4                              file            5119992 0       -5
root@th02:~# free
             total       used       free     shared    buffers     cached
Mem:      49905520   15427848   34477672          0     104068    7320244
-/+ buffers/cache:    8003536   41901984
Swap:     34623752          0   34623752
root@th02:~#
Comment 45 James Stevens 2010-05-08 14:23:50 UTC
BTW: I just re-booted the server as this was the only way to stop the oom-killer from killing

all info above (given today) is after the reboot - but the number and size of the VMs has not changed.
Comment 46 James Stevens 2010-05-08 14:25:34 UTC
Is it possible that I'm running out of "LowFree" ?
Comment 47 James Stevens 2010-05-10 10:07:25 UTC
Here's some thoughts on this subject :-

With reference to the oom-killing spree from the 8th ... (1) the system still had 10Gb of free RAM and 34Gb of free SWAP - therefore the only possible conclusion I can come to was that the "normal" trigger by which the oom-killer was invoked (i.e. running out of "main memory") was NOT (as in *NOT*) what happened in my case. I am totally certain, David, that until you can take this on-board it will not be possible for you to make progress on this issue.

However, clearly the kernel thought that the server had run out of some resource whose size was limited (and couldn't be expanded - e.g. the size is set & fixed at boot time or is a physical limit of the machine) and therefore killing tasks was the only way to re-gain this resource. I'm not doubting the kernel was right in this analysis, I simply don't know what resource it had reached the limit of and how to make more of that resource available.

The resource is clearly somehow related to disk i/o and / or cached blocks. However, what this relationship it I don't know - could it be related to "dentry" or "inodes" ?

My guess would be that the best way to stop this happening again (until we get to the bottom of it) is either to regularly reboot the server or to run "sync; echo 3 > /proc/sys/vm/drop_caches" in a cron job. Although this will obviously immediately trigger a load of disk reads, its not that bad as each virtual server has already built-up its own internal disk cache anyway.

Some unusual features of this server that might not be within a developers typical thinking is based on the fact that all this server does is run KVM VMs. Therefore there are a (relatively) small number of processes that use between 256Mb and 2Gb or RAM each. Each process will probably only open one file (its virtual disk) which is a reasonably large sparse file (between 150Gb to 500Gb) - this seems to rule out dentry or inode issues. It will open the file when the process is started and probably never close it. The system has 48Gb of RAM which is mostly used for disk cache (75% cache 25% process) - this is probably the opposite way round from a typical system - may be there are some structures / tables that point into the cache that are limited in size? 

It reminds me of a disk capacity issue I once had, on an embedded system, where I had a small disk with lots of small files, so (until I worked out what was happening) I kept runing out of inodes, cos the "typical" file:inode ratio (as set at mkfs) wasn't appropriate. My guess would be that this is the kind of thing that is happening here.

I'm wondering if "vfs_cache_pressure" or "zone_reclaim_mode" might be relevant, but the former seems to be deprecated (or may be I've used config options that have removed it?). They also smack of a work-around (like the cron job), not a fix.
Comment 48 KOSAKI Motohiro 2010-05-11 01:10:17 UTC
I haven't read your bug report datail, but..

> HighTotal:      49410668 kB
> HighFree:       34292324 kB
> LowTotal:         494852 kB
> LowFree:          185952 kB

Well, 49.4GB highmem is not suitable configration. too lots highmem may
cause lowmem exhaust. can you please try 64bit linux?
Comment 49 James Stevens 2010-05-11 10:13:41 UTC
I had thought too much memory might be the problem as the same VMs had been running on a 32Gb system with no problem for many months.

What is exhausted? and can I limit this resource in some way? Why is this resource getting over-committed?

I would have thought that with "vm.overcommit_memory=2" *NO* memory resource should be over-committed and therefore there would NEVER be the need for oom-killing. Is there some way to completely disable the oom-killer? If not, there should be.

As a sysadmin, my preference would ALWAYS be for performance degradation over arbitrary process killing (especially killing VMs) - even if it means a program waiting for a resource to become free - to me this is a much better solution than random killing. Randomly killing MY programs, especially when there is LOADS of free memory and LOADS of free swap should be a last resort - i.e. only done if the only alternative is a panic.


However, what I find almost impossible to understand is that I have ~20 VMs that all allocate their memory resource at start-up, I have overcommit disabled and once the VMs are started I don't start any more programs on that server, yet after a week or so of faultless operation the oom-killer comes in and starts killing them off - despite there being LOADS of free memory and free SWAP - this kind of unpredictable behaviour is not really acceptable.

If I tell my VMs to use more memory, so there is less available for cache, will that help - i.e. increase "Committed_AS" - or what if I decrease or increase "CommitLimit"?

Is there any (realistic) option for a 64 bit upgrade other than a complete re-install? e.g. can I replace just the kernel leaving user-land in 32bit ?

Why is my "LowTotal" *so* low at 494852kB? Is this just because of the large memory? the 32Gb VM host has 625932kB, the 16Gb VMs have 764956kB, but most of the rest of our machines (which have 8Gb or 2GB) have 800Mb or above.
Comment 50 Michael Reinelt 2010-05-11 10:50:33 UTC
James,

I remember applying a patch from KOSAKI Motohiro which added some debugging code to the kernel, telling you in more detail *which* memory, zone or whatever triggered the oom.

Unfortunately I can't find it here. Maybe it's been sent per mail? unfortunately I can't even find the email :-(

With this patch applied, we found out that my OOM problem was caused by running out of shmem, GEM uses lots of shmem, and i915 uses GEM.


@Motohiro: any chance you remember this patch? It was about adding a extra printk() before every "goto oom" somewhere in the mm code.
Comment 51 James Stevens 2010-05-11 11:12:11 UTC
Its not a user-land program that's causing the memory usage - its the kernel trying to keep track of the blocks in cache - sadly, killing user-land programs is a very poor solution to the problem of running out of cache-tracking memory - not adding more blocks to cache would be a better one - or dropping existing blocks from cache.

The problem is (I think) that I have loads of free memory, so the kernel (quite reasonably) thinks it can add more blocks into the cache - problem is it needs some other resource (e.g. buffer_head) to track / make use of the extra blocks in cache, but with insufficient LowMEM it fails to allocate any, so sends oom-killer off on a killing spree.

What might be a one off oom-kill turns into a killing spree because killing user land processes, of course, does not empty the cache, and so does not free up whatever resource (the kernel) that has taken all the LowMEM. Chucking blocks out of cache would be a better solution to this problem, but with 10Gb of free RAM I can see why it looks to the kernel like this isn't necessary.

Is there a way to limit cache usage or I will just have to limit total memory, say with "highmem=32Gb"

Blowing the cache with "echo 3 > /proc/sys/vm/drop_caches" frees up pretty much all my "LowMem" again - so currently I'm using this as a way to stopping the oom-killer being triggered. Its not a terrible solution as each VM has its own disk cache, which is unaffected by clearing the host cache.

Is there any way I can increase the memory block size and so increase the size of each block cached? That way I could cache the same amount of disk space but use less LowMEM to keep track of it.
Comment 52 James Stevens 2010-05-11 11:50:45 UTC
We also have two VM hosts that are running 2.6.28.4 - they have 500Gb of HDD and 16GB of RAM and they have a *LOT* more free LOWMEM, despite having not been re-booted for 377 days.

LowTotal:         764956 kB
LowFree:          258516 kB

However, I also have a small VM host at home running 2.6.32.9, with 500Gb of HDD and 2Gb RAM - however this has very little free LOWMEM

LowTotal:         887212 kB
LowFree:           49404 kB

This LowMEM can't be used up by tracking cached disks blocks as the server only has 2Gb of RAM - so what is using all the LOWMEM on this system? "slabtop" report :-

 Active / Total Size (% used)       : 23593.02K / 27915.55K (84.5%)

as in "not a lot" - as per the 48GB system, on the 2GB system running "echo 3 > /proc/sys/vm/drop_caches" frees up nearly all the LOWMEM ...

LowTotal:         887212 kB
LowFree:          760784 kB

But what was using it in the first place? Could there be some kind of LOWMEM leak associated with the cache?


On the 48Gb system the oom-killing problem was definitely much worse when running 2.6.32.6, slightly better when running 2.6.32.9 a lot better when running 2.6.31.12

With 32.6 it would run for a day or two before OOM killing, with 32.9 it would be about a week, with 31.12 it managed nearly a month. However, the 2.6.28.4 system has been running 377 days since last reboot, has VMs that haven't been re-started since then and has *no* recorded oom-kills.
Comment 53 James Stevens 2010-05-11 12:34:30 UTC
I've just reproduced this problem on my 2Gb VM host - proving this is *NOT* related to the fact the large server has 48Gb.

My 2Gb VM host is used for testing. It only runs one VM. Its been stable for ages even though it runs 2.6.32.9

Just now all I did was log into the VM host and run "md5sum <65Gb-file>" and (after a while) that caused the VM client to die and whenever it was re-started it was unstable, until I ran "echo 3 > /proc/sys/vm/drop_caches" on the VM host.

While it was unstable it was reporting very little free LOWMEM.
Comment 54 KOSAKI Motohiro 2010-05-12 02:06:42 UTC
Hi James

> I had thought too much memory might be the problem as the same VMs had been
> running on a 32Gb system with no problem for many months.
> 
> What is exhausted? and can I limit this resource in some way? Why is this
> resource getting over-committed?

highmem is for userland process resource. lowmem can be used for both kernel and
userland process resource.
IOW, much total memory might cause to consume much for kernel memory. then, they 
may cause lowmem exhausted. over-committed is purely unrelated.

IOW, linux has virtual memory subsystem and rich i/o subsystem. then, the kernel 
use unignoreable much memory.


> I would have thought that with "vm.overcommit_memory=2" *NO* memory resource
> should be over-committed and therefore there would NEVER be the need for
> oom-killing. Is there some way to completely disable the oom-killer? If not,
> there should be.

Unfortunatelly, Linus is NOT designed so. because Linux can't swap out the kernel itself. If you want to change it, please make and post the patch in
lkml.


> As a sysadmin, my preference would ALWAYS be for performance degradation over
> arbitrary process killing (especially killing VMs) - even if it means a
> program
> waiting for a resource to become free - to me this is a much better solution
> than random killing. Randomly killing MY programs, especially when there is
> LOADS of free memory and LOADS of free swap should be a last resort - i.e.
> only
> done if the only alternative is a panic.

I agree generically. but there is few significant exception.
32bit CPUs has some 32bit limitation. then our kernel naturally inherit
some limitation from the hardware. iow, if you are using highmem, the above
your request is not guranteed. therefore, I recommended to switch 64bit world.

Some month ago, linus said using >16Gb on 32bit is not sane.


> However, what I find almost impossible to understand is that I have ~20 VMs
> that all allocate their memory resource at start-up, I have overcommit
> disabled
> and once the VMs are started I don't start any more programs on that server,
> yet after a week or so of faultless operation the oom-killer comes in and
> starts killing them off - despite there being LOADS of free memory and free
> SWAP - this kind of unpredictable behaviour is not really acceptable.

Yes, I think this behavior is bug.

If the system has enough free memory, It shouldn't cause OOM. Isn't it?

However, you seems misunderstand what is linux memory overcommit.
vm.overcommit_memory is the knob for userland process memory 
overcommitment. it doesn't limit for kenrel memory.


> If I tell my VMs to use more memory, so there is less available for cache,
> will
> that help - i.e. increase "Committed_AS" - or what if I decrease or increase
> "CommitLimit"?

afaik, impossible.


> Is there any (realistic) option for a 64 bit upgrade other than a complete
> re-install? e.g. can I replace just the kernel leaving user-land in 32bit ?

Well, we are kernel developer. we don't know such thing ;)
Please don't ask us it, instead, please read your distrobution manual.


> Why is my "LowTotal" *so* low at 494852kB? Is this just because of the large
> memory? the 32Gb VM host has 625932kB, the 16Gb VMs have 764956kB, but most
> of
> the rest of our machines (which have 8Gb or 2GB) have 800Mb or above.

I guess it is for struct page. our kernel need the structure of page management.
x86 page size = 4Kb. and it's size is 32byte (on x86) or 64byte (on x86_64).
then, total-memory / 4KB = number-of-pages, number-of-pages x 32byte is consumed
for struct page.

If you have 50Gb memory, 50Gb / 4KB x 32 ~= 400Mb.
896Mb(theorical lowmem limit) - 400Mb = 496Mb.
Comment 55 KOSAKI Motohiro 2010-05-12 02:13:57 UTC
Hi Michael

> @Motohiro: any chance you remember this patch? It was about adding a extra
> printk() before every "goto oom" somewhere in the mm code

Unfortunately, Such patch doesn't help James's issue ;)
In your case, OOM is caused on page fault handler. then, we needed to digg
who invoke page fault oom handler because we doubted it is not caused by
non page allocation failure. 

but James's log says his oom is invoked from page allocation failure certenally.
Comment 56 KOSAKI Motohiro 2010-05-12 02:14:51 UTC
James, you seems to have good reproduce test programs. can you please give us such
reproducer?
Comment 57 KOSAKI Motohiro 2010-05-12 02:25:57 UTC
Additional few question.
Does pure qemu also reproduce your issue? Are you using KSM?
Comment 58 KOSAKI Motohiro 2010-05-12 03:01:25 UTC
And, can you please reproduce the bug on latest kenrel?
because

1) your issue is lowmem issue, then we need to know lowmem stat.
   (IOW, lots highmem free memory can't prevent your oom)
2) but your version doesn't have such capability, latest has.


Hopefully, comparision 32bit and 64bit kernel result is best.
Comment 59 James Stevens 2010-05-12 09:32:06 UTC
Thanks for all that feedback.

I accept 64bit is a better option, but choices were made on what was available at install time. If a complete re-install is the only upgrade option, its not going to happen any day soon.

The problem seems to be that oom-killer is being used to free LOWMEM, which is being occupied by cache blocks - so killing userland processes doesn't help - hence killing spree. The kernel needs a separate "clear lowmem" function, when lowmem is needed, that may (or may not) include oom-killing, but could start with chucking cache blocks out, or into highmem.

Seems to me there may even be a case to say on 32bit systems with > 8Gb RAM, LOWMEM is reserved for Kernel use only - with or without a switch to enable / disable this behaviour.


"because Linux can't swap out the kernel itself" - sure, BUT that is NOT the problem I hit - the problem I hit was that cached disk blocks (or pointers to cache blocks) were occupying LOWMEM and oom-killer was invoked to remove them - this was never going to happen - so the result was the all process in userland were oom-killed, but still LOWMEM was full.


On my 2Gb system I have single dual-core Athlon, 500Gb HDD and one VM running qemu over KVM (not qkvm) - then, on the VM host, I run "find / -type f -exec md5sum {} \;" and after a while the VM is killed. There are 2 x 65Gb files on the VM host - it is when the "find" hits these that the VM is killed. I think this is because data is being pulled off the disk faster as there is less fragmentation.

I have replaced the kernel and tried the same test on 2.6.33.3 and it survives fine. LOWMEM immediately drops to about 50Mb and stays pretty much there (~48MB to 52Mb). I will try 2.6.33.3 on the 48Gb system as soon as possible.

I don't know whether the qemu/kvm is relevant or (because of its large memory use) it simply makes an obvious target for the oom-killer, but I have seen the oom-killer kill everything on the system (inc klogd, syslogd, sshd, udevd etc.)


"vm.overcommit_memory is the knob for userland process memory overcommitment." - RIGHT - so if I set no overcommit on userland processes there should NEVER be a need to oom-kill userland processes.

If the kernel runs out of kernel memory it needs to deal with it in a better way - killing userland processes isn't a great solution - in fact in my case it isn't a solution at all, because it doesn't result in any (or enough) kernel memory being release, but simply results in a killing spree.
Comment 60 James Stevens 2010-05-12 10:59:46 UTC
Same test on 32Gb system with 2.6.33.3 also results in *NO* oom-killing - it appears that LowMEM is not being used for cache blocks. Cache rose to 30Gb and stopped there with typical free looking like this :-

HighTotal:      32359752 kB
HighFree:          66060 kB
LowTotal:         622012 kB
LowFree:          426412 kB

This memory profile continue throughout the test - so looks like LOWMEM is either reserved for kernel use or deliberately not used for disk cache.

disk i/o (vmstat/bi) occasionally degraded by 25 to 50% by periods of slightly higher system CPU - but I prefer this to invoking oom-killer :)

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  1      0 485588  99036 29521972    0    0 140738   166 5904 7270  0  4 79 17
 1  1      0 472056  99016 29535288    0    0 131258   110 6033 7144  0  3 74 23
 1  1      0 478496  98840 29529204    0    0 57472   404 5302 5838  0  9 75 16
 2  1      0 496020  98776 29511984    0    0 131374   120 6127 6899 10  5 64 22
...
 1  2      0 501032  98572 29507140    0    0 105098   158 5511 6572  0  6 71 23
 0  2      0 495260  98492 29513496    0    0 107506    88 5405 6308  0  3 74 23
 1  1      0 506784  98416 29501100    0    0 61436   228 4813 5659  0  7 72 21
 0  2      0 506280  98492 29501004    0    0 105818   166 5497 6510  0  5 75 20
 0  2      0 506436  98384 29501036    0    0 140016    86 6053 7104  0  2 77 21
Comment 61 KOSAKI Motohiro 2010-05-13 02:46:30 UTC
I'm happy too that you freed from annoying oom-killer issue :)

> I accept 64bit is a better option, but choices were made on what was
> available
> at install time. If a complete re-install is the only upgrade option, its not
> going to happen any day soon.
> 
> The problem seems to be that oom-killer is being used to free LOWMEM, which
> is
> being occupied by cache blocks - so killing userland processes doesn't help -
> hence killing spree. The kernel needs a separate "clear lowmem" function,
> when
> lowmem is needed, that may (or may not) include oom-killing, but could start
> with chucking cache blocks out, or into highmem.

How to know your issue is introduced by occupied cache?


> Seems to me there may even be a case to say on 32bit systems with > 8Gb RAM,
> LOWMEM is reserved for Kernel use only - with or without a switch to enable /
> disable this behaviour.

hmhm..
I think this suggestion is considerable.

> 
> 
> "because Linux can't swap out the kernel itself" - sure, BUT that is NOT the
> problem I hit - the problem I hit was that cached disk blocks (or pointers to
> cache blocks) were occupying LOWMEM and oom-killer was invoked to remove them
> -
> this was never going to happen - so the result was the all process in
> userland
> were oom-killed, but still LOWMEM was full.

True.


> On my 2Gb system I have single dual-core Athlon, 500Gb HDD and one VM running
> qemu over KVM (not qkvm) - then, on the VM host, I run "find / -type f -exec
> md5sum {} \;" and after a while the VM is killed. There are 2 x 65Gb files on
> the VM host - it is when the "find" hits these that the VM is killed. I think
> this is because data is being pulled off the disk faster as there is less
> fragmentation.
> 
> I have replaced the kernel and tried the same test on 2.6.33.3 and it
> survives
> fine. LOWMEM immediately drops to about 50Mb and stays pretty much there
> (~48MB
> to 52Mb). I will try 2.6.33.3 on the 48Gb system as soon as possible.
> 
> I don't know whether the qemu/kvm is relevant or (because of its large memory
> use) it simply makes an obvious target for the oom-killer, but I have seen
> the
> oom-killer kill everything on the system (inc klogd, syslogd, sshd, udevd
> etc.)

In your case, oom selection logic is unrelated, I think.
I agree OOM-killer don't help your case. but we don't have alternative
recover way.

So, I guess your above suggestion is best. no recover, instead no enter
low-mem starvation situation at all.


> "vm.overcommit_memory is the knob for userland process memory
> overcommitment."
> - RIGHT - so if I set no overcommit on userland processes there should NEVER
> be
> a need to oom-kill userland processes.
> 
> If the kernel runs out of kernel memory it needs to deal with it in a better
> way - killing userland processes isn't a great solution - in fact in my case
> it
> isn't a solution at all, because it doesn't result in any (or enough) kernel
> memory being release, but simply results in a killing spree.

In general, If kernel can't get enough memory, we only have two option, 1) kill one random unhappy process or 2) kernel pranic. There isn't great solution. There isn't miracle nor magic. 

If you want to (2), please consider to use /proc/sys/vm/panic_on_oom.
Comment 62 James Stevens 2010-05-13 10:10:42 UTC
> How to know your issue is introduced by occupied cache?

Mostly guess. But

1) The VMs allocate all their resources, inc memory, opening files etc, 
at start-up - so their resource use is fixed - so what else would it be? 
There is nothing else running on that server.

They don't use all their allocated RAM immediately, but they will have 
done after a few days.

2) oom-killer is triggered by heavy disk i/o - big clue.


>> Seems to me there may even be a case to say on 32bit systems with>  8Gb RAM,
>> LOWMEM is reserved for Kernel use only - with or without a switch to enable
>> /
>> disable this behaviour.
>
> hmhm..
> I think this suggestion is considerable.

I think this idea has some merit, but 2.6.33 seems to allocate userland 
and cache from HIGH before touching LOW, which considerably helps with 
this issue.

However, over time, it might be possible that LOW is filled with cache 
blocks leading to LOW memory exhaust.


>> I don't know whether the qemu/kvm is relevant or (because of its large
>> memory
>> use) it simply makes an obvious target for the oom-killer, but I have seen
>> the
>> oom-killer kill everything on the system (inc klogd, syslogd, sshd, udevd
>> etc.)
>
> In your case, oom selection logic is unrelated, I think.

I agree - the problem is in why oom-killer was invoked, not which 
process it killed.

It killed the most busy VMs, which are the ones with the most memory - 
so its choice was probably correct for its algorithm - but I don't know.

Of course, choosing which program to kill is never going to produce a 
"good" choice - the only answer that is acceptable to the user is "none".

> I agree OOM-killer don't help your case. but we don't have alternative
> recover way.

Yes you do - I had loads of free HIGH & SWAP - see below.

> So, I guess your above suggestion is best. no recover, instead no enter
> low-mem starvation situation at all.

If lowmem starvation, caused by cache or userland processes living 
there, can cause oom-killer even when HIGH and SWAP have considerable 
free, then there is a bug that needs fixing.

> In general, If kernel can't get enough memory, we only have two option

Only partly true - You are left with only those two options when *ALL* 
memory (Low, High & Swap) are exhausted and when you have thrown ALL 
cache blocks out of memory.

BUT (VERY BIG BUT), if LOW is exhausted and there is still free HIGH or 
SWAP then the first thing you should do is start moving user processes 
and cache blocks out of low mem into high

Also, if there is no HIGH or SWAP free, but there are cache blocks in 
LOW then throw them out - just drop them. It will cause more disk i/o, 
but this is much better than invoking oom-killer.

In general - if there are any disk blocks in cache then oom-killer 
should NOT be invoked until they have all been removed. IMHO cache is 
"nice to have" when compared with randomly killing a user's process.



This is where "ulimit" (ideally) should really be used to set a 
realistic maximum so one process can't take over the system unless 
specifically allowed - having "ulimit" default to "unlimited" is not 
good. I can see the merit in allowing over-allocation, but having 
completely unrestricted over-allocation isn't nice.

As a rough guide, I would say NEVER invoke oom-killer, unless there is 
absolutely no other option. And even if you THINK there is no other 
option, think of one - if in doubt, feel free to ask me, I'll give you 
another option.

1) you could simply suspend the process to disk - allocate just enough 
extra swap to swap out the process that you would have oom-killed, and 
SIGSTOP it. Then every time a program is closed (or more real swap is 
added) you could automatically check to see if stopped processes could 
be swapped back in and re-started.

If this was integrated properly with the window manager it could work 
quite well - the user could open as many programs as they like (if they 
have the disk space), but as they open more, older ones would simply 
automatically grey out until enough of the live ones are closed.

They would have a visual clue something is going wrong and they wouldn't 
loose any work and they could simply allocate more real swap to get 
things working again. In the "ps" listing it could show with a special 
status and tools could be used to force alternative processes to suspend 
to disk, bringing back in other ones.


2) always dynamically allocate the same amount of swap as you have 
memory committed (Committed_AS) - I agree this only works well for 
programs that alloc at start-up.


To a user / sysadmin, oom-killer will simply appear like their Linux is 
unreliable. This is not a good reputation to aim for. Linux is VERY 
reliable.

e.g. they leave their PC on overnight because they are in the middle of 
a big piece of work and "updatedb", running under cron, has caused 
oom-killer to kill off half their programs (loosing all their work) - 
this scenario would be solved by my suggestion above. I have seen users 
complaining about this scenario in forums.

To a user / sysadmin oom-killer (or panic) is NEVER the right answer. If 
you want to allow over-commit on by default, then you need to find a 
more graceful way to deal with oom scenarios. Switching my PC off frees 
a lot of memory - but its not a good solution - same goes for oom-killer.

I would prefer over-commit (and oom-killer) off by default 
(vm.overcommit_memory=2) - this is much closer to what the user would 
expect. If they have insufficient resources to open yet another program 
then it doesn't open, instead of opening it then killing it when they're 
not looking. I agree this will only work well for programs that do all 
their alloc at start-up.



Also, if you do oom-kill a program a message that tells you what is 
going on would be nice - something nicer than just "Killed", which is 
all I got.



Of course, the problem I was seeing was LOWMEM exhaust, which is a 
completely different problem from overcommit!



James
Comment 63 KOSAKI Motohiro 2010-05-13 10:55:37 UTC
> > How to know your issue is introduced by occupied cache?
> 
> Mostly guess. But
> 
> 1) The VMs allocate all their resources, inc memory, opening files etc, 
> at start-up - so their resource use is fixed - so what else would it be? 
> There is nothing else running on that server.
> 
> They don't use all their allocated RAM immediately, but they will have 
> done after a few days.
> 
> 2) oom-killer is triggered by heavy disk i/o - big clue.

First of all, if lowmem have droppable file cache or swappable pages,
our kernel don't invoke oom-killer by design. So, if your guess is correct,
We've seen another bug.

Do you use ext4? if so, please disable delayed allocation feature
(by mount option). it have the known issue of oom-killer.



> >> Seems to me there may even be a case to say on 32bit systems with>  8Gb
> RAM,
> >> LOWMEM is reserved for Kernel use only - with or without a switch to
> enable /
> >> disable this behaviour.
> >
> > hmhm..
> > I think this suggestion is considerable.
> 
> I think this idea has some merit, but 2.6.33 seems to allocate userland 
> and cache from HIGH before touching LOW, which considerably helps with 
> this issue.
> 
> However, over time, it might be possible that LOW is filled with cache 
> blocks leading to LOW memory exhaust.

Again, our kerenl drop cache before oom. (except ext4 known issue)


> >> I don't know whether the qemu/kvm is relevant or (because of its large
> memory
> >> use) it simply makes an obvious target for the oom-killer, but I have seen
> the
> >> oom-killer kill everything on the system (inc klogd, syslogd, sshd, udevd
> etc.)
> >
> > In your case, oom selection logic is unrelated, I think.
> 
> I agree - the problem is in why oom-killer was invoked, not which 
> process it killed.
> 
> It killed the most busy VMs, which are the ones with the most memory - 
> so its choice was probably correct for its algorithm - but I don't know.
> 
> Of course, choosing which program to kill is never going to produce a 
> "good" choice - the only answer that is acceptable to the user is "none".

agree.


> > I agree OOM-killer don't help your case. but we don't have alternative
> > recover way.
> 
> Yes you do - I had loads of free HIGH & SWAP - see below.
> 
> > So, I guess your above suggestion is best. no recover, instead no enter
> > low-mem starvation situation at all.
> 
> If lowmem starvation, caused by cache or userland processes living 
> there, can cause oom-killer even when HIGH and SWAP have considerable 
> free, then there is a bug that needs fixing.

Currently, our kernel doesn't have LOWMEM to HIGHMEM migration. but if
SWAP space is available, it doesn't call oom-kill, instead invoke swap-out.

So, your guess is true, It's bug. Why not?


> > In general, If kernel can't get enough memory, we only have two option
> 
> Only partly true - You are left with only those two options when *ALL* 
> memory (Low, High & Swap) are exhausted and when you have thrown ALL 
> cache blocks out of memory.

This already does.


> BUT (VERY BIG BUT), if LOW is exhausted and there is still free HIGH or 
> SWAP then the first thing you should do is start moving user processes 
> and cache blocks out of low mem into high
> 
> Also, if there is no HIGH or SWAP free, but there are cache blocks in 
> LOW then throw them out - just drop them. It will cause more disk i/o, 
> but this is much better than invoking oom-killer.
> 
> In general - if there are any disk blocks in cache then oom-killer 
> should NOT be invoked until they have all been removed. IMHO cache is 
> "nice to have" when compared with randomly killing a user's process.

This already does.


> This is where "ulimit" (ideally) should really be used to set a 
> realistic maximum so one process can't take over the system unless 
> specifically allowed - having "ulimit" default to "unlimited" is not 
> good. I can see the merit in allowing over-allocation, but having 
> completely unrestricted over-allocation isn't nice.

Well, this is distro policy. no kernel issue. Also you can change
ulimit default value. (the way depend on distro, please your distro manual)


> As a rough guide, I would say NEVER invoke oom-killer, unless there is 
> absolutely no other option. And even if you THINK there is no other 
> option, think of one - if in doubt, feel free to ask me, I'll give you 
> another option.
> 
> 1) you could simply suspend the process to disk - allocate just enough 
> extra swap to swap out the process that you would have oom-killed, and 
> SIGSTOP it. Then every time a program is closed (or more real swap is 
> added) you could automatically check to see if stopped processes could 
> be swapped back in and re-started.
> 
> If this was integrated properly with the window manager it could work 
> quite well - the user could open as many programs as they like (if they 
> have the disk space), but as they open more, older ones would simply 
> automatically grey out until enough of the live ones are closed.
> 
> They would have a visual clue something is going wrong and they wouldn't 
> loose any work and they could simply allocate more real swap to get 
> things working again. In the "ps" listing it could show with a special 
> status and tools could be used to force alternative processes to suspend 
> to disk, bringing back in other ones.

Well, I think we don't need SIGSTOP. instead, all of process pages
swapping out is enough good. It already does.


> 2) always dynamically allocate the same amount of swap as you have 
> memory committed (Committed_AS) - I agree this only works well for 
> programs that alloc at start-up.
> 
> To a user / sysadmin, oom-killer will simply appear like their Linux is 
> unreliable. This is not a good reputation to aim for. Linux is VERY 
> reliable.
> 
> e.g. they leave their PC on overnight because they are in the middle of 
> a big piece of work and "updatedb", running under cron, has caused 
> oom-killer to kill off half their programs (loosing all their work) - 
> this scenario would be solved by my suggestion above. I have seen users 
> complaining about this scenario in forums.
> 
> To a user / sysadmin oom-killer (or panic) is NEVER the right answer. If 
> you want to allow over-commit on by default, then you need to find a 
> more graceful way to deal with oom scenarios. Switching my PC off frees 
> a lot of memory - but its not a good solution - same goes for oom-killer.

Example, MS-Windows simply crash at such extream case instead. it is
NEVER right answer too.
So, I hope to make separate the discussion of this bug and ultimate case
behavior.


> I would prefer over-commit (and oom-killer) off by default 
> (vm.overcommit_memory=2) - this is much closer to what the user would 
> expect. If they have insufficient resources to open yet another program 
> then it doesn't open, instead of opening it then killing it when they're 
> not looking. I agree this will only work well for programs that do all 
> their alloc at start-up.

Nope.
Really really much programs depend on overcommit behavior. example
some java vm may reserve 10x virtual space than actual memory usage.

vm.overcommit_memory=2 bring to another trouble to us. unfortunately
kernel developer can't control user land habit.


> Also, if you do oom-kill a program a message that tells you what is 
> going on would be nice - something nicer than just "Killed", which is 
> all I got.

Linux alrady have such oom notifier.


> Of course, the problem I was seeing was LOWMEM exhaust, which is a 
> completely different problem from overcommit!

Agreed.


Last of all, if you will seen strange oom again, please report it again.
newer kernel output more detail message than old. then probably we can 
makes better analysis, I think.


Thank you for your patience and debugging help.
Comment 64 James Stevens 2010-05-13 11:13:21 UTC
We don't use ext4 - only ext2 (small & mounted read-only - e.g. /boot) and reiserFS.


This morning one VM died on 2.6.33.3 - NOT oom-killed (no syslog of oom-kill), so not sure why it died - not good. 

I have now upgraded qemu from 0.12.3 to 0.12.4 - but haven't see VM just die like that before kernel upgrade, except from OOM kill, so right now not very happy with 2.6.33.3

I have also now added "-cpu host" (to qemu invoke) as this reduces use of emulation code.



IMHO: SIGSTOP is better than SIGKILL :) - I can recover my data

Or oom-kill with core-dump then have option to resume process from core - not as nice as SIGSTOP when process is run under "init" and has already been auto-restarted.

 ... but agree - separate discussion.

Note You need to log in before you can comment on or make changes to this bug.