Bug 93111

Summary: Pages madvise'd as MADV_DONTNEED are slowly returned to the program's RSS
Product: Memory Management Reporter: Keith Randall (keithr)
Component: Page AllocatorAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: low CC: dvyukov, huww98, yszhou4tech
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.13.0-44-generic Subsystem:
Regression: No Bisected commit-id:
Attachments: reproducing example code
worse example code

Description Keith Randall 2015-02-12 00:39:26 UTC
Created attachment 166521 [details]
reproducing example code

Attached is a program which does one big 256MB mmap, then some madvise(..., MADV_DONTNEED) calls on every other 32K chunk of that memory.  Right after the madvise calls are done, the resident set size of the process is correct at about 128MB.  However, as the program subsequently idles, the resident set size slowly grows until it reaches back up to about 256MB.

The time taken to grow back to 256MB varies widely.  On an Amazon ec2 Ubuntu 14.04 LTS c4.xlarge instance (kernel 3.13.0-44-generic) it only takes 2 minutes.  On my laptop, it takes 45 minutes (same kernel).  On my desktop it takes 52 minutes (3.13.0-43-generic).  Other machines seem to not have the bug, at least to 2+ hours (3.8.0-44-generic desktop).

This problem first surfaced in the Go runtime, where we return portions of our heap to the OS when they are no longer in use.  The attached repro produces similar behavior to what we see with some Go programs (https://github.com/golang/go/issues/8832).

The program does 4K uncoalescable madvise calls, so we may be tickling something that is not often exercised.

On my ec2 instance, I can see the bug with as little as 16 madvise calls, although the growth does not make it all the way back to 256MB.
Comment 1 Andrew Morton 2015-02-20 23:56:37 UTC
That would be very odd.

I ran it for an hour on a couple of machines (3.13 and 3.19 kernels), couldn't reproduce.

I guess it's possible that some other vma is growing? I suggest that you take snapshots of /proc/pid/smaps while the code is running, see if you can work out which vma's rss is growing.
Comment 2 Keith Randall 2015-02-22 00:22:19 UTC
Below is the state of the 256MB mapping at 9 different times during a run.  I'm not entirely sure what is going on here, but the growing nonzero huge pages entries make me think the OS is swapping out small pages with huge pages, which would cause the observed behavior.

$ grep -i hugepage /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Maybe khugepaged is doing this in the background?

$ ps ax | grep huge
   78 ?        SN     0:00 [khugepaged]

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

map1:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 
map1-Size:             262144 kB
map1-Rss:              163840 kB
map1-Pss:              163840 kB
map1-Shared_Clean:          0 kB
map1-Shared_Dirty:          0 kB
map1-Private_Clean:         0 kB
map1-Private_Dirty:    163840 kB
map1-Referenced:       163840 kB
map1-Anonymous:        163840 kB
map1-AnonHugePages:     65536 kB
map1-Swap:                  0 kB
map1-KernelPageSize:        4 kB
map1-MMUPageSize:           4 kB
map1-Locked:                0 kB
--
map2:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 
map2-Size:             262144 kB
map2-Rss:              172032 kB
map2-Pss:              172032 kB
map2-Shared_Clean:          0 kB
map2-Shared_Dirty:          0 kB
map2-Private_Clean:         0 kB
map2-Private_Dirty:    172032 kB
map2-Referenced:       172032 kB
map2-Anonymous:        172032 kB
map2-AnonHugePages:     81920 kB
map2-Swap:                  0 kB
map2-KernelPageSize:        4 kB
map2-MMUPageSize:           4 kB
map2-Locked:                0 kB
--
map3:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 
map3-Size:             262144 kB
map3-Rss:              180224 kB
map3-Pss:              180224 kB
map3-Shared_Clean:          0 kB
map3-Shared_Dirty:          0 kB
map3-Private_Clean:         0 kB
map3-Private_Dirty:    180224 kB
map3-Referenced:       180224 kB
map3-Anonymous:        180224 kB
map3-AnonHugePages:     98304 kB
map3-Swap:                  0 kB
map3-KernelPageSize:        4 kB
map3-MMUPageSize:           4 kB
map3-Locked:                0 kB
--
map4:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 
map4-Size:             262144 kB
map4-Rss:              196608 kB
map4-Pss:              196608 kB
map4-Shared_Clean:          0 kB
map4-Shared_Dirty:          0 kB
map4-Private_Clean:         0 kB
map4-Private_Dirty:    196608 kB
map4-Referenced:       196608 kB
map4-Anonymous:        196608 kB
map4-AnonHugePages:    131072 kB
map4-Swap:                  0 kB
map4-KernelPageSize:        4 kB
map4-MMUPageSize:           4 kB
map4-Locked:                0 kB
--
map5:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 
map5-Size:             262144 kB
map5-Rss:              204800 kB
map5-Pss:              204800 kB
map5-Shared_Clean:          0 kB
map5-Shared_Dirty:          0 kB
map5-Private_Clean:         0 kB
map5-Private_Dirty:    204800 kB
map5-Referenced:       204800 kB
map5-Anonymous:        204800 kB
map5-AnonHugePages:    147456 kB
map5-Swap:                  0 kB
map5-KernelPageSize:        4 kB
map5-MMUPageSize:           4 kB
map5-Locked:                0 kB
--
map6:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 
map6-Size:             262144 kB
map6-Rss:              221184 kB
map6-Pss:              221184 kB
map6-Shared_Clean:          0 kB
map6-Shared_Dirty:          0 kB
map6-Private_Clean:         0 kB
map6-Private_Dirty:    221184 kB
map6-Referenced:       221184 kB
map6-Anonymous:        221184 kB
map6-AnonHugePages:    180224 kB
map6-Swap:                  0 kB
map6-KernelPageSize:        4 kB
map6-MMUPageSize:           4 kB
map6-Locked:                0 kB
--
map7:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 
map7-Size:             262144 kB
map7-Rss:              237568 kB
map7-Pss:              237568 kB
map7-Shared_Clean:          0 kB
map7-Shared_Dirty:          0 kB
map7-Private_Clean:         0 kB
map7-Private_Dirty:    237568 kB
map7-Referenced:       237568 kB
map7-Anonymous:        237568 kB
map7-AnonHugePages:    212992 kB
map7-Swap:                  0 kB
map7-KernelPageSize:        4 kB
map7-MMUPageSize:           4 kB
map7-Locked:                0 kB
--
map8:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 
map8-Size:             262144 kB
map8-Rss:              245760 kB
map8-Pss:              245760 kB
map8-Shared_Clean:          0 kB
map8-Shared_Dirty:          0 kB
map8-Private_Clean:         0 kB
map8-Private_Dirty:    245760 kB
map8-Referenced:       245760 kB
map8-Anonymous:        245760 kB
map8-AnonHugePages:    229376 kB
map8-Swap:                  0 kB
map8-KernelPageSize:        4 kB
map8-MMUPageSize:           4 kB
map8-Locked:                0 kB
--
map9:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 
map9-Size:             262144 kB
map9-Rss:              261120 kB
map9-Pss:              261120 kB
map9-Shared_Clean:          0 kB
map9-Shared_Dirty:          0 kB
map9-Private_Clean:         0 kB
map9-Private_Dirty:    261120 kB
map9-Referenced:       261120 kB
map9-Anonymous:        261120 kB
map9-AnonHugePages:    260096 kB
map9-Swap:                  0 kB
map9-KernelPageSize:        4 kB
map9-MMUPageSize:           4 kB
map9-Locked:                0 kB
Comment 3 Dmitry Vyukov 2015-02-23 12:15:50 UTC
This looks like an issue in kernel. I can understand why kernel combines pages into huge pages. But why does it page them in?
For huge pages that are completely madvised as DONTNEED, kernel should not page them in. For partially resident huge pages it is not so obvious, but it still looks wrong if kernel significantly increases process RSS because of transparent huge pages.
Comment 4 Keith Randall 2015-02-23 16:23:17 UTC
Looking through the kernel code this seems intentional, at least to some degree.

443  * max_ptes_none controls if khugepaged should collapse hugepages over
444  * any unmapped ptes in turn potentially increasing the memory
445  * footprint of the vmas. When max_ptes_none is 0 khugepaged will not
446  * reduce the available free memory in the system as it
447  * runs. Increasing max_ptes_none will instead potentially reduce the
448  * free memory in the system during the khugepaged scan.
449  */

So if max_ptes_none is >0 the code is designed to spend some physical memory to "fill in" unmapped regions to convert to a huge page.

57 static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
55 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
54 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)

The latter is 21-12=9 for amd64.  So there are 512 pages in a huge page, and by default khugepaged is happy to "fill in" 511 of the 512 pages to convert to a huge page.  That seems overly aggressive to me.  I have a modified test case that demonstrates exactly this, a growth of 512x in RSS (I will attach it).

$ cat /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none 
511

If I set /sys/kernel/mm/transparent_hugepage/enabled to "never" the problem goes away.
Comment 5 Keith Randall 2015-02-23 16:24:40 UTC
Created attachment 168051 [details]
worse example code

Example program whose RSS grows by 512x.