Created attachment 166521 [details] reproducing example code Attached is a program which does one big 256MB mmap, then some madvise(..., MADV_DONTNEED) calls on every other 32K chunk of that memory. Right after the madvise calls are done, the resident set size of the process is correct at about 128MB. However, as the program subsequently idles, the resident set size slowly grows until it reaches back up to about 256MB. The time taken to grow back to 256MB varies widely. On an Amazon ec2 Ubuntu 14.04 LTS c4.xlarge instance (kernel 3.13.0-44-generic) it only takes 2 minutes. On my laptop, it takes 45 minutes (same kernel). On my desktop it takes 52 minutes (3.13.0-43-generic). Other machines seem to not have the bug, at least to 2+ hours (3.8.0-44-generic desktop). This problem first surfaced in the Go runtime, where we return portions of our heap to the OS when they are no longer in use. The attached repro produces similar behavior to what we see with some Go programs (https://github.com/golang/go/issues/8832). The program does 4K uncoalescable madvise calls, so we may be tickling something that is not often exercised. On my ec2 instance, I can see the bug with as little as 16 madvise calls, although the growth does not make it all the way back to 256MB.
That would be very odd. I ran it for an hour on a couple of machines (3.13 and 3.19 kernels), couldn't reproduce. I guess it's possible that some other vma is growing? I suggest that you take snapshots of /proc/pid/smaps while the code is running, see if you can work out which vma's rss is growing.
Below is the state of the 256MB mapping at 9 different times during a run. I'm not entirely sure what is going on here, but the growing nonzero huge pages entries make me think the OS is swapping out small pages with huge pages, which would cause the observed behavior. $ grep -i hugepage /proc/meminfo AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Maybe khugepaged is doing this in the background? $ ps ax | grep huge 78 ? SN 0:00 [khugepaged] $ cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never map1:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map1-Size: 262144 kB map1-Rss: 163840 kB map1-Pss: 163840 kB map1-Shared_Clean: 0 kB map1-Shared_Dirty: 0 kB map1-Private_Clean: 0 kB map1-Private_Dirty: 163840 kB map1-Referenced: 163840 kB map1-Anonymous: 163840 kB map1-AnonHugePages: 65536 kB map1-Swap: 0 kB map1-KernelPageSize: 4 kB map1-MMUPageSize: 4 kB map1-Locked: 0 kB -- map2:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map2-Size: 262144 kB map2-Rss: 172032 kB map2-Pss: 172032 kB map2-Shared_Clean: 0 kB map2-Shared_Dirty: 0 kB map2-Private_Clean: 0 kB map2-Private_Dirty: 172032 kB map2-Referenced: 172032 kB map2-Anonymous: 172032 kB map2-AnonHugePages: 81920 kB map2-Swap: 0 kB map2-KernelPageSize: 4 kB map2-MMUPageSize: 4 kB map2-Locked: 0 kB -- map3:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map3-Size: 262144 kB map3-Rss: 180224 kB map3-Pss: 180224 kB map3-Shared_Clean: 0 kB map3-Shared_Dirty: 0 kB map3-Private_Clean: 0 kB map3-Private_Dirty: 180224 kB map3-Referenced: 180224 kB map3-Anonymous: 180224 kB map3-AnonHugePages: 98304 kB map3-Swap: 0 kB map3-KernelPageSize: 4 kB map3-MMUPageSize: 4 kB map3-Locked: 0 kB -- map4:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map4-Size: 262144 kB map4-Rss: 196608 kB map4-Pss: 196608 kB map4-Shared_Clean: 0 kB map4-Shared_Dirty: 0 kB map4-Private_Clean: 0 kB map4-Private_Dirty: 196608 kB map4-Referenced: 196608 kB map4-Anonymous: 196608 kB map4-AnonHugePages: 131072 kB map4-Swap: 0 kB map4-KernelPageSize: 4 kB map4-MMUPageSize: 4 kB map4-Locked: 0 kB -- map5:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map5-Size: 262144 kB map5-Rss: 204800 kB map5-Pss: 204800 kB map5-Shared_Clean: 0 kB map5-Shared_Dirty: 0 kB map5-Private_Clean: 0 kB map5-Private_Dirty: 204800 kB map5-Referenced: 204800 kB map5-Anonymous: 204800 kB map5-AnonHugePages: 147456 kB map5-Swap: 0 kB map5-KernelPageSize: 4 kB map5-MMUPageSize: 4 kB map5-Locked: 0 kB -- map6:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map6-Size: 262144 kB map6-Rss: 221184 kB map6-Pss: 221184 kB map6-Shared_Clean: 0 kB map6-Shared_Dirty: 0 kB map6-Private_Clean: 0 kB map6-Private_Dirty: 221184 kB map6-Referenced: 221184 kB map6-Anonymous: 221184 kB map6-AnonHugePages: 180224 kB map6-Swap: 0 kB map6-KernelPageSize: 4 kB map6-MMUPageSize: 4 kB map6-Locked: 0 kB -- map7:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map7-Size: 262144 kB map7-Rss: 237568 kB map7-Pss: 237568 kB map7-Shared_Clean: 0 kB map7-Shared_Dirty: 0 kB map7-Private_Clean: 0 kB map7-Private_Dirty: 237568 kB map7-Referenced: 237568 kB map7-Anonymous: 237568 kB map7-AnonHugePages: 212992 kB map7-Swap: 0 kB map7-KernelPageSize: 4 kB map7-MMUPageSize: 4 kB map7-Locked: 0 kB -- map8:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map8-Size: 262144 kB map8-Rss: 245760 kB map8-Pss: 245760 kB map8-Shared_Clean: 0 kB map8-Shared_Dirty: 0 kB map8-Private_Clean: 0 kB map8-Private_Dirty: 245760 kB map8-Referenced: 245760 kB map8-Anonymous: 245760 kB map8-AnonHugePages: 229376 kB map8-Swap: 0 kB map8-KernelPageSize: 4 kB map8-MMUPageSize: 4 kB map8-Locked: 0 kB -- map9:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map9-Size: 262144 kB map9-Rss: 261120 kB map9-Pss: 261120 kB map9-Shared_Clean: 0 kB map9-Shared_Dirty: 0 kB map9-Private_Clean: 0 kB map9-Private_Dirty: 261120 kB map9-Referenced: 261120 kB map9-Anonymous: 261120 kB map9-AnonHugePages: 260096 kB map9-Swap: 0 kB map9-KernelPageSize: 4 kB map9-MMUPageSize: 4 kB map9-Locked: 0 kB
This looks like an issue in kernel. I can understand why kernel combines pages into huge pages. But why does it page them in? For huge pages that are completely madvised as DONTNEED, kernel should not page them in. For partially resident huge pages it is not so obvious, but it still looks wrong if kernel significantly increases process RSS because of transparent huge pages.
Looking through the kernel code this seems intentional, at least to some degree. 443 * max_ptes_none controls if khugepaged should collapse hugepages over 444 * any unmapped ptes in turn potentially increasing the memory 445 * footprint of the vmas. When max_ptes_none is 0 khugepaged will not 446 * reduce the available free memory in the system as it 447 * runs. Increasing max_ptes_none will instead potentially reduce the 448 * free memory in the system during the khugepaged scan. 449 */ So if max_ptes_none is >0 the code is designed to spend some physical memory to "fill in" unmapped regions to convert to a huge page. 57 static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1; 55 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER) 54 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) The latter is 21-12=9 for amd64. So there are 512 pages in a huge page, and by default khugepaged is happy to "fill in" 511 of the 512 pages to convert to a huge page. That seems overly aggressive to me. I have a modified test case that demonstrates exactly this, a growth of 512x in RSS (I will attach it). $ cat /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none 511 If I set /sys/kernel/mm/transparent_hugepage/enabled to "never" the problem goes away.
Created attachment 168051 [details] worse example code Example program whose RSS grows by 512x.