Bug 93111
Summary: | Pages madvise'd as MADV_DONTNEED are slowly returned to the program's RSS | ||
---|---|---|---|
Product: | Memory Management | Reporter: | Keith Randall (keithr) |
Component: | Page Allocator | Assignee: | Andrew Morton (akpm) |
Status: | NEW --- | ||
Severity: | low | CC: | dvyukov, huww98, yszhou4tech |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.13.0-44-generic | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
reproducing example code
worse example code |
Description
Keith Randall
2015-02-12 00:39:26 UTC
That would be very odd. I ran it for an hour on a couple of machines (3.13 and 3.19 kernels), couldn't reproduce. I guess it's possible that some other vma is growing? I suggest that you take snapshots of /proc/pid/smaps while the code is running, see if you can work out which vma's rss is growing. Below is the state of the 256MB mapping at 9 different times during a run. I'm not entirely sure what is going on here, but the growing nonzero huge pages entries make me think the OS is swapping out small pages with huge pages, which would cause the observed behavior. $ grep -i hugepage /proc/meminfo AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Maybe khugepaged is doing this in the background? $ ps ax | grep huge 78 ? SN 0:00 [khugepaged] $ cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never map1:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map1-Size: 262144 kB map1-Rss: 163840 kB map1-Pss: 163840 kB map1-Shared_Clean: 0 kB map1-Shared_Dirty: 0 kB map1-Private_Clean: 0 kB map1-Private_Dirty: 163840 kB map1-Referenced: 163840 kB map1-Anonymous: 163840 kB map1-AnonHugePages: 65536 kB map1-Swap: 0 kB map1-KernelPageSize: 4 kB map1-MMUPageSize: 4 kB map1-Locked: 0 kB -- map2:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map2-Size: 262144 kB map2-Rss: 172032 kB map2-Pss: 172032 kB map2-Shared_Clean: 0 kB map2-Shared_Dirty: 0 kB map2-Private_Clean: 0 kB map2-Private_Dirty: 172032 kB map2-Referenced: 172032 kB map2-Anonymous: 172032 kB map2-AnonHugePages: 81920 kB map2-Swap: 0 kB map2-KernelPageSize: 4 kB map2-MMUPageSize: 4 kB map2-Locked: 0 kB -- map3:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map3-Size: 262144 kB map3-Rss: 180224 kB map3-Pss: 180224 kB map3-Shared_Clean: 0 kB map3-Shared_Dirty: 0 kB map3-Private_Clean: 0 kB map3-Private_Dirty: 180224 kB map3-Referenced: 180224 kB map3-Anonymous: 180224 kB map3-AnonHugePages: 98304 kB map3-Swap: 0 kB map3-KernelPageSize: 4 kB map3-MMUPageSize: 4 kB map3-Locked: 0 kB -- map4:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map4-Size: 262144 kB map4-Rss: 196608 kB map4-Pss: 196608 kB map4-Shared_Clean: 0 kB map4-Shared_Dirty: 0 kB map4-Private_Clean: 0 kB map4-Private_Dirty: 196608 kB map4-Referenced: 196608 kB map4-Anonymous: 196608 kB map4-AnonHugePages: 131072 kB map4-Swap: 0 kB map4-KernelPageSize: 4 kB map4-MMUPageSize: 4 kB map4-Locked: 0 kB -- map5:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map5-Size: 262144 kB map5-Rss: 204800 kB map5-Pss: 204800 kB map5-Shared_Clean: 0 kB map5-Shared_Dirty: 0 kB map5-Private_Clean: 0 kB map5-Private_Dirty: 204800 kB map5-Referenced: 204800 kB map5-Anonymous: 204800 kB map5-AnonHugePages: 147456 kB map5-Swap: 0 kB map5-KernelPageSize: 4 kB map5-MMUPageSize: 4 kB map5-Locked: 0 kB -- map6:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map6-Size: 262144 kB map6-Rss: 221184 kB map6-Pss: 221184 kB map6-Shared_Clean: 0 kB map6-Shared_Dirty: 0 kB map6-Private_Clean: 0 kB map6-Private_Dirty: 221184 kB map6-Referenced: 221184 kB map6-Anonymous: 221184 kB map6-AnonHugePages: 180224 kB map6-Swap: 0 kB map6-KernelPageSize: 4 kB map6-MMUPageSize: 4 kB map6-Locked: 0 kB -- map7:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map7-Size: 262144 kB map7-Rss: 237568 kB map7-Pss: 237568 kB map7-Shared_Clean: 0 kB map7-Shared_Dirty: 0 kB map7-Private_Clean: 0 kB map7-Private_Dirty: 237568 kB map7-Referenced: 237568 kB map7-Anonymous: 237568 kB map7-AnonHugePages: 212992 kB map7-Swap: 0 kB map7-KernelPageSize: 4 kB map7-MMUPageSize: 4 kB map7-Locked: 0 kB -- map8:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map8-Size: 262144 kB map8-Rss: 245760 kB map8-Pss: 245760 kB map8-Shared_Clean: 0 kB map8-Shared_Dirty: 0 kB map8-Private_Clean: 0 kB map8-Private_Dirty: 245760 kB map8-Referenced: 245760 kB map8-Anonymous: 245760 kB map8-AnonHugePages: 229376 kB map8-Swap: 0 kB map8-KernelPageSize: 4 kB map8-MMUPageSize: 4 kB map8-Locked: 0 kB -- map9:7f1d198e3000-7f1d298e3000 rw-p 00000000 00:00 0 map9-Size: 262144 kB map9-Rss: 261120 kB map9-Pss: 261120 kB map9-Shared_Clean: 0 kB map9-Shared_Dirty: 0 kB map9-Private_Clean: 0 kB map9-Private_Dirty: 261120 kB map9-Referenced: 261120 kB map9-Anonymous: 261120 kB map9-AnonHugePages: 260096 kB map9-Swap: 0 kB map9-KernelPageSize: 4 kB map9-MMUPageSize: 4 kB map9-Locked: 0 kB This looks like an issue in kernel. I can understand why kernel combines pages into huge pages. But why does it page them in? For huge pages that are completely madvised as DONTNEED, kernel should not page them in. For partially resident huge pages it is not so obvious, but it still looks wrong if kernel significantly increases process RSS because of transparent huge pages. Looking through the kernel code this seems intentional, at least to some degree. 443 * max_ptes_none controls if khugepaged should collapse hugepages over 444 * any unmapped ptes in turn potentially increasing the memory 445 * footprint of the vmas. When max_ptes_none is 0 khugepaged will not 446 * reduce the available free memory in the system as it 447 * runs. Increasing max_ptes_none will instead potentially reduce the 448 * free memory in the system during the khugepaged scan. 449 */ So if max_ptes_none is >0 the code is designed to spend some physical memory to "fill in" unmapped regions to convert to a huge page. 57 static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1; 55 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER) 54 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) The latter is 21-12=9 for amd64. So there are 512 pages in a huge page, and by default khugepaged is happy to "fill in" 511 of the 512 pages to convert to a huge page. That seems overly aggressive to me. I have a modified test case that demonstrates exactly this, a growth of 512x in RSS (I will attach it). $ cat /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none 511 If I set /sys/kernel/mm/transparent_hugepage/enabled to "never" the problem goes away. Created attachment 168051 [details]
worse example code
Example program whose RSS grows by 512x.
|