Bug 202089 - transparent hugepage not compatable with madvise(MADV_DONTNEED)
Summary: transparent hugepage not compatable with madvise(MADV_DONTNEED)
Status: NEW
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-29 09:00 UTC by jianpanlanyue
Modified: 2019-02-01 11:58 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.4.0-117
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description jianpanlanyue 2018-12-29 09:00:22 UTC
environment:  
  1.kernel 4.4.0 on x86_64
  2.echo always > /sys/kernel/mm/transparent_hugepage/enable
    echo always > /sys/kernel/mm/transparent_hugepage/defrag
    echo 2000000 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan ( faster defrag pages to reproduce problem)

problem: 
  1. use mmap() to allocate 4096 bytes for 1024*512 times (4096*1024*512=2G).
  2. use madvise(MADV_DONTNEED) to free most of the above pages, but reserve a few pages(by if(i%33==0) continue;), then process's physical memory firstly come down, but after a few seconds, it rise back to 2G again, and can't come down forever.
  3. if i delete this condition(if(i%33==0) continue;) or disable transparent_hugepage by setting 'enable' and 'defrag' to never, all go well and the physical memory can come down expectly.
  
  It seems like transparent_hugepage has problems with non-contiguous madvise(MADV_DONTEED).


Belows is the test code:

#include <stdio.h>
#include <memory.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <errno.h>
#include <assert.h>

#define PAGE_SIZE 4096
#define PAGE_COUNT 1024*512
int main()
{
  void** table = (void**)malloc(sizeof(void*) * PAGE_COUNT);
  printf("begin mmap...\n");

  for (int i=0; i<PAGE_COUNT; i++) {
    table[i] = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1 ,0);
    assert(table[i] != MAP_FAILED);
    memset(table[i], 1, PAGE_SIZE);
  }
  
  printf("mmap ok, press enter to free most of them\n");
  getchar();

  //it behaves not expectly: after most pages freed, thp make it rise to 2G again
  for(int i=0; i<PAGE_COUNT; i++) {
    if (i%33==0) continue;
    if (madvise(table[i], PAGE_SIZE, MADV_DONTNEED) != 0)
      printf("madvise error, errno:%d\n", errno);
  }

  printf("munmap finish\n");
  free(table);
  getchar();
  getchar();
}
Comment 1 jianpanlanyue 2018-12-29 14:55:54 UTC
i find kerner version prior to 4.4.0 both have this problem.
Comment 2 Andrew Morton 2018-12-29 20:53:20 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 29 Dec 2018 09:00:22 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=202089
> 
>             Bug ID: 202089
>            Summary: transparent hugepage not compatable with
>                     madvise(MADV_DONTNEED)
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.4.0-117
>           Hardware: x86-64
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>           Assignee: akpm@linux-foundation.org
>           Reporter: jianpanlanyue@163.com
>         Regression: No
> 
> environment:  
>   1.kernel 4.4.0 on x86_64
>   2.echo always > /sys/kernel/mm/transparent_hugepage/enable
>     echo always > /sys/kernel/mm/transparent_hugepage/defrag
>     echo 2000000 >
>     /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> ( faster defrag pages to reproduce problem)
> 
> problem: 
>   1. use mmap() to allocate 4096 bytes for 1024*512 times (4096*1024*512=2G).
>   2. use madvise(MADV_DONTNEED) to free most of the above pages, but reserve
>   a
> few pages(by if(i%33==0) continue;), then process's physical memory firstly
> come down, but after a few seconds, it rise back to 2G again, and can't come
> down forever.
>   3. if i delete this condition(if(i%33==0) continue;) or disable
> transparent_hugepage by setting 'enable' and 'defrag' to never, all go well
> and
> the physical memory can come down expectly.
> 
>   It seems like transparent_hugepage has problems with non-contiguous
> madvise(MADV_DONTEED).
> 
> 
> Belows is the test code:
> 
> #include <stdio.h>
> #include <memory.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <errno.h>
> #include <assert.h>
> 
> #define PAGE_SIZE 4096
> #define PAGE_COUNT 1024*512
> int main()
> {
>   void** table = (void**)malloc(sizeof(void*) * PAGE_COUNT);
>   printf("begin mmap...\n");
> 
>   for (int i=0; i<PAGE_COUNT; i++) {
>     table[i] = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1 ,0);
>     assert(table[i] != MAP_FAILED);
>     memset(table[i], 1, PAGE_SIZE);
>   }
> 
>   printf("mmap ok, press enter to free most of them\n");
>   getchar();
> 
>   //it behaves not expectly: after most pages freed, thp make it rise to 2G
> again
>   for(int i=0; i<PAGE_COUNT; i++) {
>     if (i%33==0) continue;
>     if (madvise(table[i], PAGE_SIZE, MADV_DONTNEED) != 0)
>       printf("madvise error, errno:%d\n", errno);
>   }
> 
>   printf("munmap finish\n");
>   free(table);
>   getchar();
>   getchar();
> }
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
Comment 3 Kirill A. Shutemov 2018-12-29 22:48:49 UTC
On Sat, Dec 29, 2018 at 12:53:16PM -0800, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Sat, 29 Dec 2018 09:00:22 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=202089
> > 
> >             Bug ID: 202089
> >            Summary: transparent hugepage not compatable with
> >                     madvise(MADV_DONTNEED)
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 4.4.0-117
> >           Hardware: x86-64
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: Other
> >           Assignee: akpm@linux-foundation.org
> >           Reporter: jianpanlanyue@163.com
> >         Regression: No
> > 
> > environment:  
> >   1.kernel 4.4.0 on x86_64
> >   2.echo always > /sys/kernel/mm/transparent_hugepage/enable
> >     echo always > /sys/kernel/mm/transparent_hugepage/defrag
> >     echo 2000000 >
> /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> > ( faster defrag pages to reproduce problem)
> > 
> > problem: 
> >   1. use mmap() to allocate 4096 bytes for 1024*512 times
> (4096*1024*512=2G).
> >   2. use madvise(MADV_DONTNEED) to free most of the above pages, but
> reserve a
> > few pages(by if(i%33==0) continue;), then process's physical memory firstly
> > come down, but after a few seconds, it rise back to 2G again, and can't
> come
> > down forever.
> >   3. if i delete this condition(if(i%33==0) continue;) or disable
> > transparent_hugepage by setting 'enable' and 'defrag' to never, all go well
> and
> > the physical memory can come down expectly.
> > 
> >   It seems like transparent_hugepage has problems with non-contiguous
> > madvise(MADV_DONTEED).

It's expected behaviour.

MADV_DONTNEED doesn't guarantee that the range will not be repopulated
(with or without direct action on application behalf). It's just a hint
for the kernel.

For sparse mappings, consider using MADV_NOHUGEPAGE.
Comment 4 jianpanlanyue 2018-12-30 04:30:08 UTC
"MADV_DONTNEED doesn't guarantee that the range will not be repopulated", 

Firstly, thanks for your suggestion(MADV_NOHUGEPAGE), but I find this problem never appears after kernel 4.15.0, it seems like this problem has already been fixed(or optimized). Then, i look through the git log, although there are some commits about "tph.*MADV_DONTNEED", but i'm not sure which commit does this. 

I just want to know what has been changed to resolve this problem, thanks.
Comment 5 jianpanlanyue 2018-12-31 04:51:53 UTC
(In reply to Kirill A. Shutemov from comment #3)
> It's expected behaviour.
> 
> MADV_DONTNEED doesn't guarantee that the range will not be repopulated
> (with or without direct action on application behalf). It's just a hint
> for the kernel.
> 
> For sparse mappings, consider using MADV_NOHUGEPAGE.

thanks for your suggestion(MADV_NOHUGEPAGE), but I find this problem never appears after kernel 4.15.0, it seems like this problem has already been fixed(or optimized). Then, i look through the git log, although there are some commits about "tph.*MADV_DONTNEED", but i'm not sure which commit does this. 

I just want to know what has been changed to resolve this problem, thanks.
Comment 6 Michal Hocko 2019-01-03 10:03:44 UTC
On Sun 30-12-18 01:48:43, Kirill A. Shutemov wrote:
> On Sat, Dec 29, 2018 at 12:53:16PM -0800, Andrew Morton wrote:
> > 
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> > 
> > On Sat, 29 Dec 2018 09:00:22 +0000 bugzilla-daemon@bugzilla.kernel.org
> wrote:
> > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=202089
> > > 
> > >             Bug ID: 202089
> > >            Summary: transparent hugepage not compatable with
> > >                     madvise(MADV_DONTNEED)
> > >            Product: Memory Management
> > >            Version: 2.5
> > >     Kernel Version: 4.4.0-117
> > >           Hardware: x86-64
> > >                 OS: Linux
> > >               Tree: Mainline
> > >             Status: NEW
> > >           Severity: high
> > >           Priority: P1
> > >          Component: Other
> > >           Assignee: akpm@linux-foundation.org
> > >           Reporter: jianpanlanyue@163.com
> > >         Regression: No
> > > 
> > > environment:  
> > >   1.kernel 4.4.0 on x86_64
> > >   2.echo always > /sys/kernel/mm/transparent_hugepage/enable
> > >     echo always > /sys/kernel/mm/transparent_hugepage/defrag
> > >     echo 2000000 >
> /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> > > ( faster defrag pages to reproduce problem)
> > > 
> > > problem: 
> > >   1. use mmap() to allocate 4096 bytes for 1024*512 times
> (4096*1024*512=2G).
> > >   2. use madvise(MADV_DONTNEED) to free most of the above pages, but
> reserve a
> > > few pages(by if(i%33==0) continue;), then process's physical memory
> firstly
> > > come down, but after a few seconds, it rise back to 2G again, and can't
> come
> > > down forever.
> > >   3. if i delete this condition(if(i%33==0) continue;) or disable
> > > transparent_hugepage by setting 'enable' and 'defrag' to never, all go
> well and
> > > the physical memory can come down expectly.
> > > 
> > >   It seems like transparent_hugepage has problems with non-contiguous
> > > madvise(MADV_DONTEED).
> 
> It's expected behaviour.
> 
> MADV_DONTNEED doesn't guarantee that the range will not be repopulated
> (with or without direct action on application behalf). It's just a hint
> for the kernel.

I agree with Kirill here but I would be interested in the underlying
usecase that triggered this. The test case is clearly artificial but is
any userspace actually relying on MADV_DONTNEED reducing the rss
longterm?

> For sparse mappings, consider using MADV_NOHUGEPAGE.

Yes or use a high threshold for khugepaged for collapsing.
Comment 7 willy 2019-01-03 14:35:07 UTC
On Thu, Jan 03, 2019 at 10:44:22AM +0100, Michal Hocko wrote:
> On Sun 30-12-18 01:48:43, Kirill A. Shutemov wrote:
> > On Sat, Dec 29, 2018 at 12:53:16PM -0800, Andrew Morton wrote:
> > > >   1. use mmap() to allocate 4096 bytes for 1024*512 times
> (4096*1024*512=2G).
> > > >   2. use madvise(MADV_DONTNEED) to free most of the above pages, but
> reserve a
> > > > few pages(by if(i%33==0) continue;), then process's physical memory
> firstly
> > > > come down, but after a few seconds, it rise back to 2G again, and can't
> come
> > > > down forever.
> > > >   3. if i delete this condition(if(i%33==0) continue;) or disable
> > > > transparent_hugepage by setting 'enable' and 'defrag' to never, all go
> well and
> > > > the physical memory can come down expectly.
> > > > 
> > > >   It seems like transparent_hugepage has problems with non-contiguous
> > > > madvise(MADV_DONTEED).
> > 
> > It's expected behaviour.
> > 
> > MADV_DONTNEED doesn't guarantee that the range will not be repopulated
> > (with or without direct action on application behalf). It's just a hint
> > for the kernel.
> 
> I agree with Kirill here but I would be interested in the underlying
> usecase that triggered this. The test case is clearly artificial but is
> any userspace actually relying on MADV_DONTNEED reducing the rss
> longterm?
> 
> > For sparse mappings, consider using MADV_NOHUGEPAGE.

Should the MADV_DONTNEED hint imply MADV_NOHUGEPAGE?  It'd prevent
coalescing elsewhere in the VMA, so that might negatively affect other
programs.
Comment 8 Michal Hocko 2019-01-03 14:41:11 UTC
On Thu 03-01-19 06:35:02, Matthew Wilcox wrote:
> On Thu, Jan 03, 2019 at 10:44:22AM +0100, Michal Hocko wrote:
> > On Sun 30-12-18 01:48:43, Kirill A. Shutemov wrote:
> > > On Sat, Dec 29, 2018 at 12:53:16PM -0800, Andrew Morton wrote:
> > > > >   1. use mmap() to allocate 4096 bytes for 1024*512 times
> (4096*1024*512=2G).
> > > > >   2. use madvise(MADV_DONTNEED) to free most of the above pages, but
> reserve a
> > > > > few pages(by if(i%33==0) continue;), then process's physical memory
> firstly
> > > > > come down, but after a few seconds, it rise back to 2G again, and
> can't come
> > > > > down forever.
> > > > >   3. if i delete this condition(if(i%33==0) continue;) or disable
> > > > > transparent_hugepage by setting 'enable' and 'defrag' to never, all
> go well and
> > > > > the physical memory can come down expectly.
> > > > > 
> > > > >   It seems like transparent_hugepage has problems with non-contiguous
> > > > > madvise(MADV_DONTEED).
> > > 
> > > It's expected behaviour.
> > > 
> > > MADV_DONTNEED doesn't guarantee that the range will not be repopulated
> > > (with or without direct action on application behalf). It's just a hint
> > > for the kernel.
> > 
> > I agree with Kirill here but I would be interested in the underlying
> > usecase that triggered this. The test case is clearly artificial but is
> > any userspace actually relying on MADV_DONTNEED reducing the rss
> > longterm?
> > 
> > > For sparse mappings, consider using MADV_NOHUGEPAGE.
> 
> Should the MADV_DONTNEED hint imply MADV_NOHUGEPAGE?  It'd prevent
> coalescing elsewhere in the VMA, so that might negatively affect other
> programs.

I really do not think this is a good idea. MADV_DONTEED doesn't really
imply anything to future rss. It only wipes out the current content.
In other words do we want to stop fault around/readahead or any other
optimistic faulting on MADV_DONTEED?
Comment 9 Kirill A. Shutemov 2019-01-03 14:53:09 UTC
On Thu, Jan 03, 2019 at 06:35:02AM -0800, Matthew Wilcox wrote:
> On Thu, Jan 03, 2019 at 10:44:22AM +0100, Michal Hocko wrote:
> > On Sun 30-12-18 01:48:43, Kirill A. Shutemov wrote:
> > > On Sat, Dec 29, 2018 at 12:53:16PM -0800, Andrew Morton wrote:
> > > > >   1. use mmap() to allocate 4096 bytes for 1024*512 times
> (4096*1024*512=2G).
> > > > >   2. use madvise(MADV_DONTNEED) to free most of the above pages, but
> reserve a
> > > > > few pages(by if(i%33==0) continue;), then process's physical memory
> firstly
> > > > > come down, but after a few seconds, it rise back to 2G again, and
> can't come
> > > > > down forever.
> > > > >   3. if i delete this condition(if(i%33==0) continue;) or disable
> > > > > transparent_hugepage by setting 'enable' and 'defrag' to never, all
> go well and
> > > > > the physical memory can come down expectly.
> > > > > 
> > > > >   It seems like transparent_hugepage has problems with non-contiguous
> > > > > madvise(MADV_DONTEED).
> > > 
> > > It's expected behaviour.
> > > 
> > > MADV_DONTNEED doesn't guarantee that the range will not be repopulated
> > > (with or without direct action on application behalf). It's just a hint
> > > for the kernel.
> > 
> > I agree with Kirill here but I would be interested in the underlying
> > usecase that triggered this. The test case is clearly artificial but is
> > any userspace actually relying on MADV_DONTNEED reducing the rss
> > longterm?
> > 
> > > For sparse mappings, consider using MADV_NOHUGEPAGE.
> 
> Should the MADV_DONTNEED hint imply MADV_NOHUGEPAGE?  It'd prevent
> coalescing elsewhere in the VMA, so that might negatively affect other
> programs.

MADV_NOHUGEPAGE often creates a new VMA (or two) and it has performance
implications. And creating a new VMA would require down_write(mmap_sem)
which is no-go for MADV_DONTNEED.
Comment 10 jianpanlanyue 2019-01-08 04:07:31 UTC
> I agree with Kirill here but I would be interested in the underlying
> usecase that triggered this. The test case is clearly artificial but is
> any userspace actually relying on MADV_DONTNEED reducing the rss
> longterm?
>

Yes,user space memory-pools and some language's gc(garbage collection module) often use MADV_DONTNEED instead of free(or munmap) to improve performace, e.g. tcmalloc and jemalloc and golang, belows are the problems they encountered, the same with me.

jemalloc: https://github.com/jemalloc/jemalloc/issues/1127
tcmalloc: https://github.com/gperftools/gperftools/issues/990
golang:   https://bugzilla.kernel.org/show_bug.cgi?id=93111
         (https://github.com/golang/go/issues/8832)


Strangely, this problem doesn't exists after kernel 4.15.0, it already be fixed?
Comment 11 jianpanlanyue 2019-02-01 11:58:16 UTC
?

Note You need to log in before you can comment on or make changes to this bug.