Bug 5493

Summary: [PATCH] mprotect usage causing slow system performance and freezing
Product: Memory Management Reporter: Douglas Crosher (dtc)
Component: OtherAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: low CC: alan, bunk, eugeneteo, linuxhippy, lkbugs, protasnb, Scott, vrajesh
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.31 Subsystem:
Regression: Yes Bisected commit-id:

Description Douglas Crosher 2005-10-25 04:04:51 UTC
Most recent kernel where this bug did not occur: 2.4
Distribution: FC4
Hardware Environment: AMD64 1GB memory.
Software Environment: FC4 AMD64
Problem Description: 

An application making use of page write protection to implement a
write barrier for garbage collection, shows very poor performance when
the amount of memory used approaches the amount of system memory.  It
would be expected that the system will start swapping, but the system
responds very slowly, and is subject to freezing for period greater
than 10 seconds.  On 2.4 kernels the same code proceeds to swap,
but makes good progress, without the system losing responsiveness.


Steps to reproduce:

The following code reproduces the problem.

/*
 * Demonstration of freezing that occurs under Linux 2.6 kernels.  When run
 * this program takes much longer to complete on 2.6 kernels compared with 2.4
 * kernels, and causes freezing of response from the Linux system on 2.6
 * kernels.
 *
 * The number of pages used will need to be adjusted to cause the onset of
 * swapping. The NUM_PAGES setting below demonstrates the issue on a system
 * with 1GB memory.
 *
 * Note that to run this code it will be necessary for the kernel vm parameter
 * max_map_count to be larger then to 2 * NUM_PAGES below, otherwise the
 * mprotect system call may fail.
 *
 *   echo 500000 > /proc/sys/vm/max_map_count
 *
 * Compiling: gcc -m32 -o linux-freeze linux-freeze.c
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/mman.h>

/*
 * For a system with 1G Byte main memory:
 *   1171M -  300000 pages.
 *
 * For a system with 4G Byte main memory:
 *   3906M - 1000000 pages
 */

#define NUM_PAGES 300000

/* #define BASE 0x4000000000L*/
#define BASE 0x48000000

int main()
{
  unsigned  i, page;

  if (mmap((void *) BASE, NUM_PAGES * 4096,
	   PROT_READ | PROT_WRITE | PROT_EXEC,
	   MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0) == (void *) -1)
    perror("mmap error");

  /*
   * Write to every second page.
   */

  for (page = 0; page < NUM_PAGES; page += 2)
    {
      long  *addr = (long *) ((void *) BASE + 4096 * page);
      *addr = 1;
    }

  printf("  Write protecting...\n");
  for (page = 0; page < NUM_PAGES; page++)
    {
      long  *addr = (long *) ((void *) BASE + 4096 * page);

      if (mprotect(addr, 4096, PROT_READ | PROT_EXEC) == -1)
	perror("mprotect error");
    }

  printf("  Writing...\n");
  for (i = 0; i < NUM_PAGES; i++)
    {
      unsigned  page = random() % NUM_PAGES;
      long  *addr = (long *) ((void *) BASE + 4096 * page);

      if (mprotect(addr, 4096, PROT_READ | PROT_WRITE | PROT_EXEC) == -1)
	perror("mprotect error");
      
      (*addr)++;
    }

  return 0;
}
Comment 1 Scott L. Burson 2005-10-25 19:01:11 UTC
Hi.  I (Scott Burson, Scott@ergy.com) am the customer of Douglas' who ran into
this problem.  Let me give you some context.  I am working on a project that is
 coming to involve running very large Lisp processes using Douglas' 64-bit
Common Lisp.  By "large", I mean that I'd like to run process sizes upwards of
10GB, eventually maybe 50GB.

I have a dual Opteron box (though this problem doesn't appear to be SMP-related)
with 4GB of DRAM.  When I first saw the problem, my heap size was around 3GB,
and the system was already bogged down badly (taking several seconds just to
change window focus, for example).  As the process continued to grow toward
3.4GB, the system eventually became completely unresponsive and I had to reboot
it.  I did have a `top' running, and it managed to make occasional updates,
which showed over 99.9% of the CPU time in the kernel, and zero swap usage: the
process was still entirely in memory.

I hope it's not too presumptuous of me to bump this bug up to "blocking".  I
really can't make much progress on my project until it is resolved.  On the
other hand, I'm happy to help in any way I can -- in fact [searches Google for
"linux kernel profiling"] let me see if OProfile turns up any useful info.
Comment 2 Douglas Crosher 2005-10-25 19:33:09 UTC
Can confirmed this issues also occurs on the 2.6.14-rc5, and 2.6.14-rc5-mm1 kernels.
Comment 3 Andrew Morton 2005-10-25 20:34:58 UTC
yup, thanks - this looks like a search complexity failure in the reverse
mapping code.  I suspect we're screwed.  Let me work on it a bit.

Comment 4 Andrew Morton 2005-10-25 21:49:35 UTC
 http://bugzilla.kernel.org/show_bug.cgi?id=5493

This real-world application is failing due to the new rmap code.  There are
a large number of vmas and the linear searches in rmap.c are completely
killing us.

Profile:

c01608f0 __link_path_walk                              1   0.0003
c016ace8 __d_lookup                                    1   0.0036
c0191b70 gcc2_compiled.                                1   0.0071
c023a368 _raw_spin_unlock                              1   0.0078
c02fe410 ide_inb                                       1   0.0625
c0117580 write_profile                                 2   0.0377
c013da20 kmem_flagcheck                                2   0.0385
c013e4c8 kmem_cache_alloc                              2   0.0143
c01406d4 shrink_list                                   2   0.0020
c0143a10 zap_pte_range                                 2   0.0032
c014ba50 get_swap_page                                 2   0.0032
c0236634 radix_tree_preload                            2   0.0179
c02da3e0 generic_make_request                          2   0.0044
c02fe468 ide_outb                                      2   0.1250
c041f47c _write_unlock_irqrestore                      2   0.0833
c01395c0 __alloc_pages                                 3   0.0031
c0139ccc __mod_page_state                              3   0.1250
c013e150 cache_alloc_debugcheck_after                  3   0.0112
c0144910 do_wp_page                                    3   0.0037
c0149ec0 page_referenced                               3   0.0197
c014bcbc swap_info_get                                 3   0.0197
c015807c bio_alloc                                     3   0.0750
c041f494 _write_unlock_irq                             4   0.2000
c013cac4 check_poison_obj                              5   0.0137
c0142bd0 page_address                                  9   0.0625
c041f890 do_page_fault                                 9   0.0054
c0149bbc page_check_address                           11   0.0573
c041f40c _spin_unlock_irq                             18   0.9000
c0139380 buffered_rmqueue                             32   0.0748
c014a024 try_to_unmap_one                            782   1.7455
c014a400 try_to_unmap_anon                          1511  11.1103
c0149c7c page_referenced_one                        4169  17.0861
c0149d70 page_referenced_anon                       8498  62.4853
00000000 total                                     15109   0.0046

Comment 5 Martin J. Bligh 2005-10-25 23:09:52 UTC
Presumably if they set up separate spaces to start with, rather than one huge
one, and then try to punch holes in it, it'd work better. 
Comment 6 Rajesh Venkatasubramanian 2005-10-26 00:02:52 UTC
Hugh initially worried about anon_vma preventing vma_merge.
when mprotect changes permissions page-by-page in sequence.
That got fixed along with objrmap patches. So the "Write
protecting..." case in the test program should be okay.
At the end of it, we should have only one vma on
the anon_vma list (modulo vma_merge or mprotect bugs?).

However, "Writing..." case is interesting because pages
are randomly chosen and mprotected. This can lead to long
anon_vma list -- vma merge cannot help here.

We can convert anon_vma into prio_tree. The main complication
is that a vma can be both on anon list and prio_tree. These
are normally file-backed MAP_PRIVATE vmas or tmpfs vmas.

I think we can work out something like below.

For file-backed anon page rmap:

* Store a pointer to address_space (or file) in struct anon_vma.
  Use that to get to the appropriate prio_tree from an anon page.
  Use prio_tree to find similar vmas. How to choose similar vmas?
  Actually, vma->anon_vma should be same as page->anon_vma.

For non file-backed (MAP_ANON|MAP_PRIVATE) anon page rmap:

* vma->shared is unused in this case. Use it to for prio_tree.
  struct anon_vma can point to the head of the tree.

It involves some serious surgery, though. Moreover, I may be
missing something totally.

Thanks.
Comment 7 Douglas Crosher 2005-10-26 17:31:08 UTC
Can confirm that an easy workaround it to not allocate just one
large area.  Splitting the area avoids the problem.  Customer has
been provided with this workaround.
Comment 8 Douglas Crosher 2005-10-26 22:43:43 UTC
Unfortunately the customer reports that this workaround is not effective.
Comment 9 Hugh Dickins 2005-10-27 10:10:25 UTC
A good workaround for the demo program, linux-freeze.c, is simply to replace its
MAP_PRIVATE by MAP_SHARED.  Could that be done in the real application?
Comment 10 Scott L. Burson 2005-10-27 17:59:09 UTC
Hugh's comment is certainly interesting.  It was not immediately obvious to me,
on reading it, what a shared anonymous mapping does -- the `mmap' man page says
that it's "implemented", but doesn't say what it does!  But I found
`shmem_zero_setup' and `shmem_file_setup', and I see that it makes a file in the
tmpfs to back the requested memory region.  I also see a comment in the
declaration of `vm_area_struct' to the effect that a shared mapping cannot be in
an `anon_vma' list.

So far so good, but this raises a couple of questions:

() Why doesn't the mmap code then convert _all_ anonymous mappings to shared? 
Wouldn't that obviate the need for the entire `anon_vma' reverse mapping mechanism?

() Are there, for instance, possible performance or other reasons why swapping
via tmpfs is not such a good idea?

() Will there be any problem if this tmpfs file grows to tens of GB, presuming
of course that there is that much swap space?

I can confirm that the example program posted above, changed to use
`MAP_SHARED', no longer manifests the problem.  I've asked Douglas to make this
change in his Lisp also.  We'll see shortly what happens.

Comment 11 Andrew Morton 2005-10-27 18:07:54 UTC
bugme-daemon@kernel-bugs.osdl.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5493
> 
> 
> 
> 
> 
> ------- Additional Comments From Scott@ergy.com  2005-10-27 17:59 -------
> Hugh's comment is certainly interesting.  It was not immediately obvious to me,
> on reading it, what a shared anonymous mapping does -- the `mmap' man page says
> that it's "implemented", but doesn't say what it does!  But I found
> `shmem_zero_setup' and `shmem_file_setup', and I see that it makes a file in the
> tmpfs to back the requested memory region.  I also see a comment in the
> declaration of `vm_area_struct' to the effect that a shared mapping cannot be in
> an `anon_vma' list.

He means that the memory should be allocated via mmap(flags=MAP_ANON). 
There's no need to open any file.  The kernel will internally create tmpfs
backing for the memory.

> So far so good, but this raises a couple of questions:
> 
> () Why doesn't the mmap code then convert _all_ anonymous mappings to shared? 

I could, I suppose.   That would require that suitable tmpfs mounts be made.

> Wouldn't that obviate the need for the entire `anon_vma' reverse mapping mechanism?
> 
> () Are there, for instance, possible performance or other reasons why swapping
> via tmpfs is not such a good idea?

Not many.  However SMP scalability of MAP_ANON-backed pages during
pagefaulting is very poor.  Normal anon memory doesn't have this problem.

> () Will there be any problem if this tmpfs file grows to tens of GB, presuming
> of course that there is that much swap space?

Shouldn't be.

> I can confirm that the example program posted above, changed to use
> `MAP_SHARED', no longer manifests the problem.  I've asked Douglas to make this
> change in his Lisp also.  We'll see shortly what happens.

Try MAP_ANON.

Comment 12 Scott L. Burson 2005-10-27 19:03:21 UTC
> http://cvs-mirror.mozilla.org/webtools/bugzilla/show_bug.cgi?id=5493
> ------- Additional Comments From akpm@osdl.org  2005-10-27 18:07 -------
> > ------- Additional Comments From Scott@ergy.com  2005-10-27 17:59 -------
> > Hugh's comment is certainly interesting.  It was not immediately obvious
> > to me, on reading it, what a shared anonymous mapping does
>
> He means that the memory should be allocated via mmap(flags=MAP_ANON).
> There's no need to open any file.  The kernel will internally create tmpfs
> backing for the memory.

I think there may be some confusion.  `MAP_ANON | MAP_PRIVATE' is what we have 
been using.  Hugh suggested `MAP_ANON | MAP_SHARED'.

> > () Why doesn't the mmap code then convert _all_ [private] anonymous 
> > mappings to shared?
>
> That would require that suitable tmpfs mounts be made.

Ah, okay, you don't want to count on tmpfs being mounted.  Fair enough.

> SMP scalability of MAP_ANON-backed pages during
> pagefaulting is very poor.  Normal anon memory doesn't have this problem.

I assume you mean, SMP scalability of tmpfs-backed pages?  Okay, well, I guess 
I'll find out how bad this is.  It can't be nearly as bad as the problem I'm
having now.

Comment 13 Scott L. Burson 2005-10-28 09:40:34 UTC
Okay, running with `MAP_ANON | MAP_SHARED' proves to be a considerable improvement.

On the other hand, the problem is not solved completely.  There are still times,
apparently during large garbage collections, that the machine bogs down
substantially and `top' shows upwards of 98% of the time being spent in the
kernel.  (A lot of it is going to `kswapd'; I've seen this even before the onset
of actual paging to the disk.)  Sometimes it even becomes fairly unresponsive,
but unlike before, it recovers after a couple of minutes.

However, I think this will be enough of an improvement that I can continue my
development efforts.

Let me switch to a recent kernel (I'm still on SLES 9 SP2, a patched 2.6.5) and
do some profiling.  I'll get back to you with the results in a few days.
Comment 14 Natalie Protasevich 2007-05-22 13:00:06 UTC
Hi Scott,
Any new developments/observations with the mprotect problem? Have you been
running with the suggested solution ever since? How about newer kernel, have you
tried it?
Thanks,
--Natalie
Comment 15 Scott L. Burson 2007-05-22 14:21:36 UTC
My recollection, which is a bit fuzzy, is that I did try a newer kernel (the one in SLES 10).  It still didn't 
work, so I gave up and switched to Solaris.  Solaris has no trace of this problem; I've run 35GB heaps 
without difficulty.
Comment 16 Alan 2009-03-17 08:47:23 UTC
Still causes significant indigestion in 2.6.29rc8
Comment 17 Anonymous Emailer 2009-03-17 14:08:08 UTC
Reply-To: akpm@linux-foundation.org

On Tue, 17 Mar 2009 08:47:24 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=5493
> 
> 
> alan@lxorguk.ukuu.org.uk changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>       KernelVersion|2.6.13                      |2.6.29
> 
> 
> 
> 
> ------- Comment #16 from alan@lxorguk.ukuu.org.uk  2009-03-17 08:47 -------
> Still causes significant indigestion in 2.6.29rc8

I don't think we know how to fix this :(
Comment 18 Clemens Eisserer 2009-12-15 09:37:41 UTC
OMG, still open after 4 years :/
Comment 19 Christian Ehrhardt 2010-03-05 09:52:28 UTC
I've taken the time to come up with a patch that
a) makes the vma prio trees somewhat more reuseable
b) Usese a prio tree for the anon_vma lists as well.

Timings for the test program provided by the original reporter increase significantly.

See http://marc.info/?l=linux-kernel&m=126778234032288&w=2 for the patch.
Comment 20 Scott L. Burson 2010-03-05 20:59:45 UTC
Thank you!!!!
Comment 22 Clemens Eisserer 2010-03-16 09:37:11 UTC
Thanks a lot!
Comment 23 Clemens Eisserer 2012-02-18 02:30:23 UTC
What is the reason for this patch getting so little attention?