Most recent kernel where this bug did not occur: 2.4 Distribution: FC4 Hardware Environment: AMD64 1GB memory. Software Environment: FC4 AMD64 Problem Description: An application making use of page write protection to implement a write barrier for garbage collection, shows very poor performance when the amount of memory used approaches the amount of system memory. It would be expected that the system will start swapping, but the system responds very slowly, and is subject to freezing for period greater than 10 seconds. On 2.4 kernels the same code proceeds to swap, but makes good progress, without the system losing responsiveness. Steps to reproduce: The following code reproduces the problem. /* * Demonstration of freezing that occurs under Linux 2.6 kernels. When run * this program takes much longer to complete on 2.6 kernels compared with 2.4 * kernels, and causes freezing of response from the Linux system on 2.6 * kernels. * * The number of pages used will need to be adjusted to cause the onset of * swapping. The NUM_PAGES setting below demonstrates the issue on a system * with 1GB memory. * * Note that to run this code it will be necessary for the kernel vm parameter * max_map_count to be larger then to 2 * NUM_PAGES below, otherwise the * mprotect system call may fail. * * echo 500000 > /proc/sys/vm/max_map_count * * Compiling: gcc -m32 -o linux-freeze linux-freeze.c * */ #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/mman.h> /* * For a system with 1G Byte main memory: * 1171M - 300000 pages. * * For a system with 4G Byte main memory: * 3906M - 1000000 pages */ #define NUM_PAGES 300000 /* #define BASE 0x4000000000L*/ #define BASE 0x48000000 int main() { unsigned i, page; if (mmap((void *) BASE, NUM_PAGES * 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0) == (void *) -1) perror("mmap error"); /* * Write to every second page. */ for (page = 0; page < NUM_PAGES; page += 2) { long *addr = (long *) ((void *) BASE + 4096 * page); *addr = 1; } printf(" Write protecting...\n"); for (page = 0; page < NUM_PAGES; page++) { long *addr = (long *) ((void *) BASE + 4096 * page); if (mprotect(addr, 4096, PROT_READ | PROT_EXEC) == -1) perror("mprotect error"); } printf(" Writing...\n"); for (i = 0; i < NUM_PAGES; i++) { unsigned page = random() % NUM_PAGES; long *addr = (long *) ((void *) BASE + 4096 * page); if (mprotect(addr, 4096, PROT_READ | PROT_WRITE | PROT_EXEC) == -1) perror("mprotect error"); (*addr)++; } return 0; }
Hi. I (Scott Burson, Scott@ergy.com) am the customer of Douglas' who ran into this problem. Let me give you some context. I am working on a project that is coming to involve running very large Lisp processes using Douglas' 64-bit Common Lisp. By "large", I mean that I'd like to run process sizes upwards of 10GB, eventually maybe 50GB. I have a dual Opteron box (though this problem doesn't appear to be SMP-related) with 4GB of DRAM. When I first saw the problem, my heap size was around 3GB, and the system was already bogged down badly (taking several seconds just to change window focus, for example). As the process continued to grow toward 3.4GB, the system eventually became completely unresponsive and I had to reboot it. I did have a `top' running, and it managed to make occasional updates, which showed over 99.9% of the CPU time in the kernel, and zero swap usage: the process was still entirely in memory. I hope it's not too presumptuous of me to bump this bug up to "blocking". I really can't make much progress on my project until it is resolved. On the other hand, I'm happy to help in any way I can -- in fact [searches Google for "linux kernel profiling"] let me see if OProfile turns up any useful info.
Can confirmed this issues also occurs on the 2.6.14-rc5, and 2.6.14-rc5-mm1 kernels.
yup, thanks - this looks like a search complexity failure in the reverse mapping code. I suspect we're screwed. Let me work on it a bit.
http://bugzilla.kernel.org/show_bug.cgi?id=5493 This real-world application is failing due to the new rmap code. There are a large number of vmas and the linear searches in rmap.c are completely killing us. Profile: c01608f0 __link_path_walk 1 0.0003 c016ace8 __d_lookup 1 0.0036 c0191b70 gcc2_compiled. 1 0.0071 c023a368 _raw_spin_unlock 1 0.0078 c02fe410 ide_inb 1 0.0625 c0117580 write_profile 2 0.0377 c013da20 kmem_flagcheck 2 0.0385 c013e4c8 kmem_cache_alloc 2 0.0143 c01406d4 shrink_list 2 0.0020 c0143a10 zap_pte_range 2 0.0032 c014ba50 get_swap_page 2 0.0032 c0236634 radix_tree_preload 2 0.0179 c02da3e0 generic_make_request 2 0.0044 c02fe468 ide_outb 2 0.1250 c041f47c _write_unlock_irqrestore 2 0.0833 c01395c0 __alloc_pages 3 0.0031 c0139ccc __mod_page_state 3 0.1250 c013e150 cache_alloc_debugcheck_after 3 0.0112 c0144910 do_wp_page 3 0.0037 c0149ec0 page_referenced 3 0.0197 c014bcbc swap_info_get 3 0.0197 c015807c bio_alloc 3 0.0750 c041f494 _write_unlock_irq 4 0.2000 c013cac4 check_poison_obj 5 0.0137 c0142bd0 page_address 9 0.0625 c041f890 do_page_fault 9 0.0054 c0149bbc page_check_address 11 0.0573 c041f40c _spin_unlock_irq 18 0.9000 c0139380 buffered_rmqueue 32 0.0748 c014a024 try_to_unmap_one 782 1.7455 c014a400 try_to_unmap_anon 1511 11.1103 c0149c7c page_referenced_one 4169 17.0861 c0149d70 page_referenced_anon 8498 62.4853 00000000 total 15109 0.0046
Presumably if they set up separate spaces to start with, rather than one huge one, and then try to punch holes in it, it'd work better.
Hugh initially worried about anon_vma preventing vma_merge. when mprotect changes permissions page-by-page in sequence. That got fixed along with objrmap patches. So the "Write protecting..." case in the test program should be okay. At the end of it, we should have only one vma on the anon_vma list (modulo vma_merge or mprotect bugs?). However, "Writing..." case is interesting because pages are randomly chosen and mprotected. This can lead to long anon_vma list -- vma merge cannot help here. We can convert anon_vma into prio_tree. The main complication is that a vma can be both on anon list and prio_tree. These are normally file-backed MAP_PRIVATE vmas or tmpfs vmas. I think we can work out something like below. For file-backed anon page rmap: * Store a pointer to address_space (or file) in struct anon_vma. Use that to get to the appropriate prio_tree from an anon page. Use prio_tree to find similar vmas. How to choose similar vmas? Actually, vma->anon_vma should be same as page->anon_vma. For non file-backed (MAP_ANON|MAP_PRIVATE) anon page rmap: * vma->shared is unused in this case. Use it to for prio_tree. struct anon_vma can point to the head of the tree. It involves some serious surgery, though. Moreover, I may be missing something totally. Thanks.
Can confirm that an easy workaround it to not allocate just one large area. Splitting the area avoids the problem. Customer has been provided with this workaround.
Unfortunately the customer reports that this workaround is not effective.
A good workaround for the demo program, linux-freeze.c, is simply to replace its MAP_PRIVATE by MAP_SHARED. Could that be done in the real application?
Hugh's comment is certainly interesting. It was not immediately obvious to me, on reading it, what a shared anonymous mapping does -- the `mmap' man page says that it's "implemented", but doesn't say what it does! But I found `shmem_zero_setup' and `shmem_file_setup', and I see that it makes a file in the tmpfs to back the requested memory region. I also see a comment in the declaration of `vm_area_struct' to the effect that a shared mapping cannot be in an `anon_vma' list. So far so good, but this raises a couple of questions: () Why doesn't the mmap code then convert _all_ anonymous mappings to shared? Wouldn't that obviate the need for the entire `anon_vma' reverse mapping mechanism? () Are there, for instance, possible performance or other reasons why swapping via tmpfs is not such a good idea? () Will there be any problem if this tmpfs file grows to tens of GB, presuming of course that there is that much swap space? I can confirm that the example program posted above, changed to use `MAP_SHARED', no longer manifests the problem. I've asked Douglas to make this change in his Lisp also. We'll see shortly what happens.
bugme-daemon@kernel-bugs.osdl.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=5493 > > > > > > ------- Additional Comments From Scott@ergy.com 2005-10-27 17:59 ------- > Hugh's comment is certainly interesting. It was not immediately obvious to me, > on reading it, what a shared anonymous mapping does -- the `mmap' man page says > that it's "implemented", but doesn't say what it does! But I found > `shmem_zero_setup' and `shmem_file_setup', and I see that it makes a file in the > tmpfs to back the requested memory region. I also see a comment in the > declaration of `vm_area_struct' to the effect that a shared mapping cannot be in > an `anon_vma' list. He means that the memory should be allocated via mmap(flags=MAP_ANON). There's no need to open any file. The kernel will internally create tmpfs backing for the memory. > So far so good, but this raises a couple of questions: > > () Why doesn't the mmap code then convert _all_ anonymous mappings to shared? I could, I suppose. That would require that suitable tmpfs mounts be made. > Wouldn't that obviate the need for the entire `anon_vma' reverse mapping mechanism? > > () Are there, for instance, possible performance or other reasons why swapping > via tmpfs is not such a good idea? Not many. However SMP scalability of MAP_ANON-backed pages during pagefaulting is very poor. Normal anon memory doesn't have this problem. > () Will there be any problem if this tmpfs file grows to tens of GB, presuming > of course that there is that much swap space? Shouldn't be. > I can confirm that the example program posted above, changed to use > `MAP_SHARED', no longer manifests the problem. I've asked Douglas to make this > change in his Lisp also. We'll see shortly what happens. Try MAP_ANON.
> http://cvs-mirror.mozilla.org/webtools/bugzilla/show_bug.cgi?id=5493 > ------- Additional Comments From akpm@osdl.org 2005-10-27 18:07 ------- > > ------- Additional Comments From Scott@ergy.com 2005-10-27 17:59 ------- > > Hugh's comment is certainly interesting. It was not immediately obvious > > to me, on reading it, what a shared anonymous mapping does > > He means that the memory should be allocated via mmap(flags=MAP_ANON). > There's no need to open any file. The kernel will internally create tmpfs > backing for the memory. I think there may be some confusion. `MAP_ANON | MAP_PRIVATE' is what we have been using. Hugh suggested `MAP_ANON | MAP_SHARED'. > > () Why doesn't the mmap code then convert _all_ [private] anonymous > > mappings to shared? > > That would require that suitable tmpfs mounts be made. Ah, okay, you don't want to count on tmpfs being mounted. Fair enough. > SMP scalability of MAP_ANON-backed pages during > pagefaulting is very poor. Normal anon memory doesn't have this problem. I assume you mean, SMP scalability of tmpfs-backed pages? Okay, well, I guess I'll find out how bad this is. It can't be nearly as bad as the problem I'm having now.
Okay, running with `MAP_ANON | MAP_SHARED' proves to be a considerable improvement. On the other hand, the problem is not solved completely. There are still times, apparently during large garbage collections, that the machine bogs down substantially and `top' shows upwards of 98% of the time being spent in the kernel. (A lot of it is going to `kswapd'; I've seen this even before the onset of actual paging to the disk.) Sometimes it even becomes fairly unresponsive, but unlike before, it recovers after a couple of minutes. However, I think this will be enough of an improvement that I can continue my development efforts. Let me switch to a recent kernel (I'm still on SLES 9 SP2, a patched 2.6.5) and do some profiling. I'll get back to you with the results in a few days.
Hi Scott, Any new developments/observations with the mprotect problem? Have you been running with the suggested solution ever since? How about newer kernel, have you tried it? Thanks, --Natalie
My recollection, which is a bit fuzzy, is that I did try a newer kernel (the one in SLES 10). It still didn't work, so I gave up and switched to Solaris. Solaris has no trace of this problem; I've run 35GB heaps without difficulty.
Still causes significant indigestion in 2.6.29rc8
Reply-To: akpm@linux-foundation.org On Tue, 17 Mar 2009 08:47:24 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=5493 > > > alan@lxorguk.ukuu.org.uk changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > KernelVersion|2.6.13 |2.6.29 > > > > > ------- Comment #16 from alan@lxorguk.ukuu.org.uk 2009-03-17 08:47 ------- > Still causes significant indigestion in 2.6.29rc8 I don't think we know how to fix this :(
OMG, still open after 4 years :/
I've taken the time to come up with a patch that a) makes the vma prio trees somewhat more reuseable b) Usese a prio tree for the anon_vma lists as well. Timings for the test program provided by the original reporter increase significantly. See http://marc.info/?l=linux-kernel&m=126778234032288&w=2 for the patch.
Thank you!!!!
Updated and fixed patch at: http://marc.info/?l=linux-kernel&m=126847717927202&w=2 http://marc.info/?l=linux-kernel&m=126847735627431&w=2 http://marc.info/?l=linux-kernel&m=126847748527584&w=2
Thanks a lot!
What is the reason for this patch getting so little attention?