Bug 5493
Summary: | [PATCH] mprotect usage causing slow system performance and freezing | ||
---|---|---|---|
Product: | Memory Management | Reporter: | Douglas Crosher (dtc) |
Component: | Other | Assignee: | Andrew Morton (akpm) |
Status: | NEW --- | ||
Severity: | low | CC: | alan, bunk, eugeneteo, linuxhippy, lkbugs, protasnb, Scott, vrajesh |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.31 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Douglas Crosher
2005-10-25 04:04:51 UTC
Hi. I (Scott Burson, Scott@ergy.com) am the customer of Douglas' who ran into this problem. Let me give you some context. I am working on a project that is coming to involve running very large Lisp processes using Douglas' 64-bit Common Lisp. By "large", I mean that I'd like to run process sizes upwards of 10GB, eventually maybe 50GB. I have a dual Opteron box (though this problem doesn't appear to be SMP-related) with 4GB of DRAM. When I first saw the problem, my heap size was around 3GB, and the system was already bogged down badly (taking several seconds just to change window focus, for example). As the process continued to grow toward 3.4GB, the system eventually became completely unresponsive and I had to reboot it. I did have a `top' running, and it managed to make occasional updates, which showed over 99.9% of the CPU time in the kernel, and zero swap usage: the process was still entirely in memory. I hope it's not too presumptuous of me to bump this bug up to "blocking". I really can't make much progress on my project until it is resolved. On the other hand, I'm happy to help in any way I can -- in fact [searches Google for "linux kernel profiling"] let me see if OProfile turns up any useful info. Can confirmed this issues also occurs on the 2.6.14-rc5, and 2.6.14-rc5-mm1 kernels. yup, thanks - this looks like a search complexity failure in the reverse mapping code. I suspect we're screwed. Let me work on it a bit. http://bugzilla.kernel.org/show_bug.cgi?id=5493 This real-world application is failing due to the new rmap code. There are a large number of vmas and the linear searches in rmap.c are completely killing us. Profile: c01608f0 __link_path_walk 1 0.0003 c016ace8 __d_lookup 1 0.0036 c0191b70 gcc2_compiled. 1 0.0071 c023a368 _raw_spin_unlock 1 0.0078 c02fe410 ide_inb 1 0.0625 c0117580 write_profile 2 0.0377 c013da20 kmem_flagcheck 2 0.0385 c013e4c8 kmem_cache_alloc 2 0.0143 c01406d4 shrink_list 2 0.0020 c0143a10 zap_pte_range 2 0.0032 c014ba50 get_swap_page 2 0.0032 c0236634 radix_tree_preload 2 0.0179 c02da3e0 generic_make_request 2 0.0044 c02fe468 ide_outb 2 0.1250 c041f47c _write_unlock_irqrestore 2 0.0833 c01395c0 __alloc_pages 3 0.0031 c0139ccc __mod_page_state 3 0.1250 c013e150 cache_alloc_debugcheck_after 3 0.0112 c0144910 do_wp_page 3 0.0037 c0149ec0 page_referenced 3 0.0197 c014bcbc swap_info_get 3 0.0197 c015807c bio_alloc 3 0.0750 c041f494 _write_unlock_irq 4 0.2000 c013cac4 check_poison_obj 5 0.0137 c0142bd0 page_address 9 0.0625 c041f890 do_page_fault 9 0.0054 c0149bbc page_check_address 11 0.0573 c041f40c _spin_unlock_irq 18 0.9000 c0139380 buffered_rmqueue 32 0.0748 c014a024 try_to_unmap_one 782 1.7455 c014a400 try_to_unmap_anon 1511 11.1103 c0149c7c page_referenced_one 4169 17.0861 c0149d70 page_referenced_anon 8498 62.4853 00000000 total 15109 0.0046 Presumably if they set up separate spaces to start with, rather than one huge one, and then try to punch holes in it, it'd work better. Hugh initially worried about anon_vma preventing vma_merge. when mprotect changes permissions page-by-page in sequence. That got fixed along with objrmap patches. So the "Write protecting..." case in the test program should be okay. At the end of it, we should have only one vma on the anon_vma list (modulo vma_merge or mprotect bugs?). However, "Writing..." case is interesting because pages are randomly chosen and mprotected. This can lead to long anon_vma list -- vma merge cannot help here. We can convert anon_vma into prio_tree. The main complication is that a vma can be both on anon list and prio_tree. These are normally file-backed MAP_PRIVATE vmas or tmpfs vmas. I think we can work out something like below. For file-backed anon page rmap: * Store a pointer to address_space (or file) in struct anon_vma. Use that to get to the appropriate prio_tree from an anon page. Use prio_tree to find similar vmas. How to choose similar vmas? Actually, vma->anon_vma should be same as page->anon_vma. For non file-backed (MAP_ANON|MAP_PRIVATE) anon page rmap: * vma->shared is unused in this case. Use it to for prio_tree. struct anon_vma can point to the head of the tree. It involves some serious surgery, though. Moreover, I may be missing something totally. Thanks. Can confirm that an easy workaround it to not allocate just one large area. Splitting the area avoids the problem. Customer has been provided with this workaround. Unfortunately the customer reports that this workaround is not effective. A good workaround for the demo program, linux-freeze.c, is simply to replace its MAP_PRIVATE by MAP_SHARED. Could that be done in the real application? Hugh's comment is certainly interesting. It was not immediately obvious to me, on reading it, what a shared anonymous mapping does -- the `mmap' man page says that it's "implemented", but doesn't say what it does! But I found `shmem_zero_setup' and `shmem_file_setup', and I see that it makes a file in the tmpfs to back the requested memory region. I also see a comment in the declaration of `vm_area_struct' to the effect that a shared mapping cannot be in an `anon_vma' list. So far so good, but this raises a couple of questions: () Why doesn't the mmap code then convert _all_ anonymous mappings to shared? Wouldn't that obviate the need for the entire `anon_vma' reverse mapping mechanism? () Are there, for instance, possible performance or other reasons why swapping via tmpfs is not such a good idea? () Will there be any problem if this tmpfs file grows to tens of GB, presuming of course that there is that much swap space? I can confirm that the example program posted above, changed to use `MAP_SHARED', no longer manifests the problem. I've asked Douglas to make this change in his Lisp also. We'll see shortly what happens. bugme-daemon@kernel-bugs.osdl.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=5493 > > > > > > ------- Additional Comments From Scott@ergy.com 2005-10-27 17:59 ------- > Hugh's comment is certainly interesting. It was not immediately obvious to me, > on reading it, what a shared anonymous mapping does -- the `mmap' man page says > that it's "implemented", but doesn't say what it does! But I found > `shmem_zero_setup' and `shmem_file_setup', and I see that it makes a file in the > tmpfs to back the requested memory region. I also see a comment in the > declaration of `vm_area_struct' to the effect that a shared mapping cannot be in > an `anon_vma' list. He means that the memory should be allocated via mmap(flags=MAP_ANON). There's no need to open any file. The kernel will internally create tmpfs backing for the memory. > So far so good, but this raises a couple of questions: > > () Why doesn't the mmap code then convert _all_ anonymous mappings to shared? I could, I suppose. That would require that suitable tmpfs mounts be made. > Wouldn't that obviate the need for the entire `anon_vma' reverse mapping mechanism? > > () Are there, for instance, possible performance or other reasons why swapping > via tmpfs is not such a good idea? Not many. However SMP scalability of MAP_ANON-backed pages during pagefaulting is very poor. Normal anon memory doesn't have this problem. > () Will there be any problem if this tmpfs file grows to tens of GB, presuming > of course that there is that much swap space? Shouldn't be. > I can confirm that the example program posted above, changed to use > `MAP_SHARED', no longer manifests the problem. I've asked Douglas to make this > change in his Lisp also. We'll see shortly what happens. Try MAP_ANON. > http://cvs-mirror.mozilla.org/webtools/bugzilla/show_bug.cgi?id=5493 > ------- Additional Comments From akpm@osdl.org 2005-10-27 18:07 ------- > > ------- Additional Comments From Scott@ergy.com 2005-10-27 17:59 ------- > > Hugh's comment is certainly interesting. It was not immediately obvious > > to me, on reading it, what a shared anonymous mapping does > > He means that the memory should be allocated via mmap(flags=MAP_ANON). > There's no need to open any file. The kernel will internally create tmpfs > backing for the memory. I think there may be some confusion. `MAP_ANON | MAP_PRIVATE' is what we have been using. Hugh suggested `MAP_ANON | MAP_SHARED'. > > () Why doesn't the mmap code then convert _all_ [private] anonymous > > mappings to shared? > > That would require that suitable tmpfs mounts be made. Ah, okay, you don't want to count on tmpfs being mounted. Fair enough. > SMP scalability of MAP_ANON-backed pages during > pagefaulting is very poor. Normal anon memory doesn't have this problem. I assume you mean, SMP scalability of tmpfs-backed pages? Okay, well, I guess I'll find out how bad this is. It can't be nearly as bad as the problem I'm having now. Okay, running with `MAP_ANON | MAP_SHARED' proves to be a considerable improvement. On the other hand, the problem is not solved completely. There are still times, apparently during large garbage collections, that the machine bogs down substantially and `top' shows upwards of 98% of the time being spent in the kernel. (A lot of it is going to `kswapd'; I've seen this even before the onset of actual paging to the disk.) Sometimes it even becomes fairly unresponsive, but unlike before, it recovers after a couple of minutes. However, I think this will be enough of an improvement that I can continue my development efforts. Let me switch to a recent kernel (I'm still on SLES 9 SP2, a patched 2.6.5) and do some profiling. I'll get back to you with the results in a few days. Hi Scott, Any new developments/observations with the mprotect problem? Have you been running with the suggested solution ever since? How about newer kernel, have you tried it? Thanks, --Natalie My recollection, which is a bit fuzzy, is that I did try a newer kernel (the one in SLES 10). It still didn't work, so I gave up and switched to Solaris. Solaris has no trace of this problem; I've run 35GB heaps without difficulty. Still causes significant indigestion in 2.6.29rc8 Reply-To: akpm@linux-foundation.org On Tue, 17 Mar 2009 08:47:24 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=5493 > > > alan@lxorguk.ukuu.org.uk changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > KernelVersion|2.6.13 |2.6.29 > > > > > ------- Comment #16 from alan@lxorguk.ukuu.org.uk 2009-03-17 08:47 ------- > Still causes significant indigestion in 2.6.29rc8 I don't think we know how to fix this :( OMG, still open after 4 years :/ I've taken the time to come up with a patch that a) makes the vma prio trees somewhat more reuseable b) Usese a prio tree for the anon_vma lists as well. Timings for the test program provided by the original reporter increase significantly. See http://marc.info/?l=linux-kernel&m=126778234032288&w=2 for the patch. Thank you!!!! Updated and fixed patch at: http://marc.info/?l=linux-kernel&m=126847717927202&w=2 http://marc.info/?l=linux-kernel&m=126847735627431&w=2 http://marc.info/?l=linux-kernel&m=126847748527584&w=2 Thanks a lot! What is the reason for this patch getting so little attention? |