Latest working kernel version: don't know Earliest failing kernel version: don't know Distribution: kernel.org Hardware Environment: HP DL160 G5 w/ dual E5430's & 16GB PC2-5300 FB-DIMMs Software Environment: F9 Problem Description: can't shmat() 1GB hugepage segment from second process more than one time Steps to reproduce: create 1GB or more hugepage shmget/shmat segment attached at explicit virtual address 0x4_00000000 run another program that attaches segment run it again, fails eventually get attached 'dmesg' output works fine under RHEL 4.6
Created attachment 19094 [details] dmesg output with errors
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Mon, 1 Dec 2008 18:01:39 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12134 > > Summary: can't shmat() 1GB hugepage segment from second process > more than one time > Product: Memory Management > Version: 2.5 > KernelVersion: 2.26.27.7 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Other > AssignedTo: akpm@osdl.org > ReportedBy: starlight@binnacle.cx > > > Latest working kernel version: don't know > Earliest failing kernel version: don't know > Distribution: kernel.org > Hardware Environment: HP DL160 G5 w/ dual E5430's & 16GB PC2-5300 FB-DIMMs > Software Environment: F9 > Problem Description: > > can't shmat() 1GB hugepage segment from second process more than one time > > Steps to reproduce: > > create 1GB or more hugepage shmget/shmat segment > attached at explicit virtual address 0x4_00000000 > > run another program that attaches segment > > run it again, fails > > eventually get attached 'dmesg' output > > works fine under RHEL 4.6 > >
On Mon, 2008-12-01 at 18:14 -0800, Andrew Morton wrote: > > can't shmat() 1GB hugepage segment from second process more than one time > > > > Steps to reproduce: starlight@binnacle.cx: I need more information to reproduce this bug. Please read on. I've tried these steps and haven't been able to reproduce. Are these reproduction steps actually a description of what a more complex program is doing, or have you reproduced this with simple C programs that implement nothing more than the instructions provided in this bug? It would make it easier to diagnose this if you could provide a simple C program that causes the bad behavior. > > > > create 1GB or more hugepage shmget/shmat segment > > attached at explicit virtual address 0x4_00000000 You must mean either 0x400000000 or 0x4000000000; please clarify. (I tried both addresses and was unable to reproduce. Are you touching any of the pages in the shared memory segment with this process? What flags are you passing to shmget and shmat? Could you provide an strace for each program run? > > run another program that attaches segment Does this second program do anything besides attaching the segment (ie. faulting any of the huge pages)? > > run it again, fails > > > > eventually get attached 'dmesg' output > > > > works fine under RHEL 4.6 > > > >
I'll collect a more detailed picture in the next day or so and send the info. Maybe create a test-case. Several other segments 128MB are created before the 1GB segment. They all run in the 0x300000000 range on 256MB boundaries (second digit changes) and the big one goes at 0x400000000. 'mlockall()' is called periodically as well--perhaps that's the antagonist. Have SHM_HUGETLB set even for no-create attaches, which I'm not sure is proper. It works on RHEL though. Memory is touched in each segment, 100% for the smaller ones and small % for the big one. Didn't think it made any difference since it's all locked by implication.
At 13:24 12/2/2008 -0600, Adam Litke wrote: >starlight@binnacle.cx: I need more information >to reproduce this bug. I'm too swamped to build a test-case, but here are straces that show the relevant system calls and the failure. The 'daemon_strace.txt' file is of the master daemon that creates all the segments and provides a service. The 'client_strace[12].txt' files are of a client daemon that attaches two segments, does some work, and then quits. The first client trace shows the first invocation, which is successful. The second client trace shows the second invocation, which fails. All other attempts to attach the segment fail, and the daemon cannot be successfully restarted until the system is rebooted. I.e. the kernel is getting hosed. Happy to answer any questions and provide more details if desired. Regards
Created attachment 19137 [details] daemon_strace1.txt not sure if the attachments made it along with the above email
Created attachment 19138 [details] client_strace1
Created attachment 19139 [details] client_strace2
Created attachment 19140 [details] meminfo Noticed that hugepage memory is not freed when the segments are deleted.
Created attachment 19141 [details] pmap Here's a 'pmap -x' output for the daemon. This was taken while running with non-hugepage normal shared memory. With hugepages it looks the same except the segments all show (deleted), which I assume is supposed to read (hugepage) and that 'pmap' has a bug on that.
On Wed, 2008-12-03 at 22:15 -0500, starlight@binnacle.cx wrote: > At 13:24 12/2/2008 -0600, Adam Litke wrote: > >starlight@binnacle.cx: I need more information > >to reproduce this bug. > > I'm too swamped to build a test-case, but here are straces > that show the relevant system calls and the failure. Starlight, Thanks for the strace output. As I suspected, this is more complex than it first appeared. There are several hugetlb shared memory segments involved. Couple that with threading and an interesting approach to mlocking the address space and I've got a very difficult to reproduce scenario. Is it possible/practical for me to have access to your program? If so, I could quickly bisect the kernel and identify the guilty patch. Without the program, I am left stabbing in the dark. Could you try on a 2.6.18 kernel to see if it works or not? Thanks.
At 11:17 12/5/2008 -0600, you wrote: >On Wed, 2008-12-03 at 22:15 -0500, starlight@binnacle.cx wrote: >> At 13:24 12/2/2008 -0600, Adam Litke wrote: >> >starlight@binnacle.cx: I need more information >> >to reproduce this bug. >> >> I'm too swamped to build a test-case, but here are straces >> that show the relevant system calls and the failure. > >Starlight, > >Thanks for the strace output. As I suspected, this is more >complex than it first appeared. There are several hugetlb >shared memory segments involved. Couple that with threading and >an interesting approach to mlocking the address space and I've >got a very difficult to reproduce scenario. Is it >possible/practical for me to have access to your program? Sorry, I'm not permitted to share the code. The program fork/execs a script in addition to creating many worker threads (have contemplated switching to 'pthread_spawn()', but it seems it does a fork/exec anyway). I wonder if that has anything to do with it. Will try disabling that and then disabling the 'mlock()' calls to see if either eliminates the issue. Doubt that worker thread creation is a factor. >If so, I could quickly bisect the kernel and identify the guilty >patch. Without the program, I am left stabbing in the dark. >Could you try on a 2.6.18 kernel to see if it works or not? >Thanks. Any particular version of 2.6.18?
On Fri, 2008-12-05 at 12:49 -0500, starlight@binnacle.cx wrote: > At 11:17 12/5/2008 -0600, you wrote: > >On Wed, 2008-12-03 at 22:15 -0500, starlight@binnacle.cx wrote: > >> At 13:24 12/2/2008 -0600, Adam Litke wrote: > >> >starlight@binnacle.cx: I need more information > >> >to reproduce this bug. > >> > >> I'm too swamped to build a test-case, but here are straces > >> that show the relevant system calls and the failure. > > > >Starlight, > > > >Thanks for the strace output. As I suspected, this is more > >complex than it first appeared. There are several hugetlb > >shared memory segments involved. Couple that with threading and > >an interesting approach to mlocking the address space and I've > >got a very difficult to reproduce scenario. Is it > >possible/practical for me to have access to your program? > > Sorry, I'm not permitted to share the code. > > The program fork/execs a script in addition to creating many > worker threads (have contemplated switching to 'pthread_spawn()', > but it seems it does a fork/exec anyway). I wonder if that has > anything to do with it. Will try disabling that and then > disabling the 'mlock()' calls to see if either eliminates > the issue. Doubt that worker thread creation is a factor. Great. I was going to ask you to disable mlock() as well. Is this the same machine that was running your workload on RHEL4 successfully? One theory I've been contemplating is that, with all of the mlocking and threads, you might be running out of memory for page tables and that perhaps the hugetlb code is not handling that case correctly. When do the bad pmd messages appear? When the daemon starts? When the first separate process attaches? When the second one does? or later? > >If so, I could quickly bisect the kernel and identify the guilty > >patch. Without the program, I am left stabbing in the dark. > >Could you try on a 2.6.18 kernel to see if it works or not? > >Thanks. > > Any particular version of 2.6.18? Nothing specific. You could try 2.6.18.8 (latest -stable). We could probably bisect this with approximately 8 kernel build-boot-test cycles if you are willing to engage on that. I am looking forward to your disabled-mlock() results.
At 12:57 12/5/2008 -0600, Adam Litke wrote: >Great. I was going to ask you to disable mlock() as well. Is this the >same machine that was running your workload on RHEL4 successfully? No, that was a an old Athlon 4800+ dev box. >One theory I've been contemplating is that, with all of the mlocking and >threads, you might be running out of memory for page tables and that >perhaps the hugetlb code is not handling that case correctly. Seems unlikely. Have 13GB of free RAM. >When do >the bad pmd messages appear? When the daemon starts? When the first >separate process attaches? When the second one does? or later? Only after a starting, stopping and attempting to restart the server daemon. The 'dmesg' errors don't appear synchronously with the initial failure. > >> >If so, I could quickly bisect the kernel and identify the guilty >> >patch. Without the program, I am left stabbing in the dark. >> >Could you try on a 2.6.18 kernel to see if it works or not? >> >Thanks. >> >> Any particular version of 2.6.18? > >Nothing specific. You could try 2.6.18.8 (latest -stable). We could >probably bisect this with approximately 8 kernel build-boot-test cycles >if you are willing to engage on that. I am looking forward to your >disabled-mlock() results. Ok, but this could take awhile. Can only spare a few hours a week on it. Hopefully my suspicion of the fork() call is on target. Forking a 3GB process seems like an extreme operation to me.
Went back and tried a few things. Finally figured out that the problem can be reproduced with a simple shared memory segment loader utility we have. No threads, no forks, nothing fancy. Just create a segment and read the contents of a big file into it. Two segments actually. The only difference is the accessing program has to be run three times instead of two times to produce the failure. You might be able to accomplish the same result just using 'memset()' to touch all the memory. Then tried this out with the F9 kernel 2.6.26.5-45.fc9.x86_64 and everything worked perfectly. This is all I can do. Have burned way to many hours on it and am now retreating to the warm safety of the RHEL kernel. Only reason I was playing with the kernel.org kernel is we're trying to get an Intel 82575 working with the 'igb' driver in multiple-RX-queue mode and the 'e1000-devel' guys said to use the latest. However that's looking like a total bust, so it's time to retreat, wait for six months and hope it's all working by then with a supported kernel. I've attached the 'strace' files. Don't know where those 'mmap's are coming from except that perhaps in a library somewhere. There are none in our code. Good luck.
Created attachment 19166 [details] create_seg_strace more straces
Created attachment 19167 [details] access_seg_strace1
Created attachment 19168 [details] access_seg_strace2
Created attachment 19169 [details] access_seg_strace3
Finally figured this out. Same kernel message but with different specific failures from application POV on RHEL5, not on RHEL4. On RHEL5 it turns out that a 'fork()' of a script is evoking the problem and that it can be worked- around with 'vfork()'. No problem with 'fork()' under RHEL4. So it looks like a kernel bug exists in the logic that copies big-page SVR4 shared memory page tables during a fork(). 'vfork()' does not copy page tables and avoids the "bad pmd" kernel error and varying subsequent failures. Will try the 'vfork()' with the KORG kernel sometime soon.
Tried it and the title under which this was reported remains a problem. Shmat() of big hugepage segment fails with ENOMEM after second try. However 'vfork()' did eliminate the "bad pmd" errors in the 'dmesg' log. So that would be a different bug I suppose.
Forgot to mention that last test was under 2.6.29.1
Sorry, but nobody is reading this bug report. I tried to divert it to email (right there in comment #2) but somehow it has ended up hidden back in bugzilla again. I suggest that you create two new and separate bug reports from scratch and email them to linux-mm@kvack.org Adam Litke <agl@us.ibm.com> Andrew Morton <akpm@linux-foundation.org> If you like you can include the text "[Bug 12134]" in that email's subject and Cc bugzilla-daemon@bugzilla.kernel.org on the email so that the conversation is appropriately captured. Thanks.
Did create new bug 13302.
New life for old bug. Reproduced under 2.6.29.1. Also discerned separate hugepage fork() issue now reported under bug 13302. Sorry I keep forgetting to stay with e-mail. Bugzillas are easier to keep track of over many months.
*** Bug 13192 has been marked as a duplicate of this bug. ***
2.6.29 is now obsolete, if this bug is still present in recent kernels please update and re-open