Bug 12134 - can't shmat() 1GB hugepage segment from second process more than one time
Summary: can't shmat() 1GB hugepage segment from second process more than one time
Status: RESOLVED OBSOLETE
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Andrew Morton
URL:
Keywords:
: 13192 (view as bug list)
Depends on:
Blocks:
 
Reported: 2008-12-01 18:01 UTC by starlight
Modified: 2013-12-10 16:18 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.27.7
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg output with errors (80.34 KB, text/plain)
2008-12-01 18:02 UTC, starlight
Details
daemon_strace1.txt (216.96 KB, text/plain)
2008-12-03 19:24 UTC, starlight
Details
client_strace1 (3.99 KB, text/plain)
2008-12-03 19:24 UTC, starlight
Details
client_strace2 (3.73 KB, text/plain)
2008-12-03 19:24 UTC, starlight
Details
meminfo (858 bytes, text/plain)
2008-12-03 20:06 UTC, starlight
Details
pmap (20.22 KB, text/plain)
2008-12-03 20:08 UTC, starlight
Details
create_seg_strace (13.23 KB, text/plain)
2008-12-05 21:30 UTC, starlight
Details
access_seg_strace1 (4.21 KB, text/plain)
2008-12-05 21:30 UTC, starlight
Details
access_seg_strace2 (4.21 KB, text/plain)
2008-12-05 21:31 UTC, starlight
Details
access_seg_strace3 (4.67 KB, text/plain)
2008-12-05 21:31 UTC, starlight
Details

Description starlight 2008-12-01 18:01:39 UTC
Latest working kernel version: don't know
Earliest failing kernel version: don't know
Distribution: kernel.org
Hardware Environment: HP DL160 G5 w/ dual E5430's & 16GB PC2-5300 FB-DIMMs
Software Environment: F9
Problem Description:

can't shmat() 1GB hugepage segment from second process more than one time

Steps to reproduce:

create 1GB or more hugepage shmget/shmat segment
attached at explicit virtual address 0x4_00000000

run another program that attaches segment

run it again, fails

eventually get attached 'dmesg' output

works fine under RHEL 4.6
Comment 1 starlight 2008-12-01 18:02:42 UTC
Created attachment 19094 [details]
dmesg output with errors
Comment 2 Anonymous Emailer 2008-12-01 18:15:11 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Mon,  1 Dec 2008 18:01:39 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=12134
> 
>            Summary: can't shmat() 1GB hugepage segment from second process
>                     more than one time
>            Product: Memory Management
>            Version: 2.5
>      KernelVersion: 2.26.27.7
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>         AssignedTo: akpm@osdl.org
>         ReportedBy: starlight@binnacle.cx
> 
> 
> Latest working kernel version: don't know
> Earliest failing kernel version: don't know
> Distribution: kernel.org
> Hardware Environment: HP DL160 G5 w/ dual E5430's & 16GB PC2-5300 FB-DIMMs
> Software Environment: F9
> Problem Description:
> 
> can't shmat() 1GB hugepage segment from second process more than one time
> 
> Steps to reproduce:
> 
> create 1GB or more hugepage shmget/shmat segment
> attached at explicit virtual address 0x4_00000000
> 
> run another program that attaches segment
> 
> run it again, fails
> 
> eventually get attached 'dmesg' output
> 
> works fine under RHEL 4.6
> 
> 
Comment 3 Adam Litke 2008-12-02 11:29:31 UTC
On Mon, 2008-12-01 at 18:14 -0800, Andrew Morton wrote:
> > can't shmat() 1GB hugepage segment from second process more than one time
> > 
> > Steps to reproduce:

starlight@binnacle.cx:  I need more information to reproduce this bug.
Please read on.

I've tried these steps and haven't been able to reproduce.  Are these
reproduction steps actually a description of what a more complex program
is doing, or have you reproduced this with simple C programs that
implement nothing more than the instructions provided in this bug?

It would make it easier to diagnose this if you could provide a simple C
program that causes the bad behavior.

> > 
> > create 1GB or more hugepage shmget/shmat segment
> > attached at explicit virtual address 0x4_00000000

You must mean either 0x400000000 or 0x4000000000; please clarify.  (I
tried both addresses and was unable to reproduce.  Are you touching any
of the pages in the shared memory segment with this process?  What flags
are you passing to shmget and shmat?  Could you provide an strace for
each program run?

> > run another program that attaches segment

Does this second program do anything besides attaching the segment (ie.
faulting any of the huge pages)?

> > run it again, fails
> > 
> > eventually get attached 'dmesg' output
> > 
> > works fine under RHEL 4.6
> > 
> > 
Comment 4 starlight 2008-12-02 11:48:09 UTC
I'll collect a more detailed picture in the next day or so and 
send the info.  Maybe create a test-case.

Several other segments 128MB are created before the 1GB segment. 
They all run in the 0x300000000 range on 256MB boundaries 
(second digit changes) and the big one goes at 0x400000000.

'mlockall()' is called periodically as well--perhaps
that's the antagonist.

Have SHM_HUGETLB set even for no-create attaches, which I'm not 
sure is proper.  It works on RHEL though.

Memory is touched in each segment, 100% for the smaller
ones and small % for the big one.  Didn't think it made
any difference since it's all locked by implication.
Comment 5 starlight 2008-12-03 19:18:22 UTC
At 13:24 12/2/2008 -0600, Adam Litke wrote:
>starlight@binnacle.cx:  I need more information
>to reproduce this bug.

I'm too swamped to build a test-case, but here are straces
that show the relevant system calls and the failure.

The 'daemon_strace.txt' file is of the master daemon
that creates all the segments and provides a service.

The 'client_strace[12].txt' files are of a client daemon
that attaches two segments, does some work, and then
quits.

The first client trace shows the first invocation, which
is successful.  The second client trace shows the second
invocation, which fails.  All other attempts to attach
the segment fail, and the daemon cannot be successfully
restarted until the system is rebooted.  I.e. the
kernel is getting hosed.

Happy to answer any questions and provide more details
if desired.

Regards
Comment 6 starlight 2008-12-03 19:24:04 UTC
Created attachment 19137 [details]
daemon_strace1.txt

not sure if the attachments made it along with the above email
Comment 7 starlight 2008-12-03 19:24:29 UTC
Created attachment 19138 [details]
client_strace1
Comment 8 starlight 2008-12-03 19:24:51 UTC
Created attachment 19139 [details]
client_strace2
Comment 9 starlight 2008-12-03 20:06:36 UTC
Created attachment 19140 [details]
meminfo

Noticed that hugepage memory is not freed when the segments are deleted.
Comment 10 starlight 2008-12-03 20:08:44 UTC
Created attachment 19141 [details]
pmap

Here's a 'pmap -x' output for the daemon.  This was taken
while running with non-hugepage normal shared memory.
With hugepages it looks the same except the segments
all show (deleted), which I assume is supposed to
read (hugepage) and that 'pmap' has a bug on that.
Comment 11 Adam Litke 2008-12-05 09:18:23 UTC
On Wed, 2008-12-03 at 22:15 -0500, starlight@binnacle.cx wrote:
> At 13:24 12/2/2008 -0600, Adam Litke wrote:
> >starlight@binnacle.cx:  I need more information
> >to reproduce this bug.
> 
> I'm too swamped to build a test-case, but here are straces
> that show the relevant system calls and the failure.

Starlight,

Thanks for the strace output.  As I suspected, this is more complex than
it first appeared.  There are several hugetlb shared memory segments
involved.  Couple that with threading and an interesting approach to
mlocking the address space and I've got a very difficult to reproduce
scenario.  Is it possible/practical for me to have access to your
program?  If so, I could quickly bisect the kernel and identify the
guilty patch.  Without the program, I am left stabbing in the dark.
Could you try on a 2.6.18 kernel to see if it works or not?  Thanks.
Comment 12 starlight 2008-12-05 09:51:34 UTC
At 11:17 12/5/2008 -0600, you wrote:
>On Wed, 2008-12-03 at 22:15 -0500, starlight@binnacle.cx wrote:
>> At 13:24 12/2/2008 -0600, Adam Litke wrote:
>> >starlight@binnacle.cx:  I need more information
>> >to reproduce this bug.
>> 
>> I'm too swamped to build a test-case, but here are straces
>> that show the relevant system calls and the failure.
>
>Starlight,
>
>Thanks for the strace output.  As I suspected, this is more 
>complex than it first appeared.  There are several hugetlb 
>shared memory segments involved.  Couple that with threading and 
>an interesting approach to mlocking the address space and I've 
>got a very difficult to reproduce scenario.  Is it 
>possible/practical for me to have access to your program?

Sorry, I'm not permitted to share the code.

The program fork/execs a script in addition to creating many 
worker threads (have contemplated switching to 'pthread_spawn()', 
but it seems it does a fork/exec anyway).  I wonder if that has 
anything to do with it.  Will try disabling that and then 
disabling the 'mlock()' calls to see if either eliminates
the issue.   Doubt that worker thread creation is a factor.

>If so, I could quickly bisect the kernel and identify the guilty 
>patch.  Without the program, I am left stabbing in the dark. 
>Could you try on a 2.6.18 kernel to see if it works or not?  
>Thanks.

Any particular version of 2.6.18?
Comment 13 Adam Litke 2008-12-05 10:58:45 UTC
On Fri, 2008-12-05 at 12:49 -0500, starlight@binnacle.cx wrote:
> At 11:17 12/5/2008 -0600, you wrote:
> >On Wed, 2008-12-03 at 22:15 -0500, starlight@binnacle.cx wrote:
> >> At 13:24 12/2/2008 -0600, Adam Litke wrote:
> >> >starlight@binnacle.cx:  I need more information
> >> >to reproduce this bug.
> >> 
> >> I'm too swamped to build a test-case, but here are straces
> >> that show the relevant system calls and the failure.
> >
> >Starlight,
> >
> >Thanks for the strace output.  As I suspected, this is more 
> >complex than it first appeared.  There are several hugetlb 
> >shared memory segments involved.  Couple that with threading and 
> >an interesting approach to mlocking the address space and I've 
> >got a very difficult to reproduce scenario.  Is it 
> >possible/practical for me to have access to your program?
> 
> Sorry, I'm not permitted to share the code.
> 
> The program fork/execs a script in addition to creating many 
> worker threads (have contemplated switching to 'pthread_spawn()', 
> but it seems it does a fork/exec anyway).  I wonder if that has 
> anything to do with it.  Will try disabling that and then 
> disabling the 'mlock()' calls to see if either eliminates
> the issue.   Doubt that worker thread creation is a factor.

Great.  I was going to ask you to disable mlock() as well.  Is this the
same machine that was running your workload on RHEL4 successfully?  One
theory I've been contemplating is that, with all of the mlocking and
threads, you might be running out of memory for page tables and that
perhaps the hugetlb code is not handling that case correctly.  When do
the bad pmd messages appear?  When the daemon starts?  When the first
separate process attaches?  When the second one does?  or later?

> >If so, I could quickly bisect the kernel and identify the guilty 
> >patch.  Without the program, I am left stabbing in the dark. 
> >Could you try on a 2.6.18 kernel to see if it works or not?  
> >Thanks.
> 
> Any particular version of 2.6.18?

Nothing specific.  You could try 2.6.18.8 (latest -stable).  We could
probably bisect this with approximately 8 kernel build-boot-test cycles
if you are willing to engage on that.  I am looking forward to your
disabled-mlock() results.
Comment 14 starlight 2008-12-05 11:05:52 UTC
At 12:57 12/5/2008 -0600, Adam Litke wrote:
>Great.  I was going to ask you to disable mlock() as well.  Is this the
>same machine that was running your workload on RHEL4 successfully?

No, that was a an old Athlon 4800+ dev box.

>One theory I've been contemplating is that, with all of the mlocking and
>threads, you might be running out of memory for page tables and that
>perhaps the hugetlb code is not handling that case correctly.

Seems unlikely.  Have 13GB of free RAM.

>When do
>the bad pmd messages appear?  When the daemon starts?  When the first
>separate process attaches?  When the second one does?  or later?

Only after a starting, stopping and attempting to restart the
server daemon.  The 'dmesg' errors don't appear synchronously
with the initial failure.

>
>> >If so, I could quickly bisect the kernel and identify the guilty 
>> >patch.  Without the program, I am left stabbing in the dark. 
>> >Could you try on a 2.6.18 kernel to see if it works or not?  
>> >Thanks.
>> 
>> Any particular version of 2.6.18?
>
>Nothing specific.  You could try 2.6.18.8 (latest -stable).  We could
>probably bisect this with approximately 8 kernel build-boot-test cycles
>if you are willing to engage on that.  I am looking forward to your
>disabled-mlock() results.

Ok, but this could take awhile.  Can only spare a few hours
a week on it.  Hopefully my suspicion of the fork() call is
on target.  Forking a 3GB process seems like an extreme
operation to me.
Comment 15 starlight 2008-12-05 21:27:50 UTC
Went back and tried a few things.

Finally figured out that the problem can be reproduced with a 
simple shared memory segment loader utility we have.  No 
threads, no forks, nothing fancy. Just create a segment and read 
the contents of a big file into it.  Two segments actually.  The 
only difference is the accessing program has to be run three 
times instead of two times to produce the failure.  You might be 
able to accomplish the same result just using 'memset()' to 
touch all the memory.

Then tried this out with the F9 kernel 2.6.26.5-45.fc9.x86_64 
and everything worked perfectly.

This is all I can do.  Have burned way to many hours on it and 
am now retreating to the warm safety of the RHEL kernel.  Only 
reason I was playing with the kernel.org kernel is we're trying 
to get an Intel 82575 working with the 'igb' driver in 
multiple-RX-queue mode and the 'e1000-devel' guys said to use 
the latest.  However that's looking like a total bust, so it's 
time to retreat, wait for six months and hope it's all working
by then with a supported kernel.

I've attached the 'strace' files.  Don't know where those 
'mmap's are coming from except that perhaps in a library 
somewhere.  There are none in our code.

Good luck.
Comment 16 starlight 2008-12-05 21:30:28 UTC
Created attachment 19166 [details]
create_seg_strace

more straces
Comment 17 starlight 2008-12-05 21:30:45 UTC
Created attachment 19167 [details]
access_seg_strace1
Comment 18 starlight 2008-12-05 21:31:03 UTC
Created attachment 19168 [details]
access_seg_strace2
Comment 19 starlight 2008-12-05 21:31:21 UTC
Created attachment 19169 [details]
access_seg_strace3
Comment 20 starlight 2009-05-13 18:52:22 UTC
Finally figured this out.

Same kernel message but with different specific failures from 
application POV on RHEL5, not on RHEL4.

On RHEL5 it turns out that a 'fork()' of a script is evoking the 
problem and that it can be worked- around with 'vfork()'.
No problem with 'fork()' under RHEL4.

So it looks like a kernel bug exists in the logic that copies 
big-page SVR4 shared memory page tables during a fork().  
'vfork()' does not copy page tables and avoids the "bad pmd" 
kernel error and varying subsequent failures.

Will try the 'vfork()' with the KORG kernel sometime soon.
Comment 21 starlight 2009-05-13 19:28:06 UTC
Tried it and the title under which this was reported
remains a problem.  Shmat() of big hugepage segment
fails with ENOMEM after second try.

However 'vfork()' did eliminate the "bad pmd" errors
in the 'dmesg' log.  So that would be a different bug
I suppose.
Comment 22 starlight 2009-05-13 19:29:22 UTC
Forgot to mention that last test was under 2.6.29.1
Comment 23 Andrew Morton 2009-05-13 20:06:51 UTC
Sorry, but nobody is reading this bug report.  I tried to divert it to email (right there in comment #2) but somehow it has ended up hidden back in bugzilla again.

I suggest that you create two new and separate bug reports from scratch and email
them to

linux-mm@kvack.org
Adam Litke <agl@us.ibm.com>
Andrew Morton <akpm@linux-foundation.org>

If you like you can include the text "[Bug 12134]" in that email's subject
and Cc  bugzilla-daemon@bugzilla.kernel.org on the email so that the conversation
is appropriately captured.

Thanks.
Comment 24 starlight 2009-05-13 20:17:09 UTC
Did create new bug 13302.
Comment 25 starlight 2009-05-13 20:24:24 UTC
New life for old bug.

Reproduced under 2.6.29.1.

Also discerned separate hugepage fork() issue now reported
under bug 13302.

Sorry I keep forgetting to stay with e-mail.  Bugzillas
are easier to keep track of over many months.
Comment 26 Alan 2012-05-30 16:29:41 UTC
*** Bug 13192 has been marked as a duplicate of this bug. ***
Comment 27 Alan 2013-12-10 16:18:53 UTC
2.6.29 is now obsolete, if this bug is still present in recent kernels please update and re-open

Note You need to log in before you can comment on or make changes to this bug.