Bug 43811 - call to 'mlock()' hangs system in presence of multiple RealTime threads
Summary: call to 'mlock()' hangs system in presence of multiple RealTime threads
Status: RESOLVED OBSOLETE
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-25 22:28 UTC by rickdic
Modified: 2013-12-10 23:01 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.35.13 x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments
test case (10.67 KB, text/x-c)
2012-06-25 22:32 UTC, rickdic
Details
SysRq hung task log (comment #3) (301.64 KB, application/octet-stream)
2012-06-26 16:50 UTC, rickdic
Details

Description rickdic 2012-06-25 22:28:54 UTC
When multiple RealTime threads are executing on their own, individual CPUs, a call to 'mlock()' from a non-realtime thread on a different CPU will hang. If some of the RealTime threads lower their priority, the call to mlock will complete.

-----------------------
Steps to reproduce:
-----------------------

I have created a test case (mlock_test) which demonstrates the problem - this requires a minimum of 4 CPUs. It also requires running in text mode (ie. runlevel 3) due to the RealTime priority of many threads (CPUs).

1. compile mlock_test using:
    gcc -Wall -Werror -o mlock_test mlock_test.c -lpthread
2. open a console window in text mode (i.e. sudo init 3), then execute mlock_test using:
    sudo ./mlock_test
3. wait for the text "aux_main: starting in ..."; this will count down from 5s
4. wait for the text "about to lock..."

- AT THIS POINT, THE THREAD IS HUNG IN 'mlock()'

after a few more seconds, each of the RealTime threads will decrease their priority, displaying the text "... adjusted scheduling to SCHED_OTHER."
5. Once all the RealTime threads are running at normal priority, the 'mlock()' call will complete, displaying the message "memory locked (0x...)"
6. Several allocation/lock messages will be displayed, after which the <ENTER> key can be pressed to exit the application.

-----------------------
Additional Information:
-----------------------

1. The test case executes properly on kernel version 2.6.18.x (CentOS 5.5, CentOS 5.7 & RHEL 5.4); that is, the call to 'mlock()' does NOT hang.

2. The test case fails identically on:
  - Quad-core Xeon (4 CPUs)
  - Dual Quad-core Xeon (8 CPUs)
  - Dual Quad-core Xeon w/Hyperthreads (16 CPUs)

3. Theory of operation:
  - the 'main()' thread creates an 'aux_main()' thread, affinitizes each to CPU 0, then blocks on a barrier, awaiting the 'aux_main' thread completion
  - the 'aux_main()' thread creates several (1 fewer than the CPU count) RealTime threads and affinitizes them to an individual CPU (starting w/highest CPU # and proceeding down to #1). Each of these threads spins, awaiting an 'exit_flag' to be set (via a memory location)
  - after creating the RealTime threads, the 'aux_main()' thread allocates a large block of memory and attempts to 'mlock()' it (where it hangs). It does this the same # of times as there are RealTime threads.

4. If the code is changed such that the RealTime threads are instead created as normally-scheduled threads, then the call to 'mlock()' DOES NOT HANG. This can be demonstrated by changing line 227 to read:
  "if( 0 )"
This prevents the test threads from being elevated to RealTime threads.

5. The amount of memory allocated/locked is large, but I have seen identical results using a memory size of 4Kb.

6. This problem also manifests itself in kernel v3.4.4 (from kernel.org) as well as on v2.6.32-220 (CentOS 6.2).
Comment 1 rickdic 2012-06-25 22:32:31 UTC
Created attachment 74271 [details]
test case

see instructions in original comment for compilation/execution instructions
Comment 2 Andrew Morton 2012-06-25 23:52:29 UTC
huh.  I suspect it's stuck in lru_add_drain_all().

Can you please capture a kernel stack trace of the hung task?  Write a 1 to /proc/sys/kernel/sysrq, make it hang, type sysrq-t?
Comment 3 rickdic 2012-06-26 16:49:09 UTC
Sorry, I had forgotten to attach the log with my original comment.

Right you are, it appears to be hung in 'lru_add_drain_all().

The SysRq log is attached (mlock_test_sysrq_20120626_0.log).
Comment 4 rickdic 2012-06-26 16:50:23 UTC
Created attachment 74311 [details]
SysRq hung task log (comment #3)
Comment 5 Andrew Morton 2012-06-27 05:56:54 UTC
here we go.... http://marc.info/?l=linux-mm&m=134074683924229&w=2

Note You need to log in before you can comment on or make changes to this bug.