Bug 194231

Summary:	Group Imbalance bug - performance drop upto factor 10x
Product:	Process Management	Reporter:	Jirka Hladky (hladky.jiri)
Component:	Scheduler	Assignee:	Ingo Molnar (mingo)
Status:	RESOLVED CODE_FIX
Severity:	normal	CC:	hladky.jiri, skarmarkar
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	4.10.0-0.rc6	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	Paper describing bug - see chapter 3.1. Reproducer for Group Imbalance bug.

Description Jirka Hladky 2017-02-06 23:31:53 UTC

Created attachment 254391 [details]
Paper describing bug - see chapter 3.1.

Description of problem:

We report that group imbalance bug can cause performance degradation upto factor 10x on 4 NUMA server.

The bug was first described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. Scheduler is not correctly balancing load on 4 NUMA node server in the following scenario:
 * there are three independent ssh connections
 * first two ssh connections are running single threaded CPU intensive workload
 * last ssh session is running multi-threaded application which requires almost all cores in the system.

We have used 
* stress --cpu 1 as single threaded CPU intensive workload http://people.seas.harvard.edu/~apw/stress/
and
* lu.C.x benchmark from NAS Parallel Benchmarks suite as multi-threaded application
https://www.nas.nasa.gov/publications/npb.html

Version-Release number of selected component (if applicable):
Reproduced on 

kernel 4.10.0-0.rc6


How reproducible:

It requires at least 2 NUMA server. Problem gets worse on 4 NUMA server. 


Steps to Reproduce:
1. start 3 ssh connections to server
2. in first two ssh connections run stress --cpu 1
3. in the third ssh connection run lu.C.x benchmark with number of threads equal to number of CPUs in the system minus 4.
4. run either Intel's numatop 
echo "N" | numatop -d log >/dev/null 2>&1 &
or mpstat -P ALL 5 and check the load distribution across the NUMA nodes. mpstat output can be processed by mpstat2node.py utility to aggregate data across NUMA nodes
https://github.com/jhladka/MPSTAT2NODE/blob/master/mpstat2node.py

mpstat -P ALL 5 | mpstat2node.py --lscpu <(lscpu)

5. Compare the results against the same workload started from ONE ssh session (all processes are in one group)


Actual results:

Uneven load across NUMA nodes:
Average:    NODE    %usr     %idle
Average:     all   66.12      33.51
Average:       0   37.97      61.74
Average:       1   31.67      68.15
Average:       2   97.50       1.98
Average:       3   97.33       2.19

Please notice that while number of CPU intensive threads is 62 on this 64 CPU system, NUMA nodes #0 and #1 are underutilized. 

Real runtime in seconds for lu.C.x benchmark went up from 114 seconds to 846 seconds!

Expected results:

Load evenly balanced across all NUMA nodes. Real runtime for lu.C.x benchmark same regardless if jobs were started from one ssh session or from multiply ssh sessions.

Additional info:

See
https://github.com/jplozi/wastedcores/blob/master/patches/group_imbalance_linux_4.1.patch
as proposal for the patch for kernel 4.1.

Comment 1 Jirka Hladky 2017-02-07 00:14:05 UTC

Created attachment 254401 [details]
Reproducer for Group Imbalance bug.

This is the reproducer for Group Imbalance bug.

Requires:
4 NUMA server
ssh server running on test machine
tmux

1) Compiling test (this is needed only once to install the tests)
./compile.sh

2)Running test - use server with at least 2 NUMA nodes

Start TWO ssh connections to the server.
* in first connection start tmux. It will be used to start jobs later.
* in the second connection run ./reproduce.sh from this tarball. Do not attempt to start tmux in this second ssh shell! 
./reproduce.sh will start automatically stress --cpu 1 jobs  in the tmux session started in the first ssh session.

Results are stored in the directory with format <kernel_name>_<timestamp>
4.10.0-0.rc6.git0.1.el7.x86_64_2017-Feb-06_23h12m54s

3)Examine results
grep -H total ${NAME}*log 
grep -H seconds ${NAME}*log
grep -H -i Average *numa

Files with "GROUP" in name where produced by using different job groups (different ssh connections).
Files with "NORMAL" in name where produced by starting the whole workload from one ssh connection.

Both results should be the same. The typical bug symptoms are:
- uneven load across NUMA nodes (check *GROUP*numa file)
- much longer (factor 5x-10x) runtimes for lu.C.x benchmark

There are included results for kernel 4.10.0-0.rc6 (directory 4.10.0-0.rc6.git0.1.el7.x86_64_2017-Feb-06_23h12m54s)

Comment 2 Jirka Hladky 2020-06-25 13:48:07 UTC

It was fixed in kernel v5.5