Created attachment 254391 [details] Paper describing bug - see chapter 3.1. Description of problem: We report that group imbalance bug can cause performance degradation upto factor 10x on 4 NUMA server. The bug was first described in this paper http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf in chapter 3.1. Scheduler is not correctly balancing load on 4 NUMA node server in the following scenario: * there are three independent ssh connections * first two ssh connections are running single threaded CPU intensive workload * last ssh session is running multi-threaded application which requires almost all cores in the system. We have used * stress --cpu 1 as single threaded CPU intensive workload http://people.seas.harvard.edu/~apw/stress/ and * lu.C.x benchmark from NAS Parallel Benchmarks suite as multi-threaded application https://www.nas.nasa.gov/publications/npb.html Version-Release number of selected component (if applicable): Reproduced on kernel 4.10.0-0.rc6 How reproducible: It requires at least 2 NUMA server. Problem gets worse on 4 NUMA server. Steps to Reproduce: 1. start 3 ssh connections to server 2. in first two ssh connections run stress --cpu 1 3. in the third ssh connection run lu.C.x benchmark with number of threads equal to number of CPUs in the system minus 4. 4. run either Intel's numatop echo "N" | numatop -d log >/dev/null 2>&1 & or mpstat -P ALL 5 and check the load distribution across the NUMA nodes. mpstat output can be processed by mpstat2node.py utility to aggregate data across NUMA nodes https://github.com/jhladka/MPSTAT2NODE/blob/master/mpstat2node.py mpstat -P ALL 5 | mpstat2node.py --lscpu <(lscpu) 5. Compare the results against the same workload started from ONE ssh session (all processes are in one group) Actual results: Uneven load across NUMA nodes: Average: NODE %usr %idle Average: all 66.12 33.51 Average: 0 37.97 61.74 Average: 1 31.67 68.15 Average: 2 97.50 1.98 Average: 3 97.33 2.19 Please notice that while number of CPU intensive threads is 62 on this 64 CPU system, NUMA nodes #0 and #1 are underutilized. Real runtime in seconds for lu.C.x benchmark went up from 114 seconds to 846 seconds! Expected results: Load evenly balanced across all NUMA nodes. Real runtime for lu.C.x benchmark same regardless if jobs were started from one ssh session or from multiply ssh sessions. Additional info: See https://github.com/jplozi/wastedcores/blob/master/patches/group_imbalance_linux_4.1.patch as proposal for the patch for kernel 4.1.
Created attachment 254401 [details] Reproducer for Group Imbalance bug. This is the reproducer for Group Imbalance bug. Requires: 4 NUMA server ssh server running on test machine tmux 1) Compiling test (this is needed only once to install the tests) ./compile.sh 2)Running test - use server with at least 2 NUMA nodes Start TWO ssh connections to the server. * in first connection start tmux. It will be used to start jobs later. * in the second connection run ./reproduce.sh from this tarball. Do not attempt to start tmux in this second ssh shell! ./reproduce.sh will start automatically stress --cpu 1 jobs in the tmux session started in the first ssh session. Results are stored in the directory with format <kernel_name>_<timestamp> 4.10.0-0.rc6.git0.1.el7.x86_64_2017-Feb-06_23h12m54s 3)Examine results grep -H total ${NAME}*log grep -H seconds ${NAME}*log grep -H -i Average *numa Files with "GROUP" in name where produced by using different job groups (different ssh connections). Files with "NORMAL" in name where produced by starting the whole workload from one ssh connection. Both results should be the same. The typical bug symptoms are: - uneven load across NUMA nodes (check *GROUP*numa file) - much longer (factor 5x-10x) runtimes for lu.C.x benchmark There are included results for kernel 4.10.0-0.rc6 (directory 4.10.0-0.rc6.git0.1.el7.x86_64_2017-Feb-06_23h12m54s)
It was fixed in kernel v5.5