Bug 217316

Summary: Interger overflow when calculating imbalance for CFS scheduler
Product: Linux Reporter: Tingjia Cao (tcao34)
Component: KernelAssignee: Virtual assignee for kernel bugs (linux-kernel)
Status: RESOLVED CODE_FIX    
Severity: normal CC: chrisxinchao, mingo, vincent.guittot
Priority: P1 Flags: mricon: bugbot+
Hardware: All   
OS: Linux   
Kernel Version: 6.3-rc5 Subsystem: SCHEDULER
Regression: No Bisected commit-id:
Attachments: A patch to trace the variables, c files to reproduce the issue

Description Tingjia Cao 2023-04-10 04:59:59 UTC
Created attachment 304105 [details]
A patch to trace the variables, c files to reproduce the issue

We have identified an issue with the rebalance algorithm of CFS scheduler when using kernel versions 6.0 or 6.3-rc5. Specifically, the calculate_imbalance function in kernel/sched/fair.c may produce incorrect results due to an integer overflow bug.

The algorithm is designed to pull some tasks from busiest group to local group. But when both group are or will become overloaded, the algorithm doesn't want to push local group above the average load of the sched domain. However, in some cases, the calculation of imbalance can be wrong, causing meaningless migration and even amplifying the imbalance.

The problem arises when the function executes the following lines:
env->imbalance = min(
	(busiest->avg_load - sds->avg_load) * busiest->group_capacity,
	(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;

In some cases (e.g. when local->group_type = group_fully_busy and busiest->group_type = group_overloaded), the system cannot guarantee sds->avg_load being larger than local->avg_load. Since both sds->avg_load and local->avg_load are unsigned long, it will cause an integer overflow when executing (sds->avg_load - local->avg_load). The calculated imbalance is not the minimum of (busiest - sds) and (sds - local) anymore. As a result, the rebalance algorithm is likely to do some meaningless migration and make the local_group's load exceeds the sched_domain's avg load after the migration.


Reproduce
=============
We attach a patch and a module that can trace the local->avg_load, sds->avg_load, and busiest->avg_load. 

We used the following workloads to test the algorithm:
- We firstly create two task group:
	mkdir /sys/fs/cgroup/t0
	mkdir /sys/fs/cgroup/t1
	echo 100000 10000 > /sys/fs/cgroup/t0/cpu.max

- Then, we run the following programs in a shell (the c files are attached):
	./test0 &
	./test0 &
	./test0 &
	./test1 &

In test0, we clone 9 processes to cgroup t0 and make them sleep occasionally. In test 1, we clone 10 processes to cgroup t1 and make them running continuously.

- Then, we use the trace module we provided in the attachment to get the trace, here are example trace items we get:
         392168005763 nsecs IC    236  103   104   133
         392168015305 nsecs LB    8    1     85    133

The first line means that in the calculate_imbalance function, 
1) busiest->avg_load = 236
2) local->avg_load = 104
3) sds->avg_load = 103
It will cause an integer overflow and get a very large imbalance 133. The expected value is min(236 - 103, 103 - 104) = -1.

The second line means that the wrong imbalance calculation will lead to an unreasonable migration from cpu 8 to cpu 1, though the local_group's load  will exceeded the sds->avg_load a lot after the migration. The migrated task's h_load is 85, the env->nr_balanced_fail = 0. If the imbalance = -1, it shouldn't do the migration. However, with the wrong imbalance 133, it'll do the task migration.

Kernel Info
===============
Host OS: ubuntu20.04
Processor: Two Intel Xeon Silver 4114 10-core CPUs at 2.20 GHz (forbid hyperthread)
Kernel Version: 6.3-rc5, 6.0