Bug 45001 - Load average computation seems wrong
Summary: Load average computation seems wrong
Status: RESOLVED CODE_FIX
Alias: None
Product: Process Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: process_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-22 09:42 UTC by embedded
Modified: 2016-09-01 15:47 UTC (History)
4 users (show)

See Also:
Kernel Version: >= 2.6.36
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Kernel 3.4.0, daily graph, idle box (19.81 KB, image/png)
2012-07-22 09:42 UTC, embedded
Details
Kernel 3.4.0, hourly graph, idle box (14.35 KB, image/png)
2012-07-22 09:43 UTC, embedded
Details
Kernel 3.2.1, hourly graph (27.02 KB, image/png)
2012-07-22 09:44 UTC, embedded
Details
Kernel 3.2.1, zoom on the hourly graph (18.32 KB, image/png)
2012-07-22 09:45 UTC, embedded
Details
Kernel 3.2.1 (third box), hourly graph (18.22 KB, image/png)
2012-07-22 09:47 UTC, embedded
Details
Kernel 3.2.1 (third box), zoom on the hourly graph (20.03 KB, image/png)
2012-07-22 09:47 UTC, embedded
Details
loadavg patch for 3.x kernels (808 bytes, patch)
2016-01-18 15:33 UTC, VTX
Details | Diff
Just a graph of what I said on-list about rounding and the steady state error (24.49 KB, image/png)
2016-01-21 22:18 UTC, Doug Smythies
Details
Forgot to label axis (26.34 KB, image/png)
2016-01-21 22:25 UTC, Doug Smythies
Details
loadavg patch for 4.x kernels, round down, round up (671 bytes, patch)
2016-01-22 09:33 UTC, VTX
Details | Diff

Description embedded 2012-07-22 09:42:33 UTC
Created attachment 75821 [details]
Kernel 3.4.0, daily graph, idle box

Hello kernel gurus,

Since upgrading to various 3.X kernels (3.2 and 3.4, depending on the computers), we see a strange 'load average' behaviour. It it *not* a 'load is too high' problem, it's the fact that the 5- and 15- minutes averages, when the computers are almost idle, are 'wrong'.

If we have a sustained 1-minute load average of 0.0 (box completely idle), it seems quite strange to have a sustained 5-min average of 0.01 and a 15-min average of 0.05. Not a big deal, I confess, but mathematically troubling. This bottom line of 0.01/0.05 seems to be the floor for the 5- and 15-min averages.

As soon as the 1-min load averages moves up (using the computers), the 5-min and 15-min averages follow suit (obviously) and the aforementionned weirdness diseappears. It shows again as soon as the load goes down.

We are using stock kernels, compiled in-house (maybe we were consistent in some configuration errors), and Gentoo distribs (32 and 64-bits).

I attach some graphs of those load averages on our computers, they may be clearer than my poor explanations.
Comment 1 embedded 2012-07-22 09:43:31 UTC
Created attachment 75831 [details]
Kernel 3.4.0, hourly graph, idle box
Comment 2 embedded 2012-07-22 09:44:44 UTC
Created attachment 75841 [details]
Kernel 3.2.1, hourly graph
Comment 3 embedded 2012-07-22 09:45:14 UTC
Created attachment 75851 [details]
Kernel 3.2.1, zoom on the hourly graph
Comment 4 embedded 2012-07-22 09:47:03 UTC
Created attachment 75861 [details]
Kernel 3.2.1 (third box), hourly graph
Comment 5 embedded 2012-07-22 09:47:30 UTC
Created attachment 75871 [details]
Kernel 3.2.1 (third box), zoom on the hourly graph
Comment 6 Anonymous Emailer 2012-07-22 12:22:40 UTC
Reply-To: mingo@kernel.org

Does the v3.5 kernel work any better? In particular this commit:

5167e8d5417b sched/nohz: Rewrite and fix load-avg computation -- again

might have done the trick.

Thanks,

	Ingo
Comment 7 embedded 2012-07-24 07:52:06 UTC
Hi,
We have set up a 3.5.0 kernel on one of the affected systems. Will keep you informed of the results in a few days.
Comment 8 embedded 2012-07-27 12:36:48 UTC
Hello,
Unfortunately the 3.5.0 kernel patch did not solve the problem. We still have 15-minute and 5-minute averages consistently above the 1-minute average when the system is almost idle.
Comment 9 Alan 2012-08-09 13:33:06 UTC
*** Bug 45471 has been marked as a duplicate of this bug. ***
Comment 10 embedded 2012-08-12 14:01:22 UTC
I am not sure this bug is the same as bug 45471.

Bug 45471 is a 'weird/too-high load average' computation problem. Here, it's only a '5-minute and 15-minute computation averages are not coherent with 1-minute average history' bug, which seem to appear only on almost-idle systems.
Comment 11 Konstantin Svist 2012-08-12 16:04:57 UTC
I agree, my bug is completely different from yours
Comment 12 embedded 2012-08-14 16:40:24 UTC
After playing with 'git bisect', it seems this but appeared between 2.6.35 and 2.6.36 (and not > 3.0 as previously stated).
Comment 13 Doug Smythies 2013-10-20 22:48:21 UTC
While looking for something else, I found this bug report.
Myself, I do not consider this a bug, but rather a natural consequence of finite bit arithmetic combined with the strong filter coefficients needed for the longer time constant load averages. Due to their highly sub-sampled nature, the load averages aren't highly accurate anyhow. The overriding requirement for the code is minimal overhead.

An example calculation:

FSHIFT = 11 is the number of bits of precision kept.
EXP_15 = 2037 is the 15 minute filter coefficient (where 2048 = 1.0)
93 = the smallest old load that still scales to 0.05, with rounding.
i.e. 93/2048 = 0.05 to two decimal places.

Now, consider old_15_minute_load_ave = 93 and new load = 0, then we get:

new_15_minute_load_ave = (old * 2037 + load * (2048 - 2037) + a half)/ 2048
                       = (93 * 2037 + 0 + 1024)/ 2048
                       = 93

So, in an unloaded system, once the 15 minute load average decays to 0.05 (or 93 as saved internally), it will not decay any further.
Similarly for the 5 minute load average.
Comment 14 VTX 2016-01-18 15:33:53 UTC
Created attachment 200421 [details]
loadavg patch for 3.x kernels

Fix for the 1-minute EXP value which is calculated not on 60 seconds, but 60.5 seconds for a mysterious reason.

Fix for the "load average: 0.00, 0.01, 0.05" - bug.

Patch can also be used on 4.x kernels. Bug it still present in version 4.4. The culprit code has moved from core.c to loadavg.c, so the patch needs to be adjusted for that.
Comment 15 VTX 2016-01-18 15:54:28 UTC
No idea why the state of this bug is resolved obsolete, since the bug is still present in kernel 4.4.

Note: the load avg calculation is wrong on three accounts:

1. the 1 minute EXP value EXP_1 is wrong. It should be 1877, but it is 1884.
  2048 = 2048 / (60 / 5) = 1877.3333...

The EXP_5 and EXP_15 values are strangely enough correct...

2. The rounding line that was added to the kernel more than 5 years ago adds a false feedback load into the load calculation:
  load += 1UL << (FSHIFT - 1);

Removing that line fixes the erroneous "load average: 0.00, 0.01, 0.05" - uptime that everyone is seeing on idle systems.

The above two problems are fixed by the patch here just above.

3. The current calc_load function (with or without the rounding fix) uses an algorithm that unfortunately "measures" the load over a much longer period than people may think and as specified.
- the indicated 1 minute average still reflects some of the load of more than 5.58 minutes ago(!)
- the indicated 5 minute average bears the load weight of up to 20.5 minutes
- the indicated 15 minute average is influenced by the system's load of up to 46.75 ago.

The indicated values were calculated by running a simulation, and later verified on an actual running system.

I do have a fix available for this point as well. I needed to design a new algorithm that I implemented into a new calc_load function. It measures the average load over exactly 1, 5 resp. 15 minutes. But unfortunately this algorithm has two major flaws that I have been unable to fix, so I was hoping someone could help:
- the algorithm only works for a single core system. I tested my implementation and it works really smooth. For multi-core systems, the algorithm needs to be adjusted. I know where it goes wrong, but I cannot implement it.
- the algorithm does only work well for (single-core) loads between 0.00 and 1.00. Although it does still operate on overload situations, in that case the reported load will remain too high too long.

I'm not familiar enough with the core.c/loadavg.c code to fix this myself.
Comment 16 Alan 2016-01-18 16:06:42 UTC
The exponent looks like it might be a real bug.  I don't know about #2 (worth asking linux-kernel or the scheduler folks).

On #3

The load average of a Unix box is defned as a 1 minute, 5 minute and 15 minute decaying  exponential average, not the exact load over or at that time. That is how all Unix boxes do it (and in fact 'load average' this way is older than Unix - Unix adopted an existing even older convention from I believe Tenex)
Comment 17 Doug Smythies 2016-01-18 16:27:43 UTC
I agree with the comments added by Alan. The filter is an IIR (Infinite Impulse Response) filter.

The rounding issue, might change the low load side of things, but will also then have effect on the high load side of things (or so I think). Load average is a very coarse calculation anyhow, and due to the sub-sampled nature, can be very inaccurate. i.e. with proper timing, one can load a system heavily, yet get a reported load average of zero.
Comment 18 VTX 2016-01-18 16:57:08 UTC
Regarding the rounding, I asked Peter Zijlstra who probably added that line, but he could not exactly remember why that line was added five years ago.

In any case, I removed it, and tested the kernel, and I see no averse side effects.

Doug, indeed it does also have an effect on the high load side. As a result of this rounding, the reported load is slightly lower than the actual load, when that approaches 1.00 / cpu. Given the fact that the avgload calculation is coarse, it doesn't even really make sense to do that rounding anyway.

Regarding 3#: the algorithm currently used - I had a look at it - represents a queue where the current load slice is added in front, and the last entry is popped off the array. The added value is known, i.e. 'active', for the slice that is removed from the total, because there is no memory of it, the avg itself is taken. If you optimize this calculation, you'll end up with the very implementation being used for calc_load. I actually went that path while trying to optimizing the algorithm, to find out that I ended up where I left :)
Comment 19 Doug Smythies 2016-01-21 22:18:34 UTC
Created attachment 200921 [details]
Just a graph of what I said on-list about rounding and the steady state error

As mentioned on list: Removing the rounding would change the maximum error in the 15 minute reported load average from +-5% (+-0.05) to +0-10% (+0-0.1), depending on if the steady state load is increased from previous or decreasing from previous.
Comment 20 Doug Smythies 2016-01-21 22:25:07 UTC
Created attachment 200931 [details]
Forgot to label axis
Comment 21 VTX 2016-01-22 09:33:40 UTC
Created attachment 200951 [details]
loadavg patch for 4.x kernels, round down, round up

This patch for kernel 4.x fixes not only the "avgload 0.00, 0.01, 0.05" - issue, but also the issue for which the rounding line was probably once added.

This modified calc_load function rounds down when newly added active load is less than the old avg load, and it rounds up in the other case. The behaviour is symmetrical now on both "ends" 0.00 and 1.00.

Doug, I repeated this night the tests on my first patch after your nice graphs and calculations, but my tests do indeed reach 1.00 per core, most likely because active occasionally and briefly exceeds FIXED_1 (2048).

I'm quite convinced that this new patch will also make your test reach 1.00 with whatever load generator you are using. I have been running it since this night on my buildhost, and for me it behaves the same way as the other patch.
Comment 22 Doug Smythies 2016-01-22 15:37:08 UTC
@VTX: I like your new patch. It changes the 15 minute steady state load average from +-5% (+-0.05) to +-0% (+-0.00). I did not do a new graph, because there is no point, but I did run the numbers.

For reasons of saving clock cycles, I would suggest changing this line:

return newload / FIXED_1;

To this:

return newload >> FSHIFT;

I am running your new proposed code now, but I do not expect any surprises.
Comment 23 VTX 2016-01-22 18:44:03 UTC
I appreciate your comments, and your testing. And your graphics. I guess the last one will be a diagonal in rectangular box ;)

But, and the last time I had this discussion was over 15 years ago: please, please, please have a look at the object code or assembler code generated for x/2048 and x>>12. It will probably surprise you, and you will admire those gcc compiler folks with awe as they deserve.
Comment 24 Doug Smythies 2016-01-22 19:33:23 UTC
(In reply to VTX from comment #23)
> I guess the
> last one will be a diagonal in rectangular box ;)

Yes.

> But, and the last time I had this discussion was over 15 years ago: please,
> please, please have a look at the object code or assembler code generated
> for x/2048 and x>>12. It will probably surprise you, and you will admire
> those gcc compiler folks with awe as they deserve.

I'll take your word for it.

Here are my last 3 uptime checks with this version 2 patch, running a 100% load on CPU 7 and my test server otherwise idle (no surprise, as expected):

doug@s15:~$ uptime
 08:30:51 up 24 min,  3 users,  load average: 1.02, 1.01, 0.81
doug@s15:~$ uptime
 10:47:55 up  2:41,  3 users,  load average: 1.00, 1.00, 1.00
doug@s15:~$ uptime
 11:26:50 up  3:20,  3 users,  load average: 1.00, 1.00, 1.00

Yesterday, with version 1, after about 3 hours it was 1.00, ? (I forget), 0.91
Comment 25 Doug Smythies 2016-02-06 01:31:48 UTC
@VTX: Was there ever a new patch submitted? I never got an e-mail that there was, but I might not have been on the list. I don't see it in kernel 4.5-rc2, but maybe it is queued for 4.6?
Comment 26 VTX 2016-09-01 14:14:55 UTC
Someone who can, please close this bug. It was fixed in 4.6 and has been backported in the meantime to older kernels.

Note You need to log in before you can comment on or make changes to this bug.