Bug 69821 - setting CONFIG_HZ_PERIODIC locks cpu at the lowest frequency using ONDEMAND
Summary: setting CONFIG_HZ_PERIODIC locks cpu at the lowest frequency using ONDEMAND
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: cpufreq (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Chen Yu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-01 20:09 UTC by j.fikar
Modified: 2016-05-18 01:09 UTC (History)
8 users (show)

See Also:
Kernel Version: 3.13.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
fix idle calc in cpufreq governor load (598 bytes, application/octet-stream)
2015-12-16 02:15 UTC, Chen Yu
Details
Revert error idle time calculating (1.49 KB, text/plain)
2016-04-21 07:07 UTC, Chen Yu
Details
Make idle time monotonic (2.07 KB, text/plain)
2016-04-21 07:08 UTC, Chen Yu
Details
cpufreq: governor: fix get_cpu_idle_time() usage in cpufreq_start_governor() (2.44 KB, patch)
2016-04-21 20:50 UTC, Rafael J. Wysocki
Details | Diff
[update] cpufreq: governor: fix get_cpu_idle_time() usage in cpufreq_start_governor() (2.44 KB, patch)
2016-04-21 21:05 UTC, Rafael J. Wysocki
Details | Diff
[update 2x] cpufreq: governor: fix get_cpu_idle_time() usage in cpufreq_start_governor() (2.45 KB, patch)
2016-04-21 21:07 UTC, Rafael J. Wysocki
Details | Diff
cpufreq: do not use cputime_to_usecs() in get_cpu_idle_time_jiffy() (650 bytes, patch)
2016-04-21 21:16 UTC, Rafael J. Wysocki
Details | Diff
dmesg trace (3.12 KB, text/plain)
2016-04-22 03:35 UTC, Timo Valtoaho
Details
failed boot (56.07 KB, application/octet-stream)
2016-04-23 07:17 UTC, Timo Valtoaho
Details
v2 cpufreq governor fix get_cpu_idle_time usage in cpufreq_start_governor() (2.71 KB, patch)
2016-04-23 11:22 UTC, Chen Yu
Details | Diff
v2_cpufreq do not use cputime_to_usecs in get_cpu_idle_time_jiffy (1.18 KB, patch)
2016-04-23 11:23 UTC, Chen Yu
Details | Diff
cpufreq: governor: Fix handling of special cases in dbs_update() (4.10 KB, patch)
2016-04-28 13:02 UTC, Rafael J. Wysocki
Details | Diff
[v2] cpufreq: governor: Fix handling of special cases in dbs_update() (4.17 KB, patch)
2016-04-28 13:07 UTC, Rafael J. Wysocki
Details | Diff
[v3] cpufreq: governor: Fix handling of special cases in dbs_update() (4.36 KB, patch)
2016-05-05 12:57 UTC, Rafael J. Wysocki
Details | Diff
[v4] cpufreq: governor: Fix handling of special cases in dbs_update() (5.58 KB, patch)
2016-05-05 22:00 UTC, Rafael J. Wysocki
Details | Diff

Description j.fikar 2014-02-01 20:09:43 UTC
With CONFIG_HZ_PERIODIC=y cpu stays at the lowest frequcency 800MHz even if usage goes to 100%, frequency does not scale up, the governor in use is ondemand. Neither works conservative. Performance and userspace governors work as expected.

With CONFIG_NO_HZ_IDLE or CONFIG_NO_HZ_FULL cpu scales up with ondemand as expected.

Cpu is amd10h and uses CONFIG_X86_ACPI_CPUFREQ.
Comment 1 Len Brown 2014-02-11 00:52:41 UTC
just curious, why are you using CONFIG_HZ_PERIODIC=y ?
Comment 2 j.fikar 2014-02-11 10:01:18 UTC
Well, I was trying the new CONFIG_NO_HZ_FULL and comparing it to the two old CONFIG_NO_HZ_IDLE and CONFIG_HZ_PERIODIC. To my surprise the oldest (and I thought most stable and most tested) CONFIG_HZ_PERIODIC causes problems with CPU scaling.

On the other hand, I'm aware that CONFIG_NO_HZ_IDLE is the recommended default and that there was a lot of recent development due to the new CONFIG_NO_HZ_FULL. 

Anyway, I think it is a bug and not expected behaviour, right?
Comment 3 Zhang Rui 2014-06-03 06:14:20 UTC
ping Len...
Comment 4 Chen Yu 2015-12-15 00:12:02 UTC
CONFIG_NO_HZ_FULL  is using ktime_get to get the idle time in get_cpu_idle_time_us, it is based on clock source,  while if the kernel config is set with CONFIG_HZ_PERIODIC , the governor might get idle time by jiffies(tick, minimum unit of 4 millisecond),
the former is more accurate than the latter, this might be one of the reason for the problem. I can try to reproduce and take a look.
Comment 5 Chen Yu 2015-12-16 02:09:55 UTC
There is a problem in cpufreq_governor.c on how to calculate the load
for each policy:

it uses get_cpu_idle_time to get the total idle time for current cpu, if it
is config with CONFIG_NO_HZ_FULL , it is OK, because as mentioned in #Comment 4, the time is return by ktime_get, which is always increasing. However,
it is config with CONFIG_HZ_PERIODIC , the situation would be changed,
now it uses tick count to calculate the idle time by get_cpu_idle_time_jiffy,
consider the following scenario:

At T1: 
idle_tick_1 = total_tick_1 - user_tick_1

At T2: ( T2 = T1 + 80ms):
idle_tick_2 = total_tick_2 - user_tick_2


Since current algorithm to get the idle time for the past sampling, is by caculcating (idle_tick_2 - idle_tick_1), but since you CAN NOT guarantee that idle_tick_2 is bigger than idle_tick_1, we might get a negative value for idle time during the past sample, which might cause the system thinks its idle time 
is very big, and the busy time is near zero, which cause the governor to always choose the lowest cpufreq state, which cause this problem.





There are two solutions for this problem:
1. Since the root cause for this problem is that, we should not rely on idle tick in every sample time, but should rely on the busy time directly in each sample, as the latter is how 'top' command implement its feature.

2. Or we can also work around it by making sure the idle_time is strictly increasing in each sample. This solution needs minimum modification and the RFC patch is attached.
Comment 6 Chen Yu 2015-12-16 02:15:17 UTC
Created attachment 197491 [details]
fix idle calc in cpufreq governor  load
Comment 7 Chen Yu 2015-12-17 04:21:26 UTC
patch sent
https://patchwork.kernel.org/patch/7859871/
Comment 8 Chen Yu 2016-04-21 07:07:57 UTC
Created attachment 213501 [details]
Revert error idle time calculating
Comment 9 Chen Yu 2016-04-21 07:08:39 UTC
Created attachment 213511 [details]
Make idle time monotonic
Comment 10 Chen Yu 2016-04-21 07:11:09 UTC
Hi Fikar,
do you have a chance to help test if #Comment 8 and #Comment 9 work for you?
My previous patch has introduced a regression and here is another solution for it.
thanks.
Comment 11 Chen Yu 2016-04-21 07:11:39 UTC
related link:
https://bugzilla.kernel.org/show_bug.cgi?id=115261
Comment 12 Rafael J. Wysocki 2016-04-21 20:50:51 UTC
Created attachment 213611 [details]
cpufreq: governor: fix get_cpu_idle_time() usage in cpufreq_start_governor()

The patch from comment #9 can be combined with this one (on top of 4.6-rc4; untested, but builds OK) and then all should actually work.
Comment 13 Rafael J. Wysocki 2016-04-21 20:55:20 UTC
@Chen Yu: Of course, the cputime_to_usecs() in get_cpu_idle_time_jiffy() would still be problematic as the uptime/idle time will grow over the 32-bit limit eventually, so that needs to be dealt with too.
Comment 14 Rafael J. Wysocki 2016-04-21 21:05:18 UTC
Created attachment 213621 [details]
[update] cpufreq: governor: fix get_cpu_idle_time() usage in cpufreq_start_governor()

Updated patch with one mistake fixed.
Comment 15 Rafael J. Wysocki 2016-04-21 21:07:21 UTC
Created attachment 213631 [details]
[update 2x] cpufreq: governor: fix get_cpu_idle_time() usage in cpufreq_start_governor()

Fix the mistake for real now.
Comment 16 Rafael J. Wysocki 2016-04-21 21:16:12 UTC
Created attachment 213641 [details]
cpufreq: do not use cputime_to_usecs() in get_cpu_idle_time_jiffy()

For completeness, that's what I'd do about the cputime_to_usecs() in get_cpu_idle_time_jiffy().

@Chen Yu: Do you see any problems with this?
Comment 17 Timo Valtoaho 2016-04-22 03:35:31 UTC
Created attachment 213661 [details]
dmesg trace

To make sure, there should be 4 patches applied on top of 4.6-rc4:

https://bugzilla.kernel.org/attachment.cgi?id=213501
https://bugzilla.kernel.org/attachment.cgi?id=213511
https://bugzilla.kernel.org/attachment.cgi?id=213631
https://bugzilla.kernel.org/attachment.cgi?id=213641

Anyway, they are all applied and I get what's inside of attachment.
Comment 18 Chen Yu 2016-04-22 07:49:45 UTC
(In reply to Rafael J. Wysocki from comment #16)
> Created attachment 213641 [details]
> cpufreq: do not use cputime_to_usecs() in get_cpu_idle_time_jiffy()
> 
> For completeness, that's what I'd do about the cputime_to_usecs() in
> get_cpu_idle_time_jiffy().
> 
> @Chen Yu: Do you see any problems with this?

When CONFIG_TICK_CPU_ACCOUNTING is not set, the cputime is not
nanosecond-based, but more likely a jiffy(tick), 
because:
#define cputime_to_jiffies(__ct)        (__force unsigned long)(__ct)
so we might not get the correct result by dividing it directly by NSEC_PER_USEC,
 we might get 'divide by zero' exception as depicted n #Comment 17:
uptime = cur_uptime - j_cdbs->prev_cpu_wall;
maybe uptime is zero in above situation, I'll run it on my laptop to confirm.
Comment 19 Rafael J. Wysocki 2016-04-23 00:10:40 UTC
(In reply to Chen Yu from comment #18)
> (In reply to Rafael J. Wysocki from comment #16)
> > Created attachment 213641 [details]
> > cpufreq: do not use cputime_to_usecs() in get_cpu_idle_time_jiffy()
> > 
> > For completeness, that's what I'd do about the cputime_to_usecs() in
> > get_cpu_idle_time_jiffy().
> > 
> > @Chen Yu: Do you see any problems with this?
> 
> When CONFIG_TICK_CPU_ACCOUNTING is not set, the cputime is not
> nanosecond-based, but more likely a jiffy(tick),

Good point.
Comment 20 Timo Valtoaho 2016-04-23 07:17:15 UTC
Created attachment 213781 [details]
failed boot

I made some tests and whenever the patch in comment #15 is applied, my system won't boot properly. Attached is a full dmesg of failed boot (single mode). This one had comment #15 and comment #16 patched in on top of 4.6-rc4.
Comment 21 Chen Yu 2016-04-23 11:19:17 UTC
I did some modifications on the order of initialization of prev_cpu_wall in #Comment 15, since some policy share the same governor data,otherwise other polcy would not cover this code path. Meanwhile bypass the delta wall_time to avoid 'divided by zero' issue.

For #Comment 16, I copied the code from jiffies_to_usecs.
Comment 22 Chen Yu 2016-04-23 11:22:50 UTC
Created attachment 213801 [details]
v2 cpufreq governor fix get_cpu_idle_time usage in cpufreq_start_governor()
Comment 23 Chen Yu 2016-04-23 11:23:09 UTC
Created attachment 213811 [details]
v2_cpufreq do not use cputime_to_usecs in get_cpu_idle_time_jiffy
Comment 24 Timo Valtoaho 2016-04-23 12:56:20 UTC
#22 and #23 applied on top of clean 4.6-rc4. Is this how it should be?

Good news, system boots up. Also, no wild horses running around, meaning that cpufreq is working fine.
Comment 25 Chen Yu 2016-04-23 13:37:46 UTC
(In reply to Timo Valtoaho from comment #24)
> #22 and #23 applied on top of clean 4.6-rc4. Is this how it should be?
> 
> Good news, system boots up. Also, no wild horses running around, meaning
> that cpufreq is working fine.

Should also with 
https://bugzilla.kernel.org/attachment.cgi?id=213501
https://bugzilla.kernel.org/attachment.cgi?id=213511
applied.
Comment 26 Timo Valtoaho 2016-04-23 13:53:11 UTC
You mean that the correct way is to patch them all?

Is there order in which they should be applied, ie. first 213501, then 213511, 213801 and finally 213811?
Comment 27 Chen Yu 2016-04-23 14:25:34 UTC
(In reply to Timo Valtoaho from comment #26)
> You mean that the correct way is to patch them all?
> 
> Is there order in which they should be applied, ie. first 213501, then
> 213511, 213801 and finally 213811?
Yes, you can apply them in this order.
Comment 28 Timo Valtoaho 2016-04-23 16:07:23 UTC
Ok, done. Everything ok, uptime 1:25 and no problems.

How about the stable 4.5 series? Fix for it coming too?
Comment 29 Rafael J. Wysocki 2016-04-25 00:33:18 UTC
(In reply to Chen Yu from comment #22)
> Created attachment 213801 [details]
> v2 cpufreq governor fix get_cpu_idle_time usage in cpufreq_start_governor()

Actually, I think that j_cdbs->prev_load should always be initialized to 0, because it will then cause dbs_update() to always compute the load when it is run first time after the initialization.
Comment 30 Rafael J. Wysocki 2016-04-25 13:15:42 UTC
(In reply to Rafael J. Wysocki from comment #29)
> (In reply to Chen Yu from comment #22)
> > Created attachment 213801 [details]
> > v2 cpufreq governor fix get_cpu_idle_time usage in cpufreq_start_governor()
> 
> Actually, I think that j_cdbs->prev_load should always be initialized to 0,
> because it will then cause dbs_update() to always compute the load when it
> is run first time after the initialization.

Patch to do that has been posted: https://patchwork.kernel.org/patch/8922431/

It replaces the one from comment #22.
Comment 31 Chen Yu 2016-04-25 15:36:19 UTC
(In reply to Rafael J. Wysocki from comment #30)
> (In reply to Rafael J. Wysocki from comment #29)
> > (In reply to Chen Yu from comment #22)
> > > Created attachment 213801 [details]
> > > v2 cpufreq governor fix get_cpu_idle_time usage in
> cpufreq_start_governor()
> > 
> > Actually, I think that j_cdbs->prev_load should always be initialized to 0,
> > because it will then cause dbs_update() to always compute the load when it
> > is run first time after the initialization.
> 
> Patch to do that has been posted: https://patchwork.kernel.org/patch/8922431/
> 
> It replaces the one from comment #22.

Thanks! I've tested on my laptop together with other patches, it runs well.
Comment 32 Chen Yu 2016-04-25 15:38:02 UTC
Timo,
would you please apply these patches and test if it works for you:

first 213501, then 213511, 
then, https://patchwork.kernel.org/patch/8922431/
 and finally 213811?
Comment 33 Timo Valtoaho 2016-04-25 16:55:37 UTC
Did apply them, still on top of 4.6-rc4, and been running it for a while now. Everything fine here.
Comment 34 Rafael J. Wysocki 2016-04-27 23:28:27 UTC
@Chen Yu: Let's do some maths here.

The unpatched get_cpu_idle_time_jiffy() returns:

cur_wall_time = uptime + INITIAL_JIFFIES
idle_time = uptime + INITIAL_JIFFIES - busy_time

In the CONFIG_TICK_CPU_ACCOUNTING set case those undergo 64-bit conversion to microseconds, so we get

cur_wall_time_us = uptime_us + INITIAL_JIFFIES_us
idle_time_ms = uptime_us + INITIAL_JIFFIES_us - busy_time_us

up to rounding errors.

After the patch from comment #30 the only user of those values is dbs_update() and it computes

wall_time = uptime_us + INITIAL_JIFFIES_us - prev_uptime_us - INITIAL_JIFFIES_us = uptime_us - prev_uptime_us
idle_time = uptime_us + INITIAL_JIFFIES_us - busy_time_us - prev_uptime_us - INITIAL_JIFFIES_us + prev_busy_time_us = uptime_us - prev_uptime_us - busy_time_us + prev_busy_time_us

Since INITIAL_JIFFIES_us cancel out, do we still need the patch from comment #9?
Comment 35 Chen Yu 2016-04-28 02:35:47 UTC
(In reply to Rafael J. Wysocki from comment #34)
> @Chen Yu: Let's do some maths here.
> 
> The unpatched get_cpu_idle_time_jiffy() returns:
> 
> cur_wall_time = uptime + INITIAL_JIFFIES
> idle_time = uptime + INITIAL_JIFFIES - busy_time
> 
> In the CONFIG_TICK_CPU_ACCOUNTING set case those undergo 64-bit conversion
> to microseconds, so we get
> 
> cur_wall_time_us = uptime_us + INITIAL_JIFFIES_us
> idle_time_ms = uptime_us + INITIAL_JIFFIES_us - busy_time_us
> 
> up to rounding errors.
> 
> After the patch from comment #30 the only user of those values is
> dbs_update() and it computes
> 
> wall_time = uptime_us + INITIAL_JIFFIES_us - prev_uptime_us -
> INITIAL_JIFFIES_us = uptime_us - prev_uptime_us
> idle_time = uptime_us + INITIAL_JIFFIES_us - busy_time_us - prev_uptime_us -
> INITIAL_JIFFIES_us + prev_busy_time_us = uptime_us - prev_uptime_us -
> busy_time_us + prev_busy_time_us
> 
> Since INITIAL_JIFFIES_us cancel out, do we still need the patch from comment
> #9?

@Rafael

But we can not guarantee 'total' idle time is always increasing, thus might
cause a negative delta idle time, especially when the cpu is very busy(idle increment is very small):
 
idle_time
= uptime_us - prev_uptime_us - busy_time_us + prev_busy_time_us
= (uptime_us - busy_time_us) - (prev_uptime_us - prev_busy_time_us)

so although uptime_us is bigger than prev_uptime_us, and busy_time_us is bigger than prev_busy_time_us, but the subtraction result might not always be positive,

(jiffise updating is done by CPU0, and idle time statistic sampling may be carried  on non-CPU0 CPUs, since there is no synchronization between CPU0.update_jiffies and CPU1.stat[USERS], they are not always increasing at the same speed)
Comment 36 Rafael J. Wysocki 2016-04-28 12:09:17 UTC
(In reply to Chen Yu from comment #35)
> (In reply to Rafael J. Wysocki from comment #34)
> > @Chen Yu: Let's do some maths here.
> > 
> > The unpatched get_cpu_idle_time_jiffy() returns:
> > 
> > cur_wall_time = uptime + INITIAL_JIFFIES
> > idle_time = uptime + INITIAL_JIFFIES - busy_time
> > 
> > In the CONFIG_TICK_CPU_ACCOUNTING set case those undergo 64-bit conversion
> > to microseconds, so we get
> > 
> > cur_wall_time_us = uptime_us + INITIAL_JIFFIES_us
> > idle_time_ms = uptime_us + INITIAL_JIFFIES_us - busy_time_us
> > 
> > up to rounding errors.
> > 
> > After the patch from comment #30 the only user of those values is
> > dbs_update() and it computes
> > 
> > wall_time = uptime_us + INITIAL_JIFFIES_us - prev_uptime_us -
> > INITIAL_JIFFIES_us = uptime_us - prev_uptime_us
> > idle_time = uptime_us + INITIAL_JIFFIES_us - busy_time_us - prev_uptime_us
> -
> > INITIAL_JIFFIES_us + prev_busy_time_us = uptime_us - prev_uptime_us -
> > busy_time_us + prev_busy_time_us
> > 
> > Since INITIAL_JIFFIES_us cancel out, do we still need the patch from
> comment
> > #9?
> 
> @Rafael
> 
> But we can not guarantee 'total' idle time is always increasing, thus might
> cause a negative delta idle time, especially when the cpu is very busy(idle
> increment is very small):
>  
> idle_time
> = uptime_us - prev_uptime_us - busy_time_us + prev_busy_time_us
> = (uptime_us - busy_time_us) - (prev_uptime_us - prev_busy_time_us)
> 
> so although uptime_us is bigger than prev_uptime_us, and busy_time_us is
> bigger than prev_busy_time_us, but the subtraction result might not always
> be positive,

The patch from comment #9 won't address that problem anyway, will it?

> (jiffies updating is done by CPU0, and idle time statistic sampling may be
> carried  on non-CPU0 CPUs, since there is no synchronization between
> CPU0.update_jiffies and CPU1.stat[USERS], they are not always increasing at
> the same speed)

Right.

In that case we have wall_time < idle_time in dbs_update() and 0 is returned as the load.  But that can only happen if (uptime_us - prev_uptime_us) is comparable with (busy_time_us - prev_busy_time_us), so it should be safe to assume that the CPU wasn't really idle at all during that time.  So dbs_update() should return 100 in that case.
Comment 37 Rafael J. Wysocki 2016-04-28 13:02:03 UTC
Created attachment 214601 [details]
cpufreq: governor: Fix handling of special cases in dbs_update()

So I have this patch that should address the "negative idle time" problem IMO (it is on top of https://patchwork.kernel.org/patch/8964001/).

Please tell me what you think.
Comment 38 Rafael J. Wysocki 2016-04-28 13:07:22 UTC
Created attachment 214611 [details]
[v2] cpufreq: governor: Fix handling of special cases in dbs_update()

The previous version was incomplete, sorry.
Comment 39 Chen Yu 2016-05-02 07:40:02 UTC
Hi !
Sorry for late response, I just came back home,

(In reply to Rafael J. Wysocki from comment #36)
> 
> The patch from comment #9 won't address that problem anyway, will it?
> 
If I understand correctly, #Comment 9 plus #Comment 23 should avoid the negative idle delta issue?  because on one cpu:
1. the cpu_stat[IDLE] should always be increasing.
2. even if cpu_stat[IDLE] increases faster than jiffies_64, which
   satisfy  wall_time < idle_time, thus load = 0 will be returned,
   this is the actual expected result(cpu is very idle)


> > (jiffies updating is done by CPU0, and idle time statistic sampling may be
> > carried  on non-CPU0 CPUs, since there is no synchronization between
> > CPU0.update_jiffies and CPU1.stat[USERS], they are not always increasing at
> > the same speed)
> 
> Right.
> 
> In that case we have wall_time < idle_time in dbs_update() and 0 is returned
> as the load.  But that can only happen if (uptime_us - prev_uptime_us) is
> comparable with (busy_time_us - prev_busy_time_us), so it should be safe to
> assume that the CPU wasn't really idle at all during that time.  So
> dbs_update() should return 100 in that case.
Yes, agreed.
Comment 40 Chen Yu 2016-05-02 07:48:13 UTC
(In reply to Rafael J. Wysocki from comment #37)
> Created attachment 214601 [details]
> cpufreq: governor: Fix handling of special cases in dbs_update()
> 
> So I have this patch that should address the "negative idle time" problem
> IMO (it is on top of https://patchwork.kernel.org/patch/8964001/).
> 
> Please tell me what you think.
For the following line:
load = (int)idle_time < 0 ? 100 : 0;
idle_time is expected to be truncated into 32bit by cputime_to_usecs, or
do we need #Comment 23 and change it to (s64)idle_time?
Comment 41 Chen Yu 2016-05-02 07:56:49 UTC
(In reply to Rafael J. Wysocki from comment #38)
> Created attachment 214611 [details]
> [v2] cpufreq: governor: Fix handling of special cases in dbs_update()
> 
> The previous version was incomplete, sorry.

I'll test it on my laptop.
Comment 42 Rafael J. Wysocki 2016-05-02 21:35:04 UTC
(In reply to Chen Yu from comment #40)
> (In reply to Rafael J. Wysocki from comment #37)
> > Created attachment 214601 [details]
> > cpufreq: governor: Fix handling of special cases in dbs_update()
> > 
> > So I have this patch that should address the "negative idle time" problem
> > IMO (it is on top of https://patchwork.kernel.org/patch/8964001/).
> > 
> > Please tell me what you think.
> For the following line:
> load = (int)idle_time < 0 ? 100 : 0;
> idle_time is expected to be truncated into 32bit by cputime_to_usecs, or
> do we need #Comment 23 and change it to (s64)idle_time?

idle_time is unsigned int.
Comment 43 Chen Yu 2016-05-04 03:48:49 UTC
Tested on top of 4.6-rc6 and with 4 patches:

https://patchwork.kernel.org/patch/8964001/
cpufreq: governor: Change confusing struct field and variable names

https://patchwork.kernel.org/patch/8922431/
cpufreq: governor: Fix prev_load initialization in cpufreq_governor_start()

#Comment 23 + #Comment 38
At first few hours, the cpufreq behaves normally, however it dropped to low freq with 100% stress cpu load after a few hours,  I'll confirm if it is related to TDP, Timo would you please help double-check? thanks
Comment 44 Chen Yu 2016-05-04 16:24:40 UTC
For #Comment 43, package power limit is reached, thus the performance is reduced, so this is working as expected.

(bit 10~11 set to 1 after 100% load for some time, this means that "Indicates package power limit is forcing one ore more
processors to operate below OS-requested P-state"
# rdmsr 0x19c                 
88200c02
)

@Rafael,
for #Comment 38 
Tested-by: Chen Yu <yu.c.chen@intel.com>
Comment 45 Timo Valtoaho 2016-05-04 17:58:45 UTC
Testing according to comment #43, except that I had to apply 8922431 before 8964001.

Anyway, uptime is now 4 hours 19 minutes and I have opposite problem than you. config still has

CONFIG_HZ_PERIODIC=y
# CONFIG_TICK_CPU_ACCOUNTING is not set

All cores are running at max freq under load, and for some reason they just stucked there, so frequency is not coming down when unloaded.
Comment 46 Rafael J. Wysocki 2016-05-04 20:09:31 UTC
(In reply to Timo Valtoaho from comment #45)
> Testing according to comment #43, except that I had to apply 8922431 before
> 8964001.

That's correct, 8964001 is on top of 8922431.

Can you please try just the patch from comment #38 on top of these two?
Comment 47 Chen Yu 2016-05-05 03:36:23 UTC
(In reply to Timo Valtoaho from comment #45)
> Testing according to comment #43, except that I had to apply 8922431 before
> 8964001.
> 
> Anyway, uptime is now 4 hours 19 minutes and I have opposite problem than
> you. config still has
> 
> CONFIG_HZ_PERIODIC=y
> # CONFIG_TICK_CPU_ACCOUNTING is not set
> 
> All cores are running at max freq under load, and for some reason they just
> stucked there, so frequency is not coming down when unloaded.
Which tool are you using to generate the cpu workload?
Comment 48 Timo Valtoaho 2016-05-05 04:11:21 UTC
@Rafael, I did -Rp1 for patch in comment #23 and running the kernel now. Will let you know of any odd behaviour. After my previous post I tried to boot to a stable kernel, but I had to hard reset since normal system shutdown/restart procedures hanged the system totally.

@Chen Yu, I'm just compilinig some source code with some different -j values. In that case I was just compiling kernel with "make -j6 bindeb-pkg" to upset my 4-core CPU when it freaked out. So I'm not using any tool, should I? From my point of view, load is load regardless of where it comes from.
Comment 49 Rafael J. Wysocki 2016-05-05 12:57:32 UTC
Created attachment 215331 [details]
[v3] cpufreq: governor: Fix handling of special cases in dbs_update()

@Timo: There is a scenario that may lead to the effect you have observed with the patch from comment #38.

Namely, if idle_time turns out to be negative, both load and
j_cdbs->prev_load become 100, so if the system goes idle subsequently,
dbs_update() will return 100 and reset j_cdbs->prev_load.  However,
if idle_time turns out to be negative on the next invocation of
dbs_update(), the whole exercise will repeat.  If that goes in
cycles, you get load of 100 all the time.

This version of that patch forces the load to be computed again on the next invocation of dbs_update() after detecting negative idle_time which should prevent the above scenario from happening.

Please try this one if poss.
Comment 50 Timo Valtoaho 2016-05-05 18:50:46 UTC
Ok I have 4.6-rc6 topped with 8922431, 8964001 and 215331. Uptime is 1:25 and no issues. Will comment if something new appears.

Also, last version with respect to comment #46 was working ok for me, it was running for 12 hours without problems.
Comment 51 Rafael J. Wysocki 2016-05-05 21:57:44 UTC
OK, thanks!

So they both work.  I'm sort of divided now. :-)

I slightly prefer the one from comment #38 as it is a bit less kludgy, so I'm going to post something equivalent to it.
Comment 52 Rafael J. Wysocki 2016-05-05 22:00:00 UTC
Created attachment 215351 [details]
[v4] cpufreq: governor: Fix handling of special cases in dbs_update()

That's what I'm going to post.

Technically, it is equivalent to the patch from comment #38 (at least if I didn't mess it up) and hence the Tested-by tags, but please test it too if you can, just to be sure.
Comment 53 j.fikar 2016-05-06 08:20:19 UTC
I'll try the patches as well, just have slight problem with vanilla 4.6-rcx as on the original computer: it encounters kernel panic shortly after boot, probably something related to EDAC...
Comment 54 Chen Yu 2016-05-06 09:17:55 UTC
uptime 2:40 hours, no problem found.
patchwork 8922431 8964001  and 9028101.

Meanwhile I'm still a little curious why #Comment 23 has made Timo's system stick at highest freq after unload.
Comment 55 j.fikar 2016-05-06 13:30:54 UTC
I'm able to run 4.6-rc6 now and things are getting interesting.

On vanilla 4.6-rc6 and CONFIG_HZ_PERIODIC now the CPU scales up correctly, but seems to stay at high frequencies after I stop the load. It looks like both CPU think they are still busy, the usage is not stable, oscillates around 70%, as it can be seen by e.g. htop, but there is of course no process causing the usage.

It looks like the symptoms have changed since my first report of the bug, nevertheless you found the correct cause.

With the proposed patches 8922431 8964001 and 9028101 everything seems to be fine. CPUs scale up and down and they show usage 0% when idle, as expected.

I'll observe it a bit longer though...
Comment 56 Timo Valtoaho 2016-05-06 14:11:11 UTC
215351 here and patchwork 9028101 are the same?

I replaced 215331 with 215351. All patches that are applied now are 8922431, 8964001 and 215351. Uptime 10:15 and all is ok. Scaling like it should and no sticky frequencies.
Comment 57 j.fikar 2016-05-06 14:18:57 UTC
Sorry, it seems like the unexplained load about 70% in vanilla case was caused by a process (sslh), after killing it even the kernel without the patches behaves normally.

Not sure how to reproduce the bug on recent kernel.
Comment 58 Rafael J. Wysocki 2016-05-06 19:47:33 UTC
(In reply to j.fikar from comment #57)
> Sorry, it seems like the unexplained load about 70% in vanilla case was
> caused by a process (sslh), after killing it even the kernel without the
> patches behaves normally.
> 
> Not sure how to reproduce the bug on recent kernel.

No worries.  I'm glad that it works. :-)
Comment 59 Rafael J. Wysocki 2016-05-06 19:48:42 UTC
The patch from comment #52 posted as:

https://patchwork.kernel.org/patch/9028101/
Comment 60 Chen Yu 2016-05-07 03:49:24 UTC
(In reply to Chen Yu from comment #54)
> uptime 2:40 hours, no problem found.
> patchwork 8922431 8964001  and 9028101.
> 
> Meanwhile I'm still a little curious why #Comment 23 has made Timo's system
> stick at highest freq after unload.

#Comment 23 is not needed IMO, since currently the code only concerns the delta of idle time/update time, and the delta is of type unsigned int(which is 32bit), so even if cputime_to_usecs cut the result to 32bit, it also works, for example:

u64 prev_idle = 0xfffffffe;
u64 cur_idle = 0x2;
u32 idle_time = cur_idle - prev_idle = 0x4
Comment 61 Rafael J. Wysocki 2016-05-18 01:09:39 UTC
Merged as commit 9485e4ca0b48 "cpufreq: governor: Fix handling of special cases in dbs_update()".

Closing.

Note You need to log in before you can comment on or make changes to this bug.