With CONFIG_HZ_PERIODIC=y cpu stays at the lowest frequcency 800MHz even if usage goes to 100%, frequency does not scale up, the governor in use is ondemand. Neither works conservative. Performance and userspace governors work as expected. With CONFIG_NO_HZ_IDLE or CONFIG_NO_HZ_FULL cpu scales up with ondemand as expected. Cpu is amd10h and uses CONFIG_X86_ACPI_CPUFREQ.
just curious, why are you using CONFIG_HZ_PERIODIC=y ?
Well, I was trying the new CONFIG_NO_HZ_FULL and comparing it to the two old CONFIG_NO_HZ_IDLE and CONFIG_HZ_PERIODIC. To my surprise the oldest (and I thought most stable and most tested) CONFIG_HZ_PERIODIC causes problems with CPU scaling. On the other hand, I'm aware that CONFIG_NO_HZ_IDLE is the recommended default and that there was a lot of recent development due to the new CONFIG_NO_HZ_FULL. Anyway, I think it is a bug and not expected behaviour, right?
ping Len...
CONFIG_NO_HZ_FULL is using ktime_get to get the idle time in get_cpu_idle_time_us, it is based on clock source, while if the kernel config is set with CONFIG_HZ_PERIODIC , the governor might get idle time by jiffies(tick, minimum unit of 4 millisecond), the former is more accurate than the latter, this might be one of the reason for the problem. I can try to reproduce and take a look.
There is a problem in cpufreq_governor.c on how to calculate the load for each policy: it uses get_cpu_idle_time to get the total idle time for current cpu, if it is config with CONFIG_NO_HZ_FULL , it is OK, because as mentioned in #Comment 4, the time is return by ktime_get, which is always increasing. However, it is config with CONFIG_HZ_PERIODIC , the situation would be changed, now it uses tick count to calculate the idle time by get_cpu_idle_time_jiffy, consider the following scenario: At T1: idle_tick_1 = total_tick_1 - user_tick_1 At T2: ( T2 = T1 + 80ms): idle_tick_2 = total_tick_2 - user_tick_2 Since current algorithm to get the idle time for the past sampling, is by caculcating (idle_tick_2 - idle_tick_1), but since you CAN NOT guarantee that idle_tick_2 is bigger than idle_tick_1, we might get a negative value for idle time during the past sample, which might cause the system thinks its idle time is very big, and the busy time is near zero, which cause the governor to always choose the lowest cpufreq state, which cause this problem. There are two solutions for this problem: 1. Since the root cause for this problem is that, we should not rely on idle tick in every sample time, but should rely on the busy time directly in each sample, as the latter is how 'top' command implement its feature. 2. Or we can also work around it by making sure the idle_time is strictly increasing in each sample. This solution needs minimum modification and the RFC patch is attached.
Created attachment 197491 [details] fix idle calc in cpufreq governor load
patch sent https://patchwork.kernel.org/patch/7859871/
Created attachment 213501 [details] Revert error idle time calculating
Created attachment 213511 [details] Make idle time monotonic
Hi Fikar, do you have a chance to help test if #Comment 8 and #Comment 9 work for you? My previous patch has introduced a regression and here is another solution for it. thanks.
related link: https://bugzilla.kernel.org/show_bug.cgi?id=115261
Created attachment 213611 [details] cpufreq: governor: fix get_cpu_idle_time() usage in cpufreq_start_governor() The patch from comment #9 can be combined with this one (on top of 4.6-rc4; untested, but builds OK) and then all should actually work.
@Chen Yu: Of course, the cputime_to_usecs() in get_cpu_idle_time_jiffy() would still be problematic as the uptime/idle time will grow over the 32-bit limit eventually, so that needs to be dealt with too.
Created attachment 213621 [details] [update] cpufreq: governor: fix get_cpu_idle_time() usage in cpufreq_start_governor() Updated patch with one mistake fixed.
Created attachment 213631 [details] [update 2x] cpufreq: governor: fix get_cpu_idle_time() usage in cpufreq_start_governor() Fix the mistake for real now.
Created attachment 213641 [details] cpufreq: do not use cputime_to_usecs() in get_cpu_idle_time_jiffy() For completeness, that's what I'd do about the cputime_to_usecs() in get_cpu_idle_time_jiffy(). @Chen Yu: Do you see any problems with this?
Created attachment 213661 [details] dmesg trace To make sure, there should be 4 patches applied on top of 4.6-rc4: https://bugzilla.kernel.org/attachment.cgi?id=213501 https://bugzilla.kernel.org/attachment.cgi?id=213511 https://bugzilla.kernel.org/attachment.cgi?id=213631 https://bugzilla.kernel.org/attachment.cgi?id=213641 Anyway, they are all applied and I get what's inside of attachment.
(In reply to Rafael J. Wysocki from comment #16) > Created attachment 213641 [details] > cpufreq: do not use cputime_to_usecs() in get_cpu_idle_time_jiffy() > > For completeness, that's what I'd do about the cputime_to_usecs() in > get_cpu_idle_time_jiffy(). > > @Chen Yu: Do you see any problems with this? When CONFIG_TICK_CPU_ACCOUNTING is not set, the cputime is not nanosecond-based, but more likely a jiffy(tick), because: #define cputime_to_jiffies(__ct) (__force unsigned long)(__ct) so we might not get the correct result by dividing it directly by NSEC_PER_USEC, we might get 'divide by zero' exception as depicted n #Comment 17: uptime = cur_uptime - j_cdbs->prev_cpu_wall; maybe uptime is zero in above situation, I'll run it on my laptop to confirm.
(In reply to Chen Yu from comment #18) > (In reply to Rafael J. Wysocki from comment #16) > > Created attachment 213641 [details] > > cpufreq: do not use cputime_to_usecs() in get_cpu_idle_time_jiffy() > > > > For completeness, that's what I'd do about the cputime_to_usecs() in > > get_cpu_idle_time_jiffy(). > > > > @Chen Yu: Do you see any problems with this? > > When CONFIG_TICK_CPU_ACCOUNTING is not set, the cputime is not > nanosecond-based, but more likely a jiffy(tick), Good point.
Created attachment 213781 [details] failed boot I made some tests and whenever the patch in comment #15 is applied, my system won't boot properly. Attached is a full dmesg of failed boot (single mode). This one had comment #15 and comment #16 patched in on top of 4.6-rc4.
I did some modifications on the order of initialization of prev_cpu_wall in #Comment 15, since some policy share the same governor data,otherwise other polcy would not cover this code path. Meanwhile bypass the delta wall_time to avoid 'divided by zero' issue. For #Comment 16, I copied the code from jiffies_to_usecs.
Created attachment 213801 [details] v2 cpufreq governor fix get_cpu_idle_time usage in cpufreq_start_governor()
Created attachment 213811 [details] v2_cpufreq do not use cputime_to_usecs in get_cpu_idle_time_jiffy
#22 and #23 applied on top of clean 4.6-rc4. Is this how it should be? Good news, system boots up. Also, no wild horses running around, meaning that cpufreq is working fine.
(In reply to Timo Valtoaho from comment #24) > #22 and #23 applied on top of clean 4.6-rc4. Is this how it should be? > > Good news, system boots up. Also, no wild horses running around, meaning > that cpufreq is working fine. Should also with https://bugzilla.kernel.org/attachment.cgi?id=213501 https://bugzilla.kernel.org/attachment.cgi?id=213511 applied.
You mean that the correct way is to patch them all? Is there order in which they should be applied, ie. first 213501, then 213511, 213801 and finally 213811?
(In reply to Timo Valtoaho from comment #26) > You mean that the correct way is to patch them all? > > Is there order in which they should be applied, ie. first 213501, then > 213511, 213801 and finally 213811? Yes, you can apply them in this order.
Ok, done. Everything ok, uptime 1:25 and no problems. How about the stable 4.5 series? Fix for it coming too?
(In reply to Chen Yu from comment #22) > Created attachment 213801 [details] > v2 cpufreq governor fix get_cpu_idle_time usage in cpufreq_start_governor() Actually, I think that j_cdbs->prev_load should always be initialized to 0, because it will then cause dbs_update() to always compute the load when it is run first time after the initialization.
(In reply to Rafael J. Wysocki from comment #29) > (In reply to Chen Yu from comment #22) > > Created attachment 213801 [details] > > v2 cpufreq governor fix get_cpu_idle_time usage in cpufreq_start_governor() > > Actually, I think that j_cdbs->prev_load should always be initialized to 0, > because it will then cause dbs_update() to always compute the load when it > is run first time after the initialization. Patch to do that has been posted: https://patchwork.kernel.org/patch/8922431/ It replaces the one from comment #22.
(In reply to Rafael J. Wysocki from comment #30) > (In reply to Rafael J. Wysocki from comment #29) > > (In reply to Chen Yu from comment #22) > > > Created attachment 213801 [details] > > > v2 cpufreq governor fix get_cpu_idle_time usage in > cpufreq_start_governor() > > > > Actually, I think that j_cdbs->prev_load should always be initialized to 0, > > because it will then cause dbs_update() to always compute the load when it > > is run first time after the initialization. > > Patch to do that has been posted: https://patchwork.kernel.org/patch/8922431/ > > It replaces the one from comment #22. Thanks! I've tested on my laptop together with other patches, it runs well.
Timo, would you please apply these patches and test if it works for you: first 213501, then 213511, then, https://patchwork.kernel.org/patch/8922431/ and finally 213811?
Did apply them, still on top of 4.6-rc4, and been running it for a while now. Everything fine here.
@Chen Yu: Let's do some maths here. The unpatched get_cpu_idle_time_jiffy() returns: cur_wall_time = uptime + INITIAL_JIFFIES idle_time = uptime + INITIAL_JIFFIES - busy_time In the CONFIG_TICK_CPU_ACCOUNTING set case those undergo 64-bit conversion to microseconds, so we get cur_wall_time_us = uptime_us + INITIAL_JIFFIES_us idle_time_ms = uptime_us + INITIAL_JIFFIES_us - busy_time_us up to rounding errors. After the patch from comment #30 the only user of those values is dbs_update() and it computes wall_time = uptime_us + INITIAL_JIFFIES_us - prev_uptime_us - INITIAL_JIFFIES_us = uptime_us - prev_uptime_us idle_time = uptime_us + INITIAL_JIFFIES_us - busy_time_us - prev_uptime_us - INITIAL_JIFFIES_us + prev_busy_time_us = uptime_us - prev_uptime_us - busy_time_us + prev_busy_time_us Since INITIAL_JIFFIES_us cancel out, do we still need the patch from comment #9?
(In reply to Rafael J. Wysocki from comment #34) > @Chen Yu: Let's do some maths here. > > The unpatched get_cpu_idle_time_jiffy() returns: > > cur_wall_time = uptime + INITIAL_JIFFIES > idle_time = uptime + INITIAL_JIFFIES - busy_time > > In the CONFIG_TICK_CPU_ACCOUNTING set case those undergo 64-bit conversion > to microseconds, so we get > > cur_wall_time_us = uptime_us + INITIAL_JIFFIES_us > idle_time_ms = uptime_us + INITIAL_JIFFIES_us - busy_time_us > > up to rounding errors. > > After the patch from comment #30 the only user of those values is > dbs_update() and it computes > > wall_time = uptime_us + INITIAL_JIFFIES_us - prev_uptime_us - > INITIAL_JIFFIES_us = uptime_us - prev_uptime_us > idle_time = uptime_us + INITIAL_JIFFIES_us - busy_time_us - prev_uptime_us - > INITIAL_JIFFIES_us + prev_busy_time_us = uptime_us - prev_uptime_us - > busy_time_us + prev_busy_time_us > > Since INITIAL_JIFFIES_us cancel out, do we still need the patch from comment > #9? @Rafael But we can not guarantee 'total' idle time is always increasing, thus might cause a negative delta idle time, especially when the cpu is very busy(idle increment is very small): idle_time = uptime_us - prev_uptime_us - busy_time_us + prev_busy_time_us = (uptime_us - busy_time_us) - (prev_uptime_us - prev_busy_time_us) so although uptime_us is bigger than prev_uptime_us, and busy_time_us is bigger than prev_busy_time_us, but the subtraction result might not always be positive, (jiffise updating is done by CPU0, and idle time statistic sampling may be carried on non-CPU0 CPUs, since there is no synchronization between CPU0.update_jiffies and CPU1.stat[USERS], they are not always increasing at the same speed)
(In reply to Chen Yu from comment #35) > (In reply to Rafael J. Wysocki from comment #34) > > @Chen Yu: Let's do some maths here. > > > > The unpatched get_cpu_idle_time_jiffy() returns: > > > > cur_wall_time = uptime + INITIAL_JIFFIES > > idle_time = uptime + INITIAL_JIFFIES - busy_time > > > > In the CONFIG_TICK_CPU_ACCOUNTING set case those undergo 64-bit conversion > > to microseconds, so we get > > > > cur_wall_time_us = uptime_us + INITIAL_JIFFIES_us > > idle_time_ms = uptime_us + INITIAL_JIFFIES_us - busy_time_us > > > > up to rounding errors. > > > > After the patch from comment #30 the only user of those values is > > dbs_update() and it computes > > > > wall_time = uptime_us + INITIAL_JIFFIES_us - prev_uptime_us - > > INITIAL_JIFFIES_us = uptime_us - prev_uptime_us > > idle_time = uptime_us + INITIAL_JIFFIES_us - busy_time_us - prev_uptime_us > - > > INITIAL_JIFFIES_us + prev_busy_time_us = uptime_us - prev_uptime_us - > > busy_time_us + prev_busy_time_us > > > > Since INITIAL_JIFFIES_us cancel out, do we still need the patch from > comment > > #9? > > @Rafael > > But we can not guarantee 'total' idle time is always increasing, thus might > cause a negative delta idle time, especially when the cpu is very busy(idle > increment is very small): > > idle_time > = uptime_us - prev_uptime_us - busy_time_us + prev_busy_time_us > = (uptime_us - busy_time_us) - (prev_uptime_us - prev_busy_time_us) > > so although uptime_us is bigger than prev_uptime_us, and busy_time_us is > bigger than prev_busy_time_us, but the subtraction result might not always > be positive, The patch from comment #9 won't address that problem anyway, will it? > (jiffies updating is done by CPU0, and idle time statistic sampling may be > carried on non-CPU0 CPUs, since there is no synchronization between > CPU0.update_jiffies and CPU1.stat[USERS], they are not always increasing at > the same speed) Right. In that case we have wall_time < idle_time in dbs_update() and 0 is returned as the load. But that can only happen if (uptime_us - prev_uptime_us) is comparable with (busy_time_us - prev_busy_time_us), so it should be safe to assume that the CPU wasn't really idle at all during that time. So dbs_update() should return 100 in that case.
Created attachment 214601 [details] cpufreq: governor: Fix handling of special cases in dbs_update() So I have this patch that should address the "negative idle time" problem IMO (it is on top of https://patchwork.kernel.org/patch/8964001/). Please tell me what you think.
Created attachment 214611 [details] [v2] cpufreq: governor: Fix handling of special cases in dbs_update() The previous version was incomplete, sorry.
Hi ! Sorry for late response, I just came back home, (In reply to Rafael J. Wysocki from comment #36) > > The patch from comment #9 won't address that problem anyway, will it? > If I understand correctly, #Comment 9 plus #Comment 23 should avoid the negative idle delta issue? because on one cpu: 1. the cpu_stat[IDLE] should always be increasing. 2. even if cpu_stat[IDLE] increases faster than jiffies_64, which satisfy wall_time < idle_time, thus load = 0 will be returned, this is the actual expected result(cpu is very idle) > > (jiffies updating is done by CPU0, and idle time statistic sampling may be > > carried on non-CPU0 CPUs, since there is no synchronization between > > CPU0.update_jiffies and CPU1.stat[USERS], they are not always increasing at > > the same speed) > > Right. > > In that case we have wall_time < idle_time in dbs_update() and 0 is returned > as the load. But that can only happen if (uptime_us - prev_uptime_us) is > comparable with (busy_time_us - prev_busy_time_us), so it should be safe to > assume that the CPU wasn't really idle at all during that time. So > dbs_update() should return 100 in that case. Yes, agreed.
(In reply to Rafael J. Wysocki from comment #37) > Created attachment 214601 [details] > cpufreq: governor: Fix handling of special cases in dbs_update() > > So I have this patch that should address the "negative idle time" problem > IMO (it is on top of https://patchwork.kernel.org/patch/8964001/). > > Please tell me what you think. For the following line: load = (int)idle_time < 0 ? 100 : 0; idle_time is expected to be truncated into 32bit by cputime_to_usecs, or do we need #Comment 23 and change it to (s64)idle_time?
(In reply to Rafael J. Wysocki from comment #38) > Created attachment 214611 [details] > [v2] cpufreq: governor: Fix handling of special cases in dbs_update() > > The previous version was incomplete, sorry. I'll test it on my laptop.
(In reply to Chen Yu from comment #40) > (In reply to Rafael J. Wysocki from comment #37) > > Created attachment 214601 [details] > > cpufreq: governor: Fix handling of special cases in dbs_update() > > > > So I have this patch that should address the "negative idle time" problem > > IMO (it is on top of https://patchwork.kernel.org/patch/8964001/). > > > > Please tell me what you think. > For the following line: > load = (int)idle_time < 0 ? 100 : 0; > idle_time is expected to be truncated into 32bit by cputime_to_usecs, or > do we need #Comment 23 and change it to (s64)idle_time? idle_time is unsigned int.
Tested on top of 4.6-rc6 and with 4 patches: https://patchwork.kernel.org/patch/8964001/ cpufreq: governor: Change confusing struct field and variable names https://patchwork.kernel.org/patch/8922431/ cpufreq: governor: Fix prev_load initialization in cpufreq_governor_start() #Comment 23 + #Comment 38 At first few hours, the cpufreq behaves normally, however it dropped to low freq with 100% stress cpu load after a few hours, I'll confirm if it is related to TDP, Timo would you please help double-check? thanks
For #Comment 43, package power limit is reached, thus the performance is reduced, so this is working as expected. (bit 10~11 set to 1 after 100% load for some time, this means that "Indicates package power limit is forcing one ore more processors to operate below OS-requested P-state" # rdmsr 0x19c 88200c02 ) @Rafael, for #Comment 38 Tested-by: Chen Yu <yu.c.chen@intel.com>
Testing according to comment #43, except that I had to apply 8922431 before 8964001. Anyway, uptime is now 4 hours 19 minutes and I have opposite problem than you. config still has CONFIG_HZ_PERIODIC=y # CONFIG_TICK_CPU_ACCOUNTING is not set All cores are running at max freq under load, and for some reason they just stucked there, so frequency is not coming down when unloaded.
(In reply to Timo Valtoaho from comment #45) > Testing according to comment #43, except that I had to apply 8922431 before > 8964001. That's correct, 8964001 is on top of 8922431. Can you please try just the patch from comment #38 on top of these two?
(In reply to Timo Valtoaho from comment #45) > Testing according to comment #43, except that I had to apply 8922431 before > 8964001. > > Anyway, uptime is now 4 hours 19 minutes and I have opposite problem than > you. config still has > > CONFIG_HZ_PERIODIC=y > # CONFIG_TICK_CPU_ACCOUNTING is not set > > All cores are running at max freq under load, and for some reason they just > stucked there, so frequency is not coming down when unloaded. Which tool are you using to generate the cpu workload?
@Rafael, I did -Rp1 for patch in comment #23 and running the kernel now. Will let you know of any odd behaviour. After my previous post I tried to boot to a stable kernel, but I had to hard reset since normal system shutdown/restart procedures hanged the system totally. @Chen Yu, I'm just compilinig some source code with some different -j values. In that case I was just compiling kernel with "make -j6 bindeb-pkg" to upset my 4-core CPU when it freaked out. So I'm not using any tool, should I? From my point of view, load is load regardless of where it comes from.
Created attachment 215331 [details] [v3] cpufreq: governor: Fix handling of special cases in dbs_update() @Timo: There is a scenario that may lead to the effect you have observed with the patch from comment #38. Namely, if idle_time turns out to be negative, both load and j_cdbs->prev_load become 100, so if the system goes idle subsequently, dbs_update() will return 100 and reset j_cdbs->prev_load. However, if idle_time turns out to be negative on the next invocation of dbs_update(), the whole exercise will repeat. If that goes in cycles, you get load of 100 all the time. This version of that patch forces the load to be computed again on the next invocation of dbs_update() after detecting negative idle_time which should prevent the above scenario from happening. Please try this one if poss.
Ok I have 4.6-rc6 topped with 8922431, 8964001 and 215331. Uptime is 1:25 and no issues. Will comment if something new appears. Also, last version with respect to comment #46 was working ok for me, it was running for 12 hours without problems.
OK, thanks! So they both work. I'm sort of divided now. :-) I slightly prefer the one from comment #38 as it is a bit less kludgy, so I'm going to post something equivalent to it.
Created attachment 215351 [details] [v4] cpufreq: governor: Fix handling of special cases in dbs_update() That's what I'm going to post. Technically, it is equivalent to the patch from comment #38 (at least if I didn't mess it up) and hence the Tested-by tags, but please test it too if you can, just to be sure.
I'll try the patches as well, just have slight problem with vanilla 4.6-rcx as on the original computer: it encounters kernel panic shortly after boot, probably something related to EDAC...
uptime 2:40 hours, no problem found. patchwork 8922431 8964001 and 9028101. Meanwhile I'm still a little curious why #Comment 23 has made Timo's system stick at highest freq after unload.
I'm able to run 4.6-rc6 now and things are getting interesting. On vanilla 4.6-rc6 and CONFIG_HZ_PERIODIC now the CPU scales up correctly, but seems to stay at high frequencies after I stop the load. It looks like both CPU think they are still busy, the usage is not stable, oscillates around 70%, as it can be seen by e.g. htop, but there is of course no process causing the usage. It looks like the symptoms have changed since my first report of the bug, nevertheless you found the correct cause. With the proposed patches 8922431 8964001 and 9028101 everything seems to be fine. CPUs scale up and down and they show usage 0% when idle, as expected. I'll observe it a bit longer though...
215351 here and patchwork 9028101 are the same? I replaced 215331 with 215351. All patches that are applied now are 8922431, 8964001 and 215351. Uptime 10:15 and all is ok. Scaling like it should and no sticky frequencies.
Sorry, it seems like the unexplained load about 70% in vanilla case was caused by a process (sslh), after killing it even the kernel without the patches behaves normally. Not sure how to reproduce the bug on recent kernel.
(In reply to j.fikar from comment #57) > Sorry, it seems like the unexplained load about 70% in vanilla case was > caused by a process (sslh), after killing it even the kernel without the > patches behaves normally. > > Not sure how to reproduce the bug on recent kernel. No worries. I'm glad that it works. :-)
The patch from comment #52 posted as: https://patchwork.kernel.org/patch/9028101/
(In reply to Chen Yu from comment #54) > uptime 2:40 hours, no problem found. > patchwork 8922431 8964001 and 9028101. > > Meanwhile I'm still a little curious why #Comment 23 has made Timo's system > stick at highest freq after unload. #Comment 23 is not needed IMO, since currently the code only concerns the delta of idle time/update time, and the delta is of type unsigned int(which is 32bit), so even if cputime_to_usecs cut the result to 32bit, it also works, for example: u64 prev_idle = 0xfffffffe; u64 cur_idle = 0x2; u32 idle_time = cur_idle - prev_idle = 0x4
Merged as commit 9485e4ca0b48 "cpufreq: governor: Fix handling of special cases in dbs_update()". Closing.