Bug 115261

Summary: BISECTED: stuck at max freq after 5 minutes if CONFIG_TICK_CPU_ACCOUNTING=n due to jiffies overflow
Product: Power Management Reporter: Timo Valtoaho (timo.valtoaho)
Component: cpufreqAssignee: Rafael J. Wysocki (rjw)
Status: CLOSED CODE_FIX    
Severity: normal CC: lenb, rjw, rui.zhang, yu.c.chen
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: config of bisected kernel
debug_governor.diff
get_idle_fix.diff
debug_governor.diff
Logfile of degug_governor.diff
grep-file
debug_cpufreq_after_fix.log
v2_fix_get_cpu_idle.diff
v2_debug_governor.diff
calculate idle time directly rather than calculating busy time first
dmesg2.txt
v4_fix_cpu_idle.patch
v5_fix_cpu_idle
v6_fix_cpu_idle
v6_fix_cpu_idle
Revert "cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC"
cpufreq: Make idle_time monotonic increasing to avoid incorrect idle_time
config file of kernel

Description Timo Valtoaho 2016-03-26 09:05:55 UTC
Created attachment 210711 [details]
config of bisected kernel

Switching to 4.5 kernel, cpufreq ondemand governor started to act oddly. Without any load, with freshly booted system, it pumped cpu frequency up to maximum at some point of time and stayed there. uptime can be anything, but usually less than 15 minutes when this happens. Driver is acpi-cpufreq.

I have git-bisected this to 0df35026c6a527e65b53bc895ad672d648a248f3. There is a mention about certain kernel config values that are ok. My values differ and I get this kind of behaviour.

CPU is AMD Phenom(tm) II X4 905e Processor.
Comment 1 Chen Yu 2016-03-29 08:21:40 UTC
Timo, could you please help test with following steps:

1. patch debug_governor.diff
2. after boot up, try to reproduce the problem
3. when this problem is reproduced, please:
   3.1 echo 1 > /proc/sys/kernel/debug_cpufreq_idle
   3.2 wait for a few seconds, dmesg -c > debug_cpufreq_before_fix.log
4. patch get_idle_fix.diff on top of debug_governor.diff
5. repeat step 2 ~ 3
6. provide debug_cpufreq_before_fix.log and debug_cpufreq_after_fix.log

thanks
Comment 2 Chen Yu 2016-03-29 08:23:12 UTC
Created attachment 210951 [details]
debug_governor.diff
Comment 3 Chen Yu 2016-03-29 08:31:00 UTC
Created attachment 210961 [details]
get_idle_fix.diff

This patch is a fix for get_cpu_idle_time_jiffy,
actually the calculation of wall time in this function might be
not so accurate, it uses get_jiffies_64() directly, however the initial
value for jiffies is INITIAL_JIFFIES, which is -300HZ, thus 5 minutes.
Maybe my previous patch has exposed this issue more frequently.
Comment 4 Chen Yu 2016-03-29 11:06:55 UTC
please also provide grep . /sys/devices/system/cpu/cpu*/cpufreq/* when reproduced.
Comment 5 Chen Yu 2016-03-29 11:58:43 UTC
Created attachment 210991 [details]
debug_governor.diff
Comment 6 Timo Valtoaho 2016-03-29 20:51:56 UTC
Created attachment 211031 [details]
Logfile of degug_governor.diff

Hi

1. Patched 4.5.0 kernel source with debug_governor.diff

2. Could reproduce the problem: please find attachment debug_cpufreq_before_fix.log
Comment 7 Timo Valtoaho 2016-03-29 20:54:05 UTC
Created attachment 211041 [details]
grep-file

3. Applied patch get_idle_fix.diff. So far, could not reproduce the problem.
Comment 8 Timo Valtoaho 2016-03-29 21:03:06 UTC
It seems that this has a huge impact on my systems stability: every time the cpufreq gets upset, I can't for example compile a kernel without fatal errors. Have to reboot to a stable kernel first to do that.
Comment 9 Chen Yu 2016-03-30 05:38:52 UTC
OK, I know the problem.
Here are two issues:
1. linux should not use jiffies directly to get the wall time, since jiffies is not counted from zero but from -300HZ.
2. the CONFIG_TICK_CPU_ACCOUNTING is not set in your config, so the kernel use an inaccurate jiffies_to_usecs to calculate the actual idle time, however, jiffies_to_usecs is constrained to unsigned int, which may overflow when the wall time exceed 32bit.
3. my previous patch exposed this issue more frequently, so..

I'll try to provide a fix for this. thanks.
Comment 10 Timo Valtoaho 2016-03-30 13:17:00 UTC
Created attachment 211081 [details]
debug_cpufreq_after_fix.log

Finally it jumped up to max again. Hope this helps. Just let me know if you need any more info or testing.
Comment 11 Chen Yu 2016-03-31 14:39:32 UTC
Created attachment 211251 [details]
v2_fix_get_cpu_idle.diff

Please help check if this works for you.
(you may need to wait for a while to confirm, at least:
wait time = (2016-03-30 13:17:00  -  2016-03-29 20:54:05)
)
Comment 12 Chen Yu 2016-03-31 14:57:45 UTC
Created attachment 211261 [details]
v2_debug_governor.diff

If previous patch v2_fix_get_cpu_idle.diff does not work, please apply both v2_fix_get_cpu_idle.diff and v2_debug_governor.diff, and when reproduced, type:
echo 1 > /proc/sys/kernel/governor_debug_oneshot
for MANY times, and then show dmesg.
Comment 13 Chen Yu 2016-04-03 00:48:20 UTC
ping Timo
Comment 14 Timo Valtoaho 2016-04-03 07:04:59 UTC
Hi Yu

Sorry for such a late answer.

I've been running 4.5.0 patched with your v2_fix_get_cpu_idle.diff, and everything seems to be in order. System uptime is  "2 days, 12:30", and there has been no problems with cpufreq at all. Although mentioned uptime includes some suspended (systemctl suspend) periods, I put my computer to sleep when I'm not using it. Does it affect to testing?

Anyway, from my point of view, I haven't had any problems after applying the mentioned patch.
Comment 15 Chen Yu 2016-04-04 16:39:46 UTC
thanks Timo, the test looks ok to me,
I've sent this patch to 
https://patchwork.kernel.org/patch/8742721/
Comment 16 Timo Valtoaho 2016-04-09 13:30:13 UTC
Great, thank you for fixing this!

I'm closing this as resolved.
Comment 17 Len Brown 2016-04-11 23:53:14 UTC
patch feedback:

INITIAL_JIFFIES should not be used
init using current value of jiffies

long term:

tick-sched.c should export function giving both cpufreq and cpuidle the idle duration.
Comment 18 Timo Valtoaho 2016-04-13 16:36:10 UTC
So, there is going to be another solution for this?
Comment 19 Chen Yu 2016-04-15 10:36:31 UTC
Created attachment 212841 [details]
calculate idle time directly rather than calculating busy time first

Yes, Timo, here's 3nd version with minimal changes to cpufreq, and without touching the INITIAL_JIFFIES, please help test if this patch work for you.
Also please apply v2_debug_governor.diff and provide info as well. thanks.
Comment 20 Timo Valtoaho 2016-04-17 10:12:21 UTC
Hi

I'm testing with 4.5.1 but I guess it doesn't make any difference, or does it?

Anyway , fix fails leading to all 4 cores to run at max freq. Please find attached dmesg2.txt with debug info.
Comment 21 Timo Valtoaho 2016-04-17 10:13:07 UTC
Created attachment 213091 [details]
dmesg2.txt
Comment 22 Chen Yu 2016-04-17 16:50:28 UTC
Created attachment 213191 [details]
v4_fix_cpu_idle.patch

please check this one. Hope time system maintainer would like this version :0
Comment 23 Chen Yu 2016-04-18 04:00:51 UTC
Created attachment 213221 [details]
v5_fix_cpu_idle

please check v5 instead.
Comment 24 Timo Valtoaho 2016-04-18 18:01:54 UTC
Hi

I was running v4_fix for some time, like 20 hours without any problems if that info is useful. Now running 4.5.1 with your latest v5_fix_cpu_idle.diff without v2_debug_governor.diff applied.

So far so good, no problems, uptime 1 hour 17 minutes now. That is a short time, but it seems to be the case that the frequency jumps up quite fast after boot if the problem is present.
Comment 25 Chen Yu 2016-04-19 03:54:08 UTC
Version 5 has been sent to mail list at:
https://patchwork.kernel.org/patch/8876381/
Comment 26 Timo Valtoaho 2016-04-19 07:40:10 UTC
What I understand about code is zero, but v5_fix_cpu_idle seems to me different than the patch shown in here: 

https://patchwork.kernel.org/patch/8876381/
Comment 27 Chen Yu 2016-04-19 08:12:03 UTC
Yes, above patch is based on latest 4.6-rc3, and v5 is based on 4.5-rc1, there is some modifications during this period, but the logic is the same.
Comment 28 Timo Valtoaho 2016-04-19 13:33:37 UTC
Ok, thanks for clearing this to me :).
Comment 29 Chen Yu 2016-04-20 03:50:30 UTC
Created attachment 213361 [details]
v6_fix_cpu_idle

Feedback from community suggest a version 6, could you help try this one
Comment 30 Chen Yu 2016-04-20 04:20:33 UTC
Created attachment 213371 [details]
v6_fix_cpu_idle

Feedback from community suggest a version 6, could you help try this one
Comment 31 Rafael J. Wysocki 2016-04-20 10:39:28 UTC
@Timo: Can you please check if the patch from comment #30 helps?
Comment 32 Chen Yu 2016-04-20 10:59:57 UTC
Hi @Rafael,

Comment 30 might not work, looks like it should be changed to 's32' rather than 's64', because we are facing the 32 bit overflow problem brought by idle time calculation.

But it brought another issue after we changed it to 's32',
for example,   if cur_idle = 0xf000 0001, and prev_idle = 0xf000 0000,
since (s32) cur_idle  < (s32) prev_idle, we will still always get the idle 0% util next overflow, actually the system might be very idle.

Lv said it might be looked like:

if (cur_idle  - prev_cpu > (INT_32_MAX/2 + 1))
	cur_idle = 0;

I'm trying to do some modifications.
Comment 33 Rafael J. Wysocki 2016-04-20 12:50:40 UTC
(In reply to Chen Yu from comment #32)
> Hi @Rafael,
> 
> Comment 30 might not work, looks like it should be changed to 's32' rather
> than 's64', because we are facing the 32 bit overflow problem brought by
> idle time calculation.

OK, scratch that, then.

We need to be somewhat smarter here, because we're comparing wall_time to idle_time in the same function going forward, so they both need to be treated similarly.
Comment 34 Rafael J. Wysocki 2016-04-20 13:25:51 UTC
@Chen Yu: cputime_to_usecs() called by get_cpu_idle_time_jiffy() should resolve to

cputime_div(__ct, NSEC_PER_USEC)

and that is defined as

div_u64((__force u64)__ct, divisor)

so the result should be a proper 64-bit number.

Where does the 32-bit zapping happen then?
Comment 35 Chen Yu 2016-04-20 13:39:05 UTC
(In reply to Rafael J. Wysocki from comment #34)
> @Chen Yu: cputime_to_usecs() called by get_cpu_idle_time_jiffy() should
> resolve to
> 
> cputime_div(__ct, NSEC_PER_USEC)
> 
> and that is defined as
> 
> div_u64((__force u64)__ct, divisor)
> 
> so the result should be a proper 64-bit number.
> 
> Where does the 32-bit zapping happen then?

When CONFIG_TICK_CPU_ACCOUNTING is not set:

include/asm-generic/cputime_jiffies.h

#define cputime_to_usecs(__ct)          \
        jiffies_to_usecs(cputime_to_jiffies(__ct))



unsigned int jiffies_to_usecs(const unsigned long j)
{
        /*
         * Hz usually doesn't go much further MSEC_PER_SEC.
         * jiffies_to_usecs() and usecs_to_jiffies() depend on that.
         */
        BUILD_BUG_ON(HZ > USEC_PER_SEC);

#if !(USEC_PER_SEC % HZ)
        return (USEC_PER_SEC / HZ) * j;
#else
# if BITS_PER_LONG == 32
        return (HZ_TO_USEC_MUL32 * j) >> HZ_TO_USEC_SHR32;
# else
        return (j * HZ_TO_USEC_NUM) / HZ_TO_USEC_DEN;
# endif
#endif
}

Since jiffies_to_usecs returns a unsigned int, which is 32bit, we cut the result from 64bit to 32bit.
Comment 36 Chen Yu 2016-04-20 13:44:17 UTC
(In reply to Rafael J. Wysocki from comment #33)
> (In reply to Chen Yu from comment #32)
> > Hi @Rafael,
> > 
> > Comment 30 might not work, looks like it should be changed to 's32' rather
> > than 's64', because we are facing the 32 bit overflow problem brought by
> > idle time calculation.
> 
> OK, scratch that, then.
> 
> We need to be somewhat smarter here, because we're comparing wall_time to
> idle_time in the same function going forward, so they both need to be
> treated similarly.

wall_time does not have 32-bit zapping issue, so the delta of wall_time should be ok I think?(we compare delta of wall_time of idle_time)
Comment 37 Rafael J. Wysocki 2016-04-20 20:31:27 UTC
(In reply to Chen Yu from comment #36)
> (In reply to Rafael J. Wysocki from comment #33)
> > (In reply to Chen Yu from comment #32)
> > > Hi @Rafael,
> > > 
> > > Comment 30 might not work, looks like it should be changed to 's32'
> rather
> > > than 's64', because we are facing the 32 bit overflow problem brought by
> > > idle time calculation.
> > 
> > OK, scratch that, then.
> > 
> > We need to be somewhat smarter here, because we're comparing wall_time to
> > idle_time in the same function going forward, so they both need to be
> > treated similarly.
> 
> wall_time does not have 32-bit zapping issue, so the delta of wall_time
> should be ok I think?(we compare delta of wall_time of idle_time)

Are you sure about that?  It undergoes the same manipulations in get_cpu_idle_time_jiffy().
Comment 38 Rafael J. Wysocki 2016-04-20 20:41:21 UTC
Created attachment 213461 [details]
Revert "cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC"

In any case, I think that the problematic commit should just be reverted at this point and the original problem it attempted to fix (incorrectly) should be addressed in a different way.

Attached is a revert patch (on top of subsequent code changes).  Please double check in case I made some silly mistake (again).
Comment 39 Rafael J. Wysocki 2016-04-20 20:42:34 UTC
The patch in commit #38 is on top of 4.6-rc4.

@Timo: Please test.
Comment 40 Rafael J. Wysocki 2016-04-20 20:52:28 UTC
(In reply to Rafael J. Wysocki from comment #39)
> The patch in commit #38 is on top of 4.6-rc4.

I mean comment #38, of course.
Comment 41 Timo Valtoaho 2016-04-20 21:31:49 UTC
Ok, I have it running now, 4.6-rc4 patched with attachment in comment #38. Uptime only 22 mins now, no problems so far. Like I've mentioned earlier, frequency pops up very fast if it is present. So this one seems to be good.
Comment 42 Rafael J. Wysocki 2016-04-20 23:35:06 UTC
OK, thanks!

Let's give it time till tomorrow and then I'll send it to the mailing lists.
Comment 43 Chen Yu 2016-04-21 01:22:13 UTC
(In reply to Rafael J. Wysocki from comment #37)
> (In reply to Chen Yu from comment #36)
> > (In reply to Rafael J. Wysocki from comment #33)
> > > (In reply to Chen Yu from comment #32)
> > > > Hi @Rafael,
> > > > 
> > > > Comment 30 might not work, looks like it should be changed to 's32'
> rather
> > > > than 's64', because we are facing the 32 bit overflow problem brought
> by
> > > > idle time calculation.
> > > 
> > > OK, scratch that, then.
> > > 
> > > We need to be somewhat smarter here, because we're comparing wall_time to
> > > idle_time in the same function going forward, so they both need to be
> > > treated similarly.
> > 
> > wall_time does not have 32-bit zapping issue, so the delta of wall_time
> > should be ok I think?(we compare delta of wall_time of idle_time)
> 
> Are you sure about that?  It undergoes the same manipulations in
> get_cpu_idle_time_jiffy().

Ah yes, it is the same situation for wall_time.
Comment 44 Chen Yu 2016-04-21 01:38:39 UTC
(In reply to Rafael J. Wysocki from comment #38)
> Created attachment 213461 [details]
> Revert "cpufreq: governor: Fix negative idle_time when configured with
> CONFIG_HZ_PERIODIC"
> 
> In any case, I think that the problematic commit should just be reverted at
> this point and the original problem it attempted to fix (incorrectly) should
> be addressed in a different way.
> 
> Attached is a revert patch (on top of subsequent code changes).  Please
> double check in case I made some silly mistake (again).

Yes,  sorry for my regression.
I've checked the code and it is OK.


BTW, I'm thinking of this method, after reverting my previous patch,
we can change the idle time calculation to the complement set for 
current 'busy_time', thus we don't have to substract busy_time from wall _time,
and the 'delta' of idle_time in governor's dbs_update should always make sense. Something like this:

@@ -140,14 +140,10 @@ static inline u64 get_cpu_idle_time_jiff
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
 
-	busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
-	busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
-
-	idle_time = cur_wall_time - busy_time;
+	idle_time = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
+	idle_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
+	idle_time += kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST];
+	idle_time += kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST_NICE];
 	if (wall)
 		*wall = cputime_to_usecs(cur_wall_time);


I'll check some data.
Comment 45 Chen Yu 2016-04-21 07:03:48 UTC
Created attachment 213491 [details]
cpufreq: Make idle_time monotonic increasing to avoid  incorrect idle_time

Timo, could you please help check if this patch still work for you(also with
patch from #Comment 38 applied: Revert "cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC" )
Comment 46 Chen Yu 2016-04-21 07:12:37 UTC
related link for previous negative idle issue:
https://bugzilla.kernel.org/show_bug.cgi?id=69821
Comment 47 Rafael J. Wysocki 2016-04-21 16:50:12 UTC
(In reply to Chen Yu from comment #45)
> Created attachment 213491 [details]
> cpufreq: Make idle_time monotonic increasing to avoid  incorrect idle_time

The reason why this is not a good approach is that cpufreq_governor_start() (as of 4.6-rc4) does this:

 j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_cpu_wall, io_busy);
 prev_load = j_cdbs->prev_cpu_wall - j_cdbs->prev_cpu_idle;

so if j_cdbs->prev_cpu_idle and j_cdbs->prev_cpu_wall are shifted with respect to each other (and they will be after your change), the value of prev_load will not be useful.
Comment 48 Timo Valtoaho 2016-04-21 16:52:35 UTC
Created attachment 213581 [details]
config file of kernel

Got a compile warning, not too serious I guess:

drivers/cpufreq/cpufreq.c: In function ‘get_cpu_idle_time_jiffy’:
drivers/cpufreq/cpufreq.c:139:6: warning: unused variable ‘busy_time’ [-Wunused-variable]
  u64 busy_time;

Anyway, attachment from comment #45 has been applied on top of #38 and everything seems to be in order.

Please check my attached config that it is still good for this testing.
Comment 49 Rafael J. Wysocki 2016-04-21 16:56:02 UTC
(In reply to Timo Valtoaho from comment #48)
> Created attachment 213581 [details]
> config file of kernel
> 
> Got a compile warning, not too serious I guess:
> 
> drivers/cpufreq/cpufreq.c: In function ‘get_cpu_idle_time_jiffy’:
> drivers/cpufreq/cpufreq.c:139:6: warning: unused variable ‘busy_time’
> [-Wunused-variable]
>   u64 busy_time;

That's OK.  The variable is declared but not used any more.

> Anyway, attachment from comment #45 has been applied on top of #38 and
> everything seems to be in order.

OK, thanks!

@Chen Yu: We'll need to reopen bug #69821 after applying the patch from comment #38, so let's move the discussion there.