Bug 217597

Summary: lscpu reporting incorrect CPU current frequencies
Product: Power Management Reporter: Keyon (yang.jie)
Component: cpufreqAssignee: Keyon (yang.jie)
Status: CLOSED CODE_FIX    
Severity: high CC: dsmythies, yang.jie
Priority: P2    
Hardware: Intel   
OS: Linux   
Kernel Version: Subsystem:
Regression: Yes Bisected commit-id: f3eca381bd49
Attachments: A possible patch for this issue
Stale frequency results for various drivers and govnerors
Stale frequency results for various drivers and governors
intel_cpufreq driver, hwp disabled, schedutil governor, odd pstate requests
Stale frequency results for various drivers and governors - policy cur method
Yet another patch to fix this.

Description Keyon 2023-06-26 17:44:45 UTC
From Linux kernel v5.18 on, we notice that the cur_freq of CPU cores are not correct from time to time, the cur_freq could even be > the maximum ones, as below:


# lscpu -e
CPU SOCKET CORE ONLINE    MAXMHZ   MINMHZ       MHZ
  0      0    0    yes 4200.0000 800.0000 4200.0000
  1      0    0    yes 4200.0000 800.0000 4200.0000
  2      0    1    yes 4200.0000 800.0000 4200.0000
  3      0    1    yes 4200.0000 800.0000 3600.0000
  4      0    2    yes 3000.0000 800.0000 1732.7410
  5      0    3    yes 3000.0000 800.0000 1685.1310
  6      0    4    yes 3000.0000 800.0000 1715.8409
  7      0    5    yes 3000.0000 800.0000 3600.0000
  8      0    6    yes 3000.0000 800.0000 1716.0790
  9      0    7    yes 3000.0000 800.0000 3600.0000
 10      0    8    yes 3000.0000 800.0000 1603.4210
 11      0    9    yes 3000.0000 800.0000 3600.0000

Here, the cur_freq of core 7, 9, 11 looks incorrect.


With v5.17, we get the correct ones as below:

# lscpu -e
CPU SOCKET CORE ONLINE    MAXMHZ   MINMHZ       MHZ
  0      0    0    yes 4200.0000 800.0000 4199.8760
  1      0    0    yes 4200.0000 800.0000 1444.7260
  2      0    1    yes 4200.0000 800.0000 1554.8870
  3      0    1    yes 4200.0000 800.0000 1577.0760
  4      0    2    yes 3000.0000 800.0000 1800.5040
  5      0    3    yes 3000.0000 800.0000 1767.4659
  6      0    4    yes 3000.0000 800.0000 1800.3500
  7      0    5    yes 3000.0000 800.0000 1600.4460
  8      0    6    yes 3000.0000 800.0000 1597.0200
  9      0    7    yes 3000.0000 800.0000 1681.0930
 10      0    8    yes 3000.0000 800.0000 1700.8730
 11      0    9    yes 3000.0000 800.0000 1785.5389
 12      0   10    yes 2100.0000 800.0000 1202.8650
 13      0   11    yes 2100.0000 800.0000 1597.0270
Comment 1 Keyon 2023-06-26 17:58:46 UTC
Have been debugged this, assigning this to myself.
Comment 2 Doug Smythies 2023-06-26 18:37:44 UTC
As of Kernel 5.18 if a CPU has been idle and is idle, it is not woken just to read CPU frequency, nor is the stale CPU frequency listed. The nominal non-tubro max frequency is listed.

The related patch set email thread:

https://lore.kernel.org/linux-pm/?q=x86%2Fcpu%3A+Consolidate+APERF%2FMPERF+code

I am not certain this is your issue, but it sounds like it.
Comment 3 Keyon 2023-06-26 22:29:58 UTC
(In reply to Doug Smythies from comment #2)
> As of Kernel 5.18 if a CPU has been idle and is idle, it is not woken just
> to read CPU frequency, nor is the stale CPU frequency listed. The nominal
> non-tubro max frequency is listed.
> 
> The related patch set email thread:
> 
> https://lore.kernel.org/linux-pm/
> ?q=x86%2Fcpu%3A+Consolidate+APERF%2FMPERF+code
> 
> I am not certain this is your issue, but it sounds like it.

Hi Doug, exactly that the series introduced this issue, I have submitted a patch to address it here:
https://lore.kernel.org/lkml/20230626193601.9169-1-yang.jie@linux.intel.com/
Comment 4 Doug Smythies 2023-06-27 00:01:12 UTC
From the patch:

> have been comparing ticks and milliseconds by mistake

I do not think so.

I think this line is correct:
>- if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)

I think this line is not correct:
> + if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE * cpu_khz)

The definition of MAX_SAMPLE_AGE is already 20 milliseconds converted to jiffies.

The code is doing what was desired (I think) and it was expected that CPU frequency listings would change. Myself, I think it would be more informative to list a frequency of 0 for stale info instead of cpu_khz, but that wasn't accepted when I proposed it (I think, I'd have to review more to be sure).

Note: I am not on the x86@kernel.org email list, only the linux-pm list. The patch should have been copied to linux-pm, as it was originally getting the bugzilla emails and was copied on the original patch series.
Comment 5 Keyon 2023-06-27 00:12:43 UTC
(In reply to Doug Smythies from comment #4)
> From the patch:
> 
> > have been comparing ticks and milliseconds by mistake
> 
> I do not think so.
> 
> I think this line is correct:
> >- if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
> 
> I think this line is not correct:
> > + if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE * cpu_khz)
> 
> The definition of MAX_SAMPLE_AGE is already 20 milliseconds converted to
> jiffies.

I don't think so, see here:

#define MAX_SAMPLE_AGE	((unsigned long)HZ / 50)

CONFIG_HZ is usually configured to be 1000 and MAX_SAMPLE_AGE is 20, I have printed that out and confirmed it.

> 
> The code is doing what was desired (I think) and it was expected that CPU
> frequency listings would change. Myself, I think it would be more
> informative to list a frequency of 0 for stale info instead of cpu_khz, but
> that wasn't accepted when I proposed it (I think, I'd have to review more to
> be sure).

Ah I can imagine that, people are complaining about the cur_freq > min_freq here as well.

> 
> Note: I am not on the x86@kernel.org email list, only the linux-pm list. The
> patch should have been copied to linux-pm, as it was originally getting the
> bugzilla emails and was copied on the original patch series.

oops, maybe I should send it to linux-pm first?
Comment 6 Keyon 2023-06-27 00:15:39 UTC
And, when I printed more logs, you can see that the delta > 20 happens quite frequently which lead to the default cpu_khz is used as the cur_freq from time to time:

[   57.904839] Keyon: jiffies - last: 50
[   57.909061] Keyon: jiffies - last: 0
[   57.913161] Keyon: jiffies - last: 60
[   57.917170] Keyon: jiffies - last: 62
[   57.921277] Keyon: jiffies - last: 174
[   57.925393] Keyon: jiffies - last: 4
[   57.929586] Keyon: jiffies - last: 993
[   57.933586] Keyon: jiffies - last: 165
Comment 7 Doug Smythies 2023-06-27 14:34:44 UTC
(In reply to Keyon from comment #5)
> (In reply to Doug Smythies from comment #4)
> I don't think so, see here:
> 
> #define MAX_SAMPLE_AGE        ((unsigned long)HZ / 50)
> 
> CONFIG_HZ is usually configured to be 1000 and MAX_SAMPLE_AGE is 20, I have
> printed that out and confirmed it.

Yes, exactly. 20 jiffies at a 1000 Hz tick rate is the desired 20 milliseconds.

> the delta > 20 happens quite frequently

Yes. If the system is somewhat idle, it'll happen very often. Example from my idle system (4.1 GHz means stale):

$ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
/sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:4100000
/sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:4100000
/sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:4100000
/sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:4100000
/sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:4100000
/sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:2200179
/sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:2200009
/sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:4100000
/sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:4100000
/sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:4100000
/sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:2139420
/sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:2155407

But if I put a light 100 hertz load on every CPU, then the stale trigger conditions will never be met:

$ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
/sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:900001
/sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:806972
/sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:898252
/sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:898536
/sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:900024
/sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:845586
/sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:900013
/sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:847413
/sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:898391
/sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:900014
/sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:888358
/sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:900020
Comment 8 Keyon 2023-06-27 17:39:55 UTC
(In reply to Doug Smythies from comment #7)
> 
> Yes. If the system is somewhat idle, it'll happen very often. Example from
> my idle system (4.1 GHz means stale):
> 
> $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
> /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:4100000
> /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:4100000
> /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:4100000
> /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:4100000
> /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:4100000
> /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:2200179
> /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:2200009
> /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:4100000
> /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:4100000
> /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:4100000
> /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:2139420
> /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:2155407
> 
> But if I put a light 100 hertz load on every CPU, then the stale trigger
> conditions will never be met:
> 
> $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
> /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:900001
> /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:806972
> /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:898252
> /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:898536
> /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:900024
> /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:845586
> /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:900013
> /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:847413
> /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:898391
> /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:900014
> /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:888358
> /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:900020

Thank you cross confirm Doug. Not saying if showing 4.1GHz at that frequency level making sense or confusing people, at least the current code (ticks comparing with ms) is not aligned with the design according to the 20ms comment.
Doug, if you agree with me about that, do you mind adding your Acked-by tag?
Comment 9 Doug Smythies 2023-06-27 18:44:21 UTC
I do not agree, and have been saying all along that current code (before your patch) is working as expected.

However, I do agree that it can be confusing. My input as to a possible remedy patch would be:

$ git diff
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index fdbb5f07448f..53c6a508a366 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -439,7 +439,7 @@ unsigned int arch_freq_get_on_cpu(int cpu)

 fallback:
        freq = cpufreq_quick_get(cpu);
-       return freq ? freq : cpu_khz;
+       return freq ? freq : 0;
 }

 static int __init bp_init_aperfmperf(void)

But that also has inconsistencies between what is reported as a function of CPU frequency scaling driver and governor that is being used (some of those inconsistencies also exist now and I seemed to have missed them or made some mistake back on the original thread referenced in comment 2 above, where I soliciting for consistent reporting across all CPU scaling drivers and governors).
Comment 10 Keyon 2023-06-27 21:57:14 UTC
Hi Doug, IMHO, the scaling_cur_freq should show the real current running/electrical frequency set to the corresponding CPU core, the core itself could run in different frequencies, range from min to max, should not be 0 except that when it is put offline. Usually when Idle, the core should still run on a freequency, e.g. the default, base or min frequency.

All right, there are 2 issues here per my understanding:
1. the cpu_khz shared for all CPU cores? In Intel's recent Hybrid CPUs, what does this cpu_khz read from cpuid really mean? I am seeing cpu_khz=3.6GHz for E-cores with Max frequecy 3GHz. We should fix that, no?
2. We don't want to wake up cores just because of the sysfs queries, so we introduced fallback mechanism here, what is our clear design about that?

So, before discussing those issues, we should get alignment on these first:
1. What is fallback and When should we fallback. From the comment, looks we wanted to use cpu_khz for Cores haven't executed any task during the last 20ms, this sounds reasonable, no?
2. What frequencies should we show in fallback case. This could be controversial, 0? min_freq? base_freq? or last calculated one?

So, I think what you proposed above is trying to push the #2 alignment above to use '0', while it has no conflict with the patch I sent trying to fix the existed typo (I believe so, might need Thomas Gleixner to chime in) of the #1 alignment.

I get your point that it looks consist with the design that 'when put a light 100 hertz load on every CPU, then the stale trigger conditions will never be met', can you share me how did you do that? Let me try it at my end also.
Comment 11 Doug Smythies 2023-06-28 00:40:12 UTC
For this:

> the scaling_cur_freq should show the real 
> current running/electrical frequency
> set to the corresponding CPU core,
> the core itself could run in different
> frequencies, range from min to max,
> should not be 0 except that when it is
> put offline. Usually when Idle, the 
> core should still run on a frequency,

That is not true. In deep idle states the CPU clock stops.
And that is the point, there is no "current" CPU frequency, it is stale.
The discussion at the time (see the thread referenced in comment 2 above) seemed to be against using the last known value. The issue I have with using cpu_khz is that there is no easy way to distinguish between a CPU really running at cpu_khz and CPU that was idle for that sample. Some drivers and governor combinations seem to report min_freq.

I think this conversation needs to include the participants from the original thread.

Note: I forgot to simplify above and am now using this:

$ git diff
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index fdbb5f07448f..486207336d8b 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -414,7 +414,7 @@ void arch_scale_freq_tick(void)
 unsigned int arch_freq_get_on_cpu(int cpu)
 {
        struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
-       unsigned int seq, freq;
+       unsigned int seq;
        unsigned long last;
        u64 acnt, mcnt;

@@ -438,8 +438,7 @@ unsigned int arch_freq_get_on_cpu(int cpu)
        return div64_u64((cpu_khz * acnt), mcnt);

 fallback:
-       freq = cpufreq_quick_get(cpu);
-       return freq ? freq : cpu_khz;
+       return cpufreq_quick_get(cpu);
 }

 static int __init bp_init_aperfmperf(void)
Comment 12 Keyon 2023-06-28 01:00:31 UTC
Agree, Doug, I just include the linux-pm ML and maintainers of the cpufreq driver in the email, we can discuss it there next, thank you.
Comment 13 Keyon 2023-06-28 20:16:40 UTC
Hi Doug, with the method you provided, I am seeing the trigger condition happens quite frequently:


# ./consume_test.sh 
consume: 3 100 300  PID: 8446
consume: 3 100 300  PID: 8448
consume: 3 100 300  PID: 8450
consume: 3 100 300  PID: 8452
consume: 3 100 300  PID: 8454
consume: 3 100 300  PID: 8456
consume: 3 100 300  PID: 8458
consume: 3 100 300  PID: 8460
consume: 3 100 300  PID: 8462
consume: 3 100 300  PID: 8464
consume: 3 100 300  PID: 8466
consume: 3 100 300  PID: 8468
consume: 3 100 300  PID: 8470
consume: 3 100 300  PID: 8472


# grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
/sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:1766635
/sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:1788979
/sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:1746332
/sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:1694200
/sys/devices/system/cpu/cpufreq/policy12/scaling_cur_freq:1199936
/sys/devices/system/cpu/cpufreq/policy13/scaling_cur_freq:3600000
/sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:1768728
/sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:3600000
/sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:3600000
/sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:1785094
/sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:1760354
/sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:1629422
/sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:1781190
/sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:3600000

I can even see it with frac=12 but don't find it at frac=15.
Comment 14 Doug Smythies 2023-06-28 23:39:50 UTC
Interesting. What CPU frequency scaling driver and governor are you using? Is HWP enabled? What is your kernel tick rate? I'll try to reproduce your results.
Comment 15 Keyon 2023-06-29 00:21:36 UTC
(In reply to Doug Smythies from comment #14)
> Interesting. What CPU frequency scaling driver and governor are you using?
> Is HWP enabled? What is your kernel tick rate? I'll try to reproduce your
> results.

scaling driver: intel_pstate
scaling_governor: powersave
HWP enabled.
kernel tick rate, do you mean the CONFIG_HZ?
CONFIG_HZ=1000
kernel source: 6.1.33 kernel base
Comment 16 Doug Smythies 2023-06-29 15:52:40 UTC
Yes, I mean CONFIG_HZ when I talk about kernel tick rate.

Okay, I can re-create your results. It just means the CPU needs to be awake on a jiffy boundary, I think. If I change away from fixed work packet to fixed timing workflow and ensure the CPU is awake for >= 1 millisecond, and not asleep for more then 20 Milliseconds, then the trigger condition will never occur. I.e.:

COUNTER=0
while [ $COUNTER -lt 12 ];
do
  taskset -c $COUNTER /home/doug/c/consume 10.1 50.1 60 &
  sleep 0.0122
  COUNTER=$((COUNTER+1))
done


But we digress, as this isn't really relevant to the discussion.
Comment 17 Keyon 2023-06-30 00:52:35 UTC
Hi Doug, I think this can prove the assumption that the driver is working as expected that we will not go to fallback if the CPU is scheduling task at least once in every 20ms:
taskset -c $COUNTER /home/doug/c/consume 1 49.9 60 &

This will try to run the task for 0.2ms active in every ~20ms, which means the cpu will never sleep > 20ms, but we are still seeing fallback with this.

Furthermore, I can still see the fallback with 'consume 4 200'

So this still looks tricky to me, but anyway as you said, let's not digress too much, I am fully open and appreciate to any reasonable fix for the 2 issues I summarized in the mail thread.
Comment 18 Doug Smythies 2023-06-30 14:43:37 UTC
Running a task for 0.2ms active in every ~20ms doesn't ensure the CPU will be awake on the tick boundary. There is only a 20% chance that the CPU will be active at the tick time, thus ensuring a frequency sample. Let's try to verify by experiment:

Run similar to your command on CPU 5:

$ taskset -c 5 /home/doug/c/consume 1 49.9 1000

Meanwhile sample its CPU frequency every 20 millseconds for 40000 samples. (Note: my utility program uses sleep here, so with sleep errors and not compensating for code time, it was actually sampling at 20.184 milliseconds per sample).

Results, test1:
9703 samples, 24%, were of actual CPU frequency.
30297 samples, 76%, were the stale response.

Results, test1:
9794 samples, 24%, were of actual CPU frequency.
30206 samples, 76%, were the stale response.

Note that while my test server is nominally pretty idle, there will be still be occasional other things for CPU 5 to do.

The other thing we expect to see in the data is the beat frequency pattern as the awake time for CPU 5 gradually moves with respect to the tick time, since its period is 1/49.9, or 20.04 mSec. We do see this pattern:

doug@s19:~/idle/perf$ tail -25 freq-03.csv
39975, 4.100
39976, 4.100
39977, 4.100
39978, 4.100
39979, 0.800
39980, 0.800
39981, 0.800
39982, 0.800
39983, 0.800
39984, 0.800
39985, 4.100
39986, 4.100
39987, 4.100
39988, 4.100
39989, 4.100
39990, 4.100
39991, 4.100
39992, 4.100
39993, 4.100
39994, 4.100
39995, 4.100
39996, 4.100
39997, 4.100
39998, 4.100
39999, 4.100

The .2 millisecond work time every 5 milliseconds, for 'consume 4 200', is getting a little different as we get into the idle state regions where the tick may or may not turn off, depending on the timing. The beat frequency should be 0 for this case, but is only extremely low. I do not want to go down a rat hole here, so am not pursuing further.

For the actual issue here: I am thinking more about my response in comment 11 above, and if it would be possible to return the policy min_freq as the stale response, and if it would make sense to do so.
Comment 19 Keyon 2023-06-30 18:49:32 UTC
(In reply to Doug Smythies from comment #18)
> Running a task for 0.2ms active in every ~20ms doesn't ensure the CPU will
> be awake on the tick boundary. There is only a 20% chance that the CPU will
> be active at the tick time, thus ensuring a frequency sample. Let's try to
> verify by experiment:
> 
> Run similar to your command on CPU 5:
> 
> $ taskset -c 5 /home/doug/c/consume 1 49.9 1000
> 
> Meanwhile sample its CPU frequency every 20 millseconds for 40000 samples.
> (Note: my utility program uses sleep here, so with sleep errors and not
> compensating for code time, it was actually sampling at 20.184 milliseconds
> per sample).
> 
> Results, test1:
> 9703 samples, 24%, were of actual CPU frequency.
> 30297 samples, 76%, were the stale response.
> 
> Results, test1:
> 9794 samples, 24%, were of actual CPU frequency.
> 30206 samples, 76%, were the stale response.
> 

Hi Doug, are you running another test script to reading the cur_freq in every 25 ms, can you share me how did you get these data?
Comment 20 Keyon 2023-07-01 00:33:44 UTC
Hi Doug, basically, I have difficult to understand what it really means with '(jiffies - last) > 20' in the upstream kernel.

I have changes of this:

1. change the fallback tick from 20 to 1000.
That is, '(jiffies - last) > 1000'. 
I suppose this could mean that "if the CPU didn't run any task during the last scheduler tick, let's treat it as IDLE and report fallback cur_freq".
2. write a script(get_cur_freq.sh) to check the CPU status every 25us.

Then, I performed a boundary test today with runnig:
1. taskset -c 13 ./consume 0.1 125 300          //just a very tiny task every 8ms.
2. ./get_cur_freq.sh 13 80000

The number #13 CPU is the one idle most the time by default.

Then, I get this:

total samples:80000
idle samples:1136
active samples:78864
idle percentage:1.00

With this, the #13 CPU almost canNOT idle. If I change to 'consume 0.1 143 300' (a very tiny task every 7ms), the #13 CPU can never idle.


OMG, this is really deep rat hole here. I am actually fine with your proposal to just return 0 for this fallback if we just want to 'fix' the original issue, as it is actually not a big deal at all for most of the users per my understanding.
Comment 21 Keyon 2023-07-01 00:52:03 UTC
(In reply to Doug Smythies from comment #18)
> 
> For the actual issue here: I am thinking more about my response in comment
> 11 above, and if it would be possible to return the policy min_freq as the
> stale response, and if it would make sense to do so.

Yes, something like this looks fine to me as well (on top of your change in comment #11 above)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 20c5ca219afe..2c44b1d3f116 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -713,7 +713,7 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
        else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
                ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
        else
-               ret = sprintf(buf, "%u\n", policy->cur);
+               ret = sprintf(buf, "%u\n", policy->min);/* show min in idle */
        return ret;
 }
Comment 22 Doug Smythies 2023-07-01 22:22:53 UTC
(In reply to Keyon from comment #20)
> I have difficult to understand what it really means with
> '(jiffies - last) > 20' in the upstream kernel.

For our 1000 Hz kernels the translation of that code is:

if the current sample is over 20 jiffies old.
or
if the current sample is over 20 kernel ticks old.
or
if the current sample is over 20 milliseconds old.
> 
> I have changes of this:
> 
> 1. change the fallback tick from 20 to 1000.
> That is, '(jiffies - last) > 1000'. 
> I suppose this could mean that "if the CPU didn't run any task during the
> last scheduler tick, let's treat it as IDLE and report fallback cur_freq".

No, that is incorrect.
It would mean "if the current sample is over 1 second old".

> 2. write a script(get_cur_freq.sh) to check the CPU status every 25us.

Did you mean 25 Milliseconds?

> 
> Then, I performed a boundary test today with runnig:
> 1. taskset -c 13 ./consume 0.1 125 300          //just a very tiny task
> every 8ms.
> 2. ./get_cur_freq.sh 13 80000
> 
> The number #13 CPU is the one idle most the time by default.
> 
> Then, I get this:
> 
> total samples:80000
> idle samples:1136
> active samples:78864
> idle percentage:1.00

No, your terminology is incorrect. Idle time and if a sample is stale or not are different things.
 
> With this, the #13 CPU almost canNOT idle.

You are not measuring idle time here. If you want to measure idle time then look at, and calculate it from, the idle stats. Example (for CPU 5):

$ grep . /sys/devices/system/cpu/cpu5/cpuidle/state*/time ; sleep 10 ; grep . /sys/devices/system/cpu/cpu5/cpuidle/state*/time
/sys/devices/system/cpu/cpu5/cpuidle/state0/time:2017995
/sys/devices/system/cpu/cpu5/cpuidle/state1/time:268419062
/sys/devices/system/cpu/cpu5/cpuidle/state2/time:786807281
/sys/devices/system/cpu/cpu5/cpuidle/state3/time:247975570890
/sys/devices/system/cpu/cpu5/cpuidle/state0/time:2017995
/sys/devices/system/cpu/cpu5/cpuidle/state1/time:268421655
/sys/devices/system/cpu/cpu5/cpuidle/state2/time:786824863
/sys/devices/system/cpu/cpu5/cpuidle/state3/time:247984398971

Or 8.848 Seconds out of the 10, 88.48% idle.
Comment 23 Doug Smythies 2023-07-03 01:21:53 UTC
(In reply to Keyon from comment #21)
 
> Yes, something like this looks fine to me as well (on top of your change in
> comment #11 above)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 20c5ca219afe..2c44b1d3f116 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -713,7 +713,7 @@ static ssize_t show_scaling_cur_freq(struct
> cpufreq_policy *policy, char *buf)
>         else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
>                 ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
>         else
> -               ret = sprintf(buf, "%u\n", policy->cur);
> +               ret = sprintf(buf, "%u\n", policy->min);/* show min in idle
> */
>         return ret;
>  }

The issue with it is that the reply for /proc/cpuinfo is different than for /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq.
Comment 24 Doug Smythies 2023-07-03 01:24:16 UTC
Created attachment 304531 [details]
A possible patch for this issue

I had some odd behaviour with HWP enabled, passive, schedutil, which I need to look into further.
Comment 25 Doug Smythies 2023-07-03 01:32:16 UTC
Created attachment 304532 [details]
Stale frequency results for various drivers and govnerors

Sorry for the picture type file.

Where:
MHz is a stale freq from /proc/cpuinfo
and cur_freq is a stale freq from /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq.
Comment 26 Keyon 2023-07-05 18:25:31 UTC
(In reply to Doug Smythies from comment #22)
> (In reply to Keyon from comment #20)
> > I have difficult to understand what it really means with
> > '(jiffies - last) > 20' in the upstream kernel.
> 
> For our 1000 Hz kernels the translation of that code is:
> 
> if the current sample is over 20 jiffies old.
> or
> if the current sample is over 20 kernel ticks old.
> or
> if the current sample is over 20 milliseconds old.
> > 
> > I have changes of this:
> > 
> > 1. change the fallback tick from 20 to 1000.
> > That is, '(jiffies - last) > 1000'. 
> > I suppose this could mean that "if the CPU didn't run any task during the
> > last scheduler tick, let's treat it as IDLE and report fallback cur_freq".
> 
> No, that is incorrect.
> It would mean "if the current sample is over 1 second old".
> 
> > 2. write a script(get_cur_freq.sh) to check the CPU status every 25us.
> 
> Did you mean 25 Milliseconds?
> 
> > 
> > Then, I performed a boundary test today with runnig:
> > 1. taskset -c 13 ./consume 0.1 125 300          //just a very tiny task
> > every 8ms.
> > 2. ./get_cur_freq.sh 13 80000
> > 
> > The number #13 CPU is the one idle most the time by default.
> > 
> > Then, I get this:
> > 
> > total samples:80000
> > idle samples:1136
> > active samples:78864
> > idle percentage:1.00
> 
> No, your terminology is incorrect. Idle time and if a sample is stale or not
> are different things.
>  
> > With this, the #13 CPU almost canNOT idle.
> 
> You are not measuring idle time here. If you want to measure idle time then
> look at, and calculate it from, the idle stats. Example (for CPU 5):
> 
> $ grep . /sys/devices/system/cpu/cpu5/cpuidle/state*/time ; sleep 10 ; grep
> . /sys/devices/system/cpu/cpu5/cpuidle/state*/time
> /sys/devices/system/cpu/cpu5/cpuidle/state0/time:2017995
> /sys/devices/system/cpu/cpu5/cpuidle/state1/time:268419062
> /sys/devices/system/cpu/cpu5/cpuidle/state2/time:786807281
> /sys/devices/system/cpu/cpu5/cpuidle/state3/time:247975570890
> /sys/devices/system/cpu/cpu5/cpuidle/state0/time:2017995
> /sys/devices/system/cpu/cpu5/cpuidle/state1/time:268421655
> /sys/devices/system/cpu/cpu5/cpuidle/state2/time:786824863
> /sys/devices/system/cpu/cpu5/cpuidle/state3/time:247984398971
> 
> Or 8.848 Seconds out of the 10, 88.48% idle.

I think all these comments are good, thank you.
Comment 27 Keyon 2023-07-05 18:43:16 UTC
(In reply to Doug Smythies from comment #24)
> Created attachment 304531 [details]
> A possible patch for this issue

This looks good to me, thanks.

> 
> I had some odd behaviour with HWP enabled, passive, schedutil, which I need
> to look into further.

could that be related with 'base_frequency'? I haven't tried passive mode though.
Comment 28 Doug Smythies 2023-07-07 21:31:41 UTC
(In reply to Keyon from comment #27)
> (In reply to Doug Smythies from comment #24)
> > Created attachment 304531 [details]
> > 
> > I had some odd behaviour with HWP enabled, passive, schedutil, which I need
> > to look into further.
> 
> could that be related with 'base_frequency'? I haven't tried passive mode
> though.

I do not think it is related to 'base_frequency', but please go ahead and provide any evidence you have that it is.

As I mentioned, there are odd things going on with the HWP enabled, passive, schedutil configuration that I need to get to bottom of. For example, right now I have stale frequency for CPU 0 of 3.9 GHz and for CPU 11 of 1.2 GHz and 0.9 GHz for all the others, even though min is set to 1.0 GHz.
Comment 29 Keyon 2023-07-10 17:00:45 UTC
Thank you Doug. So I will rely on you to proceeding the fix of the issue, let me know if anything I can help. :-)
Comment 30 Doug Smythies 2023-07-10 20:51:13 UTC
The issue with HWP enabled, passive, schedutil seems to be in the function cpufreq_quick_get in the file drivers/cpufreq/cpufreq.c. It seems that policy->cur is never updated under these conditions. HWP disabled, passive, schedutil works fine. It is a bit of a different issue, but I want to fix it at the same time with another patch.
Comment 31 Keyon 2023-07-12 18:41:03 UTC
Thank you Doug. The acpi-cpufreq driver doesn't work on my platform, but looks HWP enabled with passive/schedutil works and I get reasonable policy->cur here, even without any fixes we are talking here.
Comment 32 Doug Smythies 2023-07-13 00:07:49 UTC
Okay, then we need to figure out the difference. I get a fine response from the function when the frequency is not stale (i.e. the other return path from the function, I think). But policy->cur is never updated. It is used for the stale return path.

Example (Unmodified kernel 6.5-rc1):

scaling_min_freq for all CPUs = 0.8 GHz:

$ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
/sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:800293
/sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:799993

Okay, all the stale values are good. Note: one can tell because the numbers are exact (I also did a test compile with flagged return values).

Now set the scaling_min_freq for all CPUs = 2.5 GHz:

$ echo 2500000 | sudo tee /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq

And check:

$ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
/sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:2526076
/sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:2500047

Huh? wrong.

Now switch to the ondemand governor and then switch back and check again:

$ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
/sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:4800127
/sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:2500000
/sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:4800262

Now it is Okay.

There is other weirdness also. I am now trying to test an additional patch on top of what I proposed above. I'll post it tomorrow.

You might have to disable HWP in your BIOS to be able to boot to the acpi-cpufreq CPU frequency scaling driver. While it should be a boot time decision, some BIOS' (ASUS) force it even if "auto" is selected in the BIOS.
Comment 33 Doug Smythies 2023-07-13 15:12:33 UTC
I am testing this addition to the proposed patch from comment 24:

$ git diff
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 6230fd7162ee..44cc96864d94 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -418,9 +418,10 @@ unsigned int arch_freq_get_on_cpu(int cpu)
        unsigned long last;
        u64 acnt, mcnt;

-       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-               goto fallback;
-
+       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF)){
+               freq = cpufreq_quick_get(cpu);
+               return freq ? freq : cpufreq_quick_get_min(cpu);
+       }
        do {
                seq = raw_read_seqcount_begin(&s->seq);
                last = s->last_update;
@@ -433,13 +434,9 @@ unsigned int arch_freq_get_on_cpu(int cpu)
         * which covers idle and NOHZ full CPUs.
         */
        if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
-               goto fallback;
+               return cpufreq_quick_get_min(cpu);

        return div64_u64((cpu_khz * acnt), mcnt);
-
-fallback:
-       freq = cpufreq_quick_get(cpu);
-       return freq ? freq : cpufreq_quick_get_min(cpu);
 }

 static int __init bp_init_aperfmperf(void)

This was after thinking about, and trying, just changing the return value in drivers/cpufreq/cpufreq.c from policy->cur to policy->min. The concern with it being that I don't know if that would break the expected response for some of the other callers of that function. For example drivers/devfreq/tegra30-devfreq.c and drivers/thermal/cpufreq_cooling.c and drivers/powercap/dtpm_cpu.c
Comment 34 Keyon 2023-07-13 17:56:07 UTC
(In reply to Doug Smythies from comment #32)
> Okay, then we need to figure out the difference. I get a fine response from
> the function when the frequency is not stale (i.e. the other return path
> from the function, I think). But policy->cur is never updated. It is used
> for the stale return path.
> 
> Example (Unmodified kernel 6.5-rc1):
> 
> scaling_min_freq for all CPUs = 0.8 GHz:
> 
> $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
> /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:800293
> /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:799993
> 
> Okay, all the stale values are good. Note: one can tell because the numbers
> are exact (I also did a test compile with flagged return values).
> 
> Now set the scaling_min_freq for all CPUs = 2.5 GHz:
> 
> $ echo 2500000 | sudo tee
> /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
> 
> And check:
> 
> $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
> /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:2526076
> /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:800000
> /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:2500047
> 
> Huh? wrong.
> 
> Now switch to the ondemand governor and then switch back and check again:
> 
> $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq
> /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:4800127
> /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:2500000
> /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:4800262
> 
> Now it is Okay.

Ah, these are interesting. I must admit I haven't played a lot about different governors and usually just use the default driver and governor.

> 
> There is other weirdness also. I am now trying to test an additional patch
> on top of what I proposed above. I'll post it tomorrow.
> 
> You might have to disable HWP in your BIOS to be able to boot to the
> acpi-cpufreq CPU frequency scaling driver. While it should be a boot time
> decision, some BIOS' (ASUS) force it even if "auto" is selected in the BIOS.

Thanks, that's good point, I recall someone did tell me it's not doable to disable HWP on our platform. I just used 'intel_pstate=disable' on kernel boot cmdline.
Comment 35 Keyon 2023-07-13 18:04:23 UTC
(In reply to Doug Smythies from comment #33)
> I am testing this addition to the proposed patch from comment 24:
> 
> $ git diff
> diff --git a/arch/x86/kernel/cpu/aperfmperf.c
> b/arch/x86/kernel/cpu/aperfmperf.c
> index 6230fd7162ee..44cc96864d94 100644
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -418,9 +418,10 @@ unsigned int arch_freq_get_on_cpu(int cpu)
>         unsigned long last;
>         u64 acnt, mcnt;
> 
> -       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> -               goto fallback;
> -
> +       if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF)){
> +               freq = cpufreq_quick_get(cpu);
> +               return freq ? freq : cpufreq_quick_get_min(cpu);
> +       }
>         do {
>                 seq = raw_read_seqcount_begin(&s->seq);
>                 last = s->last_update;
> @@ -433,13 +434,9 @@ unsigned int arch_freq_get_on_cpu(int cpu)
>          * which covers idle and NOHZ full CPUs.
>          */
>         if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
> -               goto fallback;
> +               return cpufreq_quick_get_min(cpu);
> 
>         return div64_u64((cpu_khz * acnt), mcnt);
> -
> -fallback:
> -       freq = cpufreq_quick_get(cpu);
> -       return freq ? freq : cpufreq_quick_get_min(cpu);
>  }
> 
>  static int __init bp_init_aperfmperf(void)
> 
> This was after thinking about, and trying, just changing the return value in
> drivers/cpufreq/cpufreq.c from policy->cur to policy->min. The concern with
> it being that I don't know if that would break the expected response for
> some of the other callers of that function. For example
> drivers/devfreq/tegra30-devfreq.c and drivers/thermal/cpufreq_cooling.c and
> drivers/powercap/dtpm_cpu.c

Not quite follow here. Returning min in the 2nd 'goto' makes sense with avoiding the superfluous calling of cpufreq_quick_get() (as it looks that it returns always 0 in this condition). But those calls to cpufreq_quick_get() from tegra30-devfreq.c, cpufreq_cooling.c and dtpm_cpu.c won't be impacted anyway as we only added a new cpufreq_quick_get_min() helper and kept the cpufreq_quick_get() intact, do I miss something here?
Comment 36 Doug Smythies 2023-07-18 17:09:12 UTC
(In reply to Keyon from comment #35)
> (In reply to Doug Smythies from comment #33)
...
> > This was after thinking about, and trying, just changing the return value
> in
> > drivers/cpufreq/cpufreq.c from policy->cur to policy->min. The concern with
> > it being that I don't know if that would break the expected response for
> > some of the other callers of that function. For example
> > drivers/devfreq/tegra30-devfreq.c and drivers/thermal/cpufreq_cooling.c and
> > drivers/powercap/dtpm_cpu.c
> 
> Not quite follow here. Returning min in the 2nd 'goto' makes sense with
> avoiding the superfluous calling of cpufreq_quick_get() (as it looks that it
> returns always 0 in this condition). But those calls to cpufreq_quick_get()
> from tegra30-devfreq.c, cpufreq_cooling.c and dtpm_cpu.c won't be impacted
> anyway as we only added a new cpufreq_quick_get_min() helper and kept the
> cpufreq_quick_get() intact, do I miss something here?

I was talking about another option:

$ git diff
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 5ea08a576131..487ac547e19e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1768,7 +1768,7 @@ unsigned int cpufreq_quick_get(unsigned int cpu)

        policy = cpufreq_cpu_get(cpu);
        if (policy) {
-               ret_freq = policy->cur;
+               ret_freq = policy->min;
                cpufreq_cpu_put(policy);
        }

But there are other callers of cpufreq_quick_get that might be affected so I went to the code listed and left cpufreq_quick_get unchanged. Due to limited available time, I have not yet finished testing.
Comment 37 Keyon 2023-07-20 00:39:41 UTC
(In reply to Doug Smythies from comment #36)
> (In reply to Keyon from comment #35)
> > (In reply to Doug Smythies from comment #33)
> ...
> > > This was after thinking about, and trying, just changing the return value
> > in
> > > drivers/cpufreq/cpufreq.c from policy->cur to policy->min. The concern
> with
> > > it being that I don't know if that would break the expected response for
> > > some of the other callers of that function. For example
> > > drivers/devfreq/tegra30-devfreq.c and drivers/thermal/cpufreq_cooling.c
> and
> > > drivers/powercap/dtpm_cpu.c
> > 
> > Not quite follow here. Returning min in the 2nd 'goto' makes sense with
> > avoiding the superfluous calling of cpufreq_quick_get() (as it looks that
> it
> > returns always 0 in this condition). But those calls to cpufreq_quick_get()
> > from tegra30-devfreq.c, cpufreq_cooling.c and dtpm_cpu.c won't be impacted
> > anyway as we only added a new cpufreq_quick_get_min() helper and kept the
> > cpufreq_quick_get() intact, do I miss something here?
> 
> I was talking about another option:
> 
> $ git diff
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 5ea08a576131..487ac547e19e 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1768,7 +1768,7 @@ unsigned int cpufreq_quick_get(unsigned int cpu)
> 
>         policy = cpufreq_cpu_get(cpu);
>         if (policy) {
> -               ret_freq = policy->cur;
> +               ret_freq = policy->min;
>                 cpufreq_cpu_put(policy);
>         }
> 
> But there are other callers of cpufreq_quick_get that might be affected so I

Gotcha, fair enough.

> went to the code listed and left cpufreq_quick_get unchanged. Due to limited
> available time, I have not yet finished testing.

Take your time Doug. I will keep tuned for your updates.
Comment 38 Doug Smythies 2023-07-25 15:45:44 UTC
Created attachment 304694 [details]
Stale frequency results for various drivers and governors

For a patched and unpatched kernel 6.5-RC3. There might be push back for the changes to the performance governor responses. There is also an ongoing thread about "cpufreq: schedutil: next_freq need update when cpufreq_limits changed" on the linux-pm list that might change the unpatched response for the schedutil governor.
Comment 39 Keyon 2023-07-25 22:12:57 UTC
(In reply to Doug Smythies from comment #38)
> Created attachment 304694 [details]
> Stale frequency results for various drivers and governors
> 
> For a patched and unpatched kernel 6.5-RC3. There might be push back for the
> changes to the performance governor responses. There is also an ongoing
> thread about "cpufreq: schedutil: next_freq need update when cpufreq_limits
> changed" on the linux-pm list that might change the unpatched response for
> the schedutil governor.

Thank you for the notice Doug.
Comment 40 Doug Smythies 2023-07-30 14:29:17 UTC
Created attachment 304734 [details]
intel_cpufreq driver, hwp disabled, schedutil governor, odd pstate requests

From a trace. Note the odd pstate requests. 48 for no load and 28 to 30 for full load, which seems odd. policy->cur often shows 48 for the idle CPU for long periods (observe the duration numbers), which is what has been requested not what has been granted.
Comment 41 Doug Smythies 2023-07-30 14:38:35 UTC
Created attachment 304735 [details]
Stale frequency results for various drivers and governors - policy cur method

Rafael suggested this instead:

doug@s19:~/kernel/linux$ git diff
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 50bbc969ffe5..4e91169a83f5 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1768,7 +1768,7 @@ unsigned int cpufreq_quick_get(unsigned int cpu)

        policy = cpufreq_cpu_get(cpu);
        if (policy) {
-               ret_freq = policy->cur;
+               ret_freq = policy->cur ? policy->cur : policy->min;
                cpufreq_cpu_put(policy);
        }

The results are attached. policy->cur can be rather unrelated to what the actual frequency has ever been. The frequency is stale anyhow. I still like my suggested patch better.
Comment 42 Keyon 2023-08-04 19:24:34 UTC
(In reply to Doug Smythies from comment #40)
> Created attachment 304734 [details]
> intel_cpufreq driver, hwp disabled, schedutil governor, odd pstate requests
> 
> From a trace. Note the odd pstate requests. 48 for no load and 28 to 30 for
> full load, which seems odd. policy->cur often shows 48 for the idle CPU for
> long periods (observe the duration numbers), which is what has been
> requested not what has been granted.

This looks very informative but I don't have enough knowledge to understand it yet, I guess most people are actually not caring hwp disabled case that much.
Comment 43 Keyon 2023-08-04 19:34:19 UTC
(In reply to Doug Smythies from comment #41)
> Created attachment 304735 [details]
> Stale frequency results for various drivers and governors - policy cur method
> 
> Rafael suggested this instead:
> 
> doug@s19:~/kernel/linux$ git diff
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 50bbc969ffe5..4e91169a83f5 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1768,7 +1768,7 @@ unsigned int cpufreq_quick_get(unsigned int cpu)
> 
>         policy = cpufreq_cpu_get(cpu);
>         if (policy) {
> -               ret_freq = policy->cur;
> +               ret_freq = policy->cur ? policy->cur : policy->min;
>                 cpufreq_cpu_put(policy);
>         }
> 
> The results are attached. policy->cur can be rather unrelated to what the
> actual frequency has ever been. The frequency is stale anyhow. I still like
> my suggested patch better.

Doug, IMHO it is not a big deal, simple making life easier. Returning MIN was your original idea IIRC, if that is more acceptable by maintainers, I would suggest to go in that way.
Comment 44 Doug Smythies 2023-08-08 16:33:36 UTC
My original idea was this:

        policy = cpufreq_cpu_get(cpu);
        if (policy) {
-               ret_freq = policy->cur;
+               ret_freq = policy->min;
                cpufreq_cpu_put(policy);
        }

anyway, I'll submit the patch as per comment 41.
Comment 45 Keyon 2023-08-15 01:03:25 UTC
(In reply to Doug Smythies from comment #44)
> My original idea was this:
> 
>         policy = cpufreq_cpu_get(cpu);
>         if (policy) {
> -               ret_freq = policy->cur;
> +               ret_freq = policy->min;
>                 cpufreq_cpu_put(policy);
>         }
> 
> anyway, I'll submit the patch as per comment 41.

Thank you Doug.
Comment 46 Doug Smythies 2023-08-20 20:51:54 UTC
Created attachment 304916 [details]
Yet another patch to fix this.

I actually never liked the previously posted patch, even though I said I would submit it. This version has been submitted and addresses the issue at its source.
Comment 47 Keyon 2023-08-23 16:47:21 UTC
(In reply to Doug Smythies from comment #46)
> Created attachment 304916 [details]
> Yet another patch to fix this.
> 
> I actually never liked the previously posted patch, even though I said I
> would submit it. This version has been submitted and addresses the issue at
> its source.

Checked and verified this works for me as well. Thank you so much to make it happen Doug, appreciate!
Comment 48 Keyon 2023-08-23 16:48:37 UTC
fix launched to linux-pm kernel and we will see it in 6.6 kernel.
Comment 51 Keyon 2023-09-06 00:28:52 UTC
it is verified work, Thank you so much Doug! Now closing it.