From Linux kernel v5.18 on, we notice that the cur_freq of CPU cores are not correct from time to time, the cur_freq could even be > the maximum ones, as below: # lscpu -e CPU SOCKET CORE ONLINE MAXMHZ MINMHZ MHZ 0 0 0 yes 4200.0000 800.0000 4200.0000 1 0 0 yes 4200.0000 800.0000 4200.0000 2 0 1 yes 4200.0000 800.0000 4200.0000 3 0 1 yes 4200.0000 800.0000 3600.0000 4 0 2 yes 3000.0000 800.0000 1732.7410 5 0 3 yes 3000.0000 800.0000 1685.1310 6 0 4 yes 3000.0000 800.0000 1715.8409 7 0 5 yes 3000.0000 800.0000 3600.0000 8 0 6 yes 3000.0000 800.0000 1716.0790 9 0 7 yes 3000.0000 800.0000 3600.0000 10 0 8 yes 3000.0000 800.0000 1603.4210 11 0 9 yes 3000.0000 800.0000 3600.0000 Here, the cur_freq of core 7, 9, 11 looks incorrect. With v5.17, we get the correct ones as below: # lscpu -e CPU SOCKET CORE ONLINE MAXMHZ MINMHZ MHZ 0 0 0 yes 4200.0000 800.0000 4199.8760 1 0 0 yes 4200.0000 800.0000 1444.7260 2 0 1 yes 4200.0000 800.0000 1554.8870 3 0 1 yes 4200.0000 800.0000 1577.0760 4 0 2 yes 3000.0000 800.0000 1800.5040 5 0 3 yes 3000.0000 800.0000 1767.4659 6 0 4 yes 3000.0000 800.0000 1800.3500 7 0 5 yes 3000.0000 800.0000 1600.4460 8 0 6 yes 3000.0000 800.0000 1597.0200 9 0 7 yes 3000.0000 800.0000 1681.0930 10 0 8 yes 3000.0000 800.0000 1700.8730 11 0 9 yes 3000.0000 800.0000 1785.5389 12 0 10 yes 2100.0000 800.0000 1202.8650 13 0 11 yes 2100.0000 800.0000 1597.0270
Have been debugged this, assigning this to myself.
As of Kernel 5.18 if a CPU has been idle and is idle, it is not woken just to read CPU frequency, nor is the stale CPU frequency listed. The nominal non-tubro max frequency is listed. The related patch set email thread: https://lore.kernel.org/linux-pm/?q=x86%2Fcpu%3A+Consolidate+APERF%2FMPERF+code I am not certain this is your issue, but it sounds like it.
(In reply to Doug Smythies from comment #2) > As of Kernel 5.18 if a CPU has been idle and is idle, it is not woken just > to read CPU frequency, nor is the stale CPU frequency listed. The nominal > non-tubro max frequency is listed. > > The related patch set email thread: > > https://lore.kernel.org/linux-pm/ > ?q=x86%2Fcpu%3A+Consolidate+APERF%2FMPERF+code > > I am not certain this is your issue, but it sounds like it. Hi Doug, exactly that the series introduced this issue, I have submitted a patch to address it here: https://lore.kernel.org/lkml/20230626193601.9169-1-yang.jie@linux.intel.com/
From the patch: > have been comparing ticks and milliseconds by mistake I do not think so. I think this line is correct: >- if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE) I think this line is not correct: > + if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE * cpu_khz) The definition of MAX_SAMPLE_AGE is already 20 milliseconds converted to jiffies. The code is doing what was desired (I think) and it was expected that CPU frequency listings would change. Myself, I think it would be more informative to list a frequency of 0 for stale info instead of cpu_khz, but that wasn't accepted when I proposed it (I think, I'd have to review more to be sure). Note: I am not on the x86@kernel.org email list, only the linux-pm list. The patch should have been copied to linux-pm, as it was originally getting the bugzilla emails and was copied on the original patch series.
(In reply to Doug Smythies from comment #4) > From the patch: > > > have been comparing ticks and milliseconds by mistake > > I do not think so. > > I think this line is correct: > >- if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE) > > I think this line is not correct: > > + if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE * cpu_khz) > > The definition of MAX_SAMPLE_AGE is already 20 milliseconds converted to > jiffies. I don't think so, see here: #define MAX_SAMPLE_AGE ((unsigned long)HZ / 50) CONFIG_HZ is usually configured to be 1000 and MAX_SAMPLE_AGE is 20, I have printed that out and confirmed it. > > The code is doing what was desired (I think) and it was expected that CPU > frequency listings would change. Myself, I think it would be more > informative to list a frequency of 0 for stale info instead of cpu_khz, but > that wasn't accepted when I proposed it (I think, I'd have to review more to > be sure). Ah I can imagine that, people are complaining about the cur_freq > min_freq here as well. > > Note: I am not on the x86@kernel.org email list, only the linux-pm list. The > patch should have been copied to linux-pm, as it was originally getting the > bugzilla emails and was copied on the original patch series. oops, maybe I should send it to linux-pm first?
And, when I printed more logs, you can see that the delta > 20 happens quite frequently which lead to the default cpu_khz is used as the cur_freq from time to time: [ 57.904839] Keyon: jiffies - last: 50 [ 57.909061] Keyon: jiffies - last: 0 [ 57.913161] Keyon: jiffies - last: 60 [ 57.917170] Keyon: jiffies - last: 62 [ 57.921277] Keyon: jiffies - last: 174 [ 57.925393] Keyon: jiffies - last: 4 [ 57.929586] Keyon: jiffies - last: 993 [ 57.933586] Keyon: jiffies - last: 165
(In reply to Keyon from comment #5) > (In reply to Doug Smythies from comment #4) > I don't think so, see here: > > #define MAX_SAMPLE_AGE ((unsigned long)HZ / 50) > > CONFIG_HZ is usually configured to be 1000 and MAX_SAMPLE_AGE is 20, I have > printed that out and confirmed it. Yes, exactly. 20 jiffies at a 1000 Hz tick rate is the desired 20 milliseconds. > the delta > 20 happens quite frequently Yes. If the system is somewhat idle, it'll happen very often. Example from my idle system (4.1 GHz means stale): $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:4100000 /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:4100000 /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:4100000 /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:4100000 /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:4100000 /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:2200179 /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:2200009 /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:4100000 /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:4100000 /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:4100000 /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:2139420 /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:2155407 But if I put a light 100 hertz load on every CPU, then the stale trigger conditions will never be met: $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:900001 /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:806972 /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:898252 /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:898536 /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:900024 /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:845586 /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:900013 /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:847413 /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:898391 /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:900014 /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:888358 /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:900020
(In reply to Doug Smythies from comment #7) > > Yes. If the system is somewhat idle, it'll happen very often. Example from > my idle system (4.1 GHz means stale): > > $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq > /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:4100000 > /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:4100000 > /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:4100000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:4100000 > /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:4100000 > /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:2200179 > /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:2200009 > /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:4100000 > /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:4100000 > /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:4100000 > /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:2139420 > /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:2155407 > > But if I put a light 100 hertz load on every CPU, then the stale trigger > conditions will never be met: > > $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq > /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:900001 > /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:806972 > /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:898252 > /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:898536 > /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:900024 > /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:845586 > /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:900013 > /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:847413 > /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:898391 > /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:900014 > /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:888358 > /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:900020 Thank you cross confirm Doug. Not saying if showing 4.1GHz at that frequency level making sense or confusing people, at least the current code (ticks comparing with ms) is not aligned with the design according to the 20ms comment. Doug, if you agree with me about that, do you mind adding your Acked-by tag?
I do not agree, and have been saying all along that current code (before your patch) is working as expected. However, I do agree that it can be confusing. My input as to a possible remedy patch would be: $ git diff diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c index fdbb5f07448f..53c6a508a366 100644 --- a/arch/x86/kernel/cpu/aperfmperf.c +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -439,7 +439,7 @@ unsigned int arch_freq_get_on_cpu(int cpu) fallback: freq = cpufreq_quick_get(cpu); - return freq ? freq : cpu_khz; + return freq ? freq : 0; } static int __init bp_init_aperfmperf(void) But that also has inconsistencies between what is reported as a function of CPU frequency scaling driver and governor that is being used (some of those inconsistencies also exist now and I seemed to have missed them or made some mistake back on the original thread referenced in comment 2 above, where I soliciting for consistent reporting across all CPU scaling drivers and governors).
Hi Doug, IMHO, the scaling_cur_freq should show the real current running/electrical frequency set to the corresponding CPU core, the core itself could run in different frequencies, range from min to max, should not be 0 except that when it is put offline. Usually when Idle, the core should still run on a freequency, e.g. the default, base or min frequency. All right, there are 2 issues here per my understanding: 1. the cpu_khz shared for all CPU cores? In Intel's recent Hybrid CPUs, what does this cpu_khz read from cpuid really mean? I am seeing cpu_khz=3.6GHz for E-cores with Max frequecy 3GHz. We should fix that, no? 2. We don't want to wake up cores just because of the sysfs queries, so we introduced fallback mechanism here, what is our clear design about that? So, before discussing those issues, we should get alignment on these first: 1. What is fallback and When should we fallback. From the comment, looks we wanted to use cpu_khz for Cores haven't executed any task during the last 20ms, this sounds reasonable, no? 2. What frequencies should we show in fallback case. This could be controversial, 0? min_freq? base_freq? or last calculated one? So, I think what you proposed above is trying to push the #2 alignment above to use '0', while it has no conflict with the patch I sent trying to fix the existed typo (I believe so, might need Thomas Gleixner to chime in) of the #1 alignment. I get your point that it looks consist with the design that 'when put a light 100 hertz load on every CPU, then the stale trigger conditions will never be met', can you share me how did you do that? Let me try it at my end also.
For this: > the scaling_cur_freq should show the real > current running/electrical frequency > set to the corresponding CPU core, > the core itself could run in different > frequencies, range from min to max, > should not be 0 except that when it is > put offline. Usually when Idle, the > core should still run on a frequency, That is not true. In deep idle states the CPU clock stops. And that is the point, there is no "current" CPU frequency, it is stale. The discussion at the time (see the thread referenced in comment 2 above) seemed to be against using the last known value. The issue I have with using cpu_khz is that there is no easy way to distinguish between a CPU really running at cpu_khz and CPU that was idle for that sample. Some drivers and governor combinations seem to report min_freq. I think this conversation needs to include the participants from the original thread. Note: I forgot to simplify above and am now using this: $ git diff diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c index fdbb5f07448f..486207336d8b 100644 --- a/arch/x86/kernel/cpu/aperfmperf.c +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -414,7 +414,7 @@ void arch_scale_freq_tick(void) unsigned int arch_freq_get_on_cpu(int cpu) { struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu); - unsigned int seq, freq; + unsigned int seq; unsigned long last; u64 acnt, mcnt; @@ -438,8 +438,7 @@ unsigned int arch_freq_get_on_cpu(int cpu) return div64_u64((cpu_khz * acnt), mcnt); fallback: - freq = cpufreq_quick_get(cpu); - return freq ? freq : cpu_khz; + return cpufreq_quick_get(cpu); } static int __init bp_init_aperfmperf(void)
Agree, Doug, I just include the linux-pm ML and maintainers of the cpufreq driver in the email, we can discuss it there next, thank you.
Hi Doug, with the method you provided, I am seeing the trigger condition happens quite frequently: # ./consume_test.sh consume: 3 100 300 PID: 8446 consume: 3 100 300 PID: 8448 consume: 3 100 300 PID: 8450 consume: 3 100 300 PID: 8452 consume: 3 100 300 PID: 8454 consume: 3 100 300 PID: 8456 consume: 3 100 300 PID: 8458 consume: 3 100 300 PID: 8460 consume: 3 100 300 PID: 8462 consume: 3 100 300 PID: 8464 consume: 3 100 300 PID: 8466 consume: 3 100 300 PID: 8468 consume: 3 100 300 PID: 8470 consume: 3 100 300 PID: 8472 # grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:1766635 /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:1788979 /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:1746332 /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:1694200 /sys/devices/system/cpu/cpufreq/policy12/scaling_cur_freq:1199936 /sys/devices/system/cpu/cpufreq/policy13/scaling_cur_freq:3600000 /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:1768728 /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:3600000 /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:3600000 /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:1785094 /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:1760354 /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:1629422 /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:1781190 /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:3600000 I can even see it with frac=12 but don't find it at frac=15.
Interesting. What CPU frequency scaling driver and governor are you using? Is HWP enabled? What is your kernel tick rate? I'll try to reproduce your results.
(In reply to Doug Smythies from comment #14) > Interesting. What CPU frequency scaling driver and governor are you using? > Is HWP enabled? What is your kernel tick rate? I'll try to reproduce your > results. scaling driver: intel_pstate scaling_governor: powersave HWP enabled. kernel tick rate, do you mean the CONFIG_HZ? CONFIG_HZ=1000 kernel source: 6.1.33 kernel base
Yes, I mean CONFIG_HZ when I talk about kernel tick rate. Okay, I can re-create your results. It just means the CPU needs to be awake on a jiffy boundary, I think. If I change away from fixed work packet to fixed timing workflow and ensure the CPU is awake for >= 1 millisecond, and not asleep for more then 20 Milliseconds, then the trigger condition will never occur. I.e.: COUNTER=0 while [ $COUNTER -lt 12 ]; do taskset -c $COUNTER /home/doug/c/consume 10.1 50.1 60 & sleep 0.0122 COUNTER=$((COUNTER+1)) done But we digress, as this isn't really relevant to the discussion.
Hi Doug, I think this can prove the assumption that the driver is working as expected that we will not go to fallback if the CPU is scheduling task at least once in every 20ms: taskset -c $COUNTER /home/doug/c/consume 1 49.9 60 & This will try to run the task for 0.2ms active in every ~20ms, which means the cpu will never sleep > 20ms, but we are still seeing fallback with this. Furthermore, I can still see the fallback with 'consume 4 200' So this still looks tricky to me, but anyway as you said, let's not digress too much, I am fully open and appreciate to any reasonable fix for the 2 issues I summarized in the mail thread.
Running a task for 0.2ms active in every ~20ms doesn't ensure the CPU will be awake on the tick boundary. There is only a 20% chance that the CPU will be active at the tick time, thus ensuring a frequency sample. Let's try to verify by experiment: Run similar to your command on CPU 5: $ taskset -c 5 /home/doug/c/consume 1 49.9 1000 Meanwhile sample its CPU frequency every 20 millseconds for 40000 samples. (Note: my utility program uses sleep here, so with sleep errors and not compensating for code time, it was actually sampling at 20.184 milliseconds per sample). Results, test1: 9703 samples, 24%, were of actual CPU frequency. 30297 samples, 76%, were the stale response. Results, test1: 9794 samples, 24%, were of actual CPU frequency. 30206 samples, 76%, were the stale response. Note that while my test server is nominally pretty idle, there will be still be occasional other things for CPU 5 to do. The other thing we expect to see in the data is the beat frequency pattern as the awake time for CPU 5 gradually moves with respect to the tick time, since its period is 1/49.9, or 20.04 mSec. We do see this pattern: doug@s19:~/idle/perf$ tail -25 freq-03.csv 39975, 4.100 39976, 4.100 39977, 4.100 39978, 4.100 39979, 0.800 39980, 0.800 39981, 0.800 39982, 0.800 39983, 0.800 39984, 0.800 39985, 4.100 39986, 4.100 39987, 4.100 39988, 4.100 39989, 4.100 39990, 4.100 39991, 4.100 39992, 4.100 39993, 4.100 39994, 4.100 39995, 4.100 39996, 4.100 39997, 4.100 39998, 4.100 39999, 4.100 The .2 millisecond work time every 5 milliseconds, for 'consume 4 200', is getting a little different as we get into the idle state regions where the tick may or may not turn off, depending on the timing. The beat frequency should be 0 for this case, but is only extremely low. I do not want to go down a rat hole here, so am not pursuing further. For the actual issue here: I am thinking more about my response in comment 11 above, and if it would be possible to return the policy min_freq as the stale response, and if it would make sense to do so.
(In reply to Doug Smythies from comment #18) > Running a task for 0.2ms active in every ~20ms doesn't ensure the CPU will > be awake on the tick boundary. There is only a 20% chance that the CPU will > be active at the tick time, thus ensuring a frequency sample. Let's try to > verify by experiment: > > Run similar to your command on CPU 5: > > $ taskset -c 5 /home/doug/c/consume 1 49.9 1000 > > Meanwhile sample its CPU frequency every 20 millseconds for 40000 samples. > (Note: my utility program uses sleep here, so with sleep errors and not > compensating for code time, it was actually sampling at 20.184 milliseconds > per sample). > > Results, test1: > 9703 samples, 24%, were of actual CPU frequency. > 30297 samples, 76%, were the stale response. > > Results, test1: > 9794 samples, 24%, were of actual CPU frequency. > 30206 samples, 76%, were the stale response. > Hi Doug, are you running another test script to reading the cur_freq in every 25 ms, can you share me how did you get these data?
Hi Doug, basically, I have difficult to understand what it really means with '(jiffies - last) > 20' in the upstream kernel. I have changes of this: 1. change the fallback tick from 20 to 1000. That is, '(jiffies - last) > 1000'. I suppose this could mean that "if the CPU didn't run any task during the last scheduler tick, let's treat it as IDLE and report fallback cur_freq". 2. write a script(get_cur_freq.sh) to check the CPU status every 25us. Then, I performed a boundary test today with runnig: 1. taskset -c 13 ./consume 0.1 125 300 //just a very tiny task every 8ms. 2. ./get_cur_freq.sh 13 80000 The number #13 CPU is the one idle most the time by default. Then, I get this: total samples:80000 idle samples:1136 active samples:78864 idle percentage:1.00 With this, the #13 CPU almost canNOT idle. If I change to 'consume 0.1 143 300' (a very tiny task every 7ms), the #13 CPU can never idle. OMG, this is really deep rat hole here. I am actually fine with your proposal to just return 0 for this fallback if we just want to 'fix' the original issue, as it is actually not a big deal at all for most of the users per my understanding.
(In reply to Doug Smythies from comment #18) > > For the actual issue here: I am thinking more about my response in comment > 11 above, and if it would be possible to return the policy min_freq as the > stale response, and if it would make sense to do so. Yes, something like this looks fine to me as well (on top of your change in comment #11 above) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 20c5ca219afe..2c44b1d3f116 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -713,7 +713,7 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf) else if (cpufreq_driver->setpolicy && cpufreq_driver->get) ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu)); else - ret = sprintf(buf, "%u\n", policy->cur); + ret = sprintf(buf, "%u\n", policy->min);/* show min in idle */ return ret; }
(In reply to Keyon from comment #20) > I have difficult to understand what it really means with > '(jiffies - last) > 20' in the upstream kernel. For our 1000 Hz kernels the translation of that code is: if the current sample is over 20 jiffies old. or if the current sample is over 20 kernel ticks old. or if the current sample is over 20 milliseconds old. > > I have changes of this: > > 1. change the fallback tick from 20 to 1000. > That is, '(jiffies - last) > 1000'. > I suppose this could mean that "if the CPU didn't run any task during the > last scheduler tick, let's treat it as IDLE and report fallback cur_freq". No, that is incorrect. It would mean "if the current sample is over 1 second old". > 2. write a script(get_cur_freq.sh) to check the CPU status every 25us. Did you mean 25 Milliseconds? > > Then, I performed a boundary test today with runnig: > 1. taskset -c 13 ./consume 0.1 125 300 //just a very tiny task > every 8ms. > 2. ./get_cur_freq.sh 13 80000 > > The number #13 CPU is the one idle most the time by default. > > Then, I get this: > > total samples:80000 > idle samples:1136 > active samples:78864 > idle percentage:1.00 No, your terminology is incorrect. Idle time and if a sample is stale or not are different things. > With this, the #13 CPU almost canNOT idle. You are not measuring idle time here. If you want to measure idle time then look at, and calculate it from, the idle stats. Example (for CPU 5): $ grep . /sys/devices/system/cpu/cpu5/cpuidle/state*/time ; sleep 10 ; grep . /sys/devices/system/cpu/cpu5/cpuidle/state*/time /sys/devices/system/cpu/cpu5/cpuidle/state0/time:2017995 /sys/devices/system/cpu/cpu5/cpuidle/state1/time:268419062 /sys/devices/system/cpu/cpu5/cpuidle/state2/time:786807281 /sys/devices/system/cpu/cpu5/cpuidle/state3/time:247975570890 /sys/devices/system/cpu/cpu5/cpuidle/state0/time:2017995 /sys/devices/system/cpu/cpu5/cpuidle/state1/time:268421655 /sys/devices/system/cpu/cpu5/cpuidle/state2/time:786824863 /sys/devices/system/cpu/cpu5/cpuidle/state3/time:247984398971 Or 8.848 Seconds out of the 10, 88.48% idle.
(In reply to Keyon from comment #21) > Yes, something like this looks fine to me as well (on top of your change in > comment #11 above) > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 20c5ca219afe..2c44b1d3f116 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -713,7 +713,7 @@ static ssize_t show_scaling_cur_freq(struct > cpufreq_policy *policy, char *buf) > else if (cpufreq_driver->setpolicy && cpufreq_driver->get) > ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu)); > else > - ret = sprintf(buf, "%u\n", policy->cur); > + ret = sprintf(buf, "%u\n", policy->min);/* show min in idle > */ > return ret; > } The issue with it is that the reply for /proc/cpuinfo is different than for /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq.
Created attachment 304531 [details] A possible patch for this issue I had some odd behaviour with HWP enabled, passive, schedutil, which I need to look into further.
Created attachment 304532 [details] Stale frequency results for various drivers and govnerors Sorry for the picture type file. Where: MHz is a stale freq from /proc/cpuinfo and cur_freq is a stale freq from /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq.
(In reply to Doug Smythies from comment #22) > (In reply to Keyon from comment #20) > > I have difficult to understand what it really means with > > '(jiffies - last) > 20' in the upstream kernel. > > For our 1000 Hz kernels the translation of that code is: > > if the current sample is over 20 jiffies old. > or > if the current sample is over 20 kernel ticks old. > or > if the current sample is over 20 milliseconds old. > > > > I have changes of this: > > > > 1. change the fallback tick from 20 to 1000. > > That is, '(jiffies - last) > 1000'. > > I suppose this could mean that "if the CPU didn't run any task during the > > last scheduler tick, let's treat it as IDLE and report fallback cur_freq". > > No, that is incorrect. > It would mean "if the current sample is over 1 second old". > > > 2. write a script(get_cur_freq.sh) to check the CPU status every 25us. > > Did you mean 25 Milliseconds? > > > > > Then, I performed a boundary test today with runnig: > > 1. taskset -c 13 ./consume 0.1 125 300 //just a very tiny task > > every 8ms. > > 2. ./get_cur_freq.sh 13 80000 > > > > The number #13 CPU is the one idle most the time by default. > > > > Then, I get this: > > > > total samples:80000 > > idle samples:1136 > > active samples:78864 > > idle percentage:1.00 > > No, your terminology is incorrect. Idle time and if a sample is stale or not > are different things. > > > With this, the #13 CPU almost canNOT idle. > > You are not measuring idle time here. If you want to measure idle time then > look at, and calculate it from, the idle stats. Example (for CPU 5): > > $ grep . /sys/devices/system/cpu/cpu5/cpuidle/state*/time ; sleep 10 ; grep > . /sys/devices/system/cpu/cpu5/cpuidle/state*/time > /sys/devices/system/cpu/cpu5/cpuidle/state0/time:2017995 > /sys/devices/system/cpu/cpu5/cpuidle/state1/time:268419062 > /sys/devices/system/cpu/cpu5/cpuidle/state2/time:786807281 > /sys/devices/system/cpu/cpu5/cpuidle/state3/time:247975570890 > /sys/devices/system/cpu/cpu5/cpuidle/state0/time:2017995 > /sys/devices/system/cpu/cpu5/cpuidle/state1/time:268421655 > /sys/devices/system/cpu/cpu5/cpuidle/state2/time:786824863 > /sys/devices/system/cpu/cpu5/cpuidle/state3/time:247984398971 > > Or 8.848 Seconds out of the 10, 88.48% idle. I think all these comments are good, thank you.
(In reply to Doug Smythies from comment #24) > Created attachment 304531 [details] > A possible patch for this issue This looks good to me, thanks. > > I had some odd behaviour with HWP enabled, passive, schedutil, which I need > to look into further. could that be related with 'base_frequency'? I haven't tried passive mode though.
(In reply to Keyon from comment #27) > (In reply to Doug Smythies from comment #24) > > Created attachment 304531 [details] > > > > I had some odd behaviour with HWP enabled, passive, schedutil, which I need > > to look into further. > > could that be related with 'base_frequency'? I haven't tried passive mode > though. I do not think it is related to 'base_frequency', but please go ahead and provide any evidence you have that it is. As I mentioned, there are odd things going on with the HWP enabled, passive, schedutil configuration that I need to get to bottom of. For example, right now I have stale frequency for CPU 0 of 3.9 GHz and for CPU 11 of 1.2 GHz and 0.9 GHz for all the others, even though min is set to 1.0 GHz.
Thank you Doug. So I will rely on you to proceeding the fix of the issue, let me know if anything I can help. :-)
The issue with HWP enabled, passive, schedutil seems to be in the function cpufreq_quick_get in the file drivers/cpufreq/cpufreq.c. It seems that policy->cur is never updated under these conditions. HWP disabled, passive, schedutil works fine. It is a bit of a different issue, but I want to fix it at the same time with another patch.
Thank you Doug. The acpi-cpufreq driver doesn't work on my platform, but looks HWP enabled with passive/schedutil works and I get reasonable policy->cur here, even without any fixes we are talking here.
Okay, then we need to figure out the difference. I get a fine response from the function when the frequency is not stale (i.e. the other return path from the function, I think). But policy->cur is never updated. It is used for the stale return path. Example (Unmodified kernel 6.5-rc1): scaling_min_freq for all CPUs = 0.8 GHz: $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:800293 /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:799993 Okay, all the stale values are good. Note: one can tell because the numbers are exact (I also did a test compile with flagged return values). Now set the scaling_min_freq for all CPUs = 2.5 GHz: $ echo 2500000 | sudo tee /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq And check: $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:2526076 /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:800000 /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:2500047 Huh? wrong. Now switch to the ondemand governor and then switch back and check again: $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:4800127 /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:2500000 /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:4800262 Now it is Okay. There is other weirdness also. I am now trying to test an additional patch on top of what I proposed above. I'll post it tomorrow. You might have to disable HWP in your BIOS to be able to boot to the acpi-cpufreq CPU frequency scaling driver. While it should be a boot time decision, some BIOS' (ASUS) force it even if "auto" is selected in the BIOS.
I am testing this addition to the proposed patch from comment 24: $ git diff diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c index 6230fd7162ee..44cc96864d94 100644 --- a/arch/x86/kernel/cpu/aperfmperf.c +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -418,9 +418,10 @@ unsigned int arch_freq_get_on_cpu(int cpu) unsigned long last; u64 acnt, mcnt; - if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF)) - goto fallback; - + if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF)){ + freq = cpufreq_quick_get(cpu); + return freq ? freq : cpufreq_quick_get_min(cpu); + } do { seq = raw_read_seqcount_begin(&s->seq); last = s->last_update; @@ -433,13 +434,9 @@ unsigned int arch_freq_get_on_cpu(int cpu) * which covers idle and NOHZ full CPUs. */ if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE) - goto fallback; + return cpufreq_quick_get_min(cpu); return div64_u64((cpu_khz * acnt), mcnt); - -fallback: - freq = cpufreq_quick_get(cpu); - return freq ? freq : cpufreq_quick_get_min(cpu); } static int __init bp_init_aperfmperf(void) This was after thinking about, and trying, just changing the return value in drivers/cpufreq/cpufreq.c from policy->cur to policy->min. The concern with it being that I don't know if that would break the expected response for some of the other callers of that function. For example drivers/devfreq/tegra30-devfreq.c and drivers/thermal/cpufreq_cooling.c and drivers/powercap/dtpm_cpu.c
(In reply to Doug Smythies from comment #32) > Okay, then we need to figure out the difference. I get a fine response from > the function when the frequency is not stale (i.e. the other return path > from the function, I think). But policy->cur is never updated. It is used > for the stale return path. > > Example (Unmodified kernel 6.5-rc1): > > scaling_min_freq for all CPUs = 0.8 GHz: > > $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq > /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:800293 > /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:799993 > > Okay, all the stale values are good. Note: one can tell because the numbers > are exact (I also did a test compile with flagged return values). > > Now set the scaling_min_freq for all CPUs = 2.5 GHz: > > $ echo 2500000 | sudo tee > /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq > > And check: > > $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq > /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:2526076 > /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:800000 > /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:2500047 > > Huh? wrong. > > Now switch to the ondemand governor and then switch back and check again: > > $ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq > /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy10/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy11/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq:4800127 > /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy8/scaling_cur_freq:2500000 > /sys/devices/system/cpu/cpufreq/policy9/scaling_cur_freq:4800262 > > Now it is Okay. Ah, these are interesting. I must admit I haven't played a lot about different governors and usually just use the default driver and governor. > > There is other weirdness also. I am now trying to test an additional patch > on top of what I proposed above. I'll post it tomorrow. > > You might have to disable HWP in your BIOS to be able to boot to the > acpi-cpufreq CPU frequency scaling driver. While it should be a boot time > decision, some BIOS' (ASUS) force it even if "auto" is selected in the BIOS. Thanks, that's good point, I recall someone did tell me it's not doable to disable HWP on our platform. I just used 'intel_pstate=disable' on kernel boot cmdline.
(In reply to Doug Smythies from comment #33) > I am testing this addition to the proposed patch from comment 24: > > $ git diff > diff --git a/arch/x86/kernel/cpu/aperfmperf.c > b/arch/x86/kernel/cpu/aperfmperf.c > index 6230fd7162ee..44cc96864d94 100644 > --- a/arch/x86/kernel/cpu/aperfmperf.c > +++ b/arch/x86/kernel/cpu/aperfmperf.c > @@ -418,9 +418,10 @@ unsigned int arch_freq_get_on_cpu(int cpu) > unsigned long last; > u64 acnt, mcnt; > > - if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF)) > - goto fallback; > - > + if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF)){ > + freq = cpufreq_quick_get(cpu); > + return freq ? freq : cpufreq_quick_get_min(cpu); > + } > do { > seq = raw_read_seqcount_begin(&s->seq); > last = s->last_update; > @@ -433,13 +434,9 @@ unsigned int arch_freq_get_on_cpu(int cpu) > * which covers idle and NOHZ full CPUs. > */ > if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE) > - goto fallback; > + return cpufreq_quick_get_min(cpu); > > return div64_u64((cpu_khz * acnt), mcnt); > - > -fallback: > - freq = cpufreq_quick_get(cpu); > - return freq ? freq : cpufreq_quick_get_min(cpu); > } > > static int __init bp_init_aperfmperf(void) > > This was after thinking about, and trying, just changing the return value in > drivers/cpufreq/cpufreq.c from policy->cur to policy->min. The concern with > it being that I don't know if that would break the expected response for > some of the other callers of that function. For example > drivers/devfreq/tegra30-devfreq.c and drivers/thermal/cpufreq_cooling.c and > drivers/powercap/dtpm_cpu.c Not quite follow here. Returning min in the 2nd 'goto' makes sense with avoiding the superfluous calling of cpufreq_quick_get() (as it looks that it returns always 0 in this condition). But those calls to cpufreq_quick_get() from tegra30-devfreq.c, cpufreq_cooling.c and dtpm_cpu.c won't be impacted anyway as we only added a new cpufreq_quick_get_min() helper and kept the cpufreq_quick_get() intact, do I miss something here?
(In reply to Keyon from comment #35) > (In reply to Doug Smythies from comment #33) ... > > This was after thinking about, and trying, just changing the return value > in > > drivers/cpufreq/cpufreq.c from policy->cur to policy->min. The concern with > > it being that I don't know if that would break the expected response for > > some of the other callers of that function. For example > > drivers/devfreq/tegra30-devfreq.c and drivers/thermal/cpufreq_cooling.c and > > drivers/powercap/dtpm_cpu.c > > Not quite follow here. Returning min in the 2nd 'goto' makes sense with > avoiding the superfluous calling of cpufreq_quick_get() (as it looks that it > returns always 0 in this condition). But those calls to cpufreq_quick_get() > from tegra30-devfreq.c, cpufreq_cooling.c and dtpm_cpu.c won't be impacted > anyway as we only added a new cpufreq_quick_get_min() helper and kept the > cpufreq_quick_get() intact, do I miss something here? I was talking about another option: $ git diff diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 5ea08a576131..487ac547e19e 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -1768,7 +1768,7 @@ unsigned int cpufreq_quick_get(unsigned int cpu) policy = cpufreq_cpu_get(cpu); if (policy) { - ret_freq = policy->cur; + ret_freq = policy->min; cpufreq_cpu_put(policy); } But there are other callers of cpufreq_quick_get that might be affected so I went to the code listed and left cpufreq_quick_get unchanged. Due to limited available time, I have not yet finished testing.
(In reply to Doug Smythies from comment #36) > (In reply to Keyon from comment #35) > > (In reply to Doug Smythies from comment #33) > ... > > > This was after thinking about, and trying, just changing the return value > > in > > > drivers/cpufreq/cpufreq.c from policy->cur to policy->min. The concern > with > > > it being that I don't know if that would break the expected response for > > > some of the other callers of that function. For example > > > drivers/devfreq/tegra30-devfreq.c and drivers/thermal/cpufreq_cooling.c > and > > > drivers/powercap/dtpm_cpu.c > > > > Not quite follow here. Returning min in the 2nd 'goto' makes sense with > > avoiding the superfluous calling of cpufreq_quick_get() (as it looks that > it > > returns always 0 in this condition). But those calls to cpufreq_quick_get() > > from tegra30-devfreq.c, cpufreq_cooling.c and dtpm_cpu.c won't be impacted > > anyway as we only added a new cpufreq_quick_get_min() helper and kept the > > cpufreq_quick_get() intact, do I miss something here? > > I was talking about another option: > > $ git diff > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 5ea08a576131..487ac547e19e 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -1768,7 +1768,7 @@ unsigned int cpufreq_quick_get(unsigned int cpu) > > policy = cpufreq_cpu_get(cpu); > if (policy) { > - ret_freq = policy->cur; > + ret_freq = policy->min; > cpufreq_cpu_put(policy); > } > > But there are other callers of cpufreq_quick_get that might be affected so I Gotcha, fair enough. > went to the code listed and left cpufreq_quick_get unchanged. Due to limited > available time, I have not yet finished testing. Take your time Doug. I will keep tuned for your updates.
Created attachment 304694 [details] Stale frequency results for various drivers and governors For a patched and unpatched kernel 6.5-RC3. There might be push back for the changes to the performance governor responses. There is also an ongoing thread about "cpufreq: schedutil: next_freq need update when cpufreq_limits changed" on the linux-pm list that might change the unpatched response for the schedutil governor.
(In reply to Doug Smythies from comment #38) > Created attachment 304694 [details] > Stale frequency results for various drivers and governors > > For a patched and unpatched kernel 6.5-RC3. There might be push back for the > changes to the performance governor responses. There is also an ongoing > thread about "cpufreq: schedutil: next_freq need update when cpufreq_limits > changed" on the linux-pm list that might change the unpatched response for > the schedutil governor. Thank you for the notice Doug.
Created attachment 304734 [details] intel_cpufreq driver, hwp disabled, schedutil governor, odd pstate requests From a trace. Note the odd pstate requests. 48 for no load and 28 to 30 for full load, which seems odd. policy->cur often shows 48 for the idle CPU for long periods (observe the duration numbers), which is what has been requested not what has been granted.
Created attachment 304735 [details] Stale frequency results for various drivers and governors - policy cur method Rafael suggested this instead: doug@s19:~/kernel/linux$ git diff diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 50bbc969ffe5..4e91169a83f5 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -1768,7 +1768,7 @@ unsigned int cpufreq_quick_get(unsigned int cpu) policy = cpufreq_cpu_get(cpu); if (policy) { - ret_freq = policy->cur; + ret_freq = policy->cur ? policy->cur : policy->min; cpufreq_cpu_put(policy); } The results are attached. policy->cur can be rather unrelated to what the actual frequency has ever been. The frequency is stale anyhow. I still like my suggested patch better.
(In reply to Doug Smythies from comment #40) > Created attachment 304734 [details] > intel_cpufreq driver, hwp disabled, schedutil governor, odd pstate requests > > From a trace. Note the odd pstate requests. 48 for no load and 28 to 30 for > full load, which seems odd. policy->cur often shows 48 for the idle CPU for > long periods (observe the duration numbers), which is what has been > requested not what has been granted. This looks very informative but I don't have enough knowledge to understand it yet, I guess most people are actually not caring hwp disabled case that much.
(In reply to Doug Smythies from comment #41) > Created attachment 304735 [details] > Stale frequency results for various drivers and governors - policy cur method > > Rafael suggested this instead: > > doug@s19:~/kernel/linux$ git diff > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 50bbc969ffe5..4e91169a83f5 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -1768,7 +1768,7 @@ unsigned int cpufreq_quick_get(unsigned int cpu) > > policy = cpufreq_cpu_get(cpu); > if (policy) { > - ret_freq = policy->cur; > + ret_freq = policy->cur ? policy->cur : policy->min; > cpufreq_cpu_put(policy); > } > > The results are attached. policy->cur can be rather unrelated to what the > actual frequency has ever been. The frequency is stale anyhow. I still like > my suggested patch better. Doug, IMHO it is not a big deal, simple making life easier. Returning MIN was your original idea IIRC, if that is more acceptable by maintainers, I would suggest to go in that way.
My original idea was this: policy = cpufreq_cpu_get(cpu); if (policy) { - ret_freq = policy->cur; + ret_freq = policy->min; cpufreq_cpu_put(policy); } anyway, I'll submit the patch as per comment 41.
(In reply to Doug Smythies from comment #44) > My original idea was this: > > policy = cpufreq_cpu_get(cpu); > if (policy) { > - ret_freq = policy->cur; > + ret_freq = policy->min; > cpufreq_cpu_put(policy); > } > > anyway, I'll submit the patch as per comment 41. Thank you Doug.
Created attachment 304916 [details] Yet another patch to fix this. I actually never liked the previously posted patch, even though I said I would submit it. This version has been submitted and addresses the issue at its source.
(In reply to Doug Smythies from comment #46) > Created attachment 304916 [details] > Yet another patch to fix this. > > I actually never liked the previously posted patch, even though I said I > would submit it. This version has been submitted and addresses the issue at > its source. Checked and verified this works for me as well. Thank you so much to make it happen Doug, appreciate!
fix launched to linux-pm kernel and we will see it in 6.6 kernel.
https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/patch/?id=d51847acb018d83186e4af67bc93f9a00a8644f7
The patch has launched in mainline kernel here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d51847acb018d83186e4af67bc93f9a00a8644f7
it is verified work, Thank you so much Doug! Now closing it.