Bug 19702 - i5-450M CPU gets stuck in low/lowest state
Summary: i5-450M CPU gets stuck in low/lowest state
Status: CLOSED DOCUMENTED
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Processor (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Thomas Renninger
URL:
Keywords:
: 17001 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-10-04 06:41 UTC by Heinz Diehl
Modified: 2016-06-24 19:25 UTC (History)
15 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Outout of acpidump (393.47 KB, text/plain)
2010-10-04 06:41 UTC, Heinz Diehl
Details
dmesg output (84.15 KB, text/plain)
2010-10-04 06:43 UTC, Heinz Diehl
Details
Patch from Thomas R. (2.49 KB, text/plain)
2010-10-04 06:44 UTC, Heinz Diehl
Details
Output of the commands asked by Thomas R. (1.12 KB, application/x-bzip)
2010-10-08 19:06 UTC, Heinz Diehl
Details
Simple CPU tester and reporter. (4.68 KB, text/plain)
2010-10-14 05:58 UTC, mobusby
Details
Patch introducing boot param to disable HW_COORD type (1.09 KB, patch)
2010-10-19 03:11 UTC, Thomas Renninger
Details | Diff
Provide /sys/devices/system/cpu/cpu*/cpufreq/shared_type (2.09 KB, patch)
2010-10-19 03:51 UTC, Thomas Renninger
Details | Diff
CPUFREQ: Fix HW_ALL core dependencies (1.58 KB, patch)
2010-10-28 15:13 UTC, Thomas Renninger
Details | Diff
Patch to workaround the possible HW defect (1.63 KB, patch)
2010-10-29 20:19 UTC, Thomas Renninger
Details | Diff
intel_idle: Do not load if user overrides idle function via idle= boot param (484 bytes, patch)
2010-11-02 09:03 UTC, Thomas Renninger
Details | Diff

Description Heinz Diehl 2010-10-04 06:41:28 UTC
Created attachment 32502 [details]
Outout of acpidump

Hi,

I own a brand new ASUS U45JC laptop with an Intel i5-450M CPU built in.
During the first compile on it, I noticed that the CPU reached max. 1.33 GHz
using governor "ondemand". Kicking the CPU with a kernel compile using
"make -j4" didn't help either, max. frequency was nailed at 1.33 GHz.
Both the latest stable 2.6.35.7 and release candidate 2.6.36-rc6-git2 are
affected.

After 2 days investigating (I'm not primarily a kernel programmer/developer),
I came across the solution, and there was already a patch available from
Thomas Renninger @suse, which I have included in this mail. I blindly
guess now that a lot of people own a notebook with an Intel i5 inside, and
maybe (could) have the same problem.

Any chance to get this included in the kernel, or is it just me who
encounters this problem? The patch applies cleanly against the latest rc,
and I can confirm that it also fixes the above mentioned problem in
2.6.35.7 (I backported it and did try).

Attached are the output of acpidump and dmesg, and the patch from Thomas Renninger which fixes the problem for me. BIOS is the latest version 207 from Asus.

Thanks,
Heinz.
Comment 1 Heinz Diehl 2010-10-04 06:43:17 UTC
Created attachment 32512 [details]
dmesg output
Comment 2 Heinz Diehl 2010-10-04 06:44:22 UTC
Created attachment 32522 [details]
Patch from Thomas R.
Comment 3 Thomas Renninger 2010-10-08 12:59:38 UTC
Cpufreq info is embedded into dynamically loaded ACPI tables which you need to dump manually. Please run:
acpidump --addr 0xAADA3918 --length 0x3FB >/tmp/CPU0IST
acpidump --addr 0xAADA2A98  --length 0x303 >/tmp/APIST
acpidump --addr 0xAADA1018 --length 0x8A9 >/tmp/CPU0CST
acpidump --addr 0xAADA0D98 --length 0x119 >/tmp/APCST
and attach (zipped?) output files. Thanks.
Comment 4 Heinz Diehl 2010-10-08 19:04:00 UTC
Hi,
the output of these commands is attached here.

Thanks,
-heinz
Comment 5 Heinz Diehl 2010-10-08 19:06:03 UTC
Created attachment 32902 [details]
Output of the commands asked by Thomas R.
Comment 6 Thomas Renninger 2010-10-12 11:19:47 UTC
The BIOS latency values look sane, I wonder why my patch helps.
Because I am blindfolded..., this is about reducing up_threshold and
it is a a duplicate of bug #17001.
Can you try to lower up_treshold to 53 or below as mentioned there and then use
cat /dev/zero >/dev/null &
multiple times to utilize CPUs.

Looks like something broke somewhen before 2.6.35.7.
Do you know about a kernel version that did not show this?
Comment 7 mobusby 2010-10-14 05:56:36 UTC
I am having a similar problem, but changing the CPU governor or threshhold doesn't help.  I wrote a little script to test my CPU, and it appears that the BIOS is limiting the CPU at high work loads, but not at low work loads.  After some time at full BIOS limiting ACPI throttling begins.  The temperature never gets high enough for throttling.

I am using the stock Kernel from Ubuntu 10.10.  This behavior was not present in Ubuntu 10.04, and is not present when running Windows.
Comment 8 mobusby 2010-10-14 05:58:41 UTC
Created attachment 33562 [details]
Simple CPU tester and reporter.

The results from running the test on my computer are attached to the bottom of the script in comments.
Comment 9 Thomas Renninger 2010-10-14 09:11:54 UTC
Mobusby: You have another problem which is BIOS related.
Best check for the latest BIOS. Here is a similar bug (but the root cause in BIOS is probably different). BIOS limits the frequ on purpose, you have to find out why...
Best also check your BIOS settings (best after a upgrade).
Some things to check:
Temperature might be related, AC/battery, dirty fan slot, only happens after suspend (disk/ram)?
If nothing helps, processor.ignore_ppc=1 will ignore the limit from BIOS.
If you tried that far, then please open a new bug and attach dmesg and acpidump output and assign it to me.
Comment 10 Thomas Renninger 2010-10-14 09:22:48 UTC
Mobusby: Here is a similar bug (but the root cause in BIOS is probably different):
https://bugzilla.kernel.org/show_bug.cgi?id=16362
Comment 11 Heinz Diehl 2010-10-14 15:20:38 UTC
Hi Thomas,
lowering up_threshold helps. Standard after booting without having the kernel patched is 95, after your patch it is set to 40. I didn't install a lot of different kernels yet, because the machine is brand new, but I noticed to my surprise at 2.6.36-rc7-git3 seems to have cured the problem. There's no cpufreq related patch in there, as far as I could see, so I wonder what the f*ck is going on here :-)
Comment 12 Thomas Renninger 2010-10-15 10:07:06 UTC
Thanks!
So it looks like we have a regression in 2.6.35 kernel which got fixed between   2.6.36-rc6-git2 and 2.6.36-rc7-git3.
It affects idle/busy/io CPU accounting in way that cpufreq ondemand governor does not switch up frequency (only with the workaround by dramatically decreasing up_threshold tunable).

The fix seem not to be in any cpufreq related code (According to Heinz, I didn't double check, but haven't seen anything on the cpufreq list lately).

Rafael/Ingo/Peter: Do you have an idea which patch could have solved this issue.
This should probably go to .35 stable...
Comment 13 Thomas Renninger 2010-10-15 10:11:52 UTC
*** Bug 17001 has been marked as a duplicate of this bug. ***
Comment 14 Thomas Renninger 2010-10-15 10:15:06 UTC
In bug #17001 it's stated:
> As far as I can tell, the problem exists at least since kernel version 2.6.30
But this would be something machine specific then, someone should have reported that already meanwhile.
Could also be that above statement is wrong and something went wrong when testing 2.6.30 and this is as mentioned a regression introduced in 2.6.35 and fixed somewhere between 2.6.36-rc6-git2 and 2.6.36-rc7-git3...
Comment 15 vyncere 2010-10-15 17:52:16 UTC
Hi all,

I have an Intel Core i5 520 M (Thinkpad T410). This is when I switched from Ubuntu 10.04 (2.6.32.15) to Ubuntu 10.10 (2.6.35.4) that the problem has made itself visible.

* With the stock kernel of Ubuntu 10.04 : (Ondemand WORKS)
cpufreq-info said for all virtual CPUs :

hardware limits: 1.20 GHz - 2.40 GHz
current policy: frequency should be within 1.20 GHz and 2.40 GHz.
The governor "ondemand" may decide which speed to use within this range.
current CPU frequency is 1.20 GHz.
cpufreq stats: 2.40 GHz:3.26%, 2.40 GHz:0.02%, 2.27 GHz:0.05%, 2.13 GHz:0.05%, 2.00 GHz:0.03%, 1.87 GHz:0.02%, 1.73 GHz:0.03%, 1.60 GHz:0.01%, 1.47 GHz:0.03%, 1.33 GHz:0.03%, 1.20 GHz:96.48%  (235)

* With the stock kernel of Ubuntu 10.10 : (Ondemand FAILS)
cpufreq-info said for all virtual CPUs :
hardware limits: 1.20 GHz - 2.40 GHz
current policy: frequency should be within 1.20 GHz and 1.20 GHz.
current CPU frequency is 1.20 GHz.
cpufreq stats: 2.40 GHz:0.00%, 2.40 GHz:0.00%, 2.27 GHz:0.00%, 2.13 GHz:0.00%, 2.00 GHz:0.00%, 1.87 GHz:0.00%, 1.73 GHz:0.00%, 1.60 GHz:0.00%, 1.47 GHz:0.00%, 1.33 GHz:0.00%, 1.20 GHz:100.00%

I took my custom reference configuration (.config) of my laptop which perfectly works with the 2.6.32.15 Ubuntu Kernel and tested it with some different kernel versions (from kernel.org), in order to reduce the incidence of Ubuntu patches.

* Results :

2.6.32.15 : Ondemand scaling OK
2.6.32.16 : Ondemand scaling OK
2.6.32.17 : Ondemand scaling OK
2.6.32.18 : Ondemand scaling OK
2.6.32.19 : Ondemand scaling OK
2.6.32.20 : Ondemand scaling OK
2.6.32.21 : Ondemand scaling OK
2.6.32.22 : Ondemand scaling OK
2.6.32.23 : Ondemand scaling OK
2.6.32.24 : Ondemand scaling OK

So, at this state, could it be concluded that the problem has came after 2.6.32 ? Not so sure ! Because, as it was underlines higher in this thread, this problem may highly be machine specific... In any case, this is the second bug reporting, impacting a Core i5 CPU. On Ubuntu launchpad, someone with a Core 2 Duo T7200 CPU is affected by this problem too (2.6.35.4 from Ubuntu 10.10).

I have tested too with a clean 2.6.35.7 with and without the patch from Thomas R. For both, the result is the same : Ondemand scaling KO.

I will tell you what happens with >= 2.6.33 kernels.
Comment 16 Thomas Renninger 2010-10-15 19:24:59 UTC
Could have been related to cpuidle and the new intel_idle driver, but I also see no recent fixes there as well.
If this is true:
fixed between 2.6.36-rc6-git2 and 2.6.36-rc7-git3
it shouldn't be that hard to bisect on those rather stable versions.
Be aware that you have to exchange git "good" and "bad" if you search for a fix and not for a regression/bug.

That would be:
git start <bad>           <good>
git start 2.6.36-rc7-git3 2.6.36-rc6-git2
If a version works well go for git bad
if it shows the bug go for git good

gitX is not tagged in git though, no idea how one can find out the git id of these subcommits/merges?

Heinz: How sure are you that it got fixed in 2.6.36-rc7-git3?
Comment 17 Thomas Renninger 2010-10-15 19:36:59 UTC
If someone wants to test whether this has to do with intel_idle driver you should be able to check the used driver (new intel_idle vs acpi_idle):
cat /sys/devices/system/cpu/cpuidle/current_driver

The new cpupowerutils (former cpufrequtils) show this via:
cpuidle-info

If intel_idle is used the boot param:
intel_idle.max_cstate=0
should make the machine fall back to acpi_idle...
Just an idea..., idle driver sounds possibly related to wrong idle accounting... and this is an easy check compared to compiling/bisecting...
Comment 18 vyncere 2010-10-16 08:56:37 UTC
I just did some tests with >= 2.6.33 Kernels.

* Results :

2.6.33    : KERNEL_PANIC (Ouch !!!)
2.6.33.1  : Ondemand scaling KO (freeze my laptop after few minutes... I did not investigate deeply...)

In any case, just before freezing, I had enough time to check cpufreq-info and cpuidle state. 

* cpufreq-info print out the same "hardware limits" and "cpufreq stats" that I reported higher in this thread, when Ondemand governor failed to scale.

So the regression may highly appear since 2.6.33.

* The current driver used for cpuidle was "acpi_idle".

So, if intel_idle driver is not incriminated, we can check the differences between 2.6.32.24 and 2.6.33[.1] Kernel tree, which may cause this regression.

* Comparing the two "drivers/cpufreq" directories, there are some changes between :
- cpufreq.c (add bios limit reading and release the rwsem around governor)
- cpufreq_conservative.c (some stuffs, but I never use this gouvernor)
- cpufreq_ondemand.c (very few : add a condition to read the new min policy)
- freq_table.c (some function names refactoring)

I'm not a kernel hacker and I do not know (not yet) how the cpufreq API works.

I hope this information will help you.
Comment 19 vyncere 2010-10-16 09:19:07 UTC
Hi all,

I just tested my fresh and unstable 2.6.33.1 kernel with functional cpufreq driver from 2.6.32.24.

* Result : Ondemand scaling KO

So the regression may come from somewhere else... !
Comment 20 Thomas Renninger 2010-10-19 02:50:10 UTC
I possibly know the problem.
What does:
cat /sys/devices/system/cpu/cpu0/cpufreq/{affected_cpus,related_cpus}
say?
I expect (and from ACPI table info it's very likely this is the case for Heinz's system) that ACPI_PDC_SMP_P_HWCOORD is used.
Compare with 8.4.4.5 _PSD (P-State Dependency) (ACPI spec):
CoordType: DWordConst
The type of coordination that exists (hardware) or is required (software) as a result of the underlying hardware dependency. Could be either 0xFC (SW_ALL), 0xFD (SW_ANY) or 0xFE (HW_ALL) indicating whether OSPM is responsible for coordinating the P-state transitions among processors with dependencies (and needs to initiate the transition on all or any processor in the domain) or whether the hardware will perform this coordination.

Heinz's BIOS differs _PDC (OS capabilities) and exports HW_ANY in case the kernel tells the BIOS it can do it (and it does).

I remember another machine (Jean Delvare) where frequency switching was totally messed up then.
So this may not be a real kernel bug. I can provide a patch so that HW_COORD OS capability is not set, that should help Heinz can we can verify whether it's that.

It's hard to check which coordination type is used at runtime.
I once looked it up a bit and the if affected_cpus, related_cpus are not the same it must be HW_ANY, iirc.
Comment 21 Thomas Renninger 2010-10-19 03:11:56 UTC
Created attachment 34092 [details]
Patch introducing boot param to disable HW_COORD type

Please try:
processor.disable_hw_coord=1
boot param with this patch.
Hm, pdc is called rather early these days. The param won't be considered for the first _PDC call in arch/x86/kernel/acpi/boot.c.
But that should not matter and I hope it gets used later, otherwise it must be an early param...
I try to come up with another patch to export the coordination type to be able to check this...
Comment 22 Thomas Renninger 2010-10-19 03:51:38 UTC
Created attachment 34102 [details]
Provide /sys/devices/system/cpu/cpu*/cpufreq/shared_type

This should be:
CPUFREQ_SHARED_TYPE_ALL  (2)
if you added the boot param provided in the previous patch
and
CPUFREQ_SHARED_TYPE_HW   (1)
by default.
Comment 23 Peter Ganzhorn 2010-10-19 07:34:20 UTC
Concerning bug 17001, which I filed, I have to report that the issue has not magically disappeared for me with the 2.6.36-rc kernels.
I finally managed to compile 2.6.36-rc7 and -rc8 (had an error building the i915 driver before) and Core2 Duo T7300 CPU still is stuck at the lowest frequency without lowering up_threshold to 53 or lower.

Would it help if I provided you with some ACPI information? What do you need?
Comment 24 Thomas Renninger 2010-10-19 09:40:32 UTC
Would it help if I provided you with some ACPI information? What do you need?
acpidump output for now. Need to look up the addresses of dynamic tables.
Possibly you can look it up yourself.
Do:
acpixtract acpidump
iasl -d *.dat
grep -i load *.dsl
You may see something like that then:
SSDT.dsl:                    Load (IST0, HI0)
SSDT.dsl:                    Load (CST0, HC0)
SSDT.dsl:                Load (CST1, HC1)
SSDT.dsl:                Load (IST1, HI1)
Which is the address/length of the dynamically loaded table.
On this system you there is:
OperationRegion (IST0, SystemMemory, DerefOf (Index (SSDT, One)), DerefOf (Index (SSDT, 0x02)))
and
Name (SSDT, Package (0x0C)
        {
            "CPU0IST ",
            0xAADA3918,
            0x000003FB,
            "APIST   ",
            0xAADA2A98,
            0x00000303,
            "CPU0CST ",
            0xAADA1018,
            0x000008A9,
            "APCST   ",
            0xAADA0D98,
            0x00000119
        })

These are the names/address/length of the tables you need to extract manually with
acpidump --addr 0xAADA3918 --length 0x000003FB >CPU0IST.dat
acpidump --addr 0xAADA2A98 --length 0x00000303 >CPU0CST.dat

They possibly can already be found there:
/sys/firmware/acpi/tables/
not sure.

But you could also just give my two patches a try and show us:
cat /sys/devices/system/cpu/cpu*/cpufreq/shared_type
if it shows CPUFREQ_SHARED_TYPE_ALL  (2)
it's worth to the boot param mentioned in comment #21
Comment 25 Peter Ganzhorn 2010-10-19 14:13:07 UTC
Here's what I have in

# ls /sys/firmware/acpi/tables/
APIC  DSDT     FACP  HPET  SLIC   SSDT2  SSDT4	SSDT6  SSDT8
BOOT  dynamic  FACS  MCFG  SSDT1  SSDT3  SSDT5	SSDT7  TCPA

I guess you want the SSDT* . Do you need any other files from there?

Further I applied your two patches to 2.6.36-rc8 and without the boot parameter I'm getting:

# cat /sys/devices/system/cpu/cpu*/cpufreq/shared_type
2
2

With processor.disable_hw_coord=1 I'm getting the exact same thing:


# uname -rs
Linux 2.6.36-rc8-pgzh
# dmesg | grep hw_coord
Kernel command line: BOOT_IMAGE=/vmlinuz.exp root=/dev/sda3 ro processor.disable_hw_coord=1
centauri:/home/pgzh# cat /sys/devices/system/cpu/cpu*/cpufreq/shared_type
2
2

There's no change in CPUFREQ behavior as well - I'm stuck with the lowest freq unless I lower up_threshold.

In the CPUFREQ drivers section in kconfig I have the following set:
# CONFIG_X86_PCC_CPUFREQ is not set
Should this be enabled?
I only got the ACPI driver enabled right now. (CONFIG_X86_ACPI_CPUFREQ=m)
Comment 26 vyncere 2010-10-19 22:47:14 UTC
Hi all,

So with my Core i5 520 M (2 physical cores, 4 logical cores):

* 2.6.32.15 (ubuntu), 2.6.35.7 (kernel.org), 2.6.35.7 (kernel.org + 2 patchs) :

- Values of /sys/devices/system/cpu/cpu*/cpufreq/affected_cpus are : "0", "1", "2" and "3" respectively for cpu0, cpu1, cpu2 and cpu3.

- Output of /sys/devices/system/cpu/cpu*/cpufreq/related_cpus is always : "0 1 2 3"

No problem.

* 2.6.35.7 (kernel.org + 2 patchs) with and without boot parameter "processor.disable_hw_coord=1" :

- Output of /sys/devices/system/cpu/cpu*/cpufreq/shared_type is always :
1

Always default value 1 (CPUFREQ_SHARED_TYPE_HW) : so, Thomas, according to what you said, there is something wrong here...

- cpufreq-info returns the same result as before (lowest freq).
- /sys/devices/system/cpu/cpufreq/ondemand/up_threshold is 95 by default.
- Lower the up_threshold has no effect for me.
Comment 27 Thomas Renninger 2010-10-28 15:12:08 UTC
I expect acpi-cpufreq is fundamentally broken in respect to HW_ALL coordination.
The only aspect acpi-cpufreq or cpufreq subsystem takes into account in HW_ALL case is to make sure that the same governor is running on all dependent CPUs in case CONFIG_HOTPLUG_CPU is set in .config.
Otherwise the dependent CPUs are only shown as "related" in sysfs, that's all.

The only thing which makes me wonder is: Why has this not come up earlier...

From ACPI spec it's impossible to guess how OS should deal with HW_ALL.
Googling about it leads to this bug :) and one interesting discussion:
http://www.mail-archive.com/linux-acpi@vger.kernel.org/msg11682.html
-> adding Len and Rui into CC.
But from there it's also not 100% clear.

While SW_ALL is very clear, I could imagine the difference between HW_ALL and SW_ANY is that the (MSR/HW) status registers may/may not get updated. Or that HW may or may not transition the other core(s) into the same state and SW has to re-evaluate (what would be rather stupid and SW_ALL algorithm should apply then).

Hmm, I found:
14.2 P-STATE HARDWARE COORDINATION
in
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3A: System Programming Guide, Part 1

But the code in there is so poor.
Essentially it tells us to use aperf/mperf for switching decisions.
The part that aperf/mperf should get reset to 0, should vanish from this document, it's possible to write an algorithm which can handle register overflows, setting them back to zero is wrong.

And this comment there:
// This example does not cover the additional logic or algorithms
// necessary to coordinate multiple logical processors to a target P-state.
makes me wonder whether we also miss this additional logic in acpi-cpufreq/ondemand.

The ondemand governor must know about the dependency and look at the utilization of all dependent cores when doing decisions which is not the case.
I expect Heinz and Peter get "half way correct" switching at about 52 up_threshold because they have two "related" cores.
Vyncere you have 4 dependent cores, if your core(s) are switched up with an up_threshold of 25 I expect you see the same bug.

My patches or say workaround may help for Heinz's BIOS, it may not for others.
Also this must get fixed properly. For this some input from Intel people would be great how HW_ALL must get handled or better how it's done on Windows.

You could try to find out how your HW behave by not loading any cpufreq driver, get the msr-tools package and set the frequency on single cores manually and then read out status MSR whether the dependent cores switched as well (question is whether the status is true then, may be CPU specific and may have nothing to do how the OS must implement HW_ALL algorithm).
The MSRs are:
#define MSR_IA32_PERF_STATUS 0x00000198     
#define MSR_IA32_PERF_CTL    0x00000199
You have to be careful that only 16 bits (0-15, cmp. with chapter 14.3.2.2 in above mentioned document) are used when you write to PERF_CTL, you have to read out and keep the others and write them back as well.

Long story short (Provided my whole research for comments whether I have a thinko):
HW_ALL is about taking aperf/mperf into account. We do not rely on BIOS, but determine that already by reading out cpuid aperf/mperf capabilities of the CPU directly. Still the governor must consider all dependent cores which is currently not the case and which is a major bug.
Question still is whether all cores (like SW_ALL) or only one core (like SW_ANY) is enough to switch. SW_ALL should work for sure -> will provide a patch.
Comment 28 Thomas Renninger 2010-10-28 15:13:22 UTC
Created attachment 35352 [details]
CPUFREQ: Fix HW_ALL core dependencies

Please give it a try...
Comment 29 Peter Ganzhorn 2010-10-29 11:06:22 UTC
I applied the patch "CPUFREQ: Fix HW_ALL core dependencies" to a vanilla 2.6.36 (without any of the previously posted patches) and it did NOT fix the problem for me. I'm still stuck at the lowest freq until I lower up_threshold.
Comment 30 Thomas Renninger 2010-10-29 12:14:38 UTC
Thanks. I still think not considering core dependencies in HW_ALL case is wrong, but this needs further/separate discussing/evaluation.

Peter, as you said with exactly same HW and kernel version/.config:
This one is broken:
Intel Core2 Duo T7300 2.0 GHz processor
and this one is not:
Intel Core2 Quad Q9550

Could you show us:
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
and
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
does it make a difference switching, e.g. to hpet?
echo xy >/sys/devices/system/clocksource/clocksource0/current_clocksource

Ok, I checked Heinz's and Peter's dmesg:

Heinz explicitly enables hpet via boot param (what is this for?)
clocksource=hpet acpi_skip_timer_override

And Peter has:
Marking TSC unstable due to TSC halts in idle
Switching to clocksource hpet

It's always only CPU0 not switching up, right?
Looks like something with hpet is wrong.
Comment 31 Thomas Renninger 2010-10-29 13:00:19 UTC
I also wonder whether if you bind a 100% CPU utilizing task to CPU0:
numactl --physcpubind=0 cat /dev/zero >/dev/null
whether top (and then hit "1" to see each CPU's utilization) really shows you
100% (sy + us) running and 0% idle time?
Comment 32 Peter Ganzhorn 2010-10-29 15:49:01 UTC
numactl --physcpubind=0 cat /dev/zero >/dev/null

produces about 95%sys and 5%us load on _ONE_ core. idle goes down to 0.0% immediately. This behavior does not change when changing up_threshold.

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

So my system is already using the HPET clocksource.
Changing to acpi_pm has no effect on cpu frequency scaling (with the default up_threshold, of course).
The system did switch to acpi_pm as dmesg reveals:
Switching to clocksource acpi_pm

Concerning HPET and TSC I got the following from dmesg:
ACPI: HPET 000000007f736b66 00038 (v01 TOSHIB A0054    20070816 TASM 04010000)
ACPI: HPET id: 0x8086a201 base: 0xfed00000
hpet clockevent registered
Fast TSC calibration failed
TSC: PIT calibration matches HPET. 1 loops
HPET: 3 timers in total, 0 timers will be used for per-cpu timer
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
hpet0: 3 comparators, 64-bit 14.318180 MHz counter
Switching to clocksource tsc
Marking TSC unstable due to TSC halts in idle
Switching to clocksource hpet
rtc0: alarms up to one year, 114 bytes nvram, hpet irqs
CE: hpet increased min_delta_ns to 7500 nsec
CE: hpet increased min_delta_ns to 11250 nsec

The "hpet increased min_delta_ns" message shows up on the system with the Core2 Quad Q9550 as well, but obviously this does not affect cpu frequency scaling either. To me everything else looks fine.

BTW, do you still need the ACPI tables you mentioned in Comment #24? If so, please tell which of the ones I got (see Comment #25) you would like to have.
Comment 33 Thomas Renninger 2010-10-29 20:15:27 UTC
I wonder why Heinz has:
clocksource=hpet acpi_skip_timer_override

Peter's machine has an unstable TSC, that is strange, cpuflags show constant_tsc.
So it seems it falls through some boot check to not increment constantly.

Hmm, you both seem to have problems with TSC..., aperf/mperf may be fed by the same internal clock as TSC. If average frequency calced via aperf/mperf is very low, the frequency would never get ramped up.

It's hard to believe, because you have so different CPUs, but it looks like you both have broken TSC/aperf/mperf timers on CPU 0? At least Peter has?

Another patch to try.., please use:
acpi_cpufreq.disable_average=1
boot param and check dmesg that it got applied:
"acpi-cpufreq: average (aperf/mperf) accounting disabled by user"
Comment 34 Thomas Renninger 2010-10-29 20:19:32 UTC
Created attachment 35492 [details]
Patch to workaround the possible HW defect

If this really helps, please also try cpufreq-aperf tool from a latest cpufrequtils package and check what kind of broken values aperf/mperf provide on CPU0.
Comment 35 Peter Ganzhorn 2010-11-01 10:49:11 UTC
# dmesg | grep acpi-cpufreq
acpi-cpufreq: average (aperf/mperf) accounting disabled by user
acpi-cpufreq: average (aperf/mperf) accounting disabled by user

This patch actually *FIXED* the problem for me.

Please give me some instructions on how to use cpufreq-aperf - how do I use the tool? When running it, it gives me an output like this (with the patched kernel):

# cpufreq-aperf 
CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	0500250			00 sec 041 ms	00 sec 958 ms	04
001	0460230			00 sec 056 ms	00 sec 943 ms	05

000	0500250			00 sec 045 ms	00 sec 954 ms	04
001	0620310			00 sec 056 ms	00 sec 943 ms	05

000	0480240			00 sec 033 ms	00 sec 966 ms	03
001	0520260			00 sec 054 ms	00 sec 945 ms	05

000	0580290			00 sec 109 ms	00 sec 890 ms	10
001	0580290			00 sec 123 ms	00 sec 876 ms	12

000	0560280			00 sec 179 ms	00 sec 820 ms	17
001	0540270			00 sec 239 ms	00 sec 760 ms	23

Do I have to create load on one or both cpu cores? Do I have to pass particular options? And if I just have to run it, how long should I run it?
Comment 36 vyncere 2010-11-01 13:42:47 UTC
Hi everyone, Hi Thomas,

Here some result on my Thinkpad T410, Intel Core i5 520 M.


* 2.6.32.15 (Kernel Reference with functionnal cpufreq)
- Boot params : None

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

# While CPU is idling : cpufreq-aperf
CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	1824000			00 sec 067 ms	00 sec 932 ms	06
001	1392000			00 sec 007 ms	00 sec 992 ms	00
002	1272000			00 sec 010 ms	00 sec 989 ms	01
003	1488000			00 sec 005 ms	00 sec 994 ms	00

000	1656000			00 sec 052 ms	00 sec 947 ms	05
001	2232000			00 sec 016 ms	00 sec 983 ms	01
002	1272000			00 sec 009 ms	00 sec 990 ms	00
003	1464000			00 sec 005 ms	00 sec 994 ms	00

000	1248000			00 sec 026 ms	00 sec 973 ms	02
001	1992000			00 sec 043 ms	00 sec 956 ms	04
002	1272000			00 sec 011 ms	00 sec 988 ms	01
003	1488000			00 sec 006 ms	00 sec 993 ms	00

# While kernel is compiling (make -j 3) : cpufreq-aperf

CPU     Average freq(KHz)       Time in C0      Time in Cx      C0 percentage
000	2664000			00 sec 997 ms	00 sec 002 ms	99
001	2664000			00 sec 512 ms	00 sec 487 ms	51
002	2664000			00 sec 812 ms	00 sec 187 ms	81
003	2664000			00 sec 876 ms	00 sec 123 ms	87

000	2664000			00 sec 771 ms	00 sec 228 ms	77
001	2664000			00 sec 857 ms	00 sec 142 ms	85
002	2664000			00 sec 814 ms	00 sec 185 ms	81
003	2664000			00 sec 731 ms	00 sec 268 ms	73

000	2664000			00 sec 446 ms	00 sec 553 ms	44
001	2664000			00 sec 885 ms	00 sec 114 ms	88
002	2664000			00 sec 962 ms	00 sec 037 ms	96
003	2664000			00 sec 990 ms	00 sec 009 ms	99 

# cpufreq-info
current policy: frequency should be within 1.20 GHz and 2.40 GHz.









* 2.6.36 (+ 2 Patches HW_COORD, SHARED_TYPE)
- Boot params : None

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm 
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

# While CPU is idling : cpufreq-aperf

CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	1200000			00 sec 066 ms	00 sec 933 ms	06
001	1224000			00 sec 038 ms	00 sec 961 ms	03
002	1368000			00 sec 002 ms	00 sec 997 ms	00
003	1320000			00 sec 002 ms	00 sec 997 ms	00

000	1200000			00 sec 055 ms	00 sec 944 ms	05
001	1224000			00 sec 033 ms	00 sec 966 ms	03
002	1344000			00 sec 002 ms	00 sec 997 ms	00
003	1272000			00 sec 001 ms	00 sec 998 ms	00

000	1224000			00 sec 057 ms	00 sec 942 ms	05
001	1224000			00 sec 033 ms	00 sec 966 ms	03
002	1392000			00 sec 002 ms	00 sec 997 ms	00
003	1344000			00 sec 001 ms	00 sec 998 ms	00

# While kernel is compiling (make -j 3) : cpufreq-aperf

CPU     Average freq(KHz)       Time in C0      Time in Cx      C0 percentage
000	1176000			00 sec 585 ms	00 sec 414 ms	58
001	1176000			00 sec 719 ms	00 sec 280 ms	71
002	1200000			00 sec 825 ms	00 sec 174 ms	82
003	1176000			00 sec 951 ms	00 sec 048 ms	95

000	1200000			00 sec 874 ms	00 sec 125 ms	87
001	1176000			00 sec 864 ms	00 sec 135 ms	86
002	1200000			00 sec 776 ms	00 sec 223 ms	77
003	1200000			00 sec 586 ms	00 sec 413 ms	58

000	1176000			00 sec 903 ms	00 sec 096 ms	90
001	1200000			00 sec 841 ms	00 sec 158 ms	84
002	1200000			00 sec 682 ms	00 sec 317 ms	68
003	1176000			00 sec 702 ms	00 sec 297 ms	70

# cpufreq-info
current policy: frequency should be within 1.20 GHz and 1.20 GHz.










* 2.6.36 (+ 3 Patches HW_COORD, SHARED_TYPE, HW_ALL)
- Boot params : None

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

# While CPU is idling : cpufreq-aperf

CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	1224000			00 sec 053 ms	00 sec 946 ms	05
001	1248000			00 sec 022 ms	00 sec 977 ms	02
002	1320000			00 sec 002 ms	00 sec 997 ms	00
003	1344000			00 sec 002 ms	00 sec 997 ms	00

000	1224000			00 sec 061 ms	00 sec 938 ms	06
001	1272000			00 sec 018 ms	00 sec 981 ms	01
002	1344000			00 sec 002 ms	00 sec 997 ms	00
003	1344000			00 sec 003 ms	00 sec 996 ms	00

000	1224000			00 sec 060 ms	00 sec 939 ms	06
001	1248000			00 sec 015 ms	00 sec 984 ms	01
002	1416000			00 sec 002 ms	00 sec 997 ms	00
003	1368000			00 sec 002 ms	00 sec 997 ms	00

# While kernel is compiling (make -j 3) : cpufreq-aperf

CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	1200000			00 sec 101 ms	00 sec 898 ms	10
001	1200000			00 sec 079 ms	00 sec 920 ms	07
002	1200000			00 sec 201 ms	00 sec 798 ms	20
003	1200000			00 sec 828 ms	00 sec 171 ms	82

000	1200000			00 sec 112 ms	00 sec 887 ms	11
001	1200000			00 sec 515 ms	00 sec 484 ms	51
002	1200000			00 sec 006 ms	00 sec 993 ms	00
003	1200000			00 sec 528 ms	00 sec 471 ms	52

000	1200000			00 sec 567 ms	00 sec 432 ms	56
001	1200000			00 sec 275 ms	00 sec 724 ms	27
002	1176000			00 sec 253 ms	00 sec 746 ms	25
003	1176000			00 sec 120 ms	00 sec 879 ms	12

# cpufreq-info
current policy: frequency should be within 1.20 GHz and 1.20 GHz.










* 2.6.36 (+ 4 Patches HW_COORD, SHARED_TYPE, HW_ALL, HW_STATISTICS)
- Boot params for patch 4 : acpi_cpufreq.disable_average=1

# dmesg | grep cpufreq
acpi-cpufreq: average (aperf/mperf) accounting disabled by user

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

# While CPU is idling : cpufreq-aperf

CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	1200000			00 sec 084 ms	00 sec 915 ms	08
001	1248000			00 sec 016 ms	00 sec 983 ms	01
002	1368000			00 sec 011 ms	00 sec 988 ms	01
003	1344000			00 sec 003 ms	00 sec 996 ms	00

000	1224000			00 sec 056 ms	00 sec 943 ms	05
001	1224000			00 sec 059 ms	00 sec 940 ms	05
002	1320000			00 sec 010 ms	00 sec 989 ms	01
003	1296000			00 sec 004 ms	00 sec 995 ms	00

000	1224000			00 sec 054 ms	00 sec 945 ms	05
001	1224000			00 sec 033 ms	00 sec 966 ms	03
002	1368000			00 sec 010 ms	00 sec 989 ms	01
003	1368000			00 sec 003 ms	00 sec 996 ms	00

# While kernel is compiling (make -j 3) : cpufreq-aperf

CPU     Average freq(KHz)       Time in C0      Time in Cx      C0 percentage
000	1176000			00 sec 815 ms	00 sec 184 ms	81
001	1200000			00 sec 532 ms	00 sec 467 ms	53
002	1176000			00 sec 997 ms	00 sec 002 ms	99
003	1200000			00 sec 997 ms	00 sec 002 ms	99

000	1200000			00 sec 501 ms	00 sec 498 ms	50
001	1200000			00 sec 731 ms	00 sec 268 ms	73
002	1176000			00 sec 997 ms	00 sec 002 ms	99
003	1200000			00 sec 997 ms	00 sec 002 ms	99

000	1176000			00 sec 714 ms	00 sec 285 ms	71
001	1176000			00 sec 889 ms	00 sec 110 ms	88
002	1176000			00 sec 934 ms	00 sec 065 ms	93
003	1176000			00 sec 778 ms	00 sec 221 ms	77 

# cpufreq-info
current policy: frequency should be within 1.20 GHz and 1.20 GHz. 










It's very interesting. With the old good 2.6.32 kernel (with working cpufreq), while CPU is idling, according to cpufreq-aperf, the clock speeds fluctuate between 1.20GHz to 1.80GHz, sometimes up to 2.40GHz. Hummmm... it's not very good for power saving... It may explains why my CPU is always near 50°C. It's slightly better with 2.6.36 kernel. (Thanks to Intel Idle maybe ?! I don't know.) 

I don't know if it's really the true clock speeds since my conky monitor always shows me the 4 virtual cores at 1.2 GHz... But I think that cpufreq-aperf is more accurate than everything else. cpufreq-info always report a 1.20GHz max limit.

In my case, the problem is not solved. With the 3 different kernel configurations (2.6.36 + patches), with high CPU loads, clock speeds still remain at the lowest state. Ironically, cpufreq-aperf shows that frequencies never exceed 1.2OGHz at full load, contrary to idle time ! It's an upside-down world...
Comment 37 Thomas Renninger 2010-11-01 17:43:25 UTC
vyncere: Your problem is different. Could it be that 
cat /sys/devices/system/cpu/cpu0/cpufreq/bios_limit does not show max freq?
Then it is related to:
https://bugzilla.kernel.org/show_bug.cgi?id=14771
But in contrast to Peter's problem it's BIOS related (and the root cause may be different and related to ACPI).
If above (bios_limit) is true, best update your BIOS, check related BIOS options, if processor.ignore_ppc=1 workarounds and BIOS fiddling doesn't help, please open a new bug and assign it to the ACPI component.

Peter: Your CPU (counters: tsc, aperf, mperf) is/are broken.
Heinz may have a similar problem and if this is more common, a getavg check could be added whether the values are far away from min/max and getavg is not considered anymore then. As this is a hotpath, this could be implemented in a similar check as TSC is checked at boot up, e.g. test 3 times over 100ms whether getavg is inside ~min/~max limits, if not don't use it.
If this normally never happens it might not be worth it and just the module param could be added -> Waiting for a test from Heinz, either the boot param and/or a runtime timer check should be added.
Comment 38 vyncere 2010-11-01 18:03:04 UTC
Some results with :

- acpi_cpufreq.disable_average=1 clocksource=hpet

and

- acpi_cpufreq.disable_average=1 clocksource=hpet acpi_skip_timer_override







* 2.6.36 (+ 4 Patches HW_COORD, SHARED_TYPE, HW_ALL, HW_STATISTICS)
- Boot params for patch 4 : acpi_cpufreq.disable_average=1
- Boot params : clocksource=hpet

# dmesg | grep cpufreq
acpi-cpufreq: average (aperf/mperf) accounting disabled by user

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

# While CPU is idling : cpufreq-aperf

CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	2688000			00 sec 033 ms	00 sec 966 ms	03
001	1944000			00 sec 017 ms	00 sec 982 ms	01
002	1776000			00 sec 001 ms	00 sec 998 ms	00
003	1536000			00 sec 002 ms	00 sec 997 ms	00

000	2496000			00 sec 028 ms	00 sec 971 ms	02
001	1656000			00 sec 032 ms	00 sec 967 ms	03
002	1536000			00 sec 018 ms	00 sec 981 ms	01
003	1416000			00 sec 003 ms	00 sec 996 ms	00

000	2712000			00 sec 036 ms	00 sec 963 ms	03
001	2064000			00 sec 014 ms	00 sec 985 ms	01
002	1416000			00 sec 001 ms	00 sec 998 ms	00
003	1392000			00 sec 001 ms	00 sec 998 ms	00

000	2688000			00 sec 034 ms	00 sec 965 ms	03
001	1920000			00 sec 017 ms	00 sec 982 ms	01
002	1584000			00 sec 001 ms	00 sec 998 ms	00
003	1536000			00 sec 002 ms	00 sec 997 ms	00

000	2712000			00 sec 041 ms	00 sec 958 ms	04
001	1968000			00 sec 023 ms	00 sec 976 ms	02
002	1512000			00 sec 008 ms	00 sec 991 ms	00
003	1560000			00 sec 004 ms	00 sec 995 ms	00

000	2784000			00 sec 040 ms	00 sec 959 ms	04
001	1896000			00 sec 022 ms	00 sec 977 ms	02
002	1416000			00 sec 008 ms	00 sec 991 ms	00
003	1488000			00 sec 003 ms	00 sec 996 ms	00

000	2736000			00 sec 042 ms	00 sec 957 ms	04
001	2160000			00 sec 018 ms	00 sec 981 ms	01
002	1512000			00 sec 007 ms	00 sec 992 ms	00
003	1608000			00 sec 003 ms	00 sec 996 ms	00

# While kernel is compiling (make -j 3) : cpufreq-aperf
CPU     Average freq(KHz)       Time in C0      Time in Cx      C0 percentage
000     2640000                 00 sec 767 ms   00 sec 232 ms   76
001     2304000                 00 sec 960 ms   00 sec 039 ms   96
002     2112000                 00 sec 641 ms   00 sec 358 ms   64
003     2256000                 00 sec 886 ms   00 sec 113 ms   88

000     2640000                 00 sec 709 ms   00 sec 290 ms   70
001     2208000                 00 sec 969 ms   00 sec 030 ms   96
002     2016000                 00 sec 677 ms   00 sec 322 ms   67
003     2160000                 00 sec 881 ms   00 sec 118 ms   88

000     2640000                 00 sec 844 ms   00 sec 155 ms   84
001     2400000                 00 sec 937 ms   00 sec 062 ms   93
002     2328000                 00 sec 695 ms   00 sec 304 ms   69
003     2376000                 00 sec 866 ms   00 sec 133 ms   86

000     2640000                 00 sec 882 ms   00 sec 117 ms   88
001     2424000                 00 sec 756 ms   00 sec 243 ms   75
002     2400000                 00 sec 674 ms   00 sec 325 ms   67
003     2472000                 00 sec 991 ms   00 sec 008 ms   99

# cpufreq-info

current policy: frequency should be within 1.20 GHz and 1.20 GHz.


















* 2.6.36 (+ 4 Patches HW_COORD, SHARED_TYPE, HW_ALL, HW_STATISTICS)
- Boot params for patch 4 : acpi_cpufreq.disable_average=1
- Boot params : clocksource=hpet acpi_skip_timer_override

# dmesg | grep cpufreq
acpi-cpufreq: average (aperf/mperf) accounting disabled by user

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

# dmesg | grep apic

ACPI: Core revision 20100702
Setting APIC routing to flat
..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ...
..... (found apic 0 pin 0) ...
....... works.

# dmesg | grep intel_idle

intel_idle: MWAIT substates: 0x1120
intel_idle: v0.4 model 0x25
intel_idle: lapic_timer_reliable_states 0xffffffff
ACPI: acpi_idle yielding to intel_idle

# While CPU is idling : cpufreq-aperf 

CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	2184000			00 sec 030 ms	00 sec 969 ms	03
001	1320000			00 sec 038 ms	00 sec 961 ms	03
002	1392000			00 sec 032 ms	00 sec 967 ms	03
003	1536000			00 sec 002 ms	00 sec 997 ms	00

000	1752000			00 sec 031 ms	00 sec 968 ms	03
001	1680000			00 sec 051 ms	00 sec 948 ms	05
002	1488000			00 sec 015 ms	00 sec 984 ms	01
003	1272000			00 sec 015 ms	00 sec 984 ms	01

000	2640000			00 sec 034 ms	00 sec 965 ms	03
001	1632000			00 sec 027 ms	00 sec 972 ms	02
002	1392000			00 sec 003 ms	00 sec 996 ms	00
003	1464000			00 sec 002 ms	00 sec 997 ms	00

000	2712000			00 sec 038 ms	00 sec 961 ms	03
001	1776000			00 sec 020 ms	00 sec 979 ms	02
002	1464000			00 sec 002 ms	00 sec 997 ms	00
003	1488000			00 sec 003 ms	00 sec 996 ms	00

# cpufreq-info

current policy: frequency should be within 1.20 GHz and 1.20 GHz.





In all cases, cpufreq-info always reports a 1.20GHz max limit, but cpufreq-aperf the opposite ; During high loads, frequencies manage to jump up to 2.64GHz. It took the same amount of time to compile the kernel than with my reference kernel (2.6.32) and CPU reached 72°C. I can conclude that CPU states reported by cpufreq-info are wrong. (My conky desktop monitor applet does not seem to read info from cpufreq-aperf or equivalent, because it always reports frequencies at 1.20GHz like cpufreq-info).

With hpet clocksource, average frequencies at idle time are very high, higher than with tsc clocksource (which are in the first hand strangely high for idle time), specially for the CPU 0, (2,7 GHz for 3% load !!!), but it does not seem to raise the temperature. I expect to have all my virtual cores at 1.20GHz (or less) at idle time, but even with my reference kernel, this ideal state was never reached.

With acpi_skip_timer_override parameter, the kernel reports at boot time some verbosities related to IO-APIC.
Comment 39 vyncere 2010-11-01 18:16:46 UTC
Thomas :

* 2.6.32.15 (Kernel Reference with functionnal cpufreq)
- Boot params : None

--> No /sys/devices/system/cpu/cpu0/cpufreq/bios_limit

* 2.6.36 (+ 4 Patches HW_COORD, SHARED_TYPE, HW_ALL, HW_STATISTICS)
- Boot params for patch 4 : acpi_cpufreq.disable_average=1
- Boot params : clocksource=hpet

# cat /sys/devices/system/cpu/cpu0/cpufreq/bios_limit
1199000

So, 1.20GHz.

Thank you. I will investigate on the BIOS/ACPI track.
Comment 40 Peter Ganzhorn 2010-11-02 07:09:00 UTC
Seems like I am lucky in picking the broken CPUs...the Core2 Quad Q9550 I have also has a minor flaw, it has broken temperature sensors - those are stuck at one temperature, no matter how much load I put on it :-P
BTW, I never overclocked or fiddled with any of those CPUs and bought them unused and new...

So, do you need anything else from me?
If it helps, with the patched kernel CPU frequency scaling tends to be a little slower to ramp the freq up and scales down a little earlier compared to the unpatched kernel with lowered up_threshold. So system responsiveness is a bit worse. You notice that, but its not too bad, it definitely WORKS as it's supposed to. I'm now probably saving a bit of power compared to before.

Thanks for looking into this again! If you want me to try more sophisticated patches just let me know.
Comment 41 Thomas Renninger 2010-11-02 08:59:37 UTC
> So, do you need anything else from me?
Eh, yes, I found something else:
You have constant_tsc feature, but not non-stop tsc feature.
CPU idle drivers mark tsc unstable if C-states are used. If C-states also affect aperf/mperf timers, they must not used as well in this case.
Can you try: idle=halt boot param
Argh, intel idle driver does not recognize idle= overrides.
Please try idle=halt with another patch I'll provide.
Or try both params (then the patch should not be needed):
intel_idle.max_cstate=0 idle=halt
Does cpufreq-aperf then give you sane average frequency values in the range of min/max freq and cpufreq subsystem works as expected?
Comment 42 Thomas Renninger 2010-11-02 09:03:25 UTC
Created attachment 35852 [details]
intel_idle: Do not load if user overrides idle function via idle= boot param

This try only makes sense if cpuidle driver got used before and the machine supports C2 and deeper sleep states, you can check (without idle=halt override) via:
cat /sys/devices/system/cpu/cpu*/cpuidle/state*/name
whether your machine uses deeper sleep states.
Comment 43 Thomas Renninger 2010-11-03 20:42:44 UTC
> CPU idle drivers mark tsc unstable if C-states are used. If C-states also
> affect aperf/mperf timers, they must not used as well in this case.
That is wrong, sorry about that:
aperf/mperf timers always stop in C-states, but the resulting average frequency during C0 (not idle) must still be correct.
Also the fact that only one CPU is affected very much points to defect HW.
As long as this only shows up one machine it's also not worth adding extra code to the kernel. If you want to workaround this in self-built kernels, best is you remove the aperfmperf capabilties line from arch/x86/kernel/include/asm/cpufeatures.h as it's also used in the scheduler.
Still waiting some days from feedback from Heinz before closing this one "documented". As it's a totally different CPU it's probably something else.
Comment 44 Peter Ganzhorn 2010-11-04 08:15:11 UTC
> Also the fact that only one CPU is affected very much points to defect HW.

I got only one CPU, but two CPU cores - by CPU you meant "CPU core", right?

> remove the aperfmperf capabilties line from
arch/x86/kernel/include/asm/cpufeatures.h

I'll try that with a vanilla kernel to see if I run into other problems with that. (That will remove the capability system-wide, not only for the cpufreq subsystem, right?)

If it helps: I installed Windows XP on another partition on my affected laptop, frequency scaling works just fine there...
Comment 45 Thomas Renninger 2010-11-04 08:41:33 UTC
Another possibility that does not require kernel rebuilding is to use the userspace governor and a cpufreq userspace daemon. I expect they do not use aperf/mperf.
Here is a nice overview of packages that are out there:
http://www.gentoo.org/doc/en/power-management-guide.xml
Other CPU Speed Utilities

> by CPU you meant "CPU core", right?
Yes.
Comment 46 Peter Ganzhorn 2010-11-08 12:32:34 UTC
Concerning the changes to arch/x86/kernel/include/asm/cpufeatures.h:

I can't just kill the line
#define X86_FEATURE_APERFMPERF (3*32+28) /* APERFMPERF */
since this will prevent the kernel from being compiled.
For now I changed the line to
#define X86_FEATURE_APERFMPERF (6*32+11) /* APERFMPERF */
which changes the checked feature to SSE5 instead of AMPERF - my CPU does not support SSE5 and this should report a non-set bit.

Not the most elegant solution and of course not really portable to newer cpus, but it sure does the trick.
Do you have a better idea to hard-code that feature to zero in the kernel code? Is there a CPUID bit that's 0 by default for all processors?

I looked at related kernel code, but I don't think there's an easier or better fix at another place in the code, as the way I choose to go makes the feature reported as non-present for all code in the kernel.
Comment 47 Thomas Renninger 2010-11-08 12:49:05 UTC
> since this will prevent the kernel from being compiled.
Yep, you would need to touch the other places where it gets evaluated as well.

Easiest for you should be to take the patch you verified working from comment #34 (with the boot/module param described somewhere...):
acpi-cpufreq: Provide param to disable average HW statistics for broken timers
If you want to make sure scheduler also does not make use of the timers you can additionally add something like:

struct cpuinfo_x86 *c;
...
c = &cpu_data(cpu);
clear_cpu_cap(c, X86_FEATURE_APERFMPERF);

This is an ugly hack, but should work for you.
Comment 48 Peter Ganzhorn 2010-11-09 17:38:09 UTC
I modified to patch like you suggested:

+	if (cpu_has(c, X86_FEATURE_APERFMPERF)) {
+		if (disable_average) {
+			printk(KERN_INFO "acpi-cpufreq: average (aperf/mperf) "
+			       "accounting disabled by user\n");
+			clear_cpu_cap(c, X86_FEATURE_APERFMPERF);
+		}
+		else
+			acpi_cpufreq_driver.getavg = cpufreq_get_measured_perf;
+	}

Will the scheduler adapt to the changed cpu features with only that single change? Since the sched code probably gets initialized earlier, I wonder if it will notice the change.
Anyways, the system feels a bit more responsive with the modification to the patch, but that might be a psychologic deception since I know that I made that change...
Comment 49 vyncere 2010-11-10 11:17:50 UTC
Hi all,

Thank you very much Thomas for your hint. As you suggested me to check the BIOS limit (which was 1.20 GHz in my case), this morning I set in my BIOS for all modes (AC/DC Power + Battery) the "Performance" profile, instead of "Power-saving" one.

Now the BIOS limit is 2.40GHz, and cpufreq manage to do his job, without any patches (2.6.36).

Thank you very much again, tracking this bug was very instructive. ^^
Comment 50 Thomas Renninger 2010-11-10 12:51:11 UTC
> Will the scheduler adapt to the changed cpu features with only that single
> change?
Yes, the only part I found using it always checks for:
if (has_cpu_cap(X86_FEATURE_APERFMPERF))
     use_it()
So unsetting the cap effectively disables usage there on the next schedule.
This may change in the future, as said it's a hack...
Perfect would be a boot param disbable_cpu_cap=xy, but as most of these caps are evaluated early it's hard to implement.

vyncere: Looks like Linux works correctly according to your BIOS settings :)

Not sure whether I should set this invalid. I set it documented, if there are more machines with broken aperf/mperf timers, they might find it and if this should be a more common issue, it still can be thought of a fix/workaround for the kernel.
Comment 51 Len Brown 2011-07-31 17:58:18 UTC
> if there are more machines with broken aperf/mperf timers

Thomas,
In what way are the aperf/mperf timers broken on this box?
Comment 52 Thomas Renninger 2011-08-01 07:35:05 UTC
There already seem to go something wrong with TSC at early boot:
Fast TSC calibration failed
TSC: PIT calibration matches HPET. 1 loops

Most interesting are comments 33, 34 and 35.
cpufreq-aperf (measuring the average freq using aperf/mperf) shows frequencies around 500 MHz which is wrong (afaik the cpu only supports 800 and higher freqs). That is the reason why the cpufreq subsystem, taking aperf values to calculate the next frequency into account never raises the frequency.

Removing the cpufreq code to look at aperf/mperf values to calculate the next desired frequency fixed the problem for Peter and the machine starts switching frequencies as expected.

Be aware that vyncere's problem is something totally different, but that came out later in the bug.

Just a guess: Could it be that the BIOS misconfigured some clock multiplier and tsc and mperf are running to slow?
I didn't look at the rate tsc/mperf/aperf are really running at.
Comment 53 David Tomaschik 2011-08-11 12:59:49 UTC
I believe I'm having a similar problem.  My hardware is a Dell Latitude E5420 with an i5-2520M CPU.  I have the latest Dell BIOS for this system.  I'm currently on the Ubuntu kernel 2.6.38-10-generic, and after I resume from suspend, CPU frequency scaling no longer works.  Before suspending, cpufreq-aperf shows sane values.  Afterwards, it gives frequencies of about 625MHz:
cpufreq-aperf
CPU	Average freq(KHz)	Time in C0	Time in Cx	C0 percentage
000	0625250			00 sec 077 ms	00 sec 922 ms	07
001	0625250			00 sec 007 ms	00 sec 992 ms	00
002	0600240			00 sec 098 ms	00 sec 901 ms	09
003	0625250			00 sec 002 ms	00 sec 997 ms	00

I believe that, because of this, the cpufreq scaling won't bump things up.  "That is the reason why the cpufreq subsystem, taking aperf values to
calculate the next frequency into account never raises the frequency."

Any idea why it would only occur after a suspend/resume cycle and what I can do to fix the issue?
Comment 54 David Tomaschik 2011-08-12 13:20:36 UTC
An update on my post above.  Disabling TurboBoost in the BIOS seems to have resolved the issue.  I wonder if TurboBoost somehow causes mis-reported aperf statistics.

Interestingly enough, the ratios are the same as the turboboost ratio on this CPU.  That is, 3.2GHz (Turbo)/2.5 GHz (Stock) == 800 MHz (Real minimum)/625 MHz (reported minimum via aperf).  I don't know if it matters, just thought it was notable.
Comment 55 Thomas Renninger 2011-08-15 15:04:25 UTC
You could use
tools/power/x86/turbostat.c
from the latest mainline kernel and replace two lines:
print_counters(cnt_delta);
with
dump_cnt(cnt_delta);
and compare with/without turboboost.
You could also use tools/power/cpupower/ with debug option compile in (Makefile) and the cpupower -d monitor -m Mperf
but this won't be that nicely formatted.

You may be able to disable turboboost at runtime via a MSR read, mask out a bit and write the value back.
According to chapter:
14.3.2.2 OS Control of Opportunistic Processor Performance Operation
of Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A
the bit is bit 32 (starting from 0) of the IA32_PERF_CTL MSR (0199H) MSR register.
You have to make sure msr driver is compiled in or as module (modprobe msr)
then you can use msr-tools:
rdmsr 0x199
will show you the 64 bit register.
If you boot with turboboost enabled you find bit 32 set otherwise unset.
If I haven't overseen something you can enable/disable turbo mode via:
IA32_PERF_CTL=`rdmsr 0x199`
# disable
wrmsr -a 0x199 $((~(1 << 32) & $IA32_PERF_CTL))
# enable
wrmsr -a 0x199 $(((1 << 32) | $IA32_PERF_CTL))
-a option only exists in latest msr-tools git version which can be found here:
git://git.kernel.org/pub/scm/utils/cpu/msr-tools/msr-tools.git

Something to play with..., hopefully you find out something pointing to the root cause...
Comment 56 Thomas Renninger 2011-08-15 15:08:25 UTC
You can verify whether turboboost is enabled/disabled by:
- utilizing one core:
  cat /dev/zero >/dev/null &
- run turbostat or "cpupower monitor -m Mperf"
  and double check average frequency of the utilized core whether it got boosted
Comment 57 Len Brown 2012-06-18 19:11:20 UTC
The issue reported by the original poster is gone.
Other people have come in and out of this bug report,
some leaving with their issue solved, some not.

If yours is not fixed, please open a new bug,
because this one is closed.

Note You need to log in before you can comment on or make changes to this bug.