Bug 14742

Summary: 2.6.32 new menu idle governor causes very high CPU temp - HP zv5000 (P4/2.66GHz)
Product: Power Management Reporter: akwatts
Component: OtherAssignee: power-management_other
Status: CLOSED CODE_FIX    
Severity: high CC: akpm, akwatts, arjan, arjan, felipe.contreras, lenb, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14230    
Attachments: Powertop -d output on pristine 2.6.32 kernel (aka hot kernel)
Powertop -d output on patched 2.6.32 kernel (aka cool kernel)
dmidecode output

Description akwatts 2009-12-05 17:24:15 UTC
On an HP zv5000 (P4/2.66GHz) laptop running 2.6.32, the symptom is a CPU temp that creeps up continuously until leveling off at 55-57 C (2nd fan keeps it from going beyond that). Using the "nolapic" parameter or reverting the commit below resolves the issue and CPU idle temp returns to 34-35 C. (Note: temperature increase is confirmed at fan exhausts)

   problem commit (determined through git bisection):
   ----------
   commit 69d25870f20c4b2563304f2b79c5300dd60a067e
   Author: Arjan van de Ven <arjan@infradead.org>
   Date:   Mon Sep 21 17:04:08 2009 -0700
   cpuidle: fix the menu governor to boost IO performance
   ----------

The kernel is compiled with Local APIC and IO-APIC support for uniprocessors. The .config is based on one used for the 2.6.31.x branch w/o issues.

For now, I run 2.6.32 with 69d25870f20c4b2563304f2b79c5300dd60a067e reverted ("nolapic" creates IRQ problems on this HW) and have noticed no problems resulting from this reversion.

Please let me know if I can provide any additional information that would be of assistance.

~Andy
Comment 1 Arjan van de Ven 2009-12-05 20:01:02 UTC
On Sat, 5 Dec 2009 17:24:17 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:
> For now, I run 2.6.32 with 69d25870f20c4b2563304f2b79c5300dd60a067e
> reverted ("nolapic" creates IRQ problems on this HW) and have noticed
> no problems resulting from this reversion.
> 
> Please let me know if I can provide any additional information that
> would be of assistance.

can you run powertop -d and give us the result?
it's a first good diagnostic...
Comment 2 akwatts 2009-12-06 18:07:09 UTC
Created attachment 24056 [details]
Powertop -d output on pristine 2.6.32 kernel (aka hot kernel)
Comment 3 akwatts 2009-12-06 18:09:54 UTC
Created attachment 24057 [details]
Powertop -d output on patched 2.6.32 kernel (aka cool kernel)
Comment 4 Arjan van de Ven 2009-12-06 18:26:02 UTC
ok so there is something very interesting here:

your system, for some reason, seems to exit the C2 state immediately
when it gets entered. (C2 is only available because the bios announces
its presence). This is a hardware/BIOS bug, and a bad one at that.

In the old code, for some reason, C2 is not used in
practice. With the new code, the governor will try to use C2,
repeatedly. 

I think the real solution is not to change the governor, but to make a
quirk for your system so that Linux will just not use C2 on your
system... 

Can you attach the output of "dmidecode"; we'll need that to make a
quirk.
Comment 5 akwatts 2009-12-07 13:15:01 UTC
Created attachment 24075 [details]
dmidecode output

Many thanks for so quickly pinpointing a likely BIOS/HW problem and for your suggested quirk solution. I tried to view similar Cn residency stats using Intel's PowerInformer 1.2 on Windows but was met with "Add Pdh counter...failed...".

Needless to say, not too happy that I might have either broken HW or a buggy BIOS. Can you think of any other parts of the kernel that can be negatively affected by a broken, yet announced, C2?

Attached is my dmidecode output for quirk creation.

~Andy
Comment 6 Rafael J. Wysocki 2010-01-08 23:27:54 UTC
On Friday 08 January 2010, Andrew Watts wrote:
> All:
> 
> Though there has been no activity on this bug since my last comment on
> 12/7/09, the issue remains unresolved. I have confirmed that the problem
> occurs only on linux using the new menu idle governor code
> (69d25870f20c4b2563304f2b79c5300dd60a067e) . The HW performs perfectly well
> either: (a) in linux with the old idle governor code or (b) under windows XP.
> 
> ~Andy
Comment 7 Arjan van de Ven 2010-01-09 21:50:34 UTC
On Fri, 8 Jan 2010 23:29:55 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:


the following patch should make this work:

acpi: Add the HP Pavilion zv5000 to the power DMI table

The HP Pavilion zv5000 is reported (see bug 14742) to not work well in C2;
in fact the system exits C2 immediately.

This patch adds a DMI entry for this system so that C2 is not used on this
machine.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index d1676b1..97d2ee6 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -110,6 +110,10 @@ static struct dmi_system_id __cpuinitdata processor_power_dmi_table[] = {
 	  DMI_MATCH(DMI_BIOS_VENDOR,"Phoenix Technologies LTD"),
 	  DMI_MATCH(DMI_BIOS_VERSION,"SHE845M0.86C.0013.D.0302131307")},
 	 (void *)2},
+	{ set_max_cstate, "Pavilion zv5000", {
+	  DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
+	  DMI_MATCH(DMI_PRODUCT_NAME,"Pavilion zv5000 (DS502A#ABA)")},
+	 (void *)1},
 	{},
 };
Comment 8 Arjan van de Ven 2010-01-11 16:29:19 UTC
[ A request for people who see a similar issue: Please file a seperate bug for each different machine. So unless you have a HP PAvilion zv5000, you need a different bug]
Comment 9 Felipe Contreras 2010-01-11 16:54:07 UTC
(In reply to comment #8)
> [ A request for people who see a similar issue: Please file a seperate bug
> for
> each different machine. So unless you have a HP PAvilion zv5000, you need a
> different bug]

If this bug is for only one machine, shouldn't that be specified on the summary?
Comment 10 akwatts 2010-01-11 22:48:47 UTC
Confirmed - Arjan's patch applied to 2.6.32.3 limits the laptop to C1 and prevents the overheating seen with the new menu idle governor code (similar result can be achieved using processor.max_cstate=1 or idle=halt). 

With any of these restrictions powertop doesn't report detailed C-state statistics as they're no longer available. Why does the kernel no longer provide information on C0, polling, and C1?

Also, zv5000 represents the laptop family and the exact model is zv5030us (with unique identifier DS502A#ABA as used by Arjan). Arjan, should the first zv5000 in your DMI entry patch be a zv5030us instead to reflect this? Sorry for not pointing this out from the start.

Finally, isn't it a little early to conclude this is a single-machine-type bug?

~Andy
Comment 11 Arjan van de Ven 2010-01-11 23:20:50 UTC
On 1/11/2010 14:48, bugzilla-daemon@bugzilla.kernel.org wrote:
> Also, zv5000 represents the laptop family and the exact model is zv5030us
> (with
> unique identifier DS502A#ABA as used by Arjan). Arjan, should the first
> zv5000
> in your DMI entry patch be a zv5030us instead to reflect this? Sorry for not
> pointing this out from the start.

it's only a cosmetic thing....

>
> Finally, isn't it a little early to conclude this is a single-machine-type
> bug?

so far, with Fedora 12 shipping this patch already, there is 2 machines total.
Having non-working C states is actually rather rare in general..... I don't expect
many more machines.
(And even then, doing this kind of table is the right thing; you really do not have C2,
and you don't even want to pretend you have it because there's a cost associated with that,
which normally gets offset by powersavings, but not in your case)
Comment 12 akwatts 2010-01-12 00:42:38 UTC
Ah, I see there was a report from an ASUSTek user this month (http://patchwork.kernel.org/patch/71962/).

Is there a way for the kernel to still provide detailed C-state information after applying your patch Arjan? Right now that's not the case.

Thanks,

~Andy
Comment 13 Rafael J. Wysocki 2010-02-07 22:18:08 UTC
Handled-By : Arjan van de Ven <arjan@linux.intel.com>
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=14742#c7
Comment 14 Rafael J. Wysocki 2010-02-21 21:34:56 UTC
Fixed by commit 370d5cd88509b93b76eb2f5f97efbd71c25061cb.
Comment 15 Len Brown 2010-04-03 20:06:38 UTC
Andy,
On Windows, can you run perfmon and see if they are able to get
into C2 at all? (you have to add (+) the appropriate cpu counters)

Also, with the DMI entry backed out to see the bug again,
can you bring up the system in single user mode to see
if the problem is seen when the network interfaces are not up?
bug 15377 sees a similar failure, but only after the network is probed.
Comment 16 Len Brown 2010-04-03 20:44:59 UTC
also, please paste here the output from "cat /proc/cpuinfo"
Comment 17 akwatts 2010-04-05 09:43:04 UTC
Len, the perfmon information you requested is quite interesting especially when compared to my powertop output. Hopefully, you can shed some light here...

Regarding bug #15377, I compiled 2.6.33.2 with 370d5cd88509b93b76eb2f5f97efbd71c25061cb reverted and the CPU temp shoots up in single user mode (powertop -d shows 180530.5 wakeups-from-idle per second). When I boot with init=/bin/bash it is cool up until I add processor.ko.

========================

Windows Perfmon

A. Idling
% C1 Time                 0.000
% C2 Time                96.683
% C3 Time                 0.000
% Idle Time              96.875
% Processor Time          3.125
C1 Transitions/sec        0.000
C2 Transitions/sec      138.003
C3 Transitions/sec        0.000
Interrupts/sec          176.003

B. Scanning for viruses
% C1 Time                18.244
% C2 Time                11.424
% C3 Time                 0.000
% Idle Time              31.250
% Processor Time         68.750
C1 Transitions/sec      160.001
C2 Transitions/sec      160.001
C3 Transitions/sec        0.000
Interrupts/sec          478.004

==========================

/proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.66GHz
stepping        : 9
cpu MHz         : 2666.970
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr
bogomips        : 5333.94
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 32 bits virtual
power management:
Comment 18 akwatts 2010-04-28 20:29:58 UTC
*BUMP*