Bug 56601

Summary: bogus temperature reported --- HP nw9440
Product: Power Management Reporter: Matthias (morpheusxyz123)
Component: ThermalAssignee: Zhang Rui (rui.zhang)
Status: CLOSED DOCUMENTED    
Severity: normal CC: auxsvr, hegge, jake
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: linux-3.9-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg
kernelconfig
Measurements for comment 2 linux-3.8.6
Measurements for comment 2 linux-3.9-rc1
Measurements for comment 3 linux-3.9-rc1
info from various commands for fan debugging

Description Matthias 2013-04-14 10:36:03 UTC
Created attachment 98551 [details]
dmesg

The fan on my nw9440 laptop running linux-3.9-rc6 is not dethrotteling. 

To reproduce this do the following:

1: Put load on each cpu core to generate some heat: cat /dev/zero > /dev/null
2: Wait for the fan to speed up 
3: Terminate the the cat jobs. Machine is now idle. 
4: Wait
5: Fan is spinning still at the speed is was when the machine was under load. 

Expected behavior: Decrease fan speed with decreasing temperatures.

First affected version: linux-3.9-rc1
Last known version to work: linux-3.8.6
Comment 1 Matthias 2013-04-14 10:36:37 UTC
Created attachment 98561 [details]
kernelconfig
Comment 2 Zhang Rui 2013-04-14 16:28:24 UTC
please attach the output of "ll /sys/class/thermal/t*/c*" for both 3.8.6 and 3.9-rc1

please attach the output of "grep . /sys/class/thermal/*/*" when
1. before heating up
2. when the temperature is high
3. after the machine idles for a while
in both 3.8.6 and 3.9-rc1.
Comment 3 Zhang Rui 2013-04-15 01:09:08 UTC
please attach the output of
grep . /sys/class/thermal/*/device/path
ll /sys/class/thermal/t*/c*

please spin on the fan one by one and check which cooling device makes the
temperature bogus.
Comment 4 Jake Edge 2013-04-15 14:28:58 UTC
This doesn't look like the bug I am seeing either.  That bug is always associated with resume on the hp2510p, always spins the fan up to 100%, and the 'sensors' output has weirdness for 'temp6'.  It has also been in the kernel since the 3.7 merge window (3.7-rc1 has it, 3.6.11 and earlier do not).

Do you want me to start a new bug for this or continue here?   If the latter, what do you need from me?  I attached my acpidump to the earlier bug.  I am using the step-wise governor when building the kernels.

It seems likely that all of these are related to the thermal changes Rui did for 3.7, so I would guess they are all related in that sense.
Comment 5 Matthias 2013-04-15 14:53:50 UTC
On resume my fan is also spinning at full speed. This far I have been shutting it down manually. With linux-3.9-rc1 suspend is totally broken for my machine. But one bug at a time.

@ Zhang Rui: I provide you with the information you asked for shortly.
Comment 6 Matthias 2013-04-15 16:16:53 UTC
Created attachment 98791 [details]
Measurements for comment 2 linux-3.8.6
Comment 7 Matthias 2013-04-15 16:17:42 UTC
Created attachment 98801 [details]
Measurements for comment 2 linux-3.9-rc1
Comment 8 Matthias 2013-04-15 16:18:41 UTC
Created attachment 98811 [details]
Measurements for comment 3 linux-3.9-rc1
Comment 9 Matthias 2013-04-15 16:32:45 UTC
The bogus reported temperatures of thermal_zone5 seem as if it is reporting the fan speed in percent. Both times when it reported 100°C the fan was spinning at full speed. I made the measurement on an idle machine and the reported temperatures concur with the fan speeds I gave you in Bug 50041. 
I will have an eye on thermal_zone5 and the temperatures it reports.
Comment 10 Torstein Hegge 2013-04-15 21:12:17 UTC
(In reply to comment #9)
> The bogus reported temperatures of thermal_zone5 seem as if it is reporting
> the
> fan speed in percent. Both times when it reported 100°C the fan was spinning
> at
> full speed. I made the measurement on an idle machine and the reported
> temperatures concur with the fan speeds I gave you in Bug 50041. 
> I will have an eye on thermal_zone5 and the temperatures it reports.

I see the same thing with my HP 2510p with the 
git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux.git thermal branch.

/sys/class/thermal/thermal_zone4/temp tracks the fan speed in percent,
in levels: 0 -> 30 -> 50 -> 70 -> -> 90 -> 100, with a linear change in
"temperature" between levels when the fan speed changes. The fan speed
and thermal_zone4/temp never decreases while the laptop is on. 

However, with 3.9-rc7 the fan never starts and thermal_zone4/temp stays
at zero. After suspend and resume the fan runs on full speed and 
thermal_zone4/temp is 100 C.
Comment 11 Zhang Rui 2013-04-16 08:23:12 UTC
(In reply to comment #4)
> This doesn't look like the bug I am seeing either.  That bug is always
> associated with resume on the hp2510p, always spins the fan up to 100%, and
> the
> 'sensors' output has weirdness for 'temp6'.  It has also been in the kernel
> since the 3.7 merge window (3.7-rc1 has it, 3.6.11 and earlier do not).
> 
> Do you want me to start a new bug for this or continue here?   If the latter,
> what do you need from me?  I attached my acpidump to the earlier bug.  I am
> using the step-wise governor when building the kernels.
> 
> It seems likely that all of these are related to the thermal changes Rui did
> for 3.7, so I would guess they are all related in that sense.

yes.

there are two problems that I can see.
1. bogus temperature reported
2. fan always on after resume
3. fan always on after the temperature becomes high once.
these three may or may not be related.
But I think I've find the root cause of the 3rd problem. please try the patch at https://bugzilla.kernel.org/show_bug.cgi?id=56591#c13 first.
Comment 12 Matthias 2013-04-16 13:12:13 UTC
I tested your patch from https://bugzilla.kernel.org/show_bug.cgi?id=56591#c13 with linux-3.9-rc7 and the fan dethrottles. Nice catch. Thanks!
But thermal_zone5 still reports bogus (fan speed in percent?) temperatures.
Comment 13 Zhang Rui 2013-04-16 14:39:58 UTC
Okay.
Let me summaries the thermal problems on your machine.
1. fan cooling device does not reflect the actual state, this is a regression between 3.4 and 3.5, and it is reported in bug 50041.
2. fan does not throttle since 3.7. this is fixed by the patch in comment #13 bug #56591.
3. thermal_zone reports bogus temperature. that is the bug we want to check in this bug report. this may be not a regression. the difference is that it used to show a lower temperature when the fan can throttles, and 100C when problem 2 occurs. right?
Comment 14 Zhang Rui 2013-04-16 14:53:26 UTC
here is the AML code for reporting thermal temperature.
thermal_zone0:
            Method (_TMP, 0, Serialized)  // _TMP: Temperature
            {
                If (LEqual (C320, 0x00))
                {
                    \_TZ.C32F ()
                    Store (0x01, C320)
                }

                Return (C331 (0x00))
            }
thermal_zone1:
            Method (_TMP, 0, Serialized)  // _TMP: Temperature
            {
                Return (C331 (0x01))
            }
thermal_zone2:
            Method (_TMP, 0, Serialized)  // _TMP: Temperature
            {
                Store (C331 (0x02), Local0)
                Store (Local0, C325)
                Return (Local0)
            }
thermal_zone3:
            Method (_TMP, 0, Serialized)  // _TMP: Temperature
            {
                Return (C331 (0x03))
            }
thermal_zone4:
            Method (_TMP, 0, Serialized)  // _TMP: Temperature
            {
                Return (C331 (0x04))
            }
thermal_zone5:
            Method (_TMP, 0, Serialized)  // _TMP: Temperature
            {
                Store (0x1E, Local0)
                Acquire (\_SB.C003.C004.C006.C155, 0xFFFF)
                If (\_SB.C003.C004.C006.C157)
                {
                    Store (\_SB.C003.C004.C006.C189, Local0)
                }

                Release (\_SB.C003.C004.C006.C155)
                If (LGreater (Local0, 0x64))
                {
                    Store (0x64, Local0)
                }

                Multiply (Local0, 0x0A, Local0)
                Add (Local0, 0x0AAC, Local0)
                Return (Local0)
            }
first, the _TMP method is quite different from the others.
second,
                If (LGreater (Local0, 0x64))
                {
                    Store (0x64, Local0)
                }
this means if the temperature reported by C189 is higher than 100 (0x64), override it to 100 instead. But there is only a critical trip point 110C in this thermal zone, so IMO, this is a bogus thermal zone, because we'll never take any action for this thermal zone.
And the information you attached in comment #8 convince me that this is a BIOS problem.
Comment 15 Zhang Rui 2013-04-16 14:59:29 UTC
So I'll makde this bug report to address the bogus temperature problem only, and close it now.

For issue 2 addressed in comment #13, I'll try to push the patch upstream, and we can check the status in bug #56591.
For issue 1 addressed in comment #13, let's continue to debug it in bug #50041.

BTW, this is really a good bug report, thanks for your quick and valuable feedback. It is really helpful.
Comment 16 Zhang Rui 2013-04-16 15:08:29 UTC
(In reply to comment #4)
> This doesn't look like the bug I am seeing either.

Maybe it does. :p

> That bug is always
> associated with resume on the hp2510p, always spins the fan up to 100%,

can you try the patch at https://bugzilla.kernel.org/show_bug.cgi?id=56591#c13
and see if the problem still exist after resume?

> and the
> 'sensors' output has weirdness for 'temp6'.

If the fan  can be turned off, do you still get weird temperature output?

>  It has also been in the kernel
> since the 3.7 merge window (3.7-rc1 has it, 3.6.11 and earlier do not).
>
Maybe the temperature looks normal just because the fan is running at low speed.
please check the information in comment #8. can you get the similar result?
 
> Do you want me to start a new bug for this or continue here?

if the bogus temperature is not related with fan state, which I do not think so after checking your acpidump attached in https://bugzilla.kernel.org/show_bug.cgi?id=56591#c1, please file a new bug.
Comment 17 Jake Edge 2013-05-04 14:31:26 UTC
Ok, sorry for disappearing for a while, but I am not sure where this stands.  The problem still exists in 3.9 even if I apply the patch in bug #56591 (to be expected I think).  I will attach the info asked for in comments 2 and 3, but I think the root of the problem is as you described above.

In the "good" state (first boot, before sleep/resume), temp 6 is like 20° or something, doing a:

echo 1 >/sys/bus/acpi/drivers/fan/PNP0C0B\:00/thermal_cooling/cur_state

causes the fan to come on full blast *and* the temperature in temp6 to go to 100° and stay there.  Turning the fan back off (echo 0 ...) throttles back the fan to off, and temp 6 returns (over 10-15 seconds) to 20°.

This sounds like (part of) what Matthias saw too, but I'm still not clear on what the fix or the plan is.
Comment 18 Jake Edge 2013-05-04 14:41:02 UTC
Created attachment 100701 [details]
info from various commands for fan debugging

This is in reference to c#2 and c#3 ... along with my previous comment.
Comment 19 Zhang Rui 2013-05-16 01:11:12 UTC
(In reply to comment #17)
> Ok, sorry for disappearing for a while, but I am not sure where this stands. 
> The problem still exists in 3.9 even if I apply the patch in bug #56591 (to
> be
> expected I think).  I will attach the info asked for in comments 2 and 3, but
> I
> think the root of the problem is as you described above.
> 
> In the "good" state (first boot, before sleep/resume), temp 6 is like 20° or
> something, doing a:
> 
> echo 1 >/sys/bus/acpi/drivers/fan/PNP0C0B\:00/thermal_cooling/cur_state
> 
> causes the fan to come on full blast *and* the temperature in temp6 to go to
> 100° and stay there.  Turning the fan back off (echo 0 ...) throttles back
> the
> fan to off, and temp 6 returns (over 10-15 seconds) to 20°.
> 
> This sounds like (part of) what Matthias saw too, but I'm still not clear on
> what the fix or the plan is.

https://bugzilla.kernel.org/show_bug.cgi?id=56591#c32
is the fix.

bogus temperature report is a BIOS bug that can not be fixed.

What we can do here is to make the fan OFF when it should be (as I did in the patch), then temp 6 should return a normal value because the fan are off.
Comment 20 Jake Edge 2013-05-16 01:17:13 UTC
(In reply to comment #19)

> https://bugzilla.kernel.org/show_bug.cgi?id=56591#c32
> is the fix.

It does not work for me.  I applied that patch on top of 3.9 and still have the same problem when the system resumes.

> bogus temperature report is a BIOS bug that can not be fixed.

I understand.  But, before things changed, I did not get this behavior, so something needs to change.  Right now I have to manually shut the fans off after a resume.

do I need to open a separate bug or do you want to debug it in bug #56591 (I assume you don't want to do it in this bug).
Comment 21 Zhang Rui 2013-05-16 02:19:46 UTC
(In reply to comment #20)
> (In reply to comment #19)
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=56591#c32
> > is the fix.
> 
> It does not work for me.  I applied that patch on top of 3.9 and still have
> the
> same problem when the system resumes.
> 
I see.
It seems that your problem is related with suspend/resume which is probably not covered in the previous patch.
please file a new bug report.
and please attach the output of "grep . /sys/class/thermal/*/*" when the bug is reproduced after resume.