Bug 13918 - Processor not throttled when overheating - HP 2510p
Summary: Processor not throttled when overheating - HP 2510p
Status: CLOSED DOCUMENTED
Alias: None
Product: ACPI
Classification: Unclassified
Component: BIOS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: acpi_bios
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-08-05 11:52 UTC by Frans Pop
Modified: 2009-11-05 23:02 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.31-rc5
Tree: Mainline
Regression: No


Attachments
ACPI DSDT table (601.90 KB, text/plain)
2009-08-05 11:52 UTC, Frans Pop
Details
dmesg after booting the system (33.51 KB, text/plain)
2009-08-05 11:54 UTC, Frans Pop
Details
Kernel configuration (68.51 KB, text/plain)
2009-08-05 11:54 UTC, Frans Pop
Details
debug patch (1.76 KB, patch)
2009-08-11 08:55 UTC, Zhang Rui
Details | Diff
dmesg with debugging from patch in comment#4 (38.27 KB, text/plain)
2009-08-11 12:02 UTC, Frans Pop
Details
debug patch v2 (2.09 KB, patch)
2009-08-12 05:56 UTC, Zhang Rui
Details | Diff
dmesg from 2nd patch + /sys/class/thermal contents (12.33 KB, text/plain)
2009-08-12 08:31 UTC, Frans Pop
Details
debug patch: show all the thermal zones temperature and cooling device state when overheating (1.21 KB, patch)
2009-08-13 05:59 UTC, Zhang Rui
Details | Diff

Description Frans Pop 2009-08-05 11:52:27 UTC
Created attachment 22610 [details]
ACPI  DSDT table

System: HP 2510p Core Duo notebook, x86_64 kernel; Debian stable ("Lenny")

Yesterday my notebook (HP 2510p Core Duo) shut down hard due to overheating while compiling a kernel and running vlc.

Both processor cores continued to run at full speed until the system shut down. IIUC the cores should get slowed down (T-state followed by P-state) before that happens.
I was using the ondemand governor for frequency scaling.

One thing I noticed is that, although the cores are registered as cooling devices, they are not bound to any thermal zones:

/sys/class/thermal$ grep . cooling_device*/type
cooling_device0/type:Fan
cooling_device1/type:Fan
cooling_device2/type:Fan
cooling_device3/type:Fan
cooling_device4/type:Fan
cooling_device5/type:Fan
cooling_device6/type:Fan
cooling_device7/type:Processor
cooling_device8/type:Processor
cooling_device9/type:LCD

/sys/class/thermal$ ls -1d thermal_zone*/cdev[0-9]
thermal_zone0/cdev0
thermal_zone0/cdev1
thermal_zone0/cdev2
thermal_zone0/cdev3
thermal_zone1/cdev0
thermal_zone1/cdev1
thermal_zone1/cdev2
thermal_zone1/cdev3
thermal_zone1/cdev4
thermal_zone1/cdev5
thermal_zone1/cdev6
thermal_zone3/cdev0
thermal_zone3/cdev1
thermal_zone4/cdev0

As you can see, cdev7 and cdev8 are not listed. But according to Documentation/thermal/sysfs-api.txt, they should be listed "if the processor is listed in _PSL method", and AFAICT in my dsdt that is the case for zones 0, 3, 4 and 6 (through method C3B4).

AFAICT all required modules are loaded (cpufreq_ondemand is compiled in):
cpufreq_conservative    7904  0
cpufreq_userspace       3604  0
cpufreq_stats           4808  0
cpufreq_powersave       1776  0
coretemp                6720  0
acpi_cpufreq            8096  0
processor              39960  3 acpi_cpufreq
thermal                16160  0
thermal_sys            16768  4 video,processor,thermal,fan
Comment 1 Frans Pop 2009-08-05 11:54:00 UTC
Created attachment 22611 [details]
dmesg after booting the system
Comment 2 Frans Pop 2009-08-05 11:54:43 UTC
Created attachment 22612 [details]
Kernel configuration
Comment 3 Zhang Rui 2009-08-11 08:54:58 UTC
please attach the full acpidump output.
please attach the output of "grep . /proc/acpi/thermal_zone/*/*"
Comment 4 Zhang Rui 2009-08-11 08:55:46 UTC
Created attachment 22676 [details]
debug patch

please apply this debug patch and attach the dmesg output after boot.
Comment 5 Frans Pop 2009-08-11 12:02:27 UTC
Created attachment 22678 [details]
dmesg with debugging from patch in comment#4

/proc/acpi/thermal_zone$ grep . */*
TZ0/cooling_mode:<setting not supported>
TZ0/polling_frequency:<polling disabled>
TZ0/state:state:                   ok
TZ0/temperature:temperature:       60 C
TZ0/trip_points:critical (S5):     256 C
TZ0/trip_points:passive:           99 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1
TZ0/trip_points:active[0]:         88 C: devices=C3C8
TZ0/trip_points:active[1]:         82 C: devices=C3C9
TZ0/trip_points:active[2]:         68 C: devices=C3CA
TZ0/trip_points:active[3]:         50 C: devices=C3CB
TZ0/trip_points:active[4]:         40 C: devices=C3CC
TZ1/cooling_mode:<setting not supported>
TZ1/polling_frequency:<polling disabled>
TZ1/state:state:                   ok
TZ1/temperature:temperature:       60 C
TZ1/trip_points:critical (S5):     110 C
TZ3/cooling_mode:<setting not supported>
TZ3/polling_frequency:<polling disabled>
TZ3/state:state:                   ok
TZ3/temperature:temperature:       51 C
TZ3/trip_points:critical (S5):     105 C
TZ3/trip_points:passive:           95 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1
TZ4/cooling_mode:<setting not supported>
TZ4/polling_frequency:<polling disabled>
TZ4/state:state:                   ok
TZ4/temperature:temperature:       34 C
TZ4/trip_points:critical (S5):     110 C
TZ4/trip_points:passive:           60 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1
TZ5/cooling_mode:<setting not supported>
TZ5/polling_frequency:<polling disabled>
TZ5/state:state:                   ok
TZ5/temperature:temperature:       50 C
TZ5/trip_points:critical (S5):     110 C
TZ6/cooling_mode:<setting not supported>
TZ6/polling_frequency:<polling disabled>
TZ6/state:state:                   ok
TZ6/temperature:temperature:       25 C
TZ6/trip_points:critical (S5):     70 C
TZ6/trip_points:passive:           60 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1
TZ6/trip_points:active[0]:         54 C: devices=C3B1
TZ6/trip_points:active[1]:         48 C: devices=C3B2

Hmmm. So /proc/acpi/thermal_zone does list the CPUs while /sys/class/thermal does not.

And the trip points for the CPUs look to be so high that some critical temperature is reached and the system shut down before they get activated?
Or is there still something wrong?
Comment 6 Frans Pop 2009-08-11 12:18:10 UTC
I've just done a little test by running empty loops on both cores while watching contents of /proc/acpi/thermal_zone. It looks as if the critical points of zones 1 and 5 (which don't have CPU trips) may get reached before the higher trip points in other zones are reached.
Comment 7 Zhang Rui 2009-08-12 05:56:48 UTC
Created attachment 22682 [details]
debug patch v2

weird, the dmesg shows that cooling_device7/8 are bind to the thermal zone successfully...
please apply this debug patch and attach the dmesg output.
please attach the output of "grep . /sys/class/thermal/*/*" at the same time
Comment 8 Frans Pop 2009-08-12 08:31:28 UTC
Created attachment 22687 [details]
dmesg from 2nd patch + /sys/class/thermal contents

Thanks Rui. Here's the new info (only included relevant parts of dmesg).

I wonder if the problem could be related to the fact that the fan cooling devices are discovered before the thermal zones, while the cpu cooling devices are discovered after?
Comment 9 Zhang Rui 2009-08-13 03:54:54 UTC
hah, I see the problem.
the processors are bind to the thermal zones successfully.

thermal_zone1/cdev0
thermal_zone1/cdev1
thermal_zone1/cdev2
thermal_zone1/cdev3
thermal_zone1/cdev4
thermal_zone1/cdev5
thermal_zone1/cdev6

are not equal
cooling_device0
cooling_device1
...
cooling_device6

they just stand for the 1st/2nd/.../6th cooling devices in the current thermal zone.

you can run cat thermal_zone1/cdev*/type to get the real cooling device type.

Now the problem is why processor is not throttled before critical shutdown.
I agree with your guess in comment #6.
the critical trip point of one thermal zone is reached while the passive trip point of the other thermal zones is not reached.
hmm, I'll generate a debug patch to get the full thermal zone status when critical shutdown, probably posted here later today. :)
Comment 10 Zhang Rui 2009-08-13 05:59:25 UTC
Created attachment 22696 [details]
debug patch: show all the thermal zones temperature and cooling device state when overheating

please apply this patch and put the system into a overheating state,
and then attach the /var/log/messages file after the critical shutdown.
Comment 11 Frans Pop 2009-08-13 18:31:37 UTC
> [cdevX is not equal to cooling_deviceX]
> they just stand for the 1st/2nd/.../6th cooling devices in the current
> thermal zone.

Aargh, you are completely correct. It can also be seen by listing the 
symlinks:
/sys/class/thermal/thermal_zone1$ ls -l cdev[0-9]
lrwxrwxrwx 1 root root 0 2009-08-13 09:32 cdev0 -> ../cooling_device6
lrwxrwxrwx 1 root root 0 2009-08-13 09:32 cdev1 -> ../cooling_device5
lrwxrwxrwx 1 root root 0 2009-08-13 09:32 cdev2 -> ../cooling_device4
lrwxrwxrwx 1 root root 0 2009-08-13 09:32 cdev3 -> ../cooling_device3
lrwxrwxrwx 1 root root 0 2009-08-13 09:32 cdev4 -> ../cooling_device2
lrwxrwxrwx 1 root root 0 2009-08-13 09:32 cdev5 -> ../cooling_device7
lrwxrwxrwx 1 root root 0 2009-08-13 09:32 cdev6 -> ../cooling_device8

That is *extremely* confusing. And the similarity in names (cdev is an 
obvious abbreviation of cooling_device) does not help at all.

I had even done an ls -l a few times, but still missed that they did not 
match. (It's also one of the reasons I still don't like sysfs: things are 
too hard to find and even when you find them it often remains confusing 
and obfuscated.)
Still, sorry for missing that. I still somehow feel I should have seen it 
myself.
Comment 12 Frans Pop 2009-08-13 19:17:37 UTC
> please apply this patch and put the system into a overheating state,
> and then attach the /var/log/messages file after the critical shutdown.

There are two problems with that.

1) When the system shut down we had very high temperatures for NL (close 
to 30C), but currently it's only ~18C so it is doubtful I can get the 
system to overheat the same way. I could probably simulate hot temps by 
wrapping the notebook in a towel or something, but the circumstances 
would still be different.

2) Any messages just before a critical shutdown do *not* show up in the 
logs. The existing KERN_EMERG message in thermal_zone_device_update()
    Critical temperature reached (%ld C)
is nowhere to be found in the logs for my previous critical shutdown.
Still, it probably makes sense to add the zone in that message.

I assume that is because the shutdown happens too fast for the syslog 
daemon to process the message and write it to the log files. (There is a 
call to emergency_sync() in orderly_poweroff(), so I assume the problem 
must be earlier, probably in the syslog daemon.)

I could possibly work around that by using netconsole to capture console 
messages on another system. That may just be fast enough to see them.

But before I try that I have two questions.

Is thermal throttling of processors effective at all?
From the code it looks as if the processor would only be throttled *one 
step at a time* when a trip point is reached. For my system there is only 
one CPU trip point per thermal zone, and that at fairly high 
temperatures. So if that would only result in a change from T0 (100%) to 
T1 (88%), the change may well be insufficient to prevent the overheating.

Is there any way to add or modify trip point without changing the BIOS?
For example, it looks as if for my system it would make sense to either 
have CPU throttling occur at lower temperatures in zone 1 than the 
current 99C, or to add CPU trip point(s) in zones 3 and 5.
It would be nice if that could be done through a kernel interface.
Comment 13 Zhang Rui 2009-08-17 03:10:40 UTC
(In reply to comment #12)
> 
> But before I try that I have two questions.
> 
> Is thermal throttling of processors effective at all?
> From the code it looks as if the processor would only be throttled *one 
> step at a time* when a trip point is reached.

the passive trip point means that thermal driver should poke the passive cooling devices (usually processors). It's not mapped to one the the processor's throttling state.
i.e. after passive trip point is reached, processors may enter deeper throttling state if the temperature is still increasing.

> For my system there is only 
> one CPU trip point per thermal zone,

that's okay. it's true on all platforms

> and that at fairly high temperatures.

Hmm, for a passive trip point, a temperature 10C lower than the critical trip point is also normal.

> Is there any way to add or modify trip point without changing the BIOS?

yes, there is.
If the thermal driver is built as a module, load it with module paramter thermal.psv=60
If the thermal driver is built in, boot the laptop with boot option thermeal.psv=60
but note that it changes the passive trip point in all the thermal zones.

Anyway, you can give it a try.
you can even set it to a lower value, say 40C to see if the processor can be throttled correctly by the thermal driver.
Comment 14 Frans Pop 2009-08-17 10:53:51 UTC
> i.e. after passive trip point is reached, processors may enter deeper
> throttling state if the temperature is still increasing.

OK.

> > Is there any way to add or modify trip point without changing the
> > BIOS?
>
> Yes, there is: thermal.psv=<temp>
> But note that it changes the passive trip point in all the thermal
> zones.

Thanks. Pity that it can't be set per trip point. Hmm. Could that possibly 
be implemented using something like 'psv=0:85,1:80,3:90' (leaving trip 
points for zones that are not mentioned at their defaults)?

I've tested by setting psv=65 and then doing a kernel build. Results were 
interesting. In general it works nicely, but some comments below.

FYI, my possible cpu frequencies are (using ondemand governor):
   1333000, 1067000, 1200000, 933000, 800000

The thermal limiting tripped in thermal_zone 1, which is probably the 
sensor for the CPU itself.

The first thing that happened as soon as the limit was reached was that 
the cpufreq scaling_max_freq was set to 800000 (lowest value). I guess 
that is correct?
After that it took exactly 60 (!) seconds before the CPU throttling state 
was changed from T0 to T1. 30 seconds later T1 to T2 and again 30 seconds 
later T2 to T3. These times are 100% reproducible.
At that point zone 1 got below the limit. Immediately throttling went from 
T3 to T1 and a bit later to T0. Later again scaling_max_freq was raised 
to 1066400 (see below) and eventually 1333000.

In total it took exactly 2 minutes to reach T3 and it also took around 2 
minutes to get back to full speed again (I did not time that as exactly).

Issues
- After 'modprobe -r thermal; modprobe thermal' /proc/acpi/thermal_zone/
  is not restored: the files that were there before the module is removed
  get deleted, but no new ones are created after reloading the module.
  /sys/class/thermal/ does get created correctly again. There are no
  kernel errors during module unload/reload.
- The first period of 60 seconds before CPU throttling goes from T0 to T1
  seems rather long to me.
- The scaling_max_freq value of 1066400 looks like some kind of rounding
  error. As it is smaller than the closest possible value of 1067000, the
  max is effectively set to only 1200000.

Conclusion is that my original issue was almost certainly due overheating 
to critical value of a zone that does *not* have a passive trip point.
I think I will set 'thermal.psv=85' for now in the hope that that will 
avoid an emergency shutdown in the future.

I'll try to take a look at the above issues myself, but any suggestions or 
debugging patches from you will be most appreciated.

Thanks a lot for all the help and info so far!
Comment 15 Frans Pop 2009-08-17 18:26:05 UTC
> - After 'modprobe -r thermal; modprobe thermal' /proc/acpi/thermal_zone/
>   is not restored

The same happens for the fan module. I've done some basic debugging and AFAICT acpi_thermal_add_fs() runs correctly. So it looks as if files are created, but just not visible to users.

One thing I noticed is that after module removal there still exists an empty directory /proc/acpi/thermal_zone. Could it be that that old dir "masks" the newly created one?

The same happens with .30 and .28 (I tried battery in .28), so it's not some recent change. I can take the issue to lkml myself if you like.
Comment 16 Frans Pop 2009-08-17 18:47:03 UTC
> - After 'modprobe -r thermal; modprobe thermal' /proc/acpi/thermal_zone/
>   is not restored

This looks to be a KDE issue:

# lsof | grep /proc/acpi/ | awk '{print $1" "$5" "$9}'
acpid REG /proc/acpi/event
ksysguard DIR /proc/acpi/battery
ksysguard DIR /proc/acpi/fan
ksysguard DIR /proc/acpi/thermal_zone

No idea why it keeps the dirs locked when it's not using any files in them.

So, let's concentrate on the other two more interesting issues I mentioned in comment #14 :-)
Comment 17 Zhang Rui 2009-08-18 08:24:57 UTC
(In reply to comment #14)
> > Yes, there is: thermal.psv=<temp>
> > But note that it changes the passive trip point in all the thermal
> > zones.
> 
> Thanks. Pity that it can't be set per trip point. Hmm. Could that possibly 
> be implemented using something like 'psv=0:85,1:80,3:90' (leaving trip 
> points for zones that are not mentioned at their defaults)?
> 
no, at least we won't do this in ACPI thermal driver.
Because overriding the passive trip point is not a good idea from the beginning.
it's BIOS that decides when to notify ACPI thermal driver to check the temperature. For example, BIOS will send a notification if the default passive trip point is triggered.
But if we override the trip point, we probably can't get the notification when the new trip point is hit.
thermal.psv says that it can override the passive trip point while this is misleading in some cases.

> I've tested by setting psv=65 and then doing a kernel build. Results were 
> interesting. In general it works nicely, but some comments below.
> 
> FYI, my possible cpu frequencies are (using ondemand governor):
>    1333000, 1067000, 1200000, 933000, 800000
> 
> The thermal limiting tripped in thermal_zone 1, which is probably the 
> sensor for the CPU itself.
> 
> The first thing that happened as soon as the limit was reached was that 
> the cpufreq scaling_max_freq was set to 800000 (lowest value). I guess 
> that is correct?

yes.

> After that it took exactly 60 (!) seconds before the CPU throttling state 
> was changed from T0 to T1. 30 seconds later T1 to T2 and again 30 seconds 
> later T2 to T3. These times are 100% reproducible.

this depends on when BIOS generates ACPI thermal notifications.
ACPI thermal driver checks the temperature and decides whether to change the cpu P/T state only if it receives such an notification.

> At that point zone 1 got below the limit. Immediately throttling went from 
> T3 to T1 and a bit later to T0. Later again scaling_max_freq was raised 
> to 1066400 (see below) and eventually 1333000.
> 
> In total it took exactly 2 minutes to reach T3 and it also took around 2 
> minutes to get back to full speed again (I did not time that as exactly).
> 
> Issues
> - After 'modprobe -r thermal; modprobe thermal' /proc/acpi/thermal_zone/
>   is not restored: the files that were there before the module is removed
>   get deleted, but no new ones are created after reloading the module.

that's weird.
the procfs files are surely created again when the driver is loaded.
will you please make a double check?

>   /sys/class/thermal/ does get created correctly again. There are no
>   kernel errors during module unload/reload.
> - The first period of 60 seconds before CPU throttling goes from T0 to T1
>   seems rather long to me.

that's because system is not overheating. :)

> - The scaling_max_freq value of 1066400 looks like some kind of rounding
>   error.

No, 1066400 = 1333000 * 80%
the cooling states of a processor are:
0. T0, P0 (full frequency)
1. T0, P0 *80%
2. T0, P0 *60%
3. T0, P0 *40%
4. T1, P0 *40%
5. T2, P0 *40%
...

> Conclusion is that my original issue was almost certainly due overheating 
> to critical value of a zone that does *not* have a passive trip point.

yes, I agree.
Comment 18 Zhang Rui 2009-08-18 08:31:08 UTC
(In reply to comment #15)
> 
> This looks to be a KDE issue:
> 
> # lsof | grep /proc/acpi/ | awk '{print $1" "$5" "$9}'
> acpid REG /proc/acpi/event
> ksysguard DIR /proc/acpi/battery
> ksysguard DIR /proc/acpi/fan
> ksysguard DIR /proc/acpi/thermal_zone
> 
> No idea why it keeps the dirs locked when it's not using any files in them.
> 
> So, let's concentrate on the other two more interesting issues I mentioned in
> comment #14 :-)

good to know.
IMO, all the problems here is verified, and there is nothing we need to do in Linux/ACPI, right?
Close this bug report as it's not a kernel bug.
Please re-open it if you still have any questions. :)
Comment 19 Frans Pop 2009-08-20 13:53:19 UTC
On Tuesday 18 August 2009, you wrote:
> > Thanks. Pity that it can't be set per trip point. Hmm. Could that
> > possibly be implemented using something like 'psv=0:85,1:80,3:90'
> > (leaving trip points for zones that are not mentioned at their
> > defaults)?
>
> no, at least we won't do this in ACPI thermal driver.
> Because overriding the passive trip point is not a good idea from the
> beginning.
> it's BIOS that decides when to notify ACPI thermal driver to check the
> temperature. For example, BIOS will send a notification if the default
> passive trip point is triggered.
> But if we override the trip point, we probably can't get the
> notification when the new trip point is hit.

OK. So setting psv=85 is not a solution and I need a modified BIOS.

[Some time passes while /me tries to hack his DSDT...]

 TZ1/temperature:temperature:   62 C
 TZ1/trip_points:critical (S5): 110 C
+TZ1/trip_points:passive:       95 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1

Cool, so now I have a passive trip point on TZ1 too :-)
Pity that loading a custom DSDT taints the kernel.

> > FYI, my possible cpu frequencies are (using ondemand governor):
> >    1333000, 1067000, 1200000, 933000, 800000
[...]
> > - The scaling_max_freq value of 1066400 looks like some kind of
> > rounding error.
>
> No, 1066400 = 1333000 * 80%

Ugh. Still very, very strange that it comes out so close to a valid
cpufreq value, but end up being just below it. Seems illogical.
The cpufreq values look almost too nice. As if they've taken 80% and then
looked for a nice looking number close to that value.

Thanks again for your excellent help on this issue!
Comment 20 Zhang Rui 2009-08-21 03:38:13 UTC
(In reply to comment #19)
> On Tuesday 18 August 2009, you wrote:
> > > Thanks. Pity that it can't be set per trip point. Hmm. Could that
> > > possibly be implemented using something like 'psv=0:85,1:80,3:90'
> > > (leaving trip points for zones that are not mentioned at their
> > > defaults)?
> >
> > no, at least we won't do this in ACPI thermal driver.
> > Because overriding the passive trip point is not a good idea from the
> > beginning.
> > it's BIOS that decides when to notify ACPI thermal driver to check the
> > temperature. For example, BIOS will send a notification if the default
> > passive trip point is triggered.
> > But if we override the trip point, we probably can't get the
> > notification when the new trip point is hit.
> 
> OK. So setting psv=85 is not a solution and I need a modified BIOS.
> 
> [Some time passes while /me tries to hack his DSDT...]
> 
>  TZ1/temperature:temperature:   62 C
>  TZ1/trip_points:critical (S5): 110 C
> +TZ1/trip_points:passive:       95 C: tc1=1 tc2=2 tsp=300 devices=CPU0 CPU1
> 
> Cool, so now I have a passive trip point on TZ1 too :-)
> Pity that loading a custom DSDT taints the kernel.
> 
hah, I think you can try this in the thermal sysfs I/F:
1. enter the thermal zones that without passive trip point.
2. echo a proper value aaa to the "passive" file
the thermal sysfs driver will bind the processor cooling device to this thermal zone, together with a fake passive trip point aaa. :)
Comment 21 Frans Pop 2009-08-21 05:58:23 UTC
On Friday 21 August 2009, you wrote:
> hah, I think you can try this in the thermal sysfs I/F:
> 1. enter the thermal zones that without passive trip point.
> 2. echo a proper value aaa to the "passive" file
> the thermal sysfs driver will bind the processor cooling device to this
> thermal zone, together with a fake passive trip point aaa. :)

Yes, Matthew Garrett had the same suggestion for me on IRC. Is that option 
missing in Documentation/thermal/sysfs-api.txt? And even then that doc is 
not really suitable for end users.

I tried it for TZ5, but the system immediately became dreadfully slow (big 
latencies). I'll open a new BR for that issue later today :-)

Also, it's a bit strange that you have to set the temperature in sysfs, 
but AFAICT the polling_frequency can only be set in procfs (and I think 
the same goes for cooling_mode)?
Comment 22 Frans Pop 2009-08-21 08:12:40 UTC
> I tried it for TZ5, but the system immediately became dreadfully slow
> (big latencies).

Ah, I've found the problem. I did 'echo -n 90 >passive', but the temp
has to be in millidegrees, so 'echo -n 90000 >passive'.

So the latency was probably due to the system being throttled all the
way down to nothing due to overheating :-)
With the limit set to 90000 it's much happier.
Comment 23 Frans Pop 2009-08-21 14:32:30 UTC
> Ah, I've found the problem. I did 'echo -n 90 >passive', but the temp
> has to be in millidegrees, so 'echo -n 90000 >passive'.

Would something like the patch below make sense to prevent my mistake?
If you like it I'll test it and submit it to the lists.

--- a/drivers/thermal/thermal_sys.c
+++ b/drivers/thermal/thermal_sys.c
@@ -225,6 +225,12 @@ passive_store(struct device *dev,
        if (!sscanf(buf, "%d\n", &state))
                return -EINVAL;

+       /* sanity check: values below 40000 millidegrees don't make sense
+        * and can cause the system to go into a thermal heart attack
+        */
+       if (state && state < 40000)
+               return -EINVAL;
+
        if (state && !tz->forced_passive) {
                mutex_lock(&thermal_list_lock);
                list_for_each_entry(cdev, &thermal_cdev_list, node) {
Comment 24 Zhang Rui 2009-09-18 03:38:33 UTC
Frans,
what's the status of these patches?
Comment 25 Frans Pop 2009-09-18 09:20:01 UTC
> what's the status of these patches?

As far as I'm concerned they are ready for integration. All patches except 
for 1/6 (trivial documentation improvement) and 4/6 have been acked.

The only patch that's still somewhat open is 4/6, but I think my version 2 
is a good compromise for the concerns raised by Matthew.

I sent a summary mail with a request to include them for .32 to Len, with 
CCs to the acpi list you and Matthew last week: 
http://www.spinics.net/lists/linux-acpi/msg24533.html.

I have not had any response to that.
Comment 26 Zhang Rui 2009-09-28 05:35:02 UTC
Ping Len...
I think we can ship this patch set in 2.6.32.
Comment 27 Len Brown 2009-11-05 23:01:36 UTC
This bug report documents a property of the HP BIOS --
that on a hot day, it could hit a critical shutdown
in a thermal zone that didn't have a passive trip point.

As Linux was faithfully implementing the BIOS design,
changing this bug report to the BIOS category, and closed
as DOCUMENTED.

That said, the patches on the list associated with
working around this issue are useful and are now
applied to the acpi tree.

Note You need to log in before you can comment on or make changes to this bug.