Bug 10658 - thermal shutdown - Dell Precision M20, Latitude D610
Summary: thermal shutdown - Dell Precision M20, Latitude D610
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: Matthew Garrett
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-05-09 10:39 UTC by Martin Ling
Modified: 2009-05-08 14:29 UTC (History)
10 users (show)

See Also:
Kernel Version: 2.6.25.2
Tree: Mainline
Regression: No


Attachments
/proc/cpuinfo (459 bytes, text/plain)
2008-05-09 10:39 UTC, Martin Ling
Details
acpidump output (79.85 KB, text/plain)
2008-05-09 10:40 UTC, Martin Ling
Details
/proc/interrupts (870 bytes, text/plain)
2008-05-12 00:27 UTC, Martin Ling
Details
/proc/interrupts after heatup (870 bytes, text/plain)
2008-05-12 00:37 UTC, Martin Ling
Details
Enable polling when no _TZP method in thermal zone (1.30 KB, patch)
2008-05-13 17:49 UTC, Matthew Garrett
Details | Diff
Export the list of processor handles from the ACPI core to drivers (806 bytes, patch)
2008-05-13 17:50 UTC, Matthew Garrett
Details | Diff
Add a passive cooling limit to zones which don't have one (2.67 KB, patch)
2008-05-13 17:53 UTC, Matthew Garrett
Details | Diff
Add a passive cooling limit to zones which don't have one (3.14 KB, patch)
2008-05-16 11:36 UTC, Matthew Garrett
Details | Diff
Export the list of processor handles from the ACPI core to drivers (811 bytes, patch)
2008-05-21 05:58 UTC, Matthew Garrett
Details | Diff
grep . /sys/firmware/acpi/interrupts/* output when system hot. (1.51 KB, text/plain)
2008-06-03 12:12 UTC, Martin Ling
Details
dmidecode output (13.28 KB, text/plain)
2008-06-03 12:12 UTC, Martin Ling
Details
Implement management in the generic thermal class (17.59 KB, patch)
2008-06-11 09:56 UTC, Matthew Garrett
Details | Diff
Log of DSDT debug messages, annotated with temperature and CPU frequency. (5.65 KB, text/plain)
2009-02-27 08:14 UTC, Martin Ling
Details

Description Martin Ling 2008-05-09 10:39:05 UTC
Latest working kernel version: Not sure if this ever worked.
Earliest failing kernel version: 2.6.23.12 definitely, probably much earlier.
Hardware Environment: Dell Precision M20, aka Latitude D610.

Problem Description:

This laptop has a very poor cooling system. If the CPU is run fast for too long it will trip the thermal threshold at 101 deg C. I have been working around this for some time by using the powersave governor which keeps the CPU at the minimum 800MHz and prevents overheat. It would be nice to be able to use ondemand reliably.

The ondemand governor is okay for normal loads. Under heavy load though, as the temperature increases it will only reduce the CPU frequency from the maximum of 2GHz to the next lower setting of 1.6GHz, and then often allow it quickly back to 2GHz. This is not enough to keep the temperature down so the system will trip and shutdown.

Steps to reproduce:

Obtain one of these badly designed laptops. Set cpufreq governor to ondemand, performance, or anything else that lets the CPU stay at 1.6GHz+. while(1);
Comment 1 Martin Ling 2008-05-09 10:39:45 UTC
Created attachment 16087 [details]
/proc/cpuinfo
Comment 2 Martin Ling 2008-05-09 10:40:21 UTC
Created attachment 16088 [details]
acpidump output
Comment 3 Martin Ling 2008-05-09 10:45:02 UTC
I should add that I have checked the fan which is clean and runs at full speed, the heatsink is attached and looks okay. The system works okay in Windows.
Comment 4 Dave Jones 2008-05-11 09:25:23 UTC
Odd. ACPI thermal throttling should have kicked in at some point.

Can you paste the output of the contents of /proc/acpi/thermal_zone/*/* ?

Relying on cpufreq to keep this functional at all seems wrong.
Comment 5 Martin Ling 2008-05-11 10:09:38 UTC
From /proc/acpi/thermal_zone/THM:

cooling_mode: <setting not supported>
polling_frequency: <polling disabled>
state: state: ok
temperature: temperature: 60 C
trip_points: critical (S5): 101 C

The reported temperature does climb to the trip point.

Some throttling clearly does kick in when fully loaded using ondemand, as the frequency drops to 1.6GHz for a bit, and the temperature drops back into the 90s. As soon as it does though, the frequency jumps back to 2GHz. Eventually everything heats up sufficiently that the short drops in frequency aren't enough to keep the core under the limit when it goes back to 2GHz.
Comment 6 Dave Jones 2008-05-11 10:33:37 UTC
I'm reassigning this to ACPI.  Cpufreq doesn't really have enough information at the layer it lives at to make the sort of decisions necessary.  ACPI does have code to limit the number of P states that cpufreq will scale to, so I suspect that code needs tweaking in some manner so that the limiting happens for a longer period.
Comment 7 Martin Ling 2008-05-11 11:04:11 UTC
Yeah, it looks like this is the business of drivers/acpi/processor_thermal.c, but I don't really know enough about this stuff to debug it.
Comment 8 Zhang Rui 2008-05-11 18:36:48 UTC
Hmm, it's true that ACPI may change the processor P/T-state for thermal control.
But all of this is done _IF_ the processor is used as the passive cooling device in this thermal zone, i.e. it's listed in the _PSL method, which is not true in the acpidump attached.
Comment 9 Matthew Garrett 2008-05-11 18:40:30 UTC
If there's a defined critical temperature, then we should probably ensure that the procesor doesn't reach it - regardless of whether we have a listed passive cooling method or not.
Comment 10 Zhang Rui 2008-05-11 19:27:44 UTC
Right, so the main problem is the overheating.
I saw an interrupt storm bug report on the Dell M20, don't know if it is related.
please attach the content of "/proc/interrupt".
Comment 11 ykzhao 2008-05-11 19:53:15 UTC
From the acpidump in the comment #2 it seems that there only exists the
definition of "critical trip points temperature". And when the temperature
reaches the critical degress , OSPM will perform the criticial shutdown. 
  >Method (_CRT, 0, NotSerialized)
               {
     >              Store (0x65, Local0)
     >             Multiply (Local0, 0x0A, Local0)
     >            Add (Local0, 0x0AAC, Local0)
     >           Return (Local0)
     >      }
What Rui said in comment #8 is right. If the ACPI changes the processor
P/T-state for thermal control, the _PSL object is required, which returns the
passive coolling device list.(Of course the _PSV object is also required.) If
the ACPI turns on/off fan device for the thermal control, the _ALx object is
required, which returns the list of fan cooling device. (Of course the _ACx
object is also required).

Unfortunately the above objects doesn't exist on this laptop. And there is no
cooling device that can be used to cool the system temperature.
Comment 12 Callum Lerwick 2008-05-11 21:36:35 UTC
Have you taken the heatsink *off*, scrubbed off the old goo and applied fresh
heatsink paste? I've fixed two laptops with severe overheating problems this
way.
Comment 13 Martin Ling 2008-05-12 00:27:27 UTC
Created attachment 16105 [details]
/proc/interrupts

After system startup, with governor set to powersave.
Comment 14 Martin Ling 2008-05-12 00:37:01 UTC
Created attachment 16106 [details]
/proc/interrupts after heatup

/proc/interrupts after heating to 98C with ondemand governor and full load.
Comment 15 Martin Ling 2008-05-12 00:45:52 UTC
The fan on this system can be monitored and controlled via the i8k module. The BIOS seems to do a good job of managing it by itself - it runs slow most of the time and ramps up to high speed above about 85C - so I don't bother loading that module and running the userspace daemons that go with it for configurable fan control. Making that driver register a cooling device with the generic thermal code would probably be straightforward.
Comment 16 Martin Ling 2008-05-12 00:52:01 UTC
Re: heatsink paste, it might help, but the reason this is a bug is that Windows copes fine, and that it is clearly possible to stop the system overheating. From the point of view of a non-technical user, Windows works but Linux just switches itself off when they do anything intensive for a long period.
Comment 17 ykzhao 2008-05-12 17:59:33 UTC
What Martin said in comment #15 is right. It seems that the fan can be monitored and controlled via the I8K module.( The I8K module is dedicated to some Dell laptops).
Please set "CONFIG_I8K" in kernel configuration and see whether the system still is overheated.
Thanks.
Comment 18 Matthew Garrett 2008-05-13 01:25:02 UTC
Dell do not recommend the use of the i8k module to control the fans. This overrides BIOS control.
Comment 19 Martin Ling 2008-05-13 01:31:29 UTC
Having the i8k module loaded doesn't make any difference to the thermal situation. All it does is expose some ioctls that let you read and write the fan status. Without it, or with it loaded and the fan status not changed, the BIOS deals with the fans. The BIOS is already setting the fan speed to maximum long before the CPU overheats.
Comment 20 kvolny 2008-05-13 02:19:47 UTC
I have similar problem with FSC Amilo Pro V3505 ... there is something wrong between temperature sensing and the fan control, it often happens that the fan runs at the same speed regardless of the temperature. In such case, if the system gets under continuous load, it reaches the limit and switches off.

As comments #9 and #16 suggest, something should be done to stop the processor generating more heat before the critical point is reached, not relying on anything external like increasing the fan speed etc. Even if "something" means completely stopping system and doing only empty cycles for a few seconds, it would be better then leaving it to die, risking filesystem corruption etc.
Comment 21 Martin Ling 2008-05-13 04:19:54 UTC
Re: comment #20, failure to spin up the fan on the V3505 is a separate problem and should be filed as a separate bug, I think. The issue being discussed here is about throttling the CPU to prevent an overheat when the fan is not doing enough. Getting this right might prevent an overheat on the V3505 machine also, but the fan problem needs to be addressed separately as well.
Comment 22 kvolny 2008-05-13 05:24:36 UTC
(In reply to comment #21)
> Re: comment #20, failure to spin up the fan on the V3505 is a separate
> problem and should be filed as a separate bug, I think.

I'll have to investigate further, I suspect some hardware failure ...

> The issue being discussed here is about throttling the CPU to prevent
> an overheat when the fan is not doing enough. Getting this right might
> prevent an overheat on the V3505 machine also, but the fan problem needs
> to be addressed separately as well.
 
exactly - I just wanted to say that it is not a matter of renewing the heatsink paste, but rather a common problem which should be prevented at any point where it is possible
Comment 23 Len Brown 2008-05-13 17:39:56 UTC
I agree that it is a Linux problem that the D610 overheats
when running, when it doesn't overheat running Windows.

However, that doesn't mean that it is either a
Linux/cpufreq bug or a Linux/ACPI bug.

cpufreq has zero responsibility for cooling the system.
Even though you've been successful at using it for that
purpose, that is not what it is designed to do.
(nor do I think should it be -- particularly for a system
like this one that doesn't provide any OS thermal control)

Further, the D610 is not providing Linux the ACPI hooks needed
to either control the fan or passively cool the CPU.

The fact that the system overheats when the fans are running
full blast means that either the cooling hardware is failing
or the system was designed outside thermal guidelines.
Don't assume that the later is impossible...

I had a D600 a while back that would invoke throttling
via SMM -- confused the heck out of my benchmark results...
I could sometimes observe this by tracking
the contents of /proc/acpi/processor/*/throttling

My assumption is that Windows has some "special sauce" from
Dell in the form of a platform specific driver to help
the D610 run Windows properly.  The fix for Linux is for
Dell to provide the same assistance to Linux.
The i8k is the only weapon we have, as far as i know.

Dell never shipped Linux on this box, so I have zero expectation
that they'd go provide something at this point.  Indeed, for a box
this old, I wouldn't even expect a BIOS update from Dell.
However, as the BIOS is the code controlling your thermals
when Linux is running, you should certainly verify that you're
running the latest BIOS that is available...

I'm closing this as "Documented".

After you verify that you're running the latest BIOS
and that it provides no BIOS SETUP knobs related to cooling, and
after you re-assemble your thermal solution using arctic-silver
or whatever and find that it doesn't help...
I recommend that you work around this by using ondemand,
but limit the maximum frequency via
/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
Comment 24 Matthew Garrett 2008-05-13 17:47:22 UTC
Len,

I disagree. Here's a patch-set I've been working on that provides what should be a decent workaround for this issue, without having any negative impact on functional machines.
Comment 25 Matthew Garrett 2008-05-13 17:49:14 UTC
Created attachment 16133 [details]
Enable polling when no _TZP method in thermal zone

Fix Linux to conform to 11.3.18 of the ACPI spec 3.0. In the absence of a _TZP method, we should be polling at a default frequency. While it can be argued that "0" is a default frequency, I don't think that's what the spec authors had in mind.
Comment 26 Matthew Garrett 2008-05-13 17:50:51 UTC
Created attachment 16134 [details]
Export the list of processor handles from the ACPI core to drivers

Add a convenience structure to the ACPI core that allows drivers to obtain the list of CPU devices. This is left in the core since the scanning is performed at boot time and the drivers may be built as modules. There's probably a cleaner way to do this.
Comment 27 Matthew Garrett 2008-05-13 17:53:51 UTC
Created attachment 16135 [details]
Add a passive cooling limit to zones which don't have one

If a thermal zone is provided with a critical temperature, then there is obviously a concern on the part of the vendor that it may overheat. Currently Linux will only attempt to do something about that if the vendor has explicitly added a passive cooling trip point. However, it's clear that allowing the system to hit the critical trip point is far from ideal - the system will immediately shut down, and data will almost certainly be lost. This patch adds a default passive cooling zone if the platform does not provide its own, with the default being to have it be 5 degrees below the critical shutoff temperature. This should avoid the kernel limiting performance unless it's genuinely likely that the hardware is about to overheat and shut down. The default temperature value can be overridden by passing the thermal.psv argument at boot or module load time.
Comment 28 Len Brown 2008-05-14 00:36:33 UTC
re: comment #25

The wording in ACPI 3.0 is incorrect, and will be fixed in ACPI 4.0:
http://www.acpica.org/bugzilla/show_bug.cgi?id=714

This has come up before, and until the ACPI bug report above,
the "documentation" has been this commit log:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=730ff34de766a6fddee25ac1c32bc49c1a2fd758

If we implemented the patch in comment #25, we would be
swimming against the stream of "common industry practice".
The proof is systems like that in bug #8842, which would break.

Thus, the patch in comment #25 must not be applied.
Comment 29 Martin Ling 2008-05-14 04:40:18 UTC
> After you verify that you're running the latest BIOS
> and that it provides no BIOS SETUP knobs related to cooling

I can confirm that the BIOS is the latest version, A06. There are no options related to cooling in the BIOS setup. 

> and after you re-assemble your thermal solution using arctic-silver
> or whatever and find that it doesn't help...

I am not rebuilding my laptop at this point. Even if some fancy thermal paste helps, I very much doubt I am the only person with this issue, and Linux should work for everybody, not just those willing to hack around with the kernel and disassemble their laptops. I am keeping it as-is for now so that we have a readily accessible test case.

Matthew - I will test the patches you have posted.
Comment 30 Martin Ling 2008-05-15 04:36:35 UTC
I have now tested the patches from comments #25-27 on 2.6.25.2, and they work perfectly. Passive cooling kicks in above 96C and uses all the available frequencies. I can deliberately block the fan and it still won't overheat, but is still able to spend most of its time at 2GHz by short bursts of passive cooling.

Presumably however this all depends on the temperature being polled due to the patch in comment #25. If that patch is not acceptable, is there a better way to do this?

If the trip point is programmable, perhaps in this case (critical trip point only) we could reprogram the hardware trip point to a few C below that and start polling only when we hit that lower limit, but still keep the shutdown threshold at the BIOS-defined critical setting.
Comment 31 Matthew Garrett 2008-05-15 04:48:33 UTC
My proposal would be to only enable polling if there's no existing passive trip point. THat would avoid breaking the system in #8842.
Comment 32 Martin Ling 2008-05-15 04:50:03 UTC
Note that there is an error in the comment #26 patch, the EXPORT_SYMBOL(processor_list) line should read EXPORT_SYMBOL(acpi_processor_list).
Comment 33 Thomas Renninger 2008-05-15 10:42:55 UTC
I agree with Len that:
Enable polling when no _TZP method in thermal zone
in general is not a good idea.
tzp was a module parameter for thermal, does this not work anymore for latest kernels?

IMO this should better be done via dmi for the D610 only, there are already dmi specific hooks in thermal.c setting tzp...
It's ugly, but you could set up the whole passive trip point inside a dmi entry.
Len probably can accept this?

On longterm there will be the possibility to do passive cooling without ACPI, either in userspace, hopefully in kernel space. It's something I like to see for quite some time already. Matthew, if you are interested I'd like to CC you in relevant posts. Getting some support in this area would be great.

Len, pls reopen. Dell Latitude not being supported is not acceptable. Not sure, but AFAIK D610 and D800 were even one of default Novell employee laptops... (too late to bring up enough humour and add a smiley here).

BTW: Has someone an idea why these are overheating now?
Did you make sure the fan slots are clean (I fixed up a bug like that recently).
Comment 34 Martin Ling 2008-05-15 11:35:05 UTC
> BTW: Has someone an idea why these are overheating now?
> Did you make sure the fan slots are clean (I fixed up a bug like that
> recently).

The fan and heatsink are clean, as I've said already. The only other thing I could do is to apply new heatsink paste, but I'm trying to retain a consistent test case for this bug. If the paste is a problem it will apply to many laptops as they age. Windows seems to cope with that, so Linux should too.

The same issue could arise just as easily in other ways, e.g. using a system in a hot climate, when the BIOS engineers tested it in an air-conditioned office. It seems to me that this applies to any system with a critical trip point but without a passive cooling point defined, so I'd rather see something like the patch in comment #27 applied, than a hack for the D610.

As to the remaining issue of how to find out we're approaching the limit, could anyone tell me if my suggestion in comment #30 is feasible?
Comment 35 Matthew Garrett 2008-05-16 11:36:56 UTC
Created attachment 16166 [details]
Add a passive cooling limit to zones which don't have one

This patch does two new things. Firstly, it only enables polling when the hardware doesn't already have a passive zone. That stops bug #8842 from hitting us. Secondly, if a thermal zone contains a _TZD package, that is used rather than flagging it as applying to all CPUs. This won't currently do anything for devices other than CPUs as there's no kernel support, but it could be tied in through the generic thermal layer once we start seeing devices that can handle it.
Comment 36 Martin Ling 2008-05-19 04:33:35 UTC
The patch in comment #35 works for me.
Comment 37 Thomas Renninger 2008-05-19 08:24:13 UTC
I agree with Len and would also not activate thermal polling by default. This often is very slow and might cause problems on other machines where the BIOS vendor intended to not poll the temperature (and e.g. notify OS through a thermal event).

Matthew, what is so wrong about putting this into a dmi list?
Comment 38 Matthew Garrett 2008-05-19 08:34:03 UTC
Because there's no reason to believe that it's something that can be well determined at the per-model level. I can't see any situation in which polling could trigger bugs, other than the one described in #8842 (which this won't hit). I appreciate the concerns about performance, though I haven't been able to trigger any on machines here. If it's an issue, then we could simply skip this on any hardware where reading the trip points takes a significant quantity of time. I suspect that most of the thermal zone performance issues have vanished now we default to burst mode in the ec driver, though.
Comment 39 Martin Ling 2008-05-19 14:34:11 UTC
Unless I'm missing something we're talking about polling every 10 seconds until we hit the passive limit, and then every second. It surprises me that this is a performance issue. Does one need some slow & intensive polled I/O loop to read the temperature sensor or something?
Comment 40 Matthew Garrett 2008-05-19 14:39:37 UTC
It used to be the case that reading some hardware attached to embedded controllers would block the kernel for a significant period of time while it polled for completion, but I believe that we're interrupt driven on pretty much all hardware now.
Comment 41 Thomas Renninger 2008-05-20 00:55:02 UTC
We (until OpenSUSE 10.x) used to set tzp by default on older distris (via userspace override) and got convinced by Len that the thermal management might still work reliable on most/all machines.
There were (in the end rare) problems in the past with EC reads because of thermal polling, but yes, really rare or even no problems at all in the end (Alexey should know better about possible problems here).

Still having a non-polling system by default should be the goal, IMO.
Polling is a well known workaround in a lot parts, if everything is workarounded in this way, you get a whole bunch of polling threads and the whole asynchronous concept (not only in ACPI) is not worth much anymore. And polling on slow IO HW is even worse...

Adding Alexey and Jean, they have a lot experience in reading/writing on such HW.
Alexey,Jean: Matthew wants to enable temperature polling every 10 seconds if a thermal zone does not export a polling variable. He also wants to introduce a passive trip point (if none exists) which is set 5 degree below the critical temperature trip point.

Disadvantage:
   - Some machines inform OS via thermal events. Those are doing it right and
     get punished by polling
   - EC problems on some machines?
   - IMO ACPI thermal.c is the wrong place. On long-term this should be
     integrated into arch independent hwmon structures where the trip points
     are stored now.
   - Normally the ACPI BIOS design should still be capable of avoiding a
     critical shutdown. If not this is a BIOS defect and such things (if not
     a general problem) should IMO be solved in a blacklist.

Advantage:
   - Theoretically no critical shutdowns any more -> this should work out on
     a lot machines, even on these with dusty fan slots.
   - On OpenSUSE 11.0 we enabled a 3D Desktop feature by default, causing a lot
     machines with not capable graphics cards running much hotter, some with
     critical temperature shut downs (Trying a bit with graphics card drivers
     on the Dell might also be worth to know more about the shutdowns...)

Summary (from my side):
I tend to like this workaround. We need something for short-term in thermal.c, the question is whether it should get embedded into a dmi list or activated by default. As I now remember that we had tzp set by default, I do not see a stability problem, it's not perfect, but I cannot judge (Alexey, Jean?) how bad the HW accesses are or could be on some worst case machines.
IMO, on longterm such an interface should exist for hwmon, tunable (polling, lowering passive trip point) from userland.
Comment 42 Jean Delvare 2008-05-20 02:04:54 UTC
How much effort it is to access the temperature registers, depend on the type of sensor device. LPC access is very fast, while SMBus access is slower. However, reading a single register value over SMBus is not a problem, and polling every 10 seconds is totally reasonable.

So I see no problem with Matthew's proposal in this respect.
Comment 43 Martin Ling 2008-05-21 01:47:51 UTC
> Alexey,Jean: Matthew wants to enable temperature polling every 10 seconds if
> a
> thermal zone does not export a polling variable. He also wants to introduce a
> passive trip point (if none exists) which is set 5 degree below the critical
> temperature trip point.

To clarify: Matthew's updated patch in comment #35 only enables polling where the system has a critical trip point but no passive trip point. For machines that define both, everything would still be done asynchronously.

It's just this one awkward case where without polling, we only know about the overheat when it's too late to do anything.
Comment 44 Thomas Renninger 2008-05-21 05:26:32 UTC
> To clarify: Matthew's updated patch in comment #35 only enables polling where
> the system has a critical trip point but no passive trip point.
This should be a lot laptops and most recent desktops.
In very rare cases you could also have several thermal zones with a critical trip point defined.

I like the patch (when several EC reads do not take more than a second anymore and HW access is not a concern which seem to be the case...), it makes the whole ACPI thermal management more robust.
Comment 45 Matthew Garrett 2008-05-21 05:58:27 UTC
Created attachment 16227 [details]
Export the list of processor handles from the ACPI core to drivers

Fixed version of the processor list patch
Comment 46 Len Brown 2008-06-03 10:29:37 UTC
Consider the cause of bug 8842.
The BIOS exported a bogus _PSV of 50C.
Linux polled the thermal zone and dutifully throttled the
processor when it hit 50C, severely impacting
performance under normal use.

Why did windows not run into this problem?
Because Windows doesn't poll, and thus never
exposed a very obvious BIOS bug.

If Linux polls, it exposes itself to an area
of BIOS and EC which was NOT VALIDATED ON WINDOWS.

Whelp, the reality today is that in the horizontal PC
computer industry, if it isn't validated on Windows,
it isn't validated at all.

So the stated goal of the Linux/ACPI sub-system is to
take the pragmatic course of making systems work
by attempting to exercise the validated paths through
their BIOS/EC/firmware rather than the non-validated paths.
However, any other strategy would be wildly impractical.

That is why Linux must NOT enable
polling by default on any system which hasn't
been validated to handle it properly.

So I'll be delighted to accept a patch that recognizes
a broken machine via DMI and invokes a workaround --
-- as long as it doesn't hurt other instances of
the same machine that do not have the problem.
The workaround must not be deployed for all machines --
even if the profile where it is applied seems narrow,
it would apply to thousands of models.

Re: making Linux smarter about thermals.
I agree with this desire, and we implemented the generic thermal I/F
was  explicitly to help out with this.
It applies to both ACPI and non-ACPI systems.
However, you must realize that on the population of
ACPI systems in the marketplace, there are some
significant practical constraints on the ability of the OS
to do something other than what ACPI intends.

In particular, the ACPI BIOS owns the trip points --
Linux does not.  The ACPI EC decides if and when
to send an event to the OS.  If we change what
Linux thinks is a trip point, that doesn't mean
that the EC will send us an event when the temperature
crosses it.  Further, the ACPI BIOS has the right to re-define
trip points at run time, and to implement hysteresis it often does this.
On some machines the EC will send us interesting
temperature change events, and on some machines it will not.
ie. the phantom _PSV scheme may work on some systems
and not on others -- it would require polling to make sure
we notice the temperature change;
but polling is itself problematic per above.

Martin,

> The system works okay in Windows.

Can you figure out what it is doing?
Are there any Dell or platforms specific drivers present?
Is it throttling the processor when it gets hot?

Please attach the output from this command when the machine is warm:
grep . /sys/firmware/acpi/interrupts/*
It will show us which GPEs are firing and perhaps give some insight
into a DSDT which is apparently full of SMI abuse.
Comment 47 Matthew Garrett 2008-06-03 10:55:57 UTC
Len,

if there were any significant body of hardware where infrequently reading the temperature caused problems, I suspect we'd have heard about it. However, this is a straightforward patch with no further dependencies - it's trivial to back it out if it turns out it does break things.

Our hardware usage profile is always going to diverge from Windows to some extent. Where that results in undesirable behaviour, the obvious fix is to modify our behaviour to be more like Windows. That doesn't mean that we should constrain our functionality by refusing to diverge from Windows' behaviour even if we have no reason to believe it would break anything! The issue with bug #8842 was not the polling per se, but the ridiculously low passive cooling point. Nobody has yet demonstrated a system where polling would genuinely cause problems, but we have a demonstrated case where this code helps existing users. 

In the long run we'll want functionality like this for non-broken hardware anyway (datacentre thermal constraints, for instance), and putting it in the kernel is significantly safer than leaving it sitting in userspace with unknown latency requirements. It's something that could be done at the generic thermal class layer instead, but I'd prefer for that to mature further before moving it there.
Comment 48 Len Brown 2008-06-03 11:40:45 UTC
Phantom ACPI _PSV trip points are not how datacenter
thermal constraints will be managed.  In datacenters,
Node Manager informs the OS to stay within a maximum
P-state, and if that is insufficient, then it resorts
to a maximum T-state.  This mechanism is already in place
and already working.  And it is not dependent on Node Manager,
that is just an example of a commercial implementation.

The object of this bug report is to work-around a single instance
of a single 3-year-old notebook that has broken thermals --
while not breaking any other systems.

If we could figure out how Windows copes with this box
in the process, that might also be useful in pointing
out a gap in the Linux implementation.

So lets get this sighting dealt with and if we find a torrent
of systems such the the DMI list becomes large, then it makes
sense to consider broader deployment.

Martin,
Please attach the output from dmidecode.
Comment 49 Matthew Garrett 2008-06-03 12:00:27 UTC
Your description of node manager is effectively identical to using a fake _PSV - it's just got a wider range of temperature information available to it. In either case you're polling the temperature sensors and limiting P states if the temperature rises above a certain point and is trending upwards. Implementing this in-kernel in a generic way would facilitate both, and avoid the issues with hadling it in userspace (such as OOM situations allowing your machine to leave its thermal envelope - the current implemetation of the generic thermal class even disables in-kernel handling of critical shutdown temperatures!)

But this is not purely an issue with a single machine. Distribution bugzillas have multiple entries from users facing this issue, on a range of hardware platforms. If the sole objection to this is "Windows doesn't behave like this, and hypothetical hardware might not like it" then I'd rather we went with it until an example of such hardware is shown to exist.
Comment 50 Martin Ling 2008-06-03 12:12:02 UTC
Created attachment 16383 [details]
grep . /sys/firmware/acpi/interrupts/* output when system hot.
Comment 51 Martin Ling 2008-06-03 12:12:37 UTC
Created attachment 16384 [details]
dmidecode output
Comment 52 Martin Ling 2008-06-03 12:21:27 UTC
See previous attachments, and I will see what I can do about figuring out why Windows works.
Comment 53 Zhang Rui 2008-06-03 18:09:58 UTC
(In reply to comment #49)
> the current implemetation of the generic thermal
> class even disables in-kernel handling of critical shutdown temperatures!
> 
Hah, after talk with thomas, the guy who works on the menlow thermal user application, I'm about to send out the patch to fix this.
i.e. ACPI thermal driver will handle critical shutdown even in "user" mode.
Comment 54 Matthew Garrett 2008-06-11 09:56:56 UTC
Created attachment 16459 [details]
Implement management in the generic thermal class

Martin, any chance you can give this a go? It works fine here, keeping the temperature at the defined level. It uses the generic thermal class rather than the somewhat hacky approach of doing it in the ACPI layer.
Comment 55 Martin Ling 2008-06-12 13:14:58 UTC
I've just tested the patch in comment #54 on 2.6.26-rc5. I had to add an #include <linux/workqueue.h> to the patched thermal_sys.c to build it. It works just the same as the previous ACPI patch, i.e. perfectly.
Comment 56 Venkatesh Pallipadi 2008-06-12 13:16:15 UTC
I will be on vacation from 05/30 until 06/22 and will be back on 06/23. Please contact Suresh B Siddha / Len Brown / Arjan van de Ven for any urgent issues. Thanks, Venki
Comment 57 Martin Ling 2008-06-12 13:42:17 UTC
I've just had a look at what Windows XP is up to as well.

I had a look through the drivers visible in Device Manager. All the ACPI and other system devices are the standard Microsoft drivers, I don't see any Dell magic.

To get a better idea what was going on I ran the I8KfanGUI software[1] with all its control options turned off - i.e. just using it as a monitoring app. What happened was extremely similar to the results of Matthew's patches - at around 96-98 degrees, the CPU speed would drop until the temperature recovered.

Now it's possible, based solely on those results, that my checking the temperature sensors was causing the OS to notice the rising temperature when it normally wouldn't. But there must still be some logic somewhere that caused it to attempt passive cooling, without a _PSV trip point in the DSDT. Also, even with I8KfanGUI shut down, I still can't get the system to shutdown with 100% load and the fan outlet deliberately blocked.

[1] http://www.diefer.de/i8kfan/index.html
Comment 58 Thomas Renninger 2008-06-24 04:50:17 UTC
Comment #57 is the prove that Windows is:
  a) Polling the temperature even the BIOS does not tell the OS to dos so
     (at least not in ACPI specified way)
  b) Windows providing a kind of virtual passive trip point

The sad story is that this has been brought up already (set polling by default, provide a possibility to lower or create passive trip points).

It took 3 years to prove that Windows is doing it. Nobody will prove whether only XP is polling or also Vista or other flavors. Nobody will ever prove whether (quite likely) this is a machine/model specific Windows workaround.

So while "Windows is doing it" is the keyword for Len to add something, this is a nice example that the "Windows compatibility" argument is not worth much (for most/all things not related to general ASL syntax).
IMO this should still go into a DMI blacklist (as the machine violates the specs). General rule should be blacklisting spec violating machines, not adding "Windows compatibility" bug workarounds in general. Windows behavior might change again in two years, staying close to the specs is always the best.

But please add something... (I even don't mind adding this workaround in general, punishing systems who take care about the Spec and Linux), there has been enough bad publicity like "Linux may overheat your system, you don't get a Cent if this happens and your vendor does not support Linux". I mean of course you are better off buying a SUSE pre-loaded Linux supported Lenovo T61 if you are working on Linux...

Thermal management affected -> increasing serverity of the bug. Pfff, I cannot even change the severity.
Why is the bug still set to needinfo? Can someone remove it and assign it to Len again.
Comment 59 Martin Ling 2008-06-24 07:21:53 UTC
Thomas - just curious, how is the machine violating the specs? You said in comment #44 that having a critical trip point without a passive one was normal in "a lot laptops and most recent desktops"?

I'd much rather see a generic solution to this than just a DMI special case for this machine. There are lots of similar reports in various forums and distro bugzillas, and the solutions people are using are messy userspace hacks.

How do people feel about Matthew's approach in the comment #54 patch, of doing this in the generic thermal layer rather than in ACPI?
Comment 60 Thomas Renninger 2008-06-25 06:58:11 UTC
> how is the machine violating the specs?
The thermal zone must provide a sane _TZP Thermal Zone Polling and a passive cooling trip point to provide a proper thermal management.
Ok, in strict sense they are not violating the spec here if the critical shut down is intended. But it is not.

> I'd much rather see a generic solution to this...
I am all against compatibility to specific Microsoft OS bugs and workarounds. (This depends, say all workarounds that can be fixed with _OSI hooks).
We will end up with double polling temp on a lot machines and punish those who are doing it right, e.g. HP sends thermal events. While Microsoft will fix their bug with their next OS Release or a Service Pack.
But as I already said, I agree that thermal management is too important and a generic solution is also very appreciated from my side.
Comment 61 Thomas Renninger 2008-09-04 06:07:02 UTC
I currently rewrite this to get something accepted.
First step is moving the polling to drivers/thermal/ and also the polling_frequency option from /proc to /sys, a test as soon as I have something would be great.

Anyway, I checked the DSDT table. One thing I realized is that:
TZP -> Thermal Zone Polling (in deci-seconds) is nearly never used by any vendor (also by this one, I found one BIOS implementing it out of several dozens).
Also TZD -> Thermal Zone Device: This is what we could pass as name for the thermal zone, but again only one (the same machine) uses it.

But TSP -> Thermal Sampling for Passive, the polling frequency when passive mode is entered is provided *really often*

One workaround that probably works for a lot machines could be to take _TSP value as general thermal polling value if no _TZP value is provided.
The HW is supposed to work fine in passive mode with this polling frequency and it also will if not in passive mode.

Another idea specific to this ASL implementation:
There is exactly one way how BIOS can tell the OS to check for the temperature asynchronoulsy (without polling). This is:
\_GPE._L1D    # HW issues an ACPI (SCI) interrupt with id 0x1D
   +->  NEVT()
         +-> SMIE()
           +-> Notify (\_TZ.THM, 0x80)

Has there already been double checked whether thermal acpi events come in through /proc/acpi/event  (couldn't find anything by a quick search in this bug)

If there are no events, it should be double checked whether the GPE:
0x1D
is active. Yakui or Zhang may be able to help here.

Zhang: you mentioned in mail on the acpi list that it is known that Microsoft does not poll. From where do you know that? Maybe there are still special solutions? (Maybe this info is outdated, especially through commment #57 here?)
Comment 62 Thomas Renninger 2008-09-04 06:08:36 UTC
> One workaround that probably works for a lot machines could be to take _TSP
> value as general thermal polling value if no _TZP value is provided.
This does not work here of course, because no passive cooling functions are provided at all...
Comment 63 Zhang Rui 2008-09-04 23:15:21 UTC
(In reply to comment #61)
> I currently rewrite this to get something accepted.
> First step is moving the polling to drivers/thermal/ and also the
> polling_frequency option from /proc to /sys, a test as soon as I have
> something
> would be great.
> 
> Anyway, I checked the DSDT table. One thing I realized is that:
> TZP -> Thermal Zone Polling (in deci-seconds) is nearly never used by any
> vendor (also by this one, I found one BIOS implementing it out of several
> dozens).
You're right. _TZP is not well implemented because windows doesn't use it.

> Also TZD -> Thermal Zone Device: This is what we could pass as name for the
> thermal zone, but again only one (the same machine) uses it.
> 

> But TSP -> Thermal Sampling for Passive, the polling frequency when passive
> mode is entered is provided *really often*
> 
all the platforms that support passive cooling should have this method.

> One workaround that probably works for a lot machines could be to take _TSP
> value as general thermal polling value if no _TZP value is provided.
> The HW is supposed to work fine in passive mode with this polling frequency 
> and it also will if not in passive mode.
> 
Well, I think we are talking about the platforms which doesn't support passive cooling, that's why we want to fake a passive trip point at the beginning.
But on these platforms, _TSP is not available at all.

> Another idea specific to this ASL implementation:
> There is exactly one way how BIOS can tell the OS to check for the
> temperature
> asynchronoulsy (without polling). This is:
> \_GPE._L1D    # HW issues an ACPI (SCI) interrupt with id 0x1D
>    +->  NEVT()
>          +-> SMIE()
>            +-> Notify (\_TZ.THM, 0x80)
> 
> Has there already been double checked whether thermal acpi events come in
> through /proc/acpi/event  (couldn't find anything by a quick search in this
> bug)
> 
> If there are no events, it should be double checked whether the GPE:
> 0x1D
> is active. Yakui or Zhang may be able to help here.
> 
I'm afraid this is not a generic solution...

> Zhang: you mentioned in mail on the acpi list that it is known that Microsoft
> does not poll. From where do you know that? Maybe there are still special
> solutions? (Maybe this info is outdated, especially through commment #57
> here?)
> 
Please refer to comment #28 in this thread.
Comment 64 Martin Ling 2008-09-07 03:36:34 UTC
(in reply to comment #61)

> Another idea specific to this ASL implementation:
> There is exactly one way how BIOS can tell the OS to check for the
> temperature
> asynchronoulsy (without polling). This is:
> \_GPE._L1D    # HW issues an ACPI (SCI) interrupt with id 0x1D
>    +->  NEVT()
>          +-> SMIE()
>            +-> Notify (\_TZ.THM, 0x80)
> 
> Has there already been double checked whether thermal acpi events come in
> through /proc/acpi/event  (couldn't find anything by a quick search in this
> bug)

Bingo! Yes, as the chip heats up, the count in /sys/firmware/acpi/interrupts/gpe1D starts increasing. Then the following come through on /proc/acpi/event as the chip heats up and then Matthew's patch causes it to be throttled:

processor CPU0 00000080 00000001
processor CPU0 00000080 00000000

Presumably this is what Windows is using to detect the overheat on this machine, then.
Comment 65 Thomas Renninger 2008-09-07 18:52:49 UTC
Do you know how to read and override DSDT?
Here is a description
http://www.lesswatts.org/projects/acpi/faq.php

You should cluster above mentioned functions with lines like:
Store ("Variable VARIABLE is:", debug)
Store (VARIABLE, debug)

Maybe we get clue why 0x1D GPEs do not end as thermal notifcations even they possibly should.

The problem is that Dell in general (and especially at these places) makes a lot use of SMI calls. SMI triggers BIOS code getting executed behind our back (impossible to track in OS what is done there).

There is the Dell dcdbas driver which is passing userspace requests to the kernel, converts them into a SMI and triggers BIOS code.
You should double check whether this one got loaded and if yes try without. Best make sure that it never got loaded, e.g. by temporarily moving the .ko driver in /lib/modules/`uname -r`/... away.
Best you always have a look at /proc/acpi/event, there should pop up thermal events if things start to work.

Or maybe you need the userspace app (more unlikely). I do not know it, I expect it is a binary Dell app, but maybe it's even open source, you should be able to find it.
Comment 66 Martin Ling 2008-09-09 04:46:17 UTC
The dcdbas driver isn't present, I've never built it.

I'll have a look at patching the DSDT.
Comment 67 Zhang Rui 2008-12-16 21:18:25 UTC
some updates,
Matthew has sent out two patches which can fix this problem.
http://marc.info/?l=linux-acpi&m=122832726128914&w=2
http://marc.info/?l=linux-acpi&m=122832694828391&w=2
let's see what's len's comments on these patches.
Comment 68 Zhang Rui 2009-02-23 23:05:01 UTC
patches are shipped in acpi tree.
http://marc.info/?l=linux-acpi&m=123517357817664&w=2
http://patchwork.kernel.org/patch/8229/
Comment 69 Len Brown 2009-02-25 14:46:06 UTC
Martin, Matthew, Thomas,

Thank you for your tenacity.

I do now agree that it is worth the risk to try to deploy
this w/o DMI, for there will be other boxes like this one.

As Zhang-Rui poined out, I staged Matthew's latest patch
for 2.6.30.  Martin, hopefully your old laptop
is still available to test and benefit from this patch?

The main things I wonder about right now are:

1. is polling really necessary?
   The observation in comment #64 suggests that on this box
   it should not be necessary.  Notify(0x80) on the thermal zone
   is a thermal zone status change, which is exactly we should need
   to re-evaluate the temperature and compare to trip points.

   Martin, it would be interesting to discover at what temperature
   you start seeing these events.

2. observability
   We should probably print something when we manufacture a _PSV
   and we should occasionally print a warning when throttling is invoked.

   It is a risk that we may over-throttle and impact performance
   more than we need to.  Note that this will behave differently
   for the CPU_FREQ=y and n cases.  In the former, we try to exhaust
   the P-states before using T-states.  In the later, we run at HFM
   and use T-states from there.

3. default constants
   _TZP in the ACPI spec is defined to have a minimum value of 30 seconds.
   (and a maximum of 5 minutes).  We are defaulting to poll every 10 sec.
    and I wonder if that is it more often than necessary.  _TSP is basically
    a guess.  I don't have any better suggestions, but we should probably
    at least comment that these are guesses.
Comment 70 Martin Ling 2009-02-25 15:35:26 UTC
Len,

Thanks for getting this patch accepted. It sounds like it provides a useful catch for similar problems with other machines, even if we can ultimately resolve this particular case without resorting to polling.

I have kept the laptop in its original state, and have been using Matthew's patches, although am currently running an *unpatched* 2.6.28.1 kernel. This is what happens with that, if I run the CPU flat out:

The GPE 0x1D interrupts start at 91 C, i.e 10 degrees below the critical trip point. They then come in every ~10ish seconds after that (not consistent, seems to be something like 7-15 sec, maybe related to the increasing temprature).

When the temperature hits 97 C, several things happen. I get "processor CPU0 00000080 00000001" on /proc/acpi/event. The CPU clock frequency drops from the max 2GHz to 1.6GHz, and the temperature drops immediately to about 85 degrees. Then a few seconds later, I get "processor CPU0 00000080 00000000" on /proc/acpi/event. The CPU goes back up to 2GHz, and the temperature jumps back to 97-98 degrees. This cycle continues, spending what seems to be 5 seconds at each frequency, but the temperature on return to 2GHz keeps increasing towards the trip point.

So I guess the BIOS is already trying to do something itself here via SMI, but it's not acheiving the desired results. It is sending out ACPI events though, which presumably we should be using instead of polling to trigger our own passive cooling efforts.
Comment 71 Matthew Garrett 2009-02-25 15:48:56 UTC
The CP notifications are indicating that the _PPC contents have been updated - this effectively gives the OS the set of available frequencies. The first is presumably removing the higher P states, but the latter restores them and allows the processor to go back to full speed. In that respect, we're behaving correctly here.

What I wonder is whether Windows implements more of a lag in the switching - alternatively, it could simply be us failing to keep some other system component within its designed thermal envelope and that altering the thermal characteristics of the system. We /seem/ to be doing the right thing, and the BIOS seems to be going to some effort to do something useful, so the fact that it isn't working is slightly curious...
Comment 72 Matthew Garrett 2009-02-25 16:00:11 UTC
Looking at the ACPI dump, GPE 1d seems to be a simple "SMI event" trigger to let the OS know that the low-level firmware has noticed something happening. If you want to diagnose this any further, then following the instructions in http://www.lesswatts.org/projects/acpi/overridingDSDT.php and printing the contents of Local0 after the Store(SMI(blah statements in NEVT and SMIE.
Comment 73 Thomas Renninger 2009-02-26 01:06:27 UTC
Some Acer also seem to need thermal polling:
https://bugzilla.novell.com/show_bug.cgi?id=404245
I can make the guy try these and look deeper whether it could help as soon as the patch is mainline.
Thanks Matthew.
Comment 74 Thomas Renninger 2009-02-26 03:00:07 UTC
Hmm, there is the corner case with machines which do have an active and a passive thermal trip point, but do not throw thermal events and need thermal polling. But maybe we come away with it, if BIOS developers which implement a passive trip point do think that far to provide events.
Comment 75 Martin Ling 2009-02-27 08:14:20 UTC
Created attachment 20380 [details]
Log of DSDT debug messages, annotated with temperature and CPU frequency.

Okay, I have patched the DSDT as Matthew suggested with a debug statement after the initial Store() in the NEVT and SMIE methods.

I then wrote a quick script to annotate these events with the temperature and CPU frequency as they appear in the log.

The output is attached. I ran a while loop and allowed the CPU to heat up to 99C before stopping it.
Comment 76 Matthew Garrett 2009-02-27 08:24:12 UTC
Interesting. We do indeed get thermal notifications as we head towards the trip point, and as we get even closer it fires the CPU notifications. So in this case it sounds like we don't need to poll - the firmware lets us know that we're approaching the critical temperature, at which point we possibly ought to be throttling back ourselves. I've got a machine here that shows some "interesting" thermal characteristics, so I'll see what its behaviour is.
Comment 77 Martin Ling 2009-02-27 08:54:45 UTC
It would seem that if the BIOS is starting to cut off P-states, we're in a passive cooling situation and should probably start treating it as such even if there wasn't a threshold. The temperature when that notification comes in can be taken to be the threshold, the zone set to passive, and the passive cooling equation in acpi_thermal_passive() can start being applied.

But I suppose that depends on knowing the thermal zone is associated with that processor, which may not be clear?
Comment 78 Matthew Garrett 2009-02-27 08:58:15 UTC
Right, your thermal zone isn't tied to the CPU, so we can't work back from there. I suspect that getting a thermal notification and realising that we're within 5 degrees of the critical cutoff temperature ought to be enough indication that we should be passively cooling. Len, what are your thoughts?
Comment 79 Martin Ling 2009-02-27 09:57:26 UTC
I wonder if it would make sense to split up the functionality in Matthew's patch into two parts. Rather than a single sysfs entry at /sys/class/thermal/xxx/passive, used to both add a forced passive trip point and force polling for it, have two:

/sys/class/thermal/xxx/force_passive_temp (temperature at which to force passive, default to actual passive trip if defined, or 5 degrees below critical)

/sys/class/thermal/xxx/force_passive_poll (interval at which to poll for tripping this temperature, default to 0)

Then for machines that behave like this one, the additional trip point alone would be sufficient, and for any that actually need polling, that can be set separately.
Comment 80 Len Brown 2009-03-15 12:22:35 UTC
Thanks for the log in comment #75

> 16:07:04 - 92C
> 16:07:28 - 97C

Running at 2GHz, the temperature on this box rose 5C in only 14 seconds?!

I don't know how the timestamps and the temperatures line up,
but the temperature rises seem to leap even faster when we go from
1.6GHz back up to 2GHz -- 12C in 3 seconds?!

Then between 16:08:09 to 16:08:13 the temperature is reported
to drop from 99C to 68C -- 31C in 4 seconds?!

I don't think this could possibly reflect reality.
The temperature readings here seem to go in the right direction,
but to be un-calibrated or something.

Martin,  Can you boot your XP partition again and observe if
they see the temperatures jump around this fast also?
Also, on Windows, what is the highest temperature you
can achieve -- can you force it up to the ACPI CRT threshold?

I'm wondering of thermal.nocrt is the right workaround for this box
(what do you see when you try it -- how high can you go?)
for it seems we are trying to design an algorithm to handle
what appears to be semi-random input...
Comment 81 Martin Ling 2009-04-02 22:32:55 UTC
I have just had another look at this under Windows, again using the I8KFanGUI tool. The temperatures reported are the same as under Linux. It seems there are big jumps whenever the CPU clock speed changes, sometimes as big as 98C to 68C in a second when dropping from 2GHz to 800MHz. Within a single clock speed the temperatures are more believable. I even wonder if they would make more sense with 10C per P-state subtracted.

The other news, however, is that I can no longer reproduce the results from comment #57: Windows now overheats the system in the same way as Linux, after a few minutes of flat-out CPU usage.

There have been several Windows XP updates since then so it's possible one of these has changed the behavior, either by changing the thermal management or by just making Windows run the system hotter somehow.

Another explanation would be that Linux was always running a bit hotter than Windows in someway, causing it to just trip the limit while Windows did not, and then some aspect of the cooling system (perhaps heatsink paste as suggested above) has degraded further over the past year until both operating systems can cause the overheat.
Comment 82 Zhang Rui 2009-05-05 06:47:25 UTC
patches are shipped upstream.
commit b1569e99c795bf83b4ddf41c4f1c42761ab7f75e
commit 03a971a2899886006f19f3495973bbd646d8bdae
Comment 83 Len Brown 2009-05-07 19:55:28 UTC
Martin,
Thanks for verifying that Windows also observes the huge temperature jumps
that we see in Linux.  It could be that the cooling hardware has become
intermittent, or that the EC/sensors on this box are not reliable.
In either case, it is good to know it isn't a Linux specific issue.

Also, thanks for verifying that Windows too can overheat
and shut-down this system.  This suggests that Linux does
not have an inherent disadvantage vs. Windows in this area.

b1569e99c795bf83b4ddf41c4f1c42761ab7f75e
ACPI: move thermal trip handling to generic thermal layer
shipped in Linux-2.6.30-rc1

03a971a2899886006f19f3495973bbd646d8bdae
thermal: support forcing support for passive cooling
shipped in Linux-2.6.30-rc1

Indeed, Matthew's patches above make Linux more robust
than Windows on this box.  Please let us know what you experience.

Note, however, that throttling is a last resort solution.
The best it can do is to make your system run slowly
to prevent it from shutting down.  If you want your machine
to run fast under heavy load, it will likely require some hardware
maintenance.
Comment 84 Martin Ling 2009-05-08 14:29:48 UTC
I'm now running 2.6.30-rc4, with 95000 written to the thermal zone passive setting. This works fine, as per previous patches. 

Now that the question of finding a specific Linux issue on this machine has  closed, I'll strip it down and see what I can do to improve the thermal situation.

Thanks to everyone who's helped with this. I'm sorry this bug appears to have been a bit of a wild goose chase as far as a Linux/Windows difference is concerned, but that certainly did seem to be what was happening at first.

Note You need to log in before you can comment on or make changes to this bug.