Bug 8865 - Critical temperature reached (95 C), shutting down.
Summary: Critical temperature reached (95 C), shutting down.
Status: CLOSED UNREPRODUCIBLE
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-08-08 13:01 UTC by Devon C Miller
Modified: 2008-07-01 05:57 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.22
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Devon C Miller 2007-08-08 13:01:46 UTC
Entered per bug 3584, comment 51.

From bug 3584, comment 47:
I'm running an HP Pavilion laptop ze1250 (AMD Mobile XP 1800+) and I've seen
this on and off since 2.6.8. With 2.6.22 it has gotten much, much worse.

Adding a printk to acpi_thermal_get_temperature gives me output like this:
Temperature is 76C
Temperature is 76C
Temperature is 77C
Temperature is 95C
Critical temperature reached (95 C), shutting down.
Temperature is 76C

The only clue I have to add is that I haven't seen it happen with cpu frequency
scaling (CONFIG_CPU_FREQ) disabled or with the governor set to powersave or
performance.

If someone can give me some suggestions on where to go from here I'll be more
than happy to help troubleshoot. 

From bug 3584, comment 50:

Checking my config I found I had CONFIG_HWMON=y and CONFIG_SENSORS_VIA686A=y.

I don't have lm-sensors installed at the moment (probably did at some point in
the past), so I recompiled without those options. Been running 2 days now
without a single thermal fault. Much better than the previous behavior of 3
faults before completing a cold boot.

So, since I don't have lm-sensors, that means the hwmon and/or via sensor
drivers are causing problems just by being there.

I'm happy since my system is running better.

However, since I have a system that will misbehave, if you need a guinea pig to
test or help debug, I'll be glad to help; just tell me what to do.
Comment 1 Andrew Morton 2007-08-08 13:13:10 UTC
Len, Dave: could you please take a look in here, let us know whether this
is likely to be a cpufreq problem, an ACPI problem or an hwmon problem?

Mark, it'd be good if you can make suitable bugzilla alterations so that
the hwmon reports get assigned to yourself, but I'm not sure how that is
done.  Martin Bligh will know...

Thanks.
Comment 2 Jean Delvare 2007-08-08 13:40:58 UTC
(In reply to comment #1)
> Len, Dave: could you please take a look in here, let us know whether this
> is likely to be a cpufreq problem, an ACPI problem or an hwmon problem?

I asked Devon to create this bug so that I can investigate. If I come to the conclusion that the via686a driver is innocent, then Len and Dave can take over.

> Mark, it'd be good if you can make suitable bugzilla alterations so that
> the hwmon reports get assigned to yourself, but I'm not sure how that is
> done.  Martin Bligh will know...

I am perfectly fine being the default assignee for hwmon bugs. If Mark really wants the job, that's fine with me, but otherwise there's no reason to change anything (for now, at least.)
Comment 3 Jean Delvare 2007-08-08 13:58:15 UTC
Devon, it is quite unlikely (although not strictly impossible) that the via686a driver is causing your problems just by being loaded. The fact that you "don't have lm-sensors" doesn't mean that nobody is making use of the via686a driver. Some applications (e.g. gkrellm) read the temperature values from /sys directly. So, please double-check that you really don't have any application reporting hardware monitoring information coming from the via686a driver.

BTW, does your system really have a VIA VT82C686 chip? I can't remember ever seeing this chip used in a laptop. Please attach the output of lspci.

If the mere fact of loading the via686a driver (without ever reading from its sysfs files) really causes trouble, then the problem should stay after you unload it. Please try compiling the via686a driver as a module, let it load at boot time, unload it, and see if you still have the problem.

Your original description suggests that the problem happens at boot time? Only at boot time, or more frequently at boot time, or...? Please comment on this.
Comment 4 Jean Delvare 2007-09-23 00:28:02 UTC
Devon, can you please answer my questions in comment #3?
Comment 5 Devon C Miller 2007-09-24 06:17:12 UTC
Sorry for the delay, been swamped with work.

It may very well be cpufreq related. I've been running for a while now with CONFIG_HWMON=n & CONFIG_SENSORS_VIA686A=n. It still happens, but not as often. Once in a great while during boot (vs 2-3 times with those params set). I've also had it happen when starting off a build. That sort of fits with cpufreq as the build will drive up the load and powernowd will ratchet up the cpu.

A bit confused by the lspci output, though; it lists the VT82C686 under the IDE controller and under the multimedia audio controller. Not sure what to make of that.

00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133] (rev 80)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
00:0a.0 CardBus bridge: O2 Micro, Inc. OZ601/6912/711E0 CardBus/SmartCardBus Con
troller
00:0c.0 FireWire (IEEE 1394): Texas Instruments TSB43AB21 IEEE-1394a-2000 Contro
ller (PHY/Link)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8231 [PCI-to-ISA Bridge] (rev 10)
00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/
C PIPC Bus Master IDE (rev 06)
00:11.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
 (rev 1e)
00:11.4 Bridge: VIA Technologies, Inc. VT8235 ACPI (rev 10)
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT82C686 AC97 Audio
Controller (rev 40)
00:11.6 Communication controller: VIA Technologies, Inc. AC'97 Modem Controller
(rev 20)
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 51)
01:00.0 VGA compatible controller: S3 Inc. VT8636A [ProSavage KN133] AGP4X VGA C
ontroller (TwisterK) (rev 01)
Comment 6 Jean Delvare 2007-09-24 06:22:36 UTC
Can you please attach the output of "lspci -n"?
Comment 7 Devon C Miller 2007-09-26 11:22:31 UTC
00:00.0 0600: 1106:0305 (rev 80)
00:01.0 0604: 1106:8305
00:0a.0 0607: 1217:6972
00:0c.0 0c00: 104c:8026
00:11.0 0601: 1106:8231 (rev 10)
00:11.1 0101: 1106:0571 (rev 06)
00:11.2 0c03: 1106:3038 (rev 1e)
00:11.4 0680: 1106:8235 (rev 10)
00:11.5 0401: 1106:3058 (rev 40)
00:11.6 0780: 1106:3068 (rev 20)
00:12.0 0200: 1106:3065 (rev 51)
01:00.0 0300: 5333:8d02 (rev 01)
Comment 8 Jean Delvare 2007-09-26 11:46:36 UTC
As I suspected, you do not have a VT82C686 chip but a VT8231. This means that the via686a driver has _no_ effect on your system. So, if 
CONFIG_HWMON=y and CONFIG_SENSORS_VIA686A=y are the only hwmon-related options you had enabled, this means that your problem has nothing to do with hwmon.

Reassigning to ACPI maintainer.
Comment 9 Michal Suchanek 2007-11-20 03:40:52 UTC
I get something that looks similar on a Pentium M notebook with Ali chipset.

When it runs for a long time it sometimes shuts down. I do not think it is related to some real temperature readings as they are pretty much constant after an hour or so. Also it does not shut down during any heavy processing, usually it happens when the system is pretty much idle and I am doing something like text editing.

Would it be possible to make the driver more cautious, and only shut down if multiple readings are high?
Also if there is one high reading in a row of low readings it should print a warning into the log.

Under normal circumstances the temperature cannot jump ten degrees between readings, it would rarely change by more than one. Any quick change is more likely an error in the reading rather than real hardware condition.
Comment 10 Len Brown 2008-01-09 00:07:36 UTC
Devon,
please confirm that you can reproduce the failure with
CONFIG_HWMON=n and the latest stable kernel.

please use thermal.nocrt=1 to disable the shutdown on critical trip point.
You should still, however, get a warning in dmesg that you reached 95C.
please enable CONFIG_PRINTK_TIME=y and enable your
acpi_thermal_get_temperature hook so from dmesg we'll be able to see
how fast the reported temperature is changing.

what cpufreq governor sees the problem?
is the problem still seen when using the "performance" cpufreq governor?

Michal, please use thermal.nocrt=1 and printk time like above
to see if perhaps the shutdown is from a transient erroneous
reading, or if the temperature really is critical.
Comment 11 Michal Suchanek 2008-02-05 02:50:33 UTC
I can no longer reproduce the problem. Perhaps a newer kernel fixed it or perhaps I am not using the notebook often enough to see it.
I can try to compile with printk time. However, the message appears only once during shutdown which takes some time so it is either not produced very often (and the exact printk time would be of no use then) or it is only produced on positive readings, and this happens only once.
Comment 12 Len Brown 2008-03-25 18:56:07 UTC
Michael, Devon,
Please re-open if this is reproducible.
Comment 13 Michal Suchanek 2008-07-01 05:57:59 UTC
I now see the problem with a 2.6.25 kernel (system shut down while idle due to critical temp, debian 2.6.25-2).

Also when I experimented with fans earlier I found that only the first fan is ever started, and that starting one of the other fans reduces the critical zone temperature significantly. Still I have never observed temperature anywhere near 80 DegC.

There are three fan objects which probably implement different speed levels of a single fan - but I haven't looked inside.

Cannot reopen.

Note You need to log in before you can comment on or make changes to this bug.