Bug 3584

Summary: lm_sensors vs. ACPI - Critical temperature reached (157 C), shutting down.
Product: Drivers Reporter: Tom Malfrere (tom.malfrere)
Component: Hardware MonitoringAssignee: Jean Delvare (jdelvare)
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: acpi-bugzilla, devon.c.miller, encolpe, epprecht, jdelvare, kernel, pmiscml, protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.11.5 Subsystem:
Regression: --- Bisected commit-id:
Attachments: lm_sensors service script
sensors.conf
dmesg output for the case, the fan doesn't start
acpidump output
Detect conflicting ACPI I/O accesses

Description Tom Malfrere 2004-10-17 14:27:48 UTC
Distribution: SuSE 9.1 Professionel
Hardware Environment: 
 MEDION 9580A (ASUS L8400)
 CPU Intel PIII-1000 (mobile with speedstep technology)
 chipset: Intel 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge
          Intel 82371AB/EB/MB PIIX4 ACPI
Software Environment:

Problem Description:
I often get a "critical temperature reached (xxx
Comment 1 Shaohua 2004-10-19 01:50:25 UTC
I believe this problem has been fixed in latest 2.6 kernel.
Comment 2 Len Brown 2004-11-17 00:29:05 UTC
please verify that this is still an issue
or if it is indeed fixed in 2.6.9
Comment 3 Len Brown 2005-01-03 18:02:47 UTC
Please re-open if still an issue with linux 2.6.10 or later.
Comment 4 Tom Malfrere 2005-03-23 04:58:37 UTC
I've compiled a 2.6.11.5 kernel myself on sunday and installed it on monday.
I've made NO changes to the kernel source.
Tuesday 23/03/2005 I had the same problem again...

a quick google with the keywords "critical temperature reached" kernel acpi
shows that there are several other cases reported...

I'm willing to help to debug the problem, but I don't know the kernel sources...
Comment 5 Tom Malfrere 2005-03-26 06:36:19 UTC
problem was seen with 2.6.5 and 2.6.11.5
Comment 6 Tom Malfrere 2005-07-31 17:42:11 UTC
I'm running SuSE 9.3 standard kernel now, this is kernel 2.6.11.4-21.7-default
I did not have a single thermal shutdown in 2 weeks until I installed the
lm_sensors service some days ago...
Today I had 4 thermal shutdowns... the only operation I was running was
downloading image from my digital camera over USB.
I disabled the lm_sensors service and I had no problems anymore...
I suspect there is some sort of resource locking problem between the kernel and
the lm_sensors service...

I hope this helps you a bit further, or should I report this to the lm_sensors team?
Comment 7 Ortwin Glück 2005-08-01 08:34:48 UTC
I have been seeing this occasionally on 2.6.12.2 on my Acer laptop as well.
Looks like the sensor reports an invalid value. It would be nice to re-read the
sensor and verify the value in this case before shutting down.

In polling mode: wait for three consecutive values above the limit
In event mode: just re-read the temperature three times (maybe wait 1 seconds
between reads)
Comment 8 Jean Delvare 2005-10-05 05:11:00 UTC
I'm joining the discussion here after Tom opened an lm_sensors support ticket
pointing to this bug.

Tom, I'd need more information about what you call "lm_sensors service". Please
tell us exactly what enabling this service does. Does it load kernel drivers? If
so, which ones? Does it run programs, such as "sensors -s"? Does it start
daemons, such as "sensord"?

Can you also tell which hardware monitoring chip your laptop uses?
"sensors-detect" with the "lm_sensors service" disabled should tell you.

I would also be interested in the output of "sensors" with the "lm_sensors
service" enabled, and your /etc/sensors.conf file.

My first guess is that the ACPI and lm_sensors are fighting for the hardware
monitoring device, but there may be more than that.
Comment 9 Tom Malfrere 2005-10-06 13:28:27 UTC
Created attachment 6245 [details]
lm_sensors service script

I've added the script that controls the lm_sensors service.
As you can see it runs "sensors -s"
Comment 10 Tom Malfrere 2005-10-06 13:37:18 UTC
here's the output of the sensors-detect:

Client found at address 0x37
Client found at address 0x4e
Probing for `National Semiconductor LM75'... Failed!
Probing for `Dallas Semiconductor DS1621'... Failed!
Probing for `Analog Devices ADM1021'... Failed!
Probing for `Analog Devices ADM1021A/ADM1023'... Failed!
Probing for `Maxim MAX1617'... Success!
    (confidence 3, driver `adm1021')
Probing for `Maxim MAX1617A'... Success!
    (confidence 7, driver `adm1021')
Probing for `TI THMC10'... Failed!
Probing for `National Semiconductor LM84'... Failed!
Probing for `Genesys Logic GL523SM'... Failed!
Probing for `Onsemi MC1066'... Failed!
Probing for `Maxim MAX1619'... Failed!
Probing for `National Semiconductor LM82'... Failed!
Probing for `National Semiconductor LM83'... Failed!
Probing for `Maxim MAX6659'... Success!
    (confidence 4, driver `to-be-written')
Probing for `Maxim MAX6633/MAX6634/MAX6635'... Failed!
Client found at address 0x50
Probing for `SPD EEPROM'... Success!
    (confidence 8, driver `eeprom')
Probing for `DDC monitor'... Failed!
Probing for `Maxim MAX6900'... Failed!
Client found at address 0x69
Comment 11 Tom Malfrere 2005-10-06 13:41:35 UTC
Created attachment 6246 [details]
sensors.conf
Comment 12 Tom Malfrere 2005-10-06 13:46:16 UTC
linux:/home/tom # /etc/init.d/lm_sensors start
Starting up sensors:                                                 done
linux:/home/tom # sensors
eeprom-i2c-0-50
Adapter: SMBus PIIX4 adapter at 2180
Memory type:            SDR SDRAM DIMM
Memory size (MB):       128

max1617a-i2c-0-4e
Adapter: SMBus PIIX4 adapter at 2180
Board:       +56
Comment 13 Tom Malfrere 2005-10-06 13:57:12 UTC
I also think there's a conflict between the lm_sensors and the ACPI, which
results in bad temperature values.
The temperature values that are received by the ACPI are way to high and cause
an immediate thermal shutdown.

Is there some sort of resource locking on the sensor chip? Is it possible to do
something like that?
Comment 14 Jean Delvare 2005-10-09 11:11:47 UTC
Try adding the following line to /etc/modprobe.conf:
  options adm1021 read_only=1
And see if it helps.

Also, please provide the output of:
  modprobe i2c-dev
  i2cdump 0 0x4e

Anything in /proc/acpi/thermal_zone? Especially the contents of the termperature
and trip_points files, if they exist, would be of interest. I wonder if ACPI and
lm_sensors agree on the current and limit temperatures.

Note that the limit temperature for your CPU is set to 67 degrees according to
the adm1021 driver, this matches one of your logged alerts. Your CPU temperature
reads 64 degrees, which is a bit high unless it is heavily loaded, and at any
rate is just a few degrees less than the high limit. Maybe you have a real
overheating problem after all.

NB: Please set the attachements type to text/plain.
Comment 15 Bastian M. Wojek 2005-10-09 12:17:58 UTC
I've got the same problem as described by Tom on my MD 9580A.
At the moment I'm running two different kernels, one with ACPI Subsystem 
revision 20050408 and the other one with revision 20050902. The "thermal 
shutdown" only occurs with the newer ACPI!
For now I exchanged the poweroff-command in the thermal.c and create a log-entry 
of the temperature in /proc/acpi/thermal_zone/THRM/temperature instead.

The result: cat /var/log/messages

Oct  9 13:57:56 psycho kernel: acpi_thermal-0472 [06] thermal_critical      : 
Critical trip point
Oct  9 13:57:56 psycho logger: temperature:             55 C
Oct  9 13:57:56 psycho logger: temperature:             55 C
Oct  9 13:57:56 psycho logger: ACPI group thermal_zone / action THRM is not 
defined
Oct  9 13:57:56 psycho logger: ACPI group thermal_zone / action THRM is not 
defined

and so on... 
The logged temperature values are always between 55 and 65 C - but the values 
displayed while working in a console are up to 191 C.

The /proc/acpi/thermal_zone/THRM/trip_points is the following:

critical (S5):           92 C
passive:                 78 C: tc1=1 tc2=4 tsp=60 devices=0xc1275d88
active[0]:               65 C: devices=0xc1275748

Maybe it helps to know that this problem refers only to specific ACPI versions.
Please let me know if I can provide additional output of interest.
Comment 16 Tom Malfrere 2005-10-09 14:27:50 UTC
linux:/home/tom # modprobe i2c-dev
linux:/home/tom # i2cdump 0 0x4e
No size specified (using byte-data access)
WARNING! This program can confuse your I2C bus, cause data loss and worse!
I will probe file /dev/i2c-0, address 0x4e, mode byte
Continue? [Y/n]
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 38 41 00 00 01 7f bf 43 bf 01 01 01 01 01 01 01    8A..???C????????
10: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
20: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
30: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
40: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
50: 01 01 01 01 01 01 41 01 01 01 01 01 01 01 01 01    ??????A?????????
60: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
70: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
80: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
90: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
a0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
b0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
c0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
d0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
e0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
f0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 4d 01    ??????????????M?
linux:/home/tom #
Comment 17 Tom Malfrere 2005-10-09 14:37:05 UTC
linux:/home/tom # cd /proc/acpi/thermal_zone/
linux:/proc/acpi/thermal_zone # cd THRM/
linux:/proc/acpi/thermal_zone/THRM # ls
.  ..  cooling_mode  polling_frequency  state  temperature  trip_points
linux:/proc/acpi/thermal_zone/THRM # cat cooling_mode
cooling mode:   active
linux:/proc/acpi/thermal_zone/THRM # cat polling_frequency
polling frequency:       5 seconds
linux:/proc/acpi/thermal_zone/THRM # cat state
state:                   passive
linux:/proc/acpi/thermal_zone/THRM # cat temperature
temperature:             59 C
linux:/proc/acpi/thermal_zone/THRM # cat trip_points
critical (S5):           92 C
passive:                 78 C: tc1=1 tc2=4 tsp=60 devices=0xc1277c20
active[0]:               65 C: devices=0xc1277980
linux:/proc/acpi/thermal_zone/THRM #
Comment 18 Tom Malfrere 2005-10-09 14:59:58 UTC
I've added the 'options adm1021 read_only=1' and started the lm_sensors service
again. Now I'm waiting for a possible shutdown, I will let my laptop run
overnight. Without the extra line added, it normally shuts down in during the night.

BTW. I'm an embedded software engineer myself (68000 and arm7), no embedded
linux though, I'm familiar with the i2c protocol. I'm interested in the meaning
of the dumped data.
If I can help, just ask...
Comment 19 Tom Malfrere 2005-10-10 00:03:21 UTC
This morning, my laptop was shutdown, probably because of a thermal shutdown. I
didn't had time this morning. I will check my kernel message tonight.

So, the extra line didn't really help... :-(
Comment 20 Jean Delvare 2005-10-10 02:15:57 UTC
The dump simply shows the MAX1617A register map. You can get a datasheet from
Maxim if you are curious:
  http://www.maxim-ic.com/quick_view2.cfm/qv_pk/1964
It seems to confirm that this really is a MAX1617A chip, the only surprising
thing is the value 0x41 at address 0x56, while ths chip isn't supposed to have a
register there. Is it still there is you attempt a second dump?

If the shutdown still occurs with read_only=1, this means that the problem is
not caused by any kind of chip reprogramming. Considering that loading the
driver does not do much per se, I suspect that you have some program making use
of the driver and triggering reads to the chip. Is it true? Do you have any of
sensord, gkrellm, xsensors, ksensors, wmtemp or anything of that kind loaded?

The strange readings (157 degrees) could be caused by SMBus collisions. Don't
you see any error message from "piix4" in your logs? Messages such as "Failed!"
followed by a number, or "SMBus Timeout!" could indicate SMBus problems, in turn
causing bad reads from the MAX1617A chip.
Comment 21 Bastian M. Wojek 2005-10-10 10:07:52 UTC
Here are my sensors and i2cdump outputs (twice) for comparison:

psycho:~# sensors
eeprom-i2c-0-50
Adapter: SMBus PIIX4 adapter at 2180
Memory type:            SDR SDRAM DIMM
Memory size (MB):       128

max1617a-i2c-0-4e
Adapter: SMBus PIIX4 adapter at 2180
Board:       +55 C  (low  =   -55 C, high =  +127 C)
CPU:         +63 C  (low  =   -55 C, high =  +127 C)

psycho:~# i2cdump 0 0x4e
No size specified (using byte-data access)
  WARNING! This program can confuse your I2C bus, cause data loss and worse!
  I will probe file /dev/i2c-0, address 0x4e, mode byte
  You have five seconds to reconsider and press CTRL-C!

     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 36 3b 00 00 01 7f bf 43 bf 01 01 01 01 01 01 01    6;..???C????????
10: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
20: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
30: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
40: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
50: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
60: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
70: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
80: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
90: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
a0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
b0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
c0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
d0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
e0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
f0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 4d 01    ??????????????M?

psycho:~# i2cdump 0 0x4e
No size specified (using byte-data access)
  WARNING! This program can confuse your I2C bus, cause data loss and worse!
  I will probe file /dev/i2c-0, address 0x4e, mode byte
  You have five seconds to reconsider and press CTRL-C!

     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 37 3e 00 00 02 7f c9 7f c9 01 01 01 01 01 01 01    7>..????????????
10: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
20: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
30: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
40: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
50: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
60: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
70: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
80: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
90: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
a0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
b0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
c0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
d0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
e0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
f0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 4d 01    ??????????????M?

And I run ksensors for displaying the sensordata.
Comment 22 Bastian M. Wojek 2005-10-10 11:11:24 UTC
And there are no error messages from piix4 or SMBus as you described in the 
syslog when the shutdown occurs - at least not on my system.
Comment 23 Tom Malfrere 2005-10-14 11:52:09 UTC
I don't have that value anymore...
linux:/home/tom # i2cdump 0 0x4e
No size specified (using byte-data access)
WARNING! This program can confuse your I2C bus, cause data loss and worse!
I will probe file /dev/i2c-0, address 0x4e, mode byte
Continue? [Y/n] y
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 32 3e 00 00 01 7f bf 43 bf 01 01 01 01 01 01 01    2>..???C????????
10: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
20: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
30: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
40: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
50: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
60: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
70: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
80: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
90: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
a0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
b0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
c0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
d0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
e0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01    ????????????????
f0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 4d 01    ??????????????M?

I'm using gkrellm to display the temp values...
Is that wrong? What's the use of the sensors service if you can't use it???
Comment 24 Jean Delvare 2006-03-23 00:55:09 UTC
Any updates on this problem? Is it still there on newer kernels? Was is reported
on other systems since?
Comment 25 Bastian M. Wojek 2006-06-15 09:19:34 UTC
Yes, it's still there in newer kernels. At the moment I'm using 2.6.17-rc6-mm1 
including ACPI subsystem revision 20060310.
The system still claims reaching the critical trip point although the 
temperature is not too high.

Here are two examples from the log:

Jun 11 18:24:03 psycho kernel: ACPI Warning (acpi_thermal-0456): Critical trip 
point [20060310]
Jun 11 18:24:03 psycho last message repeated 2 times
Jun 11 18:24:03 psycho kernel: ACPI Error (exmutex-0283): Cannot release Mutex 
[MTXS], incorrect sync_level [20060310]
Jun 11 18:24:03 psycho kernel: ACPI Error (psparse-0522): Method parse/
execution failed [\_TZ_.RBYT] (Node cffc39e0), AE_AML_MUTEX_ORDER
Jun 11 18:24:03 psycho kernel: ACPI Error (psparse-0522): Method parse/
execution failed [\_TZ_.RTMX] (Node cffc4940), AE_AML_MUTEX_ORDER
Jun 11 18:24:03 psycho kernel: ACPI Error (psparse-0522): Method parse/
execution failed [\_TZ_.THRM._TMP] (Node cffc46e0), AE_AML_MUTEX_ORDER

...

Jun 14 19:34:18 psycho kernel: ACPI Warning (acpi_thermal-0456): Critical trip 
point [20060310]
Jun 14 19:34:18 psycho logger: temperature:             67 C
Comment 26 Konstantin Karasyov 2006-06-19 03:29:49 UTC
To all,

Could you clarify if thermal shutdowns are happens only with lm_sensors 
installed/enabled, or in the other cases too?
If anyone has observed unsupposed thermal shutdowns, please send 'dmesg' 
and 'acpidump' outputs.
Comment 27 Bastian M. Wojek 2006-06-24 10:10:43 UTC
Well, I don't know yet, if the shutdowns also occur if lm_sensors is not 
running. Until now it was always running on my system - but I'll turn it off 
now.
There was also no new thermal shutdown, but another ACPI problem, that seems to 
be correlated with the shutdowns (because it gives the same error messages): 
the fan doesn't start to work when the trip point is reached, until I reboot.

dmesg and acpidump output follow
Comment 28 Bastian M. Wojek 2006-06-24 10:13:16 UTC
Created attachment 8410 [details]
dmesg output for the case, the fan doesn't start
Comment 29 Bastian M. Wojek 2006-06-24 10:14:07 UTC
Created attachment 8411 [details]
acpidump output
Comment 30 Bastian M. Wojek 2006-07-01 11:26:40 UTC
Hi!

I ran my machine the last week with an unchanged kernel and without any sensor 
modules loaded (i2c-piix4,adm1021,eeprom).
There has NOT been any problem with any thermal shutdowns or fan malfunctions.

I would not call this is a strong proof that the problem depends on the i2c 
sensors only... but it would at least be possible.
Comment 31 Robert Moore 2006-07-10 14:44:33 UTC
RE: AE_AML_MUTEX_ORDER

There is a possible timeout in the _TZ_.RBYT method:

        Method (RBYT, 2, NotSerialized)
        {
            Store (One, GO25)
            Store (One, GO26)
            Acquire (MTXS, 0x0FFF)

The Acquire should have an infinite wait, coded as:

            Acquire (MTXS, 0xFFFF)

This is a common typo. Since the mutex has a synclevel of 4, an attempt to 
release the mutex without a successful acquire will cause the errors seen.

BTW, newer versions of iASL catch this error:

dsdt.dsl   465:             Acquire (MTXS, 0x0FFF)
Warning  1103 -                                 ^ Possible operator timeout is 
ignored

Comment 32 Tom Malfrere 2006-07-11 15:51:12 UTC
Does this mean there's a fix for the problem?
Is it possible to create a patch?
What code should be patched?
The kernel or lmsensors or one of the lmsensors drivers?

Is it possible to give more info in order to test the fix?

PS: I now have a new laptop so I can setup my old laptop under test conditions...
Comment 33 Robert Hancock 2006-07-25 23:59:43 UTC
Assuming that Robert Moore is correct, then the real fix would be for the laptop
manufacturer to fix their broken ACPI DSDT in their BIOS code. Failing that, you
could decompile the DSDT, fix the bug, recompile and configure the kernel to use
the modified DSDT instead.
Comment 34 Bastian M. Wojek 2006-08-02 12:54:51 UTC
Ok, I changed the ACPI DSDT according to Robert Moore's message and also 
reactivated my ADM1021 sensor but still both problems are there.

At some time the fan was not starting when it should have (now simply without a 
message in the syslog) and at another time a thermal shutdown occured. 

dmesg showed:
cpufreq: change failed with new_state 0 and result 3
ACPI: Critical trip point
Critical temperature reached (157 C), shutting down.
Critical temperature reached (67 C), shutting down.

So I'll at least turn off the temperature sensor again.
Comment 35 Konstantin Karasyov 2006-10-25 20:37:58 UTC
It's still looks like an lm_sensors problem, so I'm rejecting the bug.
Comment 36 Jean Delvare 2006-10-26 00:28:14 UTC
I wouldn't call it "an lm_sensors problem". As I understand it it's a conflict
between ACPI and lm_sensors, so it's as much an ACPI problem as an lm_sensors
problem. Given that the lm_sensors drivers properly request the resources they
are using, and ACPI isn't, you can't really blame lm_sensors.

Now, the fact is that ACPI is broken by design in that respect (at least this is
how I see it), and modern machines don't work properly (if at all) without ACPI
support, so even if ACPI is to blame here, the only solution at the moment is to
not use the lm_sensors drivers. I hope that future versions of ACPI will handle
resource access properly so that ACPI and lm_sensors (and other non-ACPI
drivers) can coexist peacefully.
Comment 37 Konstantin Karasyov 2006-11-02 02:18:47 UTC
> Critical temperature reached (157 C), shutting down.
> Critical temperature reached (67 C), shutting down.

The temperature difference here looks like the diffrerence between Fahrenheit 
and Celsius scales. Is it possible that lm_sensors somehow switch the thermal 
sensor output data format?
ACPI expects the temperature to be always reported in tenth of Celsius.
Comment 38 Tom Malfrere 2006-11-02 02:26:49 UTC
For me, lmsensors always displayed the temperature in degrees Celsius.
However it it might be possible that lmsensors internally fetched temperature
values in Fahrenheit.

How are the temperature values reported by an Intel PIII coppermine(as in my
laptop)? In Celsius or in Fahrenheit or is it selectable?
Comment 39 Konstantin Karasyov 2006-11-02 02:50:57 UTC
It should be Celsuis by default, I think, or you'd be getting either critical 
shutdown on boot.

This could be evaluated from _TMP method definition.
Comment 40 Konstantin Karasyov 2006-11-03 02:38:33 UTC
> ACPI expects the temperature to be always reported in tenth of Celsius.

Sorry? there should be Kelvin instead of Celsius.
Comment 41 Jean Delvare 2006-11-03 04:00:06 UTC
All hardware monitoring chips I know of (and that's many) report the temperature
in degrees Celsius. I've never seen a chip reporting temperature in degrees
Farhenheit. All lm_sensors drivers use degrees Celsius.

I doubt degrees Farhenheit have anything to do with the problem. My guess is
that the unsynchronized concurrent accesses to the monitoring chip cause
register reads to return values from other registers, resulting in apparently
weird temperatures.

BTW, a tenth of degree Celsius is the same as a tenth of Kelvin ;)
Comment 42 Len Brown 2006-11-14 00:08:23 UTC
> ...the only solution at the moment is to not use the lm_sensors drivers.   
   
True.   
   
AFAIK, there is zero coordination between lm_sensors and ACPI.   
lm_sensors is trying to expose what ACPI is trying to abstract.   
So the conflict is potentially worse than two Linux sub-systems,  
because lm_sensors may actually conflict with the platform  
ACPI firmware supplied by the vendor...  
   
Until such coordination exists, the prudent course appears 
to be this:   
   
diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig   
index e76d919..0e0b958 100644   
--- a/drivers/hwmon/Kconfig   
+++ b/drivers/hwmon/Kconfig   
@@ -6,6 +6,7 @@ menu "Hardware Monitoring support"   
   
 config HWMON   
        tristate "Hardware Monitoring support"   
+       depends on !ACPI   
        default y   
        help   
          Hardware monitoring devices let you monitor the hardware health   
   
Comment 43 Jean Delvare 2006-11-14 00:53:59 UTC
Ah ah, Len, _that_ was funny! :) But I think I have a better one, see:

--- linux-2.6.19-rc5.orig/drivers/acpi/Kconfig  2006-10-05
+++ linux-2.6.19-rc5/drivers/acpi/Kconfig       2006-11-14
@@ -3,6 +3,7 @@
 #
 
 menu "ACPI (Advanced Configuration and Power Interface) Support"
+       depends on BROKEN
        depends on !X86_VISWS
        depends on !IA64_HP_SIM
        depends on IA64 || X86


Oh wait, you were serious? You really wanted to kill the hwmon subsystem right
away because acpi can't even request the resources it uses like every other
subsystem is required to do?

It's not a conflict between hwmon and acpi. It's a conflict between acpi and the
rest of the world. It might be more visible with hwmon, but every other
subsystem could conflict as well. For example, i2c is affected, as acpi accesses
SMBus masters without telling anyone. As a result, every I2C/SMBus chip driver
is virtually affected too, including eeprom and RTC drivers.

Let's recap the facts:
* lm_sensors has been around since 1998. It does one thing and does it well. All
the i2c bus drivers and hwmon drivers request all the resources they use.
* ACPI support was added to Linux in (I think) 2001, it handles half a dozen
mostly unrelated things, from interrupt routing to power management to thermal
management to screen brightness changes, and accesses the hardware directly
without properly requesting the resources.

And your conclusion is that we should disable lm_sensors? Interesting. ACPI is
broken here.
Comment 44 Jean Delvare 2007-03-07 01:36:13 UTC
Created attachment 10637 [details]
Detect conflicting ACPI I/O accesses

Here comes a patch (against 2.6.20.1) which will detect and log the ACPI
accesses to I/O areas already requested by other drivers. I would like all
users affected by ACPI vs lm_sensors conflicts to give it a try and post the
generated logs.

A 2.6.21-rc3 version of the patch is also available:
http://jdelvare.pck.nerim.net/sensors/acpi-check-io-ports.patch
Comment 45 Encolpe Degoute 2007-07-06 15:06:02 UTC
I didn't have this bug before 2.6.20 on my debian kernel but since I can have this critical trop point three times a day like only one time a week.
I didn't think it's only a conversion bug (see below).
My dual core is reported in logs to be at 45°C usualy, and never reach 60°C under load.

# grep -A 4 "Critical trip point" syslog.?
syslog.2:Jul  5 00:33:06 gosseyn kernel: ACPI: Critical trip point
syslog.2-Jul  5 00:33:06 gosseyn kernel: Critical temperature reached (1135 C), shutting down.
syslog.2-Jul  5 00:33:06 gosseyn shutdown[13784]: shutting down for system halt
syslog.2-Jul  5 00:33:06 gosseyn init: Switching to runlevel: 0
syslog.2-Jul  5 00:33:08 gosseyn kernel: Critical temperature reached (41 C), shutting down.
--
syslog.3:Jul  4 00:49:49 gosseyn kernel: ACPI: Critical trip point
syslog.3-Jul  4 00:49:49 gosseyn kernel: Critical temperature reached (5487 C), shutting down.
syslog.3-Jul  4 00:49:49 gosseyn shutdown[10029]: shutting down for system halt
syslog.3-Jul  4 00:49:49 gosseyn init: Switching to runlevel: 0
syslog.3-Jul  4 00:49:51 gosseyn kernel: Critical temperature reached (53 C), shutting down.
--
syslog.5:Jul  1 21:49:23 gosseyn kernel: ACPI: Critical trip point
syslog.5-Jul  1 21:49:23 gosseyn kernel: Critical temperature reached (1135 C), shutting down.
syslog.5-Jul  1 21:49:23 gosseyn shutdown[8231]: shutting down for system halt
syslog.5-Jul  1 21:49:23 gosseyn init: Switching to runlevel: 0
syslog.5-Jul  1 21:50:20 gosseyn syslog-ng[3338]: syslog-ng starting up; version='2.0.0'
--
syslog.6:Jul  1 00:04:08 gosseyn kernel: ACPI: Critical trip point
syslog.6-Jul  1 00:04:08 gosseyn kernel: Critical temperature reached (1135 C), shutting down.
syslog.6-Jul  1 00:04:08 gosseyn shutdown[12701]: shutting down for system halt
syslog.6-Jul  1 00:04:08 gosseyn init: Switching to runlevel: 0
syslog.6-Jul  1 00:04:10 gosseyn kernel: Critical temperature reached (41 C), shutting down.
Comment 46 Jean Delvare 2007-07-07 00:12:32 UTC
(In reply to comment #45)
> I didn't have this bug before 2.6.20 on my debian kernel but since I can have
> this critical trop point three times a day like only one time a week.
> I didn't think it's only a conversion bug (see below).

This may or may not be the same problem as the original bug report. There are several causes for these "Critical temperature reached" bugs. Are you using a non-ACPI hardware monitoring driver at all? If you do, do the problems go away if you stop loading this driver? If not, then your problem is different and you should open a separate bug.
Comment 47 Devon C Miller 2007-07-22 16:52:14 UTC
I'm running an HP Pavilion laptop ze1250 (AMD Mobile XP 1800+) and I've seen this on and off since 2.6.8. With 2.6.22 it has gotten much, much worse.

Adding a printk to acpi_thermal_get_temperature gives me output like this:
Temperature is 76C
Temperature is 76C
Temperature is 77C
Temperature is 95C
Critical temperature reached (95 C), shutting down.
Temperature is 76C

The only clue I have to add is that I haven't seen it happen with cpu frequency scaling (CONFIG_CPU_FREQ) disabled or with the governor set to powersave or performance.

If someone can give me some suggestions on where to go from here I'll be more than happy to help troubleshoot. 
Comment 48 Jean Delvare 2007-07-23 06:50:28 UTC
(In reply to comment #47)
> I'm running an HP Pavilion laptop ze1250 (AMD Mobile XP 1800+) and I've seen
> this on and off since 2.6.8. With 2.6.22 it has gotten much, much worse.

Are you using lm-sensors at all?
Comment 49 Len Brown 2007-07-23 10:35:47 UTC
Devon,
Jean is right.
If you still have the problem with CONFIG_HWMON=n,
then your sighting isn't related to this report (HWMON vs. ACPI conflict)
In that case, please file a new sighting against ACPI/Power-Thermal
Comment 50 Devon C Miller 2007-07-28 07:12:51 UTC
Checking my config I found I had CONFIG_HWMON=y and CONFIG_SENSORS_VIA686A=y.

I don't have lm-sensors installed at the moment (probably did at some point in the past), so I recompiled without those options. Been running 2 days now without a single thermal fault. Much better than the previous behavior of 3 faults before completing a cold boot.

So, since I don't have lm-sensors, that means the hwmon and/or via sensor drivers are causing problems just by being there.

I'm happy since my system is running better.

However, since I have a system that will misbehave, if you need a guinea pig to test or help debug, I'll be glad to help; just tell me what to do.
Comment 51 Jean Delvare 2007-08-08 10:37:09 UTC
Devon, please open a separate bug with the information from comments #47 and #50, in category Drivers/Hardware Monitoring. Assign it to me, I'll take care of it.
Comment 52 Natalie Protasevich 2007-09-22 21:22:09 UTC
Thanks Devon, so it's bug #8865 now.
Comment 53 Robert Epprecht 2008-09-07 08:02:50 UTC
I see a very similar problem running 2.6.26.3 from kernel.org on my Asus Eee 701 model. The system gives the following error messages while booting and does a shutdown before completing the boot process:
[   32.976558] ACPI: Critical trip point
[   32.976591] Critical temperature reached (144 C), shutting down.

The problem happens mostly when booting on battery power and the battery is not 100% full. It looks as if it can be avoided by blacklisting the thermal or the battery module. Testing is difficult because the bug has the following behaviour:

The bug tends to show up (or not) consistently many times in a row of tests, but can then disappear and allow a series of test boots (one after the other) without problems. So it is very intermittent, but stays constant if I boot many times in a row for tests.

Many users of Debian experience the same problem on different Asus Eee PC models since they run kernel 2.6.26 (from Debian Lenny). The system does shutdown without completing the boot process. We did not see the problem with 2.6.25.

In all these cases there was CONFIG_HWMON=y
Comment 54 Jean Delvare 2008-09-07 08:21:45 UTC
Robert, can the problem be reproduced with CONFIG_HWMON=n? Or, if you can't test that, without any hwmon driver loaded?

As a side note, it makes little sense for a distribution to have CONFIG_HWMON=y. They should set it to m.
Comment 55 Robert Epprecht 2008-09-08 08:06:54 UTC
(In reply to comment #54)
> Robert, can the problem be reproduced with CONFIG_HWMON=n? 

I cannot set CONFIG_HWMON=n, it get's reset to m

> Or, if you can't test that, without any hwmon driver loaded?

I'm not sure which drivers count as 'hwmon driver'
i.e. does eeepc_laptop count?

Would a test with CONFIG_HWMON=m and blacklisting it be useful?

The pattern I have descibed in comment #53 makes it *very* time consuming to test if a version does not have the bug...

Robert Epprecht
Comment 56 Robert Epprecht 2008-09-08 08:36:22 UTC
> I'm not sure which drivers count as 'hwmon driver'

Then there are CONFIG_THERMAL and CONFIG_THERMAL_HWMON
What about these?

Robert
Comment 57 Jean Delvare 2008-09-08 09:56:53 UTC
(In reply to comment #55)
> I cannot set CONFIG_HWMON=n, it get's reset to m

Apparently THINKPAD_ACPI selects HWMON. Try disabling THINKPAD_ACPI and then you should be able to remove HWMON entirely. More generally, you can ask for the help of CONFIG_HWMON to find out who is selecting it.

> I'm not sure which drivers count as 'hwmon driver'
> i.e. does eeepc_laptop count?

Yes, eeepc_laptop counts. Other than laptop-specific drivers, all hwmon drivers are in the "Hardware Monitoring support" menu entry, so if you unselect all of them, you should be done.

Alternatively, you can keep them, select HWMON as a module, and then do "lsmod | grep hwmon" to see any loaded driver that depends on hwmon.

> Would a test with CONFIG_HWMON=m and blacklisting it be useful?

I have never had any luck with blacklisting.

(In reply to comment #56)
> Then there are CONFIG_THERMAL and CONFIG_THERMAL_HWMON
> What about these?

These are the standard hwmon interface to the ACPI thermal zones, so they are safe ACPI-wise by definition.
Comment 58 Robert Epprecht 2008-09-09 07:01:41 UTC
(In reply to comment #57)
This configuration allows to set CONFIG_HWMON=n

I can reproduce the bug with that kernel image: still does a shutdown in the middle of the boot process.

What does this mean now?

Robert
Comment 59 Jean Delvare 2008-09-09 07:06:06 UTC
This means that your problem is unrelated to lm-sensors conflicting with ACPI. So it has to be an ACPI bug, either in your BIOS, or in the Linux implementation. Please open a new bug under Product ACPI.