Bug 3584
Summary: | lm_sensors vs. ACPI - Critical temperature reached (157 C), shutting down. | ||
---|---|---|---|
Product: | Drivers | Reporter: | Tom Malfrere (tom.malfrere) |
Component: | Hardware Monitoring | Assignee: | Jean Delvare (jdelvare) |
Status: | REJECTED INSUFFICIENT_DATA | ||
Severity: | normal | CC: | acpi-bugzilla, devon.c.miller, encolpe, epprecht, jdelvare, kernel, pmiscml, protasnb |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.11.5 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
lm_sensors service script
sensors.conf dmesg output for the case, the fan doesn't start acpidump output Detect conflicting ACPI I/O accesses |
Description
Tom Malfrere
2004-10-17 14:27:48 UTC
I believe this problem has been fixed in latest 2.6 kernel. please verify that this is still an issue or if it is indeed fixed in 2.6.9 Please re-open if still an issue with linux 2.6.10 or later. I've compiled a 2.6.11.5 kernel myself on sunday and installed it on monday. I've made NO changes to the kernel source. Tuesday 23/03/2005 I had the same problem again... a quick google with the keywords "critical temperature reached" kernel acpi shows that there are several other cases reported... I'm willing to help to debug the problem, but I don't know the kernel sources... problem was seen with 2.6.5 and 2.6.11.5 I'm running SuSE 9.3 standard kernel now, this is kernel 2.6.11.4-21.7-default I did not have a single thermal shutdown in 2 weeks until I installed the lm_sensors service some days ago... Today I had 4 thermal shutdowns... the only operation I was running was downloading image from my digital camera over USB. I disabled the lm_sensors service and I had no problems anymore... I suspect there is some sort of resource locking problem between the kernel and the lm_sensors service... I hope this helps you a bit further, or should I report this to the lm_sensors team? I have been seeing this occasionally on 2.6.12.2 on my Acer laptop as well. Looks like the sensor reports an invalid value. It would be nice to re-read the sensor and verify the value in this case before shutting down. In polling mode: wait for three consecutive values above the limit In event mode: just re-read the temperature three times (maybe wait 1 seconds between reads) I'm joining the discussion here after Tom opened an lm_sensors support ticket pointing to this bug. Tom, I'd need more information about what you call "lm_sensors service". Please tell us exactly what enabling this service does. Does it load kernel drivers? If so, which ones? Does it run programs, such as "sensors -s"? Does it start daemons, such as "sensord"? Can you also tell which hardware monitoring chip your laptop uses? "sensors-detect" with the "lm_sensors service" disabled should tell you. I would also be interested in the output of "sensors" with the "lm_sensors service" enabled, and your /etc/sensors.conf file. My first guess is that the ACPI and lm_sensors are fighting for the hardware monitoring device, but there may be more than that. Created attachment 6245 [details]
lm_sensors service script
I've added the script that controls the lm_sensors service.
As you can see it runs "sensors -s"
here's the output of the sensors-detect: Client found at address 0x37 Client found at address 0x4e Probing for `National Semiconductor LM75'... Failed! Probing for `Dallas Semiconductor DS1621'... Failed! Probing for `Analog Devices ADM1021'... Failed! Probing for `Analog Devices ADM1021A/ADM1023'... Failed! Probing for `Maxim MAX1617'... Success! (confidence 3, driver `adm1021') Probing for `Maxim MAX1617A'... Success! (confidence 7, driver `adm1021') Probing for `TI THMC10'... Failed! Probing for `National Semiconductor LM84'... Failed! Probing for `Genesys Logic GL523SM'... Failed! Probing for `Onsemi MC1066'... Failed! Probing for `Maxim MAX1619'... Failed! Probing for `National Semiconductor LM82'... Failed! Probing for `National Semiconductor LM83'... Failed! Probing for `Maxim MAX6659'... Success! (confidence 4, driver `to-be-written') Probing for `Maxim MAX6633/MAX6634/MAX6635'... Failed! Client found at address 0x50 Probing for `SPD EEPROM'... Success! (confidence 8, driver `eeprom') Probing for `DDC monitor'... Failed! Probing for `Maxim MAX6900'... Failed! Client found at address 0x69 Created attachment 6246 [details]
sensors.conf
linux:/home/tom # /etc/init.d/lm_sensors start Starting up sensors: done linux:/home/tom # sensors eeprom-i2c-0-50 Adapter: SMBus PIIX4 adapter at 2180 Memory type: SDR SDRAM DIMM Memory size (MB): 128 max1617a-i2c-0-4e Adapter: SMBus PIIX4 adapter at 2180 Board: +56 I also think there's a conflict between the lm_sensors and the ACPI, which results in bad temperature values. The temperature values that are received by the ACPI are way to high and cause an immediate thermal shutdown. Is there some sort of resource locking on the sensor chip? Is it possible to do something like that? Try adding the following line to /etc/modprobe.conf: options adm1021 read_only=1 And see if it helps. Also, please provide the output of: modprobe i2c-dev i2cdump 0 0x4e Anything in /proc/acpi/thermal_zone? Especially the contents of the termperature and trip_points files, if they exist, would be of interest. I wonder if ACPI and lm_sensors agree on the current and limit temperatures. Note that the limit temperature for your CPU is set to 67 degrees according to the adm1021 driver, this matches one of your logged alerts. Your CPU temperature reads 64 degrees, which is a bit high unless it is heavily loaded, and at any rate is just a few degrees less than the high limit. Maybe you have a real overheating problem after all. NB: Please set the attachements type to text/plain. I've got the same problem as described by Tom on my MD 9580A. At the moment I'm running two different kernels, one with ACPI Subsystem revision 20050408 and the other one with revision 20050902. The "thermal shutdown" only occurs with the newer ACPI! For now I exchanged the poweroff-command in the thermal.c and create a log-entry of the temperature in /proc/acpi/thermal_zone/THRM/temperature instead. The result: cat /var/log/messages Oct 9 13:57:56 psycho kernel: acpi_thermal-0472 [06] thermal_critical : Critical trip point Oct 9 13:57:56 psycho logger: temperature: 55 C Oct 9 13:57:56 psycho logger: temperature: 55 C Oct 9 13:57:56 psycho logger: ACPI group thermal_zone / action THRM is not defined Oct 9 13:57:56 psycho logger: ACPI group thermal_zone / action THRM is not defined and so on... The logged temperature values are always between 55 and 65 C - but the values displayed while working in a console are up to 191 C. The /proc/acpi/thermal_zone/THRM/trip_points is the following: critical (S5): 92 C passive: 78 C: tc1=1 tc2=4 tsp=60 devices=0xc1275d88 active[0]: 65 C: devices=0xc1275748 Maybe it helps to know that this problem refers only to specific ACPI versions. Please let me know if I can provide additional output of interest. linux:/home/tom # modprobe i2c-dev linux:/home/tom # i2cdump 0 0x4e No size specified (using byte-data access) WARNING! This program can confuse your I2C bus, cause data loss and worse! I will probe file /dev/i2c-0, address 0x4e, mode byte Continue? [Y/n] 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 00: 38 41 00 00 01 7f bf 43 bf 01 01 01 01 01 01 01 8A..???C???????? 10: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 20: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 30: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 40: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 50: 01 01 01 01 01 01 41 01 01 01 01 01 01 01 01 01 ??????A????????? 60: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 70: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 80: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 90: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? a0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? b0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? c0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? d0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? e0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? f0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 4d 01 ??????????????M? linux:/home/tom # linux:/home/tom # cd /proc/acpi/thermal_zone/ linux:/proc/acpi/thermal_zone # cd THRM/ linux:/proc/acpi/thermal_zone/THRM # ls . .. cooling_mode polling_frequency state temperature trip_points linux:/proc/acpi/thermal_zone/THRM # cat cooling_mode cooling mode: active linux:/proc/acpi/thermal_zone/THRM # cat polling_frequency polling frequency: 5 seconds linux:/proc/acpi/thermal_zone/THRM # cat state state: passive linux:/proc/acpi/thermal_zone/THRM # cat temperature temperature: 59 C linux:/proc/acpi/thermal_zone/THRM # cat trip_points critical (S5): 92 C passive: 78 C: tc1=1 tc2=4 tsp=60 devices=0xc1277c20 active[0]: 65 C: devices=0xc1277980 linux:/proc/acpi/thermal_zone/THRM # I've added the 'options adm1021 read_only=1' and started the lm_sensors service again. Now I'm waiting for a possible shutdown, I will let my laptop run overnight. Without the extra line added, it normally shuts down in during the night. BTW. I'm an embedded software engineer myself (68000 and arm7), no embedded linux though, I'm familiar with the i2c protocol. I'm interested in the meaning of the dumped data. If I can help, just ask... This morning, my laptop was shutdown, probably because of a thermal shutdown. I didn't had time this morning. I will check my kernel message tonight. So, the extra line didn't really help... :-( The dump simply shows the MAX1617A register map. You can get a datasheet from Maxim if you are curious: http://www.maxim-ic.com/quick_view2.cfm/qv_pk/1964 It seems to confirm that this really is a MAX1617A chip, the only surprising thing is the value 0x41 at address 0x56, while ths chip isn't supposed to have a register there. Is it still there is you attempt a second dump? If the shutdown still occurs with read_only=1, this means that the problem is not caused by any kind of chip reprogramming. Considering that loading the driver does not do much per se, I suspect that you have some program making use of the driver and triggering reads to the chip. Is it true? Do you have any of sensord, gkrellm, xsensors, ksensors, wmtemp or anything of that kind loaded? The strange readings (157 degrees) could be caused by SMBus collisions. Don't you see any error message from "piix4" in your logs? Messages such as "Failed!" followed by a number, or "SMBus Timeout!" could indicate SMBus problems, in turn causing bad reads from the MAX1617A chip. Here are my sensors and i2cdump outputs (twice) for comparison: psycho:~# sensors eeprom-i2c-0-50 Adapter: SMBus PIIX4 adapter at 2180 Memory type: SDR SDRAM DIMM Memory size (MB): 128 max1617a-i2c-0-4e Adapter: SMBus PIIX4 adapter at 2180 Board: +55 C (low = -55 C, high = +127 C) CPU: +63 C (low = -55 C, high = +127 C) psycho:~# i2cdump 0 0x4e No size specified (using byte-data access) WARNING! This program can confuse your I2C bus, cause data loss and worse! I will probe file /dev/i2c-0, address 0x4e, mode byte You have five seconds to reconsider and press CTRL-C! 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 00: 36 3b 00 00 01 7f bf 43 bf 01 01 01 01 01 01 01 6;..???C???????? 10: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 20: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 30: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 40: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 50: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 60: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 70: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 80: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 90: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? a0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? b0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? c0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? d0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? e0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? f0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 4d 01 ??????????????M? psycho:~# i2cdump 0 0x4e No size specified (using byte-data access) WARNING! This program can confuse your I2C bus, cause data loss and worse! I will probe file /dev/i2c-0, address 0x4e, mode byte You have five seconds to reconsider and press CTRL-C! 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 00: 37 3e 00 00 02 7f c9 7f c9 01 01 01 01 01 01 01 7>..???????????? 10: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 20: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 30: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 40: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 50: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 60: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 70: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 80: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 90: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? a0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? b0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? c0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? d0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? e0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? f0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 4d 01 ??????????????M? And I run ksensors for displaying the sensordata. And there are no error messages from piix4 or SMBus as you described in the syslog when the shutdown occurs - at least not on my system. I don't have that value anymore... linux:/home/tom # i2cdump 0 0x4e No size specified (using byte-data access) WARNING! This program can confuse your I2C bus, cause data loss and worse! I will probe file /dev/i2c-0, address 0x4e, mode byte Continue? [Y/n] y 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 00: 32 3e 00 00 01 7f bf 43 bf 01 01 01 01 01 01 01 2>..???C???????? 10: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 20: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 30: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 40: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 50: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 60: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 70: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 80: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? 90: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? a0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? b0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? c0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? d0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? e0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 ???????????????? f0: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 4d 01 ??????????????M? I'm using gkrellm to display the temp values... Is that wrong? What's the use of the sensors service if you can't use it??? Any updates on this problem? Is it still there on newer kernels? Was is reported on other systems since? Yes, it's still there in newer kernels. At the moment I'm using 2.6.17-rc6-mm1 including ACPI subsystem revision 20060310. The system still claims reaching the critical trip point although the temperature is not too high. Here are two examples from the log: Jun 11 18:24:03 psycho kernel: ACPI Warning (acpi_thermal-0456): Critical trip point [20060310] Jun 11 18:24:03 psycho last message repeated 2 times Jun 11 18:24:03 psycho kernel: ACPI Error (exmutex-0283): Cannot release Mutex [MTXS], incorrect sync_level [20060310] Jun 11 18:24:03 psycho kernel: ACPI Error (psparse-0522): Method parse/ execution failed [\_TZ_.RBYT] (Node cffc39e0), AE_AML_MUTEX_ORDER Jun 11 18:24:03 psycho kernel: ACPI Error (psparse-0522): Method parse/ execution failed [\_TZ_.RTMX] (Node cffc4940), AE_AML_MUTEX_ORDER Jun 11 18:24:03 psycho kernel: ACPI Error (psparse-0522): Method parse/ execution failed [\_TZ_.THRM._TMP] (Node cffc46e0), AE_AML_MUTEX_ORDER ... Jun 14 19:34:18 psycho kernel: ACPI Warning (acpi_thermal-0456): Critical trip point [20060310] Jun 14 19:34:18 psycho logger: temperature: 67 C To all, Could you clarify if thermal shutdowns are happens only with lm_sensors installed/enabled, or in the other cases too? If anyone has observed unsupposed thermal shutdowns, please send 'dmesg' and 'acpidump' outputs. Well, I don't know yet, if the shutdowns also occur if lm_sensors is not running. Until now it was always running on my system - but I'll turn it off now. There was also no new thermal shutdown, but another ACPI problem, that seems to be correlated with the shutdowns (because it gives the same error messages): the fan doesn't start to work when the trip point is reached, until I reboot. dmesg and acpidump output follow Created attachment 8410 [details]
dmesg output for the case, the fan doesn't start
Created attachment 8411 [details]
acpidump output
Hi! I ran my machine the last week with an unchanged kernel and without any sensor modules loaded (i2c-piix4,adm1021,eeprom). There has NOT been any problem with any thermal shutdowns or fan malfunctions. I would not call this is a strong proof that the problem depends on the i2c sensors only... but it would at least be possible. RE: AE_AML_MUTEX_ORDER There is a possible timeout in the _TZ_.RBYT method: Method (RBYT, 2, NotSerialized) { Store (One, GO25) Store (One, GO26) Acquire (MTXS, 0x0FFF) The Acquire should have an infinite wait, coded as: Acquire (MTXS, 0xFFFF) This is a common typo. Since the mutex has a synclevel of 4, an attempt to release the mutex without a successful acquire will cause the errors seen. BTW, newer versions of iASL catch this error: dsdt.dsl 465: Acquire (MTXS, 0x0FFF) Warning 1103 - ^ Possible operator timeout is ignored Does this mean there's a fix for the problem? Is it possible to create a patch? What code should be patched? The kernel or lmsensors or one of the lmsensors drivers? Is it possible to give more info in order to test the fix? PS: I now have a new laptop so I can setup my old laptop under test conditions... Assuming that Robert Moore is correct, then the real fix would be for the laptop manufacturer to fix their broken ACPI DSDT in their BIOS code. Failing that, you could decompile the DSDT, fix the bug, recompile and configure the kernel to use the modified DSDT instead. Ok, I changed the ACPI DSDT according to Robert Moore's message and also reactivated my ADM1021 sensor but still both problems are there. At some time the fan was not starting when it should have (now simply without a message in the syslog) and at another time a thermal shutdown occured. dmesg showed: cpufreq: change failed with new_state 0 and result 3 ACPI: Critical trip point Critical temperature reached (157 C), shutting down. Critical temperature reached (67 C), shutting down. So I'll at least turn off the temperature sensor again. It's still looks like an lm_sensors problem, so I'm rejecting the bug. I wouldn't call it "an lm_sensors problem". As I understand it it's a conflict between ACPI and lm_sensors, so it's as much an ACPI problem as an lm_sensors problem. Given that the lm_sensors drivers properly request the resources they are using, and ACPI isn't, you can't really blame lm_sensors. Now, the fact is that ACPI is broken by design in that respect (at least this is how I see it), and modern machines don't work properly (if at all) without ACPI support, so even if ACPI is to blame here, the only solution at the moment is to not use the lm_sensors drivers. I hope that future versions of ACPI will handle resource access properly so that ACPI and lm_sensors (and other non-ACPI drivers) can coexist peacefully. > Critical temperature reached (157 C), shutting down.
> Critical temperature reached (67 C), shutting down.
The temperature difference here looks like the diffrerence between Fahrenheit
and Celsius scales. Is it possible that lm_sensors somehow switch the thermal
sensor output data format?
ACPI expects the temperature to be always reported in tenth of Celsius.
For me, lmsensors always displayed the temperature in degrees Celsius. However it it might be possible that lmsensors internally fetched temperature values in Fahrenheit. How are the temperature values reported by an Intel PIII coppermine(as in my laptop)? In Celsius or in Fahrenheit or is it selectable? It should be Celsuis by default, I think, or you'd be getting either critical shutdown on boot. This could be evaluated from _TMP method definition. > ACPI expects the temperature to be always reported in tenth of Celsius.
Sorry? there should be Kelvin instead of Celsius.
All hardware monitoring chips I know of (and that's many) report the temperature in degrees Celsius. I've never seen a chip reporting temperature in degrees Farhenheit. All lm_sensors drivers use degrees Celsius. I doubt degrees Farhenheit have anything to do with the problem. My guess is that the unsynchronized concurrent accesses to the monitoring chip cause register reads to return values from other registers, resulting in apparently weird temperatures. BTW, a tenth of degree Celsius is the same as a tenth of Kelvin ;) > ...the only solution at the moment is to not use the lm_sensors drivers.
True.
AFAIK, there is zero coordination between lm_sensors and ACPI.
lm_sensors is trying to expose what ACPI is trying to abstract.
So the conflict is potentially worse than two Linux sub-systems,
because lm_sensors may actually conflict with the platform
ACPI firmware supplied by the vendor...
Until such coordination exists, the prudent course appears
to be this:
diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
index e76d919..0e0b958 100644
--- a/drivers/hwmon/Kconfig
+++ b/drivers/hwmon/Kconfig
@@ -6,6 +6,7 @@ menu "Hardware Monitoring support"
config HWMON
tristate "Hardware Monitoring support"
+ depends on !ACPI
default y
help
Hardware monitoring devices let you monitor the hardware health
Ah ah, Len, _that_ was funny! :) But I think I have a better one, see: --- linux-2.6.19-rc5.orig/drivers/acpi/Kconfig 2006-10-05 +++ linux-2.6.19-rc5/drivers/acpi/Kconfig 2006-11-14 @@ -3,6 +3,7 @@ # menu "ACPI (Advanced Configuration and Power Interface) Support" + depends on BROKEN depends on !X86_VISWS depends on !IA64_HP_SIM depends on IA64 || X86 Oh wait, you were serious? You really wanted to kill the hwmon subsystem right away because acpi can't even request the resources it uses like every other subsystem is required to do? It's not a conflict between hwmon and acpi. It's a conflict between acpi and the rest of the world. It might be more visible with hwmon, but every other subsystem could conflict as well. For example, i2c is affected, as acpi accesses SMBus masters without telling anyone. As a result, every I2C/SMBus chip driver is virtually affected too, including eeprom and RTC drivers. Let's recap the facts: * lm_sensors has been around since 1998. It does one thing and does it well. All the i2c bus drivers and hwmon drivers request all the resources they use. * ACPI support was added to Linux in (I think) 2001, it handles half a dozen mostly unrelated things, from interrupt routing to power management to thermal management to screen brightness changes, and accesses the hardware directly without properly requesting the resources. And your conclusion is that we should disable lm_sensors? Interesting. ACPI is broken here. Created attachment 10637 [details] Detect conflicting ACPI I/O accesses Here comes a patch (against 2.6.20.1) which will detect and log the ACPI accesses to I/O areas already requested by other drivers. I would like all users affected by ACPI vs lm_sensors conflicts to give it a try and post the generated logs. A 2.6.21-rc3 version of the patch is also available: http://jdelvare.pck.nerim.net/sensors/acpi-check-io-ports.patch I didn't have this bug before 2.6.20 on my debian kernel but since I can have this critical trop point three times a day like only one time a week. I didn't think it's only a conversion bug (see below). My dual core is reported in logs to be at 45°C usualy, and never reach 60°C under load. # grep -A 4 "Critical trip point" syslog.? syslog.2:Jul 5 00:33:06 gosseyn kernel: ACPI: Critical trip point syslog.2-Jul 5 00:33:06 gosseyn kernel: Critical temperature reached (1135 C), shutting down. syslog.2-Jul 5 00:33:06 gosseyn shutdown[13784]: shutting down for system halt syslog.2-Jul 5 00:33:06 gosseyn init: Switching to runlevel: 0 syslog.2-Jul 5 00:33:08 gosseyn kernel: Critical temperature reached (41 C), shutting down. -- syslog.3:Jul 4 00:49:49 gosseyn kernel: ACPI: Critical trip point syslog.3-Jul 4 00:49:49 gosseyn kernel: Critical temperature reached (5487 C), shutting down. syslog.3-Jul 4 00:49:49 gosseyn shutdown[10029]: shutting down for system halt syslog.3-Jul 4 00:49:49 gosseyn init: Switching to runlevel: 0 syslog.3-Jul 4 00:49:51 gosseyn kernel: Critical temperature reached (53 C), shutting down. -- syslog.5:Jul 1 21:49:23 gosseyn kernel: ACPI: Critical trip point syslog.5-Jul 1 21:49:23 gosseyn kernel: Critical temperature reached (1135 C), shutting down. syslog.5-Jul 1 21:49:23 gosseyn shutdown[8231]: shutting down for system halt syslog.5-Jul 1 21:49:23 gosseyn init: Switching to runlevel: 0 syslog.5-Jul 1 21:50:20 gosseyn syslog-ng[3338]: syslog-ng starting up; version='2.0.0' -- syslog.6:Jul 1 00:04:08 gosseyn kernel: ACPI: Critical trip point syslog.6-Jul 1 00:04:08 gosseyn kernel: Critical temperature reached (1135 C), shutting down. syslog.6-Jul 1 00:04:08 gosseyn shutdown[12701]: shutting down for system halt syslog.6-Jul 1 00:04:08 gosseyn init: Switching to runlevel: 0 syslog.6-Jul 1 00:04:10 gosseyn kernel: Critical temperature reached (41 C), shutting down. (In reply to comment #45) > I didn't have this bug before 2.6.20 on my debian kernel but since I can have > this critical trop point three times a day like only one time a week. > I didn't think it's only a conversion bug (see below). This may or may not be the same problem as the original bug report. There are several causes for these "Critical temperature reached" bugs. Are you using a non-ACPI hardware monitoring driver at all? If you do, do the problems go away if you stop loading this driver? If not, then your problem is different and you should open a separate bug. I'm running an HP Pavilion laptop ze1250 (AMD Mobile XP 1800+) and I've seen this on and off since 2.6.8. With 2.6.22 it has gotten much, much worse. Adding a printk to acpi_thermal_get_temperature gives me output like this: Temperature is 76C Temperature is 76C Temperature is 77C Temperature is 95C Critical temperature reached (95 C), shutting down. Temperature is 76C The only clue I have to add is that I haven't seen it happen with cpu frequency scaling (CONFIG_CPU_FREQ) disabled or with the governor set to powersave or performance. If someone can give me some suggestions on where to go from here I'll be more than happy to help troubleshoot. (In reply to comment #47) > I'm running an HP Pavilion laptop ze1250 (AMD Mobile XP 1800+) and I've seen > this on and off since 2.6.8. With 2.6.22 it has gotten much, much worse. Are you using lm-sensors at all? Devon, Jean is right. If you still have the problem with CONFIG_HWMON=n, then your sighting isn't related to this report (HWMON vs. ACPI conflict) In that case, please file a new sighting against ACPI/Power-Thermal Checking my config I found I had CONFIG_HWMON=y and CONFIG_SENSORS_VIA686A=y. I don't have lm-sensors installed at the moment (probably did at some point in the past), so I recompiled without those options. Been running 2 days now without a single thermal fault. Much better than the previous behavior of 3 faults before completing a cold boot. So, since I don't have lm-sensors, that means the hwmon and/or via sensor drivers are causing problems just by being there. I'm happy since my system is running better. However, since I have a system that will misbehave, if you need a guinea pig to test or help debug, I'll be glad to help; just tell me what to do. Devon, please open a separate bug with the information from comments #47 and #50, in category Drivers/Hardware Monitoring. Assign it to me, I'll take care of it. Thanks Devon, so it's bug #8865 now. I see a very similar problem running 2.6.26.3 from kernel.org on my Asus Eee 701 model. The system gives the following error messages while booting and does a shutdown before completing the boot process: [ 32.976558] ACPI: Critical trip point [ 32.976591] Critical temperature reached (144 C), shutting down. The problem happens mostly when booting on battery power and the battery is not 100% full. It looks as if it can be avoided by blacklisting the thermal or the battery module. Testing is difficult because the bug has the following behaviour: The bug tends to show up (or not) consistently many times in a row of tests, but can then disappear and allow a series of test boots (one after the other) without problems. So it is very intermittent, but stays constant if I boot many times in a row for tests. Many users of Debian experience the same problem on different Asus Eee PC models since they run kernel 2.6.26 (from Debian Lenny). The system does shutdown without completing the boot process. We did not see the problem with 2.6.25. In all these cases there was CONFIG_HWMON=y Robert, can the problem be reproduced with CONFIG_HWMON=n? Or, if you can't test that, without any hwmon driver loaded? As a side note, it makes little sense for a distribution to have CONFIG_HWMON=y. They should set it to m. (In reply to comment #54) > Robert, can the problem be reproduced with CONFIG_HWMON=n? I cannot set CONFIG_HWMON=n, it get's reset to m > Or, if you can't test that, without any hwmon driver loaded? I'm not sure which drivers count as 'hwmon driver' i.e. does eeepc_laptop count? Would a test with CONFIG_HWMON=m and blacklisting it be useful? The pattern I have descibed in comment #53 makes it *very* time consuming to test if a version does not have the bug... Robert Epprecht > I'm not sure which drivers count as 'hwmon driver'
Then there are CONFIG_THERMAL and CONFIG_THERMAL_HWMON
What about these?
Robert
(In reply to comment #55) > I cannot set CONFIG_HWMON=n, it get's reset to m Apparently THINKPAD_ACPI selects HWMON. Try disabling THINKPAD_ACPI and then you should be able to remove HWMON entirely. More generally, you can ask for the help of CONFIG_HWMON to find out who is selecting it. > I'm not sure which drivers count as 'hwmon driver' > i.e. does eeepc_laptop count? Yes, eeepc_laptop counts. Other than laptop-specific drivers, all hwmon drivers are in the "Hardware Monitoring support" menu entry, so if you unselect all of them, you should be done. Alternatively, you can keep them, select HWMON as a module, and then do "lsmod | grep hwmon" to see any loaded driver that depends on hwmon. > Would a test with CONFIG_HWMON=m and blacklisting it be useful? I have never had any luck with blacklisting. (In reply to comment #56) > Then there are CONFIG_THERMAL and CONFIG_THERMAL_HWMON > What about these? These are the standard hwmon interface to the ACPI thermal zones, so they are safe ACPI-wise by definition. (In reply to comment #57) This configuration allows to set CONFIG_HWMON=n I can reproduce the bug with that kernel image: still does a shutdown in the middle of the boot process. What does this mean now? Robert This means that your problem is unrelated to lm-sensors conflicting with ACPI. So it has to be an ACPI bug, either in your BIOS, or in the Linux implementation. Please open a new bug under Product ACPI. |