About 1 hour after I started a heavy computing batch (using matlab 7.6, distributing the load on all 4 core), the machine suddenly shutdown. kern.log shows this message: Jun 18 14:32:48 bcipc038 kernel: [1117042.573713] ACPI Exception (thermal-0479): AE_ERROR, ACPI thermal trip point state changed Jun 18 14:32:50 bcipc038 kernel: [1117042.573717] Please send acpidump to linux-acpi@vger.kernel.org Jun 18 14:32:50 bcipc038 kernel: [1117042.573719] [20080926] Jun 18 14:32:50 bcipc038 kernel: [1117042.574046] ACPI: Critical trip point Jun 18 14:32:50 bcipc038 kernel: [1117042.574072] Critical temperature reached (72 C), shutting down. Jun 18 14:32:50 bcipc038 kernel: [1117042.574098] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jun 18 14:32:58 bcipc038 kernel: [1117048.576698] Critical temperature reached (58 C), shutting down. Jun 18 14:32:58 bcipc038 kernel: [1117049.920186] [drm] Resetting GPU Jun 18 14:32:58 bcipc038 kernel: [1117050.517645] mtrr: MTRR 5 not used Jun 18 14:57:23 bcipc038 kernel: Inspecting /boot/System.map-2.6.28-11-generic The same behavior was observed on the same machine a few month ago and was reported here https://bugs.launchpad.net/ubuntu/+source/linux/+bug/314001 and here http://marc.info/?l=linux-acpi&m=123120299000668&w=1 The problem went away, or I did not have time trying to reproduce the problem. Based on the previous feedback in http://marc.info/?l=linux-acpi&m=123120299000668&w=2 , I attach acpidump.dump dmesg.dump dmidecode.dump kern.log.20090619181656 However, I've no idea how to try the boot option of "acpi.power_nocheck=1". The problem happened actually two times (see kern.log): Jun 18 14:32:48 bcipc038 kernel: [1117042.573713] ACPI Exception (thermal-0479): AE_ERROR, ACPI thermal trip point state changed Jun 18 16:22:19 bcipc038 kernel: [ 5402.772605] ACPI Exception (thermal-0479): AE_ERROR, ACPI thermal trip point state changed Therefore, it seems to be reproducable again. I noticed also this message (4 times) in (kern.log): Jun 18 16:32:46 bcipc038 kernel: [ 4.263055] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor. Is this important? What can I do about this ? I've no experience with the kernel/acpi internal issues. Let me know if I can do anything to track down this issue. Alois
Created attachment 21991 [details] zip file containing kern.log, acpidump.dump dmesg.dump and dmidecode.dump The zip file with kern.log, acpidump.dump dmesg.dump and dmidecode.dump was not accepted. Therefore, I make it available here http://hci.tugraz.at/~schloegl/acpi_linux_dumplog.zip
This looks worth looking at further: (thermal-0479): AE_ERROR, ACPI thermal trip point state changed The code is from Rui AFAIK, he might know what's going on here. It looks like an ACPI function related to the thermal device and its trip point fails at some point and then the whole trip point gets invalidated. Unfortunately your dmesg doesn't show such a case. > Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. This means CPU freq does not work because your BIOS misses some tables. A BIOS upgrade might fix that.
please attach the output of "grep . /proc/acpi/thermal_zone/*/*". And there are several problems of the ACPI thermal control on your laptop, 1. > Critical temperature reached (72 C), shutting down. the critical trip point on your laptop is 72C, which is quite low... 2. the ACPI thermal active cooling doesn't work because it uses a fake ACPI fan. 3. the ACPI thermal passive cooling doesn't help a lot well because the processor frequency change is not available. so my questions is, can you hear the fan spinning when the computer goes hot? is the computer really hot when it shutdown?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13573 > > > Zhang Rui <rui.zhang@intel.com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|NEW |NEEDINFO > Component|BIOS |Power-Thermal > AssignedTo|acpi_bios@kernel-bugs.osdl. |acpi_power-thermal@kernel-b > |org |ugs.osdl.org > Summary|ACPI: Unable to turn |Critical temperature > |cooling device 'on' |reached (72 C), shutting > |(Quadcore-AMD64, Ubuntu64) |down - Quadcore-AMD64, > | |Ubuntu64 > > > > > --- Comment #3 from Zhang Rui <rui.zhang@intel.com> 2009-06-19 03:19:06 --- > please attach the output of "grep . /proc/acpi/thermal_zone/*/*". /proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive /proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled> /proc/acpi/thermal_zone/THRM/state:state: ok /proc/acpi/thermal_zone/THRM/temperature:temperature: 68 C /proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 70 C /proc/acpi/thermal_zone/THRM/trip_points:active[0]: 68 C: devices= FAN > > And there are several problems of the ACPI thermal control on your laptop, Its not a laptop but a desktop machine. > > 1. > Critical temperature reached (72 C), shutting down. > the critical trip point on your laptop is 72C, which is quite low... > 2. the ACPI thermal active cooling doesn't work because it uses a fake ACPI > fan. > 3. the ACPI thermal passive cooling doesn't help a lot well because the > processor frequency change is not available. > > so my questions is, can you hear the fan spinning when the computer goes hot? The fan is always spinning, even in idle mode. > is the computer really hot when it shutdown? > Each time I start the computing job using 400% cpu for some time, the machine shutdown. This happened now 3 times out of 3 tries within 24 h. So its reproducible. The computing job is just executing some plain matlab script doing a lot of floating point operations, and occasionally saving the intermediate results in a file. There is no special hardware access involved. I've no means to measure the temperature. I opened the case and see a big fan on top of the cpu, beside that there is a passive cooler which is also square with a side length of about 60% of the CPU cooler. When I touch it, it is hot (do not know how hot). Then, there is a two-digit 7-segment red LED display. In the morning it showed about 44, after I started the matlab job, it immediately started to rise. It seemed to settle at about 68 (30-40 min after starting the job). After about 50 min running the job, the machine shut down again. During the shutdown process, the LED display dropped to 58 - I think. When the machine was up again, it showed about 60 and it dropped further to about 46 (I did not start the computing job). Therefore, I'm pretty sure it is a thermal problem. Alois -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAko7TvEACgkQzSlbmAlvEIiLqwCZAWOJJlkYr+rrdhv/ADCgZYii 880AnA5rudUo98f0mbNqCvTw+4gq/TL5 =YnEw -----END PGP SIGNATURE-----
You should really first look out for the latest BIOS. Have the Quad-cores been added later? The BIOS must know the CPU to be able to provide cpufreq info.
please retry with CONFIG_THERMAL set if it's not in the previous test. please set "/sys/class/thermal/thermal_zone*/passive" before the test. Matthew's passive cooling may help in this case.
(In reply to comment #6) > please set "/sys/class/thermal/thermal_zone*/passive" before the test. It accepts a decimal value in millidegrees celsius, e.g. echo 62000 > /sys/class/thermal/thermal_zone0/passive.
(In reply to comment #5) > You should really first look out for the latest BIOS. > Have the Quad-cores been added later? The BIOS must know the CPU to be able > to > provide cpufreq info. I bought the whole system from DiTech http://www.ditech.at/ . I plan to contact them, in order to resolve this issue. Is there anything else that should be taken into account ? Alois
(In reply to comment #8) > (In reply to comment #5) > > You should really first look out for the latest BIOS. > > Have the Quad-cores been added later? The BIOS must know the CPU to be able > to > > provide cpufreq info. > > > I bought the whole system from DiTech http://www.ditech.at/ . > I plan to contact them, in order to resolve this issue. Is there anything > else > that should be taken into account ? > > > Alois I tried in user space and as root, but got this error: # echo 62000 > /sys/class/thermal/thermal_zone0/passive bash: /sys/class/thermal/thermal_zone0/passive: No such file or directory Because there is a symbolic link, # ls -al /sys/class/thermal/thermal_zone0 lrwxrwxrwx 1 root root 0 2009-06-22 09:07 /sys/class/thermal/thermal_zone0 -> ../../devices/virtual/thermal/thermal_zone0 I tried this, too: # echo 62000 > /sys/devices/virtual/thermal/thermal_zone0/passive bash: /sys/devices/virtual/thermal/thermal_zone0/passive: No such file or directory When I tried to generate the file "passive" with some editor, it could not save it. The permissions of the respective directory entries (and aller intermediate directories up to /) are set to 0755 with owner root:root . Here are the entries in .../thermal_zone0/* # ls -al /sys/devices/virtual/thermal/thermal_zone0/ total 0 drwxr-xr-x 3 root root 0 2009-06-22 09:07 . drwxr-xr-x 8 root root 0 2009-06-22 09:07 .. lrwxrwxrwx 1 root root 0 2009-06-22 09:23 cdev0 -> ../cooling_device0 -r--r--r-- 1 root root 4096 2009-06-22 09:23 cdev0_trip_point lrwxrwxrwx 1 root root 0 2009-06-22 09:23 device -> ../../../LNXSYSTM:00/LNXTHERM:00/LNXTHERM:01 -rw-r--r-- 1 root root 4096 2009-06-22 09:23 mode drwxr-xr-x 2 root root 0 2009-06-22 09:23 power lrwxrwxrwx 1 root root 0 2009-06-22 09:23 subsystem -> ../../../../class/thermal -r--r--r-- 1 root root 4096 2009-06-22 09:23 temp -r--r--r-- 1 root root 4096 2009-06-22 09:23 trip_point_0_temp -r--r--r-- 1 root root 4096 2009-06-22 09:23 trip_point_0_type -r--r--r-- 1 root root 4096 2009-06-22 09:23 trip_point_1_temp -r--r--r-- 1 root root 4096 2009-06-22 09:23 trip_point_1_type -r--r--r-- 1 root root 4096 2009-06-22 09:23 type -rw-r--r-- 1 root root 4096 2009-06-22 09:07 uevent So, I do not know how to set "/sys/class/thermal/thermal_zone*/passive"
(In reply to comment #5) > You should really first look out for the latest BIOS. > Have the Quad-cores been added later? The BIOS must know the CPU to be able > to > provide cpufreq info. I bought the whole system from http://http://www.ditech.at/ . I contacted them, and they refered to this page: http://www.sapphiretech.com/ge/support/drivers.php The mainboard is a SAPPHIRE PI-AM2RS780G with SB700. I've downloaded http://us.sapphiretech.com/drivers/usb_format_20090619_8590.zip http://us.sapphiretech.com/drivers/78SAPV09_20090522_4854.zip but failed to boot from the USB-stick (the machine has no floppy). I'm also looking at coreutils, http://www.coreboot.org/pipermail/coreboot/2009-June/050038.html but there is no readily available solution. Before I pursue this path further, I'm also wondering how likely the Bios is responsible for the problems described above? Are you sure that updating the Bios will solve the issue ?
I flashed the bias with the latest version. Unfortunately, it did not solve the problem: http://www.coreboot.org/pipermail/coreboot/2009-June/050347.html
Hi It seems that this is a BIOS bug. On this box the FAN is controlled by BIOS.Of course there exists the ACPI FAN device on this box. But it is bogus and it can do nothing. At the same time the incorrect passive cooling device is returned by the _PSL object. >Name (_PSL, Package (0x01) { \_PR.CPU0 // the correct name should \_PR.C000 }) The critical temperature threshold is gotten by evaluating the _CRT object.(From the info in comment #4 we know that the threshold is 70) And the thermal temperature is obtained by using the following object: >Method (_TMP, 0, NotSerialized) { And (SENF, 0x01, Local6) If (LEqual (Local6, 0x01)) { Return (RTMP ()) } Else { Return (0x0B86) } } This is related with BIOS. From the above analysis it seems that this bug is related with the broken BIOS.And it had better be fixed by upgrading BIOS. If you can confirm that it is still safe even when the temperature reaches the critical threshold, you can avoid it by adding the boot option of "thermal.nocrt=1". Of course it will be ok by adding your box into the quirk table that ignores the critical threshold. Hi, Rui How about reject this bug as it seems that this is a BIOS bug? We can do nothing about it. Or we add the box into the quirk table that ignores the critical threshold.
Created attachment 22225 [details] customized DSDT please apply this customized DSDT and attach the output of "grep . /proc/acpi/thermal_zone/*/*"
Thanks, after some hazzles, I was able to install it. This is the result I get: $ grep . /proc/acpi/thermal_zone/*/* /proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive /proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled> /proc/acpi/thermal_zone/THRM/state:state: ok /proc/acpi/thermal_zone/THRM/temperature:temperature: 48 C /proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 70 C /proc/acpi/thermal_zone/THRM/trip_points:passive: 68 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 /proc/acpi/thermal_zone/THRM/trip_points:active[0]: 68 C: devices= FAN
Next, I tested whether the original problem went away. Unfortunately, the machine shutdown again about 40 min after starting the job. In order to investigate whether the DSDT was really loaded, I run dmesg |grep DSDT and grep DSDT /var/log/kern.log The results are shown below. For the first time the custom DSDT was found at Jul 6 12:12:49 bcipc038 kernel: [ 0.237073] ACPI: Found DSDT in DSDT.aml. Shortly after I send the previous report, the computational job was started (at about 12:40). At 13:21 I had to boot the machine, because it had shutdown. I noticed also the fan became loader (was speeding up) a few minutes after 12:40). $ dmesg |grep DSDT [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) [ 0.008586] ACPI: Checking initramfs for custom DSDT [ 0.237143] ACPI: Found DSDT in DSDT.aml. [ 0.237147] ACPI: Override [DSDT-AWRDACPI], this is unsafe: tainting kernel [ 0.237152] ACPI: Table DSDT replaced by host OS [ 0.237155] ACPI: DSDT 00000000, 6CE1 (r1 RS780 AWRDACPI 1000 INTL 20081204) [ 0.237159] ACPI: DSDT override uses original SSDTs unless "acpi_no_auto_ssdt" [ 0.568497] ACPI: EC: Look up EC in DSDT $ grep DSDT /var/log/kern.log Jul 1 07:53:59 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 1 07:53:59 bcipc038 kernel: [ 0.008561] ACPI: Checking initramfs for custom DSDT Jul 1 07:53:59 bcipc038 kernel: [ 0.572380] ACPI: EC: Look up EC in DSDT Jul 2 08:30:41 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 2 08:30:41 bcipc038 kernel: [ 0.008573] ACPI: Checking initramfs for custom DSDT Jul 2 08:30:41 bcipc038 kernel: [ 0.572366] ACPI: EC: Look up EC in DSDT Jul 2 10:54:00 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 2 10:54:00 bcipc038 kernel: [ 0.008565] ACPI: Checking initramfs for custom DSDT Jul 2 10:54:00 bcipc038 kernel: [ 0.572359] ACPI: EC: Look up EC in DSDT Jul 2 11:08:34 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 2 11:08:34 bcipc038 kernel: [ 0.008564] ACPI: Checking initramfs for custom DSDT Jul 2 11:08:34 bcipc038 kernel: [ 0.572483] ACPI: EC: Look up EC in DSDT Jul 3 08:06:16 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 3 08:06:16 bcipc038 kernel: [ 0.008562] ACPI: Checking initramfs for custom DSDT Jul 3 08:06:16 bcipc038 kernel: [ 0.572478] ACPI: EC: Look up EC in DSDT Jul 3 18:21:09 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 3 18:21:09 bcipc038 kernel: [ 0.008558] ACPI: Checking initramfs for custom DSDT Jul 3 18:21:09 bcipc038 kernel: [ 0.572476] ACPI: EC: Look up EC in DSDT Jul 6 11:13:37 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 6 11:13:37 bcipc038 kernel: [ 0.009197] ACPI: Checking initramfs for custom DSDT Jul 6 11:13:37 bcipc038 kernel: [ 0.577431] ACPI: EC: Look up EC in DSDT Jul 6 11:17:55 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 6 11:17:55 bcipc038 kernel: [ 0.008600] ACPI: Checking initramfs for custom DSDT Jul 6 11:17:55 bcipc038 kernel: [ 0.572472] ACPI: EC: Look up EC in DSDT Jul 6 12:12:49 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 6 12:12:49 bcipc038 kernel: [ 0.008572] ACPI: Checking initramfs for custom DSDT Jul 6 12:12:49 bcipc038 kernel: [ 0.237073] ACPI: Found DSDT in DSDT.aml. Jul 6 12:12:49 bcipc038 kernel: [ 0.237078] ACPI: Override [DSDT-AWRDACPI], this is unsafe: tainting kernel Jul 6 12:12:49 bcipc038 kernel: [ 0.237082] ACPI: Table DSDT replaced by host OS Jul 6 12:12:49 bcipc038 kernel: [ 0.237084] ACPI: DSDT 00000000, 6CE1 (r1 RS780 AWRDACPI 1000 INTL 20081204) Jul 6 12:12:49 bcipc038 kernel: [ 0.237088] ACPI: DSDT override uses original SSDTs unless "acpi_no_auto_ssdt" Jul 6 12:12:49 bcipc038 kernel: [ 0.568469] ACPI: EC: Look up EC in DSDT Jul 6 13:21:31 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 6 13:21:31 bcipc038 kernel: [ 0.008586] ACPI: Checking initramfs for custom DSDT Jul 6 13:21:31 bcipc038 kernel: [ 0.237143] ACPI: Found DSDT in DSDT.aml. Jul 6 13:21:31 bcipc038 kernel: [ 0.237147] ACPI: Override [DSDT-AWRDACPI], this is unsafe: tainting kernel Jul 6 13:21:31 bcipc038 kernel: [ 0.237152] ACPI: Table DSDT replaced by host OS Jul 6 13:21:31 bcipc038 kernel: [ 0.237155] ACPI: DSDT 00000000, 6CE1 (r1 RS780 AWRDACPI 1000 INTL 20081204) Jul 6 13:21:31 bcipc038 kernel: [ 0.237159] ACPI: DSDT override uses original SSDTs unless "acpi_no_auto_ssdt" Jul 6 13:21:31 bcipc038 kernel: [ 0.568497] ACPI: EC: Look up EC in DSDT
if the ACPI thermal driver is built in, please boot with thermal.psv=60 if the ACPI thermal driver is compiled as a module, please load the thermal driver manually with module parameter psv=60, i.e. "modprobe thermal psv=60" and see if it helps.
I tried this, but this attempt was also not successful. The machine shut down again. I was looking into to problem litte more, and found that throttling is not supported. /proc/acpi/processor/C001$ cat info processor id: 1 acpi id: 1 bus mastering control: yes power management: no throttling control: no limit interface: no powernowd reports this message: $ sudo powernowd powernowd: PowerNow Daemon v1.00, (c) 2003-2008 John Clemens /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus: No such file or directory powernowd: err=2 powernowd: Found 4 scalable units: -- 1 'CPU' per scalable unit /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq: No such file or directory PowerNowd encountered and error and could not start. Please make sure that: - You are running a v2.6.7 kernel or later - That you have sysfs mounted /sys - That you have the core cpufreq and cpufreq-userspace modules loaded into your kernel - That you have the cpufreq driver for your cpu loaded, (for example: powernow-k7), and that it works. Check 'dmesg' for errors. If all of the above are true, and you still have problems, please email the author: clemej@alum.rpi.edu Maybe the bios error is preventing powernow-k8 from starting. The DSDT work around does not prevent this error message in dmesg [ 4.114549] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor. [ 4.114612] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor. [ 4.114671] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor. [ 4.114730] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor.
Alois, you should really try to upgrade the BIOS and come back if you still have problems. You might also want to go through power/thermal related BIOS settings *after* upgrading. > Maybe the bios error is preventing powernow-k8 from starting. Look at the second part of comment #2.
please reload the thermal driver with psv=60 crt=80 and re-attach the output of "grep . /proc/acpi/thermal_zone/*/*" remember all these tests are done with the customized DSDT. :)
First off all, let me thank you for your effort. You are much more responsive than the vendors. I appreciate this very much. re #18: I've updated the bios with the latest version provided http://us.sapphiretech.com/drivers/78SAPV09_20090522_4854.zip The latest version still has the problem reported in kern.log > Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. I've now contacted the mainboard manufacturer http://www1.sapphiretech.com/us/support/respondtoticket.php?t=1S53102G37668S53102G1-37668S53102G8279590S53102G5 http://tinyurl.com/msw9sq It seems they are aware of the issue but need to work with their vendor (Phoenix Technology? ). I went through the Bios settings. This was the setting I found: Power Management Setup ACPI Suspend Type S3(STR) C2 Disable/Enable Disabled Power Management Option: User Define ... PC Health Status Show PC Health In Post: Enabled Shutdown Temperature: 70°C/158°F SMART FAN Configuration CPUFAN Smart Mode: Enabled CPUFAN Full-speed Temp: 60 CPUFAN Idle Temp: 50 Thermal Throttling Option CPU Thermal Throttling: Enabled CPU Throttling Temp: 70°C CPU Throttling Duty: 75.0% Because both, the Shutdown and Throttling, Temp were set to 70, I changed the Throttling temp to 68°C. The test caused again a shutdown. The Shutdown Temp had only options of 60°C/65°C/70°C/disable. So I disabled Shutdown Temperature, and it was strange. The temp was fixed at 40 C. This seems this switch off the temperature sensor. I reverted this change and set the Shutdown Temp again to 70°C. re #19: I converted DSDT.hex file into DSDT.aml with this short c-program: #include <stdio.h> #include "DUMP.hex" main() { FILE *fid = fopen("DSDT.aml","w"); fwrite(AmlCode,1,0x6ce1,fid); fclose(fid); } and followed the instruction along the lines of http://ubuntuforums.org/showthread.php?t=1036051 sudo cp DSDT.aml /etc/initramfs-tools/DSDT.aml sudo update-initramfs -u -k 2.6.28-13-generic My understanding is that it installs the DSDT permanently. I confirmed this by looking at the kern.log with grep DSDT /var/log/kern.log Before installing the DSDT, reboot showed this, Jul 6 11:17:55 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 6 11:17:55 bcipc038 kernel: [ 0.008600] ACPI: Checking initramfs for custom DSDT Jul 6 11:17:55 bcipc038 kernel: [ 0.572472] ACPI: EC: Look up EC in DSDT Afterwards, kern.log has these messages. Jul 6 12:12:49 bcipc038 kernel: [ 0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780 AWRDACPI 1000 MSFT 3000000) Jul 6 12:12:49 bcipc038 kernel: [ 0.008572] ACPI: Checking initramfs for custom DSDT Jul 6 12:12:49 bcipc038 kernel: [ 0.237073] ACPI: Found DSDT in DSDT.aml. Jul 6 12:12:49 bcipc038 kernel: [ 0.237078] ACPI: Override [DSDT-AWRDACPI], this is unsafe: tainting kernel Jul 6 12:12:49 bcipc038 kernel: [ 0.237082] ACPI: Table DSDT replaced by host OS Jul 6 12:12:49 bcipc038 kernel: [ 0.237084] ACPI: DSDT 00000000, 6CE1 (r1 RS780 AWRDACPI 1000 INTL 20081204) Jul 6 12:12:49 bcipc038 kernel: [ 0.237088] ACPI: DSDT override uses original SSDTs unless "acpi_no_auto_ssdt" Jul 6 12:12:49 bcipc038 kernel: [ 0.568469] ACPI: EC: Look up EC in DSDT I assume this shows that the DSDT is loaded. Tell me if I'm wrong, or if there is a better method to check whether DSDT is loaded. As suggested, I changed the boot options to psv=60 crt=80. I confirmed this with this command $ grep Command /var/log/kern.log Jul 8 15:58:24 bcipc038 kernel: [ 0.000000] Command line: root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro quiet splash psv=60 crt=80 Here is the result on $ grep . /proc/acpi/thermal_zone/*/* /proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive /proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled> /proc/acpi/thermal_zone/THRM/state:state: ok /proc/acpi/thermal_zone/THRM/temperature:temperature: 66 C /proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 70 C /proc/acpi/thermal_zone/THRM/trip_points:passive: 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 /proc/acpi/thermal_zone/THRM/trip_points:active[0]: 68 C: devices= FAN It survived the test for about 5 hours, then it shut down again with the same error as always: Jul 8 21:03:30 bcipc038 kernel: [18341.205549] ACPI: Critical trip point Jul 8 21:03:30 bcipc038 kernel: [18341.205577] Critical temperature reached (71 C), shutting down. Jul 8 21:03:30 bcipc038 kernel: [18341.205603] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 8 21:03:31 bcipc038 kernel: [18342.604759] [drm] Resetting GPU Jul 8 21:03:31 bcipc038 kernel: [18342.747361] mtrr: MTRR 5 not used Jul 8 21:03:36 bcipc038 kernel: [18347.204187] Critical temperature reached (57 C), shutting down. I'm wondering whether there is a way to test whether throttling was applied or not, or whether this is just a variation due to different environmental changes (the weather was cooler in the last few days.)
If the thermel driver is built in, you should use boot option thermal.psv=60, thermal.crt=80 If the thermal driver is a module, you should load it manually using "modprobe thermal psv=60 crt=80"
Driver module does not work: $ sudo modprobe thermal psv=60 crt=80 FATAL: Module thermal not found. So I added a boot option in /boot/grub/menu.lst title Ubuntu 9.04, kernel 2.6.28-13-with_thermal_fix uuid 4a0d6592-2929-4050-a603-87e463ceed0e kernel /boot/vmlinuz-2.6.28-13-generic root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro quiet splash thermal.psv=60 thermal.crt=80 initrd /boot/initrd.img-2.6.28-13-generic quiet and rebooted. $ grep Command /var/log/kern.log Jul 13 09:32:24 bcipc038 kernel: [ 0.000000] Command line: root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro quiet splash thermal.psv=60 thermal.crt=80 $ grep . /proc/acpi/thermal_zone/*/* /proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive /proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled> /proc/acpi/thermal_zone/THRM/state:state: passive /proc/acpi/thermal_zone/THRM/temperature:temperature: 67 C /proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 80 C /proc/acpi/thermal_zone/THRM/trip_points:passive: 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 /proc/acpi/thermal_zone/THRM/trip_points:active[0]: 68 C: devices= FAN I run the test now for over 24 h without any shutdown. It seems to problem is gone. Changing the critical temp to 80 C did "fix" it. But is this really a fix, or just a hack to avoid a shutdown ? I get frequent (about 2 per minute) warnings $ grep ACPI /var/log/kern.log ... Jul 14 10:08:32 bcipc038 kernel: [88685.696666] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 10:08:38 bcipc038 kernel: [88691.696673] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 10:08:50 bcipc038 kernel: [88703.696664] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 10:09:50 bcipc038 kernel: [88763.696667] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 10:09:56 bcipc038 kernel: [88769.696667] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 10:10:50 bcipc038 kernel: [88823.696665] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 10:10:56 bcipc038 kernel: [88829.696664] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 10:11:32 bcipc038 kernel: [88865.696665] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' and $ cat /proc/acpi/processor/*/info processor id: 0 acpi id: 0 bus mastering control: yes power management: no throttling control: no limit interface: no processor id: 1 acpi id: 1 bus mastering control: yes power management: no throttling control: no limit interface: no processor id: 2 acpi id: 2 bus mastering control: yes power management: no throttling control: no limit interface: no processor id: 3 acpi id: 3 bus mastering control: yes power management: no throttling control: no limit interface: no No Throttling is applied. I guess I need to find a way to enable throttling. Do you have any suggestions ?
> I guess I need to find a way to enable throttling Hmm, better powernow gets enabled (not sure, but I could imagine this machine does not support throttling at all), it seems the newest BIOS still does not export the frequency tables: [ 4.114549] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor. Can you attach acpidump after the BIOS update, I can have a look whether it's really the BIOS' fault. > Jul 14 10:08:32 bcipc038 kernel: [88685.696666] ACPI: Unable to turn cooling > device [ffff88012f815a60] 'on' Interesting. I haven't looked at the details, but this sounds like a critical bug, BIOS or kernel.
Created attachment 22338 [details] new ACPIDUMP after Bios update (re #23)
Strange, doing: acpixtract -a acpidump should create the extract tables as files, but I get dozens of corrupt files/tables: 10.dat ??1.dat 2.dat 31.dat 45.dat 61.dat 82.dat ?apa.dat ??.dat ?d?F.dat ??IN.dat ?P6E.dat ?prx.dat ??SC1.dat 11.dat ??1.dat ?2.dat ... (and much more) Hm, the file looks like a DSDT, but doing: iasl -d acpidump results in: ... **** ACPI table terminates in the middle of a data structure! Could you try again and make sure you use the latest acpidump version, it's included here: http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/pmtools-20071116.tar.bz2 Just doing: acpidump >/tmp/acpidump and then send /tmp/acpidump is what is needed.
Created attachment 22339 [details] new ACPIDUMP after Bios update (re #25)
I think I have a candidate for the problem. The bios setting for AMD Coool&Quiet control was disabled. Setting it to AUTO does not cause this message anymore: [ 4.114549] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor. And before I saw this, $ cpufreq-info cpufrequtils 004: cpufreq-info (C) Dominik Brodowski 2004-2006 Report errors and bugs to cpufreq@lists.linux.org.uk, please. analyzing CPU 0: no or unknown cpufreq driver is active on this CPU analyzing CPU 1: no or unknown cpufreq driver is active on this CPU analyzing CPU 2: no or unknown cpufreq driver is active on this CPU analyzing CPU 3: no or unknown cpufreq driver is active on this CPU after changing the setting, its this: $ cpufreq-info cpufrequtils 004: cpufreq-info (C) Dominik Brodowski 2004-2006 Report errors and bugs to cpufreq@lists.linux.org.uk, please. analyzing CPU 0: driver: powernow-k8 CPUs which need to switch frequency at the same time: 0 hardware limits: 1.20 GHz - 2.40 GHz available frequency steps: 2.40 GHz, 1.20 GHz available cpufreq governors: conservative, ondemand, userspace, powersave, performance current policy: frequency should be within 1.20 GHz and 2.40 GHz. The governor "ondemand" may decide which speed to use within this range. current CPU frequency is 2.40 GHz. cpufreq stats: 2.40 GHz:74.96%, 1.20 GHz:25.04% (149) analyzing CPU 1: driver: powernow-k8 ... I still need to revert the changes and running the test, but I'm telling you now, so you do not waste more of your time on this. Sorry.
No, this change in the bios setting is not sufficient to fix the problem. I booted without the options thermal.psv and thermal.crt (DSDT was still in place), and after running the test for about 35 min, the machine shutdown again with this error message. Jul 14 15:29:29 bcipc038 kernel: [ 2227.313508] ACPI: Critical trip point Jul 14 15:29:29 bcipc038 kernel: [ 2227.313508] Critical temperature reached (76 C), shutting down. Jul 14 15:29:29 bcipc038 kernel: [ 2227.314940] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 15:29:30 bcipc038 kernel: [ 2228.296284] [drm] Resetting GPU Jul 14 15:29:30 bcipc038 kernel: [ 2228.504024] mtrr: MTRR 5 not used Jul 14 15:29:35 bcipc038 kernel: [ 2233.314108] Critical temperature reached (57 C), shutting down. The (76 C) are strange, because the temperature display (in the gnome panel) was mostly in the range of 67-68. Next I tried with the boot option thermal.psv=60, again a shutdown after about 40 min. Jul 14 16:19:21 bcipc038 kernel: [ 2416.905003] ACPI: Critical trip point Jul 14 16:19:21 bcipc038 kernel: [ 2416.905502] Critical temperature reached (71 C), shutting down. Jul 14 16:19:21 bcipc038 kernel: [ 2416.907031] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 16:19:22 bcipc038 kernel: [ 2417.946331] [drm] Resetting GPU Jul 14 16:19:22 bcipc038 kernel: [ 2418.125020] mtrr: MTRR 5 not used Jul 14 16:19:27 bcipc038 kernel: [ 2422.904779] Critical temperature reached (56 C), shutting down. I could try again with thermal.psv=60 thermal.crt=80, but earlier there was no shutdown, so I guess this might work again. The frequent messages (#22) ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' might trigger the shutdown, setting crt=80 seems to prevent the shutdown. Even so powernowd is running, I get this result: $ cat /proc/acpi/processor/*/info processor id: 0 acpi id: 0 bus mastering control: yes power management: no throttling control: no limit interface: no processor id: 1 acpi id: 1 bus mastering control: yes power management: no throttling control: no limit interface: no processor id: 2 acpi id: 2 bus mastering control: yes power management: no throttling control: no limit interface: no processor id: 3 acpi id: 3 bus mastering control: yes power management: no throttling control: no limit interface: no No throttling. Any idea ?
> The frequent messages (#22) > ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' > might trigger the shutdown, setting crt=80 seems to prevent the shutdown. Possibly indirectly. The shutdown is triggered as soon as: /proc/acpi/thermal_zone/THRM/temperature:temperature exceeds the critical temperature. > The (76 C) are strange, because the temperature display (in the gnome panel) > was mostly in the range of 67-68. Hmm, you could verify the acpi temperature readings using hwmon. This is included in sensors and libsensors packages on SUSE. You first run sensors-detect to identify the right kernel driver you need. Then load it and run sensors. This one is directly accessing the HW temperature monitor. It could happen that the driver conflicts with the ACPI thermal driver, therefore you could read out acpi temp, unload the thermal driver, load the other one and run sensors all in a row. Hmm, I didn't look at the DSDT/ACPI code yet, but I have the feeling it would be better if ACPI keeps its fingers away from thermal management at all on this machine?
> ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' this is not a problem in the latest kernel. it would be great if you can give it a try. > Jul 14 15:29:35 bcipc038 kernel: [ 2233.314108] Critical temperature reached > (57 C), shutting down. critical shutdown at 57 C? this is bad. please run "grep . /proc/acpi/thermal_zone/*/*" after you changing the BIOS option and adding boot option "thermal.psv=60"
(In reply to comment #29) I run sensors-detect, after pressing Ok several times, I got this: ... Some south bridges, CPUs or memory controllers may also contain embedded sensors. Do you want to scan for them? (YES/no): Silicon Integrated Systems SIS5595... No VIA VT82C686 Integrated Sensors... No VIA VT8231 Integrated Sensors... No AMD K8 thermal sensors... No AMD K10 thermal sensors... Success! (driver `to-be-written') Intel Core family thermal sensor... No Intel AMB FB-DIMM thermal sensor... No Now follows a summary of the probes I have just done. Just press ENTER to continue: Driver `f71882fg' (should be inserted): Detects correctly: * ISA bus, address 0x225 Chip `Fintek F71882FG/F71883FG Super IO Sensors' (confidence: 9) Driver `to-be-written' (should be inserted): Detects correctly: * Chip `AMD K10 thermal sensors' (confidence: 9) I will now generate the commands needed to load the required modules. Just press ENTER to continue: To load everything that is needed, add this to /etc/modules: #----cut here---- # Chip drivers f71882fg # no driver for AMD K10 thermal sensors yet #----cut here---- Any idea where to get a the driver for "AMD K10 thermal sensors" ? $ sensors acpitz-virtual-0 Adapter: Virtual device temp1: +43.0°C (crit = +70.0°C) $ sensors -v sensors version 3.0.2 with libsensors version 3.0.2 Perhaps, I should look for a newer version of libsensors. Zhang, what is your opinion on removing the DSDT hack ? (In reply to comment #30) > > ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' > > this is not a problem in the latest kernel. > it would be great if you can give it a try. Ok, this will take some time. Which version 2.6.30.1 or 2.6.31-rc3 ? > > > Jul 14 15:29:35 bcipc038 kernel: [ 2233.314108] Critical temperature > reached > > (57 C), shutting down. > > critical shutdown at 57 C? this is bad. Please note that that 6 s earlier the temp was 76 C (maybe some random fluctuations exceeded the crt=80) Jul 14 15:29:29 bcipc038 kernel: [ 2227.313508] Critical temperature reached (76 C), shutting down. Is not it possible, that this triggered the shutdown, processes were stopped, and the CPU cooled down within these 6 seconds ? Same pattern here: .. Jul 14 16:19:21 bcipc038 kernel: [ 2416.905502] Critical temperature reached (71 C), shutting down. .. Jul 14 16:19:27 bcipc038 kernel: [ 2422.904779] Critical temperature reached (56 C), shutting down. > please run "grep . /proc/acpi/thermal_zone/*/*" after you changing the BIOS > option and adding boot option "thermal.psv=60" $ grep . /proc/acpi/thermal_zone/*/* /proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive /proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled> /proc/acpi/thermal_zone/THRM/state:state: ok /proc/acpi/thermal_zone/THRM/temperature:temperature: 46 C /proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 70 C /proc/acpi/thermal_zone/THRM/trip_points:passive: 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 /proc/acpi/thermal_zone/THRM/trip_points:active[0]: 68 C: devices= FAN
Some info and questions: As _PSL is wrong pointing to CPU0, which must be C000, I wonder why we do not see an "Invalid passive threshold\n", the acpi_evaluate_reference should fail?: In drivers/thermal/thermal.xyx: status = acpi_evaluate_reference(tz->device->handle, "_PSL", NULL, &devices); if (ACPI_FAILURE(status)) { printk(KERN_WARNING PREFIX "Invalid passive threshold\n"); tz->trips.passive.flags.valid = 0; } Does this come from the modified DSDT? The message looks wrong or too general (from 2.6.30), shouldn't it be something like: "Couldn't reference passive cooling device(s)\n"); What exactly happens if it cannot be referenced? I'd expected above message and the passive trip point is marked invalid. It's not the first time I see this (wrong CPU as passive device ref), it looks like a generic AMD BIOS bug after upgrading to quad-core capable BIOSes. IMO the we should try to fetch all processor objects as passive cooling devices if something like that happens. It could also happen that the temp monitor device is accessed by another driver, this is the OpRegion used: OperationRegion (IP, SystemIO, 0x0225, 0x02) In this case it could happen that ACPI reads totally wrong (normally even more wrong than 76C, e.g. 3000C) temperature values and would shutdown if a race with another driver happens. Hmm, we see a conflict, not with the ACPI temp device, but still...: ACPI: I/O resource piix4_smbus [0xb00-0xb07] conflicts with ACPI region SOR2 [0xb00-0xb0f] Hmm, OperationRegion (SOR2, SystemIO, SBA2, 0x10) This one (and the other OpRegion SOR1, defining the same SMBus IOs) seem to only be used by the WMI device. But it may get used (from dmesg): ACPI: WMI: Mapper loaded Maybe now it's time to add Jean... Alois: The attachment: "zip file containing kern.log, acpidump.dump dmesg.dump and dmidecode.dump" only contains plain text dmesg. Can you attach dmidecode, please. I am curious what kind of machine this is and depending on the OEM, also like to poke about the wrong passive cooling device reference...
Created attachment 22352 [details] dump of dmidecode
Is this some kind of devel machine?: System Information Manufacturer: Unknow Product Name: Unknow Version: Unknow What is the vendor and model of it?
(In reply to comment #32) > Some info and questions: > > As _PSL is wrong pointing to CPU0, which must be C000, I wonder why we do not > see > an "Invalid passive threshold\n", the acpi_evaluate_reference should fail?: > In drivers/thermal/thermal.xyx: > status = acpi_evaluate_reference(tz->device->handle, "_PSL", > NULL, &devices); > if (ACPI_FAILURE(status)) { > printk(KERN_WARNING PREFIX > "Invalid passive threshold\n"); > tz->trips.passive.flags.valid = 0; > } > > Does this come from the modified DSDT? right. I rename CPU0 to C000 in the customized DSDT. please refer to comment #14, you can see C000 is used instead of CPU0 > The message looks wrong or too general (from 2.6.30), shouldn't it be > something > like: > "Couldn't reference passive cooling device(s)\n"); > What exactly happens if it cannot be referenced? I'd expected above message > and > the passive trip point is marked invalid. that's right. please refer to the output of "grep . /proc/acpi/thermal_zone*/*" in comment #4 > I am curious > what kind of machine this is and depending on the OEM, also like to poke > about > the wrong passive cooling device reference... how do workaround the wrong passive cooling device reference? assume _PSL returns all the processors?
(In reply to comment #34) > Is this some kind of devel machine?: > System Information > Manufacturer: Unknow > Product Name: Unknow > Version: Unknow > What is the vendor and model of it? I bought the system from www.ditech.at with this specification PCDM4H3 PC-System - dimotion Mini M4H3 AMD® Phenom™ X4 9750, 2,4GHz, Quad-Core 4 GB DDR2-RAM, 640 GB HDD Cardreader, DVD-Writer, Sound, 1 GBit LAN ATI® Radeon™ HD4850, 512MB I did not know how lousy they are.
Created attachment 22382 [details] Workaround the invalid passive cooling device reference Alois, can you try out this patch without the modified DSDT and see whether you get a valid passive trip point. Also echo a 1 into the cooling_mode file. Then the active and passive trip points should switch and you "should" get a totally passively cooled system (cat the trip points afterwards). It's a rarely implemented, but really "cool" feature. Unfortunately the fan seem to not be controlled on your system correctly, but it's worth a test.
re #36: The vendor (ditech) did offer an other mainboard. Although they have no solution, they are interested in fixing the problem. re #37: I removed the DSDT, and tried to compile the kernel. First, I was following this instructions http://www.cyberciti.biz/tips/compiling-linux-kernel-26.html but was not successful. Probably, because some restricted modules were not included, or because the default settings were different for ubuntu, or because I was using 2.6.30.1, I donot know. So, I followed this instruction: - http://www.howtoforge.com/kernel_compilation_ubuntu, - installed linux-restricted-modules, - downloaded the kernel sources 2.6.30.2, - applied your patch, - and used .config from the previous kernel version. I rebooted without the option thermal.psv=60, and get this: # grep . /proc/acpi/thermal_zone/*/* /proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive /proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled> /proc/acpi/thermal_zone/THRM/state:state: ok /proc/acpi/thermal_zone/THRM/temperature:temperature: 45 C /proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 70 C /proc/acpi/thermal_zone/THRM/trip_points:passive: 68 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 /proc/acpi/thermal_zone/THRM/trip_points:active[0]: 68 C: devices= FAN # echo 1 > /proc/acpi/thermal_zone/THRM/cooling_mode # grep . /proc/acpi/thermal_zone/*/* /proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive /proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled> /proc/acpi/thermal_zone/THRM/state:state: ok /proc/acpi/thermal_zone/THRM/temperature:temperature: 44 C /proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 70 C /proc/acpi/thermal_zone/THRM/trip_points:passive: 68 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 /proc/acpi/thermal_zone/THRM/trip_points:active[0]: 68 C: devices= FAN It seems echoing a 1 does not make a difference, the file /proc/acpi/thermal_zone/THRM/cooling_mode is unchanged. When I booted with thermal.psv=60, I get this # grep . /proc/acpi/thermal_zone/*/* /proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive /proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled> /proc/acpi/thermal_zone/THRM/state:state: ok /proc/acpi/thermal_zone/THRM/temperature:temperature: 63 C /proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 70 C /proc/acpi/thermal_zone/THRM/trip_points:passive: 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 /proc/acpi/thermal_zone/THRM/trip_points:active[0]: 68 C: devices= FAN # echo 1 > /proc/acpi/thermal_zone/THRM/cooling_mode root@bcipc038:/home/schloegl# grep . /proc/acpi/thermal_zone/*/* /proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive /proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled> /proc/acpi/thermal_zone/THRM/state:state: ok /proc/acpi/thermal_zone/THRM/temperature:temperature: 64 C /proc/acpi/thermal_zone/THRM/trip_points:critical (S5): 70 C /proc/acpi/thermal_zone/THRM/trip_points:passive: 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 /proc/acpi/thermal_zone/THRM/trip_points:active[0]: 68 C: devices= FAN perhaps active and passive trip_points are both 68. Again, no difference when echoing a 1 into cooling_mode. Did I miss anything ? I started the test, it run for about 70 min without a problem, I'll continue the test.
Thanks for the test! That means my workaround seem to work and I'll send it mainline, right? > It seems echoing a 1 does not make a difference, the file /proc/acpi/thermal_zone/THRM/cooling_mode is unchanged. Yes, several things why nothing changed: - The sysfs output is always the same: In drivers/acpi/thermal line 1091: seq_puts(seq, "0 - Active; 1 - Passive\n"); But it should show the current state, try to fix this, no testing needed. - The BIOS is reading the active/passive trip point values from IO, therefore I couldn't see the real value. As they are both the same, you do not see that they flipped. Nothing to worry about, but a stupid implementation by BIOS developers. - when you provide thermal.psv=60 you override the BIOS provided passive trip point which won't get used with cooling_mode settings. -> everything is fine (beside that the BIOS implementation of the cooling mode is useless, but that's not the kernel's fault). > Although they have no solution, they are interested in fixing the problem. Most important, they should fix up the wrong reference in the ACPI _PSL function of the thermal device to match the CPUs. This would look like this: --- DSDT.dsl.orig 2009-07-20 14:50:08.447874000 +0200 +++ DSDT.dsl 2009-07-20 14:50:44.457613000 +0200 @@ -3192,7 +3192,10 @@ Name (_PSL, Package (0x01) { - \_PR.CPU0 + \_PR.C000 + \_PR.C001 + \_PR.C002 + \_PR.C003 }) Name (_TSP, 0x3C) Name (_TC1, 0x04) Have I overseen anything to answer? If the patch gets accepted, can the bug be closed?
> Although they have no solution, they are interested in fixing the problem. If they want to do it right, they should take care that the fan can be controlled via ACPI. They should then define the active trip point below the passive one, e.g. to 55C. Then the fan kicks in at 55C. In extreme circumstances, the temperature might raise up to 68C, then the passive trip point kicks in and you get a slower system, but at least no critical shut down. If you switch the cooling mode to passive then, the values for active and passive gets switched. You then have passive cooling kicking in at 55C, the frequency gets reduced and will provide you a slower, but absolutely quiet system. -> that's the idea...
Created attachment 22413 [details] .config used to compile the kernel 2.6.30.2 with acpi patch
The machine shut down again after almost 4 hours running the test. The kernel was booted with thermal.psv=60 Here is the snippet of /var/log/kern.log ... Jul 20 13:40:17 bcipc038 kernel: [ 0.000000] Linux version 2.6.30.2-some-string-here (root@bcipc038) (gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) ) #2 SMP Mon Jul 20 11:14:08 CEST 2009 Jul 20 13:40:17 bcipc038 kernel: [ 0.000000] Command line: root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro quiet splash thermal.psv=60 ... Jul 20 17:38:47 bcipc038 kernel: [14343.489356] Critical temperature reached (71 C), shutting down. Jul 20 17:38:49 bcipc038 kernel: [14344.607387] [drm] Resetting GPU Jul 20 17:38:49 bcipc038 kernel: [14344.774229] mtrr: MTRR 5 not used ...
Did you double check whether the frequency got reduced as soon as temp is above 60C?: cat /proc/cpuinfo or cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies The first must get reduced when the passive trip point temp is exceeded. If the machine still overheats with reduced freq, either the temp is wrongly read (but it looks fine, no insane values..) or you may need a more powerful fan(s).
Be aware that you should still do some computation, even when frequency is lowered to test overheating. A total idle system with reduced freq consumes much less heat than a utilized one. The fan must easily be able to keep the system at say 60C when frequency is reduced, even if CPUs are busy. Did you try whether the fan gets controlled by ACPI? You may want to add thermal.act=50 Does the fan get loader at this temp then? Best take a latest kernel for that again, there were some changes recently, the: "Unable to turn cooling device" message does not exist in latest kernels anymore and the kernel should still try to set the requested state. From comment #50: > I noticed also the fan became loader (was speeding up) a few minutes after... You also may want to play with sensors which should be able to show fan activity and temperature.
I guess the the (In reply to comment #43) > Did you double check whether the frequency got reduced as soon as temp is > above > 60C?: > cat /proc/cpuinfo > or > cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq > cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies > The first must get reduced when the passive trip point temp is exceeded. > If the machine still overheats with reduced freq, either the temp is wrongly > read (but it looks fine, no insane values..) or you may need a more powerful > fan(s). $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq 2400000 2400000 2400000 2400000 $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies 2400000 1200000 2400000 1200000 2400000 1200000 2400000 1200000 I guess the frequency does not get reduced. (Its a guess, because I do not how reliable the tools are) I'm using the gnome applets, and the CPU frequency monitor does not have an update interval. The sensor applet for the temp has per default a 2 s interval. However, when I start the test, the temp raises to about 67-68 C, and the CPU freq still does not drop. I was not suspicious, because there might be some some short term throttling, and if only the maximum frequency in a given interval is shown, it will mostly display 2.4 GHz. At least that's what I thought. Now I notice, when running the test (the temp is typically 67-68 C) I can enforce a reduction to 1.2 GHz when I type the command # echo 1 >/proc/acpi/thermal_zone/THRM/cooling_mode Simultaneously, the CPU freq of all 4 cores is reduced to 1.2 GHz and the temp drops to about 57 C. This will last for about 13 s (counting from the echo command), then the freq is back to 2.4 GHz and the temp raises to 66 C. The freq stays at 2.4, only another echo command will reduce the frequency. It seems, some periodic reseting of the cooling_mode is happening. Any idea what's the cause of this and how to avoid it ? I think the fan is controlled by ACPI for the following reasons: First, /var/log/kern.log contains this line. Jul 20 19:14:39 bcipc038 kernel: [ 2.645668] ACPI: Fan [FAN] (on) Second, the new kernel (2.6.30.2) has also a sensor for the fan speed it shows 1800 RPM when idle. When I increase the load, and the temp raises above 60 C, the fan becomes faster (about 3600 RPM), when the load is reduced and the temp drops, the fan reduces the speed (about 1800 RPM) at about 55 C.
after the cpu frequency changes to 1.2G HZ, please attach the output of "grep . /sys/devices/system/cpu/cpu*/cpufreq/*"
(In reply to comment #46) > after the cpu frequency changes to 1.2G HZ, please attach the output of > "grep . /sys/devices/system/cpu/cpu*/cpufreq/*" Ok, here is what I did. First, I start the test (heavy numerical computations. The temp raises above 60 C, then I issue this commands: # echo 1 >/proc/acpi/thermal_zone/THRM/cooling_mode # grep . /sys/devices/system/cpu/cpu*/cpufreq/* /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus:0 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq:2400000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu0/cpufreq/related_cpus:0 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies:2400000 1200000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:powernow-k8 /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:1920000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed:<unsupported> /sys/devices/system/cpu/cpu1/cpufreq/affected_cpus:1 /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq:1200000 /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq:2400000 /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_min_freq:1200000 /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu1/cpufreq/related_cpus:1 /sys/devices/system/cpu/cpu1/cpufreq/scaling_available_frequencies:2400000 1200000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:1200000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:powernow-k8 /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:1920000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq:1200000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_setspeed:<unsupported> /sys/devices/system/cpu/cpu2/cpufreq/affected_cpus:2 /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_cur_freq:1200000 /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_max_freq:2400000 /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_min_freq:1200000 /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu2/cpufreq/related_cpus:2 /sys/devices/system/cpu/cpu2/cpufreq/scaling_available_frequencies:2400000 1200000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:1200000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:powernow-k8 /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:1920000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq:1200000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_setspeed:<unsupported> /sys/devices/system/cpu/cpu3/cpufreq/affected_cpus:3 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_cur_freq:1200000 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_max_freq:2400000 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_min_freq:1200000 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu3/cpufreq/related_cpus:3 /sys/devices/system/cpu/cpu3/cpufreq/scaling_available_frequencies:2400000 1200000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:1200000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:powernow-k8 /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:1920000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq:1200000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_setspeed:<unsupported> When the cpu frequency is back at 2.4 (about 12 s later), I get this: # grep . /sys/devices/system/cpu/cpu*/cpufreq/* /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus:0 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq:2400000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq:2400000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu0/cpufreq/related_cpus:0 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies:2400000 1200000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2400000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:powernow-k8 /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:2400000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed:<unsupported> /sys/devices/system/cpu/cpu1/cpufreq/affected_cpus:1 /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq:2400000 /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq:2400000 /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_min_freq:1200000 /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu1/cpufreq/related_cpus:1 /sys/devices/system/cpu/cpu1/cpufreq/scaling_available_frequencies:2400000 1200000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:2400000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:powernow-k8 /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:2400000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq:1200000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_setspeed:<unsupported> /sys/devices/system/cpu/cpu2/cpufreq/affected_cpus:2 /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_cur_freq:2400000 /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_max_freq:2400000 /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_min_freq:1200000 /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu2/cpufreq/related_cpus:2 /sys/devices/system/cpu/cpu2/cpufreq/scaling_available_frequencies:2400000 1200000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:2400000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:powernow-k8 /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:2400000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq:1200000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_setspeed:<unsupported> /sys/devices/system/cpu/cpu3/cpufreq/affected_cpus:3 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_cur_freq:2400000 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_max_freq:2400000 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_min_freq:1200000 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu3/cpufreq/related_cpus:3 /sys/devices/system/cpu/cpu3/cpufreq/scaling_available_frequencies:2400000 1200000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:2400000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:powernow-k8 /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:2400000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq:1200000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_setspeed:<unsupported>
(In reply to comment #44) > Be aware that you should still do some computation, even when frequency is > lowered to test overheating. A total idle system with reduced freq consumes > much less heat than a utilized one. Of course, in the (almost) idle state, the temp is in the range of 45 C. Only when the test (which contains heavy FPU computation) is started, the temp raises towards 68 C. > The fan must easily be able to keep the system at say 60C when frequency is > reduced, even if CPUs are busy. When the frequency on all 4 cpus is lowered to 1.2 GHz, the temp settles around 57 C. > Did you try whether the fan gets controlled by ACPI? > You may want to add thermal.act=50 > Does the fan get loader at this temp then? I did not test thermal.act=50 yet, but try to install 2.6.31-rc3. Please see also comment #45. > Best take a latest kernel for that again, there were some changes recently, > the: > "Unable to turn cooling device" > message does not exist in latest kernels anymore and the kernel should still > try to set the requested state. The last time I saw this message was Jul 14 16:19:21 bcipc038 kernel: [ 2416.905003] ACPI: Critical trip point Jul 14 16:19:21 bcipc038 kernel: [ 2416.905502] Critical temperature reached (71 C), shutting down. Jul 14 16:19:21 bcipc038 kernel: [ 2416.907031] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on' Jul 14 16:19:22 bcipc038 kernel: [ 2417.946331] [drm] Resetting GPU Jul 14 16:19:22 bcipc038 kernel: [ 2418.125020] mtrr: MTRR 5 not used Jul 14 16:19:27 bcipc038 kernel: [ 2422.904779] Critical temperature reached (56 C), shutting down. The two shutdowns afterwards did not contain this message but stopped with this message: Jul 20 17:38:47 bcipc038 kernel: [14343.489356] Critical temperature reached (71 C), shutting down. Jul 20 19:13:24 bcipc038 kernel: [ 5623.501001] Critical temperature reached (71 C), shutting down. The tests on Jul 20 were run with the kernel 2.6.30.2 + your acpi patch. This seems to confirm that the newer kernel got rid of the problem "Unable to turn cooling device". > From comment #50: > > I noticed also the fan became loader (was speeding up) a few minutes > after... > You also may want to play with sensors which should be able to show fan > activity and temperature. So the remaining question is, why does a temp>60 not trigger the freq reduction? I'm going to test 2.6.31-rc3, or do I need a more recent version, e.g. snapshot 2.6.31-rc3-git5 ? When I compiled the kernel 2.6.31-rc3 without your patch (I forgot to included it), echoing 1 does not reduce the frequency.
I have installed 2.6.31-rc3 with your patch, and booted with thermal.psv=60 thermal.act=50 [ 0.000000] Linux version 2.6.31-rc3-some-string-here (root@bcipc038) (gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) ) #1 SMP Tue Jul 21 18:42:43 CEST 2009 [ 0.000000] Command line: root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro thermal.psv=60 thermal.act=50 quiet splash Despite thermal.act=50, the fan speeds up (from ~1800 to 3600 RPM) only at a temp of 60. A temp>60 does not cause a passive cooling (i.e. frequency reduction). However, if the temp>60 and all CPU running full speed (2.4 GHz), the # echo 1 > /proc/acpi/thermal_zone/THRM/cooling_mode reduces the freq to 1.2 GHz for about 12 s. In the meantime, the temp also drops to about 57 C. After these 12 s, the freq becomes 2.4 GHz, and the temp increases. Although the freq is above 60 C, the freq is not reduced. Only another echoing will reduce the freq for about 12 s.
I think I know why this does not work. Also increase the polling frequency: echo 10 >/proc/acpi/thermal_zone/*/polling_frequency Rui, do you agree with patch from comment #37? While it may not help with this totally broken BIOS, it may help with others. Do I get your reviewed-by or signed-off-by?
# cat /proc/acpi/thermal_zone/*/polling_frequency <polling disabled> When I do # echo 1 > /proc/acpi/thermal_zone/THRM/cooling_mode # echo 10 >/proc/acpi/thermal_zone/*/polling_frequency and start the test, the freq reduces to 1.2 when the temp exceeds thermal.psv The cpu cools down (about 57 C), and after a few seconds the freq goes up to 2.4 again. So, this is working for me. If I remember correctly, 2.6.31-rc3 without the patch did not reduce the freq. Concerning comment #37: why do you think this is "broken" bios? Is there some standard how a bios should deal with this? Or could you imagine that this is just some alternative definition ? I'm asking to get a better understanding of the problem.
(In reply to comment #50) > Rui, do you agree with patch from comment #37? > While it may not help with this totally broken BIOS, it may help with others. > Do I get your reviewed-by or signed-off-by? yes, signed-off-by: Zhang Rui <rui.zhang@intel.com> (In reply to comment #51) > Concerning comment #37: why do you think this is "broken" bios? there are four processor devices (C000, C001, C002, C003) defined in the BIOS. And _PSL is a control method that BIOS tells OS which device (mostly processors) should be used for passive cooling. but here is the _PSL in your BIOS: Name (_PSL, Package (0x01) { \_PR.CPU0 }) it references to a non-exist device, which is surely broken.
> Concerning comment #37: why do you think this is "broken" bios? Yep. And the next point is that BIOS should notify the OS when temperature exceeds a trip point. Otherwise the OS has to poll the temperature and check itself. There was a lot discussion whether thermal polling should be enabled by default (SUSE did this some time ago, but Len convinced us to not do that). Or if it should be enabled if there is a passive trip point, etc., because there are other BIOSes (not much, but it hurts) which do not notify on trip point changes.
Noooo.... Although cpufreq has problems on this machine, fixing it will never address the thermal problem seen when all cores are running at max frequency for a long period. Although ACPI fan control is broken on this machine, "fixing it" is probably not the way to go, as approximately 0 desktop machines actually _have_ underlying ACPI fan control. Although ACPI throttling is broken on this machine, fixing it with a kernel workaround is _not_ the way to go because this is a quad-core desktop machine. It is a class of system that should be able to run at maximum performance for an indefinite period of time with _no_ thermal throttling. My advice is to return this piece of junk and get a real computer. If that is not possible, then the workaround mentioned in commnet #12 should be all you need. Boot with "thermal.nocrt=1" This is a quad-core desktop. If the machine overheats, then the supplier did not install sufficient fans -- get bigger ones, or some fans that spin at a faster speed at a given voltage. The only thing here that I think could be a kernel issue worth fixing is that the temperature reading jumps around and we shut-down even if it "gets better" after the initial event. This may be a junk sensor, or it could be an issue with the Linux EC driver. But we should probably re-read the critical temperature to be sure it is still over-temp before shutting down. closing this as an invalid BIOS issue.