Bug 13573 - Critical temperature reached (72 C), shutting down - Quadcore-AMD64, Ubuntu64
Critical temperature reached (72 C), shutting down - Quadcore-AMD64, Ubuntu64
Status: CLOSED INVALID
Product: ACPI
Classification: Unclassified
Component: BIOS
All Linux
: P1 high
Assigned To: acpi_bios
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-18 16:09 UTC by Alois Schlögl
Modified: 2009-08-29 22:21 UTC (History)
5 users (show)

See Also:
Kernel Version:
Tree: Mainline
Regression: No


Attachments
zip file containing kern.log, acpidump.dump dmesg.dump and dmidecode.dump (132.87 KB, text/plain)
2009-06-18 16:21 UTC, Alois Schlögl
Details
customized DSDT (255.50 KB, application/octet-stream)
2009-07-06 03:38 UTC, Zhang Rui
Details
new ACPIDUMP after Bios update (re #23) (29.80 KB, application/octet-stream)
2009-07-14 11:25 UTC, Alois Schlögl
Details
new ACPIDUMP after Bios update (re #25) (138.33 KB, application/octet-stream)
2009-07-14 12:08 UTC, Alois Schlögl
Details
dump of dmidecode (17.52 KB, application/octet-stream)
2009-07-15 10:39 UTC, Alois Schlögl
Details
Workaround the invalid passive cooling device reference (2.57 KB, patch)
2009-07-16 23:24 UTC, Thomas Renninger
Details | Diff
.config used to compile the kernel 2.6.30.2 with acpi patch (96.74 KB, application/octet-stream)
2009-07-20 13:07 UTC, Alois Schlögl
Details

Description Alois Schlögl 2009-06-18 16:09:52 UTC
About 1 hour after I started a heavy computing batch  (using matlab 7.6, distributing the load on all 4 core), the machine suddenly shutdown. kern.log shows this message: 

Jun 18 14:32:48 bcipc038 kernel: [1117042.573713] ACPI Exception (thermal-0479): AE_ERROR, ACPI thermal trip point state changed
Jun 18 14:32:50 bcipc038 kernel: [1117042.573717] Please send acpidump to linux-acpi@vger.kernel.org
Jun 18 14:32:50 bcipc038 kernel: [1117042.573719]  [20080926]
Jun 18 14:32:50 bcipc038 kernel: [1117042.574046] ACPI: Critical trip point
Jun 18 14:32:50 bcipc038 kernel: [1117042.574072] Critical temperature reached (72 C), shutting down.
Jun 18 14:32:50 bcipc038 kernel: [1117042.574098] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jun 18 14:32:58 bcipc038 kernel: [1117048.576698] Critical temperature reached (58 C), shutting down.
Jun 18 14:32:58 bcipc038 kernel: [1117049.920186] [drm] Resetting GPU
Jun 18 14:32:58 bcipc038 kernel: [1117050.517645] mtrr: MTRR 5 not used
Jun 18 14:57:23 bcipc038 kernel: Inspecting /boot/System.map-2.6.28-11-generic

The same behavior was observed on the same machine a few month ago and was reported here
  
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/314001

and here 

   http://marc.info/?l=linux-acpi&m=123120299000668&w=1


The problem went away, or I did not have time trying to reproduce the problem. 


Based on the previous feedback in 
http://marc.info/?l=linux-acpi&m=123120299000668&w=2 , I attach 
  acpidump.dump  
  dmesg.dump  
  dmidecode.dump  
  kern.log.20090619181656

However, I've no idea how to try the boot option of "acpi.power_nocheck=1".  


The problem happened actually two times (see kern.log):  

Jun 18 14:32:48 bcipc038 kernel: [1117042.573713] ACPI Exception (thermal-0479): AE_ERROR, ACPI thermal trip point state changed

Jun 18 16:22:19 bcipc038 kernel: [ 5402.772605] ACPI Exception (thermal-0479): AE_ERROR, ACPI thermal trip point state changed

Therefore, it seems to be reproducable again. I noticed also this message (4 times) in (kern.log): 

Jun 18 16:32:46 bcipc038 kernel: [    4.263055] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor.


Is this important? What can I do about this ? I've no experience with the kernel/acpi internal issues. Let me know if I can do anything to track down this issue. 

Alois
Comment 1 Alois Schlögl 2009-06-18 16:21:08 UTC
Created attachment 21991 [details]
zip file containing kern.log, acpidump.dump dmesg.dump and dmidecode.dump

The zip file with kern.log, acpidump.dump dmesg.dump and dmidecode.dump was not accepted. Therefore, I make it available here http://hci.tugraz.at/~schloegl/acpi_linux_dumplog.zip
Comment 2 Thomas Renninger 2009-06-18 16:28:03 UTC
This looks worth looking at further:
(thermal-0479): AE_ERROR, ACPI thermal trip point state changed

The code is from Rui AFAIK, he might know what's going on here.
It looks like an ACPI function related to the thermal device and its trip point fails at some point and then the whole trip point gets invalidated.
Unfortunately your dmesg doesn't show such a case.

> Your BIOS does not provide ACPI _PSS objects in a way that Linux understands.
This means CPU freq does not work because your BIOS misses some tables. A BIOS upgrade might fix that.
Comment 3 Zhang Rui 2009-06-19 03:19:06 UTC
please attach the output of "grep . /proc/acpi/thermal_zone/*/*".

And there are several problems of the ACPI thermal control on your laptop,

1. > Critical temperature reached (72 C), shutting down.
the critical trip point on your laptop is 72C, which is quite low...
2. the ACPI thermal active cooling doesn't work because it uses a fake ACPI fan.
3. the ACPI thermal passive cooling doesn't help a lot well because the processor frequency change is not available. 

so my questions is, can you hear the fan spinning when the computer goes hot?
is the computer really hot when it shutdown?
Comment 4 Alois Schlögl 2009-06-19 08:40:30 UTC
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13573
> 
> 
> Zhang Rui <rui.zhang@intel.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |NEEDINFO
>           Component|BIOS                        |Power-Thermal
>          AssignedTo|acpi_bios@kernel-bugs.osdl. |acpi_power-thermal@kernel-b
>                    |org                         |ugs.osdl.org
>             Summary|ACPI: Unable to turn        |Critical temperature
>                    |cooling device 'on'         |reached (72 C), shutting
>                    |(Quadcore-AMD64, Ubuntu64)  |down - Quadcore-AMD64,
>                    |                            |Ubuntu64
> 
> 
> 
> 
> --- Comment #3 from Zhang Rui <rui.zhang@intel.com>  2009-06-19 03:19:06 ---
> please attach the output of "grep . /proc/acpi/thermal_zone/*/*".

/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   ok
/proc/acpi/thermal_zone/THRM/temperature:temperature:             68 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           70 C
/proc/acpi/thermal_zone/THRM/trip_points:active[0]:               68 C:
devices= FAN


> 
> And there are several problems of the ACPI thermal control on your laptop,

Its not a laptop but a desktop machine.

> 
> 1. > Critical temperature reached (72 C), shutting down.
> the critical trip point on your laptop is 72C, which is quite low...
> 2. the ACPI thermal active cooling doesn't work because it uses a fake ACPI
> fan.
> 3. the ACPI thermal passive cooling doesn't help a lot well because the
> processor frequency change is not available. 
> 
> so my questions is, can you hear the fan spinning when the computer goes hot?

The fan is always spinning, even in idle mode.

> is the computer really hot when it shutdown?
> 

Each time I start the computing job using 400% cpu for some time, the
machine shutdown. This happened now 3 times out of 3 tries within 24 h.
So its reproducible.
The computing job is just executing some plain matlab script doing a lot
of floating point operations, and occasionally saving the intermediate
results in a file. There is no special hardware access involved.

I've no means to measure the temperature. I opened the case and see a
big fan on top of the cpu, beside that there is a passive cooler which
is also square with a side length of about 60% of the CPU cooler. When I
touch it, it is hot (do not know how hot). Then, there is a two-digit
7-segment red LED display. In the morning it showed about 44, after I
started the matlab job, it immediately started to rise. It seemed to
settle at about 68 (30-40 min after starting the job).

After about 50 min running the job, the machine shut down again. During
the shutdown process, the LED display dropped to 58 - I think. When the
machine was up again, it showed about 60 and it dropped further to about
46 (I did not start the computing job). Therefore, I'm pretty sure it is
a thermal problem.


Alois



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko7TvEACgkQzSlbmAlvEIiLqwCZAWOJJlkYr+rrdhv/ADCgZYii
880AnA5rudUo98f0mbNqCvTw+4gq/TL5
=YnEw
-----END PGP SIGNATURE-----
Comment 5 Thomas Renninger 2009-06-19 11:07:35 UTC
You should really first look out for the latest BIOS.
Have the Quad-cores been added later? The BIOS must know the CPU to be able to provide cpufreq info.
Comment 6 Zhang Rui 2009-06-22 02:23:12 UTC
please retry with CONFIG_THERMAL set if it's not in the previous test.
please set "/sys/class/thermal/thermal_zone*/passive" before the test.
Matthew's passive cooling may help in this case.
Comment 7 Zhang Rui 2009-06-22 02:25:02 UTC
(In reply to comment #6)
> please set "/sys/class/thermal/thermal_zone*/passive" before the test.

It accepts a decimal value in millidegrees celsius,
e.g. echo 62000 > /sys/class/thermal/thermal_zone0/passive.
Comment 8 Alois Schlögl 2009-06-22 07:58:23 UTC
(In reply to comment #5)
> You should really first look out for the latest BIOS.
> Have the Quad-cores been added later? The BIOS must know the CPU to be able to
> provide cpufreq info.


I bought the whole system from DiTech http://www.ditech.at/ . 
I plan to contact them, in order to resolve this issue. Is there anything else that should be taken into account ? 


Alois
Comment 9 Alois Schlögl 2009-06-22 07:58:55 UTC
(In reply to comment #8)
> (In reply to comment #5)
> > You should really first look out for the latest BIOS.
> > Have the Quad-cores been added later? The BIOS must know the CPU to be able to
> > provide cpufreq info.
> 
> 
> I bought the whole system from DiTech http://www.ditech.at/ . 
> I plan to contact them, in order to resolve this issue. Is there anything else
> that should be taken into account ? 
> 
> 
> Alois


I tried in user space and as root, but got this error:   


# echo 62000 > /sys/class/thermal/thermal_zone0/passive
bash: /sys/class/thermal/thermal_zone0/passive: No such file or directory


Because there is a symbolic link, 

# ls -al /sys/class/thermal/thermal_zone0
lrwxrwxrwx 1 root root 0 2009-06-22 09:07 /sys/class/thermal/thermal_zone0 -> ../../devices/virtual/thermal/thermal_zone0

I tried this, too: 

# echo 62000 > /sys/devices/virtual/thermal/thermal_zone0/passive
bash: /sys/devices/virtual/thermal/thermal_zone0/passive: No such file or directory


When I tried to generate the file "passive" with some editor, it could not save it. The permissions of the respective directory entries (and aller intermediate directories up to /) are set to 0755 with owner root:root . 

Here are the entries in .../thermal_zone0/*

# ls -al /sys/devices/virtual/thermal/thermal_zone0/
total 0
drwxr-xr-x 3 root root    0 2009-06-22 09:07 .
drwxr-xr-x 8 root root    0 2009-06-22 09:07 ..
lrwxrwxrwx 1 root root    0 2009-06-22 09:23 cdev0 -> ../cooling_device0
-r--r--r-- 1 root root 4096 2009-06-22 09:23 cdev0_trip_point
lrwxrwxrwx 1 root root    0 2009-06-22 09:23 device -> ../../../LNXSYSTM:00/LNXTHERM:00/LNXTHERM:01
-rw-r--r-- 1 root root 4096 2009-06-22 09:23 mode
drwxr-xr-x 2 root root    0 2009-06-22 09:23 power
lrwxrwxrwx 1 root root    0 2009-06-22 09:23 subsystem -> ../../../../class/thermal
-r--r--r-- 1 root root 4096 2009-06-22 09:23 temp
-r--r--r-- 1 root root 4096 2009-06-22 09:23 trip_point_0_temp
-r--r--r-- 1 root root 4096 2009-06-22 09:23 trip_point_0_type
-r--r--r-- 1 root root 4096 2009-06-22 09:23 trip_point_1_temp
-r--r--r-- 1 root root 4096 2009-06-22 09:23 trip_point_1_type
-r--r--r-- 1 root root 4096 2009-06-22 09:23 type
-rw-r--r-- 1 root root 4096 2009-06-22 09:07 uevent


So, I do not know how to set 
"/sys/class/thermal/thermal_zone*/passive"
Comment 10 Alois Schlögl 2009-06-23 09:16:58 UTC
(In reply to comment #5)
> You should really first look out for the latest BIOS.
> Have the Quad-cores been added later? The BIOS must know the CPU to be able to
> provide cpufreq info.

I bought the whole system from http://http://www.ditech.at/ . I contacted them, and they refered to this page: 
http://www.sapphiretech.com/ge/support/drivers.php
The mainboard is a SAPPHIRE PI-AM2RS780G with SB700. 

I've downloaded 
http://us.sapphiretech.com/drivers/usb_format_20090619_8590.zip
http://us.sapphiretech.com/drivers/78SAPV09_20090522_4854.zip

but failed to boot from the USB-stick (the machine has no floppy). 


I'm also looking at coreutils, 
http://www.coreboot.org/pipermail/coreboot/2009-June/050038.html
but there is no readily available solution. 


Before I pursue this path further, I'm also wondering how likely the Bios is responsible for the problems described above? Are you sure that updating the Bios will solve the issue ?
Comment 11 Alois Schlögl 2009-07-01 07:23:57 UTC
I flashed the bias with the latest version. Unfortunately, it did not solve the problem: 
http://www.coreboot.org/pipermail/coreboot/2009-June/050347.html
Comment 12 ykzhao 2009-07-03 03:55:24 UTC
Hi
   It seems that this is a BIOS bug. On this box the FAN is controlled by BIOS.Of course there exists the ACPI FAN device on this box. But it is bogus and it can do nothing. At the same time the incorrect passive cooling device is returned by the _PSL object. 
   >Name (_PSL, Package (0x01)
            {
                \_PR.CPU0 // the correct name should \_PR.C000
            })
   
    The critical temperature threshold is gotten by evaluating the _CRT object.(From the info in comment #4 we know that the threshold is 70)
    And the thermal temperature is obtained by using the following object:
    >Method (_TMP, 0, NotSerialized)
            {
                And (SENF, 0x01, Local6)
                If (LEqual (Local6, 0x01))
                {
                    Return (RTMP ())
                }
                Else
                {
                    Return (0x0B86)
                }
            }
     This is related with BIOS.

     From the above analysis it seems that this bug is related with the broken BIOS.And it had better be fixed by upgrading BIOS. 

     If you can confirm that it is still safe even when the temperature reaches the critical threshold, you can avoid it by adding the boot option of "thermal.nocrt=1".
     Of course it will be ok by adding your box into the quirk table that ignores the critical threshold.
   
Hi, Rui
    How about reject this bug as it seems that this is a BIOS bug? We can do nothing about it.
    Or we add the box into the quirk table that ignores the critical threshold.
Comment 13 Zhang Rui 2009-07-06 03:38:22 UTC
Created attachment 22225 [details]
customized DSDT

please apply this customized DSDT and attach the output of "grep . /proc/acpi/thermal_zone/*/*"
Comment 14 Alois Schlögl 2009-07-06 10:38:27 UTC
Thanks, after some hazzles, I was able to install it. 

This is the result I get: 
$ grep . /proc/acpi/thermal_zone/*/*
/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   ok
/proc/acpi/thermal_zone/THRM/temperature:temperature:             48 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           70 C
/proc/acpi/thermal_zone/THRM/trip_points:passive:                 68 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 
/proc/acpi/thermal_zone/THRM/trip_points:active[0]:               68 C: devices= FAN
Comment 15 Alois Schlögl 2009-07-06 12:08:59 UTC
Next, I tested whether the original problem went away. Unfortunately, the machine shutdown again about 40 min after starting the job. 

In order to investigate whether the DSDT was really loaded, I run 
   dmesg |grep DSDT
and 
   grep DSDT /var/log/kern.log

The results are shown below. For the first time the custom DSDT was found at
 Jul  6 12:12:49 bcipc038 kernel: [    0.237073] ACPI: Found DSDT in DSDT.aml.

Shortly after I send the previous report, the computational job was started (at about 12:40). At 13:21 I had to boot the machine, because it had shutdown. 

I noticed also the fan became loader (was speeding up) a few minutes after 12:40). 




$ dmesg |grep DSDT
[    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
[    0.008586] ACPI: Checking initramfs for custom DSDT
[    0.237143] ACPI: Found DSDT in DSDT.aml.
[    0.237147] ACPI: Override [DSDT-AWRDACPI], this is unsafe: tainting kernel
[    0.237152] ACPI: Table DSDT replaced by host OS
[    0.237155] ACPI: DSDT 00000000, 6CE1 (r1 RS780  AWRDACPI     1000 INTL 20081204)
[    0.237159] ACPI: DSDT override uses original SSDTs unless "acpi_no_auto_ssdt"
[    0.568497] ACPI: EC: Look up EC in DSDT


$ grep DSDT /var/log/kern.log
Jul  1 07:53:59 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  1 07:53:59 bcipc038 kernel: [    0.008561] ACPI: Checking initramfs for custom DSDT
Jul  1 07:53:59 bcipc038 kernel: [    0.572380] ACPI: EC: Look up EC in DSDT
Jul  2 08:30:41 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  2 08:30:41 bcipc038 kernel: [    0.008573] ACPI: Checking initramfs for custom DSDT
Jul  2 08:30:41 bcipc038 kernel: [    0.572366] ACPI: EC: Look up EC in DSDT
Jul  2 10:54:00 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  2 10:54:00 bcipc038 kernel: [    0.008565] ACPI: Checking initramfs for custom DSDT
Jul  2 10:54:00 bcipc038 kernel: [    0.572359] ACPI: EC: Look up EC in DSDT
Jul  2 11:08:34 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  2 11:08:34 bcipc038 kernel: [    0.008564] ACPI: Checking initramfs for custom DSDT
Jul  2 11:08:34 bcipc038 kernel: [    0.572483] ACPI: EC: Look up EC in DSDT
Jul  3 08:06:16 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  3 08:06:16 bcipc038 kernel: [    0.008562] ACPI: Checking initramfs for custom DSDT
Jul  3 08:06:16 bcipc038 kernel: [    0.572478] ACPI: EC: Look up EC in DSDT
Jul  3 18:21:09 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  3 18:21:09 bcipc038 kernel: [    0.008558] ACPI: Checking initramfs for custom DSDT
Jul  3 18:21:09 bcipc038 kernel: [    0.572476] ACPI: EC: Look up EC in DSDT
Jul  6 11:13:37 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  6 11:13:37 bcipc038 kernel: [    0.009197] ACPI: Checking initramfs for custom DSDT
Jul  6 11:13:37 bcipc038 kernel: [    0.577431] ACPI: EC: Look up EC in DSDT
Jul  6 11:17:55 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  6 11:17:55 bcipc038 kernel: [    0.008600] ACPI: Checking initramfs for custom DSDT
Jul  6 11:17:55 bcipc038 kernel: [    0.572472] ACPI: EC: Look up EC in DSDT
Jul  6 12:12:49 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  6 12:12:49 bcipc038 kernel: [    0.008572] ACPI: Checking initramfs for custom DSDT
Jul  6 12:12:49 bcipc038 kernel: [    0.237073] ACPI: Found DSDT in DSDT.aml.
Jul  6 12:12:49 bcipc038 kernel: [    0.237078] ACPI: Override [DSDT-AWRDACPI], this is unsafe: tainting kernel
Jul  6 12:12:49 bcipc038 kernel: [    0.237082] ACPI: Table DSDT replaced by host OS
Jul  6 12:12:49 bcipc038 kernel: [    0.237084] ACPI: DSDT 00000000, 6CE1 (r1 RS780  AWRDACPI     1000 INTL 20081204)
Jul  6 12:12:49 bcipc038 kernel: [    0.237088] ACPI: DSDT override uses original SSDTs unless "acpi_no_auto_ssdt"
Jul  6 12:12:49 bcipc038 kernel: [    0.568469] ACPI: EC: Look up EC in DSDT
Jul  6 13:21:31 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  6 13:21:31 bcipc038 kernel: [    0.008586] ACPI: Checking initramfs for custom DSDT
Jul  6 13:21:31 bcipc038 kernel: [    0.237143] ACPI: Found DSDT in DSDT.aml.
Jul  6 13:21:31 bcipc038 kernel: [    0.237147] ACPI: Override [DSDT-AWRDACPI], this is unsafe: tainting kernel
Jul  6 13:21:31 bcipc038 kernel: [    0.237152] ACPI: Table DSDT replaced by host OS
Jul  6 13:21:31 bcipc038 kernel: [    0.237155] ACPI: DSDT 00000000, 6CE1 (r1 RS780  AWRDACPI     1000 INTL 20081204)
Jul  6 13:21:31 bcipc038 kernel: [    0.237159] ACPI: DSDT override uses original SSDTs unless "acpi_no_auto_ssdt"
Jul  6 13:21:31 bcipc038 kernel: [    0.568497] ACPI: EC: Look up EC in DSDT
Comment 16 Zhang Rui 2009-07-07 05:49:49 UTC
if the ACPI thermal driver is built in, please boot with thermal.psv=60
if the ACPI thermal driver is compiled as a module, please load the thermal driver manually with module parameter psv=60, i.e. "modprobe thermal psv=60"

and see if it helps.
Comment 17 Alois Schlögl 2009-07-07 10:41:17 UTC
I tried this, but this attempt was also not successful. The machine shut down again. 

I was looking into to problem litte more, and found that throttling is not supported.   

/proc/acpi/processor/C001$ cat info
processor id:            1
acpi id:                 1
bus mastering control:   yes
power management:        no
throttling control:      no
limit interface:         no

powernowd reports this message: 

$ sudo powernowd 
powernowd: PowerNow Daemon v1.00, (c) 2003-2008 John Clemens
/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus: No such file or directory
powernowd: err=2
powernowd: Found 4 scalable units:  -- 1 'CPU' per scalable unit
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq: No such file or directory

PowerNowd encountered and error and could not start.
Please make sure that:
 - You are running a v2.6.7 kernel or later
 - That you have sysfs mounted /sys
 - That you have the core cpufreq and cpufreq-userspace
   modules loaded into your kernel
 - That you have the cpufreq driver for your cpu loaded,
   (for example: powernow-k7), and that it works. Check
   'dmesg' for errors.
If all of the above are true, and you still have problems,
please email the author: clemej@alum.rpi.edu


Maybe the bios error is preventing powernow-k8 from starting. 
 
The DSDT work around does not prevent this error message in dmesg

[    4.114549] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor.
[    4.114612] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor.
[    4.114671] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor.
[    4.114730] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI _PSS objects in a way that Linux understands. Please report this to the Linux ACPI maintainers and complain to your BIOS vendor.
Comment 18 Thomas Renninger 2009-07-07 15:14:18 UTC
Alois, you should really try to upgrade the BIOS and come back if you still have problems. You might also want to go through power/thermal related BIOS settings *after* upgrading.

> Maybe the bios error is preventing powernow-k8 from starting. 
Look at the second part of comment #2.
Comment 19 Zhang Rui 2009-07-08 00:45:02 UTC
please reload the thermal driver with psv=60 crt=80
and re-attach the output of "grep . /proc/acpi/thermal_zone/*/*"
remember all these tests are done with the customized DSDT. :)
Comment 20 Alois Schlögl 2009-07-10 08:56:20 UTC
First off all, let me thank you for your effort. You are much more responsive than the vendors. I appreciate this very much. 

re #18: 
I've updated the bios with the latest version provided
http://us.sapphiretech.com/drivers/78SAPV09_20090522_4854.zip

The latest version still has the problem reported in kern.log
> Your BIOS does not provide ACPI _PSS objects in a way that Linux understands.

I've now contacted the mainboard manufacturer  
http://www1.sapphiretech.com/us/support/respondtoticket.php?t=1S53102G37668S53102G1-37668S53102G8279590S53102G5
http://tinyurl.com/msw9sq
It seems they are aware of the issue but need to work with their vendor (Phoenix Technology? ). 

I went through the Bios settings. This was the setting I found: 
Power Management Setup
	ACPI Suspend Type	S3(STR)
	C2 Disable/Enable	Disabled
	Power Management Option: User Define
	... 

PC Health Status
	Show PC Health In Post:	Enabled
	Shutdown Temperature:	70°C/158°F
	SMART FAN Configuration
		CPUFAN Smart Mode:	Enabled
		CPUFAN Full-speed Temp:	60
		CPUFAN Idle Temp:	50

Thermal Throttling Option
	CPU Thermal Throttling: Enabled
	CPU Throttling Temp:	70°C
	CPU Throttling Duty:	75.0%
	
Because both, the Shutdown and Throttling, Temp were set to 70, I changed the Throttling temp to 68°C. 

The test caused again a shutdown. 

The Shutdown Temp had only options of 60°C/65°C/70°C/disable. So I disabled Shutdown Temperature, and it was strange. The temp was fixed at 40 C. This seems this switch off the temperature sensor. I reverted this change and set the Shutdown Temp again to 70°C.

re #19: 

I converted DSDT.hex file into DSDT.aml with this short c-program:  
        #include <stdio.h>
        #include "DUMP.hex"
        main() {
            FILE *fid = fopen("DSDT.aml","w");
            fwrite(AmlCode,1,0x6ce1,fid);
            fclose(fid); 
        }

and followed the instruction along the lines of  
http://ubuntuforums.org/showthread.php?t=1036051

        sudo cp DSDT.aml /etc/initramfs-tools/DSDT.aml
        sudo update-initramfs -u -k 2.6.28-13-generic

My understanding is that it installs the DSDT permanently. 

I confirmed this by looking at the kern.log with 
    grep DSDT /var/log/kern.log

Before installing the DSDT, reboot showed this, 

Jul  6 11:17:55 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  6 11:17:55 bcipc038 kernel: [    0.008600] ACPI: Checking initramfs for custom DSDT
Jul  6 11:17:55 bcipc038 kernel: [    0.572472] ACPI: EC: Look up EC in DSDT

Afterwards, kern.log has these messages. 

Jul  6 12:12:49 bcipc038 kernel: [    0.000000] ACPI: DSDT CFFE3200, 73F3 (r1 RS780  AWRDACPI     1000 MSFT  3000000)
Jul  6 12:12:49 bcipc038 kernel: [    0.008572] ACPI: Checking initramfs for custom DSDT
Jul  6 12:12:49 bcipc038 kernel: [    0.237073] ACPI: Found DSDT in DSDT.aml.
Jul  6 12:12:49 bcipc038 kernel: [    0.237078] ACPI: Override [DSDT-AWRDACPI], this is unsafe: tainting kernel
Jul  6 12:12:49 bcipc038 kernel: [    0.237082] ACPI: Table DSDT replaced by host OS
Jul  6 12:12:49 bcipc038 kernel: [    0.237084] ACPI: DSDT 00000000, 6CE1 (r1 RS780  AWRDACPI     1000 INTL 20081204)
Jul  6 12:12:49 bcipc038 kernel: [    0.237088] ACPI: DSDT override uses original SSDTs unless "acpi_no_auto_ssdt"
Jul  6 12:12:49 bcipc038 kernel: [    0.568469] ACPI: EC: Look up EC in DSDT

I assume this shows that the DSDT is loaded. Tell me if I'm wrong, or if there is a better method to check whether DSDT is loaded. 


As suggested, I changed the boot options to  psv=60 crt=80. I confirmed this with this command 
  $ grep Command /var/log/kern.log
Jul  8 15:58:24 bcipc038 kernel: [    0.000000] Command line: root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro quiet splash psv=60 crt=80


Here is the result on 
$  grep . /proc/acpi/thermal_zone/*/*
/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   ok
/proc/acpi/thermal_zone/THRM/temperature:temperature:             66 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           70 C
/proc/acpi/thermal_zone/THRM/trip_points:passive:                 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 
/proc/acpi/thermal_zone/THRM/trip_points:active[0]:               68 C: devices= FAN 


It survived the test for about 5 hours, then it shut down again with the same error as always: 
Jul  8 21:03:30 bcipc038 kernel: [18341.205549] ACPI: Critical trip point
Jul  8 21:03:30 bcipc038 kernel: [18341.205577] Critical temperature reached (71 C), shutting down.
Jul  8 21:03:30 bcipc038 kernel: [18341.205603] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul  8 21:03:31 bcipc038 kernel: [18342.604759] [drm] Resetting GPU
Jul  8 21:03:31 bcipc038 kernel: [18342.747361] mtrr: MTRR 5 not used
Jul  8 21:03:36 bcipc038 kernel: [18347.204187] Critical temperature reached (57 C), shutting down.

I'm wondering whether there is a way to test whether throttling was applied or not, or whether this is just a variation due to different environmental changes (the weather was cooler in the last few days.)
Comment 21 Zhang Rui 2009-07-13 01:05:16 UTC
If the thermel driver is built in, you should use boot option thermal.psv=60, thermal.crt=80
If the thermal driver is a module, you should load it manually using "modprobe thermal psv=60 crt=80"
Comment 22 Alois Schlögl 2009-07-14 08:26:25 UTC
Driver module does not work: 
$ sudo modprobe  thermal psv=60 crt=80
FATAL: Module thermal not found.

So I added a boot option in /boot/grub/menu.lst

title		Ubuntu 9.04, kernel 2.6.28-13-with_thermal_fix
uuid		4a0d6592-2929-4050-a603-87e463ceed0e
kernel		/boot/vmlinuz-2.6.28-13-generic root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro quiet splash thermal.psv=60 thermal.crt=80
initrd		/boot/initrd.img-2.6.28-13-generic
quiet

and rebooted. 

$ grep Command /var/log/kern.log
Jul 13 09:32:24 bcipc038 kernel: [    0.000000] Command line: root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro quiet splash thermal.psv=60 thermal.crt=80

$ grep . /proc/acpi/thermal_zone/*/*
/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   passive 
/proc/acpi/thermal_zone/THRM/temperature:temperature:             67 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           80 C
/proc/acpi/thermal_zone/THRM/trip_points:passive:                 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 
/proc/acpi/thermal_zone/THRM/trip_points:active[0]:               68 C: devices= FAN 


I run the test now for over 24 h without any shutdown. It seems to problem is gone. Changing the critical temp to 80 C did "fix" it. But is this really a fix, or just a hack to avoid a shutdown ? 


I get frequent (about 2 per minute) warnings  

$ grep ACPI /var/log/kern.log
... 
Jul 14 10:08:32 bcipc038 kernel: [88685.696666] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 10:08:38 bcipc038 kernel: [88691.696673] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 10:08:50 bcipc038 kernel: [88703.696664] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 10:09:50 bcipc038 kernel: [88763.696667] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 10:09:56 bcipc038 kernel: [88769.696667] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 10:10:50 bcipc038 kernel: [88823.696665] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 10:10:56 bcipc038 kernel: [88829.696664] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 10:11:32 bcipc038 kernel: [88865.696665] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'

and 

 $ cat /proc/acpi/processor/*/info
processor id:            0
acpi id:                 0
bus mastering control:   yes
power management:        no
throttling control:      no
limit interface:         no
processor id:            1
acpi id:                 1
bus mastering control:   yes
power management:        no
throttling control:      no
limit interface:         no
processor id:            2
acpi id:                 2
bus mastering control:   yes
power management:        no
throttling control:      no
limit interface:         no
processor id:            3
acpi id:                 3
bus mastering control:   yes
power management:        no
throttling control:      no
limit interface:         no


No Throttling is applied. I guess I need to find a way to enable throttling. Do you have any suggestions ?
Comment 23 Thomas Renninger 2009-07-14 10:29:27 UTC
> I guess I need to find a way to enable throttling
Hmm, better powernow gets enabled (not sure, but I could imagine this machine does not support throttling at all), it seems the newest BIOS still does not export the frequency tables:
[    4.114549] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI
_PSS objects in a way that Linux understands. Please report this to the Linux
ACPI maintainers and complain to your BIOS vendor.

Can you attach acpidump after the BIOS update, I can have a look whether it's really the BIOS' fault.

> Jul 14 10:08:32 bcipc038 kernel: [88685.696666] ACPI: Unable to turn cooling
> device [ffff88012f815a60] 'on'
Interesting. I haven't looked at the details, but this sounds like a critical bug, BIOS or kernel.
Comment 24 Alois Schlögl 2009-07-14 11:25:41 UTC
Created attachment 22338 [details]
new ACPIDUMP after Bios update (re #23)
Comment 25 Thomas Renninger 2009-07-14 11:57:56 UTC
Strange, doing:
acpixtract -a acpidump
should create the extract tables as files, but I get dozens of corrupt files/tables:
10.dat    ??1.dat       2.dat       31.dat      45.dat      61.dat      82.dat  ?apa.dat     ??.dat  ?d?F.dat   ??IN.dat   ?P6E.dat   ?prx.dat   ??SC1.dat
    11.dat    ??1.dat      ?2.dat ... (and much more)

Hm, the file looks like a DSDT, but doing:
iasl -d acpidump
results in:
...
**** ACPI table terminates in the middle of a data structure!

Could you try again and make sure you use the latest acpidump version, it's included here:
http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/pmtools-20071116.tar.bz2
Just doing:
acpidump >/tmp/acpidump
and then send /tmp/acpidump is what is needed.
Comment 26 Alois Schlögl 2009-07-14 12:08:04 UTC
Created attachment 22339 [details]
new ACPIDUMP after Bios update (re #25)
Comment 27 Alois Schlögl 2009-07-14 12:51:27 UTC
I think I have a candidate for the problem. The bios setting for AMD Coool&Quiet control was disabled. 

Setting it to AUTO does not cause this message anymore: 

 [    4.114549] [Firmware Bug]: powernow-k8: Your BIOS does not provide ACPI
_PSS objects in a way that Linux understands. Please report this to the Linux
ACPI maintainers and complain to your BIOS vendor.

And before I saw this, 

$ cpufreq-info
cpufrequtils 004: cpufreq-info (C) Dominik Brodowski 2004-2006
Report errors and bugs to cpufreq@lists.linux.org.uk, please.
analyzing CPU 0:
  no or unknown cpufreq driver is active on this CPU
analyzing CPU 1:
  no or unknown cpufreq driver is active on this CPU
analyzing CPU 2:
  no or unknown cpufreq driver is active on this CPU
analyzing CPU 3:
  no or unknown cpufreq driver is active on this CPU

after changing the setting, its this: 

$ cpufreq-info
cpufrequtils 004: cpufreq-info (C) Dominik Brodowski 2004-2006
Report errors and bugs to cpufreq@lists.linux.org.uk, please.
analyzing CPU 0:
  driver: powernow-k8
  CPUs which need to switch frequency at the same time: 0
  hardware limits: 1.20 GHz - 2.40 GHz
  available frequency steps: 2.40 GHz, 1.20 GHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance
  current policy: frequency should be within 1.20 GHz and 2.40 GHz.
                  The governor "ondemand" may decide which speed to use
                  within this range.
  current CPU frequency is 2.40 GHz.
  cpufreq stats: 2.40 GHz:74.96%, 1.20 GHz:25.04%  (149)
analyzing CPU 1:
  driver: powernow-k8
... 

I still need to revert the changes and running the test, but I'm telling you now, so you do not waste more of your time on this. Sorry.
Comment 28 Alois Schlögl 2009-07-14 14:42:11 UTC
No, this change in the bios setting is not sufficient to fix the problem. 
I booted without the options thermal.psv and thermal.crt (DSDT was still in place), and after running the test for about 35 min, the machine shutdown again with this error message. 

Jul 14 15:29:29 bcipc038 kernel: [ 2227.313508] ACPI: Critical trip point
Jul 14 15:29:29 bcipc038 kernel: [ 2227.313508] Critical temperature reached (76 C), shutting down.
Jul 14 15:29:29 bcipc038 kernel: [ 2227.314940] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 15:29:30 bcipc038 kernel: [ 2228.296284] [drm] Resetting GPU
Jul 14 15:29:30 bcipc038 kernel: [ 2228.504024] mtrr: MTRR 5 not used
Jul 14 15:29:35 bcipc038 kernel: [ 2233.314108] Critical temperature reached (57 C), shutting down.


The (76 C) are strange, because the temperature display (in the gnome panel) was mostly in the range of 67-68.  

Next I tried with the boot option thermal.psv=60, again a shutdown after about 40 min. 

Jul 14 16:19:21 bcipc038 kernel: [ 2416.905003] ACPI: Critical trip point
Jul 14 16:19:21 bcipc038 kernel: [ 2416.905502] Critical temperature reached (71 C), shutting down.
Jul 14 16:19:21 bcipc038 kernel: [ 2416.907031] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 16:19:22 bcipc038 kernel: [ 2417.946331] [drm] Resetting GPU
Jul 14 16:19:22 bcipc038 kernel: [ 2418.125020] mtrr: MTRR 5 not used
Jul 14 16:19:27 bcipc038 kernel: [ 2422.904779] Critical temperature reached (56 C), shutting down.

I could try again with thermal.psv=60 thermal.crt=80, but earlier there was no shutdown, so I guess this might work again. The frequent messages (#22)
      ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
might trigger the shutdown, setting crt=80 seems to prevent the shutdown. 

Even so powernowd is running, I get this result: 

$ cat /proc/acpi/processor/*/info
processor id:            0
acpi id:                 0
bus mastering control:   yes
power management:        no
throttling control:      no
limit interface:         no
processor id:            1
acpi id:                 1
bus mastering control:   yes
power management:        no
throttling control:      no
limit interface:         no
processor id:            2
acpi id:                 2
bus mastering control:   yes
power management:        no
throttling control:      no
limit interface:         no
processor id:            3
acpi id:                 3
bus mastering control:   yes
power management:        no
throttling control:      no
limit interface:         no

No throttling. Any idea ?
Comment 29 Thomas Renninger 2009-07-14 15:27:13 UTC
> The frequent messages (#22)
>      ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
> might trigger the shutdown, setting crt=80 seems to prevent the shutdown.
Possibly indirectly. The shutdown is triggered as soon as:
/proc/acpi/thermal_zone/THRM/temperature:temperature
exceeds the critical temperature.

> The (76 C) are strange, because the temperature display (in the gnome panel)
> was mostly in the range of 67-68.
Hmm, you could verify the acpi temperature readings using hwmon. This is included in sensors and libsensors packages on SUSE.
You first run sensors-detect to identify the right kernel driver you need. Then load it and run sensors. This one is directly accessing the HW temperature monitor. It could happen that the driver conflicts with the ACPI thermal driver, therefore you could read out acpi temp, unload the thermal driver, load the other one and run sensors all in a row.

Hmm, I didn't look at the DSDT/ACPI code yet, but I have the feeling it would be better if ACPI keeps its fingers away from thermal management at all on this machine?
Comment 30 Zhang Rui 2009-07-15 02:48:56 UTC
> ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'

this is not a problem in the latest kernel.
it would be great if you can give it a try.

> Jul 14 15:29:35 bcipc038 kernel: [ 2233.314108] Critical temperature reached
> (57 C), shutting down.

critical shutdown at 57 C? this is bad. 
please run "grep . /proc/acpi/thermal_zone/*/*" after you changing the BIOS option and adding boot option "thermal.psv=60"
Comment 31 Alois Schlögl 2009-07-15 07:06:21 UTC
(In reply to comment #29)
I run sensors-detect, after pressing Ok several times, I got this: 

... 
Some south bridges, CPUs or memory controllers may also contain
embedded sensors. Do you want to scan for them? (YES/no): 
Silicon Integrated Systems SIS5595...                       No
VIA VT82C686 Integrated Sensors...                          No
VIA VT8231 Integrated Sensors...                            No
AMD K8 thermal sensors...                                   No
AMD K10 thermal sensors...                                  Success!
    (driver `to-be-written')
Intel Core family thermal sensor...                         No
Intel AMB FB-DIMM thermal sensor...                         No

Now follows a summary of the probes I have just done.
Just press ENTER to continue: 

Driver `f71882fg' (should be inserted):
  Detects correctly:
  * ISA bus, address 0x225
    Chip `Fintek F71882FG/F71883FG Super IO Sensors' (confidence: 9)

Driver `to-be-written' (should be inserted):
  Detects correctly:
  * Chip `AMD K10 thermal sensors' (confidence: 9)

I will now generate the commands needed to load the required modules.
Just press ENTER to continue: 

To load everything that is needed, add this to /etc/modules:

#----cut here----
# Chip drivers
f71882fg
# no driver for AMD K10 thermal sensors yet
#----cut here----

Any idea where to get a the driver for "AMD K10 thermal sensors" ? 

$ sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:       +43.0°C  (crit = +70.0°C)    

$ sensors -v
sensors version 3.0.2 with libsensors version 3.0.2

Perhaps, I should look for a newer version of libsensors. 


Zhang, what is your opinion on removing the DSDT hack ?


(In reply to comment #30)
> > ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
> 
> this is not a problem in the latest kernel.
> it would be great if you can give it a try.

Ok, this will take some time. Which version 2.6.30.1 or 2.6.31-rc3 ? 

> 
> > Jul 14 15:29:35 bcipc038 kernel: [ 2233.314108] Critical temperature reached
> > (57 C), shutting down.
> 
> critical shutdown at 57 C? this is bad. 

Please note that that 6 s earlier the temp was 76 C (maybe some random fluctuations exceeded the crt=80)

Jul 14 15:29:29 bcipc038 kernel: [ 2227.313508] Critical temperature reached
(76 C), shutting down.

Is not it possible, that this triggered the shutdown, processes were stopped, and the CPU cooled down within these 6 seconds ? 


Same pattern here: 

.. 
Jul 14 16:19:21 bcipc038 kernel: [ 2416.905502] Critical temperature reached
(71 C), shutting down.
.. 
Jul 14 16:19:27 bcipc038 kernel: [ 2422.904779] Critical temperature reached
(56 C), shutting down.


> please run "grep . /proc/acpi/thermal_zone/*/*" after you changing the BIOS
> option and adding boot option "thermal.psv=60"


$ grep . /proc/acpi/thermal_zone/*/*
/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   ok
/proc/acpi/thermal_zone/THRM/temperature:temperature:             46 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           70 C
/proc/acpi/thermal_zone/THRM/trip_points:passive:                 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 
/proc/acpi/thermal_zone/THRM/trip_points:active[0]:               68 C: devices= FAN
Comment 32 Thomas Renninger 2009-07-15 10:13:33 UTC
Some info and questions:

As _PSL is wrong pointing to CPU0, which must be C000, I wonder why we do not see
an "Invalid passive threshold\n", the acpi_evaluate_reference should fail?:
In drivers/thermal/thermal.xyx:
	status = acpi_evaluate_reference(tz->device->handle, "_PSL",
							NULL, &devices);
		if (ACPI_FAILURE(status)) {
			printk(KERN_WARNING PREFIX
				"Invalid passive threshold\n");
			tz->trips.passive.flags.valid = 0;
		}

Does this come from the modified DSDT?
The message looks wrong or too general (from 2.6.30), shouldn't it be something like:
"Couldn't reference passive cooling device(s)\n");
What exactly happens if it cannot be referenced? I'd expected above message and the passive trip point is marked invalid.
It's not the first time I see this (wrong CPU as passive device ref), it looks like a generic AMD BIOS bug after upgrading to quad-core capable BIOSes.

IMO the we should try to fetch all processor objects as passive cooling devices if something like that happens.

It could also happen that the temp monitor device is accessed by another driver,
this is the OpRegion used:
OperationRegion (IP, SystemIO, 0x0225, 0x02)
In this case it could happen that ACPI reads totally wrong (normally even more wrong than 76C, e.g. 3000C) temperature values and would shutdown if a race with another driver happens.

Hmm, we see a conflict, not with the ACPI temp device, but still...:
ACPI: I/O resource piix4_smbus [0xb00-0xb07] conflicts with ACPI region SOR2 [0xb00-0xb0f]
Hmm, 
        OperationRegion (SOR2, SystemIO, SBA2, 0x10)

This one (and the other OpRegion SOR1, defining the same SMBus IOs) seem to only be used by the WMI device. But it may get used (from dmesg):
ACPI: WMI: Mapper loaded

Maybe now it's time to add Jean...

Alois:  The attachment: 
"zip file containing kern.log, acpidump.dump dmesg.dump and dmidecode.dump"
only contains plain text dmesg. Can you attach dmidecode, please. I am curious what kind of machine this is and depending on the OEM, also like to poke about the wrong passive cooling device reference...
Comment 33 Alois Schlögl 2009-07-15 10:39:07 UTC
Created attachment 22352 [details]
dump of dmidecode
Comment 34 Thomas Renninger 2009-07-15 12:18:33 UTC
Is this some kind of devel machine?:
System Information
        Manufacturer: Unknow
        Product Name: Unknow
        Version: Unknow
What is the vendor and model of it?
Comment 35 Zhang Rui 2009-07-16 05:23:03 UTC
(In reply to comment #32)
> Some info and questions:
> 
> As _PSL is wrong pointing to CPU0, which must be C000, I wonder why we do not
> see
> an "Invalid passive threshold\n", the acpi_evaluate_reference should fail?:
> In drivers/thermal/thermal.xyx:
>     status = acpi_evaluate_reference(tz->device->handle, "_PSL",
>                             NULL, &devices);
>         if (ACPI_FAILURE(status)) {
>             printk(KERN_WARNING PREFIX
>                 "Invalid passive threshold\n");
>             tz->trips.passive.flags.valid = 0;
>         }
> 
> Does this come from the modified DSDT?

right. I rename CPU0 to C000 in the customized DSDT.
please refer to comment #14, you can see C000 is used instead of CPU0

> The message looks wrong or too general (from 2.6.30), shouldn't it be something
> like:
> "Couldn't reference passive cooling device(s)\n");
> What exactly happens if it cannot be referenced? I'd expected above message and
> the passive trip point is marked invalid.

that's right. please refer to the output of "grep . /proc/acpi/thermal_zone*/*" in comment #4

> I am curious
> what kind of machine this is and depending on the OEM, also like to poke about
> the wrong passive cooling device reference...

how do workaround the wrong passive cooling device reference?
assume _PSL returns all the processors?
Comment 36 Alois Schlögl 2009-07-16 06:11:59 UTC
(In reply to comment #34)
> Is this some kind of devel machine?:
> System Information
>         Manufacturer: Unknow
>         Product Name: Unknow
>         Version: Unknow
> What is the vendor and model of it?

I bought the system from www.ditech.at with this specification 

PCDM4H3 PC-System - dimotion Mini M4H3
AMD® Phenom™ X4 9750, 2,4GHz, Quad-Core
4 GB DDR2-RAM, 640 GB HDD
Cardreader, DVD-Writer, Sound, 1 GBit LAN
ATI® Radeon™ HD4850, 512MB

I did not know how lousy they are.
Comment 37 Thomas Renninger 2009-07-16 23:24:12 UTC
Created attachment 22382 [details]
Workaround the invalid passive cooling device reference

Alois, can you try out this patch without the modified DSDT and see whether you get a valid passive trip point.
Also echo a 1 into the cooling_mode file. Then the active and passive trip points should switch and you "should" get a totally passively cooled system (cat the trip points afterwards). It's a rarely implemented, but really "cool" feature. Unfortunately the fan seem to not be controlled on your system correctly, but it's worth a test.
Comment 38 Alois Schlögl 2009-07-20 12:30:43 UTC
re #36: The vendor (ditech) did offer an other mainboard. Although they have no solution, they are interested in fixing the problem. 


re #37:

I removed the DSDT, and tried to compile the kernel. First, I was following this instructions http://www.cyberciti.biz/tips/compiling-linux-kernel-26.html
but was not successful. Probably, because some restricted modules were not included, or because the default settings were different for ubuntu, or because I was using 2.6.30.1, I donot know. 

So, I followed this instruction: 
- http://www.howtoforge.com/kernel_compilation_ubuntu, 
- installed linux-restricted-modules,  
- downloaded the kernel sources 2.6.30.2, 
- applied your patch, 
- and used .config from the previous kernel version. 


I rebooted without the option thermal.psv=60, and get this:

# grep . /proc/acpi/thermal_zone/*/*
/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   ok
/proc/acpi/thermal_zone/THRM/temperature:temperature:             45 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           70 C
/proc/acpi/thermal_zone/THRM/trip_points:passive:                 68 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 
/proc/acpi/thermal_zone/THRM/trip_points:active[0]:               68 C: devices= FAN 

# echo 1 > /proc/acpi/thermal_zone/THRM/cooling_mode 

# grep . /proc/acpi/thermal_zone/*/*
/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   ok
/proc/acpi/thermal_zone/THRM/temperature:temperature:             44 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           70 C
/proc/acpi/thermal_zone/THRM/trip_points:passive:                 68 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 
/proc/acpi/thermal_zone/THRM/trip_points:active[0]:               68 C: devices= FAN 


It seems echoing a 1 does not make a difference, the file /proc/acpi/thermal_zone/THRM/cooling_mode is unchanged. 


When I booted with thermal.psv=60, I get this  

# grep . /proc/acpi/thermal_zone/*/*
/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   ok
/proc/acpi/thermal_zone/THRM/temperature:temperature:             63 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           70 C
/proc/acpi/thermal_zone/THRM/trip_points:passive:                 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 
/proc/acpi/thermal_zone/THRM/trip_points:active[0]:               68 C: devices= FAN 

# echo 1 > /proc/acpi/thermal_zone/THRM/cooling_mode

root@bcipc038:/home/schloegl# grep . /proc/acpi/thermal_zone/*/*
/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   ok
/proc/acpi/thermal_zone/THRM/temperature:temperature:             64 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           70 C
/proc/acpi/thermal_zone/THRM/trip_points:passive:                 60 C: tc1=4 tc2=3 tsp=60 devices=C000 C001 C002 C003 
/proc/acpi/thermal_zone/THRM/trip_points:active[0]:               68 C: devices= FAN  perhaps active and passive trip_points are both 68. 



Again, no difference when echoing a 1 into cooling_mode. Did I miss anything ? 

I started the test, it run for about 70 min without a problem, I'll continue the test.
Comment 39 Thomas Renninger 2009-07-20 12:54:41 UTC
Thanks for the test!
That means my workaround seem to work and I'll send it mainline, right?

> It seems echoing a 1 does not make a difference, the file
/proc/acpi/thermal_zone/THRM/cooling_mode is unchanged. 

Yes, several things why nothing changed:
  - The sysfs output is always the same:
    In drivers/acpi/thermal line 1091:
    seq_puts(seq, "0 - Active; 1 - Passive\n");
    But it should show the current state, try to fix this, no testing needed.

  - The BIOS is reading the active/passive trip point values from IO, therefore
    I couldn't see the real value. As they are both the same, you do not see
    that they flipped. Nothing to worry about, but a stupid implementation by
    BIOS developers.

  - when you provide thermal.psv=60 you override the BIOS provided passive trip
    point which won't get used with cooling_mode settings.

-> everything is fine (beside that the BIOS implementation of the cooling mode is useless, but that's not the kernel's fault).

> Although they have no solution, they are interested in fixing the problem.
Most important, they should fix up the wrong reference in the ACPI _PSL function of the thermal device to match the CPUs. This would look like this:

--- DSDT.dsl.orig       2009-07-20 14:50:08.447874000 +0200
+++ DSDT.dsl    2009-07-20 14:50:44.457613000 +0200
@@ -3192,7 +3192,10 @@

             Name (_PSL, Package (0x01)
             {
-                \_PR.CPU0
+                \_PR.C000
+                \_PR.C001
+                \_PR.C002
+                \_PR.C003
             })
             Name (_TSP, 0x3C)
             Name (_TC1, 0x04)

Have I overseen anything to answer?
If the patch gets accepted, can the bug be closed?
Comment 40 Thomas Renninger 2009-07-20 13:01:28 UTC
> Although they have no solution, they are interested in fixing the problem.
If they want to do it right, they should take care that the fan can be controlled via ACPI.
They should then define the active trip point below the passive one, e.g. to 55C.
Then the fan kicks in at 55C. In extreme circumstances, the temperature might raise up to 68C, then the passive trip point kicks in and you get a slower system, but at least no critical shut down.
If you switch the cooling mode to passive then, the values for active and passive gets switched. You then have passive cooling kicking in at 55C, the frequency gets reduced and will provide you a slower, but absolutely quiet system.
-> that's the idea...
Comment 41 Alois Schlögl 2009-07-20 13:07:19 UTC
Created attachment 22413 [details]
.config used to compile the kernel 2.6.30.2 with acpi patch
Comment 42 Alois Schlögl 2009-07-20 16:02:24 UTC
The machine shut down again after almost 4 hours running the test. 
The kernel was booted with thermal.psv=60 

Here is the snippet of /var/log/kern.log 

...
Jul 20 13:40:17 bcipc038 kernel: [    0.000000] Linux version 2.6.30.2-some-string-here (root@bcipc038) (gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) ) #2 SMP Mon Jul 20 11:14:08 CEST 2009
Jul 20 13:40:17 bcipc038 kernel: [    0.000000] Command line: root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro quiet splash  thermal.psv=60
...
Jul 20 17:38:47 bcipc038 kernel: [14343.489356] Critical temperature reached (71 C), shutting down.
Jul 20 17:38:49 bcipc038 kernel: [14344.607387] [drm] Resetting GPU
Jul 20 17:38:49 bcipc038 kernel: [14344.774229] mtrr: MTRR 5 not used
...
Comment 43 Thomas Renninger 2009-07-20 16:10:03 UTC
Did you double check whether the frequency got reduced as soon as temp is above 60C?:
cat /proc/cpuinfo
or
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies
The first must get reduced when the passive trip point temp is exceeded.
If the machine still overheats with reduced freq, either the temp is wrongly read (but it looks fine, no insane values..) or you may need a more powerful fan(s).
Comment 44 Thomas Renninger 2009-07-20 16:32:36 UTC
Be aware that you should still do some computation, even when frequency is lowered to test overheating. A total idle system with reduced freq consumes much less heat than a utilized one.
The fan must easily be able to keep the system at say 60C when frequency is reduced, even if CPUs are busy.
Did you try whether the fan gets controlled by ACPI?
You may want to add thermal.act=50
Does the fan get loader at this temp then?
Best take a latest kernel for that again, there were some changes recently, the:
"Unable to turn cooling device"
message does not exist in latest kernels anymore and the kernel should still try to set the requested state.
From comment #50:
> I noticed also the fan became loader (was speeding up) a few minutes after...
You also may want to play with sensors which should be able to show fan activity and temperature.
Comment 45 Alois Schlögl 2009-07-20 17:50:26 UTC
I guess the the (In reply to comment #43)
> Did you double check whether the frequency got reduced as soon as temp is above
> 60C?:
> cat /proc/cpuinfo
> or
> cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
> cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies
> The first must get reduced when the passive trip point temp is exceeded.
> If the machine still overheats with reduced freq, either the temp is wrongly
> read (but it looks fine, no insane values..) or you may need a more powerful
> fan(s).


$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
2400000
2400000
2400000
2400000

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies
2400000 1200000 
2400000 1200000 
2400000 1200000 
2400000 1200000 


I guess the frequency does not get reduced. (Its a guess, because I do not how reliable the tools are) I'm using the gnome applets, and the CPU frequency monitor does not have an update interval. The sensor applet for the temp has per default a 2 s interval. 

However, when I start the test, the temp raises to about 67-68 C, and the CPU freq still does not drop. I was not suspicious, because there might be some some short term throttling, and if only the maximum frequency in a given interval is shown, it will mostly display 2.4 GHz. At least that's what I thought. 


Now I notice, when running the test (the temp is typically 67-68 C) I can enforce a reduction to 1.2 GHz when I type the command 

# echo 1 >/proc/acpi/thermal_zone/THRM/cooling_mode 

Simultaneously, the CPU freq of all 4 cores is reduced to 1.2 GHz and the temp  drops to about 57 C. This will last for about 13 s (counting from the echo command), then the freq is back to 2.4 GHz and the temp raises to 66 C. The freq stays at 2.4, only another echo command will reduce the frequency. 

It seems, some periodic reseting of the cooling_mode is happening. Any idea what's the cause of this and how to avoid it ? 


I think the fan is controlled by ACPI for the following reasons:  
First, /var/log/kern.log contains this line. 
Jul 20 19:14:39 bcipc038 kernel: [    2.645668] ACPI: Fan [FAN] (on)

Second, the new kernel (2.6.30.2) has also a sensor for the fan speed it shows 1800 RPM when idle. When I increase the load, and the temp raises above 60 C, the fan becomes faster (about 3600 RPM), when the load is reduced and the temp drops, the fan reduces the speed (about 1800 RPM) at about 55 C.
Comment 46 Zhang Rui 2009-07-21 05:29:52 UTC
after the cpu frequency changes to 1.2G HZ, please attach the output of
"grep . /sys/devices/system/cpu/cpu*/cpufreq/*"
Comment 47 Alois Schlögl 2009-07-21 08:14:52 UTC
(In reply to comment #46)
> after the cpu frequency changes to 1.2G HZ, please attach the output of
> "grep . /sys/devices/system/cpu/cpu*/cpufreq/*"


Ok, here is what I did. First, I start the test (heavy numerical computations. 
The temp raises above 60 C, then I issue this commands:  

# echo 1 >/proc/acpi/thermal_zone/THRM/cooling_mode 
# grep . /sys/devices/system/cpu/cpu*/cpufreq/*
/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq:1200000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq:2400000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq:1200000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:4000
/sys/devices/system/cpu/cpu0/cpufreq/related_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies:2400000 1200000 
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:1200000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:powernow-k8
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:1920000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq:1200000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu1/cpufreq/affected_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq:1200000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq:2400000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_min_freq:1200000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_transition_latency:4000
/sys/devices/system/cpu/cpu1/cpufreq/related_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq/scaling_available_frequencies:2400000 1200000 
/sys/devices/system/cpu/cpu1/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:1200000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:powernow-k8
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:1920000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq:1200000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu2/cpufreq/affected_cpus:2
/sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_cur_freq:1200000
/sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_max_freq:2400000
/sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_min_freq:1200000
/sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_transition_latency:4000
/sys/devices/system/cpu/cpu2/cpufreq/related_cpus:2
/sys/devices/system/cpu/cpu2/cpufreq/scaling_available_frequencies:2400000 1200000 
/sys/devices/system/cpu/cpu2/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:1200000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:powernow-k8
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:1920000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq:1200000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu3/cpufreq/affected_cpus:3
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_cur_freq:1200000
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_max_freq:2400000
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_min_freq:1200000
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency:4000
/sys/devices/system/cpu/cpu3/cpufreq/related_cpus:3
/sys/devices/system/cpu/cpu3/cpufreq/scaling_available_frequencies:2400000 1200000 
/sys/devices/system/cpu/cpu3/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:1200000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:powernow-k8
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:1920000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq:1200000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_setspeed:<unsupported>


When the cpu frequency is back at 2.4  (about 12 s later), I get this:  

# grep . /sys/devices/system/cpu/cpu*/cpufreq/*
/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq:2400000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq:2400000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq:1200000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:4000
/sys/devices/system/cpu/cpu0/cpufreq/related_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies:2400000 1200000 
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2400000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:powernow-k8
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:2400000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq:1200000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu1/cpufreq/affected_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq:2400000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq:2400000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_min_freq:1200000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_transition_latency:4000
/sys/devices/system/cpu/cpu1/cpufreq/related_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq/scaling_available_frequencies:2400000 1200000 
/sys/devices/system/cpu/cpu1/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:2400000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:powernow-k8
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:2400000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq:1200000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu2/cpufreq/affected_cpus:2
/sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_cur_freq:2400000
/sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_max_freq:2400000
/sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_min_freq:1200000
/sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_transition_latency:4000
/sys/devices/system/cpu/cpu2/cpufreq/related_cpus:2
/sys/devices/system/cpu/cpu2/cpufreq/scaling_available_frequencies:2400000 1200000 
/sys/devices/system/cpu/cpu2/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:2400000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:powernow-k8
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq:2400000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq:1200000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu3/cpufreq/affected_cpus:3
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_cur_freq:2400000
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_max_freq:2400000
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_min_freq:1200000
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency:4000
/sys/devices/system/cpu/cpu3/cpufreq/related_cpus:3
/sys/devices/system/cpu/cpu3/cpufreq/scaling_available_frequencies:2400000 1200000 
/sys/devices/system/cpu/cpu3/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:2400000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:powernow-k8
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:2400000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq:1200000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_setspeed:<unsupported>
Comment 48 Alois Schlögl 2009-07-21 17:20:05 UTC
(In reply to comment #44)
> Be aware that you should still do some computation, even when frequency is
> lowered to test overheating. A total idle system with reduced freq consumes
> much less heat than a utilized one.

Of course, in the (almost) idle state, the temp is in the range of 45 C. Only 
when the test (which contains heavy FPU computation) is started, the temp raises towards 68 C. 

> The fan must easily be able to keep the system at say 60C when frequency is
> reduced, even if CPUs are busy.

When the frequency on all 4 cpus is lowered to 1.2 GHz, the temp settles around 57 C. 

> Did you try whether the fan gets controlled by ACPI?
> You may want to add thermal.act=50
> Does the fan get loader at this temp then?

I did not test thermal.act=50 yet, but try to install 2.6.31-rc3.  
Please see also comment #45.


> Best take a latest kernel for that again, there were some changes recently,
> the:
> "Unable to turn cooling device"
> message does not exist in latest kernels anymore and the kernel should still
> try to set the requested state.

The last time I saw this message was 
Jul 14 16:19:21 bcipc038 kernel: [ 2416.905003] ACPI: Critical trip point
Jul 14 16:19:21 bcipc038 kernel: [ 2416.905502] Critical temperature reached (71 C), shutting down.
Jul 14 16:19:21 bcipc038 kernel: [ 2416.907031] ACPI: Unable to turn cooling device [ffff88012f815a60] 'on'
Jul 14 16:19:22 bcipc038 kernel: [ 2417.946331] [drm] Resetting GPU
Jul 14 16:19:22 bcipc038 kernel: [ 2418.125020] mtrr: MTRR 5 not used
Jul 14 16:19:27 bcipc038 kernel: [ 2422.904779] Critical temperature reached (56 C), shutting down.

The two shutdowns afterwards did not contain this message but stopped with this message: 

Jul 20 17:38:47 bcipc038 kernel: [14343.489356] Critical temperature reached (71 C), shutting down.

Jul 20 19:13:24 bcipc038 kernel: [ 5623.501001] Critical temperature reached (71 C), shutting down.

The tests on Jul 20 were run with the kernel 2.6.30.2 + your acpi patch. This seems to confirm that the newer kernel got rid of the problem "Unable to turn cooling device". 


> From comment #50:
> > I noticed also the fan became loader (was speeding up) a few minutes after...
> You also may want to play with sensors which should be able to show fan
> activity and temperature.

So the remaining question is, why does a temp>60 not trigger the freq reduction? 

I'm going to test 2.6.31-rc3, or do I need a more recent version, e.g. snapshot 2.6.31-rc3-git5 ? 

When I compiled the kernel 2.6.31-rc3 without your patch (I forgot to included it), echoing 1 does not reduce the frequency.
Comment 49 Alois Schlögl 2009-07-21 17:53:54 UTC
I have installed 2.6.31-rc3 with your patch, and booted with thermal.psv=60 thermal.act=50 

[    0.000000] Linux version 2.6.31-rc3-some-string-here (root@bcipc038) (gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) ) #1 SMP Tue Jul 21 18:42:43 CEST 2009
[    0.000000] Command line: root=UUID=4a0d6592-2929-4050-a603-87e463ceed0e ro thermal.psv=60 thermal.act=50 quiet splash


Despite thermal.act=50, the fan speeds up (from ~1800 to 3600 RPM) only at a temp of 60. 

A temp>60 does not cause a passive cooling (i.e. frequency reduction). 

However, if the temp>60 and all CPU running full speed (2.4 GHz), the 
  # echo 1 > /proc/acpi/thermal_zone/THRM/cooling_mode 
reduces the freq to 1.2 GHz for about 12 s. In the meantime, the temp also drops to about 57 C. After these 12 s, the freq becomes 2.4 GHz, and the temp increases. Although the freq is above 60 C, the freq is not reduced. Only another echoing will reduce the freq for about 12 s.
Comment 50 Thomas Renninger 2009-07-21 20:36:43 UTC
I think I know why this does not work. Also increase the polling frequency:
echo 10 >/proc/acpi/thermal_zone/*/polling_frequency

Rui, do you agree with patch from comment #37?
While it may not help with this totally broken BIOS, it may help with others.
Do I get your reviewed-by or signed-off-by?
Comment 51 Alois Schlögl 2009-07-22 16:44:41 UTC
# cat /proc/acpi/thermal_zone/*/polling_frequency
<polling disabled>

When I do 
 # echo 1 > /proc/acpi/thermal_zone/THRM/cooling_mode 
 # echo 10 >/proc/acpi/thermal_zone/*/polling_frequency
 
and start the test, the freq reduces to 1.2 when the temp exceeds thermal.psv 
The cpu cools down (about 57 C), and after a few seconds the freq goes up to 2.4 again. 

So, this is working for me. If I remember correctly, 2.6.31-rc3 without the patch did not reduce the freq. 

Concerning comment #37: why do you think this is "broken" bios? Is there some standard how a bios should deal with this? Or could you imagine that this is just some alternative definition ? I'm asking to get a better understanding of the problem.
Comment 52 Zhang Rui 2009-07-23 01:25:08 UTC
(In reply to comment #50)
> Rui, do you agree with patch from comment #37?
> While it may not help with this totally broken BIOS, it may help with others.
> Do I get your reviewed-by or signed-off-by?

yes,
signed-off-by: Zhang Rui <rui.zhang@intel.com>

(In reply to comment #51)
> Concerning comment #37: why do you think this is "broken" bios?

there are four processor devices (C000, C001, C002, C003) defined in the BIOS.
And _PSL is a control method that BIOS tells OS which device (mostly processors) should be used for passive cooling.
but here is the _PSL in your BIOS:
            Name (_PSL, Package (0x01)
            {
                \_PR.CPU0
            })
it references to a non-exist device, which is surely broken.
Comment 53 Thomas Renninger 2009-07-23 08:42:27 UTC
> Concerning comment #37: why do you think this is "broken" bios?
Yep. And the next point is that BIOS should notify the OS when temperature exceeds a trip point. Otherwise the OS has to poll the temperature and check itself.
There was a lot discussion whether thermal polling should be enabled by default (SUSE did this some time ago, but Len convinced us to not do that). Or if it should be enabled if there is a passive trip point, etc., because there are other BIOSes (not much, but it hurts) which do not notify on trip point changes.
Comment 54 Len Brown 2009-08-29 22:21:36 UTC
Noooo....

Although cpufreq has problems on this machine,
fixing it will never address the thermal problem seen
when all cores are running at max frequency for a long period.

Although ACPI fan control is broken on this machine,
"fixing it" is probably not the way to go, as approximately
0 desktop machines actually _have_ underlying ACPI fan control.

Although ACPI throttling is broken on this machine,
fixing it with a kernel workaround is _not_ the way to go
because this is a quad-core desktop machine.
It is a class of system that should be able to run
at maximum performance for an indefinite period of
time with _no_ thermal throttling.

My advice is to return this piece of junk and get
a real computer.  If that is not possible, then
the workaround mentioned in commnet #12 should be
all you need.  Boot with "thermal.nocrt=1"

This is a quad-core desktop.
If the machine overheats, then the supplier did not
install sufficient fans -- get bigger ones, or some
fans that spin at a faster speed at a given voltage.

The only thing here that I think could be a kernel
issue worth fixing is that the temperature reading jumps
around and we shut-down even if it "gets better" after
the initial event.  This may be a junk sensor, or it
could be an issue with the Linux EC driver.  But we should
probably re-read the critical temperature to be sure
it is still over-temp before shutting down.

closing this as an invalid BIOS issue.

Note You need to log in before you can comment on or make changes to this bug.