Bug 15946

Summary: critical shutdown because of bogus temperature got from EC address space - Asus V1S
Product: ACPI Reporter: bugzilla
Component: ECAssignee: Lan Tianyu (tianyu.lan)
Status: CLOSED WILL_FIX_LATER    
Severity: normal CC: acpi-bugzilla, alan, astarikovskiy, b.weggenmann+kernel, johnmbryant, lenb, public.oss, rain, rui.zhang, sv, tianyu.lan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.4 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Logs and file listings and DSDT
Acpidump without -b switch
2.6.34 Powertop -d output
12MB DMESG Output with full ACPI Debug messages
debug patch
acpi readings, dump, dmesg and heat watchdog (no shutdown)
acpi readings, dump, dmesg and heat watchdog (shutdown on May 16th 23:15)
6 last critical shutdowns, commented very light log greps
Explicit ACPI/EC errors in log context
complete /var/log/messages for 2 identical shutdowns, different messaging
More ACPI error messages, less system handling ?
Open Hardware Monitor screencap under Wind'Oz Vista
Open Hardware Monitor text report under Wind'Oz Vista

Description bugzilla 2010-05-09 17:27:14 UTC
Created attachment 26300 [details]
Logs and file listings and DSDT

Hello,

this is my first reported bug ever, so please be a little bit patient.

I have a Asus V1S Laptop with an Intel 7300 Dual Core CPU

I'm a Suse User and since updating the kernel from Suse 11.1 my laptop suddenly shut down. In  message.log I can read "Critical temperature reached (127C), shutting down". 
So in 2.6.27 I had no problem, I tried the following kernel versions 2.6.31,2.6.32,2.6.33,2.6.34 in different subversions and all showed the problem

Under heavy load (like copying large files, converting raw files..) suddenly the fan spins up in 2-3 steps to top speed and seconds later the systems shuts down with the above message.

To narrow down the problem I tried all vanilla versions of the above kernels, open source nv drivers or proprietary nvidia driver in different versions. None solved the problem, only the rate of shutdowns varied between kernel versions. No difference could be found between vanilla or suse kernel or nv vs. proprietory driver. 
 
After an internet resarch I tried kernel params thermal.nocrt=1 and thermal.psv=80 on boot without effect. acpi -d showed no change in passive trip point.

For facilitating bug search I attached the output of 

dmesg after boot
grep . /sys/class/thermal/*/*
grep . /proc/acpi/*/*
ls -l /sys/class/thermal/*/*
acpidump

Is anything more needed?
Comment 1 Zhang Rui 2010-05-10 06:14:10 UTC
does it help if you boot with kernel parameter "acpi_enforce_resources=lax"?

(In reply to comment #0)
> After an internet resarch I tried kernel params thermal.nocrt=1 and
> thermal.psv=80 on boot without effect. acpi -d showed no change in passive
> trip
> point.
> 
what do you mean by "without effect"? the laptops still shutdown?
these two params only work if the thermal driver is built in.
if the thermal driver is built as a module, please re-load it with module parameter "thermal.nocrt=1" and "thermal.psv=80".

BTW: how did you get the acpidump.dat file? please use "acpidump > acpidump.dat", without any params.
Comment 2 bugzilla 2010-05-11 07:17:56 UTC
Created attachment 26330 [details]
Acpidump without -b switch
Comment 3 bugzilla 2010-05-11 07:22:48 UTC
I've added 
options thermal nocrt=1 psv=80 

to /etc/modules/99-local.conf. Now I'll have to wait if the laptop still shuts down. It's difficult as the bug cannot be reproduced in a simple way, for example 
two or three shells with cat /dev/zero > /dev/null increase the load but the system doesn't shut down. Sometimes copying files is a problem, sometimes not. 

The acpidump file is uploaded.
Comment 4 Zhang Rui 2010-05-11 07:33:42 UTC
/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:105000
/sys/class/thermal/thermal_zone0/trip_point_1_type:passive

this is no fan control on this laptop, and the passive trip point is only 5C below the critical trip point.
So it's easy to overheat under heavy load.
I'm wondering why the problem doesn't exist in 2.6.27.

you can watch the temperature of the thermal zone and attach the output of "grep . /sys/class/thermal/*/*" when the temp is higher than 80C.
Comment 5 bugzilla 2010-05-12 12:33:05 UTC
Here I have the output of grep . /sys/class/thermal/*/* 

/sys/class/thermal/cooling_device0/cur_state:0
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
/sys/class/thermal/cooling_device1/cur_state:0
/sys/class/thermal/cooling_device1/max_state:10
/sys/class/thermal/cooling_device1/type:Processor
/sys/class/thermal/cooling_device2/cur_state:0
/sys/class/thermal/cooling_device2/max_state:15
/sys/class/thermal/cooling_device2/type:LCD
/sys/class/thermal/thermal_zone0/cdev0_trip_point:1
/sys/class/thermal/thermal_zone0/cdev1_trip_point:1
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/temp:84000
/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:80000
/sys/class/thermal/thermal_zone0/trip_point_1_type:passive
/sys/class/thermal/thermal_zone0/type:acpitz

and grep . /proc/acpi/themal_zone/*/*

/proc/acpi/thermal_zone/THRM/cooling_mode:0 - Active; 1 - Passive
/proc/acpi/thermal_zone/THRM/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THRM/state:state:                   ok
/proc/acpi/thermal_zone/THRM/temperature:temperature:             83 C
/proc/acpi/thermal_zone/THRM/trip_points:critical (S5):           110 C <disabled>
/proc/acpi/thermal_zone/THRM/trip_points:passive:                 80 C: tc1=2 tc2=10 tsp=100 devices=CPU0 CPU1 

I wonder what causes this error. Still no fan visible? As it changes the speed depending on load I assume it is BIOS controlled. Correct? Is this a buggy DSDT?
Comment 6 Zhang Rui 2010-05-13 02:53:25 UTC
(In reply to comment #5)
> I wonder what causes this error. Still no fan visible? As it changes the
> speed
> depending on load I assume it is BIOS controlled. Correct?

Yes. I should say there is no ACPI fan control on this laptop.
The fan may be controlled via BIOS or hwmon.

The processor is not throttled in the last test, this is probably because a thermal notification is not generated when temperature is at 83C.

could you please redo the test in comment #4 with thermal.psv=60 please?
i.e. send the output of
"grep . /sys/class/thermal/*/*" when the temp is higher than 90C.
We should see the processor(/sys/class/thermal/cooling_device1/cur_state) is in deep cooling state is time.
Comment 7 bugzilla 2010-05-13 08:38:53 UTC
As requested I changed thermal.psv to 60 and created a new initrd

Here is the result after Loading with 3x glxgears and 3x cat /dev/random > dev/null

/sys/class/thermal/cooling_device0/cur_state:3
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
/sys/class/thermal/cooling_device1/cur_state:3
/sys/class/thermal/cooling_device1/max_state:10
/sys/class/thermal/cooling_device1/type:Processor
/sys/class/thermal/cooling_device2/cur_state:0
/sys/class/thermal/cooling_device2/max_state:15
/sys/class/thermal/cooling_device2/type:LCD
/sys/class/thermal/thermal_zone0/cdev0_trip_point:1
/sys/class/thermal/thermal_zone0/cdev1_trip_point:1
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/temp:70000
/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:60000
/sys/class/thermal/thermal_zone0/trip_point_1_type:passive
/sys/class/thermal/thermal_zone0/type:acpitz

Furthermore I did echo 0xFFFF0000 > /proc/acpi/debug_layer but it just showed the temperature and no throttleing event.

Acpi_listen didn't give a result too.

Yet apparently the system throttles as the fan doesn't spin up and the temperature doesn't rise above 70°

Is there a way to change the passive trip point at run time?
Comment 8 Zhang Rui 2010-05-19 07:04:00 UTC
please attach the output of "powertop -d" in kernel 2.6.27 and 2.6.33
Comment 9 bugzilla 2010-05-24 12:25:29 UTC
Created attachment 26518 [details]
2.6.34 Powertop -d output

Hello, 

here is the requested Powertop output from 2.6.34

I use EXT4 filesystem, 2.6.27 doesn't support it so i cannot boot with this kernel. Any suggestion

Thanks
Comment 10 Zhang Rui 2010-05-28 03:20:13 UTC
the powertop output is pretty normal.
It seems that the ACPI thermal driver is working well.
Now the problem is why this is a regression.
can you hear the fan spin in 2.6.27? and please attach the output of "grep . /proc/acpi/themal_zone/*/*" in 2.6.27 as well.
Comment 11 bugzilla 2010-06-07 17:16:08 UTC
Hello,

something interesting happened. 
I prepared a script that ran in the background and executed 
grep . /sys/class/thermal/*/*  >> /home/CritLog/sysclassthermal.txt in case the temp was over 90°C. 

Just yesterday my system shut down again when the above script was running. luckily everything was written to the disk before shutdown occured.
Just look what I found:

from nvclock --info
-- Sensor info --
Sensor: GPU Internal Sensor
GPU temperature: 62C

so this is not the culprit, but this is suspicious

/sys/class/thermal/cooling_device0/cur_state:0
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
/sys/class/thermal/cooling_device1/cur_state:0
/sys/class/thermal/cooling_device1/max_state:10
/sys/class/thermal/cooling_device1/type:Processor
/sys/class/thermal/cooling_device2/cur_state:0
/sys/class/thermal/cooling_device2/max_state:15
/sys/class/thermal/cooling_device2/type:LCD
/sys/class/thermal/thermal_zone0/cdev0_trip_point:1
/sys/class/thermal/thermal_zone0/cdev1_trip_point:1
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/temp:127000
/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:80000
/sys/class/thermal/thermal_zone0/trip_point_1_type:passive
/sys/class/thermal/thermal_zone0/type:acpitz

To me it looks as if the system was at full speed and the temp was strange. 127°C looks strangely, like some error message.
The fans were spinning at top speed, yet the load was only average. Just one youtube vid, vlc as mp3 player and 2-3 browser tabs.
So what do you think happened here.

Until now I didn't manage a 2.6.27 dump
Comment 12 Zhang Rui 2010-06-09 07:38:34 UTC
well.
127 equals 0x7f. this seems like an magic number.
As it is got from EC address space, which suggests this may be an EC problem.

could you attach the full /home/CritLog/sysclassthermal.txt file?
so that we can see if the temperature grows step by step or it jumps to 127C from a normal value.
Comment 13 Zhang Rui 2010-06-09 07:40:04 UTC
you may want to change the script to writing to /home/CritLog/sysclassthermal.txt every 10 seconds, no matter what the temperature is.
Comment 14 bugzilla 2010-06-09 08:55:37 UTC
Ok the script is changed. Any further suggestions on what should be dumped regularly?

What is an EC problem? Where can I read about it? On this machine the problem only occurs under linux, on the dual boot XP it never occured.
Indeed, the temp. suddenly rises to 127°C. 
I will post again, once i got a usable file after a crash, as I cannot reliably induce it.
Comment 15 Zhang Rui 2010-06-09 09:02:59 UTC
(In reply to comment #14)
> Ok the script is changed. Any further suggestions on what should be dumped
> regularly?
> 
Not yet.

> What is an EC problem? Where can I read about it? On this machine the problem
> only occurs under linux, on the dual boot XP it never occured.
> Indeed, the temp. suddenly rises to 127°C. 

then this sounds like a problem that Linux gets incorrect value from EC address space. We need the help from alexey, who is the ACPI EC driver maintainer. :)

> I will post again, once i got a usable file after a crash, as I cannot
> reliably
> induce it.

okay.
Comment 16 bugzilla 2010-06-09 14:20:37 UTC
I seems today is crash day.

 The machine crashed twice, my onscreen temp monitor showed 12/°C again and the fans were suddenly spinning high.

here the last seconds from /sys/class/thermal, regrettably the system shut down before the complete file was written. Yet it is apparent that is is not throttled. The GPU at the same time was at unsuspicios 66°C.

Mi 9. Jun 16:12:51 CEST 2010
/sys/class/thermal/cooling_device0/cur_state:0
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
/sys/class/thermal/cooling_device1/cur_state:0
/sys/class/thermal/cooling_device1/max_state:10
/sys/class/thermal/cooling_device1/type:Processor
/sys/class/thermal/cooling_device2/cur_state:0
/sys/class/thermal/cooling_device2/max_state:15
/sys/class/thermal/cooling_device2/type:LCD
/sys/class/thermal/thermal_zone0/cdev0_trip_point:1
/sys/class/thermal/thermal_zone0/cdev1_trip_point:1
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/temp:63000
/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:80000
/sys/class/thermal/thermal_zone0/trip_point_1_type:passive
/sys/class/thermal/thermal_zone0/type:acpitz
Mi 9. Jun 16:12:56 CEST 2010
/sys/class/thermal/cooling_device0/cur_state:0
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
/sys/class/thermal/cooling_device1/cur_state:0
/sys/class/thermal/cooling_device1/max_state:10
/sys/class/thermal/cooling_device1/type:Processor
/sys/class/thermal/cooling_device2/cur_state:0
/sys/class/thermal/cooling_device2/max_state:15
/sys/class/thermal/cooling_device2/type:LCD
/sys/class/thermal/thermal_zone0/cdev0_trip_point:1
/sys/class/thermal/thermal_zone0/cdev1_trip_point:1
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/temp:62000
/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:80000
/sys/class/thermal/thermal_zone0/trip_point_1_type:passive
/sys/class/thermal/thermal_zone0/type:acpitz
Mi 9. Jun 16:13:01 CEST 2010
/sys/class/thermal/cooling_device0/cur_state:0
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
/sys/class/thermal/cooling_device1/cur_state:0
/sys/class/thermal/cooling_device1/max_state:10
/sys/class/thermal/cooling_device1/type:Processor
/sys/class/



Is the powertop dump from 2.6.27 still needed?
Comment 17 Zhang Rui 2010-06-10 01:14:20 UTC
no, you don't.

we need the help from Alexey, who is on vacation right now... :p
Comment 18 bugzilla 2010-06-15 16:41:51 UTC
Today after some shutdowns I noticed sth. interesting.

Adding thermal.nocrt=1 to my modules configuration made the system shutting down after 1-2 seconds of high fan noise without going backwards in the run level. It simply turns of. After restart the fan turns at high speed until some of the kernel parts are loaded. 
I tried to exhaust the system with furmark and linpack in windows, yet there the error doesn't appear. The Graphic card too is indifferent to temps around  90°C under heavy load. 
So I ask myself if it is possible that some bogus value is read or written, the bios spins the fans up and shuts down the pc as it sees a wrong temp. Yet this value seems to be retained during a shutdown-restart cycle. It that possible?
Comment 19 bugzilla 2010-07-05 13:27:32 UTC
Can I contribute any more info in order to solve the problem?
Comment 20 Alexey Starikovskiy 2010-07-05 15:23:51 UTC
Rui,

127 is not the number EC gets... Remember that in thermal module reading is in deci-Kelvins, so it needs to subtract 2730 and then divide by 10 to get this number...
Comment 21 Zhang Rui 2010-07-06 05:31:45 UTC
(In reply to comment #20)
> Rui,
> 
> 127 is not the number EC gets... Remember that in thermal module reading is
> in
> deci-Kelvins, so it needs to subtract 2730 and then divide by 10 to get this
> number...

            Method (_TMP, 0, NotSerialized)
            {
                Store (0x05, Local1)
                While (Local1)
                {
                    Store (RTMP (), Local0)
                    If (LGreater (Local0, TCRT))
                    {
                        Decrement (Local1)
                    }
                    Else
                    {
                        Store (Zero, Local1)
                    }
                }

                Return (KELV (Local0))
            }

RTMP returns 0x7F (getting from \_SB.PCI0.SBRG.EC0.ECPU) and
KELV() translate 0x7f to Kelvins.
Comment 22 bugzilla 2010-07-10 12:19:23 UTC
Today I noticed sth. interesting.

After a sudden shutdown and manually restarting the system the core temp is no longer displayed on my desktop. I noticed it today as usually the desktop is hidden by autostart programs.

dmesg was filled with this

[ 3695.304073] ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20100121/evregion-474)
[ 3695.304103] ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.RTMP] (Node ffff88007d0032b0), AE_TIME
[ 3695.304235] ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.THRM._TMP] (Node ffff88007d0033f0), AE_TIME
linux-1x2r:/home/Ali # ^C

This is 
grep . /sys/class/thermal/*/*
/sys/class/thermal/cooling_device0/cur_state:1
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
/sys/class/thermal/cooling_device1/cur_state:1
/sys/class/thermal/cooling_device1/max_state:10
/sys/class/thermal/cooling_device1/type:Processor
/sys/class/thermal/cooling_device2/cur_state:0
/sys/class/thermal/cooling_device2/max_state:15
/sys/class/thermal/cooling_device2/type:LCD
/sys/class/thermal/thermal_zone0/cdev0_trip_point:1
/sys/class/thermal/thermal_zone0/cdev1_trip_point:1
/sys/class/thermal/thermal_zone0/mode:enabled
grep: /sys/class/thermal/thermal_zone0/temp: #Tranlated: No suitable device found
/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:80000
/sys/class/thermal/thermal_zone0/trip_point_1_type:passive
/sys/class/thermal/thermal_zone0/type:acpitz

What's this? Is it a hardware problem?
Comment 23 bugzilla 2010-07-27 07:05:26 UTC
I updated my kernel to 2.6.35-rc5-10-desktop but the problem persists.
Occasional shutdowns with temp=127°C from no apparent reason.

How can I produce more logs that might help?
Comment 24 bugzilla 2010-07-28 10:03:43 UTC
Created attachment 27280 [details]
12MB DMESG Output with full ACPI Debug messages

Today I finally managed to get a dmesg copy to file with a 16Mb Ringbuffer with full ACPI debug output.

The first seconds are fine, yet in the end I get this 4002K Error.
But I do not understand the ACPI Process, can anybody help?
Comment 25 Len Brown 2010-08-26 17:52:51 UTC
*** Bug 16522 has been marked as a duplicate of this bug. ***
Comment 26 Len Brown 2010-08-26 17:57:10 UTC
Rather than using acpi=off as a workaround,
you can  disable all thermal trip points via "thermal.crt=-1"

Also, to observe the issue w/o actually shutting down,
you can disable critical trip point actions by booting
with "thermal.nocrt=1".
Comment 27 bugzilla 2010-08-28 16:03:42 UTC
Hello,

thanks for the advice. On the 15. of June I tried the hint, the result is as posted on this day.

The Kernel doesn't go back through the runlevels but the machine simply stays at 127° at top fan speed and then shuts down. After a restart the fan continues spinning at top speed until some acpi modules are loaded again. see #18

Never observed this under windows, so I consider it a linux kernel related bug. A complete dmesg output is attached to this bug. 

I think it is not a problem of clocking down under load (see #7 where this works) but rather sth. different see #11.
Comment 28 bugzilla 2010-09-05 19:04:37 UTC
The problem persists on 2.6.36-rc3 even on runlevel 3
Comment 29 BW 2010-10-12 00:33:57 UTC
(In reply to comment #28)
> The problem persists on 2.6.36-rc3 even on runlevel 3

Hi, I have exactly the same problem since kernel 2.6.31, on basically the same laptop (Asus V1S with T7500) running gentoo (currently 2.6.35-r4 amd64). I found in the kernel changelog that there were many changes to thermal stuff in 2.6.30, so maybe it was introduced there. The problem never occured under Windows Vista and 7 (Which is good, since it implies the hardware is not broken, and which also is bad, since it implies something not so good about Linux).
Since I changed from pre-2.6.30 to 2.6.31, I got arbitrary thermal shutdowns. Within a second, the temperature would jump from the normal 60-80 degrees to 127 degrees, and then the system would shut down. The strange thing is that this *never* happened e.g. when playing 3D games like Quake Live or nexuiz, but mostly only when the system was under low to moderate load (e.g. word processing, playing music, surfing the web etc.).

I recently found some people writing that adding the kernel parameter acpi_osi="Linux" helps them with fan control on their laptops (the brand was Toshiba, not Asus). I added the parameter to my kernel boot line, and for the last two days no thermal shutdown occurred. Unfortunately, this doesn't mean too much since as explained earlier, the problem occurs sporadically. However, I also noticed that the fan would generally spin faster. I suspect that without the acpi_osi parameter, Linux somehow overtakes the fan control or temperature reading from the BIOS, and while overtaking also "overlooks" that the CPU and GPU share the same fan. So it might only monitor the CPU temperature, while the GPU might get really hot since the fan is not spinning fast enough (which agrees with my observation that the thermal shutdown never occured when playing games; also note that the V1S has the infamous 8600m that had overheating issues in some MacBook Pro laptops, although this shouldn't be the problem here since nothing overheats when booting Windows)
I tried heavy load applications as well as low load. Maybe you could try if adding that kernel parameter makes a difference on your system. Also, maybe the Linux devs could look into this further to clarify what's really going on.
Comment 30 Zhang Rui 2010-10-12 01:55:20 UTC
I'd rather think this is an EC regression. ACPI thermal driver spins the fan, shutdown the laptop based on the temperature value it reads from EC.
Now EC reports bogus temperature, and this explains why the system shutdown is load independent.
Comment 31 bugzilla 2010-10-15 08:14:22 UTC
Nice to hear sth. in this case and find another victim.
I almost thought it was a hardware problem, yet in Windows it never occured.

The kernel parameter I already tried since 2.6.31 yet it is no cure. 
I constantly monitor my gpu temp. and in case of a shutdown it is almost around 66-70°C. I use the latest 256 Series Nvidia driver.

The only cure is the kernel parameter Acpi=off, but this is no option for a mobile laptop. 
Thermal.nocrt=1 is no cure either as controlled shutdown is prevented but after 2-3 secs at 127° the computer just shuts down without going back through the runlevels. After reboot the fan spins at top speed until thermal module is loaded again, so there some sort of reset/reinitialising seems to happen.

I will try 2.6.36 too when it is finally released, but rc3 still produced the error.
Comment 32 BW 2010-10-17 22:07:15 UTC
Hi, unfortunately the other victim is me =) And it seems both of you you are right (again unfortunately ;). The Terminator-Bug came back today :( Hope there will be some fix soon, because it's really frustrating when you are in the middle of doing something.

For me the fan also keeps spinning at top speed after turning the laptop on again. Since right after the reboot the ACPI driver isn't active, this makes me wonder if the cause of the problem could be that some bad value might be *written* to somewhere (in memory or a location that retains its state after a reboot) that shouldn't be written there.
Comment 33 bugzilla 2010-10-28 17:37:27 UTC
Regrettably the bug is still found in 2.6.36-87.

It hit me while browsing kde-apps with no other open windows, so the system load was very small for a Core2Duo processor.

Same procedure as every year, suddenly the fans spin at top speed. When I look at the current temp monitor on my x desktop 127°C is shown. After two seconds the system shuts down (No regular shutdown as thermal.nocrt=1 is used).

At reboot the fans still spin at top speed until acpi modules are loaded.

Any logs needed? Any hints?
Comment 34 bugzilla 2010-11-14 17:27:33 UTC
 I was really fed with these sudden shutdowns as they made me use Windows for productive work. So I decided to try the stable 2.6.34 kernel in the 32 bit version as a change but after two days I have to confirm: 

The bug is present in 2.6.34_x86 as well.
Comment 35 Zhang Rui 2010-11-15 02:24:09 UTC
maybe you need to build in thermal driver and use "thermal.crt=-1" to disable the ACPI
Comment 36 Zhang Rui 2010-11-15 02:26:06 UTC
oops, hit the "commit" button by mistake.

maybe you need to build in thermal driver and use "thermal.crt=-1" to disable
the ACPI thermal critical trip point for now. until we find a way to get the correct temperature.
Comment 37 bugzilla 2010-11-15 12:18:59 UTC
Is is this the same a thermal.nocrt=1 as I used this but somehow 127°C was read and the system suddenly shut down without properly going back through the run levels. 

Strangely after turning on the Laptop again the fans still run a maximum speed until acpi is reloaded.

How could I contribute to find the bug?
Comment 38 bugzilla 2010-12-31 11:24:43 UTC
I had a look at the dmesg log again.

At 3217.434331 the returned temp was 3412°mK eq. 68,2°C

AT 3217.972578 about 0.5 sec later the temp. rose to 127°C. 

At 3217.946899 a pointer is returned where the wrong temperature is returned. 

Is there a way to further investigate these 0,5 seconds?
Comment 39 JohnMB 2011-01-03 09:31:23 UTC
I have the same behaviour on a Toshiba Equium L40-17M (PSL49E).

It has been a problem using ubuntu and Mandriva kernels since 2.6.32

As observed, under no particular load, the system fan goes to maximum and a bogus temperature of 127C is reported, followed by a shutdown.
Comment 40 Zdenek Behan 2011-01-06 01:32:02 UTC
Same message happens on my Asus U6V, in my dmesg right after forced halt is:
Jan  5 19:16:49 falcon kernel: [89597.368754] Critical temperature reached (127 C), shutting down.
Jan  5 19:16:49 falcon logger: ACPI event unhandled: thermal_zone THRM 000000f0 00000001

However, this is only half-bogus. This has happened lately several times, every time under heavy load, and I managed to decrease the incidence of the problem almost entirely by vacuuming the heatsink, which has probably gotten to the point of being too dusty and not cooling enough over time.

I have been watching the output of "sensors" and "grep . /sys/class/thermal/*/*" closely just prior to the forced halt. In idle state (just desktop running), the temperature stays just under 60C according to all outputs. Putting everything under very heavy load can raise this all the way up to 82-85C, which is pretty hot, but hardly critical.

According to /sys/class/thermal/thermal_zone0/, both trip points are set to 110C, and it's very unlikely that the CPU would suddenly jump over 40 degrees C to 127.

Maybe the output of the EC is simply wrongly understood? Maybe there's a flipping bit that is not used for actual temperature output, but is incorrectly interpreted that way. Maybe the other people who have this problem too are simply running too hot normally, so this "fake trip point" happens to them without any load. Only thing that seems to be rather certain that up to the point of the magical jump to 127C, the temperature is reported correctly, because it is indicated almost identically by two independent sensors (the builtin core sensor, acpi-tz).

Just wild guesses, but I thought I'll report my findings. I can probably "fix" the problem for myself by doing something more serious about heatsink, and keeping the laptop cool, but this is an annoying bug nonetheless.
Comment 41 bugzilla 2011-01-06 09:38:36 UTC
Finally more people appear who suffer from this bug too.

@zdebek behan: I have already tried vacuuming, regreasing the heatsink and running the laptop without covers above a fan. It stays cool but even without external input is suddenly shuts down.

So you might save your time
Comment 42 JohnMB 2011-01-06 09:41:18 UTC
Heatsink/fan have also been cleaned here, to no avail. Temperatures under normal conditions are not unduly elevated.

The problem appears to be entirely kernel version dependant.
Comment 43 Zdenek Behan 2011-01-06 14:35:09 UTC
(In reply to comment #41)
> @zdebek behan: I have already tried vacuuming, regreasing the heatsink and
> running the laptop without covers above a fan. It stays cool but even without
> external input is suddenly shuts down.
> 
> So you might save your time

Well, in my case, vacuuming did have a significant effect. As a matter of fact, when this problem manifested itself for the first time, it was extremely massive, and my laptop was willing to shut itself down several times a day. After vacuuming, it took over two weeks and really heavy load to have it force itself shut down again.

Nonetheless, there were no significant objective differences in internal temperatures, looking at the output of sensors. Even the cooler is still running pretty loud.

That might mean that either I'm exactly on the border of the "critical" temperature as perceived by acpi-tz and a marginal increase kills it, or that it's simply based on something else than temperature.

Anyway, in my case, there's a direct causality between load (temperature) and the shutdowns, but the shutdowns still seem to be bogus, happening at way lower temperature than they should happen.
Comment 44 JohnMB 2011-01-06 15:08:10 UTC
(In reply to comment #43)

> Well, in my case, vacuuming did have a significant effect.

No effect at all here, both in terms of the frequency of shutdowns and reported temperatures.

> Anyway, in my case, there's a direct causality between load (temperature) and
> the shutdowns

Again, seems to happen completely at random here with no correlation.

It is an odd one though.......
Comment 45 Zdenek Behan 2011-01-06 15:15:23 UTC
(In reply to comment #42)
> Heatsink/fan have also been cleaned here, to no avail. Temperatures under
> normal conditions are not unduly elevated.
> 
> The problem appears to be entirely kernel version dependant.

Just curious, what's the turnaround for shutdown occurence for you? Right now,
I'm seeing the bug very rarely, and I'm hoping to do some changes to see it
even more rarely or never. But if your computer shuts down reliably within a
reasonable timeframe (minutes), you could try to bisect the problem, provided
it is a regression and you have a known stable kernel version.
Comment 46 JohnMB 2011-02-02 09:45:21 UTC
(In reply to comment #45)
> Just curious, what's the turnaround for shutdown occurence for you? Right
> now,
> I'm seeing the bug very rarely, and I'm hoping to do some changes to see it
> even more rarely or never. But if your computer shuts down reliably within a
> reasonable timeframe (minutes), you could try to bisect the problem, provided
> it is a regression and you have a known stable kernel version.

It's somewhat unpredictable. The machine can run for a few days without shutdown on an effected kernel. Or at other times there can be several occurrences within a few hours. Longer time intervals are more the 'norm' though.

For example, I've just been running the latest ubuntu test 2.3.35 kernel (2.6.38 RCs just crash so far. Unrelated to this bug I think), and that ran fine for a few days. I thought 'great', maybe one of the minor changes they applied has fixed something? Alas, a few mins ago, under hardly any load and at minimal real temp, the bug reared it's head again.

Previously I had been using the old karmic 2.6.31 kernel, as that is rock solid and unaffected by this bug.

If there is anything I can do to usefully bisect the problem by running the 2.6.35 kernel, then I'm happy to do so. (I must re-read what diagnostic steps have already been suggested). I just can't reproduce the bug with any predictability.
Comment 47 Stefan Vogel 2011-02-06 15:27:04 UTC
I have the same problem here with my Asus V1Sn/T7700 CPU and an 2.6.34.7 kernel (OpenSuSe). Apparently the EC returns the wrong temperature. Some time ago, I had the problem, that the EC always returned a temperature much to high and the fan was blowing very loud, although the sensors showed a lower temperature. After several weeks I found out that you can reset the EC by removing the power/laptop battery for 5 minutes. After this the EC temperature was correct.

The problem with the notebook shutting down occasionally because it thinks that the temperature is 127C (which it is not) remains unfortunately.
Comment 48 BW 2011-02-06 15:37:01 UTC
I'm currently running 2.6.36-gentoo-r3 and the shutdown occurs about once every
2-5 days. I tried with BIOS versions 301, 302, and 303 (inofficial) to no
avail.
With 2.6.37, I have the (subjective) impression that the shutdown occurs about
5 times as often (!), i.e. once a day or more, which is unacceptable for me. I
tried to switch back and forth from 2.6.36 to .37 several times and it seems to
support this impression (while it still might be just a coincidence as I, too,
haven't found anything that deterministically causes the bug to be triggered).
Can somebody else support or refute this on their systems?
Last kernel ver without this bug was 2.6.30 or .31 for me.

One funny reproducible thing I noticed is the following: After the shutdown,
when you turn the Laptop on again the fan keeps spinning at max speed for
several seconds (unrelated to loading of kernel acpi modules or anything, it
also stops after a few seconds e.g. when you enter the bios or wait at the grub
menu). However, if you unplug the power (and battery) before turning on the
machine, the fans run back at normal speed. So something 'remembers' the
falsely reported high temperature or fan speed. Maybe this helps to locate the
bug.
Comment 49 BW 2011-02-06 15:41:55 UTC
Oh, and I noticed there is an acpi.ec_delay kernel parameter. Since I was keen to experiment, I changed this to 2500 (from the default which should be 500), however again to no avail.
Comment 50 JohnMB 2011-02-06 17:11:53 UTC
Has anyone tried the 2.6.38 RCs yet?

I can't here as it quickly panics on this hardware. Hopefully that will be fixed, but it would interesting to here if this bug persists in 2.6.38. TIA.
Comment 51 Zdenek Behan 2011-03-11 09:54:34 UTC
I had my laptop "deep-cleaned" in an authorized service (essentially, completely disassemble of the heatsink part, cleaning, applying new thermal paste, assemble).

Temperatures during heavy load have gone down by as much as 20 degrees, and I am no longer able to reproduce this no matter how much load I impose on the poor thing. In fact I'm unable to cause the fan to go 100%, unlike before where it sometimes went to 100% even without any serious load.
Before the servicing, I also managed to eliminate the issue by using an active cooling pad. So for me, this was definitely caused by temperature being too high, as I reported previously.

Still, I wouldn't expect a laptop to panic when core temperature peaks at 85 celsius degrees as reported by acpi-tz, and the reported temperature in panic message is entirely off (127 degrees). In fact I could imagine some laptops with not so well designed cooling, for which this temperature may be an out-of-the-box (no thick layer of dust) experience under heavy load. Perhaps that is the case of some other reporters who simply run too hot normally, and get "random" panics.
Comment 52 bugzilla 2011-03-11 21:01:06 UTC
@Zdenek Behan: I have deep cleaned my laptop more than once. My problem is definitly not heat related as the temp. suddenly jumps from 55°-60° or even 45°idle to 127° in one second with no load. 

This is clearly an EC read error.
Comment 53 JohnMB 2011-03-11 21:06:32 UTC
(In reply to comment #52)
> My problem is
> definitly not heat related as the temp. suddenly jumps from 55°-60° or even
> 45°idle to 127° in one second with no load. 
> 
> This is clearly an EC read error.

Same here. Doesn't relate one bit to load or temperature at the time the error occurs.
Comment 54 bugzilla 2011-03-23 12:01:41 UTC
The problem of sudden rise of measured core temperature followed by 2 seconds of high rpm fan spinning and finally a hard shutdown of the laptop persists in 2.6.38-22.

It happens independently of load (even in RL 3 with nothing happening in the background) on a perfectly cleaned laptop. 

The temp jumps from around 54° idle to 127°C with no intermediate measurements. 

The only solution so far is disabling ACPI completely to have a usable laptop.
Comment 55 Zhang Rui 2011-03-24 00:45:08 UTC
Created attachment 51842 [details]
debug patch

the dmesg log in comment #24 is really helpful.
Can you please apply this patch on top of 2.6.38 and try to get the same dmesg log when the bogus temperature is reported.
Comment 56 Zhang Rui 2011-03-31 08:53:23 UTC
ping...
Comment 57 bugzilla 2011-03-31 16:41:28 UTC
I'm on it.

I have already patched 2.6.38-22 and compiled the kernel. 
Yet it produces frequent soft lock ups during boot or when I was successful booting to init 3 and echo 0xffffffff to Debug_level and Debug_layer the system usually hangs during dmesg.

At the moment I try getting a usable kernel to be able to wait for the error

@Rui 

Where exactly was the debug log helpful? I'm just curious as I didn't understand a bit.

Thanks
Comment 58 Zhang Rui 2011-04-01 05:55:16 UTC
when the acpi thermal driver reads a bogus temperature, I find this in the dmesg you attached:
[ 3217.967136] **** Context Switch from TID ffff88007cafe580 to TID ffff
[ 3217.967138] **** Context Switch from TID ffff88007cafe580 to TID ffff88002f292540 ****

and it suggests there is a racing condition that we didn't care about before.
The patch in comment #55 should fix this problem.
If the patch in comment #55 doesn't help for this bug report, at least we can see what's the difference between reading a good temperature and a bad one.
Comment 59 bugzilla 2011-04-05 05:47:48 UTC
Until now I tried 2.6.38-22 and 2.6.38-30 with the patch. 
I got this nasty DRM problem on booting, could bypass it yet during use the system just locks up and a power off is necessary. Even the Magic Sysrq keys no longer work.

The only cure is Acpi=off but this makes temperature testing impossible.

@Z. Rui

Is it possible to use the patch from comment #55 against 2.6.37 as this runs stable on my system and is known to have the temperature problem too?
Comment 60 Zhang Rui 2011-04-06 02:00:27 UTC
yes, sure.
you can use boot option "nomodeset" to get rid of the DRM problem in 2.6.38 kernel, or just apply the patch on top of 2.6.37 kernel.
Comment 61 bugzilla 2011-04-13 16:28:26 UTC
Compiled multiple version.

Now 2.6.37.6 is up and running with nouveau.
KDE regrettably often pauses for several seconds so this is not a real productive environment.
But in RL 3 as test I didn't get a reboot in around 24h runtime. But its difficult to say the problem is solved as it happens only occasionally, but I'll continue testing
Comment 62 Zhang Rui 2011-04-14 01:44:49 UTC
so how long it usually took to reproduce the problem?
Comment 63 bugzilla 2011-04-14 17:34:58 UTC
it depends, as my system is not running 24h/day. So sometimes it happens multiple times a day, sometimes not in a week.

Continue testing..., regrettably it doesn't work while everyday work in kde
Comment 64 bugzilla 2011-04-20 10:14:48 UTC
I have a problem. I try running the patched kernel without acpi=off in init 3 over night but in the morning the screen is black, the keyboard no longer responds and MagicSysRQ Keys don't work either. 
LCD Dimmer is switched of in BIOS so what more can I do?
Comment 65 Zhang Rui 2011-04-22 02:18:29 UTC
well, at least this doesn't sound like a critical shutdown issue.
is the screen blank or switched off without backlight?
Comment 66 bugzilla 2011-04-23 21:06:34 UTC
The screen is blank, the cursor is in the lower left corner but no longer blinking and the keyboard is completely non responding, even the magic sysrq keys are dead. 

So how can I proceed to debug.
Comment 67 Zhang Rui 2011-04-25 08:18:27 UTC
please add boot option "nomodeset" and check if the problem still exists.
Comment 68 bugzilla 2011-04-29 17:49:09 UTC
So far it's working in RL 3, i'll try to figure out how to get to RL 5 next week and check again. 

Thank you so far
Comment 69 Zhang Rui 2011-05-03 01:20:21 UTC
(In reply to comment #68)
> So far it's working in RL 3,

you mean the critical shutdown problem is gone if you boot with "nomodeset" in RL3?
Comment 70 bugzilla 2011-05-04 09:08:36 UTC
So far I did not encounter a critical shutdown in RL 3. Yet the system occasionally hangs with no longer responding keyboard either during boot or while waiting for the error in RL 3. Regrettably I get no error message

Until now I didn't reach RL 5 but I continue to try.
Comment 71 bugzilla 2011-05-12 11:32:59 UTC
Now using patched 2.6.38.6-26. I can sometimes reach Rl3, sometimes RL 5 using proprietary NVIDIA driver.
With acpi=off everything is fine
Without acpi=off I get random soft lockups during boot or while watching the kernel log output via CRTL-ALT-F10
I use a 1650*1080 Framebuffer, yet the log is too long to be read. 
I read using serial output might be a solution, so I ordered a RS 232 crossover cable and a usb adapter for a second pc. I will enable the serial console and hope to able to deliver the error messages
Comment 72 BW 2011-05-27 03:33:14 UTC
Have been running 2.6.39 since 3 weeks (rc6 on gentoo, rc5 on kubuntu natty) and haven't had a single thermal shutdown since then. Releases 2.6.37 and .38 were really unstable (apart from the thermal shutdowns), and I had the impression that above patch only changed the thermal shutdown into lockups (so the system would just hang instead of shutting down with the thermal shutdown message)
My laptop had the thermal shutdown problem for several years now (since 2.6.30 was released) and it was quite frustrating, so I am quite relieved that they seem to be finally gone :)
Comment 73 bugzilla 2011-05-30 19:10:49 UTC
I'm still having problems with the serial console. Didn't manage to capture a single complete error log.

After comment #72 I'll try 2.6.39
Comment 74 JohnMB 2011-05-30 20:12:43 UTC
(In reply to comment #72)
> Have been running 2.6.39 since 3 weeks (rc6 on gentoo, rc5 on kubuntu natty)
> and haven't had a single thermal shutdown since then. Releases 2.6.37 and .38
> were really unstable (apart from the thermal shutdowns), and I had the
> impression that above patch only changed the thermal shutdown into lockups
> (so
> the system would just hang instead of shutting down with the thermal shutdown
> message)
> My laptop had the thermal shutdown problem for several years now (since
> 2.6.30
> was released) and it was quite frustrating, so I am quite relieved that they
> seem to be finally gone :)

Using 2.6.39, the problem has certainly been less frequent here.

However, I've had a recurrence earlier today, so the issue is still there.
Comment 75 bugzilla 2011-06-01 17:20:18 UTC
Same for me, 

I was glad hearing the problem was gone, yet after downloading and installing the kernel the system shut down during the first boot.

Any idea what may have happened?
Comment 76 bugzilla 2011-08-07 06:20:20 UTC
Regrettably the sudden shutdown issue is still there.
Comment 77 JohnMB 2011-09-05 12:23:47 UTC
Still present in kernel 3.0.4.
Comment 78 Zhang Rui 2012-01-18 02:04:17 UTC
It's great that kernel bugzilla is back.

can you please verify if the problem still exists in the latest upstream
kernel?
Comment 79 JohnMB 2012-01-18 07:48:12 UTC
It was still present in kernel 3.2, but so far not had an occurrence in 3.2.1.

However, it's the sort of bug that seems to be able to occur 10 times in a couple of days, then not happen for a week or so. Pretty random. So cannot say for sure.
Comment 80 bugzilla 2012-01-30 19:06:20 UTC
It's nice to hear somebody is still looking after this bug after such a long time. 

Regrettably for me it is still persistent in 3.2.1, I expirienced it on the first boot. Didn't dare to try it again.
Comment 81 Xavier Hourcade 2012-05-13 21:00:17 UTC
Hello all!

I cannot believe I had not found this ticket earlier, thanks! :)

> Finally more people appear who suffer from this bug too.

Exactly :D From my own searches, I suspect many more like us out there.
Not a developer myself, but *extremely* motivated to contribute on this one.
This is an over-frustrating bug, I can reproduce it often and provide data.

Hardware is also Asus V1S, BIOS 301, with the "infamous" NVidia 8600M GT
(replaced once under warranty, as Windoz driver "had let it burn" they said).

Issue still occurs under Fedora 16 KDE latest stable.
No matter which recent kernel/mod/DE I am running. Currently:
  kernel-3.3.4-3.fc16.x86_64
  kmod-nvidia-295.49-1.fc16.1.x86_64
  kde-workspace-4.8.3-3.fc16.x86_64

Each time, without exception:
- Fan speed goes to maximum, all of a sudden, during few seconds
- Then the system powers off immediately.
- No matter when I next restart the laptop, fan is at maximum speed again
  during a few seconds, typically the time to reach grub screen
So there is for sure a flag set somewhere on the hardware/BIOS side.

System was just unusable in production with Fedora 14 kernels and nouveau.
Then switched to kmod-nvidia, issue had almost disappeared (once p.month maybe).

With Fedora 16 kernels I first tried nouveau again, as unusable as before.
Then switched to kmod-nvidia and it became more, but "randomly" usable
(from unusable, to several p.week with "non-sense" exceptions).

10 days ago, cooler cleaned & thermal pastes changed :
- greatly dropped overall temperatures by 10+ degrees Celcius
- greatly reduced occurrences, from several p.day to several p.week
- permits again the use of dual head i.e. nVidia TwinView
- system is silent 90% of the time, as new, fan speed doesn't even go any high
  unless I really put the system under long and heavy load

Since soon-after KDE 4.8.3 (where KWin seems to be less GPU aggressive ?)
- reduced even more the average frequency of the issue
  (twice a week, plus "unfortunate non-sense exceptions"!).
Hence my increased suspicion of nVidia board.

Critical shutdowns now occur more randomly than ever before:
- None over the past 3 days with dual display, VBox guest, "heavy load and all".
- Today I had three in a row (!) with internal display only, almost no load.
  Worth to mention (?) outside air temperature was clearly hotter than usual.
  Last of this awful series of 3, did occur at boot during KDE session opening,
  which was the first time ever, so I wouldn't say this are getting better.
  But then, no further shutdown until now, 4 hours later.
- Could occur, typically, while closing/re-opening several times consecutively
  eg. Firefox web browser, multiple tabs with "heavy" or rather long web pages,
  (such as redhat bugzilla * query results or kernel.org changelog :)

Whenever the random devil is on my side, I can run fine about anything at once:
- a VM guest (host partitions sit in a LUKS VG, hence it's rather intensive)
- yum updates, prelink -a, rpm -Va, rsync some 20-gigabytes-sized tarballs
- play some flash video full screen on internal display with Firefox
- play some other video full screen an external display with VLC
- also run browser, email/chat/IRC clients, even desktop effects
and still fail to cause the issue :
- CPU/GPU would both come close to a maximum of 78ºC
- unique fan spins to maximum speed, succeeds to cool the system down
- KDE is only getting a little slow, nothing else.
GNU/Linux rocks, then :)

Critical shutdowns, as ever, "may or may not" appear in /var/log/messages.
System "may or may not" have time to step down RLs (or just sync disk ?)
Whenever logged, the "bogus" 127 ºC value is always reported.

So, I wrote a poor-man watchdog/logger script to track this. So far
- I run it within a root screen started from tty2, when KDM login is ready
- It uses `nvidia-smi` and `apci` to monitor temperatures and set priorities.
  I shall now modify it to read *also* from /sys/class/thermal,
  as well as `nvidia-settings -t -q` for the GPU
- it reduces delay between readings as the priority is increasing

      CPU_NOTICE  CPU_WARNING  CPU_CRITICAL ->   72 74 76
      GPU_NOTICE  GPU_WARNING  GPU_CRITICAL ->   70 72 74
WAIT WAIT_NOTICE WAIT_WARNING WAIT_CRITICAL -> 3  2  1  0

More test if >= Warning eg. CPU load average (which increases delay by 0.5s)
Even more if >= Critical:  top 5 processes for CPU and RAM usage.
Logger is called at each first or returning "notice", or any above prio.
Sync is run after each logger, but it doesn't seem to help catching any more.
Stars as in "ºC*", denote a new highest temperature within current execution.

# grep -E "ºC|temperature|\(proc\) stopped|kmsg started" /var/log/messages
May 13 16:46:36 venus /root/heat.sh:   log | notice   | GPU:  69 ºC   82 MB | CPU:  70 ºC
May 13 16:56:02 venus /root/heat.sh:   log | notice   | GPU:  69 ºC* 105 MB | CPU:  71 ºC*
May 13 16:56:16 venus /root/heat.sh:   log | notice   | GPU:  69 ºC   82 MB | CPU:  72 ºC*
May 13 17:01:12 venus kernel: imklog 5.8.10, log source = /proc/kmsg started.
May 13 17:07:34 venus kernel: [  460.807678] Critical temperature reached (127 C), shutting down.
May 13 17:07:34 venus kernel: Kernel logging (proc) stopped.
May 13 17:10:59 venus kernel: imklog 5.8.10, log source = /proc/kmsg started.
May 13 17:25:08 venus /root/heat.sh:   log | notice   | GPU:  68 ºC*  58 MB | CPU:  69 ºC*
May 13 18:45:04 venus /root/heat.sh:   log | notice   | GPU:  67 ºC*  86 MB | CPU:  73 ºC*
May 13 18:45:15 venus /root/heat.sh:   log | notice   | GPU:  68 ºC*  91 MB | CPU:  72 ºC

My poor-man conclusion so far :
- Warning wasn't even reached when 2nd of today's shutdowns occurred at 17:00
  (which was *not* logged)
- Notice wasn't even reached when 3rd of today's shutdowns occurred at 17:07
  (which *was* magically logged this time)
- No such real temperature jump could occur within a 1-second laps time
- here the "127" value either is bogus indeed, or means something else.

Any suggestion more than welcome.
Thank you if you read through :)
Comment 82 Xavier Hourcade 2012-05-14 22:12:18 UTC
Created attachment 73296 [details]
acpi readings, dump, dmesg and heat watchdog (no shutdown)

Here is a first data set, despite the devil being on my side today :)

grep . /proc/acpi/*/*
grep . /sys/class/thermal/*/*
ls -la /proc/acpi/ /proc/acpi/*
ls -la /sys/class/thermal/*/*
acpidump
dmesg
sed -n '/May 14 21/{h;b};H;${x;p}' /var/log/messages \
  | grep -E "(ºC|temperature|ACPI|\(proc\) stopped|kmsg started)"

Fresh booted (now 3.3.5-2.fc16.x86_64), switched tty at KDM login, started my heat.sh screen (didn't extend sensor readings, yet) and collected the above a first time
cf. acpi-kdm.log

Then I launched KDE session, attached an external display, stressed the system quite a lot and collect the above a second time:
cf. acpi-session.log

Tree under /proc/acpi is smaller than reported in previous comments (?)
Both trip points are, also, very close to each other (all defaults)
Highest temperatures were 75 & 80 ºC respectively, all fine.

Should I collect these again, or anything else, once only or via the watcher prior to future shutdowns ?
Comment 83 Xavier Hourcade 2012-05-19 20:36:23 UTC
Created attachment 73332 [details]
acpi readings, dump, dmesg and heat watchdog (shutdown on May 16th 23:15)

An unattended shutdown did occur here, again, on May 16th 23:15, first time using this kernel (latest stable at fedora, as always):

  kernel-3.3.5-2.fc16.x86_64

At the time of the event, system was "only" playing a flash video full screen (using firefox @fedora, flash-plugin @adobe, kmod-nvidia @rpmfusion, using external display only at 1680x1050 -- nothing huge).

I booted again immediately after that, and since then, well, "the devil has been back on my side" : system has been running 24/7 with no issues, including extremely heavy loads during several consecutive hours (I pushed it, really).


What surprises me, is that, at the time of the shutdown :
- my "heat.sh" script *had* the time to detect and call logger
- it did log "crit" events -- not just once, but even *twice*
- again, there is a 0.5s delay to estimate the CPU average load,
  hence my "crit" level was reached for more than 1 second (and less than 1.5s)
  before system was powered off.
- syslog/kernel however... did *not* log anything !

"wt...?" :)


Please see the grep'ed /var/log/messages (at the end of the attached report):

Line 7010 :

* I had set "notice" event level to the following temperature ranges :
  - CPU: 74..77 ºC (as read by acpi -t, i.e. /sys/class/thermal)
  - GPU: 71..73 ºC (as read by nvidia-smi -q | sed)

* I had set a 3 seconds pause when temperatures were at my "yawn" level
  a reduced 2 seconds pause when "noticed" level was reached
  a reduced 1 and 0 seconds for "warning" and "crit", respectively
  (plus the 0.5s laps to measure CPU average usage, for every levels)

Lines 7012 through 7026 :

* Since the flash video started to be played full screen,
  "notice" was reached on a regular basis (up to 6 times per minute, maximum),
  but never persistently (i.e. temperature did drop again to "yawn", every time)

Lines 7027 though 7039 :
   
* All of a sudden (that means within a 3.5 seconds time laps, maximum,
  after the preceding "yawn" event which did not have to log anything),
  then event level switched to "crit" (no "notice" or "warning")
  CPU temperature holding the 127 magic value once again

* System was under no particular load (for such usage) at the time of the bug.
  Even GPU temperature was, well, rather low...


Does this helps a little ? This bug -- and above everything, the absence of syslog/kernel event -- really sucks :/

Please advise, thanks.
Comment 84 Xavier Hourcade 2012-05-30 07:50:06 UTC
Created attachment 73462 [details]
6 last critical shutdowns,  commented very light log greps

Still applies here to all kernels up to (including) 

  kernel-3.3.6-3.fc16.x86_64
  kernel-3.3.7-1.fc16.x86_64

Attached greps are commented, 6 critical shutdowns over 6 days :/

In some of them, the log was more or less abruptly closed, as shown.

No. 6 occurred while re-starting right after no. 5, during the root partition passphrase (luks) prompt.

Is it the same ACPI code which is running at this time ?

In regards to the fan speed at power on: I think the hardware simply stores, *always* (i.e. even under normal conditions) the temperature or fan speed at power off and replicates it blindly at the next power on (then adjusting it after a few seconds, before any boot loader is executed if it has time to do so). I have noticed quite high fan speed (but not maximum) when I did poweroff/on fine, just to boot a new kernel after a long session with heavy load.
Comment 85 Xavier Hourcade 2012-05-30 14:06:38 UTC
Created attachment 73467 [details]
Explicit ACPI/EC errors in log context

Explicit ACPI/EC error messages were caught here, first time:
( same kernel as previous comment, second crash today  o/ )


ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20120111/evregion-501)
ACPI Error: Method parse/execution failed [\_SB_.PCI0.SBRG.EC0_.GBTT] (Node ffff880131dc4be0), AE_TIME (20120111/psparse-536)
ACPI Error: Method parse/execution failed [\_SB_.PCI0.BAT1._BST] (Node ffff880131ddd410), AE_TIME (20120111/psparse-536)
ACPI Exception: AE_TIME, Evaluating _BST (20120111/battery-455)
ACPI: EC: input buffer is not empty, aborting transaction
ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20120111/evregion-501)
ACPI Error: Method parse/execution failed [\_TZ_.RTMP] (Node ffff880131dc5988), AE_TIME (20120111/psparse-536)
ACPI Error: Method parse/execution failed [\_TZ_.THRM._TMP] (Node ffff880131dc57f8), AE_TIME (20120111/psparse-536)


Notes (please see attachment)

* Errors appeared within a series of 4 "critical" events of mine,
  so the 127 value was readable during more than 2 seconds

* After these errors, /sys/class/thermal/thermal_zone0/temp seems to be gone.
  However,`acpi -t` (as used by my script) still returns 127 (?)

* The "Critical temperature" error wasn't logged however
  (and the log wasn't closed correctly)
Comment 86 Xavier Hourcade 2012-05-30 22:21:13 UTC
Created attachment 73472 [details]
complete /var/log/messages for 2 identical shutdowns, different messaging

Full messages might help for these last two, also today o/

Both occurred *exactly* while performing the same task:
* No Flash involved! Just Konsole, Kate and some erroneous Firefox profile,
  which launch locked the CPU every time at 100% - and beyond, said `ps` :)
* My script WAITs were set to 1,1,0,0 (+ 0.5s for cpu % average, still)

Experience was also, exactly the same i.e. Firefox launch, sudden fan max speed, then very brief OS shutdown screen/console messages, power off.

However, messaging was again, "head or tail":

May 30 16:25:18---21
* My script caught *five* critical events this time (2.5+ seconds), all complete
* Kernel reported *part* of the ACPI/EC explicit error (cf. previous comment)
* But no "Critical temperature shutdown" message, log was aborted.

May 30 18:02:14
* My script had reported a *unique* (!) "notice" at 74 ºC* two minutes before,
  but had cooled back to "yawn" after that (hence looping each 1.5s)
* Kernel reported (part of) the same ACPI/EC error message cf. previous comment
* The "Critical temperature shutdown" message *was* reported, log completed.

So it seems like I can reproduce the issue "at will" and my reports are all scripted - it's only messaging which is still "head or tail".

Could anyone please guide me any further ?
Comment 87 Xavier Hourcade 2012-05-30 22:24:26 UTC
Erratum for the second one, sorry :

May 30 18:02:14
* My script had reported a *unique* (!) "notice" at 74 ºC* two minutes before,
  but had cooled back to "yawn" after that (hence looping each 1.5s)
* Kernel reported *none_of_the* ACPI/EC error messages
* The "Critical temperature shutdown" message *was* reported, log completed.
Comment 88 Stefan Vogel 2012-05-31 17:13:25 UTC
I have a V1Sn with a Core2Duo T7700.
It suffers the same problem - spurious shutdown with critical temperature at 127ºC (== 0x3F). It happens with both Bios 301 and 302, EC Bios is 1000B80001 (both for 301 and 302).

With OpenSuse 12.1 and kernel-desktop 3.3.x (from the Tubleweed repo) this happened only every other week. After updating to kernel-desktop 3.4.0 the shutdown happens 2-4 times a day. 

This is definitely not a temperature problem. I can put heavy load on the CPU, so the temperature raises to 80ºC for an hour - and no shutdown happens. Sometimes, with nearly no load the CPU has a temperature of 60-70ºC, and suddenly pretends to raise to 127ºC within a second.

I am switching back to 3.3.6 now.
Comment 89 Len Brown 2012-06-05 05:02:31 UTC
> This is definitely not a temperature problem.

It wasn't useful for the original submitter, because their EC
shuts down the system, but possibly your box is different...
You can use the kernel cmdline "thermal.nocrt=1"
or the module parameter "options thermal nocrt=1"
and that will disable the action taken when the
critical trip point fires.

When it fires, you can then monitor the temperature --
eg. see if it stays high, or spike goes away etc.

It seems that the EC on this box is basically going out to lunch.
If we have Linux ignore the bogus temperature reading, the EC
takes matters into its own hands and shuts off the system.

One possible approach would be to neuter the ACPI EC driver
altogether so that it doesn't load.  For it may be interaction
with the Linux EC driver that is confusing the motherboard EC.
Comment 90 Xavier Hourcade 2012-06-07 23:06:53 UTC
Created attachment 73535 [details]
More ACPI error messages, less system handling ?

  % grep ºº heat.after.Jun.7.01.13.log-serie

will show my inserted comments.

No error in 3 days then 3 in a row tonight, with slightly more detailed ACPI messaging than I did catch so far.
Comment 91 Xavier Hourcade 2012-06-08 00:06:20 UTC
(In reply to comment #88)
> at 127ºC (== 0x3F).

Am I missing something or is it not 0x7f ? Could the value have another meaning yet to be understood, while comment #18 suggested the hardware was fine under Wind'Oz, and while 0x7f and 0x80 are only a shift away ?

$ grep "THRM" acpi/DSDT.dsl

Notify (\_TZ.THRM, 0x80)
Notify (\_TZ.THRM, 0x80)
Notify (THRM, 0x81)
Notify (\_TZ.THRM, 0x80)

--

If (Arg0)
  {
      ShiftLeft (CR25, 0x02, Local1)
      If (Arg1)
      {
          Or (CR02, 0x80, CR02)
      }
      Else
      {
          And (CR02, 0x7F, CR02)
      }
  }
Comment 92 Xavier Hourcade 2012-06-16 16:08:51 UTC
erratum: s/shift/one's complement/

(In reply to comment #88)
> EC Bios is 1000B80001 (both for 301 and 302).

Same EC version here (and Bios 301)

(In reply to comment #89)
> You can use the kernel cmdline "thermal.nocrt=1"

Sorry I hadn't confirmed this, using this option makes no difference here, certainly as you described here :

> It seems that the EC on this box is basically going out to lunch.
> If we have Linux ignore the bogus temperature reading, the EC
> takes matters into its own hands and shuts off the system.

Yesterday I have ran a native Wind'Oz Vista on this system i.e. the OEM install up to date. It includes Asus's ACPI drivers named as "ATKxxx". This was the third time since I experience this issue under Linux, and Vista still didn't fail despite heavy load. Is there anything I could do/extract/dump from ths Vista installation to help better understand the issue ?

> One possible approach would be to neuter the ACPI EC driver
> altogether so that it doesn't load.  For it may be interaction
> with the Linux EC driver that is confusing the motherboard EC.

Could you please point me to further instructions on this path ?
Comment 93 Xavier Hourcade 2012-06-16 16:21:58 UTC
Created attachment 73721 [details]
Open Hardware Monitor screencap under Wind'Oz Vista

All seemed fine there.

Two distinct sensors were found and reported by Open Hardware Monitor, one for each CPU Core. Their recorded values only differ very slightly, which would make sense, up to 3 Celsius thou.

How many physical sensors would there be, one or two ?
Comment 94 Xavier Hourcade 2012-06-16 16:23:50 UTC
Created attachment 73731 [details]
Open Hardware Monitor text report under Wind'Oz Vista
Comment 95 Stefan Vogel 2012-06-16 16:24:51 UTC
After falling back to linux 3.3.6 (3.3.6-jng5-desktop #1 SMP PREEMPT Tue May 15 16:09:46 UTC 2012 (241577b) x86_64) on May 31, I didn't suffer a single shutdown in the last 16 days. With 3.4 it did shut down 2 to 4 times a day.
Comment 96 Xavier Hourcade 2012-06-16 17:16:18 UTC
(In reply to comment #95)
> After falling back to linux 3.3.6 (3.3.6-jng5-desktop #1 SMP PREEMPT Tue May
> 15
> 16:09:46 UTC 2012 (241577b) x86_64) on May 31, I didn't suffer a single
> shutdown in the last 16 days. With 3.4 it did shut down 2 to 4 times a day.

You're very lucky :)

Here I haven't tried the 3.4 series yet, and I think all Fedora 16 stock kernels have presented the issue so far, for sure these latest ones since I monitored everything much more carefully:

  kernel-3.3.5-2.fc16.x86_64 (since May 14)
  kernel-3.3.6-3.fc16.x86_64 (since May 22)
  kernel-3.3.7-1.fc16.x86_64 (since May 27)
  kernel-3.3.8-1.fc16.x86_64 (since Jun 14)

All of these did *feel* more stable sometimes, i.e. several days long without any issues, but again this is just an illusion : all of a sudden sh*t may happen at any time (last was yesterday here, several in a row again).

I am also under the impression there are boots that will work and others that will fail. In other words, I try to "sleep" rather than poweroff, and so far it has proven to much extend the usability duration, but obviously this is simply impossible to prove.
Comment 97 Xavier Hourcade 2012-07-02 17:50:38 UTC
(In reply to comment #95)
> After falling back to linux 3.3.6 (3.3.6-jng5-desktop #1 SMP PREEMPT Tue May
> 15
> 16:09:46 UTC 2012 (241577b) x86_64) on May 31, I didn't suffer a single
> shutdown in the last 16 days. With 3.4 it did shut down 2 to 4 times a day.

Stefan et al, you might want to give another chance to the 3.4 series:

  kernel-3.4.2-1.fc16.x86_64 (since Jun 20): "so far so good, hope !"

That was strictly my first boot of any 3.4 kernel on this hardware: I've been suspending it every day instead of powering off, almost 12 days lies uptime. That's stock kernel from Fedora 16 stable update repositories, I've had all sorts of very heavy loads, and ambient temperature even massively increased in the mean time (which seemed to make faults more frequent).

So, either the issue is fixed for my hardware (no related changelog I could find, thou ?) ...or this could just confirm my "impression" cf. comment #94. Anyway, I'll have to reboot quite a few times this week. If disillusion comes, I will share :)
Comment 98 Xavier Hourcade 2012-07-02 17:53:34 UTC
(errata sorry)

(In reply to comment #95)
> After falling back to linux 3.3.6 (3.3.6-jng5-desktop #1 SMP PREEMPT Tue May
> 15
> 16:09:46 UTC 2012 (241577b) x86_64) on May 31, I didn't suffer a single
> shutdown in the last 16 days. With 3.4 it did shut down 2 to 4 times a day.

Stefan et al, you might want to give another chance to the 3.4 series:

  kernel-3.4.2-1.fc16.x86_64 (since Jun 20): "so far so good, hope !"

That was strictly my first boot of any 3.4 kernel on this hardware: I've been
suspending it every day instead of powering off, almost 12 days lies uptime.
That's stock kernel from Fedora 16 stable update repositories, I've had all
sorts of very heavy loads, and ambient temperature even massively increased in
the mean time (which *used* to to make faults more frequent, previously).

So, either the issue is fixed for my hardware (no related changelog I could
find, thou ?) ...or this could just confirm my "impression" cf. comment * #96 *.
Anyway, I'll have to reboot quite a few times this week. If disillusion comes,
I will share :)
Comment 99 Xavier Hourcade 2012-07-06 11:15:35 UTC
Devil came back this morning, still running the same kernel.

  kernel-3.4.2-1.fc16.x86_64 (since Jun 20): fault occurred

I did power off every day since July 2nd. Yesterday, I even "challenged" the issue, to make sure I could safely run some important process, but all remained just fine. I "challenged" it again this morning, and this time it occurred immediately. This was my 7th cold boot running this kernel.

So, 3.4 is definitely *no worse* nor better than any other in this regard. Whether this is related or not, my overall experience here is actually better.

I'll now *suspend* the machine as much as I can again, to extend my lucky heaven session by a few weeks. Feels like drinking to forget something ! :)
Comment 100 Xavier Hourcade 2012-07-23 14:25:36 UTC
Still occurs, neither worse nor better cf. frequency, with:

  kernel-3.4.4-4.fc16.x86_64 (since July 10): fault occurred

I've tested more kernels, no more luck, hence default to this one ever since.

Interestingly enough, issue also occurs with... Vista's Memory Diagnostic tool (as available from Vista boot menu entry). But it still does *not* occur under Vista itself. This is an OEM install from Asus DVD, with all factory drivers updated, and all Windows Update patches applied. I only use it as a "witness" OS for hardware checks, and I cannot tell about its "Safe Mode", since it has *always* given me a BSOD here, nothing else... :D

Vista is a poem by itself anyway, present other core issues here, and is not even a "safe" option for this hardware : yet, I could see graphic artefacts under heavy GPU load, which to my experience are premises to nVidia adapter physical damages [1]. All is fine on this side under GNU/Linux. You've been warned!

Maybe worth mentioning, Open Hardware Monitor [2] reads ACPI fine, despite confusing FSB speed *if* launched under heavy load [3].

Asus' drivers and tools are still available at [4], including their ACPI driver, named "ATK0100" [5]. Besides, their hardware monitoring tool, named "NBProbe" [6], no longer works here since Windows Update's July patches were applied (but this is strictly a GUI-related issue I think, and Open Hardware Monitor still works well).

So to my understanding, this would confirm the Asus ACPI driver knows and achieve to workaround our BIOS/EC factory defect, while Linux kernel doesn't know how to deal with it.

=> Could anyone please confirm ?

=> Could we make any further reading/sniffing under Vista maybe, in order to help replicating this workaround as a patch to Linux kernels, finally ?

[1] http://www.nvidiadefect.com
[2] https://code.google.com/p/open-hardware-monitor
[3] https://code.google.com/p/open-hardware-monitor/issues/detail?id=309
[4] http://support.asus.com/download.aspx?SLanguage=en&p=3&m=V1S
[5] http://dlcdnet.asus.com/pub/ASUS/nb/F3Tc/ATKDrv_Vista32bit_070126.zip
[6] http://dlcdnet.asus.com/pub/ASUS/nb/Apps/NB_Probe/vista/V3.0.0033/NB_Probe_VT_070918.zip
Comment 101 Xavier Hourcade 2012-07-23 14:48:18 UTC
Maybe worth mentioning also : Open Hardware Monitor reports distinct thermal values for each CPU Core. See my reports at [1] for a slight example of this. Under heavy load, I could observe differences as high as 3 or 4 degrees Celsius between cores, at a given second. I'm not sure how much sense does this make.

=> Could hardware sensors be better exposed or recognized, under Vista with Asus ACPI drivers, than by Linux kernel ?

=> Could our "bogus" 127 value be intentional and part of some sequence, meant to say, for example : « here will follows Core1 temperature, as it now differs from Core0 » ?

[1] https://code.google.com/p/open-hardware-monitor/issues/detail?id=309#c3
Comment 102 Xavier Hourcade 2012-07-23 15:01:22 UTC
=> Or, could the temperature be passed over more than 7 bits in our case ? Hence meant to be interpreted as Core1 temperature *if* the lowest 7 bits were equal to 127, thus supposed to be taken off before reading the effective value ?

I'm not sure if such speculation is any credible or even possible -- that would be pretty ugly coding ! How could we double check this ?
Comment 103 Lan Tianyu 2013-04-11 05:55:34 UTC
Could you try the follow patch and attach the dmesg with open ec debug patch?
Thanks

diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index d45b287..6376fd2 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -223,7 +223,7 @@ static int ec_check_sci_sync(struct acpi_ec *ec, u8 state)
 static int ec_poll(struct acpi_ec *ec)
 {
        unsigned long flags;
-       int repeat = 2; /* number of command restarts */
+       int repeat = 4; /* number of command restarts */
        while (repeat--) {
                unsigned long delay = jiffies +
                        msecs_to_jiffies(ec_delay);
@@ -241,8 +241,10 @@ static int ec_poll(struct acpi_ec *ec)
                        }
                        advance_transaction(ec, acpi_ec_read_status(ec));
                } while (time_before(jiffies, delay));
-               if (acpi_ec_read_status(ec) & ACPI_EC_FLAG_IBF)
-                       break;
+
+               //if (acpi_ec_read_status(ec) & ACPI_EC_FLAG_IBF)
+               //      break;
+               pr_info(PREFIX "%s: repeat %d \n", __func__, 4 - repeat);
                pr_debug(PREFIX "controller reset, restart transaction\n");
                spin_lock_irqsave(&ec->lock, flags);
                start_transaction(ec);


Open ec debug patch.
diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index d45b287..34189f4 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -27,7 +27,7 @@
  */

 /* Uncomment next line to get verbose printout */
-/* #define DEBUG */
+#define DEBUG

 #include <linux/kernel.h>
 #include <linux/module.h>
Comment 104 Lan Tianyu 2013-04-16 01:33:07 UTC
ping ...
Comment 105 Stefan Vogel 2013-04-16 09:24:23 UTC
Maybe its superstition, but when I suspend to RAM immediately after a boot, the shutdown happens only about every other two weeks. Without the suspend, the temperature shutdown occurs every other hour.

Now I am up since 6 days :D

BTW. there is code in the thermal driver to perform a shutdown due to high temperature. But even after disabling the thermal driver, the shutdown occurs. So I think, the shutdown is initiated by EC and BIOS directly.
Comment 106 Lan Tianyu 2013-04-16 11:47:12 UTC
At last, EC driver doesn't initiate show down.
Have you tried the patch in the comment 103?
The patch is to fix the following error log.

ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl]
(20120111/evregion-501)
ACPI Error: Method parse/execution failed [\_SB_.PCI0.SBRG.EC0_.GBTT] (Node
ffff880131dc4be0), AE_TIME (20120111/psparse-536)
ACPI Error: Method parse/execution failed [\_SB_.PCI0.BAT1._BST] (Node
ffff880131ddd410), AE_TIME (20120111/psparse-536)
ACPI Exception: AE_TIME, Evaluating _BST (20120111/battery-455)
ACPI: EC: input buffer is not empty, aborting transaction
ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl]
(20120111/evregion-501)
ACPI Error: Method parse/execution failed [\_TZ_.RTMP] (Node ffff880131dc5988),
AE_TIME (20120111/psparse-536)
ACPI Error: Method parse/execution failed [\_TZ_.THRM._TMP] (Node
ffff880131dc57f8), AE_TIME (20120111/psparse-536)

From the previous comments I see, the shut down is triggered by 127C.
Comment 107 Xavier Hourcade 2013-04-16 14:57:46 UTC
(In reply to comment #104)
> ping ...

Thank you Lan for your patch, which I will try for sure then report here, but not before a few weeks from now.

It seems to me there are several, distinct issues. Here is a paste of my notes to date, if this may help as yet?

How to reproduce, typically:
- dual-head, allocating one CPU core to a VM guest
- closing/re-opening the Fatest-Firefox-profile-ever™ several times
- if needed, start eg. building a kernel and/or heavy LUKS write/swap.

Hardware observations
- cooling system is common to CPU/GPU on this hardware
- nouveau kernel module gets the GPU to heat more,
  and brings many more failures than nvidia driver (by far).
- the higher ambient temperature, the more occurrences
- at ambient temperature of 16ºC or under, issue seems to
  entirely disappear, no matter the system usage
- at any time and under any OS, maximum "true" temperatures
  seen were 86/75ºC (CPU/GPU), which matches specifications.

Shutdown observations
- may occur at grub screen
- may occur under Vista's boot-time "memory test" software
- may occur from GNU/Linux pre-boot or at any later stage
- may occur disregarding to the "thermal.nocrt" parameter,
- may follow bogus value report from kernel itself
- may follow bogus value report from monitor script, reading from /sys
  during several seconds, sometimes in the absence of kernel messaging
- may occur while any of these reports are printed to screen,
  but before they could be written to disk (via logger)
- may follow none of these reports at all
- most often occurs in series, same day
- series often start while CPU is under heavy load
- pm-suspend clearly extends "good mood" by several weeks
- when EC is "in a good moon", overall fan quieter from boot

Applies to
- all Fedora stock kernels at least until 3.6.11
  (later kernels not yet tested on this hardware)

Does NOT seem to apply to (only a few test so far)
- older Debian kernels e.g. 2.6 (Squeeze) or even 3.4 (Wheezy)
- Vista "normal" session

Hence there might be several distinct issues (?)
- a built-in EC defect (soft-fixed by Asus/Vista ?)
- kernel /sys reading clearly not consistent
- kernel messaging absent while monitor script succeeds in reporting
  and is even logging to disk over several seconds (via logger)
- kernel ignoring thermal.crt and shutting down nevertheless (?)

OS  Since       Linux  Distro Video  Occurrences
--------------------------------------------------------
4 : 2013-01-16  2.6.32 5.deb6 nouveau  0 (3 cold boots)
5 : 2013-01-16  3.2.35  .deb7 nouveau  0 (2 cold boots)
1 : 2012-07-28  3.4.6  1.fc16 nvidia   0
1 : 2012-08-07  3.4.7  1.fc16 nvidia   2 (over 2 days)
1 : 2012-08-19  3.4.9  1.fc16 nvidia   2 (over 2 days)
1 : 2012-08-31  3.4.9  2.fc16 nvidia  15 (over 7 days)
1 : 2012-09-23  3.4.11 1.fc16 nvidia   2 (over 2 days)
2 : 2012-08-12  3.5.1  1.fc17 nouveau Dies within secs
3 : 2012-09-05  3.6.0  x.fc18 nouveau  8 (20+ cold boots)
1 : 2012-10-24  3.6.2  1.fc16 nvidia   0 (cold boot)
1 : 2012-11-06  3.6.5  2.fc16 nvidia   0 (warm boot)
1 : 2012-11-09  3.6.6  1.fc16 nvidia   0 (warm boot)
1 : 2012-11-10  3.6.6  1.fc16 nouveau  1 (warm boot)
1 : 2012-11-29  3.6.7  4.fc16 nouveau  9 (over 3 days)
1 : 2012-12-18  3.6.7  4.fc16 nvidia   2 (over 1 day)
1 : 2012-12-18  3.6.10 2.fc16 nvidia  13 (over 7 days)
3 : 2013-01-05  3.6.10 4.fc18 nouveau  2 (3rd cold boot)
1 : 2013-01-05  3.6.11 1.fc16 nvidia  12 (over 6 days)
1 : 2013-01-28  3.6.11 4.fc16 nvidia   9 (over 7 days)

OS notes
1 : Fedora 16 stable HD install, intensive use 365d/y!
    3.4.6 to 3.4.11: hotest ambient temperature
    3.4.9: more cold boots (between OS 2&3 tests)
    3.6.2 to 3.6.5 : did not try nouveau
    3.6.2 to 3.6.6 : 16ºC - ambient temperature
    3.6.7 to 3.6.11: warmer ambient temperature
2 : Fedora 17 stable HD install, for test, light usage
3 : Fedora 18 Live, just opening many apps at once
4 : Squeeze 6.0.6 live, heavy testing, no issues
    (building kernel and other tasks all at once)
5 : Wheezy 7.b3 live, heavy testing, no issues

Other

Sometimes (but not always) immediately after installing a new kernel with yum, or running plymouth-theme -R, firmware gets upset - system becomes "clunky" (half of keystokes would get lost while typing) while messages log is filled up with the following pattern :

Dec 18 14:45:33 hostname kernel: [50913.287022] ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20120711/evregion-501)
Dec 18 14:45:33 hostname kernel: [50913.287040] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SBRG.EC0_.RRAM] (Node ffff880131dccc58), AE_TIME (20120711/psparse-536)
Dec 18 14:45:33 hostname kernel: [50913.287054] ACPI Error: Method parse/execution failed [\ODTS] (Node ffff880131de4f00), AE_TIME (20120711/psparse-536)
Dec 18 14:45:33 hostname kernel: [50913.287062] ACPI Error: Method parse/execution failed [\_GPE._L02] (Node ffff880131db8280), AE_TIME (20120711/psparse-536)
Dec 18 14:45:33 hostname kernel: [50913.287073] ACPI Exception: AE_TIME, while evaluating GPE method [_L02] (20120711/evgpe-560)
Dec 18 14:45:34 hostname kernel: [50914.498033] ACPI: EC: input buffer is not empty, aborting transaction
Dec 18 14:45:34 hostname kernel: [50914.498041] ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20120711/evregion-501)
Comment 108 Xavier Hourcade 2013-04-16 15:27:13 UTC
(In reply to comment #105)
> Maybe its superstition, but when I suspend to RAM immediately after a boot,
> the
> shutdown happens only about every other two weeks. Without the suspend, the
> temperature shutdown occurs every other hour.
> 
> Now I am up since 6 days :D

Surely not superstition, I can also confirm suspend MUCH helps. From my stats here above, in the end I discarded all strict conclusion thou, it's all very random (without suspending, it may well be stable too during days here). Being a little cautious (and "lucky"), it's easy to get several weeks in the row without any shutdown. And since we can "probe" stability prior to run heavier processes, the hardware remains usable generally, without serious impact for daily usage (and no data loss, thanks to ext4 here).

> BTW. there is code in the thermal driver to perform a shutdown due to high
> temperature. But even after disabling the thermal driver, the shutdown
> occurs.
> So I think, the shutdown is initiated by EC and BIOS directly.

Yes -- unless we have a distinct issue since I saw kernel messaging announcing shutdown several times despite it was disabled.

Nevertheless it did occur also frequently at GRUB menu or even under Vista memory test (which I ran twice just to test this issue, cf. observations above). Hence yes, the EC/BIOS is actively shutting down the hardware. I think it even happened at the BIOS password screen, even once within BIOS interface IIRC (within a series, immediately after reboot).
Comment 109 Lan Tianyu 2013-04-22 02:36:19 UTC
Hi all, pleas try test this issue with kernel parameter "thermal.nocrt=1" and record the temp per second. I'd like to see whether the temp will return to normal value after reaching 127C.

Before testing, please apply the patch in the comment 103, it resolves some EC transaction timeout problem and it seems not related with shutdown problem. From the log in the comment #85. The temp already reached 127C before the timeout issue happened. If the issue happen, no temp will return from the thermal driver  except for -ENODEV error code.

	/sys/class/thermal/cooling_device2/cur_state:0
	/sys/class/thermal/cooling_device2/max_state:15
	/sys/class/thermal/cooling_device2/type:LCD
	/sys/class/thermal/thermal_zone0/cdev0_trip_point:1
	/sys/class/thermal/thermal_zone0/cdev1_trip_point:1
	/sys/class/thermal/thermal_zone0/mode:enabled
	/sys/class/thermal/thermal_zone0/temp:127000
	/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
	/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
	/sys/class/thermal/the
May 30 12:49:18 venus kernel: [38169.530102] ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20120111/evregion-501)
May 30 12:49:18 venus kernel: [38169.530140] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SBRG.EC0_.GBTT] (Node ffff880131dc4be0), AE_TIME (20120111/psparse-536)
May 30 12:49:18 venus kernel: [38169.530171] ACPI Error: Method parse/execution failed [\_SB_.PCI0.BAT1._BST] (Node ffff880131ddd410), AE_TIME (20120111/psparse-536)
May 30 12:49:18 venus kernel: [38169.530263] ACPI Exception: AE_TIME, Evaluating _BST (20120111/battery-455)
May 30 12:49:18 venus kernel: [38170.030055] ACPI: EC: input buffer is not empty, aborting transaction
Comment 110 Lan Tianyu 2013-05-05 12:28:42 UTC
ping... Someone can do the test?
Comment 111 Lan Tianyu 2013-05-28 01:22:33 UTC
Since no response for long time. Close this bug as will fix later. Please feel free to reopen this bug if able to provide more debug info.