Bug 20242

Summary: 2.6.31 - > 2.6.32 regression: intermittent Thermal zone sensor reports 0°C --- Quanta tw8, tw9 --- probably caused by commit 2a84cb9852f52c0cd1c48bca41a8792d44ad06cc
Product: ACPI Reporter: Jan-Matthias Braun (jan_braun)
Component: ECAssignee: Zhang Rui (rui.zhang)
Status: CLOSED CODE_FIX    
Severity: high CC: florian, lenb, lv.zheng
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 2.6.32 upwards Subsystem:
Regression: Yes Bisected commit-id:
Attachments: acpidump output
custom _TMP method
dmesg for 2.6.31-rc8 with DEBUG defined in ec.c
dmesg for 2.6.36 with DEBUG defined in ec.c
dmidecode output
dmidecode output for the Quanta TW9
patch: enable MSI workaround for Quanta laptops

Description Jan-Matthias Braun 2010-10-13 12:36:10 UTC
Created attachment 33462 [details]
acpidump output

After a random uptime the thermal zone will report 0°C system temperature.
The fan immediately turns off.
The system will halt semiautomatically at above 110°C.

This behavior was introduced between Kernel versions 2.6.31 and 2.6.32. It is persistent up to 2.6.36-rc6 and was introduced before or with commit b963bd39c9000328f6ce4f12aa52abbb0c68ee91. (I have started testing using “git bisect”, but as the time to thermal zone loss is arbitrary and can range from 5 Minutes to more than a week, I have difficulties marking any commit as good.)

No dependency on other conditions could be identified so far. Trying random kernel boot flags from  other problem descriptions (e.g. acpi_osi, acpi.power_nocheck, noapic and nolapic) did not improve the situation.
Comment 1 Jan-Matthias Braun 2010-10-13 12:44:01 UTC
Besides: "Hello!" to everybody.

Is there any other information needed?
Comment 2 Jan-Matthias Braun 2010-10-13 12:46:26 UTC
On the architecture side x86 and x86_64 have been tested with kernels up to 2.6.33, showing the same behavior.
Comment 3 Zhang Rui 2010-10-18 01:38:24 UTC
Created attachment 33872 [details]
custom _TMP method

please
1. mount the debugfs
2. apply this custom _TMP method by running "cat _TMP.aml > /sys/kernel/debug/acpi/custom_method"
3. enable the ACPI debug output by running "echo 1 > /sys/module/acpi/parameters/aml_debug_output"
4. dmesg -c
5. read the temperature by running "grep . /sys/class/thermal/thermal_zone*/*"
6. dmesg > dmesg.log
and then attach the dmesg.log here.

please do the same tests both when the temperature is normal and when it drops to 0C.
Comment 4 Zhang Rui 2010-10-18 01:40:15 UTC
BTW, you need to do this test in a kernel newer than 2.6.35.
Comment 5 Jan-Matthias Braun 2010-10-19 07:39:01 UTC
Thanks for your commitment.

The test actually gives no (measurable) result: dmesg.log is empty (0 bytes) in both cases. Used kernel was 2.6.36-rc6 (version string for git kernel: 2.6.36-rc6-00137-g7c6d45e).
(Of course I never got any message there in normal setup, too.)
dmesg -c pulls out the last lines containing
    [ACPI Debug]  String [0x15] "_Q80 : Temperature Up"
But the reading of temperature via grep gives no messages via dmesg, but the following values instead on stdout:
    /sys/class/thermal/thermal_zone0/cdev0_trip_point:2
    /sys/class/thermal/thermal_zone0/cdev1_trip_point:2
    /sys/class/thermal/thermal_zone0/mode:enabled
    /sys/class/thermal/thermal_zone0/temp:0
    /sys/class/thermal/thermal_zone0/trip_point_0_temp:105000
    /sys/class/thermal/thermal_zone0/trip_point_0_type:critical
    /sys/class/thermal/thermal_zone0/trip_point_1_temp:104000
    /sys/class/thermal/thermal_zone0/trip_point_1_type:hot
    /sys/class/thermal/thermal_zone0/trip_point_2_temp:96000
    /sys/class/thermal/thermal_zone0/trip_point_2_type:passive
    /sys/class/thermal/thermal_zone0/type:acpitz

Actually, I have made both runs consecutive without shutdown. If this is a problem, I will (of course) repeat the last run. 0°C were reached after 45 minutes of uptime.
Comment 6 Jan-Matthias Braun 2010-10-19 08:47:45 UTC
Additional Infos:
The BIOS of the Laptop is an infamous InsydeH2O.

The above mentioned commit was first bad in a git bisect run with the previous commits running good for roughly a week. Where roughly means: 5 to seven days during workhours (not consecutive), as I can't trigger the problem, this is just a guess. But I don't feel like this result really helps, as this commit is basically empty. So I am testing the previous ones again.

If you would have advise on how I could help narrowing down the problem, I would be grateful.

Thanks,
Jan
Comment 7 Len Brown 2010-10-20 17:35:39 UTC
In the working 2.6.31, what do you see in /proc/acpi/thermal_zone/*/* ?
Comment 8 Jan-Matthias Braun 2010-10-20 22:01:49 UTC
Hi! Welcome on this bug. :-)

The output of "ls /proc/acpi/thermal_zone/*/*" is
    /proc/acpi/thermal_zone/TZ01/cooling_mode
    /proc/acpi/thermal_zone/TZ01/polling_frequency
    /proc/acpi/thermal_zone/TZ01/state
    /proc/acpi/thermal_zone/TZ01/temperature
    /proc/acpi/thermal_zone/TZ01/trip_points
using the working 2.6.31.14 kernel. With cat on 
    cooling_mod giving "<setting not supported>"
    polling_frequency giving "<polling disabled>"
    state giving "state: ok"
    temperature giving an expectable temperature in °C
    trip_points "critical (S5):           105 C
                    hot (S4):                104 C
                    passive:                 96 C: tc1=2 tc2=3 tsp=50 devices=CPU0 CPU1"

I try triggering the bug using the stress utility or doing multiple compiles with "make -j3". The Laptop has a core 2 duo processor. The system fairly never reaches 84°C using these procedures. But the behaviour triggers even with idle system load so I am unsure if this generated load really helps.

In my attempt to do a "git bisect" I found a false good marker and I am on my way back through another branch. I am now between commits
    bf25400e889dac3f9a3d5a5b77e8ec4c170a5006 (bad) and
    7a92d803227a523a9a5546e4e0dce1325a4b5926 (good)
with the above stated retentions.

Hope this helps!
Jan
Comment 9 Zhang Rui 2010-10-21 00:14:03 UTC
(In reply to comment #8)
> Hi! Welcome on this bug. :-)
> 
> The output of "ls /proc/acpi/thermal_zone/*/*" is
>     /proc/acpi/thermal_zone/TZ01/cooling_mode
>     /proc/acpi/thermal_zone/TZ01/polling_frequency
>     /proc/acpi/thermal_zone/TZ01/state
>     /proc/acpi/thermal_zone/TZ01/temperature
>     /proc/acpi/thermal_zone/TZ01/trip_points

what's the output of "grep . /proc/acpi/thermal_zone/*/*"?
Comment 10 Jan-Matthias Braun 2010-10-21 07:46:55 UTC
Output of "grep . /proc/acpi/thermal_zone/*/*"
/proc/acpi/thermal_zone/TZ01/cooling_mode:<setting not supported>
/proc/acpi/thermal_zone/TZ01/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/TZ01/state:state:                   ok
/proc/acpi/thermal_zone/TZ01/temperature:temperature:             49 C
/proc/acpi/thermal_zone/TZ01/trip_points:critical (S5):           105 C
/proc/acpi/thermal_zone/TZ01/trip_points:hot (S4):                104 C
/proc/acpi/thermal_zone/TZ01/trip_points:passive:                 96 C: tc1=2 tc2=3 tsp=50 devices=CPU0 CPU1
Comment 11 Jan-Matthias Braun 2010-10-26 21:08:41 UTC
Hm, although my remaining git bisect line just holds 16 commits I don't know if there is anything useful within it. May someone please be so kind to check the commits and tell me if it is worth trying mostly good commits for weeks?

Any hints on how I could track that thing down?

Cheers,
    Jan
Comment 12 Jan-Matthias Braun 2010-10-28 11:37:25 UTC
Okay, I got another bad commit 
f25752e67d9d9ee7562ae9944314dd8c057d3fa2
Comment 13 Jan-Matthias Braun 2010-10-28 11:43:05 UTC
Hm, text got through to the server to early…

What is this ACPI EC thingie and could it be connected? I am now testing a commit having to do with a IRQ/Polling descision in this context. Actually I will have to test another thing, as the coretemp module for the Core 2 Duo is not updating its temperature properly (I might have mentioned this earlier, I know, sorry if it's useful, but I am sitting on this bug quite some time, now).

Actually the coretemp module reports a temperature which seems identical to the thermal zone temperature at system boot (or resume from suspend to disk in that matter). But afterwards it stays constant. The thermal zone is still updated up to the point it freezes at 0°C.

Hope this helps you to help me. ^^

Cheers,
    Jan
Comment 14 Jan-Matthias Braun 2010-10-28 16:59:11 UTC
Latest results from git bisect testing. First bad commit seems to be:
    2a84cb9852f52c0cd1c48bca41a8792d44ad06cc
The previous commit is versioned 2.6.31-rc8, which I am going to test, just to be sure, but which should be okay. So this seems to be the commit this bug was introduced for my system.

I really hope this helps. :-)
Comment 15 Jan-Matthias Braun 2010-11-01 19:58:33 UTC
Created attachment 35782 [details]
dmesg for 2.6.31-rc8 with DEBUG defined in ec.c

In drivers/acpi/ec.c defined DEBUG.
This is the output for the (working) 2.6.31-rc8 kernel.
Comment 16 Jan-Matthias Braun 2010-11-01 20:01:10 UTC
Created attachment 35792 [details]
dmesg for 2.6.36 with DEBUG defined in ec.c

In drivers/acpi/ec.c defined DEBUG.
This is the output for the 2.6.36 kernel.
This kernel fails in the sense of the bug description.

In comparison to the 2.6.31 debug output you often see
    ACPI: EC: ---> status = 0x00
Comment 17 Jan-Matthias Braun 2010-11-01 20:03:11 UTC
Addition to the last comment: Better take a look at the end of the file as the fan and thermal modules have been loaded later (in the running system).
Then it were only approx. 5 min to the bug occurance.
Comment 18 Len Brown 2010-11-09 03:01:28 UTC
can you attach the dmesg for the working 2.6.31
with CONFIG_PRINTK_TIME=y and DEBUG _disabled_ in ec.c?
would like to see it for a significant uptime, maybe
with some change in thermal load, maybe some polling of the
temperature to stress the EC.  We are looking to see
if some old workarounds are being triggered on this box.
Comment 19 Jan-Matthias Braun 2010-11-09 11:50:33 UTC
I will try this, but it might take some time. In the mean time a friend of mine and me have found a workaround by setting EC_FLAGS_MSI in the module, after finding the following line in the logs
    ACPI: EC: missing confirmations, switch off interrupt mode.

In the current 2.6 series hard wiring a call to ec_flag_msi worked, too.

This might be far away from optimal, but is working since Friday.
Comment 20 Zhang Rui 2010-11-10 00:47:15 UTC
please attach the dmidecode output of your laptop
Comment 21 Jan-Matthias Braun 2010-11-24 11:48:00 UTC
Created attachment 38022 [details]
dmidecode output
Comment 22 Jan-Matthias Braun 2010-11-24 11:52:06 UTC
Sorry for the long delay, I hadn't very few time in the last weeks.

@Len Brown: Shall I still do the tests with the 2.6.31 kernel? Actually it is no problem for me to do, but if nobody is interested in this, I would skip it. :-)
Comment 23 Jan-Matthias Braun 2010-11-24 11:53:53 UTC
Concerning the dmidecode output: Actually I have been trying different Bios versions first. This now is the latest Bios version availabe for this laptop, but previous versions failed, too. At least those I have tried.

A question for advise: To whom should I turn, if after a necessary Bios-Update some hotkeys are not working any longer?
Comment 24 Kyle Blake 2011-01-10 00:34:52 UTC
Created attachment 43032 [details]
dmidecode output for the Quanta TW9

I've been having the same problem on a Quanta TW9, and the workaround in comment #19 works for me. Since Jan's laptop is a Quanta TW8, I would guess that this problem (and the workaround) would apply to all laptops in the TW series.
Comment 25 Jan-Matthias Braun 2011-02-17 20:39:40 UTC
@Len Brown

Now I have finally done some testing on the vanilla 2.6.31.14 version of the Linux kernel (again).

Actually nothing of interest seems to happen at all. I am not even seeing the message from comment #19
     ACPI: EC: missing confirmations, switch off interrupt mode.
with the desired configuration.

Here the greps "ACPI: EC:" on three runs (the full logs are available):
(1) uptime: 1 day, 12 min
[    0.101636] ACPI: EC: Look up EC in DSDT
[    0.105002] ACPI: EC: non-query interrupt received, switching to interrupt mode
[    0.121609] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
[    0.121611] ACPI: EC: driver started in interrupt mode

(2) uptime: 1 day,  6:23
[    0.101489] ACPI: EC: Look up EC in DSDT
[    0.105002] ACPI: EC: non-query interrupt received, switching to interrupt mode
[    0.122509] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
[    0.122511] ACPI: EC: driver started in interrupt mode

(3) uptime: 2 days, 22:34
[    0.100722] ACPI: EC: Look up EC in DSDT
[    0.104002] ACPI: EC: non-query interrupt received, switching to interrupt mode
[    0.120891] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
[    0.120893] ACPI: EC: driver started in interrupt mode

And then nothing else. I have recompiled the whole system (on gentoo) now while from time to time doing EC queries with 
    (a) for i in `seq 1000`; do sensors ; done
and
    (b) for i in `seq 10000`; do 
         cat/proc/acpi/thermal_zone/TZ01/temperature ; done

with no influence on the logs at all. The uptimes are optimistic, as I had some suspend time inbetween, but the real uptime of run 3 (from dmesg) is beyond 159028 seconds (44 h).

Anything else I could test or try?

Cheers,

Jan
Comment 26 Zhang Rui 2011-03-23 02:54:30 UTC
Hi, Jan,

can you reproduce the problem in the latest upstream kernel?
If no, can you please verify if the patch attached below (DMI quirks) helps?
Comment 27 Zhang Rui 2011-03-23 03:03:30 UTC
Created attachment 51682 [details]
patch: enable MSI workaround for Quanta laptops

please apply this patch on 2.6.38 kernel and see if it helps.
Comment 28 Jan-Matthias Braun 2011-03-27 21:03:28 UTC
First I want you to thank you for your continuing support.

1. Vanilla 2.6.38 problem reproduction
To be honest I am experiencing strange problems with disk access using vanilla 2.6.38. Using a patched kernel these problems are vanishing. I had no time to do further checks on these disk problems, so I can't tell if theese two behaviours are related.

2. The patch
The patch seems to work fine. From dmesg

    ACPI: EC: Detected MSI hardware, enabling workarounds.
    ACPI: EC: Look up EC in DSDT
    ACPI: Executed 1 blocks of module-level executable AML code
    [Firmware Bug]: ACPI: BIOS _OSI(Linux) query ignored

so I wouldn't expect any misbehaviour, but I will continue the test over the following week. (And I will try to come back to the vanilla 2.6.38 kernel.)

Thanks,

Jan
Comment 29 Zhang Rui 2011-03-28 06:42:39 UTC
okay. Bug closed.
please feel free to re-open it if the problem can be reproduced in the latest upstream kernel, after applying the patch.
Comment 30 Jan-Matthias Braun 2011-03-29 19:19:13 UTC
Okay with me. Thanks to all!

Jan
Comment 31 Florian Mickler 2011-05-30 07:26:05 UTC
A patch referencing this bug report has been merged in v3.0-rc1:

commit 534bc4e3d27096e2f3fc00c14a20efd597837a4f
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Tue Apr 26 16:30:02 2011 +0800

    ACPI EC: enable MSI workaround for Quanta laptops
Comment 32 Lv Zheng 2015-05-13 08:25:39 UTC
Hi,

Can anyone here try the upstream kernel without this quirk.
We are about to remove the quirk as it is covered by the following commit:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9e295ac

Thanks in advance.

Best regards
-Lv
Comment 33 Jan-Matthias Braun 2015-05-26 14:52:10 UTC
Hi,

I am sorry, but I no longer have access to the device in question.

Thanks for keeping this issue in mind!

Jan