Bug 11878 - fan doesn't turn off
Summary: fan doesn't turn off
Status: REJECTED WILL_NOT_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: All Linux
: P1 high
Assignee: ykzhao
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-10-28 12:48 UTC by Martin Capitanio
Modified: 2008-11-06 01:31 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.27 - 2.6.28rc2
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg 2.6.28-rc2 (35.39 KB, text/plain)
2008-10-28 13:27 UTC, Martin Capitanio
Details
acpidump (153.41 KB, text/plain)
2008-10-29 02:34 UTC, Martin Capitanio
Details
dmesg-acpi.debug-XXXX.txt (105.60 KB, text/plain)
2008-10-29 02:37 UTC, Martin Capitanio
Details
dmesg-2.6.26.7.txt (44.26 KB, text/plain)
2008-10-29 09:38 UTC, Martin Capitanio
Details
dmesg-thermtrip.txt (48.55 KB, text/plain)
2008-11-03 09:04 UTC, Martin Capitanio
Details

Description Martin Capitanio 2008-10-28 12:48:13 UTC
Latest working kernel version: 2.6.26.7 tested o.k.
Earliest failing kernel version: 2.6.27 - 2.6.28rc2
Distribution: Ubuntu/Independent

Problem Description:
The fan stays rotating at low speed and doesn't turn off. The dmesg difference
(+2.6.26.7 -2.6.27) is:
-[    1.905701] thermal LNXTHERM:01: registered as thermal_zone0
-[    1.906095] ACPI: Transitioning device [FAN] to D3
-[    1.906285] ACPI: Thermal Zone [THRM] (50 C)
+[    5.499848] ACPI: LNXTHERM:01 is registered as thermal_zone0
+[    5.499848] ACPI: Thermal Zone [THRM] (56 C)

The value of /proc/acpi/thermal_zone/THRM/temperature is updated
in both cases (old and new kernel) just randomly.

Weird thing is, that changing the value of

/sys/class/backlight/acpi_video0/brightness

causes a temperature value update and the fan goes off
(as it automatically does in the 26er kernel).
Comment 1 Martin Capitanio 2008-10-28 13:04:32 UTC
The value of /proc/acpi/fan/FAN/state is always on.
Comment 2 Martin Capitanio 2008-10-28 13:27:19 UTC
Created attachment 18485 [details]
dmesg 2.6.28-rc2
Comment 3 ykzhao 2008-10-28 18:05:07 UTC
Will you please attach the output of acpidump?
Will you please add the boot option of "acpi.power_nocheck=1" and see whether the problem still exists?
Thanks.
Comment 4 ykzhao 2008-10-28 18:18:27 UTC
Will you please enable the CONFIG_ACPI_DEBUG in kernel configuration and add the boot option of "acpi.debug_layer=0x04810000 acpi.debug_level=0x1f"? (The boot option of "acpi.power_nocheck=1" is still needed).
   After the system is booted, please change the value of /sys/class/backlight/acpi_video0/brightness 
   and then attach the output of dmesg.
   thanks.
   
Comment 5 Shaohua 2008-10-28 23:19:28 UTC
can you please attach the output of 'acpidump'?
Comment 6 Martin Capitanio 2008-10-29 02:31:13 UTC
[    0.000000] Command line: root=/dev/sda8 locale=de_DE ro acpi.power_nocheck=1
[    2.783591] thermal LNXTHERM:01: registered as thermal_zone0
[    2.784013] ACPI: Thermal Zone [THRM] (45 C)

Yes, that reverts to the 2.6.26.7 behavior, IIRC i.e.:
(/proc/acpi/thermal_zone/THRM/temperature)
* ? -> 56 fan low speed
* 56 -> 61 fan hi speed [kernel make -j3]
* 56|61 -> ? fan off
* the temperature value get updated only on fan state transition or brightness
  value change
Comment 7 Martin Capitanio 2008-10-29 02:34:19 UTC
Created attachment 18492 [details]
acpidump
Comment 8 Martin Capitanio 2008-10-29 02:37:13 UTC
Created attachment 18493 [details]
dmesg-acpi.debug-XXXX.txt

search for XXXX comments
Comment 9 Martin Capitanio 2008-10-29 03:39:49 UTC
Eh, I was wrong. After some time the fan stuck at low speed again:

... fan goes on probably here:

Oct 29 10:57:09 marvin kernel: [ 3798.120111]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3252 dK
Oct 29 10:57:11 marvin kernel: [ 3800.119663]    utils-0291 [00] evaluate_integer      : Return value [3252]
Oct 29 10:57:11 marvin kernel: [ 3800.119676]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3252 dK
Oct 29 10:57:13 marvin kernel: [ 3802.118496]    utils-0291 [00] evaluate_integer      : Return value [3252]
Oct 29 10:57:13 marvin kernel: [ 3802.118508]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3252 dK
Oct 29 10:57:15 marvin kernel: [ 3804.119988]    utils-0291 [00] evaluate_integer      : Return value [3252]
Oct 29 10:57:15 marvin kernel: [ 3804.120001]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3252 dK
Oct 29 10:57:17 marvin kernel: [ 3806.119536]    utils-0291 [00] evaluate_integer      : Return value [3252]
Oct 29 10:57:17 marvin kernel: [ 3806.119549]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3252 dK
Oct 29 10:57:19 marvin kernel: [ 3808.117668]    utils-0291 [00] evaluate_integer      : Return value [3292]
Oct 29 10:57:19 marvin kernel: [ 3808.117680]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3292 dK
Oct 29 10:57:21 marvin kernel: [ 3810.119828]    utils-0291 [00] evaluate_integer      : Return value [3272]
Oct 29 10:57:21 marvin kernel: [ 3810.119841]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3272 dK
Oct 29 10:57:23 marvin kernel: [ 3812.117618]    utils-0291 [00] evaluate_integer      : Return value [3272]
Oct 29 10:57:23 marvin kernel: [ 3812.117631]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3272 dK

... fan stays on for ever, until

Oct 29 11:16:51 marvin kernel: [ 4980.120131]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3272 dK
Oct 29 11:16:53 marvin kernel: [ 4982.116690]    utils-0291 [00] evaluate_integer      : Return value [3272]
Oct 29 11:16:53 marvin kernel: [ 4982.116703]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3272 dK
Oct 29 11:16:55 marvin kernel: [ 4984.120116]    utils-0291 [00] evaluate_integer      : Return value [3272]
Oct 29 11:16:55 marvin kernel: [ 4984.120128]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3272 dK
Oct 29 11:16:57 marvin kernel: [ 4986.117660]    utils-0291 [00] evaluate_integer      : Return value [3272]
Oct 29 11:16:57 marvin kernel: [ 4986.117673]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3272 dK

... brightness value changed

Oct 29 11:16:59 marvin kernel: [ 4988.120127]    utils-0291 [00] evaluate_integer      : Return value [3062]
Oct 29 11:16:59 marvin kernel: [ 4988.120139]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3062 dK
Oct 29 11:17:01 marvin kernel: [ 4990.118439]    utils-0291 [00] evaluate_integer      : Return value [3062]
Oct 29 11:17:01 marvin kernel: [ 4990.118452]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3062 dK
Oct 29 11:17:03 marvin kernel: [ 4992.120085]    utils-0291 [00] evaluate_integer      : Return value [3062]
Oct 29 11:17:03 marvin kernel: [ 4992.120098]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3062 dK
Oct 29 11:17:05 marvin kernel: [ 4994.117660]    utils-0291 [00] evaluate_integer      : Return value [3062]
Oct 29 11:17:05 marvin kernel: [ 4994.117673]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3062 dK

... works again, fan goes on

Oct 29 11:27:11 marvin kernel: [ 5600.120210]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3162 dK
Oct 29 11:27:13 marvin kernel: [ 5602.119714]    utils-0291 [00] evaluate_integer      : Return value [3162]
Oct 29 11:27:13 marvin kernel: [ 5602.119726]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3162 dK
Oct 29 11:27:15 marvin kernel: [ 5604.119251]    utils-0291 [00] evaluate_integer      : Return value [3162]
Oct 29 11:27:15 marvin kernel: [ 5604.119264]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3162 dK
Oct 29 11:27:17 marvin kernel: [ 5606.119789]    utils-0291 [00] evaluate_integer      : Return value [3292]
Oct 29 11:27:17 marvin kernel: [ 5606.119802]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3292 dK
Oct 29 11:27:19 marvin kernel: [ 5608.118172]    utils-0291 [00] evaluate_integer      : Return value [3292]
Oct 29 11:27:19 marvin kernel: [ 5608.118184]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3292 dK

... fan goes off

Oct 29 11:27:53 marvin kernel: [ 5642.117678]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3292 dK
Oct 29 11:27:55 marvin kernel: [ 5644.120187]    utils-0291 [00] evaluate_integer      : Return value [3292]
Oct 29 11:27:55 marvin kernel: [ 5644.120200]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3292 dK
Oct 29 11:27:57 marvin kernel: [ 5646.119685]    utils-0291 [00] evaluate_integer      : Return value [3162]
Oct 29 11:27:57 marvin kernel: [ 5646.119698]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3162 dK
Oct 29 11:27:59 marvin kernel: [ 5648.119892]    utils-0291 [00] evaluate_integer      : Return value [3162]
Oct 29 11:27:59 marvin kernel: [ 5648.119905]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3162 dK

... etc.
Comment 10 Martin Capitanio 2008-10-29 06:00:03 UTC
After 2 hours again. Seems to be much less frequently as without
acpi.power_nocheck=1. Is there some option to trigger the debug
message only on temperature value change?

Oct 29 13:30:51 marvin kernel: [13020.119669]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3282 dK

./backlight-togle 
#!/bin/bash
echo 6 > /sys/class/backlight/acpi_video0/brightness
echo 7 > /sys/class/backlight/acpi_video0/brightness

temperature jump cca 50 -> 30

Oct 29 13:30:53 marvin kernel: [13022.117649]    utils-0291 [00] evaluate_integer      : Return value [3062]
Oct 29 13:30:53 marvin kernel: [13022.117661]  thermal-0262 [00] thermal_get_temperatur: Temperature is 3062 dK
Oct 29 13:30:55 marvin kernel: [13024.119852]    utils-0291 [00] evaluate_integer      : Return value [3062]

Another experiment:
If I run backlight-togle frequently, the hysteresis fan off -> fan on
stay (luckily) the same, but the state change fan on -> fan off happens much
more quickly.
Comment 11 Martin Capitanio 2008-10-29 09:38:46 UTC
Created attachment 18497 [details]
dmesg-2.6.26.7.txt

Double-check for the 2.6.26.7 kernel:

diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c
index 84c795f..2615776 100644
--- a/drivers/acpi/thermal.c
+++ b/drivers/acpi/thermal.c
@@ -251,6 +251,10 @@ static int acpi_thermal_get_temperature(struct acpi_thermal *tz)
        if (!tz)
                return -EINVAL;

+       if (tz->last_temperature != tz->temperature)
+               printk(KERN_NOTICE "ACPI: temperature %lu -> %lu\n",
+                      tz->last_temperature, tz->temperature);
+               
        tz->last_temperature = tz->temperature;

---

-> 3292 fan get always on
-> 3162 fan get always off

Where should I look now?

Thanks,
Martin
Comment 12 Martin Capitanio 2008-10-30 01:05:39 UTC
The debug output was triggered through temperature monitoring applet
reading /proc/acpi/thermal_zone/THRM/temperature.

Without it, changing the brightness causes still temperature
update / fan stop but not any debug output.
Comment 13 ykzhao 2008-11-03 01:51:01 UTC
Hi, Martin
    Thanks for the test and info.
    From the acpidump it seems that the ACPI FAN device is related with the power resource(FN00), in which the _ON/_OFF object is bogus.
    > PowerResource (FN00, 0x00, 0x0000)
        {
            Method (_STA, 0, Serialized)
            {
                Store (0xF1, P80H)
                Return (One) // It means that the FN00 is alwasy on. Of course the ACPI FAN device will report that it is on.
            }

            Method (_ON, 0, Serialized)
            {
                Store (0xF1, P80H)
            }

            Method (_OFF, 0, Serialized)
            {
                Store (0xF0, P80H)
            }
            // The _ON/_OFF object definition is bogus. 
        }
    In such case the /proc/acpi/fan/FAN/state is always on and the FAN device is not controlled by ACPI FAN device. And it is controlled by BIOS.
    
    And when the brightness is changed by  the interface of /sys/class/backlight/acpi_video0/brightness, the _BCM object will be evaluated, in which the SMI is triggered. Maybe in such case the FAN is turned on/off by BIOS.
    So IMO this bug is related with BIOS. 
    

    
Comment 14 ykzhao 2008-11-03 01:59:57 UTC
Hi,Martin
   As the FAN can't be controlled by ACPI thermal FAN device, IMO the bug is related with the BIOS. 
   At the same time the bogus method is provided by BIOS, the /proc/acpi/fan/FAN/state will always be On.
   In such case I think that the problem is not related with Linux-ACPI. It had better be fixed by BIOS upgrading.
Comment 15 Martin Capitanio 2008-11-03 09:04:47 UTC
Created attachment 18637 [details]
dmesg-thermtrip.txt

Hi Zhao,
thanks, I see, definitely the BIOS is broken, but ...

The reality is, that most of the companies designing this funny stuff, I would
say, are living on some unknown planet. If their gadgets are working in
the planned environment (e.g. by incident), they stop any further actions.
Luckily the linux kernel is open source and there must be something in the
kernel disturbing the (broken) BIOS.

If I try some sort of rational reasoning (I don't jet have the time
to RTFSMIM), the for me visible facts are:

* There is no way to manage the fan state directly (which is IMO good thing)
  but further this state is reported always on to os (BUG_1).

* IMO the BIOS through SMI, or what ever, triggers *always* the temperature object
  state-update irregularly (BUG_2), much few times that it should be.
  But it in a normal case the fan gets sooner or later *switched off*.

* In *some* linux kernels during the uptime from 5 min to 2 h the fan stucks on
  (BUG_3).

* I patched acpi_video_device_lcd_set_level() and it turns out that even writing
  a constant *same* brightness value (also actualy not messing with the 
  backlight voltage value) kicks the SMI system to update the
  temperature/fan state.

* While make -j3 kernel, one of the cores got rescued from meltdown (BUG_4):

> mcelog 
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 THERMAL EVENT TSC cf9edaa98
Processor core below trip temperature. Throttling disabled
STATUS 882d0100 MCGSTATUS 0

I can only try to guess:
* it is perfectly fine, the cpu is designed for
* wrong or old microcode:
[   20.406893] IA-32 Microcode Update Driver: v1.14a <tigran@aivazian.fsnet.co.uk>
[   20.409382] firmware: requesting intel-ucode/06-0f-0d
[   20.441544] firmware: requesting intel-ucode/06-0f-0d
[
* the cpu is buggy
* a damage somehow related to this issue

* Well, the kernel has to have a full load of quirks everywhere;) It may be
  a not a such crazy idea to quirk it, i.e. set a timer to 'kick the SMI' this
  or other (yet to be found) way. I think every half second would make a sense.

I am curious enough to try to understand and track down the bug (particularly
BUG_3) to some single option or patch, so any explanations or hints where to
look or instrument the kernel are really appreciated.

After more than 2h uptime and I used always the same (latest) ubuntu 'boot
machinery' IIRC:

* the working kernels are:
  vanilla 2.6.26.7 *2.6.27.4*
  ubuntu 2.6.24-21

* fan stays on:
  vanilla 2.6.28 rc1, rc2
  current ubuntu 2.6.27-7.14
Comment 16 Martin Capitanio 2008-11-03 17:22:08 UTC
Is this different message indicating something wrong?
[    0.004000] CPU: Processor Core ID: 0
[    0.004000] CPU0: Thermal monitoring handled by SMI
...
[    0.004000] CPU: Processor Core ID: 1
[    0.004000] CPU1: Thermal monitoring enabled (TM2)

...
[    0.346227] ACPI: Power Resource [FN00] (on)
[    2.742654] fan PNP0C0B:00: registered as cooling_device0
[    2.742718] ACPI: Fan [FAN] (on)
[    2.756812] thermal LNXTHERM:01: registered as thermal_zone0

IMO this is a bogus message cos it is not intended to handle the fan
by the os or is that doing something?
[    2.757230] ACPI: Transitioning device [FAN] to D3
[    2.757426] ACPI: Thermal Zone [THRM] (56 C)
Comment 17 ykzhao 2008-11-06 01:30:15 UTC
Understand what you said.
   In the boot phase the Linux ACPI will try to enumerate all the ACPI devices and  load the corresponding driver for them. As there exists the ACPI FAN device, the ACPI fan driver will be loaded. So the following message is reported. In such case the userspace application or the other module can control the FAN device . But unfortunately the FAN control method is bogus. The FAN device can't be controlled by ACPI FAN interface. It is not the ACPI fault.
   When you try to change the LCD brightness, the _BCM object will be evaluated, in which the SMI is triggered. In such case OS can't know what is changed by SMI. 
   
   So IMO this is a BIOS bug. And Linux ACPI can do nothing about it. 
Comment 18 ykzhao 2008-11-06 01:31:12 UTC
As it is a BIOS bug, Linux ACPI can do nothing it. It had better be fixed by BIOS upgrading.
   So the bug will be rejected.

Note You need to log in before you can comment on or make changes to this bug.