Bug 12620 - kacpi_notify chews up constant 50-60% CPU when machine gets hot.
Summary: kacpi_notify chews up constant 50-60% CPU when machine gets hot.
Status: REJECTED WILL_NOT_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: ykzhao
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-02-01 16:35 UTC by Erik de Castro Lopo
Modified: 2009-02-16 17:03 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.28.2
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Gzipped output of acpidump (25.87 KB, application/x-gzip)
2009-02-01 17:07 UTC, Erik de Castro Lopo
Details
try the custom DSDT in which the _PS0/_PS3 object is added for the FAN device (181.48 KB, patch)
2009-02-01 18:47 UTC, ykzhao
Details | Diff
Output from command "acpidump --addr 0xff810 --length 0x40 -o temm", Gzipped. (79 bytes, application/x-gzip)
2009-02-03 13:06 UTC, Erik de Castro Lopo
Details
dmesg output just ofter boot when machine was still cold. (10.20 KB, application/x-gzip)
2009-02-15 14:58 UTC, Erik de Castro Lopo
Details
dmesg output after machine was rebooted when already hot. (10.10 KB, application/x-gzip)
2009-02-15 14:59 UTC, Erik de Castro Lopo
Details

Description Erik de Castro Lopo 2009-02-01 16:35:51 UTC
Latest working kernel version:
Earliest failing kernel version: Also seen with 2.6.27.8.
Distribution: Ubuntu
Hardware Environment: Mini ATX mobo with Intel Corporation Mobile GME965/GLE960 Memory Controller, Intel Celeron R 2GHz, 1 G ram.
Software Environment:
Problem Description: kacpi_notify chews up too much CPU.

Steps to reproduce:

 a) Power up machine.
 b) Run CPU intensive tasks with air holes blocked until machine heats up significantly.
 c) Reboot.

After reboot, machine comes up with kacpi_notify consuming about 60% of CPU.
Comment 1 Erik de Castro Lopo 2009-02-01 16:45:13 UTC
If the machine is powered down and allowed to cool, the next boot has kacpi_notify behaving normally again.
Comment 2 ykzhao 2009-02-01 16:48:18 UTC
Hi, Erik
   Will you please attach the output of acpidump?
   Thanks.
Comment 3 Erik de Castro Lopo 2009-02-01 17:07:51 UTC
Created attachment 20068 [details]
Gzipped output of acpidump

Machine was allowed to cool before doing acpidump.
Comment 4 ykzhao 2009-02-01 18:33:45 UTC
Hi, Erik
    Thanks for so quick response.
    From the acpidump there exists the ACPI fan device. When the temperature is above the threshold returned by the _AC0 object, OS will try to turn on the ACPI fan device to cool the temperature. But unfortunately it is a bogus FAN device. There is no ACPI object(_PS0/_PS3) to turn on/off the FAN device.
    >Device (FAN)
        {
            Name (_HID, EisaId ("PNP0C0B"))
            Method (_INI, 0, NotSerialized)
            {
                Store (TP1H, CTOS)
                Store (TP1L, CTHY)
            }
        }
     
    At the same time there exists the following definition under the scope of \_GPE:
    >Method (_L00, 0, NotSerialized)
        {
            Notify (\_TZ.THRM, 0x80)
        }
    
    Maybe when the temperature is above the threshold, the ACPI event(0x80) is sent to the thermal object. But OS can do nothing to cool the temperature. So the ACPI event(0x80) will continually be sent to the thermal object. In such case the kacpi_notify will chew up 50-60% CPU.
    Will you please check whether there exists the BIOS option related with FAN device?
    Thanks.
    Will 
Comment 5 ykzhao 2009-02-01 18:47:51 UTC
Created attachment 20069 [details]
try the custom DSDT in which the _PS0/_PS3 object is added for the FAN device

Will you please try the custom DSDT and see whether the problem still exists?
    In the custom DSDT the _PS3/_PS0 object is added for the ACPI fan device. 
    How to use the custom DSDT can be found in:
      www.lesswatts.org/projects/acpi/faq.php
    Thanks.
Comment 6 Erik de Castro Lopo 2009-02-01 19:16:02 UTC
(In reply to comment #4)
>     Thanks for so quick response.

Thank you!

>     Maybe when the temperature is above the threshold, the ACPI event(0x80)
>     is
> sent to the thermal object. But OS can do nothing to cool the temperature. So
> the ACPI event(0x80) will continually be sent to the thermal object. In such
> case the kacpi_notify will chew up 50-60% CPU.

One of my colleagues was investigating an heat related issue when he noticed the kacpi_notify CPU usage. He then called me in to look at that.

Shouldn't kacpi_notify realise that something is wrong and limit itself to say 5-10% CPU?


>     Will you please check whether there exists the BIOS option related with
>     FAN
> device?
>     Thanks.
>     Will 
> 
Comment 7 ykzhao 2009-02-01 19:50:03 UTC
The kacpi_notify is used to process the notification event sent from BIOS. And it is difficult to judge whether something is wrong. 
    Will you please try the custom DSDT and see whether the problem still exists?
    Thanks.
Comment 8 Erik de Castro Lopo 2009-02-02 19:20:06 UTC
Ok, I did the following:

a) cd linux-2.6.28.2
b) wget "http://bugzilla.kernel.org/attachment.cgi?id=20069&action=view" -O drivers/acpi/DSDT.hex
c) Edited .config and added:
    CONFIG_ACPI_CUSTOM_DSDT=y
    CONFIG_ACPI_CUSTOM_DSDT_FILE="DSDT.hex"
d) Built and installed the kernel.
e) Rebooted and checked for "ACPI: Table DSDT replaced by host OS" in the dmesg outout (it was present).
f) Repeated my testing from before.

After the machine had heated up considerably, and was then rebooted (again checking dmesg), kacpi_notify again took up 50-60% CPU.

Just so that we're clear, the kacpi_notify problem is something that only happens when deliberately heat stressing a machine to try and replicate a problem experienced by a customer in the field.
Comment 9 Erik de Castro Lopo 2009-02-02 19:43:28 UTC
I should also mention that although I can see the real CPU temperature in the BIOS, so the hardware is working, but the file /proc/acpi/thermal_zone/THRM/temperature says 40 degress C regardless.
Comment 10 ykzhao 2009-02-02 22:51:42 UTC
Hi, Erik
    Thanks for the test. 
    Will you please confirm whether the FAN device is turned on/off when the custom DSDT is used?
    
    Based on your description it seems that 40 degree is reported by /proc/acpi/thermal_zone/THRM/temperature regardless of real CPU temperature. Maybe this is caused by the BIOS bug.
    The temperature of thermal zone is obtained by the _TMP object, in which the RTMP is evaluated. The definition of RTMP is listed in the following:
    >Method (RTMP, 0, NotSerialized)
    {
   ....
         If (LEqual (SSHU, 0x01))
        {
            Return (0x0C3C)   // IF the SSHU is one, the 40 degree will be returned.
        }  
        Else
        {
            Return (Local0)
        }
    }
    
    Will you please attach the following output?
    > ./acpidump --addr 0xff810 --length 0x40 -o temm
   Thanks.
    
Comment 11 ykzhao 2009-02-02 22:53:03 UTC
Will you please also attach the output of "/proc/acpi/thermal_zone/thrm/* "?
   Thanks.
Comment 12 Erik de Castro Lopo 2009-02-03 13:05:09 UTC
ykzhao, thanks. Data you requested below.

1) Both the CPU and the system fan are on.

2 ) > acpidump --addr 0xff810 --length 0x40 -o temm
    > hexdump temm

      0000000 0f5c 0f3e 0f5c 0f3e 0f5c 0001 6967 7365
      0000010 202c 744c 0000 0000 1000 0000 7869 5420
      0000020 6365 6e68 0000 000d 0000 0000 0000 3f6e
      0000030 1e44 40b8 8e00 a1d8 0013 cf1f e963 f61f
      0000040

      I'll also add this as a gzippped attachment.

3) > cat /proc/acpi/thermal_zone/THRM/*

    0 - Active; 1 - Passive
    <polling disabled>
    state:                   ok
    temperature:             40 C
    critical (S5):           120 C
    passive:                 120 C: tc1=4 tc2=3 tsp=60 devices=CPU0 
    active[0]:               120 C: devices= FAN 

I did at one stage enable polling and got "<polling enabled>" in the above but that didn't help.
Comment 13 Erik de Castro Lopo 2009-02-03 13:06:46 UTC
Created attachment 20098 [details]
Output from command "acpidump --addr 0xff810 --length 0x40 -o temm", Gzipped.
Comment 14 ykzhao 2009-02-04 18:11:33 UTC
Hi, Erik
    Thanks for the info. 
    From the log in comment #13 it is confirmed that this issue is caused by broken BIOS.
    a.SSHU is one. In such case the 40 degree is always reported by thermal zone.
    
    b. the active threshold is 120. If so, only when the temperature is above the threshold, the FAN device will be turned on. But in fact on this box this is a bogus FAN. There is no method that can turn on/off the FAN device.

    c. When the temperature rises, the GPE _L00 method will send ACPI notification event(0x80) to thermal zone so that OS can take some actions to cool the temperature. Unfortunately OS can do nothing because of bogus thermal zone.  As the temperature can't be decreased, it will continue to send the notification event.

     As this bug is related with the broken bios, it had better be fixed by BIOS upgrading.
    
     Will you please add the boot option of "thermal.act=35" and see whether this issue still exists? The driver/acpi/thermal should be compiled as built-in kernel and the custom DSDT should also be used.
    Thanks.
Comment 15 Erik de Castro Lopo 2009-02-08 17:39:18 UTC
I tried the thermal.act=35 boot option, but got an error saying "unknown boot option". I am sure I have CONFIG_ACPI_THERMAL set to yes. I checked twice.
Comment 16 ykzhao 2009-02-08 21:09:51 UTC
Will you please double check it again?
     
Comment 17 Erik de Castro Lopo 2009-02-10 18:50:23 UTC
Just to make sure I did the following:

 a) Compiled kernel 2.6.28.4 with the custom DSDT.hex file, and
    CONFIG_ACPI_THERMAL=y and CONFIG_IKCONFIG_PROC=y.

 b) Rebooted and checked /proc/config.gz for CONFIG_ACPI_THERMAL=y as well
    as the custom DSDT.hex file.

 c) Allowed machine to heat up.

 d) Rebooted with "thermal.act=35".

This time the kernel accepted the thermal.act parameter, but kcapi_notify
still ran at 50-50% CPU.
Comment 18 ykzhao 2009-02-11 19:01:07 UTC
Will you please cat the output of /proc/acpi/fan/*/* and /proc/acpi/thermal_zone/*/*?
   Thanks.
   
Comment 19 Erik de Castro Lopo 2009-02-12 14:04:05 UTC
prompt $ uname -a
Linux sn-11067d15 2.6.28.4test3 #1 SMP Thu Feb 12 14:50:34 EST 2009 i686 GNU/Linux

prompt $ gunzip -c --stdout /proc/config.gz | grep THERMAL
CONFIG_ACPI_THERMAL=y
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y

prompt $ gunzip -c --stdout /proc/config.gz | grep DSDT   
CONFIG_ACPI_CUSTOM_DSDT_FILE="DSDT.hex"
CONFIG_ACPI_CUSTOM_DSDT=y


prompt $ cat /proc/acpi/fan/*/* 
status:                  off

prompt $ cat /proc/acpi/thermal_zone/*/*
0 - Active; 1 - Passive
<polling disabled>
state:                   ok
temperature:             40 C
critical (S5):           120 C
passive:                 120 C: tc1=4 tc2=3 tsp=60 devices=CPU0 
active[0]:               120 C: devices= FAN 
Comment 20 ykzhao 2009-02-12 16:53:59 UTC
hi, Erik
    Is the boot option of "thermal.act=35" added while doing the test in comment #19?
    If not, had better add it.
    Thanks.
Comment 21 Zhang Rui 2009-02-12 17:43:48 UTC
when kacpid_notify is busy, please attach the output of "grep . /sys/firmware/acpi/interrupts/*".
And run this command twice in a row to see if there is any GPE/fixed event that keeps on firing.

Yakui, note that the kacpi_notify is busy after REBOOT.
we probably hit some interrupt/notification storm problems here.
Comment 22 ykzhao 2009-02-12 18:35:55 UTC
Understand what you said.
   In fact the problem is triggered only when the temperature is very high.
   >Power up machine.
   >Run CPU intensive tasks with air holes blocked until machine heats up
significantly.
   >Reboot.And machine comes up with kacpi_notify consuming about 60% of CPU.


   From the comment #1 the kacpid_notify is normal if the temperature is very low.
   Thanks.
Comment 23 Erik de Castro Lopo 2009-02-12 19:25:53 UTC
A couple of answers:

> Is the boot option of "thermal.act=35" added while doing the test in
> comment #19?

Yes, I added this to the grub menu.list:

    prompt $ cat /proc/cmdline 
    root=UUID=bde07cb2-73a3-4502-ae4c-781dc8c392fe ro quiet splash 
    thermal.act=35

> From the comment #1 the kacpid_notify is normal if the temperature is very
> low.

Thats correct. If machine is powered down and left to cool down, on boot kacpi_notify CPU usage is low (less than 1%). I *only* see high kacpi_notify CPU usage when the machine is rebooted when *already* heat stressed.

> when kacpid_notify is busy, please attach the output of "grep .
> /sys/firmware/acpi/interrupts/*"

prompt $ grep . /sys/firmware/acpi/interrupts/*
/sys/firmware/acpi/interrupts/error:       0
/sys/firmware/acpi/interrupts/ff_gbl_lock:       0      enabled
/sys/firmware/acpi/interrupts/ff_pmtimer:       0       invalid
/sys/firmware/acpi/interrupts/ff_pwr_btn:       0       enabled
/sys/firmware/acpi/interrupts/ff_rt_clk:       0        disabled
/sys/firmware/acpi/interrupts/ff_slp_btn:       0       invalid
/sys/firmware/acpi/interrupts/gpe00: 1614732    enabled
/sys/firmware/acpi/interrupts/gpe01:       0    invalid
/sys/firmware/acpi/interrupts/gpe02:       0    invalid
/sys/firmware/acpi/interrupts/gpe03:       0    disabled
/sys/firmware/acpi/interrupts/gpe04:       0    disabled
/sys/firmware/acpi/interrupts/gpe05:       0    disabled
/sys/firmware/acpi/interrupts/gpe06:       0    invalid
/sys/firmware/acpi/interrupts/gpe07:       0    invalid
/sys/firmware/acpi/interrupts/gpe08:       0    disabled
/sys/firmware/acpi/interrupts/gpe09:       0    disabled
/sys/firmware/acpi/interrupts/gpe0A:       0    invalid
/sys/firmware/acpi/interrupts/gpe0B:       0    disabled
/sys/firmware/acpi/interrupts/gpe0C:       0    disabled
/sys/firmware/acpi/interrupts/gpe0D:       0    disabled
/sys/firmware/acpi/interrupts/gpe0E:       0    disabled
/sys/firmware/acpi/interrupts/gpe0F:       0    invalid
/sys/firmware/acpi/interrupts/gpe10:       0    invalid
/sys/firmware/acpi/interrupts/gpe11:       0    invalid
/sys/firmware/acpi/interrupts/gpe12:       0    invalid
/sys/firmware/acpi/interrupts/gpe13:       0    invalid
/sys/firmware/acpi/interrupts/gpe14:       0    invalid
/sys/firmware/acpi/interrupts/gpe15:       0    invalid
/sys/firmware/acpi/interrupts/gpe16:       0    invalid
/sys/firmware/acpi/interrupts/gpe17:       0    invalid
/sys/firmware/acpi/interrupts/gpe18:       0    invalid
/sys/firmware/acpi/interrupts/gpe19:       0    invalid
/sys/firmware/acpi/interrupts/gpe1A:       0    invalid
/sys/firmware/acpi/interrupts/gpe1B:       0    invalid
/sys/firmware/acpi/interrupts/gpe1C:       0    invalid
/sys/firmware/acpi/interrupts/gpe1D:       0    invalid
/sys/firmware/acpi/interrupts/gpe1E:       0    invalid
/sys/firmware/acpi/interrupts/gpe1F:       0    invalid
/sys/firmware/acpi/interrupts/gpe_all: 1614732
/sys/firmware/acpi/interrupts/sci: 1614732

And again about 10 seconds later:

/sys/firmware/acpi/interrupts/error:       0
/sys/firmware/acpi/interrupts/ff_gbl_lock:       0      enabled
/sys/firmware/acpi/interrupts/ff_pmtimer:       0       invalid
/sys/firmware/acpi/interrupts/ff_pwr_btn:       0       enabled
/sys/firmware/acpi/interrupts/ff_rt_clk:       0        disabled
/sys/firmware/acpi/interrupts/ff_slp_btn:       0       invalid
/sys/firmware/acpi/interrupts/gpe00: 1660965    enabled
/sys/firmware/acpi/interrupts/gpe01:       0    invalid
/sys/firmware/acpi/interrupts/gpe02:       0    invalid
/sys/firmware/acpi/interrupts/gpe03:       0    disabled
/sys/firmware/acpi/interrupts/gpe04:       0    disabled
/sys/firmware/acpi/interrupts/gpe05:       0    disabled
/sys/firmware/acpi/interrupts/gpe06:       0    invalid
/sys/firmware/acpi/interrupts/gpe07:       0    invalid
/sys/firmware/acpi/interrupts/gpe08:       0    disabled
/sys/firmware/acpi/interrupts/gpe09:       0    disabled
/sys/firmware/acpi/interrupts/gpe0A:       0    invalid
/sys/firmware/acpi/interrupts/gpe0B:       0    disabled
/sys/firmware/acpi/interrupts/gpe0C:       0    disabled
/sys/firmware/acpi/interrupts/gpe0D:       0    disabled
/sys/firmware/acpi/interrupts/gpe0E:       0    disabled
/sys/firmware/acpi/interrupts/gpe0F:       0    invalid
/sys/firmware/acpi/interrupts/gpe10:       0    invalid
/sys/firmware/acpi/interrupts/gpe11:       0    invalid
/sys/firmware/acpi/interrupts/gpe12:       0    invalid
/sys/firmware/acpi/interrupts/gpe13:       0    invalid
/sys/firmware/acpi/interrupts/gpe14:       0    invalid
/sys/firmware/acpi/interrupts/gpe15:       0    invalid
/sys/firmware/acpi/interrupts/gpe16:       0    invalid
/sys/firmware/acpi/interrupts/gpe17:       0    invalid
/sys/firmware/acpi/interrupts/gpe18:       0    invalid
/sys/firmware/acpi/interrupts/gpe19:       0    invalid
/sys/firmware/acpi/interrupts/gpe1A:       0    invalid
/sys/firmware/acpi/interrupts/gpe1B:       0    invalid
/sys/firmware/acpi/interrupts/gpe1C:       0    invalid
/sys/firmware/acpi/interrupts/gpe1D:       0    invalid
/sys/firmware/acpi/interrupts/gpe1E:       0    invalid
/sys/firmware/acpi/interrupts/gpe1F:       0    invalid
/sys/firmware/acpi/interrupts/gpe_all: 1660965
/sys/firmware/acpi/interrupts/sci: 1660965
Comment 24 ykzhao 2009-02-12 21:28:42 UTC
Now it is very clear that the issue is related with GPE00 storm. 
   From the chipset datasheet we know that the GPE00 is driven by the external THRM signal. When the temperature arises, it will send the notification event to thermal zone so that OS can take actions to cool the temperature. But unfortunately OS can do nothing about it. In such case the gpe00 will be triggered again.
   
   Do you mean that the boot option of "thermal.act=35" is already added in the test of comment #19?
   But it seems that it has no effect and the active threshold is still 120.
    >active[0]:               120 C: devices= FAN

   If it is effective, the active threshold will be 35 and the FAN device will be turned on.
   Will you please double check it again?(had better attach the output of dmesg).
Comment 25 Erik de Castro Lopo 2009-02-15 14:56:56 UTC
Tested again.

 - Powered up after machine had been switched off over the weekend and boot
   into default kernel.

 - Check everything is in place:

    prompt:~$ cat /proc/cmdline 
    root=UUID=bde07cb2-73a3-4502-ae4c-781dc8c392fe ro thermal.act=35


    prompt:~$ gunzip -c --stdout /proc/config.gz  | grep THERMAL
    CONFIG_ACPI_THERMAL=y
    CONFIG_THERMAL=y
    CONFIG_THERMAL_HWMON=y

    prompt:~$ gunzip -c --stdout /proc/config.gz  | grep DSDT   
    CONFIG_ACPI_CUSTOM_DSDT_FILE="DSDT.hex"
    CONFIG_ACPI_CUSTOM_DSDT=y

 - Capture dmesg when machine still cold (attachment dmsg-cold.txt.gz).

 - Allow machine to heat up and then reboot.

 - On reboot kacpi_notify still runing at 50-60% CPU.

 - Check /proc/cmdline and /proc/config.gz was as before (it was).
 
 - Capture dmesg (attachment dmsg-hot.txt.gz).
 
Attachments coming soon.
Comment 26 Erik de Castro Lopo 2009-02-15 14:58:02 UTC
Created attachment 20257 [details]
dmesg output just ofter boot when machine was still cold.
Comment 27 Erik de Castro Lopo 2009-02-15 14:59:05 UTC
Created attachment 20258 [details]
dmesg output after machine was rebooted when already hot.
Comment 28 ykzhao 2009-02-15 21:38:11 UTC
Hi, Erik
    From the attached test it seems that the boot option of "thermal.act=35" is already accepted. But the output of /proc/acpi/thermal_zone/*/* indicates that the boot parameter has no effect.
   > <polling disabled>
   > state:                   ok
   > temperature:             40 C
   > critical (S5):           120 C
   > passive:                 120 C: tc1=4 tc2=3 tsp=60 devices=CPU0 
   >active[0]:               120 C: devices= FAN 
   (If takes effect, the active threshold should be 35 instead of 120).
   
   In such case the fan device will be in off-state. 
   > status:                  off
    
   How about the issue if adding the boot option of "thermal.act=-1" and the fan device is put in on-state.
   > echo 0 > /proc/acpi/fan/FAN/state
   
   Thanks.
Comment 29 Erik de Castro Lopo 2009-02-16 15:44:33 UTC
ykzhao,

You should be aware that regardless of the value of
/proc/acpi/fan/FAN/state, the fans (CPU and system fan) are on, all
the time, regardless of temperature. Their speed also does not vary
with temperature.

Anyway, here's what I did:

  a) Booted from a cold start (machine switched off overnight).

  b) Set "thermal.act=-1" in grub's menu.lst

  c) Rebooted.

  d) On boot, checked /etc/cmdline for new thermal.act setting (ok).

  e) Messed with /proc/acpi/fan/FAN/state (added it to /etc/rc.local)

     prompt# echo 0 > /proc/acpi/fan/FAN/state
     prompt# cat /proc/acpi/fan/FAN/state
     status:                  on

  f) Let the machine heat up.
  
  g) Rebooted.
 
On reboot, the behaviour was the same as before, kacpi_notify running
at 50-60% CPU.

Also checked that everything was as expected (thermal.act in /proc/cmdline,
/proc/acpi/fan/FAN/state and THERMAL settings in /proc/config.gz). Everything
was as expected.

At this point I am willing to accept that the BIOS on this board is buggy
and that there is nothing that can be done in the kernel to work around
the broken-ness.

Thanks very much for you exemplary level of support on this issue. It was
much, much better than I get from any commercial company I've dealt with
in the last 5 years.
Comment 30 ykzhao 2009-02-16 17:03:04 UTC
Hi, Erik
    Thanks for your understanding  and detailed test.
    Now it seems that the ACPI FAN device is already turned on. But it doesn't help to fix the issue of kacpid_notify.  This kacpid_notify issue is related with the GPE00 storm. 
    
    In fact there exist several problems on this box.
    > incorrect threshold
    > the thermal_zone temperature is constant
    > bogus fan device : there is no method that can turn on/off the FAN device.(_PS0, _PS3)
    
    thanks for your understanding that this bug can't be workaround in kernel. So this bug will be rejected and marked as "Will_NOT_FIX".
    
     

Note You need to log in before you can comment on or make changes to this bug.