Bug 12620
Summary: | kacpi_notify chews up constant 50-60% CPU when machine gets hot. | ||
---|---|---|---|
Product: | ACPI | Reporter: | Erik de Castro Lopo (mle+reportbug) |
Component: | Power-Thermal | Assignee: | ykzhao (yakui.zhao) |
Status: | REJECTED WILL_NOT_FIX | ||
Severity: | normal | CC: | acpi-bugzilla |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.28.2 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
Gzipped output of acpidump
try the custom DSDT in which the _PS0/_PS3 object is added for the FAN device Output from command "acpidump --addr 0xff810 --length 0x40 -o temm", Gzipped. dmesg output just ofter boot when machine was still cold. dmesg output after machine was rebooted when already hot. |
Description
Erik de Castro Lopo
2009-02-01 16:35:51 UTC
If the machine is powered down and allowed to cool, the next boot has kacpi_notify behaving normally again. Hi, Erik Will you please attach the output of acpidump? Thanks. Created attachment 20068 [details]
Gzipped output of acpidump
Machine was allowed to cool before doing acpidump.
Hi, Erik Thanks for so quick response. From the acpidump there exists the ACPI fan device. When the temperature is above the threshold returned by the _AC0 object, OS will try to turn on the ACPI fan device to cool the temperature. But unfortunately it is a bogus FAN device. There is no ACPI object(_PS0/_PS3) to turn on/off the FAN device. >Device (FAN) { Name (_HID, EisaId ("PNP0C0B")) Method (_INI, 0, NotSerialized) { Store (TP1H, CTOS) Store (TP1L, CTHY) } } At the same time there exists the following definition under the scope of \_GPE: >Method (_L00, 0, NotSerialized) { Notify (\_TZ.THRM, 0x80) } Maybe when the temperature is above the threshold, the ACPI event(0x80) is sent to the thermal object. But OS can do nothing to cool the temperature. So the ACPI event(0x80) will continually be sent to the thermal object. In such case the kacpi_notify will chew up 50-60% CPU. Will you please check whether there exists the BIOS option related with FAN device? Thanks. Will Created attachment 20069 [details]
try the custom DSDT in which the _PS0/_PS3 object is added for the FAN device
Will you please try the custom DSDT and see whether the problem still exists?
In the custom DSDT the _PS3/_PS0 object is added for the ACPI fan device.
How to use the custom DSDT can be found in:
www.lesswatts.org/projects/acpi/faq.php
Thanks.
(In reply to comment #4) > Thanks for so quick response. Thank you! > Maybe when the temperature is above the threshold, the ACPI event(0x80) > is > sent to the thermal object. But OS can do nothing to cool the temperature. So > the ACPI event(0x80) will continually be sent to the thermal object. In such > case the kacpi_notify will chew up 50-60% CPU. One of my colleagues was investigating an heat related issue when he noticed the kacpi_notify CPU usage. He then called me in to look at that. Shouldn't kacpi_notify realise that something is wrong and limit itself to say 5-10% CPU? > Will you please check whether there exists the BIOS option related with > FAN > device? > Thanks. > Will > The kacpi_notify is used to process the notification event sent from BIOS. And it is difficult to judge whether something is wrong. Will you please try the custom DSDT and see whether the problem still exists? Thanks. Ok, I did the following: a) cd linux-2.6.28.2 b) wget "http://bugzilla.kernel.org/attachment.cgi?id=20069&action=view" -O drivers/acpi/DSDT.hex c) Edited .config and added: CONFIG_ACPI_CUSTOM_DSDT=y CONFIG_ACPI_CUSTOM_DSDT_FILE="DSDT.hex" d) Built and installed the kernel. e) Rebooted and checked for "ACPI: Table DSDT replaced by host OS" in the dmesg outout (it was present). f) Repeated my testing from before. After the machine had heated up considerably, and was then rebooted (again checking dmesg), kacpi_notify again took up 50-60% CPU. Just so that we're clear, the kacpi_notify problem is something that only happens when deliberately heat stressing a machine to try and replicate a problem experienced by a customer in the field. I should also mention that although I can see the real CPU temperature in the BIOS, so the hardware is working, but the file /proc/acpi/thermal_zone/THRM/temperature says 40 degress C regardless. Hi, Erik Thanks for the test. Will you please confirm whether the FAN device is turned on/off when the custom DSDT is used? Based on your description it seems that 40 degree is reported by /proc/acpi/thermal_zone/THRM/temperature regardless of real CPU temperature. Maybe this is caused by the BIOS bug. The temperature of thermal zone is obtained by the _TMP object, in which the RTMP is evaluated. The definition of RTMP is listed in the following: >Method (RTMP, 0, NotSerialized) { .... If (LEqual (SSHU, 0x01)) { Return (0x0C3C) // IF the SSHU is one, the 40 degree will be returned. } Else { Return (Local0) } } Will you please attach the following output? > ./acpidump --addr 0xff810 --length 0x40 -o temm Thanks. Will you please also attach the output of "/proc/acpi/thermal_zone/thrm/* "? Thanks. ykzhao, thanks. Data you requested below. 1) Both the CPU and the system fan are on. 2 ) > acpidump --addr 0xff810 --length 0x40 -o temm > hexdump temm 0000000 0f5c 0f3e 0f5c 0f3e 0f5c 0001 6967 7365 0000010 202c 744c 0000 0000 1000 0000 7869 5420 0000020 6365 6e68 0000 000d 0000 0000 0000 3f6e 0000030 1e44 40b8 8e00 a1d8 0013 cf1f e963 f61f 0000040 I'll also add this as a gzippped attachment. 3) > cat /proc/acpi/thermal_zone/THRM/* 0 - Active; 1 - Passive <polling disabled> state: ok temperature: 40 C critical (S5): 120 C passive: 120 C: tc1=4 tc2=3 tsp=60 devices=CPU0 active[0]: 120 C: devices= FAN I did at one stage enable polling and got "<polling enabled>" in the above but that didn't help. Created attachment 20098 [details]
Output from command "acpidump --addr 0xff810 --length 0x40 -o temm", Gzipped.
Hi, Erik Thanks for the info. From the log in comment #13 it is confirmed that this issue is caused by broken BIOS. a.SSHU is one. In such case the 40 degree is always reported by thermal zone. b. the active threshold is 120. If so, only when the temperature is above the threshold, the FAN device will be turned on. But in fact on this box this is a bogus FAN. There is no method that can turn on/off the FAN device. c. When the temperature rises, the GPE _L00 method will send ACPI notification event(0x80) to thermal zone so that OS can take some actions to cool the temperature. Unfortunately OS can do nothing because of bogus thermal zone. As the temperature can't be decreased, it will continue to send the notification event. As this bug is related with the broken bios, it had better be fixed by BIOS upgrading. Will you please add the boot option of "thermal.act=35" and see whether this issue still exists? The driver/acpi/thermal should be compiled as built-in kernel and the custom DSDT should also be used. Thanks. I tried the thermal.act=35 boot option, but got an error saying "unknown boot option". I am sure I have CONFIG_ACPI_THERMAL set to yes. I checked twice. Will you please double check it again? Just to make sure I did the following: a) Compiled kernel 2.6.28.4 with the custom DSDT.hex file, and CONFIG_ACPI_THERMAL=y and CONFIG_IKCONFIG_PROC=y. b) Rebooted and checked /proc/config.gz for CONFIG_ACPI_THERMAL=y as well as the custom DSDT.hex file. c) Allowed machine to heat up. d) Rebooted with "thermal.act=35". This time the kernel accepted the thermal.act parameter, but kcapi_notify still ran at 50-50% CPU. Will you please cat the output of /proc/acpi/fan/*/* and /proc/acpi/thermal_zone/*/*? Thanks. prompt $ uname -a Linux sn-11067d15 2.6.28.4test3 #1 SMP Thu Feb 12 14:50:34 EST 2009 i686 GNU/Linux prompt $ gunzip -c --stdout /proc/config.gz | grep THERMAL CONFIG_ACPI_THERMAL=y CONFIG_THERMAL=y CONFIG_THERMAL_HWMON=y prompt $ gunzip -c --stdout /proc/config.gz | grep DSDT CONFIG_ACPI_CUSTOM_DSDT_FILE="DSDT.hex" CONFIG_ACPI_CUSTOM_DSDT=y prompt $ cat /proc/acpi/fan/*/* status: off prompt $ cat /proc/acpi/thermal_zone/*/* 0 - Active; 1 - Passive <polling disabled> state: ok temperature: 40 C critical (S5): 120 C passive: 120 C: tc1=4 tc2=3 tsp=60 devices=CPU0 active[0]: 120 C: devices= FAN hi, Erik Is the boot option of "thermal.act=35" added while doing the test in comment #19? If not, had better add it. Thanks. when kacpid_notify is busy, please attach the output of "grep . /sys/firmware/acpi/interrupts/*". And run this command twice in a row to see if there is any GPE/fixed event that keeps on firing. Yakui, note that the kacpi_notify is busy after REBOOT. we probably hit some interrupt/notification storm problems here. Understand what you said. In fact the problem is triggered only when the temperature is very high. >Power up machine. >Run CPU intensive tasks with air holes blocked until machine heats up significantly. >Reboot.And machine comes up with kacpi_notify consuming about 60% of CPU. From the comment #1 the kacpid_notify is normal if the temperature is very low. Thanks. A couple of answers: > Is the boot option of "thermal.act=35" added while doing the test in > comment #19? Yes, I added this to the grub menu.list: prompt $ cat /proc/cmdline root=UUID=bde07cb2-73a3-4502-ae4c-781dc8c392fe ro quiet splash thermal.act=35 > From the comment #1 the kacpid_notify is normal if the temperature is very > low. Thats correct. If machine is powered down and left to cool down, on boot kacpi_notify CPU usage is low (less than 1%). I *only* see high kacpi_notify CPU usage when the machine is rebooted when *already* heat stressed. > when kacpid_notify is busy, please attach the output of "grep . > /sys/firmware/acpi/interrupts/*" prompt $ grep . /sys/firmware/acpi/interrupts/* /sys/firmware/acpi/interrupts/error: 0 /sys/firmware/acpi/interrupts/ff_gbl_lock: 0 enabled /sys/firmware/acpi/interrupts/ff_pmtimer: 0 invalid /sys/firmware/acpi/interrupts/ff_pwr_btn: 0 enabled /sys/firmware/acpi/interrupts/ff_rt_clk: 0 disabled /sys/firmware/acpi/interrupts/ff_slp_btn: 0 invalid /sys/firmware/acpi/interrupts/gpe00: 1614732 enabled /sys/firmware/acpi/interrupts/gpe01: 0 invalid /sys/firmware/acpi/interrupts/gpe02: 0 invalid /sys/firmware/acpi/interrupts/gpe03: 0 disabled /sys/firmware/acpi/interrupts/gpe04: 0 disabled /sys/firmware/acpi/interrupts/gpe05: 0 disabled /sys/firmware/acpi/interrupts/gpe06: 0 invalid /sys/firmware/acpi/interrupts/gpe07: 0 invalid /sys/firmware/acpi/interrupts/gpe08: 0 disabled /sys/firmware/acpi/interrupts/gpe09: 0 disabled /sys/firmware/acpi/interrupts/gpe0A: 0 invalid /sys/firmware/acpi/interrupts/gpe0B: 0 disabled /sys/firmware/acpi/interrupts/gpe0C: 0 disabled /sys/firmware/acpi/interrupts/gpe0D: 0 disabled /sys/firmware/acpi/interrupts/gpe0E: 0 disabled /sys/firmware/acpi/interrupts/gpe0F: 0 invalid /sys/firmware/acpi/interrupts/gpe10: 0 invalid /sys/firmware/acpi/interrupts/gpe11: 0 invalid /sys/firmware/acpi/interrupts/gpe12: 0 invalid /sys/firmware/acpi/interrupts/gpe13: 0 invalid /sys/firmware/acpi/interrupts/gpe14: 0 invalid /sys/firmware/acpi/interrupts/gpe15: 0 invalid /sys/firmware/acpi/interrupts/gpe16: 0 invalid /sys/firmware/acpi/interrupts/gpe17: 0 invalid /sys/firmware/acpi/interrupts/gpe18: 0 invalid /sys/firmware/acpi/interrupts/gpe19: 0 invalid /sys/firmware/acpi/interrupts/gpe1A: 0 invalid /sys/firmware/acpi/interrupts/gpe1B: 0 invalid /sys/firmware/acpi/interrupts/gpe1C: 0 invalid /sys/firmware/acpi/interrupts/gpe1D: 0 invalid /sys/firmware/acpi/interrupts/gpe1E: 0 invalid /sys/firmware/acpi/interrupts/gpe1F: 0 invalid /sys/firmware/acpi/interrupts/gpe_all: 1614732 /sys/firmware/acpi/interrupts/sci: 1614732 And again about 10 seconds later: /sys/firmware/acpi/interrupts/error: 0 /sys/firmware/acpi/interrupts/ff_gbl_lock: 0 enabled /sys/firmware/acpi/interrupts/ff_pmtimer: 0 invalid /sys/firmware/acpi/interrupts/ff_pwr_btn: 0 enabled /sys/firmware/acpi/interrupts/ff_rt_clk: 0 disabled /sys/firmware/acpi/interrupts/ff_slp_btn: 0 invalid /sys/firmware/acpi/interrupts/gpe00: 1660965 enabled /sys/firmware/acpi/interrupts/gpe01: 0 invalid /sys/firmware/acpi/interrupts/gpe02: 0 invalid /sys/firmware/acpi/interrupts/gpe03: 0 disabled /sys/firmware/acpi/interrupts/gpe04: 0 disabled /sys/firmware/acpi/interrupts/gpe05: 0 disabled /sys/firmware/acpi/interrupts/gpe06: 0 invalid /sys/firmware/acpi/interrupts/gpe07: 0 invalid /sys/firmware/acpi/interrupts/gpe08: 0 disabled /sys/firmware/acpi/interrupts/gpe09: 0 disabled /sys/firmware/acpi/interrupts/gpe0A: 0 invalid /sys/firmware/acpi/interrupts/gpe0B: 0 disabled /sys/firmware/acpi/interrupts/gpe0C: 0 disabled /sys/firmware/acpi/interrupts/gpe0D: 0 disabled /sys/firmware/acpi/interrupts/gpe0E: 0 disabled /sys/firmware/acpi/interrupts/gpe0F: 0 invalid /sys/firmware/acpi/interrupts/gpe10: 0 invalid /sys/firmware/acpi/interrupts/gpe11: 0 invalid /sys/firmware/acpi/interrupts/gpe12: 0 invalid /sys/firmware/acpi/interrupts/gpe13: 0 invalid /sys/firmware/acpi/interrupts/gpe14: 0 invalid /sys/firmware/acpi/interrupts/gpe15: 0 invalid /sys/firmware/acpi/interrupts/gpe16: 0 invalid /sys/firmware/acpi/interrupts/gpe17: 0 invalid /sys/firmware/acpi/interrupts/gpe18: 0 invalid /sys/firmware/acpi/interrupts/gpe19: 0 invalid /sys/firmware/acpi/interrupts/gpe1A: 0 invalid /sys/firmware/acpi/interrupts/gpe1B: 0 invalid /sys/firmware/acpi/interrupts/gpe1C: 0 invalid /sys/firmware/acpi/interrupts/gpe1D: 0 invalid /sys/firmware/acpi/interrupts/gpe1E: 0 invalid /sys/firmware/acpi/interrupts/gpe1F: 0 invalid /sys/firmware/acpi/interrupts/gpe_all: 1660965 /sys/firmware/acpi/interrupts/sci: 1660965 Now it is very clear that the issue is related with GPE00 storm. From the chipset datasheet we know that the GPE00 is driven by the external THRM signal. When the temperature arises, it will send the notification event to thermal zone so that OS can take actions to cool the temperature. But unfortunately OS can do nothing about it. In such case the gpe00 will be triggered again. Do you mean that the boot option of "thermal.act=35" is already added in the test of comment #19? But it seems that it has no effect and the active threshold is still 120. >active[0]: 120 C: devices= FAN If it is effective, the active threshold will be 35 and the FAN device will be turned on. Will you please double check it again?(had better attach the output of dmesg). Tested again. - Powered up after machine had been switched off over the weekend and boot into default kernel. - Check everything is in place: prompt:~$ cat /proc/cmdline root=UUID=bde07cb2-73a3-4502-ae4c-781dc8c392fe ro thermal.act=35 prompt:~$ gunzip -c --stdout /proc/config.gz | grep THERMAL CONFIG_ACPI_THERMAL=y CONFIG_THERMAL=y CONFIG_THERMAL_HWMON=y prompt:~$ gunzip -c --stdout /proc/config.gz | grep DSDT CONFIG_ACPI_CUSTOM_DSDT_FILE="DSDT.hex" CONFIG_ACPI_CUSTOM_DSDT=y - Capture dmesg when machine still cold (attachment dmsg-cold.txt.gz). - Allow machine to heat up and then reboot. - On reboot kacpi_notify still runing at 50-60% CPU. - Check /proc/cmdline and /proc/config.gz was as before (it was). - Capture dmesg (attachment dmsg-hot.txt.gz). Attachments coming soon. Created attachment 20257 [details]
dmesg output just ofter boot when machine was still cold.
Created attachment 20258 [details]
dmesg output after machine was rebooted when already hot.
Hi, Erik From the attached test it seems that the boot option of "thermal.act=35" is already accepted. But the output of /proc/acpi/thermal_zone/*/* indicates that the boot parameter has no effect. > <polling disabled> > state: ok > temperature: 40 C > critical (S5): 120 C > passive: 120 C: tc1=4 tc2=3 tsp=60 devices=CPU0 >active[0]: 120 C: devices= FAN (If takes effect, the active threshold should be 35 instead of 120). In such case the fan device will be in off-state. > status: off How about the issue if adding the boot option of "thermal.act=-1" and the fan device is put in on-state. > echo 0 > /proc/acpi/fan/FAN/state Thanks. ykzhao, You should be aware that regardless of the value of /proc/acpi/fan/FAN/state, the fans (CPU and system fan) are on, all the time, regardless of temperature. Their speed also does not vary with temperature. Anyway, here's what I did: a) Booted from a cold start (machine switched off overnight). b) Set "thermal.act=-1" in grub's menu.lst c) Rebooted. d) On boot, checked /etc/cmdline for new thermal.act setting (ok). e) Messed with /proc/acpi/fan/FAN/state (added it to /etc/rc.local) prompt# echo 0 > /proc/acpi/fan/FAN/state prompt# cat /proc/acpi/fan/FAN/state status: on f) Let the machine heat up. g) Rebooted. On reboot, the behaviour was the same as before, kacpi_notify running at 50-60% CPU. Also checked that everything was as expected (thermal.act in /proc/cmdline, /proc/acpi/fan/FAN/state and THERMAL settings in /proc/config.gz). Everything was as expected. At this point I am willing to accept that the BIOS on this board is buggy and that there is nothing that can be done in the kernel to work around the broken-ness. Thanks very much for you exemplary level of support on this issue. It was much, much better than I get from any commercial company I've dealt with in the last 5 years. Hi, Erik Thanks for your understanding and detailed test. Now it seems that the ACPI FAN device is already turned on. But it doesn't help to fix the issue of kacpid_notify. This kacpid_notify issue is related with the GPE00 storm. In fact there exist several problems on this box. > incorrect threshold > the thermal_zone temperature is constant > bogus fan device : there is no method that can turn on/off the FAN device.(_PS0, _PS3) thanks for your understanding that this bug can't be workaround in kernel. So this bug will be rejected and marked as "Will_NOT_FIX". |