Bug 8274

Summary: kacpi use up 100% CPU after thermal throttling invoked - Sony PCG-XR1G, Vaio SR17
Product: ACPI Reporter: Cao Thanh Tung (thanhtung)
Component: Power-ThermalAssignee: ykzhao (yakui.zhao)
Status: REJECTED UNREPRODUCIBLE    
Severity: high CC: acpi-bugzilla, akkzilla, bunk, johfel, messerli, protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.20 Subsystem:
Regression: --- Bisected commit-id:
Attachments: dmesg -s64000 output
acpidump output
Kernel config file (2.6.20)
Kernel config file (2.6.21-rc5)
cat /proc/acpi/processor/CPU0/*
cat /proc/acpi/thermal_zone/ATF0/*
dmesg -s64000 output (probably not very enlightening)
Output of acpidump
more CPU0/*

Description Cao Thanh Tung 2007-03-28 02:49:19 UTC
Most recent kernel where this bug did *NOT* occur: No, I have tried kernel
version 2.6.21rc4, the bug is still there.

Distribution: Arch Linux

Hardware Environment: Sony laptop PCG-XR1G, Celeron 466Mhz.

Problem Description: When the laptop temperature rise above 85 degree, the CPU
is throttled down to reduce temperature. However, after running some heavy task
for a while, the CPU can no longer throttled down, and kacpi keep running at
100% CPU.

Steps to reproduce:
Just run any job at 100% CPU for a few minutes. 

During the test, the fan is still running. The behavior of ACPI is correct
(throttling CPU to reduce speed, and return back to normal when the temperature
gets lower), but after a while doing like that, kacpi hang, I think. This bug
happen even when I start the laptop with no services (no acpid, powersaved,...)
and just try to run glxgears on X.
Comment 1 Len Brown 2007-03-29 21:23:11 UTC
Still an issue in 2.6.21-rc5?
Comment 2 Cao Thanh Tung 2007-03-30 05:41:14 UTC
Just tested kernel 2.6.21-rc5, the problem is still there. I cannot run glxgears
more than 1 minute.
Comment 3 Len Brown 2007-03-30 13:16:34 UTC
This bug was filed against 2.6.20, and you confirmed it is
still present in 2.6.21-rc5.  What is the most recent release
before 2.6.20 where this bug was not present?

Please attach the complete output from dmesg -s64000 from that
release, plus your 2.6.21-rc5 boot.

Also, please attach the output from acpidump (running any release)

Comment 4 Cao Thanh Tung 2007-03-31 06:06:55 UTC
Sorry for the confusion, I just think that I should file the bug on the latest
released kernel version. As far as I have tried, there is no kernel that this
bug was not presented (including 2.4.33.3 and 2.6.17.13 of Slackware, 2.6.19,
2.6.20).  I have tried the BSD kernel of FreeBSD and there is no such error, so
I think this is not a hardware problem (hopefully). 
Comment 5 Cao Thanh Tung 2007-03-31 06:08:06 UTC
Created attachment 11012 [details]
dmesg -s64000 output
Comment 6 Cao Thanh Tung 2007-03-31 06:08:49 UTC
Created attachment 11013 [details]
acpidump output
Comment 7 Zhang Rui 2007-04-02 20:48:25 UTC
Could you attach your kernel config file and the output of
#cat /proc/acpi/processor/CPU0/*
#cat /proc/acpi/thermal_zone/ATF0/* when CPU is busy?
Comment 8 Cao Thanh Tung 2007-04-03 01:50:50 UTC
Created attachment 11029 [details]
Kernel config file (2.6.20)

Sorry, I've lost the kernel config for 2.6.21-rc5 kernel. Basically, it's the
same one. For those new parameters, I have chosen No to all of them.
Comment 9 Cao Thanh Tung 2007-04-03 01:57:42 UTC
Created attachment 11030 [details]
Kernel config file (2.6.21-rc5)

Well, found it.
Comment 10 Cao Thanh Tung 2007-04-03 01:58:17 UTC
Created attachment 11031 [details]
cat /proc/acpi/processor/CPU0/*
Comment 11 Cao Thanh Tung 2007-04-03 01:59:44 UTC
Created attachment 11032 [details]
cat /proc/acpi/thermal_zone/ATF0/*

This is the result before CPU is busy. When CPU is busy, any attempt to access
the temperature file (including acpi -t) will hang.
Comment 12 Akkana Peck 2007-06-25 10:41:21 UTC
I've been seeing this for quite a while on my Vaio SR17 (since Ubuntu Dapper, which I think was 2.6.15 -- see https://bugs.launchpad.net/bugs/75174), but it seemed to get a lot worse in 2.6.20 because it's apparently no longer possible to kill kacpid or prevent it from running. Once kacpid gets going, the only solution is to pull the power plug: the machine won't shut down normally.

I also see the problem with accessing the temperature file once the problem starts. I tried this: in one window, run a big compile (gimp, kernel). In another window, run top. In a third, run a loop that cats the thermal_zone file every 5 seconds. If I don't run the thermal zone loop, then top will show kacpid as the runaway process; if I do, then "cat" will be the runaway process (kill -9 won't kill the cat), but either way the system becomes unusable after a few minutes of compiling. The last temperature shown is typically about 71/72C (fairly typical for this machine during a compile).
Comment 13 Akkana Peck 2007-06-25 12:18:07 UTC
Created attachment 11872 [details]
dmesg -s64000 output (probably not very enlightening)
Comment 14 Akkana Peck 2007-06-25 12:22:15 UTC
Created attachment 11873 [details]
Output of acpidump
Comment 15 Akkana Peck 2007-06-25 12:24:25 UTC
Created attachment 11874 [details]
more CPU0/*
Comment 16 Len Brown 2007-08-13 22:40:42 UTC
When provoking the failure, please kill acpid (the user process)
and cat /proc/acpi/event
And also watch the acpi line in /proc/interrupts
do you see any changes when the problem starts, and when it persists?

Please build with CONFIG_ACPI_THERMAL=n and report if the issue
persists, changes, or goes away.
Comment 17 Akkana Peck 2007-08-14 09:29:14 UTC
Nothing shows up in /proc/acpi/event. The acpi line in /proc/interrupts is the same after the problem occurs as it was right after booting:

  9:      42414    XT-PIC-XT        acpi, ohci1394, yenta, uhci_hcd:usb1, YMFPCI, eth0

That was with 2.6.21.3 and CONFIG_ACPI_THERMAL turned on. Will try 22.2 with thermal turned off next.
Comment 18 Len Brown 2007-08-14 10:12:53 UTC
> sonypi: Sony Programmable I/O Controller Driver v1.26.
> sonypi: detected type1 model, verbose = 0, fnkeyinit = off, camera = off,
> compat = off, mask = 0xffffffff, useinput = on, acpi = on
> sonypi: enabled at irq=11, port1=0x10c0, port2=0x10c4
> sonypi: device allocated minor is 63

Please test w/o the sonypi driver.
Comment 19 Akkana Peck 2007-08-14 11:24:20 UTC
Of course there is one difference: that second line ("CPU0") was 165 at boot time, 42414 after -- didn't mean to imply no irq9s were received.

It looks like turning off CONFIG_ACPI_THERMAL is doing the trick here. I thought I had tried that already! (Though someone else with the same kacpid problem in bug 6944 reported CONFIG_ACPI_THERMAL didn't help.)

I haven't yet tried with thermal but without sonypi, later today. I just noticed there are two sonypi options now: the old one, and a new legacy mode in "sony laptop support", so it looks like there are several combinations to test.
Comment 20 Akkana Peck 2007-08-14 14:55:19 UTC
> Please test w/o the sonypi driver.

Problem still occurs with CONFIG_ACPI_THERMAL=y but # CONFIG_SONY_LAPTOP is not set and # CONFIG_SONYPI is not set. So it's not either of the sonypi drivers, and thermal is looking like a likely culprit.
Comment 21 Len Brown 2007-08-14 15:57:41 UTC
please paste the contents of
more /proc/acpi/thermal_zone/*/* | cat

Please boot 2.6.23-rc3.
thermal.off=1 should give you the same result as CONFIG_ACPI_THERMAL=n
Try thermal.psv=-1 to disable just the passive trip function,
Verify that the passive trip points are now missing from /proc/acpi/thermal_zone/*/trip_points
and see if the problem is still gone.

Then please build your kernel with CONFIG_CPU_FREQ=n to make sure
that with no boot params, the error still occurs.

Then please build with CONFIG_ACPI_DEBUG=y
# echo 0x04000000 > /sys/module/parameters/acpi/debug_layer
# echo 0xffffffff > /sys/module/parameters/acpi/debug_level

and capture the messages as you provoke thermal throttling
and the kacpid loop.
Comment 22 Akkana Peck 2007-08-14 17:52:32 UTC
thermal_zone on older kernels (this is actually from an Ubuntu 2.6.20-16, let me know if that's not good enough):

::::::::::::::
/proc/acpi/thermal_zone/ATF0/cooling_mode
::::::::::::::
<setting not supported>
cooling mode:   passive
::::::::::::::
/proc/acpi/thermal_zone/ATF0/polling_frequency
::::::::::::::
<polling disabled>
::::::::::::::
/proc/acpi/thermal_zone/ATF0/state
::::::::::::::
state:                   ok
::::::::::::::
/proc/acpi/thermal_zone/ATF0/temperature
::::::::::::::
temperature:             47 C
::::::::::::::
/proc/acpi/thermal_zone/ATF0/trip_points
::::::::::::::
critical (S5):           100 C
passive:                 73 C: tc1=1 tc2=1 tsp=50 devices=0xcabb5338 

If I boot .23-rc3 with thermal.psv=-1, then trip_points changes to:
::::::::::::::
/proc/acpi/thermal_zone/ATF0/trip_points
::::::::::::::
critical (S5):           100 C

But it's looking like with thermal.psv=-1 it's not going to do its thing -- I'm about halfway through a kernel build and kacpid hasn't run away yet, and the fan isn't staying on full (like it normally does when kacpid is going to run away). I'll go ahead and send this and see if it locks up right afterward. :-)

CPU_FREQ has been off all along. (Someone in one of the kacpid bugs said turning it off helped, so I tried it; it didn't help here but I guess I left it off.)

ACPI_DEBUG messages to follow in a while.
Comment 23 Akkana Peck 2007-08-14 19:30:42 UTC
Oops, I don't have a /sys/module/parameters. This kernel is mostly non-modular (makes it easier to build it on a faster machine and copy it over). Would I make all of ACPI modular to make /sys/module/parameters appear? Or is there a way to enable this in a non-modular kernel that has CONFIG_ACPI=y and CONFIG_ACPI_DEBUG=y?
Comment 24 Zhang Rui 2007-08-14 19:34:54 UTC
># echo 0x04000000 > /sys/module/parameters/acpi/debug_layer
># echo 0xffffffff > /sys/module/parameters/acpi/debug_level
It's a typo, and it should be:
# echo 0x04000000 > /sys/module/acpi/parameters/debug_layer
# echo 0xffffffff > /sys/module/acpi/parameters/debug_level
Comment 25 Akkana Peck 2007-08-14 23:05:59 UTC
Well, I guess I should have tested .23-rc3 earlier. I've been trying all evening with vac3rious combinations of config options, and can't reproduce the problem on .23-r. I'll keep running rc3 and see if the problem recurs, but it looks like it might be fixed!
Comment 26 Len Brown 2007-08-20 12:20:16 UTC
If this problem went away, it would be good to find out when and why.
It may be that we need to back-port a fix for users of older kernels.

Akkana,
You've seen the kacpid looping problem in 2.6.15 - 2.6.20,
but it appears gone in 2.6.23, yes?
(please verify that you're not running with any kernel patches
 or tweaked kernel parameters in 2.6.23, and report if you are
 able to get the machine up to the 73C passive trip point)

What is the latest release that still exhibits the problem?


Cao,
You've seen the problem in 2.4.33.3 and 2.6.17.13 of Slackware, 2.6.19,
2.6.20, and then 2.6.21.

how about 2.6.22 and 2.6.23?
Comment 27 Akkana Peck 2007-08-21 20:04:56 UTC
I've seen the problem as late as 2.6.22.2, and it appears gone in 2.6.23.

In both .22.2 and .23, the machine goes up to and sometimes over the passive trip point. Going over the trip point doesn't seem to be what triggers the problem (it tends to bounce up and down between 74 and 71 with occasional forays to 75 or 60; passive trip is 73).

It does seem related to thermal_zone. If I run both top and
while true; do echo cat /proc/acpi/thermal_zone/*/*; sleep 10; done
during a kernel build, the hang will be in the cat (ps shows cat hung, and nothing obvious shows up in top -- kacpid isn't up near the top). If I don't run the thermal_zone loop, and just watch top, then kacpid is always #1 in top (but after the problem occurs I can't cat /proc/acpi/thermal_zone/ATF0/temperature any more; it'll just hang).

Here's the bad news: in 2.6.22.2 with CONFIG_ACPI_DEBUG=y, I haven't been able to trigger it, with or without the thermal debugging suggested by zhang rui.
Comment 28 johfel 2007-09-12 08:26:51 UTC
I have the same problem (high cpu usage of kacpid) on 2.6.23-rc6 (!). The problem appears on a new laptop (Intel Core 2 Duo,  Intel 82801H Hardware).

In dmesg (perhaps interesting):
hald-addon-acpi[3920]: segfault at 00000a58 eip b7e7979c esp bf81b568 error 4

Access to /proc/acpi/thermal_zone/THRM/* is *not* blocked, so my
bug is perhaps not related to thermal_zone.
The system is usable (if I set kacpid to nice level 19), but a little slowed.

I'm going to do some experiments to find something about the bug.
Comment 29 johfel 2007-09-13 07:34:34 UTC
About my problem on 2.6.23-rc6:
The bug occurs only if I run KDE with kpowersave. Without kpowersave everything seems normal.
Comment 30 Fu Michael 2007-10-22 06:01:10 UTC
Len, does comment# 27 help?
Comment 31 Akkana Peck 2007-11-16 15:12:12 UTC
Bad news: I just saw this on 2.6.23.1. I've been running that kernel for a while and this is the first time I've seen it, so it's happening much less often than before, but it's not gone.

Unfortunately I wasn't prepared for it and didn't have debug on, so I couldn't capture any ACPI debug messages. I'll try to make sure it's always on for future builds. (Is there any way to have it always on by default, rather than echoing those lines manually after booting?)
Comment 32 Zhang Rui 2007-11-18 21:39:30 UTC
>Is there any way to have it always on by default, rather than
>echoing those lines manually after booting?
For a kernel with CONFIG_ACPI_DEBUG set, you can add these two kernel parameters acpi.debug_layer=0x04 acpi.debug_level=0x8800001f.
Comment 33 Len Brown 2007-12-14 21:56:37 UTC
> The bug occurs only if I run KDE with kpowersave.

Oh goody, I wonder what kpowersave is doing...

Akkana, are you running kpowersave also?

johfel,
can you provide the info in comment #22 but for your system?
Comment 34 johfel 2007-12-15 05:28:57 UTC
For me the 100%-usage-kacpid-problem seems disappeared: I can't reproduce it anymore (with 2.6.23-rc3, -rc6 and 2.6.23.8). 

I made some system upgrades (I'm using debian sid) and changed some little things on my kernel config (but not direct acpi related). I sadly lost the original kernel config in which the problem occurs. I tried the version of kpowersave again, in which the problem occured, but now everything seems normal. The whole thing is very strange.

Here ist for 2.6.23-rc3:
# cat /proc/acpi/thermal_zone/THRM/*
0 - Active; 1 - Passive
<polling disabled>
state:                   ok
temperature:             52 C
critical (S5):           105 C
passive:                 95 C: tc1=4 tc2=3 tsp=100 devices=CPU0
active[0]:               60 C: devices= FAN
Comment 35 Akkana Peck 2007-12-15 11:31:18 UTC
I don't know what happened that one time (I wonder if I wasn't running the kernel I thought I was?) but I haven't seen the problem again on 2.6.23.1, .23.8 or .24-pre.

I wasn't running kpowersave, but maybe that makes it worse somehow.
Comment 36 Len Brown 2008-01-13 23:48:14 UTC
okay, this failure seems to be vanishing.
if somebody can reproduce it with recent software
often enough that we can debug it, please re-open this report.
Comment 37 Jithin Emmanuel 2008-03-04 21:42:04 UTC
I use ubuntu 2.6.22-14. I have this problem kacpid is eating 70+% of my cpu.

When i try to run cat /proc/acpi/thermal_zone/THRM/*
cat: /proc/acpi/thermal_zone/THRM/*: No such file or directory
Comment 38 ykzhao 2008-03-04 23:12:47 UTC
Hi, Jithin
    Your laptop has the different problem. 
    Please open a new bug and attach the output of dmesg, acpidump.
    thanks.