Kernel Bug Tracker – Bug 8274
kacpi use up 100% CPU after thermal throttling invoked - Sony PCG-XR1G, Vaio SR17
Last modified: 2008-03-04 23:12:47 UTC
Most recent kernel where this bug did *NOT* occur: No, I have tried kernel
version 2.6.21rc4, the bug is still there.
Distribution: Arch Linux
Hardware Environment: Sony laptop PCG-XR1G, Celeron 466Mhz.
Problem Description: When the laptop temperature rise above 85 degree, the CPU
is throttled down to reduce temperature. However, after running some heavy task
for a while, the CPU can no longer throttled down, and kacpi keep running at
Steps to reproduce:
Just run any job at 100% CPU for a few minutes.
During the test, the fan is still running. The behavior of ACPI is correct
(throttling CPU to reduce speed, and return back to normal when the temperature
gets lower), but after a while doing like that, kacpi hang, I think. This bug
happen even when I start the laptop with no services (no acpid, powersaved,...)
and just try to run glxgears on X.
Still an issue in 2.6.21-rc5?
Just tested kernel 2.6.21-rc5, the problem is still there. I cannot run glxgears
more than 1 minute.
This bug was filed against 2.6.20, and you confirmed it is
still present in 2.6.21-rc5. What is the most recent release
before 2.6.20 where this bug was not present?
Please attach the complete output from dmesg -s64000 from that
release, plus your 2.6.21-rc5 boot.
Also, please attach the output from acpidump (running any release)
Sorry for the confusion, I just think that I should file the bug on the latest
released kernel version. As far as I have tried, there is no kernel that this
bug was not presented (including 18.104.22.168 and 22.214.171.124 of Slackware, 2.6.19,
2.6.20). I have tried the BSD kernel of FreeBSD and there is no such error, so
I think this is not a hardware problem (hopefully).
Created attachment 11012 [details]
dmesg -s64000 output
Created attachment 11013 [details]
Could you attach your kernel config file and the output of
#cat /proc/acpi/thermal_zone/ATF0/* when CPU is busy?
Created attachment 11029 [details]
Kernel config file (2.6.20)
Sorry, I've lost the kernel config for 2.6.21-rc5 kernel. Basically, it's the
same one. For those new parameters, I have chosen No to all of them.
Created attachment 11030 [details]
Kernel config file (2.6.21-rc5)
Well, found it.
Created attachment 11031 [details]
Created attachment 11032 [details]
This is the result before CPU is busy. When CPU is busy, any attempt to access
the temperature file (including acpi -t) will hang.
I've been seeing this for quite a while on my Vaio SR17 (since Ubuntu Dapper, which I think was 2.6.15 -- see https://bugs.launchpad.net/bugs/75174), but it seemed to get a lot worse in 2.6.20 because it's apparently no longer possible to kill kacpid or prevent it from running. Once kacpid gets going, the only solution is to pull the power plug: the machine won't shut down normally.
I also see the problem with accessing the temperature file once the problem starts. I tried this: in one window, run a big compile (gimp, kernel). In another window, run top. In a third, run a loop that cats the thermal_zone file every 5 seconds. If I don't run the thermal zone loop, then top will show kacpid as the runaway process; if I do, then "cat" will be the runaway process (kill -9 won't kill the cat), but either way the system becomes unusable after a few minutes of compiling. The last temperature shown is typically about 71/72C (fairly typical for this machine during a compile).
Created attachment 11872 [details]
dmesg -s64000 output (probably not very enlightening)
Created attachment 11873 [details]
Output of acpidump
Created attachment 11874 [details]
When provoking the failure, please kill acpid (the user process)
and cat /proc/acpi/event
And also watch the acpi line in /proc/interrupts
do you see any changes when the problem starts, and when it persists?
Please build with CONFIG_ACPI_THERMAL=n and report if the issue
persists, changes, or goes away.
Nothing shows up in /proc/acpi/event. The acpi line in /proc/interrupts is the same after the problem occurs as it was right after booting:
9: 42414 XT-PIC-XT acpi, ohci1394, yenta, uhci_hcd:usb1, YMFPCI, eth0
That was with 126.96.36.199 and CONFIG_ACPI_THERMAL turned on. Will try 22.2 with thermal turned off next.
> sonypi: Sony Programmable I/O Controller Driver v1.26.
> sonypi: detected type1 model, verbose = 0, fnkeyinit = off, camera = off, compat = off, mask = 0xffffffff, useinput = on, acpi = on
> sonypi: enabled at irq=11, port1=0x10c0, port2=0x10c4
> sonypi: device allocated minor is 63
Please test w/o the sonypi driver.
Of course there is one difference: that second line ("CPU0") was 165 at boot time, 42414 after -- didn't mean to imply no irq9s were received.
It looks like turning off CONFIG_ACPI_THERMAL is doing the trick here. I thought I had tried that already! (Though someone else with the same kacpid problem in bug 6944 reported CONFIG_ACPI_THERMAL didn't help.)
I haven't yet tried with thermal but without sonypi, later today. I just noticed there are two sonypi options now: the old one, and a new legacy mode in "sony laptop support", so it looks like there are several combinations to test.
> Please test w/o the sonypi driver.
Problem still occurs with CONFIG_ACPI_THERMAL=y but # CONFIG_SONY_LAPTOP is not set and # CONFIG_SONYPI is not set. So it's not either of the sonypi drivers, and thermal is looking like a likely culprit.
please paste the contents of
more /proc/acpi/thermal_zone/*/* | cat
Please boot 2.6.23-rc3.
thermal.off=1 should give you the same result as CONFIG_ACPI_THERMAL=n
Try thermal.psv=-1 to disable just the passive trip function,
Verify that the passive trip points are now missing from /proc/acpi/thermal_zone/*/trip_points
and see if the problem is still gone.
Then please build your kernel with CONFIG_CPU_FREQ=n to make sure
that with no boot params, the error still occurs.
Then please build with CONFIG_ACPI_DEBUG=y
# echo 0x04000000 > /sys/module/parameters/acpi/debug_layer
# echo 0xffffffff > /sys/module/parameters/acpi/debug_level
and capture the messages as you provoke thermal throttling
and the kacpid loop.
thermal_zone on older kernels (this is actually from an Ubuntu 2.6.20-16, let me know if that's not good enough):
<setting not supported>
cooling mode: passive
temperature: 47 C
critical (S5): 100 C
passive: 73 C: tc1=1 tc2=1 tsp=50 devices=0xcabb5338
If I boot .23-rc3 with thermal.psv=-1, then trip_points changes to:
critical (S5): 100 C
But it's looking like with thermal.psv=-1 it's not going to do its thing -- I'm about halfway through a kernel build and kacpid hasn't run away yet, and the fan isn't staying on full (like it normally does when kacpid is going to run away). I'll go ahead and send this and see if it locks up right afterward. :-)
CPU_FREQ has been off all along. (Someone in one of the kacpid bugs said turning it off helped, so I tried it; it didn't help here but I guess I left it off.)
ACPI_DEBUG messages to follow in a while.
Oops, I don't have a /sys/module/parameters. This kernel is mostly non-modular (makes it easier to build it on a faster machine and copy it over). Would I make all of ACPI modular to make /sys/module/parameters appear? Or is there a way to enable this in a non-modular kernel that has CONFIG_ACPI=y and CONFIG_ACPI_DEBUG=y?
># echo 0x04000000 > /sys/module/parameters/acpi/debug_layer
># echo 0xffffffff > /sys/module/parameters/acpi/debug_level
It's a typo, and it should be:
# echo 0x04000000 > /sys/module/acpi/parameters/debug_layer
# echo 0xffffffff > /sys/module/acpi/parameters/debug_level
Well, I guess I should have tested .23-rc3 earlier. I've been trying all evening with vac3rious combinations of config options, and can't reproduce the problem on .23-r. I'll keep running rc3 and see if the problem recurs, but it looks like it might be fixed!
If this problem went away, it would be good to find out when and why.
It may be that we need to back-port a fix for users of older kernels.
You've seen the kacpid looping problem in 2.6.15 - 2.6.20,
but it appears gone in 2.6.23, yes?
(please verify that you're not running with any kernel patches
or tweaked kernel parameters in 2.6.23, and report if you are
able to get the machine up to the 73C passive trip point)
What is the latest release that still exhibits the problem?
You've seen the problem in 188.8.131.52 and 184.108.40.206 of Slackware, 2.6.19,
2.6.20, and then 2.6.21.
how about 2.6.22 and 2.6.23?
I've seen the problem as late as 220.127.116.11, and it appears gone in 2.6.23.
In both .22.2 and .23, the machine goes up to and sometimes over the passive trip point. Going over the trip point doesn't seem to be what triggers the problem (it tends to bounce up and down between 74 and 71 with occasional forays to 75 or 60; passive trip is 73).
It does seem related to thermal_zone. If I run both top and
while true; do echo cat /proc/acpi/thermal_zone/*/*; sleep 10; done
during a kernel build, the hang will be in the cat (ps shows cat hung, and nothing obvious shows up in top -- kacpid isn't up near the top). If I don't run the thermal_zone loop, and just watch top, then kacpid is always #1 in top (but after the problem occurs I can't cat /proc/acpi/thermal_zone/ATF0/temperature any more; it'll just hang).
Here's the bad news: in 18.104.22.168 with CONFIG_ACPI_DEBUG=y, I haven't been able to trigger it, with or without the thermal debugging suggested by zhang rui.
I have the same problem (high cpu usage of kacpid) on 2.6.23-rc6 (!). The problem appears on a new laptop (Intel Core 2 Duo, Intel 82801H Hardware).
In dmesg (perhaps interesting):
hald-addon-acpi: segfault at 00000a58 eip b7e7979c esp bf81b568 error 4
Access to /proc/acpi/thermal_zone/THRM/* is *not* blocked, so my
bug is perhaps not related to thermal_zone.
The system is usable (if I set kacpid to nice level 19), but a little slowed.
I'm going to do some experiments to find something about the bug.
About my problem on 2.6.23-rc6:
The bug occurs only if I run KDE with kpowersave. Without kpowersave everything seems normal.
Len, does comment# 27 help?
Bad news: I just saw this on 22.214.171.124. I've been running that kernel for a while and this is the first time I've seen it, so it's happening much less often than before, but it's not gone.
Unfortunately I wasn't prepared for it and didn't have debug on, so I couldn't capture any ACPI debug messages. I'll try to make sure it's always on for future builds. (Is there any way to have it always on by default, rather than echoing those lines manually after booting?)
>Is there any way to have it always on by default, rather than
>echoing those lines manually after booting?
For a kernel with CONFIG_ACPI_DEBUG set, you can add these two kernel parameters acpi.debug_layer=0x04 acpi.debug_level=0x8800001f.
> The bug occurs only if I run KDE with kpowersave.
Oh goody, I wonder what kpowersave is doing...
Akkana, are you running kpowersave also?
can you provide the info in comment #22 but for your system?
For me the 100%-usage-kacpid-problem seems disappeared: I can't reproduce it anymore (with 2.6.23-rc3, -rc6 and 126.96.36.199).
I made some system upgrades (I'm using debian sid) and changed some little things on my kernel config (but not direct acpi related). I sadly lost the original kernel config in which the problem occurs. I tried the version of kpowersave again, in which the problem occured, but now everything seems normal. The whole thing is very strange.
Here ist for 2.6.23-rc3:
# cat /proc/acpi/thermal_zone/THRM/*
0 - Active; 1 - Passive
temperature: 52 C
critical (S5): 105 C
passive: 95 C: tc1=4 tc2=3 tsp=100 devices=CPU0
active: 60 C: devices= FAN
I don't know what happened that one time (I wonder if I wasn't running the kernel I thought I was?) but I haven't seen the problem again on 188.8.131.52, .23.8 or .24-pre.
I wasn't running kpowersave, but maybe that makes it worse somehow.
okay, this failure seems to be vanishing.
if somebody can reproduce it with recent software
often enough that we can debug it, please re-open this report.
I use ubuntu 2.6.22-14. I have this problem kacpid is eating 70+% of my cpu.
When i try to run cat /proc/acpi/thermal_zone/THRM/*
cat: /proc/acpi/thermal_zone/THRM/*: No such file or directory
Your laptop has the different problem.
Please open a new bug and attach the output of dmesg, acpidump.