Distribution: Fedora 9 (problem has been seen in Ubuntu Hardy and earlier versions of both Fedora and Ubuntu). Hardware Environment: HP Pavilion 503n, Intel Celeron 1.7GHz and Pentium 4 Northwood (single core). Problem Description: Like the bugs mentioned at http://ubuntuforums.org/showthread.php?t=399619http://ubuntuforums.org/showthread.php?t=399619 and http://bugzilla.kernel.org/show_bug.cgi?id=10224http://bugzilla.kernel.org/show_bug.cgi?id=10224 , the kacpi_notify workqueue process consumes 90% of the CPU after things like a suspend or the CPU getting hot. After a lot of printk tracing, I found that acpi_thermal_notify (in thermal.c) with event type ACPI_THERMAL_NOTIFY_THRESHOLDS is triggered, leading to acpi_power_off_device (in power.c). When it gets to: status = acpi_evaluate_object(resource->device->handle, "_OFF", NULL, NULL); It does turn down the fan like it should, but it also triggers the acpi_ex_opcode_2A_0T_0R opcode, leading to a notify event, and the cycle begins again. Moreover, the ACPI driver can't determine the state of the fan (I've only heard it in high power and low power, never off). It also thinks it should turn down the fan when the CPU gets hot. However, these are separate issues that I'm not too concerned about. The kacpid and kacpi_notify problems appear in many different manifestations, and all of them are ostensibly due to buggy ACPI BIOS implementations. What I posted above is likely going to be a different call chain than someone else's. Nevertheless, these problems indicate that the ACPI driver should be able to detect when a workqueue is getting flooded and queue delayed events when it is. Steps to reproduce: On my system, letting the CPU get above 51C or so for a while or suspending and resuming will turn down (that's right, turn down) the fan and trigger the infinite CPU hogging. echo 3 > /proc/acpi/fan/FAN1/state will turn down the fan, but will not trigger the spinning. Proposed solution: Create a flood protection mechanism for events queued to kacpid and kacpi_notify. For instance, if more than 10 events are sent to one of these queues within a tenth of a second, more events should be delayed by 10 seconds until a timer elapses. Thus, normal operation should not be interfered with to a severe extent, and the kacpi_notify and kacpid queue flooding will cause unnoticeable impact. However, if a whole bunch of ACPI activity happens before the timer is put in place (e.g. the timer source is based on ACPI), my proposed general solution could possibly stall correct ACPI operation and not get the timer up and running at all. Thus, flood protection should be disabled until, say, ten seconds of ACPI driver uptime (thus verifying that the clock works).
Will you please attach the output of acpidump , lspci -vxxx? Will you please set "CONFIG_ACPI_DEBUG" in kernel configuration and attach the output of dmesg ?(Please boot the system with acpi enabled) Thanks.
Created attachment 16665 [details] This is the output of acpidump run as root.
Created attachment 16666 [details] This is the output of /sbin/lspci -vxxx run as root.
Created attachment 16667 [details] The output of dmesg before suspending (nothing wrong)
Created attachment 16668 [details] The output of dmesg after resuming from suspend. kacpi_notify is consuming CPU, and CONFIG_ACPI_DEBUG is set.
Created attachment 16669 [details] Like the above dmesg, but ACPI_LV_FUNCTIONS is enabled with /proc/acpi/debug_level (if I'm not mistaken, this enabled function tracing).
I moved from Fedora 9's 2.6.25.6 to the 2.6.25.9 kernel from kernel.org and used the config file from Fedora 9 as well as enabled CONFIG_ACPI_DEBUG and CONFIG_ACPI_DEBUG_FUNC_TRACE (the latter doesn't seem to affect the output of dmesg unless I enable ACPI_LV_FUNCTIONS with /proc/acpi/debug_level (like I did in the last dmesg)). The bug is still present. Did I attach the above 5 files correctly, or is there an option in Bugzilla to attach a bunch of files at once without sending 5 emails to you guys? I wrote a workqueue flood protector similar to what I proposed above. I'm cleaning it up and getting ready to test it. Afterward, I will post it here (or should I post it to the mailing list instead?) as a patch against drivers/acpi/osl.c in the 2.6.25.9 kernel. The main thing it lacks currently is an interface (perhaps a /proc/acpi filesystem entry) through which it can be enabled/disabled and configured.
Thanks for the info. From the acpidump it seems that GPE 0x1D is shared by several ACPI devices: Power button, KBC, Mouse. And the issue is fixed by the following commit: commit 729b2bdbfa19dd9be98dbd49caf2773b3271cc24 Author: Zhao Yakui <yakui.zhao@intel.com> Date: Wed Mar 19 13:26:54 2008 +0800 ACPI : Disable the device's ability to wake the sleeping system in the boot phase Will you please try the latest kernel(for example: 2.6.26-rc7) and see whether the problem still exists? After the system is booted, please do the following test. a. echo 0x00010004 > /sys/module/acpi/parameters/debug_layer && echo 0x08000017 > /sys/module/acpi/parameters/debug_level b. cat /sys/firmware/interrupts/* c. when the problem appears, cat /sys/firmware/interrupts/* and /proc/interrupts/ d. dmesg >dmesg_after After the test, please attach the output. Thanks.
From the dmesg in comment #5 it seems that OS complains the following warning message. ACPI: Transitioning device [FAN1] to D3 ACPI: Unable to turn cooling device [f78122a0] 'off' From the acpidump we can know that : PowerResource (FN01, 0x00, 0x0000) { Method (_STA, 0, NotSerialized) { Return (0x01) // 1 means that the power resource is on; 0 means that the power resource is off. } And this powerresource is referred by the Fan1 device. Device (FAN1) { Name (_PR0, Package (0x01) { FN01 }) When the thermal temperature is below pre-defined temperature, OS will try to turn off the Fan device. But as the _STA objec of FN01 always reports that the power resource is on, OS will complain that ACPI: Unable to turn cooling device [f78122a0] 'off'. At the same time when OS try to turn off the Fan device, it will call the _OFF object of FN01, in which the notification event(0x81) will be sent to thermal zone. When the thermal zone receives the 0x81 notification event, OS will recheck the trip-point of thermalzone and try to turn off the FAN device again. So the notification event(0x81) will be sent again to the thermalzone and OS will try to process the event again, which causes that OS enter the infinite loop. IMO this is a very obvious BIOS bug and had better be fixed by bios upgrading.
Will you please attach the output of dmidecode ? According to the analysis in comment #9 it seems that it is a BIOS bug. In the upstream kernel when the _OFF object is called to turn off the power resource, OS will check the status of power resource to verify whether the power resource is turned off. Maybe for the bogus bios it is unnecessary to check whether the power resource is turned off. I will try to write a workaround patch to fix this issue.
When I compiled and tested the 2.6.26-rc8 kernel from kernel.org, the kacpi_notify spinning symptom went away, but it's because the thermal_zone THRM device (/proc/acpi/thermal_zone/THRM) did not load (it did in earlier kernels). The problem is that my BIOS reports temperatures in degrees Celsius (i.e. 40 degrees Celsius would be '40'), but the ACPI spec clearly states that it should report temperatures in kelvins times 10 (i.e. 300 kelvins would be '3000'). The ACPI driver doesn't know this, so it acts like the reported values are in kelvins and does the conversion, resulting in temperature values like -269C and -273C. Because of this additional code: /* * Treat freezing temperatures as invalid as well; some * BIOSes return really low values and cause reboots at startup. * Below zero (Celcius) values clearly aren't right for sure.. * ... so lets discard those as invalid. */ if (ACPI_FAILURE(status) || tz->trips.critical.temperature <= 2732) { tz->trips.critical.flags.valid = 0; ACPI_EXCEPTION((AE_INFO, status, "No or invalid critical threshold")); return -ENODEV; The THRM device fails to load because the critical temperature looks like it's below freezing. I added a simple sanity check to thermal.c for all the temperature reads I could find. It will work on correct BIOSes as well, provided that the CPU really isn't -216.7 degrees Celsius or colder. The patch is posted below. When I made this change, THRM came back, and the kacpi_notify spinning problem returned as before. The requested information is posted below. I triggered the kacpi_notify spinning by suspending and resuming. By the way, my computer is from around December 2001, if you're interested. I'm finishing up the event queue flood protector. It's sort of large because it now has a procfs folder for configuration. I'll post the patch for that when I'm done and have tested it on my machine.
Created attachment 16683 [details] Patch against thermal.c to add temperature sanity check This patch against thermal.c adds a sanity check to make buggy BIOSes that report temperature in degrees Celsius rather than kelvins times 10 work correctly.
Created attachment 16684 [details] Output of each file in /sys/firmware/acpi/interrupts/* both before and after suspend.
Created attachment 16685 [details] Output of dmesg before suspending.
Created attachment 16686 [details] cat /proc/interrupts before suspending.
Created attachment 16687 [details] Output of each file in /sys/firmware/acpi/interrupts/* after suspend (same as before suspend)
Created attachment 16688 [details] cat /proc/interrupts after suspending.
Created attachment 16689 [details] Output of dmesg after suspending.
Created attachment 16690 [details] Output of dmidecode (it's the same before and after suspending).
Created attachment 16708 [details] try the attached workround patch series(tar.gz package) Will you please try the attached workaround patch to see whether the problem still exists?(Please add the boot option of "acpi.power_nocheck=1") Note: the attached tar package contains four patches and the corresponding patch series. After testing, please attach the output of dmesg. Thanks.
Created attachment 16742 [details] Patch adds a flood protector to kacpid/kacpi_notify queues This is the workqueue flood protector I proposed earlier. This is my first submission of code to the Linux kernel, so please bear with and inform me of any incorrect spinlocks, poorly formatted code, etc. in my patch :) It now has a /proc/acpi/flood directory for options and help, but it lacks boot command line configuration options (e.g. acpi.flood=on acpi.flood.start_wait_time=10000). This can serve as a generic quick fix to any instances of kacpid and/or kacpi_notify spinning, but I'd think users should still file bug reports if they have to enable this. I added a copyright signature for myself at the top of osl.c (where the vast majority of the flood protector code resides). Is this correct protocol? I did test your patch series, and it does indeed seem to clear up all the problems I've mentioned here. I did remember to include the thermal sanity patch I posted earlier, so /proc/thermal_zone/THRM was there. The kacpi_notify spinning problem is completely gone, and I can echo 0 and 3 to /proc/fan/FAN1/state to turn the fan up and down respectively (rather than it being stuck at down). I can't force the acpi.power_nocheck option off with the command line, and this appears to be because my machine type is listed under power_nocheck_dmi_table and set with dmi_check_system(power_nocheck_dmi_table); . To finish up testing on my flood protector, I commented out that line in my working code (but the dmesg's I will attach as well as this patch will not reflect this). I'm recompiling the kernel to get rid of any stray printk traces and to make sure all the patches posted so far work with the 2.6.26-rc8 kernel. I'll post the dmesg before and after suspend pretty soon after that finishes (the flood protector will probably not affect things since there won't be flooding, but will post a warning if there is flooding).
Created attachment 16743 [details] A patch series consisting of all of the patches posted so far This is all of the patches posted so far (along with a series file): They are (in this order): Power patch series Thermal sanity check (thermal_sanitycheck.patch) Flood protector (acpi_floodprotector.patch) I ran patch -p1 with all the patches in the order of the series file within a vanilla 2.6.26-rc8 source directory, and there were no errors or offsets. Thanks for fixing my ACPI issues :)
Hi, Joey Thanks for your work. Your proposal about the workqueue flood detector is very good. But maybe it is too agressive. If it is applied, it will have side effect: the kacpid/kacpid_notify workqueue will be deferred for some time. And it will have an impact on the performance. For example: The notifiy event will be sent to some device when GPE interrupt is processed. It will take more time to process the notification event. Before the notification event is processed, the corresponding GPE will be disabled. If some delay time is added, it will have impact on the performance. At the same time the workqueue flood detector can't solve the problem. OS still needs to process a lot of work tasks caused by the broken BIOS. Of course the kacpid/kacpid_notify usage will be decreased. IMO the patch about the workqueue flood detector is too agressive and have some side effects. Anyway we can discuss it in acpi_mail list. thanks.
After the patch in comment #20 is applied, the problem disappears. So the bug will be marked as the resolved.
Hi, I have a similar problem with kacpi_notify running out of control and eating all my CPU. It`s triggered by pressing the 'undock' button on the dockstation my Dell D600 is attached. I`m running 2.6.26 and the patches you provided in commment #20 don`t help in my case. I`ve attached my lspci, my acpidump and dmesg messages with ACPI debug enabled. I`m running Gentoo and I`ve created a report on their bugzilla too: http://bugs.gentoo.org/show_bug.cgi?id=235404
Created attachment 17392 [details] The dmesg after pressing 'undock' with ACPI debug enabled.
Created attachment 17393 [details] Output of acpidump run as root.
Created attachment 17394 [details] lspci -vvvvxxx output
Maciej, that's another problem. Please file a new bug report. :)
Created attachment 17735 [details] Patch 1/3: Attach the ACPI device to the ACPI handle as early as possible
Created attachment 17736 [details] Patch 2/3: Get the device power state in the course of scanning device
Created attachment 17737 [details] Patch 3/3: :Add "acpi.power_nocheck=1" to disable power state check in power transition
The tar.gz in comment #20 is not appropriate. So the patches set are attached instead of the tar package. Thanks.
patch in comment #30 included in acpi test tree: uthor: Zhao Yakui <yakui.zhao@intel.com> 2008-08-11 02:54:16 Committer: Len Brown <len.brown@intel.com> 2008-09-19 14:57:44 Parent: 9a252fe1069f759a4fe76fdffce3b589b899bcdf (ACPI: Get the device power state in the course of scanning device) Child: f2e969acd6d5981e6b1272810002558650d0736e (ACPI: Add "acpi.power_nocheck=1" to disable power state check in power transition) Branches: bugfixes, test Follows: v2.6.27-rc6 Precedes: ACPI: Attach the ACPI device to the ACPI handle as early as possible patch in comment #31: Author: Zhao Yakui <yakui.zhao@intel.com> 2008-08-11 02:55:05 Committer: Len Brown <len.brown@intel.com> 2008-09-19 14:57:43 Parent: a7be432189eb771907fb52b3f28f3e9c9309683d (acpi: replace AE_BAD_ADDRESS exception code with AE_ERROR) Child: 05afbbf73abcbb4f6fc9fcf0a6cbbea623928a7a (ACPI: Attach the ACPI device to the ACPI handle as early as possible) Branches: bugfixes, test Follows: v2.6.27-rc6 Precedes: ACPI: Get the device power state in the course of scanning device patch in commednt #32: Author: Zhao Yakui <yakui.zhao@intel.com> 2008-08-11 02:57:50 Committer: Len Brown <len.brown@intel.com> 2008-09-19 14:57:45 Parent: 05afbbf73abcbb4f6fc9fcf0a6cbbea623928a7a (ACPI: Attach the ACPI device to the ACPI handle as early as possible) Child: 2f76f1bd9b3ed0992b987d0b3a9b60ee698467d4 (ACPI: Add DMI check to disable power state check in power transition) Branches: bugfixes, test Follows: v2.6.27-rc6 Precedes: ACPI: Add "acpi.power_nocheck=1" to disable power state check in power transition I don't know what the attachment in comment #33 is supposed to be, it appears to be html
Hi, Len What I said in comment #33 is that the tar.gz package is attached in comment #20. So the patch set is attached again instead of the tar.gz package. thanks.
re: comment #33 confusion pilot error on my part, for some reason i must have thought the link was an attachment and saved comment #22 as a file... sorry, and thanks.
*** Bug 8049 has been marked as a duplicate of this bug. ***
shipped in linux-2.6.28-rc1 closed commit 6415e12ba0f92a54f02d9c4ecaa3c82f35f3d335 Author: Zhao Yakui <yakui.zhao@intel.com> Date: Mon Aug 11 14:59:59 2008 +0800 ACPI: Add DMI check to disable power state check in power transition ... commit f5adfaa372c76423b6e8e4727a9701330374f364 Author: Zhao Yakui <yakui.zhao@intel.com> Date: Mon Aug 11 14:57:50 2008 +0800 ACPI: Add "acpi.power_nocheck=1" to disable power state check in power transition ... commit eab4b645769fa2f8703f5a3cb0cc4ac090d347af Author: Zhao Yakui <yakui.zhao@intel.com> Date: Mon Aug 11 14:54:16 2008 +0800 ACPI: Attach the ACPI device to the ACPI handle as early as possible