Bug 15924
Summary: | kacpid consumes ~100% CPU, system freezes randomly | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jaroslav Kameník (jaroslav) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | acpi-bugzilla, antekgrzymala, astarikovskiy, lenb, mehmet, mjg59-kernel, ozan, rjw, rostedt, rui.zhang, tomas_pavelka, trenn |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.34 RC6 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 15310, 56331 | ||
Attachments: |
acpidump
acpi interupts on 2.6.34-rc5 acpidump with 2.6.34-rc6 acpi/wakeup with 2.6.33 acpi/wakeup with 2.3.34.rc6 lspci -vvxxx acpi interrupts in 2.6.33.3 kernel dmesg for Comment #12 on RC6 patch-1 patch-2 Make sure GPEs stay disabled ACPI / ACPICA: Fix problems related to GPE reference counting Debug patch dmesg for comment #47 Debug patch #2 dsdt - f.08 dmesg for comment #51 Debug patch #3 Debug patch #4 dmesg for Comment #55 PCI / PM: Do not use native PCIe PME by default |
Description
Jaroslav Kameník
2010-05-06 21:12:21 UTC
Created attachment 26269 [details]
acpi interupts on 2.6.34-rc5
Please attach the output of acpidump and the contents of /proc/acpi/wakeup . Created attachment 26271 [details]
acpidump with 2.6.34-rc6
Created attachment 26272 [details]
acpi/wakeup with 2.6.33
Created attachment 26273 [details]
acpi/wakeup with 2.3.34.rc6
please attach the lspci -vvxxx output when the interrupt storms happens. > /sys/firmware/acpi/interrupts/gpe01: 246416 enabled GPE01 appears to be the problem Does the kacpid thrashing stop after this? # echo disable > /sys/firmware/acpi/interrupts/gpe01 # cat /sys/firmware/acpi/interrupts/gpe01 > Device S-state Status Sysfs node > LANC S0 *disabled pci:0000:00:19.0 > HDEF S3 *disabled pci:0000:00:1b.0 > EHC1 S0 *disabled pci:0000:00:1d.0 > EHC2 S0 *disabled pci:0000:00:1a.0 > ... All of the devices in /proc/acpi/wakeup have the run_wake flag set in 2.6.34, while none had this set in 2.6.33 It should be possible to bisect which commit caused this regression, particularly if the bisect is limited to the PCI and ACPI code. GPE01's handler seems to be involved in PCI hotplug notifies: Method (_L01, 0, NotSerialized) { Add (L01C, 0x01, L01C) \_SB.PCI0.RP01.HPLG () \_SB.PCI0.RP02.HPLG () \_SB.PCI0.RP03.HPLG () \_SB.PCI0.RP04.HPLG () \_SB.PCI0.RP05.HPLG () \_SB.PCI0.RP06.HPLG () } It isn't immediately clear from the AML how the handler clears the event that caused the interrupt. BTW. there seem to be checksum errors in the acpi tables, please attach the full output from acpidump Created attachment 26308 [details]
lspci -vvxxx
(In reply to comment #7) # echo disable > /sys/firmware/acpi/interrupts/gpe01 # cat /sys/firmware/acpi/interrupts/gpe01 writes something as " 204208 enabled" (number is growing) and kacpid continues in eating cpu. "acpidump with 2.6.34-rc6" that I attached is all acpidump command gave me (std+err output in one file) and I don't see nothing like verbose/.. flag there. What exactly you mean with full output? please attach the acpi interrupts in 2.6.33.3 kernel. Created attachment 26311 [details]
acpi interrupts in 2.6.33.3 kernel
(In reply to comment #9) > (In reply to comment #7) > > # echo disable > /sys/firmware/acpi/interrupts/gpe01 > # cat /sys/firmware/acpi/interrupts/gpe01 > > writes something as " 204208 enabled" (number is growing) and kacpid > continues in eating cpu. > this is weird. we should disable the gpe by running this command. please run "echo 0x44 > /sys/module/acpi/parameters/debug_layer && echo 0x88000104 > /sys/module/acpi/parameters/debug_level && dmesg -c && echo disable > /sys/firmware/acpi/interrupts/gpe01 && dmesg > /tmp/dmesg" and attach the /tmp/dmesg file here. Created attachment 26354 [details] dmesg for Comment #12 on RC6 Created attachment 26361 [details] patch-1 please try this patch and do the same test as the one in comment #9. Created attachment 26362 [details] patch-2 this patch-1 doesn't help, please convert that patch, apply this one and do the test in comment #9. I applied second patch and it seems to be working;). Test #9 wrote disabled and kacpid stopped eating cpu. Just to clarify, do you mean the patch from comment #15? Yes, the patch from #15 which changes the system.c file. Kacpid still eats cpu at startup but I can disable that bad gpe01 with commands from #9 now and stop it. I understood #15 that I shouldn't try the patch 1 from #14, is it ok? Please the patch from comment #14 too, we'd like to know if it works anyway. With both patches applied test from #9 writes " 0 disabled" but kacpid does not stop and there is still lots of interupts in 'grep . /sys/firmware/acpi/interrupts/*' (on the row gpe01 is disabled but number still growing). (In reply to comment #20) > With both patches applied test from #9 writes " 0 disabled" but kacpid does > not > stop and there is still lots of interupts in 'grep . > /sys/firmware/acpi/interrupts/*' (on the row gpe01 is disabled but number > still > growing). do you mean the GPE storm can be disabled if you apply patch-2 only, but it can not be disabled if you apply both patch-1 and patch-2? Justo to confirm I have experienced the same regression on a similar machine: product: HP EliteBook 8540p version: 68CVD Ver. F.06 (02/25/2010) i7 620M, 4GB RAM. Previous BIOS release (one I got the machine shipped with) exhibited same behaviour. (In reply to comment #21) Yes. With patch-2 I was able to disable it with 'echo disable > /sys/firmware/acpi/interrupts/gpe01' and storm stopped, 'echo enable ...' started it again. With both patches it looks like it is disabled since boot but behaves as it is not. After boot 'cat /sys/firmware/acpi/interrupts/gpe01' writes ' somenumber disabled', number is still growing quickly. After 'echo disable > /sys/firmware/acpi/interrupts/gpe01', it still writes disabled, number drops to one and grows quickly again. short listing (.disable are commands from #9) with few secs between commands - cat /sys/firmware/acpi/interrupts/gpe01 74437 disabled cat /sys/firmware/acpi/interrupts/gpe01 76697 disabled cat /sys/firmware/acpi/interrupts/gpe01 84399 disabled ./disable 1 disabled cat /sys/firmware/acpi/interrupts/gpe01 2684 disabled cat /sys/firmware/acpi/interrupts/gpe01 6059 disabled cat /sys/firmware/acpi/interrupts/gpe01 34875 disabled ./disable 1 disabled cat /sys/firmware/acpi/interrupts/gpe01 1603 disabled cat /sys/firmware/acpi/interrupts/gpe01 2673 disabled I have recompiled it once again to be sure with patch-2 only and it works as I wrote in #23. The similar sequence of commands after boot gave cat /sys/firmware/acpi/interrupts/gpe01 85898 enabled cat /sys/firmware/acpi/interrupts/gpe01 87292 enabled echo disable > /sys/firmware/acpi/interrupts/gpe01 cat /sys/firmware/acpi/interrupts/gpe01 104925 disabled cat /sys/firmware/acpi/interrupts/gpe01 104925 disabled echo enable > /sys/firmware/acpi/interrupts/gpe01 cat /sys/firmware/acpi/interrupts/gpe01 111831 enabled cat /sys/firmware/acpi/interrupts/gpe01 113668 enabled disable > /sys/firmware/acpi/interrupts/gpe01 cat /sys/firmware/acpi/interrupts/gpe01 117682 disabled cat /sys/firmware/acpi/interrupts/gpe01 117682 disabled acpi_ev_asynch_enable_gpe enables the GPE if gpe_register_info->enable_for_run is not cleared, so I guess we can not disable the GPE storm because it's enabled by the deferred GPE handler. IMO, this sounds like a problem in the gpe reference counting patch. Say, we don't update the runtime_count/wakeup_count when invoking acpi_set_gpe. so in order to make acpi_set_gpe work properly, we can not use runtime_count/wakeup_count to update the gpe_register_info->enable_for_run/enable_for_wake flag, as we do in acpi_ev_update_gpe_enable_masks currently. But I'm still wondering why patch-1 breaks patch-2... Hm. It's correct that acpi_set_gpe() doesn't change the refcounting - but we probably need to set a force_enable/disable flag that's tested as well as the runtime count. I'll look at this tomorrow. I'm not sure what you mean. Do you think acpi_set_gpe() shouldn't use acpi_ev_enable_gpe(), but instead it should call acpi_hw_clear_gpe() and acpi_hw_write_gpe_enable_reg() directly (in the "enable" case)? I think they're orthogonal. acpi_set_gpe() in this case should mean "Turn off the gpe even if the reference counting says it should be enabled", and then any codepath that tests the runtime reference count should also test that flag. But that doesn't answer the question of why this GPE is enabled in the first place. acpi_set_gpe() turns off the GPE unconditionally in the "disable" case. Only the "enable" case takes the enable mask read from gpe_event_info->register_info into account. But we have acpi_ev_asynch_enable_gpe which may then enable it again... (In reply to comment #28) > I think they're orthogonal. acpi_set_gpe() in this case should mean "Turn off > the gpe even if the reference counting says it should be enabled", and then > any > codepath that tests the runtime reference count should also test that flag. I agree. > But > that doesn't answer the question of why this GPE is enabled in the first > place. sorry I don't understand. what do you mean by "this GPE is enabled IN THE FIRST PLACE"? Created attachment 26458 [details]
Make sure GPEs stay disabled
Can you test this without any other patches? It should result in disable working correctly when you write it. We still need to work out why this GPE is enabled at all, though.
With last patch only after boot -- kacpid takes ~100% #cat /sys/firmware/acpi/interrupts/gpe01 62586 enabled # cat /sys/firmware/acpi/interrupts/gpe01 64478 enabled # echo disable > /sys/firmware/acpi/interrupts/gpe01 -- kacpid is ok # cat /sys/firmware/acpi/interrupts/gpe01 73971 enabled # cat /sys/firmware/acpi/interrupts/gpe01 73971 enabled # echo enable > /sys/firmware/acpi/interrupts/gpe01 -- kacpid is still ok # cat /sys/firmware/acpi/interrupts/gpe01 0 enabled # cat /sys/firmware/acpi/interrupts/gpe01 0 enabled # echo disable > /sys/firmware/acpi/interrupts/gpe01 # cat /sys/firmware/acpi/interrupts/gpe01 0 enabled # cat /sys/firmware/acpi/interrupts/gpe01 0 enabled (In reply to comment #32) > Created an attachment (id=26458) [details] > Make sure GPEs stay disabled Well, it looks like we need to do something like this. Although it seems we can avoid adding a new flag by making acpi_set_gpe() claer the GPE's bit in its register's enable_for_run mask. Ah, that's not going to work without this patch: https://patchwork.kernel.org/patch/101264/. (In reply to comment #36) Should I try it alone or with earlier patches? And still against RC6 or newer kernel? I'm going to prepare a patch on top of 2.6.34 (final) for you. For the record, the Matthew's patch in comment #32 has a problem that it will set ACPI_GPE_DISABLED for all wakeup GPEs during suspend, which isn't correct. Created attachment 26532 [details]
ACPI / ACPICA: Fix problems related to GPE reference counting
Please check if this patch (on top of 2.6.34) changes anything.
I got similar result as in #33 after boot -- kacpid takes ~100% #cat /sys/firmware/acpi/interrupts/gpe01 126811 enabled #cat /sys/firmware/acpi/interrupts/gpe01 129801 enabled #echo disable > /sys/firmware/acpi/interrupts/gpe01 -- kacpid is ok #cat /sys/firmware/acpi/interrupts/gpe01 155255 enabled #cat /sys/firmware/acpi/interrupts/gpe01 155255 enabled #echo enable > /sys/firmware/acpi/interrupts/gpe01 -- kacpid is still ok #cat /sys/firmware/acpi/interrupts/gpe01 0 enabled #cat /sys/firmware/acpi/interrupts/gpe01 0 enabled #echo disable > /sys/firmware/acpi/interrupts/gpe01 #cat /sys/firmware/acpi/interrupts/gpe01 0 enabled #cat /sys/firmware/acpi/interrupts/gpe01 0 enabled I have an old box (Compaq Preserio) that I bought back in 2003. I'm currently installing Fedora 12 on it. The default .31 kernel could not handle the PCI wifi card, .32 (and .33) would lock up in X, and .34 which seemed to fix the previous two, would have kacpi_notify running at 100%. I ran ftrace on it (workqueue and sched events along with tracing the acpi_ev_*). The kacpi_notify never goes to sleep. It is scheduled out via a "cond_resched" in the running state. This means on a non preempt, non voluntary preempt kernel, this would lock up the box. Here's what I did: # echo 'sched_switch' 'sched_wakeup' sched_wakeup_new 'workqueue' irq > /debug/tracing/set_event # echo acpi_os_execute_deferred >> /debug/tracing/set_ftrace_filter # echo 'acpi_ev_*' >> /debug/tracing/set_ftrace_filter # echo function > /debug/tracing/current_tracer # echo 1 > /debug/tracing/options/stacktrace Which would produce the following: kacpi_notify-17 [000] 807.509455: acpi_ev_is_notify_object <-acpi_ex_opcode_2A_0T_0R kacpi_notify-17 [000] 807.509456: acpi_ev_queue_notify_request <-acpi_ex_opcode_2A_0T_0R kacpi_notify-17 [000] 807.509462: workqueue_insertion: thread=kacpi_notify:17 func=acpi_os_execute_deferred kacpi_notify-17 [000] 807.509462: <stack trace> => ftrace_raw_event_workqueue_insertion => insert_work => __queue_work => queue_work_on => queue_work => __acpi_os_execute => acpi_os_execute => acpi_ev_queue_notify_request kacpi_notify-17 [000] 807.509485: acpi_ev_address_space_dispatch <-acpi_ex_access_region I'm currently bisecting to find where the issue is. On a bad kernel, kacpi_notify takes up 100% of the CPU, on a good kernel, it does not even show up. I forgot to say, I added the patch in comment 40 and still get the issue. The above comment included that patch. I think the problem went in with this commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3f0be67188c60ebf1b5d00354b44b4b24f5af313;hp=f517709d65beed95f52f021b43e3035b52ef791a and the problem most likely is that one of the GPEs that wasn't enabled in the previous kernels in now enabled. please apply the patch set https://patchwork.kernel.org/patch/104903/ https://patchwork.kernel.org/patch/104912/ https://patchwork.kernel.org/patch/104909/ https://patchwork.kernel.org/patch/104911/ https://patchwork.kernel.org/patch/104910/ on top of the latest upstream kernel and see if it helps. Comment #45 I tried it with 2.6.34-git16, it fires lots of interrupts after boot but it is possible to disable it and enable again. I got this: -- kacpid eats ~100% # cat /sys/firmware/acpi/interrupts/gpe01 123457 enabled # cat /sys/firmware/acpi/interrupts/gpe01 124981 enabled # echo disable > /sys/firmware/acpi/interrupts/gpe01 -- kacpid is ok # cat /sys/firmware/acpi/interrupts/gpe01 132234 disabled # cat /sys/firmware/acpi/interrupts/gpe01 132234 disabled # echo enable > /sys/firmware/acpi/interrupts/gpe01 -- kacpid eats ~100% # cat /sys/firmware/acpi/interrupts/gpe01 143519 enabled # cat /sys/firmware/acpi/interrupts/gpe01 145585 enabled # echo disable > /sys/firmware/acpi/interrupts/gpe01 -- kacpid is ok # cat /sys/firmware/acpi/interrupts/gpe01 149538 disabled # cat /sys/firmware/acpi/interrupts/gpe01 149538 disabled Created attachment 26705 [details]
Debug patch
Please apply this patch on top of the previous ones and attach dmesg output generated after a fresh boot.
Created attachment 26715 [details] dmesg for comment #47 It does seem that GPE01 is legitimately enabled in acpi_ev_initialize_gpe_block(), so the question is what the old code did to disable it afterwards (or how did it know not to enable it). Jaroslav, I cannot extract the DSDT from your output of acpidump, can you attach the contents of /proc/acpi/dsdt from your system? Never mind, I managed to extract the DSDT after all. Created attachment 26721 [details]
Debug patch #2
Please apply this patch on top of the previous one and send the output of dmesg from a fresh boot.
Created attachment 26722 [details]
dsdt - f.08
HP released bios update (F.08), it did not help with kacpid but maybe there are changes..
Created attachment 26723 [details] dmesg for comment #51 Well, the last debug patch was too general, it seems, and we need some more precise information. I'll do my best to prepare another debug patch for you in the next couple of days. Apparently, GPE01 is legitimately enabled on your system, but we need to find out why it keeps firing. Created attachment 26741 [details]
Debug patch #3
Please apply this patch on top of the previous ones and attach the dmesg output (this can generate quite a lot of output).
Created attachment 26746 [details]
Debug patch #4
Also please check if the problem is reproducible with the attached patch applied
(it shouldn't matter whether or not the previous patches are applied).
Created attachment 26770 [details] dmesg for Comment #55 This appears to be correct. Please try the patch from comment #56, then. (In reply to comment #56) It seems to be OK with this patch applied;) cat /sys/firmware/acpi/interrupts/gpe01 writes 0 enabled after boot, I can enable/disable it and it shows correct state and no interupt storm starts. OK, thanks for testing. You can safely run with this patch applied. The issue is that we call the BIOS to allow us to use the native PCIe PME mechanism, which is coupled to the native PCIe hotplug. Since the BIOS allows us to use the registers of PCIe root ports directly, we do just that, but then the BIOS gets confused and fires the hotplug notification. So, in fact we shouldn't try to use the native PCIe PME on your machine. I'm not sure what the long-term solution of the problem will be chosen, but at least we know what the problem _is_. Jaroslav, can you please test if the kernel works correctly on your machine, without the patch from comment #56, if the pcie_pme=off kernel command line switch is used? Created attachment 26830 [details]
PCI / PM: Do not use native PCIe PME by default
Please test the kernel with this patch applied without the previous patches
and report back.
(In reply to comment #62) > Created an attachment (id=26830) [details] > PCI / PM: Do not use native PCIe PME by default > > Please test the kernel with this patch applied without the previous patches > and report back. Works fine patched against the latest release candidate (2.6.35-rc3): $ cat /sys/firmware/acpi/interrupts/gpe01 0 enabled Thanks for the confirmation, I'll push the patch upstream shortly. Rafael, so this is introduced by the new pcie pme driver which is shipped in 2.6.34, right? Yes. To be precise, by its initialization code. Hi, I'm having a very similar problem with the same laptop running 2.6.36_rc6 but this time it's a kworker which consumes ~100% CPU. I first tried rmmod'ing all the modules but it didn't help. Then I built the kernel with some ACPI/PM verbose flags turned on and found out an infinite number of the following message was written in kernel log: scsi host1: __pm_runtime_resume() returns 1! I then passed "on\n" to all power/control files in sysfs but the logging didn't appear. I disabled PM_RUNTIME to see if it was causing the kworker stuff, but nope, it still continues. Then I found this report and tried the gpe01 thing. Disabling it stopped the kworker CPU consumation problem. When enabled the number of interrupts in /sys/firmware/acpi/interrupts/gpe01 increases very fast. But the problem is not fixed with the pcie_pme=off switch so I think the main problem is different but the solution is the same. Should I file a new bug report or can this be considered as a duplicate to this? How can I debug to find out the real issue? Thanks The original issue should be fixed in 2.6.36-rc, so I think that's a different problem with the same symptom. Please file a separate bug report and add my address to the CC list in there. (In reply to comment #68) > I'm having a very similar problem with the same laptop running 2.6.36_rc6 but > this time it's a kworker which consumes ~100% CPU. Same here, with 2.6.36 final. Will try to file a separate bug report ASAP. Hello, just testing the output format. Since the preview gives me everything formatted without CR (one line). Thank you, Zoran I'm currently attending Linux Plumbers. My email responses will be very slow. If you don't hear from me by Monday November 19th, please send me a gentle ping. -- Steve |