Bug 20232
Summary: | kworker consumes ~100% CPU on HP Elitebook 8540w running 2.6.36_rc6-git4 | ||
---|---|---|---|
Product: | ACPI | Reporter: | Ozan Caglayan (ozan) |
Component: | Config-Interrupts | Assignee: | Rafael J. Wysocki (rjw) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | acpi-bugzilla, dv, florian, henry78, ibrahim, lenb, m.debruijne, odi, ozan, rjw, rui.zhang |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.36_rc6-git4 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216, 16444 | ||
Attachments: |
acpidump
dmesg of bad kernel dmesg of good kernel 2.6.35.7 dmesg of 2.6.36_rc8-git4 with pcie_ports=compat PCI / Hotplug: Fix unexpected driver unregister in pciehp_acpi.c PCI / ACPI: Request _OSC control once for each root bridge Good dmesg 2.6.37_rc7_git4 PCI / ACPI: Pass all _OSC support bits to the BIOS simultaneously 2.6.37_rc8+patch#29 ACPI trace PCI / ACPI: Request _OSC control once for each root bridge (v3) dmesg of vanilla 2.6.37_rc8-git4 |
Description
Ozan Caglayan
2010-10-13 06:13:38 UTC
BTW, 2.6.32.24 doesn't have this symptom. We didn't have much time to bisect the issue but if we can't find out the cause without bisecting, I'll try to bisect too. (In reply to comment #0) > I'm having a serious CPU hogging problem with an HP Elitebook 8540w running > 2.6.36_rc6. A kworker consumes ~100% CPU during all the uptime since booting. > what's the latest good kernel? what's the earlies bad kernel? > > Then I found a similar report and tried writing "disable" to > /sys/firmware/acpi/interrupts/gpe01 and it stopped the kworker CPU > consumation > problem *although the load average doesn't drop under ~1.3*. When enabled the > number of interrupts in /sys/firmware/acpi/interrupts/gpe01 increases very > fast. > > The problem is not fixed with the pcie_pme=off trick suggested in the other > bug > report related to this laptop. > hmm, seems an ACPI interrupt storm. please attach the acpidump of this laptop. please attach the dmesg output after boot for both the good and the bad kernel. Created attachment 33442 [details]
acpidump
Created attachment 33452 [details]
dmesg of bad kernel
I'll try to post the earliest bad and the latest good kernel as soon as possible. okay 2.6.35.7 is good, 2.6.36_rc1 is bad. I'm attaching the good one's dmesg. Seems that it's broken by a commit during the merge window. Created attachment 33472 [details]
dmesg of good kernel 2.6.35.7
Can you test 2.6.36-rc7 with pcie_ports=compat, please? // Add me to CC OK will try that and post the result here. I didn't have time to rebuild rc8 but I tried that parameter on rc6 and it fixed the kworker issue but even on a completely idle system load average doesn't drop under 1.09~. I'll post some details later. Created attachment 33912 [details]
dmesg of 2.6.36_rc8-git4 with pcie_ports=compat
Ok the pcie_ports=compat still works under 2.6.36_rc8-git4 but there's an "unexpected driver unregister!" backtrace in the dmesg that I've recently attached. And also the load average is still above > 1, I don't know if it's related or not. so with pcie_ports=compat, the interrupt storm goes away, as indicated by the acpi interrupt in /proc/interrupts and the gpe is /sys/firmware/acpi/interrupts no longer incrementing quickly, but you still have something running on the cpu 100% of the time? what does top(1) show? There is a skew in the loadaverage due to a commit introduced in 2.6.36: commit 74f5187ac873042f502227701ed1727e7c5fbfa9 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> 2010-04-22 21:50:19 Committer: Ingo Molnar <mingo@elte.hu> 2010-04-23 11:02:02 sched: Cure load average vs NO_HZ woes Check https://bugzilla.kernel.org/show_bug.cgi?id=16525 and the referenced threads for more details on that. It is not clear to me, if that skew is accompanied by any harmful symptoms though... Regards, Flo p.s.: for convenience, I post the backtrace mentioned in comment #12: [ 1.120742] pci_hotplug: PCI Hot Plug PCI Core version: 0.5 [ 1.120744] ------------[ cut here ]------------ [ 1.120749] WARNING: at drivers/base/driver.c:262 driver_unregister+0x36/0x6f() [ 1.120751] Hardware name: HP EliteBook 8540w [ 1.120752] Unexpected driver unregister! [ 1.120753] Modules linked in: [ 1.120756] Pid: 1, comm: swapper Not tainted 2.6.36_rc8-143 #1 [ 1.120757] Call Trace: [ 1.120762] [<ffffffff81045ec4>] warn_slowpath_common+0x80/0x98 [ 1.120765] [<ffffffff81045f70>] warn_slowpath_fmt+0x41/0x43 [ 1.120769] [<ffffffff81b7666d>] ? pci_hotplug_init+0x0/0x4e [ 1.120772] [<ffffffff812c0604>] driver_unregister+0x36/0x6f [ 1.120775] [<ffffffff8123939c>] pcie_port_service_unregister+0xd/0xf [ 1.120777] [<ffffffff81b76914>] pciehp_acpi_slot_detection_init+0x96/0x132 [ 1.120780] [<ffffffff81b766c9>] ? pcied_init+0x0/0x79 [ 1.120782] [<ffffffff81b766d7>] pcied_init+0xe/0x79 [ 1.120786] [<ffffffff81000348>] do_one_initcall+0x7a/0x132 [ 1.120789] [<ffffffff81b4fd54>] kernel_init+0x17d/0x20b [ 1.120792] [<ffffffff81003924>] kernel_thread_helper+0x4/0x10 [ 1.120794] [<ffffffff81b4fbd7>] ? kernel_init+0x0/0x20b [ 1.120796] [<ffffffff81003920>] ? kernel_thread_helper+0x0/0x10 [ 1.120803] ---[ end trace 6d450e935ee1897c ]--- well yes the storm and the kworker which hogs the cups goes away with pcie_ports=compat. But even on a basic console login which stays idle for hours the load average stays always above 1. The output of top doesn't show any surprising, no task which uses the cpu excessively. But i dont know why the load average doesn't converge to 0. I'm experiencing the same problem in my laptop. Two or four kworker processes constantly at the top of top, consuming between 1 and ~20% cpu, load average of ~1. The trackpad is unusably jerky in X. pcie_ports=compad didn't change anything. Back to 2.6.34.7 for now (with 2.6.35.x I had a similar problem: several kslowd00x processes hogging my cpu and making my trackpad jerky. At least they don't appear with 2.6.36...). (In reply to comment #16) > I'm experiencing the same problem in my laptop. Two or four kworker processes > constantly at the top of top, consuming between 1 and ~20% cpu, load average > of > ~1. The trackpad is unusably jerky in X. pcie_ports=compad didn't change > anything. That's pcie_ports=compat, not pcie_ports=compad. If the latter is what you have tested, please retest and report back. If pcie_ports=compat doesn't help on your machine, the problem you're seeing is certainly different. In that case, please file a separate bug report for that issue. so what's the status of this bug? :) Still continues with 2.6.36. pcie_ports still fixes the issue still with the backtrace. any suggestions? The load average is unrelated to this bug. Check the patch in bug #16525 for that. Len, that means: (Ozan, please correct me if I'm wrong): (In reply to comment #13) > so with pcie_ports=compat, the interrupt storm goes away, > as indicated by the acpi interrupt in /proc/interrupts and > the gpe is /sys/firmware/acpi/interrupts no longer incrementing > quickly, but you still have something running on the cpu 100% of the time? > what does top(1) show? Answer: With pcie_ports=compat the kworker @100%cpu goes away, and everything is fine. Not everything, the backtrace is still there, that needs to be fixed. I'll take care of this shortly. Created attachment 39752 [details] PCI / Hotplug: Fix unexpected driver unregister in pciehp_acpi.c This patch should fix the warning in comment #14, please verify. Okay, I'll try the patch ASAP but will the users of this laptop pass pcie_ports=compat explicitly to fix the issue? If yes, this is bad. If there's some sort of DMI quirk list that will be patched, that's reasonable. We're hoping to have a better fix than a DMI quirk, but not in 2.6.37, so please use the command line workaround for now. Ozan, can you please send the output of "ls /sys/bus/pci/drivers" and "ls /sys/bys/pci_express/drivers" ? Sorry, not this information. The output of "ls /sys/bus/pci/slots/". Also please rmmod the pciehp module and modprobe acpiphp module instead. Please check if the problem is reproducible with that in place. Created attachment 41832 [details]
PCI / ACPI: Request _OSC control once for each root bridge
The attached patch may help, so please test it.
If it doesn't help, please send the output of "dmesg | grep _OSC" generated
right after a fresh boot.
On 2.6.36.1: /sys/bus/pci/slots is empty. /sys/bus/pci_express/drivers contains pciehp pci_pme aer As pciehp is built into the kernel image, I could not find any way to avoid it from loading, so I'll need time to recompile and try what you've suggested in #28 and #29. The patch from comment #29 is on top of the current mainline (2.6.37-rc8 at the moment). Ok I'll try with that kernel. Created attachment 41872 [details]
Good dmesg 2.6.37_rc7_git4
Well I tried with 2.6.37_rc7-git4 + the patch in #23, and did a normal reboot e.g. without pcie_pme=compat and the issue seems to get fixed, new dmesg is attached. Let's keep the bug report open until 2.6.37 gets released and I'll close this as fixed if 2.6.37 works OK. Sorry for being late to switch to 2.6.37_rc*. Do you mean that 2.6.37-rc7-git4 with the patch from comment #23 works for you without pcie_pme=compat and without the patch from comment #29 ? If so, 2.6.37-rc8 should work for you too (it contains the patch from comment #23). Please confirm. Yes exactly. But, What I've tried as 2.6.37_rc7-git4 + patch in comment#23 was not vanilla at all. It's carrying a patch from upstream that seems related to the issue so maybe it was this commit which fixed the issue: commit 885c252ffb059dc493200bdb981bdd21cabe4442 Author: Matthew Garrett <mjg@redhat.com> Date: Thu Dec 9 18:31:59 2010 -0500 PCI: _OSC "supported" field should contain supported features, not enabled ones From testing with Windows, the call to the PCI root _OSC method includes the full set of features supported by the operating system even if the hardware has already indicated that it doesn't support ASPM or MSI. https://bugzilla.redhat.com/show_bug.cgi?id=638912 is a case where making the _OSC call will incorrectly configure the chipset unless the supported field has bits 1, 2 and 4 set. Rework the functionality to ensure that we match this behaviour. Anyway, I'll try with a vanilla 2.6.37_rc8 with and without the patch in comment29 to see the outcome. First, what do you mean saying "upstream"? Second, if the "PCI: _OSC "supported" field should contain supported features, not enabled ones" patch helps, the patch from comment #29 rather won't help. I'll attach a patch on top of the one from comment #29 that may help. Created attachment 41892 [details] PCI / ACPI: Pass all _OSC support bits to the BIOS simultaneously Patch to test on top of the patch from comment #29. Okay, I'll try to resume what's going on as I think I've caused a little bit of confusion: - 2.6.36.x is still showing the issue on those laptops - The problem goes away on 2.6.36.x with pci_ports=compat but this gives a backtrace while unregistering a driver (patch to fix is available in comment #23) - A complete solution is offered within the patch in comment #29 Then I tried 2.6.37_rc7-git4 with the patch in comment #23 to see at least if the backtrace is fixed when booting with pcie_ports=compat. A plain reboot (with no pcie_ports=compat) cured the kworker issue. Either the switch to 2.6.37_rc* cured the issue or the patch that I've taken from fedora f-15 entitled "PCI: _OSC "supported" field should contain supported features, not enabled ones". That was the patch I misleadingly told as "from upstream", sorry. Then last night, I switched to 2.6.37_rc8 which already contains your patch in comment #23. I also put the patch in comment #29 on top of it and dropped the "PCI: _OSC .." patch from Matthew Garrett. But unfortunately a lot of machines broke while booting this kernel. I'll send the photo just after this comment. Created attachment 41992 [details]
2.6.37_rc8+patch#29 ACPI trace
The crash seems to be caused by the patch from comment #29, which apparently tries to parse the HEST table too early. However, you appear to say that the patch from comment #29 on top of 2.6.36.y works correctly. Is that also the case on machines that crash with 2.6.37-rc8 + the patch from comment #29 (I mean, if those machines are booted with 2.6.36.y + patch from comment #29, do they boot correctly or crash)? Can you please attach a dmesg output from vanilla 2.6.37-rc8 on one of the machines that crash with the patch from comment #29 on top of that kernel? No I didn't say that 2.6.36.y + patch from comment #29 booted correctly as you've said that the patch was against the top of the current mainline, so I even didn't try to patch 2.6.36.y. I'll send you the dmesg from vanilla 2.6.37_rc8-git1 tomorrow. Sorry I'm not an owner of this laptop so things are going slowly.. Created attachment 42132 [details] PCI / ACPI: Request _OSC control once for each root bridge (v3) In that case it's better if you test the attached patch when you have access to the machine in question. It is a replacement for the patch in comment #29 that should fix the problem with HEST parsing attempted too early. Created attachment 42312 [details]
dmesg of vanilla 2.6.37_rc8-git4
Okay vanilla 2.6.37_rc8-git4 still problematic. I've attached the dmesg of it. Applying your v3 patch on top of it *seems* to fix the issue. I'm occasionally seeing a kworker in top with ~%20-50 CPU usage but at least it does not hog the CPU eternally. /sys/firmware/acpi/interrupts/gpe_all is constant 127 since booting and does not increment insanely with time. Here's a diff between the vanilla and the patched dmesg's: --- dmesg.vanilla 2011-01-04 10:13:47.694000464 +0200 +++ dmesg.patched 2011-01-04 10:13:58.577000494 +0200 @@ -247,6 +247,7 @@ ACPI: Power Resource [APPR] (off) ACPI: Power Resource [LPP] (on) ACPI: No dock devices found. +HEST: Table not found. PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe]) pci_root PNP0A08:00: host bridge window [io 0x0000-0x0cf7] @@ -405,7 +406,6 @@ ACPI: PCI Interrupt Link [LNKF] (IRQs 1 3 4 5 6 7 11 12 14 15) *10 ACPI: PCI Interrupt Link [LNKG] (IRQs 1 3 4 5 6 7 10 12 14 15) *0, disabled. ACPI: PCI Interrupt Link [LNKH] (IRQs 1 3 4 5 6 7 11 12 14 15) *0, disabled. -HEST: Table is not found! vgaarb: device added: PCI:0000:01:00.0,decodes=io+mem,owns=io+mem,locks=none vgaarb: loaded SCSI subsystem initialized @@ -635,39 +635,12 @@ io scheduler noop registered io scheduler deadline registered io scheduler cfq registered (default) -pcieport 0000:00:01.0: ACPI _OSC control granted for 0x1c pcieport 0000:00:01.0: setting latency timer to 64 pcieport 0000:00:01.0: irq 40 for MSI/MSI-X -pcieport 0000:00:1c.0: ACPI _OSC control granted for 0x1c -pcieport 0000:00:1c.0: setting latency timer to 64 -pcieport 0000:00:1c.0: irq 41 for MSI/MSI-X -pcieport 0000:00:1c.1: ACPI _OSC control granted for 0x1c -pcieport 0000:00:1c.1: setting latency timer to 64 -pcieport 0000:00:1c.1: irq 42 for MSI/MSI-X -pcieport 0000:00:1c.3: ACPI _OSC control granted for 0x1c -pcieport 0000:00:1c.3: setting latency timer to 64 -pcieport 0000:00:1c.3: irq 43 for MSI/MSI-X -pcieport 0000:00:1c.7: ACPI _OSC control granted for 0x1c -pcieport 0000:00:1c.7: setting latency timer to 64 -pcieport 0000:00:1c.7: irq 44 for MSI/MSI-X -pcieport 0000:00:01.0: Signaling PME through PCIe PME interrupt -pci 0000:01:00.0: Signaling PME through PCIe PME interrupt -pci 0000:01:00.1: Signaling PME through PCIe PME interrupt -pcie_pme 0000:00:01.0:pcie01: service driver pcie_pme loaded -pcieport 0000:00:1c.0: Signaling PME through PCIe PME interrupt -pcie_pme 0000:00:1c.0:pcie01: service driver pcie_pme loaded -pcieport 0000:00:1c.1: Signaling PME through PCIe PME interrupt -pcie_pme 0000:00:1c.1:pcie01: service driver pcie_pme loaded -pcieport 0000:00:1c.3: Signaling PME through PCIe PME interrupt -pci 0000:44:00.0: Signaling PME through PCIe PME interrupt -pcie_pme 0000:00:1c.3:pcie01: service driver pcie_pme loaded -pcieport 0000:00:1c.7: Signaling PME through PCIe PME interrupt -pci 0000:45:00.0: Signaling PME through PCIe PME interrupt -pcie_pme 0000:00:1c.7:pcie01: service driver pcie_pme loaded pci_hotplug: PCI Hot Plug PCI Core version: 0.5 pciehp: PCI Express Hot Plug Controller Driver version: 0.4 pci-stub: invalid id string "" @@ -676,35 +649,34 @@ ACPI: Power Button [PWRF] ACPI: acpi_idle registered with cpuidle Monitor-Mwait will be used to enter C-1 state -Monitor-Mwait will be used to enter C-2 state Monitor-Mwait will be used to enter C-3 state thermal LNXTHERM:00: registered as thermal_zone0 OK, thanks for testing! Apparently, with the patch from comment #44 _OSC is not executed on your system, so it doesn't use native PCI Express services and that's why the GPE storm doesn't appear any more (so the patch definitely helps). Which appears to be fine, because your system doesn't support ASPM, as indicated by the ACPI tables. Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Patch : https://patchwork.kernel.org/patch/449231/ Thanks. BTW, if it is not too invasive for 2.6.36, it will be good to CC stable@kernel.org. Rafael can you check the following screenshot? The user tells that he gets this trace with 2.6.37 + your v3 patch. I'm not quite sure that he's booting the right kernel but the trace seems to be a little different than the one caused by your v2 patch? http://bugs.pardus.org.tr/attachment.cgi?id=6374 Thanks, This is an entirely different bug. It's the aer_service_init() code path that should be executed way after acpi_pci_root_init() that calls acpi_hest_init() in the patch from comment #47. Apart from this, it looks like the user actually _has_ HEST. Can you open a new bug entry for this one, please, and put the slide in there along with (non-failing) boot log and the output of acpidump from the affected machine? merged in .38-rc1: commit 415e12b2379239973feab91850b0dce985c6058a Author: Rafael J. Wysocki <rjw@sisk.pl> Date: Fri Jan 7 00:55:09 2011 +0100 PCI/ACPI: Request _OSC control once for each root bridge (v3) Even though the patch fixes the initial problem, it reoccurs after a suspend to RAM / resume cycle. On top of what kernel? Linux ortwin-hp 2.6.37 #14 SMP PREEMPT Mon Feb 7 18:48:47 CET 2011 x86_64 Intel(R) Core(TM) i7 CPU M 620 @ 2.67GHz GenuineIntel GNU/Linux I think you're seeing a different problem. Please file a separate bug for it an put my address into the CC list. Confirming unfortunately that the issue reappears after a suspend/resume cycle. perf top after resume: ------------------------------------------------------------------------------------------------------------------- PerfTop: 1138 irqs/sec kernel:90.7% exact: 0.0% [1000Hz cycles], (all, 4 CPUs) ------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ___________________________________ _________________________________ 3873.00 33.6% __acpi_acquire_global_lock /lib/modules/2.6.37/build/vmlinux 1043.00 9.1% acpi_os_read_port /lib/modules/2.6.37/build/vmlinux 879.00 7.6% acpi_ns_search_one_scope /lib/modules/2.6.37/build/vmlinux 577.00 5.0% acpi_ns_lookup /lib/modules/2.6.37/build/vmlinux 474.00 4.1% acpi_ps_peek_opcode /lib/modules/2.6.37/build/vmlinux 367.00 3.2% acpi_ex_name_segment /lib/modules/2.6.37/build/vmlinux 324.00 2.8% __acpi_release_global_lock /lib/modules/2.6.37/build/vmlinux 303.00 2.6% acpi_ps_get_next_namestring /lib/modules/2.6.37/build/vmlinux 269.00 2.3% acpi_ex_system_memory_space_handler /lib/modules/2.6.37/build/vmlinux 216.00 1.9% pci_conf1_read /lib/modules/2.6.37/build/vmlinux 199.00 1.7% kmem_cache_free /lib/modules/2.6.37/build/vmlinux 188.00 1.6% __memset /lib/modules/2.6.37/build/vmlinux 181.00 1.6% acpi_ps_parse_loop /lib/modules/2.6.37/build/vmlinux 179.00 1.6% kmem_cache_alloc /lib/modules/2.6.37/build/vmlinux 139.00 1.2% acpi_os_write_port /lib/modules/2.6.37/build/vmlinux 120.00 1.0% acpi_ps_get_next_package_end /lib/modules/2.6.37/build/vmlinux 100.00 0.9% acpi_ex_get_name_string /lib/modules/2.6.37/build/vmlinux 90.00 0.8% add_preempt_count /lib/modules/2.6.37/build/vmlinux 83.00 0.7% acpi_ut_create_generic_state /lib/modules/2.6.37/build/vmlinux 78.00 0.7% _raw_spin_lock_irqsave /lib/modules/2.6.37/build/vmlinux 75.00 0.7% acpi_ps_get_opcode_info /lib/modules/2.6.37/build/vmlinux 54.00 0.5% acpi_ps_get_next_simple_arg /lib/modules/2.6.37/build/vmlinux 54.00 0.5% _raw_spin_unlock_irqrestore /lib/modules/2.6.37/build/vmlinux 53.00 0.5% acpi_ds_exec_end_op /lib/modules/2.6.37/build/vmlinux 50.00 0.4% kfree /lib/modules/2.6.37/build/vmlinux 50.00 0.4% acpi_ps_append_arg /lib/modules/2.6.37/build/vmlinux 46.00 0.4% acpi_ut_update_object_reference /lib/modules/2.6.37/build/vmlinux 44.00 0.4% sub_preempt_count /lib/modules/2.6.37/build/vmlinux 38.00 0.3% acpi_ex_extract_from_field /lib/modules/2.6.37/build/vmlinux 38.00 0.3% acpi_ds_exec_begin_op /lib/modules/2.6.37/build/vmlinux Please open a new bug. Or please let me know its number in case you've done it already. I have now opened #29722 A patch referencing this bug report has been merged in v2.6.38-8569-g16c29da: commit 8b8bae901ce23addbdcdb54fa1696fb2d049feb5 Author: Rafael J. Wysocki <rjw@sisk.pl> Date: Sat Mar 5 13:21:51 2011 +0100 PCI/ACPI: Report ASPM support to BIOS if not disabled from command line |