Bug 15954
Summary: | At random times, kacpid eats around 80% of CPU time | ||
---|---|---|---|
Product: | Drivers | Reporter: | leander256 |
Component: | network-wireless | Assignee: | Reinette Chatre (reinette.chatre) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | alan, linville, reinette.chatre |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.32.11, 2.6.32.12 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 56331 | ||
Attachments: |
backtrace of all active CPUs
ACPI interrupts /var/log/messages acpidump log file /var/log/messages for 2.6.34 vanilla /var/log/messages for 2.6.34 with patch /var/log/messages for 2.6.34 with debug=0x43fff lspci -s 05:00.0 -xxxx |
please attach the output of "grep . /sys/firmware/acpi/interrupts/*". please attach the full dmesg after the problem appears once. Created attachment 26341 [details]
ACPI interrupts
Created attachment 26342 [details]
/var/log/messages
I don't know how I didn't notice that... Must be tired.
This file begins around 10 minutes *before* kacpid goes mad. Everything before is fine and then out of the blue I get these error messages. It ends around 1 or 2 minutes *after* kacpid started hogging the CPU, that's when I used SysRq to retrieve additional information right before rebooting.
please attach the acpidump of your laptop. does the problem still exists if you disable the wireless card? please attach the full dmesg after disabling wireless card if the problem still exists. Created attachment 26352 [details]
acpidump log file
I blacklisted the iwlcore module and rebooted. I'll try to not reboot for a few days.
does the problem still exist?:) ping ... Sorry for the delay, I wanted to make sure my system is now stable because the last time the problem occurred, I think I had at least 2 days of uptime (probably 3). Now it's almost 8 days of uptime so we can assume it comes from iwlcore/iwlagn. Waiting for your instructions on what to do next. > iwlagn 0000:05:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0
> free buffers remaining.
this is printed by this piece of code: (iwl_rx_allocate)
skb = alloc_skb(priv->hw_params.rx_buf_size + 256,
priority);
if (!skb) {
if (net_ratelimit())
IWL_DEBUG_INFO(priv, "Failed to allocate SKB buffer.\n");
if ((rxq->free_count <= RX_LOW_WATERMARK) &&
net_ratelimit())
IWL_CRIT(priv, "Failed to allocate SKB buffer with %s. Only %u free buffers remaining.\n",
priority == GFP_ATOMIC ? "GFP_ATOMIC" : "GFP_KERNEL",
rxq->free_count);
/* We don't reschedule replenish work here -- we will
* call the restock method and if it still needs
* more buffers it will schedule replenish */
break;
}
I saw there are several changes in iwl_rx_allocate().
Please try the latest upstream kernel (2.6.34-rc7) and see if the problem still exists.
Created attachment 26509 [details]
/var/log/messages for 2.6.34 vanilla
Hi,
I tried the vanilla 2.6.34 and the problem occurred again, the iwlagn driver crashed somehow and kacpid started eating 90% of CPU again. I attached the /var/log/messages complete with boot and a suspend-to-ram in the middle. Also, due to a previous bug with that driver, my kernels are compiled with CONFIG_IWLWIFI_DEBUG=y .
As the problem occurs after the iwlcore/iwlagn crash, let's assign this bug to the wireless experts to see if they can help. When the driver tries to read from a device register and the device returns 0xFFFFFFFF it means that the device has essentially been disconnected from the PCI bus, perhaps due to some powersaving mechanism. We have seen this problem intermittently but have not been able to clear it up. We have a test patch available that we are unable to get clear indication whether it works or not. Could you try it out? I am not sure on which kernel this patch was based, let me know if you need any help if backporting is required. http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037#c112 I patched the 2.6.34 without any problem, and restarted. Please note that I didn't specify any option (e.g. debug=xxxxxx) for modprobe. Created attachment 26541 [details]
/var/log/messages for 2.6.34 with patch
So it took just one hour before the problem shows up again. Connection loss and kacpid going wild again. I was doing something on the computer at that time and downloading at around 600 KB/s, so it's not like some major power-saving event was supposed to kick-in.
This new log looks significantly different from the one you posted in comment #10. I doubt that it has anything to do with the patch though since the settings this patch attempts to clear in your device is not set. The interesting thing in the new log is that the unresponsiveness of device seems to start after a hardware error. The logs from comment #10 does not contain a hardware error. When this hardware error occurs the driver does attempt to restart the device but cannot communicate with it at all. Since the patch is not working, could we look into perhaps what happens around the time the device becomes unhappy? Is it possible to reproduce with debug=0x43fff (please note that this will generate a lot of data and it will be better if you can use syslog to capture all kernel messages to a file)? This will hopefully give us an idea where the issue is so that we can dig deeper. Created attachment 26719 [details]
/var/log/messages for 2.6.34 with debug=0x43fff
It took many days before showing up, but it finally occurred. For six days I didn't notice much, then suddenly there were disconnections every few minutes and then the final crash.
Is the PCI config space still accessible when the device gets into this state? That is, when you start seeing the "MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF"? Could you please check the device id with lspci and then run "lspci -s <device> -xxxx" when this error occurs? Created attachment 27039 [details]
lspci -s 05:00.0 -xxxx
It took a while but the bug show up again, and I could do the lspci command.
Besides, I just realized there's probably a pattern linked to this bug, it seems to occur "quickly" (let's say a few hours) after I do a suspend to RAM. I used to bring my laptop everyday to work and that's when I had trouble, now that I'm not moving it often (around once every two weeks) the bug seems elusive but shows up quickly after I moved the laptop.
For the record I'm using the hibernate-ram script from tuxonice (but not the patch, my kernel is definitely a vanilla one).
Is this still seen with modern kernels ? |
Created attachment 26319 [details] backtrace of all active CPUs I think this problem occurs randomly since I've been using the 2.6.32 branch but it's hard to remember when it first appeared. It seems to occur more quickly when I'm using a wifi connection and downloading big files (for instance, the OOo package). My laptop is an Acer Travelmate 6292 with a Core2 Duo T7300, the wifi card is a 4965AGN. I retrieved a lot of information during the previous bug occurrence with the help of magic SysRq (backtrace of all active CPUs, Memory, Regs, clockevent devices & pending hrtimers, State, Blocked State), please tell me if you need one of them. I'm attaching the CPUs backtrace.