Bug 15954

Summary: At random times, kacpid eats around 80% of CPU time
Product: Drivers Reporter: leander256
Component: network-wirelessAssignee: Reinette Chatre (reinette.chatre)
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, linville, reinette.chatre
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32.11, 2.6.32.12 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 56331    
Attachments: backtrace of all active CPUs
ACPI interrupts
/var/log/messages
acpidump log file
/var/log/messages for 2.6.34 vanilla
/var/log/messages for 2.6.34 with patch
/var/log/messages for 2.6.34 with debug=0x43fff
lspci -s 05:00.0 -xxxx

Description leander256 2010-05-10 17:02:53 UTC
Created attachment 26319 [details]
backtrace of all active CPUs

I think this problem occurs randomly since I've been using the 2.6.32 branch but it's hard to remember when it first appeared. It seems to occur more quickly when I'm using a wifi connection and downloading big files (for instance, the OOo package). My laptop is an Acer Travelmate 6292 with a Core2 Duo T7300, the wifi card is a 4965AGN.

I retrieved a lot of information during the previous bug occurrence with the help of magic SysRq (backtrace of all active CPUs, Memory, Regs, clockevent devices & pending hrtimers, State, Blocked State), please tell me if you need one of them. I'm attaching the CPUs backtrace.
Comment 1 Zhang Rui 2010-05-11 09:08:28 UTC
please attach the output of "grep . /sys/firmware/acpi/interrupts/*".
please attach the full dmesg after the problem appears once.
Comment 2 leander256 2010-05-11 12:34:12 UTC
Created attachment 26341 [details]
ACPI interrupts
Comment 3 leander256 2010-05-11 12:39:42 UTC
Created attachment 26342 [details]
/var/log/messages

I don't know how I didn't notice that... Must be tired.

This file begins around 10 minutes *before* kacpid goes mad. Everything before is fine and then out of the blue I get these error messages. It ends around 1 or 2 minutes *after* kacpid started hogging the CPU, that's when I used SysRq to retrieve additional information right before rebooting.
Comment 4 Zhang Rui 2010-05-12 02:59:26 UTC
please attach the acpidump of your laptop.

does the problem still exists if you disable the wireless card?
please attach the full dmesg after disabling wireless card if the problem still exists.
Comment 5 leander256 2010-05-12 09:44:22 UTC
Created attachment 26352 [details]
acpidump log file

I blacklisted the iwlcore module and rebooted. I'll try to not reboot for a few days.
Comment 6 Zhang Rui 2010-05-13 02:54:17 UTC
does the problem still exist?:)
Comment 7 Zhang Rui 2010-05-19 06:47:39 UTC
ping ...
Comment 8 leander256 2010-05-19 18:30:43 UTC
Sorry for the delay, I wanted to make sure my system is now stable because the last time the problem occurred, I think I had at least 2 days of uptime (probably 3). Now it's almost 8 days of uptime so we can assume it comes from iwlcore/iwlagn. Waiting for your instructions on what to do next.
Comment 9 Zhang Rui 2010-05-20 02:40:22 UTC
> iwlagn 0000:05:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0
> free buffers remaining.

this is printed by this piece of code: (iwl_rx_allocate)
                skb = alloc_skb(priv->hw_params.rx_buf_size + 256,
                                                priority);

                if (!skb) {
                        if (net_ratelimit())
                                IWL_DEBUG_INFO(priv, "Failed to allocate SKB buffer.\n");
                        if ((rxq->free_count <= RX_LOW_WATERMARK) &&
                            net_ratelimit())
                                IWL_CRIT(priv, "Failed to allocate SKB buffer with %s. Only %u free buffers remaining.\n",
                                         priority == GFP_ATOMIC ?  "GFP_ATOMIC" : "GFP_KERNEL",
                                         rxq->free_count);
                        /* We don't reschedule replenish work here -- we will
                         * call the restock method and if it still needs
                         * more buffers it will schedule replenish */
                        break;
                }


I saw there are several changes in iwl_rx_allocate().
Please try the latest upstream kernel (2.6.34-rc7) and see if the problem still exists.
Comment 10 leander256 2010-05-23 12:08:28 UTC
Created attachment 26509 [details]
/var/log/messages for 2.6.34 vanilla

Hi,

I tried the vanilla 2.6.34 and the problem occurred again, the iwlagn driver crashed somehow and kacpid started eating 90% of CPU again. I attached the /var/log/messages complete with boot and a suspend-to-ram in the middle. Also, due to a previous bug with that driver, my kernels are compiled with CONFIG_IWLWIFI_DEBUG=y .
Comment 11 Zhang Rui 2010-05-24 07:09:08 UTC
As the problem occurs after the iwlcore/iwlagn crash, let's assign this bug to the wireless experts to see if they can help.
Comment 12 Reinette Chatre 2010-05-24 17:36:58 UTC
When the driver tries to read from a device register and the device returns 0xFFFFFFFF it means that the device has essentially been disconnected from the PCI bus, perhaps due to some powersaving mechanism.

We have seen this problem intermittently but have not been able to clear it up. We have a test patch available that we are unable to get clear indication whether it works or not. Could you try it out? I am not sure on which kernel this patch was based, let me know if you need any help if backporting is required.

http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037#c112
Comment 13 leander256 2010-05-24 19:01:19 UTC
I patched the 2.6.34 without any problem, and restarted. Please note that I didn't specify any option (e.g. debug=xxxxxx) for modprobe.
Comment 14 leander256 2010-05-25 19:44:59 UTC
Created attachment 26541 [details]
/var/log/messages for 2.6.34 with patch

So it took just one hour before the problem shows up again. Connection loss and kacpid going wild again. I was doing something on the computer at that time and downloading at around 600 KB/s, so it's not like some major power-saving event was supposed to kick-in.
Comment 15 Reinette Chatre 2010-06-03 22:21:17 UTC
This new log looks significantly different from the one you posted in comment #10. I doubt that it has anything to do with the patch though since the settings this patch attempts to clear in your device is not set.

The interesting thing in the new log is that the unresponsiveness of device seems to start after a hardware error. The logs from comment #10 does not contain a hardware error. When this hardware error occurs the driver does attempt to restart the device but cannot communicate with it at all.

Since the patch is not working, could we look into perhaps what happens around the time the device becomes unhappy? Is it possible to reproduce with debug=0x43fff (please note that this will generate a lot of data and it will be better if you can use syslog to capture all kernel messages to a file)? This will hopefully give us an idea where the issue is so that we can dig deeper.
Comment 16 leander256 2010-06-10 21:25:13 UTC
Created attachment 26719 [details]
/var/log/messages for 2.6.34 with debug=0x43fff

It took many days before showing up, but it finally occurred. For six days I didn't notice much, then suddenly there were disconnections every few minutes and then the final crash.
Comment 17 Reinette Chatre 2010-06-21 21:37:52 UTC
Is the PCI config space still accessible when the device gets into this state? That is, when you start seeing the "MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF"? Could you please check the device id with lspci and then run "lspci -s <device> -xxxx" when this error occurs?
Comment 18 leander256 2010-07-07 21:09:03 UTC
Created attachment 27039 [details]
lspci -s 05:00.0 -xxxx

It took a while but the bug show up again, and I could do the lspci command.

Besides, I just realized there's probably a pattern linked to this bug, it seems to occur "quickly" (let's say a few hours) after I do a suspend to RAM. I used to bring my laptop everyday to work and that's when I had trouble, now that I'm not moving it often (around once every two weeks) the bug seems elusive but shows up quickly after I moved the laptop.

For the record I'm using the hibernate-ram script from tuxonice (but not the patch, my kernel is definitely a vanilla one).
Comment 19 Alan 2012-06-14 17:21:41 UTC
Is this still seen with modern kernels ?