Bug 13032

Summary: S3: 2.6.29 regression: network interfaces drop after resume - Dell Inspiron 600m laptop
Product: ACPI Reporter: Jose Marino (braket)
Component: Power-ProcessorAssignee: Len Brown (lenb)
Severity: normal CC: acpi-bugzilla, alan, braket, daniel, kvalo, lenb, linville, reinette.chatre, rjw, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: Tree: Mainline
Regression: Yes
Bug Depends on:    
Bug Blocks: 7216, 12398    
Attachments: Kernel logs right after loading b44 driver: modprobe b44 b44_debug=31
Kernel logs after loading ipw2200 driver: modprobe ipw2200
Kernel logs after loading ipw2200 driver: modprobe ipw2200 debug=0x43fff
Messages after adding "dump_stack" to b44_halt()
Output from lspci -vv
patch vs 2.6.30-rc4 to restore BM_RLD on resume
patch vs 2.6.30-rc4 to restore BM_RLD on resume

Description Jose Marino 2009-04-07 00:08:12 UTC
Created attachment 20841 [details]
Kernel logs right after loading b44 driver: modprobe b44 b44_debug=31

After a fresh reboot everything works just fine.
The problems appear after resuming from suspend to ram at least once. Both network interfaces (ethernet and wireless) keep dropping continuously. Dropping frequency seems to depend on traffic.
BTW suspend to ram seems to work perfectly fine.

This problem happens in a Dell Inspiron 600m laptop. The ethernet card uses driver b44 and the wireless ipw2200.
The problem occurs with kernels 2.6.29 and, but not with 2.6.28 or less.

I'll attach some kernel logs showing what happens from the moment I load the network drivers. The interface drops happen within seconds after the connection is established.

One more thing, while using the wireless card the connection drops happen much more frequently when using encrypted networks (both WEP and WPA).
Comment 1 Jose Marino 2009-04-07 00:09:44 UTC
Created attachment 20842 [details]
Kernel logs after loading ipw2200 driver: modprobe ipw2200
Comment 2 Jose Marino 2009-04-07 00:10:40 UTC
Created attachment 20843 [details]
Kernel logs after loading ipw2200 driver: modprobe ipw2200 debug=0x43fff
Comment 3 John W. Linville 2009-04-08 16:56:46 UTC
Not sure what would account for drops on both wired and wireless interfaces.
Comment 4 Jose Marino 2009-04-08 17:29:03 UTC
I have some new info on this issue.

I noticed that the problems don't appear if I have a USB optical mouse plugged in. I played around with plugging and unplugging the mouse and this is what I get:

After fresh reboot: 
- USB mouse plugged in: no drops
- USB mouse unplugged: no drops

After a suspend/resume:
- USB mouse plugged in: no drops
- USB mouse unplugged: lots of drops

I can also confirm that this happens with and not with (both vanilla kernels).

Just in case the kind of mouse matters, this is what shows in the kernel logs when I plug it:

Apr  8 13:17:48 flux generic-usb 0003:0461:4D03.0003: input: USB HID v1.00 Mouse [ARROW STRONG USB 3D Mouse] on usb-0000:00:1d.0-1/input0
Comment 5 Reinette Chatre 2009-04-09 21:19:07 UTC
Could there be something going on with the system power saving? As the ipw2200 errors do not help much here (all we have is fw sysassert) can you perhaps dig into the b44 errors? Perhaps adding a "dump_stack" into b44_halt() will give us an idea of why the system is bringing device down at that time.
Comment 6 Jose Marino 2009-04-09 22:43:15 UTC
Created attachment 20912 [details]
Messages after adding "dump_stack" to b44_halt()

I inserted a call to dump_stack at the bottom of b44_halt:

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index dc5f051..1e066c0 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -1337,6 +1337,7 @@ static void b44_halt(struct b44 *bp)
 	/* now reset the chip, but without enabling the MAC&PHY
 	 * part of it. This has to be done _after_ we shut down the PHY */
 	b44_chip_reset(bp, B44_CHIP_RESET_PARTIAL);
+	dump_stack();
 /* bp->lock is held. */

Find attached the kernel messages showing 3 consecutive network drops. The messages show from the time I unplug the USB mouse to the time I plug it back in (and the drops stop).
Comment 7 Reinette Chatre 2009-04-09 22:59:13 UTC
That stack dump shows that the reset is triggered after the device reports an error to the driver via an interrupt. In the ipw2200 case there is also a hardware error, but the hardware resets itself (no need for driver intervention). So, something is going on in the lower levels of your platform. Unfortunately I do not know how to debug this further.
Comment 8 Rafael J. Wysocki 2009-04-10 12:30:13 UTC
Jose, can you test commit 0db29af1e767464d71b89410d61a1e5b668d0370, please?
Comment 9 Jose Marino 2009-04-10 17:41:30 UTC
Commit 0db29af1e767464d71b89410d61a1e5b668d0370 refers to MSI stuff: "PCI/MSI: bugfix/utilize for msi_capability_init()". I don't seem to have MSI enabled in my config:

# CONFIG_PCI_MSI is not set

I went ahead anyway and undid the commit which made no difference. Make told me my build was up to date.
Comment 10 Rafael J. Wysocki 2009-04-10 17:46:47 UTC
Sorry, I didn't mean to revert the commit, but to try the kernel where this commit is a head.
Comment 11 Jose Marino 2009-04-10 19:18:09 UTC
The kernel at that commit does not have the problem. 
I've been testing with both wired and wireless for quite a while and everything works fine. Not a single drop.
Comment 12 Rafael J. Wysocki 2009-04-10 20:34:45 UTC
OK, now please test the kernel where the head is commit 8efb8c76fcdccf5050c0ea059dac392789baaff2 .
Comment 13 Rafael J. Wysocki 2009-04-10 20:35:58 UTC
Sorry, comment #12 is wrong, please discard it.
Comment 14 Rafael J. Wysocki 2009-04-10 20:37:59 UTC
Please test the kernel where the head is commit aa8c6c93747f7b55fa11e1624fec8ca33763a805 .
Comment 15 Jose Marino 2009-04-10 21:26:41 UTC
After trying the first commit you suggested I got curious and started a bisect.
Here's the bisect log at this point:

git bisect start
# good: [0db29af1e767464d71b89410d61a1e5b668d0370] PCI/MSI: bugfix/utilize for msi_capability_init()
git bisect good 0db29af1e767464d71b89410d61a1e5b668d0370
# bad: [8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84] Linux 2.6.29
git bisect bad 8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84
# bad: [1db8508cf483dc1ecf66141f90a7c03659d69512] hugetlbfs: fix build failure with !CONFIG_HUGETLBFS
git bisect bad 1db8508cf483dc1ecf66141f90a7c03659d69512
# good: [8b1e3a2f7f84484a8c208671adac39eb148c7d61] headers_check fix: linux/video_decoder.h
git bisect good 8b1e3a2f7f84484a8c208671adac39eb148c7d61

So right now I'm between:
8b1e3a2f7f84484a8c208671adac39eb148c7d61  (good)  and
1db8508cf483dc1ecf66141f90a7c03659d69512  (bad)

I believe the commit you suggest in comment #14 is before my current good bisect point. Do you want me to test it anyway?
Comment 16 Rafael J. Wysocki 2009-04-10 23:45:13 UTC
No, thanks, please continue the bisection.
Comment 17 Jose Marino 2009-04-11 16:53:16 UTC
After 12 rebuilds and reboots I finally finished the bisection. Here's the result:

31878dd86b7df9a147f5e6cc6e07092b4308782b is first bad commit
commit 31878dd86b7df9a147f5e6cc6e07092b4308782b
Author: Len Brown <len.brown@intel.com>
Date:   Wed Jan 28 18:28:09 2009 -0500

    ACPI: remove BM_RLD access from idle entry path
    It is true that BM_RLD needs to be set to enable
    bus master activity to wake an older chipset (eg PIIX4) from C3.
    This is contrary to the erroneous wording the ACPI 2.0, 3.0
    specifications that suggests that BM_RLD is an indicator
    rather than a control bit.
    ACPI 1.0's correct wording should be restored in ACPI 4.0:
    But the kernel should not have to clear BM_RLD
    when entering a non C3-type state just to set
    it again when entering a C3-type C-state.
    We should be able to set BM_RLD at boot time
    and leave it alone -- removing the overhead of
    accessing this IO register from the idle entry path.
    Signed-off-by: Len Brown <len.brown@intel.com>

:040000 040000 999629a2940ce92d1b2022fe510eecd2f57599ee 892206507142af444079b2329eeecf1d86f54d19 M	drivers
Comment 18 Rafael J. Wysocki 2009-04-11 18:17:41 UTC
Thanks for bisecting it!

First-Bad-Commit : 31878dd86b7df9a147f5e6cc6e07092b4308782b
Comment 19 Jose Marino 2009-04-11 18:53:56 UTC
To trigger this bug all I had to do is create some network traffic. By downloading something or by ssh to a remote machine and running "find .".

While bisecting the bug I noticed that I could not trigger the bug while the kernel was compiling (cpu at 100%) or when I ssh to a machine in the local network. With powertop I checked the Cn state of the cpu:

**** network drops ****
-- USB mouse unplugged, traffic with distant machine
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        (10.2%)         2.00 Ghz     9.0%
polling           0.0ms ( 0.0%)         1.80 Ghz     1.1%
C1 halt           0.0ms ( 0.0%)         1.60 Ghz     0.1%
C2                4.2ms (15.6%)          800 Mhz     1.0%
C3                0.3ms ( 0.1%)          600 Mhz    88.7%
C4                3.8ms (74.1%)

**** NO network drops ****
-- USB mouse unplugged, traffic with machine in local network
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        (99.9%)         2.00 Ghz    42.4%
polling           0.0ms ( 0.0%)         1.80 Ghz     0.0%
C1 halt           0.0ms ( 0.0%)         1.60 Ghz     0.0%
C2                0.9ms ( 0.1%)          800 Mhz     2.5%
C3                0.0ms ( 0.0%)          600 Mhz    55.0%
C4                0.0ms ( 0.0%)

-- Mouse plugged in
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        ( 2.8%)         2.00 Ghz     6.2%
polling           0.0ms ( 0.0%)         1.80 Ghz     0.0%
C1 halt           0.0ms ( 0.0%)         1.60 Ghz     0.0%
C2                5.1ms (97.2%)         1400 Mhz     0.0%
C3                0.0ms ( 0.0%)          600 Mhz    93.8%
C4                0.0ms ( 0.0%)

It seems that the network drops only happen whenever the cpu is allowed to go lower than C2. Somehow plugging in the USB mouse is preventing this.
Comment 20 Rafael J. Wysocki 2009-04-26 11:48:39 UTC
Notify-Also : ACPI Bugzilla <acpi-bugzilla@lists.sourceforge.net>
Comment 21 Len Brown 2009-04-28 01:40:30 UTC
please attach the output from lspci -vv
Comment 22 Jose Marino 2009-04-28 22:18:27 UTC
Created attachment 21164 [details]
Output from lspci -vv
Comment 23 Len Brown 2009-05-08 02:27:36 UTC
Created attachment 21271 [details]
patch vs 2.6.30-rc4 to restore BM_RLD on resume

please apply this patch, suspend, resume,
and show the output of "dmesg | grep BM_RLD"
and report if the patch fixes the regression.
Comment 24 Len Brown 2009-05-08 02:33:33 UTC
Created attachment 21272 [details]
patch vs 2.6.30-rc4 to restore BM_RLD on resume

fixed printk typo in previous version, please test this one instead.
Comment 25 Jose Marino 2009-05-08 13:43:27 UTC
This patch fixes the regression here. Yay.

The output of "dmesg | grep BM_RLD" after one suspend/resume:

ACPI: saved: BM_RLD 1
ACPI: resumed: BM_RLD 0
ACPI: restored BM_RLD 1

Thanks :)
Comment 26 Len Brown 2009-05-16 03:09:06 UTC
*** Bug 13277 has been marked as a duplicate of this bug. ***