Bug 13032 - S3: 2.6.29 regression: network interfaces drop after resume - Dell Inspiron 600m laptop
Summary: S3: 2.6.29 regression: network interfaces drop after resume - Dell Inspiron 6...
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Processor (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks: 7216 12398
  Show dependency tree
 
Reported: 2009-04-07 00:08 UTC by Jose Marino
Modified: 2009-05-16 21:46 UTC (History)
10 users (show)

See Also:
Kernel Version: 2.6.29.1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Kernel logs right after loading b44 driver: modprobe b44 b44_debug=31 (2.95 KB, text/plain)
2009-04-07 00:08 UTC, Jose Marino
Details
Kernel logs after loading ipw2200 driver: modprobe ipw2200 (2.45 KB, text/plain)
2009-04-07 00:09 UTC, Jose Marino
Details
Kernel logs after loading ipw2200 driver: modprobe ipw2200 debug=0x43fff (103.05 KB, text/plain)
2009-04-07 00:10 UTC, Jose Marino
Details
Messages after adding "dump_stack" to b44_halt() (4.19 KB, text/plain)
2009-04-09 22:43 UTC, Jose Marino
Details
Output from lspci -vv (10.92 KB, text/plain)
2009-04-28 22:18 UTC, Jose Marino
Details
patch vs 2.6.30-rc4 to restore BM_RLD on resume (2.03 KB, patch)
2009-05-08 02:27 UTC, Len Brown
Details | Diff
patch vs 2.6.30-rc4 to restore BM_RLD on resume (2.03 KB, patch)
2009-05-08 02:33 UTC, Len Brown
Details | Diff

Description Jose Marino 2009-04-07 00:08:12 UTC
Created attachment 20841 [details]
Kernel logs right after loading b44 driver: modprobe b44 b44_debug=31

After a fresh reboot everything works just fine.
The problems appear after resuming from suspend to ram at least once. Both network interfaces (ethernet and wireless) keep dropping continuously. Dropping frequency seems to depend on traffic.
BTW suspend to ram seems to work perfectly fine.

This problem happens in a Dell Inspiron 600m laptop. The ethernet card uses driver b44 and the wireless ipw2200.
The problem occurs with kernels 2.6.29 and 2.6.29.1, but not with 2.6.28 or less.

I'll attach some kernel logs showing what happens from the moment I load the network drivers. The interface drops happen within seconds after the connection is established.

One more thing, while using the wireless card the connection drops happen much more frequently when using encrypted networks (both WEP and WPA).
Comment 1 Jose Marino 2009-04-07 00:09:44 UTC
Created attachment 20842 [details]
Kernel logs after loading ipw2200 driver: modprobe ipw2200
Comment 2 Jose Marino 2009-04-07 00:10:40 UTC
Created attachment 20843 [details]
Kernel logs after loading ipw2200 driver: modprobe ipw2200 debug=0x43fff
Comment 3 John W. Linville 2009-04-08 16:56:46 UTC
Not sure what would account for drops on both wired and wireless interfaces.
Comment 4 Jose Marino 2009-04-08 17:29:03 UTC
I have some new info on this issue.

I noticed that the problems don't appear if I have a USB optical mouse plugged in. I played around with plugging and unplugging the mouse and this is what I get:

After fresh reboot: 
- USB mouse plugged in: no drops
- USB mouse unplugged: no drops

After a suspend/resume:
- USB mouse plugged in: no drops
- USB mouse unplugged: lots of drops

I can also confirm that this happens with 2.6.29.1 and not with 2.6.28.9 (both vanilla kernels).

Just in case the kind of mouse matters, this is what shows in the kernel logs when I plug it:

Apr  8 13:17:48 flux generic-usb 0003:0461:4D03.0003: input: USB HID v1.00 Mouse [ARROW STRONG USB 3D Mouse] on usb-0000:00:1d.0-1/input0
Comment 5 Reinette Chatre 2009-04-09 21:19:07 UTC
Could there be something going on with the system power saving? As the ipw2200 errors do not help much here (all we have is fw sysassert) can you perhaps dig into the b44 errors? Perhaps adding a "dump_stack" into b44_halt() will give us an idea of why the system is bringing device down at that time.
Comment 6 Jose Marino 2009-04-09 22:43:15 UTC
Created attachment 20912 [details]
Messages after adding "dump_stack" to b44_halt()

I inserted a call to dump_stack at the bottom of b44_halt:

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index dc5f051..1e066c0 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -1337,6 +1337,7 @@ static void b44_halt(struct b44 *bp)
 	/* now reset the chip, but without enabling the MAC&PHY
 	 * part of it. This has to be done _after_ we shut down the PHY */
 	b44_chip_reset(bp, B44_CHIP_RESET_PARTIAL);
+	dump_stack();
 }
 
 /* bp->lock is held. */

Find attached the kernel messages showing 3 consecutive network drops. The messages show from the time I unplug the USB mouse to the time I plug it back in (and the drops stop).
Comment 7 Reinette Chatre 2009-04-09 22:59:13 UTC
That stack dump shows that the reset is triggered after the device reports an error to the driver via an interrupt. In the ipw2200 case there is also a hardware error, but the hardware resets itself (no need for driver intervention). So, something is going on in the lower levels of your platform. Unfortunately I do not know how to debug this further.
Comment 8 Rafael J. Wysocki 2009-04-10 12:30:13 UTC
Jose, can you test commit 0db29af1e767464d71b89410d61a1e5b668d0370, please?
Comment 9 Jose Marino 2009-04-10 17:41:30 UTC
Commit 0db29af1e767464d71b89410d61a1e5b668d0370 refers to MSI stuff: "PCI/MSI: bugfix/utilize for msi_capability_init()". I don't seem to have MSI enabled in my config:

CONFIG_ARCH_SUPPORTS_MSI=y
# CONFIG_PCI_MSI is not set

I went ahead anyway and undid the commit which made no difference. Make told me my build was up to date.
Comment 10 Rafael J. Wysocki 2009-04-10 17:46:47 UTC
Sorry, I didn't mean to revert the commit, but to try the kernel where this commit is a head.
Comment 11 Jose Marino 2009-04-10 19:18:09 UTC
The kernel at that commit does not have the problem. 
I've been testing with both wired and wireless for quite a while and everything works fine. Not a single drop.
Comment 12 Rafael J. Wysocki 2009-04-10 20:34:45 UTC
OK, now please test the kernel where the head is commit 8efb8c76fcdccf5050c0ea059dac392789baaff2 .
Comment 13 Rafael J. Wysocki 2009-04-10 20:35:58 UTC
Sorry, comment #12 is wrong, please discard it.
Comment 14 Rafael J. Wysocki 2009-04-10 20:37:59 UTC
Please test the kernel where the head is commit aa8c6c93747f7b55fa11e1624fec8ca33763a805 .
Comment 15 Jose Marino 2009-04-10 21:26:41 UTC
After trying the first commit you suggested I got curious and started a bisect.
Here's the bisect log at this point:

git bisect start
# good: [0db29af1e767464d71b89410d61a1e5b668d0370] PCI/MSI: bugfix/utilize for msi_capability_init()
git bisect good 0db29af1e767464d71b89410d61a1e5b668d0370
# bad: [8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84] Linux 2.6.29
git bisect bad 8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84
# bad: [1db8508cf483dc1ecf66141f90a7c03659d69512] hugetlbfs: fix build failure with !CONFIG_HUGETLBFS
git bisect bad 1db8508cf483dc1ecf66141f90a7c03659d69512
# good: [8b1e3a2f7f84484a8c208671adac39eb148c7d61] headers_check fix: linux/video_decoder.h
git bisect good 8b1e3a2f7f84484a8c208671adac39eb148c7d61

So right now I'm between:
8b1e3a2f7f84484a8c208671adac39eb148c7d61  (good)  and
1db8508cf483dc1ecf66141f90a7c03659d69512  (bad)

I believe the commit you suggest in comment #14 is before my current good bisect point. Do you want me to test it anyway?
Comment 16 Rafael J. Wysocki 2009-04-10 23:45:13 UTC
No, thanks, please continue the bisection.
Comment 17 Jose Marino 2009-04-11 16:53:16 UTC
After 12 rebuilds and reboots I finally finished the bisection. Here's the result:

31878dd86b7df9a147f5e6cc6e07092b4308782b is first bad commit
commit 31878dd86b7df9a147f5e6cc6e07092b4308782b
Author: Len Brown <len.brown@intel.com>
Date:   Wed Jan 28 18:28:09 2009 -0500

    ACPI: remove BM_RLD access from idle entry path
    
    It is true that BM_RLD needs to be set to enable
    bus master activity to wake an older chipset (eg PIIX4) from C3.
    
    This is contrary to the erroneous wording the ACPI 2.0, 3.0
    specifications that suggests that BM_RLD is an indicator
    rather than a control bit.
    
    ACPI 1.0's correct wording should be restored in ACPI 4.0:
    http://www.acpica.org/bugzilla/show_bug.cgi?id=689
    
    But the kernel should not have to clear BM_RLD
    when entering a non C3-type state just to set
    it again when entering a C3-type C-state.
    
    We should be able to set BM_RLD at boot time
    and leave it alone -- removing the overhead of
    accessing this IO register from the idle entry path.
    
    Signed-off-by: Len Brown <len.brown@intel.com>

:040000 040000 999629a2940ce92d1b2022fe510eecd2f57599ee 892206507142af444079b2329eeecf1d86f54d19 M	drivers
Comment 18 Rafael J. Wysocki 2009-04-11 18:17:41 UTC
Thanks for bisecting it!

First-Bad-Commit : 31878dd86b7df9a147f5e6cc6e07092b4308782b
Comment 19 Jose Marino 2009-04-11 18:53:56 UTC
To trigger this bug all I had to do is create some network traffic. By downloading something or by ssh to a remote machine and running "find .".

While bisecting the bug I noticed that I could not trigger the bug while the kernel was compiling (cpu at 100%) or when I ssh to a machine in the local network. With powertop I checked the Cn state of the cpu:

**** network drops ****
-- USB mouse unplugged, traffic with distant machine
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        (10.2%)         2.00 Ghz     9.0%
polling           0.0ms ( 0.0%)         1.80 Ghz     1.1%
C1 halt           0.0ms ( 0.0%)         1.60 Ghz     0.1%
C2                4.2ms (15.6%)          800 Mhz     1.0%
C3                0.3ms ( 0.1%)          600 Mhz    88.7%
C4                3.8ms (74.1%)


**** NO network drops ****
-- USB mouse unplugged, traffic with machine in local network
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        (99.9%)         2.00 Ghz    42.4%
polling           0.0ms ( 0.0%)         1.80 Ghz     0.0%
C1 halt           0.0ms ( 0.0%)         1.60 Ghz     0.0%
C2                0.9ms ( 0.1%)          800 Mhz     2.5%
C3                0.0ms ( 0.0%)          600 Mhz    55.0%
C4                0.0ms ( 0.0%)

-- Mouse plugged in
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        ( 2.8%)         2.00 Ghz     6.2%
polling           0.0ms ( 0.0%)         1.80 Ghz     0.0%
C1 halt           0.0ms ( 0.0%)         1.60 Ghz     0.0%
C2                5.1ms (97.2%)         1400 Mhz     0.0%
C3                0.0ms ( 0.0%)          600 Mhz    93.8%
C4                0.0ms ( 0.0%)


It seems that the network drops only happen whenever the cpu is allowed to go lower than C2. Somehow plugging in the USB mouse is preventing this.
Comment 20 Rafael J. Wysocki 2009-04-26 11:48:39 UTC
Notify-Also : ACPI Bugzilla <acpi-bugzilla@lists.sourceforge.net>
Comment 21 Len Brown 2009-04-28 01:40:30 UTC
please attach the output from lspci -vv
Comment 22 Jose Marino 2009-04-28 22:18:27 UTC
Created attachment 21164 [details]
Output from lspci -vv
Comment 23 Len Brown 2009-05-08 02:27:36 UTC
Created attachment 21271 [details]
patch vs 2.6.30-rc4 to restore BM_RLD on resume

please apply this patch, suspend, resume,
and show the output of "dmesg | grep BM_RLD"
and report if the patch fixes the regression.
Comment 24 Len Brown 2009-05-08 02:33:33 UTC
Created attachment 21272 [details]
patch vs 2.6.30-rc4 to restore BM_RLD on resume

fixed printk typo in previous version, please test this one instead.
Comment 25 Jose Marino 2009-05-08 13:43:27 UTC
This patch fixes the regression here. Yay.

The output of "dmesg | grep BM_RLD" after one suspend/resume:

ACPI: saved: BM_RLD 1
ACPI: resumed: BM_RLD 0
ACPI: restored BM_RLD 1

Thanks :)
Comment 26 Len Brown 2009-05-16 03:09:06 UTC
*** Bug 13277 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.