Bug 14538 - Unable to associate with AP after resume since 2.6.32-rc6
Unable to associate with AP after resume since 2.6.32-rc6
Status: CLOSED CODE_FIX
Product: Networking
Classification: Unclassified
Component: Wireless
All Linux
: P1 normal
Assigned To: Larry Finger
:
Depends on:
Blocks: 7216 14230
  Show dependency treegraph
 
Reported: 2009-11-03 22:07 UTC by Christian Casteyde
Modified: 2009-12-29 21:04 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.32-rc6
Tree: Mainline
Regression: Yes


Attachments
git bisect log (2.34 KB, text/plain)
2009-11-04 22:28 UTC, Christian Casteyde
Details
Possible patch (571 bytes, patch)
2009-11-21 04:26 UTC, Larry Finger
Details | Diff
Patch to log the ssb core scan results (794 bytes, patch)
2009-11-21 14:49 UTC, Larry Finger
Details | Diff
Patch to log information at resume time (616 bytes, patch)
2009-11-21 19:14 UTC, Larry Finger
Details | Diff
dmesg output after boot, with 3 patches applied (25.14 KB, text/plain)
2009-11-22 16:11 UTC, Christian Casteyde
Details
dmesg output after resume, with 3 patches applied (31.23 KB, text/plain)
2009-11-22 16:12 UTC, Christian Casteyde
Details
Test patch (942 bytes, patch)
2009-11-23 04:08 UTC, Larry Finger
Details | Diff

Description Christian Casteyde 2009-11-03 22:07:36 UTC
Kernel version: 2.6.32-rc6
Last working kernel: 2.6.32-rc5
Athlon 64 3000 single core
64 bits version, Bluewhite64

After a suspend to ram/resume, the wireless interface is dead and cannot reassociate anymore. Trying to restart wpa/dhcpd gives the following errors:

root@athor:/home/christian# /etc/rc.d/rc.inet1 stop
root@athor:/home/christian# /etc/rc.d/rc.inet1 start
SIOCSIFFLAGS: Erreur inconnue 132
/etc/rc.d/rc.inet1:  eth1 information: 'Any ESSID'
Error for wireless request "Set Nickname" (8B1C) :
    SET failed on device eth1 ; Operation not supported.
SIOCSIFFLAGS: Unknown error 132
Could not set interface 'eth1' UP
SIOCSIFFLAGS: Erreur inconnue 132
Polling for DHCP server on interface eth1:
err, eth1: ioctl SIOCSIFFLAGS: Unknown error 132

The "Set Nickame" always occurs and is not the problem. However, the Unknown error 132 is completly new.

Moreover, iwconfig says the eth1 interface is not associated anymore.
Comment 1 Rafael J. Wysocki 2009-11-03 22:39:25 UTC
What adapter?  Is it PCI, PCMCIA or USB?

Also, it should be relatively easy to find the commit that introduce the issue by bisection.
Comment 2 Christian Casteyde 2009-11-03 23:00:01 UTC
oops, sorry.
It's a PCI b43 adapter of an Aspire 1511 Lmi laptop.
But I'm confident it's not the b43 driver since I was the tester of the commit done on this driver, and this commit worked on a 2.6.32-rc kernel source base.
Comment 3 Christian Casteyde 2009-11-03 23:00:40 UTC
re-oops, the commit worked on -rc5
Comment 4 Rafael J. Wysocki 2009-11-04 00:08:37 UTC
OK, thanks for the info.
Comment 5 John W. Linville 2009-11-04 00:18:29 UTC
#define ERFKILL         132     /* Operation not possible due to RF-kill */

Did you happen to change your rfkill switch during suspend/resume cycle?
Comment 6 Christian Casteyde 2009-11-04 06:56:21 UTC
No, just  doing :

closing the lid
-> acpi event that start a script that does:

rc.inet1 eth1_stop
echo mem > /sys/power/state

opening the lid
-> acpi event that resumes the script here:

rc.inet1 eth1_start   ** this fails**

then try to ping (fails), restart the script, etc. fails.

rc.inet1 script does basically wpa/iwconfig + dhcp on slack
Comment 7 John W. Linville 2009-11-04 13:48:18 UTC
What sort of laptop is it?

Are you able to do the git bisect between 2.6.32-rc5 and 2.6.32-rc6?
Comment 8 Christian Casteyde 2009-11-04 18:21:41 UTC
The laptop is an Acer Aspire 1511Lmi, quite old but very good to get suspend/resume and wireless problems :-)

I'm bisecting, still 7 reboots to do.
Comment 9 Christian Casteyde 2009-11-04 22:25:59 UTC
Finally, I didn't managed to bisect, because at the end the suspend to ram fails systematically. However, I narrowed the commits to the following logs appended (git bisect log).
Comment 10 Christian Casteyde 2009-11-04 22:28:11 UTC
Created attachment 23656 [details]
git bisect log

I finally gave up after too many test failures (skip = the computer freezes when suspending).

Another interesting point is that each time it fails, I don't get the warning reported in
http://bugzilla.kernel.org/show_bug.cgi?id=13987

so this may well be a RF kill pb.
Comment 11 Christian Casteyde 2009-11-05 19:45:18 UTC
Well, since it was apparently related to rfkill, I reverted the most probable patch on top of vanilla 2.6.32-rc6, that is:

--- linux-2.6.32-rc5/drivers/net/wireless/b43/rfkill.c  2009-11-03 19:48:11.805090636 +0000
+++ linux-2.6.32-rc6/drivers/net/wireless/b43/rfkill.c  2009-11-03 19:48:17.955464847 +0000
@@ -33,7 +33,8 @@
                       & B43_MMIO_RADIO_HWENABLED_HI_MASK))
                         return 1;
         } else {
-                if (b43_read16(dev, B43_MMIO_RADIO_HWENABLED_LO)
+                if (b43_status(dev) >= B43_STAT_STARTED &&
+                    b43_read16(dev, B43_MMIO_RADIO_HWENABLED_LO)
                     & B43_MMIO_RADIO_HWENABLED_LO_MASK)
                         return 1;
         }

and indeed it works. So this is this commit that broke RF kill on my laptop.
That is, if I use:

                if (/* b43_status(dev) >= B43_STAT_STARTED &&*/

the problem does not appear anymore.

I've also seen that patch:

--- linux-2.6.32-rc5/drivers/net/wireless/b43/main.c    2009-11-03 19:48:11.801464075 +0000
+++ linux-2.6.32-rc6/drivers/net/wireless/b43/main.c    2009-11-03 19:48:17.952088831 +0000
@@ -4501,7 +4501,6 @@


         cancel_work_sync(&(wl->beacon_update_trigger));
 
-        wiphy_rfkill_stop_polling(hw->wiphy);
         mutex_lock(&wl->mutex);
         if (b43_status(dev) >= B43_STAT_STARTED) {
                 dev = b43_wireless_core_stop(dev);

that could be involved, but I didn't tested the 4 combinations of these patches.

Please note that the b43 commit fixes http://bugzilla.kernel.org/show_bug.cgi?id=14277

which I tested successfully, but the proposed patch only contained bounce buffer fix, not RFkill stuff. Maybe a partial revert should be done (I don't know if the RF kill bug sleeped in the bounce buffer fix, didn't checked the commit numbers).

Forget what I said about http://bugzilla.kernel.org/show_bug.cgi?id=13987 in #10, since apparently the NMI occurs only at association or nearby, but as RFkill prevents association, I cannot see this other bug while this one is there.
Comment 12 Christian Casteyde 2009-11-14 09:54:30 UTC
Still present in 2.6.32-rc7, and commenting out "b43_status(dev) >= B43_STAT_STARTED" still solves the problem.
Comment 13 Christian Casteyde 2009-11-20 21:52:39 UTC
I've found the commit that triggers the problem (no problem before, and reverting it solves the problem).
It's:

d50bae33d1358b909ade05ae121d83d3a60ab63f

Beware that reverting it would reopen bug #14181 as indicated in the comment. So both fix are broken indeed.
Comment 14 Rafael J. Wysocki 2009-11-20 23:42:16 UTC
Caused by:

commit d50bae33d1358b909ade05ae121d83d3a60ab63f
Author: Larry Finger <Larry.Finger@lwfinger.net>
Date:   Fri Oct 16 10:18:09 2009 -0500

    b43: Fix Bugzilla #14181 and the bug from the previous 'fix'

    Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

(Christian, please also add the subject of the commit and ideally the author to the report in future, thanks).

First-Bad-Commit : d50bae33d1358b909ade05ae121d83d3a60ab63f
Comment 15 Larry Finger 2009-11-21 03:26:41 UTC
What are the details of the Broadcom device? Please show the results from

dmesg | egreb "b43|ssb"
Comment 16 Larry Finger 2009-11-21 04:26:06 UTC
Created attachment 23851 [details]
Possible patch

Please test this patch. It appears that there was/is a bug in our specs.
Comment 17 Christian Casteyde 2009-11-21 09:35:24 UTC
The proposed patch on -rc8 fails.

The output of dmesg is appended below:

christian@athor:~$ dmesg | egrep "b43|ssb"
b43-pci-bridge 0000:02:08.0: PCI INT A -> Link[LNK4] -> GSI 19 (level, low) -> IRQ 19
ssb: Sonics Silicon Backplane found on PCI device 0000:02:08.0
b43-phy0: Broadcom 4306 WLAN found (core revision 5)
b43 ssb0:0: firmware: requesting b43/ucode5.fw
b43 ssb0:0: firmware: requesting b43/pcm5.fw
b43 ssb0:0: firmware: requesting b43/b0g0initvals5.fw
b43 ssb0:0: firmware: requesting b43/b0g0bsinitvals5.fw
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
Comment 18 Larry Finger 2009-11-21 14:49:43 UTC
Created attachment 23856 [details]
Patch to log the ssb core scan results

Please add this patch and then resubmit the results of

dmesg | egrep "b43|ssb"
Comment 19 Larry Finger 2009-11-21 19:14:52 UTC
Created attachment 23859 [details]
Patch to log information at resume time

There is something that I don't understand. You say that if you eliminate the b43_status(dev) >= B43_STAT_STARTED test, then it works. On my system, however, when this routine is entered, the value of b43_status(dev) is 2, which is the value for B43)STAT_STARTED.

Please add this patch, which will print the value of b43_status(dev), and send the dmesg output from the "ACPI: Waking up from system sleep state S4" point.
Comment 20 Christian Casteyde 2009-11-22 16:10:58 UTC
OK, I've patched the kernel with the 3 patches from comment #16, #18 and #19, and rebooted.

The first attached dmesg output is just after boot.
The second one is after resume. In this case the status is not 2 anymore, but 0.
Comment 21 Christian Casteyde 2009-11-22 16:11:40 UTC
Created attachment 23868 [details]
dmesg output after boot, with 3 patches applied
Comment 22 Christian Casteyde 2009-11-22 16:12:17 UTC
Created attachment 23869 [details]
dmesg output after resume, with 3 patches applied
Comment 23 Larry Finger 2009-11-22 16:29:50 UTC
Thanks for testing. Based on your findings, the status of 0 makes sense, I just don't know why. That is the real bug here.

FYI, the code change to test for status >= 2 is needed as some architectures will fault and crash the system if one attempts to read a register when the interface is in the state indicated by status of 0 or 1.

I'll let you know when I have another patch for testing.
Comment 24 Larry Finger 2009-11-23 04:08:19 UTC
Created attachment 23876 [details]
Test patch

Please try this patch. It is a bandaid rather than a fix, and there will likely be resistance to including it; however, I want to know if your system works after including it. Note: This is the only patch that should be included. The others have been deleted.
Comment 25 Christian Casteyde 2009-11-23 19:18:07 UTC
This patch works more than expected. When applied on 2.6.32-rc8, I not only can connect to the network at resume, but it also seems to fix the NMI regression I reported in http://bugzilla.kernel.org/show_bug.cgi?id=13987

I do not know why it works at all (whereas reverting commit mentionned in #14 do not suffice to suppress the NMI), but it works: I made several suspend/resume in a row and I never managed to get the NMI anymore.

So for me this fixes both bugs, at least it lets me use my network after resume, and seems to prevent code execution that would trigger the NMI also.
Comment 26 Christian Casteyde 2009-12-03 21:01:40 UTC
Update: Still present in 2.6.12, and the proposed patch still fixes it (and http://bugzilla.kernel.org/show_bug.cgi?id=13987).
Comment 27 John W. Linville 2009-12-03 22:27:19 UTC
Yes, sorry...it (or rather it's successor) didn't get pulled in time for 2.6.32...
Comment 28 Christian Casteyde 2009-12-19 10:51:09 UTC
Update : Still present in 2.6.32.2, and the patch still fixes it.
Btw, 2.6.32.2 has the patch for b43 legacy, but not for current b43 :-)
Seems to be integrated in 2.6.33-rc1, I will test it soon.
Comment 29 Larry Finger 2009-12-19 17:32:02 UTC
I missed the Cc for atable in the b43 patch, but included it in b43legacy. :)

A note has been sent to GregKH and stable. The patch should be in 2.6.32.3 or .4.
Comment 30 Rafael J. Wysocki 2009-12-29 16:24:53 UTC
Is it the patch from comment #25 or another one?
Comment 31 Larry Finger 2009-12-29 16:37:56 UTC
On 12/29/2009 10:24 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14538
> 
> 
> 
> 
> 
> --- Comment #30 from Rafael J. Wysocki <rjw@sisk.pl>  2009-12-29 16:24:53 ---
> Is it the patch from comment #25 or another one?

It is actually the patch in mainline commit
c2ff581acab16c6af56d9e8c1a579bf041ec00b1. The code does the same things, but was
rearranged to make it clearer and some comments have been added.

Larry
Comment 32 Rafael J. Wysocki 2009-12-29 21:04:59 UTC
OK, so I'm closing this as fixed in the mainline and please make sure it appears in -stable.

Note You need to log in before you can comment on or make changes to this bug.