Bug 42576

Summary: brcmsmac driver: loosing wireless signal causes tasks to hang
Product: Drivers Reporter: Patrick (ragamuffin)
Component: network-wirelessAssignee: Arend van Spriel (aspriel)
Status: RESOLVED CODE_FIX    
Severity: normal CC: florian, linville, rvossen, stf_xl
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.2.1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: see kernel oops messages at the bottom
another symptom of the same issue
another dmesg. 1st stack trace NOT relevant here
some lockdep output
some (hopefully) relevant lockdep output
another dmesg, not "polluted" by btrfs bugs
dmesg including sysrq("w") output while several tasks blocked
not much further details included in this dmesg
brcmsmac_timeout_on_flushing_queue.patch
bug fixed by brcmsmac_timeout_on_flushing_queue.patch

Description Patrick 2012-01-14 02:03:30 UTC
Created attachment 72064 [details]
see kernel oops messages at the bottom

Hello

I used this laptop with 2.6.38-11-generic from ubuntu for a long time without experiencing this problem.

Although I am not 100% certain, which driver that's been in use there. (Will check shortly.)

After upgrading to a custom-built vanilla 3.2.1 stable kernel, i experience this "task hangs". I noticed that all java tasks were hanging and was wondering why? Judging from the dmesg kernel log, where there were several oopses showing up, I decided it might be the wireless driver causing those problems.

Actually my wireless signal allways goes off whenever i close the laptop lid. But because i like to close that when working with the external monitor, I plug in an ethernet cable then. So that's what I did this time as well. But: before, I would then just have to chancel one request by network-manager to re-enter the wlan pre-shared key and after that, the wireless connection was disconnected while the wired connection was allready connected and everything ran smooth.

After the upgrade to 3.2.1 however, network-manager is not bringing up that dialog any more and some tasks in the system start hanging and the kernel generates those oopses. The solution was to quickly re-open the laptop lid, which brought the tasks back to live (that means they died and were reaped properly, while before they were in a zombie state after my unsuccessful attempts to kill -KILL them). Then i disconnected the wireless conn and then closed the lid and restarted the tasks.
Comment 1 Patrick 2012-01-14 12:28:20 UTC
Created attachment 72065 [details]
another symptom of the same issue

this time the wireless signal was not even there ever ... firefox hangs right after *booting with lid closed*
Comment 2 Stanislaw Gruszka 2012-01-14 15:03:11 UTC
There is deadlock involving rtnl lock, but it's hard to tell where is exact reason of that. Can you compile kernel with CONFIG_LOCKDEP=y and provide dmesg (it should show detailed information about deadlock reason) ?
Comment 3 Patrick 2012-01-14 20:10:18 UTC
Created attachment 72067 [details]
another dmesg. 1st stack trace NOT relevant here

Well... I have configured CONFIG_LOCKDEP=y now.

The dmesg output is attached, but I think it is mostly equal to the other 2 traces I uploaded.

I.e. I can't identify any output by kernel/lockdep.c

Any other ideas how to tackle this?

I can't really reproduce it that easily. What always "worked" now, was booting / wakup with the lid closed. Then firefox hangs on startup / wakeup. The solution is then allways very simple: open the lid quickly and the tasks immediately wake up. This I find interresting and could give a hint as to where to look for the bug ... if you have got any ide, please tell me - I'd enjoy looking at the code but have no idea which file / function. Do you think it's the driver? The problem only seems to surface if i plug in the ethernet as well but i'm not 100% certain. However, could this mean it's somewhere further up in the network layer architecture?

Btw. The signal is actually also not completely gone when the lid is closed. But reliability and throuput are severly affected... i.e.

ping www.google.com
64 bytes: icmp_req=1 ttl=54 time=46.4 ms
64 bytes: icmp_req=2 ttl=54 time=67.0 ms
64 bytes: icmp_req=3 ttl=54 time=45.3 ms
64 bytes: icmp_req=4 ttl=54 time=35.8 ms
64 bytes: icmp_req=5 ttl=54 time=43.6 ms
64 bytes: icmp_req=6 ttl=54 time=34.2 ms
64 bytes: icmp_req=7 ttl=54 time=35.6 ms
------close lid-------
64 bytes: icmp_req=8 ttl=54 time=1011 ms
64 bytes: icmp_req=9 ttl=54 time=143 ms
64 bytes: icmp_req=10 ttl=54 time=10653 ms
64 bytes: icmp_req=11 ttl=54 time=9655 ms
64 bytes: icmp_req=12 ttl=54 time=8662 ms
64 bytes: icmp_req=13 ttl=54 time=8568 ms
64 bytes: icmp_req=14 ttl=54 time=7581 ms
64 bytes: icmp_req=15 ttl=54 time=6587 ms

So I *need* to plug in the ethernet cable...
Comment 4 Stanislaw Gruszka 2012-01-14 21:07:12 UTC
With CONFIG_LOCKDEP there should be "Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar" message in dmesg. It depend on some other CONFIG options, the best way to turn it on is "make menuconfig" and mark:
  Kernel hacking  --->
    * Lock debugging: detect incorrect freeing of live locks
    * Lock debugging: prove locking correctness
Comment 5 Patrick 2012-01-15 14:45:32 UTC
Created attachment 72072 [details]
some lockdep output

That was a mistake. Obviously "$ echo CONFIG_LOCKDEP=y >> .config; make oldconfig" didn't do the trick before. Now I used menuconfig, as you suggested.

I might be wrong but I thing the information you are looking for might still not be included in this dmesg. I guess the _btrfs_ stuff is a different issue. I have gnome and gdm auto-login enabled and thunderbird and firefox are auto-started. Then I boot with lid closed. Thunderbird eventually starts up, but firefox allways blocks until i quickly open the laptop lid. Just for the record: Although I manually disconnect from the (auto-connected on startup) wireless network with the network-manager "tray icon" menu, firefox is still blocked and i need to open the lid to wake it up.

....

But: As said, it seems there is no "rtnl"-related lockdep output in the log. Do you know why "INFO: lockdep is turned off." and how I could re-enable it?
Comment 6 Patrick 2012-01-15 15:07:36 UTC
Created attachment 72073 [details]
some (hopefully) relevant lockdep output

Found some answers here: http://lxr.linux.no/#linux+v3.2.1/lib/debug_locks.c#L15 : "[...] we report the first one and shut up after that."

Maybe I'll just hack debug_locks.c to keep it enabled because i can't find a way to re-enable "debug_locks".

Anyway, what I did to avoid "debug_locks" beeing disabled early, was to not mount my btrfs partition on boot-up. Now there you go with some (hopefully) relevant info on the "rtnl_lock" issue you are looking for.
Comment 7 Patrick 2012-01-15 16:48:44 UTC
Created attachment 72074 [details]
another dmesg, not "polluted" by btrfs bugs

The kernel that produced this dmesg is this one here: https://github.com/P2000/linux/commits/lockdep-debug
Comment 8 Stanislaw Gruszka 2012-01-15 22:18:57 UTC
Hmm, it is not that I expected. I expected rather something similar to btrfs report. Looks like there is no deadlock, but we keep local->mtx for long time in some other task than printed, or somewhere local->mtx is not unlocked after a lock (but that should be detected by lockdep).

First thing to try is patch that revert mac80211 channel switch optimizations, it can be downloaded from:

>
> http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=mac80211_offchannel_rework_revert.patch;h=859799714cd85a58450ecde4a1dabc5adffd5100;hb=refs/heads/f16

If it does not help, let's try to grab lockdep messages without hacking lockdep. Reverting commit:

commit 660d3f6cde552323578b85fc5a09a6742f1fe804
Author: Josef Bacik <josef@redhat.com>
Date:   Fri Dec 9 11:18:51 2011 -0500

    Btrfs: fix how we do delalloc reservations and how we free reservations on error

should prevent the btrfs deadlock, hence stopping lockdep to disable.

If it still will print the same massages, try to see if dump all blocked task by sysrq:

echo w > /proc/sysrq-trigger 

will show some more task, perhaps the one that hold mac80211 local->mtx
This action may require a few repeats. Hopefully this give us some new info, but if not, let me know too.
Comment 9 Patrick 2012-01-16 01:54:43 UTC
Created attachment 72079 [details]
dmesg including sysrq("w") output while several tasks blocked

Ok, I applied the mac80211_offchannel_rework_revert patch.

Not much of a change - i.e. no change at all.

I haven't reverted the btrfs commit yet, but already produced some sysrq("w") output.

I was not able to echo w > /proc/sysrq because right after booting up with lid closed, when the bug surfaces, even sudo blocks for some reason (see dmesg). So I triggered it using my external keyboard (Alt-SysRq-w).

Hope there is something useful in the output because I am not convinced that reverting the btrfs and my lockdep commits will change much. I will do it nevertheless.

It is really funny how little is needed to immediately wake up the blocked tasks by just opening the lid quickly -- and not even fully -- a few inches will do and as if by macig, the firefox window appears on the external monitor...
Comment 10 Patrick 2012-01-16 02:59:13 UTC
Created attachment 72080 [details]
not much further details included in this dmesg

So now it's this kernel: https://github.com/P2000/linux/commits/b269702

Lockdep hacks removed. Offending btrfs commit removed.

Another interresting detail: the bug is somehow triggered by WPA / WPA2 association. It will *not surface* (i.e. no blocked tasks) if I do *not supply the wpa key* during gnome session startup.

I.e. I do not supply my gnome-keyring password when the dialog pops up at gnome session start time. I had actually diabled the password from the keyring, as I use full hard-disk encryption with dm-crypt anyway and don't want to enter 10'000 passwords each time I turn onn my PC. But it annoyingly came back somehow...
Comment 11 Stanislaw Gruszka 2012-01-16 07:02:43 UTC
We sleep for undefined period in brcms_c_wait_for_tx_completion()

> 302.841411] kworker/u:3     D ffff8801097d4800     0   181      2 0x00000000
> [  302.841418]  ffff88011b651ac0 0000000000000046 ffff88011b651af0
> ffff880120e38000
> [  302.841425]  0000000000000282 ffff88011b651fd8 ffff88011b650000
> ffff88011b651fd8
> [  302.841431]  ffff8800b42d5e80 ffff88011b4f1f80 ffff88011b651aa0
> ffffffff810a795d
> [  302.841438] Call Trace:
> [  302.841451]  [<ffffffff810a795d>] ? trace_hardirqs_on_caller+0x13d/0x1b0
> [  302.841457]  [<ffffffff8163732f>] schedule+0x3f/0x60
> [  302.841461]  [<ffffffff81637896>] schedule_timeout+0x196/0x310
> [  302.841469]  [<ffffffff81077b00>] ? cascade+0xa0/0xa0
> [  302.841485]  [<ffffffffa03da005>] ? brcms_msleep+0x25/0x40 [brcmsmac]
> [  302.841490]  [<ffffffff81637a2e>]
> schedule_timeout_uninterruptible+0x1e/0x20
> [  302.841495]  [<ffffffff81079a50>] msleep+0x20/0x30
> [  302.841505]  [<ffffffffa03da00d>] brcms_msleep+0x2d/0x40 [brcmsmac]
> [  302.841519]  [<ffffffffa03e639e>] brcms_c_wait_for_tx_completion+0x2e/0x80
> [brcmsmac]
> [  302.841528]  [<ffffffffa03d82ab>] brcms_ops_flush+0x3b/0x60 [brcmsmac]
> [  302.841547]  [<ffffffffa0395155>] __ieee80211_recalc_idle+0x215/0x2b0
> [mac80211]
> [  302.841562]  [<ffffffffa0396105>] ieee80211_recalc_idle+0x35/0x60
> [mac80211]
> [  302.841576]  [<ffffffffa0392516>] ieee80211_work_work+0x4c6/0x1540
> [mac80211]

and that's the root of the problem here. This should be handled by brcmsmac maintainers. Please assign to rvossen@broadcom.com (seems I can not do it).
Comment 12 Stanislaw Gruszka 2012-01-16 07:06:09 UTC
Created attachment 72082 [details]
brcmsmac_timeout_on_flushing_queue.patch

You can also check this patch. It will print the warning, but should prevent live deadlock (is this actually live deadlock, or system goes back to normal usage after printing messages ?)
Comment 13 Patrick 2012-01-16 11:25:26 UTC
Ok, I'm compiling this https://github.com/P2000/linux/commits/brcmsmac-tx-fix kernel now. I'm totally convinced that you found the problem, Stanislaw. Thank you very much!



I'm glad this is fixed. I had *many crashes under 2.6.38* that must have been caused by the wlan driver as well. Under 2.6.38, the system would crash whenever I uploaded large files at a high data-rate. It often took me ages to realize this but I could then fix it by throtling the upload bandwith in the software. After the throtling the system would not crash any more. There was no way to fix that bug though, because the machine just crashed completely. (!) So getting around this Issue was actually one of the main motivations to upgrade to 3.2 (besides having hibernate working together with the intel i915 dri video driver).



As for the btrfs issue that surfaced once thanks to enabling lockdep debugging. It hasn't surfaced any more since then. Even though I expected it to surface in one case where I booted a kernel that didn't have 660d3f6cde5 reverted. So it might be hard to reproduce. Therefore I applied 660d3f6cde5 again an am keeping lockdep debugging enabled.



Will post new dmesg shortly...




--------- bugzilla says -------------
You tried to change the Assignee field from drivers_network-wireless@kernel-bugs.osdl.org to rvossen@broadcom.com , but only the assignee of the bug, or a user with the required permissions may change that field.
Comment 14 Patrick 2012-01-16 11:56:07 UTC
Created attachment 72088 [details]
bug fixed by brcmsmac_timeout_on_flushing_queue.patch

:-) ... smooth!

That was it.

The dmesg might be of use to the btrfs devs as well - am posting it here: https://bugzilla.kernel.org/show_bug.cgi?id=42582
Comment 15 John W. Linville 2012-01-16 15:40:48 UTC
Stanislaw, please post the brcmsmac patch to the list ASAP -- thanks!
Comment 16 Roland Vossen 2012-01-17 11:37:27 UTC
Changed 'assigned to' since Arend will follow up.
Comment 17 Stanislaw Gruszka 2012-01-17 11:40:25 UTC
> Stanislaw, please post the brcmsmac patch to the list ASAP -- thanks!
Ok, just posted, but again, this is just workaround.(In reply to comment #15)
Comment 18 Florian Mickler 2012-02-01 21:02:06 UTC
A patch referencing this bug report has been merged in Linux v3.3-rc2:

commit f96b08a7e6f69c0f0a576554df3df5b1b519c479
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Tue Jan 17 12:38:50 2012 +0100

    brcmsmac: fix tx queue flush infinite loop
Comment 19 Florian Mickler 2012-03-12 22:21:56 UTC
A patch referencing this bug report has been merged in Linux v3.3-rc7:

commit 85091fc0a75653e239dc8379658515e577544927
Author: Arend van Spriel <arend@broadcom.com>
Date:   Thu Feb 23 18:38:22 2012 +0100

    brcm80211: smac: fix endless retry of A-MPDU transmissions
Comment 20 Florian Mickler 2012-04-04 14:55:08 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc1:

commit 5e379203c7788b7af01150bfadbc74d2797a2ef4
Author: Arend van Spriel <arend@broadcom.com>
Date:   Thu Feb 9 21:08:58 2012 +0100

    brcm80211: smac: fix endless retry of A-MPDU transmissions
Comment 21 Patrick 2012-04-13 14:01:19 UTC
I consider this fixed, as my current kernel, 3.2.14, is now working.

In kernel 3.3 however, where the brcmsmac driver moved to bcma, there are new problems however. See https://bugzilla.kernel.org/show_bug.cgi?id=43100
Comment 22 Florian Mickler 2013-02-23 10:28:13 UTC
A patch referencing a commit referencing this bug report has been merged in Linux v3.8-rc7:

commit 7b2385b95363c9ff49801c19cba2489a05bd6672
Author: Arend van Spriel <arend@broadcom.com>
Date:   Sat Feb 2 14:36:50 2013 +0100

    brcmsmac: rework of mac80211 .flush() callback operation