Bug 14342

Summary: Hard system lockup during association with WEP/WPA2 networks using ath5k
Product: Drivers Reporter: jsbronder
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: linville, me, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28.6 2.6.30.8 2.6.32-rc2-wl-39303-g8e025fe Subsystem:
Regression: No Bisected commit-id:
Attachments: Work around concurrent reset crash
Another workaround attempt

Description jsbronder 2009-10-07 16:19:09 UTC
Latest working kernel:
    Unknown
Failing kernels:
    2.6.28.6, 2.6.30.8, wireless-testing 2.6.32-rc2-wl-39303-g8e025fe
Distribution:
    Gentoo
Wireless Hardware:
    AR5005G 802.11abg NIC (rev 01)
Wireless Software:
    net-wireless/wireless-tools-29
    net-wireless/wpa_supplicant-0.5.7

The ath5k driver occasionally causes hard system lockup during
association with WEP/WPA2 networks.  The keyboard becomes
unresponsive (caps lock/num lock do nothing), SysRq command do
not work and acpid does not recognize the power button being
pressed.  Attaching a serial console is of no help even when
using EARLY_PRINTK or nmi_watchdog.

There are a number of ways to reproduce this, but it does
typically take some number of hours for the lockup to occur.

1.)  Cycle wpa_cli {disconnect,reassociate}
2.)  Cycle ifconfig {up,down}
3.)  Enter a nonexistent ssid, let wpa_supplicant run.
4.)  Enter an invalid key, let wpa_supplicant run.

I should also mention that I cannot reproduce this on older
hardware.  The current system is an Intel Q9400 (2.66GHz quad
core) with 8 GB of RAM, whereas the older system used two AMD
Opteron 265 HE (1.8GHz) with 4 GB of RAM.  I've been testing on
the older system for over a week without a single issue.  Across
seven of the newer machines, dedicated to reproducing this
condition with various kernel configurations, I typically get at
least three hanging when left to run overnight.

I'll be happy to help in testing anything.  Thanks in advance.
Comment 1 John W. Linville 2009-10-07 17:29:29 UTC
Not much to go on...have you tried using netconsole?
Comment 2 jsbronder 2009-10-07 17:45:01 UTC
I was under the impression that serial was more reliable, but
I'll go ahead and start monitoring via netconsole as well.
Comment 3 jsbronder 2009-10-07 20:36:05 UTC
Netconsole just before a hang:
[ 4892.316943] wlan0: deauthenticating from 00:16:01:d1:ad:cb by local choice (reason=3)
[ 4892.473738] wlan0: direct probe to AP 00:16:01:d1:ad:cb (try 1)
[ 4892.516001] wlan0: direct probe responded
[ 4892.528011] wlan0: authenticate with AP 00:16:01:d1:ad:cb (try 1)
[ 4892.547673] wlan0: authenticated
[ 4892.557393] wlan0: associate with AP 00:16:01:d1:ad:cb (try 1)
[ 4892.576702] wlan0: RX AssocResp from 00:16:01:d1:ad:cb (capab=0x431 status=0 aid=5)
[ 4892.599651] wlan0: associated

Serial was nearly the same as above, however there was another
linebreak and a '['.  Looks as though the kernel was trying to
report something.  When functioning normally, the next line would
have been the deauthenticating one.


Also, as I was actively watching monitors this time, I also
noticed a number of the following warnings shoot by.  I have no
clue if this is related.  These have never occurred just before a
system hang (I typically page back a little to see if anything
interesting was reported).

[  236.860709] wlan0: deauthenticating from 00:16:01:d1:ad:cb by local choice (reason=3)
[  236.953397] ------------[ cut here ]------------
[  236.967246] WARNING: at net/wireless/core.c:613 wdev_cleanup_work+0x50/0xbf()
[  236.988647] Hardware name: C2SBX
[  237.000380] Modules linked in: netconsole cls_u32 sch_sfq sch_htb iTCO_wdt ath5k ath b3dfg(C) iTCO_vendor_support led_class ftdi_sio i2c_i801 e1000e
[  237.000393] Pid: 17, comm: events/2 Tainted: G         C 2.6.32-rc2-wl-39303-g8e025fe #1
[  237.000396] Call Trace:
[  237.000401]  [<ffffffff8103bd4e>] ? need_resched+0x1e/0x28
[  237.000404]  [<ffffffff813b6d86>] ? wdev_cleanup_work+0x50/0xbf
[  237.000408]  [<ffffffff81049e37>] warn_slowpath_common+0x77/0x8f
[  237.000411]  [<ffffffff81049e5e>] warn_slowpath_null+0xf/0x11
[  237.000414]  [<ffffffff813b6d86>] wdev_cleanup_work+0x50/0xbf
[  237.000418]  [<ffffffff81062a47>] worker_thread+0x10d/0x1ac
[  237.000421]  [<ffffffff81066bf6>] ? autoremove_wake_function+0x0/0x38
[  237.000424]  [<ffffffff8106293a>] ? worker_thread+0x0/0x1ac
[  237.000427]  [<ffffffff81066893>] kthread+0x69/0x71
[  237.000430]  [<ffffffff8100caea>] child_rip+0xa/0x20
[  237.000433]  [<ffffffff8106682a>] ? kthread+0x0/0x71
[  237.000435]  [<ffffffff8100cae0>] ? child_rip+0x0/0x20
[  237.000437] ---[ end trace 487b4b39a10f8e8f ]---
Comment 4 Bob Copeland 2009-10-13 12:14:03 UTC
Do you have lockdep turned on?  If not, can you?

(In reply to comment #3)

> [  236.967246] WARNING: at net/wireless/core.c:613
> wdev_cleanup_work+0x50/0xbf()

I think this is:

        if (WARN_ON(rdev->scan_req && rdev->scan_req->dev == wdev->netdev)) {

But I don't have any idea what that warning indicates.
Comment 5 jsbronder 2009-10-15 15:32:36 UTC
(In reply to comment #4)
> Do you have lockdep turned on?  If not, can you?

I've enabled it.

> 
> (In reply to comment #3)
> 
> > [  236.967246] WARNING: at net/wireless/core.c:613
> > wdev_cleanup_work+0x50/0xbf()
> 
> I think this is:
> 
>         if (WARN_ON(rdev->scan_req && rdev->scan_req->dev == wdev->netdev)) {

That is the correct line.

> 
> But I don't have any idea what that warning indicates.
Comment 6 Bob Copeland 2009-10-25 19:59:21 UTC
So the last thing I heard on this via irc was that with lockdep enabled, the crash doesn't happen (I would guess due to timing changes), is that still the case?  And there are no lockdep reports in dmesg?
Comment 7 jsbronder 2009-10-26 16:32:10 UTC
(In reply to comment #6)
> So the last thing I heard on this via irc was that with lockdep enabled, the
> crash doesn't happen (I would guess due to timing changes), is that still the
> case?  And there are no lockdep reports in dmesg?

Actually, at first I messed up and left the nvidia module on the system, which disabled lockdep.  In this configuration, the machines ran for nearly a week before one locked up (~170k on/off cycles).  At that point, I realized my mistake and removed the nvidia module.

All four machines ran over the weekend (10-70k cycles) and locked up.  There were no messages from lockdep via the serial port, and grepping kernel.log also gave no new information.  If I can trust what was reported via serial, 2 were bringing wireless up and 2 were bringing it down.
Comment 8 jsbronder 2009-10-28 20:45:56 UTC
Here's another set of warnings.  I got these via netconsole and not the serial port just before a lockup.

[ 3527.387488] wlan1: authenticated
[ 3527.397174] wlan1: associate with AP 00:16:01:d1:ad:cb (try 1)
[ 3527.612106] wlan1: associate with AP 00:16:01:d1:ad:cb (try 2)
[ 3527.664547] wlan1: RX AssocResp from 00:16:01:d1:ad:cb (capab=0x431 status=0 aid=6)
[ 3527.687482] wlan1: associated
[ 3527.712121] wlan1: deauthenticating from 00:16:01:d1:ad:cb by local choice (reason=3)
[ 3527.735627] ------------[ cut here ]------------
[ 3527.749470] WARNING: at net/wireless/mlme.c:96 cfg80211_send_rx_assoc+0xe2/0x1bd()
[ 3527.772156] Hardware name: C2SBX
[ 3527.781851] Modules linked in: netconsole cls_u32 sch_sfq sch_htb ath5k ath led_class iTCO_wdt i2c_i801 iTCO_vendor_support e1000e
[ 3527.817626] Pid: 3887, comm: phy1 Not tainted 2.6.32-rc2-wl-39303-g8e025fe #2
[ 3527.839011] Call Trace:
[ 3527.846342]  [<ffffffff813d5831>] ? cfg80211_send_rx_assoc+0xe2/0x1bd
[ 3527.865636]  [<ffffffff8104a52e>] warn_slowpath_common+0x77/0x8f
[ 3527.883649]  [<ffffffff8104a555>] warn_slowpath_null+0xf/0x11
[ 3527.900885]  [<ffffffff813d5831>] cfg80211_send_rx_assoc+0xe2/0x1bd
[ 3527.919664]  [<ffffffff81041443>] ? wake_up_process+0x10/0x12
[ 3527.936892]  [<ffffffff813e4552>] ieee80211_sta_work+0x933/0x143e
[ 3527.955161]  [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97
[ 3527.973149]  [<ffffffff81075812>] ? register_lock_class+0x20/0x2cc
[ 3527.991657]  [<ffffffff81075eb9>] ? __lock_acquire+0x216/0x3d7
[ 3528.009129]  [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97
[ 3528.027122]  [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97
[ 3528.045117]  [<ffffffff81075812>] ? register_lock_class+0x20/0x2cc
[ 3528.063626]  [<ffffffff81075eb9>] ? __lock_acquire+0x216/0x3d7
[ 3528.081102]  [<ffffffff813e3c1f>] ? ieee80211_sta_work+0x0/0x143e
[ 3528.099353]  [<ffffffff81063ab2>] worker_thread+0x171/0x274
[ 3528.116042]  [<ffffffff81063a5b>] ? worker_thread+0x11a/0x274
[ 3528.133255]  [<ffffffff8103be32>] ? need_resched+0x1e/0x28
[ 3528.149689]  [<ffffffff81067eb2>] ? autoremove_wake_function+0x0/0x38
[ 3528.168979]  [<ffffffff81063941>] ? worker_thread+0x0/0x274
[ 3528.185675]  [<ffffffff81067c9a>] kthread+0x69/0x71
[ 3528.200284]  [<ffffffff8100cb6a>] child_rip+0xa/0x20
[ 3528.215159]  [<ffffffff81067c31>] ? kthread+0x0/0x71
[ 3528.230031]  [<ffffffff8100cb60>] ? child_rip+0x0/0x20
[ 3528.245419] ---[ end trace f8a61aff09ec1d5d ]---
[ 3528.259259] ------------[ cut here ]------------
[ 3528.273094] ath5k phy1: unsupported jumbo
[ 3528.285133] WARNING: at net/wireless/sme.c:417 __cfg80211_connect_result+0x26c/0x2c3()
[ 3528.308879] Hardware name: C2SBX
[ 3528.318558] Modules linked in: netconsole cls_u32 sch_sfq sch_htb ath5k ath led_class iTCO_wdt i2c_i801 iTCO_vendor_support e1000e
[ 3528.354327] Pid: 3887, comm: phy1 Tainted: G        W  2.6.32-rc2-wl-39303-g8e025fe #2
[ 3528.378045] Call Trace:
[ 3528.385379]  [<ffffffff813cc7e8>] ? cfg80211_get_bss+0x11d/0x132
[ 3528.403372]  [<ffffffff813d7786>] ? __cfg80211_connect_result+0x26c/0x2c3
[ 3528.423721]  [<ffffffff8104a52e>] warn_slowpath_common+0x77/0x8f
[ 3528.441722]  [<ffffffff8104a555>] warn_slowpath_null+0xf/0x11
[ 3528.458939]  [<ffffffff813d7786>] __cfg80211_connect_result+0x26c/0x2c3
[ 3528.478773]  [<ffffffff8100f387>] ? dump_stack+0x72/0x7b
[ 3528.494701]  [<ffffffff813d590a>] cfg80211_send_rx_assoc+0x1bb/0x1bd
[ 3528.513748]  [<ffffffff81041443>] ? wake_up_process+0x10/0x12
[ 3528.530956]  [<ffffffff813e4552>] ieee80211_sta_work+0x933/0x143e
[ 3528.549210]  [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97
[ 3528.567199]  [<ffffffff81075812>] ? register_lock_class+0x20/0x2cc
[ 3528.585728]  [<ffffffff81075eb9>] ? __lock_acquire+0x216/0x3d7
[ 3528.603211]  [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97
[ 3528.621222]  [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97
[ 3528.639245]  [<ffffffff81075812>] ? register_lock_class+0x20/0x2cc
[ 3528.657757]  [<ffffffff81075eb9>] ? __lock_acquire+0x216/0x3d7
[ 3528.675234]  [<ffffffff813e3c1f>] ? ieee80211_sta_work+0x0/0x143e
[ 3528.693486]  [<ffffffff81063ab2>] worker_thread+0x171/0x274
[ 3528.710175]  [<ffffffff81063a5b>] ? worker_thread+0x11a/0x274
[ 3528.727387]  [<ffffffff8103be32>] ? need_resched+0x1e/0x28
[ 3528.743820]  [<ffffffff81067eb2>] ? autoremove_wake_function+0x0/0x38
[ 3528.763111]  [<ffffffff81063941>] ? worker_thread+0x0/0x274
[ 3528.779802]  [<ffffffff81067c9a>] kthread+0x69/0x71
[ 3528.794432]  [<ffffffff8100cb6a>] child_rip+0xa/0x20
[ 3528.809314]  [<ffffffff81067c31>] ? kthread+0x0/0x71
[ 3528.824189]  [<ffffffff8100cb60>] ? child_rip+0x0/0x20
[ 3528.839576] ---[ end trace f8a61aff09ec1d5e ]---
[ 3564.268491] ath5k phy1: unsupported jumbo
Comment 9 Bob Copeland 2009-11-07 12:45:24 UTC
Ok I have the hardware and am running some tests now.  Any suggested workload?
Comment 10 Rafael J. Wysocki 2009-11-16 21:06:34 UTC
Johannes says it may have been fixed by commit 7400f42e9d765fa0656b432f3ab1245f9710f190 (cfg80211: fix NULL ptr deref).

Can you confirm or deny?

For now, I'm going to close this bug as fixed, so please reopen if it's not in fact.
Comment 11 Bob Copeland 2009-11-17 00:47:05 UTC
I don't think that's the same bug.
Comment 12 Bob Copeland 2009-11-17 00:49:18 UTC
(Note the original report says 2.6.28 and 2.6.30 stable, IIRC the null ptr was added late in 2.6.32-rc).

As for reproducing with the hardware, I've had some success with w-t as of a few days ago, but it depends heavily on timing -- I have much less luck with debugging knobs turned on.
Comment 13 jsbronder 2009-11-18 18:45:26 UTC
I'm still able to reproduce in 2.6.32-rc7-wl-42932-gb648d86 which does contain the mentioned commit.
Comment 14 Bob Copeland 2010-03-03 23:27:07 UTC
Created attachment 25347 [details]
Work around concurrent reset crash

This one fixed one hang of mine, but jsbronder's workload still fails.  I'll be spending this weekend trying again.
Comment 15 Bob Copeland 2010-03-18 16:23:56 UTC
Created attachment 25592 [details]
Another workaround attempt

When we hooked this up to a PCI analyzer we saw that the card would emit bus errors in this state, which is consistent with my getting NMI exceptions when the device is confused.

Looking over the code, I see a few potential problem spots described in the patch.  However, with the exception of the interrupt mask register fix -- which doesn't fix the problem in isolation -- I'm not sure if these are real fixes or just changing delay profile (e.g., spinlocking around read/write seems dubious, but ath9k does that at least).  That said, so far the results are working well for me.
Comment 16 John W. Linville 2010-08-11 18:01:42 UTC
Does this issue persist with 2.6.35 (or later) kernels?
Comment 17 jsbronder 2010-08-23 15:56:56 UTC
I attempted to reproduce with 2.6.35.2.  Over seven days, I ran stress testing on four machines, each hitting ~198k attempts to trigger this bug.  They are all still alive and running as expected.

Thanks.
Comment 18 Bob Copeland 2010-08-24 12:56:24 UTC
Many thanks Justin for the testing, hardware, and patience.  In 2.6.36, there will be a couple of additional concurrency fixes, but it is good to know that we've finally squashed this one.