Latest working kernel: Unknown Failing kernels: 2.6.28.6, 2.6.30.8, wireless-testing 2.6.32-rc2-wl-39303-g8e025fe Distribution: Gentoo Wireless Hardware: AR5005G 802.11abg NIC (rev 01) Wireless Software: net-wireless/wireless-tools-29 net-wireless/wpa_supplicant-0.5.7 The ath5k driver occasionally causes hard system lockup during association with WEP/WPA2 networks. The keyboard becomes unresponsive (caps lock/num lock do nothing), SysRq command do not work and acpid does not recognize the power button being pressed. Attaching a serial console is of no help even when using EARLY_PRINTK or nmi_watchdog. There are a number of ways to reproduce this, but it does typically take some number of hours for the lockup to occur. 1.) Cycle wpa_cli {disconnect,reassociate} 2.) Cycle ifconfig {up,down} 3.) Enter a nonexistent ssid, let wpa_supplicant run. 4.) Enter an invalid key, let wpa_supplicant run. I should also mention that I cannot reproduce this on older hardware. The current system is an Intel Q9400 (2.66GHz quad core) with 8 GB of RAM, whereas the older system used two AMD Opteron 265 HE (1.8GHz) with 4 GB of RAM. I've been testing on the older system for over a week without a single issue. Across seven of the newer machines, dedicated to reproducing this condition with various kernel configurations, I typically get at least three hanging when left to run overnight. I'll be happy to help in testing anything. Thanks in advance.
Not much to go on...have you tried using netconsole?
I was under the impression that serial was more reliable, but I'll go ahead and start monitoring via netconsole as well.
Netconsole just before a hang: [ 4892.316943] wlan0: deauthenticating from 00:16:01:d1:ad:cb by local choice (reason=3) [ 4892.473738] wlan0: direct probe to AP 00:16:01:d1:ad:cb (try 1) [ 4892.516001] wlan0: direct probe responded [ 4892.528011] wlan0: authenticate with AP 00:16:01:d1:ad:cb (try 1) [ 4892.547673] wlan0: authenticated [ 4892.557393] wlan0: associate with AP 00:16:01:d1:ad:cb (try 1) [ 4892.576702] wlan0: RX AssocResp from 00:16:01:d1:ad:cb (capab=0x431 status=0 aid=5) [ 4892.599651] wlan0: associated Serial was nearly the same as above, however there was another linebreak and a '['. Looks as though the kernel was trying to report something. When functioning normally, the next line would have been the deauthenticating one. Also, as I was actively watching monitors this time, I also noticed a number of the following warnings shoot by. I have no clue if this is related. These have never occurred just before a system hang (I typically page back a little to see if anything interesting was reported). [ 236.860709] wlan0: deauthenticating from 00:16:01:d1:ad:cb by local choice (reason=3) [ 236.953397] ------------[ cut here ]------------ [ 236.967246] WARNING: at net/wireless/core.c:613 wdev_cleanup_work+0x50/0xbf() [ 236.988647] Hardware name: C2SBX [ 237.000380] Modules linked in: netconsole cls_u32 sch_sfq sch_htb iTCO_wdt ath5k ath b3dfg(C) iTCO_vendor_support led_class ftdi_sio i2c_i801 e1000e [ 237.000393] Pid: 17, comm: events/2 Tainted: G C 2.6.32-rc2-wl-39303-g8e025fe #1 [ 237.000396] Call Trace: [ 237.000401] [<ffffffff8103bd4e>] ? need_resched+0x1e/0x28 [ 237.000404] [<ffffffff813b6d86>] ? wdev_cleanup_work+0x50/0xbf [ 237.000408] [<ffffffff81049e37>] warn_slowpath_common+0x77/0x8f [ 237.000411] [<ffffffff81049e5e>] warn_slowpath_null+0xf/0x11 [ 237.000414] [<ffffffff813b6d86>] wdev_cleanup_work+0x50/0xbf [ 237.000418] [<ffffffff81062a47>] worker_thread+0x10d/0x1ac [ 237.000421] [<ffffffff81066bf6>] ? autoremove_wake_function+0x0/0x38 [ 237.000424] [<ffffffff8106293a>] ? worker_thread+0x0/0x1ac [ 237.000427] [<ffffffff81066893>] kthread+0x69/0x71 [ 237.000430] [<ffffffff8100caea>] child_rip+0xa/0x20 [ 237.000433] [<ffffffff8106682a>] ? kthread+0x0/0x71 [ 237.000435] [<ffffffff8100cae0>] ? child_rip+0x0/0x20 [ 237.000437] ---[ end trace 487b4b39a10f8e8f ]---
Do you have lockdep turned on? If not, can you? (In reply to comment #3) > [ 236.967246] WARNING: at net/wireless/core.c:613 > wdev_cleanup_work+0x50/0xbf() I think this is: if (WARN_ON(rdev->scan_req && rdev->scan_req->dev == wdev->netdev)) { But I don't have any idea what that warning indicates.
(In reply to comment #4) > Do you have lockdep turned on? If not, can you? I've enabled it. > > (In reply to comment #3) > > > [ 236.967246] WARNING: at net/wireless/core.c:613 > > wdev_cleanup_work+0x50/0xbf() > > I think this is: > > if (WARN_ON(rdev->scan_req && rdev->scan_req->dev == wdev->netdev)) { That is the correct line. > > But I don't have any idea what that warning indicates.
So the last thing I heard on this via irc was that with lockdep enabled, the crash doesn't happen (I would guess due to timing changes), is that still the case? And there are no lockdep reports in dmesg?
(In reply to comment #6) > So the last thing I heard on this via irc was that with lockdep enabled, the > crash doesn't happen (I would guess due to timing changes), is that still the > case? And there are no lockdep reports in dmesg? Actually, at first I messed up and left the nvidia module on the system, which disabled lockdep. In this configuration, the machines ran for nearly a week before one locked up (~170k on/off cycles). At that point, I realized my mistake and removed the nvidia module. All four machines ran over the weekend (10-70k cycles) and locked up. There were no messages from lockdep via the serial port, and grepping kernel.log also gave no new information. If I can trust what was reported via serial, 2 were bringing wireless up and 2 were bringing it down.
Here's another set of warnings. I got these via netconsole and not the serial port just before a lockup. [ 3527.387488] wlan1: authenticated [ 3527.397174] wlan1: associate with AP 00:16:01:d1:ad:cb (try 1) [ 3527.612106] wlan1: associate with AP 00:16:01:d1:ad:cb (try 2) [ 3527.664547] wlan1: RX AssocResp from 00:16:01:d1:ad:cb (capab=0x431 status=0 aid=6) [ 3527.687482] wlan1: associated [ 3527.712121] wlan1: deauthenticating from 00:16:01:d1:ad:cb by local choice (reason=3) [ 3527.735627] ------------[ cut here ]------------ [ 3527.749470] WARNING: at net/wireless/mlme.c:96 cfg80211_send_rx_assoc+0xe2/0x1bd() [ 3527.772156] Hardware name: C2SBX [ 3527.781851] Modules linked in: netconsole cls_u32 sch_sfq sch_htb ath5k ath led_class iTCO_wdt i2c_i801 iTCO_vendor_support e1000e [ 3527.817626] Pid: 3887, comm: phy1 Not tainted 2.6.32-rc2-wl-39303-g8e025fe #2 [ 3527.839011] Call Trace: [ 3527.846342] [<ffffffff813d5831>] ? cfg80211_send_rx_assoc+0xe2/0x1bd [ 3527.865636] [<ffffffff8104a52e>] warn_slowpath_common+0x77/0x8f [ 3527.883649] [<ffffffff8104a555>] warn_slowpath_null+0xf/0x11 [ 3527.900885] [<ffffffff813d5831>] cfg80211_send_rx_assoc+0xe2/0x1bd [ 3527.919664] [<ffffffff81041443>] ? wake_up_process+0x10/0x12 [ 3527.936892] [<ffffffff813e4552>] ieee80211_sta_work+0x933/0x143e [ 3527.955161] [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97 [ 3527.973149] [<ffffffff81075812>] ? register_lock_class+0x20/0x2cc [ 3527.991657] [<ffffffff81075eb9>] ? __lock_acquire+0x216/0x3d7 [ 3528.009129] [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97 [ 3528.027122] [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97 [ 3528.045117] [<ffffffff81075812>] ? register_lock_class+0x20/0x2cc [ 3528.063626] [<ffffffff81075eb9>] ? __lock_acquire+0x216/0x3d7 [ 3528.081102] [<ffffffff813e3c1f>] ? ieee80211_sta_work+0x0/0x143e [ 3528.099353] [<ffffffff81063ab2>] worker_thread+0x171/0x274 [ 3528.116042] [<ffffffff81063a5b>] ? worker_thread+0x11a/0x274 [ 3528.133255] [<ffffffff8103be32>] ? need_resched+0x1e/0x28 [ 3528.149689] [<ffffffff81067eb2>] ? autoremove_wake_function+0x0/0x38 [ 3528.168979] [<ffffffff81063941>] ? worker_thread+0x0/0x274 [ 3528.185675] [<ffffffff81067c9a>] kthread+0x69/0x71 [ 3528.200284] [<ffffffff8100cb6a>] child_rip+0xa/0x20 [ 3528.215159] [<ffffffff81067c31>] ? kthread+0x0/0x71 [ 3528.230031] [<ffffffff8100cb60>] ? child_rip+0x0/0x20 [ 3528.245419] ---[ end trace f8a61aff09ec1d5d ]--- [ 3528.259259] ------------[ cut here ]------------ [ 3528.273094] ath5k phy1: unsupported jumbo [ 3528.285133] WARNING: at net/wireless/sme.c:417 __cfg80211_connect_result+0x26c/0x2c3() [ 3528.308879] Hardware name: C2SBX [ 3528.318558] Modules linked in: netconsole cls_u32 sch_sfq sch_htb ath5k ath led_class iTCO_wdt i2c_i801 iTCO_vendor_support e1000e [ 3528.354327] Pid: 3887, comm: phy1 Tainted: G W 2.6.32-rc2-wl-39303-g8e025fe #2 [ 3528.378045] Call Trace: [ 3528.385379] [<ffffffff813cc7e8>] ? cfg80211_get_bss+0x11d/0x132 [ 3528.403372] [<ffffffff813d7786>] ? __cfg80211_connect_result+0x26c/0x2c3 [ 3528.423721] [<ffffffff8104a52e>] warn_slowpath_common+0x77/0x8f [ 3528.441722] [<ffffffff8104a555>] warn_slowpath_null+0xf/0x11 [ 3528.458939] [<ffffffff813d7786>] __cfg80211_connect_result+0x26c/0x2c3 [ 3528.478773] [<ffffffff8100f387>] ? dump_stack+0x72/0x7b [ 3528.494701] [<ffffffff813d590a>] cfg80211_send_rx_assoc+0x1bb/0x1bd [ 3528.513748] [<ffffffff81041443>] ? wake_up_process+0x10/0x12 [ 3528.530956] [<ffffffff813e4552>] ieee80211_sta_work+0x933/0x143e [ 3528.549210] [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97 [ 3528.567199] [<ffffffff81075812>] ? register_lock_class+0x20/0x2cc [ 3528.585728] [<ffffffff81075eb9>] ? __lock_acquire+0x216/0x3d7 [ 3528.603211] [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97 [ 3528.621222] [<ffffffff81075624>] ? look_up_lock_class+0x2e/0x97 [ 3528.639245] [<ffffffff81075812>] ? register_lock_class+0x20/0x2cc [ 3528.657757] [<ffffffff81075eb9>] ? __lock_acquire+0x216/0x3d7 [ 3528.675234] [<ffffffff813e3c1f>] ? ieee80211_sta_work+0x0/0x143e [ 3528.693486] [<ffffffff81063ab2>] worker_thread+0x171/0x274 [ 3528.710175] [<ffffffff81063a5b>] ? worker_thread+0x11a/0x274 [ 3528.727387] [<ffffffff8103be32>] ? need_resched+0x1e/0x28 [ 3528.743820] [<ffffffff81067eb2>] ? autoremove_wake_function+0x0/0x38 [ 3528.763111] [<ffffffff81063941>] ? worker_thread+0x0/0x274 [ 3528.779802] [<ffffffff81067c9a>] kthread+0x69/0x71 [ 3528.794432] [<ffffffff8100cb6a>] child_rip+0xa/0x20 [ 3528.809314] [<ffffffff81067c31>] ? kthread+0x0/0x71 [ 3528.824189] [<ffffffff8100cb60>] ? child_rip+0x0/0x20 [ 3528.839576] ---[ end trace f8a61aff09ec1d5e ]--- [ 3564.268491] ath5k phy1: unsupported jumbo
Ok I have the hardware and am running some tests now. Any suggested workload?
Johannes says it may have been fixed by commit 7400f42e9d765fa0656b432f3ab1245f9710f190 (cfg80211: fix NULL ptr deref). Can you confirm or deny? For now, I'm going to close this bug as fixed, so please reopen if it's not in fact.
I don't think that's the same bug.
(Note the original report says 2.6.28 and 2.6.30 stable, IIRC the null ptr was added late in 2.6.32-rc). As for reproducing with the hardware, I've had some success with w-t as of a few days ago, but it depends heavily on timing -- I have much less luck with debugging knobs turned on.
I'm still able to reproduce in 2.6.32-rc7-wl-42932-gb648d86 which does contain the mentioned commit.
Created attachment 25347 [details] Work around concurrent reset crash This one fixed one hang of mine, but jsbronder's workload still fails. I'll be spending this weekend trying again.
Created attachment 25592 [details] Another workaround attempt When we hooked this up to a PCI analyzer we saw that the card would emit bus errors in this state, which is consistent with my getting NMI exceptions when the device is confused. Looking over the code, I see a few potential problem spots described in the patch. However, with the exception of the interrupt mask register fix -- which doesn't fix the problem in isolation -- I'm not sure if these are real fixes or just changing delay profile (e.g., spinlocking around read/write seems dubious, but ath9k does that at least). That said, so far the results are working well for me.
Does this issue persist with 2.6.35 (or later) kernels?
I attempted to reproduce with 2.6.35.2. Over seven days, I ran stress testing on four machines, each hitting ~198k attempts to trigger this bug. They are all still alive and running as expected. Thanks.
Many thanks Justin for the testing, hardware, and patience. In 2.6.36, there will be a couple of additional concurrency fixes, but it is good to know that we've finally squashed this one.