Bug 15692 - Oopses in conjunction with wireless, ath9k
Summary: Oopses in conjunction with wireless, ath9k
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 low
Assignee: Luis Chamberlain
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-04 18:43 UTC by Sebastian Kemper
Modified: 2010-05-28 17:47 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.32.11, 2.6.33.2
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Kernel configuration (44.26 KB, text/plain)
2010-04-04 18:47 UTC, Sebastian Kemper
Details
dmesg from April 3rd (25.85 KB, text/plain)
2010-04-04 18:52 UTC, Sebastian Kemper
Details
dmesg from April 4th (30.06 KB, text/plain)
2010-04-04 18:54 UTC, Sebastian Kemper
Details

Description Sebastian Kemper 2010-04-04 18:43:56 UTC
Hello,

I'm seeing what I think are oopses in my syslog. They appear since I update the kernel from 2.6.32.10 to .11 (that's my best guess anyway, never witnessed any of them with .10).

I don't know if they're worth looking at. For one in the beginning it always shows "WARNING" (so not really an error I guess) and I can't really say that anything doesn't work after the oops occurs. I have the feeling that my wifi connections is lacking the stability I experienced during the rule of the previous minor stable version, but it's really mere subjective.

But, I saw the oopses twice now, two days in a row.

First one:
wlan0: deauthenticated from 00:1a:2b:0d:ac:5e (Reason: 7)
------------[ cut here ]------------
WARNING: at kernel/softirq.c:143 local_bh_enable_ip+0x3c/0x8a()
Hardware name: System Product Name
Modules linked in: sha256_generic serpent cbc dm_crypt
Pid: 49, comm: phy0 Not tainted 2.6.32.11 #1
Call Trace:
 [<ffffffff810350eb>] ? local_bh_enable_ip+0x3c/0x8a
 [<ffffffff810350eb>] ? local_bh_enable_ip+0x3c/0x8a
 [<ffffffff81030b69>] ? warn_slowpath_common+0x77/0xa3
 [<ffffffff810350eb>] ? local_bh_enable_ip+0x3c/0x8a
 [<ffffffff81215d04>] ? ath_tx_node_cleanup+0x13c/0x15c
 [<ffffffff812fd9f7>] ? __sta_info_unlink+0x164/0x19e
 [<ffffffff812fda59>] ? sta_info_unlink+0x28/0x37
 [<ffffffff81302d40>] ? ieee80211_set_disassoc+0x193/0x1e7
 [<ffffffff81302e0b>] ? ieee80211_rx_mgmt_deauth+0x77/0xa8
 [<ffffffff81305184>] ? ieee80211_sta_work+0x1ac/0xe84
 [<ffffffff81318208>] ? thread_return+0x4e/0x10f
 [<ffffffff810391e3>] ? mod_timer+0x114/0x124
 [<ffffffff81304fd8>] ? ieee80211_sta_work+0x0/0xe84
 [<ffffffff8103ef23>] ? worker_thread+0xf5/0x179
 [<ffffffff81041f20>] ? autoremove_wake_function+0x0/0x2e
 [<ffffffff8103ee2e>] ? worker_thread+0x0/0x179
 [<ffffffff81041cb1>] ? kthread+0x79/0x81
 [<ffffffff8100bdfa>] ? child_rip+0xa/0x20
 [<ffffffff81041c38>] ? kthread+0x0/0x81
 [<ffffffff8100bdf0>] ? child_rip+0x0/0x20
---[ end trace 8b59f214aa8c10be ]---
wlan0: direct probe to AP 00:1a:2b:0d:ac:5e (try 1)
wlan0: direct probe responded
wlan0: authenticate with AP 00:1a:2b:0d:ac:5e (try 1)
wlan0: authenticated
wlan0: associate with AP 00:1a:2b:0d:ac:5e (try 1)
wlan0: RX AssocResp from 00:1a:2b:0d:ac:5e (capab=0x11 status=0 aid=1)
wlan0: associated

Second:
No probe response from AP 00:1a:2b:0d:ac:5e after 500ms, disconnecting.
------------[ cut here ]------------
WARNING: at kernel/softirq.c:143 local_bh_enable_ip+0x3c/0x8a()
Hardware name: System Product Name
Modules linked in:
Pid: 49, comm: phy0 Not tainted 2.6.32.11 #3
Call Trace:
 [<ffffffff810350eb>] ? local_bh_enable_ip+0x3c/0x8a
 [<ffffffff810350eb>] ? local_bh_enable_ip+0x3c/0x8a
 [<ffffffff81030b69>] ? warn_slowpath_common+0x77/0xa3
 [<ffffffff810350eb>] ? local_bh_enable_ip+0x3c/0x8a
 [<ffffffff81215d04>] ? ath_tx_node_cleanup+0x13c/0x15c
 [<ffffffff812fd9f7>] ? __sta_info_unlink+0x164/0x19e
 [<ffffffff812fda59>] ? sta_info_unlink+0x28/0x37
 [<ffffffff81302d40>] ? ieee80211_set_disassoc+0x193/0x1e7
 [<ffffffff81318dd9>] ? mutex_lock+0xd/0x31
 [<ffffffff81304fd8>] ? ieee80211_sta_work+0x0/0xe84
 [<ffffffff813055a8>] ? ieee80211_sta_work+0x5d0/0xe84
 [<ffffffff81318208>] ? thread_return+0x4e/0x10f
 [<ffffffff810391e3>] ? mod_timer+0x114/0x124
 [<ffffffff81304fd8>] ? ieee80211_sta_work+0x0/0xe84
 [<ffffffff8103ef23>] ? worker_thread+0xf5/0x179
 [<ffffffff81041f20>] ? autoremove_wake_function+0x0/0x2e
 [<ffffffff8103ee2e>] ? worker_thread+0x0/0x179
 [<ffffffff81041cb1>] ? kthread+0x79/0x81
 [<ffffffff8100bdfa>] ? child_rip+0xa/0x20
 [<ffffffff81041c38>] ? kthread+0x0/0x81
 [<ffffffff8100bdf0>] ? child_rip+0x0/0x20
---[ end trace 3adfa5f3dbe1ecae ]---
wlan0: direct probe to AP 00:1a:2b:0d:ac:5e (try 1)

Kind regards
Sebastian
Comment 1 Sebastian Kemper 2010-04-04 18:47:45 UTC
Created attachment 25850 [details]
Kernel configuration

I didn't change the configuration when upgrading from .10 to .11
Comment 2 Sebastian Kemper 2010-04-04 18:52:10 UTC
Created attachment 25851 [details]
dmesg from April 3rd
Comment 3 Sebastian Kemper 2010-04-04 18:54:40 UTC
Created attachment 25852 [details]
dmesg from April 4th
Comment 4 John W. Linville 2010-04-05 15:27:47 UTC
Could be a problem with one of the backported ath9k patches...perhaps you could try reverting them in order?  Or perhaps bisecting?

/home/linville/git/linux-2.6.32.y
[linville-t400.local]:> git log v2.6.32.10..v2.6.32.11 drivers/net/wireless/ath/ath9k/
commit 948dc19cef84be418aadc6b3ade94207c0888937
Author: Vivek Natarajan <vnatarajan@atheros.com>
Date:   Thu Mar 11 13:03:01 2010 -0800

    ath9k: Enable IEEE80211_HW_REPORTS_TX_ACK_STATUS flag for ath9k
    
    commit 05df49865be08b30e7ba91b9d3d94d7d52dd3033 upstream.
    
    Signed-off-by: Vivek Natarajan <vnatarajan@atheros.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
    Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

commit 525ad995580f1d760cbf4a3b42df1cba3df6821b
Author: Senthil Balasubramanian <senthilkumar@atheros.com>
Date:   Thu Mar 11 11:50:54 2010 -0800

    ath9k: Enable TIM timer interrupt only when needed.
    
    commit 3f7c5c10e9dc6bf90179eb9f7c06151d508fb324 upstream.
    
    The TIM timer interrupt is enabled even before the ACK of nullqos
    is received which is unnecessary.
    
    Also clean up the CONF_PS part of config callback properly for
    better readability.
    
    Signed-off-by: Senthil Balasubramanian <senthilkumar@atheros.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
    Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

commit fa8e8558a6874707d8e30c85bb386f43cb61a6e1
Author: Felix Fietkau <nbd@openwrt.org>
Date:   Fri Mar 12 04:02:43 2010 +0100

    ath9k: fix BUG_ON triggered by PAE frames
    
    commit 4fdec031b9169b3c17938b9c4168f099f457169c upstream.
    
    When I initially stumbled upon sequence number problems with PAE frames
    in ath9k, I submitted a patch to remove all special cases for PAE
    frames and let them go through the normal transmit path.
    Out of concern about crypto incompatibility issues, this change was
    merged instead:
    
    commit 6c8afef551fef87a3bf24f8a74c69a7f2f72fc82
    Author: Sujith <Sujith.Manoharan@atheros.com>
    Date:   Tue Feb 9 10:07:00 2010 +0530
    
        ath9k: Fix sequence numbers for PAE frames
    
    After a lot of testing, I'm able to reliably trigger a driver crash on
    rekeying with current versions with this change in place.
    It seems that the driver does not support sending out regular MPDUs with
    the same TID while an A-MPDU session is active.
    This leads to duplicate entries in the TID Tx buffer, which hits the
    following BUG_ON in ath_tx_addto_baw():
    
        index  = ATH_BA_INDEX(tid->seq_start, bf->bf_seqno);
        cindex = (tid->baw_head + index) & (ATH_TID_MAX_BUFS - 1);
    
        BUG_ON(tid->tx_buf[cindex] != NULL);
    
    I believe until we actually have a reproducible case of an
    incompatibility with another AP using no PAE special cases, we should
    simply get rid of this mess.
    
    This patch completely fixes my crash issues in STA mode and makes it
    stay connected without throughput drops or connectivity issues even
    when the AP is configured to a very short group rekey interval.
    
    Signed-off-by: Felix Fietkau <nbd@openwrt.org>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

commit 0dcc9985f34aef3c60bffab3dfc7f7ba3748f35a
Author: Ming Lei <tom.leiming@gmail.com>
Date:   Sun Feb 28 00:56:24 2010 +0800

    ath9k: fix lockdep warning when unloading module
    
    commit a9f042cbe5284f34ccff15f3084477e11b39b17b upstream.
    
    Since txq->axq_lock may be hold in softirq context, it must be
    acquired with spin_lock_bh() instead of spin_lock() if softieq is
    enabled.
    
    The patch fixes the lockdep warning below when unloading ath9k modules.
    
    =================================
    [ INFO: inconsistent lock state ]
    2.6.33-wl #12
    ---------------------------------
    inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
    rmmod/3642 [HC0[0]:SC0[0]:HE1:SE1] takes:
     (&(&txq->axq_lock)->rlock){+.?...}, at: [<ffffffffa03568c3>] ath_tx_node_cl
    {IN-SOFTIRQ-W} state was registered at:
      [<ffffffff8107577d>] __lock_acquire+0x2f6/0xd35
      [<ffffffff81076289>] lock_acquire+0xcd/0xf1
      [<ffffffff813a7486>] _raw_spin_lock_bh+0x3b/0x6e
      [<ffffffffa0356b49>] spin_lock_bh+0xe/0x10 [ath9k]
      [<ffffffffa0358ec7>] ath_tx_tasklet+0xcd/0x391 [ath9k]
      [<ffffffffa0354f5f>] ath9k_tasklet+0x70/0xc8 [ath9k]
      [<ffffffff8104e601>] tasklet_action+0x8c/0xf4
      [<ffffffff8104f459>] __do_softirq+0xf8/0x1cd
      [<ffffffff8100ab1c>] call_softirq+0x1c/0x30
      [<ffffffff8100c2cf>] do_softirq+0x4b/0xa3
      [<ffffffff8104f045>] irq_exit+0x4a/0x8c
      [<ffffffff813acccc>] do_IRQ+0xac/0xc3
      [<ffffffff813a7d53>] ret_from_intr+0x0/0x16
      [<ffffffff81302d52>] cpuidle_idle_call+0x9e/0xf8
      [<ffffffff81008be7>] cpu_idle+0x62/0x9d
      [<ffffffff81391c1a>] rest_init+0x7e/0x80
      [<ffffffff818bbd38>] start_kernel+0x3e8/0x3f3
      [<ffffffff818bb2bc>] x86_64_start_reservations+0xa7/0xab
      [<ffffffff818bb3b8>] x86_64_start_kernel+0xf8/0x107
    irq event stamp: 42037
    hardirqs last  enabled at (42037): [<ffffffff813a7b21>] _raw_spin_unlock_irq
    hardirqs last disabled at (42036): [<ffffffff813a72f4>] _raw_spin_lock_irqsa
    softirqs last  enabled at (42000): [<ffffffffa0353ea6>] spin_unlock_bh+0xe/0
    softirqs last disabled at (41998): [<ffffffff813a7463>] _raw_spin_lock_bh+0x
    
    other info that might help us debug this:
    4 locks held by rmmod/3642:
     #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8132c10d>] rtnl_lock+0x17/0x19
     #1:  (&wdev->mtx){+.+.+.}, at: [<ffffffffa01e53f2>] cfg80211_netdev_notifie
     #2:  (&ifmgd->mtx){+.+.+.}, at: [<ffffffffa0260834>] ieee80211_mgd_deauth+0
     #3:  (&local->sta_mtx){+.+.+.}, at: [<ffffffffa025a381>] sta_info_destroy_a
    
    stack backtrace:
    Pid: 3642, comm: rmmod Not tainted 2.6.33-wl #12
    Call Trace:
     [<ffffffff81074469>] valid_state+0x178/0x18b
     [<ffffffff81014f94>] ? save_stack_trace+0x2f/0x4c
     [<ffffffff81074e08>] ? check_usage_backwards+0x0/0x88
     [<ffffffff8107458f>] mark_lock+0x113/0x230
     [<ffffffff810757f1>] __lock_acquire+0x36a/0xd35
     [<ffffffff8101018d>] ? native_sched_clock+0x2d/0x5f
     [<ffffffffa03568c3>] ? ath_tx_node_cleanup+0x62/0x180 [ath9k]
     [<ffffffff81076289>] lock_acquire+0xcd/0xf1
     [<ffffffffa03568c3>] ? ath_tx_node_cleanup+0x62/0x180 [ath9k]
     [<ffffffff810732eb>] ? trace_hardirqs_off+0xd/0xf
     [<ffffffff813a7193>] _raw_spin_lock+0x36/0x69
     [<ffffffffa03568c3>] ? ath_tx_node_cleanup+0x62/0x180 [ath9k]
     [<ffffffffa03568c3>] ath_tx_node_cleanup+0x62/0x180 [ath9k]
     [<ffffffff810749ed>] ? trace_hardirqs_on+0xd/0xf
     [<ffffffffa0353950>] ath9k_sta_remove+0x22/0x26 [ath9k]
     [<ffffffffa025a08f>] __sta_info_destroy+0x1ad/0x38c [mac80211]
     [<ffffffffa025a394>] sta_info_destroy_addr+0x3e/0x5e [mac80211]
     [<ffffffffa02605d6>] ieee80211_set_disassoc+0x175/0x180 [mac80211]
     [<ffffffffa026084d>] ieee80211_mgd_deauth+0x58/0x17e [mac80211]
     [<ffffffff813a60c1>] ? __mutex_lock_common+0x37f/0x3a4
     [<ffffffffa01e53f2>] ? cfg80211_netdev_notifier_call+0x28d/0x46d [cfg80211]
     [<ffffffffa026786e>] ieee80211_deauth+0x1e/0x20 [mac80211]
     [<ffffffffa01f47f9>] __cfg80211_mlme_deauth+0x130/0x13f [cfg80211]
     [<ffffffffa01e53f2>] ? cfg80211_netdev_notifier_call+0x28d/0x46d [cfg80211]
     [<ffffffff810732eb>] ? trace_hardirqs_off+0xd/0xf
     [<ffffffffa01f7eee>] __cfg80211_disconnect+0x111/0x189 [cfg80211]
     [<ffffffffa01e5433>] cfg80211_netdev_notifier_call+0x2ce/0x46d [cfg80211]
     [<ffffffff813aa9ea>] notifier_call_chain+0x37/0x63
     [<ffffffff81068c98>] raw_notifier_call_chain+0x14/0x16
     [<ffffffff81322e97>] call_netdevice_notifiers+0x1b/0x1d
     [<ffffffff8132386d>] dev_close+0x6a/0xa6
     [<ffffffff8132395f>] rollback_registered_many+0xb6/0x2f4
     [<ffffffff81323bb8>] unregister_netdevice_many+0x1b/0x66
     [<ffffffffa026494f>] ieee80211_remove_interfaces+0xc5/0xd0 [mac80211]
     [<ffffffffa02580a2>] ieee80211_unregister_hw+0x47/0xe8 [mac80211]
     [<ffffffffa035290e>] ath9k_deinit_device+0x7a/0x9b [ath9k]
     [<ffffffffa035bc26>] ath_pci_remove+0x38/0x76 [ath9k]
     [<ffffffff8120940a>] pci_device_remove+0x2d/0x51
     [<ffffffff8129d797>] __device_release_driver+0x7b/0xd1
     [<ffffffff8129d885>] driver_detach+0x98/0xbe
     [<ffffffff8129ca7a>] bus_remove_driver+0x94/0xb7
     [<ffffffff8129ddd6>] driver_unregister+0x6c/0x74
     [<ffffffff812096d2>] pci_unregister_driver+0x46/0xad
     [<ffffffffa035bae1>] ath_pci_exit+0x15/0x17 [ath9k]
     [<ffffffffa035e1a2>] ath9k_exit+0xe/0x2f [ath9k]
     [<ffffffff8108050a>] sys_delete_module+0x1c7/0x236
     [<ffffffff813a7df5>] ? retint_swapgs+0x13/0x1b
     [<ffffffff810749b5>] ? trace_hardirqs_on_caller+0x119/0x144
     [<ffffffff8109b9f6>] ? audit_syscall_entry+0x11e/0x14a
     [<ffffffff81009bb2>] system_call_fastpath+0x16/0x1b
    wlan1: deauthenticating from 00:23:cd:e1:f9:b2 by local choice (reason=3)
    PM: Removing info for No Bus:wlan1
    cfg80211: Calling CRDA to update world regulatory domain
    PM: Removing info for No Bus:rfkill2
    PM: Removing info for No Bus:phy1
    ath9k 0000:16:00.0: PCI INT A disabled
    
    Signed-off-by: Ming Lei <tom.leiming@gmail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Comment 5 Sebastian Kemper 2010-04-06 17:50:48 UTC
Hi all,

I think it's the last one you mentioned: 0dcc9985f34aef3c60bffab3dfc7f7ba3748f35a

I've added Ming to the CC list of this report.

I've found that the trace can be reproduced by simply shutting down the access point (I cold-started it).

So I reversed patch #1, found that the trace still occurred, applied patch #1 again, then reversed patch #2 and so on.

Whenever I reverse this patch I don't get the trace:

diff --git a/drivers/net/wireless/ath/ath9k/xmit.c b/drivers/net/wireless/ath/ath9k/xmit.c
index 9009bac..5184af2 100644
--- a/drivers/net/wireless/ath/ath9k/xmit.c
+++ b/drivers/net/wireless/ath/ath9k/xmit.c
@@ -2210,7 +2210,7 @@ void ath_tx_node_cleanup(struct ath_softc *sc, struct ath_node *an)
 		if (ATH_TXQ_SETUP(sc, i)) {
 			txq = &sc->tx.txq[i];
 
-			spin_lock(&txq->axq_lock);
+			spin_lock_bh(&txq->axq_lock);
 
 			list_for_each_entry_safe(ac,
 					ac_tmp, &txq->axq_acq, list) {
@@ -2231,7 +2231,7 @@ void ath_tx_node_cleanup(struct ath_softc *sc, struct ath_node *an)
 				}
 			}
 
-			spin_unlock(&txq->axq_lock);
+			spin_unlock_bh(&txq->axq_lock);
 		}
 	}
 }

Whenever I apply it again the trace shows up again (when switching off the Access Point).

Kind regards
Sebastian
Comment 6 Sebastian Kemper 2010-04-12 08:26:46 UTC
Kernel 2.6.33.2 also shows the issue, see this mailing list thread: http://marc.info/?t=127083441400007&r=1&w=2
Comment 7 Sebastian Kemper 2010-05-28 04:11:16 UTC
Hello all,

issue was fixed with commits 4ee20bc28f74a77307091b15d498e44c7f54645e and 0afe7e4c1136dda6486587be78c5d277f7aac999 for 2.6.32 and .33.

Thanks!
Sebastian
Comment 8 Luis Chamberlain 2010-05-28 17:47:01 UTC
Thanks for the report.

Note You need to log in before you can comment on or make changes to this bug.