Bug 16099 - joining a mesh causes kernel fault with rt73
Summary: joining a mesh causes kernel fault with rt73
Status: CLOSED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: Wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: John W. Linville
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-01 20:47 UTC by Christian Mehlis
Modified: 2010-08-23 14:55 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.34-generic
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel.log (81.69 KB, text/plain)
2010-06-01 20:47 UTC, Christian Mehlis
Details
lsusb (674 bytes, text/plain)
2010-06-01 20:48 UTC, Christian Mehlis
Details
lspci -v (22.27 KB, text/plain)
2010-06-01 20:49 UTC, Christian Mehlis
Details
full syslog (209.33 KB, text/plain)
2010-06-02 20:27 UTC, Christian Mehlis
Details
full kern.log (165.06 KB, text/plain)
2010-06-02 20:30 UTC, Christian Mehlis
Details
full messages (153.75 KB, application/x-bzip)
2010-06-02 20:31 UTC, Christian Mehlis
Details
0001-mac80211-avoid-scheduling-while-atomic-in-mesh_rx_pl.patch (4.89 KB, patch)
2010-06-21 21:04 UTC, John W. Linville
Details | Diff
0001-mac80211-avoid-scheduling-while-atomic-in-mesh_rx_pl.patch (5.44 KB, patch)
2010-06-21 21:19 UTC, John W. Linville
Details | Diff

Description Christian Mehlis 2010-06-01 20:47:56 UTC
Created attachment 26604 [details]
kernel.log

if two mesh devices merging into one mesh, the rt73 module fails, see attachment
Comment 1 Christian Mehlis 2010-06-01 20:48:43 UTC
Created attachment 26605 [details]
lsusb
Comment 2 Christian Mehlis 2010-06-01 20:49:12 UTC
Created attachment 26606 [details]
lspci -v
Comment 3 Gertjan van Wingerde 2010-06-02 20:17:00 UTC
Hmmm, the kernel.log file looks a bit odd, as information as to why the stack trace was generated seems to be missing.

Is this really all there is in the kernel log, or did you only include parts of the full log?
Comment 4 Christian Mehlis 2010-06-02 20:27:50 UTC
Created attachment 26623 [details]
full syslog
Comment 5 Christian Mehlis 2010-06-02 20:30:30 UTC
Created attachment 26624 [details]
full kern.log

3.5 mb
Comment 6 Christian Mehlis 2010-06-02 20:31:09 UTC
Created attachment 26625 [details]
full messages

3.3 mb
Comment 7 Christian Mehlis 2010-06-02 20:33:11 UTC
steps to reproduce:
create a mesh device on host a (rt73)
create a mesh device on host b
iw event -t on a shows: 1275510283.394517: mesh0: new station 00:16:e3:97:2c:82
then this stuff is happening on a
Comment 8 Gertjan van Wingerde 2010-06-02 21:42:40 UTC
OK. It seems to be a "Scheduling in atomic" error, indicating we are trying to schedule with a spinlock held.

John, looking at this, it seems that the mesh code is calling the mac80211 bss_info_changed callback function of the driver with a spinlock held, where the rest of mac80211 will never do that.

Therefore I believe this to be a bug in the mesh code, rather than in rt2x00.
Comment 9 John W. Linville 2010-06-09 17:31:50 UTC
Should we consider marking rt2x00 as not supporting mesh mode?
Comment 10 Ivo van Doorn 2010-06-09 22:43:07 UTC
I've already submitted a patch which disables Mesh mode upstream.
I'll recheck when I get back from Berlin later this week.
Comment 11 Ivo van Doorn 2010-06-13 18:29:00 UTC
Apparently I forgot that patch earlier. I send it upstream a few minutes ago:

[PATCH 1/2] mac80211: Fix bss_info_changed comment regarding sleeping
[PATCH 2/2] rt2x00: Disable Mesh mode for USB drivers

The first one is just a documentation fix for mac80211, while the second patch solves the actual problem (although it simply removes the Mesh feature for USB devices).
Comment 12 John W. Linville 2010-06-21 20:57:16 UTC
Alright, sorry for my delay...

It looks to me like this is called by calling mesh_plink_inc_estab_count (which calls ieee80211_bss_info_change_notify) from inside mesh_rx_plink_frame while holding sta->lock.  I don't really see why we need to hold sta->lock while incrementing that count.  Am I on crack? :-)

*time passes*

OK, so the bits related to mesh_plink_dec_estab_count were slightly more complicated.  Hopefully I'm not missing anything -- patch to follow...
Comment 13 John W. Linville 2010-06-21 21:04:40 UTC
Created attachment 26888 [details]
0001-mac80211-avoid-scheduling-while-atomic-in-mesh_rx_pl.patch
Comment 14 John W. Linville 2010-06-21 21:19:05 UTC
Created attachment 26889 [details]
0001-mac80211-avoid-scheduling-while-atomic-in-mesh_rx_pl.patch
Comment 15 John W. Linville 2010-08-23 14:55:36 UTC
commit c937019761a758f2749b1f3a032b7a91fb044753
Author: John W. Linville <linville@tuxdriver.com>
Date:   Mon Jun 21 17:14:07 2010 -0400

    mac80211: avoid scheduling while atomic in mesh_rx_plink_frame
    
    While mesh_rx_plink_frame holds sta->lock...
    
    mesh_rx_plink_frame ->
        mesh_plink_inc_estab_count ->
                ieee80211_bss_info_change_notify
    
    ...but ieee80211_bss_info_change_notify is allowed to sleep.  A driver
    taking advantage of that allowance can cause a scheduling while
    atomic bug.  Similar paths exist for mesh_plink_dec_estab_count,
    so work around those as well.
    
    http://bugzilla.kernel.org/show_bug.cgi?id=16099
    
    Also, correct a minor kerneldoc comment error (mismatched function names).
    
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
    Cc: stable@kernel.org

Note You need to log in before you can comment on or make changes to this bug.