Bug 34992 - (athk5) Regression with ath5k, cannot find any wireless network
(athk5)
Regression with ath5k, cannot find any wireless network
Status: CLOSED CODE_FIX
Product: Drivers
Classification: Unclassified
Component: network-wireless
All Linux
: P1 high
Assigned To: drivers_network-wireless@kernel-bugs.osdl.org
:
Depends on:
Blocks: 27352
  Show dependency treegraph
 
Reported: 2011-05-12 11:24 UTC by Joshua Covington
Modified: 2011-07-09 07:40 UTC (History)
8 users (show)

See Also:
Kernel Version: 2.6.38.x
Tree: Mainline
Regression: Yes


Attachments
dmesg with DEBUG=Y (122.77 KB, text/plain)
2011-05-12 11:24 UTC, Joshua Covington
Details
/var/log/messages with DEBUG=Y (9.80 KB, text/plain)
2011-05-12 11:26 UTC, Joshua Covington
Details
/sys/kernel/debug/ieee80211 directory with debug=0xffffffff (13.36 KB, application/octet-stream)
2011-05-12 11:28 UTC, Joshua Covington
Details
kernel-panic after nm-applet starts (53 bytes, text/plain)
2011-05-15 00:10 UTC, Joshua Covington
Details
another kernel panic (52 bytes, text/plain)
2011-05-15 08:33 UTC, Joshua Covington
Details
another foto of the kernel-panic (seems to show more information than the previous fotos) (49 bytes, text/plain)
2011-05-19 21:00 UTC, Joshua Covington
Details
dmesg with slub_debug=FPZ (62.62 KB, text/plain)
2011-05-20 14:52 UTC, Joshua Covington
Details
screen shot of the kernel oops with slub_debug=FPZ (50 bytes, text/plain)
2011-05-20 14:54 UTC, Joshua Covington
Details
dmesg debug=0xffffffff (123.00 KB, text/plain)
2011-05-20 18:16 UTC, Joshua Covington
Details
2.6.39 + compat-wireless-2011-05-16 + Nick's patch (53 bytes, text/plain)
2011-05-21 07:51 UTC, Joshua Covington
Details
dmesg showing the network loss at about 266s (65.71 KB, text/plain)
2011-06-15 13:27 UTC, Fabio
Details

Description Joshua Covington 2011-05-12 11:24:29 UTC
Created attachment 57542 [details]
dmesg with DEBUG=Y

Description:

My ath5k based card was working perfectly until I upgraded to the kernel-2.6.38.5-24.fc15.x86_64. Now it cannot see any wireless network.

`iwlist wlan0 scan` returns "no wireless found"

I tested this with the latest compat-wireless-2011-05-11 and DEBUG=Y without luck. The log/debug files are attached.


Device:

08:04.0 Ethernet controller: Atheros Communications Inc. AR2413 802.11bg NIC (rev 01)
        Subsystem: AMBIT Microsystem Corp. Device 0418
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 168 (2500ns min, 7000ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 21
        Region 0: Memory at c0200000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
        Kernel driver in use: ath5k
        Kernel modules: ath5k


Latest working kernel:
kernel-2.6.37-2.fc15


First non-wokring kernel:
kernel-2.6.38-0.rc2.git0.1.fc15
Comment 1 Joshua Covington 2011-05-12 11:26:02 UTC
Created attachment 57552 [details]
/var/log/messages with DEBUG=Y

I filtered the file for NetworkManager.
Comment 2 Joshua Covington 2011-05-12 11:28:39 UTC
Created attachment 57562 [details]
/sys/kernel/debug/ieee80211 directory with debug=0xffffffff

This is the /sys/kernel/debug/ieee80211 directory. I compiled the ath5k driver from compat-wireless-2011-05-11 with DEBUG=Y and this is the corresponding directory.
Comment 3 Nick Kossifidis 2011-05-14 18:46:18 UTC
Can you try out this patch ?
http://www.kernel.org/pub/linux/kernel/people/mickflemm/01-fast-chan-switch-modparm
Comment 4 Joshua Covington 2011-05-15 00:10:01 UTC
Created attachment 57872 [details]
kernel-panic after nm-applet starts

The patch seems to work, because I can scan for networks.

As you can see from the attached foto I get a kernel panic the moment nm-applet starts and tries to pass the password for a secured network. I'm not sure if this is now a problem of the driver itself or something connected to the NetworkManager.

However it seems that everything is fine with the NerworkManager service but as I said when the nm-applet tries to pass a password, then the kernel locks up. What do you think - is this a bug in the driver or the nm-applet(NetworkManager)?
Comment 5 Joshua Covington 2011-05-15 00:24:23 UTC
Was the algorithm for processing 63-ASCII passwords changed between 2.6.37 and 2.6.38?

This is the only logical explanation for me because a single program with user rights cannot cause a kernel-panic, can it?
Comment 6 Joshua Covington 2011-05-15 08:33:36 UTC
Created attachment 57912 [details]
another kernel panic

I updated to the latest wpa_supplicant-0.7.3-8.fc15 and now I have another kernel panic. I'm quite puzzled at the moment if this is a driver bug or not???
Comment 7 Joshua Covington 2011-05-15 08:34:51 UTC
In both cases the kernel panics when the nm-applet tries to make the connection.
Comment 8 Joshua Covington 2011-05-19 21:00:00 UTC
Created attachment 58662 [details]
another foto of the kernel-panic (seems to show more information than the previous fotos)

This is another foto of the kernel panic that shows a little bit more information than the oder ones. 

Does anyone have any idea what's causing this lockup?
Comment 9 Bob Copeland 2011-05-20 02:21:32 UTC
Well, I'm not sure what to make of the photos.  rt_cache_flush (this is networking code nominally outside of the driver) seems to be implicated in 2 of them, while the other looks like it's just randomly in default_idle (but I can't tell what the actual error is in the latter).

Can you turn on slab/slub debugging and lockdep if they aren't already on?
Comment 10 Joshua Covington 2011-05-20 04:05:05 UTC
I looked into the config file ans found the following:

CONFIG_LOCKDEP_SUPPORT=y

# CONFIG_SLAB is not set
CONFIG_SLABINFO=y

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set

So some of them have been already set. Should I still set something (like debug level etc...) and if I have to recompile the kernel are there any other options that need to be set? I don't like recompiling the kernel because it takes appx 55 mins and I'd like to set everything that could help (and save some recompiling).

Are there any shortcuts for this kernel portion instead of recompiling everything? Maybe just the wireless stuff?
Comment 11 Bob Copeland 2011-05-20 11:23:28 UTC
(In reply to comment #10)
> So some of them have been already set. Should I still set something (like debug
> level etc...) 

Yeah - please boot with kernel param "slub_debug=FPZ"
Comment 12 Joshua Covington 2011-05-20 14:52:45 UTC
Created attachment 58772 [details]
dmesg with slub_debug=FPZ

There's nothing unusual here.
Comment 13 Joshua Covington 2011-05-20 14:54:22 UTC
Created attachment 58792 [details]
screen shot of the kernel oops with slub_debug=FPZ

I hope this one can help.
Comment 14 Bob Copeland 2011-05-20 16:27:48 UTC
Ok that does help, yes...

So it looks like we were in the middle of adding the debugfs file for the STA, then you got a timer interrupt, it ran softirqs, which caused rcu_process_callbacks to run... somewhere in there IP got set to 0x10 so there is maybe a dangling pointer, or a race with the debugfs code.  Not sure yet what might be in the callback.
Comment 15 Bob Copeland 2011-05-20 16:37:24 UTC
Did you apply Nick's patch to compat-wireless or did you rebuild the whole kernel?  There seem to be some issues with the backports of some rcu code in compat-wireless where the structures have different layouts.  If that's the case, can you try testing vanilla kernel.org kernel + Nick's patch?
Comment 16 Joshua Covington 2011-05-20 17:44:08 UTC
All the output is from kernel-2.6.38.5-24.fc15.x86_64 + compat-wireless-2011-05-11 + Nick's patch. It's a lot easier to use this than recompiling everything. 

I'll try to use some of the stable releases available here: http://wireless.kernel.org/en/users/Download/stable/ because it takes such a long time to rebuild the whole kernel in my case.

I'll let you know what works. Thanks for the fast response!
Comment 17 Joshua Covington 2011-05-20 18:16:46 UTC
Created attachment 58812 [details]
dmesg debug=0xffffffff

This is the output from 2.6.38.5-24.fc15.x86_64 + compat-wireless-2.6.39-rc6-1-sp.tar.bz2 + Nick's patch. ATH5K_DEBUG=Y

Everything wokrs again.

It looks like those issues with the backports of some rcu code in
compat-wireless you were talking about have either been fixed in the 2.6.39 or still not applied to the stable branch.

So what should we do:
1. research further if those patches are to be added to the stable release (because they'll cause the lock again) or
2. close this bug IF those backports have already been fixed?
Comment 18 Joshua Covington 2011-05-21 07:51:13 UTC
Created attachment 58852 [details]
2.6.39 + compat-wireless-2011-05-16 + Nick's patch

I decided to test this further and used the kernel-2.6.39-0.fc16.x86_64 + compat-wireless-2011-05-16 + Nick's patch.

Obviously these rcu changes still exists in the upstream compat-wireless. Now the backtrace shows some signs of the faulty rcu structure. I hope this is fixed soon or before it lands in the mainstream kernel.
Comment 19 Bob Copeland 2011-05-22 21:55:19 UTC
I don't think it's a mainline problem since structure layouts and APIs by definition are the same for any single version of the kernel.  That is, if you can reproduce with a vanilla kernel.org kernel + Nick's patch, then it is something we should look at, otherwise it is something compat-wireless needs to address.
Comment 20 Joshua Covington 2011-05-23 16:50:41 UTC
Ok,

As I said in comment #17 everything is working again now with Nick's patch and therefore I propose to close this bug.

Can you please report when this is submitted and accepted in the vanilla kernel?
Comment 22 John W. Linville 2011-05-31 17:15:18 UTC
Nick, are you planning to submit this patch?  Or did I miss it?
Comment 23 Nick Kossifidis 2011-05-31 17:19:08 UTC
Sorry for the delay :-( I've been very busy lately, I'll post it tomorrow...
Comment 24 Fabio 2011-06-04 22:21:56 UTC
Was the patched pushed somewhere?
Comment 25 Joshua Covington 2011-06-05 01:58:49 UTC
yes, finally:

author	Nick Kossifidis <mickflemm@gmail.com>	
	Thu, 2 Jun 2011 00:09:48 +0000 (03:09 +0300)
committer	John W. Linville <linville@tuxdriver.com>	
	Fri, 3 Jun 2011 18:19:49 +0000 (14:19 -0400)
commit	a99168eece601d2a79ecfcb968ce226f2f30cf98
tree	01598dfa43a08038f9b33cdae902f71156647471
parent	bdf492f502ad4f646e9905db1b89e11822826edd

It was pushed to wireless-2.6.git and wireless-testing.git but it missed the 2.6.38.8 (too late). I hope it lands in 2.6.38.9 and newer because all of the latest versions (incl 2.6.39 and 3.0) have this problem.
Comment 26 John W. Linville 2011-06-06 15:01:05 UTC
I neglected to add 'Cc: stable@kernel.org' to the commit message, but I sent something to that address today requesting inclusion in 2.6.38.y.
Comment 27 Fabio 2011-06-15 13:27:52 UTC
Created attachment 62132 [details]
dmesg showing the network loss at about 266s

I am still having this problem, maybe it's a different bug but I'll report here anyway, let me know if I have to open a new one.

The problem is that I am able to connect to the wireless network (using NetworkManager from Ubuntu 11.04), but after about 210s I lost the connection and I am no longer able to reconnect. I have to wait half an hour then I can reconnect for another ~200s and so on.

I am using 2.6.39.1 with the patch from this bug + this patch http://patches.aircrack-ng.org/ath5k_regdomain_override.patch (to use 30dB power but it makes no difference) + this https://patchwork.kernel.org/patch/103589/ (to avoid the -1 error with airmon). I enabled the following ath5k options:
CONFIG_ATH5K=m
CONFIG_ATH5K_DEBUG=y
CONFIG_ATH5K_TRACER=y
CONFIG_ATH5K_PCI=y

I have to say I connect to a public AP with WPA which is far from my home (I get about 70~90dB) but it always worked fine up to some time ago. I am attaching my dmesg, note the gap between 53-266 where the network worked, then it fails and try to reconnect without success.
Comment 28 Florian Mickler 2011-06-15 15:55:22 UTC
Yes, please file a new bug. This bug has been solved by the patch you have already applied. So you do not experience _this_ bug.

Please file a new bug with an appropriate description. If you can determine a kernel release that does not exhibit your problem, then mark it as a regression and make it block the corresponding tracker bug (see bug #15790). 

Regards,
Flo

p.s.: for completeness sake you can post that bug number here, so that any interested party can follow...
Comment 29 Fabio 2011-06-16 10:23:15 UTC
Thanks, filed as bug #37612.
Comment 30 Joshua Covington 2011-06-26 18:51:46 UTC
(In reply to comment #26)
> I neglected to add 'Cc: stable@kernel.org' to the commit message, but I sent
> something to that address today requesting inclusion in 2.6.38.y.

Sadly, this was left out of the current 2.6.39.2 and I bet it won't be in 2.6.38.9 either.
Comment 31 Joshua Covington 2011-07-09 07:40:11 UTC
Commit bdc5ce7ef6b7a4aa7a9ae7c60767783e6c5e438a in 2.6.39.3 and commit a99168eece601d2a79ecfcb968ce226f2f30cf98 upstream. Thanks.

Note You need to log in before you can comment on or make changes to this bug.