Bug 13581

Summary: ath9k doesn't work with newer kernels
Product: Networking Reporter: Matteo Croce (rootkit85)
Component: WirelessAssignee: Luis Chamberlain (mcgrof)
Status: CLOSED CODE_FIX    
Severity: normal CC: adam, andrej, ath9k-devel, info, j, linville, mcgrof, rjw, sujith
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13070    
Attachments: modprobe ath9k debug=0xffffffff
iwlist wlan0 scan
ram align hack

Description Matteo Croce 2009-06-19 12:04:19 UTC
Upgrading from 2.6.29 to 2.6.30 the wireless stop working.
some infos:

# wpa_supplicant -i wlan0 -D wext -c /etc/wpa_supplicant/wpa_supplicant.conf
CTRL-EVENT-SCAN-RESULTS                                                     
Trying to associate with 00:18:84:81:00:fd (SSID='OpenWrt' freq=2452 MHz)   
Association request to the driver failed                                    
Associated with 00:18:84:81:00:fd                                           
CTRL-EVENT-DISCONNECTED - Disconnect event - remove keys                    
ioctl[SIOCSIWENCODEEXT]: No such file or directory                          
ioctl[SIOCSIWSCAN]: Device or resource busy                                 
Failed to initiate AP scan.                                                 
Authentication with 00:00:00:00:00:00 timed out.                            
CTRL-EVENT-SCAN-RESULTS                                                     
Trying to associate with 00:18:84:81:00:fd (SSID='OpenWrt' freq=2452 MHz)   
Association request to the driver failed
Associated with 00:18:84:81:00:fd
CTRL-EVENT-DISCONNECTED - Disconnect event - remove keys
ioctl[SIOCSIWENCODEEXT]: No such file or directory
ioctl[SIOCSIWSCAN]: Device or resource busy
Failed to initiate AP scan.
Authentication with 00:00:00:00:00:00 timed out.

# dmesg
cfg80211: Using static regulatory domain info
cfg80211: Regulatory domain: US
        (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
        (2402000 KHz - 2472000 KHz @ 40000 KHz), (600 mBi, 2700 mBm)
        (5170000 KHz - 5190000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
        (5190000 KHz - 5210000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
        (5210000 KHz - 5230000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
        (5230000 KHz - 5330000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
        (5735000 KHz - 5835000 KHz @ 40000 KHz), (600 mBi, 3000 mBm)
cfg80211: Calling CRDA for country: US
ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
ath9k 0000:03:00.0: setting latency timer to 64
phy0: Selected rate control algorithm 'ath9k_rate_control'
cfg80211: Calling CRDA for country: AT
Registered led device: ath9k-phy0::radio
Registered led device: ath9k-phy0::assoc
Registered led device: ath9k-phy0::tx
Registered led device: ath9k-phy0::rx
phy0: Atheros AR5418 MAC/BB Rev:2 AR5133 RF Rev:81: mem=0xffffc200042c0000, irq=17
ADDRCONF(NETDEV_UP): wlan0: link is not ready
wlan0: authenticate with AP 00:18:84:81:00:fd
wlan0: authenticated
wlan0: associate with AP 00:18:84:81:00:fd
wlan0: RX AssocResp from 00:18:84:81:00:fd (capab=0x431 status=0 aid=3)
wlan0: associated
ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
wlan0: disassociating by local choice (reason=3)
wlan0: no IPv6 routers present
wlan0: authenticate with AP 00:18:84:81:00:fd
wlan0: authenticated
wlan0: associate with AP 00:18:84:81:00:fd
wlan0: RX AssocResp from 00:18:84:81:00:fd (capab=0x431 status=0 aid=3)
wlan0: associated
wlan0: disassociating by local choice (reason=3)
wlan0: authenticate with AP 00:18:84:81:00:fd
wlan0: authenticated
wlan0: associate with AP 00:18:84:81:00:fd
wlan0: RX AssocResp from 00:18:84:81:00:fd (capab=0x431 status=0 aid=3)
wlan0: associated
wlan0: disassociating by local choice (reason=3)
Comment 1 Luis Chamberlain 2009-06-22 15:51:07 UTC
ioctl[SIOCSIWSCAN]: Device or resource busy

Hm, can you try this:

iwlist wlan0 scan

and provide the dmesg output of that.

Are there no other instances of wpa_supplicant running? Can you also install the latest iw, and provide the output of 'iw event -t'.

http://wireless.kernel.org/en/users/Documentation/iw
http://wireless.kernel.org/en/users/Documentation/Reporting_bugs
Comment 2 Jouni Malinen 2009-06-22 15:53:24 UTC
Do you happen to have NetworkManager running in the background when you start
wpa_supplicant manually? It is known to disconnect the connection created by
wpa_supplicant if it was not the one asking for the connection in the first
place; this results in a dmesg output that looks like the one shown here.

If you do not have NetworkManager (or some other software that could behave
similarly) running, please attach more verbose debug output from wpa_supplicant
(-ddt on command line).
Comment 3 Matteo Croce 2009-06-23 16:17:32 UTC
Yes I know than networkManager and connman do disconnects me, but I was using starting the only wpa_supplicant instance by hand
Comment 4 Luis Chamberlain 2009-06-24 21:53:50 UTC
Matteo, can you provide more details as I asked?
Comment 5 Matteo Croce 2009-06-25 00:59:35 UTC
sure:

root@macbook-luca:~# iwlist wlan0 scan
wlan0     Interface doesn't support scanning : Network is down

root@macbook-luca:~# ifconfig wlan0 up
root@macbook-luca:~# iwlist wlan0 scan
wlan0     Interface doesn't support scanning : Device or resource busy

root@macbook-luca:~# iw event -t
^C
root@macbook-luca:~# iw event -t
# start wpa_supplicant
# kill wpa_supplicant
1245891191.161903: wlan0 (phy #0): scan aborted
# start wpa_supplicant
# kill wpa_supplicant
1245891209.162864: wlan0 (phy #0): scan aborted
Comment 6 Luis Chamberlain 2009-07-27 13:51:25 UTC
Matteo, are you sure you do not have rfkill button pressed?
Comment 7 Luis Chamberlain 2009-07-27 13:53:50 UTC
If you do not have rfkill button enabled, please try loading ath9k with debugging enabled. Please read:

http://wireless.kernel.org/en/users/Drivers/ath9k/debug

Please use 0xffffffff for debug and attach the compressed log here or somewhere for retrieval.
Comment 8 Matteo Croce 2009-07-27 14:05:00 UTC
Can't test now, the notebook is far away from here,
but MacBooks hasn't a wifi button
Comment 9 Luis Chamberlain 2009-08-03 15:10:33 UTC
Please provide feedback
Comment 10 Matteo Croce 2009-08-03 16:19:24 UTC
I have no rfkill module loaded
Comment 11 Luis Chamberlain 2009-08-03 17:57:38 UTC
Please provide the compressed log of running ath9k with debugging enabled.
Comment 12 Chi 2009-09-06 11:54:21 UTC
I have the same problem. My ath9k driver is broken with 2.6.31-rc8-zen1. THe strange thing is that iwlist wlan0 scan is working. ifup fails. I've attached my dmesg output with debug=0xffffffff option on.
Comment 13 Chi 2009-09-06 11:56:04 UTC
Created attachment 23019 [details]
modprobe ath9k debug=0xffffffff

dmesg dump of modprobe ath9k debug=0xffffffff on an Amilo xa 3530 laptop.
Comment 14 Chi 2009-09-06 13:00:12 UTC
Created attachment 23023 [details]
iwlist wlan0 scan

Dump of dmesg of iwlist wlan0 scan.
Comment 15 Andrej Podzimek 2009-10-10 14:44:55 UTC
Exactly the same problem here, since 2.6.30. 2.6.31 is affected as well.

Here comes an *important* note: This probably has nothing to do with ath9k. I can see exactly the same problem with both ipw2200 and ath9k. AFAIK, the former doesn't even use the mac80211 module.

The problem could be somewhere much deeper than in ath9k. There could be something wrong with wpa_supplicant or with the wireless extensions API implementation it talks to.

BTW, sometimes ping6 -q -i .001 <address-of-my-server> helps, but it takes up to 15 seconds of this terrible ping flood before the network starts working again. Lower packet frequencies mostly don't help and the interface needs to be brought down and up again.

It seems to me that ping6 -q -i .1 <my-server> significantly reduces the probaility of total network freezes. Disassociations still *do* occur, but the connection recovers automatically in most cases, unlike situations with no ping at all. Unfortunately, this recovery usually takes a couple of seconds, which is just enough to interrupt all the data streams and VoIP calls and make the user scream with anger.
Comment 16 Adam 2009-10-14 15:48:22 UTC
I'm having a similar problem here with 2.6.31.4 in Arch Linux on an Eee 1000HE.

It seems like it might be rfkill related since when I have bluetooth enabled, I can usually get a connection, but it will disconnect me shortly after.  If I have bluetooth disabled, I cannot get a connection at all and get:

SIOCSIFFLAGS: Unknown error 132

Which is rfkill related.

However, I don't know if this is asus-laptop or ath9k related, so I will be creating a new report when I get home and have access to my laptop again.
Comment 17 Luis Chamberlain 2009-10-14 16:40:12 UTC
This does indeed seem rfkill related, note that rfkill was completely rewritten for the 2.6.31 kernel.

Please try out the new rfkill userspace application to see if you can query the rfkill status:

http://wireless.kernel.org/en/users/Documentation/rfkill

I think there is support for a command:

rfkill unblock all
Comment 18 Adam 2009-10-15 16:42:43 UTC
I didn't get to play around too much, but I was able to try the userspace application.  It looks like eee-laptop exposes a second set of devices(?) and when I enable or disable one things act strange.  However, one time I was able to reboot, unblock all, and everything worked fine.  Hopefully tonight or tomorrow I will be able to do some more testing and post results.
Comment 19 Luis Chamberlain 2009-11-04 23:43:50 UTC
Can you try:

git revert 5d423ccd7ba4285f1084e91b26805e1d0ae978ed
Comment 20 Andrej Podzimek 2009-11-05 03:20:06 UTC
> git revert 5d423ccd7ba4285f1084e91b26805e1d0ae978ed

Does this apply to eee only or to ath9k supported devices in general? (I have a PCMCIA device and don't see any "second set of devices" as reported by Adam.)

Honestly, I don't know how to get the source. Cloned the kernel repository, switched branch to v2.6.31 and reverted the change -- that worked fine. But the kernel source I obtained was 2.6.31, not 2.6.31.5 (the version I'm using right now), judging by the Makefile. There was no tag called v2.6.31.5 on the list.

Should I just ignore this and try 2.6.31? Or is there a better way to get the source and revert one commit? I guess there are more GIT paths to play with and I just cloned from the wrong one...
Comment 21 Adam 2009-11-05 12:21:12 UTC
(In reply to comment #19)
> Can you try:
> 
> git revert 5d423ccd7ba4285f1084e91b26805e1d0ae978ed

On 2.6.31.5 or 2.6.32?
Comment 22 Luis Chamberlain 2009-11-05 15:44:05 UTC
No, the commit 5d423ccd7ba4285f1084e91b26805e1d0ae978ed could be affecting other devices as well.

To get the tree:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-allstable.git

Since you were on 2.6.31 go ahead and just try that:

git checkout -b linux-2.6.31.5 v2.6.31.5

Compile and test that to ensure it does not work.

Then revert the patch I am indicating to you:

git checkout -b linux-2.6.31.5-revert-applied
git revert 5d423ccd7ba4285f1084e91b26805e1d0ae978ed

compile and test that, see if your ath9k then works.

If it does fix it then please try now a patch for 2.6.31.5 which should might fix the issue without a full revert of the questioned patch. I will attach the patch next. The patch applies on a clean 2.6.31.5. So you would do:


git checkout linux-2.6.31.5
git checkout -b linux-2.6.31.5-with-new-fix
git am ram-align-hack.patch

compile and test that.
Comment 23 Luis Chamberlain 2009-11-05 15:46:50 UTC
Created attachment 23664 [details]
ram align hack

This is the ram-align-hack.patch, apply this on to 2.6.31.5 and compile/test. This is supposed to fix the issue introduced by patch I asked you to revert but it *might* not fix it; would like your test results.

Testing this patch is pointless unless you confirm reverting the patch in question helps.
Comment 24 Luis Chamberlain 2009-11-05 15:48:16 UTC
Also please attach "debug" the the kernel parameters line. On grub
this would be editing /boot/grub/menu.lst and for your specific kernel line add
debug.

Then please post your full dmesg output on each boot, with the 2.6.31.5 kernel, with the revert and then with the new ram-align-hack.patch.
Comment 25 Adam 2009-11-05 17:35:10 UTC
Unfortunately, only my Eee uses ath9k which takes quite a while to finish compiling and I need it for school.  So it might not be until Saturday that I can try this.  I will report back when I finish the compiling and testing.
Comment 26 Matteo Croce 2009-11-05 18:26:39 UTC
Compiling a kernel on my eeepc takes 17 minutes...
Comment 27 Luis Chamberlain 2009-11-05 18:33:36 UTC
You could cross compile. This is simple if you have a machine with the same architecture around and is beefy. Just make sure you copy a good config over and use

make tar-pkg

Then untar that stuff to / on the eeepc. I had a hard time doing this myself on my eeepc when I had one due to the amount of space on the eepc. But if you have a ~4 GB USB stick this shouldn't be too bad. Don't forget to generate an initramfs if you need one.
Comment 28 Adam 2009-11-05 19:17:09 UTC
(In reply to comment #26)
> Compiling a kernel on my eeepc takes 17 minutes...

Feel free to test this then, I will do it when I'm not at school, work, or spending time with my family.
Comment 29 Adam 2009-11-08 17:43:12 UTC
Patch looks good so far here.  My problem was somewhat intermittent, so if it comes back I'll open a new issue.

Thanks again for all the help.
Comment 30 Andrej Podzimek 2009-11-08 19:20:41 UTC
Both the reverted and the patched versions work just as bad as the original version and much much worse than last week's compat-wirless.

Failures occur about once in 30 minutes with compat-wireless. All the other versions fail every five minutes or so. That makes the user feel like throwing the whole computer out of the window.

Those failures are almost "reproducible"... Just wave your hand quickly in front of the card's antenna and it fails immediately. As I have already said many times, this must be a rate control issue. I'm convinced that multiple rate control algorithms should be tested before trying anything else.

Surprisingly, most failures don't even get logged! The disassociation events correspond to the hardest failures that block the interface for minutes or forever. When shorter failures are "handled" and fixed by a flood ping in time, they don't get logged at all.
Comment 31 Adam 2009-11-20 18:00:11 UTC
Take a look here: http://bugzilla.kernel.org/show_bug.cgi?id=13807

This seems to be helping some people.
Comment 32 Luis Chamberlain 2009-12-29 20:40:02 UTC
this issue seems to have been the align issue which the user reported fix.
Comment 33 Luis Chamberlain 2009-12-29 20:40:23 UTC
The fix was upstream and non-ath9k related.
Comment 34 Matteo Croce 2009-12-29 21:28:19 UTC
I can confirm too that it's fixed