Bug 12110

Summary: ath9k causes computer to hang after long data transmissions
Product: Networking Reporter: Barry Green (barry)
Component: WirelessAssignee: Luis Chamberlain (mcgrof)
Status: CLOSED CODE_FIX    
Severity: normal CC: aaronkelley, anders1, stubenschrott, sujith
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27.10 (Ubuntu 2.6.27-11-generic) Subsystem:
Regression: --- Bisected commit-id:
Attachments: Debug messages
wpa_supplicant messages before computer hang
syslog of coomunication with two AP
SMP fixes v3
Serialization v4
Backport of serialization patches -- v6 serialization for 2.6.27-2.6.29

Description Barry Green 2008-11-27 03:23:01 UTC
Latest working kernel version:N/A
Earliest failing kernel version:2.6.27-7-generic
Distribution:Ubuntu 8.10
Hardware Environment:Athlon 64 X2 5600+.  Linksys WMP300N PCI card, Linksys WAG325N Wireless-N router.
Software Environment:compat-wireless-2008-11-26

Problem Description:

After doing a "sudo modprobe ath9k", the system will continue to run for just under 3 minutes.  I start a ping to the local router, which reports destination unreachable for a couple of minutes, then starts to respond.  About 10 seconds after it starts to respond, the whole system hangs.  Video freezes, no mouse/keyboard, no disk activity, and I have to do a hard reset.  The last time I tested, the Linksys router actually crashed at the same time and I had to hard reboot it too.  This has only happened once though, so may have been a coincidence (hell of a coincidence though).

Steps to reproduce:

- Blacklist ath9k so that it doesn't freeze the computer on startup.
- Do "sudo modprobe ath9k".

I get the same problem whether I'm using WPA2 encryption, or no encryption at all.
Comment 1 Luis Chamberlain 2008-12-01 15:02:14 UTC
You are reporting a bug on 2.6.27-7 (whatever that is) when using compat-wireless. This means the issue you are facing is not present on 2.6.27 but on wireles-testing. Please understand compat-wireless comes from wireless-testing and as such its under development. Our best bet is to take this on the lists as we probably already have a fix for this. Please see the mailing lists for pending patches. John will soon updates wireless-testing with any pending patches from last week.
Comment 2 Barry Green 2008-12-02 00:13:43 UTC
2.6.27-7-generic is what "uname -a" reports for Ubuntu 8.10.  However, it is not correct that the issue is not present on 2.6.27.  I have also experienced this same issue on the stock 2.6.27 kernel as included with Ubuntu 8.10.  I logged the issue as using compat-wireless because I thought it was a good idea to use the latest version of the ath9k code.  So can this be reopened?
Comment 3 Luis Chamberlain 2008-12-02 00:30:50 UTC
Yes, sorry about that. Please try disabling 11n and see if you get the issue. You seem to be able to reproduce it easily so this should hopefully help.

Also please try to reproduce by going into a virtual terminal, so that if there is an oops maybe you can at least take a picture of it. You can also try to enable more debugging messages in ath9k. You do that by changing drivers/net/wireless/ath9k/core.h as follows:

diff --git a/drivers/net/wireless/ath9k/core.h b/drivers/net/wireless/ath9k/core.h
index f0c5437..b831271 100644
--- a/drivers/net/wireless/ath9k/core.h
+++ b/drivers/net/wireless/ath9k/core.h
@@ -111,7 +111,7 @@ enum ATH_DEBUG {
        ATH_DBG_ANY             = 0xffffffff
 };
 
-#define DBG_DEFAULT (ATH_DBG_FATAL)
+#define DBG_DEFAULT (ATH_DBG_ANY)
 
 #define        DPRINTF(sc, _m, _fmt, ...) do {                 \
                if (sc->sc_debug & (_m))                \



This means just change ATH_DBG_FATAL to ATH_DBG_ANY and recompile. This should work either on 2.6.27 or on wireless-testing/compat-wireless.
Comment 4 Luis Chamberlain 2008-12-02 19:16:19 UTC
Please provide some debug messages if possible.
Comment 5 Barry Green 2008-12-02 23:31:05 UTC
Created attachment 19120 [details]
Debug messages

I've attached the contents of /var/log/messages (chopped the middle out since there was a lot of repeated debug information).  This is with ATH_DBG_ANY.  The hang happened at timestamp Dec 2 23:47:12.  You can see the restart message when I had to hard reboot at Dec 2 23:50:29.  This time it did bring down my router again.  I did switch to a virtual console but there was no debug output printed to the console.  I also tried turning off 802.11n at the router, but still had the same problem.  This log is from when 802.11n was on.
Comment 6 Luis Chamberlain 2008-12-04 20:30:46 UTC
Barry, please try these patches:

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-22/27-IOMMU-01/

Let me know how it goes.
Comment 7 Barry Green 2008-12-06 16:06:05 UTC
Does the latest compat-wireless have these patches applied, and can I just use that?  I'd rather not have to rebuild my whole kernel if at all possible.
Comment 8 Luis Chamberlain 2008-12-06 16:10:51 UTC
Affirmative.
Comment 9 Barry Green 2008-12-06 22:05:07 UTC
Now I don't get a connection, but the driver no longer hangs my PC.  I'm not getting any association with the AP when using wpa_supplicant on 802.11n.
Comment 10 Barry Green 2008-12-13 19:30:07 UTC
OK, an update.  I've experimented and managed association with the AP on channel 10 only, in 802.11n mode, with WPA.  However, the connection is so slow (less than 1KB/s) as to make it unusable.  So far, any other channel I try to associate with (2,3,8 and 11) causes the whole computer to hang.  I'm now using compat-wireless-2008-12-11, and am just about to recompile with ATH_DBG_ANY to get more information.
Comment 11 Barry Green 2008-12-14 00:12:40 UTC
Created attachment 19286 [details]
wpa_supplicant messages before computer hang

Turning on ATH_DBG_ANY didn't give me any more messages than having it turned off.  I've done some more testing and found the following:

- If I turn off encryption completely, I can associate with the AP, but again the transfer rate is less than 1KB/s.
- Both 11n and 11g cause the computer to hang, except when I associate on channel 10 as explained above.

I've attached a screenshot of the wpa_supplicant messages leading up to the hang.
Comment 12 Barry Green 2008-12-29 17:23:46 UTC
Not sure where to go from here - any suggestions?
Comment 13 Luis Chamberlain 2009-01-03 19:35:53 UTC
Please boot into single user mode, then connect manually using iwconfig. See if you can reproduce an issue and see if you can get an oops message.
Comment 14 Serge Wellyman 2009-01-14 01:58:33 UTC
(In reply to comment #12)
> Not sure where to go from here - any suggestions?

Barry!

I definitely reckon your issue with pc hanging is due to AMD cpu in your computer. I’ve been facing different difficulties in computers based on some AMD cpus much more often than in the ones built on Intel architecture. For example, the latest Freebsd 7.1 freezes on AMD Phenom X3 even while booting.

As for Ath9k, my laptop with Intel Pentium M 1.6 never hangs while connecting to AP in 802.11n . I’m using channel 1.

See my comment to the bug in speed of ath9k http://bugzilla.kernel.org/show_bug.cgi?id=12373#c4 
Comment 15 Luis Chamberlain 2009-01-16 12:15:40 UTC
Please close this bug report if it is not present on a stable release of the kernel. If you are testing compat-wireless / wireless-testing stuff then please use the linux-wirless mailing list to report issues.
Comment 16 Barry Green 2009-01-17 17:26:56 UTC
OK, I have tried this on the 2.6.27-9 kernel that in now current in Ubuntu 8.10.  With no encryption, I can associate and so far have not experienced a crash.  With WPA2 encryption, I get the crash as soon as wpa_supplicant tries to associate with the AP.

I will now try to use the nmi_watchdog setting to get an oops message.  Will update in a few hours.
Comment 17 Barry Green 2009-01-18 02:57:10 UTC
Can't get a crash on a console, only when I'm in Gnome (that's what I meant above when I mentioned that I still got the crash).  I tried nmi_watchdog=1 (I do have IO-APIC) but it didn't give me anything, although as I say, I was in Gnome at the time.
Comment 18 Luis Chamberlain 2009-01-18 08:48:37 UTC
Barry Ubuntu's latest kernel is 2.6.27-11 from what I can understand. Can you confirm if you can upgrade.

On Ubuntu, please provide the output of:

cat /proc/version_signature

That will tell us the exact kernel version, the "2.6.27-9" is completely useless to me. I need something like 2.6.27.9 or 2.6.27.11.

Also can you clarify what you mean about how you can get a crash?

Am I understanding correctly that you cannot reproduce a hard crash when on a console and not using GNOME?
Comment 19 Barry Green 2009-01-18 14:33:20 UTC
Output of "cat /proc/version_signature":

Ubuntu 2.6.27-9.19-generic

To clarify - I can only get a crash when using GNOME.  You are understanding correctly in saying that I cannot reproduce a hard crash when on a console.
Comment 20 Luis Chamberlain 2009-01-20 11:16:41 UTC
Unfortunately it seems Ubuntu started using /proc/version_signature to determine the exact kernel version on Jaunty, not intrepid so this is useless to me. I did manage to find out what an Ubuntu "linux-image-2.6.27-9-generic" uses though, its 2.6.27.2. This is ancient!!! And you need at least 2.6.27.8 to get the latest 2.6.27 ath9k critical fixes.

I checked with the Ubuntu kernel team and they have a "2.6.27-12" kernel package being worked on which will have all updates through 2.6.27.11, this is only available through their git tree right now. They do have some other kernels though but only available through the intrepid-proposed sources. For example their 2.6.27-11 kernel has updates through 2.6.27.10. As of Ubuntu policy packages from -proposed only get posted into -updates (what users have by default) only about once per quarter. Because of this and since ath9k got its critical DMA updates in the beginning of December it explains why Ubuntu users don't yet have anything > 2.6.27.2 and why some may be experiencing issues with ath9k.

To get Ubuntu's 2.6.27-11 kernel please add to your /etc/apt/sources.list the following entries:

# Ubuntu proposed changes
deb http://us.archive.ubuntu.com/ubuntu/ intrepid-proposed main restricted
deb-src http://us.archive.ubuntu.com/ubuntu/ intrepid-proposed main restricted

Then do:

sudo apt-get install linux-image-2.6.27-12-generic

Please let me know if that fixes your issues.
Comment 21 Luis Chamberlain 2009-01-20 11:17:04 UTC
Sorry I meant:

sudo apt-get install linux-image-2.6.27-11-generic

as -12 is not yet available.
Comment 22 Barry Green 2009-01-20 13:40:55 UTC
OK, I will try this later today.  However, note that above I was also testing using compat-wireless (I tried versions right up until the end of December last year) and got the same issue.
Comment 23 Luis Chamberlain 2009-01-20 14:03:07 UTC
The end of December is old for wireless-testing purposes. There's been quite a few patches in since then.
Comment 24 Barry Green 2009-01-24 01:47:06 UTC
Tested with 2.6.27-11 as suggested above and I can still reproduce the issue.  However, only while in GNOME and only when I set the router to n-only mode.
Comment 25 Luis Chamberlain 2009-01-24 12:40:42 UTC
Barry, I have been stress testing ath9k using iperf over TCP on Ubuntu's 2.6.27-11 on AR5416 abg Cardbus card against an WRT610N AP with 802.11n enabled only on 5 GHz using HT40 configuration with no encryption for over 4 1/2 hours so far and its still cruising with rates averaging 18Mbps - 28 Mbps but staying more around 28 MBps. No crashes, not outages.

Can you please tell me what AP you have, what exact settings you are using for "802.11n only". I need to know if you are using HT20, HT40, encryption and if so what type.

Also can you please come up steps to reproduce this? Are you transfering data? Are you using iperf? Are you idle?

Since this is occurring when only using GNOME are you certain its an ath9k issue? If so what makes you believe that is the case? Are you using any proprietary graphics driver?
Comment 26 Barry Green 2009-01-24 13:10:54 UTC
AP: I've tried 2 different APs - the Linksys WAG325N Wireless-N router and the Billion 7402NX (which is what I'm currently using).  The problem can be reproduced with either.
Settings: The last crash I got was when using WPA2 with AES.  I don't know if it's using HT20 or HT40 since the AP only has a setting to choose "20MHz" or "40/20MHz".  Can I get this information from ath9k?  Is there a way to force ath9k to use HT40?

I can get the crash by:
1. "modprobe ath9k".
2. Start wpa_supplicant.
3. Start to transfer a large file via FTP.  After up to 3 mins, I get the hang.

Obviously without any hard evidence I can't be certain that it's an ath9k issue.  However, my PC never hangs like this if I'm not trying to use ath9k.  Previously I was using the same wireless card with the madwifi driver and it was solid as a rock.  I am using the proprietary nvidia graphics driver.
Comment 27 Luis Chamberlain 2009-01-24 15:15:29 UTC
HT configuration is negotiated with the AP as far as I can tell. Will have to check how exactly that is figured out when the AP enables both. Anyway I've just stress tested the driver against 11n-only mode AP WPA2 AES in HT40 and HT20 and no crashes occurred.

Can you do:

sudo rmmod ath9k
sudo dmesg -c > /dev/null
sudo modprobe -l ath9k
sudo modprobe ath9k
sudo dmesg -c
uname -a

and paste everything here?

Also, can you try your tests and rmmod nvidia module prior to testing?

If it still crashes with the proprietary nvidia module present can you try using no encryption too and see if that makes a difference. Since I cannot reproduce this I want for you to be able to narrow the issue down.

You can also enable debugging as indicated in comment #3 and see if upon a crash there is something useful in the log.
Comment 28 Marco Tessarotto 2009-01-24 18:11:54 UTC
hello,
I have the same wifi hardware so I report this.
Hope it helps, this happens to my pc:

Ubuntu 8.10 (cat /proc/version_signature: Ubuntu 2.6.27-9.19-generic)
Linksys WMP300N PCI card (AR5008 rev 01)
Linksys WAG325N Wireless-N router (firmware v1.00.12)

AP config: 11n enabled, WPA-PSK

no crash but pc never connects to AP:
- networkmanager scans and reports correctly the AP
- when supplicant connection state changes from 2 -> 3, authentication times out

I will try to install linux-image-2.6.27-11-generic and report what happens

BR/Marco
Comment 29 Luis Chamberlain 2009-01-24 18:28:18 UTC
Ubuntu 2.6.27-9.19-generic is ancient, please add intrepid-proposed to your /etc/apt/sources.list and upgrade and try 2.6.27-11.
Comment 30 Barry Green 2009-01-24 18:45:02 UTC
Command output:

mythtv@anpanman:~$ sudo rmmod ath9k
mythtv@anpanman:~$ sudo dmesg -c >/dev/null
mythtv@anpanman:~$ sudo modprobe -l ath9k
/lib/modules/2.6.27-11-generic/kernel/drivers/net/wireless/ath9k/ath9k.ko
mythtv@anpanman:~$ sudo modprobe ath9k
mythtv@anpanman:~$ sudo dmesg -c
[74759.572396] ath9k: 0.1
[74759.572967] ath9k 0000:01:01.0: PCI INT A -> Link[LNKB] -> GSI 17 (level, low) -> IRQ 17
[74760.006381] phy1: Selected rate control algorithm 'ath9k_rate_control'
[74760.010727] phy1: Atheros 5416: mem=0xfaca0000, irq=17
[74760.012362] udev: renamed network interface wlan0 to ath0
mythtv@anpanman:~$ uname -a
Linux anpanman 2.6.27-11-generic #1 SMP Thu Jan 22 17:22:40 UTC 2009 i686 GNU/Linux
Comment 31 Marco Tessarotto 2009-01-24 19:27:10 UTC
root@marco-desktop:/home/marco# rmmod ath9k
root@marco-desktop:/home/marco# dmesg -c > /dev/null
root@marco-desktop:/home/marco# modprobe -l ath9k
/lib/modules/2.6.27-11-generic/kernel/drivers/net/wireless/ath9k/ath9k.ko
root@marco-desktop:/home/marco# modprobe ath9k
root@marco-desktop:/home/marco# dmesg -c
[  652.844783] ath9k: 0.1
[  652.844842] ath9k 0000:01:01.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
[  653.282128] phy1: Selected rate control algorithm 'ath9k_rate_control'
[  653.283276] phy1: Atheros 5416: mem=0xffffc200109c0000, irq=21
root@marco-desktop:/home/marco# uname -a
Linux marco-desktop 2.6.27-11-generic #1 SMP Fri Jan 23 13:58:13 UTC 2009 x86_64 GNU/Linux


it does not connect to the AP, although now I get the following: 

Jan 25 03:58:19 marco-desktop kernel: [  811.602744] wlan0: authenticate with AP 00:18:39:a7:e1:d1
Jan 25 03:58:19 marco-desktop kernel: [  811.605045] wlan0: authenticated
Jan 25 03:58:19 marco-desktop kernel: [  811.605052] wlan0: associate with AP 00:18:39:a7:e1:d1
Jan 25 03:58:19 marco-desktop kernel: [  811.609536] wlan0: RX AssocResp from 00:18:39:a7:e1:d1 (capab=0x431 status=0 aid=1)
Jan 25 03:58:19 marco-desktop kernel: [  811.609544] wlan0: associated
Jan 25 03:58:19 marco-desktop kernel: [  811.609742] wlan0 (WE) : Wireless Event too big (366)
Jan 25 03:59:04 marco-desktop kernel: [  856.935477] wlan0: disassociating by local choice (reason=3)
Comment 32 Barry Green 2009-01-25 04:13:15 UTC
I've now tested after removing the nvidia module and using the nv driver instead.  However, I still got the crash.  I'm currently testing with no encryption, and so far, even after 10 minutes of iperf, no crash.

So next I'll compile the latest compat-wireless and turn on debugging, then test again with encryption on.
Comment 33 Barry Green 2009-01-25 06:11:37 UTC
Looks like I spoke too soon.  With encryption off, I started an iperf test lasting for 60 minutes.  At some point during this, the crash happened.
Comment 34 Luis Chamberlain 2009-01-25 11:48:03 UTC
Marco -- if you cannot even associate with your AP that is separate issue, please open up a separate bug report on that and please use the latest vanilla stock kernel on kernel.org. I'd simply recommend to upgrade to 2.6.28.2 (Ubuntu Jaunty has this) though as there are a lot of changes and enhancements for ath9k there. Alternatively you can also just try to install compat-wireless which is based on the wireless-testing git tree:

http://wireless.kernel.org/en/users/Download

Barry -- I cannot reproduce on 5 GHz on all sorts of different configurations, I'm going to try 2 GHz now. Since we cannot get an oops message to work with can you help us try to narrow it down further? For instance can you try to scp over a 100M file to a box, if it hits the bug then try 50 M, if it doesn't then try 200M (after a fresh reboot).

Also please provide the output of:

free -m
cat /proc/cpuinfo
cat /proc/interrupts

lspci
Comment 35 Luis Chamberlain 2009-01-25 14:31:00 UTC
Been running iperf over WPA2 AES on 11n-only HT20 configuration and I see no crashes. I did see a lag in communication though but after a few seconds everything was running smooth. So, so far I am unable to reproduce this. Unfortunately without a captured oops message and without being unable to reproduce this we cannot help you. Will try to see if we have the APs you described at the office but what would help the best is to try to get an oops from you.

Can you try to use the latest vanilla 2.6.27 which is not at 2.6.27.13 from kernel.org?

http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.27.13.tar.bz2

This kernel doesn't have any ath9k specific patches but I am curious if other parts of the kernel have been massaged which may fix your issue.

Additionally you can try the latest wireless-testing git tree which is based as of this writing on 2.6.29-rc2, it just has also the latest wireless subsystem enhancements.

Wireless testing:

http://wireless.kernel.org/en/developers/Documentation/git-guide

If you don't want to compile the entire kernel you can just try compat-wireless to get only the wireless subsystem upgraded:

http://wireless.kernel.org/en/users/Download

I am very interested in getting your issue fixed on the 2.6.27 kernel series, kernel crashes are simply unacceptable, but would like to know if you still are seeing your issue on the latest wireless subsystem (through wireless-testing or compat-wirelss) on later kernels (2.6.28 or 2.6.29-rc2).
Comment 36 Luis Chamberlain 2009-01-25 14:31:57 UTC
Oh BTW my new tests were now on 2.4 GHz, so no luck there either on trying to reproduce this.
Comment 37 Luis Chamberlain 2009-01-25 14:37:37 UTC
The fact that you cannot reproduce a console makes me wonder if this is really ath9k specific. Your /proc/interrupts output is appreciated.

Can you try to test again over the console and see if you can reproduce?
Comment 38 Marco Tessarotto 2009-01-25 16:30:49 UTC
Created attachment 19987 [details]
syslog of coomunication with two AP

before reading the last posts I recompiled ath9k driver with:
+#define DBG_DEFAULT (ATH_DBG_ANY)

Hoping it helps, I am attaching a syslog where two AP are used:
- 00:14:6c:a9:71:40 : NETGEAR DG834G + Linksys WMP300N PCI card, WPA-PSK: association works and link is established
- 00:18:39:a7:e1:d1 : Linksys WAG325N Wireless-N router + Linksys WMP300N PCI card, WPA-PSK: association does not work 

the preshared key is the same for both AP, also the SSID

now I will try compat-wireless.
Comment 39 Sujith 2009-01-25 18:47:29 UTC
Please disable Network Manager, and use wpa_supplicant from the console and see if the problem still occurs.
Comment 40 Sujith 2009-01-25 19:03:11 UTC
NetworkManager: <WARN>  get_secrets_cb(): Couldn't get connection secrets: applet-device-wifi.c.1512 (get_secrets_dialog_response_cb): canceled. 
NetworkManager: <info>  (wlan0): device state change: 6 -> 9 
NetworkManager: <info>  Activation (wlan0) failed for access point (NTGR_p5ypqysp1fqt3cdu5iaarxz) 
NetworkManager: <info>  Marking connection 'Auto NTGR_p5ypqysp1fqt3cdu5iaarxz' invalid. 
NetworkManager: <info>  Activation (wlan0) failed. 
NetworkManager: <info>  (wlan0): device state change: 9 -> 3 
NetworkManager: <info>  (wlan0): deactivating device (reason: 0).

That was from the log. So maybe NM has something to do with this.
Comment 41 Marco Tessarotto 2009-01-26 00:34:58 UTC
I have already tried to stop NetworkManager and use wpa_supplicant from the console, but the result is the same.
Comment 42 Luis Chamberlain 2009-01-26 07:21:20 UTC
Marco -- please keep your issue, which is irrelevant to this bug report, separate. This specific bug report is about a hang that is caused by connecting to an AP. Your issue is not being able to establish a connection at all. Please file a separate bug report.
Comment 43 Barry Green 2009-01-31 22:22:54 UTC
Output from "free -m":

             total       used       free     shared    buffers     cached
Mem:          2024       1789        234          0         48       1150
-/+ buffers/cache:        590       1433
Swap:         4690          3       4687

Output from /proc/cpuinfo:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 67
model name	: AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
stepping	: 3
cpu MHz		: 1000.000
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips	: 1999.95
clflush size	: 64
power management: ts fid vid ttp tm stc

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 67
model name	: AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
stepping	: 3
cpu MHz		: 1000.000
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips	: 1999.95
clflush size	: 64
power management: ts fid vid ttp tm stc


Output from /proc/interrupts:

           CPU0       CPU1       
  0:        112       2466   IO-APIC-edge      timer
  1:          0          2   IO-APIC-edge      i8042
  3:          1          6   IO-APIC-edge    
  4:          1        976   IO-APIC-edge      serial
  7:          1          0   IO-APIC-edge      parport0
  8:          0         62   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          0          4   IO-APIC-edge      i8042
 14:          0          0   IO-APIC-edge      pata_amd
 15:          0          0   IO-APIC-edge      pata_amd
 16:          9       1532   IO-APIC-fasteoi   ivtv1
 17:       1076     165749   IO-APIC-fasteoi   ivtv0
 18:      36742   10315086   IO-APIC-fasteoi   bttv0, bt878
 19:      41434   22896955   IO-APIC-fasteoi   ohci1394, nvidia
 21:      19669    4214618   IO-APIC-fasteoi   sata_nv, HDA Intel
 22:      79967   36610948   IO-APIC-fasteoi   ehci_hcd:usb2, sata_nv
 23:       8906    2074623   IO-APIC-fasteoi   ohci_hcd:usb1, sata_nv
218:     110570   42612308   PCI-MSI-edge      eth2
NMI:          0          0   Non-maskable interrupts
LOC:   48716368   46390275   Local timer interrupts
RES:    3540254    1631848   Rescheduling interrupts
CAL:      98308      96379   function call interrupts
TLB:     218727     160647   TLB shootdowns
SPU:          0          0   Spurious interrupts
ERR:          1
MIS:          0

I'm just about to try the latest compat-wireless.
Comment 44 Barry Green 2009-02-08 00:08:31 UTC
Finally reproduced on a single-user console, no encryption, using 11n at 2.4GHz.  Ran iperf once, it ran OK.  Ran it again, got an immediate crash.  Using compat-wireless-2009-02-06 with the serialization patches applied.
Comment 45 Luis Chamberlain 2009-02-09 10:39:38 UTC
Barry, good to hear you are able to reproduce, how easily can you reproduce?

Did you manage to see a panic at all?

Also on the above /proc/interrupts output I do not see ath9k, can you print out that file with ath9k loaded?

Also can you reproduce the crash if you boot with this option:

maxcpus=1

You can add that to your grub menu.lst.
Comment 46 Luis Chamberlain 2009-02-09 15:26:52 UTC
Bary, please remember to enable NMI watchdog when reproducing over the console:

nmi_watchdog=1
Comment 47 Barry Green 2009-02-14 19:48:04 UTC
Just got another crash.  Using compat-wireless-2009-02-14, with your latest serialisation patches, nmi_watchdog=1.  Same as before, single user console, no encryption, but this time it happened as soon as I did "modprobe ath9k".  Didn't get any console output, just a complete freeze.  I'll try with maxcpus=1 now.

I did notice that in /proc/interrupts, I was seeing ivtv and ath sharing an interrupt.  Might this be an issue?
Comment 48 Barry Green 2009-02-14 20:45:59 UTC
I enabled maxcpus=1 and I've spent the last 45 minutes trying to produce a crash, but it's been rock solid, both with and without encryption.  I used iperf to perform 10M, 100M and 1000M tests, and could not get a crash.
Comment 49 Martin Stubenschrott 2009-02-16 07:53:38 UTC
I am having the exact same issue with my AR5008 based card and ath9k and ubuntu 8.10 (or 9.04alpha4) on a corei7 cpu - and it also works flawlessly with maxcpus=1. I couldn't test on windows, as i don't have a windows cd.

I just saw windows users also have NMI:parity errors with AR5008 sometimes, so this might be related. Actually someone has a "modded" inf files which seems to work, no idea what they did to it, maybe it is a help though:

http://www.laptopvideo2go.com/forum/lofiversion/index.php/t15297-0.html

I also noticed that even when using madwifi, i get *occasional* lockups with my card, but usually "just" once a day or so, and only after transfering large amounts of data, so maybe the problem lies in a different part of the kernel or probably really in a hardware compatibility issue with multicore as the same card works flawlessly with ath9k on my girlfriend's single CPU machine.
Comment 50 jasin colegrove 2009-02-22 23:41:10 UTC
I am pretty sure I ran into this bug as well on archlinux using the 2.6.28 kernel. I also have a dual core AMD athlonX2 4200+ cpu. The wireless card I have is a linksys wmp110. It has reports to lspci as an AR5008.

My issue is when running this with the ath9k module loaded it randomly hard locks my computer. I always have to use the reboot. I have seen it hard lock while doing iwlist wlan0 scan. I have also seen it work for 2 hours without issue, then for no apparent reason it hard locks.

I have to admit, this is my first wireless setup. But, getting the connection setup is not the problem. It's keeping it without locking up my PC. I have tried with encryption and no encryption. It doesn't make a difference. 

I will try the maxcpus=1 on boot and see if that helps. After that I will try the 2.6.29 rc kernel and see if that helps. I will report back, I need to get this fixed.
Comment 51 jasin colegrove 2009-02-23 10:15:08 UTC
Like Barry, I enabled maxcpus=1 on boot and my computer has been rock solid for 20 hours now. I have downloaded multiple torrents over 1.4gb and used youtube.com under four tabs in FF for couple hours. Could not get my computer to hard lock. I also tried spamming the iwlist wlan0 scan command which had also caused a hard lock for me. Still rock solid.
Comment 52 Luis Chamberlain 2009-02-27 10:58:05 UTC
Created attachment 20383 [details]
SMP fixes v3

Barry can you try these patches.
Comment 53 jasin colegrove 2009-02-28 14:38:10 UTC
I can't get to the patch, says 'Not Found'

Not that they would work for me, but I would at least like to try them out as well.
Comment 54 Martin Stubenschrott 2009-03-01 01:45:46 UTC
Using a little common sense (browsing the ath9k directory of tht link), I found that: http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2009-02-24/serialization-v3.patch

But why wouldn't they work for you?
Comment 55 jasin colegrove 2009-03-01 08:51:42 UTC
Issue is resolved for me, when can we expect to see this patch pushed into the main tree?
Comment 56 Barry Green 2009-03-02 04:00:36 UTC
I used compat-wireless-2009-02-26 with the new patches, and while I would love to report success, I actually got a crash the second time I ran iperf with a 50MB data transfer.
Comment 57 Luis Chamberlain 2009-03-02 17:59:30 UTC
Its not going upstream just yet as I am not happy with the patch just yet in fact I think the d1 d0 patch is just a paper bag nasty fix for another issue which I'm trying to zero in on now.

Will post as soon as I have updates. Thanks for testing and the reports -- they help a lot.
Comment 58 Luis Chamberlain 2009-03-09 19:52:33 UTC
Created attachment 20475 [details]
Serialization v4

Give these a shot, these are now on their way to wireless-testing.
Comment 59 Luis Chamberlain 2009-03-10 20:13:59 UTC
You can try out the compat-wireless stuff now too, that has these patches merged.
Comment 60 Barry Green 2009-03-11 03:39:42 UTC
Tested iperf with 10M, 100M, 1000M and 2000M and no crash.  Looking good, well done Luis!
Comment 61 Martin Stubenschrott 2009-03-12 12:35:02 UTC
Doesn't seem to work for me, tested with compat-wireless-2009-03-12

At first I could not even load the module correctly, as I have those options:
options cfg80211 ieee80211_regdom=EU

[   94.988866] cfg80211: Using static regulatory domain info
[   94.988871] cfg80211: Regulatory domain: EU
[   94.988873] 	(start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
[   94.988875] 	(2402000 KHz - 2482000 KHz @ 40000 KHz), (600 mBi, 2000 mBm)
[   94.988878] 	(5170000 KHz - 5190000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[   94.988880] 	(5190000 KHz - 5210000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[   94.988883] 	(5210000 KHz - 5230000 KHz @ 40000 KHz), (600 mBi, 2300 mBm)
[   94.988885] 	(5230000 KHz - 5330000 KHz @ 40000 KHz), (600 mBi, 2000 mBm)
[   94.988887] 	(5490000 KHz - 5710000 KHz @ 40000 KHz), (600 mBi, 3000 mBm)
[   95.084872] ath9k 0000:09:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[   95.518451] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[   95.518454] IP: [<ffffffffa0c1934e>] freq_reg_info_regd+0x2e/0x180 [cfg80211]
[   95.518461] PGD 1a98ba067 PUD 14d1d3067 PMD 0 
[   95.518463] Oops: 0000 [1] SMP 
[   95.518465] CPU 3 
[   95.518466] Modules linked in: ath9k(+) mac80211 rfkill cfg80211 led_class binfmt_misc rfcomm bridge stp bnep sco l2cap bluetooth ipv6 ppdev acpi_cpufreq cpufreq_powersave cpufreq_stats cpufreq_ondemand freq_table cpufreq_conservative cpufreq_userspace container sbs sbshc video output pci_slot wmi battery iptable_filter ip_tables x_tables ac sbp2 parport_pc lp parport snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy nvidia(P) snd_seq_oss i2c_core snd_seq_midi pcspkr evdev snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd button soundcore shpchp pci_hotplug snd_page_alloc ext3 jbd mbcache sr_mod cdrom pata_acpi sd_mod crc_t10dif sg pata_jmicron usbhid hid ata_generic ahci libata ohci1394 scsi_mod ieee1394 uhci_hcd dock ehci_hcd usbcore thermal processor fan fbcon tileblit font bitblit softcursor fuse
[   95.518503] Pid: 6841, comm: modprobe Tainted: P          2.6.27-13-generic #1
[   95.518504] RIP: 0010:[<ffffffffa0c1934e>]  [<ffffffffa0c1934e>] freq_reg_info_regd+0x2e/0x180 [cfg80211]
[   95.518509] RSP: 0018:ffff88014d519b48  EFLAGS: 00010286
[   95.518510] RAX: 0000000000000000 RBX: ffffffffa0cadfe0 RCX: ffff88014d519ba0
[   95.518511] RDX: ffff88014d519bac RSI: 000000000024cde0 RDI: ffff88015e5840a0
[   95.518513] RBP: ffff88014d519b78 R08: ffffffffa0ca5b28 R09: 0000000000000004
[   95.518514] R10: ffff88014d519b08 R11: ffffffffa0c23228 R12: 0000000000000000
[   95.518515] R13: ffff88015e585e58 R14: 0000000000000000 R15: ffff88014d519bac
[   95.518517] FS:  00007fee20e7e6e0(0000) GS:ffff8801af06b080(0000) knlGS:0000000000000000
[   95.518518] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   95.518519] CR2: 0000000000000004 CR3: 00000001aa5f3000 CR4: 00000000000006e0
[   95.518521] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   95.518522] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   95.518524] Process modprobe (pid: 6841, threadinfo ffff88014d518000, task ffff880152088000)
[   95.518525] Stack:  ffff88014d519ba0 ffffffffa0cadfe0 0000000000000000 ffff88015e585e58
[   95.518528]  0000000000000000 0000000000000000 ffff88014d519bd8 ffffffffa0c1959b
[   95.518531]  0000000000000004 ffffffffa0ca5b28 ffff88015e5840a0 0000000000000000
[   95.518534] Call Trace:
[   95.518538]  [<ffffffffa0c1959b>] wiphy_apply_custom_regulatory+0xdb/0x180 [cfg80211]
[   95.518546]  [<ffffffffa0c8d99c>] ath_attach+0x5fc/0xb70 [ath9k]
[   95.518551]  [<ffffffff80254929>] ? tasklet_init+0x9/0x30
[   95.518557]  [<ffffffffa0c95652>] ath_pci_probe+0x192/0x370 [ath9k]
[   95.518561]  [<ffffffff803a4d6a>] ? kobject_get+0x1a/0x30
[   95.518565]  [<ffffffff803bc219>] ? pci_match_id+0x9/0xa0
[   95.518568]  [<ffffffff803bd1c0>] pci_device_probe+0xe0/0x140
[   95.518572]  [<ffffffff8034c033>] ? sysfs_create_link+0x13/0x20
[   95.518577]  [<ffffffff80431a82>] really_probe+0x72/0x1a0
[   95.518580]  [<ffffffff80431c00>] driver_probe_device+0x50/0x60
[   95.518583]  [<ffffffff80431c9b>] __driver_attach+0x8b/0x90
[   95.518585]  [<ffffffff80431c10>] ? __driver_attach+0x0/0x90
[   95.518588]  [<ffffffff8043120b>] bus_for_each_dev+0x6b/0xa0
[   95.518591]  [<ffffffff802e23eb>] ? kmem_cache_alloc+0x8b/0xd0
[   95.518596]  [<ffffffffa0087000>] ? ath9k_init+0x0/0x57 [ath9k]
[   95.518600]  [<ffffffff804318e1>] driver_attach+0x21/0x30
[   95.518602]  [<ffffffff80430a78>] bus_add_driver+0x1f8/0x270
[   95.518607]  [<ffffffffa0087000>] ? ath9k_init+0x0/0x57 [ath9k]
[   95.518611]  [<ffffffff80431e95>] driver_register+0x75/0x170
[   95.518616]  [<ffffffffa0087000>] ? ath9k_init+0x0/0x57 [ath9k]
[   95.518619]  [<ffffffff803bd4c2>] __pci_register_driver+0x72/0xc0
[   95.518624]  [<ffffffffa0087000>] ? ath9k_init+0x0/0x57 [ath9k]
[   95.518630]  [<ffffffffa0c95853>] ath_pci_init+0x23/0x28 [ath9k]
[   95.518635]  [<ffffffffa008701e>] ath9k_init+0x1e/0x57 [ath9k]
[   95.518639]  [<ffffffff8020a041>] do_one_initcall+0x41/0x170
[   95.518642]  [<ffffffff8026c3e1>] ? __blocking_notifier_call_chain+0x21/0x90
[   95.518647]  [<ffffffff8027d1a5>] sys_init_module+0xb5/0x1f0
[   95.518650]  [<ffffffff8021285a>] system_call_fastpath+0x16/0x1b
[   95.518651] 
[   95.518652] 
[   95.518652] Code: e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 e8 4a 93 5f df 48 8b 05 5b 20 01 00 4c 8b 1d 4c 20 01 00 48 89 4d d0 4d 85 c0 49 89 d7 <8b> 40 04 4d 0f 45 d8 83 f8 03 0f 84 aa 00 00 00 83 e8 01 0f 84 
[   95.518671] RIP  [<ffffffffa0c1934e>] freq_reg_info_regd+0x2e/0x180 [cfg80211]
[   95.518676]  RSP <ffff88014d519b48>
[   95.518677] CR2: 0000000000000004
[   95.518680] ---[ end trace 1cbd0f4b44c744df ]---


Without those options, loading the US regulatory settings, I could load the module without any errors in dmesg. But a few seconds later (I guess when nm-applet tries to connect to the network), the same complete lockup as before: No mouse, no keyboard, nothing. Reboot required. 

Maybe this is a slightly different error, as I always get the lockups while connecting, while some other users here, like Barry, could connect with the 'old' ath9k but only got lockups after transfering data?
Comment 62 Luis Chamberlain 2009-03-12 14:00:44 UTC
The serialization patch series was reverted from wireless-testing so that's why you got a hang, we're going to submit something smaller.
Comment 63 Luis Chamberlain 2009-03-13 15:43:21 UTC
Created attachment 20518 [details]
Backport of serialization patches -- v6 serialization for 2.6.27-2.6.29

The 2.6.30 have been submitted to John. If you want to use this on older kernels you can use the backported versions of the patches. Once John propagates the 2.6.29 patches to Linus then the other stable series will get the fixes for 2.6.27 and 2.6.28. For those that can't wait you can use the patches in the folder here.
Comment 64 jasin colegrove 2009-03-14 06:45:46 UTC
I just thought I'd just chime back in on this issue. But, first of all, thank you luis and barry(debugging work)for working on this issue and getting it resolved. I sure do appreciate it.

I applied the patch to the 2.6.28 source and compiled the kernel. All went well, there absolutely no issues with the patch and compilation. Also the machine has been running for close to 12 hours now with no signs of problems. 

I think I am fully satisfied with this patch for now, thanks again for all the hard work.