Bug 42092

Summary: ath9k: TX hangs every 30 seconds with ANI enabled
Product: Networking Reporter: Robert (robert.hogberg)
Component: WirelessAssignee: Adrian Chadd (adrian)
Status: ASSIGNED ---    
Severity: normal CC: adrian, alan, ath9k-devel, caleb, inbox-bpnyjac, linville, markuz, mcgrof, shafi.wireless, Steffen.Public, unsuspicious.fakename+kernel
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.5 Subsystem:
Regression: No Bisected commit-id:

Description Robert 2011-08-30 17:57:54 UTC
My NIC is an AR5416 based PCI card:
Atheros Communications Inc. AR5008 Wireless Network Adapter [168c:0023] (rev 01)

Every 30 seconds it stops transmitting packets for almost two seconds:

> ping -i 0.1 192.168.6.60
[ ... ]
> 64 bytes from 192.168.6.60: icmp_req=93 ttl=64 time=1.84 ms
> 64 bytes from 192.168.6.60: icmp_req=94 ttl=64 time=1.47 ms
> 64 bytes from 192.168.6.60: icmp_req=95 ttl=64 time=1959 ms
> 64 bytes from 192.168.6.60: icmp_req=96 ttl=64 time=1854 ms

Enabling debug output reveals the reason for this:
> ath: tx hung, resetting the chip

The 30 seconds come from this define:
> ath9k.h:#define ATH_LONG_CALINTERVAL      30000   /* 30 seconds */

Changing this value changes the frequency of the problem.

Somehow this problem is also related to ANI, since if I disable ANI (echo 1 > /sys/kernel/debug/ieee80211/phy0/ath9k/disable_ani) the problem disappears.

I use bleeding edge compat-wireless from 2011-08-27.
Comment 1 Robert 2011-08-30 19:43:32 UTC
This bug has been around since at least September 2010. The bug was unmasked by this fix:
https://patchwork.kernel.org/patch/181422/
Comment 2 John W. Linville 2011-08-30 19:57:04 UTC
Have you tried reverting that patch?
Comment 3 Robert 2011-08-30 20:40:22 UTC
(In reply to comment #2)
> Have you tried reverting that patch?

I haven't tried to revert it from current wireless code.

However, I've taken compat-wireless-2.6.36-4, which is the most recent compat-wireless release which works well for me, and applied the patch mentioned in comment 1 and seen that this bug surfaces.

So, the patch in comment 1 seems to reveal this bug, but I don't think there's anything wrong with that patch. I think compat-wireless-2.6.36-4 works well for me because in that code there's a problem with ath9k where ANI is erroneously disabled and with ANI disabled this bug isn't triggered. When the patch in comment 1 was introduced it fixes the ANI problem and re-enables ANI and then this bug shows up.

Maybe I should have used this link for the patch mentioned earlier. It contains a comment explaining the change:
http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2010/09/14/ps-fixes-09-14/v2.6.36/0003-ath9k-fix-enabling-ANI-tx-monitor-after-bg-scan.patch
Comment 4 Caleb Hearon 2011-08-31 00:52:35 UTC
Same problem.

$ echo 0 > /sys/kernel/debug/ieee80211/phy0/ath9k/disable_ani
$ ping google.com
PING google.com (74.125.91.147) 56(84) bytes of data.
64 bytes from qy-in-f147.1e100.net (74.125.91.147): icmp_req=1 ttl=48 time=50.9 ms
...
64 bytes from qy-in-f147.1e100.net (74.125.91.147): icmp_req=29 ttl=48 time=1077 ms
...omitted 30 pings with ~40ms response time, then:
64 bytes from qy-in-f147.1e100.net (74.125.91.147): icmp_req=60 ttl=48 time=2019 ms
64 bytes from qy-in-f147.1e100.net (74.125.91.147): icmp_req=61 ttl=48 time=1020 ms
...Another 30 pings with ~40ms response, then:
64 bytes from qy-in-f147.1e100.net (74.125.91.147): icmp_req=92 ttl=48 time=1992 ms
Etc...

As soon as I echo 1 into /sys/kernel/debug/ieee80211/phy0/ath9k/disable_ani, the problem goes away.
I used compat-wireless-2011-08-25 to compile the ath9k driver, but I first noticed the bug in
the version bundled with Ubuntu 11.04.

I haven't tried applying the patch Robert has mentioned.  I'm using the AR5008 chipset:

05:01.0 Network controller [0280]: Atheros Communications Inc. AR5008 Wireless Network Adapter [168c:0023] (rev 01)

It is a D-Link DWA-552
Comment 5 Adrian Chadd 2011-08-31 03:03:12 UTC
I'll take a look at this; I have the hardware lying about.

ANI on FreeBSD/AR5416 has worked for me, so either:

* something in how ath9k is doing ANI is a bit strange; or
* my environment doesn't annoy the AR5416+ANI like these posters are seeing.
Comment 6 markuz 2012-05-28 12:24:43 UTC
Any updates on this? It makes my linux box pretty unusable cause of that lags :(
Comment 7 Alan 2012-08-30 13:29:34 UTC
What kernel are you still seeing this on ?
Comment 8 Robert 2012-09-18 08:57:29 UTC
I tested a 3.5.4 kernel from Ubuntu kernel PPA (http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.4-quantal/) on a Ubuntu 12.04.1 installation and I could not reproduce this bug with that kernel.

I haven't tried a vanilla kernel yet. I'll try that next, but it'll take a while since I don't have easy access to this NIC at the moment.
Comment 9 Robert 2012-10-27 16:40:33 UTC
I was wrong in comment 8. I still see this in vanilla Linux kernel 3.6.2.

One thing that's probably worth pointing out is that it seems like the bug doesn't show if I first boot Windows XP and then boots Linux (soft reboot). If I first boot Windows XP the NIC works perfectly with any kernel. If I cold boot into Linux this bug always shows.
Comment 10 Adrian Chadd 2012-10-29 01:45:52 UTC
There's not many parameters that could have changed in that way.

The ath9k driver should be doing a complete cold reset of the chip upon attach. There are only a few parameters that can't be modified after power-on.

Can you please provide me with which windows driver version you're using? Every bit of information is going to help - I'll try to track down the internal driver build from that period of time and see what registers have changed.

The trouble is on the pre-AR9220/AR9280 NICs, a bunch of register values were written in via a serial shift register (all the "addac" and PCIe PHY updates) so it's not as easy as "dump the register contents to figure out the differences.)

But that's a great data point, thanks!

What about if you boot Linux -> Windows -> Linux?

(I'd even say - what if you booted FreeBSD -> Linux? FreeBSD's install disk should have AR5416 support; so if you bring up a livefs and configure the supplicant manually (/etc/wpa_supplicant.conf) you should be able to connect to your local network and do some traffic. If you download a FreeBSD-HEAD snapshot then it'll do 802.11n out of the box on Atheros NICs.

If you _could_ test FreeBSD that would be great - we all have access to FreeBSD's HAL code. :-)
Comment 11 Robert 2012-11-12 16:56:37 UTC
(In reply to comment #10)
> 
> Can you please provide me with which windows driver version you're using? 

This is the driver:
http://www.tp-link.com/resources/software/200912252180711.zip

The driver version reported by Windows XP is 7.7.0.329


> What about if you boot Linux -> Windows -> Linux?

The Windows driver fails to "repair" the situation here. Bug still appears when booting Linux a second time.


> If you _could_ test FreeBSD that would be great - we all have access to
> FreeBSD's HAL code. :-)

I tried to boot FreeBSD 9.0 and 9.1 RC3 from a USB stick, but FreeBSD failed to boot on this machine (couldn't mount root). I could probably try harder to get FreeBSD running if that would help.
Comment 12 Adrian Chadd 2012-11-12 17:17:06 UTC
Grr. Would you just type '?' at the freebsd mountroot prompt? Let's see what filesystems are available.

Hopefully it's just a case of "someone messed up the installer" and you can manually enter the root filesystem there.

I've started asking around internally to see if I can find the source tree used to build revision '7.7.0.329' of the driver.

Thanks!



Adrian
Comment 13 Robert 2012-11-12 17:39:55 UTC
(In reply to comment #12)
> Grr. Would you just type '?' at the freebsd mountroot prompt? Let's see what
> filesystems are available.

I think da0 was the USB device (kernel listed it as that), but it wasn't listed when I typed '?'. Only the internal IDE HDD and its partitions were listed.

I could probably boot a FreeBSD CD/DVD or move the NIC to a more modern computer to get FreeBSD running..

The NIC is currently out of my reach, but I'll get back to you with the results from trying out FreeBSD within a few weeks. Sorry for the delays and thanks for your patience :-)
Comment 14 Robert 2012-12-08 14:37:53 UTC
I did some tests with FreeBSD 9.1 RC3:

FreeBSD -> Linux:
Working    Working

Linux -> FreeBSD -> Linux:
Broken   Working    Working

Seems like booting FreeBSD helps avoid this bug.
Comment 15 unsuspicious.fakename+kernel 2013-03-17 11:24:40 UTC
Same problem here. 

What's the "NEEDINFO"? 
Any "experiments" still required?
Comment 16 inbox-bpnyjac 2014-08-30 02:01:22 UTC
I have an atheros AR9271, using ath9k_htc (with manually compiled open and normal firmware) driver, and I experienced similar bug.
The solution was to specify BSSID manually in network manager applet (nm-applet).
This was double checked, without problem is back; with - disappears. BSSID is a MAC of the wireless modem of your wireless router(not router MAC of wan/lan part).
Comment 17 Steffen.Public 2014-10-09 16:08:14 UTC
Setting the BSSID also fixes the problem for me.
ID 0846:9030 NetGear, Inc. WNA1100 Wireless-N 150 [Atheros AR9271]
Comment 18 Robert 2014-10-09 16:43:41 UTC
Comment #16 and comment #17 describe a problem related to the background scans done by NetworkManager (NetworkManager bug here: https://bugzilla.gnome.org/show_bug.cgi?id=513820) and is not the same problem as described in this bug's original description.

Explicitly setting BSSID in NetworkManager fixes the problem with NetworkManager background scans, but will not fix the "tx hung" errors this bug is about.

When you experience the NetworkManager problem you probably don't see the network glitches every 30 seconds, but rather every 120 seconds.
Comment 19 inbox-bpnyjac 2014-10-09 23:12:01 UTC
Hello Robert!

No. There are THREE problems. 
1) Periodic hangs - very good detected by built-in network latency monitor of OpenArena (q3 based shooter). Just connect to *any* server and watch in the corner. These are cased by NetworkManager polling for BSSID. Giving BSSID will stop these lags. This can be tested over and over. This is not firmware related, as proprietary and open firmware behave same. Removing NetworkManager also fixes the problem (wicd). Note that ping is not affected by this.

2) Periodic stalls. This causes system to significantly lag, running a ping of google com, will return something between 700 and 1000ms, with 50ms when not affected. Every single ping response will be lagged. This can be triggered by connecting and disconnecting from same network in NM. This is the re-polling problem. Its not same as BSSID.

3) Tx hangs. I was not affected by this and this is firmware/hardware issue.