Bug 14802

Summary: ath5k + hostapd causes flood of interrupts
Product: Drivers Reporter: Bartek Knapek (kernel)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: linville, me
Priority: P1    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.32 Subsystem:
Regression: No Bisected commit-id:
Attachments: ath5k debug intr
ath5k+intr debug+timestamp
ath5k debug beacon
beacon_int - different values
beacon debug - an event every 5s
don't use the external clock
patch applied, beacon_int=1000

Description Bartek Knapek 2009-12-13 20:57:48 UTC
I am running a home wireless router on Slackware 13.0 / 2.6.29.6-smp.
WPA2 and g only.
Athlon 2400+ (single-core CPU) / Asus A7N8X-X. 
The wireless device is Planet WL-8310 (Atheros AR5213A).

madwifi 0.9.4 + hostapd 0.6.9 work fine.

I try to migrate to ath5k.

Initially I have used a patch to introduce the AP mode support into ath5k from my kernel, but after having started hostapd the kernel was complaining about "ath5k phy0: beacon queue 7 didn't stop", and besides the AP was not stable: hard to connect, then dropping the connections at random intervals.

So I have upgraded compat-wireless to 2.6.32. I no longer see the "beacon queue" message in the logs, but the AP is still not stable. 

From the server perspective I see that the system chokes. Runs fine a couple of seconds, then everything goes very slow for a couple of seconds, and so on. On "top" I see that the software interrupts (si) figure changes between 30-90%. With madwifi it is usually 0-3%. 

I have checked /proc/interrupts and calculated the delta increase, every 1 second, of the interrupts the wireless device is using (loop: cat /proc/interrupt >> file ; sleep 1). This is the only device on this interrupt. The following pattern repeats all the time (the exact values differ of course):

new interrupts related to wireless device, every 1s
:
91 
41
5
445
1059
1181
1229
1493
909
7
3
4
615
:
Using madwifi the number of new interrupts every second is always ~60.

Also the Local timer interrupts are affected:
:
1350
1018
1018
1018
1361
2575
1078
1017
1018
1328
:
With madwifi it is always ~1018.

The peaks in Local timer interrupts are not correlated with the peak of wireless device interrupts.

I have set kernel debug to 8 (sysrq), but seen nothing, until after some 1-2 hours the following appeared:

ath5k phy0: failed to warm reset the MAC Chip                    
ath5k phy0: can't reset hardware (-5) 


I though it might be related to Bug#12647, so I have applied Patch#20356, but it did not help. 

It is 100% reproducable. I can provide traces, but you need to tell me what is of interest to you.
Comment 1 Bartek Knapek 2009-12-14 16:46:55 UTC
I have upgraded to 2.6.32 today.
Still the same: the moment I start hostapd, the flood of interrupts begin.
Comment 2 Bartek Knapek 2009-12-14 21:34:12 UTC
Created attachment 24189 [details]
ath5k debug intr

Log taken during a couple of seconds time after having loaded the ath5k module with debug=0x2.
Comment 3 Bartek Knapek 2009-12-14 22:14:07 UTC
Am I correct that the flooding interrupts are due to:
 - SW Beacon Alert
 - Beacon Not Ready ?

What does it mean ?
Comment 4 Bob Copeland 2009-12-15 01:04:49 UTC
Thanks for looking into this.  The background:

The software beacon alert (SWBA) should be fired before every beacon interval (every 100ms or so, by default) -- it tells the driver to load a beacon into the beacon queue on the hardware.  So it's normal to get these in AP mode.

After a short period of time, the dma beacon alert (DBA) happens, this is when it actually transmits the beacon from memory to the card and sends it.  Beacon not ready means we didn't actually load a beacon in time for the hardware.  This is bad.

Do you have any timing information in the logs?  Could you try with CONFIG_PRINTK_TIME?
Comment 5 Bartek Knapek 2009-12-15 10:22:22 UTC
Created attachment 24193 [details]
ath5k+intr debug+timestamp

I attach a time-stamped interrupt log.

I can see two different time intervals in the log:
 - status 0x100001 usually appears every ~100ms, but there are exceptions with longer time in-between.
 - status 0x110000 floods every ~0.85ms.
Comment 6 Bob Copeland 2009-12-15 17:59:24 UTC
Ok, what do you have beacon_int set to in your hostapd.conf?  Can you try using ATH5K_DEBUG_BEACON and post the results?  (modprobe ath5k debug=0x10)
Comment 7 Bartek Knapek 2009-12-15 20:37:01 UTC
Created attachment 24203 [details]
ath5k debug beacon

Attaching a 60s log with debug beacon enabled.

I have beacon_int=100 in the hostapd.conf.
Comment 8 Bartek Knapek 2009-12-16 05:26:17 UTC
Created attachment 24206 [details]
beacon_int - different values

I have tried different values of beacon_int in hostapd.conf.

 - beacon_int=10000 - results in beacon every ~8.4ms
 - beacon_int=50000 - results in beacon every ~42ms (5x8.4ms)

See the attached log.

With beacon_int=50000 everything _seems_ to work fine. The interrupts are no longer flooding the system, and it is possible to use the WLAN. 

So I keep this setting as a temporary workarround.
Comment 9 Bartek Knapek 2009-12-16 07:04:38 UTC
Now I have noticed the following message in the logs:

[72530.904742] __ratelimit: 138 callbacks suppressed
[72530.904750] ath5k phy5: beacon queue 7 didn't start/stop ?
[72551.510681] ath5k phy5: beacon queue 7 didn't start/stop ?

It happened about the time I was suspending the laptop (client) this morning.
Previous suspend, and resume, did not cause any event in the logs.
Comment 10 Bartek Knapek 2009-12-19 13:42:32 UTC
I have been evaluating the workarround for a couple of days now, and it does not work OK. 

1)
The main issue is that my IBM T42 does not see this WLAN at all. When I change to madwifi, it discovers the network immediately. And it can see all the other neighboring WLANs. I have tried changing channels, tx power, tried changing from g to b, updated windows and ibm drivers. The other laptop (Compaq 6710b) does not have this problem.

2)
Sporadically "beacon queue 7 didn't start/stop ?" appears in the logs. It never appears with "beacon debug" enabled, so I cannot provide more data. (WLAN unaffected - established connections not dropped).

3)
When I enable beacon debug, the following messages repeat every 5s (a log file will follow). The number of "callback suppressed" changes from time to time, and then the new value is reported for a longer time.


Throughput is fine: 2,5 MByte/s both UL and DL.
Comment 11 Bartek Knapek 2009-12-19 13:43:28 UTC
Created attachment 24231 [details]
beacon debug - an event every 5s
Comment 12 Bob Copeland 2009-12-20 20:36:06 UTC
Very strange.  If I read it right, SWBA is being invoked every 100 microseconds instead of 100 TUs (1 TU approx = 1 ms).  But nexttbtt looks correct compared to the TSF in the beacon debugging.  There must be something wrong with the timer programming.
Comment 13 Bartek Knapek 2009-12-20 21:01:18 UTC
The SWBA interrupts - are they generated by the PC CPU, or do they come from the WLAN card? Is it possible that the Planet WLAN device I am using is broken - I mean it has some sort of a bug that results in ~1000x times faster expiration of some internal timers? The timers ath5k is using, but madwifi was not?

Since this issue is kind of a stopper for my small home network (my wife is using IBM T42 and this laptop refuses to see the network when using ath5k ;-) I have ordered another WLAN device (TL-WN722N, AR5008) - I will try setting up the AP using this one. Perhaps it will shed some more light on the problem.
Comment 14 Bob Copeland 2009-12-20 21:14:34 UTC
So on my system things appear to be working fine:

[91955.828217] ath5k phy0: (ath5k_beacon_send:2126): in beacon_send
[91955.930644] ath5k phy0: (ath5k_beacon_send:2126): in beacon_send
[91956.033126] ath5k phy0: (ath5k_beacon_send:2126): in beacon_send
[91956.135553] ath5k phy0: (ath5k_beacon_send:2126): in beacon_send

Perhaps timer setup is different on 5316... I'll take a look at madwifi to see what I can find.

By the way, wrt to #1 in your three points above, with 50 second beacon interval, you are quite likely to not see the wireless network from other computers for a while.  This is because beacon responses are the way the AP announces its presence (though you should see probe responses if you do active scans).
Comment 15 Bob Copeland 2009-12-20 21:17:36 UTC
Madwifi and ath5k should be using the same timer setup as far as I know.  You can compare ath5k_beacon_update_timers() with the equivalent in madwifi if you want to know what they are doing, but I'll see if I can spot anything on my end as well.
Comment 16 Bartek Knapek 2009-12-20 23:14:02 UTC
Correct - I have tried decreasing beacon_int, and at =1000 the T42 discovered the WLAN. But the server load is too high at this setting, due to the interrupts..
Comment 17 Bartek Knapek 2009-12-21 21:07:03 UTC
The moment beacon is enabled, the BNR+SWBA interrupts start to appear, at a rate 1000x faster than given by beacon_int. The SWBA alone also appears, at a much less frequent rate, which resembles the rate stated by beacon_int (though not always).

Speculations:

We could assume the SWBA alone is scheduled by us, while the BNR+SWBA come from the device. Perhaps it is only the BNR the device wants to complain about, but SWBA is added for some reason. Something 5213A-specific?

There is no support for BNR in ath5k_intr, so I have updated ath5k_intr a bit, so it schedules beacontq only if BNR is not set. It became possible to use beacon_int=1000: beacon_send called much less frequent and quite regularly, the system load low. However after some time the "no further tx buf available, dropping packet" messages started to appear, and the connectivity died. I have seen this message before, running beacon_int=50000 without any code modifications, but then it appeared after a couple of days of running the AP.

The question is what triggers the BNR+SWBA interrupt, and why is the frequency of this interrupt dependent on beacon_int. Why do BNR come together with SWBA?  

The DBA event you mentioned before - where is it? Perhaps the WLAN device is ordered to transmit beacon too often, so it complains since most of the time it does not have any data to transmit due to the less-frequent SWBA:s ?
Comment 18 Bartek Knapek 2009-12-22 19:59:13 UTC
I have got down to the place where the timers and intervals is written to the device with iowrite32(). Everything seems to be fine..

I would like to check whether the base addresses are OK. I see that there have been a lot of changes between 5210 and 5211. Perhaps 5213A has some peculiarities? Is some spec available?
Comment 19 Bartek Knapek 2009-12-22 21:41:08 UTC
You wrote that 1 TU ~= 1 ms.
But looking at the ath5k_beacon_update_timers debug, the hw_tu is incremented ~1200x per 1 ms. While nexttbt = hw_tu + intval, where intval is in ms.
Comment 20 Bob Copeland 2009-12-22 22:58:01 UTC
Created attachment 24259 [details]
don't use the external clock

Can you try the attached patch?

To be precise, 1 TU = 1.024 ms.  So your TSF timer is running way too fast (in the one I looked at, I thought it was only 60x but still, that's not going to work).  You can verify this by mounting debugfs and looking at the file /debug/ieee80211/phyX/tsf - it should tick in microseconds.

60x sounds like an issue with the sleep clock, so the patch turns it off.
Comment 21 Bartek Knapek 2009-12-22 23:47:20 UTC
Created attachment 24261 [details]
patch applied, beacon_int=1000

Worked! Thanks! :-)

I attach a log with beacon_int set to 1000.
No more BNR interrupts, beacons sent as expected.

Setting beacon_int to 100 also worked (not in the log).


What is the issue with the sleep clock?
Is there something wrong with my device, or is it some driver problem?
Comment 22 John W. Linville 2010-03-03 16:44:43 UTC
commit 5d6ce628f986d1a3c523cbb0a5a52095c48cc332
Author: Bob Copeland <me@bobcopeland.com>
Date:   Wed Jan 20 23:51:03 2010 -0500

    ath5k: dont use external sleep clock in AP mode
    
    When using the external sleep clock in AP mode, the
    TSF increments too quickly, causing beacon interval
    to be much lower than it is supposed to be, resulting
    in lots of beacon-not-ready interrupts.
    
    This fixes http://bugzilla.kernel.org/show_bug.cgi?id=14802.
    
    Signed-off-by: Bob Copeland <me@bobcopeland.com>
    Acked-by: Nick Kossifidis <mickflemm@gmail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>