Kernel Bug Tracker – Bug 14802
ath5k + hostapd causes flood of interrupts
Last modified: 2010-03-03 16:45:13 UTC
I am running a home wireless router on Slackware 13.0 / 126.96.36.199-smp.
WPA2 and g only.
Athlon 2400+ (single-core CPU) / Asus A7N8X-X.
The wireless device is Planet WL-8310 (Atheros AR5213A).
madwifi 0.9.4 + hostapd 0.6.9 work fine.
I try to migrate to ath5k.
Initially I have used a patch to introduce the AP mode support into ath5k from my kernel, but after having started hostapd the kernel was complaining about "ath5k phy0: beacon queue 7 didn't stop", and besides the AP was not stable: hard to connect, then dropping the connections at random intervals.
So I have upgraded compat-wireless to 2.6.32. I no longer see the "beacon queue" message in the logs, but the AP is still not stable.
From the server perspective I see that the system chokes. Runs fine a couple of seconds, then everything goes very slow for a couple of seconds, and so on. On "top" I see that the software interrupts (si) figure changes between 30-90%. With madwifi it is usually 0-3%.
I have checked /proc/interrupts and calculated the delta increase, every 1 second, of the interrupts the wireless device is using (loop: cat /proc/interrupt >> file ; sleep 1). This is the only device on this interrupt. The following pattern repeats all the time (the exact values differ of course):
new interrupts related to wireless device, every 1s
Using madwifi the number of new interrupts every second is always ~60.
Also the Local timer interrupts are affected:
With madwifi it is always ~1018.
The peaks in Local timer interrupts are not correlated with the peak of wireless device interrupts.
I have set kernel debug to 8 (sysrq), but seen nothing, until after some 1-2 hours the following appeared:
ath5k phy0: failed to warm reset the MAC Chip
ath5k phy0: can't reset hardware (-5)
I though it might be related to Bug#12647, so I have applied Patch#20356, but it did not help.
It is 100% reproducable. I can provide traces, but you need to tell me what is of interest to you.
I have upgraded to 2.6.32 today.
Still the same: the moment I start hostapd, the flood of interrupts begin.
Created attachment 24189 [details]
ath5k debug intr
Log taken during a couple of seconds time after having loaded the ath5k module with debug=0x2.
Am I correct that the flooding interrupts are due to:
- SW Beacon Alert
- Beacon Not Ready ?
What does it mean ?
Thanks for looking into this. The background:
The software beacon alert (SWBA) should be fired before every beacon interval (every 100ms or so, by default) -- it tells the driver to load a beacon into the beacon queue on the hardware. So it's normal to get these in AP mode.
After a short period of time, the dma beacon alert (DBA) happens, this is when it actually transmits the beacon from memory to the card and sends it. Beacon not ready means we didn't actually load a beacon in time for the hardware. This is bad.
Do you have any timing information in the logs? Could you try with CONFIG_PRINTK_TIME?
Created attachment 24193 [details]
I attach a time-stamped interrupt log.
I can see two different time intervals in the log:
- status 0x100001 usually appears every ~100ms, but there are exceptions with longer time in-between.
- status 0x110000 floods every ~0.85ms.
Ok, what do you have beacon_int set to in your hostapd.conf? Can you try using ATH5K_DEBUG_BEACON and post the results? (modprobe ath5k debug=0x10)
Created attachment 24203 [details]
ath5k debug beacon
Attaching a 60s log with debug beacon enabled.
I have beacon_int=100 in the hostapd.conf.
Created attachment 24206 [details]
beacon_int - different values
I have tried different values of beacon_int in hostapd.conf.
- beacon_int=10000 - results in beacon every ~8.4ms
- beacon_int=50000 - results in beacon every ~42ms (5x8.4ms)
See the attached log.
With beacon_int=50000 everything _seems_ to work fine. The interrupts are no longer flooding the system, and it is possible to use the WLAN.
So I keep this setting as a temporary workarround.
Now I have noticed the following message in the logs:
[72530.904742] __ratelimit: 138 callbacks suppressed
[72530.904750] ath5k phy5: beacon queue 7 didn't start/stop ?
[72551.510681] ath5k phy5: beacon queue 7 didn't start/stop ?
It happened about the time I was suspending the laptop (client) this morning.
Previous suspend, and resume, did not cause any event in the logs.
I have been evaluating the workarround for a couple of days now, and it does not work OK.
The main issue is that my IBM T42 does not see this WLAN at all. When I change to madwifi, it discovers the network immediately. And it can see all the other neighboring WLANs. I have tried changing channels, tx power, tried changing from g to b, updated windows and ibm drivers. The other laptop (Compaq 6710b) does not have this problem.
Sporadically "beacon queue 7 didn't start/stop ?" appears in the logs. It never appears with "beacon debug" enabled, so I cannot provide more data. (WLAN unaffected - established connections not dropped).
When I enable beacon debug, the following messages repeat every 5s (a log file will follow). The number of "callback suppressed" changes from time to time, and then the new value is reported for a longer time.
Throughput is fine: 2,5 MByte/s both UL and DL.
Created attachment 24231 [details]
beacon debug - an event every 5s
Very strange. If I read it right, SWBA is being invoked every 100 microseconds instead of 100 TUs (1 TU approx = 1 ms). But nexttbtt looks correct compared to the TSF in the beacon debugging. There must be something wrong with the timer programming.
The SWBA interrupts - are they generated by the PC CPU, or do they come from the WLAN card? Is it possible that the Planet WLAN device I am using is broken - I mean it has some sort of a bug that results in ~1000x times faster expiration of some internal timers? The timers ath5k is using, but madwifi was not?
Since this issue is kind of a stopper for my small home network (my wife is using IBM T42 and this laptop refuses to see the network when using ath5k ;-) I have ordered another WLAN device (TL-WN722N, AR5008) - I will try setting up the AP using this one. Perhaps it will shed some more light on the problem.
So on my system things appear to be working fine:
[91955.828217] ath5k phy0: (ath5k_beacon_send:2126): in beacon_send
[91955.930644] ath5k phy0: (ath5k_beacon_send:2126): in beacon_send
[91956.033126] ath5k phy0: (ath5k_beacon_send:2126): in beacon_send
[91956.135553] ath5k phy0: (ath5k_beacon_send:2126): in beacon_send
Perhaps timer setup is different on 5316... I'll take a look at madwifi to see what I can find.
By the way, wrt to #1 in your three points above, with 50 second beacon interval, you are quite likely to not see the wireless network from other computers for a while. This is because beacon responses are the way the AP announces its presence (though you should see probe responses if you do active scans).
Madwifi and ath5k should be using the same timer setup as far as I know. You can compare ath5k_beacon_update_timers() with the equivalent in madwifi if you want to know what they are doing, but I'll see if I can spot anything on my end as well.
Correct - I have tried decreasing beacon_int, and at =1000 the T42 discovered the WLAN. But the server load is too high at this setting, due to the interrupts..
The moment beacon is enabled, the BNR+SWBA interrupts start to appear, at a rate 1000x faster than given by beacon_int. The SWBA alone also appears, at a much less frequent rate, which resembles the rate stated by beacon_int (though not always).
We could assume the SWBA alone is scheduled by us, while the BNR+SWBA come from the device. Perhaps it is only the BNR the device wants to complain about, but SWBA is added for some reason. Something 5213A-specific?
There is no support for BNR in ath5k_intr, so I have updated ath5k_intr a bit, so it schedules beacontq only if BNR is not set. It became possible to use beacon_int=1000: beacon_send called much less frequent and quite regularly, the system load low. However after some time the "no further tx buf available, dropping packet" messages started to appear, and the connectivity died. I have seen this message before, running beacon_int=50000 without any code modifications, but then it appeared after a couple of days of running the AP.
The question is what triggers the BNR+SWBA interrupt, and why is the frequency of this interrupt dependent on beacon_int. Why do BNR come together with SWBA?
The DBA event you mentioned before - where is it? Perhaps the WLAN device is ordered to transmit beacon too often, so it complains since most of the time it does not have any data to transmit due to the less-frequent SWBA:s ?
I have got down to the place where the timers and intervals is written to the device with iowrite32(). Everything seems to be fine..
I would like to check whether the base addresses are OK. I see that there have been a lot of changes between 5210 and 5211. Perhaps 5213A has some peculiarities? Is some spec available?
You wrote that 1 TU ~= 1 ms.
But looking at the ath5k_beacon_update_timers debug, the hw_tu is incremented ~1200x per 1 ms. While nexttbt = hw_tu + intval, where intval is in ms.
Created attachment 24259 [details]
don't use the external clock
Can you try the attached patch?
To be precise, 1 TU = 1.024 ms. So your TSF timer is running way too fast (in the one I looked at, I thought it was only 60x but still, that's not going to work). You can verify this by mounting debugfs and looking at the file /debug/ieee80211/phyX/tsf - it should tick in microseconds.
60x sounds like an issue with the sleep clock, so the patch turns it off.
Created attachment 24261 [details]
patch applied, beacon_int=1000
Worked! Thanks! :-)
I attach a log with beacon_int set to 1000.
No more BNR interrupts, beacons sent as expected.
Setting beacon_int to 100 also worked (not in the log).
What is the issue with the sleep clock?
Is there something wrong with my device, or is it some driver problem?
Author: Bob Copeland <firstname.lastname@example.org>
Date: Wed Jan 20 23:51:03 2010 -0500
ath5k: dont use external sleep clock in AP mode
When using the external sleep clock in AP mode, the
TSF increments too quickly, causing beacon interval
to be much lower than it is supposed to be, resulting
in lots of beacon-not-ready interrupts.
This fixes http://bugzilla.kernel.org/show_bug.cgi?id=14802.
Signed-off-by: Bob Copeland <email@example.com>
Acked-by: Nick Kossifidis <firstname.lastname@example.org>
Signed-off-by: John W. Linville <email@example.com>