When a network card using ath5k has been used in adhoc mode for a while (from minutes to hours) the txbuf fills up and no more packets get through. This error message is flooded in the kernel log: ath5k phy0: no further txbuf available, dropping packet Network card: Atheros AR5BXB63 03:00.0 Ethernet controller: Atheros Communications Inc. AR242x 802.11abg Wireless PCI Express Adapter (rev 01) It probably relates to this bug that was fixed for AP mode in this patch: commit a951ae2176b982574ffa197455db6c89359fd5eb Author: Bob Copeland <me@bobcopeland.com> Date: Wed Jan 20 23:51:04 2010 -0500 ath5k: fix setup for CAB queue The beacon sent gating doesn't seem to work with any combination of flags. Thus, buffered frames tend to stay buffered forever, using up tx descriptors. Instead, use the DBA gating and hold transmission of the buffered frames until 80% of the beacon interval has elapsed using the ready time. This fixes the following error in AP mode: ath5k phy0: no further txbuf available, dropping packet Add a comment to acknowledge that this isn't the best solution. Signed-off-by: Bob Copeland <me@bobcopeland.com> Acked-by: Nick Kossifidis <mickflemm@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Created attachment 25520 [details] don't send buffered frames I'm working on a debug patch using the tracing infrastructure but in the meantime does it go away with this patch?
(In reply to comment #1) > Created an attachment (id=25520) [details] > don't send buffered frames > > I'm working on a debug patch using the tracing infrastructure but in the > meantime does it go away with this patch? I have tested the patch and I still get the error.
(In reply to comment #2) > (In reply to comment #1) > > Created an attachment (id=25520) [details] [details] > > don't send buffered frames > > > > I'm working on a debug patch using the tracing infrastructure but in the > > meantime does it go away with this patch? > > I have tested the patch and I still get the error. Ok, in that case it is unrelated to the AP mode patch. I'll try to get that tracing patch done today.
Created attachment 25594 [details] add some tracepoints to figure out what's going on See the attached patch. If so inclined, you can get the same info via printk() -- but tracepoints are a little nicer. I have used this before to see which queues are consuming the tx buffers, it should show if we have leaks as well.
Created attachment 25604 [details] kernel log
Created attachment 25605 [details] Trace Since I am doing a flood ping the traces are huge. I have cut the part when no more ping packets go through. It is totally four traces. To get the interface up again I do a "ifconfig wlan0 down" and "ifconfig wlan0 192.168.1.116 up"
Created attachment 25611 [details] fix tx race condition Ok so we definitely know the main tx queue is stalling even though there are packets queued. Please try this patch and report back.
Seems like I still get the error. I will make more tests tomorrow and submit some new traces.
Created attachment 25619 [details] New trace after the fix race condition patch It seems like we have the same problem after the fix race condition patch
Ok, can you also do the following (this will generate even more logs, but again the end is the interesting part) as root: $ cd /debug/tracing $ echo function > current_tracer $ echo ath5k_tasklet_tx > set_ftrace_filter $ echo 1 > events/ath5k/enable $ echo 1 > tracing_on $ echo 0 > tracing_enabled $ echo 1 > tracing_enabled This should show if we entered the tasklet but did not process some packet.
Created attachment 25622 [details] function trace Here is a new trace including function trace.
Created attachment 25625 [details] Another function trace This trace is from another machine running ath5k in adhoc mode.
The traces show TX completion interrupts just stop being triggered. How long does it take to trigger the bug, and any tips? I'm using two ath5k cards in adhoc with ping -f on both sides and after a few hours I haven't hit it yet. I'm using 2.6.34-wl on one side and 2.6.32 on the other.
I get the error most of the time within 5 minutes. Very rarely it can run for hours. I have tested on two different Acer Aspire One A110 machines (I have about 30 of them)
Created attachment 25723 [details] No card on PCI After several days / weeks of uptime I usually get the problem that the card disappears. I also have this problem on the 2.6.29-gentoo kernel. I don't know if it can be related or not. If I reboot the machines the card is still not present. I have to first cut the power and then start to get the card back again.
What about the file /proc/interrupts - does the line with ath stop increasing after it breaks?
(In reply to comment #16) > What about the file /proc/interrupts - does the line with ath stop increasing > after it breaks? Yes it continues to get interrupts, but at a much slower rate, possible only USB interrupts 11: 42486542 XT-PIC-XT ath, ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5 I have ordered another laptop to see if I can trigger the same bug on different hardware platform.
As one of my tests, I put some printks in the interrupt handler, and it has shown that interrupts stop arriving, and then storm of RXORN occurs. No strange interrupts were seen ether while card did work.
Created attachment 25906 [details] Interrupt handlers I cannot trigger the bug on an Asus EEE PC 1005P. It seems like the interrupt handlers are different compared to the Acer Aspire One A110 (where I easily can trigger the bug). Any clues there?
Hrm, in the second case it's a shared irq, and since it's UP it's using a plain old PIC instead of IOAPIC as in the first case. I glanced at ath5k_intr and it should handle this case ok, but who knows -- it happens to be unshared on mine. I guess you can try either: a) adding printk in the if clause in ath5k_intr where we return IRQ_NONE, to see if it's correlated with the times when interrupts stop. This might be too noisy to get anything useful since you'll probably be getting usb interrupts all the time... b) using 0 instead of IRQF_SHARED in the call to request_irq. to rule out any issues there. As a shot in the dark you can also try acpi=off boot param. The other thing suggested on the mailing list recently is to change the DMA size from AR5K_DMASIZE_128B to something smaller (AR5K_DMASIZE_4B).. but I'd like to find a solution that doesn't need that.
(In reply to comment #20) > Hrm, in the second case it's a shared irq, and since it's UP it's using a > plain > old PIC instead of IOAPIC as in the first case. I glanced at ath5k_intr and > it > should handle this case ok, but who knows -- it happens to be unshared on > mine. > I guess you can try either: > > a) adding printk in the if clause in ath5k_intr where we return IRQ_NONE, to > see if it's correlated with the times when interrupts stop. This might be > too > noisy to get anything useful since you'll probably be getting usb interrupts > all the time... > b) using 0 instead of IRQF_SHARED in the call to request_irq. > > to rule out any issues there. As a shot in the dark you can also try > acpi=off > boot param. > > The other thing suggested on the mailing list recently is to change the DMA > size from AR5K_DMASIZE_128B to something smaller (AR5K_DMASIZE_4B).. but I'd > like to find a solution that doesn't need that. I tried to change IRQF_SHARED to 0, DMA to AR5K_DMASIZE_4B and acpi=off, but I can still easily trigger the bug.
Bob, any more ideas? :-)
Please see if fix from https://bugzilla.kernel.org/show_bug.cgi?id=13892 works. It does work for me.
After applying this patch I have been running flood ping for more than two days now and the network card is still working. I will continue to test, but it looks promising.
commit 6ccf15a1a76d2ff915cdef6ae4d12d0170087118 Author: Maxim Levitsky <maximlevitsky@gmail.com> Date: Fri Aug 13 11:27:28 2010 -0400 ath5k: disable ASPM L0s for all cards Atheros PCIe wireless cards handled by ath5k do require L0s disabled. For distributions shipping with CONFIG_PCIEASPM (this will be enabled by default in the future in 2.6.36) this will also mean both L1 and L0s will be disabled when a pre 1.1 PCIe device is detected. We do know L1 works correctly even for all ath5k pre 1.1 PCIe devices though but cannot currently undue the effect of a blacklist, for details you can read pcie_aspm_sanity_check() and see how it adjusts the device link capability. It may be possible in the future to implement some PCI API to allow drivers to override blacklists for pre 1.1 PCIe but for now it is best to accept that both L0s and L1 will be disabled completely for distributions shipping with CONFIG_PCIEASPM rather than having this issue present. Motivation for adding this new API will be to help with power consumption for some of these devices. Example of issues you'd see: - On the Acer Aspire One (AOA150, Atheros Communications Inc. AR5001 Wireless Network Adapter [168c:001c] (rev 01)) doesn't work well with ASPM enabled, the card will eventually stall on heavy traffic with often 'unsupported jumbo' warnings appearing. Disabling ASPM L0s in ath5k fixes these problems. - On the same card you would see a storm of RXORN interrupts even though medium is idle. Credit for root causing and fixing the bug goes to Jussi Kivilinna. Cc: David Quan <David.Quan@atheros.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Tim Gardner <tim.gardner@canonical.com> Cc: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Cc: stable@kernel.org Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Can you mark this as CLOSED? This patch is upstream. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6ccf15a1a76d2ff915cdef6ae4d12d0170087118
What is the difference between RESOLVED and CLOSED anyway?