Bug 15538

Summary: ath5k: txbuf fills up in adhoc mode
Product: Drivers Reporter: Johannes Karlsson (johannes.karlsson)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: johannes.karlsson, linville, maximlevitsky, me
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34-rc1 Subsystem:
Regression: No Bisected commit-id:
Attachments: don't send buffered frames
add some tracepoints to figure out what's going on
kernel log
Trace
fix tx race condition
New trace after the fix race condition patch
function trace
Another function trace
No card on PCI
Interrupt handlers

Description Johannes Karlsson 2010-03-15 09:13:46 UTC
When a network card using ath5k has been used in adhoc mode for a while (from minutes to hours) the txbuf fills up and no more packets get through. This error message is flooded in the kernel log:
ath5k phy0: no further txbuf available, dropping packet

Network card: Atheros AR5BXB63 
03:00.0 Ethernet controller: Atheros Communications Inc. AR242x 802.11abg Wireless PCI Express Adapter (rev 01)

It probably relates to this bug that was fixed for AP mode in this patch:

commit a951ae2176b982574ffa197455db6c89359fd5eb
Author: Bob Copeland <me@bobcopeland.com>
Date:   Wed Jan 20 23:51:04 2010 -0500

    ath5k: fix setup for CAB queue
    
    The beacon sent gating doesn't seem to work with any combination
    of flags.  Thus, buffered frames tend to stay buffered forever,
    using up tx descriptors.
    
    Instead, use the DBA gating and hold transmission of the buffered
    frames until 80% of the beacon interval has elapsed using the ready
    time.  This fixes the following error in AP mode:
    
       ath5k phy0: no further txbuf available, dropping packet
    
    Add a comment to acknowledge that this isn't the best solution.
    
    Signed-off-by: Bob Copeland <me@bobcopeland.com>
    Acked-by: Nick Kossifidis <mickflemm@gmail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
Comment 1 Bob Copeland 2010-03-15 13:43:07 UTC
Created attachment 25520 [details]
don't send buffered frames

I'm working on a debug patch using the tracing infrastructure but in the meantime does it go away with this patch?
Comment 2 Johannes Karlsson 2010-03-15 16:22:49 UTC
(In reply to comment #1)
> Created an attachment (id=25520) [details]
> don't send buffered frames
> 
> I'm working on a debug patch using the tracing infrastructure but in the
> meantime does it go away with this patch?

I have tested the patch and I still get the error.
Comment 3 Bob Copeland 2010-03-15 17:53:56 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > Created an attachment (id=25520) [details] [details]
> > don't send buffered frames
> > 
> > I'm working on a debug patch using the tracing infrastructure but in the
> > meantime does it go away with this patch?
> 
> I have tested the patch and I still get the error.

Ok, in that case it is unrelated to the AP mode patch.  I'll try to get that tracing patch done today.
Comment 4 Bob Copeland 2010-03-18 18:51:22 UTC
Created attachment 25594 [details]
add some tracepoints to figure out what's going on

See the attached patch.  If so inclined, you can get the same info via printk() -- but tracepoints are a little nicer.  I have used this before to see which queues are consuming the tx buffers, it should show if we have leaks as well.
Comment 5 Johannes Karlsson 2010-03-19 09:31:07 UTC
Created attachment 25604 [details]
kernel log
Comment 6 Johannes Karlsson 2010-03-19 09:36:27 UTC
Created attachment 25605 [details]
Trace

Since I am doing a flood ping the traces are huge. I have cut the part when no more ping packets go through. It is totally four traces. To get the interface up again I do a "ifconfig wlan0 down" and "ifconfig wlan0 192.168.1.116 up"
Comment 7 Bob Copeland 2010-03-19 12:21:54 UTC
Created attachment 25611 [details]
fix tx race condition

Ok so we definitely know the main tx queue is stalling even though there are packets queued.  Please try this patch and report back.
Comment 8 Johannes Karlsson 2010-03-19 18:45:57 UTC
Seems like I still get the error. I will make more tests tomorrow and submit some new traces.
Comment 9 Johannes Karlsson 2010-03-20 08:59:14 UTC
Created attachment 25619 [details]
New trace after the fix race condition patch

It seems like we have the same problem after the fix race condition patch
Comment 10 Bob Copeland 2010-03-20 14:10:29 UTC
Ok, can you also do the following (this will generate even more logs, but again the end is the interesting part)

as root:
$ cd /debug/tracing
$ echo function > current_tracer
$ echo ath5k_tasklet_tx > set_ftrace_filter
$ echo 1 > events/ath5k/enable
$ echo 1 > tracing_on
$ echo 0 > tracing_enabled
$ echo 1 > tracing_enabled

This should show if we entered the tasklet but did not process some packet.
Comment 11 Johannes Karlsson 2010-03-20 17:07:03 UTC
Created attachment 25622 [details]
function trace

Here is a new trace including function trace.
Comment 12 Johannes Karlsson 2010-03-20 21:41:04 UTC
Created attachment 25625 [details]
Another function trace

This trace is from another machine running ath5k in adhoc mode.
Comment 13 Bob Copeland 2010-03-25 23:08:19 UTC
The traces show TX completion interrupts just stop being triggered.

How long does it take to trigger the bug, and any tips?  I'm using two ath5k cards in adhoc with ping -f on both sides and after a few hours I haven't hit it yet.  I'm using 2.6.34-wl on one side and 2.6.32 on the other.
Comment 14 Johannes Karlsson 2010-03-26 10:56:01 UTC
I get the error most of the time within 5 minutes. Very rarely it can run for hours. I have tested on two different Acer Aspire One A110 machines (I have about 30 of them)
Comment 15 Johannes Karlsson 2010-03-26 11:02:48 UTC
Created attachment 25723 [details]
No card on PCI

After several days / weeks of uptime I usually get the problem that the card disappears. I also have this problem on the 2.6.29-gentoo kernel. I don't know if it can be related or not. If I reboot the machines the card is still not present. I have to first cut the power and then start to get the card back again.
Comment 16 Bob Copeland 2010-04-01 21:50:46 UTC
What about the file /proc/interrupts - does the line with ath stop increasing after it breaks?
Comment 17 Johannes Karlsson 2010-04-03 09:49:00 UTC
(In reply to comment #16)
> What about the file /proc/interrupts - does the line with ath stop increasing
> after it breaks?

Yes it continues to get interrupts, but at a much slower rate, possible only USB interrupts
11:   42486542    XT-PIC-XT        ath, ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5

I have ordered another laptop to see if I can trigger the same bug on different hardware platform.
Comment 18 Maxim Levitsky 2010-04-03 12:25:55 UTC
As one of my tests, I put some printks in the interrupt handler, and it has shown that interrupts stop arriving, and then storm of RXORN occurs.
No strange interrupts were seen ether while card did work.
Comment 19 Johannes Karlsson 2010-04-07 13:31:03 UTC
Created attachment 25906 [details]
Interrupt handlers

I cannot trigger the bug on an Asus EEE PC 1005P. It seems like the interrupt handlers are different compared to the Acer Aspire One A110 (where I easily can trigger the bug). Any clues there?
Comment 20 Bob Copeland 2010-04-07 14:55:00 UTC
Hrm, in the second case it's a shared irq, and since it's UP it's using a plain old PIC instead of IOAPIC as in the first case.  I glanced at ath5k_intr and it should handle this case ok, but who knows -- it happens to be unshared on mine.  I guess you can try either:

a) adding printk in the if clause in ath5k_intr where we return IRQ_NONE, to see if it's correlated with the times when interrupts stop.  This might be too noisy to get anything useful since you'll probably be getting usb interrupts all the time...
b) using 0 instead of IRQF_SHARED in the call to request_irq.

to rule out any issues there.  As a shot in the dark you can also try acpi=off boot param. 

The other thing suggested on the mailing list recently is to change the DMA size from AR5K_DMASIZE_128B to something smaller (AR5K_DMASIZE_4B).. but I'd like to find a solution that doesn't need that.
Comment 21 Johannes Karlsson 2010-04-07 19:10:35 UTC
(In reply to comment #20)
> Hrm, in the second case it's a shared irq, and since it's UP it's using a
> plain
> old PIC instead of IOAPIC as in the first case.  I glanced at ath5k_intr and
> it
> should handle this case ok, but who knows -- it happens to be unshared on
> mine.
>  I guess you can try either:
> 
> a) adding printk in the if clause in ath5k_intr where we return IRQ_NONE, to
> see if it's correlated with the times when interrupts stop.  This might be
> too
> noisy to get anything useful since you'll probably be getting usb interrupts
> all the time...
> b) using 0 instead of IRQF_SHARED in the call to request_irq.
> 
> to rule out any issues there.  As a shot in the dark you can also try
> acpi=off
> boot param. 
> 
> The other thing suggested on the mailing list recently is to change the DMA
> size from AR5K_DMASIZE_128B to something smaller (AR5K_DMASIZE_4B).. but I'd
> like to find a solution that doesn't need that.

I tried to change IRQF_SHARED to 0, DMA to AR5K_DMASIZE_4B and acpi=off, but I can still easily trigger the bug.
Comment 22 John W. Linville 2010-05-13 15:12:00 UTC
Bob, any more ideas? :-)
Comment 23 Maxim Levitsky 2010-06-18 00:24:22 UTC
Please see if fix from https://bugzilla.kernel.org/show_bug.cgi?id=13892 works.

It does work for me.
Comment 24 Johannes Karlsson 2010-06-26 17:25:51 UTC
After applying this patch I have been running flood ping for more than two days now and the network card is still working. I will continue to test, but it looks promising.
Comment 25 John W. Linville 2010-08-18 18:58:01 UTC
commit 6ccf15a1a76d2ff915cdef6ae4d12d0170087118
Author: Maxim Levitsky <maximlevitsky@gmail.com>
Date:   Fri Aug 13 11:27:28 2010 -0400

    ath5k: disable ASPM L0s for all cards
    
    Atheros PCIe wireless cards handled by ath5k do require L0s disabled.
    For distributions shipping with CONFIG_PCIEASPM (this will be enabled
    by default in the future in 2.6.36) this will also mean both L1 and L0s
    will be disabled when a pre 1.1 PCIe device is detected. We do know L1
    works correctly even for all ath5k pre 1.1 PCIe devices though but cannot
    currently undue the effect of a blacklist, for details you can read
    pcie_aspm_sanity_check() and see how it adjusts the device link
    capability.
    
    It may be possible in the future to implement some PCI API to allow
    drivers to override blacklists for pre 1.1 PCIe but for now it is
    best to accept that both L0s and L1 will be disabled completely for
    distributions shipping with CONFIG_PCIEASPM rather than having this
    issue present. Motivation for adding this new API will be to help
    with power consumption for some of these devices.
    
    Example of issues you'd see:
    
      - On the Acer Aspire One (AOA150, Atheros Communications Inc. AR5001
        Wireless Network Adapter [168c:001c] (rev 01)) doesn't work well
        with ASPM enabled, the card will eventually stall on heavy traffic
        with often 'unsupported jumbo' warnings appearing. Disabling
        ASPM L0s in ath5k fixes these problems.
    
      - On the same card you would see a storm of RXORN interrupts
        even though medium is idle.
    
    Credit for root causing and fixing the bug goes to Jussi Kivilinna.
    
    Cc: David Quan <David.Quan@atheros.com>
    Cc: Matthew Garrett <mjg59@srcf.ucam.org>
    Cc: Tim Gardner <tim.gardner@canonical.com>
    Cc: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
    Cc: stable@kernel.org
    Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
    Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
Comment 26 Maxim Levitsky 2010-08-20 07:59:15 UTC
Can you mark this as CLOSED?

This patch is upstream.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6ccf15a1a76d2ff915cdef6ae4d12d0170087118
Comment 27 John W. Linville 2010-08-23 14:21:13 UTC
What is the difference between RESOLVED and CLOSED anyway?