Bug 14502 (doddel) - ath9k leaks memory as function of traffic amount and forces reboot
Summary: ath9k leaks memory as function of traffic amount and forces reboot
Status: CLOSED CODE_FIX
Alias: doddel
Product: Networking
Classification: Unclassified
Component: Wireless (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Luis Chamberlain
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-28 23:07 UTC by doddel
Modified: 2010-02-12 23:00 UTC (History)
5 users (show)

See Also:
Kernel Version: Linux version 2.6.30.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
debugfs entries before and after 350MB of wireless outbound traffic (1.44 KB, patch)
2009-10-28 23:07 UTC, doddel
Details | Diff
mac80211: add the total ampdu length to tx info (1.42 KB, patch)
2009-11-19 22:54 UTC, Luis Chamberlain
Details | Diff
ath9k: get rid of tx_info_priv (14.94 KB, patch)
2009-11-19 22:55 UTC, Luis Chamberlain
Details | Diff

Description doddel 2009-10-28 23:07:43 UTC
Created attachment 23571 [details]
debugfs entries before and after 350MB of wireless outbound traffic

In station mode, regardless of encryption (tried 'none' and wpa-tkip), per 100 MB of wireless traffic about 10 kB of memory gets from free to unreclaimed. 
On my RouterStation (ar7161) with R52n miniPCI (ar9220), running OpenWrt trunk,
this forces the kernel after about 350 MB of traffic to reboot.
The same traffic routed between ethernet wired ports shows no leak.
Ath9k in debug mode shows to my non-expert eyes no error messages in any of the reports. Have tried compilations with various compat-wireless packages until the 2009-10-27 one.
Comment 1 doddel 2009-10-28 23:19:21 UTC
correction: per 100 MB traffic about 10 MB of memory gets eaten.
Comment 2 John W. Linville 2009-10-30 17:48:37 UTC
Have you tried a 2.6.31 or later kernel?
Comment 3 doddel 2009-10-30 18:06:16 UTC
No, but will give it a try and report back. Default settings in 'trunk'
development tree of OpenWrt do 2.6.30.8

Thanks for giving this attention !


On Fri, 2009-10-30 at 17:48 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14502
> 
> 
> John W. Linville <linville@tuxdriver.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |linville@tuxdriver.com,
>                    |                            |m.sujith@gmail.com,
>                    |                            |mcgrof@gmail.com
> 
> 
> 
> 
> --- Comment #2 from John W. Linville <linville@tuxdriver.com>  2009-10-30
> 17:48:37 ---
> Have you tried a 2.6.31 or later kernel?
>
Comment 4 doddel 2009-10-31 10:54:44 UTC
Have recompiled OpenWrt using Linux version 2.6.31.5 and compat-wireless
of Oct 28 2009 for the ar71xx platform (RouterStation) with ath9k driver
(Mikrotik R52n miniPCI).
The memory leak is still there, exactly the same 1:10 ratio between
traffic and loss of available memory into Slab and SUnreclaim as before
with 2.6.30.8
I attach the new dmesg for reference of the system used.

In dmesg I notice that 4 LEDs get registered on PHY0.
The Mikrotik R52n board does not have LEDs though. Do not know whether
that could do harm.

In the test I am not using wpa so neither supplicant nor hostap are
active. The active ones during data transfer I assume:
Module                  Size  Used by
ath9k                  63504  0 
ath9k_hw              200432  1 ath9k
ath                     6832  2 ath9k,ath9k_hw
mac80211              136528  1 ath9k
cfg80211              102096  3 ath9k,ath,mac80211

Your help is much appreciated in pinpointing the origin of this leak.
The way things are my system cannot be deployed as it should carry much
more traffic than just a few hundred MB.
Comment 5 Johannes Berg 2009-10-31 10:59:25 UTC
Can you figure out what is leaking?

Try
(for i in /sys/kernel/slab/* ; do echo "$(cat $i/objects) - $i" ; done)|sort -n|tail
Comment 6 Johannes Berg 2009-10-31 11:01:06 UTC
And/or check which one is increasing all the time. You may need some kernel config options to get those files, not sure.
Comment 7 doddel 2009-10-31 11:52:36 UTC
Hi Johannes,
There is no such directory /sys/kernel/slab
Must I recompile with certain options to get it ?

On Sat, 2009-10-31 at 10:59 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> (for i in /sys/kernel/slab/* ; do echo "$(cat $i/objects) - $i" ;
> done)|sort
> -n|tail
Comment 8 doddel 2009-10-31 13:01:10 UTC
am currently recompiling with a number of kernel debug options
activated. For the beginner, slab, slub, slob terminology is a bit
confusing but I give it a try to see if I can get more info on what
fills SUnreclaim.
Comment 9 doddel 2009-10-31 17:44:45 UTC
Can you please indicate what to do to make the system produce a
'/sys/kernel/slab/'

On Sat, 2009-10-31 at 10:59 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> /sys/kernel/slab/
Comment 10 Luis Chamberlain 2009-10-31 17:51:02 UTC
Not sure exactly which one enables it but I have these enabled and I get /sys/kernel/slab

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_SLABINFO=y
Comment 11 doddel 2009-10-31 17:58:48 UTC
tnx Luis, will give that a try !
Comment 12 doddel 2009-11-01 09:41:01 UTC
following the sys/kernel/slab data sorted as suggested
(for i in /sys/kernel/slab/* ; do echo "$(cat $i/objects) - $i" ; done)|
sort -n|tail:

after boot, before 100MB wireless traffic, no encryption or conditional
access 
3204 - /sys/kernel/slab/:t-0000048
3204 - /sys/kernel/slab/blkdev_ioc
3204 - /sys/kernel/slab/sysfs_dir_cache
5462 - /sys/kernel/slab/arp_cache
5473 - /sys/kernel/slab/filp
5502 - /sys/kernel/slab/bio-0
5506 - /sys/kernel/slab/kmalloc-128
5520 - /sys/kernel/slab/:t-0000128
5543 - /sys/kernel/slab/mnt_cache
5600 - /sys/kernel/slab/sgpool-8

after ~100MB wireless traffic
3228 - /sys/kernel/slab/:t-0000048
3228 - /sys/kernel/slab/blkdev_ioc
3228 - /sys/kernel/slab/sysfs_dir_cache
90704 - /sys/kernel/slab/sgpool-8
90717 - /sys/kernel/slab/mnt_cache
90783 - /sys/kernel/slab/filp
90784 - /sys/kernel/slab/arp_cache
90797 - /sys/kernel/slab/:t-0000128
90812 - /sys/kernel/slab/kmalloc-128
90813 - /sys/kernel/slab/bio-0

ifconfig wlan0 after 100MB:
Link encap:Ethernet  HWaddr 00:0C:42:3A:EA:BC  
inet addr:192.168.20.100  Bcast:192.168.20.255  Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
RX packets:73782 errors:0 dropped:0 overruns:0 frame:0
TX packets:85216 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:7054546 (6.7 MiB)  TX bytes:121105272 (115.4 MiB)

cat /proc/meminfo after 100 MB
MemTotal:          61664 kB
MemFree:           29992 kB
Buffers:            2160 kB
Cached:             6936 kB
SwapCached:            0 kB
Active:             5020 kB
Inactive:           5112 kB
Active(anon):       1104 kB
Inactive(anon):        0 kB
Active(file):       3916 kB
Inactive(file):     5112 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:          1052 kB
Mapped:             1596 kB
Slab:              19244 kB
SReclaimable:        872 kB
SUnreclaim:        18372 kB
PageTables:          152 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:       30832 kB
Committed_AS:       5484 kB
VmallocTotal:    1048404 kB
VmallocUsed:         336 kB
VmallocChunk:    1047432 kB
Comment 13 doddel 2009-11-01 10:09:41 UTC
and can add from /proc/slabinfo that it is its last line on kmalloc-128
that reacts to the traffic; all others remain about equal.

# name            <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : 
	tunables <limit> <batchcount> <sharedfactor> : 
	slabdata <active_slabs> <num_slabs> <sharedavail>

after 100M traffic:
kmalloc-128        90659  90720    128   32    1 : 
	tunables    0    0    0 : 
	slabdata   2835   2835      0

after 200M traffic:
kmalloc-128       175327 175328    128   32    1 :
	tunables    0    0    0 : 
	slabdata   5479   5479      0
Comment 14 doddel 2009-11-02 12:51:10 UTC
is there a way to trace the SUnreclaim back to the routines that cause
it  ?
Have tried to activate trace. Wireless througput then drops dramatically
and lots of output is generated in dmesg but cannot interpret it. 
What can be a next step to find the leak ?

Or is there a way to force the return of memory from Slab to FreeMemory
that is faster than a system reboot ? Could have a cron job monitor
FreeMem and trigger something at the moment of underrun, if need be with
a momentary interruption of wireless traffic. Anything better than
unpredictable behaviour and unforeseeable reboots. 

tnx for your help
  
On Sat, 2009-10-31 at 17:51 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14502
> 
> 
> 
> 
> 
> --- Comment #10 from Luis R. Rodriguez <mcgrof@gmail.com>  2009-10-31
> 17:51:02 ---
> Not sure exactly which one enables it but I have these enabled and I get
> /sys/kernel/slab
> 
> CONFIG_SLUB_DEBUG=y
> CONFIG_SLUB=y
> CONFIG_SLABINFO=y
>
Comment 15 Luis Chamberlain 2009-11-02 16:30:45 UTC
I know recompiling your kernel may be painful but I believe a more suitable utility here may be to use kmemleak to see if we can detect actual leaks. kmemleak will help us narrow the leak down even further.

Please try as follows:

1) First please enable kmemleak in your kernel. You do this by enabling CONFIG_DEBUG_KMEMLEAK.

2) Mount debugfs upon bootup. You will have already done this for the slab stuff. But just in case:

  # mount -t debugfs nodev /sys/kernel/debug/
  # cat /sys/kernel/debug/kmemleak

3) After bootup please clear all the current kmemleaks:

echo clear > /sys/kernel/debug/kmemleak

4) Now, I want to you to a scan prior to running some TX or RX:

echo scan > /sys/kernel/debug/kmemleak

5) Now that we have a fresh scan please go ahead and run your TX test. This is under the assumption your leak happens under TX and not RX (why do you believe this?) Please TX 1 GB or so of data.

6) After your scan please re-run a scan:

echo scan > /sys/kernel/debug/kmemleak

7) Now capture the kmemleak output:

cat /sys/kernel/debug/kmemleak > /root/kmemleak-out.txt

Please attach your kmemleak-out.txt If you can avoid using your hard drive during the TX that would be great as that would rule out the hard driver modules/parition modules or reduce the noise from them.

If you suspect RX is the issue you can re-run this test with RX instead of TX. And so on.
Comment 16 doddel 2009-11-02 16:54:43 UTC
Tnx for advise.
Unfortunately kmemleak cannot be enabled, unless I overrule the
menuconfig system, as it is only defined for x86 and ARM architectures.
The ar71xx is MIPS ....
You think kmemleak will work for MIPS as well ?
Comment 17 Luis Chamberlain 2009-11-02 17:02:44 UTC
Not sure, I don't recall seeing arch specific code on kmemleak.c I'll check.
Comment 18 doddel 2009-11-02 17:19:22 UTC
The KMEMLEAK documentation says in the very last line in chapter
Limitations and Drawbacks that "Only the ARM and x86 architectures are
currently supported".

On Mon, 2009-11-02 at 17:02 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14502
> 
> 
> 
> 
> 
> --- Comment #17 from Luis R. Rodriguez <mcgrof@gmail.com>  2009-11-02
> 17:02:44 ---
> Not sure, I don't recall seeing arch specific code on kmemleak.c I'll check.
>
Comment 19 Luis Chamberlain 2009-11-02 17:24:04 UTC
It does not say why though.
Comment 20 doddel 2009-11-03 10:12:47 UTC
just to make sure that no error was introduced by the many
experimentations I recompiled the complete cross-compiler toolchain of
OpenWrt and compiled an image for the ar71xx Routerstation platform
directly having chosen kernel 2.6.31.5 and using the latest
compat-wireless of Nov. 2 2009.
Unfortunately exactly the same memory leak presents itself.
Ath9k itself reports in its debug info no errors. I attach its debug
info. (after wlan0: RX bytes:8029844 (7.6 MiB)  TX bytes:162154562
(154.6 MiB))

Any further ideas on how to get from the large numbers of 128 bit slab
items to the routine(s) captivating them, apart from KMEMLEAK that does
not support MIPS  ?
Comment 21 doddel 2009-11-03 16:34:32 UTC
have tried the same Mikrotik R52n card in a laptop (P4, 2.8 GHz) with
miniPCI slot running latest Ubuntu, kernel 2.6.31-14-generic, (which
also has the OpenWrt cross-compiler toolchain for the ar71xx) and ath9k
driver.
Tried data transfers in 802.11g and 802.11a modes: no memory leak.
So the problem is specific to the ar9220 radio on a RouterBoard with
ar7161 rev.2 CPU and/or the cross-compiler used or its settings.
Is there anything x86 specific in ath9k that perhaps needs tweaking via
a patch for proper compilation towards an ar7161 MIPS environment ?
Comment 22 doddel 2009-11-15 15:51:32 UTC
some progress by accident:
because of instability in packet timing I tried replacing iwconfig by a version including the power setting (switching 'iwconfig wlan0 power off' gives stable timing) and tried to replace the wireless-tools package that carries iwconfig by a modified one. This procedure did not complete due to an error in opkg, the packet management in OpenWrt.
Hence iwconfig, used by me also for settings of channel and rate when bringing up the interface, became disfunctional altogeher. 
The interface still came up (in station mode channel is not needed and the rate initialises at 6 Mbps when left untouched) but the memory leak was gone !
It appears that iwconfig and ath9k/mac80211 have a hate/love relation. To get the timing of packets right you need to use it to switch power management off, to not have the memory leak you should not use it. Their interaction clearly deserves further code inspection. Perhaps iw can be extended to make iwconfig a tool of the past. This bug report can be closed as its description is inaccurate.
ath9k does not leak memory per-se but can be made to do so by application of iwconfig settings.
Comment 23 Luis Chamberlain 2009-11-19 22:53:48 UTC
Can you please try the 2.6.32-rc7 stable compat-wireless release:

http://wireless.kernel.org/en/users/Download/stable

and apply the two patches I'm about to attach and let me know if this issues disappears.
Comment 24 Luis Chamberlain 2009-11-19 22:54:58 UTC
Created attachment 23836 [details]
mac80211: add the total ampdu length to tx info

First patch is simple, and is required by the second patch
Comment 25 Luis Chamberlain 2009-11-19 22:55:55 UTC
Created attachment 23837 [details]
ath9k: get rid of tx_info_priv

This second patch is a backport of a patch already applied on wireless-testing. It removes the private info struct which we kmalloc/free.
Comment 26 doddel 2009-11-22 14:16:29 UTC
Hi, first of all a big thank you to all involved in improving ath9k /
mac80211.

Have tested the suggested patches by compiling OpenWrt for the 
RouterStation ar71xx platform:
OpenWrt Trunk r 18465 (Nov. 22 2009)
compat-wireless-2009-11-22 (which had these patches in it already)
Kernel 2.6.31.6
(ubuntu Pentium4 OpenWrt cross-compile environment) 

The good news: No memory leak ! So my bug report can be closed. 
The bad news: a stable wireless connection still needs the issuing of
the command "iwconfig wlan0 power off".
If that is not given in the process of bringing up the interface the
typical behaviour is still exhibited that from sta > ap it pings with ~
2.5 ms and from ap to sta at ~ 50 ms with lots of variation and frame
errors. After switching off power management it goes to typically 1.5 ms
both ways.   

After stabilizing the link with the power off setting typical
performance in 802.11a mode using WPA-TKIP and sftp for a file transfer
12 Mbit/s nett. Without encryption, using sftp: 19 Mbps.

Link topology: R52n on RouterStation in STA mode, a Wispstation5 in AP
mode and laptop as the remote point and source of data. Have sent a few
GB through it to make sure.

It seems necessary in initialising ath9k to have a default 'power
management off' being included and/or this power management being
debugged; as it is now it completely disrupts communication. 

On Thu, 2009-11-19 at 22:55 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14502
> 
> 
> 
> 
> 
> --- Comment #25 from Luis R. Rodriguez <mcgrof@gmail.com>  2009-11-19
> 22:55:55 ---
> Created an attachment (id=23837)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=23837)
> ath9k: get rid of tx_info_priv
> 
> This second patch is a backport of a patch already applied on
> wireless-testing.
> It removes the private info struct which we kmalloc/free.
>
Comment 27 doddel 2009-12-14 21:46:17 UTC
Dec. 14 2009: Nov. 22 status of ath9k now 3 weeks into operation on RouterStation platform with two R52n mini-PCI radios and is stable and without memory leak.
Comment 28 Luis Chamberlain 2009-12-14 22:05:02 UTC
Thanks, that patch is rather large, but since it fixes a memory we'll see if its acceptable.
Comment 29 Luis Chamberlain 2010-02-12 23:00:05 UTC
This is upstream, not backported though.

Note You need to log in before you can comment on or make changes to this bug.