Created attachment 23571 [details] debugfs entries before and after 350MB of wireless outbound traffic In station mode, regardless of encryption (tried 'none' and wpa-tkip), per 100 MB of wireless traffic about 10 kB of memory gets from free to unreclaimed. On my RouterStation (ar7161) with R52n miniPCI (ar9220), running OpenWrt trunk, this forces the kernel after about 350 MB of traffic to reboot. The same traffic routed between ethernet wired ports shows no leak. Ath9k in debug mode shows to my non-expert eyes no error messages in any of the reports. Have tried compilations with various compat-wireless packages until the 2009-10-27 one.
correction: per 100 MB traffic about 10 MB of memory gets eaten.
Have you tried a 2.6.31 or later kernel?
No, but will give it a try and report back. Default settings in 'trunk' development tree of OpenWrt do 2.6.30.8 Thanks for giving this attention ! On Fri, 2009-10-30 at 17:48 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14502 > > > John W. Linville <linville@tuxdriver.com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |linville@tuxdriver.com, > | |m.sujith@gmail.com, > | |mcgrof@gmail.com > > > > > --- Comment #2 from John W. Linville <linville@tuxdriver.com> 2009-10-30 > 17:48:37 --- > Have you tried a 2.6.31 or later kernel? >
Have recompiled OpenWrt using Linux version 2.6.31.5 and compat-wireless of Oct 28 2009 for the ar71xx platform (RouterStation) with ath9k driver (Mikrotik R52n miniPCI). The memory leak is still there, exactly the same 1:10 ratio between traffic and loss of available memory into Slab and SUnreclaim as before with 2.6.30.8 I attach the new dmesg for reference of the system used. In dmesg I notice that 4 LEDs get registered on PHY0. The Mikrotik R52n board does not have LEDs though. Do not know whether that could do harm. In the test I am not using wpa so neither supplicant nor hostap are active. The active ones during data transfer I assume: Module Size Used by ath9k 63504 0 ath9k_hw 200432 1 ath9k ath 6832 2 ath9k,ath9k_hw mac80211 136528 1 ath9k cfg80211 102096 3 ath9k,ath,mac80211 Your help is much appreciated in pinpointing the origin of this leak. The way things are my system cannot be deployed as it should carry much more traffic than just a few hundred MB.
Can you figure out what is leaking? Try (for i in /sys/kernel/slab/* ; do echo "$(cat $i/objects) - $i" ; done)|sort -n|tail
And/or check which one is increasing all the time. You may need some kernel config options to get those files, not sure.
Hi Johannes, There is no such directory /sys/kernel/slab Must I recompile with certain options to get it ? On Sat, 2009-10-31 at 10:59 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > (for i in /sys/kernel/slab/* ; do echo "$(cat $i/objects) - $i" ; > done)|sort > -n|tail
am currently recompiling with a number of kernel debug options activated. For the beginner, slab, slub, slob terminology is a bit confusing but I give it a try to see if I can get more info on what fills SUnreclaim.
Can you please indicate what to do to make the system produce a '/sys/kernel/slab/' On Sat, 2009-10-31 at 10:59 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > /sys/kernel/slab/
Not sure exactly which one enables it but I have these enabled and I get /sys/kernel/slab CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y CONFIG_SLABINFO=y
tnx Luis, will give that a try !
following the sys/kernel/slab data sorted as suggested (for i in /sys/kernel/slab/* ; do echo "$(cat $i/objects) - $i" ; done)| sort -n|tail: after boot, before 100MB wireless traffic, no encryption or conditional access 3204 - /sys/kernel/slab/:t-0000048 3204 - /sys/kernel/slab/blkdev_ioc 3204 - /sys/kernel/slab/sysfs_dir_cache 5462 - /sys/kernel/slab/arp_cache 5473 - /sys/kernel/slab/filp 5502 - /sys/kernel/slab/bio-0 5506 - /sys/kernel/slab/kmalloc-128 5520 - /sys/kernel/slab/:t-0000128 5543 - /sys/kernel/slab/mnt_cache 5600 - /sys/kernel/slab/sgpool-8 after ~100MB wireless traffic 3228 - /sys/kernel/slab/:t-0000048 3228 - /sys/kernel/slab/blkdev_ioc 3228 - /sys/kernel/slab/sysfs_dir_cache 90704 - /sys/kernel/slab/sgpool-8 90717 - /sys/kernel/slab/mnt_cache 90783 - /sys/kernel/slab/filp 90784 - /sys/kernel/slab/arp_cache 90797 - /sys/kernel/slab/:t-0000128 90812 - /sys/kernel/slab/kmalloc-128 90813 - /sys/kernel/slab/bio-0 ifconfig wlan0 after 100MB: Link encap:Ethernet HWaddr 00:0C:42:3A:EA:BC inet addr:192.168.20.100 Bcast:192.168.20.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:73782 errors:0 dropped:0 overruns:0 frame:0 TX packets:85216 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:7054546 (6.7 MiB) TX bytes:121105272 (115.4 MiB) cat /proc/meminfo after 100 MB MemTotal: 61664 kB MemFree: 29992 kB Buffers: 2160 kB Cached: 6936 kB SwapCached: 0 kB Active: 5020 kB Inactive: 5112 kB Active(anon): 1104 kB Inactive(anon): 0 kB Active(file): 3916 kB Inactive(file): 5112 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 1052 kB Mapped: 1596 kB Slab: 19244 kB SReclaimable: 872 kB SUnreclaim: 18372 kB PageTables: 152 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 30832 kB Committed_AS: 5484 kB VmallocTotal: 1048404 kB VmallocUsed: 336 kB VmallocChunk: 1047432 kB
and can add from /proc/slabinfo that it is its last line on kmalloc-128 that reacts to the traffic; all others remain about equal. # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> after 100M traffic: kmalloc-128 90659 90720 128 32 1 : tunables 0 0 0 : slabdata 2835 2835 0 after 200M traffic: kmalloc-128 175327 175328 128 32 1 : tunables 0 0 0 : slabdata 5479 5479 0
is there a way to trace the SUnreclaim back to the routines that cause it ? Have tried to activate trace. Wireless througput then drops dramatically and lots of output is generated in dmesg but cannot interpret it. What can be a next step to find the leak ? Or is there a way to force the return of memory from Slab to FreeMemory that is faster than a system reboot ? Could have a cron job monitor FreeMem and trigger something at the moment of underrun, if need be with a momentary interruption of wireless traffic. Anything better than unpredictable behaviour and unforeseeable reboots. tnx for your help On Sat, 2009-10-31 at 17:51 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14502 > > > > > > --- Comment #10 from Luis R. Rodriguez <mcgrof@gmail.com> 2009-10-31 > 17:51:02 --- > Not sure exactly which one enables it but I have these enabled and I get > /sys/kernel/slab > > CONFIG_SLUB_DEBUG=y > CONFIG_SLUB=y > CONFIG_SLABINFO=y >
I know recompiling your kernel may be painful but I believe a more suitable utility here may be to use kmemleak to see if we can detect actual leaks. kmemleak will help us narrow the leak down even further. Please try as follows: 1) First please enable kmemleak in your kernel. You do this by enabling CONFIG_DEBUG_KMEMLEAK. 2) Mount debugfs upon bootup. You will have already done this for the slab stuff. But just in case: # mount -t debugfs nodev /sys/kernel/debug/ # cat /sys/kernel/debug/kmemleak 3) After bootup please clear all the current kmemleaks: echo clear > /sys/kernel/debug/kmemleak 4) Now, I want to you to a scan prior to running some TX or RX: echo scan > /sys/kernel/debug/kmemleak 5) Now that we have a fresh scan please go ahead and run your TX test. This is under the assumption your leak happens under TX and not RX (why do you believe this?) Please TX 1 GB or so of data. 6) After your scan please re-run a scan: echo scan > /sys/kernel/debug/kmemleak 7) Now capture the kmemleak output: cat /sys/kernel/debug/kmemleak > /root/kmemleak-out.txt Please attach your kmemleak-out.txt If you can avoid using your hard drive during the TX that would be great as that would rule out the hard driver modules/parition modules or reduce the noise from them. If you suspect RX is the issue you can re-run this test with RX instead of TX. And so on.
Tnx for advise. Unfortunately kmemleak cannot be enabled, unless I overrule the menuconfig system, as it is only defined for x86 and ARM architectures. The ar71xx is MIPS .... You think kmemleak will work for MIPS as well ?
Not sure, I don't recall seeing arch specific code on kmemleak.c I'll check.
The KMEMLEAK documentation says in the very last line in chapter Limitations and Drawbacks that "Only the ARM and x86 architectures are currently supported". On Mon, 2009-11-02 at 17:02 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14502 > > > > > > --- Comment #17 from Luis R. Rodriguez <mcgrof@gmail.com> 2009-11-02 > 17:02:44 --- > Not sure, I don't recall seeing arch specific code on kmemleak.c I'll check. >
It does not say why though.
just to make sure that no error was introduced by the many experimentations I recompiled the complete cross-compiler toolchain of OpenWrt and compiled an image for the ar71xx Routerstation platform directly having chosen kernel 2.6.31.5 and using the latest compat-wireless of Nov. 2 2009. Unfortunately exactly the same memory leak presents itself. Ath9k itself reports in its debug info no errors. I attach its debug info. (after wlan0: RX bytes:8029844 (7.6 MiB) TX bytes:162154562 (154.6 MiB)) Any further ideas on how to get from the large numbers of 128 bit slab items to the routine(s) captivating them, apart from KMEMLEAK that does not support MIPS ?
have tried the same Mikrotik R52n card in a laptop (P4, 2.8 GHz) with miniPCI slot running latest Ubuntu, kernel 2.6.31-14-generic, (which also has the OpenWrt cross-compiler toolchain for the ar71xx) and ath9k driver. Tried data transfers in 802.11g and 802.11a modes: no memory leak. So the problem is specific to the ar9220 radio on a RouterBoard with ar7161 rev.2 CPU and/or the cross-compiler used or its settings. Is there anything x86 specific in ath9k that perhaps needs tweaking via a patch for proper compilation towards an ar7161 MIPS environment ?
some progress by accident: because of instability in packet timing I tried replacing iwconfig by a version including the power setting (switching 'iwconfig wlan0 power off' gives stable timing) and tried to replace the wireless-tools package that carries iwconfig by a modified one. This procedure did not complete due to an error in opkg, the packet management in OpenWrt. Hence iwconfig, used by me also for settings of channel and rate when bringing up the interface, became disfunctional altogeher. The interface still came up (in station mode channel is not needed and the rate initialises at 6 Mbps when left untouched) but the memory leak was gone ! It appears that iwconfig and ath9k/mac80211 have a hate/love relation. To get the timing of packets right you need to use it to switch power management off, to not have the memory leak you should not use it. Their interaction clearly deserves further code inspection. Perhaps iw can be extended to make iwconfig a tool of the past. This bug report can be closed as its description is inaccurate. ath9k does not leak memory per-se but can be made to do so by application of iwconfig settings.
Can you please try the 2.6.32-rc7 stable compat-wireless release: http://wireless.kernel.org/en/users/Download/stable and apply the two patches I'm about to attach and let me know if this issues disappears.
Created attachment 23836 [details] mac80211: add the total ampdu length to tx info First patch is simple, and is required by the second patch
Created attachment 23837 [details] ath9k: get rid of tx_info_priv This second patch is a backport of a patch already applied on wireless-testing. It removes the private info struct which we kmalloc/free.
Hi, first of all a big thank you to all involved in improving ath9k / mac80211. Have tested the suggested patches by compiling OpenWrt for the RouterStation ar71xx platform: OpenWrt Trunk r 18465 (Nov. 22 2009) compat-wireless-2009-11-22 (which had these patches in it already) Kernel 2.6.31.6 (ubuntu Pentium4 OpenWrt cross-compile environment) The good news: No memory leak ! So my bug report can be closed. The bad news: a stable wireless connection still needs the issuing of the command "iwconfig wlan0 power off". If that is not given in the process of bringing up the interface the typical behaviour is still exhibited that from sta > ap it pings with ~ 2.5 ms and from ap to sta at ~ 50 ms with lots of variation and frame errors. After switching off power management it goes to typically 1.5 ms both ways. After stabilizing the link with the power off setting typical performance in 802.11a mode using WPA-TKIP and sftp for a file transfer 12 Mbit/s nett. Without encryption, using sftp: 19 Mbps. Link topology: R52n on RouterStation in STA mode, a Wispstation5 in AP mode and laptop as the remote point and source of data. Have sent a few GB through it to make sure. It seems necessary in initialising ath9k to have a default 'power management off' being included and/or this power management being debugged; as it is now it completely disrupts communication. On Thu, 2009-11-19 at 22:55 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14502 > > > > > > --- Comment #25 from Luis R. Rodriguez <mcgrof@gmail.com> 2009-11-19 > 22:55:55 --- > Created an attachment (id=23837) > --> (http://bugzilla.kernel.org/attachment.cgi?id=23837) > ath9k: get rid of tx_info_priv > > This second patch is a backport of a patch already applied on > wireless-testing. > It removes the private info struct which we kmalloc/free. >
Dec. 14 2009: Nov. 22 status of ath9k now 3 weeks into operation on RouterStation platform with two R52n mini-PCI radios and is stable and without memory leak.
Thanks, that patch is rather large, but since it fixes a memory we'll see if its acceptable.
This is upstream, not backported though.