Distribution: SuSE Linux 9.1 Hardware Environment: Toshiba Satellite A25-S207 Laptop (All stock) Software Environment: SuSE Linux 9.1 with custom compiled kernel(s) from kernel.org Problem Description: After about 5 days with a kernel >= 2.6.10, memory runs out completely and the OOM killer begins killing every single process. Leading up to it memory begins to dwindle down, a calculation at about 4 days time showed 331MB of memory unaccounted for. Fixing the problem requires nothing other than a cold boot; a warm reboot DOES NOT clear the "missing" RAM, and the cs probe fails due to not having any free RAM and I get "IRQ11: nobody cared!" Kernels <= 2.6.9 do not exhibit this issue. Steps to reproduce: Get my computer, compile a kernel >= 2.6.10 and run it. This is my first bug report, so be kind and inform me of any extra information which would be of assistance in solving this issue.
Please wait for it to happen again then gather the output of cat /proc/meminfo cat /proc/slabinfo echo m > /proc/sysrq-trigger && dmesg -s 1000000
Would that be when it starts slowing down, after the OOM-killer finishes (if possible) or after a warm boot? Or a combination of all three?
Ultimately just before the OOM-killer hits, but if you take some when it starts slowing down it should give an indication of what's going on.
I'll try to get it when OOM-killer starts, if not (Last time it was completely frozen out while OOM-killer was working) then I'll try to get it after a reboot... should have the results within a week or two...
Sounds strange. Please precisely define what you mean by "warm boot".
Warm boot in this case being Alt + SysRq + B
Sorry, but I'm really doubting that the kernel is still short of memory after a sysrq-B. Can you please triple-check that? Let it run for three days, do sysrq-B then capture /proc/meminfo? Also, do dmesg -s 1000000 > dmesq.warm then coldboot and do dmesg -s 1000000 > dmesg.cold Thanks.
Does 2.6.12-rc5 still have problems?
I recently switched distributions to Gentoo (stage 1, of course) and I installed the 2.6.11.7 vanilla sources... The problem of the swap just filling up _seems_ to be gone, but it still begins to grind to a halt after some uptime, only now it lasts longer, about 7 or 8 days... I'm considering updating the whole system sooner or later, I'll test 2.6.12-rc5 or later if you wish.
Hm ok. Yeah please use vanilla 2.6.12-rc5. And like Andrew asked the first time, please gather cat /proc/meminfo cat /proc/slabinfo echo m > /proc/sysrq-trigger && dmesg -s 1000000 as late as possible, maybe put it as a cron job and gather every hour would be good and then send us as the info from the last output. Please tell if you still get the "IRQ11: nobody cared!" with 2.6.12-rc5 too!
Sorry I'm a little slow on this, I have an accelerated summer class that eats up most (all) of my time... As for the IRQ glitch, it seems to occur whenever doing a warm reboot... despite how long the computer has been on. It usually happens 90% of the time, so when I reboot now I usually just turn it off and rest it for about a minute then no issue when I boot it again. More to come... sometime (I promise!).
As of kernel 2.6.12, the swap fill-up seems to be gone, but the system still begins to slow down around 8 days of use... Checking over time indicated that even with heavy memory loads (I was messing with qemu the past couple days) the swap will go back down instead of the way it used to go up and stay up. So whatever was done since then seems to have had an effect.
Created attachment 5295 [details] Memory report after (approximately) 11 days This would be Piyoko's memory report after about 11 days (It's actually 10 days and 23 hours, but that shouldn't make a difference) when the slowdown was already quite severe for about 3 days now running kernel 2.6.12 downloaded from kernel.org Hope this helps!
Created attachment 5296 [details] Slab leak detector
Looks like someone is leaking size-64 slabs. Could you please apply the patch I just sent (Slab leak detector) and under "Kernel hacking" in the kernel configuration enable "Debug memory allocations" and "Compile the kernel with frame pointers". When the machine has been on for a while do: echo "size-64 0 0 0" > /proc/slabinfo which will tell us who is allocating but not letting go of objects (via dmesg -s 1000000). Make sure you use vanilla (kernel.org) kernel with no extra patches please.
Patch applied to a clean vanilla 2.6.13-rc1 kernel sources, debug options applied and compiling with debug indication in its name (that last part is for my reference, it's not an important piece of information for anyone here).
Cancel comment 16... kernel build failed due to unknown issue... Trying manual download of 2.6.12.2 sources with same patching.
Note to self: Don't post until AFTER it's built. Issue was an OOM. Rebuilding after quick process killoffs.
Created attachment 5322 [details] Slab usage dump after appx 6d Here's the output of [ echo "size-64 0 0 0" > /proc/slabinfo ] as gleaned from the kernel logs rather than dmesg because dmesg cut them off. Ignore anything above the slab information, that was a completely unrelated problem (Lousy MS mouse).
The current slab dump you sent doesn't give any hard appearance of a memory leak (a few acpi_os_allocate but looks fairly innocent). How does your /proc/slabinfo look? Has the size-64 cache grown large yet?
According to what I gleaned from the cache column in slabtop, yes, the cache grows VERY large along with the drop in available memory. It throws the alignment off in slabtop, in fact, because the values for that line are so large.
Ok, could you send a new batch of output from /proc/slabinfo and echo "size-64 0 0 0" > /proc/slabinfo and dmesg -s 1000000 It'll be easier to make something out of it if /proc/slabinfo output is also attached. Thanks
Sure, I'll get that as soon as I reach about 5-6 days uptime again and toss it up here.
Created attachment 5396 [details] Slab usage dump + slabinfo Here's the output of /proc/slabinfo as well as the size-64 usage output. This time it was grabbed after a little more than seven and a half days, nearer to the point where things will grind to a halt, which usually occurs around 10 days uptime.
(We can email into bugme-daemon@kernel-bugs.osdl.org now, and the bug gets updated. Alexander Nyberg <alexn@telia.com> wrote: > > > [acpi-devel has decided I'm a spammer so I can't post to acpi-devel, so > I'm throwing this out to some people I think are interested...] > > Can the acpi folks take a look at this? It appears that there is a > leak on his machine. hm, what a shame slab-leak-detector doesn't generate backtraces. Hard, I guess. > Furthermore, can we kill acpi_os_allocate() or at least make it > a static inline in some header file cause it makes seeing caller > difficult in places like slab object owner tracing. Send patch ;)
acpi_os_allocate exists because the ACPI CA code is integrated into at least 10 different operating systems. The purpose of the OSL is to abstract the operating system services in order to simplify integration of ACPICA with no source code changes within the core code itself. Bob > -----Original Message----- > From: Andrew Morton [mailto:akpm@osdl.org] > Sent: Friday, July 29, 2005 11:54 AM > To: Alexander Nyberg; bugme-daemon@kernel-bugs.osdl.org > Cc: Brown, Len; Moore, Robert; Yu, Luming; Li, Shaohua > Subject: Re: [bugme-daemon@kernel-bugs.osdl.org: [Bug 4377] Severe memory > leak issue] > > > (We can email into bugme-daemon@kernel-bugs.osdl.org now, and the bug gets > updated. > > Alexander Nyberg <alexn@telia.com> wrote: > > > > > > [acpi-devel has decided I'm a spammer so I can't post to acpi-devel, so > > I'm throwing this out to some people I think are interested...] > > > > Can the acpi folks take a look at this? It appears that there is a > > leak on his machine. > > hm, what a shame slab-leak-detector doesn't generate backtraces. Hard, I > guess. > > > Furthermore, can we kill acpi_os_allocate() or at least make it > > a static inline in some header file cause it makes seeing caller > > difficult in places like slab object owner tracing. > > Send patch ;)
"Moore, Robert" <robert.moore@intel.com> wrote: > > acpi_os_allocate exists because the ACPI CA code is integrated into at > least 10 different operating systems. The purpose of the OSL is to > abstract the operating system services in order to simplify integration > of ACPICA with no source code changes within the core code itself. Yup. But we can make it an inline function, yes? That'll be faster, will save stack space and will allow the slab-leak-detector to determine the real callsite for acpi slab allocations.
Sure, it can be implemented in whatever way is best. > -----Original Message----- > From: Andrew Morton [mailto:akpm@osdl.org] > Sent: Friday, July 29, 2005 1:15 PM > To: Moore, Robert > Cc: alexn@telia.com; bugme-daemon@kernel-bugs.osdl.org; Brown, Len; Yu, > Luming; Li, Shaohua > Subject: Re: [bugme-daemon@kernel-bugs.osdl.org: [Bug 4377] Severe memory > leak issue] > > "Moore, Robert" <robert.moore@intel.com> wrote: > > > > acpi_os_allocate exists because the ACPI CA code is integrated into at > > least 10 different operating systems. The purpose of the OSL is to > > abstract the operating system services in order to simplify integration > > of ACPICA with no source code changes within the core code itself. > > Yup. But we can make it an inline function, yes? That'll be faster, will > save stack space and will allow the slab-leak-detector to determine the > real > callsite for acpi slab allocations.
"Moore, Robert" <robert.moore@intel.com> wrote: > > Sure, it can be implemented in whatever way is best. > Alex, that's your cue ;)
Well, after getting impatient, I upgraded to 2.6.13-rc5, and I noticed that the leak seems to have been plugged, will keep Piyoko running and keep you all posted.
My bad... it's rc6. -.-
Upgraded to 2.6.13, and I think the leak is back. size-64 seems to be rising... and rising... and rising... I think I also noticed that the Toshiba interface (Like for fnfxd) was inoperative in 2.6.13-rc6; It's working again in 2.6.13 but I think the leak is back with it... It is early yet, so I'll keep you posted.
Are you thinking that the leak is associated with the toshiba interface? If so, that should be pretty easy to confirm by just disabling that piece of kernel config, no? Alex, where's that inline-acpi_os_allocate patch? ;)
Actually, I was just going to try that. Check out bug 4707 filed about two months after this one, will report when I build the new one.
Yeah, I think we can safely say that the memory leak is in the Toshiba ACPI Interface. Bug 4707 is a duplicate of this one.
(akpm: sorry for the delay. however, an inline patch of acpi_os_allocate unfortunately wouldn't help much as they have more than one wrapper for memory allocation (!!!)) Yes, this indeed looks like the same thing as 4707 which was clearly an acpi bug. the toshiba machinery calls acpi_evaluate_object() which allocates memory, so that has to be the one responsible. I'm afraid I'm gonna have to pass this one to someone who understands acpi, sorry.
4707 was fixed in 2.6.14, so this bug should be gone in 2.6.15 ?
I'm assuming this issue is already fixed in recent 2.6 kernels. Please reopen this bug if it's still present in kernel 2.6.16.
The problem has been gone for a while now, sorry about taking so long to respond. I've closed it now.