Bug 4377

Summary: Severe memory leak issue
Product: Memory Management Reporter: Edward Swiftwood (payphoneed)
Component: OtherAssignee: Andrew Morton (akpm)
Status: CLOSED INSUFFICIENT_DATA    
Severity: high CC: akpm, alexn, bunk
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: >= 2.6.10 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Memory report after (approximately) 11 days
Slab leak detector
Slab usage dump after appx 6d
Slab usage dump + slabinfo

Description Edward Swiftwood 2005-03-20 06:16:35 UTC
Distribution: SuSE Linux 9.1
Hardware Environment: Toshiba Satellite A25-S207 Laptop (All stock)
Software Environment: SuSE Linux 9.1 with custom compiled kernel(s) from kernel.org
Problem Description: After about 5 days with a kernel >= 2.6.10, memory runs out
completely and the OOM killer begins killing every single process. Leading up to
it memory begins to dwindle down, a calculation at about 4 days time showed
331MB of memory unaccounted for. Fixing the problem requires nothing other than
a cold boot; a warm reboot DOES NOT clear the "missing" RAM, and the cs probe
fails due to not having any free RAM and I get "IRQ11: nobody cared!" Kernels <=
2.6.9 do not exhibit this issue.

Steps to reproduce: Get my computer, compile a kernel >= 2.6.10 and run it.

This is my first bug report, so be kind and inform me of any extra information
which would be of assistance in solving this issue.
Comment 1 Andrew Morton 2005-03-20 15:08:03 UTC
Please wait for it to happen again then gather the output of

cat /proc/meminfo
cat /proc/slabinfo
echo m > /proc/sysrq-trigger && dmesg -s 1000000
Comment 2 Edward Swiftwood 2005-03-23 04:08:39 UTC
Would that be when it starts slowing down, after the OOM-killer finishes (if
possible) or after a warm boot? Or a combination of all three?
Comment 3 Alexander Nyberg 2005-03-24 07:11:07 UTC
Ultimately just before the OOM-killer hits, but if you take some when it starts
slowing down it should give an indication of what's going on.
Comment 4 Edward Swiftwood 2005-03-25 16:33:13 UTC
I'll try to get it when OOM-killer starts, if not (Last time it was completely
frozen out while OOM-killer was working) then I'll try to get it after a
reboot... should have the results within a week or two...
Comment 5 Andrew Morton 2005-03-25 16:36:27 UTC
Sounds strange.

Please precisely define what you mean by "warm boot".
Comment 6 Edward Swiftwood 2005-03-25 16:39:06 UTC
Warm boot in this case being Alt + SysRq + B
Comment 7 Andrew Morton 2005-03-25 16:48:47 UTC
Sorry, but I'm really doubting that the kernel is still short of
memory after a sysrq-B.

Can you please triple-check that?  Let it run for three days, 
do sysrq-B then capture /proc/meminfo?

Also, do

   dmesg -s 1000000 > dmesq.warm

then coldboot and do

   dmesg -s 1000000 > dmesg.cold

Thanks.
Comment 8 Andrew Morton 2005-05-25 16:33:26 UTC
Does 2.6.12-rc5 still have problems?
Comment 9 Edward Swiftwood 2005-05-27 13:21:51 UTC
I recently switched distributions to Gentoo (stage 1, of course) and I installed
the 2.6.11.7 vanilla sources...

The problem of the swap just filling up _seems_ to be gone, but it still begins
to grind to a halt after some uptime, only now it lasts longer, about 7 or 8 days...

I'm considering updating the whole system sooner or later, I'll test 2.6.12-rc5
or later if you wish.
Comment 10 Alexander Nyberg 2005-05-27 13:44:16 UTC
Hm ok. Yeah please use vanilla 2.6.12-rc5. 
And like Andrew asked the first time, please gather

cat /proc/meminfo
cat /proc/slabinfo
echo m > /proc/sysrq-trigger && dmesg -s 1000000

as late as possible, maybe put it as a cron job and gather every hour would be
good and then send us as the info from the last output.

Please tell if you still get the "IRQ11: nobody cared!" with 2.6.12-rc5 too!
Comment 11 Edward Swiftwood 2005-06-21 05:48:15 UTC
Sorry I'm a little slow on this, I have an accelerated summer class that eats up
most (all) of my time...

As for the IRQ glitch, it seems to occur whenever doing a warm reboot... despite
how long the computer has been on. It usually happens 90% of the time, so when I
reboot now I usually just turn it off and rest it for about a minute then no
issue when I boot it again.

More to come... sometime (I promise!).
Comment 12 Edward Swiftwood 2005-07-06 15:39:02 UTC
As of kernel 2.6.12, the swap fill-up seems to be gone, but the system still
begins to slow down around 8 days of use... Checking over time indicated that
even with heavy memory loads (I was messing with qemu the past couple days) the
swap will go back down instead of the way it used to go up and stay up. So
whatever was done since then seems to have had an effect.
Comment 13 Edward Swiftwood 2005-07-07 06:26:36 UTC
Created attachment 5295 [details]
Memory report after (approximately) 11 days

This would be Piyoko's memory report after about 11 days (It's actually 10 days
and 23 hours, but that shouldn't make a difference) when the slowdown was
already quite severe for about 3 days now running kernel 2.6.12 downloaded from
kernel.org

Hope this helps!
Comment 14 Alexander Nyberg 2005-07-07 07:34:16 UTC
Created attachment 5296 [details]
Slab leak detector
Comment 15 Alexander Nyberg 2005-07-07 07:35:51 UTC
Looks like someone is leaking size-64 slabs. Could you please apply the patch I
just sent (Slab leak detector) and under "Kernel hacking" in the kernel
configuration enable "Debug memory allocations" and "Compile the kernel with
frame pointers".

When the machine has been on for a while do:
echo "size-64 0 0 0" > /proc/slabinfo

which will tell us who is allocating but not letting go of objects (via dmesg -s
1000000). Make sure you use vanilla (kernel.org) kernel with no extra patches
please.
Comment 16 Edward Swiftwood 2005-07-07 08:03:07 UTC
Patch applied to a clean vanilla 2.6.13-rc1 kernel sources, debug options
applied and compiling with debug indication in its name (that last part is for
my reference, it's not an important piece of information for anyone here).
Comment 17 Edward Swiftwood 2005-07-07 08:21:56 UTC
Cancel comment 16... kernel build failed due to unknown issue... Trying manual
download of 2.6.12.2 sources with same patching.
Comment 18 Edward Swiftwood 2005-07-07 08:24:40 UTC
Note to self: Don't post until AFTER it's built. Issue was an OOM. Rebuilding
after quick process killoffs.
Comment 19 Edward Swiftwood 2005-07-13 07:17:58 UTC
Created attachment 5322 [details]
Slab usage dump after appx 6d

Here's the output of [ echo "size-64 0 0 0" > /proc/slabinfo ] as gleaned from
the kernel logs rather than dmesg because dmesg cut them off. Ignore anything
above the slab information, that was a completely unrelated problem (Lousy MS
mouse).
Comment 20 Alexander Nyberg 2005-07-14 03:39:32 UTC
The current slab dump you sent doesn't give any hard appearance of a memory leak
(a few acpi_os_allocate but looks fairly innocent). How does your /proc/slabinfo
look? Has the size-64 cache grown large yet? 
Comment 21 Edward Swiftwood 2005-07-21 07:04:02 UTC
According to what I gleaned from the cache column in slabtop, yes, the cache
grows VERY large along with the drop in available memory. It throws the
alignment off in slabtop, in fact, because the values for that line are so large.
Comment 22 Alexander Nyberg 2005-07-21 07:21:00 UTC
Ok, could you send a new batch of output from /proc/slabinfo and echo "size-64 0
0 0" > /proc/slabinfo and dmesg -s 1000000

It'll be easier to make something out of it if /proc/slabinfo output is also
attached. Thanks
Comment 23 Edward Swiftwood 2005-07-22 06:12:44 UTC
Sure, I'll get that as soon as I reach about 5-6 days uptime again and toss it
up here.
Comment 24 Edward Swiftwood 2005-07-28 05:48:47 UTC
Created attachment 5396 [details]
Slab usage dump + slabinfo

Here's the output of /proc/slabinfo as well as the size-64 usage output. This
time it was grabbed after a little more than seven and a half days, nearer to
the point where things will grind to a halt, which usually occurs around 10
days uptime.
Comment 25 Andrew Morton 2005-07-29 11:54:53 UTC
(We can email into bugme-daemon@kernel-bugs.osdl.org now, and the bug gets
updated.

Alexander Nyberg <alexn@telia.com> wrote:
>
> 
> [acpi-devel has decided I'm a spammer so I can't post to acpi-devel, so
> I'm throwing this out to some people I think are interested...]
> 
> Can the acpi folks take a look at this? It appears that there is a
> leak on his machine.

hm, what a shame slab-leak-detector doesn't generate backtraces.  Hard, I
guess.

> Furthermore, can we kill acpi_os_allocate() or at least make it
> a static inline in some header file cause it makes seeing caller
> difficult in places like slab object owner tracing.

Send patch ;)

Comment 26 Robert Moore 2005-07-29 12:22:15 UTC
acpi_os_allocate exists because the ACPI CA code is integrated into at
least 10 different operating systems.  The purpose of the OSL is to
abstract the operating system services in order to simplify integration
of ACPICA with no source code changes within the core code itself.

Bob


> -----Original Message-----
> From: Andrew Morton [mailto:akpm@osdl.org]
> Sent: Friday, July 29, 2005 11:54 AM
> To: Alexander Nyberg; bugme-daemon@kernel-bugs.osdl.org
> Cc: Brown, Len; Moore, Robert; Yu, Luming; Li, Shaohua
> Subject: Re: [bugme-daemon@kernel-bugs.osdl.org: [Bug 4377] Severe
memory
> leak issue]
> 
> 
> (We can email into bugme-daemon@kernel-bugs.osdl.org now, and the bug
gets
> updated.
> 
> Alexander Nyberg <alexn@telia.com> wrote:
> >
> >
> > [acpi-devel has decided I'm a spammer so I can't post to acpi-devel,
so
> > I'm throwing this out to some people I think are interested...]
> >
> > Can the acpi folks take a look at this? It appears that there is a
> > leak on his machine.
> 
> hm, what a shame slab-leak-detector doesn't generate backtraces.
Hard, I
> guess.
> 
> > Furthermore, can we kill acpi_os_allocate() or at least make it
> > a static inline in some header file cause it makes seeing caller
> > difficult in places like slab object owner tracing.
> 
> Send patch ;)


Comment 27 Andrew Morton 2005-07-29 13:16:02 UTC
"Moore, Robert" <robert.moore@intel.com> wrote:
>
> acpi_os_allocate exists because the ACPI CA code is integrated into at
> least 10 different operating systems.  The purpose of the OSL is to
> abstract the operating system services in order to simplify integration
> of ACPICA with no source code changes within the core code itself.

Yup.  But we can make it an inline function, yes?  That'll be faster, will
save stack space and will allow the slab-leak-detector to determine the real
callsite for acpi slab allocations.

Comment 28 Robert Moore 2005-07-29 13:36:21 UTC
Sure, it can be implemented in whatever way is best.


> -----Original Message-----
> From: Andrew Morton [mailto:akpm@osdl.org]
> Sent: Friday, July 29, 2005 1:15 PM
> To: Moore, Robert
> Cc: alexn@telia.com; bugme-daemon@kernel-bugs.osdl.org; Brown, Len;
Yu,
> Luming; Li, Shaohua
> Subject: Re: [bugme-daemon@kernel-bugs.osdl.org: [Bug 4377] Severe
memory
> leak issue]
> 
> "Moore, Robert" <robert.moore@intel.com> wrote:
> >
> > acpi_os_allocate exists because the ACPI CA code is integrated into
at
> > least 10 different operating systems.  The purpose of the OSL is to
> > abstract the operating system services in order to simplify
integration
> > of ACPICA with no source code changes within the core code itself.
> 
> Yup.  But we can make it an inline function, yes?  That'll be faster,
will
> save stack space and will allow the slab-leak-detector to determine
the
> real
> callsite for acpi slab allocations.


Comment 29 Andrew Morton 2005-07-29 14:12:50 UTC
"Moore, Robert" <robert.moore@intel.com> wrote:
>
> Sure, it can be implemented in whatever way is best.
> 

Alex, that's your cue ;)

Comment 30 Edward Swiftwood 2005-08-29 16:56:47 UTC
Well, after getting impatient, I upgraded to 2.6.13-rc5, and I noticed that the
leak seems to have been plugged, will keep Piyoko running and keep you all posted.
Comment 31 Edward Swiftwood 2005-08-30 05:28:43 UTC
My bad... it's rc6. -.-
Comment 32 Edward Swiftwood 2005-09-01 06:21:30 UTC
Upgraded to 2.6.13, and I think the leak is back. size-64 seems to be rising...
and rising... and rising...

I think I also noticed that the Toshiba interface (Like for fnfxd) was
inoperative in 2.6.13-rc6; It's working again in 2.6.13 but I think the leak is
back with it... It is early yet, so I'll keep you posted.
Comment 33 Andrew Morton 2005-09-01 12:05:53 UTC
Are you thinking that the leak is associated with the toshiba
interface?   If so, that should be pretty easy to confirm
by just disabling that piece of kernel config, no?

Alex, where's that inline-acpi_os_allocate patch? ;)
Comment 34 Edward Swiftwood 2005-09-01 16:38:15 UTC
Actually, I was just going to try that. Check out bug 4707 filed about two
months after this one, will report when I build the new one.
Comment 35 Edward Swiftwood 2005-09-01 17:04:54 UTC
Yeah, I think we can safely say that the memory leak is in the Toshiba ACPI
Interface. Bug 4707 is a duplicate of this one.
Comment 36 Alexander Nyberg 2005-09-03 13:27:43 UTC
(akpm: sorry for the delay. however, an inline patch of acpi_os_allocate
unfortunately wouldn't help much as they have more than one wrapper for memory
allocation (!!!))

Yes, this indeed looks like the same thing as 4707 which was clearly an acpi bug. 
the toshiba machinery calls acpi_evaluate_object() which allocates memory, so
that has to be the one responsible.

I'm afraid I'm gonna have to pass this one to someone who understands acpi, sorry.
Comment 37 Dave Jones 2006-01-04 23:11:25 UTC
4707 was fixed in 2.6.14, so this bug should be gone in 2.6.15 ?
Comment 38 Adrian Bunk 2006-04-22 09:39:55 UTC
I'm assuming this issue is already fixed in recent 2.6 kernels.

Please reopen this bug if it's still present in kernel 2.6.16.
Comment 39 Edward Swiftwood 2006-05-14 18:03:39 UTC
The problem has been gone for a while now, sorry about taking so long to
respond. I've closed it now.