Bug 31412

Summary: radeon memory leak
Product: Drivers Reporter: Kevin (kjslag)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED INVALID    
Severity: normal CC: airlied, akpm, kjslag, zh.jesse
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: <= 2.6.38 Subsystem:
Regression: No Bisected commit-id:
Attachments: debug info
slabinfo before echo 3 > /proc/sys/vm/drop_caches
slabinfo after echo 3 > /proc/sys/vm/drop_caches

Description Kevin 2011-03-18 20:27:13 UTC
problem:

I experience major memory leaks with my mobile radeon 5850 using the latest 2.6.38 kernel. The problem occurs with both the open source driver (I've only tried with KMS) and the catalyst driver. I have experienced this since I purchased my laptop in late December. For me, the leaked memory has reached 3GB of my 6GB many times after a few days of use (I work with large .pdfs).


how to reproduce:

The easiest way for me to quickly reproduce the leak is to open huge .pdf files with okular and scroll through the files. Here is an example:
http://developer.amd.com/gpu_assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf

After doing the above, I logged logging out to terminal and killed all processes (except for init and kernel processes). Below is free -m output after doing this. There are 3828-1949-128 = 1751 MB of memory that is still in use for no reason. I have to reboot to get it back. I've gotten this number up to 3GB before; the only limit is my system's memory (6GB).

             total       used       free     shared    buffers     cached
Mem:          5920       3828       2092          0        128       1949
-/+ buffers/cache:       1749       4170
Swap:         7998          0       7998



This is the only related bug that I found, but it's closed:
https://bugzilla.redhat.com/show_bug.cgi?id=642045


my setup:

latest Arch Linux software
KDE 4.6.1 with compositing (I haven't tested if this is related)
mobile radeon 5850

$ uname -a
Linux J 2.6.38-ARCH #1 SMP PREEMPT Tue Mar 15 09:36:10 CET 2011 x86_64 Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz GenuineIntel GNU/Linux


thanks
Comment 1 Andrew Morton 2011-03-18 20:31:35 UTC
It's a little unclear - is this a regression?  If so, is it a 2.6.37 -> 2.6.38 regression?

Thanks.
Comment 2 Kevin 2011-03-18 20:47:01 UTC
It's not a regression. I also experienced the problem with 2.6.37 and 2.6.36. I haven't tested any other kernel versions; I got my laptop in December.
Comment 3 Dave Airlie 2011-03-18 21:01:21 UTC
1df6a2ebd75067aefbdf07482bf8e3d0584e04ee

was the fix for the original problem, which was in 2.6.37.

Unless something we've done since is causing it or another race to happen.

So logging out of X doesn't get the memory back?

can you attach the contents of /sys/kernel/debug/dri/0/radeon_vram_mm and radeon_gtt_mm? (after mounting debugfs).
Comment 4 Kevin 2011-03-18 21:25:15 UTC
The problem occurred in 2.6.37 also.

Logging out of X doesn't get the memory back.

I'm also not sure if just scrolling a lot causes the problem. It seems to be easier for me to consume memory by opening many .pdf files.


Here is some output after logging out of X (I still had about ~30MB from root processes):

$ free -m
             total       used       free     shared    buffers     cached
Mem:          5920       2668       3252          0         96       1088
-/+ buffers/cache:       1484       4436
Swap:         7998          0       7998

$ cat /sys/kernel/debug/dri/0/radeon_vram_mm
0x00000000-0x00000040: 0x00000040: used
0x00000040-0x00000140: 0x00000100: used
0x00000140-0x00000141: 0x00000001: used
0x00000141-0x0000092a: 0x000007e9: used
0x0000092a-0x00040000: 0x0003f6d6: free
total: 262144, used 2346 free 259798

$ cat /sys/kernel/debug/dri/0/radeon_gtt_mm 
0x00000000-0x00000001: 0x00000001: used
0x00000001-0x00000011: 0x00000010: used
0x00000011-0x00000111: 0x00000100: used
0x00000111-0x00000211: 0x00000100: used
0x00000211-0x00020000: 0x0001fdef: free
total: 131072, used 529 free 130543
Comment 5 Dave Airlie 2011-03-19 07:16:26 UTC
how about cat /sys/kernel/debug/dri/0/ttm_page_pool
Comment 6 Dave Airlie 2011-03-19 07:17:08 UTC
how about cat /sys/kernel/debug/dri/0/ttm_page_pool 

and /sys/class/drm/ttm/memory_accounting/*/*
Comment 7 Kevin 2011-03-19 09:15:54 UTC
Created attachment 51232 [details]
debug info
Comment 8 Kevin 2011-03-19 09:17:19 UTC
The attached "debug info" above is output after logging out of X (I still had about ~30MB from root processes). It includes what Dave asked for. FYI, I haven't rebooted since my last post.

$ free -m
             total       used       free     shared    buffers     cached
Mem:          5920       4279       1641          0        212       2547
-/+ buffers/cache:       1519       4401
Swap:         7998          0       7998
Comment 9 Dave Airlie 2011-03-19 22:07:30 UTC
I'm having trouble reconciling this with the graphics driver.

a) you claim it happens with fglrx as well, there is nothing shared between them
(so it can't be a common gpu bug)

b) nothing is leaking according to that debug info in anywhere obvious.

One last test would be unload the gpu stack to see what happens

got to runlevel 3 or anywhere X isn't rnunig
echo 0 > /sys/class/vtconsole/vtcon1/bind
rmmod radeon ttm drm_kms_helper drm

see if memory is still gone.

maybe you can watch /proc/slabinfo to see where the memory might be going. It might also just be page cache from the pdf loading,

does echo 3 > /proc/sys/vm/drop_caches help?
Comment 10 Kevin 2011-03-20 20:40:04 UTC
echo 3 > /proc/sys/vm/drop_caches
cleared all of the memory. Thank you! See output below (ran after exiting X).
I didn't try unloading the gpu stack. I can try it if you like.


I also saved /proc/slabinfo before and after I dropped the caches. If I'm interpreting the file correctly, I only notice large drops in ext4 filesystem stuff. Is there a way to tell which slabs are considered buffers or cache?


# free -m
             total       used       free     shared    buffers     cached
Mem:          5920       4318       1602          0        113       2465
-/+ buffers/cache:       1740       4180
Swap:         7998          0       7998
# echo 3 > /proc/sys/vm/drop_caches
# free -m
             total       used       free     shared    buffers     cached
Mem:          5920        225       5695          0          3         12
-/+ buffers/cache:        209       5711
Swap:         7998          0       7998
Comment 11 Kevin 2011-03-20 20:40:18 UTC
Created attachment 51392 [details]
slabinfo before echo 3 > /proc/sys/vm/drop_caches
Comment 12 Kevin 2011-03-20 20:40:48 UTC
Created attachment 51402 [details]
slabinfo after echo 3 > /proc/sys/vm/drop_caches
Comment 13 Kevin 2011-06-02 02:57:29 UTC
So is this something that should be fixed or is everything working as intended? To me, it doesn't seem correct. Unused caches shouldn't be stored in my "used" memory.

Is there anything else that I should try to help?
Can anyone else reproduce this? I just open large pdfs and scroll through them.
Comment 14 Jesse Zhang 2011-06-02 05:56:22 UTC
There is no leaking here. A quick google search turns up http://sourcefrog.net/weblog/software/linux-kernel/free-mem.html.
Comment 15 Kevin 2011-06-02 17:24:35 UTC
I'm sorry. I wasn't clear. There is NOT a memory leak.

Below is some of my 'free -m' output from above. My question is, why does the 'used -/+ buffers/cache' decrease when I 'echo 3 > /proc/sys/vm/drop_caches'?


# free -m
             total       used       free     shared    buffers     cached
Mem:          5920       4318       1602          0        113       2465
-/+ buffers/cache:       1740       4180
Swap:         7998          0       7998
# echo 3 > /proc/sys/vm/drop_caches
# free -m
             total       used       free     shared    buffers     cached
Mem:          5920        225       5695          0          3         12
-/+ buffers/cache:        209       5711
Swap:         7998          0       7998
Comment 16 Kevin 2012-01-07 02:20:59 UTC
I think it's clear that no memory is "leaked", so I guess I'll mark this "bug" as invalid.

I still don't understand why 'echo 3 > /proc/sys/vm/drop_caches' caused the 'used -/+ buffers/cache' field to decrease by 1500MB (see my previous comment) since I thought this field doesn't include cached memory. But that's probably just a sign of my ignorance.