Bug 200387 - amdgpu uses unusually high memory
Summary: amdgpu uses unusually high memory
Status: RESOLVED OBSOLETE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-01 15:54 UTC by phoenix
Modified: 2018-07-08 12:57 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.16.18, 4.17.3, 4.18.0-rc2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Config file for 4.17.3. (142.00 KB, text/x-mpsub)
2018-07-01 15:54 UTC, phoenix
Details
dmesg Kernel 4.17.3 (76.05 KB, text/plain)
2018-07-02 15:14 UTC, phoenix
Details
dmesg on Kernel 4.4.0 (87.98 KB, text/plain)
2018-07-02 15:14 UTC, phoenix
Details
Xorg Log on 4.17.3 (57.46 KB, text/plain)
2018-07-02 15:15 UTC, phoenix
Details
Xorg log on 4.4.0 (57.47 KB, text/plain)
2018-07-02 15:15 UTC, phoenix
Details
Free and stats of the two Kernels (1.26 KB, text/plain)
2018-07-02 15:19 UTC, phoenix
Details
Output of top of the problematic process on the two Kernels (660 bytes, text/plain)
2018-07-02 15:19 UTC, phoenix
Details
kmemleak output of Cities.x64 (55.54 KB, text/plain)
2018-07-02 20:40 UTC, phoenix
Details
Memor usage measurements for different programs using Kernel 4.9.111 and 4.15.0-24 (5.20 KB, text/plain)
2018-07-03 19:16 UTC, phoenix
Details

Description phoenix 2018-07-01 15:54:10 UTC
Created attachment 277107 [details]
Config file for 4.17.3.

Hi there,

I'm experiencing some out-of-memory issues while running Cities:Skylines using the amdgpu driver. Trying to run a new game cases a complete system-freeze running any Kernel that runs the amdgpu driver instead of a rather old Kernel using the amdgpu-pro driver. The memory is the system related main memory, not the GPU memory.

System details:
I'm running Ubuntu Mate 16.04 with a custom build 4.17.3 Kernel (Find config attached)
AMD FX-8350
32 GB RAM
Radeon RX470

Sample Main Memory usage.
Kernel 4.4 with amdgpu-pro driver    - RAM Usage after 1 Minute: 2.4 GB
Kernel 4.17.3 with amdgpu driver     - RAM Usage after 1 Minute: 13 GB
Kernel 4.16.18 with amdgpu driver    - RAM Usage after 1 Minute: 13 GB
Kernel 4.18.0-rc2 with amdgpu driver - RAM Usage after 1 Minute: 13 GB

I get similar results with running Stardew Valley (Factor two difference, clearly measurable)

Find attached the config file for the 4.17.3 Kernel. Other kernels have been build using this config file and the default suggestions for any unconfigured parameter.

Greetings,
Felix
Comment 1 Alex Deucher 2018-07-02 03:34:27 UTC
Please attach your dmesg output and xorg log if using X.
Comment 2 Michel Dänzer 2018-07-02 10:13:43 UTC
Also, please attach the output of

 free

and of top after pressing shift-M, both captured while RAM usage is high.
Comment 3 phoenix 2018-07-02 10:55:50 UTC
Yeah, I'll post the mentioned things today after I got home.
Comment 4 phoenix 2018-07-02 15:14:25 UTC
Created attachment 277121 [details]
dmesg Kernel 4.17.3
Comment 5 phoenix 2018-07-02 15:14:41 UTC
Created attachment 277123 [details]
dmesg on Kernel 4.4.0
Comment 6 phoenix 2018-07-02 15:15:00 UTC
Created attachment 277125 [details]
Xorg Log on 4.17.3
Comment 7 phoenix 2018-07-02 15:15:16 UTC
Created attachment 277127 [details]
Xorg log on 4.4.0
Comment 8 phoenix 2018-07-02 15:19:01 UTC
Created attachment 277129 [details]
Free and stats of the two Kernels

Contains free and the /proc/$ID/stat and /proc/$ID/statm output of the two Kernel versions
Comment 9 phoenix 2018-07-02 15:19:32 UTC
Created attachment 277131 [details]
Output of top of the problematic process on the two Kernels

Truncated output of top of the problematic process on the two kernels
Comment 10 phoenix 2018-07-02 15:21:42 UTC
I uploaded all the requested files. Interestingly the output of top and statm of the process has comparable values except for the data stack (see file stats)

Virtual, resident and shared memory are comparable.

If you need any further data don't hesitate to ask. Thank you
Comment 11 Christian König 2018-07-02 15:39:52 UTC
You could also try to compile your kernel with kmemleak enabled.
Comment 12 phoenix 2018-07-02 16:03:30 UTC
I'm rebuilding the kernel and checking a possible memory leak with kmemleak.
Comment 13 phoenix 2018-07-02 16:29:14 UTC
Having some problems setting up kmemleak at the moment. I'll test and check tomorrow
Comment 14 Michel Dänzer 2018-07-02 16:50:05 UTC
Another possibility would be narrowing down where between 4.4 and 4.16 this started happening, and eventually bisecting.
Comment 15 phoenix 2018-07-02 20:40:29 UTC
Created attachment 277135 [details]
kmemleak output of Cities.x64

I was finally able to create a kmemleak output and cropped it to the relevant outpt coming from the affected program.

I hope this is helpful.
Comment 16 Christian König 2018-07-03 07:27:11 UTC
(In reply to phoenix from comment #15)
> I hope this is helpful.

Unfortunately not really, the only thing in there is a known issue with the IOVA cache.

Can you try to bisect as Michel suggested?
Comment 17 phoenix 2018-07-03 07:42:10 UTC
Sure, I'm going to investigate through the different kernel versions but that is gonna take me some time (I have to do this in my spare time)

I'll post my progress and findings, when available.
Comment 18 Michel Dänzer 2018-07-03 07:54:29 UTC
Does the memory usage go back down when you quit the game? Or when you restart X? Or never?
Comment 19 phoenix 2018-07-03 07:55:36 UTC
The memory usage goes immediately down once the game quits. No X restart necessary
Comment 20 Michel Dänzer 2018-07-03 08:08:26 UTC
In that case, the output of running the game in

 valgrind --leak-check=full

might be interesting.
Comment 21 phoenix 2018-07-03 08:11:34 UTC
Jep, I'll have a look this evening. Maybe I can reproduce the issue with another program as well to exclude exclusive problems with a single userland program.
Comment 22 phoenix 2018-07-03 16:31:47 UTC
Apperently it's not that easy to attach valgrind to any Steam game, so I'm going the suggested approach of trying it out using different Kernel version.

Interestingly I could observe similar behaviour in Stardew Valley but not in Kerbal Space program, as the following attached statm shows:


## /proc/$ID/statm for Stardew Valley (Similar problem see the data segment)
# statm for 4917 on 4.17.3
978381 424915 23927 849 0 449695 0
# statm for 4370 on 4.4.0
979917 418188 23774 849 0 874146 0

## /proc/$ID/statm for Kerbal Space Program (Problem does not occur)
# statm of 5419 on 4.4.0
532753 381415 19974 7863 0 446822 0
# statm of on 4.17.3
529142 389210 19754 7863 0 441862 0

I'm investigating using different Kernel versions and maybe I'm able to write a simple OpenGL program that triggers the problem.
Comment 23 Michel Dänzer 2018-07-03 16:45:06 UTC
(In reply to phoenix from comment #22)
> 
> ## /proc/$ID/statm for Stardew Valley (Similar problem see the data segment)
> # statm for 4917 on 4.17.3
> 978381 424915 23927 849 0 449695 0
> # statm for 4370 on 4.4.0
> 979917 418188 23774 849 0 874146 0

Did you swap these numbers? The only significant difference is the data size (second to last number), but the 4.4 number is bigger by ~400MB.
Comment 24 phoenix 2018-07-03 17:40:16 UTC
Hi Michel,

wiredly not, I just double-checked them an in Stardew Valley the 4.4 number is really the 400 MB bigger one. For now I'm gonna give the kernel version numbers a try before we're working here on two things at the same time.
Comment 25 phoenix 2018-07-03 19:16:59 UTC
Created attachment 277153 [details]
Memor usage measurements for different programs using Kernel 4.9.111 and 4.15.0-24

Ok, I've tested the issue using Kernel 4.15.0-24-generic (Shipped with Ubuntu Mate) using the amdgpu driver and a 4.9.111 Kernel using the amdgpu-pro driver (17.40).

Sadly building the amdgpu-pro driver for Kernel linux-4.14.53 failed, so I couldn't test that one.

The issue occurs also in the 4.15.0-24-generic Kernel, while the 4.9.111 Kernel has significantly lower main memory requirements using Cities Skylines.

Also I found out, that neither the output of mstat nor proc shows significant differences in the processes between the Kernel versions. So as of now the only accessible metric for measuring the memory usage is to look at the output for 'free'.


In addition I could observe the same memory issue (but without a system freeze) in Civilization Beyond earth using the above mentioned Kernel versions. That program is more suitable than a rather low-resource program like Stardew Valley.

Find attached the text file MemUsage.txt with my current measurements.


Attaching Valgrind to a Steam Game is kind of non-trivial, do you still think that this gives us some meaningful insights? I can work that out, but fear that this soon goes beyond the scopes of my available time, still can give it a shot.
Comment 26 Michel Dänzer 2018-07-04 08:01:27 UTC
(In reply to phoenix from comment #25)
> Ok, I've tested the issue using Kernel 4.15.0-24-generic (Shipped with
> Ubuntu Mate) using the amdgpu driver and a 4.9.111 Kernel using the
> amdgpu-pro driver (17.40).

BTW, ideally you should only test with the kernel's own amdgpu driver, not with amdgpu-pro, because the later uses its own copies of core DRM and even some core kernel code, and has other modifications compared to the stock driver.
Comment 27 Christian König 2018-07-04 08:40:25 UTC
(In reply to Michel Dänzer from comment #26)
> BTW, ideally you should only test with the kernel's own amdgpu driver, not
> with amdgpu-pro, because the later uses its own copies of core DRM and even
> some core kernel code, and has other modifications compared to the stock
> driver.

To be even more precise I'm not sure that this is actually a kernel problem, or just caused by some mix up between the amdgpu-pro driver and the upstream driver.

So testing on a clean install could yield some more results.
Comment 28 phoenix 2018-07-04 10:00:57 UTC
Hi Michel, Hi Christian,

that makes sense, I test it on a clean environment. Sorry, that I should have done that in the first place :-/
Comment 29 phoenix 2018-07-05 17:09:02 UTC
I'm a bit busy at the moment, hope that I will find time on the weekend to further investigate!
Comment 30 phoenix 2018-07-08 12:57:08 UTC
Finally had time to investigate.

The bug doesn't appear on a fresh install of Ubuntu 16.04 using the 4.17.3 Kernel with the above posted configuration. So apperently Christian was right and it was a weird mix-up between the amdgpu-pro and the upstream driver.

I mark the bug as Resolved -> Obsolete, because it was indeed just a zombie from relict of an ancient installation :-) I should have check in the first place on a fresh install.

Anyway - Thank you very much for the support and help and I wish you still a pleasant Sunday (or a good start into the week)

Greetings,
Felix

Note You need to log in before you can comment on or make changes to this bug.