Bug 205137 - Memory leak while using opencl (F@H) with amdgpu in Kernel space
Summary: Memory leak while using opencl (F@H) with amdgpu in Kernel space
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-09 13:44 UTC by Ville Aakko
Modified: 2019-11-11 20:48 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.3.1 - 5.3.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
journalctl --since '2 months ago' | sed -n -e '/cut\ here/,/end\ trace/ p' (227.76 KB, text/plain)
2019-10-09 16:27 UTC, Ville Aakko
Details

Description Ville Aakko 2019-10-09 13:44:18 UTC
Hello,

Since upgrading to Linux Kernel 5.3.1 (from 5.2.14), I get a memory leak in case I use Folding @ Home with GPU folding (OpenCL with amdgpu, on an RX Vega64) which will eventually (after hours, but less than 24hours) consume all memory on a system with 16GB of RAM.

The RAM will remain reserved even if I stop folding@home process. That process never consumes an excessive amount of RAM according to top, nor does any other process on the system, which means the problem must be somewhere not in user space (Kernel space).

If I never start foldingathome process, I don't experience a memory leak (haven't had time to try longer gaming sessions - I only presume using graphics intensive applications does not produce this, but it is caused by OpenCL usage only).

I can see that kmalloc-1k (-> ~2979365) and kmalloc-8k (-> ~112008) gradually get to huge numbers in /proc/slabinfo. But I have no idea what I should be looking for / how to interpret /proc/slabinfo or slabtop, though.

In /sys/fs/debug/kmemleak I get thousands of these kind of entries:   

unreferenced object 0xffffa06fa8805c00 (size 512):
  comm "FahCore_21", pid 27188, jiffies 4301717909 (age 33092.544s)
  hex dump (first 32 bytes):
    c0 fd d5 86 a4 ea ff ff 00 40 64 86 a4 ea ff ff  .........@d.....
    40 40 64 86 a4 ea ff ff 80 40 64 86 a4 ea ff ff  @@d......@d.....
  backtrace:
    [<000000001f5d4f11>] amdgpu_cs_ioctl+0xc10/0x1d00 [amdgpu]
    [<000000001c8a70b5>] drm_ioctl_kernel+0xb8/0x100 [drm]
    [<000000001135bf21>] drm_ioctl+0x23d/0x3d0 [drm]
    [<000000004c66c447>] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
    [<000000001e3e0278>] do_vfs_ioctl+0x43d/0x6c0
    [<000000008860e76c>] ksys_ioctl+0x5e/0x90
    [<000000002b120d39>] __x64_sys_ioctl+0x16/0x20
    [<000000002d155151>] do_syscall_64+0x5f/0x1c0
    [<00000000bf22908d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Though I must admit I've never user kmemleak and have no idea if these are false positives.

I can reproduce this on the main kernel shipped with Arch Linux, but also on a custom Kernel with CONFIG_DEBUG_KMEMLEAK enabled (the stock configuration on Arch does not have it enabled). So I can gather more information via the debugfs, in case needed (I don't know how to use that).

Possibly related: bug 205051 ? I don't get oops with clinfo, though. Haven't used Blender recently. Only keywords "opencl", kernel version and using opencl match.

Please understand I'm a bit over my experience with gathering information about these kind of Kernel bugs. Not sure what other information is relevant in this situation, please advice!

*** Steps to reproduce:  ***
* Run Folding@Home with amdgpu (OpenCL) for some hours (seems that the memory leak might not manifest with all work units?).

*** Observed results: ***
* A memory leak. Many hundreds of megabytes of memory will be reserved in an hour.
* No process is consuming this memory in top. 
* eventually, OOM killer will kick in (and no RAM used for cache/buffers anymore)
* The memory will be reserved even after stopping foldingathome processes

*** Expected results: ****
* Folding should not cause exessive RAM usage. This was the situation in 5.2.14 (and prior). These processes use a few hundreds of MBs of RAM
* If there was a leak in user space, I should be able to see the culprit in top.
* Even after stopping folding process, RAM remains used (by Kernel/amdgpu?)
Comment 1 Ville Aakko 2019-10-09 16:27:10 UTC
Created attachment 285421 [details]
journalctl --since '2 months ago' | sed -n -e '/cut\ here/,/end\ trace/ p'

Also, forgot to mention: I've had a few near-hangs, where the computer will not respond to any keyboard nor mouse input. I can ssh into the computer and restart it, but it will take a very long time to restart, after journal has stopped (15 minutes or more). As the GPU can not output anything (and journal has stopped) I have no idea what it is going on during this time. These have happened only ~3-5 times and I can not reproduce them. I'm mentioning since I didn't get them before 5.3.1..

Also, there are some Kernel errors, often but not always with "kernel: Missing get_user_page_done". These errors seem to point into amdgpu. Also, I didn't get these before 5.3.1..

But more often than not, despite these kernel errors, there is not a hang (I just noticed there have been quite a few of these I haven't noticed), and moreover these errors do *not* co-inside with the memory leak, apparently (I've looked at my memory usage when these errors happen). 

But *all* symptoms started after upgrading to 5.3.1 (and subsequently to 5.3.5). 

Attaching some (actually, all) of said errors.

(Please not although some entries are from while running -zen branch, I can reproduce with plain ARCH kernel, and -custom is with the Kernel I made just to enable KMEMLEAK)
Comment 2 Ville Aakko 2019-11-11 20:48:59 UTC
Hi, 


This problem has fixed itself after upgrading to 5.3.8 (possibly somewhere in the interim, haven't been folding extensively because of being afraid of leaks in the meantime). ~48 hours of folding now and not a single leak (this wasn't possibly on a previous version.

What I mean to say is this bug report can be closed (I will mark this as UNREPRODUCIBLE, as it seems most appropriate to me. However please feel free to edit if you think some other reason is more appropriate, as I'm not sure which one to use).

Note You need to log in before you can comment on or make changes to this bug.