Looking at MESOS-5836 https://issues.apache.org/jira/browse/MESOS-5836 and patch 9184539 https://patchwork.kernel.org/patch/9184539/ we're concerned about memory cgroup leakage in kernel 4.2/4.4/4.5. This was first seen on CoreOS 835.6/4.2, but we've reproduced on Ubuntu 16.04/4.4 and CoreOS 1010/4.5 kernels. When a system allocates >65336 cgroups, we'll see the following in dmesg when de-allocating them: idr_remove called for id=65536 which is not allocated. After that point, the memory cgroup subsystem is effectively locked until the system's page caches are dropped using: echo 1 > /proc/sys/vm/drop_caches We're working to determine if the patch mentioned above is a fix for this issue and will report back when we have more info. Reproduction steps: - Start a new instance using kernel 4.2, 4.4, or 4.5 (CoreOS 766-1010, Ubuntu 16.04) - ssh to the machine - {{cat /proc/cgroups}} to determine the number of memory cgroups - Run several docker containers using the {{--memory}} or {{-m}} option to set a memory isolator, either in parallel or in series - Stop all containers - {{cat /proc/cgroups}} to review the number of memory cgroups and compare to previous run
On Tue, Jul 12, 2016 at 11:29:05PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=124641 > > Bug ID: 124641 > Summary: Memory cgroups are not garbage-collected after release > to the system > Product: File System > Version: 2.5 > Kernel Version: 4.2 4.2 is old, does this happen on 4.6? Also, can you take this to email, copying me and Tejun and the cgroups developers and lkml? thanks, greg k-h
We've tested up to 4.5, I'll fetch a 4.6 and try it out. Changing the reported version to 4.5 to reflect this.
Tejun adds valuable context at LKML: It's not that memcg doesn't gc the dead csses but that the memory lying around keeps pinning the memcg struct down. There's nothing wrong with it. As soon as there's memory pressure, the memory will get reclaimed and the memcg structs will be freed. The problem is caused by the memcg struct keeping pinning memcg id which is a pretty limited resource. The above patch fixes the issue by the lifetime of decoupling memcg id from that of memcg struct. I've tested this in 4.6 and was _not_ able to reproduce the result. Resolving now.