Bug 208613 - It is necessary to add shrinker support for the pid_namespace
Summary: It is necessary to add shrinker support for the pid_namespace
Status: NEW
Alias: None
Product: Process Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: process_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-19 13:43 UTC by Wen Yang
Modified: 2020-11-26 12:09 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Wen Yang 2020-07-19 13:43:09 UTC
On a server running kubernetes for a long time, we encountered this problem:

$sudo  unshare  --pid    -r --fork --propagation private bash
unshare: unshare failed: No space left on device

but max_pid_NAMESPACES is already quite large:
$cat  /proc/sys/user/max_pid_namespaces
500000


We checked slabInfo and found that PID_namespace does take up a lot of memory:
$sudo cat /proc/slabinfo  | grep  pid
pid_namespace     599956 599956   2232   14    8 : tunables    0    0    0 : slabdata  42854  42854      0
pid               1184543 1184896    128   32    1 : tunables    0    0    0 : slabdata  37028  37028      0


We had to release the memory manually:
$sudo echo  3 > /proc/sys/vm/drop_caches


$sudo cat /proc/slabinfo  | grep -i pid
pid_namespace       1694   2016   2232   14    8 : tunables    0    0    0 : slabdata    144    144      0
pid                15194  36224    128   32    1 : tunables    0    0    0 : slabdata   1132   1132      0


fnally, we were able to successfully execute the unshare command:

$sudo  unshare  --pid    -r --fork --propagation private bash
#exit
exit


Therefore, it is necessary to add shrinker support for the pid_namespace.
Comment 1 Eric W. Biederman 2020-07-19 18:01:58 UTC
bugzilla-daemon@bugzilla.kernel.org writes:

> https://bugzilla.kernel.org/show_bug.cgi?id=208613
>
> Wen Yang (wenyang@linux.alibaba.com) changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            Severity|normal                      |high

What does "echo > /proc/sys/vm/drop_caches" do that allows
the number of active pid namespaces to be decreased?

What kernel are you running on?

The only thing I can possibly imagine is that you have a whole bunch
of stale proc entries cached for some reason.

Especially after the last round of cleanups in this area I am at a loss
to imagine how you could have anything like that going on.  If you are
running an older kernel, without the most recent proc cleanups than
maybe there is some pathlogical way.

Is there some obvious mechanism I am missing that allows drop_caches
to decrease the number of active pid namespaces?

Eric
Comment 2 Wen Yang 2020-10-05 06:44:26 UTC
(In reply to Eric W. Biederman from comment #1)
> bugzilla-daemon@bugzilla.kernel.org writes:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=208613
> >
> > Wen Yang (wenyang@linux.alibaba.com) changed:
> >
> >            What    |Removed                     |Added
> >
> ----------------------------------------------------------------------------
> >            Severity|normal                      |high
> 
> What does "echo > /proc/sys/vm/drop_caches" do that allows
> the number of active pid namespaces to be decreased?
> 
> What kernel are you running on?
> 
> The only thing I can possibly imagine is that you have a whole bunch
> of stale proc entries cached for some reason.
> 
> Especially after the last round of cleanups in this area I am at a loss
> to imagine how you could have anything like that going on.  If you are
> running an older kernel, without the most recent proc cleanups than
> maybe there is some pathlogical way.
> 
> Is there some obvious mechanism I am missing that allows drop_caches
> to decrease the number of active pid namespaces?
> 
> Eric

Thanks for your comments.
Sorry for the delay, we have recently observed a similar problem on another server (4.9 kernel). The crash analysis is as follows:

1. Found a pid structure, its corresponding task has been released, but the pid still remains:

crash> pid 0xffff8812cdc28800
struct pid {
  count = {
    counter = 3 
  },
  level = 1,
  tasks = {{
      first = 0x0
    }, {
      first = 0x0
    }, {
      first = 0x0
    }},
  rcu = {
    next = 0xffff881b412d14c0,
    func = 0xffffffff810a9ed0 <delayed_put_pid>
  },
  numbers = {{
      nr = 94426,
      ns = 0xffffffff81c5b280 <init_pid_ns>,
      pid_chain = {
        next = 0xffff881b66076c40,
        pprev = 0xdead000000000200
      }
    }}
}

2. The reference count of this pid structure is 3. We searched the memory and found that 3 proc_inodes were referencing it.

The first proc_inode:

crash> proc_inode.pid,vfs_inode.i_sb,vfs_inode.i_fop,vfs_inode.i_dentry,vfs_inode.i_ino  ffff8812336887b0
  pid = 0xffff8812cdc28800
  vfs_inode.i_sb = 0xffff881fbe923800,
  vfs_inode.i_fop = 0xffffffff81846d40 <proc_tgid_base_operations>,
  vfs_inode.i_dentry = {
    first = 0xffff881233dfcb30
  },
  vfs_inode.i_ino = 1426511612,
crash>


Find the corresponding dentry and found it is /proc/94426

crash> eval 0xffff881233dfcb30 - 0xb0
hexadecimal: ffff881233dfca80

crash> dentry.d_name,d_parent,d_inode ffff881233dfca80
  d_name = {
    {
      {
        hash = 382192091,
        len = 5
      },
      hash_len = 21857028571
    },
    name = 0xffff881233dfcab8 "94426"
  }
  d_parent = 0xffff881fbe408240
  d_inode = 0xffff8812336887f8

crash>  dentry.d_name,d_parent,d_inode 0xffff881fbe408240
  d_name = {
    {
      {
        hash = 0,
        len = 1
      },
      hash_len = 4294967296
    },
    name = 0xffff881fbe408278 "/"
  }
  d_parent = 0xffff881fbe408240
  d_inode = 0xffff881fbe418048


The second proc_inode:

crash>  proc_inode.pid,vfs_inode.i_sb,vfs_inode.i_fop,vfs_inode.i_dentry,vfs_inode.i_ino  ffff881233d9ae20
  pid = 0xffff8812cdc28800
  vfs_inode.i_sb = 0xffff881fbe923800,
  vfs_inode.i_fop = 0xffffffff81841700 <no_open_fops.46098>,
  vfs_inode.i_dentry = {
    first = 0xffff881233dd12b0
  },
  vfs_inode.i_ino = 1426524178,
crash>

Find the corresponding dentry and found it is /proc/94426/ns/pid:

crash> eval 0xffff881233dd12b0 - 0xb0
hexadecimal: ffff881233dd1200

crash> dentry.d_name,d_parent,d_inode ffff881233dd1200
  d_name = {
    {
      {
        hash = 3686000610,
        len = 3
      },
      hash_len = 16570902498
    },
    name = 0xffff881233dd1238 "pid"
  }
  d_parent = 0xffff881233dd0540
  d_inode = 0xffff881233d9ae68

crash> dentry.d_name,d_parent,d_inode 0xffff881233dd0540
  d_name = {
    {
      {
        hash = 4061867969,
        len = 2
      },
      hash_len = 12651802561
    },
    name = 0xffff881233dd0578 "ns"
  }
  d_parent = 0xffff881233dfca80
  d_inode = 0xffff881233d9cd28

crash> dentry.d_name,d_parent,d_inode 0xffff881233dfca80
  d_name = {
    {
      {
        hash = 382192091,
        len = 5
      },
      hash_len = 21857028571
    },
    name = 0xffff881233dfcab8 "94426"
  }
  d_parent = 0xffff881fbe408240
  d_inode = 0xffff8812336887f8

crash> dentry.d_name,d_parent,d_inode 0xffff881fbe408240
  d_name = {
    {
      {
        hash = 0,
        len = 1
      },
      hash_len = 4294967296
    },
    name = 0xffff881fbe408278 "/"
  }
  d_parent = 0xffff881fbe408240
  d_inode = 0xffff881fbe418048


The third proc_pid:

crash> proc_inode.pid,vfs_inode.i_sb,vfs_inode.i_fop,vfs_inode.i_dentry,vfs_inode.i_ino ffff881233d9cce0
  pid = 0xffff8812cdc28800
  vfs_inode.i_sb = 0xffff881fbe923800,
  vfs_inode.i_fop = 0xffffffff8184acc0 <proc_ns_dir_operations>,
  vfs_inode.i_dentry = {
    first = 0xffff881233dd05f0
  },
  vfs_inode.i_ino = 1426524174,

Find the corresponding dentry and found it is /proc/94426/ns:

crash>  eval 0xffff881233dd05f0 - 0xb0
hexadecimal: ffff881233dd0540


crash>  dentry.d_name,d_parent,d_inode ffff881233dd0540
  d_name = {
    {
      {
        hash = 4061867969,
        len = 2
      },
      hash_len = 12651802561
    },
    name = 0xffff881233dd0578 "ns"
  }
  d_parent = 0xffff881233dfca80
  d_inode = 0xffff881233d9cd28

crash> dentry.d_name,d_parent,d_inode 0xffff881233dfca80
  d_name = {
    {
      {
        hash = 382192091,
        len = 5
      },
      hash_len = 21857028571
    },
    name = 0xffff881233dfcab8 "94426"
  }
  d_parent = 0xffff881fbe408240
  d_inode = 0xffff8812336887f8

crash> dentry.d_name,d_parent,d_inode 0xffff881fbe408240
  d_name = {
    {
      {
        hash = 0,
        len = 1
      },
      hash_len = 4294967296
    },
    name = 0xffff881fbe408278 "/"
  }
  d_parent = 0xffff881fbe408240
  d_inode = 0xffff881fbe418048
crash>

3, To sum up, after the process exits, the following three dentry still refer to the pid:
/proc/<pid>
/proc/<pid>/ns
/proc/<pid>/ns/pid


According to commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if pids cannot be released, it may result in the inability to create a new pid_ns.
Adding the following code to the 4.9 kernel may solve this problem:

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 13b6398..1bff16e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3207,6 +3207,22 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 	}

 	dput(dir);
+
+	name.name = "ns";
+	name.len = strlen(name.name);
+	dir = d_hash_and_lookup(leader, &name);
+	if (!dir)
+		goto out_put_leader;
+
+	name.name = "pid";
+	name.len = strlen(name.name);
+	dentry = d_hash_and_lookup(dir, &name);
+	if (dentry) {
+		d_invalidate(dentry);
+		dput(dentry);
+	}
+	dput(dir);
+
 out_put_leader:
 	dput(leader);
 out:
Comment 3 Wen Yang 2020-10-07 15:04:31 UTC
The following 3 patches may need to be ported to the stable branches:
7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
26dbc60f385f ("proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache")
f90f3cafe8d5 ("proc: Use d_invalidate in proc_prune_siblings_dcache")
Comment 4 Wen Yang 2020-11-26 10:09:16 UTC
In the kernel 4.9 stable branch:

void dput(struct dentry *dentry)
{
…
        if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
                if (dentry->d_op->d_delete(dentry))     ---> Assuming that the process is at the moment of exiting, there is a race
                        goto kill_it;
        }
…
        dentry->d_lockref.count--;     ---> The reference count is decreased by 1, but dentry will not be released (it can be released only when drop_caches is manually triggered)
        spin_unlock(&dentry->d_lock);  
        return;

kill_it:
        dentry = dentry_kill(dentry);
…
}

as well as:

void release_task(struct task_struct *p)
{
…

	proc_flush_task(p);     ---> Will only release dentry with a reference count of 0 

	write_lock_irq(&tasklist_lock);
	ptrace_release_task(p);
	__exit_signal(p);       ---> Calling __unhash_process to modify the data corresponding to PIDTYPE_PID.
…
}


So there may be such a race:

 release_task                         dput
-> proc_flush_task
                                              ->  dentry->d_op->d_delete(dentry)
-> __exit_signal
Comment 5 Wen Yang 2020-11-26 11:02:53 UTC
If we do not backport the above patches to the 4.9 stable branch, we may need to refer to commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ("proc: Use a list of inodes to flush from proc") to make the following simple modifications:

diff --git a/kernel/exit.c b/kernel/exit.c
index f9943ef..cfbbbe9 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -176,8 +176,6 @@ void release_task(struct task_struct *p)
        atomic_dec(&__task_cred(p)->user->processes);
        rcu_read_unlock();

-       proc_flush_task(p);
-
        write_lock_irq(&tasklist_lock);
        ptrace_release_task(p);
        __exit_signal(p);
@@ -202,6 +200,7 @@ void release_task(struct task_struct *p)
        }

        write_unlock_irq(&tasklist_lock);
+       proc_flush_task(p);
        release_thread(p);
        call_rcu(&p->rcu, delayed_put_task_struct);


Hello Eric, hello Greg K-H, would you like to give some suggestions?
Comment 6 gregkh 2020-11-26 12:09:51 UTC
On Thu, Nov 26, 2020 at 11:02:53AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> Hello Eric, hello Greg K-H, would you like to give some suggestions?

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

Note You need to log in before you can comment on or make changes to this bug.