Bug 208613
Summary: | It is necessary to add shrinker support for the pid_namespace | ||
---|---|---|---|
Product: | Process Management | Reporter: | Wen Yang (wenyang) |
Component: | Other | Assignee: | process_other |
Status: | NEW --- | ||
Severity: | high | CC: | avagin, christianvanbrauner, ebiederm, gregkh, viro, wenyang |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.9 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Wen Yang
2020-07-19 13:43:09 UTC
bugzilla-daemon@bugzilla.kernel.org writes: > https://bugzilla.kernel.org/show_bug.cgi?id=208613 > > Wen Yang (wenyang@linux.alibaba.com) changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Severity|normal |high What does "echo > /proc/sys/vm/drop_caches" do that allows the number of active pid namespaces to be decreased? What kernel are you running on? The only thing I can possibly imagine is that you have a whole bunch of stale proc entries cached for some reason. Especially after the last round of cleanups in this area I am at a loss to imagine how you could have anything like that going on. If you are running an older kernel, without the most recent proc cleanups than maybe there is some pathlogical way. Is there some obvious mechanism I am missing that allows drop_caches to decrease the number of active pid namespaces? Eric (In reply to Eric W. Biederman from comment #1) > bugzilla-daemon@bugzilla.kernel.org writes: > > > https://bugzilla.kernel.org/show_bug.cgi?id=208613 > > > > Wen Yang (wenyang@linux.alibaba.com) changed: > > > > What |Removed |Added > > > ---------------------------------------------------------------------------- > > Severity|normal |high > > What does "echo > /proc/sys/vm/drop_caches" do that allows > the number of active pid namespaces to be decreased? > > What kernel are you running on? > > The only thing I can possibly imagine is that you have a whole bunch > of stale proc entries cached for some reason. > > Especially after the last round of cleanups in this area I am at a loss > to imagine how you could have anything like that going on. If you are > running an older kernel, without the most recent proc cleanups than > maybe there is some pathlogical way. > > Is there some obvious mechanism I am missing that allows drop_caches > to decrease the number of active pid namespaces? > > Eric Thanks for your comments. Sorry for the delay, we have recently observed a similar problem on another server (4.9 kernel). The crash analysis is as follows: 1. Found a pid structure, its corresponding task has been released, but the pid still remains: crash> pid 0xffff8812cdc28800 struct pid { count = { counter = 3 }, level = 1, tasks = {{ first = 0x0 }, { first = 0x0 }, { first = 0x0 }}, rcu = { next = 0xffff881b412d14c0, func = 0xffffffff810a9ed0 <delayed_put_pid> }, numbers = {{ nr = 94426, ns = 0xffffffff81c5b280 <init_pid_ns>, pid_chain = { next = 0xffff881b66076c40, pprev = 0xdead000000000200 } }} } 2. The reference count of this pid structure is 3. We searched the memory and found that 3 proc_inodes were referencing it. The first proc_inode: crash> proc_inode.pid,vfs_inode.i_sb,vfs_inode.i_fop,vfs_inode.i_dentry,vfs_inode.i_ino ffff8812336887b0 pid = 0xffff8812cdc28800 vfs_inode.i_sb = 0xffff881fbe923800, vfs_inode.i_fop = 0xffffffff81846d40 <proc_tgid_base_operations>, vfs_inode.i_dentry = { first = 0xffff881233dfcb30 }, vfs_inode.i_ino = 1426511612, crash> Find the corresponding dentry and found it is /proc/94426 crash> eval 0xffff881233dfcb30 - 0xb0 hexadecimal: ffff881233dfca80 crash> dentry.d_name,d_parent,d_inode ffff881233dfca80 d_name = { { { hash = 382192091, len = 5 }, hash_len = 21857028571 }, name = 0xffff881233dfcab8 "94426" } d_parent = 0xffff881fbe408240 d_inode = 0xffff8812336887f8 crash> dentry.d_name,d_parent,d_inode 0xffff881fbe408240 d_name = { { { hash = 0, len = 1 }, hash_len = 4294967296 }, name = 0xffff881fbe408278 "/" } d_parent = 0xffff881fbe408240 d_inode = 0xffff881fbe418048 The second proc_inode: crash> proc_inode.pid,vfs_inode.i_sb,vfs_inode.i_fop,vfs_inode.i_dentry,vfs_inode.i_ino ffff881233d9ae20 pid = 0xffff8812cdc28800 vfs_inode.i_sb = 0xffff881fbe923800, vfs_inode.i_fop = 0xffffffff81841700 <no_open_fops.46098>, vfs_inode.i_dentry = { first = 0xffff881233dd12b0 }, vfs_inode.i_ino = 1426524178, crash> Find the corresponding dentry and found it is /proc/94426/ns/pid: crash> eval 0xffff881233dd12b0 - 0xb0 hexadecimal: ffff881233dd1200 crash> dentry.d_name,d_parent,d_inode ffff881233dd1200 d_name = { { { hash = 3686000610, len = 3 }, hash_len = 16570902498 }, name = 0xffff881233dd1238 "pid" } d_parent = 0xffff881233dd0540 d_inode = 0xffff881233d9ae68 crash> dentry.d_name,d_parent,d_inode 0xffff881233dd0540 d_name = { { { hash = 4061867969, len = 2 }, hash_len = 12651802561 }, name = 0xffff881233dd0578 "ns" } d_parent = 0xffff881233dfca80 d_inode = 0xffff881233d9cd28 crash> dentry.d_name,d_parent,d_inode 0xffff881233dfca80 d_name = { { { hash = 382192091, len = 5 }, hash_len = 21857028571 }, name = 0xffff881233dfcab8 "94426" } d_parent = 0xffff881fbe408240 d_inode = 0xffff8812336887f8 crash> dentry.d_name,d_parent,d_inode 0xffff881fbe408240 d_name = { { { hash = 0, len = 1 }, hash_len = 4294967296 }, name = 0xffff881fbe408278 "/" } d_parent = 0xffff881fbe408240 d_inode = 0xffff881fbe418048 The third proc_pid: crash> proc_inode.pid,vfs_inode.i_sb,vfs_inode.i_fop,vfs_inode.i_dentry,vfs_inode.i_ino ffff881233d9cce0 pid = 0xffff8812cdc28800 vfs_inode.i_sb = 0xffff881fbe923800, vfs_inode.i_fop = 0xffffffff8184acc0 <proc_ns_dir_operations>, vfs_inode.i_dentry = { first = 0xffff881233dd05f0 }, vfs_inode.i_ino = 1426524174, Find the corresponding dentry and found it is /proc/94426/ns: crash> eval 0xffff881233dd05f0 - 0xb0 hexadecimal: ffff881233dd0540 crash> dentry.d_name,d_parent,d_inode ffff881233dd0540 d_name = { { { hash = 4061867969, len = 2 }, hash_len = 12651802561 }, name = 0xffff881233dd0578 "ns" } d_parent = 0xffff881233dfca80 d_inode = 0xffff881233d9cd28 crash> dentry.d_name,d_parent,d_inode 0xffff881233dfca80 d_name = { { { hash = 382192091, len = 5 }, hash_len = 21857028571 }, name = 0xffff881233dfcab8 "94426" } d_parent = 0xffff881fbe408240 d_inode = 0xffff8812336887f8 crash> dentry.d_name,d_parent,d_inode 0xffff881fbe408240 d_name = { { { hash = 0, len = 1 }, hash_len = 4294967296 }, name = 0xffff881fbe408278 "/" } d_parent = 0xffff881fbe408240 d_inode = 0xffff881fbe418048 crash> 3, To sum up, after the process exits, the following three dentry still refer to the pid: /proc/<pid> /proc/<pid>/ns /proc/<pid>/ns/pid According to commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if pids cannot be released, it may result in the inability to create a new pid_ns. Adding the following code to the 4.9 kernel may solve this problem: diff --git a/fs/proc/base.c b/fs/proc/base.c index 13b6398..1bff16e 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3207,6 +3207,22 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid) } dput(dir); + + name.name = "ns"; + name.len = strlen(name.name); + dir = d_hash_and_lookup(leader, &name); + if (!dir) + goto out_put_leader; + + name.name = "pid"; + name.len = strlen(name.name); + dentry = d_hash_and_lookup(dir, &name); + if (dentry) { + d_invalidate(dentry); + dput(dentry); + } + dput(dir); + out_put_leader: dput(leader); out: The following 3 patches may need to be ported to the stable branches: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc") 26dbc60f385f ("proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache") f90f3cafe8d5 ("proc: Use d_invalidate in proc_prune_siblings_dcache") In the kernel 4.9 stable branch: void dput(struct dentry *dentry) { … if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) { if (dentry->d_op->d_delete(dentry)) ---> Assuming that the process is at the moment of exiting, there is a race goto kill_it; } … dentry->d_lockref.count--; ---> The reference count is decreased by 1, but dentry will not be released (it can be released only when drop_caches is manually triggered) spin_unlock(&dentry->d_lock); return; kill_it: dentry = dentry_kill(dentry); … } as well as: void release_task(struct task_struct *p) { … proc_flush_task(p); ---> Will only release dentry with a reference count of 0 write_lock_irq(&tasklist_lock); ptrace_release_task(p); __exit_signal(p); ---> Calling __unhash_process to modify the data corresponding to PIDTYPE_PID. … } So there may be such a race: release_task dput -> proc_flush_task -> dentry->d_op->d_delete(dentry) -> __exit_signal If we do not backport the above patches to the 4.9 stable branch, we may need to refer to commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ("proc: Use a list of inodes to flush from proc") to make the following simple modifications: diff --git a/kernel/exit.c b/kernel/exit.c index f9943ef..cfbbbe9 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -176,8 +176,6 @@ void release_task(struct task_struct *p) atomic_dec(&__task_cred(p)->user->processes); rcu_read_unlock(); - proc_flush_task(p); - write_lock_irq(&tasklist_lock); ptrace_release_task(p); __exit_signal(p); @@ -202,6 +200,7 @@ void release_task(struct task_struct *p) } write_unlock_irq(&tasklist_lock); + proc_flush_task(p); release_thread(p); call_rcu(&p->rcu, delayed_put_task_struct); Hello Eric, hello Greg K-H, would you like to give some suggestions? On Thu, Nov 26, 2020 at 11:02:53AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > Hello Eric, hello Greg K-H, would you like to give some suggestions? <formletter> This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly. </formletter> |