Bug 30432 - rmdir on cgroup can cause hang tasks
Summary: rmdir on cgroup can cause hang tasks
Status: CLOSED CODE_FIX
Alias: None
Product: Process Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: process_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-04 04:27 UTC by Daniel Poelzleithner
Modified: 2012-06-13 15:01 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.37
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Daniel Poelzleithner 2011-03-04 04:27:22 UTC
I just got following hang when removing an empty cgroup. I had still a shell in the cgroup that got emptied and removed. The shell as well as the release_agent and the program managing the cgroup hangs.

The directory structure looks like:
/sys/fs/cgroup/memory/usr_1000/psn_3234

ls on /sys/fs/cgroup/memory but ls on /sys/fs/cgroup/memory/usr_1000 hangs.


[ 5065.280666] SysRq : Changing Loglevel
[ 5065.282574] Loglevel set to 5
[ 5066.139879] SysRq : Show Blocked State
[ 5066.141848]   task                        PC stack   pid father
[ 5066.141925] zsh           D ffff880071520398     0  8719   3589 0x00000084
[ 5066.141937]  ffff880002059bd8 0000000000000086 ffff880002059bb8 ffffffff00000000
[ 5066.143971]  00000000000139c0 ffff880071520000 ffff880071520398 ffff880002059fd8
[ 5066.146049]  ffff8800715203a0 00000000000139c0 ffff880002058010 00000000000139c0
[ 5066.148183] Call Trace:
[ 5066.149853]  [<ffffffff8158ec97>] __mutex_lock_slowpath+0xf7/0x180
[ 5066.149853]  [<ffffffff812d74a6>] ? vsnprintf+0x416/0x5a0
[ 5066.149853]  [<ffffffff8158eb7b>] mutex_lock+0x2b/0x50
[ 5066.149853]  [<ffffffff81168252>] do_lookup+0x102/0x180
[ 5066.149853]  [<ffffffff81168dfd>] link_path_walk+0x4dd/0x9e0
[ 5066.149853]  [<ffffffff81169417>] path_walk+0x67/0xe0
[ 5066.149853]  [<ffffffff811695eb>] do_path_lookup+0x5b/0xa0
[ 5066.149853]  [<ffffffff8116a2f7>] user_path_at+0x57/0xa0
[ 5066.149853]  [<ffffffff815940e0>] ? do_page_fault+0x1f0/0x4f0
[ 5066.149853]  [<ffffffff81075e6c>] ? kill_pid_info+0x2c/0x60
[ 5066.149853]  [<ffffffff811604fc>] vfs_fstatat+0x3c/0x80
[ 5066.149853]  [<ffffffff8116061b>] vfs_stat+0x1b/0x20
[ 5066.149853]  [<ffffffff81160644>] sys_newstat+0x24/0x50
[ 5066.149853]  [<ffffffff810bf6ff>] ? audit_syscall_entry+0x1df/0x280
[ 5066.149853]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
[ 5066.149853] ulatencyd     D ffff88007a0bc7d8     0  9004   4809 0x00000084
[ 5066.149853]  ffff880070b55cd8 0000000000000082 0000000000000082 ffff88002c9d16c0
[ 5066.149853]  00000000000139c0 ffff88007a0bc440 ffff88007a0bc7d8 ffff880070b55fd8
[ 5066.149853]  ffff88007a0bc7e0 00000000000139c0 ffff880070b54010 00000000000139c0
[ 5066.149853] Call Trace:
[ 5066.149853]  [<ffffffff8158ec97>] __mutex_lock_slowpath+0xf7/0x180
[ 5066.149853]  [<ffffffff81166124>] ? exec_permission+0x44/0x90
[ 5066.149853]  [<ffffffff8158eb7b>] mutex_lock+0x2b/0x50
[ 5066.149853]  [<ffffffff81168418>] do_last+0x148/0x650
[ 5066.149853]  [<ffffffff8116a6d5>] do_filp_open+0x205/0x5f0
[ 5066.149853]  [<ffffffff81167281>] ? path_put+0x31/0x40
[ 5066.149853]  [<ffffffff8117593a>] ? alloc_fd+0x10a/0x150
[ 5066.149853]  [<ffffffff81159bb9>] do_sys_open+0x69/0x110
[ 5066.149853]  [<ffffffff81159ca0>] sys_open+0x20/0x30
[ 5066.149853]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
[ 5066.149853] lua           D ffff88002c91b118     0  9487      1 0x00000080
[ 5066.149853]  ffff880078f6db08 0000000000000086 ffff88002c91b118 ffff880000000000
[ 5066.149853]  00000000000139c0 ffff88002c91ad80 ffff88002c91b118 ffff880078f6dfd8
[ 5066.149853]  ffff88002c91b120 00000000000139c0 ffff880078f6c010 00000000000139c0
[ 5066.149853] Call Trace:
[ 5066.149853]  [<ffffffff8158e4d5>] schedule_timeout+0x215/0x2f0
[ 5066.149853]  [<ffffffff8104e4fd>] ? task_rq_lock+0x5d/0xa0
[ 5066.149853]  [<ffffffff81059c93>] ? try_to_wake_up+0xc3/0x410
[ 5066.149853]  [<ffffffff8158e0cb>] wait_for_common+0xdb/0x180
[ 5066.149853]  [<ffffffff81059fe0>] ? default_wake_function+0x0/0x20
[ 5066.244366]  [<ffffffff8158e24d>] wait_for_completion+0x1d/0x20
[ 5066.244366]  [<ffffffff810d44f5>] synchronize_sched+0x55/0x60
[ 5066.244366]  [<ffffffff81080b00>] ? wakeme_after_rcu+0x0/0x20
[ 5066.244366]  [<ffffffff811526a3>] mem_cgroup_start_move+0x93/0xa0
[ 5066.244366]  [<ffffffff8115739b>] mem_cgroup_force_empty+0xdb/0x640
[ 5066.244366]  [<ffffffff81157914>] mem_cgroup_pre_destroy+0x14/0x20
[ 5066.244366]  [<ffffffff810ae681>] cgroup_rmdir+0xc1/0x560
[ 5066.244366]  [<ffffffff81083d70>] ? autoremove_wake_function+0x0/0x40
[ 5066.244366]  [<ffffffff81167cc4>] vfs_rmdir+0xb4/0x110
[ 5066.244366]  [<ffffffff81169d13>] do_rmdir+0x133/0x140
[ 5066.244366]  [<ffffffff810d3c85>] ? call_rcu_sched+0x15/0x20
[ 5066.244366]  [<ffffffff810bf6ff>] ? audit_syscall_entry+0x1df/0x280
[ 5066.244366]  [<ffffffff81169d76>] sys_rmdir+0x16/0x20
[ 5066.244366]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
Comment 1 Andrew Morton 2011-03-04 08:04:54 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 4 Mar 2011 04:27:26 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=30432
> 
>            Summary: rmdir on cgroup can cause hang tasks
>            Product: Process Management
>            Version: 2.5
>     Kernel Version: 2.6.37
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>         AssignedTo: process_other@kernel-bugs.osdl.org
>         ReportedBy: bugzilla.kernel.org@poelzi.org
>         Regression: No
> 
> 
> I just got following hang when removing an empty cgroup. I had still a shell
> in
> the cgroup that got emptied and removed. The shell as well as the
> release_agent
> and the program managing the cgroup hangs.
> 
> The directory structure looks like:
> /sys/fs/cgroup/memory/usr_1000/psn_3234
> 
> ls on /sys/fs/cgroup/memory but ls on /sys/fs/cgroup/memory/usr_1000 hangs.
> 
> 
> [ 5065.280666] SysRq : Changing Loglevel
> [ 5065.282574] Loglevel set to 5
> [ 5066.139879] SysRq : Show Blocked State
> [ 5066.141848]   task                        PC stack   pid father
> [ 5066.141925] zsh           D ffff880071520398     0  8719   3589 0x00000084
> [ 5066.141937]  ffff880002059bd8 0000000000000086 ffff880002059bb8
> ffffffff00000000
> [ 5066.143971]  00000000000139c0 ffff880071520000 ffff880071520398
> ffff880002059fd8
> [ 5066.146049]  ffff8800715203a0 00000000000139c0 ffff880002058010
> 00000000000139c0
> [ 5066.148183] Call Trace:
> [ 5066.149853]  [<ffffffff8158ec97>] __mutex_lock_slowpath+0xf7/0x180
> [ 5066.149853]  [<ffffffff812d74a6>] ? vsnprintf+0x416/0x5a0
> [ 5066.149853]  [<ffffffff8158eb7b>] mutex_lock+0x2b/0x50
> [ 5066.149853]  [<ffffffff81168252>] do_lookup+0x102/0x180
> [ 5066.149853]  [<ffffffff81168dfd>] link_path_walk+0x4dd/0x9e0
> [ 5066.149853]  [<ffffffff81169417>] path_walk+0x67/0xe0
> [ 5066.149853]  [<ffffffff811695eb>] do_path_lookup+0x5b/0xa0
> [ 5066.149853]  [<ffffffff8116a2f7>] user_path_at+0x57/0xa0
> [ 5066.149853]  [<ffffffff815940e0>] ? do_page_fault+0x1f0/0x4f0
> [ 5066.149853]  [<ffffffff81075e6c>] ? kill_pid_info+0x2c/0x60
> [ 5066.149853]  [<ffffffff811604fc>] vfs_fstatat+0x3c/0x80
> [ 5066.149853]  [<ffffffff8116061b>] vfs_stat+0x1b/0x20
> [ 5066.149853]  [<ffffffff81160644>] sys_newstat+0x24/0x50
> [ 5066.149853]  [<ffffffff810bf6ff>] ? audit_syscall_entry+0x1df/0x280
> [ 5066.149853]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> [ 5066.149853] ulatencyd     D ffff88007a0bc7d8     0  9004   4809 0x00000084
> [ 5066.149853]  ffff880070b55cd8 0000000000000082 0000000000000082
> ffff88002c9d16c0
> [ 5066.149853]  00000000000139c0 ffff88007a0bc440 ffff88007a0bc7d8
> ffff880070b55fd8
> [ 5066.149853]  ffff88007a0bc7e0 00000000000139c0 ffff880070b54010
> 00000000000139c0
> [ 5066.149853] Call Trace:
> [ 5066.149853]  [<ffffffff8158ec97>] __mutex_lock_slowpath+0xf7/0x180
> [ 5066.149853]  [<ffffffff81166124>] ? exec_permission+0x44/0x90
> [ 5066.149853]  [<ffffffff8158eb7b>] mutex_lock+0x2b/0x50
> [ 5066.149853]  [<ffffffff81168418>] do_last+0x148/0x650
> [ 5066.149853]  [<ffffffff8116a6d5>] do_filp_open+0x205/0x5f0
> [ 5066.149853]  [<ffffffff81167281>] ? path_put+0x31/0x40
> [ 5066.149853]  [<ffffffff8117593a>] ? alloc_fd+0x10a/0x150
> [ 5066.149853]  [<ffffffff81159bb9>] do_sys_open+0x69/0x110
> [ 5066.149853]  [<ffffffff81159ca0>] sys_open+0x20/0x30
> [ 5066.149853]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> [ 5066.149853] lua           D ffff88002c91b118     0  9487      1 0x00000080
> [ 5066.149853]  ffff880078f6db08 0000000000000086 ffff88002c91b118
> ffff880000000000
> [ 5066.149853]  00000000000139c0 ffff88002c91ad80 ffff88002c91b118
> ffff880078f6dfd8
> [ 5066.149853]  ffff88002c91b120 00000000000139c0 ffff880078f6c010
> 00000000000139c0
> [ 5066.149853] Call Trace:
> [ 5066.149853]  [<ffffffff8158e4d5>] schedule_timeout+0x215/0x2f0
> [ 5066.149853]  [<ffffffff8104e4fd>] ? task_rq_lock+0x5d/0xa0
> [ 5066.149853]  [<ffffffff81059c93>] ? try_to_wake_up+0xc3/0x410
> [ 5066.149853]  [<ffffffff8158e0cb>] wait_for_common+0xdb/0x180
> [ 5066.149853]  [<ffffffff81059fe0>] ? default_wake_function+0x0/0x20
> [ 5066.244366]  [<ffffffff8158e24d>] wait_for_completion+0x1d/0x20
> [ 5066.244366]  [<ffffffff810d44f5>] synchronize_sched+0x55/0x60
> [ 5066.244366]  [<ffffffff81080b00>] ? wakeme_after_rcu+0x0/0x20
> [ 5066.244366]  [<ffffffff811526a3>] mem_cgroup_start_move+0x93/0xa0
> [ 5066.244366]  [<ffffffff8115739b>] mem_cgroup_force_empty+0xdb/0x640
> [ 5066.244366]  [<ffffffff81157914>] mem_cgroup_pre_destroy+0x14/0x20
> [ 5066.244366]  [<ffffffff810ae681>] cgroup_rmdir+0xc1/0x560
> [ 5066.244366]  [<ffffffff81083d70>] ? autoremove_wake_function+0x0/0x40
> [ 5066.244366]  [<ffffffff81167cc4>] vfs_rmdir+0xb4/0x110
> [ 5066.244366]  [<ffffffff81169d13>] do_rmdir+0x133/0x140
> [ 5066.244366]  [<ffffffff810d3c85>] ? call_rcu_sched+0x15/0x20
> [ 5066.244366]  [<ffffffff810bf6ff>] ? audit_syscall_entry+0x1df/0x280
> [ 5066.244366]  [<ffffffff81169d76>] sys_rmdir+0x16/0x20
> [ 5066.244366]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
Comment 2 KAMEZAWA Hiroyuki 2011-03-04 09:21:50 UTC
On Fri, 4 Mar 2011 17:28:15 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
This seems....
> ==
> static void mem_cgroup_start_move(struct mem_cgroup *mem)
> {
> .....
>       put_online_cpus();
> 
>         synchronize_rcu();   <---------(*)
> }
> ==
> 

But this may scan LRU of memcg forever and SysRq+T just shows
above stack.

I'll check a tree before THP and force_empty again
-Kame
Comment 3 KAMEZAWA Hiroyuki 2011-03-04 10:14:16 UTC
On Fri, 4 Mar 2011 00:03:55 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> > [ 5066.149853] Call Trace:
> > [ 5066.149853]  [<ffffffff8158e4d5>] schedule_timeout+0x215/0x2f0
> > [ 5066.149853]  [<ffffffff8104e4fd>] ? task_rq_lock+0x5d/0xa0
> > [ 5066.149853]  [<ffffffff81059c93>] ? try_to_wake_up+0xc3/0x410
> > [ 5066.149853]  [<ffffffff8158e0cb>] wait_for_common+0xdb/0x180
> > [ 5066.149853]  [<ffffffff81059fe0>] ? default_wake_function+0x0/0x20
> > [ 5066.244366]  [<ffffffff8158e24d>] wait_for_completion+0x1d/0x20
> > [ 5066.244366]  [<ffffffff810d44f5>] synchronize_sched+0x55/0x60
> > [ 5066.244366]  [<ffffffff81080b00>] ? wakeme_after_rcu+0x0/0x20
> > [ 5066.244366]  [<ffffffff811526a3>] mem_cgroup_start_move+0x93/0xa0
> > [ 5066.244366]  [<ffffffff8115739b>] mem_cgroup_force_empty+0xdb/0x640
> > [ 5066.244366]  [<ffffffff81157914>] mem_cgroup_pre_destroy+0x14/0x20
> > [ 5066.244366]  [<ffffffff810ae681>] cgroup_rmdir+0xc1/0x560
> > [ 5066.244366]  [<ffffffff81083d70>] ? autoremove_wake_function+0x0/0x40
> > [ 5066.244366]  [<ffffffff81167cc4>] vfs_rmdir+0xb4/0x110
> > [ 5066.244366]  [<ffffffff81169d13>] do_rmdir+0x133/0x140
> > [ 5066.244366]  [<ffffffff810d3c85>] ? call_rcu_sched+0x15/0x20
> > [ 5066.244366]  [<ffffffff810bf6ff>] ? audit_syscall_entry+0x1df/0x280
> > [ 5066.244366]  [<ffffffff81169d76>] sys_rmdir+0x16/0x20
> > [ 5066.244366]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> 

This seems....
==
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
.....
	put_online_cpus();

        synchronize_rcu();   <---------(*)
}
==

Waiting on above synchronize_rcu().

Hmm...
-Kame
Comment 4 Anonymous Emailer 2011-03-04 11:15:45 UTC
Reply-To: nishimura@mxp.nes.nec.co.jp

Hi, thank you for your report.

I have some questions.

On Fri, 4 Mar 2011 00:03:55 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Fri, 4 Mar 2011 04:27:26 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=30432
> > 
> >            Summary: rmdir on cgroup can cause hang tasks
> >            Product: Process Management
> >            Version: 2.5
> >     Kernel Version: 2.6.37
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: Other
> >         AssignedTo: process_other@kernel-bugs.osdl.org
> >         ReportedBy: bugzilla.kernel.org@poelzi.org
> >         Regression: No
> > 
> > 
> > I just got following hang when removing an empty cgroup. I had still a
> shell in
> > the cgroup that got emptied and removed. The shell as well as the
> release_agent
> > and the program managing the cgroup hangs.
> > 
You tried to remove a cgroup directory via release_agent(i.e. you enabled "notify_on_release",
and make the release_agent remove the directory)?

And can you show your /proc/cgroups ?

Thanks,
Daisuke Nishimura.

> > The directory structure looks like:
> > /sys/fs/cgroup/memory/usr_1000/psn_3234
> > 
> > ls on /sys/fs/cgroup/memory but ls on /sys/fs/cgroup/memory/usr_1000 hangs.
> > 
> > 
> > [ 5065.280666] SysRq : Changing Loglevel
> > [ 5065.282574] Loglevel set to 5
> > [ 5066.139879] SysRq : Show Blocked State
> > [ 5066.141848]   task                        PC stack   pid father
> > [ 5066.141925] zsh           D ffff880071520398     0  8719   3589
> 0x00000084
> > [ 5066.141937]  ffff880002059bd8 0000000000000086 ffff880002059bb8
> > ffffffff00000000
> > [ 5066.143971]  00000000000139c0 ffff880071520000 ffff880071520398
> > ffff880002059fd8
> > [ 5066.146049]  ffff8800715203a0 00000000000139c0 ffff880002058010
> > 00000000000139c0
> > [ 5066.148183] Call Trace:
> > [ 5066.149853]  [<ffffffff8158ec97>] __mutex_lock_slowpath+0xf7/0x180
> > [ 5066.149853]  [<ffffffff812d74a6>] ? vsnprintf+0x416/0x5a0
> > [ 5066.149853]  [<ffffffff8158eb7b>] mutex_lock+0x2b/0x50
> > [ 5066.149853]  [<ffffffff81168252>] do_lookup+0x102/0x180
> > [ 5066.149853]  [<ffffffff81168dfd>] link_path_walk+0x4dd/0x9e0
> > [ 5066.149853]  [<ffffffff81169417>] path_walk+0x67/0xe0
> > [ 5066.149853]  [<ffffffff811695eb>] do_path_lookup+0x5b/0xa0
> > [ 5066.149853]  [<ffffffff8116a2f7>] user_path_at+0x57/0xa0
> > [ 5066.149853]  [<ffffffff815940e0>] ? do_page_fault+0x1f0/0x4f0
> > [ 5066.149853]  [<ffffffff81075e6c>] ? kill_pid_info+0x2c/0x60
> > [ 5066.149853]  [<ffffffff811604fc>] vfs_fstatat+0x3c/0x80
> > [ 5066.149853]  [<ffffffff8116061b>] vfs_stat+0x1b/0x20
> > [ 5066.149853]  [<ffffffff81160644>] sys_newstat+0x24/0x50
> > [ 5066.149853]  [<ffffffff810bf6ff>] ? audit_syscall_entry+0x1df/0x280
> > [ 5066.149853]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> > [ 5066.149853] ulatencyd     D ffff88007a0bc7d8     0  9004   4809
> 0x00000084
> > [ 5066.149853]  ffff880070b55cd8 0000000000000082 0000000000000082
> > ffff88002c9d16c0
> > [ 5066.149853]  00000000000139c0 ffff88007a0bc440 ffff88007a0bc7d8
> > ffff880070b55fd8
> > [ 5066.149853]  ffff88007a0bc7e0 00000000000139c0 ffff880070b54010
> > 00000000000139c0
> > [ 5066.149853] Call Trace:
> > [ 5066.149853]  [<ffffffff8158ec97>] __mutex_lock_slowpath+0xf7/0x180
> > [ 5066.149853]  [<ffffffff81166124>] ? exec_permission+0x44/0x90
> > [ 5066.149853]  [<ffffffff8158eb7b>] mutex_lock+0x2b/0x50
> > [ 5066.149853]  [<ffffffff81168418>] do_last+0x148/0x650
> > [ 5066.149853]  [<ffffffff8116a6d5>] do_filp_open+0x205/0x5f0
> > [ 5066.149853]  [<ffffffff81167281>] ? path_put+0x31/0x40
> > [ 5066.149853]  [<ffffffff8117593a>] ? alloc_fd+0x10a/0x150
> > [ 5066.149853]  [<ffffffff81159bb9>] do_sys_open+0x69/0x110
> > [ 5066.149853]  [<ffffffff81159ca0>] sys_open+0x20/0x30
> > [ 5066.149853]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> > [ 5066.149853] lua           D ffff88002c91b118     0  9487      1
> 0x00000080
> > [ 5066.149853]  ffff880078f6db08 0000000000000086 ffff88002c91b118
> > ffff880000000000
> > [ 5066.149853]  00000000000139c0 ffff88002c91ad80 ffff88002c91b118
> > ffff880078f6dfd8
> > [ 5066.149853]  ffff88002c91b120 00000000000139c0 ffff880078f6c010
> > 00000000000139c0
> > [ 5066.149853] Call Trace:
> > [ 5066.149853]  [<ffffffff8158e4d5>] schedule_timeout+0x215/0x2f0
> > [ 5066.149853]  [<ffffffff8104e4fd>] ? task_rq_lock+0x5d/0xa0
> > [ 5066.149853]  [<ffffffff81059c93>] ? try_to_wake_up+0xc3/0x410
> > [ 5066.149853]  [<ffffffff8158e0cb>] wait_for_common+0xdb/0x180
> > [ 5066.149853]  [<ffffffff81059fe0>] ? default_wake_function+0x0/0x20
> > [ 5066.244366]  [<ffffffff8158e24d>] wait_for_completion+0x1d/0x20
> > [ 5066.244366]  [<ffffffff810d44f5>] synchronize_sched+0x55/0x60
> > [ 5066.244366]  [<ffffffff81080b00>] ? wakeme_after_rcu+0x0/0x20
> > [ 5066.244366]  [<ffffffff811526a3>] mem_cgroup_start_move+0x93/0xa0
> > [ 5066.244366]  [<ffffffff8115739b>] mem_cgroup_force_empty+0xdb/0x640
> > [ 5066.244366]  [<ffffffff81157914>] mem_cgroup_pre_destroy+0x14/0x20
> > [ 5066.244366]  [<ffffffff810ae681>] cgroup_rmdir+0xc1/0x560
> > [ 5066.244366]  [<ffffffff81083d70>] ? autoremove_wake_function+0x0/0x40
> > [ 5066.244366]  [<ffffffff81167cc4>] vfs_rmdir+0xb4/0x110
> > [ 5066.244366]  [<ffffffff81169d13>] do_rmdir+0x133/0x140
> > [ 5066.244366]  [<ffffffff810d3c85>] ? call_rcu_sched+0x15/0x20
> > [ 5066.244366]  [<ffffffff810bf6ff>] ? audit_syscall_entry+0x1df/0x280
> > [ 5066.244366]  [<ffffffff81169d76>] sys_rmdir+0x16/0x20
> > [ 5066.244366]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Comment 5 KAMEZAWA Hiroyuki 2011-03-07 05:05:21 UTC
On Fri, 4 Mar 2011 18:01:57 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri, 4 Mar 2011 17:28:15 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> This seems....
> > ==
> > static void mem_cgroup_start_move(struct mem_cgroup *mem)
> > {
> > .....
> >     put_online_cpus();
> > 
> >         synchronize_rcu();   <---------(*)
> > }
> > ==
> > 
> 
> But this may scan LRU of memcg forever and SysRq+T just shows
> above stack.
> 
> I'll check a tree before THP and force_empty again

Hmm, one more conern is what kind of file system is used ?

Can I see 
 - /prco/mounts
and your .config ?

If you use FUSE, could you try this ?

I'll prepare one for mmotm.

==

fs/fuse/dev.c::fuse_try_move_page() does

   (1) remove a page by ->steal()
   (2) re-add the page to page cache 
   (3) link the page to LRU if it was not on LRU at (1)

This implies the page is _on_ LRU when it's added to radix-tree.
So, the page is added to  memory cgroup while it's on LRU.
because LRU is lazy and no one flushs it.

This is the same behavior as SwapCache and needs special care as
 - remove page from LRU before overwrite pc->mem_cgroup.
 - add page to LRU after overwrite pc->mem_cgroup.

So, reusing it with renaming.

Note: a page on pagevec(LRU).
If a page is not PageLRU(page) but on pagevec(LRU), it may be added to LRU
while we overwrite page->mapping. But in that case, PCG_USED bit of
the page_cgroup is not set and the page_cgroup will not be added to
wrong memcg's LRU. So, this patch's logic will work fine.
(It has been tested with SwapCache.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   45 +++++++++++++++++++++++++++------------------
 1 file changed, 27 insertions(+), 18 deletions(-)

Index: linux-2.6.37/mm/memcontrol.c
===================================================================
--- linux-2.6.37.orig/mm/memcontrol.c
+++ linux-2.6.37/mm/memcontrol.c
@@ -876,13 +876,12 @@ void mem_cgroup_add_lru_list(struct page
 }
 
 /*
- * At handling SwapCache, pc->mem_cgroup may be changed while it's linked to
- * lru because the page may.be reused after it's fully uncharged (because of
- * SwapCache behavior).To handle that, unlink page_cgroup from LRU when charge
- * it again. This function is only used to charge SwapCache. It's done under
- * lock_page and expected that zone->lru_lock is never held.
+ * At handling SwapCache and other FUSE stuff, pc->mem_cgroup may be changed
+ * while it's linked to lru because the page may be reused after it's fully
+ * uncharged. To handle that, unlink page_cgroup from LRU when charge it again.
+ * It's done under lock_page and expected that zone->lru_lock isnever held.
  */
-static void mem_cgroup_lru_del_before_commit_swapcache(struct page *page)
+static void mem_cgroup_lru_del_before_commit(struct page *page)
 {
 	unsigned long flags;
 	struct zone *zone = page_zone(page);
@@ -898,7 +897,7 @@ static void mem_cgroup_lru_del_before_co
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
-static void mem_cgroup_lru_add_after_commit_swapcache(struct page *page)
+static void mem_cgroup_lru_add_after_commit(struct page *page)
 {
 	unsigned long flags;
 	struct zone *zone = page_zone(page);
@@ -2299,7 +2298,7 @@ int mem_cgroup_newpage_charge(struct pag
 }
 
 static void
-__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
+__mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *ptr,
 					enum charge_type ctype);
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -2339,18 +2338,28 @@ int mem_cgroup_cache_charge(struct page 
 	if (unlikely(!mm))
 		mm = &init_mm;
 
-	if (page_is_file_cache(page))
-		return mem_cgroup_charge_common(page, mm, gfp_mask,
+	/*
+	 * FUSE has a logic to reuse existing page-cache before free().
+	 * It means the 'page' may be on some LRU. SwapCache has the
+	 * same kind of handling.
+	 */
+	if (page_is_file_cache(page) && !PageLRU(page)) {
+		ret = mem_cgroup_charge_common(page, mm, gfp_mask,
 				MEM_CGROUP_CHARGE_TYPE_CACHE);
+	} else if (page_is_file_cache(page)) {
+		struct mem_cgroup *mem = NULL;
 
-	/* shmem */
-	if (PageSwapCache(page)) {
+		ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
+		if (!ret)
+			__mem_cgroup_commit_charge_lrucare(page, mem,
+				MEM_CGROUP_CHARGE_TYPE_CACHE);
+	} else if (PageSwapCache(page)) {
 		struct mem_cgroup *mem = NULL;
 
 		ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
 		if (!ret)
-			__mem_cgroup_commit_charge_swapin(page, mem,
-					MEM_CGROUP_CHARGE_TYPE_SHMEM);
+			__mem_cgroup_commit_charge_lrucare(page, mem,
+				MEM_CGROUP_CHARGE_TYPE_SHMEM);
 	} else
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask,
 					MEM_CGROUP_CHARGE_TYPE_SHMEM);
@@ -2398,7 +2407,7 @@ charge_cur_mm:
 }
 
 static void
-__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
+__mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *ptr,
 					enum charge_type ctype)
 {
 	struct page_cgroup *pc;
@@ -2409,9 +2418,9 @@ __mem_cgroup_commit_charge_swapin(struct
 		return;
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
-	mem_cgroup_lru_del_before_commit_swapcache(page);
+	mem_cgroup_lru_del_before_commit(page);
 	__mem_cgroup_commit_charge(ptr, pc, ctype);
-	mem_cgroup_lru_add_after_commit_swapcache(page);
+	mem_cgroup_lru_add_after_commit(page);
 	/*
 	 * Now swap is on-memory. This means this page may be
 	 * counted both as mem and swap....double count.
@@ -2449,7 +2458,7 @@ __mem_cgroup_commit_charge_swapin(struct
 
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 {
-	__mem_cgroup_commit_charge_swapin(page, ptr,
+	__mem_cgroup_commit_charge_lrucare(page, ptr,
 					MEM_CGROUP_CHARGE_TYPE_MAPPED);
 }
Comment 6 KAMEZAWA Hiroyuki 2011-03-07 05:46:08 UTC
On Mon, 7 Mar 2011 13:58:03 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri, 4 Mar 2011 18:01:57 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Fri, 4 Mar 2011 17:28:15 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > This seems....
> > > ==
> > > static void mem_cgroup_start_move(struct mem_cgroup *mem)
> > > {
> > > .....
> > >   put_online_cpus();
> > > 
> > >         synchronize_rcu();   <---------(*)
> > > }
> > > ==
> > > 
> > 
> > But this may scan LRU of memcg forever and SysRq+T just shows
> > above stack.
> > 
> > I'll check a tree before THP and force_empty again
> 
> Hmm, one more conern is what kind of file system is used ?
> 
> Can I see 
>  - /prco/mounts
> and your .config ?
> 
> If you use FUSE, could you try this ?
> 
> I'll prepare one for mmotm.
> 

Sorry, this version may contain hung-up bug...
I'll start from fix for mmotm and backport it.

Thanks,
-Kame
Comment 7 Florian Mickler 2011-03-28 22:51:38 UTC
A patch referencing this bug report has been merged in v2.6.38-8569-g16c29da:

commit 5a6475a4e162200f43855e2d42bbf55bcca1a9f2
Author: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date:   Wed Mar 23 16:42:42 2011 -0700

    memcg: fix leak on wrong LRU with FUSE

Note You need to log in before you can comment on or make changes to this bug.