Created attachment 176051 [details] .config I have an application that makes heavy use of network namespaces, creating and destroying them on the fly during operation. With 100% reproducibility, the first invocation of "ip netns create" after any "ip netns del" hangs forever in D-state; only rebooting the machine clears the condition. (This used to be an intermittent fault, going back at least as far as 3.16, but 4.0 seems to have upgraded it to an every-single-time fault.) lockdep hung task detection reports: May 5 22:48:07 laguz kernel: INFO: task ip:572431 blocked for more than 120 seconds. May 5 22:48:07 laguz kernel: Not tainted 4.0.1-linode55-lockdep #2 May 5 22:48:07 laguz kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 5 22:48:07 laguz kernel: ip D ffff88000e28bda8 0 572431 572428 0x00000000 May 5 22:48:07 laguz kernel: ffff88000e28bda8 ffff880058b60000 ffff8801512c6540 ffff88000e28bda8 May 5 22:48:07 laguz kernel: ffff88000e28bfd8 ffff8800a07e6500 ffffffff81ecee88 ffff88000e28be28 May 5 22:48:07 laguz kernel: ffff8801512c6540 ffff88000e28bdc8 ffffffff81964ca9 0000000000000006 May 5 22:48:07 laguz kernel: Call Trace: May 5 22:48:07 laguz kernel: [<ffffffff81964ca9>] schedule+0x74/0x83 May 5 22:48:07 laguz kernel: [<ffffffff81964cc6>] schedule_preempt_disabled+0xe/0x10 May 5 22:48:07 laguz kernel: [<ffffffff81965b0f>] mutex_lock_nested+0x200/0x462 May 5 22:48:07 laguz kernel: [<ffffffff8175a651>] ? copy_net_ns+0x79/0x103 May 5 22:48:07 laguz kernel: [<ffffffff8175a61c>] ? copy_net_ns+0x44/0x103 May 5 22:48:07 laguz kernel: [<ffffffff811139d5>] ? lockdep_trace_alloc+0xca/0xdc May 5 22:48:07 laguz kernel: [<ffffffff8175a651>] copy_net_ns+0x79/0x103 May 5 22:48:07 laguz kernel: [<ffffffff810f4b40>] create_new_namespaces+0x103/0x17f May 5 22:48:07 laguz kernel: [<ffffffff810f4d8b>] unshare_nsproxy_namespaces+0x8b/0xb0 May 5 22:48:07 laguz kernel: [<ffffffff810da7f0>] SyS_unshare+0x120/0x2aa May 5 22:48:07 laguz kernel: [<ffffffff8196a2c9>] system_call_fastpath+0x12/0x17 May 5 22:48:07 laguz kernel: 1 lock held by ip/572431: May 5 22:48:07 laguz kernel: #0: (net_mutex){+.+.+.}, at: [<ffffffff8175a651>] copy_net_ns+0x79/0x103 When the bug manifests, there is invariably also a kworker process stuck in D-state, but I don't see anything related to that in this trace. I've restarted the machine so I can continue collecting data, but the condition will recur in a few days, at which point I will provide any additional information that you ask me for between now and then. .config attached. (This is Linode's latest stock 64-bit kernel + activation of lockdep.)
I am also seeing this with 4.0.4
Here is the kernel stack trace for the stuck kworker: # ps -e -o pid,stat,comm,wchan | grep D PID STAT COMMAND WCHAN 257795 D kworker/u12:0 msleep 290845 D+ ip copy_net_ns # cat /proc/257795/stack [<ffffffff8112a9a7>] msleep+0x1e/0x22 [<ffffffff81763b13>] netdev_run_todo+0x17a/0x2a5 [<ffffffff81770284>] rtnl_unlock+0xe/0x10 [<ffffffff81764764>] default_device_exit_batch+0x12c/0x13b [<ffffffff81759c25>] ops_exit_list+0x2a/0x55 [<ffffffff8175a532>] cleanup_net+0x14c/0x1f2 [<ffffffff810eecf1>] process_one_work+0x216/0x3fb [<ffffffff810ef17d>] worker_thread+0x278/0x35c [<ffffffff810f3ec5>] kthread+0x102/0x10a [<ffffffff8196a218>] ret_from_fork+0x58/0x90 The top of the stack is is the msleep(250) in netdev_wait_allrefs, which has been inlined into netdev_run_todo. Presumably hung task detection does not trigger because it's never actually *blocked* for more than 250ms at a time. I'm not finding any obvious way to get the "N locks held by <process>" dump manually, but I am *guessing* that it doesn't hold any and this is just fallout -- whatever it's trying to do cannot proceed as long as copy_net_ns holds the net_mutex, and rather than actually *block* for that, it loops forever like this. The critical piece of code in copy_net_ns reads: mutex_lock(&net_mutex); rv = setup_net(net, user_ns); if (rv == 0) { rtnl_lock(); list_add_tail_rcu(&net->list, &net_namespace_list); rtnl_unlock(); } mutex_unlock(&net_mutex); The assembly around copy_net_ns+0x79 is 0xffffffff8175a643 <+107>: xor %esi,%esi 0xffffffff8175a645 <+109>: mov $0xffffffff81ecee80,%rdi 0xffffffff8175a64c <+116>: callq 0xffffffff8196590f <mutex_lock_nested> 0xffffffff8175a651 <+121>: mov %rbx,%rsi 0xffffffff8175a654 <+124>: mov %r12,%rdi 0xffffffff8175a657 <+127>: callq 0xffffffff8175a0a5 <setup_net> so it's blocked on an attempt to acquire the net_mutex, which lockdep thinks it *already holds, having acquired it at that very point*. This doesn't make any sense to me. (I can imagine some sort of bug in the guts of the mutex implementation, that would cause it to block forever *after* having acquired the mutex it was asked to acquire, but that strikes me as extraordinarily unlikely.)
N.B. that it never gets to setup_net and that rtnl_lock has _not_ been inlined, so it can't be a case of its actually trying to acquire the rtnl_mutex (which the kworker is continually locking and unlocking, so we'd have to presume colossal unfairness in the mutex implementation, but that would make more sense to me than what appears to have happened). I might try moving the mutex_unlock(&net_mutex) up to right after the setup_net call anyway, I don't *think* anything done in the if (rv == 0) branch actually needs to hold that lock.
... Reproduction appears not to be as simple as I thought. A shell script that simply brings up a network namespace with the configuration I use, tears it down again, and then tries to create another one, does _not_ trigger the bug. Given that this happens 100% in production, I can only conclude that either you have to push a lot of data within the namespace, or the bug depends on something OpenVPN does to its tunnel devices. And I am wondering whether getting stuck on the loopback interface is a red herring.
Zack, If it helps the setup in 89370 also seems to trigger this bug. Ed
https://bugzilla.kernel.org/show_bug.cgi?id=89370 gives me only an error message, "Bug #89370 does not exist." Did you mean some other bug number or some other tracker?
Zack Try: https://bugs.freedesktop.org/show_bug.cgi?id=89370 Ed
I was hitting this issue when developing tests [1] for OpenStack neutron. It has been reproduced rather often when using a 3- or 4- virtual cpu VM inside which the tests were run (and namespaces manipulated). The reproduction rate was rather low for 2- cpu VM. Reproduced on two kernels: versions 4.6.4-301.fc24.x86_64, 4.6.6-300.fc24.x86_64. [1] https://review.openstack.org/#/c/355893/
I can reproduce this even on latest kernels (last I tried is 4.8.12). My use case is probably similar to Zack's - I'm using Docker (and therefore network namespaces) for regular testing of a rather large amount of OpenVPN and IPsec gateways, spinning up a container with VPN client for each test and tearing it down afterwards. I can't tell exactly how many such tests are needed to trigger it, but it only takes a couple of minutes tops with about 30 tests per minute being run.
The D-state issue is resolved by this commit: commit 7281a66590cbc284272e18cb4acbfb88ddaa4cd8 Author: Andrey Vagin <avagin@openvz.org> Date: Thu Oct 20 19:45:43 2016 -0700 net: allow to kill a task which waits net_mutex in copy_new_ns net_mutex can be locked for a long time. It may be because many namespaces are being destroyed or many processes decide to create a network namespace. Both these operations are heavy, so it is better to have an ability to kill a process which is waiting net_mutex. So after that commit `ip netns create` will be no longer in D-state and can be killed. The "unregister_netdevice: waiting for lo to become free" is another issue which has been reported to netdev at least twice, I have no clue about the cause yet. The last investigation shows it is related to a dst refcnt leak.
Is this not supposed to have been fixed with: commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d Author: Dan Streetman <ddstreet@ieee.org> Date: Thu Jan 18 16:14:26 2018 -0500 net: tcp: close sock if net namespace is exiting ?
4ee806d51176ba7b8ff1efd81f271d7252e03a1d does seems to be mitigating the issue, see https://github.com/moby/moby/issues/5618#issuecomment-455800975