Bug 97811

Summary: "ip netns create" hangs forever, spamming console with "unregister_netdevice: waiting for lo to become free"
Product: Networking Reporter: Zack Weinberg (zackw)
Component: OtherAssignee: Stephen Hemminger (stephen)
Status: NEW ---    
Severity: normal CC: andre, anthonybest, edt, egor.guz, fweimer, glen.84, hmlnarik, jdelvare, me, pahan, szg0000, vit, xiyou.wangcong
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.0.1 Tree: Mainline
Regression: No
Attachments: .config

Description Zack Weinberg 2015-05-06 17:09:25 UTC
Created attachment 176051 [details]
.config

I have an application that makes heavy use of network namespaces, creating and destroying them on the fly during operation.  With 100% reproducibility, the first invocation of "ip netns create" after any "ip netns del" hangs forever in D-state; only rebooting the machine clears the condition.  (This used to be an intermittent fault, going back at least as far as 3.16, but 4.0 seems to have upgraded it to an every-single-time fault.)

lockdep hung task detection reports:

May  5 22:48:07 laguz kernel: INFO: task ip:572431 blocked for more than 120 seconds.
May  5 22:48:07 laguz kernel:      Not tainted 4.0.1-linode55-lockdep #2
May  5 22:48:07 laguz kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  5 22:48:07 laguz kernel: ip              D ffff88000e28bda8     0 572431 572428 0x00000000
May  5 22:48:07 laguz kernel: ffff88000e28bda8 ffff880058b60000 ffff8801512c6540 ffff88000e28bda8
May  5 22:48:07 laguz kernel: ffff88000e28bfd8 ffff8800a07e6500 ffffffff81ecee88 ffff88000e28be28
May  5 22:48:07 laguz kernel: ffff8801512c6540 ffff88000e28bdc8 ffffffff81964ca9 0000000000000006
May  5 22:48:07 laguz kernel: Call Trace:
May  5 22:48:07 laguz kernel: [<ffffffff81964ca9>] schedule+0x74/0x83
May  5 22:48:07 laguz kernel: [<ffffffff81964cc6>] schedule_preempt_disabled+0xe/0x10
May  5 22:48:07 laguz kernel: [<ffffffff81965b0f>] mutex_lock_nested+0x200/0x462
May  5 22:48:07 laguz kernel: [<ffffffff8175a651>] ? copy_net_ns+0x79/0x103
May  5 22:48:07 laguz kernel: [<ffffffff8175a61c>] ? copy_net_ns+0x44/0x103
May  5 22:48:07 laguz kernel: [<ffffffff811139d5>] ? lockdep_trace_alloc+0xca/0xdc
May  5 22:48:07 laguz kernel: [<ffffffff8175a651>] copy_net_ns+0x79/0x103
May  5 22:48:07 laguz kernel: [<ffffffff810f4b40>] create_new_namespaces+0x103/0x17f
May  5 22:48:07 laguz kernel: [<ffffffff810f4d8b>] unshare_nsproxy_namespaces+0x8b/0xb0
May  5 22:48:07 laguz kernel: [<ffffffff810da7f0>] SyS_unshare+0x120/0x2aa
May  5 22:48:07 laguz kernel: [<ffffffff8196a2c9>] system_call_fastpath+0x12/0x17
May  5 22:48:07 laguz kernel: 1 lock held by ip/572431:
May  5 22:48:07 laguz kernel: #0:  (net_mutex){+.+.+.}, at: [<ffffffff8175a651>] copy_net_ns+0x79/0x103

When the bug manifests, there is invariably also a kworker process stuck in D-state, but I don't see anything related to that in this trace.  I've restarted the machine so I can continue collecting data, but the condition will recur in a few days, at which point I will provide any additional information that you ask me for between now and then.

.config attached.  (This is Linode's latest stock 64-bit kernel + activation of lockdep.)
Comment 1 Ed Tomlinson 2015-05-24 16:58:53 UTC
I am also seeing this with 4.0.4
Comment 2 Zack Weinberg 2015-05-26 15:10:00 UTC
Here is the kernel stack trace for the stuck kworker:

# ps -e -o pid,stat,comm,wchan | grep D
   PID STAT COMMAND         WCHAN
257795 D    kworker/u12:0   msleep
290845 D+   ip              copy_net_ns

# cat /proc/257795/stack
[<ffffffff8112a9a7>] msleep+0x1e/0x22
[<ffffffff81763b13>] netdev_run_todo+0x17a/0x2a5
[<ffffffff81770284>] rtnl_unlock+0xe/0x10
[<ffffffff81764764>] default_device_exit_batch+0x12c/0x13b
[<ffffffff81759c25>] ops_exit_list+0x2a/0x55
[<ffffffff8175a532>] cleanup_net+0x14c/0x1f2
[<ffffffff810eecf1>] process_one_work+0x216/0x3fb
[<ffffffff810ef17d>] worker_thread+0x278/0x35c
[<ffffffff810f3ec5>] kthread+0x102/0x10a
[<ffffffff8196a218>] ret_from_fork+0x58/0x90

The top of the stack is is the msleep(250) in netdev_wait_allrefs, which has been inlined into netdev_run_todo. Presumably hung task detection does not trigger because it's never actually *blocked* for more than 250ms at a time. I'm not finding any obvious way to get the "N locks held by <process>" dump manually, but I am *guessing* that it doesn't hold any and this is just fallout -- whatever it's trying to do cannot proceed as long as copy_net_ns holds the net_mutex, and rather than actually *block* for that, it loops forever like this.

The critical piece of code in copy_net_ns reads:

        mutex_lock(&net_mutex);
        rv = setup_net(net, user_ns);
        if (rv == 0) {
                rtnl_lock();
                list_add_tail_rcu(&net->list, &net_namespace_list);
                rtnl_unlock();
        }
        mutex_unlock(&net_mutex);

The assembly around copy_net_ns+0x79 is

   0xffffffff8175a643 <+107>:   xor    %esi,%esi
   0xffffffff8175a645 <+109>:   mov    $0xffffffff81ecee80,%rdi
   0xffffffff8175a64c <+116>:   callq  0xffffffff8196590f <mutex_lock_nested>
   0xffffffff8175a651 <+121>:   mov    %rbx,%rsi
   0xffffffff8175a654 <+124>:   mov    %r12,%rdi
   0xffffffff8175a657 <+127>:   callq  0xffffffff8175a0a5 <setup_net>

so it's blocked on an attempt to acquire the net_mutex, which lockdep thinks it *already holds, having acquired it at that very point*.  This doesn't make any sense to me.  (I can imagine some sort of bug in the guts of the mutex implementation, that would cause it to block forever *after* having acquired the mutex it was asked to acquire, but that strikes me as extraordinarily unlikely.)
Comment 3 Zack Weinberg 2015-05-26 15:13:45 UTC
N.B. that it never gets to setup_net and that rtnl_lock has _not_ been inlined, so it can't be a case of its actually trying to acquire the rtnl_mutex (which the kworker is continually locking and unlocking, so we'd have to presume colossal unfairness in the mutex implementation, but that would make more sense to me than what appears to have happened).  I might try moving the mutex_unlock(&net_mutex) up to right after the setup_net call anyway, I don't *think* anything done in the if (rv == 0) branch actually needs to hold that lock.
Comment 4 Zack Weinberg 2015-05-26 18:41:45 UTC
... Reproduction appears not to be as simple as I thought.  A shell script that simply brings up a network namespace with the configuration I use, tears it down again, and then tries to create another one, does _not_ trigger the bug.  Given that this happens 100% in production, I can only conclude that either you have to push a lot of data within the namespace, or the bug depends on something OpenVPN does to its tunnel devices.  And I am wondering whether getting stuck on the loopback interface is a red herring.
Comment 5 Ed Tomlinson 2015-05-30 12:44:16 UTC
Zack,

If it helps the setup in 89370 also seems to trigger this bug.

Ed
Comment 6 Zack Weinberg 2015-05-30 14:12:10 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=89370 gives me only an error message, "Bug #89370 does not exist."  Did you mean some other bug number or some other tracker?
Comment 7 Ed Tomlinson 2015-05-31 23:45:43 UTC
Zack

Try: https://bugs.freedesktop.org/show_bug.cgi?id=89370

Ed
Comment 8 Hynek Mlnarik 2016-08-19 10:13:00 UTC
I was hitting this issue when developing tests [1] for OpenStack neutron. It has been reproduced rather often when using a 3- or 4- virtual cpu VM inside which the tests were run (and namespaces manipulated). The reproduction rate was rather low for 2- cpu VM.

Reproduced on two kernels: versions 4.6.4-301.fc24.x86_64, 4.6.6-300.fc24.x86_64.

[1] https://review.openstack.org/#/c/355893/
Comment 9 Vít Heřman 2016-12-15 14:42:31 UTC
I can reproduce this even on latest kernels (last I tried is 4.8.12). My use case is probably similar to Zack's - I'm using Docker (and therefore network namespaces) for regular testing of a rather large amount of OpenVPN and IPsec gateways, spinning up a container with VPN client for each test and tearing it down afterwards. I can't tell exactly how many such tests are needed to trigger it, but it only takes a couple of minutes tops with about 30 tests per minute being run.
Comment 10 Cong Wang 2017-04-04 21:26:32 UTC
The D-state issue is resolved by this commit:

commit 7281a66590cbc284272e18cb4acbfb88ddaa4cd8
Author: Andrey Vagin <avagin@openvz.org>
Date:   Thu Oct 20 19:45:43 2016 -0700

    net: allow to kill a task which waits net_mutex in copy_new_ns

    net_mutex can be locked for a long time. It may be because many
    namespaces are being destroyed or many processes decide to create
    a network namespace.

    Both these operations are heavy, so it is better to have an ability to
    kill a process which is waiting net_mutex.


So after that commit `ip netns create` will be no longer in D-state and can be killed.

The "unregister_netdevice: waiting for lo to become free" is another issue which has been reported to netdev at least twice, I have no clue about the cause yet. The last investigation shows it is related to a dst refcnt leak.
Comment 11 Jean Delvare 2018-12-04 15:32:49 UTC
Is this not supposed to have been fixed with:

commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Thu Jan 18 16:14:26 2018 -0500

    net: tcp: close sock if net namespace is exiting

?