Bug 47181 (CLONE_NEWNET)

Summary:	clone() hang when creating new network namespace
Product:	Networking	Reporter:	Bryan Wu (cooloney)
Component:	Other	Assignee:	Stephen Hemminger (stephen)
Status:	RESOLVED OBSOLETE
Severity:	high	CC:	alan, ebiederm, eric.dumazet, stefan.bader
Priority:	P1
Hardware:	All
OS:	Linux
URL:	https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1021471
Kernel Version:	3.5.2	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	Naive hack to avoid the situation When unregistering a net device, clean the route cache of it Fix fib_frontent to do the cache flush

Description Bryan Wu 2012-09-07 11:56:17 UTC

We can easily reproduce this issue with a test program [1]. And run as following:

- sudo ./reproducer
- ctrl+c
- sudo ./reproducer
wait for a while
- dmesg | grep unregister

dmesg will show "unregister_netdevice: waiting for lo to become free. Usage count = 2".

And for LXC testcase, we will see lxc-start is blocked too long so that dmesg shows oops about this.

We think Debian has the same issue [2].

[1]: http://paste.ubuntu.com/1182799/
[2]: http://lists.debian.org/debian-kernel/2012/05/msg00494.html

Comment 1 Eric W. Biederman 2012-09-08 01:31:51 UTC

Which kernel does this occur with and with what configuration.

v3.6-rc1 does not fail for me.

"unregistering_netdevice: waiting for lo to become free" is caused by a reference counting leak in the network stack.

Reference counting leaks in the network stack against netdevices are rare but
the do happen and an exiting network namespace happens to see them all.

I expect what was seen in 3.2 is different from what you see in 3.5.

And at least without more evidence it appears the specific problem reported
does not exist in 3.6-rc1.

Comment 2 Bryan Wu 2012-09-10 23:06:08 UTC

Thanks Eric, I can confirm 3.6-rc1 works fine but 3.5.3 and olders are not.

So this issue should be fixed during 3.6-rc1, do you have any hints about those reference counting leakage fixing patches? I'm going to backport to our Ubunt 3.5 kernel or probably we should backport those to -stable tree.

Comment 3 Eric W. Biederman 2012-09-11 01:42:19 UTC

As I recall the routing cache was removed between 3.5 and 3.6-rc1 so there are some significant changes to the fundamentals.

What to look for failure of dev_hold and dev_put to pair.

The kernel configuration may play a role.  I remember times when there was a small bug in ipv6 multicast routing with respect to this.  So a more minimal configuration may not reproduce the problem.

I would also assume that the different reproducers exercise different code paths 
so you are probably dealing with more than one bug, between the ubuntu and the debian bug trackers.

I hope those hints help.

Comment 4 Stefan Bader 2012-09-26 16:03:34 UTC

I added debugging to dev_hold and dev_put as Eric suggested and used the reproducer attached to this bug. What I saw was creation and destruction would be balanced. However on the connect call, there were another two dev_hold() calls that seem to be exactly those references not been returned.

system_call_fastpath+0x1a/0x1f
  sys_connect+0xeb/0x110
    inet_stream_connect+0x11c/0x310
      tcp_v4_connect+0x13c/0x510
        ip_route_connect+???/???
          __ip_route_output_key+0x39a/0xb10
            ip_route_output_slow
              __mkroute_output
                rt_dst_alloc+0x3e/0x40
                  dst_alloc+0xc5/0x1c0
                    +1 = 8
                  rt_set_nexthop.isra.45+0x131/0x2d0 ?
                    rt_intern_hash+0x133/0x670
                      rt_bind_neighbour+0x1d/0x40
                        ipv4_neigh_lookup+0xe7/0x120
                          neigh_create+0x1bd/0x5d0
                            +9

Unfortunately the stack traces miss the details about going into ip_route_connect, but with more printks I know that ip_route_output_flow() is the one failing with -EINVAL.
Comparing functions between 3.5 and current linux-HEAD I was not very successful in spotting the important difference. Anybody else having some insight or ideas?

Comment 5 Eric Dumazet 2012-09-26 16:20:13 UTC

I would try LOCKDEP enabled kernel to eventually catch a locking problem.

Comment 6 Eric Dumazet 2012-09-26 16:22:58 UTC

Also I would try to backport 

commit caacf05e5ad1abf0a2864863da4e33024bc68ec6

ipv4: Properly purge netdev references on uncached routes.

Comment 7 Stefan Bader 2012-09-27 10:15:30 UTC

I tried with the LOCKDEP enabled (and a bit of more tracking of function return codes I was interested). The lock dependency yielded nothing. And I guess I am not really surprised there as we sort of knew before it is a refcount problem. And there is not really any instrumentation that would allow to catch those.

The commit above sounds like a good general idea. A backport will not be that easy, though. There were a lot of changes before and especially rt_set_nexthop has no code yet to hook the check for an uncached route into. It may need to go in earlier as in my trace there was one refcount increment even before nexthop was called.

Just for the sake of additional information/documentation:

ip_route_connect(...)
{
        ...
        if (!dst || !src) {
                rt = __ip_route_output_key(net, fl4);

Either dst or src isn't set and the first call to __ip_route_output_key does succeed (I assume this is the one that creates the references).

                if (IS_ERR(rt))
                        return rt;
                ip_rt_put(rt);
                flowi4_update_output(fl4, oif, tos, fl4->daddr, fl4->saddr);
        }
        security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
        return ip_route_output_flow(net, fl4, sk);
}

The call to ip_route_output_flow fails when calling __ip_route_output_key (which returns -EINVAL) again. Without deeper knowledge of the network code, I naively would wonder whether the created rt could be freed here or whether it would be possible to detect the reason the second call to route output fails and not create anything first place.

Comment 8 Stefan Bader 2012-09-28 14:32:03 UTC

So I found that the direct call to __ip_route_output_key (while saddr and daddr are 0) will cause an entry in the rt_hash_table. And ip_rt_put would only clean up when NO_CACHE is set.

Later in ip_route_output_flow, the call to __ip_route_output_key will have saddr and daddr set to 127.0.0.1 therefor fail to find the hashed key and when doing the ip_route_output_key_slow it fails the __ip_dev_find when having saddr set but not the FLOWI_FLAG_ANYSRC.

This sounds to me like this is not a problem of having something *not* in the cache. Should the first __ip_route_output_key call really cause something to go into the cache?

Comment 9 Stefan Bader 2012-09-28 14:37:00 UTC

[  101.810602] lo(ffff88003c6bd000)[1]+= 7
[  101.810670] src or dst missing
[  101.810674] rt-key: hash=5e8a daddr=0 saddr=0 oif=0
[  101.810676] rt-key: mark=0 net=ffff88003cfa0000 key_tos=0
[  101.810677] calling ip_route_output_slow
> __mkroute_output
[  101.810679] ip_route_output_slow 0->0
[  101.810682] lo(ffff88003c6bd000)[1]+= 8
> rt_intern_hash
[  101.810686] bind cached output route
[  101.810691] lo(ffff88003c6bd000)[1]+= 9
[  101.810695] rt-key: hash=5e8a daddr=0 saddr=0 oif=0
[  101.810697] rt-key: mark=0 net=ffff88003cfa0000 key_tos=0
[  101.810699] rt-key: expired=0 is-output-route=1
[  101.810701] intern_hash returns orignal pointer

[  101.810704] rt-key: hash=48a8 daddr=100007f saddr=100007f oif=0
[  101.810706] rt-key: mark=0 net=ffff88003cfa0000 key_tos=0
[  101.810707] calling ip_route_output_slow
[  101.810709] ip_route_output_slow 100007f->100007f
[  101.810712] INVAL3
[  101.810715] __ip_route_output_key -> -22
[  101.810717] ip_route_connect -> -22

Comment 10 Stefan Bader 2012-09-28 15:25:08 UTC

This seems to be another subtle difference between the 3.5 code and now. In the case of no daddr and no saddr, both get set to loopback. But mkroute in 3.5 is called with the saved (0) values while now it actually uses the modified values (directly from the flow).

Comment 11 Stefan Bader 2012-10-01 12:40:34 UTC

Created attachment 81761 [details]
Naive hack to avoid the situation

I would hope there is a better solution to the problem than this. But I really would appreciate any comments from anybody with a better insight into how things are/were supposed to work.

Comment 12 Stefan Bader 2012-10-02 07:33:17 UTC

The naive approach is not useful for the real issue. In that case there seems to be a failing connect with some addresses. Unfortunately I have not received full logs, yet.
But to me it seems more and more like there needs to be a cleanup step of the route cache when ripping down a net namespace.

Comment 13 Stefan Bader 2012-10-02 11:39:31 UTC

Indeed, while I don't know how to make that faster, things get cleaned up after a long while...

[   44.099279] lo(ffff88003cc3c000)[0]+= 1
[   44.099337] lo(ffff88003cc3c000)[0]+= 2
[   44.099344] lo(ffff88003cc3c000)[0]+= 3
[   44.099358] lo(ffff88003cc3c000)[0]+= 4
[   44.099364] lo(ffff88003cc3c000)[0]+= 5
[   44.099416] lo(ffff88003cc3c000)[0]+= 6
[   44.099422] lo(ffff88003cc3c000)[0]+= 7
[   44.099580] lo(ffff88003cc3c000)[0]+= 8
[   44.099596] lo(ffff88003cc3c000)[0]+= 9
[   46.728441] lo(ffff88003cc3c000)[1] -= -1
[   46.728556] lo(ffff88003cc3c000)[1] -= -2
[   46.728565] lo(ffff88003cc3c000)[1] -= -3
[   46.729266] lo(ffff88003cc3c000)[1] -= -4
[   46.729313] lo(ffff88003cc3c000)[1] -= -5
[   46.729975] lo(ffff88003cc3c000)[1] -= -6
[   46.732279] lo(ffff88003cc3c000)[1] -= -7
[  338.896671] lo(ffff88003cc3c000)[0] -= 8
[  338.896677] lo(ffff88003cc3c000)[0] -= 7

Comment 14 Stefan Bader 2012-10-02 17:44:47 UTC

Created attachment 81851 [details]
When unregistering a net device, clean the route cache of it

As the route cache apparently gets cleared (though it can feel like an eternity) after a while, I wonder whether this would not make sense to be done at the same time the net device is brought down.

Eric, would that look sensible? It seems to work well enough with the test case and feel like little risk. Though maybe there is a better place...

Comment 15 Eric W. Biederman 2012-10-02 17:58:03 UTC

As I recall the mechanisms at play.

- For non-loopback devices we tend to shift there references to the loopback device.

- There is (or at least there was) a flush of the routing cache see UNREGISTER_NETDEV_BATCH.  I believe flush lived in net/ipv4/fib_frontend.c

Comment 16 Eric Dumazet 2012-10-02 18:02:51 UTC

Yes, take a look at commits :

0115e8e30d6fcdd4b8faa30d3ffd90859a591f51
(net: remove delay at device dismantle)

748e2d9396a18c3fd3d07d47c1b41320acf1fbf4
(net: reinstate rtnl in call_netdevice_notifiers())

6549dd43c04327071571edf7aea4465b16539422
(net: dev: fix the incorrect hold of net namespace's lo device)

Comment 17 Stefan Bader 2012-10-02 19:25:35 UTC

Thanks, I'll try with backports of those patches. Just one thing I noted when working through them. There is one place that wasn't set back by the second patch while other similar places were (below is that place when I squash all 3):

--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -991,7 +991,7 @@ static int fib_inetaddr_event(struct notifier_block *this, u
 static int fib_netdev_event(struct notifier_block *this, unsigned long event, v
 {
        struct net_device *dev = ptr;
-       struct in_device *in_dev = __in_dev_get_rtnl(dev);
+       struct in_device *in_dev;
        struct net *net = dev_net(dev);
 
        if (event == NETDEV_UNREGISTER) {
@@ -999,8 +999,7 @@ static int fib_netdev_event(struct notifier_block *this, uns
                return NOTIFY_DONE;
        }
 
-       if (!in_dev)
-               return NOTIFY_DONE;
+       in_dev = __in_dev_get_rtnl(dev);
 
        switch (event) {
        case NETDEV_UP:

Comment 18 Stefan Bader 2012-10-02 20:06:34 UTC

Just to add, with a quick and brainless (it is rather late here) backport of the three patches above, the testcase runs into the same 2 very slowly going away references to lo. Of course this is 3.5 and the route cache has not really gone away there.

I found the call to call_netdevice_notifiers(NETDEV_UNREGISTER_BATCH, dev) later in rollback_registered_many() which in theory should call rt_cache_flush_batch(NULL) which then should drop all the cache. Just that it does not seem to do so...

Comment 19 Stefan Bader 2012-10-04 11:00:06 UTC

I had a closer look at the NETDEV_UNREGISTER_BATCH notifier call. This is done (actually with the lo device but that is not important) but then it hits the following:

static int fib_netdev_event(...)
{
  struct in_device *in_dev = __in_dev_get_rtnl(dev);
  ...
  if (!in_dev)
    return NOTIFY_DONE;
  ...
}

So it never reaches the rt_cache_flush_batch(). And any calls to disable_ip use a delay of -1 which only invalidates the cache but does not flush it. Assuming all the cases we saw where because of this, it looks like that in_dev should at least not be used as an early exit for the batch call.

Comment 20 Eric Dumazet 2012-10-04 11:16:44 UTC

Nice catch. Does moving the test to the case NETDEV_UP: helps ?

Comment 21 Stefan Bader 2012-10-04 12:07:25 UTC

Created attachment 82101 [details]
Fix fib_frontent to do the cache flush

Based on the previous observations, and trying to have a rather minimalistic change for stable, the following change seems to solve the problem as well. Since the whole route cache in gone in 3.6 and the issues does not seem to be the same there (even without the three patches as they are in linux-next still, testing did not seem to see issues there), this patch would never go upstream.

Comment 22 Stefan Bader 2012-10-04 12:23:14 UTC

It does help for the testcase. I am pushing out a test kernel to other people which had been having issues with real usecases (LXC using net namespace).

Comment 23 Stefan Bader 2012-10-04 12:37:51 UTC

Oh, I just realize that actually this is the slight discrepancy I noticed in comment #17 (with the upstream changes). Though that one now never tests for in_dev == NULL, even in the NETDEV_UP case.

Either moving the batch even up or the test for in_dev into the up case will fix the problem.

Comment 24 Eric Dumazet 2012-10-04 12:59:30 UTC

Note that in_dev should not be NULL in the NETDEV_UP case,
because NETDEV_REGISTER allocates it in inetdev_event()

In case of allocation error, device cannot be registered.

Comment 25 Stefan Bader 2012-10-05 08:48:58 UTC

Though the NETDEV_UP case is the only place that directly uses in_dev. I would be assuming that checking it and exiting early is/was thought as an optimization to the other cases. That is also the reason I moved the NETDEV_UNREGISTER_BATCH code up in front of it instead of the test itself into the NETDEV_UP case. That way only the behaviour of this case is modified.

I think this on its own is one valid fix. However the reporters still seem to have an issue. But in that case the refcount is 1 and it does not seem to be resolved over time. Unfortunately it is also not that simple to reproduce (I have not yet seen it locally).

Comment 26 Stefan Bader 2012-10-08 10:11:22 UTC

Most recent comments on the Ubuntu bug say that the second issue only appears to happen when a proprietary wl (broadcom wireless) driver is involved. And it seems to happen even on 3.6. So this most likely should be handled separately.

Now for fixing current stable, the code in fib_frontent.c is now (after 3.6) fixed by:
- "net: remove delay at device dismantle" and
- "net: reinstate rtnl in call_netdevice_notifiers()"

And those depend in some way on dropping the route cache. So at least for before 3.6 the complete patches do not look very compelling to take into stable.
In this case I guess someone (one of the Erics?) should queue up a small fix with David for stable submission. Either the way I did it or by just removing the check on in_dev completely. If it is always set in the UP case, it sounds like there should be no harm in that. I just would be too cautious to make that
call.

Comment 27 Stefan Bader 2012-10-08 10:12:50 UTC

Just for reference (I don't think anybody has done, yet) the link to the Ubuntu bug report: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1021471

Comment 28 Stefan Bader 2012-10-15 07:50:22 UTC

Ping, any recommendations how to proceed with this?

Comment 29 Eric W. Biederman 2012-10-16 07:48:21 UTC

Does the problem happen with any non-proprietary wireless drivers?  aka is it something in the wireless stack?

My personal concern goes as far as the current linux kernel, and open source code.

Comment 30 Stefan Bader 2012-10-16 08:41:54 UTC

Sorry, I guess the comments caused a bit of confusion. There seem to be two problems:

1. The one I would address with this bug report, which is that the route cache is not cleaned because the NETDEVICE_UNREGISTER_BATCH message is sent when in_dev is already gone. This is completely unrelated to any hw. The only problem there is, that it is fixed in latest code by two patches which would not qualify for stable. I would propose the variant of my latest patch attached here for stable, but you could prefer to fix it by removing the check for in_dev completely (just like it is upstream now).

2. Apparently there is another refcount issue (but I want to handle that in another bug report) which happens after a bit of usage. It turns out that the timing might be hw related (iow it seems to take different amounts of traffic to reproduce the problem on different hw). But until now I know of three sets of nics that were used in reproduction (and two of them have upstream drivers). But again, this should be separate.

So my question here is just how we should get the NETDEVICE_UNREGISTER_BATCH change into stable (and how far back if there is a limitation).