Bug 198189 - netdev_wait_allrefs endless loop caused by ipv6 driver
Summary: netdev_wait_allrefs endless loop caused by ipv6 driver
Status: RESOLVED INVALID
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV6 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Hideaki YOSHIFUJI
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-18 14:24 UTC by Mathias Tillman
Modified: 2017-12-27 20:13 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.4.103
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Mathias Tillman 2017-12-18 14:24:03 UTC
I've been trying to debug a recent problem that's happened on my Turris Omnia router running v3.9 with kernel 4.4.105 (I downgraded it to 4.4.103). It was fine on the previous version - 3.8.6 that was running kernel 4.4.96. What happened was that when trying to connect to the router using ftp (the router is running vsftpd) the process would hang indefinitely, and trying to kill it was impossible. 

My hunt to find the problem led me to this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.4.y&id=2417da3f4d6bc4fc6c77f613f0e2264090892aa5

When reverting that, it started working normally again.

It seems to be stuck in an endless loop in netdev_wait_allrefs, which is confirmed by a bunch of "netdev_wait_allrefs msleep" messages in the kernel log.

I did some further debugging, and I know it enters the unregister_netdevice_many function, which then calls rollback_registered_many and that sends the NETDEV_UNREGISTER event to all of the registered net devices. But for some reason, when it later enters the netdev_run_todo function and finally netdev_wait_allrefs, the percpu refs is 0 for all interfaces but one - 'lo', the loopback interface.
This is a problem because when ipv6 (ipv6/route.c/ip6_route_dev_notify) receives the NETDEV_UNREGISTER event from netdev_wait_allrefs, it won't actually unregister the device, because reg_state will have been set to NETREG_UNREGISTERED in netdev_run_todo, which the above commit added.

I'm not sure what the solution here would be - I'm assuming the above commit fixes something critical? If not, can it be reversed?
The other option would be to find out why no all percpu refs are cleared on the lo interface.
Comment 1 Konstantin Khlebnikov 2017-12-23 13:36:38 UTC
First time NETDEV_UNREGISTER called from rollback_registered_many() where reg_state is NETREG_UNREGISTERING.

After that NETDEV_UNREGISTER might fired multiple times and without that commit ip6_route_dev_notify put device reference to "lo" each time. This hides any reference leaks in other places.
Comment 2 Mathias Tillman 2017-12-27 20:13:17 UTC
Closing this as this is due to a patch in OpenWRT - not the kernel.

Note You need to log in before you can comment on or make changes to this bug.