Bug 198189

Summary: netdev_wait_allrefs endless loop caused by ipv6 driver
Product: Networking Reporter: Mathias Tillman (master.homer)
Component: IPV6Assignee: Hideaki YOSHIFUJI (yoshfuji)
Status: RESOLVED INVALID    
Severity: high CC: kernel, koct9i, michal
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.4.103 Subsystem:
Regression: No Bisected commit-id:

Description Mathias Tillman 2017-12-18 14:24:03 UTC
I've been trying to debug a recent problem that's happened on my Turris Omnia router running v3.9 with kernel 4.4.105 (I downgraded it to 4.4.103). It was fine on the previous version - 3.8.6 that was running kernel 4.4.96. What happened was that when trying to connect to the router using ftp (the router is running vsftpd) the process would hang indefinitely, and trying to kill it was impossible. 

My hunt to find the problem led me to this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.4.y&id=2417da3f4d6bc4fc6c77f613f0e2264090892aa5

When reverting that, it started working normally again.

It seems to be stuck in an endless loop in netdev_wait_allrefs, which is confirmed by a bunch of "netdev_wait_allrefs msleep" messages in the kernel log.

I did some further debugging, and I know it enters the unregister_netdevice_many function, which then calls rollback_registered_many and that sends the NETDEV_UNREGISTER event to all of the registered net devices. But for some reason, when it later enters the netdev_run_todo function and finally netdev_wait_allrefs, the percpu refs is 0 for all interfaces but one - 'lo', the loopback interface.
This is a problem because when ipv6 (ipv6/route.c/ip6_route_dev_notify) receives the NETDEV_UNREGISTER event from netdev_wait_allrefs, it won't actually unregister the device, because reg_state will have been set to NETREG_UNREGISTERED in netdev_run_todo, which the above commit added.

I'm not sure what the solution here would be - I'm assuming the above commit fixes something critical? If not, can it be reversed?
The other option would be to find out why no all percpu refs are cleared on the lo interface.
Comment 1 Konstantin Khlebnikov 2017-12-23 13:36:38 UTC
First time NETDEV_UNREGISTER called from rollback_registered_many() where reg_state is NETREG_UNREGISTERING.

After that NETDEV_UNREGISTER might fired multiple times and without that commit ip6_route_dev_notify put device reference to "lo" each time. This hides any reference leaks in other places.
Comment 2 Mathias Tillman 2017-12-27 20:13:17 UTC
Closing this as this is due to a patch in OpenWRT - not the kernel.