Bug 15606

Summary: register_netdevice: Kobject changed too ealier
Product: Networking Reporter: linuxzsc
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: RESOLVED CODE_FIX    
Severity: normal CC: alan, johannes, kvalo, linville, martin.xu, mcgrof
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31 Subsystem:
Regression: No Bisected commit-id:
Attachments: ath6kl: Use register_netdevice() and dev_alloc_name()
udev-ifname-race.c

Description linuxzsc 2010-03-22 09:31:54 UTC
Connman is using udev message to know a new net card created. But when the message comes, the call of ioctl(sk, SIOCGIFNAME, ifr) fails.

If I use the bellow patch, it will be ok :

diff --git a/net/core/dev.c b/net/core/dev.c
index 6a94475..385de09 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4783,11 +4783,6 @@ int register_netdevice(struct net_device *dev)
       if (dev->features & NETIF_F_SG)
               dev->features |= NETIF_F_GSO;

-       netdev_initialize_kobject(dev);
-       ret = netdev_register_kobject(dev);
-       if (ret)
-               goto err_uninit;
-       dev->reg_state = NETREG_REGISTERED;

       /*
        *      Default initial state at registry is that the
@@ -4800,6 +4795,13 @@ int register_netdevice(struct net_device *dev)
       dev_hold(dev);
       list_netdevice(dev);

+       netdev_initialize_kobject(dev);
+       ret = netdev_register_kobject(dev);
+       if (ret)
+               goto err_uninit;
+       dev->reg_state = NETREG_REGISTERED;
+
+
       /* Notify protocols, that a new device appeared. */
       ret = call_netdevice_notifiers(NETDEV_REGISTER, dev);
       ret = notifier_to_errno(ret);
Comment 1 Luis Chamberlain 2011-03-25 03:37:29 UTC
I believe the driver you used was a cfg80211 driver, I am starting to think connman picked up a udev event for the kobject created for the device that gets used internally by cfg80211 through wiphy_register().

So for example, at least mac80211 later adds an actual networking device and registers it via register_netdevice(), and this will also create a udev event, this is the udev event that connman should look out for. Are we certain that the udev event that connman uses to process a networking device is the one from register_netdevice() and not from wiphy_register()?

For ath6kl we have gotten issues reported with similar issues and at first I thought the race was we were calling register_netdev() after wiphy_register() but this turns out to be OK, even though inverting this can actually help, but its the wrong fix. The next step I took was to drop using register_netdev() and instead contend on the rtnl_lock() and use dev_alloc_name() myself and later use register_netdevice(). This should expose the proper desired name for the networking device for ath6kl.

If this doesn't fix it I start wondering more about the udev events that connman is starting to process and if its really for a networking device.
Comment 2 Luis Chamberlain 2011-03-25 03:38:34 UTC
Created attachment 51952 [details]
ath6kl: Use register_netdevice() and dev_alloc_name()

This is the stuff I was talking about for ath6kl. This uses register_netdevice() and dev_alloc_name(), can you try this instead?
Comment 3 Luis Chamberlain 2011-03-25 03:45:19 UTC
Oh that patch will not apply, I have a boat load of other updates on top of staging-2.6.git on the staging-next branch for which that will apply.. hmm, so I can get you all my other patches.. but that's a lot.. I think you can read the patch though and get the idea of what was intended though.
Comment 4 Luis Chamberlain 2011-04-28 20:31:24 UTC
anyone?
Comment 5 Kalle Valo 2011-04-28 21:54:46 UTC
I can reproduce this easily with flimflam on x86 box with one cpu and ath6kl. I just added this extra delay:

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5428,6 +5428,9 @@ int register_netdevice(struct net_device *dev)
 	ret = netdev_register_kobject(dev);
 	if (ret)
 		goto err_uninit;
+
+	msleep(10000);
+
 	dev->reg_state = NETREG_REGISTERED;
 
 	netdev_update_features(dev);
Comment 6 Kalle Valo 2011-04-28 21:58:23 UTC
And here's a part of flimflam log when the race happens. Notice how SIOCGIFNAME fails.

Apr 29 00:21:35 roska flimflamd[2598]: src/rfkill.c:rfkill_event() idx 0 type 1 op 2 soft 0 hard 0
Apr 29 00:21:35 roska kernel: [  125.530250] ar6000_init() Got WMI @ 0xf4bb3d40.
Apr 29 00:21:35 roska kernel: [  125.541146]  Target Ready: credits: 11 credit size: 1664
Apr 29 00:21:35 roska kernel: [  125.541151] HIF-SCATTER Enabled: max scatter req : 4 entries: 16 
Apr 29 00:21:35 roska kernel: [  125.541174] AR6K: HIF layer supports scatter requests (max scatter items:16: maxlen:32768) 
Apr 29 00:21:35 roska kernel: [  125.541177] AR6K: max recv: 32768 max send: 12288 
Apr 29 00:21:35 roska kernel: [  125.549883] ar6000_init() WMI is ready
Apr 29 00:21:35 roska kernel: [  125.550168] wmi_control_rx() : Unknown id 0x101e
Apr 29 00:21:35 roska kernel: [  125.550673] AR6000 Reg Code = 0x40000060
Apr 29 00:21:35 roska flimflamd[2598]: === add ===
Apr 29 00:21:35 roska flimflamd[2598]: DEVPATH = /devices/pci0000:00/0000:00:1e.0/0000:03:02.1/mmc_host/mmc0/mmc0:0001/mmc0:0001:1/net/wlan0
Apr 29 00:21:35 roska flimflamd[2598]: INTERFACE = wlan0
Apr 29 00:21:35 roska flimflamd[2598]: IFINDEX = 4
Apr 29 00:21:35 roska flimflamd[2598]:     DEVPATH = /devices/pci0000:00/0000:00:1e.0/0000:03:02.1/mmc_host/mmc0/mmc0:0001/mmc0:0001:1
Apr 29 00:21:35 roska flimflamd[2598]: src/udev.c:add_net_device() 
Apr 29 00:21:35 roska flimflamd[2598]: connman_inet_ifname: SIOCGIFNAME(index 4): No such device
Apr 29 00:21:45 roska flimflamd[2598]: src/rtnl.c:rtnl_message() buf 0xbfefda3c len 1004
Apr 29 00:21:45 roska flimflamd[2598]: src/rtnl.c:rtnl_message() NEWLINK len 1004 type 16 flags 0x0000 seq 0
Apr 29 00:21:45 roska flimflamd[2598]: src/ipconfig.c:__connman_ipconfig_newlink() index 4 type 1 flags 0x1002 ifname wlan0
Apr 29 00:21:45 roska flimflamd[2598]: wlan0 {create} index 4 type 1 <ETHER>
Comment 7 Kalle Valo 2011-04-28 22:03:20 UTC
Jouni Malinen suggested another way to fix the issue: for some reason SIOCGIFNAME ioctl call doesn't take rtnl. Adding this fixes the race as well. 

Now the question is which way is the best way to do it?

Here's a patch:

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4917,8 +4917,12 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 		rtnl_unlock();
 		return ret;
 	}
-	if (cmd == SIOCGIFNAME)
-		return dev_ifname(net, (struct ifreq __user *)arg);
+	if (cmd == SIOCGIFNAME) {
+		rtnl_lock();
+		ret = dev_ifname(net, (struct ifreq __user *)arg);
+		rtnl_unlock();
+		return ret;
+	}
 
 	if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
 		return -EFAULT;
Comment 8 Kalle Valo 2011-05-03 21:59:42 UTC
Created attachment 56402 [details]
udev-ifname-race.c

I can now reproduce with a stock wireless-testing kernel 2.6.39-rc5-wl (rc5 plus wireless patches) on my x86 laptop, without flimflam or any other special software. Here's how to do it:

1. Run four burnP6 instances (I have four virtual cpus)
2. ./udev-ifname-race (attached)
3. while true; do modprobe e1000e; sleep 0.5; modprobe -r e1000e; echo -n .; done

Within few minutes udev-ifname-race will fail on my laptop (lenovo x201, i5, 64 bit).
Comment 10 Kalle Valo 2011-05-16 15:31:35 UTC
patchset v3:

http://marc.info/?l=linux-netdev&m=130555720227212&w=2