Bug 16187

Summary: Carrier detection failed in dhcpcd when link is up
Product: Networking Reporter: Christian Casteyde (casteyde.christian)
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: CLOSED INVALID    
Severity: normal CC: andreas.sturmlechner, bruce.w.allan, davem, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 16055    
Attachments: lspci output for 2.6.34 on my computer
git bisect log

Description Christian Casteyde 2010-06-12 15:15:27 UTC
Created attachment 26742 [details]
lspci output for 2.6.34 on my computer

Kernel at least 2.6.35-rc2, 2.6.34 works fine
Athlon X2 3000 in 64bits mode
Slackware 13.1
"Ethernet controller: ALi Corporation ULi 1689,1573 integrated ethernet. (rev 40)" (from lspci)

Since 2.6.35-rc2 (didn't checked -rc1), dhcpcd hangs fails to detect carrier appearance at boot.

The Slackware network script uses dhcpcd to bring DHCP interfaces up. At boot, it seems my network device doesn't have the link up immediatly. dhcpcd tries to bring it up, and wait for the carrier. The carrier indeed goes up, but dhcpcd gets absolutly no notification of that, and therefore times out.

logs says that:
Jun 12 16:53:58 sirius logger: /etc/rc.d/rc.inet1:  /sbin/route add -net 127.0.0.0 netmask 255.0.0.0 lo
Jun 12 16:53:58 sirius logger: /etc/rc.d/rc.inet1:  /sbin/dhcpcd -t 10  eth0
Jun 12 16:53:58 sirius dhcpcd: version 5.2.2 starting
Jun 12 16:53:58 sirius kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
Jun 12 16:53:58 sirius dhcpcd: eth0: waiting for carrier
Jun 12 16:54:01 sirius kernel: uli526x: eth0 NIC Link is Up 100 Mbps Full duplex
Jun 12 16:54:01 sirius kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
then dhcpcd times out a little later (10 seconds as -t 10 is specified).

If I use the -K option (no carrier detection), and does ifconfig eth0 up just before issueing the dhcpcd command, dhcpcd doesn't wait for the carrier and gets a lease correctly.

Reversely, with the 2.6.34 kernel, dhcpcd indeed gets the link up notification, and gets the lease immediatly. In this case, the log says:

Jun 12 17:04:21 sirius logger: /etc/rc.d/rc.inet1:  /sbin/route add -net 127.0.0.0 netmask 255.0.0.0 lo
Jun 12 17:04:21 sirius logger: /etc/rc.d/rc.inet1:  /sbin/dhcpcd -t 10  eth0
Jun 12 17:04:22 sirius dhcpcd: version 5.2.2 starting
Jun 12 17:04:22 sirius kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
Jun 12 17:04:22 sirius dhcpcd: eth0: waiting for carrier
Jun 12 17:04:25 sirius dhcpcd: eth0: carrier acquired
Jun 12 17:04:25 sirius kernel: uli526x: eth0 NIC Link is Up 100 Mbps Full duplex
Jun 12 17:04:25 sirius kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Jun 12 17:04:25 sirius dhcpcd: eth0: broadcasting for a lease
Jun 12 17:04:29 sirius dhcpcd: eth0: offered 192.168.1.3 from 192.168.1.1
Jun 12 17:04:29 sirius dhcpcd: eth0: acknowledged 192.168.1.3 from 192.168.1.1
Jun 12 17:04:29 sirius dhcpcd: eth0: checking for 192.168.1.3
Jun 12 17:04:34 sirius dhcpcd: eth0: leased 192.168.1.3 for 864000 seconds
Jun 12 17:04:34 sirius dhcpcd: forking to background

As you can see, dhcpcd asks for a lease as soon as the kernel tells it the link is up (message "carrier acquired").

Therefore, I think 2.6.35-rc* notification of carrier is broken, and at least it broke dhcpcd way of watching the link.
Comment 1 Andrew Morton 2010-06-15 21:24:59 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 12 Jun 2010 15:15:31 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=16187
> 
>            Summary: Carrier detection failed in dhcpcd when link is up
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.35-rc2
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: acme@ghostprotocols.net
>         ReportedBy: casteyde.christian@free.fr
>         Regression: Yes
> 
> 
> Created an attachment (id=26742)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=26742)
> lspci output for 2.6.34 on my computer
> 
> Kernel at least 2.6.35-rc2, 2.6.34 works fine

Seems to be post-2.6.34 breakage in the tulip driver.

> Athlon X2 3000 in 64bits mode
> Slackware 13.1
> "Ethernet controller: ALi Corporation ULi 1689,1573 integrated ethernet. (rev
> 40)" (from lspci)
> 
> Since 2.6.35-rc2 (didn't checked -rc1), dhcpcd hangs fails to detect carrier
> appearance at boot.
> 
> The Slackware network script uses dhcpcd to bring DHCP interfaces up. At
> boot,
> it seems my network device doesn't have the link up immediatly. dhcpcd tries
> to
> bring it up, and wait for the carrier. The carrier indeed goes up, but dhcpcd
> gets absolutly no notification of that, and therefore times out.
> 
> logs says that:
> Jun 12 16:53:58 sirius logger: /etc/rc.d/rc.inet1:  /sbin/route add -net
> 127.0.0.0 netmask 255.0.0.0 lo
> Jun 12 16:53:58 sirius logger: /etc/rc.d/rc.inet1:  /sbin/dhcpcd -t 10  eth0
> Jun 12 16:53:58 sirius dhcpcd: version 5.2.2 starting
> Jun 12 16:53:58 sirius kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
> Jun 12 16:53:58 sirius dhcpcd: eth0: waiting for carrier
> Jun 12 16:54:01 sirius kernel: uli526x: eth0 NIC Link is Up 100 Mbps Full
> duplex
> Jun 12 16:54:01 sirius kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes
> ready
> then dhcpcd times out a little later (10 seconds as -t 10 is specified).
> 
> If I use the -K option (no carrier detection), and does ifconfig eth0 up just
> before issueing the dhcpcd command, dhcpcd doesn't wait for the carrier and
> gets a lease correctly.
> 
> Reversely, with the 2.6.34 kernel, dhcpcd indeed gets the link up
> notification,
> and gets the lease immediatly. In this case, the log says:
> 
> Jun 12 17:04:21 sirius logger: /etc/rc.d/rc.inet1:  /sbin/route add -net
> 127.0.0.0 netmask 255.0.0.0 lo
> Jun 12 17:04:21 sirius logger: /etc/rc.d/rc.inet1:  /sbin/dhcpcd -t 10  eth0
> Jun 12 17:04:22 sirius dhcpcd: version 5.2.2 starting
> Jun 12 17:04:22 sirius kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
> Jun 12 17:04:22 sirius dhcpcd: eth0: waiting for carrier
> Jun 12 17:04:25 sirius dhcpcd: eth0: carrier acquired
> Jun 12 17:04:25 sirius kernel: uli526x: eth0 NIC Link is Up 100 Mbps Full
> duplex
> Jun 12 17:04:25 sirius kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes
> ready
> Jun 12 17:04:25 sirius dhcpcd: eth0: broadcasting for a lease
> Jun 12 17:04:29 sirius dhcpcd: eth0: offered 192.168.1.3 from 192.168.1.1
> Jun 12 17:04:29 sirius dhcpcd: eth0: acknowledged 192.168.1.3 from
> 192.168.1.1
> Jun 12 17:04:29 sirius dhcpcd: eth0: checking for 192.168.1.3
> Jun 12 17:04:34 sirius dhcpcd: eth0: leased 192.168.1.3 for 864000 seconds
> Jun 12 17:04:34 sirius dhcpcd: forking to background
> 
> As you can see, dhcpcd asks for a lease as soon as the kernel tells it the
> link
> is up (message "carrier acquired").
> 
> Therefore, I think 2.6.35-rc* notification of carrier is broken, and at least
> it broke dhcpcd way of watching the link.
>
Comment 2 Christian Casteyde 2010-06-16 06:21:54 UTC
Exact, my kernel config is:
-*- Generic Media Independent Interface device support
[*]   "Tulip" family network device support
<*>   ULi M526x controller support

Le mardi 15 juin 2010 23:24:18, Andrew Morton a écrit :
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Sat, 12 Jun 2010 15:15:31 GMT
> 
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=16187
> > 
> >            Summary: Carrier detection failed in dhcpcd when link is up
> >            Product: Networking
> >            Version: 2.5
> >     
> >     Kernel Version: 2.6.35-rc2
> >     
> >           Platform: All
> >         
> >         OS/Version: Linux
> >         
> >               Tree: Mainline
> >             
> >             Status: NEW
> >           
> >           Severity: normal
> >           Priority: P1
> >          
> >          Component: Other
> >         
> >         AssignedTo: acme@ghostprotocols.net
> >         ReportedBy: casteyde.christian@free.fr
> >         Regression: Yes
> > 
> > Created an attachment (id=26742)
> > 
> >  --> (https://bugzilla.kernel.org/attachment.cgi?id=26742)
> > 
> > lspci output for 2.6.34 on my computer
> > 
> > Kernel at least 2.6.35-rc2, 2.6.34 works fine
> 
> Seems to be post-2.6.34 breakage in the tulip driver.
> 
> > Athlon X2 3000 in 64bits mode
> > Slackware 13.1
> > "Ethernet controller: ALi Corporation ULi 1689,1573 integrated ethernet.
> > (rev 40)" (from lspci)
> > 
> > Since 2.6.35-rc2 (didn't checked -rc1), dhcpcd hangs fails to detect
> > carrier appearance at boot.
> > 
> > The Slackware network script uses dhcpcd to bring DHCP interfaces up. At
> > boot, it seems my network device doesn't have the link up immediatly.
> > dhcpcd tries to bring it up, and wait for the carrier. The carrier
> > indeed goes up, but dhcpcd gets absolutly no notification of that, and
> > therefore times out.
> > 
> > logs says that:
> > Jun 12 16:53:58 sirius logger: /etc/rc.d/rc.inet1:  /sbin/route add -net
> > 127.0.0.0 netmask 255.0.0.0 lo
> > Jun 12 16:53:58 sirius logger: /etc/rc.d/rc.inet1:  /sbin/dhcpcd -t 10 
> > eth0 Jun 12 16:53:58 sirius dhcpcd: version 5.2.2 starting
> > Jun 12 16:53:58 sirius kernel: ADDRCONF(NETDEV_UP): eth0: link is not
> > ready Jun 12 16:53:58 sirius dhcpcd: eth0: waiting for carrier
> > Jun 12 16:54:01 sirius kernel: uli526x: eth0 NIC Link is Up 100 Mbps Full
> > duplex
> > Jun 12 16:54:01 sirius kernel: ADDRCONF(NETDEV_CHANGE): eth0: link
> > becomes ready
> > then dhcpcd times out a little later (10 seconds as -t 10 is specified).
> > 
> > If I use the -K option (no carrier detection), and does ifconfig eth0 up
> > just before issueing the dhcpcd command, dhcpcd doesn't wait for the
> > carrier and gets a lease correctly.
> > 
> > Reversely, with the 2.6.34 kernel, dhcpcd indeed gets the link up
> > notification, and gets the lease immediatly. In this case, the log says:
> > 
> > Jun 12 17:04:21 sirius logger: /etc/rc.d/rc.inet1:  /sbin/route add -net
> > 127.0.0.0 netmask 255.0.0.0 lo
> > Jun 12 17:04:21 sirius logger: /etc/rc.d/rc.inet1:  /sbin/dhcpcd -t 10 
> > eth0 Jun 12 17:04:22 sirius dhcpcd: version 5.2.2 starting
> > Jun 12 17:04:22 sirius kernel: ADDRCONF(NETDEV_UP): eth0: link is not
> > ready Jun 12 17:04:22 sirius dhcpcd: eth0: waiting for carrier
> > Jun 12 17:04:25 sirius dhcpcd: eth0: carrier acquired
> > Jun 12 17:04:25 sirius kernel: uli526x: eth0 NIC Link is Up 100 Mbps Full
> > duplex
> > Jun 12 17:04:25 sirius kernel: ADDRCONF(NETDEV_CHANGE): eth0: link
> > becomes ready
> > Jun 12 17:04:25 sirius dhcpcd: eth0: broadcasting for a lease
> > Jun 12 17:04:29 sirius dhcpcd: eth0: offered 192.168.1.3 from 192.168.1.1
> > Jun 12 17:04:29 sirius dhcpcd: eth0: acknowledged 192.168.1.3 from
> > 192.168.1.1 Jun 12 17:04:29 sirius dhcpcd: eth0: checking for
> > 192.168.1.3
> > Jun 12 17:04:34 sirius dhcpcd: eth0: leased 192.168.1.3 for 864000
> > seconds Jun 12 17:04:34 sirius dhcpcd: forking to background
> > 
> > As you can see, dhcpcd asks for a lease as soon as the kernel tells it
> > the link is up (message "carrier acquired").
> > 
> > Therefore, I think 2.6.35-rc* notification of carrier is broken, and at
> > least it broke dhcpcd way of watching the link.
Comment 3 Grant Grundler 2010-06-18 06:53:47 UTC
On Tue, Jun 15, 2010 at 02:24:18PM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).

I've resync to linus' tree (2.6.35-rc3) and reviewed the output of:
    git diff v2.6.34 drivers/net/tulip/

I don't see anything that would affect how link state
changes get reported to user space.

I'm not inclined to believe this is a tulip "bug" unless
core netdev behavior changed and tulip is not longer
doing the right thing.

hth,
grant
Comment 4 Bruce Allan 2010-06-18 16:18:04 UTC
On Thursday, June 17, 2010 11:17 PM, Grant Grundler wrote:
> On Tue, Jun 15, 2010 at 02:24:18PM -0700, Andrew Morton wrote:
>> 
>> (switched to email.  Please respond via emailed reply-to-all, not
>> via the bugzilla web interface).
> 
> I've resync to linus' tree (2.6.35-rc3) and reviewed the output of:
>     git diff v2.6.34 drivers/net/tulip/
> 
> I don't see anything that would affect how link state
> changes get reported to user space.
> 
> I'm not inclined to believe this is a tulip "bug" unless
> core netdev behavior changed and tulip is not longer
> doing the right thing.
> 
> hth,
> grant

I don't believe this is a tulip specific bug - the same thing has been reported against e1000e and bnx2 (IIRC); I have not had the time to investigate further.

Bruce.
Comment 5 Andrew Morton 2010-06-21 21:22:20 UTC
On Fri, 18 Jun 2010 08:04:36 -0700
"Allan, Bruce W" <bruce.w.allan@intel.com> wrote:

> On Thursday, June 17, 2010 11:17 PM, Grant Grundler wrote:
> > On Tue, Jun 15, 2010 at 02:24:18PM -0700, Andrew Morton wrote:
> >> 
> >> (switched to email.  Please respond via emailed reply-to-all, not
> >> via the bugzilla web interface).
> > 
> > I've resync to linus' tree (2.6.35-rc3) and reviewed the output of:
> >     git diff v2.6.34 drivers/net/tulip/
> > 
> > I don't see anything that would affect how link state
> > changes get reported to user space.
> > 
> > I'm not inclined to believe this is a tulip "bug" unless
> > core netdev behavior changed and tulip is not longer
> > doing the right thing.
> > 
> > hth,
> > grant
> 
> I don't believe this is a tulip specific bug - the same thing has been
> reported against e1000e and bnx2 (IIRC); I have not had the time to
> investigate further.

So it's affecting three drivers.

One thing which changed in there recently is

: commit b2db756449f63f98049587f7ede4a8e85e0c79b1
: Author:     YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
: AuthorDate: Sat Mar 20 16:11:12 2010 -0700
: Commit:     David S. Miller <davem@davemloft.net>
: CommitDate: Sat Mar 20 16:11:12 2010 -0700
:
:    ipv6: Reduce timer events for addrconf_verify().

So perhaps someone could test the simple reversion patch, below?

I couldn't locate these e1000e and bnx2 bug reports so I couldn't cc
the reporters :(

I'm seeing several patches on netdev "use netif_carrier_off to prevent
tx timeout".  Is that related?


 net/ipv6/addrconf.c |   27 ++++-----------------------
 1 file changed, 4 insertions(+), 23 deletions(-)

diff -puN net/ipv6/addrconf.c~revert-2 net/ipv6/addrconf.c
--- a/net/ipv6/addrconf.c~revert-2
+++ a/net/ipv6/addrconf.c
@@ -100,10 +100,6 @@
 #define	INFINITY_LIFE_TIME	0xFFFFFFFF
 #define TIME_DELTA(a, b) ((unsigned long)((long)(a) - (long)(b)))
 
-#define ADDRCONF_TIMER_FUZZ_MINUS	(HZ > 50 ? HZ/50 : 1)
-#define ADDRCONF_TIMER_FUZZ		(HZ / 4)
-#define ADDRCONF_TIMER_FUZZ_MAX		(HZ)
-
 #ifdef CONFIG_SYSCTL
 static void addrconf_sysctl_register(struct inet6_dev *idev);
 static void addrconf_sysctl_unregister(struct inet6_dev *idev);
@@ -3159,15 +3155,15 @@ int ipv6_chk_home_addr(struct net *net, 
 
 static void addrconf_verify(unsigned long foo)
 {
-	unsigned long now, next, next_sec, next_sched;
 	struct inet6_ifaddr *ifp;
 	struct hlist_node *node;
+	unsigned long now, next;
 	int i;
 
 	rcu_read_lock_bh();
 	spin_lock(&addrconf_verify_lock);
 	now = jiffies;
-	next = round_jiffies_up(now + ADDR_CHECK_FREQUENCY);
+	next = now + ADDR_CHECK_FREQUENCY;
 
 	del_timer(&addr_chk_timer);
 
@@ -3181,8 +3177,7 @@ restart:
 				continue;
 
 			spin_lock(&ifp->lock);
-			/* We try to batch several events at once. */
-			age = (now - ifp->tstamp + ADDRCONF_TIMER_FUZZ_MINUS) / HZ;
+			age = (now - ifp->tstamp) / HZ;
 
 			if (ifp->valid_lft != INFINITY_LIFE_TIME &&
 			    age >= ifp->valid_lft) {
@@ -3252,21 +3247,7 @@ restart:
 		}
 	}
 
-	next_sec = round_jiffies_up(next);
-	next_sched = next;
-
-	/* If rounded timeout is accurate enough, accept it. */
-	if (time_before(next_sec, next + ADDRCONF_TIMER_FUZZ))
-		next_sched = next_sec;
-
-	/* And minimum interval is ADDRCONF_TIMER_FUZZ_MAX. */
-	if (time_before(next_sched, jiffies + ADDRCONF_TIMER_FUZZ_MAX))
-		next_sched = jiffies + ADDRCONF_TIMER_FUZZ_MAX;
-
-	ADBG((KERN_DEBUG "now = %lu, schedule = %lu, rounded schedule = %lu => %lu\n",
-	      now, next, next_sec, next_sched));
-
-	addr_chk_timer.expires = next_sched;
+	addr_chk_timer.expires = time_before(next, jiffies + HZ) ? jiffies + HZ : next;
 	add_timer(&addr_chk_timer);
 	spin_unlock(&addrconf_verify_lock);
 	rcu_read_unlock_bh();
_
Comment 6 Rafael J. Wysocki 2010-06-22 13:55:20 UTC
Handled-By : Andrew Morton <akpm@linux-foundation.org>
Comment 7 Christian Casteyde 2010-06-22 18:38:03 UTC
I can test since I can reproduce it
However, I didn't managed to revert the patch:

root@sirius:/usr/src/linux# patch -p1 -R < ../revert.patch 
patching file net/ipv6/addrconf.c
Hunk #1 succeeded at 104 with fuzz 2 (offset 4 lines).
Hunk #2 FAILED at 3163.
Hunk #3 FAILED at 3185.
Hunk #4 FAILED at 3256.
3 out of 4 hunks FAILED -- saving rejects to file net/ipv6/addrconf.c.rej

I applied it on -rc1, -rc2 and -rc3 and all failed
Comment 8 Christian Casteyde 2010-06-22 21:50:09 UTC
I bisected it:
10708f37ae729baba9b67bd134c3720709d4ae62 is the first bad commit
commit 10708f37ae729baba9b67bd134c3720709d4ae62
Author: Jan Engelhardt <jengelh@medozas.de>
Date:   Thu Mar 11 09:57:29 2010 +0000

    net: core: add IFLA_STATS64 support
    
    `ip -s link` shows interface counters truncated to 32 bit. This is
    because interface statistics are transported only in 32-bit quantity
    to userspace. This commit adds a new IFLA_STATS64 attribute that
    exports them in full 64 bit.
    
    References: http://lkml.indiana.edu/hypermail/linux/kernel/0307.3/0215.html
    Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 d3495c3a1f0c879f0635279e05ccffbcab6b6f1b e42d269b76c89d6d098b9f290b959e0320615756 M    include
:040000 040000 94c49d66f55109c55c7d832ed74205b0b40cb983 eab089628db95a473a1ced8d99c629b3b3b2ba91 M    net
Comment 9 Christian Casteyde 2010-06-22 21:52:59 UTC
Created attachment 26904 [details]
git bisect log
Comment 10 Rafael J. Wysocki 2010-06-22 21:57:19 UTC
Good job, thanks!

First-Bad-Commit : 10708f37ae729baba9b67bd134c3720709d4ae62
Comment 11 Christian Casteyde 2010-07-05 18:39:44 UTC
Update: Still present in 2.6.35-rc4.
Comment 12 David S. Miller 2010-07-09 17:32:39 UTC
Bug is in dhcpd, it allocates a fixed 256 byte buffer for netlink
messages.
Comment 13 Andreas Sturmlechner 2010-08-04 14:48:51 UTC
Bug is fixed in dhcpcd-5.2.7