Bug 29712 - Bonding Driver(version : 3.5.0) - Problem with ARP monitoring in active backup mode
Summary: Bonding Driver(version : 3.5.0) - Problem with ARP monitoring in active backu...
Status: RESOLVED OBSOLETE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-23 10:41 UTC by Harsha H R
Modified: 2012-08-16 11:06 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.32
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Harsha H R 2011-02-23 10:41:32 UTC
We are facing an issue with arp_monitoring in active_backup mode when
two network interfaces of two systems are connected back to back (point
to point connected without switch connection) and bond is created on
either systems with point-to-point connected interfaces as slaves.

Steps to reproduce :

1. Initially the bond was created with two interfaces eth2 and eth3, having eth2 as primary

	# modprobe bonding primary=eth2 mode=1 arp_interval=500
	arp_ip_target=192.168.4.61

	# ifconfig bond0 192.168.2.63 netmask 255.255.255.0

	# ifenslave bond0 eth2 eth3

	# ifconfig bond0 up

	# cat /proc/net/bonding/bond0

	Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

	Bonding Mode: fault-tolerance (active-backup)

	Primary Slave: eth2
	Currently Active Slave: eth2
	MII Status: up
	MII Polling Interval (ms): 0
	Up Delay (ms): 0
	Down Delay (ms): 0
	ARP Polling Interval (ms): 500
	ARP IP target/s (n.n.n.n form): 192.168.4.61

	Slave Interface: eth2
	MII Status: up
	Link Failure Count: 1
	Permanent HW addr: 00:26:55:27:88:52

	Slave Interface: eth3
	MII Status: down
	Link Failure Count: 1
	Permanent HW addr: 00:26:55:27:88:54

2. The primary interface was made down, and fail over happened

	# ifconfig down

	# cat /proc/net/bonding/bond0

	Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

	Bonding Mode: fault-tolerance (active-backup)
	Primary Slave: eth2
	Currently Active Slave: eth3 <-- As expected -->
	MII Status: up
	MII Polling Interval (ms): 0
	Up Delay (ms): 0
	Down Delay (ms): 0
	ARP Polling Interval (ms): 500
	ARP IP target/s (n.n.n.n form): 192.168.4.61

	Slave Interface: eth2
	MII Status: down
	Link Failure Count: 2
	Permanent HW addr: 00:26:55:27:88:52

	Slave Interface: eth3
	MII Status: up
	Link Failure Count: 1
	Permanent HW addr: 00:26:55:27:88:54

3. The primary interface was brought up again and we did not see failover happening back to primary

	ned1g6:~# ifconfig eth2 up

	ned1g6:~# cat /proc/net/bonding/bond0

	Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

	Bonding Mode: fault-tolerance (active-backup)
	Primary Slave: eth2
	Currently Active Slave: eth3 <-- Ideally this should have been eth2 -->
	MII Status: up
	MII Polling Interval (ms): 0
	Up Delay (ms): 0
	Down Delay (ms): 0
	ARP Polling Interval (ms): 500
	ARP IP target/s (n.n.n.n form): 192.168.4.61

	Slave Interface: eth2
	MII Status: down
	Link Failure Count: 2
	Permanent HW addr: 00:26:55:27:88:52

	Slave Interface: eth3
	MII Status: up
	Link Failure Count: 1
	Permanent HW addr: 00:26:55:27:88:54

The problem is that when the primary_slave comes up from the down state
it won't get selected as the currently active slave for the bond.

Best Regards,
Harsha
Comment 1 Andrew Morton 2011-02-24 22:52:00 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 23 Feb 2011 10:41:34 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=29712
> 
>            Summary: Bonding Driver(version : 3.5.0) - Problem with ARP
>                     monitoring in active backup mode
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 2.6.32

That's a paleolithic kernel you have there.  This problem might have
been fixed already.  Can you test a more recent kernel?


>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Network
>         AssignedTo: drivers_network@kernel-bugs.osdl.org
>         ReportedBy: harsha.r02@mphasis.com
>         Regression: No
> 
> 
> We are facing an issue with arp_monitoring in active_backup mode when
> two network interfaces of two systems are connected back to back (point
> to point connected without switch connection) and bond is created on
> either systems with point-to-point connected interfaces as slaves.
> 
> Steps to reproduce :
> 
> 1. Initially the bond was created with two interfaces eth2 and eth3, having
> eth2 as primary
> 
>     # modprobe bonding primary=eth2 mode=1 arp_interval=500
>     arp_ip_target=192.168.4.61
> 
>     # ifconfig bond0 192.168.2.63 netmask 255.255.255.0
> 
>     # ifenslave bond0 eth2 eth3
> 
>     # ifconfig bond0 up
> 
>     # cat /proc/net/bonding/bond0
> 
>     Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
> 
>     Bonding Mode: fault-tolerance (active-backup)
> 
>     Primary Slave: eth2
>     Currently Active Slave: eth2
>     MII Status: up
>     MII Polling Interval (ms): 0
>     Up Delay (ms): 0
>     Down Delay (ms): 0
>     ARP Polling Interval (ms): 500
>     ARP IP target/s (n.n.n.n form): 192.168.4.61
> 
>     Slave Interface: eth2
>     MII Status: up
>     Link Failure Count: 1
>     Permanent HW addr: 00:26:55:27:88:52
> 
>     Slave Interface: eth3
>     MII Status: down
>     Link Failure Count: 1
>     Permanent HW addr: 00:26:55:27:88:54
> 
> 2. The primary interface was made down, and fail over happened
> 
>     # ifconfig down
> 
>     # cat /proc/net/bonding/bond0
> 
>     Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
> 
>     Bonding Mode: fault-tolerance (active-backup)
>     Primary Slave: eth2
>     Currently Active Slave: eth3 <-- As expected -->
>     MII Status: up
>     MII Polling Interval (ms): 0
>     Up Delay (ms): 0
>     Down Delay (ms): 0
>     ARP Polling Interval (ms): 500
>     ARP IP target/s (n.n.n.n form): 192.168.4.61
> 
>     Slave Interface: eth2
>     MII Status: down
>     Link Failure Count: 2
>     Permanent HW addr: 00:26:55:27:88:52
> 
>     Slave Interface: eth3
>     MII Status: up
>     Link Failure Count: 1
>     Permanent HW addr: 00:26:55:27:88:54
> 
> 3. The primary interface was brought up again and we did not see failover
> happening back to primary
> 
>     ned1g6:~# ifconfig eth2 up
> 
>     ned1g6:~# cat /proc/net/bonding/bond0
> 
>     Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
> 
>     Bonding Mode: fault-tolerance (active-backup)
>     Primary Slave: eth2
>     Currently Active Slave: eth3 <-- Ideally this should have been eth2 -->
>     MII Status: up
>     MII Polling Interval (ms): 0
>     Up Delay (ms): 0
>     Down Delay (ms): 0
>     ARP Polling Interval (ms): 500
>     ARP IP target/s (n.n.n.n form): 192.168.4.61
> 
>     Slave Interface: eth2
>     MII Status: down
>     Link Failure Count: 2
>     Permanent HW addr: 00:26:55:27:88:52
> 
>     Slave Interface: eth3
>     MII Status: up
>     Link Failure Count: 1
>     Permanent HW addr: 00:26:55:27:88:54
> 
> The problem is that when the primary_slave comes up from the down state
> it won't get selected as the currently active slave for the bond.
> 
> Best Regards,
> Harsha
Comment 2 Anonymous Emailer 2011-02-25 03:42:58 UTC
Reply-To: brian.haley@hp.com

On 02/24/2011 05:51 PM, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Wed, 23 Feb 2011 10:41:34 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=29712
>>
>>            Summary: Bonding Driver(version : 3.5.0) - Problem with ARP
>>                     monitoring in active backup mode
>>            Product: Drivers
>>            Version: 2.5
>>     Kernel Version: 2.6.32
> 
> That's a paleolithic kernel you have there.  This problem might have
> been fixed already.  Can you test a more recent kernel?

I can add some more info since I originally looked at the problem.  This
happens on 2.6.38 as well, and on this 2.6.32 kernel with a backported
3.7.0 bonding driver (with the primary_reselect option).  Harsha has a
prototype patch that's being tested, but wanted to log the bug to see
if one of the bonding maintainers had a better solution.

I'll let him respond as I'm now out of the loop...

Thanks,

-Brian
Comment 3 Harsha H R 2011-02-25 13:57:23 UTC
Attached patch resolves the issue. Failover happened back to primary
when it was up again in both the point to point and switch
configuration.

Please let us know if this change can be included.

Thanks,

- Harsha


-----Original Message-----
From: Brian Haley [mailto:brian.haley@hp.com] 
Sent: Friday, February 25, 2011 9:12 AM
To: Andrew Morton
Cc: Harsha R02; bugzilla-daemon@bugzilla.kernel.org;
bugme-daemon@bugzilla.kernel.org; netdev@vger.kernel.org; Jay Vosburgh
Subject: Re: [Bugme-new] [Bug 29712] New: Bonding Driver(version :
3.5.0) - Problem with ARP monitoring in active backup mode

On 02/24/2011 05:51 PM, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via
the
> bugzilla web interface).
> 
> On Wed, 23 Feb 2011 10:41:34 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=29712
>>
>>            Summary: Bonding Driver(version : 3.5.0) - Problem with
ARP
>>                     monitoring in active backup mode
>>            Product: Drivers
>>            Version: 2.5
>>     Kernel Version: 2.6.32
> 
> That's a paleolithic kernel you have there.  This problem might have
> been fixed already.  Can you test a more recent kernel?

I can add some more info since I originally looked at the problem.  This
happens on 2.6.38 as well, and on this 2.6.32 kernel with a backported
3.7.0 bonding driver (with the primary_reselect option).  Harsha has a
prototype patch that's being tested, but wanted to log the bug to see
if one of the bonding maintainers had a better solution.

I'll let him respond as I'm now out of the loop...

Thanks,

-Brian

Information transmitted by this e-mail is proprietary to MphasiS, its associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com and delete this mail from your records.
Comment 4 Harsha H R 2011-02-25 18:02:59 UTC
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 40fb5ee..0413917 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3020,11 +3020,16 @@ static void bond_ab_arp_probe(struct bonding *bond)
                       bond->curr_active_slave->dev->name);
        if (bond->curr_active_slave) {
+                if((bond->curr_active_slave != bond->primary_slave) &&
+                   (IS_UP(bond->primary_slave->dev)))
+                        goto failover;
+
                bond_arp_send_all(bond, bond->curr_active_slave);
                read_unlock(&bond->curr_slave_lock);
                return;
        }
+failover:
        read_unlock(&bond->curr_slave_lock);
        /* if we don't have a curr_active_slave, search for the next available


________________________________

From: Harsha R02
Sent: Fri 2/25/2011 6:14 PM
To: Brian Haley; Andrew Morton
Cc: bugzilla-daemon@bugzilla.kernel.org; bugme-daemon@bugzilla.kernel.org; netdev@vger.kernel.org; Jay Vosburgh
Subject: RE: [Bugme-new] [Bug 29712] New: Bonding Driver(version : 3.5.0) - Problem with ARP monitoring in active backup mode



Attached patch resolves the issue. Failover happened back to primary when it was up again in both the point to point and switch configuration.

Please let us know if this change can be included.

Thanks,

- Harsha


-----Original Message-----
From: Brian Haley [mailto:brian.haley@hp.com]
Sent: Friday, February 25, 2011 9:12 AM
To: Andrew Morton
Cc: Harsha R02; bugzilla-daemon@bugzilla.kernel.org; bugme-daemon@bugzilla.kernel.org; netdev@vger.kernel.org; Jay Vosburgh
Subject: Re: [Bugme-new] [Bug 29712] New: Bonding Driver(version : 3.5.0) - Problem with ARP monitoring in active backup mode

On 02/24/2011 05:51 PM, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Wed, 23 Feb 2011 10:41:34 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=29712
>>
>>            Summary: Bonding Driver(version : 3.5.0) - Problem with ARP
>>                     monitoring in active backup mode
>>            Product: Drivers
>>            Version: 2.5
>>     Kernel Version: 2.6.32
>
> That's a paleolithic kernel you have there.  This problem might have
> been fixed already.  Can you test a more recent kernel?

I can add some more info since I originally looked at the problem.  This
happens on 2.6.38 as well, and on this 2.6.32 kernel with a backported
3.7.0 bonding driver (with the primary_reselect option).  Harsha has a
prototype patch that's being tested, but wanted to log the bug to see
if one of the bonding maintainers had a better solution.

I'll let him respond as I'm now out of the loop...

Thanks,

-Brian



Information transmitted by this e-mail is proprietary to MphasiS, its associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com and delete this mail from your records.
Comment 5 David S. Miller 2011-02-25 18:54:05 UTC
From: "Harsha R02" <Harsha.R02@mphasis.com>
Date: Fri, 25 Feb 2011 18:14:32 +0530

> Attached patch resolves the issue. Failover happened back to primary
> when it was up again in both the point to point and switch
> configuration.
> 
> Please let us know if this change can be included.

Please don't base64 encode your patches, that makes them harder
to read for some people.  It's just plain text.
Comment 6 Jay Vosburgh 2011-02-25 19:02:59 UTC
Harsha R02 <Harsha.R02@mphasis.com> wrote:

>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 40fb5ee..0413917 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3020,11 +3020,16 @@ static void bond_ab_arp_probe(struct bonding *bond)
>                       bond->curr_active_slave->dev->name);
>        if (bond->curr_active_slave) {
>+                if((bond->curr_active_slave != bond->primary_slave) &&
>+                   (IS_UP(bond->primary_slave->dev)))
>+                        goto failover;
>+
>                bond_arp_send_all(bond, bond->curr_active_slave);
>                read_unlock(&bond->curr_slave_lock);
>                return;
>        }
>+failover:
>        read_unlock(&bond->curr_slave_lock);
>        /* if we don't have a curr_active_slave, search for the next available

	I'm not sure this is the proper place to put the "failover:"
label, as it will go through the "search for any peer" logic that's
normally used when there are no available slaves.  That will likely take
longer than simply switching to the primary.

	It should be possible to simply call bond_change_active_slave
with the appropriate arguments; did you try this?

	-J


>-------------------------------------------------------------------------------
>From: Harsha R02
>Sent: Fri 2/25/2011 6:14 PM
>To: Brian Haley; Andrew Morton
>Cc: bugzilla-daemon@bugzilla.kernel.org; bugme-daemon@bugzilla.kernel.org;
>netdev@vger.kernel.org; Jay Vosburgh
>Subject: RE: [Bugme-new] [Bug 29712] New: Bonding Driver(version : 3.5.0) -
>Problem with ARP monitoring in active backup mode
>
>Attached patch resolves the issue. Failover happened back to primary when it
>was up again in both the point to point and switch configuration.
>
>Please let us know if this change can be included.
>
>Thanks,
>
>- Harsha
>
>-----Original Message-----
>From: Brian Haley [mailto:brian.haley@hp.com]
>Sent: Friday, February 25, 2011 9:12 AM
>To: Andrew Morton
>Cc: Harsha R02; bugzilla-daemon@bugzilla.kernel.org;
>bugme-daemon@bugzilla.kernel.org; netdev@vger.kernel.org; Jay Vosburgh
>Subject: Re: [Bugme-new] [Bug 29712] New: Bonding Driver(version : 3.5.0) -
>Problem with ARP monitoring in active backup mode
>
>On 02/24/2011 05:51 PM, Andrew Morton wrote:
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Wed, 23 Feb 2011 10:41:34 GMT
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=29712
>>>
>>>            Summary: Bonding Driver(version : 3.5.0) - Problem with ARP
>>>                     monitoring in active backup mode
>>>            Product: Drivers
>>>            Version: 2.5
>>>     Kernel Version: 2.6.32
>>
>> That's a paleolithic kernel you have there.  This problem might have
>> been fixed already.  Can you test a more recent kernel?
>
>I can add some more info since I originally looked at the problem.  This
>happens on 2.6.38 as well, and on this 2.6.32 kernel with a backported
>3.7.0 bonding driver (with the primary_reselect option).  Harsha has a
>prototype patch that's being tested, but wanted to log the bug to see
>if one of the bonding maintainers had a better solution.
>
>I'll let him respond as I'm now out of the loop...
>
>Thanks,
>
>-Brian

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
Comment 7 Harsha H R 2011-03-03 06:32:28 UTC
Hi Jay,

We found that the patch that is presented here has some issues and we
cannot go with this solution.

In function "bond_ab_arp_probe" in addition to sending arp probes for
the currently active slave we should also 
be sending arp probes for the primary_slave if the link status of the
primary slave is up correct ?

I have made changes as below :

static void bond_ab_arp_probe(struct bonding *bond)
{
        struct slave *slave;
        int i;

        read_lock(&bond->curr_slave_lock);

        if (bond->current_arp_slave && bond->curr_active_slave)
                pr_info(DRV_NAME "PROBE: c_arp %s && cas %s BAD\n",
                       bond->current_arp_slave->dev->name,
                       bond->curr_active_slave->dev->name);

        if (bond->curr_active_slave) {
+                if((bond->curr_active_slave != bond->primary_slave) &&
+                   (IS_UP(bond->primary_slave->dev))) {
+                    bond_arp_send_all(bond, bond->primary_slave);
+                }
                bond_arp_send_all(bond, bond->curr_active_slave);
                read_unlock(&bond->curr_slave_lock);
                return;
        }


Please let us know if this can help us ? or if you see any side effects
?

Thanks,
Harsha


-----Original Message-----
From: Jay Vosburgh [mailto:fubar@us.ibm.com] 
Sent: Saturday, February 26, 2011 12:33 AM
To: Harsha R02
Cc: Brian Haley; Andrew Morton; bugzilla-daemon@bugzilla.kernel.org;
bugme-daemon@bugzilla.kernel.org; netdev@vger.kernel.org
Subject: Re: [Bugme-new] [Bug 29712] New: Bonding Driver(version :
3.5.0) - Problem with ARP monitoring in active backup mode

Harsha R02 <Harsha.R02@mphasis.com> wrote:

>diff --git a/drivers/net/bonding/bond_main.c
b/drivers/net/bonding/bond_main.c
>index 40fb5ee..0413917 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3020,11 +3020,16 @@ static void bond_ab_arp_probe(struct bonding
*bond)
>                       bond->curr_active_slave->dev->name);
>        if (bond->curr_active_slave) {
>+                if((bond->curr_active_slave != bond->primary_slave) &&
>+                   (IS_UP(bond->primary_slave->dev)))
>+                        goto failover;
>+
>                bond_arp_send_all(bond, bond->curr_active_slave);
>                read_unlock(&bond->curr_slave_lock);
>                return;
>        }
>+failover:
>        read_unlock(&bond->curr_slave_lock);
>        /* if we don't have a curr_active_slave, search for the next
available

	I'm not sure this is the proper place to put the "failover:"
label, as it will go through the "search for any peer" logic that's
normally used when there are no available slaves.  That will likely take
longer than simply switching to the primary.

	It should be possible to simply call bond_change_active_slave
with the appropriate arguments; did you try this?

	-J


>-----------------------------------------------------------------------
--------
>From: Harsha R02
>Sent: Fri 2/25/2011 6:14 PM
>To: Brian Haley; Andrew Morton
>Cc: bugzilla-daemon@bugzilla.kernel.org;
bugme-daemon@bugzilla.kernel.org;
>netdev@vger.kernel.org; Jay Vosburgh
>Subject: RE: [Bugme-new] [Bug 29712] New: Bonding Driver(version :
3.5.0) -
>Problem with ARP monitoring in active backup mode
>
>Attached patch resolves the issue. Failover happened back to primary
when it
>was up again in both the point to point and switch configuration.
>
>Please let us know if this change can be included.
>
>Thanks,
>
>- Harsha
>
>-----Original Message-----
>From: Brian Haley [mailto:brian.haley@hp.com]
>Sent: Friday, February 25, 2011 9:12 AM
>To: Andrew Morton
>Cc: Harsha R02; bugzilla-daemon@bugzilla.kernel.org;
>bugme-daemon@bugzilla.kernel.org; netdev@vger.kernel.org; Jay Vosburgh
>Subject: Re: [Bugme-new] [Bug 29712] New: Bonding Driver(version :
3.5.0) -
>Problem with ARP monitoring in active backup mode
>
>On 02/24/2011 05:51 PM, Andrew Morton wrote:
>> (switched to email.  Please respond via emailed reply-to-all, not via
the
>> bugzilla web interface).
>>
>> On Wed, 23 Feb 2011 10:41:34 GMT
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=29712
>>>
>>>            Summary: Bonding Driver(version : 3.5.0) - Problem with
ARP
>>>                     monitoring in active backup mode
>>>            Product: Drivers
>>>            Version: 2.5
>>>     Kernel Version: 2.6.32
>>
>> That's a paleolithic kernel you have there.  This problem might have
>> been fixed already.  Can you test a more recent kernel?
>
>I can add some more info since I originally looked at the problem.
This
>happens on 2.6.38 as well, and on this 2.6.32 kernel with a backported
>3.7.0 bonding driver (with the primary_reselect option).  Harsha has a
>prototype patch that's being tested, but wanted to log the bug to see
>if one of the bonding maintainers had a better solution.
>
>I'll let him respond as I'm now out of the loop...
>
>Thanks,
>
>-Brian

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

Information transmitted by this e-mail is proprietary to MphasiS, its associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com and delete this mail from your records.
Comment 8 Jay Vosburgh 2011-03-04 18:19:05 UTC
Harsha R02 <Harsha.R02@mphasis.com> wrote:

>We found that the patch that is presented here has some issues and we
>cannot go with this solution.
>
>In function "bond_ab_arp_probe" in addition to sending arp probes for
>the currently active slave we should also 
>be sending arp probes for the primary_slave if the link status of the
>primary slave is up correct ?
>
>I have made changes as below :
>
>static void bond_ab_arp_probe(struct bonding *bond)
>{
>        struct slave *slave;
>        int i;
>
>        read_lock(&bond->curr_slave_lock);
>
>        if (bond->current_arp_slave && bond->curr_active_slave)
>                pr_info(DRV_NAME "PROBE: c_arp %s && cas %s BAD\n",
>                       bond->current_arp_slave->dev->name,
>                       bond->curr_active_slave->dev->name);
>
>        if (bond->curr_active_slave) {
>+                if((bond->curr_active_slave != bond->primary_slave) &&
>+                   (IS_UP(bond->primary_slave->dev))) {
>+                    bond_arp_send_all(bond, bond->primary_slave);
>+                }
>                bond_arp_send_all(bond, bond->curr_active_slave);
>                read_unlock(&bond->curr_slave_lock);

	No, we can't do this; if we send ARP probes out from an inactive
slave (which the primary would be at this point) it will confuse
switches that snoop traffic to determine the switch port's MAC addresses
(the switches will believe that the "primary" slave is the port to use
to reach the bond's MAC address).

	I think your problem is that your configuration (two systems,
back to back, no switch) is not a configuration the ARP monitor is
designed to work with.

	The ARP monitor determines the availability of backup slaves
based on traffic received by the backup slaves.  The usual source of
this traffic is the ARP broadcast requests being sent out the active
slave and then forwarded by the switch to all switch ports, including
the backup slave's port.  I'm guessing that your system isn't forwarding
these packets like a switch would, and so the primary slave isn't seeing
any incoming packets at all.

	If your primary slave (which is an inactive slave at the moment)
is not receiving traffic, bonding will never believe it is available.

	I've never experimented with using the ARP monitor in a
back-to-back confguration; I'm thinking through how the ARP monitor
functions, and I'm not sure it can be reliable when set up like this.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

Note You need to log in before you can comment on or make changes to this bug.