Bug 36602

Summary: Bridge fails to work normally without net.ipv4.ip_forward=1
Product: Networking Reporter: Igor Novgorodov (igor)
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: CLOSED CODE_FIX    
Severity: normal CC: alan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38.7 Subsystem:
Regression: No Bisected commit-id:

Description Igor Novgorodov 2011-06-03 19:21:17 UTC
Yes, this seems strange, but it's seems to be true.

My network scheme is quite simple:

(host1) <--- 10gbe ---> (bridge host) <--- 10gbe ---> (host2)

host1 & host2 are actually VMWare ESXi hypervisors, but that's irrelevant in this case i think.

Network adapters are Intel's 82599 10 gig cards on all hosts.

At the bridge, on each interface i've created a vlan, and then bridged them:
# vconfig add eth0 102
# vconfig add eth1 102
# brctl addbr br0
# brctl addif br0 eth0.102
# brctl addif br0 eth1.102
# ip link set br0 mtu 9000 up
...etc...

At this point, the bridge seems to be working, i can ping between host1 & host2, even with jumbo frames without fragmentation.

BUT when i am trying to use iperf & friends to measure raw tcp speed between hosts 1/2, i'm getting something weird like 7-10 MEGABITS per second, or even an iperf hang until ctrl+c.

If i attach an ip address to the bridge, and measure between hosts and the bridge, it works flawlessly, rendering 9.8Gbit/s in both directions.

While trying to find a solution, when i ran out of options, i've set net.ipv4.ip_forward to 1, and, SURPRISE, the bridge started to work like a charm, at almost 10gig speed.

What makes it stranger, is that in my kernel, i've turned off all routing code, iptables and other stuff, as this host serves primarily as iSCSI target.

I have little knowledge in kernel's deep internals, but i always thought that bridging & routing are on different levels of operation and couldn't affect each other (ebtables is an exception, but i don't have it :) ).

Maybe i'm interpreting the results wrong, but i've ruled out everything else.

Currently, i can't use this setup as a test ground, i'll try to replicate the scheme in a virtual environment to see if other kernels are affected as well.

Glad to hear any ideas on this.
Comment 1 Andrew Morton 2011-06-03 19:37:11 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 3 Jun 2011 19:21:20 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=36602
> 
>            Summary: Bridge fails to work normally without
>                     net.ipv4.ip_forward=1
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.38.7
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: acme@ghostprotocols.net
>         ReportedBy: igor@novg.net
>         Regression: No
> 
> 
> Yes, this seems strange, but it's seems to be true.
> 
> My network scheme is quite simple:
> 
> (host1) <--- 10gbe ---> (bridge host) <--- 10gbe ---> (host2)
> 
> host1 & host2 are actually VMWare ESXi hypervisors, but that's irrelevant in
> this case i think.
> 
> Network adapters are Intel's 82599 10 gig cards on all hosts.
> 
> At the bridge, on each interface i've created a vlan, and then bridged them:
> # vconfig add eth0 102
> # vconfig add eth1 102
> # brctl addbr br0
> # brctl addif br0 eth0.102
> # brctl addif br0 eth1.102
> # ip link set br0 mtu 9000 up
> ...etc...
> 
> At this point, the bridge seems to be working, i can ping between host1 &
> host2, even with jumbo frames without fragmentation.
> 
> BUT when i am trying to use iperf & friends to measure raw tcp speed between
> hosts 1/2, i'm getting something weird like 7-10 MEGABITS per second, or even
> an iperf hang until ctrl+c.
> 
> If i attach an ip address to the bridge, and measure between hosts and the
> bridge, it works flawlessly, rendering 9.8Gbit/s in both directions.
> 
> While trying to find a solution, when i ran out of options, i've set
> net.ipv4.ip_forward to 1, and, SURPRISE, the bridge started to work like a
> charm, at almost 10gig speed.
> 
> What makes it stranger, is that in my kernel, i've turned off all routing
> code,
> iptables and other stuff, as this host serves primarily as iSCSI target.
> 
> I have little knowledge in kernel's deep internals, but i always thought that
> bridging & routing are on different levels of operation and couldn't affect
> each other (ebtables is an exception, but i don't have it :) ).
> 
> Maybe i'm interpreting the results wrong, but i've ruled out everything else.
> 
> Currently, i can't use this setup as a test ground, i'll try to replicate the
> scheme in a virtual environment to see if other kernels are affected as well.
> 
> Glad to hear any ideas on this.
>
Comment 2 Ben Hutchings 2011-06-03 23:13:54 UTC
On Fri, 2011-06-03 at 12:36 -0700, Andrew Morton wrote:
[...]
> > At the bridge, on each interface i've created a vlan, and then bridged
> them:
> > # vconfig add eth0 102
> > # vconfig add eth1 102
> > # brctl addbr br0
> > # brctl addif br0 eth0.102
> > # brctl addif br0 eth1.102
> > # ip link set br0 mtu 9000 up
> > ...etc...
> > 
> > At this point, the bridge seems to be working, i can ping between host1 &
> > host2, even with jumbo frames without fragmentation.
> > 
> > BUT when i am trying to use iperf & friends to measure raw tcp speed
> between
> > hosts 1/2, i'm getting something weird like 7-10 MEGABITS per second, or
> even
> > an iperf hang until ctrl+c.

This sounds like a symptom of doing LRO on a bridged device.  Normally
we turn off LRO for bridge members automatically, but we haven't been
doing that when the bridge members are VLAN devices.

> > If i attach an ip address to the bridge, and measure between hosts and the
> > bridge, it works flawlessly, rendering 9.8Gbit/s in both directions.
> > 
> > While trying to find a solution, when i ran out of options, i've set
> > net.ipv4.ip_forward to 1, and, SURPRISE, the bridge started to work like a
> > charm, at almost 10gig speed.
[...]

Right, that should force LRO off for all devices with IPv4 set up.

This should be fixed by:

commit f11970e383acd6f505f492f1bc07fb1a4d884829
Author: Neil Horman <nhorman@tuxdriver.com>
Date:   Tue May 24 08:31:09 2011 +0000

    net: make dev_disable_lro use physical device if passed a vlan dev (v2)

which is in 3.0-rc1.

Ben.
Comment 3 Igor Novgorodov 2011-06-04 03:50:26 UTC
Hmm...
By LRO do you mean in-kernel software CONFIG_INET_LRO, or nic driver's LRO?
Because i've read about all bad things that can happen with LRO & bridging/routing used together, and built ixgbe drivers without LRO at all:

make CFLAGS_EXTRA="-DIXGBE_NO_LRO -DIXGBE_NO_LLI" KSP="/usr/src/linux" install

So i thought that i'll be safe from that problem...