Bug 209767 - Bonding 802.3ad layer2+3 transmits on both slaves within single connection
Summary: Bonding 802.3ad layer2+3 transmits on both slaves within single connection
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-20 10:42 UTC by Onno Zweers
Modified: 2020-10-28 20:54 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.8.11-1.el8.elrepo.x86_64 and 3.10.0-1127.19.1.el7.x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Onno Zweers 2020-10-20 10:42:34 UTC
Dear people,

I'm seeing bonding behavior I don't understand and neither do several network experts I've consulted.

We have two servers, both with two 25 Gbit interfaces in a bond (802.3ad with layer2+3 hashing). We tuned the systems according to https://fasterdata.es.net/host-tuning/linux/. I run `iperf3 --server` on server 1 and connect to it with `iperf3 --client server1` from server 2. We notice that sometimes the connection is good (24.7 Gbit/s, no retransmits) and sometimes there are many retransmits (sometimes as many as >30,000 in a 10 second run) and then the bandwidth may drop to 15 Gbit/s or even lower. The servers are idle except for the iperf3 runs. When we bring down one slave on server 1, the result is always perfect; no retransmits and good throughput.

We have captured traffic with tcpdump on server 1 at the slave level (I'll try to add the pcap files). To our surprise, we see that the data channel ACK packets are sometimes sent over one slave and sometimes over the other. We think this causes packet misordering in the network switches, and thus retransmits and loss of bandwidth.

Our understanding of layer2+3 hashing is that for a single connection, all traffic should go over the same slave. Therefore, we don't understand why server 1 sends ACK packets out over both slaves.

I've read the documentation at https://www.kernel.org/doc/Documentation/networking/bonding.txt but I couldn't find the observed behaviour explained there.

We have tested several Centos 7 and Centos 8 kernels, including recent elrepo kernels, but all show this behaviour. We have tried teaming instead of bonding but it has the same behaviour. We have tried other hashing algorithms like layer3+4 but they seem to have the same issue. It occurs with both IPv4 and IPv6.

Is this behaviour to be expected? If yes, is it documented anywhere? Will it degrade throughput in real life traffic (with multiple concurrent data streams)?
If the behaviour is not expected, are we doing something wrong, or might it be a bug?

Thanks,
Onno
Comment 1 Onno Zweers 2020-10-20 10:47:49 UTC
The mentioned pcap files can be downloaded from https://surfdrive.surf.nl/files/index.php/s/LseSNOdcHSrIs9l .
Comment 2 Jay Vosburgh 2020-10-20 18:44:14 UTC
Onno,

No, what you describe doesn't sound like the expected behavior.

In your configuration, from your description, it appears the two servers are not connected directly to one another, and there at least one switch in between, correct?  If there is a switch, are the appropriate switch ports (those connected directly to the servers) configured for LACP mode?

Could you attach the contents of /proc/net/bonding/bond0 (replacing "bond0" with whatever the name of your actual bonding interface is if not "bond0") from both servers?  Please read the file as root (otherwise some information is omitted).
Comment 3 Onno Zweers 2020-10-21 07:19:04 UTC
Hi Jay,

You are correct, there are several (Cumulus) switches in between server 1 and server 2. The servers are in two different racks; each rack has two top-of-rack switches and each slave is connected to a different top-of-rack switch (to have redundancy on the network level). The top-of-rack switches are connected to end-of-row switches that are interlinked. The switches are configured for LACP.

Interestingly, between two servers within one rack we don't see the retransmits. We suspect that the top-of-row switches are then capable of avoiding misordering of packets. But with a path TOR->EOR->EOR->TOR we think there is misordering of packets.


Here's the bonding properties (with MLNX-OFED driver):

[root@penguin1 ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 20000
Down Delay (ms): 0
Peer Notification Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 1c:34:da:49:1d:81
Active Aggregator Info:
	Aggregator ID: 1
	Number of ports: 2
	Actor Key: 21
	Partner Key: 17
	Partner Mac Address: 44:38:39:ff:00:43

Slave Interface: ens2f1
MII Status: up
Speed: 25000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 1c:34:da:49:1d:81
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 1c:34:da:49:1d:81
    port key: 21
    port priority: 255
    port number: 1
    port state: 63
details partner lacp pdu:
    system priority: 65535
    system mac address: 44:38:39:ff:00:43
    oper key: 17
    port priority: 255
    port number: 1
    port state: 63

Slave Interface: ens2f0
MII Status: up
Speed: 25000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 1c:34:da:49:1d:80
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 1c:34:da:49:1d:81
    port key: 21
    port priority: 255
    port number: 2
    port state: 63
details partner lacp pdu:
    system priority: 65535
    system mac address: 44:38:39:ff:00:43
    oper key: 17
    port priority: 255
    port number: 1
    port state: 63


Bonding properties with the stock kernel NIC driver:

[root@penguin2 ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 20000
Down Delay (ms): 0
Peer Notification Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 1c:34:da:6c:ed:fa
Active Aggregator Info:
	Aggregator ID: 1
	Number of ports: 2
	Actor Key: 21
	Partner Key: 17
	Partner Mac Address: 44:38:39:ff:00:43

Slave Interface: extern1
MII Status: up
Speed: 25000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 1c:34:da:6c:ed:fa
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 1c:34:da:6c:ed:fa
    port key: 21
    port priority: 255
    port number: 1
    port state: 63
details partner lacp pdu:
    system priority: 65535
    system mac address: 44:38:39:ff:00:43
    oper key: 17
    port priority: 255
    port number: 1
    port state: 63

Slave Interface: extern2
MII Status: up
Speed: 25000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 1c:34:da:6c:ed:fb
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 1c:34:da:6c:ed:fa
    port key: 21
    port priority: 255
    port number: 2
    port state: 63
details partner lacp pdu:
    system priority: 65535
    system mac address: 44:38:39:ff:00:43
    oper key: 17
    port priority: 255
    port number: 1
    port state: 63
Comment 4 Jay Vosburgh 2020-10-28 20:54:29 UTC
(In reply to Onno Zweers from comment #3)
> Hi Jay,
> 
> You are correct, there are several (Cumulus) switches in between server 1
> and server 2. The servers are in two different racks; each rack has two
> top-of-rack switches and each slave is connected to a different top-of-rack
> switch (to have redundancy on the network level). The top-of-rack switches
> are connected to end-of-row switches that are interlinked. The switches are
> configured for LACP.

Ok, so I've had a chance to look at the packet captures, and I think I understand at least some of what you're seeing.

The effect you describe I think is most notable in the "intelX710" captures, and what appears to be happening here is that the switch -> bond traffic is hashing to one of the slaves, and the bond -> switch traffic is hashing to the other.  This is a normal behavior, as the hashing occurs on egress, and the switch's egress hash may not exactly map to the bond's egress hash (with two ports in the bond, it's a 50/50 chance).

Looking at the packet flow, the "slave1" capture contains all of the port 24000 -> 41464 direction traffic, and the "slave2" capture all of the port 41464 -> 24000 traffic.  I don't see any examples of traffic for a specific (unidirectional) flow appearing in both of the X710 captures, so flows do not appear to be flapping between the two ports of the bond.

In the "intelXXV710" captures, it appears that everything hashed to the same device in both directions, and in the "mellanox" captures everything except for one direction of the control connection is in the "slave2" capture.

The TCP protocol adventures in the captures appear to result from lost packets,
not packets that are out of order due to them being sent on the wrong interface.  I did not examine these exhaustively, but looked at a dozen or so.

> Interestingly, between two servers within one rack we don't see the
> retransmits. We suspect that the top-of-row switches are then capable of
> avoiding misordering of packets. But with a path TOR->EOR->EOR->TOR we think
> there is misordering of packets.

I think this may be the actual key, in that something in the longer switch path
is losing packets.  As I mentioned, the captures appear to show packet loss, not misordering.

> Here's the bonding properties (with MLNX-OFED driver):
> 
> [root@penguin1 ~]# cat /proc/net/bonding/bond0
> Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
[...]

Looking through both of these status files, I don't see anything obviously
amiss.  Both show 1 aggregator with two active ports.

-J

Note You need to log in before you can comment on or make changes to this bug.