Bug 209767
Summary: | Bonding 802.3ad layer2+3 transmits on both slaves within single connection | ||
---|---|---|---|
Product: | Networking | Reporter: | Onno Zweers (onno.zweers) |
Component: | Other | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | normal | CC: | jay.vosburgh |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.8.11-1.el8.elrepo.x86_64 and 3.10.0-1127.19.1.el7.x86_64 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Onno Zweers
2020-10-20 10:42:34 UTC
The mentioned pcap files can be downloaded from https://surfdrive.surf.nl/files/index.php/s/LseSNOdcHSrIs9l . Onno, No, what you describe doesn't sound like the expected behavior. In your configuration, from your description, it appears the two servers are not connected directly to one another, and there at least one switch in between, correct? If there is a switch, are the appropriate switch ports (those connected directly to the servers) configured for LACP mode? Could you attach the contents of /proc/net/bonding/bond0 (replacing "bond0" with whatever the name of your actual bonding interface is if not "bond0") from both servers? Please read the file as root (otherwise some information is omitted). Hi Jay, You are correct, there are several (Cumulus) switches in between server 1 and server 2. The servers are in two different racks; each rack has two top-of-rack switches and each slave is connected to a different top-of-rack switch (to have redundancy on the network level). The top-of-rack switches are connected to end-of-row switches that are interlinked. The switches are configured for LACP. Interestingly, between two servers within one rack we don't see the retransmits. We suspect that the top-of-row switches are then capable of avoiding misordering of packets. But with a path TOR->EOR->EOR->TOR we think there is misordering of packets. Here's the bonding properties (with MLNX-OFED driver): [root@penguin1 ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2+3 (2) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 20000 Down Delay (ms): 0 Peer Notification Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable System priority: 65535 System MAC address: 1c:34:da:49:1d:81 Active Aggregator Info: Aggregator ID: 1 Number of ports: 2 Actor Key: 21 Partner Key: 17 Partner Mac Address: 44:38:39:ff:00:43 Slave Interface: ens2f1 MII Status: up Speed: 25000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 1c:34:da:49:1d:81 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 details actor lacp pdu: system priority: 65535 system mac address: 1c:34:da:49:1d:81 port key: 21 port priority: 255 port number: 1 port state: 63 details partner lacp pdu: system priority: 65535 system mac address: 44:38:39:ff:00:43 oper key: 17 port priority: 255 port number: 1 port state: 63 Slave Interface: ens2f0 MII Status: up Speed: 25000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 1c:34:da:49:1d:80 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 details actor lacp pdu: system priority: 65535 system mac address: 1c:34:da:49:1d:81 port key: 21 port priority: 255 port number: 2 port state: 63 details partner lacp pdu: system priority: 65535 system mac address: 44:38:39:ff:00:43 oper key: 17 port priority: 255 port number: 1 port state: 63 Bonding properties with the stock kernel NIC driver: [root@penguin2 ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2+3 (2) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 20000 Down Delay (ms): 0 Peer Notification Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable System priority: 65535 System MAC address: 1c:34:da:6c:ed:fa Active Aggregator Info: Aggregator ID: 1 Number of ports: 2 Actor Key: 21 Partner Key: 17 Partner Mac Address: 44:38:39:ff:00:43 Slave Interface: extern1 MII Status: up Speed: 25000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 1c:34:da:6c:ed:fa Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 details actor lacp pdu: system priority: 65535 system mac address: 1c:34:da:6c:ed:fa port key: 21 port priority: 255 port number: 1 port state: 63 details partner lacp pdu: system priority: 65535 system mac address: 44:38:39:ff:00:43 oper key: 17 port priority: 255 port number: 1 port state: 63 Slave Interface: extern2 MII Status: up Speed: 25000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 1c:34:da:6c:ed:fb Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 details actor lacp pdu: system priority: 65535 system mac address: 1c:34:da:6c:ed:fa port key: 21 port priority: 255 port number: 2 port state: 63 details partner lacp pdu: system priority: 65535 system mac address: 44:38:39:ff:00:43 oper key: 17 port priority: 255 port number: 1 port state: 63 (In reply to Onno Zweers from comment #3) > Hi Jay, > > You are correct, there are several (Cumulus) switches in between server 1 > and server 2. The servers are in two different racks; each rack has two > top-of-rack switches and each slave is connected to a different top-of-rack > switch (to have redundancy on the network level). The top-of-rack switches > are connected to end-of-row switches that are interlinked. The switches are > configured for LACP. Ok, so I've had a chance to look at the packet captures, and I think I understand at least some of what you're seeing. The effect you describe I think is most notable in the "intelX710" captures, and what appears to be happening here is that the switch -> bond traffic is hashing to one of the slaves, and the bond -> switch traffic is hashing to the other. This is a normal behavior, as the hashing occurs on egress, and the switch's egress hash may not exactly map to the bond's egress hash (with two ports in the bond, it's a 50/50 chance). Looking at the packet flow, the "slave1" capture contains all of the port 24000 -> 41464 direction traffic, and the "slave2" capture all of the port 41464 -> 24000 traffic. I don't see any examples of traffic for a specific (unidirectional) flow appearing in both of the X710 captures, so flows do not appear to be flapping between the two ports of the bond. In the "intelXXV710" captures, it appears that everything hashed to the same device in both directions, and in the "mellanox" captures everything except for one direction of the control connection is in the "slave2" capture. The TCP protocol adventures in the captures appear to result from lost packets, not packets that are out of order due to them being sent on the wrong interface. I did not examine these exhaustively, but looked at a dozen or so. > Interestingly, between two servers within one rack we don't see the > retransmits. We suspect that the top-of-row switches are then capable of > avoiding misordering of packets. But with a path TOR->EOR->EOR->TOR we think > there is misordering of packets. I think this may be the actual key, in that something in the longer switch path is losing packets. As I mentioned, the captures appear to show packet loss, not misordering. > Here's the bonding properties (with MLNX-OFED driver): > > [root@penguin1 ~]# cat /proc/net/bonding/bond0 > Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) [...] Looking through both of these status files, I don't see anything obviously amiss. Both show 1 aggregator with two active ports. -J |