Bug 212921 - ECMP not working for local sockets
Summary: ECMP not working for local sockets
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-02 07:10 UTC by Nitin Issac Joy
Modified: 2021-05-05 20:57 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.8.0-50-generic
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Nitin Issac Joy 2021-05-02 07:10:44 UTC
When you're creating local TCP sockets in a Linux machine, the connections to the same destination IP are not load-balanced across multiple interfaces when ECMP path is set. Even when net.ipv4.fib_multipath_hash_policy is set to L4 hash, multiple interfaces are never used for same destination. 

I tried working around the issue by setting two route table entries with same metric using `ip route append` command. In this case, the connections get load-balanced across multiple interfaces for 5-10 seconds, after which all future connections will choose one of the interfaces. There is no configuration that can disable this behavior. I tried disabling tcp_metrics_nosave
Comment 1 Nitin Issac Joy 2021-05-02 07:19:08 UTC
tcp_metrics_nosave -> net.ipv4.tcp_no_metrics_save=1
Comment 2 Nitin Issac Joy 2021-05-02 10:32:05 UTC
Two route table entries are in main route table and both have same default gateway.
ip route add default via <gw> dev eth0 metric 100
ip route append default via <gw> dev eth1 metric 100

This load balances outbound connection to the same destination <ip>:<port> for first 5-10 seconds after which all future connections will stick to one interface
Comment 3 David Ahern 2021-05-04 02:20:54 UTC
For a socket not bound to an address, a FIB is lookup done with path selection based on saddr == 0; this is to set the source address for the socket. Then later another FIB lookup is done to decide how to route the packet. That lookup also picks a path but this time saddr is not 0 resulting in a different hash value which could select a different path.

You can see this with
perf recored -e fib:* -a -g -- <run a command to a connect to a server>
<ctrl-c>
perf script
Comment 4 Nitin Issac Joy 2021-05-04 07:39:39 UTC
Hi David, 

Thanks a lot for the response! I tried out the command, I can see fib_table_lookup being called twice. 

I was trying to understand why FIB trie selects source addresses randomly in the first 5-10 seconds but later sticks with one consistently. You mentioned the use of a hash table -  I was betting on route-cache being removed for the source address to be selected randomly among all available interfaces that can connect to the default gateway. Reading through code, function __mkroute_output - it looks like, for UNICAST, routes are cached after all (although only for local sockets). Is it viable to add a sysctl to disable the cache? 

We're building a stress testing platform focussed on generating outbound connections to the same dest <IP>:<port>, hence the success metric is if we can generate 256k connections irrespective of the application so that we can effectively utilize the available CPU/Mem capacity of a large VM. Hence the need..
Comment 5 David Ahern 2021-05-05 14:16:36 UTC
I mentioned a hash, not a hashtable. See fib_select_path, fib_multipath_hash and fib_select_multipath
Comment 6 Nitin Issac Joy 2021-05-05 20:57:03 UTC
(In reply to David Ahern from comment #5)
> I mentioned a hash, not a hashtable. See fib_select_path, fib_multipath_hash
> and fib_select_multipath

I see. Initial path lookup is done with saddr and sport as 0, source address is selected, source port is selected for the connection, second lookup is done based on the new source addr and port (L4 hash).. After a certain threshold of ports being used, the hash stops changing for second lookup - is it because the ephemeral ports selected are distributed initially but after 2000 connections or so end up being closer to each other, and that affects hash to stick with one path? I don't think I increased the port range, I can try.. thanks for the info.

Note You need to log in before you can comment on or make changes to this bug.