When you're creating local TCP sockets in a Linux machine, the connections to the same destination IP are not load-balanced across multiple interfaces when ECMP path is set. Even when net.ipv4.fib_multipath_hash_policy is set to L4 hash, multiple interfaces are never used for same destination. I tried working around the issue by setting two route table entries with same metric using `ip route append` command. In this case, the connections get load-balanced across multiple interfaces for 5-10 seconds, after which all future connections will choose one of the interfaces. There is no configuration that can disable this behavior. I tried disabling tcp_metrics_nosave
tcp_metrics_nosave -> net.ipv4.tcp_no_metrics_save=1
Two route table entries are in main route table and both have same default gateway. ip route add default via <gw> dev eth0 metric 100 ip route append default via <gw> dev eth1 metric 100 This load balances outbound connection to the same destination <ip>:<port> for first 5-10 seconds after which all future connections will stick to one interface
For a socket not bound to an address, a FIB is lookup done with path selection based on saddr == 0; this is to set the source address for the socket. Then later another FIB lookup is done to decide how to route the packet. That lookup also picks a path but this time saddr is not 0 resulting in a different hash value which could select a different path. You can see this with perf recored -e fib:* -a -g -- <run a command to a connect to a server> <ctrl-c> perf script
Hi David, Thanks a lot for the response! I tried out the command, I can see fib_table_lookup being called twice. I was trying to understand why FIB trie selects source addresses randomly in the first 5-10 seconds but later sticks with one consistently. You mentioned the use of a hash table - I was betting on route-cache being removed for the source address to be selected randomly among all available interfaces that can connect to the default gateway. Reading through code, function __mkroute_output - it looks like, for UNICAST, routes are cached after all (although only for local sockets). Is it viable to add a sysctl to disable the cache? We're building a stress testing platform focussed on generating outbound connections to the same dest <IP>:<port>, hence the success metric is if we can generate 256k connections irrespective of the application so that we can effectively utilize the available CPU/Mem capacity of a large VM. Hence the need..
I mentioned a hash, not a hashtable. See fib_select_path, fib_multipath_hash and fib_select_multipath
(In reply to David Ahern from comment #5) > I mentioned a hash, not a hashtable. See fib_select_path, fib_multipath_hash > and fib_select_multipath I see. Initial path lookup is done with saddr and sport as 0, source address is selected, source port is selected for the connection, second lookup is done based on the new source addr and port (L4 hash).. After a certain threshold of ports being used, the hash stops changing for second lookup - is it because the ephemeral ports selected are distributed initially but after 2000 connections or so end up being closer to each other, and that affects hash to stick with one path? I don't think I increased the port range, I can try.. thanks for the info.