Bug 213729 - PMTUD failure with ECMP.
Summary: PMTUD failure with ECMP.
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-14 13:43 UTC by Sam Kappen
Modified: 2022-04-22 13:50 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.13.0-rc5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Ecmp pmtud test setup (78.76 KB, application/pdf)
2021-07-14 13:43 UTC, Sam Kappen
Details

Description Sam Kappen 2021-07-14 13:43:51 UTC
Created attachment 297849 [details]
Ecmp pmtud test setup

PMTUD failure with ECMP.

We have observed failures when PMTUD and ECMP work together.
Ping fails either through gateway1 or gateway2 when using MTU greater than 1500.
The Issue has been tested and reproduced on CentOS 8 and mainline kernels. 


Kernel versions: 
[root@localhost ~]# uname -a
Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@localhost skappen]# uname -a
Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux


Static routes with ECMP are configured like this:

[root@localhost skappen]#ip route
default proto static 
	nexthop via 192.168.0.11 dev enp0s3 weight 1 
	nexthop via 192.168.0.12 dev enp0s3 weight 1 
192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4 metric 100

So the host would pick the first or the second nexthop depending on ECMP's hashing algorithm.

When pinging the destination with MTU greater than 1500 it works through the first gateway.

[root@localhost skappen]# ping -s1700 10.0.3.17
PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
^C
--- 10.0.3.17 ping statistics ---
3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms

The MTU also gets cached for this route as per rfc6754:

[root@localhost skappen]# ip route get 10.0.3.17
10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0 
    cache expires 540sec mtu 1500 
    
[root@localhost skappen]# tracepath -n 10.0.3.17
 1?: [LOCALHOST]                      pmtu 1500
 1:  192.168.0.11                                          1.475ms 
 1:  192.168.0.11                                          0.995ms 
 2:  192.168.0.11                                          1.075ms !H
     Resume: pmtu 1500         

However when the second nexthop is picked PMTUD breaks. In this example I ping a second interface configured on the same destination
from the same host, using the same routes and gateways. Based on ECMP's hashing algorithm this host would pick the second nexthop (.2):

[root@localhost skappen]# ping -s1700 10.0.3.18
PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
^C
--- 10.0.3.18 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2062ms
[root@localhost skappen]# ip route get 10.0.3.18
10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0 
    cache 

[root@localhost skappen]# tracepath -n 10.0.3.18
 1?: [LOCALHOST]                      pmtu 9000
 1:  192.168.0.12                                          3.147ms 
 1:  192.168.0.12                                          0.696ms 
 2:  192.168.0.12                                          0.648ms pmtu 1500
 2:  192.168.0.12                                          0.761ms !H
     Resume: pmtu 1500     

The ICMP frag needed reaches the host, but in this case it is ignored.
The MTU for this route does not get cached either.


It looks like mtu value from the next hop is not properly updated for some reason. 


Test Case:
Create 2 networks: Internal, External
Create 4 virtual machines: Client, GW-1, GW-2, Destination

Client
configure 1 NIC to internal with MTU 9000
configure static route with ECMP to GW-1 and GW-2 internal address

GW-1, GW-2
configure 2 NICs
- to internal with MTU 9000
- to external MTU 1500
- enable ip_forward
- enable packet forward

Target
configure 1 NIC to external MTU with 1500
configure multiple IP address(say IP1, IP2, IP3, IP4) on the same interface, so ECMP's hashing algorithm would pick different routes

Test
ping from client to target with larger than 1500 bytes
ping the other addresses of the target so ECMP would use the other route too

Results observed:
Through GW-1 PMTUD works, after the first frag needed message the MTU is lowered on the client side for this target. Through the GW-2 PMTUD does not, all responses to ping are ICMP frag needed, which are not obeyed by the kernel.
In all failure cases mtu is not cashed on "ip route get".
Comment 1 Vadim Fedorenko 2021-07-21 00:44:38 UTC
I can confirm that both routes do not take PMTU value until source ip is explicitly stated. Will try to investigate it
Comment 2 Sam Kappen 2021-07-29 15:15:23 UTC
(In reply to Vadim Fedorenko from comment #1)
> I can confirm that both routes do not take PMTU value until source ip is
> explicitly stated. Will try to investigate it

Thanks.
It is observed that during the successful scenarious, the rt->pmtu in function "ipv4_mtu" always returns the correct pmtu(here 1500) of the nexthop.
In failure cases it is noted that two different struct rtable are being passed in "ipv4_mtu" and one seems to be returning the host mtu(here 9000) and even if the second one  returns the mtu as 1500 but the packet fragmentation still not happens and ping the fails.
Comment 3 Vadim Fedorenko 2021-07-29 17:00:45 UTC
At least for ICMP the actual problem is that raw_sendmsg selects different routes to check MTU and to actually send skb. I'm not sure about other protocols, have to investigate a bit more.
Comment 4 Sam Kappen 2021-10-09 04:11:41 UTC
Hi,

Could you update if there is any further findings or thoughts on this issue?


Regards,
Sam
Comment 5 Kfir Itzhak 2021-12-01 00:03:38 UTC
Hi,

I am also experiencing this issue. I reported it about a year ago, and David Ahern (thank you) worked on it and a patch was merged, however, it didn't fix the issue.

I use kernel 5.4.78. This version supposedly includes the patch (commit 2fbc6e89b2f1403189e624cabaf73e189c5e50c6)
The patch: https://lore.kernel.org/all/20200925124724.448531559@linuxfoundation.org/

I would also like to get this fixed, it prevents me from using ECMP on some places, impacting our redundancy.

Note You need to log in before you can comment on or make changes to this bug.