When gretap is used to encapsulate Ethernet packets into IP packets, the encapsulated IP packets are larger than the original Ethernet packet, as expected. Let's say you create a gre0 interface with a 1500 bytes MTU (since this interface will latter be inserted in a bridge interface, its MTU must be 1500). And Let's say the GRE encapsulated packet (now larger than 1500 bytes) is going to be routed over an IP interface with a 1500 bytes MTU. The expected behavior would be that the encapsulated packet be fragmented. The observed behavior is that any encapsulated packets over 1500 bytes are simply dropped and an ICMP "fragmentation needed" message is sent to ... who knows. My feeling is that DF bit is not playing nice here.
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Fri, 18 Dec 2009 23:10:01 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14837 > > Summary: gretap does not fragment IP packets > Product: Networking > Version: 2.5 > Kernel Version: 2.6.32 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IPV4 > AssignedTo: shemminger@linux-foundation.org > ReportedBy: benoit.papillault@free.fr > Regression: No > > > When gretap is used to encapsulate Ethernet packets into IP packets, the > encapsulated IP packets are larger than the original Ethernet packet, as > expected. > > Let's say you create a gre0 interface with a 1500 bytes MTU (since this > interface will latter be inserted in a bridge interface, its MTU must be > 1500). > And Let's say the GRE encapsulated packet (now larger than 1500 bytes) is > going > to be routed over an IP interface with a 1500 bytes MTU. > > The expected behavior would be that the encapsulated packet be fragmented. > The > observed behavior is that any encapsulated packets over 1500 bytes are simply > dropped and an ICMP "fragmentation needed" message is sent to ... who knows. > > My feeling is that DF bit is not playing nice here. >
Reply-To: shemminger@vyatta.com On Fri, 18 Dec 2009 15:32:09 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > > > The expected behavior would be that the encapsulated packet be fragmented. > The > > observed behavior is that any encapsulated packets over 1500 bytes are > simply > > dropped and an ICMP "fragmentation needed" message is sent to ... who > knows. > > > > My feeling is that DF bit is not playing nice here. > > TCP uses DF bit to do path mtu discovery. If your firewall et all, doesn't do ICMP correctly, then this is the classic TCP path MTU discovery ICMP blackhole problem. http://www.ietf.org/rfc/rfc2923.txt
Reply-To: hadi@cyberus.ca On Fri, 2009-12-18 at 15:32 -0800, Andrew Morton wrote: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Fri, 18 Dec 2009 23:10:01 GMT > bugzilla-daemon@bugzilla.kernel.org wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=14837 > > > > Summary: gretap does not fragment IP packets > > Product: Networking > > Version: 2.5 > > Kernel Version: 2.6.32 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: IPV4 > > AssignedTo: shemminger@linux-foundation.org > > ReportedBy: benoit.papillault@free.fr > > Regression: No > > > > > > When gretap is used to encapsulate Ethernet packets into IP packets, the > > encapsulated IP packets are larger than the original Ethernet packet, as > > expected. > > > > Let's say you create a gre0 interface with a 1500 bytes MTU (since this > > interface will latter be inserted in a bridge interface, its MTU must be > 1500). > > And Let's say the GRE encapsulated packet (now larger than 1500 bytes) is > going > > to be routed over an IP interface with a 1500 bytes MTU. > > > > The expected behavior would be that the encapsulated packet be fragmented. > The > > observed behavior is that any encapsulated packets over 1500 bytes are > simply > > dropped and an ICMP "fragmentation needed" message is sent to ... who > knows. Sending back an ICMP is good behavior. Sending it "who knows" is not ;-> Make sure it is sent to the originator of the packet. The originator of the packet should play nice and reduce the path mtu. One work around is to reduce the gre device mtu to something less than 1500B. cheers, jamal
jamal a écrit : > Sending back an ICMP is good behavior. Sending it "who knows" is not ;-> > Make sure it is sent to the originator of the packet. The originator of > the packet should play nice and reduce the path mtu. > > One work around is to reduce the gre device mtu to something less than > 1500B. > > cheers, > jamal > > > As I explained in my original message, the gre device MTU must be 1500 bytes (since it is used in an Ethernet bridge). To reproduce the problem, I did a very simple setup with two machines (A & B) connected with an Ethernet cable (so no router between them). On machine A : # ip link add gre0 type gretap local <A> remote <B> # ifconfig gre0 mtu 1500 # ifconfig gre0 192.192.192.1 up On machine B: # ip link add gre0 type gretap local <B> remote <A> # ifconfig gre0 mtu 1500 # ifconfig gre0 192.192.192.2 up On machine A: # ping 192.192.192.2 => working # ping -s 1434 192.192.192.2 => working, match a GRE packet of 1500 bytes # ping -s 1435 192.192.192.2 => not working, match a GRE packet of 1501 bytes (1435+8+20+38) # ping -s 1472 192.192.192.2 => not working, match an IP packet of 1500 bytes Doing a tcpdump on the machine (like tcpdump -pni any) shows that ICMP packets are simply dropped! Using tracepath 192.192.192.2, a tcpdump -pni lo shows : IP 192.192.192.1 > 192.192.192.1: ICMP 192.192.192.2 unreachable - need to frag (mtu 1500), length 556 Regards, Benoit
Reply-To: hadi@cyberus.ca On Mon, 2009-12-21 at 02:17 +0100, Benoit PAPILLAULT wrote: > > > As I explained in my original message, the gre device MTU must be 1500 > bytes (since it is used in an Ethernet bridge). Ok, sorry i missed this bit. I didnt realize that the bridge device had such draconian enforcement. Bridge picks whatever the lowest common denominator is for MTU (I suspect so as to not keep track of all the interfaces; good policy, IMO, should allow a user to shoot themselves in the toe while defaulting to the min mtu). >From the looks of it, this enforcement could be changed with a one line patch - but not being privy to the reasoning it would be unfair of me to do so from my comfortable couch. Stephen? > To reproduce the > problem, I did a very simple setup with two machines (A & B) connected > with an Ethernet cable (so no router between them). > > On machine A : > # ip link add gre0 type gretap local <A> remote <B> > # ifconfig gre0 mtu 1500 What i meant is try to ifconfig gre0 to something small like 1420B cheers, jamal
On Mon, 21 Dec 2009 19:09:12 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14837 > > > > > > --- Comment #6 from Anonymous Emailer <anonymous@kernel-bugs.osdl.org> > 2009-12-21 19:09:09 --- > Reply-To: hadi@cyberus.ca > > On Mon, 2009-12-21 at 02:17 +0100, Benoit PAPILLAULT wrote: > > > > > > As I explained in my original message, the gre device MTU must be 1500 > > bytes (since it is used in an Ethernet bridge). > > Ok, sorry i missed this bit. I didnt realize that the bridge device had > such draconian enforcement. Bridge picks whatever the lowest common > denominator is for MTU (I suspect so as to not keep track of all the > interfaces; good policy, IMO, should allow a user to shoot themselves in > the toe while defaulting to the min mtu). No it is part of the IEEE standard, bridge has to drop overlength packets.
I narrow the problem further down to http://lxr.linux.no/linux+v2.6.32/net/ipv4/ip_gre.c#L765. I created the tunnel with : # ip link add mygre type gretap local <local IP> remote <remote IP> nopmtudisc ttl 64 Then, doing a tracepath over the mygre interface generates an UDP packet (thus an IP packet) with the DF bit set. Tracing over the code, we now have : - skb->protocol = htons(ETH_P_IP) ... even if the skb contains an Ethernet header. - old_iph->frag_off & htons(IP_DF) is true since DF bit is set - the mtu condition matches since tracepath is sending big packets on purpose At this point, icmp_send is called, but it immediately stops since struct rtable *rt = skb_rtable(skb_in) is indeed NULL. Not sure what it really means... So, to sum up : - any IP packet with DF bit set will try to send an ICMP message and drop the packet - since the ICMP message is not sent, we are in a deadlock trying to send big packets in order to know the real MTU Note: - if we replace "gretap" by "gre", the icmp_send works as expected. - according to some Cisco whitepapers (http://www.cisco.com/en/US/tech/tk827/tk369/technologies_white_paper09186a00800d6979.shtml), I think we should just fragment the packet in this case instead of sending an ICMP message because I set "nopmtudisc".
I wrote a fix for : - sending ICMP messages even if rt=NULL (I do compute a valid rt value, might need review) - fix the DF bit in the GRE header. DF is set only if PMTU is requested (tiph->frag_off & htons(IP_DF)) AND either we are encapsulated IPv6 or IPv4 with the DF bit set It has been tested with "gre" and "gretap" interfaces and with a router that has a lower MTU on the path between the two tunnel endpoints (I artificially lower the MTU on the intermediate router to 1400 instead of 1500).
Created attachment 24332 [details] 0001-ip_gre-Fix-ICMP-message-and-DF-bit-settings.patch
First of all, my best wishes for year 2010! Any comments on the patch I sent? Should I send it on a mailing list for a broader audience? Regards, Benoit
Reply-To: hadi@cyberus.ca Salut Benoit, I didnt see any patch... Also did you try changing the mtu per suggestion i made? People tend to be busy and sometimes dont read the mailing list. To get proper answers, always CC the maintainers. In this case CC Stephen Hemminger - he maintains the bridging code. I am ccing Herbert Xu as well - he may have opinions on the gre side of things. My suggestion is you repost your issue along with your patch and describe why it solves your problem. cheers, jamal On Thu, 2010-01-07 at 15:30 +0100, Benoit PAPILLAULT wrote: > First of all, my best wishes for year 2010! > > Any comments on the patch I sent? Should I send it on a mailing list for > a broader audience? > > Regards, > Benoit > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Jamal, jamal a écrit : > Salut Benoit, > > I didnt see any patch... > > Also did you try changing the mtu per suggestion i made? > I cannot changed the MTU by design since I need to have the gretap interface be part of an Ethernet bridge. > People tend to be busy and sometimes dont read the mailing list. To > get proper answers, always CC the maintainers. In this case CC Stephen > Hemminger - he maintains the bridging code. I am ccing Herbert Xu as > well - he may have opinions on the gre side of things. My suggestion is > you repost your issue along with your patch and describe why it solves > your problem. > I know people can be busy, I am just trying to push things forward. Original bug is there : http://bugzilla.kernel.org/show_bug.cgi?id=14837 Patch is available here : http://bugzilla.kernel.org/attachment.cgi?id=24332 This patch fixes my issue since either packets get fragmented or an ICMP error packet is sent back to the sender. Regards, Benoit PS: Congrats to Herbert Xu which fixes a bug in gretap before I had the time to report it (was about the broadcast addr which was incorrect). > cheers, > jamal > >
On Sun, Jan 10, 2010 at 06:43:56PM +0100, Benoit PAPILLAULT wrote: > > I know people can be busy, I am just trying to push things forward. > Original bug is there : > http://bugzilla.kernel.org/show_bug.cgi?id=14837 > > Patch is available here : > http://bugzilla.kernel.org/attachment.cgi?id=24332 > > This patch fixes my issue since either packets get fragmented or an ICMP > error packet is sent back to the sender. I was actually working on this back in November before getting side-tracked by travelling. I'll look into this again. Thanks,
i just got burned by this running ubuntu lucid 64bit i have ipsec between my sites, and i was trying to set up high-availability using a virtual ethernet bridge across my sites. 2 gateways at each site with all gateways part of the same virtual bridge with stp. got everything working, except any packets over ~1300 would get dropped / errored in my tap device. ifconfig showed errors i saw l2tpv3 was available in 2.6.35 (which might also solve my problem), but unfortunately i'm stuck at 2.6.32
Has anyone tested this patch on CentOS 6? I'm using 2.6.32-71.29.1.el6 and have the gretap device on a bridge with an incoming vlan-tagged interface. Inbound traffic which explicitly sets the MSS size is causing a kernel panic when this patch is applied.
I don't think this bug has been fixed. Should I reopen it with a more recent kernel if it still applies ?
This is still not fixed as of 4.4 and the kernel is incorrectly applying the DF setting from IP packets sent over the gretap interface to the outer IP (GRE) packets. The gretap interface is supposed to be a layer 2 tunnel, so the IP (GRE) packets must always be fragmented to maintain the original MTU of the interface. The gretap interface must have an MTU of 1500 if it is to match the other interfaces it is bridged to. The underlying MTU of the network that the tunnel goes over is irrelevant to the gretap interface.
this bug has been fixed. Tested on kernel 5.10.10 You must create the gretap link with the following parameters: "ignore-df nopmtudisc" as documented in this bug https://bugzilla.kernel.org/show_bug.cgi?id=211175 Cheers, Marco