Bug 14837

Summary: gretap does not fragment IP packets
Product: Networking Reporter: Benoit PAPILLAULT (benoit.papillault)
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: CLOSED OBSOLETE    
Severity: normal CC: alan, anthrax, bugzilla, christian_durieux, kenyon, pupilla, rhuddusa, uothrawn
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32 Subsystem:
Regression: No Bisected commit-id:
Attachments: 0001-ip_gre-Fix-ICMP-message-and-DF-bit-settings.patch

Description Benoit PAPILLAULT 2009-12-18 23:10:00 UTC
When gretap is used to encapsulate Ethernet packets into IP packets, the encapsulated IP packets are larger than the original Ethernet packet, as expected.

Let's say you create a gre0 interface with a 1500 bytes MTU (since this interface will latter be inserted in a bridge interface, its MTU must be 1500). And Let's say the GRE encapsulated packet (now larger than 1500 bytes) is going to be routed over an IP interface with a 1500 bytes MTU.

The expected behavior would be that the encapsulated packet be fragmented. The observed behavior is that any encapsulated packets over 1500 bytes are simply dropped and an ICMP "fragmentation needed" message is sent to ... who knows.

My feeling is that DF bit is not playing nice here.
Comment 1 Andrew Morton 2009-12-18 23:32:42 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 18 Dec 2009 23:10:01 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14837
> 
>            Summary: gretap does not fragment IP packets
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.32
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: benoit.papillault@free.fr
>         Regression: No
> 
> 
> When gretap is used to encapsulate Ethernet packets into IP packets, the
> encapsulated IP packets are larger than the original Ethernet packet, as
> expected.
> 
> Let's say you create a gre0 interface with a 1500 bytes MTU (since this
> interface will latter be inserted in a bridge interface, its MTU must be
> 1500).
> And Let's say the GRE encapsulated packet (now larger than 1500 bytes) is
> going
> to be routed over an IP interface with a 1500 bytes MTU.
> 
> The expected behavior would be that the encapsulated packet be fragmented.
> The
> observed behavior is that any encapsulated packets over 1500 bytes are simply
> dropped and an ICMP "fragmentation needed" message is sent to ... who knows.
> 
> My feeling is that DF bit is not playing nice here.
>
Comment 2 Anonymous Emailer 2009-12-18 23:47:23 UTC
Reply-To: shemminger@vyatta.com

On Fri, 18 Dec 2009 15:32:09 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> > 
> > The expected behavior would be that the encapsulated packet be fragmented.
> The
> > observed behavior is that any encapsulated packets over 1500 bytes are
> simply
> > dropped and an ICMP "fragmentation needed" message is sent to ... who
> knows.
> > 
> > My feeling is that DF bit is not playing nice here.
> >   

TCP uses DF bit to do path mtu discovery.  If your firewall et all, doesn't
do ICMP correctly, then this is the classic TCP path MTU discovery ICMP
blackhole problem. 

http://www.ietf.org/rfc/rfc2923.txt
Comment 3 Anonymous Emailer 2009-12-19 12:55:09 UTC
Reply-To: hadi@cyberus.ca

On Fri, 2009-12-18 at 15:32 -0800, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Fri, 18 Dec 2009 23:10:01 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=14837
> > 
> >            Summary: gretap does not fragment IP packets
> >            Product: Networking
> >            Version: 2.5
> >     Kernel Version: 2.6.32
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: IPV4
> >         AssignedTo: shemminger@linux-foundation.org
> >         ReportedBy: benoit.papillault@free.fr
> >         Regression: No
> > 
> > 
> > When gretap is used to encapsulate Ethernet packets into IP packets, the
> > encapsulated IP packets are larger than the original Ethernet packet, as
> > expected.
> > 
> > Let's say you create a gre0 interface with a 1500 bytes MTU (since this
> > interface will latter be inserted in a bridge interface, its MTU must be
> 1500).
> > And Let's say the GRE encapsulated packet (now larger than 1500 bytes) is
> going
> > to be routed over an IP interface with a 1500 bytes MTU.
> > 
> > The expected behavior would be that the encapsulated packet be fragmented.
> The
> > observed behavior is that any encapsulated packets over 1500 bytes are
> simply
> > dropped and an ICMP "fragmentation needed" message is sent to ... who
> knows.

Sending back an ICMP is good behavior. Sending it "who knows" is not ;->
Make sure it is sent to the originator of the packet. The originator of
the packet should play nice and reduce the path mtu.

One work around is to reduce the gre device mtu to something less than
1500B.

cheers,
jamal
Comment 4 Benoit PAPILLAULT 2009-12-21 01:17:56 UTC
jamal a écrit :
> Sending back an ICMP is good behavior. Sending it "who knows" is not ;->
> Make sure it is sent to the originator of the packet. The originator of
> the packet should play nice and reduce the path mtu.
>
> One work around is to reduce the gre device mtu to something less than
> 1500B.
>
> cheers,
> jamal
>
>
>   
As I explained in my original message, the gre device MTU must be 1500 
bytes (since it is used in an Ethernet bridge). To reproduce the 
problem, I did a very simple setup with two machines (A & B) connected 
with an Ethernet cable (so no router between them).

On machine A :
# ip link add gre0 type gretap local <A> remote <B>
# ifconfig gre0 mtu 1500
# ifconfig gre0 192.192.192.1 up

On machine B:
# ip link add gre0 type gretap local <B> remote <A>
# ifconfig gre0 mtu 1500
# ifconfig gre0 192.192.192.2 up

On machine A:
# ping 192.192.192.2 => working
# ping -s 1434 192.192.192.2 => working, match a GRE packet of 1500 bytes
# ping -s 1435 192.192.192.2 => not working, match a GRE packet of 1501 
bytes (1435+8+20+38)
# ping -s 1472 192.192.192.2 => not working, match an IP packet of 1500 
bytes


Doing a tcpdump on the machine (like tcpdump -pni any) shows that ICMP 
packets are simply dropped!

Using tracepath 192.192.192.2, a tcpdump -pni lo shows :
IP 192.192.192.1 > 192.192.192.1: ICMP 192.192.192.2 unreachable - need 
to frag (mtu 1500), length 556

Regards,
Benoit
Comment 5 Benoit PAPILLAULT 2009-12-21 01:18:04 UTC
jamal a écrit :
> Sending back an ICMP is good behavior. Sending it "who knows" is not ;->
> Make sure it is sent to the originator of the packet. The originator of
> the packet should play nice and reduce the path mtu.
>
> One work around is to reduce the gre device mtu to something less than
> 1500B.
>
> cheers,
> jamal
>
>
>   
As I explained in my original message, the gre device MTU must be 1500 
bytes (since it is used in an Ethernet bridge). To reproduce the 
problem, I did a very simple setup with two machines (A & B) connected 
with an Ethernet cable (so no router between them).

On machine A :
# ip link add gre0 type gretap local <A> remote <B>
# ifconfig gre0 mtu 1500
# ifconfig gre0 192.192.192.1 up

On machine B:
# ip link add gre0 type gretap local <B> remote <A>
# ifconfig gre0 mtu 1500
# ifconfig gre0 192.192.192.2 up

On machine A:
# ping 192.192.192.2 => working
# ping -s 1434 192.192.192.2 => working, match a GRE packet of 1500 bytes
# ping -s 1435 192.192.192.2 => not working, match a GRE packet of 1501 
bytes (1435+8+20+38)
# ping -s 1472 192.192.192.2 => not working, match an IP packet of 1500 
bytes


Doing a tcpdump on the machine (like tcpdump -pni any) shows that ICMP 
packets are simply dropped!

Using tracepath 192.192.192.2, a tcpdump -pni lo shows :
IP 192.192.192.1 > 192.192.192.1: ICMP 192.192.192.2 unreachable - need 
to frag (mtu 1500), length 556

Regards,
Benoit
Comment 6 Anonymous Emailer 2009-12-21 19:09:09 UTC
Reply-To: hadi@cyberus.ca

On Mon, 2009-12-21 at 02:17 +0100, Benoit PAPILLAULT wrote:

> >   
> As I explained in my original message, the gre device MTU must be 1500 
> bytes (since it is used in an Ethernet bridge). 

Ok, sorry i missed this bit. I didnt realize that the bridge device had
such draconian enforcement. Bridge picks whatever the lowest common
denominator is for MTU (I suspect so as to not keep track of all the
interfaces; good policy, IMO, should allow a user to shoot themselves in
the toe while defaulting to the min mtu).
>From the looks of it, this enforcement could be changed with a one line
patch - but not being privy to the reasoning it would be unfair of me to
do so from my comfortable couch. Stephen?

> To reproduce the 
> problem, I did a very simple setup with two machines (A & B) connected 
> with an Ethernet cable (so no router between them).
> 
> On machine A :
> # ip link add gre0 type gretap local <A> remote <B>
> # ifconfig gre0 mtu 1500

What i meant is try to ifconfig gre0 to something small like 1420B

cheers,
jamal
Comment 7 Stephen Hemminger 2009-12-21 19:21:11 UTC
On Mon, 21 Dec 2009 19:09:12 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14837
> 
> 
> 
> 
> 
> --- Comment #6 from Anonymous Emailer <anonymous@kernel-bugs.osdl.org> 
> 2009-12-21 19:09:09 ---
> Reply-To: hadi@cyberus.ca
> 
> On Mon, 2009-12-21 at 02:17 +0100, Benoit PAPILLAULT wrote:
> 
> > >   
> > As I explained in my original message, the gre device MTU must be 1500 
> > bytes (since it is used in an Ethernet bridge). 
> 
> Ok, sorry i missed this bit. I didnt realize that the bridge device had
> such draconian enforcement. Bridge picks whatever the lowest common
> denominator is for MTU (I suspect so as to not keep track of all the
> interfaces; good policy, IMO, should allow a user to shoot themselves in
> the toe while defaulting to the min mtu).

No it is part of the IEEE standard, bridge has to drop overlength
packets.
Comment 8 Benoit PAPILLAULT 2009-12-26 21:46:42 UTC
I narrow the problem further down to http://lxr.linux.no/linux+v2.6.32/net/ipv4/ip_gre.c#L765.

I created the tunnel with :
# ip link add mygre type gretap local <local IP> remote <remote IP> nopmtudisc ttl 64

Then, doing a tracepath over the mygre interface generates an UDP packet (thus an IP packet) with the DF bit set. Tracing over the code, we now have :
- skb->protocol = htons(ETH_P_IP) ... even if the skb contains an Ethernet header.
- old_iph->frag_off & htons(IP_DF) is true since DF bit is set
- the mtu condition matches since tracepath is sending big packets on purpose

At this point, icmp_send is called, but it immediately stops since struct rtable *rt = skb_rtable(skb_in) is indeed NULL. Not sure what it really means...

So, to sum up :
- any IP packet with DF bit set will try to send an ICMP message and drop the packet
- since the ICMP message is not sent, we are in a deadlock trying to send big packets in order to know the real MTU

Note:
- if we replace "gretap" by "gre", the icmp_send works as expected.
- according to some Cisco whitepapers (http://www.cisco.com/en/US/tech/tk827/tk369/technologies_white_paper09186a00800d6979.shtml), I think we should just fragment the packet in this case instead of sending an ICMP message because I set "nopmtudisc".
Comment 9 Benoit PAPILLAULT 2009-12-28 21:00:18 UTC
I wrote a fix for :
- sending ICMP messages even if rt=NULL (I do compute a valid rt value, might need review)
- fix the DF bit in the GRE header. DF is set only if PMTU is requested (tiph->frag_off & htons(IP_DF)) AND either we are encapsulated IPv6 or IPv4 with the DF bit set

It has been tested with "gre" and "gretap" interfaces and with a router that has a lower MTU on the path between the two tunnel endpoints (I artificially lower the MTU on the intermediate router to 1400 instead of 1500).
Comment 10 Benoit PAPILLAULT 2009-12-28 21:05:04 UTC
Created attachment 24332 [details]
0001-ip_gre-Fix-ICMP-message-and-DF-bit-settings.patch
Comment 11 Benoit PAPILLAULT 2010-01-07 14:30:30 UTC
First of all, my best wishes for year 2010!

Any comments on the patch I sent? Should I send it on a mailing list for 
a broader audience?

Regards,
Benoit
Comment 12 Benoit PAPILLAULT 2010-01-07 14:31:00 UTC
First of all, my best wishes for year 2010!

Any comments on the patch I sent? Should I send it on a mailing list for 
a broader audience?

Regards,
Benoit
Comment 13 Anonymous Emailer 2010-01-10 16:04:06 UTC
Reply-To: hadi@cyberus.ca

Salut Benoit,

I didnt see any patch...

Also did you try changing the mtu per suggestion i made?
People tend to be busy and sometimes dont read the mailing list. To
get proper answers, always CC the maintainers. In this case CC Stephen
Hemminger - he maintains the bridging code. I am ccing Herbert Xu as
well - he may have opinions on the gre side of things. My suggestion is
you repost your issue along with your patch and describe why it solves
your problem.

cheers,
jamal

On Thu, 2010-01-07 at 15:30 +0100, Benoit PAPILLAULT wrote:
> First of all, my best wishes for year 2010!
> 
> Any comments on the patch I sent? Should I send it on a mailing list for 
> a broader audience?
> 
> Regards,
> Benoit
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Comment 14 Benoit PAPILLAULT 2010-01-10 17:44:11 UTC
Hello Jamal,

jamal a écrit :
> Salut Benoit,
>
> I didnt see any patch...
>
> Also did you try changing the mtu per suggestion i made?
>   
I cannot changed the MTU by design since I need to have the gretap 
interface be part of an Ethernet bridge.
> People tend to be busy and sometimes dont read the mailing list. To
> get proper answers, always CC the maintainers. In this case CC Stephen
> Hemminger - he maintains the bridging code. I am ccing Herbert Xu as
> well - he may have opinions on the gre side of things. My suggestion is
> you repost your issue along with your patch and describe why it solves
> your problem.
>   
I know people can be busy, I am just trying to push things forward. 
Original bug is there :
http://bugzilla.kernel.org/show_bug.cgi?id=14837

Patch is available here :
http://bugzilla.kernel.org/attachment.cgi?id=24332

This patch fixes my issue since either packets get fragmented or an ICMP 
error packet is sent back to the sender.

Regards,
Benoit
PS: Congrats to Herbert Xu which fixes a bug in gretap before I had the 
time to report it (was about the broadcast addr which was incorrect).

> cheers,
> jamal
>
>
Comment 15 Benoit PAPILLAULT 2010-01-10 17:44:19 UTC
Hello Jamal,

jamal a écrit :
> Salut Benoit,
>
> I didnt see any patch...
>
> Also did you try changing the mtu per suggestion i made?
>   
I cannot changed the MTU by design since I need to have the gretap 
interface be part of an Ethernet bridge.
> People tend to be busy and sometimes dont read the mailing list. To
> get proper answers, always CC the maintainers. In this case CC Stephen
> Hemminger - he maintains the bridging code. I am ccing Herbert Xu as
> well - he may have opinions on the gre side of things. My suggestion is
> you repost your issue along with your patch and describe why it solves
> your problem.
>   
I know people can be busy, I am just trying to push things forward. 
Original bug is there :
http://bugzilla.kernel.org/show_bug.cgi?id=14837

Patch is available here :
http://bugzilla.kernel.org/attachment.cgi?id=24332

This patch fixes my issue since either packets get fragmented or an ICMP 
error packet is sent back to the sender.

Regards,
Benoit
PS: Congrats to Herbert Xu which fixes a bug in gretap before I had the 
time to report it (was about the broadcast addr which was incorrect).

> cheers,
> jamal
>
>
Comment 16 Herbert Xu 2010-01-10 22:00:03 UTC
On Sun, Jan 10, 2010 at 06:43:56PM +0100, Benoit PAPILLAULT wrote:
>
> I know people can be busy, I am just trying to push things forward.  
> Original bug is there :
> http://bugzilla.kernel.org/show_bug.cgi?id=14837
>
> Patch is available here :
> http://bugzilla.kernel.org/attachment.cgi?id=24332
>
> This patch fixes my issue since either packets get fragmented or an ICMP  
> error packet is sent back to the sender.

I was actually working on this back in November before getting
side-tracked by travelling.

I'll look into this again.

Thanks,
Comment 17 Richard Huddleston 2010-09-19 03:51:38 UTC
i just got burned by this running ubuntu lucid 64bit 

i have ipsec between my sites, and i was trying to set up high-availability using a virtual ethernet bridge across my sites.  2 gateways at each site with all gateways part of the same virtual bridge with stp.

got everything working, except any packets over ~1300 would get  dropped / errored in my tap device.  ifconfig showed errors 

i saw l2tpv3 was available in 2.6.35 (which might also solve my problem), but unfortunately i'm stuck at 2.6.32
Comment 18 g h 2011-08-29 19:27:12 UTC
Has anyone tested this patch on CentOS 6? I'm using 2.6.32-71.29.1.el6 and have the gretap device on a bridge with an incoming vlan-tagged interface. Inbound traffic which explicitly sets the MSS size is causing a kernel panic when this patch is applied.
Comment 19 Benoit PAPILLAULT 2012-06-18 17:50:12 UTC
I don't think this bug has been fixed. Should I reopen it with a more recent kernel if it still applies ?
Comment 20 Simon Arlott 2020-07-18 15:51:10 UTC
This is still not fixed as of 4.4 and the kernel is incorrectly applying the DF setting from IP packets sent over the gretap interface to the outer IP (GRE) packets.

The gretap interface is supposed to be a layer 2 tunnel, so the IP (GRE) packets must always be fragmented to maintain the original MTU of the interface.

The gretap interface must have an MTU of 1500 if it is to match the other interfaces it is bridged to. The underlying MTU of the network that the tunnel goes over is irrelevant to the gretap interface.
Comment 21 Marco Berizzi 2021-01-29 18:18:13 UTC
this bug has been fixed. Tested on kernel 5.10.10

You must create the gretap link with the following parameters:

"ignore-df nopmtudisc"

as documented in this bug https://bugzilla.kernel.org/show_bug.cgi?id=211175

Cheers,
Marco