11316 – severe performance regression for iptables nat routing

Bug 11316 - severe performance regression for iptables nat routing

Summary: severe performance regression for iptables nat routing

Status:	CLOSED CODE_FIX

Alias:	None

Product:	Networking
Classification:	Unclassified
Component:	Netfilter/Iptables (show other bugs)
Hardware:	All Linux

Importance:	P1 high
Assignee:	networking_netfilter-iptables@kernel-bugs.osdl.org

URL:
Keywords:

Depends on:
Blocks:	Regressions-2.6.26
	Show dependency tree

Reported:	2008-08-12 22:04 UTC by Alex Williamson
Modified:	2008-08-22 16:41 UTC (History)
CC List:	2 users (show)

See Also:
Kernel Version:	2.6.27-rc3
Subsystem:
Regression:	Yes
Bisected commit-id:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Alex Williamson 2008-08-12 22:04:40 UTC

Latest working kernel version: 2.6.26.2
Earliest failing kernel version: 2.6.27-rc2 (maybe earlier)
Distribution: Ubuntu
Hardware Environment: x86_64
Software Environment: 32bit userspace/64bit kernel
Problem Description: When using iptables to intercept addr:port and reroute through an ssh tunnel, I see a huge performance hit on the 2.6.27-rc series relative to 2.6.26 (34KB/s vs 1+MB/s).

Steps to reproduce:

Setup and ssh tunnel to one of the kernel.org servers using a system on your local network:

ssh -L 8888:204.152.191.37:80 <local system>

Leave the ssh session running.  In a new terminal (on your local system), verify performance of direct access versus the tunnel:

wget -O /dev/null http://204.152.191.37/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
wget -O /dev/null http://127.0.0.1:8888/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2

These should be roughly the same.  Now setup iptables so that when you try to access 204.152.191.37:80 you'll automatically be redirected to the ssh tunnel:

sudo iptables -t nat -N bug
sudo iptables -t nat -I OUTPUT 1 -j bug
sudo iptables -t nat -A bug -d 204.152.191.37 -p tcp --dport 80 -j DNAT --to-destination 127.0.0.1:8888

Repeat the performance test:

wget -O /dev/null http://204.152.191.37/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
wget -O /dev/null http://127.0.0.1:8888/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2

On 2.6.27-rc2+ My rate quickly drops down to ~34KB/s using the iptables nat'd wget (204.152.191.37) while the ssh tunnel still runs 1+MB/s.  On 2.6.26 I get similar performance for both paths.

Comment 1 Anonymous Emailer 2008-08-12 22:12:39 UTC

Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 12 Aug 2008 22:04:41 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=11316
> 
>            Summary: severe performance regression for iptables nat routing
>            Product: Networking
>            Version: 2.5
>      KernelVersion: 2.6.27-rc3
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Netfilter/Iptables
>         AssignedTo: networking_netfilter-iptables@kernel-bugs.osdl.org
>         ReportedBy: alex.williamson@hp.com
> 
> 
> Latest working kernel version: 2.6.26.2
> Earliest failing kernel version: 2.6.27-rc2 (maybe earlier)
> Distribution: Ubuntu
> Hardware Environment: x86_64
> Software Environment: 32bit userspace/64bit kernel
> Problem Description: When using iptables to intercept addr:port and reroute
> through an ssh tunnel, I see a huge performance hit on the 2.6.27-rc series
> relative to 2.6.26 (34KB/s vs 1+MB/s).
> 
> Steps to reproduce:
> 
> Setup and ssh tunnel to one of the kernel.org servers using a system on your
> local network:
> 
> ssh -L 8888:204.152.191.37:80 <local system>
> 
> Leave the ssh session running.  In a new terminal (on your local system),
> verify performance of direct access versus the tunnel:
> 
> wget -O /dev/null
> http://204.152.191.37/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
> wget -O /dev/null
> http://127.0.0.1:8888/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
> 
> These should be roughly the same.  Now setup iptables so that when you try to
> access 204.152.191.37:80 you'll automatically be redirected to the ssh
> tunnel:
> 
> sudo iptables -t nat -N bug
> sudo iptables -t nat -I OUTPUT 1 -j bug
> sudo iptables -t nat -A bug -d 204.152.191.37 -p tcp --dport 80 -j DNAT
> --to-destination 127.0.0.1:8888
> 
> Repeat the performance test:
> 
> wget -O /dev/null
> http://204.152.191.37/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
> wget -O /dev/null
> http://127.0.0.1:8888/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
> 
> On 2.6.27-rc2+ My rate quickly drops down to ~34KB/s using the iptables nat'd
> wget (204.152.191.37) while the ssh tunnel still runs 1+MB/s.  On 2.6.26 I
> get
> similar performance for both paths.
>

Comment 2 Alex Williamson 2008-08-13 19:14:31 UTC

git bisect traced the problem back to this changeset:

        commit e5a4a72d4f88f4389e9340d383ca67031d1b8536
        Author: Lennert Buytenhek <buytenh@marvell.com>
        Date:   Sun Aug 3 01:23:10 2008 -0700
        
            net: use software GSO for SG+CSUM capable netdevices

I've verified that I can toggle the slowness by reverting this patch on
top of 8d0968ab (current head).  The problem is readily reproducible
using Ubuntu Hardy in a KVM VM with upstream, defconfig kernel.


On Tue, 2008-08-12 at 22:12 -0700, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Tue, 12 Aug 2008 22:04:41 -0700 (PDT) bugme-daemon@bugzilla.kernel.org
> wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=11316
> > 
> >            Summary: severe performance regression for iptables nat routing
> >            Product: Networking
> >            Version: 2.5
> >      KernelVersion: 2.6.27-rc3
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: Netfilter/Iptables
> >         AssignedTo: networking_netfilter-iptables@kernel-bugs.osdl.org
> >         ReportedBy: alex.williamson@hp.com
> > 
> > 
> > Latest working kernel version: 2.6.26.2
> > Earliest failing kernel version: 2.6.27-rc2 (maybe earlier)
> > Distribution: Ubuntu
> > Hardware Environment: x86_64
> > Software Environment: 32bit userspace/64bit kernel
> > Problem Description: When using iptables to intercept addr:port and reroute
> > through an ssh tunnel, I see a huge performance hit on the 2.6.27-rc series
> > relative to 2.6.26 (34KB/s vs 1+MB/s).
> > 
> > Steps to reproduce:
> > 
> > Setup and ssh tunnel to one of the kernel.org servers using a system on
> your
> > local network:
> > 
> > ssh -L 8888:204.152.191.37:80 <local system>
> > 
> > Leave the ssh session running.  In a new terminal (on your local system),
> > verify performance of direct access versus the tunnel:
> > 
> > wget -O /dev/null
> > http://204.152.191.37/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
> > wget -O /dev/null
> > http://127.0.0.1:8888/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
> > 
> > These should be roughly the same.  Now setup iptables so that when you try
> to
> > access 204.152.191.37:80 you'll automatically be redirected to the ssh
> tunnel:
> > 
> > sudo iptables -t nat -N bug
> > sudo iptables -t nat -I OUTPUT 1 -j bug
> > sudo iptables -t nat -A bug -d 204.152.191.37 -p tcp --dport 80 -j DNAT
> > --to-destination 127.0.0.1:8888
> > 
> > Repeat the performance test:
> > 
> > wget -O /dev/null
> > http://204.152.191.37/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
> > wget -O /dev/null
> > http://127.0.0.1:8888/pub/linux/kernel/v2.6/linux-2.6.26.2.tar.bz2
> > 
> > On 2.6.27-rc2+ My rate quickly drops down to ~34KB/s using the iptables
> nat'd
> > wget (204.152.191.37) while the ssh tunnel still runs 1+MB/s.  On 2.6.26 I
> get
> > similar performance for both paths.
> > 
>

Comment 3 David S. Miller 2008-08-13 19:21:34 UTC

From: Alex Williamson <alex.williamson@hp.com>
Date: Wed, 13 Aug 2008 20:08:20 -0600

> git bisect traced the problem back to this changeset:
> 
>         commit e5a4a72d4f88f4389e9340d383ca67031d1b8536
>         Author: Lennert Buytenhek <buytenh@marvell.com>
>         Date:   Sun Aug 3 01:23:10 2008 -0700
>         
>             net: use software GSO for SG+CSUM capable netdevices
> 
> I've verified that I can toggle the slowness by reverting this patch on
> top of 8d0968ab (current head).  The problem is readily reproducible
> using Ubuntu Hardy in a KVM VM with upstream, defconfig kernel.

Patrick I wonder if there a case where iptables NAT will COW the packet
when it really doesn't need to.

It seems, if anything, using GSO should make things go a little bit
faster not slower... Hmmm...

Anyways, if we can't figure this one out soon we can easily revert.

Comment 4 Patrick McHardy 2008-08-14 04:05:04 UTC

David Miller wrote:
> From: Alex Williamson <alex.williamson@hp.com>
> Date: Wed, 13 Aug 2008 20:08:20 -0600
> 
>> git bisect traced the problem back to this changeset:
>>
>>         commit e5a4a72d4f88f4389e9340d383ca67031d1b8536
>>         Author: Lennert Buytenhek <buytenh@marvell.com>
>>         Date:   Sun Aug 3 01:23:10 2008 -0700
>>         
>>             net: use software GSO for SG+CSUM capable netdevices
>>
>> I've verified that I can toggle the slowness by reverting this patch on
>> top of 8d0968ab (current head).  The problem is readily reproducible
>> using Ubuntu Hardy in a KVM VM with upstream, defconfig kernel.
> 
> Patrick I wonder if there a case where iptables NAT will COW the packet
> when it really doesn't need to.

I don't think so, its using skb_make_writable everywhere, which checks
for skb_clone_writable, which should usually avoid COWing local TCP
packets. It would also be unlikely to have that much of a performance
impact (1MB/s -> 34kb/s).

> 
> It seems, if anything, using GSO should make things go a little bit
> faster not slower... Hmmm...

Alex, could you post a tcpdump from both loopback and the outgoing
device from the machine you're doing NAT on?

Comment 5 Alex Williamson 2008-08-14 08:08:50 UTC

On Thu, 2008-08-14 at 13:04 +0200, Patrick McHardy wrote:
> I don't think so, its using skb_make_writable everywhere, which checks
> for skb_clone_writable, which should usually avoid COWing local TCP
> packets. It would also be unlikely to have that much of a performance
> impact (1MB/s -> 34kb/s).
> 
> > 
> > It seems, if anything, using GSO should make things go a little bit
> > faster not slower... Hmmm...
> 
> Alex, could you post a tcpdump from both loopback and the outgoing
> device from the machine you're doing NAT on?

Attached, let me know if you want more options, this is just -vv -n.
The NAT'ing system is at 10.0.2.15 and the ssh tunnel target is
192.168.1.60.  Thanks,

Alex

Comment 6 David S. Miller 2008-08-14 15:00:56 UTC

From: Patrick McHardy <kaber@trash.net>
Date: Thu, 14 Aug 2008 13:04:25 +0200

> David Miller wrote:
> > Patrick I wonder if there a case where iptables NAT will COW the packet
> > when it really doesn't need to.
> 
> I don't think so, its using skb_make_writable everywhere, which checks
> for skb_clone_writable, which should usually avoid COWing local TCP
> packets. It would also be unlikely to have that much of a performance
> impact (1MB/s -> 34kb/s).

I think he is NAT'ing locally generated traffic, look at the bugzilla
entry.

He has two cases of the same wget transfer, one is direct and another
uses a 127.0.0.1:XXXX URL that does the transfer over an SSH tunnel.
Normally they go roughly at the same rate.

Then he adds iptables NAT entries that redirect the first transfer
case over the SSH tunnel addr/port.  And it is this case that degrades
in performance with the GSO changeset.

So it is locally generated TCP traffic, NAT'd to another port and IP
address (specifically, redirected to 127.0.0.1:8888).

Perhaps the problem has something to do with the fact that as far as
TCP is concerned, the destination device can do SG and CSUM and thus
GSO.  But then iptables NATs this traffic to loopback.  I think that
is what leads to some kind of slowpath.

Comment 7 Herbert Xu 2008-08-14 21:35:10 UTC

David Miller <davem@davemloft.net> wrote:
>
> Patrick I wonder if there a case where iptables NAT will COW the packet
> when it really doesn't need to.

This doesn't make sense.  He's downloading from a remote host, so
GSO shouldn't even come into play.

Cheers,

Comment 8 Herbert Xu 2008-08-14 21:45:11 UTC

Alex Williamson <alex.williamson@hp.com> wrote:
> 
> Attached, let me know if you want more options, this is just -vv -n.
> The NAT'ing system is at 10.0.2.15 and the ssh tunnel target is
> 192.168.1.60.  Thanks,

Right, the underlying TCP connection is going well, but the NATed
connection is getting checksum errors.  Please send us the raw
packet dump on lo (tcpdump -s 1600 -w file) so we can see what's
wrong.

Actually, I think know what's going on but a raw packet dump should
confirm whether we're getting a partial checksum.

Thanks,

Comment 9 Alex Williamson 2008-08-14 22:30:59 UTC

On Fri, 2008-08-15 at 14:44 +1000, Herbert Xu wrote:
> Alex Williamson <alex.williamson@hp.com> wrote:
> > 
> > Attached, let me know if you want more options, this is just -vv -n.
> > The NAT'ing system is at 10.0.2.15 and the ssh tunnel target is
> > 192.168.1.60.  Thanks,
> 
> Right, the underlying TCP connection is going well, but the NATed
> connection is getting checksum errors.  Please send us the raw
> packet dump on lo (tcpdump -s 1600 -w file) so we can see what's
> wrong.

Here it is.  Thanks,

	Alex

Comment 10 Herbert Xu 2008-08-14 22:36:34 UTC

On Fri, Aug 15, 2008 at 02:44:26PM +1000, Herbert Xu wrote:
> 
> Actually, I think know what's going on but a raw packet dump should
> confirm whether we're getting a partial checksum.

Nevermind, I think I've found the problem.

loopback: Drop obsolete ip_summed setting

Now that the network stack can handle inbound packets with partial
checksums, we should no longer clobber the ip_summed field in the
loopback driver.  This is because CHECKSUM_UNNECESSARY implies that
the checksum field is actually valid which is not true for loopback
packets since it's only partial (and thus complemented).

This allows packets from lo to then be SNATed to an external source
while still preserving the checksum's validity.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 49f6bc0..810e292 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -137,9 +137,6 @@ static int loopback_xmit(struct sk_buff *skb, struct net_device *dev)
 	skb_orphan(skb);

 	skb->protocol = eth_type_trans(skb,dev);
-#ifndef LOOPBACK_MUST_CHECKSUM
-	skb->ip_summed = CHECKSUM_UNNECESSARY;
-#endif

 #ifdef LOOPBACK_TSO
 	if (skb_is_gso(skb)) {

Comment 11 Alex Williamson 2008-08-14 22:50:23 UTC

On Fri, 2008-08-15 at 15:35 +1000, Herbert Xu wrote:
> On Fri, Aug 15, 2008 at 02:44:26PM +1000, Herbert Xu wrote:
> > 
> > Actually, I think know what's going on but a raw packet dump should
> > confirm whether we're getting a partial checksum.
> 
> Nevermind, I think I've found the problem.
> 
> loopback: Drop obsolete ip_summed setting
> 
> Now that the network stack can handle inbound packets with partial
> checksums, we should no longer clobber the ip_summed field in the
> loopback driver.  This is because CHECKSUM_UNNECESSARY implies that
> the checksum field is actually valid which is not true for loopback
> packets since it's only partial (and thus complemented).
> 
> This allows packets from lo to then be SNATed to an external source
> while still preserving the checksum's validity.

Nope, that doesn't fix it.  NAT'd throughput remains about the same.
Thanks,

	Alex

Comment 12 Herbert Xu 2008-08-14 23:17:52 UTC

Alex Williamson <alex.williamson@hp.com> wrote:
> 
> Nope, that doesn't fix it.  NAT'd throughput remains about the same.

Please take the raw packet dump on lo then.

Thanks,

Comment 13 Herbert Xu 2008-08-15 00:34:00 UTC

On Thu, Aug 14, 2008 at 11:30:37PM -0600, Alex Williamson wrote:
>
> Here it is.  Thanks,

Can you also post all your netfilter rules (filter + NAT) please?

Thanks,

Comment 14 Herbert Xu 2008-08-15 01:15:00 UTC

On Fri, Aug 15, 2008 at 05:33:43PM +1000, Herbert Xu wrote:
> On Thu, Aug 14, 2008 at 11:30:37PM -0600, Alex Williamson wrote:
> >
> > Here it is.  Thanks,
> 
> Can you also post all your netfilter rules (filter + NAT) please?

It's OK, I can reproduce it now.

Cheers,

Comment 15 Herbert Xu 2008-08-15 03:32:53 UTC

On Fri, Aug 15, 2008 at 06:14:42PM +1000, Herbert Xu wrote:
> 
> It's OK, I can reproduce it now.

This fixes it for me.

loopback: Enable TSO

This patch enables TSO since the loopback device is naturally
capable of handling packets of any size.  This also means that
we won't enable GSO on lo which is good until GSO is fixed to
preserve netfilter state as netfilter treats loopback packets
in a special way.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

I'll work on the netfilter state preservation next.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 49f6bc0..c11e621 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -234,9 +231,7 @@ static void loopback_setup(struct net_device *dev)
 	dev->type		= ARPHRD_LOOPBACK;	/* 0x0001*/
 	dev->flags		= IFF_LOOPBACK;
 	dev->features 		= NETIF_F_SG | NETIF_F_FRAGLIST
-#ifdef LOOPBACK_TSO
 		| NETIF_F_TSO
-#endif
 		| NETIF_F_NO_CSUM
 		| NETIF_F_HIGHDMA
 		| NETIF_F_LLTX

Comment 16 Herbert Xu 2008-08-15 03:53:38 UTC

On Fri, Aug 15, 2008 at 08:32:35PM +1000, Herbert Xu wrote:
> 
> I'll work on the netfilter state preservation next.

Here it is:

net: Preserve netfilter attributes in skb_gso_segment using __copy_skb_header

skb_gso_segment didn't preserve some attributes in the original skb
such as the netfilter fields.  This was harmless until they were used
which is the case for packets going through lo.

This patch makes it call __copy_skb_header which also picks up some
other missing attributes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8464017..ca1ccdf 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2256,14 +2256,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 			segs = nskb;
 		tail = nskb;
 
-		nskb->dev = skb->dev;
-		skb_copy_queue_mapping(nskb, skb);
-		nskb->priority = skb->priority;
-		nskb->protocol = skb->protocol;
-		nskb->vlan_tci = skb->vlan_tci;
-		nskb->dst = dst_clone(skb->dst);
-		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
-		nskb->pkt_type = skb->pkt_type;
+		__copy_skb_header(nskb, skb);
 		nskb->mac_len = skb->mac_len;
 
 		skb_reserve(nskb, headroom);
@@ -2274,6 +2267,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		skb_copy_from_linear_data(skb, skb_put(nskb, doffset),
 					  doffset);
 		if (!sg) {
+			nskb->ip_summed = CHECKSUM_NONE;
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
 							    skb_put(nskb, len),
 							    len, 0);
@@ -2283,8 +2277,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		frag = skb_shinfo(nskb)->frags;
 		k = 0;
 
-		nskb->ip_summed = CHECKSUM_PARTIAL;
-		nskb->csum = skb->csum;
 		skb_copy_from_linear_data_offset(skb, offset,
 						 skb_put(nskb, hsize), hsize);
 

Cheers,

Comment 17 Rafael J. Wysocki 2008-08-15 06:45:27 UTC

Handled-By : Herbert Xu <herbert@gondor.apana.org.au>
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11316#c15
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11316#c16

Comment 18 Alex Williamson 2008-08-15 08:35:33 UTC

On Fri, 2008-08-15 at 20:53 +1000, Herbert Xu wrote:
> On Fri, Aug 15, 2008 at 08:32:35PM +1000, Herbert Xu wrote:
> > 
> > I'll work on the netfilter state preservation next.
> 
> Here it is:

Confirmed, these patches solve the problem.  Thanks Herbert.

Comment 19 David S. Miller 2008-08-15 13:59:23 UTC

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 15 Aug 2008 20:32:35 +1000

> loopback: Enable TSO
> 
> This patch enables TSO since the loopback device is naturally
> capable of handling packets of any size.  This also means that
> we won't enable GSO on lo which is good until GSO is fixed to
> preserve netfilter state as netfilter treats loopback packets
> in a special way.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

This, effectively, "enables" LRO on loopback.

And sure it's pretty obscure to shape, NAT, and end up forwarding
loopback received packets, but do you want to be the user trying to do
something like that and trying to find this particular patch which is
causing it to not work? :-)

I really don't know whether it's worth worrying about, I just wanted
to mention it.

Comment 20 David S. Miller 2008-08-15 14:55:05 UTC

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 15 Aug 2008 20:32:35 +1000

> loopback: Enable TSO
> 
> This patch enables TSO since the loopback device is naturally
> capable of handling packets of any size.  This also means that
> we won't enable GSO on lo which is good until GSO is fixed to
> preserve netfilter state as netfilter treats loopback packets
> in a special way.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Meanwhile I applied this and I took the liberty of applying
the following right afterwards:

loopback: Remove rest of LOOPBACK_TSO code.

It hasn't been enabled for a long time and the generic GSO
engine is better documentation of what is expected of a
device implementing TSO.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/loopback.c |   62 ------------------------------------------------
 1 files changed, 0 insertions(+), 62 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 46e87cc..489d53b 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -64,68 +64,6 @@ struct pcpu_lstats {
 	unsigned long bytes;
 };
 
-/* KISS: just allocate small chunks and copy bits.
- *
- * So, in fact, this is documentation, explaining what we expect
- * of largesending device modulo TCP checksum, which is ignored for loopback.
- */
-
-#ifdef LOOPBACK_TSO
-static void emulate_large_send_offload(struct sk_buff *skb)
-{
-	struct iphdr *iph = ip_hdr(skb);
-	struct tcphdr *th = (struct tcphdr *)(skb_network_header(skb) +
-					      (iph->ihl * 4));
-	unsigned int doffset = (iph->ihl + th->doff) * 4;
-	unsigned int mtu = skb_shinfo(skb)->gso_size + doffset;
-	unsigned int offset = 0;
-	u32 seq = ntohl(th->seq);
-	u16 id  = ntohs(iph->id);
-
-	while (offset + doffset < skb->len) {
-		unsigned int frag_size = min(mtu, skb->len - offset) - doffset;
-		struct sk_buff *nskb = alloc_skb(mtu + 32, GFP_ATOMIC);
-
-		if (!nskb)
-			break;
-		skb_reserve(nskb, 32);
-		skb_set_mac_header(nskb, -ETH_HLEN);
-		skb_reset_network_header(nskb);
-		iph = ip_hdr(nskb);
-		skb_copy_to_linear_data(nskb, skb_network_header(skb),
-					doffset);
-		if (skb_copy_bits(skb,
-				  doffset + offset,
-				  nskb->data + doffset,
-				  frag_size))
-			BUG();
-		skb_put(nskb, doffset + frag_size);
-		nskb->ip_summed = CHECKSUM_UNNECESSARY;
-		nskb->dev = skb->dev;
-		nskb->priority = skb->priority;
-		nskb->protocol = skb->protocol;
-		nskb->dst = dst_clone(skb->dst);
-		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
-		nskb->pkt_type = skb->pkt_type;
-
-		th = (struct tcphdr *)(skb_network_header(nskb) + iph->ihl * 4);
-		iph->tot_len = htons(frag_size + doffset);
-		iph->id = htons(id);
-		iph->check = 0;
-		iph->check = ip_fast_csum((unsigned char *) iph, iph->ihl);
-		th->seq = htonl(seq);
-		if (offset + doffset + frag_size < skb->len)
-			th->fin = th->psh = 0;
-		netif_rx(nskb);
-		offset += frag_size;
-		seq += frag_size;
-		id++;
-	}
-
-	dev_kfree_skb(skb);
-}
-#endif /* LOOPBACK_TSO */
-
 /*
  * The higher levels take care of making this non-reentrant (it's
  * called with bh's disabled).

Comment 21 David S. Miller 2008-08-15 14:55:35 UTC

From: Alex Williamson <alex.williamson@hp.com>
Date: Fri, 15 Aug 2008 09:34:47 -0600

> On Fri, 2008-08-15 at 20:53 +1000, Herbert Xu wrote:
> > On Fri, Aug 15, 2008 at 08:32:35PM +1000, Herbert Xu wrote:
> > > 
> > > I'll work on the netfilter state preservation next.
> > 
> > Here it is:
> 
> Confirmed, these patches solve the problem.  Thanks Herbert.

Thanks for your report and testing the fix Alex.

Comment 22 David S. Miller 2008-08-15 14:55:51 UTC

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 15 Aug 2008 20:53:18 +1000

> net: Preserve netfilter attributes in skb_gso_segment using __copy_skb_header
> 
> skb_gso_segment didn't preserve some attributes in the original skb
> such as the netfilter fields.  This was harmless until they were used
> which is the case for packets going through lo.
> 
> This patch makes it call __copy_skb_header which also picks up some
> other missing attributes.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks Herbert.

Comment 23 David S. Miller 2008-08-15 14:57:24 UTC

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 15 Aug 2008 15:35:48 +1000

> loopback: Drop obsolete ip_summed setting
> 
> Now that the network stack can handle inbound packets with partial
> checksums, we should no longer clobber the ip_summed field in the
> loopback driver.  This is because CHECKSUM_UNNECESSARY implies that
> the checksum field is actually valid which is not true for loopback
> packets since it's only partial (and thus complemented).
> 
> This allows packets from lo to then be SNATed to an external source
> while still preserving the checksum's validity.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

I've applied this one too, let me know if I should not have :)

Comment 24 Herbert Xu 2008-08-15 17:26:24 UTC

On Fri, Aug 15, 2008 at 01:58:51PM -0700, David Miller wrote:
> 
> This, effectively, "enables" LRO on loopback.
> 
> And sure it's pretty obscure to shape, NAT, and end up forwarding
> loopback received packets, but do you want to be the user trying to do
> something like that and trying to find this particular patch which is
> causing it to not work? :-)
> 
> I really don't know whether it's worth worrying about, I just wanted
> to mention it.

Well the same code path is also used by Xen and virtio (apart
from the netfilter bits which caused this particular bug), so
we should be pretty safe here.

Cheers,

Comment 25 Alex Williamson 2008-08-22 16:41:55 UTC

Verified fixed in 2.6.27-rc4

Note You need to log in before you can comment on or make changes to this bug.