Bug 9758 - net_device refcnt bug when NFQUEUEing bridged packets
Summary: net_device refcnt bug when NFQUEUEing bridged packets
Status: CLOSED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: Netfilter/Iptables (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: networking_netfilter-iptables@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-15 15:28 UTC by Jan C. Nordholz
Modified: 2008-06-02 20:46 UTC (History)
0 users

See Also:
Kernel Version: 2.6.24-rc7
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Jan C. Nordholz 2008-01-15 15:28:30 UTC
The bug is probably around since the combination bridge+NFQUEUE is possible, and does not depend on distro or environment:

Packets that are to be sent out over a bridge device are skb_clone()d in
br_loop() before traversing the appropriate (FORWARD/OUTPUT) NF chain.
The copies made by skb_clone() share their nf_bridge metadata with the
original, which is no problem usually.
If however one or more packets of a br_loop() run end up in a NFQUEUE,
their shared nf_bridge metadata causes trouble when they are about to be
reinjected: nf_reinject() decrements the net_device refcounts that were
previously upped when queueing the packet in __nf_queue(), but as
skb->nf_bridge->physoutdev points to the same device for all these
packets, most (if not all) of them will affect the wrong refcnt.

(I originally encountered the bug on a Xen host because the hypervisor
refused to shutdown a virtual device with non-zero refcount... but it is
perfectly reproducible with a standard kernel, too, although it was a
bit more tedious to create a test scenario, involving a couple of UMLs.)

I'd suggest to make a real copy of the nf_bridge member in br_loop() if
CONFIG_BRIDGE_NETFILTER is defined, remedying the entanglement. I'd go ahead and create a patch, but I'm unsure as to where that logic should be implemented.
Comment 1 Anonymous Emailer 2008-01-15 15:57:31 UTC
Reply-To: akpm@linux-foundation.org

On Tue, 15 Jan 2008 15:28:31 -0800 (PST)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9758
> 
>            Summary: net_device refcnt bug when NFQUEUEing bridged packets
>            Product: Networking
>            Version: 2.5
>      KernelVersion: 2.6.24-rc7
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Netfilter/Iptables
>         AssignedTo: networking_netfilter-iptables@kernel-bugs.osdl.org
>         ReportedBy: jckn@gmx.net
> 
> 
> The bug is probably around since the combination bridge+NFQUEUE is possible,
> and does not depend on distro or environment:
> 
> Packets that are to be sent out over a bridge device are skb_clone()d in
> br_loop() before traversing the appropriate (FORWARD/OUTPUT) NF chain.
> The copies made by skb_clone() share their nf_bridge metadata with the
> original, which is no problem usually.
> If however one or more packets of a br_loop() run end up in a NFQUEUE,
> their shared nf_bridge metadata causes trouble when they are about to be
> reinjected: nf_reinject() decrements the net_device refcounts that were
> previously upped when queueing the packet in __nf_queue(), but as
> skb->nf_bridge->physoutdev points to the same device for all these
> packets, most (if not all) of them will affect the wrong refcnt.
> 
> (I originally encountered the bug on a Xen host because the hypervisor
> refused to shutdown a virtual device with non-zero refcount... but it is
> perfectly reproducible with a standard kernel, too, although it was a
> bit more tedious to create a test scenario, involving a couple of UMLs.)
> 
> I'd suggest to make a real copy of the nf_bridge member in br_loop() if
> CONFIG_BRIDGE_NETFILTER is defined, remedying the entanglement. I'd go ahead
> and create a patch, but I'm unsure as to where that logic should be
> implemented.
Comment 2 Patrick McHardy 2008-01-15 20:54:39 UTC
Andrew Morton wrote:
> On Tue, 15 Jan 2008 15:28:31 -0800 (PST)
> bugme-daemon@bugzilla.kernel.org wrote:
> 
>> http://bugzilla.kernel.org/show_bug.cgi?id=9758
>>
>> The bug is probably around since the combination bridge+NFQUEUE is possible,
>> and does not depend on distro or environment:
>>
>> Packets that are to be sent out over a bridge device are skb_clone()d in
>> br_loop() before traversing the appropriate (FORWARD/OUTPUT) NF chain.
>> The copies made by skb_clone() share their nf_bridge metadata with the
>> original, which is no problem usually.
>> If however one or more packets of a br_loop() run end up in a NFQUEUE,
>> their shared nf_bridge metadata causes trouble when they are about to be
>> reinjected: nf_reinject() decrements the net_device refcounts that were
>> previously upped when queueing the packet in __nf_queue(), but as
>> skb->nf_bridge->physoutdev points to the same device for all these
>> packets, most (if not all) of them will affect the wrong refcnt.
>>
>> (I originally encountered the bug on a Xen host because the hypervisor
>> refused to shutdown a virtual device with non-zero refcount... but it is
>> perfectly reproducible with a standard kernel, too, although it was a
>> bit more tedious to create a test scenario, involving a couple of UMLs.)
>>
>> I'd suggest to make a real copy of the nf_bridge member in br_loop() if
>> CONFIG_BRIDGE_NETFILTER is defined, remedying the entanglement. I'd go ahead
>> and create a patch, but I'm unsure as to where that logic should be
>> implemented.


Very nice catch, that explains quite a few bug reports about
refcnt leaks. Your patch looks correct and performs the copying
in the logically correct place, it would be nicer to keep this
crap limited to bridge netfilter however.

What should work is to perform the copying in br_netfilter.c
at the spots where phsyoutdev is assigned. As an optimization
we should be able to avoid the copying in most cases by
checking that the bridge info has a refcount above 1.

Could you test whether this patch also fixes the problem?


diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 0e884fe..9759bd7 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -142,6 +142,22 @@ static inline struct nf_bridge_info *nf_bridge_alloc(struct sk_buff *skb)
 	return skb->nf_bridge;
 }
 
+static inline struct nf_bridge_info *nf_bridge_unshare(struct sk_buff *skb)
+{
+	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
+
+	if (atomic_read(&nf_bridge->use) > 1) {
+		struct nf_bridge_info *tmp = nf_bridge_alloc(skb);
+
+		if (tmp) {
+			memcpy(tmp, nf_bridge, sizeof(struct nf_bridge_info));
+			nf_bridge_put(nf_bridge);
+		}
+		nf_bridge = tmp;
+	}
+	return nf_bridge;
+}
+
 static inline void nf_bridge_push_encap_header(struct sk_buff *skb)
 {
 	unsigned int len = nf_bridge_encap_header_len(skb);
@@ -637,6 +653,11 @@ static unsigned int br_nf_forward_ip(unsigned int hook, struct sk_buff *skb,
 	if (!skb->nf_bridge)
 		return NF_ACCEPT;
 
+	/* Need exclusive nf_bridge_info since we might have multiple
+	 * different physoutdevs. */
+	if (!nf_bridge_unshare(skb))
+		return NF_DROP;
+
 	parent = bridge_parent(out);
 	if (!parent)
 		return NF_DROP;
@@ -718,6 +739,11 @@ static unsigned int br_nf_local_out(unsigned int hook, struct sk_buff *skb,
 	if (!skb->nf_bridge)
 		return NF_ACCEPT;
 
+	/* Need exclusive nf_bridge_info since we might have multiple
+	 * different physoutdevs. */
+	if (!nf_bridge_unshare(skb))
+		return NF_DROP;
+
 	nf_bridge = skb->nf_bridge;
 	if (!(nf_bridge->mask & BRNF_BRIDGED_DNAT))
 		return NF_ACCEPT;
Comment 3 Patrick McHardy 2008-01-15 20:59:44 UTC
Patrick McHardy wrote:
> Very nice catch, that explains quite a few bug reports about
> refcnt leaks. Your patch looks correct and performs the copying
> in the logically correct place, it would be nicer to keep this
> crap limited to bridge netfilter however.
> 
> What should work is to perform the copying in br_netfilter.c
> at the spots where phsyoutdev is assigned. As an optimization
> we should be able to avoid the copying in most cases by
> checking that the bridge info has a refcount above 1.
> 
> Could you test whether this patch also fixes the problem?


That patch had a bug, we need to set the refcount of the
new bridge info to 1 after performing the copy.

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 0e884fe..141f069 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -142,6 +142,23 @@ static inline struct nf_bridge_info *nf_bridge_alloc(struct sk_buff *skb)
 	return skb->nf_bridge;
 }
 
+static inline struct nf_bridge_info *nf_bridge_unshare(struct sk_buff *skb)
+{
+	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
+
+	if (atomic_read(&nf_bridge->use) > 1) {
+		struct nf_bridge_info *tmp = nf_bridge_alloc(skb);
+
+		if (tmp) {
+			memcpy(tmp, nf_bridge, sizeof(struct nf_bridge_info));
+			atomic_set(&tmp->use, 1);
+			nf_bridge_put(nf_bridge);
+		}
+		nf_bridge = tmp;
+	}
+	return nf_bridge;
+}
+
 static inline void nf_bridge_push_encap_header(struct sk_buff *skb)
 {
 	unsigned int len = nf_bridge_encap_header_len(skb);
@@ -637,6 +654,11 @@ static unsigned int br_nf_forward_ip(unsigned int hook, struct sk_buff *skb,
 	if (!skb->nf_bridge)
 		return NF_ACCEPT;
 
+	/* Need exclusive nf_bridge_info since we might have multiple
+	 * different physoutdevs. */
+	if (!nf_bridge_unshare(skb))
+		return NF_DROP;
+
 	parent = bridge_parent(out);
 	if (!parent)
 		return NF_DROP;
@@ -718,6 +740,11 @@ static unsigned int br_nf_local_out(unsigned int hook, struct sk_buff *skb,
 	if (!skb->nf_bridge)
 		return NF_ACCEPT;
 
+	/* Need exclusive nf_bridge_info since we might have multiple
+	 * different physoutdevs. */
+	if (!nf_bridge_unshare(skb))
+		return NF_DROP;
+
 	nf_bridge = skb->nf_bridge;
 	if (!(nf_bridge->mask & BRNF_BRIDGED_DNAT))
 		return NF_ACCEPT;
Comment 4 Jan C. Nordholz 2008-01-17 09:08:40 UTC
> Could you test whether this patch also fixes the problem?

yes, it does. I agree that br_netfilter.c is a better place for this.
Comment 5 Patrick McHardy 2008-01-20 05:57:08 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=9758
> 
> 
> 
> 
> 
> ------- Comment #4 from jckn@gmx.net  2008-01-17 09:08 -------
>> Could you test whether this patch also fixes the problem?
> 
> yes, it does. I agree that br_netfilter.c is a better place for this.


Thanks for testing, I'll send it upstream ASAP.
Comment 6 Patrick McHardy 2008-02-02 03:42:06 UTC
Merged upstream and submitted to -stable. Please close.
Comment 7 Natalie Protasevich 2008-06-02 20:46:29 UTC
Thanks much, closing the bug

Note You need to log in before you can comment on or make changes to this bug.