Bug 16572 - random panics in bridging on 2.6.34+
Summary: random panics in bridging on 2.6.34+
Status: CLOSED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: Herbert Xu
URL:
Keywords:
Depends on:
Blocks: 15310
  Show dependency tree
 
Reported: 2010-08-12 16:54 UTC by Patrick McLean
Modified: 2011-03-31 14:45 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.34,2.6.35{,.1,.2-rc1}
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
screenshot of panic message (165.97 KB, image/jpeg)
2010-08-12 16:54 UTC, Patrick McLean
Details
.config from affected machine (58.22 KB, text/plain)
2010-08-12 17:08 UTC, Patrick McLean
Details
slightly different panic message on 2.6.36-rc3 (165.51 KB, image/jpeg)
2010-08-30 20:43 UTC, Patrick McLean
Details
panic with suggested patch (162.85 KB, image/jpeg)
2010-11-11 03:53 UTC, Patrick McLean
Details
bridge: Reset IPCB when entering IP stack on NF_FORWARD (467 bytes, patch)
2010-12-10 03:11 UTC, Herbert Xu
Details | Diff

Description Patrick McLean 2010-08-12 16:54:46 UTC
Created attachment 27417 [details]
screenshot of panic message

This did not occur with 2.6.33, the machine is uselessly unstable with 2.6.35. It only appears to happen when there are KVM VM's running. I have not found a reliable way to trigger the panics.

The VM's all use bridging, and from the panic message the problem appears to be in bridging. I haven't tested with 2.6.34 yet.

Will attach screenshot of panic and .config, please let me know if any more information is needed.
Comment 1 Patrick McLean 2010-08-12 17:08:05 UTC
Created attachment 27418 [details]
.config from affected machine
Comment 2 Patrick McLean 2010-08-12 20:20:23 UTC
2.6.34 is definitely affected, I am pretty sure that 2.6.33 isn't. I am going to try to do a bisect, but it's difficult with an intermittent problem like this one.
Comment 3 Patrick McLean 2010-08-13 16:40:22 UTC
Ok, I did a git bisect on this, and I am pretty sure that this is the commit that caused the problem:

68b7c895be336b19f4c38d7cb500132fabba0afd is the first bad commit
commit 68b7c895be336b19f4c38d7cb500132fabba0afd
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Sat Feb 27 19:41:40 2010 +0000

    bridge: Allow tail-call on br_pass_frame_up
    
    This patch allows tail-call on the call to br_pass_frame_up
    in br_handle_frame_finish.  This is now possible because of the
    previous patch to call br_pass_frame_up last.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 551a2d35821d44493b4516e20c1352c6ca0cc5c3 ac3568468f1caf3696079bdf553d873dccb4f416 M	net
Comment 4 Patrick McLean 2010-08-19 18:19:27 UTC
Verified that it was definitely the commit above that introduced the problem.
Comment 5 Patrick McLean 2010-08-30 20:43:03 UTC
Created attachment 28561 [details]
slightly different panic message on 2.6.36-rc3

Got a different backtrace when triggering the bug on 2.6.36-rc3. It can be triggered fairly easily by bringing up/down or using net on kvm hosts that use bridged interfaces to talk to the network.
Comment 6 Patrick McLean 2010-08-30 20:49:28 UTC
I should also add that I was originally hitting the bug with a 4-port bnx2 card, and testing with a 4-port igb card seems to trigger similar panics.
Comment 7 Herbert Xu 2010-09-03 08:34:06 UTC
David Miller has committed a patch to net-next-2.6 to fix this:

bridge: Clear INET control block of SKBs passed into ip_fragment().
Comment 8 Herbert Xu 2010-09-03 08:35:53 UTC
Hmm, actually your problem might be a different manifestation of this bug as you don't appear to be on the fragmentation path.  But it certainly does look like the same issue since reverting my patch helps and you're definitely crashing in the IP stack.

How exactly does one reproduce this crash?
Comment 9 Patrick McLean 2010-09-03 13:50:59 UTC
Usually I can reproduce it by starting/stopping VM's and using network to one of the guest OS'es. Unfortunately I don't have an easy way to test kernels on the machine at the moment as it's now a production server.

I am working on reproducing the crash on another machine.
Comment 10 Florian Mickler 2010-10-07 22:16:54 UTC
Where you able to test if that patch fixes the issue?
Comment 11 Patrick McLean 2010-11-11 03:53:19 UTC
Created attachment 37102 [details]
panic with suggested patch

I finally got a chance to test it again, it took _much_ longer to trigger the problem, but I am still getting this panic with 2.6.36 + patch.

Sorry about the delay in testing, this is a critical server.
Comment 12 Patrick McLean 2010-12-09 16:57:43 UTC
Are any other patches that I can test to try to resolve this problem? I have a server maintenance Window coming next week when I can test another patch.
Comment 13 Herbert Xu 2010-12-10 03:11:57 UTC
Created attachment 39682 [details]
bridge: Reset IPCB when entering IP stack on NF_FORWARD

Does this patch help?
Comment 14 Patrick McLean 2010-12-18 04:42:26 UTC
git as of yesterday (top commit in git log is b3444d164be8f977f4133ef0c6f4a18f2741373f) with the patch in #13 seems to be stable.
Comment 15 Patrick McLean 2010-12-24 20:03:08 UTC
It's been stable over a week now. The problem is definitely fixed.

Any chance of the fix getting into 2.6.37?
Comment 16 Florian Mickler 2010-12-27 09:02:52 UTC
Well, the maintainers decide about the merging of patches. Normally panic fixes get in pretty immediately (for some reasonable definition of immediate).

Patch: https://bugzilla.kernel.org/attachment.cgi?id=39682
Comment 17 Patrick McLean 2011-03-18 13:49:00 UTC
I am not seeing this in v2.6.38 or in Linus' git. Has some equivalent to this made it into mainline yet, or do I need to keep applying the patch before updating the kernel.
Comment 18 Stephen Hemminger 2011-03-18 15:11:23 UTC
On Fri, 18 Mar 2011 13:49:03 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=16572
> 
> 
> 
> 
> 
> --- Comment #17 from Patrick McLean <chutzpah@gentoo.org>  2011-03-18
> 13:49:00 ---
> I am not seeing this in v2.6.38 or in Linus' git. Has some equivalent to this
> made it into mainline yet, or do I need to keep applying the patch before
> updating the kernel.
> 

Herbert hasn't submitted the patch upstream.
Comment 19 Stephen Hemminger 2011-03-18 16:04:55 UTC
On Fri, 18 Mar 2011 10:27:28 -0500
Herbert Xu <herbert@gondor.hengli.com.au> wrote:

> On Fri, Mar 18, 2011 at 08:10:52AM -0700, Stephen Hemminger wrote:
> > On Fri, 18 Mar 2011 13:49:03 GMT
> > bugzilla-daemon@bugzilla.kernel.org wrote:
> > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=16572
> >
> > Herbert hasn't submitted the patch upstream.
> 
> Oops! Thanks for reminding me.
> 
> bridge: Reset IPCB when entering IP stack on NF_FORWARD
> 
> Whenever we enter the IP stack proper from bridge netfilter we
> need to ensure that the skb is in a form the IP stack expects
> it to be in.
> 
> The entry point on NF_FORWARD did not meet the requirements of
> the IP stack, therefore leading to potential crashes/panics.
> 
> This patch fixes the problem.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Comment 20 Anonymous Emailer 2011-03-18 16:23:56 UTC
Reply-To: herbert@gondor.hengli.com.au

On Fri, Mar 18, 2011 at 08:10:52AM -0700, Stephen Hemminger wrote:
> On Fri, 18 Mar 2011 13:49:03 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=16572
>
> Herbert hasn't submitted the patch upstream.

Oops! Thanks for reminding me.

bridge: Reset IPCB when entering IP stack on NF_FORWARD

Whenever we enter the IP stack proper from bridge netfilter we
need to ensure that the skb is in a form the IP stack expects
it to be in.

The entry point on NF_FORWARD did not meet the requirements of
the IP stack, therefore leading to potential crashes/panics.

This patch fixes the problem.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 865fd76..7e9b167 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -752,6 +752,9 @@ static unsigned int br_nf_forward_ip(unsigned int hook, struct sk_buff *skb,
 		nf_bridge->mask |= BRNF_PKT_TYPE;
 	}
 
+	if (br_parse_ip_options(skb))
+		return NF_DROP;
+
 	/* The physdev module checks on this */
 	nf_bridge->mask |= BRNF_BRIDGED;
 	nf_bridge->physoutdev = skb->dev;

Cheers,
Comment 21 David S. Miller 2011-03-18 22:16:54 UTC
From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Fri, 18 Mar 2011 09:03:38 -0700

> On Fri, 18 Mar 2011 10:27:28 -0500
> Herbert Xu <herbert@gondor.hengli.com.au> wrote:
> 
>> On Fri, Mar 18, 2011 at 08:10:52AM -0700, Stephen Hemminger wrote:
>> > On Fri, 18 Mar 2011 13:49:03 GMT
>> > bugzilla-daemon@bugzilla.kernel.org wrote:
>> > 
>> > > https://bugzilla.kernel.org/show_bug.cgi?id=16572
>> >
>> > Herbert hasn't submitted the patch upstream.
>> 
>> Oops! Thanks for reminding me.
>> 
>> bridge: Reset IPCB when entering IP stack on NF_FORWARD
>> 
>> Whenever we enter the IP stack proper from bridge netfilter we
>> need to ensure that the skb is in a form the IP stack expects
>> it to be in.
>> 
>> The entry point on NF_FORWARD did not meet the requirements of
>> the IP stack, therefore leading to potential crashes/panics.
>> 
>> This patch fixes the problem.
>> 
>> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> Acked-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.
Comment 22 Florian Mickler 2011-03-30 22:29:52 UTC
Does this need to go to stable?

merged in v2.6.39-rc1:

commit 6b1e960fdbd75dcd9bcc3ba5ff8898ff1ad30b6e
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Fri Mar 18 05:27:28 2011 +0000

    bridge: Reset IPCB when entering IP stack on NF_FORWARD
Comment 23 Patrick McLean 2011-03-31 14:45:48 UTC
It probably should go to stable, it does fix a panic that has been present since 2.6.34.

Note You need to log in before you can comment on or make changes to this bug.