Bug 16572
Summary: | random panics in bridging on 2.6.34+ | ||
---|---|---|---|
Product: | Networking | Reporter: | Patrick McLean (chutzpah) |
Component: | Other | Assignee: | Herbert Xu (herbert) |
Status: | CLOSED CODE_FIX | ||
Severity: | blocking | CC: | florian, herbert, maciej.rutecki, rjw, stephen |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.34,2.6.35{,.1,.2-rc1} | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 15310 | ||
Attachments: |
screenshot of panic message
.config from affected machine slightly different panic message on 2.6.36-rc3 panic with suggested patch bridge: Reset IPCB when entering IP stack on NF_FORWARD |
Created attachment 27418 [details]
.config from affected machine
2.6.34 is definitely affected, I am pretty sure that 2.6.33 isn't. I am going to try to do a bisect, but it's difficult with an intermittent problem like this one. Ok, I did a git bisect on this, and I am pretty sure that this is the commit that caused the problem: 68b7c895be336b19f4c38d7cb500132fabba0afd is the first bad commit commit 68b7c895be336b19f4c38d7cb500132fabba0afd Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Sat Feb 27 19:41:40 2010 +0000 bridge: Allow tail-call on br_pass_frame_up This patch allows tail-call on the call to br_pass_frame_up in br_handle_frame_finish. This is now possible because of the previous patch to call br_pass_frame_up last. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> :040000 040000 551a2d35821d44493b4516e20c1352c6ca0cc5c3 ac3568468f1caf3696079bdf553d873dccb4f416 M net Verified that it was definitely the commit above that introduced the problem. Created attachment 28561 [details]
slightly different panic message on 2.6.36-rc3
Got a different backtrace when triggering the bug on 2.6.36-rc3. It can be triggered fairly easily by bringing up/down or using net on kvm hosts that use bridged interfaces to talk to the network.
I should also add that I was originally hitting the bug with a 4-port bnx2 card, and testing with a 4-port igb card seems to trigger similar panics. David Miller has committed a patch to net-next-2.6 to fix this: bridge: Clear INET control block of SKBs passed into ip_fragment(). Hmm, actually your problem might be a different manifestation of this bug as you don't appear to be on the fragmentation path. But it certainly does look like the same issue since reverting my patch helps and you're definitely crashing in the IP stack. How exactly does one reproduce this crash? Usually I can reproduce it by starting/stopping VM's and using network to one of the guest OS'es. Unfortunately I don't have an easy way to test kernels on the machine at the moment as it's now a production server. I am working on reproducing the crash on another machine. Where you able to test if that patch fixes the issue? Created attachment 37102 [details]
panic with suggested patch
I finally got a chance to test it again, it took _much_ longer to trigger the problem, but I am still getting this panic with 2.6.36 + patch.
Sorry about the delay in testing, this is a critical server.
Are any other patches that I can test to try to resolve this problem? I have a server maintenance Window coming next week when I can test another patch. Created attachment 39682 [details]
bridge: Reset IPCB when entering IP stack on NF_FORWARD
Does this patch help?
git as of yesterday (top commit in git log is b3444d164be8f977f4133ef0c6f4a18f2741373f) with the patch in #13 seems to be stable. It's been stable over a week now. The problem is definitely fixed. Any chance of the fix getting into 2.6.37? Well, the maintainers decide about the merging of patches. Normally panic fixes get in pretty immediately (for some reasonable definition of immediate). Patch: https://bugzilla.kernel.org/attachment.cgi?id=39682 I am not seeing this in v2.6.38 or in Linus' git. Has some equivalent to this made it into mainline yet, or do I need to keep applying the patch before updating the kernel. On Fri, 18 Mar 2011 13:49:03 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=16572 > > > > > > --- Comment #17 from Patrick McLean <chutzpah@gentoo.org> 2011-03-18 > 13:49:00 --- > I am not seeing this in v2.6.38 or in Linus' git. Has some equivalent to this > made it into mainline yet, or do I need to keep applying the patch before > updating the kernel. > Herbert hasn't submitted the patch upstream. On Fri, 18 Mar 2011 10:27:28 -0500 Herbert Xu <herbert@gondor.hengli.com.au> wrote: > On Fri, Mar 18, 2011 at 08:10:52AM -0700, Stephen Hemminger wrote: > > On Fri, 18 Mar 2011 13:49:03 GMT > > bugzilla-daemon@bugzilla.kernel.org wrote: > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=16572 > > > > Herbert hasn't submitted the patch upstream. > > Oops! Thanks for reminding me. > > bridge: Reset IPCB when entering IP stack on NF_FORWARD > > Whenever we enter the IP stack proper from bridge netfilter we > need to ensure that the skb is in a form the IP stack expects > it to be in. > > The entry point on NF_FORWARD did not meet the requirements of > the IP stack, therefore leading to potential crashes/panics. > > This patch fixes the problem. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Reply-To: herbert@gondor.hengli.com.au On Fri, Mar 18, 2011 at 08:10:52AM -0700, Stephen Hemminger wrote: > On Fri, 18 Mar 2011 13:49:03 GMT > bugzilla-daemon@bugzilla.kernel.org wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=16572 > > Herbert hasn't submitted the patch upstream. Oops! Thanks for reminding me. bridge: Reset IPCB when entering IP stack on NF_FORWARD Whenever we enter the IP stack proper from bridge netfilter we need to ensure that the skb is in a form the IP stack expects it to be in. The entry point on NF_FORWARD did not meet the requirements of the IP stack, therefore leading to potential crashes/panics. This patch fixes the problem. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c index 865fd76..7e9b167 100644 --- a/net/bridge/br_netfilter.c +++ b/net/bridge/br_netfilter.c @@ -752,6 +752,9 @@ static unsigned int br_nf_forward_ip(unsigned int hook, struct sk_buff *skb, nf_bridge->mask |= BRNF_PKT_TYPE; } + if (br_parse_ip_options(skb)) + return NF_DROP; + /* The physdev module checks on this */ nf_bridge->mask |= BRNF_BRIDGED; nf_bridge->physoutdev = skb->dev; Cheers, From: Stephen Hemminger <shemminger@linux-foundation.org> Date: Fri, 18 Mar 2011 09:03:38 -0700 > On Fri, 18 Mar 2011 10:27:28 -0500 > Herbert Xu <herbert@gondor.hengli.com.au> wrote: > >> On Fri, Mar 18, 2011 at 08:10:52AM -0700, Stephen Hemminger wrote: >> > On Fri, 18 Mar 2011 13:49:03 GMT >> > bugzilla-daemon@bugzilla.kernel.org wrote: >> > >> > > https://bugzilla.kernel.org/show_bug.cgi?id=16572 >> > >> > Herbert hasn't submitted the patch upstream. >> >> Oops! Thanks for reminding me. >> >> bridge: Reset IPCB when entering IP stack on NF_FORWARD >> >> Whenever we enter the IP stack proper from bridge netfilter we >> need to ensure that the skb is in a form the IP stack expects >> it to be in. >> >> The entry point on NF_FORWARD did not meet the requirements of >> the IP stack, therefore leading to potential crashes/panics. >> >> This patch fixes the problem. >> >> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > Acked-by: Stephen Hemminger <shemminger@vyatta.com> Applied. Does this need to go to stable? merged in v2.6.39-rc1: commit 6b1e960fdbd75dcd9bcc3ba5ff8898ff1ad30b6e Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri Mar 18 05:27:28 2011 +0000 bridge: Reset IPCB when entering IP stack on NF_FORWARD It probably should go to stable, it does fix a panic that has been present since 2.6.34. |
Created attachment 27417 [details] screenshot of panic message This did not occur with 2.6.33, the machine is uselessly unstable with 2.6.35. It only appears to happen when there are KVM VM's running. I have not found a reliable way to trigger the panics. The VM's all use bridging, and from the panic message the problem appears to be in bridging. I haven't tested with 2.6.34 yet. Will attach screenshot of panic and .config, please let me know if any more information is needed.