Most recent kernel where this bug did not occur: 2.6.13.2 Distribution: Debian Etch Hardware Environment: Tyan Transport GX28 (B2882), Dual AMD Opteron 242, 4GB RAM Software Environment: Kernel compiled with skas3-v8.2, running multiple UML instances, using bridging interfaces on all active ethernet ports Problem Description: We launch 10 UML instances using a bridging interface (no iptables). The systems will run fine (6-8 days) until we attempt to shutdown the system for maintenance. The host server will "freeze" hard when shutting down the guest instances (no network, console, etc), usually after the 7th instance has been shutdown and it's network interface removed from the bridge. I had to power- cycle the system to recover. The issue disappears when I reverted the host back to 2.6.13.2. Initially, I thought the issue had disappeared once I upgraded the oldest guest UML kernel (2.6.7) it to a more recent version. However, it reappeared this week with all newer guest instances. Unfortunately, only one of the "freezes" dumped any information: device bind1-0 left promiscuous mode Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8be14f4 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: autofs4 tun ipv6 bridge floppy pcspkr hw_random i2c_amd8111 generic amd74xx shpchp pci_hotplug ohci_hcd usbcore raid1 md_mod dm_mod rtc w83627hf eeprom lm85 hwmon_vid i2c_isa i2c_amd756 i2c_core tg3 e100 mii psmouse ide_generic ide_disk ide_cd cdrom ide_core unix CPU: 1 EIP: 0060:[<f8be14f4>] Not tainted VLI EFLAGS: 00010287 (2.6.14.2-skas3-v8.2) EIP is at br_nf_forward_ip+0xa2/0x16a [bridge] eax: 00000000 ebx: d7b59dc0 ecx: ea4a5380 edx: 00000080 esi: 00000002 edi: 00000002 ebp: f8bdbdb7 esp: e172bcc8 ds: 007b es: 007b ss: 0068 Process linux-2.6.7-02- (pid: 4049, threadinfo=e172a000 task=e99ed550) Stack: 80000000 c03e7350 c02d44e7 00000002 e172bd58 f8be1334 80000000 ea4a5380 e172bd40 80000000 c03e7350 c02d44e7 00000002 e172bd7c f2641000 efd7e800 f8bdbdb7 00000002 e172bd7c c03e7350 f8bdbdb7 c02d4564 c03e7350 e172bd7c Call Trace: [<c02d44e7>] nf_iterate+0x66/0x8a [<f8be1334>] br_nf_forward_finish+0x0/0x11e [bridge] [<c02d44e7>] nf_iterate+0x66/0x8a [<f8bdbdb7>] br_forward_finish+0x0/0x6b [bridge] [<f8bdbdb7>] br_forward_finish+0x0/0x6b [bridge] [<c02d4564>] nf_hook_slow+0x59/0x10e [<f8bdbdb7>] br_forward_finish+0x0/0x6b [bridge] [<f8bdbefa>] __br_forward+0x63/0x7c [bridge] [<f8bdbdb7>] br_forward_finish+0x0/0x6b [bridge] [<f8bdc11b>] br_flood_forward+0x27/0x2c [bridge] [<f8bdbe97>] __br_forward+0x0/0x7c [bridge] [<f8bdcb1b>] br_handle_frame_finish+0x11b/0x137 [bridge] [<f8be094c>] br_nf_pre_routing_finish+0x1a6/0x36b [bridge] [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<f8be07a6>] br_nf_pre_routing_finish+0x0/0x36b [bridge] [<f8be07a6>] br_nf_pre_routing_finish+0x0/0x36b [bridge] [<c02d4564>] nf_hook_slow+0x59/0x10e [<f8be07a6>] br_nf_pre_routing_finish+0x0/0x36b [bridge] [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<f8be11a6>] br_nf_pre_routing+0x332/0x457 [bridge] [<f8be07a6>] br_nf_pre_routing_finish+0x0/0x36b [bridge] [<c02d44e7>] nf_iterate+0x66/0x8a [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<c02d4564>] nf_hook_slow+0x59/0x10e [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<f8bdccec>] br_handle_frame+0x1b5/0x229 [bridge] [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<c027e6e9>] netif_receive_skb+0x1af/0x334 [<f8a74bf5>] e100_poll+0x1cf/0x70c [e100] [<c027ea36>] net_rx_action+0xc1/0x197 [<c01206a2>] __do_softirq+0x72/0xdd [<c0120740>] do_softirq+0x33/0x35 [<c0104e7e>] do_IRQ+0x1e/0x24 [<c0103772>] common_interrupt+0x1a/0x20 Code: c1 e2 06 8d 82 90 71 3e c0 3b 82 90 71 3e c0 74 58 c7 44 24 18 00 00 00 80 c7 44 24 14 34 13 be f8 8b 44 24 3c 8b 80 ec 02 00 00 <8b> 00 8b 40 0c 89 44 24 10 8b 44 24 38 8b 80 ec 02 00 00 8b 00 Steps to reproduce:
Created attachment 6907 [details] output from lspci -vv
Created attachment 6908 [details] 2.6.14.2-skas3-v8.2 kernel config
Looks like br_netfilter went splat. Begin forwarded message: Date: Sat, 31 Dec 2005 09:23:40 -0800 From: bugme-daemon@bugzilla.kernel.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 5803] New: Bridge code Oops with 2.6.14.2 http://bugzilla.kernel.org/show_bug.cgi?id=5803 Summary: Bridge code Oops with 2.6.14.2 Kernel Version: 2.6.14.2 Status: NEW Severity: high Owner: shemminger@osdl.org Submitter: brocka@sterlingcgi.com Most recent kernel where this bug did not occur: 2.6.13.2 Distribution: Debian Etch Hardware Environment: Tyan Transport GX28 (B2882), Dual AMD Opteron 242, 4GB RAM Software Environment: Kernel compiled with skas3-v8.2, running multiple UML instances, using bridging interfaces on all active ethernet ports Problem Description: We launch 10 UML instances using a bridging interface (no iptables). The systems will run fine (6-8 days) until we attempt to shutdown the system for maintenance. The host server will "freeze" hard when shutting down the guest instances (no network, console, etc), usually after the 7th instance has been shutdown and it's network interface removed from the bridge. I had to power- cycle the system to recover. The issue disappears when I reverted the host back to 2.6.13.2. Initially, I thought the issue had disappeared once I upgraded the oldest guest UML kernel (2.6.7) it to a more recent version. However, it reappeared this week with all newer guest instances. Unfortunately, only one of the "freezes" dumped any information: device bind1-0 left promiscuous mode Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8be14f4 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: autofs4 tun ipv6 bridge floppy pcspkr hw_random i2c_amd8111 generic amd74xx shpchp pci_hotplug ohci_hcd usbcore raid1 md_mod dm_mod rtc w83627hf eeprom lm85 hwmon_vid i2c_isa i2c_amd756 i2c_core tg3 e100 mii psmouse ide_generic ide_disk ide_cd cdrom ide_core unix CPU: 1 EIP: 0060:[<f8be14f4>] Not tainted VLI EFLAGS: 00010287 (2.6.14.2-skas3-v8.2) EIP is at br_nf_forward_ip+0xa2/0x16a [bridge] eax: 00000000 ebx: d7b59dc0 ecx: ea4a5380 edx: 00000080 esi: 00000002 edi: 00000002 ebp: f8bdbdb7 esp: e172bcc8 ds: 007b es: 007b ss: 0068 Process linux-2.6.7-02- (pid: 4049, threadinfo=e172a000 task=e99ed550) Stack: 80000000 c03e7350 c02d44e7 00000002 e172bd58 f8be1334 80000000 ea4a5380 e172bd40 80000000 c03e7350 c02d44e7 00000002 e172bd7c f2641000 efd7e800 f8bdbdb7 00000002 e172bd7c c03e7350 f8bdbdb7 c02d4564 c03e7350 e172bd7c Call Trace: [<c02d44e7>] nf_iterate+0x66/0x8a [<f8be1334>] br_nf_forward_finish+0x0/0x11e [bridge] [<c02d44e7>] nf_iterate+0x66/0x8a [<f8bdbdb7>] br_forward_finish+0x0/0x6b [bridge] [<f8bdbdb7>] br_forward_finish+0x0/0x6b [bridge] [<c02d4564>] nf_hook_slow+0x59/0x10e [<f8bdbdb7>] br_forward_finish+0x0/0x6b [bridge] [<f8bdbefa>] __br_forward+0x63/0x7c [bridge] [<f8bdbdb7>] br_forward_finish+0x0/0x6b [bridge] [<f8bdc11b>] br_flood_forward+0x27/0x2c [bridge] [<f8bdbe97>] __br_forward+0x0/0x7c [bridge] [<f8bdcb1b>] br_handle_frame_finish+0x11b/0x137 [bridge] [<f8be094c>] br_nf_pre_routing_finish+0x1a6/0x36b [bridge] [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<f8be07a6>] br_nf_pre_routing_finish+0x0/0x36b [bridge] [<f8be07a6>] br_nf_pre_routing_finish+0x0/0x36b [bridge] [<c02d4564>] nf_hook_slow+0x59/0x10e [<f8be07a6>] br_nf_pre_routing_finish+0x0/0x36b [bridge] [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<f8be11a6>] br_nf_pre_routing+0x332/0x457 [bridge] [<f8be07a6>] br_nf_pre_routing_finish+0x0/0x36b [bridge] [<c02d44e7>] nf_iterate+0x66/0x8a [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<c02d4564>] nf_hook_slow+0x59/0x10e [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<f8bdccec>] br_handle_frame+0x1b5/0x229 [bridge] [<f8bdca00>] br_handle_frame_finish+0x0/0x137 [bridge] [<c027e6e9>] netif_receive_skb+0x1af/0x334 [<f8a74bf5>] e100_poll+0x1cf/0x70c [e100] [<c027ea36>] net_rx_action+0xc1/0x197 [<c01206a2>] __do_softirq+0x72/0xdd [<c0120740>] do_softirq+0x33/0x35 [<c0104e7e>] do_IRQ+0x1e/0x24 [<c0103772>] common_interrupt+0x1a/0x20 Code: c1 e2 06 8d 82 90 71 3e c0 3b 82 90 71 3e c0 74 58 c7 44 24 18 00 00 00 80 c7 44 24 14 34 13 be f8 8b 44 24 3c 8b 80 ec 02 00 00 <8b> 00 8b 40 0c 89 44 24 10 8b 44 24 38 8b 80 ec 02 00 00 8b 00 Steps to reproduce: ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
On Fri, Jan 20, 2006 at 11:35:46AM +0000, Andrew Morton wrote: > > Looks like br_netfilter went splat. It's not surprising that it went splat. What does puzzle me is how on earth did no one see this before. The bridge code is just broken when it comes to removing a live interface from a bridge. Look, del_nbp can be called at any time when user space asks us to remove an interface from a bridge. The first thing it does is set dev->br_port to NULL. Now if dev is a live interface and receiving a packet at that point in time, then we can have someone sitting in br_nf_forward_ip and just about to dereference dev->br_port. Stephen, you've got your work cut out :) > Unable to handle kernel NULL pointer dereference at virtual address 00000000 > printing eip: > f8be14f4 > *pde = 00000000 > Oops: 0000 [#1] > SMP > Modules linked in: autofs4 tun ipv6 bridge floppy pcspkr hw_random i2c_amd8111 > generic amd74xx shpchp pci_hotplug ohci_hcd usbcore raid1 md_mod dm_mod rtc > w83627hf eeprom lm85 hwmon_vid i2c_isa i2c_amd756 i2c_core tg3 e100 mii > psmouse ide_generic ide_disk ide_cd cdrom ide_core unix > CPU: 1 > EIP: 0060:[<f8be14f4>] Not tainted VLI > EFLAGS: 00010287 (2.6.14.2-skas3-v8.2) > EIP is at br_nf_forward_ip+0xa2/0x16a [bridge] > eax: 00000000 ebx: d7b59dc0 ecx: ea4a5380 edx: 00000080 > esi: 00000002 edi: 00000002 ebp: f8bdbdb7 esp: e172bcc8 > ds: 007b es: 007b ss: 0068 > Process linux-2.6.7-02- (pid: 4049, threadinfo=e172a000 task=e99ed550) > Stack: 80000000 c03e7350 c02d44e7 00000002 e172bd58 f8be1334 80000000 ea4a5380 > e172bd40 80000000 c03e7350 c02d44e7 00000002 e172bd7c f2641000 efd7e800 > f8bdbdb7 00000002 e172bd7c c03e7350 f8bdbdb7 c02d4564 c03e7350 e172bd7c Cheers,
this bug had a committed bugfix released in 2.6.15.4 : http://www.kernel.org/git/?p=linux/kernel/git/chrisw/linux-2.6.15.y.git;a=commitdiff;h=ec81e3178071f3747bd1522c959972105584514b;hp=09a17332563531806883b67bf8a9fe0ef0200262 this needs an additional patch to solve an unresolved symbol though : http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3c791925da0e6108cda15e3c2c7bfaebcd9ab9cf
Created attachment 7338 [details] [PATCH] bridge: fix undefined macro reference for has_bridge_parent to fix at runtime (after depmod -a) : WARNING: /lib/modules/2.6.15.4/kernel/net/bridge/bridge.ko needs unknown symbol has_bridge_parent or at compile time : net/bridge/br_netfilter.c: In function `br_nf_post_routing': net/bridge/br_netfilter.c:808: warning: implicit declaration of function `has_bridge_parent'
I think that the following patch from Horms will resolve this problem. [BRIDGE]: netfilter missing symbol has_bridge_parent 5dce971acf2ae20c80d5e9d1f6bbf17376870911 in Linus' tree, otherwise known as bridge-netfilter-races-on-device-removal.patch in 2.5.15.4 removed has_bridge_parent, however this symbol is still called with NETFILTER_DEBUG is enabled. This patch uses the already seeded realoutdev value to detect if a parent exists, and if so, the value of the parent. Signed-Off-By: Horms <horms@verge.net.au> diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c index b501816..6bb0c7e 100644 --- a/net/bridge/br_netfilter.c +++ b/net/bridge/br_netfilter.c @@ -805,8 +805,8 @@ static unsigned int br_nf_post_routing(u print_error: if (skb->dev != NULL) { printk("[%s]", skb->dev->name); - if (has_bridge_parent(skb->dev)) - printk("[%s]", bridge_parent(skb->dev)->name); + if (realoutdev) + printk("[%s]", realoutdev->name); } printk(" head:%p, raw:%p, data:%p\n", skb->head, skb->mac.raw, skb->data);
agree that Horms' patch (which just happens to be the same one i independently made and attached to this bug) is probably a better solution that the one already in linus' tree (3c791925da0e6108cda15e3c2c7bfaebcd9ab9cf). whatever the fix, it needs to be committed to the stable tree so that users of kernels with versions newer than 2.6.15.4 can safelly use the bridge.
Fix is in 2.6.16 and 2.6.15.6