Bug 15148
Summary: | kernel BUG at mm/slab.c | ||
---|---|---|---|
Product: | Networking | Reporter: | Krzysztof Mościcki (stivi) |
Component: | Netfilter/Iptables | Assignee: | networking_netfilter-iptables (networking_netfilter-iptables) |
Status: | RESOLVED OBSOLETE | ||
Severity: | high | CC: | alan, marek |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.28.10 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Krzysztof Mościcki
2010-01-27 10:29:06 UTC
These look like fairly random memory corruptions. Is the hardware ECC memory and has the memory been tested. Also I see reiserfs is loaded - is it heavily used by this system (just comparing it to another similar looking report) There was nothing meaningful in ipmi and mcdelog logs but RAM wasn't chcecked yet. if it could be the problem with reiserfs maybe changing it into ext3 could fix it? RAM is with ECC. It would be a useful test to run without riserfs for a bit if that is practicable. It would help cut down the possibilities. Have you had older kernels on this box that were stable ? This is a new box, not tested on older kernels. All partitions with reiserfs been converted to ext3, and today was a new kernel panic: [75864.834871] general protection fault: 0000 [#1] SMP [75864.838837] last sysfs file: /sys/class/i2c-adapter/i2c-0/name [75864.838837] CPU 0 [75864.838837] Modules linked in: xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_hashlimit xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler evdev snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 pcspkr i2c_core button ext3 jbd mbcache sd_mod 3w_9xxx igb scsi_mod dca thermal processor fan thermal_sys [last unloaded: reiserfs] [75864.838837] Pid: 0, comm: swapper Not tainted 2.6.28.10-univ #1 [75864.838837] RIP: 0010:[<ffffffff803c6ae1>] [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [75864.838837] RSP: 0018:ffffffff8059cba0 EFLAGS: 00010282 [75864.838837] RAX: ffff880493afe6f0 RBX: ffff88016f756ce0 RCX: ffff88066d18cf28 [75864.838837] RDX: a56b6b6b6b6b6b6b RSI: ffff880493afe6d8 RDI: ffff88066e400500 [75864.838837] RBP: ffff88066e400500 R08: 0000000000000000 R09: 00000000e8644dd4 [75864.838837] R10: 000005ef805f8c10 R11: ffff880663856658 R12: ffff880493afe6d8 [75864.838837] R13: ffff880063c807b8 R14: ffff88016f756ce0 R15: ffff880063c807b8 [75864.838837] FS: 0000000000000000(0000) GS:ffffffff805a5040(0000) knlGS:0000000000000000 [75864.838837] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [75864.838837] CR2: 00007f43b5091000 CR3: 0000000000201000 CR4: 00000000000006e0 [75864.838837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [75864.838837] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [75864.838837] Process swapper (pid: 0, threadinfo ffffffff8053c000, task ffffffff804e0340) [75864.838837] Stack: [75864.838837] ffff880493afe6d8 ffff880663856658 ffff8801ab72a888 ffffffff803daaea [75864.838837] 0000000000000000 ffff88016f756ce0 ffff880663856658 0000000000001000 [75864.838837] ffff8804bc820020 ffffffff803db5f0 ffff88066c1e8aa0 ffff88041a2b7028 [75864.838837] Call Trace: [75864.838837] <IRQ> <0> [<ffffffff803daaea>] ? tcp_v4_syn_recv_sock+0x1bf/0x215 [75864.838837] [<ffffffff803db5f0>] ? tcp_check_req+0x207/0x3b7 [75864.838837] [<ffffffff803d9db4>] ? tcp_v4_do_rcv+0x267/0x37a [75864.838837] [<ffffffffa01aa541>] ? nf_ct_deliver_cached_events+0x51/0x80 [nf_conntrack] [75864.838837] [<ffffffffa01bb384>] ? ipv4_confirm+0xcb/0xd6 [nf_conntrack_ipv4] [75864.838837] [<ffffffff803da39c>] ? tcp_v4_rcv+0x4d5/0x774 [75864.838837] [<ffffffff803ba538>] ? nf_hook_slow+0x62/0xc3 [75864.838837] [<ffffffff803c0204>] ? ip_local_deliver_finish+0x0/0x1ee [75864.838837] [<ffffffff803c0320>] ? ip_local_deliver_finish+0x11c/0x1ee [75864.838837] [<ffffffff803bff77>] ? ip_rcv_finish+0x30b/0x325 [75864.838837] [<ffffffff803c01c0>] ? ip_rcv+0x22f/0x273 [75864.838837] [<ffffffffa006384b>] ? igb_clean_rx_irq_adv+0x3bb/0x484 [igb] [75864.838837] [<ffffffffa0063acc>] ? igb_clean_rx_ring_msix+0x4a/0x156 [igb] [75864.838837] [<ffffffff8039fc86>] ? net_rx_action+0xa7/0x1cb [75864.838837] [<ffffffff8023875b>] ? __do_softirq+0x7c/0x135 [75864.838837] [<ffffffff8020d03c>] ? call_softirq+0x1c/0x28 [75864.838837] [<ffffffff8020e53c>] ? do_softirq+0x2c/0x68 [75864.838837] [<ffffffff8023848f>] ? irq_exit+0x3f/0x85 [75864.838837] [<ffffffff8020e767>] ? do_IRQ+0xc5/0xe2 [75864.838837] [<ffffffff8020c2f6>] ? ret_from_intr+0x0/0xa [75864.838837] <EOI> <0> [<ffffffffa0012428>] ? acpi_idle_enter_bm+0x2fb/0x37c [processor] [75864.838837] [<ffffffffa001241e>] ? acpi_idle_enter_bm+0x2f1/0x37c [processor] [75864.838837] [<ffffffff8026e710>] ? rcu_needs_cpu+0x35/0x44 [75864.838837] [<ffffffff8038f449>] ? cpuidle_idle_call+0x8b/0xca [75864.838837] [<ffffffff8020b018>] ? cpu_idle+0x4a/0x8b [75864.838837] Code: e8 48 c1 e5 04 48 03 6a 18 48 89 ef e8 74 b4 04 00 48 8b 8b e8 02 00 00 48 8b 51 20 49 89 54 24 18 48 85 d2 74 09 49 8d 44 24 18 <48> 89 42 08 49 8d 44 24 18 48 89 41 20 48 8d 41 20 49 89 8c 24 [75864.838837] RIP [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [75864.838837] RSP <ffffffff8059cba0> [75869.296074] Kernel panic - not syncing: Fatal exception in interrupt Hi, Krzysiek's co-worker here. Box is running two bonded igb interfaces in IntrMode=2 (MSI-X + RSS). Could multiqueue be somehow involved? At this time, igb module is loaded with the following parameters: "intmode=1, no rss, no msi-x", but tonight was a server crash. In dmesg I found: [ 2926.102332] Slab corruption: tw_sock_TCP start=ffff88059a441740, len=224 [ 2926.182688] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 2926.247388] Last user: [<ffffffff803c7780>](inet_twsk_put+0x59/0x69) [ 2926.323866] 020: b8 ac 9b 62 05 88 ff ff 6b 6b 6b 6b 6b 6b 6b 6b [ 2926.398986] Prev obj: start=ffff88059a441648, len=224 [ 2926.459569] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 2926.524279] Last user: [<ffffffff803c7780>](inet_twsk_put+0x59/0x69) [ 2926.600775] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 2926.676112] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 2926.751435] Next obj: start=ffff88059a441838, len=224 [ 2926.812027] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 2926.878857] Last user: [<ffffffff803c754b>](inet_twsk_alloc+0x26/0xe4) [ 2926.957484] 000: 02 00 06 01 00 00 00 00 00 00 00 00 00 00 00 00 [ 2927.032812] 010: 88 e5 b7 6e 06 88 ff ff 60 d6 85 b5 05 88 ff ff [ 3051.348547] Slab corruption: tw_sock_TCP start=ffff88059a441740, len=224 [ 3051.428853] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 3051.493524] Last user: [<ffffffff803c7780>](inet_twsk_put+0x59/0x69) [ 3051.569970] 020: 18 ff 83 13 05 88 ff ff 6b 6b 6b 6b 6b 6b 6b 6b [ 3051.645226] Prev obj: start=ffff88059a441648, len=224 [ 3051.705809] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 3051.772632] Last user: [<ffffffff803c754b>](inet_twsk_alloc+0x26/0xe4) [ 3051.851238] 000: 02 00 06 01 00 00 00 00 00 00 00 00 00 00 00 00 [ 3051.926754] 010: 78 44 b8 6e 06 88 ff ff e8 40 4c 89 05 88 ff ff [ 3052.002177] Next obj: start=ffff88059a441838, len=224 [ 3052.062762] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 3052.129591] Last user: [<ffffffff803c754b>](inet_twsk_alloc+0x26/0xe4) [ 3052.208196] 000: 02 00 06 01 00 00 00 00 00 00 00 00 00 00 00 00 [ 3052.283702] 010: 58 e0 b3 6e 06 88 ff ff 78 8c df 51 06 88 ff ff any sugestions on how to proceed/test the bug? slab corruptions in dmesg, no panic yet: [78633.978592] Slab corruption: TCP start=ffff88040d9a2ce0, len=1520 [78634.051615] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [78634.116326] Last user: [<ffffffff803987e0>](sk_free+0xad/0xc2) [78634.186566] 020: 40 fa d8 88 00 88 ff ff 6b 6b 6b 6b 6b 6b 6b 6b [78634.261131] Prev obj: start=ffff88040d9a26d8, len=1520 [78634.322708] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [78634.389484] Last user: [<ffffffff8039867f>](sk_prot_alloc+0x1d/0x7e) [78634.466002] 000: 02 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 [78634.540725] 010: 80 85 bb 6e 06 88 ff ff 00 00 00 00 00 00 00 00 [78634.615329] Next obj: start=ffff88040d9a32e8, len=1520 [78634.676904] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [78634.743682] Last user: [<ffffffff8039867f>](sk_free+0xad/0xc2) [78634.813858] 000: 02 00 02 00 00 00 00 00 78 59 2f 72 02 88 ff ff [78634.888422] 010: d0 69 af 6e 06 88 ff ff 00 00 00 00 00 00 00 00 panic occured: [126205.439604] general protection fault: 0000 [#1] SMP [126205.443583] last sysfs file: /sys/class/i2c-adapter/i2c-0/name [126205.443583] CPU 3 [126205.443583] Modules linked in: xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_hashlimit xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler snd_pcm snd_timer snd soundcore snd_page_alloc evdev i2c_i801 pcspkr i2c_core button ext3 jbd mbcache sd_mod 3w_9xxx igb scsi_mod dca thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] [126205.443583] Pid: 0, comm: swapper Not tainted 2.6.28.10-univ #1 [126205.443583] RIP: 0010:[<ffffffff803c6ae1>] [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [126205.443583] RSP: 0018:ffff88066f94fb60 EFLAGS: 00010282 [126205.443583] RAX: ffff8801075b1380 RBX: ffff8803fe200110 RCX: ffff88066c118428 [126205.443583] RDX: a56b6b6b6b6b6b6b RSI: ffff8801075b1368 RDI: ffff88066e40c3b0 [126205.443583] RBP: ffff88066e40c3b0 R08: 0000000000000000 R09: 000000003790af53 [126205.443583] R10: 000005ef805f8c10 R11: ffff8804bfdd6298 R12: ffff8801075b1368 [126205.443583] R13: ffff8802d4d19078 R14: ffff8803fe200110 R15: ffff8802d4d19078 [126205.443583] FS: 0000000000000000(0000) GS:ffff88066f8f8898(0000) knlGS:0000000000000000 [126205.443583] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [126205.443583] CR2: 00007f7a14ccc740 CR3: 0000000000201000 CR4: 00000000000006e0 [126205.443583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [126205.443583] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [126205.443583] Process swapper (pid: 0, threadinfo ffff88066f946000, task ffff88066f943320) [126205.443583] Stack: [126205.443583] ffff8801075b1368 ffff8804bfdd6298 ffff88008b56ae08 ffffffff803daaea [126205.443583] 0000000000000000 ffff8803fe200110 ffff8804bfdd6298 0000000000001000 [126205.443583] ffff88027bbd0020 ffffffff803db5f0 ffff88066c956520 ffff8805063a3680 [126205.443583] Call Trace: [126205.443583] <IRQ> <0> [<ffffffff803daaea>] ? tcp_v4_syn_recv_sock+0x1bf/0x215 [126205.443583] [<ffffffff803db5f0>] ? tcp_check_req+0x207/0x3b7 [126205.443583] [<ffffffff803d9db4>] ? tcp_v4_do_rcv+0x267/0x37a [126205.443583] [<ffffffffa01b5541>] ? nf_ct_deliver_cached_events+0x51/0x80 [nf_conntrack] [126205.443583] [<ffffffffa01c6384>] ? ipv4_confirm+0xcb/0xd6 [nf_conntrack_ipv4] [126205.443583] [<ffffffff803da39c>] ? tcp_v4_rcv+0x4d5/0x774 [126205.443583] [<ffffffff803ba538>] ? nf_hook_slow+0x62/0xc3 [126205.443583] [<ffffffff803c0204>] ? ip_local_deliver_finish+0x0/0x1ee [126205.443583] [<ffffffff803c0320>] ? ip_local_deliver_finish+0x11c/0x1ee [126205.443583] [<ffffffff803bff77>] ? ip_rcv_finish+0x30b/0x325 [126205.443583] [<ffffffff803c01c0>] ? ip_rcv+0x22f/0x273 [126205.443583] [<ffffffffa00642ed>] ? igb_poll+0x52d/0xee0 [igb] [126205.443583] [<ffffffff8022ed41>] ? rebalance_domains+0x166/0x461 [126205.443583] [<ffffffff802483ff>] ? ktime_get_ts+0x21/0x4a [126205.443583] [<ffffffff8039fc86>] ? net_rx_action+0xa7/0x1cb [126205.443583] [<ffffffff8023875b>] ? __do_softirq+0x7c/0x135 [126205.443583] [<ffffffffa00657e9>] ? igb_intr_msi+0xb9/0x100 [igb] [126205.443583] [<ffffffff8020d03c>] ? call_softirq+0x1c/0x28 [126205.443583] [<ffffffff8020e53c>] ? do_softirq+0x2c/0x68 [126205.443583] [<ffffffff8023848f>] ? irq_exit+0x3f/0x85 [126205.443583] [<ffffffff8020e767>] ? do_IRQ+0xc5/0xe2 [126205.443583] [<ffffffff8020c2f6>] ? ret_from_intr+0x0/0xa [126205.443583] <EOI> <0> [<ffffffffa0012428>] ? acpi_idle_enter_bm+0x2fb/0x37c [processor] [126205.443583] [<ffffffffa001241e>] ? acpi_idle_enter_bm+0x2f1/0x37c [processor] [126205.443583] [<ffffffff8038f449>] ? cpuidle_idle_call+0x8b/0xca [126205.443583] [<ffffffff8020b018>] ? cpu_idle+0x4a/0x8b [126205.443583] Code: e8 48 c1 e5 04 48 03 6a 18 48 89 ef e8 74 b4 04 00 48 8b 8b e8 02 00 00 48 8b 51 20 49 89 54 24 18 48 85 d2 74 09 49 8d 44 24 18 <48> 89 42 08 49 8d 44 24 18 48 89 41 20 48 8d 41 20 49 89 8c 24 [126205.443583] RIP [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [126205.443583] RSP <ffff88066f94fb60> [126210.010431] Kernel panic - not syncing: Fatal exception in interrupt [126210.091616] ------------[ cut here ]------------ [126210.091616] WARNING: at arch/x86/kernel/smp.c:118 try_to_wake_up+0x12d/0x183() [126210.091616] Modules linked in: xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_hashlimit xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler snd_pcm snd_timer snd soundcore snd_page_alloc evdev i2c_i801 pcspkr i2c_core button ext3 jbd mbcache sd_mod 3w_9xxx igb scsi_mod dca thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] [126210.091616] Pid: 0, comm: swapper Tainted: G D 2.6.28.10-univ #1 [126210.091616] Call Trace: [126210.091616] <IRQ> [<ffffffff80233b89>] warn_on_slowpath+0x51/0x75 [126210.091616] [<ffffffff8027148e>] cpupri_set+0x10f/0x138 [126210.091616] [<ffffffff8022ce38>] enqueue_task_rt+0x13f/0x1f6 [126210.091616] [<ffffffff8022940a>] enqueue_task+0x59/0x64 [126210.091616] [<ffffffff802294f5>] activate_task+0x28/0x30 [126210.091616] [<ffffffff8022eb5a>] try_to_wake_up+0x12d/0x183 [126210.091616] [<ffffffff80251e7b>] smp_call_function_mask+0xbb/0x1d2 [126210.091616] [<ffffffff802297b2>] __wake_up_common+0x46/0x76 [126210.091616] [<ffffffff80229b81>] complete+0x38/0x4b [126210.091616] [<ffffffffa014d465>] deliver_recv_msg+0x11/0x1a [ipmi_si] [126210.091616] [<ffffffffa014d8f5>] smi_event_handler+0x335/0x40b [ipmi_si] [126210.091616] [<ffffffffa014da32>] set_run_to_completion+0x29/0x30 [ipmi_si] [126210.091616] [<ffffffffa013f163>] panic_event+0x3e/0x5d [ipmi_msghandler] [126210.091616] [<ffffffff80248d71>] notifier_call_chain+0x29/0x4c [126210.091616] [<ffffffff8040fd1a>] panic+0xaa/0x136 [126210.091616] [<ffffffff8020c2f6>] ret_from_intr+0x0/0xa [126210.091616] [<ffffffff8020e82c>] oops_end+0x38/0x88 [126210.091616] [<ffffffff8020e86f>] oops_end+0x7b/0x88 [126210.091616] [<ffffffff80412179>] error_exit+0x0/0x51 [126210.091616] [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [126210.091616] [<ffffffff803daaea>] tcp_v4_syn_recv_sock+0x1bf/0x215 [126210.091616] [<ffffffff803db5f0>] tcp_check_req+0x207/0x3b7 [126210.091616] [<ffffffff803d9db4>] tcp_v4_do_rcv+0x267/0x37a [126210.091616] [<ffffffffa01b5541>] nf_ct_deliver_cached_events+0x51/0x80 [nf_conntrack] [126210.091616] [<ffffffffa01c6384>] ipv4_confirm+0xcb/0xd6 [nf_conntrack_ipv4] [126210.091616] [<ffffffff803da39c>] tcp_v4_rcv+0x4d5/0x774 [126210.091616] [<ffffffff803ba538>] nf_hook_slow+0x62/0xc3 [126210.091616] [<ffffffff803c0204>] ip_local_deliver_finish+0x0/0x1ee [126210.091616] [<ffffffff803c0320>] ip_local_deliver_finish+0x11c/0x1ee [126210.091616] [<ffffffff803bff77>] ip_rcv_finish+0x30b/0x325 [126210.091616] [<ffffffff803c01c0>] ip_rcv+0x22f/0x273 [126210.091616] [<ffffffffa00642ed>] igb_poll+0x52d/0xee0 [igb] [126210.091616] [<ffffffff8022ed41>] rebalance_domains+0x166/0x461 [126210.091616] [<ffffffff802483ff>] ktime_get_ts+0x21/0x4a [126210.091616] [<ffffffff8039fc86>] net_rx_action+0xa7/0x1cb [126210.091616] [<ffffffff8023875b>] __do_softirq+0x7c/0x135 [126210.091616] [<ffffffffa00657e9>] igb_intr_msi+0xb9/0x100 [igb] [126210.091616] [<ffffffff8020d03c>] call_softirq+0x1c/0x28 [126210.091616] [<ffffffff8020e53c>] do_softirq+0x2c/0x68 [126210.091616] [<ffffffff8023848f>] irq_exit+0x3f/0x85 [126210.091616] [<ffffffff8020e767>] do_IRQ+0xc5/0xe2 [126210.091616] [<ffffffff8020c2f6>] ret_from_intr+0x0/0xa [126210.091616] <EOI> [<ffffffffa0012428>] acpi_idle_enter_bm+0x2fb/0x37c [processor] [126210.091616] [<ffffffffa001241e>] acpi_idle_enter_bm+0x2f1/0x37c [processor] [126210.091616] [<ffffffff8038f449>] cpuidle_idle_call+0x8b/0xca [126210.091616] [<ffffffff8020b018>] cpu_idle+0x4a/0x8b [126210.091616] ---[ end trace 828328a8c2c41748 ]--- i/o scheduler was switched from deadline to cfq, kernel didn't panic for nearly 48h, tests continue another panic, distinct thing is: - box never panics under heavy load - panics occur at load minimums (night time) and shortly after taking load of the box Krzysiek was able to reproduce panics by deleting iptables rules with "-m socket" match (part of tproxy suit) After restructuring of iptables ruleset we were able to overcome socket match problem (by excluding some traffic). It seems that slab corruptions were caused by traffic destined to local port 80 hitting "-m socket" rule or TPROXY target. If no traffic destined to local port 80 hits "-m socket" rule or TPROXY target there's no problem with kernel panic after iptables rule deletion. |