Distribution:Debian Lenny Hardware Environment: INTEL Server Board S5520HC Intel(R) Xeon(R) CPU X5560 @ 2.80GHz 2x RAID bus controller 3ware Inc 9690SA-8I Software Environment: squid (multi instances) Problem Description: Sample message: [144108.757381] ------------[ cut here ]------------ [144108.761334] kernel BUG at mm/slab.c:602! [144108.860077] invalid opcode: 0000 [#1] SMP [144108.911818] last sysfs file: /sys/class/i2c-adapter/i2c-0/name [144108.960102] CPU 5 [144109.007160] Modules linked in: xt_hashlimit reiserfs xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core pcspkr evdev button ext3 jbd mbcache sd_mod 3w_9xxx igb dca scsi_mod thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] [144109.476315] Pid: 32, comm: events/5 Not tainted 2.6.28.10-univ #1 [144109.712325] RIP: 0010:[<ffffffff80294fdc>] [<ffffffff80294fdc>] free_block+0x59/0x119 [144109.712325] RSP: 0018:ffff88066fa71e00 EFLAGS: 00010046 [144109.860289] RAX: 8000000000000000 RBX: ffff88066fbee2c0 RCX: ffffe20000000000 [144109.976312] RDX: ffffe200167ccca0 RSI: ffffffff80516ac0 RDI: ffff88066cccc1c0 [144109.976312] RBP: ffff88066cccc1c0 R08: 0000000000000216 R09: ffff8800281026c0 [144109.976312] R10: 0000000000000002 R11: 0000000000000016 R12: 000000000000000e [144110.180290] R13: ffff88066f1c5088 R14: 0000000000000014 R15: 000000000000001e [144110.180290] FS: 0000000000000000(0000) GS:ffff88066f936a40(0000) knlGS:0000000000000000 [144110.180290] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [144110.180290] CR2: 00007f27ec765508 CR3: 0000000000201000 CR4: 00000000000006e0 [144110.180290] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [144110.180290] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [144110.180290] Process events/5 (pid: 32, threadinfo ffff88066fa70000, task ffff88066fa66660) [144110.180290] Stack: [144110.180290] ffff88066c97c2e0 ffff88066f1c5018 000000000000001e ffff88066f1c5000 [144110.180290] ffff88066f152940 0000000000000000 ffff88066fbee2c0 ffffffff8029523e [144110.180290] 0000000580510d70 ffff88066f152940 ffff88066fbee2c0 ffff880028071630 [144110.180290] Call Trace: [144110.180290] Call Trace: [144110.180290] [<ffffffff8029523e>] ? drain_array+0x89/0xba [144110.180290] [<ffffffff802953cf>] ? cache_reap+0x83/0x105 [144110.180290] [<ffffffff8029534c>] ? cache_reap+0x0/0x105 [144110.180290] [<ffffffff80242332>] ? run_workqueue+0x79/0xfe [144110.180290] [<ffffffff8024248f>] ? worker_thread+0xd8/0xe7 [144110.180290] [<ffffffff802457b5>] ? autoremove_wake_function+0x0/0x2e [144110.180290] [<ffffffff802423b7>] ? worker_thread+0x0/0xe7 [144110.180290] [<ffffffff802454a6>] ? kthread+0x47/0x73 [144110.180290] [<ffffffff80231586>] ? schedule_tail+0x27/0x5f [144110.180290] [<ffffffff8020ccd9>] ? child_rip+0xa/0x11 [144110.180290] [<ffffffff8024545f>] ? kthread+0x0/0x73 [144110.180290] [<ffffffff8020cccf>] ? child_rip+0x0/0x11 [144110.180290] Code: bb e2 f8 ff 48 b9 00 00 00 00 00 e2 ff ff 48 c1 e8 0c 48 6b c0 38 48 8d 14 08 48 8b 02 f6 c4 40 74 04 48 8b 52 10 80 3a 00 78 04 <0f> 0b eb fe 48 8b 72 30 4a 8b 4c f3 08 48 8b 16 48 8b 46 08 48 [144110.180290] RIP [<ffffffff80294fdc>] free_block+0x59/0x119 [144110.180290] RSP <ffff88066fa71e00> [144110.180290] ---[ end trace b24fd99ca9f76a6b ]--- [144112.374657] igb 0000:01:00.1: Detected Tx Unit Hang [144112.374658] Tx Queue <0> [144112.374660] TDH <40a> [144112.374661] TDT <40a> [144112.374662] next_to_use <40a> [144112.374663] next_to_clean <468> [144112.374664] head (WB) <40a> [144112.374666] buffer_info[next_to_clean] [144112.374667] time_stamp <10224964b> [144112.374668] jiffies <1022499c5> [144112.374669] desc.status <0> [144112.920912] igb 0000:01:00.1: Detected Tx Unit Hang [144112.920912] Tx Queue <0> [144112.920913] TDH <452> [144112.920913] TDT <452> [144112.920914] next_to_use <452> [144112.920914] next_to_clean <4b0> [144112.920915] head (WB) <452> [144112.920915] buffer_info[next_to_clean] [144112.920916] time_stamp <10224964b> [144112.920916] jiffies <102249a4e> [144112.920917] desc.status <0> Message with another day: [106253.058465] ------------[ cut here ]------------ [106253.062037] kernel BUG at mm/slab.c:3000! [106253.062037] invalid opcode: 0000 [#1] SMP [106253.062037] last sysfs file: /sys/class/i2c-adapter/i2c-0/name [106253.062037] CPU 4 [106253.062037] Modules linked in: reiserfs xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core pcspkr button joydev evdev ext3 jbd mbcache sd_mod 3w_9xxx usbhid hid igb uhci_hcd ehci_hcd usbcore scsi_mod dca thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] [106253.062037] Pid: 4864, comm: squid Not tainted 2.6.28.10-univ #1 [106253.062037] RIP: 0010:[<ffffffff80295770>] [<ffffffff80295770>] cache_alloc_refill+0x10f/0x4a2 [106253.062037] RSP: 0018:ffff880641463d98 EFLAGS: 00010046 [106253.062037] RAX: 000000000000003b RBX: ffff88066f305940 RCX: 0000000000000029 [106253.062037] RDX: ffff88065e1b9000 RSI: ffff88066dd98000 RDI: ffff88066f305950 [106254.400801] RBP: ffff88066f317000 R08: ffff88066f305960 R09: 0000000000000002 [106254.464405] R10: ffff88066f4b1a40 R11: 0000000000000202 R12: 0000000000000013 [106254.528348] R13: ffff88066fbfa2c0 R14: 0000000000008000 R15: ffffffff80535400 [106254.640568] FS: 00007fae29c326e0(0000) GS:ffff88066f897240(0000) knlGS:0000000000000000 [106254.752802] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [106254.752802] CR2: 00007f8c13e9c650 CR3: 000000063694d000 CR4: 00000000000006e0 [106254.896623] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [106255.008775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [106255.008775] Process squid (pid: 4864, threadinfo ffff880641462000, task ffff88066c48c000) [106255.008775] Stack: [106255.008775] 0000002000000001 0000000000000000 0000000000000020 ffff880600000040 [106255.008775] ffff880641463de8 ffff88066c48c278 ffff8803d758a3c0 0000000000000282 [106255.008775] ffff88066fbfa2c0 0000000000000020 ffff880470519280 0000000000008000 [106255.444381] Call Trace: [106255.444381] [<ffffffff80295638>] ? kmem_cache_alloc+0x3f/0x68 [106255.444381] [<ffffffff803c5b6d>] ? inet_bind_bucket_create+0x16/0x60 [106255.444381] [<ffffffff803c7626>] ? inet_csk_get_port+0x1fc/0x20a [106255.444381] [<ffffffff803e42aa>] ? inet_bind+0x103/0x1a3 [106255.444381] [<ffffffff803950e2>] ? sys_bind+0x61/0x91 [106255.444381] [<ffffffff803c4cf9>] ? ip_setsockopt+0x1c/0x78 [106255.928626] [<ffffffff8039402f>] ? sys_setsockopt+0x8b/0x9c [106255.956373] [<ffffffff8020bd7b>] ? system_call_fastpath+0x16/0x1b [106255.956373] Code: 00 e9 be 00 00 00 48 8b 33 48 39 de 75 14 48 8b 73 20 c7 43 60 01 00 00 00 4c 39 c6 0f 84 9b 00 00 00 8b 46 20 41 3b 45 58 72 2e <0f> 0b eb fe 8b 4d 00 41 8b 55 4c ff c0 0f af 56 24 89 46 20 48 [106255.956373] RIP [<ffffffff80295770>] cache_alloc_refill+0x10f/0x4a2 [106255.956373] RSP <ffff880641463d98> [106256.440649] Kernel panic - not syncing: Fatal exception in interrupt [106256.468330] ------------[ cut here ]------------ [106256.468330] WARNING: at kernel/smp.c:333 smp_call_function_mask+0x37/0x1d2() [106256.468330] Modules linked in: reiserfs xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core pcspkr button joydev evdev ext3 jbd mbcache sd_mod 3w_9xxx usbhid hid igb uhci_hcd ehci_hcd usbcore scsi_mod dca thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] [106256.980308] Pid: 4864, comm: squid Tainted: G D 2.6.28.10-univ #1 [106256.980308] Call Trace: [106256.980308] [<ffffffff80233b89>] warn_on_slowpath+0x51/0x75 [106257.492352] [<ffffffff80248c30>] up+0xe/0x36 [106257.492352] [<ffffffff80234202>] release_console_sem+0x17b/0x1ad [106257.492352] [<ffffffff80248c30>] up+0xe/0x36 [106257.492352] [<ffffffff80251df7>] smp_call_function_mask+0x37/0x1d2 [106257.492352] [<ffffffff8040e8c9>] printk+0x4e/0x5d [106257.492352] [<ffffffff8025b31b>] crash_kexec+0xee/0xf7 [106257.492352] [<ffffffff80295700>] cache_alloc_refill+0x9f/0x4a2 [106257.492352] [<ffffffff8040e8c9>] printk+0x4e/0x5d [106258.004373] [<ffffffff8021a56d>] native_smp_send_stop+0x1a/0x26 [106258.004373] [<ffffffff8040e7da>] panic+0x95/0x136 [106258.004373] [<ffffffff80325a89>] vga_set_palette+0xe8/0x102 [106258.004373] [<ffffffff80325a89>] vga_set_palette+0xe8/0x102 [106258.004373] [<ffffffff80325ec6>] vgacon_set_cursor_size+0xe1/0x101 [106258.004373] [<ffffffff8020e86f>] oops_end+0x7b/0x88 [106258.004373] [<ffffffff8020daca>] do_invalid_op+0x85/0x8f [106258.488594] [<ffffffff80295770>] cache_alloc_refill+0x10f/0x4a2 [106258.572424] [<ffffffff80396846>] lock_sock_nested+0x9a/0xa5 [106258.572424] [<ffffffff80410aab>] _spin_lock_bh+0x9/0x1f [106258.572424] [<ffffffff80410c49>] error_exit+0x0/0x51 [106258.572424] [<ffffffff80295770>] cache_alloc_refill+0x10f/0x4a2 [106258.572424] [<ffffffff802956ea>] cache_alloc_refill+0x89/0x4a2 [106258.572424] [<ffffffff80295638>] kmem_cache_alloc+0x3f/0x68 [106258.572424] [<ffffffff803c5b6d>] inet_bind_bucket_create+0x16/0x60 [106259.000594] [<ffffffff803c7626>] inet_csk_get_port+0x1fc/0x20a [106259.084341] [<ffffffff803e42aa>] inet_bind+0x103/0x1a3 [106259.084341] [<ffffffff803950e2>] sys_bind+0x61/0x91 [106259.084341] [<ffffffff803c4cf9>] ip_setsockopt+0x1c/0x78 [106259.084341] [<ffffffff8039402f>] sys_setsockopt+0x8b/0x9c [106259.084341] [<ffffffff8020bd7b>] system_call_fastpath+0x16/0x1b [106259.084341] ---[ end trace a31fff95b699fbc3 ]--- [106259.512534] Rebooting in 3 seconds.. And crash during halt server: [15979.816114] ------------[ cut here ]------------ [15979.820092] kernel BUG at mm/slab.c:602! [15979.820092] invalid opcode: 0000 [#1] SMP [15979.820092] last sysfs file: /sys/class/i2c-adapter/i2c-0/name [15979.820092] CPU 0 [15979.820092] Modules linked in: reiserfs xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr i2c_i801 i2c_core button joydev evdev ext3 jbd mbcache sd_mod 3w_9xxx usbhid hid igb ehci_hcd uhci_hcd usbcore scsi_mod dca thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] [15979.820092] Pid: 27, comm: events/0 Not tainted 2.6.28.10-univ #1 [15979.820092] RIP: 0010:[<ffffffff80294fdc>] [<ffffffff80294fdc>] free_block+0x59/0x119 [15979.820092] RSP: 0018:ffff88066fa61e00 EFLAGS: 00010046 [15979.820092] RAX: 8000000000000000 RBX: ffff88066fbee2c0 RCX: ffffe20000000000 [15979.820092] RDX: ffffe200167e96e8 RSI: ffff88066f303018 RDI: ffff88066d4fb140 [15979.820092] RBP: ffff88066d4fb140 R08: 0000000000000000 R09: ffff88066d7b8418 [15979.820092] R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000000 [15979.820092] R13: ffff88066f303018 R14: 0000000000000014 R15: 0000000000000001 [15979.820092] FS: 0000000000000000(0000) GS:ffffffff805a3040(0000) knlGS:0000000000000000 [15979.820092] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [15979.820092] CR2: 00000000006c3d28 CR3: 0000000000201000 CR4: 00000000000006e0 [15979.820092] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [15979.820092] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [15979.820092] Process events/0 (pid: 27, threadinfo ffff88066fa60000, task ffff88066fa2e660) [15979.820092] Stack: [15979.820092] 0000000000000000 ffff88066f303018 0000000000000001 ffff88066f303000 [15979.820092] ffff88066f1b5940 0000000000000000 ffff88066fbee2c0 ffffffff8029523e [15979.820092] 0000000000000658 ffff88066f1b5940 ffff88066fbee2c0 ffff88002803a630 [15979.820092] Call Trace: [15979.820092] [<ffffffff8029523e>] ? drain_array+0x89/0xba [15979.820092] [<ffffffff8029539d>] ? cache_reap+0x51/0x105 [15979.820092] [<ffffffff8029534c>] ? cache_reap+0x0/0x105 [15979.820092] [<ffffffff80242332>] ? run_workqueue+0x79/0xfe [15979.820092] [<ffffffff8024248f>] ? worker_thread+0xd8/0xe7 [15979.820092] [<ffffffff802457b5>] ? autoremove_wake_function+0x0/0x2e [15979.820092] [<ffffffff802423b7>] ? worker_thread+0x0/0xe7 [15979.820092] [<ffffffff802454a6>] ? kthread+0x47/0x73 [15979.820092] [<ffffffff80231586>] ? schedule_tail+0x27/0x5f [15979.820092] [<ffffffff8020ccd9>] ? child_rip+0xa/0x11 [15979.820092] [<ffffffff8024545f>] ? kthread+0x0/0x73 [15979.820092] [<ffffffff8020cccf>] ? child_rip+0x0/0x11 [15979.820092] Code: bb e2 f8 ff 48 b9 00 00 00 00 00 e2 ff ff 48 c1 e8 0c 48 6b c0 38 48 8d 14 08 48 8b 02 f6 c4 40 74 04 48 8b 52 10 80 3a 00 78 04 <0f> 0b eb fe 48 8b 72 30 4a 8b 4c f3 08 48 8b 16 48 8b 46 08 48 [15979.820092] RIP [<ffffffff80294fdc>] free_block+0x59/0x119 [15979.820092] RSP <ffff88066fa61e00> [15979.820092] ---[ end trace 856285f8416c5c97 ]--- Steps to reproduce: Server crashes randomly about once per day.
These look like fairly random memory corruptions. Is the hardware ECC memory and has the memory been tested. Also I see reiserfs is loaded - is it heavily used by this system (just comparing it to another similar looking report)
There was nothing meaningful in ipmi and mcdelog logs but RAM wasn't chcecked yet. if it could be the problem with reiserfs maybe changing it into ext3 could fix it?
RAM is with ECC.
It would be a useful test to run without riserfs for a bit if that is practicable. It would help cut down the possibilities. Have you had older kernels on this box that were stable ?
This is a new box, not tested on older kernels. All partitions with reiserfs been converted to ext3, and today was a new kernel panic: [75864.834871] general protection fault: 0000 [#1] SMP [75864.838837] last sysfs file: /sys/class/i2c-adapter/i2c-0/name [75864.838837] CPU 0 [75864.838837] Modules linked in: xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_hashlimit xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler evdev snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 pcspkr i2c_core button ext3 jbd mbcache sd_mod 3w_9xxx igb scsi_mod dca thermal processor fan thermal_sys [last unloaded: reiserfs] [75864.838837] Pid: 0, comm: swapper Not tainted 2.6.28.10-univ #1 [75864.838837] RIP: 0010:[<ffffffff803c6ae1>] [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [75864.838837] RSP: 0018:ffffffff8059cba0 EFLAGS: 00010282 [75864.838837] RAX: ffff880493afe6f0 RBX: ffff88016f756ce0 RCX: ffff88066d18cf28 [75864.838837] RDX: a56b6b6b6b6b6b6b RSI: ffff880493afe6d8 RDI: ffff88066e400500 [75864.838837] RBP: ffff88066e400500 R08: 0000000000000000 R09: 00000000e8644dd4 [75864.838837] R10: 000005ef805f8c10 R11: ffff880663856658 R12: ffff880493afe6d8 [75864.838837] R13: ffff880063c807b8 R14: ffff88016f756ce0 R15: ffff880063c807b8 [75864.838837] FS: 0000000000000000(0000) GS:ffffffff805a5040(0000) knlGS:0000000000000000 [75864.838837] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [75864.838837] CR2: 00007f43b5091000 CR3: 0000000000201000 CR4: 00000000000006e0 [75864.838837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [75864.838837] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [75864.838837] Process swapper (pid: 0, threadinfo ffffffff8053c000, task ffffffff804e0340) [75864.838837] Stack: [75864.838837] ffff880493afe6d8 ffff880663856658 ffff8801ab72a888 ffffffff803daaea [75864.838837] 0000000000000000 ffff88016f756ce0 ffff880663856658 0000000000001000 [75864.838837] ffff8804bc820020 ffffffff803db5f0 ffff88066c1e8aa0 ffff88041a2b7028 [75864.838837] Call Trace: [75864.838837] <IRQ> <0> [<ffffffff803daaea>] ? tcp_v4_syn_recv_sock+0x1bf/0x215 [75864.838837] [<ffffffff803db5f0>] ? tcp_check_req+0x207/0x3b7 [75864.838837] [<ffffffff803d9db4>] ? tcp_v4_do_rcv+0x267/0x37a [75864.838837] [<ffffffffa01aa541>] ? nf_ct_deliver_cached_events+0x51/0x80 [nf_conntrack] [75864.838837] [<ffffffffa01bb384>] ? ipv4_confirm+0xcb/0xd6 [nf_conntrack_ipv4] [75864.838837] [<ffffffff803da39c>] ? tcp_v4_rcv+0x4d5/0x774 [75864.838837] [<ffffffff803ba538>] ? nf_hook_slow+0x62/0xc3 [75864.838837] [<ffffffff803c0204>] ? ip_local_deliver_finish+0x0/0x1ee [75864.838837] [<ffffffff803c0320>] ? ip_local_deliver_finish+0x11c/0x1ee [75864.838837] [<ffffffff803bff77>] ? ip_rcv_finish+0x30b/0x325 [75864.838837] [<ffffffff803c01c0>] ? ip_rcv+0x22f/0x273 [75864.838837] [<ffffffffa006384b>] ? igb_clean_rx_irq_adv+0x3bb/0x484 [igb] [75864.838837] [<ffffffffa0063acc>] ? igb_clean_rx_ring_msix+0x4a/0x156 [igb] [75864.838837] [<ffffffff8039fc86>] ? net_rx_action+0xa7/0x1cb [75864.838837] [<ffffffff8023875b>] ? __do_softirq+0x7c/0x135 [75864.838837] [<ffffffff8020d03c>] ? call_softirq+0x1c/0x28 [75864.838837] [<ffffffff8020e53c>] ? do_softirq+0x2c/0x68 [75864.838837] [<ffffffff8023848f>] ? irq_exit+0x3f/0x85 [75864.838837] [<ffffffff8020e767>] ? do_IRQ+0xc5/0xe2 [75864.838837] [<ffffffff8020c2f6>] ? ret_from_intr+0x0/0xa [75864.838837] <EOI> <0> [<ffffffffa0012428>] ? acpi_idle_enter_bm+0x2fb/0x37c [processor] [75864.838837] [<ffffffffa001241e>] ? acpi_idle_enter_bm+0x2f1/0x37c [processor] [75864.838837] [<ffffffff8026e710>] ? rcu_needs_cpu+0x35/0x44 [75864.838837] [<ffffffff8038f449>] ? cpuidle_idle_call+0x8b/0xca [75864.838837] [<ffffffff8020b018>] ? cpu_idle+0x4a/0x8b [75864.838837] Code: e8 48 c1 e5 04 48 03 6a 18 48 89 ef e8 74 b4 04 00 48 8b 8b e8 02 00 00 48 8b 51 20 49 89 54 24 18 48 85 d2 74 09 49 8d 44 24 18 <48> 89 42 08 49 8d 44 24 18 48 89 41 20 48 8d 41 20 49 89 8c 24 [75864.838837] RIP [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [75864.838837] RSP <ffffffff8059cba0> [75869.296074] Kernel panic - not syncing: Fatal exception in interrupt
Hi, Krzysiek's co-worker here. Box is running two bonded igb interfaces in IntrMode=2 (MSI-X + RSS). Could multiqueue be somehow involved?
At this time, igb module is loaded with the following parameters: "intmode=1, no rss, no msi-x", but tonight was a server crash. In dmesg I found: [ 2926.102332] Slab corruption: tw_sock_TCP start=ffff88059a441740, len=224 [ 2926.182688] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 2926.247388] Last user: [<ffffffff803c7780>](inet_twsk_put+0x59/0x69) [ 2926.323866] 020: b8 ac 9b 62 05 88 ff ff 6b 6b 6b 6b 6b 6b 6b 6b [ 2926.398986] Prev obj: start=ffff88059a441648, len=224 [ 2926.459569] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 2926.524279] Last user: [<ffffffff803c7780>](inet_twsk_put+0x59/0x69) [ 2926.600775] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 2926.676112] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 2926.751435] Next obj: start=ffff88059a441838, len=224 [ 2926.812027] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 2926.878857] Last user: [<ffffffff803c754b>](inet_twsk_alloc+0x26/0xe4) [ 2926.957484] 000: 02 00 06 01 00 00 00 00 00 00 00 00 00 00 00 00 [ 2927.032812] 010: 88 e5 b7 6e 06 88 ff ff 60 d6 85 b5 05 88 ff ff [ 3051.348547] Slab corruption: tw_sock_TCP start=ffff88059a441740, len=224 [ 3051.428853] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 3051.493524] Last user: [<ffffffff803c7780>](inet_twsk_put+0x59/0x69) [ 3051.569970] 020: 18 ff 83 13 05 88 ff ff 6b 6b 6b 6b 6b 6b 6b 6b [ 3051.645226] Prev obj: start=ffff88059a441648, len=224 [ 3051.705809] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 3051.772632] Last user: [<ffffffff803c754b>](inet_twsk_alloc+0x26/0xe4) [ 3051.851238] 000: 02 00 06 01 00 00 00 00 00 00 00 00 00 00 00 00 [ 3051.926754] 010: 78 44 b8 6e 06 88 ff ff e8 40 4c 89 05 88 ff ff [ 3052.002177] Next obj: start=ffff88059a441838, len=224 [ 3052.062762] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 3052.129591] Last user: [<ffffffff803c754b>](inet_twsk_alloc+0x26/0xe4) [ 3052.208196] 000: 02 00 06 01 00 00 00 00 00 00 00 00 00 00 00 00 [ 3052.283702] 010: 58 e0 b3 6e 06 88 ff ff 78 8c df 51 06 88 ff ff
any sugestions on how to proceed/test the bug?
slab corruptions in dmesg, no panic yet: [78633.978592] Slab corruption: TCP start=ffff88040d9a2ce0, len=1520 [78634.051615] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [78634.116326] Last user: [<ffffffff803987e0>](sk_free+0xad/0xc2) [78634.186566] 020: 40 fa d8 88 00 88 ff ff 6b 6b 6b 6b 6b 6b 6b 6b [78634.261131] Prev obj: start=ffff88040d9a26d8, len=1520 [78634.322708] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [78634.389484] Last user: [<ffffffff8039867f>](sk_prot_alloc+0x1d/0x7e) [78634.466002] 000: 02 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 [78634.540725] 010: 80 85 bb 6e 06 88 ff ff 00 00 00 00 00 00 00 00 [78634.615329] Next obj: start=ffff88040d9a32e8, len=1520 [78634.676904] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [78634.743682] Last user: [<ffffffff8039867f>](sk_free+0xad/0xc2) [78634.813858] 000: 02 00 02 00 00 00 00 00 78 59 2f 72 02 88 ff ff [78634.888422] 010: d0 69 af 6e 06 88 ff ff 00 00 00 00 00 00 00 00
panic occured: [126205.439604] general protection fault: 0000 [#1] SMP [126205.443583] last sysfs file: /sys/class/i2c-adapter/i2c-0/name [126205.443583] CPU 3 [126205.443583] Modules linked in: xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_hashlimit xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler snd_pcm snd_timer snd soundcore snd_page_alloc evdev i2c_i801 pcspkr i2c_core button ext3 jbd mbcache sd_mod 3w_9xxx igb scsi_mod dca thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] [126205.443583] Pid: 0, comm: swapper Not tainted 2.6.28.10-univ #1 [126205.443583] RIP: 0010:[<ffffffff803c6ae1>] [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [126205.443583] RSP: 0018:ffff88066f94fb60 EFLAGS: 00010282 [126205.443583] RAX: ffff8801075b1380 RBX: ffff8803fe200110 RCX: ffff88066c118428 [126205.443583] RDX: a56b6b6b6b6b6b6b RSI: ffff8801075b1368 RDI: ffff88066e40c3b0 [126205.443583] RBP: ffff88066e40c3b0 R08: 0000000000000000 R09: 000000003790af53 [126205.443583] R10: 000005ef805f8c10 R11: ffff8804bfdd6298 R12: ffff8801075b1368 [126205.443583] R13: ffff8802d4d19078 R14: ffff8803fe200110 R15: ffff8802d4d19078 [126205.443583] FS: 0000000000000000(0000) GS:ffff88066f8f8898(0000) knlGS:0000000000000000 [126205.443583] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [126205.443583] CR2: 00007f7a14ccc740 CR3: 0000000000201000 CR4: 00000000000006e0 [126205.443583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [126205.443583] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [126205.443583] Process swapper (pid: 0, threadinfo ffff88066f946000, task ffff88066f943320) [126205.443583] Stack: [126205.443583] ffff8801075b1368 ffff8804bfdd6298 ffff88008b56ae08 ffffffff803daaea [126205.443583] 0000000000000000 ffff8803fe200110 ffff8804bfdd6298 0000000000001000 [126205.443583] ffff88027bbd0020 ffffffff803db5f0 ffff88066c956520 ffff8805063a3680 [126205.443583] Call Trace: [126205.443583] <IRQ> <0> [<ffffffff803daaea>] ? tcp_v4_syn_recv_sock+0x1bf/0x215 [126205.443583] [<ffffffff803db5f0>] ? tcp_check_req+0x207/0x3b7 [126205.443583] [<ffffffff803d9db4>] ? tcp_v4_do_rcv+0x267/0x37a [126205.443583] [<ffffffffa01b5541>] ? nf_ct_deliver_cached_events+0x51/0x80 [nf_conntrack] [126205.443583] [<ffffffffa01c6384>] ? ipv4_confirm+0xcb/0xd6 [nf_conntrack_ipv4] [126205.443583] [<ffffffff803da39c>] ? tcp_v4_rcv+0x4d5/0x774 [126205.443583] [<ffffffff803ba538>] ? nf_hook_slow+0x62/0xc3 [126205.443583] [<ffffffff803c0204>] ? ip_local_deliver_finish+0x0/0x1ee [126205.443583] [<ffffffff803c0320>] ? ip_local_deliver_finish+0x11c/0x1ee [126205.443583] [<ffffffff803bff77>] ? ip_rcv_finish+0x30b/0x325 [126205.443583] [<ffffffff803c01c0>] ? ip_rcv+0x22f/0x273 [126205.443583] [<ffffffffa00642ed>] ? igb_poll+0x52d/0xee0 [igb] [126205.443583] [<ffffffff8022ed41>] ? rebalance_domains+0x166/0x461 [126205.443583] [<ffffffff802483ff>] ? ktime_get_ts+0x21/0x4a [126205.443583] [<ffffffff8039fc86>] ? net_rx_action+0xa7/0x1cb [126205.443583] [<ffffffff8023875b>] ? __do_softirq+0x7c/0x135 [126205.443583] [<ffffffffa00657e9>] ? igb_intr_msi+0xb9/0x100 [igb] [126205.443583] [<ffffffff8020d03c>] ? call_softirq+0x1c/0x28 [126205.443583] [<ffffffff8020e53c>] ? do_softirq+0x2c/0x68 [126205.443583] [<ffffffff8023848f>] ? irq_exit+0x3f/0x85 [126205.443583] [<ffffffff8020e767>] ? do_IRQ+0xc5/0xe2 [126205.443583] [<ffffffff8020c2f6>] ? ret_from_intr+0x0/0xa [126205.443583] <EOI> <0> [<ffffffffa0012428>] ? acpi_idle_enter_bm+0x2fb/0x37c [processor] [126205.443583] [<ffffffffa001241e>] ? acpi_idle_enter_bm+0x2f1/0x37c [processor] [126205.443583] [<ffffffff8038f449>] ? cpuidle_idle_call+0x8b/0xca [126205.443583] [<ffffffff8020b018>] ? cpu_idle+0x4a/0x8b [126205.443583] Code: e8 48 c1 e5 04 48 03 6a 18 48 89 ef e8 74 b4 04 00 48 8b 8b e8 02 00 00 48 8b 51 20 49 89 54 24 18 48 85 d2 74 09 49 8d 44 24 18 <48> 89 42 08 49 8d 44 24 18 48 89 41 20 48 8d 41 20 49 89 8c 24 [126205.443583] RIP [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [126205.443583] RSP <ffff88066f94fb60> [126210.010431] Kernel panic - not syncing: Fatal exception in interrupt [126210.091616] ------------[ cut here ]------------ [126210.091616] WARNING: at arch/x86/kernel/smp.c:118 try_to_wake_up+0x12d/0x183() [126210.091616] Modules linked in: xt_DSCP xt_TPROXY xt_u32 ip_set_iphash xt_socket nf_tproxy_core xt_MARK ipt_NETMAP xt_hashlimit xt_multiport ipt_set xt_state xt_owner xt_dscp xt_tcpudp xt_statistic ip_set_nethash ip_set iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q bonding ipmi_devintf ipmi_watchdog ipmi_si ipmi_msghandler snd_pcm snd_timer snd soundcore snd_page_alloc evdev i2c_i801 pcspkr i2c_core button ext3 jbd mbcache sd_mod 3w_9xxx igb scsi_mod dca thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] [126210.091616] Pid: 0, comm: swapper Tainted: G D 2.6.28.10-univ #1 [126210.091616] Call Trace: [126210.091616] <IRQ> [<ffffffff80233b89>] warn_on_slowpath+0x51/0x75 [126210.091616] [<ffffffff8027148e>] cpupri_set+0x10f/0x138 [126210.091616] [<ffffffff8022ce38>] enqueue_task_rt+0x13f/0x1f6 [126210.091616] [<ffffffff8022940a>] enqueue_task+0x59/0x64 [126210.091616] [<ffffffff802294f5>] activate_task+0x28/0x30 [126210.091616] [<ffffffff8022eb5a>] try_to_wake_up+0x12d/0x183 [126210.091616] [<ffffffff80251e7b>] smp_call_function_mask+0xbb/0x1d2 [126210.091616] [<ffffffff802297b2>] __wake_up_common+0x46/0x76 [126210.091616] [<ffffffff80229b81>] complete+0x38/0x4b [126210.091616] [<ffffffffa014d465>] deliver_recv_msg+0x11/0x1a [ipmi_si] [126210.091616] [<ffffffffa014d8f5>] smi_event_handler+0x335/0x40b [ipmi_si] [126210.091616] [<ffffffffa014da32>] set_run_to_completion+0x29/0x30 [ipmi_si] [126210.091616] [<ffffffffa013f163>] panic_event+0x3e/0x5d [ipmi_msghandler] [126210.091616] [<ffffffff80248d71>] notifier_call_chain+0x29/0x4c [126210.091616] [<ffffffff8040fd1a>] panic+0xaa/0x136 [126210.091616] [<ffffffff8020c2f6>] ret_from_intr+0x0/0xa [126210.091616] [<ffffffff8020e82c>] oops_end+0x38/0x88 [126210.091616] [<ffffffff8020e86f>] oops_end+0x7b/0x88 [126210.091616] [<ffffffff80412179>] error_exit+0x0/0x51 [126210.091616] [<ffffffff803c6ae1>] __inet_inherit_port+0x4e/0x74 [126210.091616] [<ffffffff803daaea>] tcp_v4_syn_recv_sock+0x1bf/0x215 [126210.091616] [<ffffffff803db5f0>] tcp_check_req+0x207/0x3b7 [126210.091616] [<ffffffff803d9db4>] tcp_v4_do_rcv+0x267/0x37a [126210.091616] [<ffffffffa01b5541>] nf_ct_deliver_cached_events+0x51/0x80 [nf_conntrack] [126210.091616] [<ffffffffa01c6384>] ipv4_confirm+0xcb/0xd6 [nf_conntrack_ipv4] [126210.091616] [<ffffffff803da39c>] tcp_v4_rcv+0x4d5/0x774 [126210.091616] [<ffffffff803ba538>] nf_hook_slow+0x62/0xc3 [126210.091616] [<ffffffff803c0204>] ip_local_deliver_finish+0x0/0x1ee [126210.091616] [<ffffffff803c0320>] ip_local_deliver_finish+0x11c/0x1ee [126210.091616] [<ffffffff803bff77>] ip_rcv_finish+0x30b/0x325 [126210.091616] [<ffffffff803c01c0>] ip_rcv+0x22f/0x273 [126210.091616] [<ffffffffa00642ed>] igb_poll+0x52d/0xee0 [igb] [126210.091616] [<ffffffff8022ed41>] rebalance_domains+0x166/0x461 [126210.091616] [<ffffffff802483ff>] ktime_get_ts+0x21/0x4a [126210.091616] [<ffffffff8039fc86>] net_rx_action+0xa7/0x1cb [126210.091616] [<ffffffff8023875b>] __do_softirq+0x7c/0x135 [126210.091616] [<ffffffffa00657e9>] igb_intr_msi+0xb9/0x100 [igb] [126210.091616] [<ffffffff8020d03c>] call_softirq+0x1c/0x28 [126210.091616] [<ffffffff8020e53c>] do_softirq+0x2c/0x68 [126210.091616] [<ffffffff8023848f>] irq_exit+0x3f/0x85 [126210.091616] [<ffffffff8020e767>] do_IRQ+0xc5/0xe2 [126210.091616] [<ffffffff8020c2f6>] ret_from_intr+0x0/0xa [126210.091616] <EOI> [<ffffffffa0012428>] acpi_idle_enter_bm+0x2fb/0x37c [processor] [126210.091616] [<ffffffffa001241e>] acpi_idle_enter_bm+0x2f1/0x37c [processor] [126210.091616] [<ffffffff8038f449>] cpuidle_idle_call+0x8b/0xca [126210.091616] [<ffffffff8020b018>] cpu_idle+0x4a/0x8b [126210.091616] ---[ end trace 828328a8c2c41748 ]---
i/o scheduler was switched from deadline to cfq, kernel didn't panic for nearly 48h, tests continue
another panic, distinct thing is: - box never panics under heavy load - panics occur at load minimums (night time) and shortly after taking load of the box
Krzysiek was able to reproduce panics by deleting iptables rules with "-m socket" match (part of tproxy suit) After restructuring of iptables ruleset we were able to overcome socket match problem (by excluding some traffic). It seems that slab corruptions were caused by traffic destined to local port 80 hitting "-m socket" rule or TPROXY target. If no traffic destined to local port 80 hits "-m socket" rule or TPROXY target there's no problem with kernel panic after iptables rule deletion.