Systems hang unexpectedly, I have found no way to trigger the error, but all of them seem to be related to /dev/sdd which is a Intel SDD disk on each host. Systems are dual 4-core Supermicro with 64GB RAM, running 64-bit kernel running on a 32-bit OS using 32-bit compatibility in the kernel. Kernel dump 1: ------------[ cut here ]------------ kernel BUG at kernel/timer.c:1036! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/block/sdd/queue/scheduler CPU 7 Modules linked in: netconsole fuse ip6t_LOG ipt_REJECT ipt_LOG xt_limit xt_state xt_mark ip6table_mangle iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_ conntrack nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter ip_tables Pid: 0, comm: kworker/0:1 Tainted: G M 2.6.39.2 #7 Supermicro X8DTU/X8D TU RIP: 0010:[<ffffffff81037cb2>] [<ffffffff81037cb2>] cascade+0x51/0x75 RSP: 0000:ffff88103fdc3e60 EFLAGS: 00010096 RAX: 1cdd7220449d9bfa RBX: ffff88103f958000 RCX: 000000010b384000 RDX: 0000000000000021 RSI: ffff8801f332dca0 RDI: ffff88103f958000 RBP: aadd71d766ad94ec R08: ffff88103fdd1460 R09: 0000000000000000 R10: ffffffff81467bb5 R11: 0000000000000001 R12: 0000000000000021 R13: ffff88103fdc3e60 R14: ffff88103f95ffd8 R15: ffff88103f959020 FS: 0000000000000000(0000) GS:ffff88103fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000009dc1a00 CR3: 0000000001609000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/0:1 (pid: 0, threadinfo ffff88103f95e000, task ffff88103f94e090) Stack: ffff8801f332dca0 ffff88049425dbd0 0000000000000004 ffff88103f958000 ffff88103f95ffd8 0000000000000000 ffff88103fdc3ec0 ffffffff81037d86 0000000000000007 ffff88103f959420 ffff88103f959820 ffff88103f959c20 Call Trace: <IRQ> [<ffffffff81037d86>] ? run_timer_softirq+0xb0/0x1f1 [<ffffffff81049574>] ? ktime_get+0x5f/0xbc [<ffffffff81032c6a>] ? __do_softirq+0x7f/0x106 [<ffffffff8145cc0c>] ? call_softirq+0x1c/0x30 [<ffffffff81003656>] ? do_softirq+0x31/0x67 [<ffffffff810160f2>] ? smp_apic_timer_interrupt+0x87/0x97 [<ffffffff8145c6d3>] ? apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff8122f3fa>] ? acpi_idle_enter_bm+0x224/0x258 [<ffffffff8122f3f5>] ? acpi_idle_enter_bm+0x21f/0x258 [<ffffffff81371a46>] ? cpuidle_idle_call+0x91/0xce [<ffffffff81000894>] ? cpu_idle+0x92/0xa6 [<ffffffff81455ab0>] ? start_secondary+0x1e0/0x1e6 Code: 04 24 48 8b 46 08 48 89 44 24 08 48 89 20 48 89 36 48 89 76 08 48 8b 34 24 48 8b 2e eb 1e 48 8b 46 18 48 83 e0 fe 48 39 c3 74 02 <0f> 0b 48 89 df e8 ca fd ff ff 48 89 ee 48 8b 6d 00 4c 39 ee 75 RIP [<ffffffff81037cb2>] cascade+0x51/0x75 RSP <ffff88103fdc3e60> ---[ end trace c83e94e1b65e4a66 ]--- Kernel panic - not syncing: Fatal exception in interrupt Pid: 0, comm: kworker/0:1 Tainted: G M D 2.6.39.2 #7 Call Trace: <IRQ> [<ffffffff81458fc2>] ? panic+0x92/0x18a [<ffffffff810048b9>] ? oops_end+0x7e/0x8d [<ffffffff8100214d>] ? do_invalid_op+0x85/0x8f [<ffffffff81037cb2>] ? cascade+0x51/0x75 [<ffffffff81043672>] ? autoremove_wake_function+0x9/0x2a [<ffffffff81021ce5>] ? __wake_up_common+0x41/0x78 [<ffffffff81037f6a>] ? lock_timer_base.clone.23+0x25/0x4c [<ffffffff81038477>] ? mod_timer+0x155/0x16d [<ffffffff8145c995>] ? invalid_op+0x15/0x20 [<ffffffff81037cb2>] ? cascade+0x51/0x75 [<ffffffff81037d86>] ? run_timer_softirq+0xb0/0x1f1 [<ffffffff81049574>] ? ktime_get+0x5f/0xbc [<ffffffff81032c6a>] ? __do_softirq+0x7f/0x106 [<ffffffff8145cc0c>] ? call_softirq+0x1c/0x30 [<ffffffff81003656>] ? do_softirq+0x31/0x67 [<ffffffff810160f2>] ? smp_apic_timer_interrupt+0x87/0x97 [<ffffffff8145c6d3>] ? apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff8122f3fa>] ? acpi_idle_enter_bm+0x224/0x258 [<ffffffff8122f3f5>] ? acpi_idle_enter_bm+0x21f/0x258 [<ffffffff81371a46>] ? cpuidle_idle_call+0x91/0xce [<ffffffff81000894>] ? cpu_idle+0x92/0xa6 [<ffffffff81455ab0>] ? start_secondary+0x1e0/0x1e6 Another dump from another host. Kernel dump 2: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffff811e870d>] rb_insert_color+0x17/0xd9 PGD a8ecf3067 PUD 103c94067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/block/sdd/queue/scheduler CPU 2 Modules linked in: netconsole fuse ip6t_LOG ipt_REJECT ipt_LOG xt_limit xt_state xt_mark ip6table_mangle iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_ conntrack nf_defrag_ipv4 iptable_filter ip_tables ip6table_filter ip6_tables [la st unloaded: netconsole] Pid: 8179, comm: zsh Tainted: G M 2.6.39.2 #7 Supermicro X8DTU/X8DTU RIP: 0010:[<ffffffff811e870d>] [<ffffffff811e870d>] rb_insert_color+0x17/0xd9 RSP: 0000:ffff880ddc33bca8 EFLAGS: 00010246 RAX: ffff88000d8f9500 RBX: ffff88053a66d1c8 RCX: 0000000000000000 RDX: ffff88053a66d1d0 RSI: ffff880470912d40 RDI: ffff88000d8f9508 RBP: 0000000000000000 R08: 0000000043a5870e R09: 000000000e0e6473 R10: 000000006563746d R11: 000000006573656c R12: ffff88000d8f9508 R13: ffff880470912d40 R14: 0000000000000007 R15: ffff88000d974440 FS: 0000000000000000(0000) GS:ffff88103fc80000(0063) knlGS:00000000f75c56c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000000010 CR3: 0000000bf5913000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process zsh (pid: 8179, threadinfo ffff880ddc33a000, task ffff880cbb5c7990) Stack: ffff88053a66d1c8 ffff8806d0ab8084 ffff880470912d40 00000000c179a694 0000000000000007 ffffffff8110d1f5 ffff8806d0ab8084 ffff881036438a50 ffff8806d0ab808441 56 41 55 49 89 f5 41 54 49 89 fc 55 53 e9 9e 00 00 00 48 83 e5 fc 8b 45 10 48 39 c3 75 41 48 8b 45 08 48 85 c0 74 08 48 8b 10 RIP [<ffffffff811e870d>] rb_insert_color+0x17/0xd9 RSP <ffff880ddc33bca8> CR2: 0000000000000010 ---[ end trace 24f924cfeeb1298a ]--- And another one from an identical host running 2.6.27.6 Kernel dump 3: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffff811e2f35>] rb_insert_color+0x17/0xd9 PGD 9b570b067 PUD 597522067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/block/sdd/queue/scheduler CPU 7 Modules linked in: i2c_dev i2c_core fuse ip6t_LOG ipt_REJECT ipt_LOG xt_limit xt_state xt_mark ip6table_mangle iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables ip6table_filter ip6_tables Pid: 12128, comm: zsh Tainted: G M 2.6.37.6 #2 Supermicro X8DTU/X8DTU RIP: 0010:[<ffffffff811e2f35>] [<ffffffff811e2f35>] rb_insert_color+0x17/0xd9 RSP: 0000:ffff8808179f3ca8 EFLAGS: 00010246 RAX: ffff880ef08d01c0 RBX: ffff8809b2b947c8 RCX: 0000000000000000 RDX: ffff8809b2b947d0 RSI: ffff880eb486ce80 RDI: ffff880ef08d01c8 RBP: 0000000000000000 R08: 0000000043a5870e R09: 000000000e0e7368 R10: 000000007468732e R11: 000000006c656e67 R12: ffff880ef08d01c8 R13: ffff880eb486ce80 R14: 000000000000003b R15: ffff88103ba8c540 FS: 0000000000000000(0000) GS:ffff8800bf5c0000(0063) knlGS:00000000f761d6c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000000010 CR3: 0000000c43672000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process zsh (pid: 12128, threadinfo ffff8808179f2000, task ffff880820177300) Stack: ffff8809b2b947c8 ffff880c6c0be544 ffff880eb486ce80 0000000041b9c94d 000000000000003b ffffffff81108d4d ffff880c6c0be544 ffff88103fbfca68 ffff880c6c0be544 ffff880ed7317cf8 ffff8808179f3db8 ffffffff811108b4 Call Trace: [<ffffffff81108d4d>] ? ext3_htree_store_dirent+0xe0/0xef [<ffffffff811108b4>] ? htree_dirblock_to_tree+0xef/0x144 [<ffffffff81090b23>] ? file_sb_list_add+0xd/0x42 [<ffffffff81110983>] ? ext3_htree_fill_tree+0x7a/0x1e6 [<ffffffff81084963>] ? page_add_new_anon_rmap+0x47/0x6c [<ffffffff8110887d>] ? ext3_readdir+0x174/0x536 [<ffffffff810c1912>] ? compat_filldir64+0x0/0xd6 [<ffffffff8101de7b>] ? do_page_fault+0x31b/0x354 [<ffffffff810c1912>] ? compat_filldir64+0x0/0xd6 [<ffffffff8109c8c5>] ? vfs_readdir+0x64/0x9c [<ffffffff810c3424>] ? compat_sys_getdents64+0x77/0xbf [<ffffffff814512df>] ? page_fault+0x1f/0x30 [<ffffffff81021e43>] ? ia32_sysret+0x0/0x5 Code: 42 10 eb 03 48 89 06 48 8b 17 83 e2 03 48 09 c2 48 89 17 c3 41 56 41 55 49 89 f5 41 54 49 89 fc 55 53 e9 9e 00 00 00 48 83 e5 fc <48> 8b 45 10 48 39 c3 75 41 48 8b 45 08 48 85 c0 74 08 48 8b 10 RIP [<ffffffff811e2f35>] rb_insert_color+0x17/0xd9 RSP <ffff8808179f3ca8> CR2: 0000000000000010 ---[ end trace 377318b2e1c551de ]--- cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: WDC WD2002FYPS-0 Rev: 04.0 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: ST31000340NS Rev: SN06 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: WDC WD20EARS-00M Rev: 51.0 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: INTEL SSDSA2M080 Rev: 2CV1 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi4 Channel: 00 Id: 00 Lun: 00 Vendor: MATSHITA Model: DVD-ROM UJDA780 Rev: 1.50 Type: CD-ROM ANSI SCSI revision: 05 lspci 00:00.0 Host bridge: Intel Corp.: Unknown device 3406 (rev 13) 00:01.0 PCI bridge: Intel Corp.: Unknown device 3408 (rev 13) 00:03.0 PCI bridge: Intel Corp.: Unknown device 340a (rev 13) 00:05.0 PCI bridge: Intel Corp.: Unknown device 340c (rev 13) 00:06.0 PCI bridge: Intel Corp.: Unknown device 340d (rev 13) 00:07.0 PCI bridge: Intel Corp.: Unknown device 340e (rev 13) 00:09.0 PCI bridge: Intel Corp.: Unknown device 3410 (rev 13) 00:14.0 PIC: Intel Corp.: Unknown device 342e (rev 13) 00:14.1 PIC: Intel Corp.: Unknown device 3422 (rev 13) 00:14.2 PIC: Intel Corp.: Unknown device 3423 (rev 13) 00:14.3 PIC: Intel Corp.: Unknown device 3438 (rev 13) 00:16.0 System peripheral: Intel Corp.: Unknown device 3430 (rev 13) 00:16.1 System peripheral: Intel Corp.: Unknown device 3431 (rev 13) 00:16.2 System peripheral: Intel Corp.: Unknown device 3432 (rev 13) 00:16.3 System peripheral: Intel Corp.: Unknown device 3433 (rev 13) 00:16.4 System peripheral: Intel Corp.: Unknown device 3429 (rev 13) 00:16.5 System peripheral: Intel Corp.: Unknown device 342a (rev 13) 00:16.6 System peripheral: Intel Corp.: Unknown device 342b (rev 13) 00:16.7 System peripheral: Intel Corp.: Unknown device 342c (rev 13) 00:1a.0 USB Controller: Intel Corp.: Unknown device 3a37 00:1a.1 USB Controller: Intel Corp.: Unknown device 3a38 00:1a.2 USB Controller: Intel Corp.: Unknown device 3a39 00:1a.7 USB Controller: Intel Corp.: Unknown device 3a3c 00:1d.0 USB Controller: Intel Corp.: Unknown device 3a34 00:1d.1 USB Controller: Intel Corp.: Unknown device 3a35 00:1d.2 USB Controller: Intel Corp.: Unknown device 3a36 00:1d.7 USB Controller: Intel Corp.: Unknown device 3a3a 00:1e.0 PCI bridge: Intel Corp. 82801BA/CA/DB PCI Bridge (rev 90) 00:1f.0 ISA bridge: Intel Corp.: Unknown device 3a16 00:1f.2 Class 0106: Intel Corp.: Unknown device 3a22 00:1f.3 SMBus: Intel Corp.: Unknown device 3a30 01:00.0 Ethernet controller: Intel Corp.: Unknown device 10c9 (rev 01) 01:00.1 Ethernet controller: Intel Corp.: Unknown device 10c9 (rev 01) 07:01.0 VGA compatible controller: Matrox Graphics, Inc.: Unknown device 0532 (rev 0a) Please let me know if morwe info is required.
Created attachment 66652 [details] Kernel config 2.6.39.2
Created attachment 66662 [details] Kernel config 2.6.37.6
*** Bug 38372 has been marked as a duplicate of this bug. ***
From the logs, you can see the tainted flags are set:"Tainted: G M" This means the system experienced a machine check, which means a likely hardware issue.
You might check your /var/log/messages or dmesg output to see if there are any indications of what hardware has been having issues.
Hi John, Thanks for looking into this. Indeed, one of the systems has ATA errors prior to the crash in its log. Could that be the cause of kernel dump 2 above? Furthermore I noticed that while booting the following messages are logged: Booting Node 0, Processors #1 #2 #3 #4 [Hardware Error]: No human readable MCE decoding support on this CPU type. [Hardware Error]: Run the message through 'mcelog --ascii' to decode. Disabling lock debugging due to kernel taint #5 #6 #7 Ok. Brought up 8 CPUs Total of 8 processors activated (36267.21 BogoMIPS). What message should I run through mcelog? Is this also a cause of the kernel being tainted, because I have not seen any other hardware issues on the other systems? I would be happy to supply more info if needed.
(In reply to comment #6) > Hi John, > > Thanks for looking into this. > > Indeed, one of the systems has ATA errors prior to the crash in its log. > Could > that be the cause of kernel dump 2 above? Well, when hardware acts up it can manifest in strange ways. The fact that /sys/block/sdd/queue/scheduler is in all the dumps also aligns with the (S?)ATA errors. However, it could be something else as well. > What message should I run through mcelog? Honestly, I'm not sure. I'm not very familiar with the mce framework. CC'ing Andi to see if he has any thoughts. > Is this also a cause of the kernel being tainted, because I have not seen any > other hardware issues on the other systems? So, looking back over the original report you're seeing this on 3 different systems? All dual-socket quad cores? Since you're getting the [Hardware Error] message when initializing the cpus, right between sockets, I'm curious if your cpus are mis-matched? You are using identical processors in both sockets, right?
We experiece these problems with four identical systems. All of them are dual quad-core Xeon E5520s with 6x8GB+4x4GB RAM The SATA errors are present at only one of the systems. The cpus are identical, but the DIMMs might not be properly populated amongst CPU-channels. As these machines are in a remote datacenter I cannot check this easily. Would it be possible the machine check exception is raised when DIMMs are not balanced? e.g: P1_1A=4GB P1_1B=8GB P2_1A=8GB P2_1B=4GB Mysteriously, we also have the mcelog entry on a fifth system, identical to the other four but with only 8x2GB and no SSD drive and that system has not shown any instability issues. It's on a different load but is still quite heavily used.
Machine checks are usually not software or kernel problems. You have to talk to whoever sold you the system. This bugzilla is likely the wrong place.
Per Andi's comment, I'm going to close this as invalid. Please re-open if there's any data pointing to a kernel issue instead of a hardware problem.