Latest working kernel version: 2.6.26.5 Earliest failing kernel version: none Distribution: gentoo 32 bit Hardware Environment: Attached. Software Environment: TC and IPtables only. Its router Problem Description: [143784.513166] CPU 2: Bank 0: 3200004000000800 [143784.513241] CPU 2: Bank 5: 3200121020080400 [143784.513241] Kernel panic - not syncing: CPU context corrupt [143784.513282] Rebooting in 3 seconds.. On high cpu load Steps to reproduce: add all tc rules... add all iptables rules... do 500 mbs traffc to shape and try compile or any other CPU load... Thanks!
Created attachment 17929 [details] Bois info
Created attachment 17930 [details] Config
Created attachment 17931 [details] Dmesg
Created attachment 17932 [details] dmidecode
Created attachment 17933 [details] Ip tables full rules
Created attachment 17934 [details] LSPCI full verbose
Created attachment 17935 [details] TC full rules
I try get more info (how to reproduce) letter... when system have more traffic and i will can experements...
can you post dmesg with latest mainline and tip/master? also please use plain text format... Thanks
> can you post dmesg with latest mainline and tip/master? latest mainline and tip/master can be accessed via: http://people.redhat.com/mingo/tip.git/README
Created attachment 17936 [details] Dmesg not compressed
Not undarstand about "tip/master"... that is it? I can't translate it to Russian. If you wont git bitsect - i not know kernel what work fine for us... -tip git tree get more info for debug?
> [143784.513166] CPU 2: Bank 0: 3200004000000800 > [143784.513241] CPU 2: Bank 5: 3200121020080400 > [143784.513241] Kernel panic - not syncing: CPU context corrupt That's an unrecoverable machine check exception (hardware failure). The kernel detects it and there is not much it can do about it.
Thomas Gleixner hmm... you say that we buy new broken intel server? =) messages like this also say it? (but in kernel 2.6.26.2). [ 116.333079] NETDEV WATCHDOG: eth1: transmit timed out [ 116.333349] ------------[ cut here ]------------ [ 116.333516] WARNING: at net/sched/sch_generic.c:222 dev_watchdog+0xf1/0x110() [ 116.333690] Modules linked in: netconsole i2c_i801 i2c_core e1000e e1000 [ 116.334199] Pid: 0, comm: swapper Not tainted 2.6.26-gentoo-r1-fw #2 [ 116.334371] [<c012506f>] warn_on_slowpath+0x5f/0x90 [ 116.334597] [<c011dd1a>] enqueue_task_fair+0x1a/0x30 [ 116.334823] [<c011b962>] enqueue_task+0x12/0x30 [ 116.335046] [<c011b9d3>] activate_task+0x23/0x40 [ 116.335268] [<c011e01a>] try_to_wake_up+0x6a/0x110 [ 116.335491] [<c0137c7b>] autoremove_wake_function+0x1b/0x50 [ 116.335718] [<c011be6b>] __wake_up_common+0x4b/0x80 [ 116.335941] [<c011cfde>] __wake_up+0x3e/0x60 [ 116.336161] [<c0134a2b>] insert_work+0x4b/0x70 [ 116.336384] [<c0134dd7>] __queue_work+0x27/0x40 [ 116.336610] [<c02d0651>] dev_watchdog+0xf1/0x110 [ 116.337333] [<c012e055>] run_timer_softirq+0x115/0x170 [ 116.337557] [<c0122c71>] scheduler_tick+0xa1/0xd0 [ 116.337780] [<c012a062>] __do_softirq+0x82/0x100 [ 116.338002] [<c012a117>] do_softirq+0x37/0x40 [ 116.338222] [<c0114027>] smp_apic_timer_interrupt+0x57/0x90 [ 116.338448] [<c0105660>] apic_timer_interrupt+0x28/0x30 [ 116.338672] [<c010a5e2>] mwait_idle+0x32/0x40 [ 116.338894] [<c010a5b0>] mwait_idle+0x0/0x40 [ 116.339115] [<c01036e8>] cpu_idle+0x48/0xc0 [ 116.339336] ======================= [ 116.339499] ---[ end trace e25a40b7dc59df07 ]--- [ 117.655918] CPU 1: Machine Check Exception: 0000000000000005 [ 117.656103] CPU 1: Bank 0: 3200004000000800 [ 117.656604] CPU 1: Bank 5: 3200220024080400 [ 117.656604] Kernel panic - not syncing: CPU context corrupt [ 117.656624] Rebooting in 3 seconds.. I update 10 servers to 2.6.26.5 kernel. (all is identical and all is deadlocked/blackscreened on 2.6.26.2 kernel in spite of kernel.panic=3). If you think that its broken server - close bug and i reopen if its happen on another server....
> Thomas Gleixner hmm... you say that we buy new broken intel server? =) > > messages like this also say it? (but in kernel 2.6.26.2). > [ 116.333079] NETDEV WATCHDOG: eth1: transmit timed out > [ 116.333349] ------------[ cut here ]------------ > [ 116.333516] WARNING: at net/sched/sch_generic.c:222 > dev_watchdog+0xf1/0x110() > [ 116.333690] Modules linked in: netconsole i2c_i801 i2c_core e1000e e1000 > [ 116.334199] Pid: 0, comm: swapper Not tainted 2.6.26-gentoo-r1-fw #2 > [ 116.334371] [<c012506f>] warn_on_slowpath+0x5f/0x90 > [ 116.334597] [<c011dd1a>] enqueue_task_fair+0x1a/0x30 > [ 116.334823] [<c011b962>] enqueue_task+0x12/0x30 > [ 116.335046] [<c011b9d3>] activate_task+0x23/0x40 > [ 116.335268] [<c011e01a>] try_to_wake_up+0x6a/0x110 > [ 116.335491] [<c0137c7b>] autoremove_wake_function+0x1b/0x50 > [ 116.335718] [<c011be6b>] __wake_up_common+0x4b/0x80 > [ 116.335941] [<c011cfde>] __wake_up+0x3e/0x60 > [ 116.336161] [<c0134a2b>] insert_work+0x4b/0x70 > [ 116.336384] [<c0134dd7>] __queue_work+0x27/0x40 > [ 116.336610] [<c02d0651>] dev_watchdog+0xf1/0x110 > [ 116.337333] [<c012e055>] run_timer_softirq+0x115/0x170 > [ 116.337557] [<c0122c71>] scheduler_tick+0xa1/0xd0 > [ 116.337780] [<c012a062>] __do_softirq+0x82/0x100 > [ 116.338002] [<c012a117>] do_softirq+0x37/0x40 > [ 116.338222] [<c0114027>] smp_apic_timer_interrupt+0x57/0x90 > [ 116.338448] [<c0105660>] apic_timer_interrupt+0x28/0x30 > [ 116.338672] [<c010a5e2>] mwait_idle+0x32/0x40 > [ 116.338894] [<c010a5b0>] mwait_idle+0x0/0x40 > [ 116.339115] [<c01036e8>] cpu_idle+0x48/0xc0 > [ 116.339336] ======================= > [ 116.339499] ---[ end trace e25a40b7dc59df07 ]--- That's something different. And it is 1 second _before_ the machine check exception triggers. Can we please split the problems into separate issues ? > [ 117.655918] CPU 1: Machine Check Exception: 0000000000000005 > [ 117.656103] CPU 1: Bank 0: 3200004000000800 > [ 117.656604] CPU 1: Bank 5: 3200220024080400 > [ 117.656604] Kernel panic - not syncing: CPU context corrupt > [ 117.656624] Rebooting in 3 seconds.. > > I update 10 servers to 2.6.26.5 kernel. (all is identical and all is > deadlocked/blackscreened on 2.6.26.2 kernel in spite of kernel.panic=3). Does that mean you have the same problem on all 10 servers ? Please specify which of the problems you have on which system. Thanks, tglx
Ok... i can reproduse bug on all!! 10 servers! this i get on second server by netconsole [14935.780146] CPU 1: Machine Check Exception: 0000000000000005 [14935.780369] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang [14935.780369] Tx Queue <0> [14935.780369] TDH <4> [14935.780369] TDT <4> [14935.780369] next_to_use <4> [14935.780369] next_to_clean <59> [14935.780369] buffer_info[next_to_clean] [14935.780369] time_stamp <42f78d> [14935.780369] next_to_watch <59> [14935.780369] jiffies <42fd38> [14935.780369] next_to_watch.status <1> [14935.782456] CPU 1: Bank 0: 3200004000000800 [14935.785751] CPU 1: Bank 5: 3200220024080400 [14935.785934] Kernel panic - not syncing: CPU context corrupt [14935.785968] Rebooting in 3 seconds.. and on third pc: [13205.963400] CPU 3: Machine Check Exception: 0000000000000005 [13205.963400] MCE: The hardware reports a non fatal, correctable incident occurred on CPU 1. [13205.963400] Bank 0: b200004000000800 [13205.963400] MCE: The hardware reports a non fatal, correctable incident occurred on CPU 1. [13205.963400] Bank 5: b200120014040400 [13209.945802] eth1: Detected Tx Unit Hang: [13209.945802] TDH <9a> [13209.945802] TDT <9a> [13209.945802] next_to_use <9a> [13209.945802] next_to_clean <ef> [13209.945802] buffer_info[next_to_clean]: [13209.945802] time_stamp <3b0c8f> [13209.945802] next_to_watch <ef> [13209.945802] jiffies <3b123a> [13209.945802] next_to_watch.status <1> [13209.945802] CPU 3: Bank 5: 3200120020080400 [13209.945988] Kernel panic - not syncing: CPU context corrupt And again: [ 1172.520389] CPU 1: Machine Check Exception: 0000000000000005 [ 1172.523728] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang [ 1172.523728] Tx Queue <0> [ 1172.523728] TDH <e6> [ 1172.523728] TDT <e6> [ 1172.523728] next_to_use <e6> [ 1172.523728] next_to_clean <3b> [ 1172.523728] buffer_info[next_to_clean] [ 1172.523728] time_stamp <3f6b8> [ 1172.523728] next_to_watch <3b> [ 1172.523728] jiffies <3fc61> [ 1172.523728] next_to_watch.status <1> [ 1172.525882] CPU 1: Bank 0: 3200004000000800 [ 1172.525882] CPU 1: Bank 5: 3200220024080400 [ 1172.525882] Kernel panic - not syncing: CPU context corrupt [ 1172.525882] Rebooting in 3 seconds.. Reproduse is simple - run "emerge portage" 1-3 times... Some time i get error without e1000 error, but i think its related.. I hope its HARDWARE problem becouse copy of system work great on intel platform with 2xXeon processors (631xESB/632xESB/3100 Chipset)... maybe its ICH9-R/Core Duo 2 Quard specific problem?? Thanks!
> Jarek Poplawski: > BTW: could you try to trigger this bug with one network card off? Shire! I stop eth1 and do "/etc/init.d/bgpd stop" (this pc not get route traffic anymore).... run "emerge portage" 2 times and get: [25492.187405] CPU 3: Machine Check Exception: 0000000000000005 [25492.187405] MCE: The hardware reports a non fatal, correctable incident occurred on CPU 1. [25492.187405] Bank 0: b200004000000800 [25492.187405] MCE: The hardware reports a non fatal, correctable incident occurred on CPU 1. [25492.187405] Bank 5: b200120014040400 [25497.124884] CPU 1: Machine Check Exception: 0000000000000004 [25497.124884] Kernel panic - not syncing: Unable to continue [25497.124884] Rebooting in 3 seconds.. Thanks!
Added Intel folks on CC.
Also additional information. System compiled with: CHOST="i686-pc-linux-gnu" CFLAGS="-O2 -march=nocona -pipe" I add output of emerge --info bellow in attach...
Created attachment 17938 [details] Output of emerge --info
Any ideas/updates?
Hi again! Maybe add any other CC what can help for me? I can and wont test any patches or gits to fix this situation... Also maybe if system compiled with CHOST="i686-pc-linux-gnu" CFLAGS="-O2 -march=nocona -pipe" it can't work on "Core Duo 2 Quard"? But if i understand "man gcc" it must work. If i have mistake - please say where! I hope it my mistake, but not hardware bug =( You more good understand hardware part of question. Please help :) We have 10 server that not work :( 5 servers on Intel Xeon with copy of this system work great, but we need more PC to our work :( Thanks!!
Please close the bug... Update bios help for us.... Thanks all!