Bug 11618 - Core Duo Quard and intell S3210SH platform (ICH9-R chipset) panic
Summary: Core Duo Quard and intell S3210SH platform (ICH9-R chipset) panic
Status: REJECTED INVALID
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: i386 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: platform_i386
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-09-22 02:21 UTC by Badalian Slava
Modified: 2008-10-02 03:46 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.26.5
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Bois info (513 bytes, application/gzip)
2008-09-22 02:22 UTC, Badalian Slava
Details
Config (8.90 KB, application/gzip)
2008-09-22 02:22 UTC, Badalian Slava
Details
Dmesg (6.73 KB, application/gzip)
2008-09-22 02:22 UTC, Badalian Slava
Details
dmidecode (3.20 KB, application/gzip)
2008-09-22 02:22 UTC, Badalian Slava
Details
Ip tables full rules (82.71 KB, application/gzip)
2008-09-22 02:23 UTC, Badalian Slava
Details
LSPCI full verbose (2.63 KB, application/gzip)
2008-09-22 02:23 UTC, Badalian Slava
Details
TC full rules (382.12 KB, application/gzip)
2008-09-22 02:23 UTC, Badalian Slava
Details
Dmesg not compressed (23.21 KB, text/plain)
2008-09-22 03:02 UTC, Badalian Slava
Details
Output of emerge --info (3.16 KB, text/plain)
2008-09-22 06:16 UTC, Badalian Slava
Details

Description Badalian Slava 2008-09-22 02:21:38 UTC
Latest working kernel version: 2.6.26.5

Earliest failing kernel version: none

Distribution: gentoo 32 bit

Hardware Environment:

Attached.

Software Environment:

TC and IPtables only. Its router

Problem Description:

[143784.513166] CPU 2: Bank 0: 3200004000000800
[143784.513241] CPU 2: Bank 5: 3200121020080400
[143784.513241] Kernel panic - not syncing: CPU context corrupt
[143784.513282] Rebooting in 3 seconds..

On high cpu load 

Steps to reproduce:

add all tc rules... add all iptables rules... do 500 mbs traffc to shape and try compile or any other CPU load...

Thanks!
Comment 1 Badalian Slava 2008-09-22 02:22:06 UTC
Created attachment 17929 [details]
Bois info
Comment 2 Badalian Slava 2008-09-22 02:22:21 UTC
Created attachment 17930 [details]
Config
Comment 3 Badalian Slava 2008-09-22 02:22:38 UTC
Created attachment 17931 [details]
Dmesg
Comment 4 Badalian Slava 2008-09-22 02:22:53 UTC
Created attachment 17932 [details]
dmidecode
Comment 5 Badalian Slava 2008-09-22 02:23:12 UTC
Created attachment 17933 [details]
Ip tables full rules
Comment 6 Badalian Slava 2008-09-22 02:23:32 UTC
Created attachment 17934 [details]
LSPCI full verbose
Comment 7 Badalian Slava 2008-09-22 02:23:53 UTC
Created attachment 17935 [details]
TC full rules
Comment 8 Badalian Slava 2008-09-22 02:31:55 UTC
I try get more info (how to reproduce) letter... when system have more traffic and i will can experements... 
Comment 9 Yinghai Lu 2008-09-22 02:54:48 UTC
can you post dmesg with latest mainline and tip/master?
also please use plain text format...

Thanks
Comment 10 Ingo Molnar 2008-09-22 03:01:06 UTC
> can you post dmesg with latest mainline and tip/master?

latest mainline and tip/master can be accessed via:

  http://people.redhat.com/mingo/tip.git/README
Comment 11 Badalian Slava 2008-09-22 03:02:48 UTC
Created attachment 17936 [details]
Dmesg not compressed
Comment 12 Badalian Slava 2008-09-22 03:08:39 UTC
Not undarstand about "tip/master"... that is it? I can't translate it to Russian. If you wont git bitsect - i not know kernel what work fine for us... 
-tip git tree get more info for debug?
Comment 13 Thomas Gleixner 2008-09-22 03:10:20 UTC
> [143784.513166] CPU 2: Bank 0: 3200004000000800
> [143784.513241] CPU 2: Bank 5: 3200121020080400
> [143784.513241] Kernel panic - not syncing: CPU context corrupt

That's an unrecoverable machine check exception (hardware failure). The kernel 
detects it and there is not much it can do about it.
Comment 14 Badalian Slava 2008-09-22 03:23:56 UTC
Thomas Gleixner hmm... you say that we buy new broken intel server? =)

messages like this also say it? (but in kernel 2.6.26.2). 

[  116.333079] NETDEV WATCHDOG: eth1: transmit timed out
[  116.333349] ------------[ cut here ]------------
[  116.333516] WARNING: at net/sched/sch_generic.c:222 dev_watchdog+0xf1/0x110()
[  116.333690] Modules linked in: netconsole i2c_i801 i2c_core e1000e e1000
[  116.334199] Pid: 0, comm: swapper Not tainted 2.6.26-gentoo-r1-fw #2
[  116.334371]  [<c012506f>] warn_on_slowpath+0x5f/0x90
[  116.334597]  [<c011dd1a>] enqueue_task_fair+0x1a/0x30
[  116.334823]  [<c011b962>] enqueue_task+0x12/0x30
[  116.335046]  [<c011b9d3>] activate_task+0x23/0x40
[  116.335268]  [<c011e01a>] try_to_wake_up+0x6a/0x110
[  116.335491]  [<c0137c7b>] autoremove_wake_function+0x1b/0x50
[  116.335718]  [<c011be6b>] __wake_up_common+0x4b/0x80
[  116.335941]  [<c011cfde>] __wake_up+0x3e/0x60
[  116.336161]  [<c0134a2b>] insert_work+0x4b/0x70
[  116.336384]  [<c0134dd7>] __queue_work+0x27/0x40
[  116.336610]  [<c02d0651>] dev_watchdog+0xf1/0x110
[  116.337333]  [<c012e055>] run_timer_softirq+0x115/0x170
[  116.337557]  [<c0122c71>] scheduler_tick+0xa1/0xd0
[  116.337780]  [<c012a062>] __do_softirq+0x82/0x100
[  116.338002]  [<c012a117>] do_softirq+0x37/0x40
[  116.338222]  [<c0114027>] smp_apic_timer_interrupt+0x57/0x90
[  116.338448]  [<c0105660>] apic_timer_interrupt+0x28/0x30
[  116.338672]  [<c010a5e2>] mwait_idle+0x32/0x40
[  116.338894]  [<c010a5b0>] mwait_idle+0x0/0x40
[  116.339115]  [<c01036e8>] cpu_idle+0x48/0xc0
[  116.339336]  =======================
[  116.339499] ---[ end trace e25a40b7dc59df07 ]---
[  117.655918] CPU 1: Machine Check Exception: 0000000000000005
[  117.656103] CPU 1: Bank 0: 3200004000000800
[  117.656604] CPU 1: Bank 5: 3200220024080400
[  117.656604] Kernel panic - not syncing: CPU context corrupt
[  117.656624] Rebooting in 3 seconds..

I update 10 servers to 2.6.26.5 kernel. (all is identical and all is deadlocked/blackscreened on 2.6.26.2 kernel in spite of kernel.panic=3).
If you think that its broken server - close bug and i reopen if its happen on another server....
Comment 15 Thomas Gleixner 2008-09-22 03:53:20 UTC
> Thomas Gleixner hmm... you say that we buy new broken intel server? =)
> 
> messages like this also say it? (but in kernel 2.6.26.2). 

> [  116.333079] NETDEV WATCHDOG: eth1: transmit timed out
> [  116.333349] ------------[ cut here ]------------
> [  116.333516] WARNING: at net/sched/sch_generic.c:222
> dev_watchdog+0xf1/0x110()
> [  116.333690] Modules linked in: netconsole i2c_i801 i2c_core e1000e e1000
> [  116.334199] Pid: 0, comm: swapper Not tainted 2.6.26-gentoo-r1-fw #2
> [  116.334371]  [<c012506f>] warn_on_slowpath+0x5f/0x90
> [  116.334597]  [<c011dd1a>] enqueue_task_fair+0x1a/0x30
> [  116.334823]  [<c011b962>] enqueue_task+0x12/0x30
> [  116.335046]  [<c011b9d3>] activate_task+0x23/0x40
> [  116.335268]  [<c011e01a>] try_to_wake_up+0x6a/0x110
> [  116.335491]  [<c0137c7b>] autoremove_wake_function+0x1b/0x50
> [  116.335718]  [<c011be6b>] __wake_up_common+0x4b/0x80
> [  116.335941]  [<c011cfde>] __wake_up+0x3e/0x60
> [  116.336161]  [<c0134a2b>] insert_work+0x4b/0x70
> [  116.336384]  [<c0134dd7>] __queue_work+0x27/0x40
> [  116.336610]  [<c02d0651>] dev_watchdog+0xf1/0x110
> [  116.337333]  [<c012e055>] run_timer_softirq+0x115/0x170
> [  116.337557]  [<c0122c71>] scheduler_tick+0xa1/0xd0
> [  116.337780]  [<c012a062>] __do_softirq+0x82/0x100
> [  116.338002]  [<c012a117>] do_softirq+0x37/0x40
> [  116.338222]  [<c0114027>] smp_apic_timer_interrupt+0x57/0x90
> [  116.338448]  [<c0105660>] apic_timer_interrupt+0x28/0x30
> [  116.338672]  [<c010a5e2>] mwait_idle+0x32/0x40
> [  116.338894]  [<c010a5b0>] mwait_idle+0x0/0x40
> [  116.339115]  [<c01036e8>] cpu_idle+0x48/0xc0
> [  116.339336]  =======================
> [  116.339499] ---[ end trace e25a40b7dc59df07 ]---

That's something different. And it is 1 second _before_ the machine
check exception triggers. Can we please split the problems into
separate issues ?

> [  117.655918] CPU 1: Machine Check Exception: 0000000000000005
> [  117.656103] CPU 1: Bank 0: 3200004000000800
> [  117.656604] CPU 1: Bank 5: 3200220024080400
> [  117.656604] Kernel panic - not syncing: CPU context corrupt
> [  117.656624] Rebooting in 3 seconds..
> 
> I update 10 servers to 2.6.26.5 kernel. (all is identical and all is
> deadlocked/blackscreened on 2.6.26.2 kernel in spite of kernel.panic=3).

Does that mean you have the same problem on all 10 servers ? Please
specify which of the problems you have on which system.

Thanks,

	tglx

 
Comment 16 Badalian Slava 2008-09-22 05:33:07 UTC
Ok... i can reproduse bug on all!! 10 servers!

this i get on second server by netconsole
[14935.780146] CPU 1: Machine Check Exception: 0000000000000005
[14935.780369] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[14935.780369]   Tx Queue             <0>
[14935.780369]   TDH                  <4>
[14935.780369]   TDT                  <4>
[14935.780369]   next_to_use          <4>
[14935.780369]   next_to_clean        <59>
[14935.780369] buffer_info[next_to_clean]
[14935.780369]   time_stamp           <42f78d>
[14935.780369]   next_to_watch        <59>
[14935.780369]   jiffies              <42fd38>
[14935.780369]   next_to_watch.status <1>
[14935.782456] CPU 1: Bank 0: 3200004000000800
[14935.785751] CPU 1: Bank 5: 3200220024080400
[14935.785934] Kernel panic - not syncing: CPU context corrupt
[14935.785968] Rebooting in 3 seconds..

and on third pc:
[13205.963400] CPU 3: Machine Check Exception: 0000000000000005
[13205.963400] MCE: The hardware reports a non fatal, correctable incident occurred on CPU 1.
[13205.963400] Bank 0: b200004000000800
[13205.963400] MCE: The hardware reports a non fatal, correctable incident occurred on CPU 1.
[13205.963400] Bank 5: b200120014040400
[13209.945802] eth1: Detected Tx Unit Hang:
[13209.945802]   TDH                  <9a>
[13209.945802]   TDT                  <9a>
[13209.945802]   next_to_use          <9a>
[13209.945802]   next_to_clean        <ef>
[13209.945802] buffer_info[next_to_clean]:
[13209.945802]   time_stamp           <3b0c8f>
[13209.945802]   next_to_watch        <ef>
[13209.945802]   jiffies              <3b123a>
[13209.945802]   next_to_watch.status <1>
[13209.945802] CPU 3: Bank 5: 3200120020080400
[13209.945988] Kernel panic - not syncing: CPU context corrupt

And again:

[ 1172.520389] CPU 1: Machine Check Exception: 0000000000000005
[ 1172.523728] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[ 1172.523728]   Tx Queue             <0>
[ 1172.523728]   TDH                  <e6>
[ 1172.523728]   TDT                  <e6>
[ 1172.523728]   next_to_use          <e6>
[ 1172.523728]   next_to_clean        <3b>
[ 1172.523728] buffer_info[next_to_clean]
[ 1172.523728]   time_stamp           <3f6b8>
[ 1172.523728]   next_to_watch        <3b>
[ 1172.523728]   jiffies              <3fc61>
[ 1172.523728]   next_to_watch.status <1>
[ 1172.525882] CPU 1: Bank 0: 3200004000000800
[ 1172.525882] CPU 1: Bank 5: 3200220024080400
[ 1172.525882] Kernel panic - not syncing: CPU context corrupt
[ 1172.525882] Rebooting in 3 seconds..


Reproduse is simple - run "emerge portage" 1-3 times...
Some time i get error without e1000 error, but i think its related..

I hope its HARDWARE problem becouse copy of system work great on intel platform with 2xXeon processors (631xESB/632xESB/3100 Chipset)... maybe its ICH9-R/Core Duo 2 Quard specific problem??

Thanks!
Comment 17 Badalian Slava 2008-09-22 06:01:17 UTC
> Jarek Poplawski:
> BTW: could you try to trigger this bug with one network card off?

Shire! I stop eth1 and do "/etc/init.d/bgpd stop" (this pc not get route traffic anymore)....

run "emerge portage" 2 times and get:

[25492.187405] CPU 3: Machine Check Exception: 0000000000000005
[25492.187405] MCE: The hardware reports a non fatal, correctable incident occurred on CPU 1.
[25492.187405] Bank 0: b200004000000800
[25492.187405] MCE: The hardware reports a non fatal, correctable incident occurred on CPU 1.
[25492.187405] Bank 5: b200120014040400
[25497.124884] CPU 1: Machine Check Exception: 0000000000000004
[25497.124884] Kernel panic - not syncing: Unable to continue
[25497.124884] Rebooting in 3 seconds..


Thanks!
Comment 18 Thomas Gleixner 2008-09-22 06:10:27 UTC
Added Intel folks on CC.
Comment 19 Badalian Slava 2008-09-22 06:15:29 UTC
Also additional information. System compiled with:
CHOST="i686-pc-linux-gnu"
CFLAGS="-O2 -march=nocona -pipe"

I add output of emerge --info bellow in attach...
Comment 20 Badalian Slava 2008-09-22 06:16:07 UTC
Created attachment 17938 [details]
Output of emerge --info
Comment 21 Badalian Slava 2008-09-22 22:22:49 UTC
Any ideas/updates?
Comment 22 Badalian Slava 2008-09-23 01:32:49 UTC
Hi again!

Maybe add any other CC what can help for me?

I can and wont test any patches or gits to fix this situation... 
Also maybe if system compiled with CHOST="i686-pc-linux-gnu" CFLAGS="-O2 -march=nocona -pipe"  it can't work on "Core Duo 2 Quard"? But if i understand "man gcc" it must work. If i have mistake - please say where! I hope it my mistake, but not hardware bug =(

You more good understand hardware part of question. Please help :) 
We have 10 server that not work :(
5 servers on Intel Xeon with copy of this system work great, but we need more PC to our work :(

Thanks!!
Comment 23 Badalian Slava 2008-10-01 22:20:28 UTC
Please close the bug... Update bios help for us.... 

Thanks all!

Note You need to log in before you can comment on or make changes to this bug.