Bug 24942
Summary: | Many NMI, and freeze at one month work. | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Nevenchannyy Alexander (nevenchannyy) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED OBSOLETE | ||
Severity: | high | CC: | akpm, alan, paulmck |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.36.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg
ps -ef new today stall today dmesg Linux node0 2.6.34-gentoo dmesg from Linux virtualbc 2.6.34-gentoo-r1 Diagnostic patch to dump out hrtimer functions in effect during an RCU CPU stall warning message. Updated diagnostic patch for hrtimers. new traces with Paul 's patch |
Description
Nevenchannyy Alexander
2010-12-15 20:50:46 UTC
also 2.6.34 also 2.6.34 & 2.6.35 Geeze that's a mess. Can you please readd the trace as an attachment so it doesn't get all wrecked by wordwrapping? Created attachment 40272 [details]
dmesg
Created attachment 40282 [details]
ps -ef
Thanks. I'm having trouble working out where CPU 27 got stuck. Maybe in rcu_check_callbacks(). I'm also have some dmesg's from others two servers with 2.6.34-gentoo and 2.6.32-gentoo-r20. If interested i'm send it's. Created attachment 40332 [details]
new today stall
I took a quick look at attachment id=40332 from comment #8 above, and here is what I found: CPU 0: ??? CPU 1: idle CPU 2: idle CPU 3-22: ??? CPU 23: idle CPU 24: ??? CPU 25: idle CPU 26: idle CPU 27: idle CPU 28: Was running hrtimers, took a scheduler-tick interrupt, detected CPU stall, initiated the stack traces. CPU 29: rt_worker_func() in IPv4 routing CPU 30: idle CPU 31: idle Are CPUs 0 and 3-22 offline or something? CPU 28 is flagged as causing the stall. Is there an extremely heavy timer load on this system? Attachment id=40272 shows the same thing: CPU 27 was running hrtimers, took a scheduler-tick interrupt, and detected the CPU stall. In both cases, we get three copied of the stack backtrace, not sure why. If you are willing to try out a diagnostic patch, one thing to try would be to store the value of the "fn" local variable __run_hrtimer() in kernel/hrtimer.c into a global per-CPU variable just after the "fn = timer->function;" line -- NULL it out before __run_hrtimer() returns. Then in print_other_cpu_stall() in kernel/rcutree.c, just after the first printk(), print out the global per-CPU variables. The value for "fn" for the CPU flagged as causing the stall might provide some clues. (I can provide the patch if you would prefer, but the edit-debug-test cycle will be quite a bit faster if you do it.) Also, does this stall happen but once? If the system is semi-alive, and if the CPU stall persists, you should see similar messages every 30 seconds. Or does the system hang? All CPU is online. Test was compiling kernel with MAKEOPTS="-j65" and controlling with htop. All CPU works fine with ~100% loading. I'm have three production servers under Gentoo Linux with differed kernels under KVM VM's. 2 nodes with many VM have 1-2 messages per day about CPU stall. After messages it's semi-live about 1 month, after this freeze (with null information in /var/log/messages). Third server works's fine 49 days (with 8 WinXP guests), but at this moment, also have one message about CPU stall. This is messages from fourth server, it's installed yesterday. At 910 seconds of works with out load, we have first CPU stall. About patch, yes of course i'm compile kernel with it at third node, with out critical for business VM's. For me this problem very critical, because hangs create many problems. I am also currently studying part of the kernel code on the system RCU. Trying to understand what is happening with servers. Any help is appreciated:) P.S. Sorry for bad English. Created attachment 40402 [details]
today dmesg Linux node0 2.6.34-gentoo
At other server we are also see that NMI was received only 9 CPU, instead of 32. betelgeuse ~ # cat ./trace2.log | grep 'NMI ' Dec 16 20:01:34 node0 kernel: [1215640.206461] sending NMI to all CPUs: Dec 16 20:01:34 node0 kernel: [1215640.206502] NMI backtrace for cpu 1 Dec 16 20:01:34 node0 kernel: [1215640.206985] NMI backtrace for cpu 2 Dec 16 20:01:34 node0 kernel: [1215640.207055] NMI backtrace for cpu 3 Dec 16 20:01:34 node0 kernel: [1215640.207408] NMI backtrace for cpu 26 Dec 16 20:01:34 node0 kernel: [1215640.207469] NMI backtrace for cpu 30 Dec 16 20:01:34 node0 kernel: [1215640.206461] NMI backtrace for cpu 29 Dec 16 20:01:34 node0 kernel: [1215640.207425] NMI backtrace for cpu 27 Dec 16 20:01:34 node0 kernel: [1215640.208927] NMI backtrace for cpu 28 Dec 16 20:01:34 node0 kernel: [1215640.213487] NMI backtrace for cpu 31 Created attachment 40412 [details]
dmesg from Linux virtualbc 2.6.34-gentoo-r1
OK, so patch should be against vanilla 2.6.34, correct? Other server, NMI received only 13 CPU instead of 32. betelgeuse ~ # cat ./trace3.log | grep 'NMI ' Dec 14 02:51:30 virtualbc kernel: [4214243.442828] sending NMI to all CPUs: Dec 14 02:51:30 virtualbc kernel: [4214243.442916] NMI backtrace for cpu 0 Dec 14 02:51:30 virtualbc kernel: [4214243.443857] NMI backtrace for cpu 4 Dec 14 02:51:30 virtualbc kernel: [4214243.444884] NMI backtrace for cpu 1 Dec 14 02:51:30 virtualbc kernel: [4214243.452132] NMI backtrace for cpu 29 Dec 14 02:51:30 virtualbc kernel: [4214243.452344] NMI backtrace for cpu 24 Dec 14 02:51:30 virtualbc kernel: [4214243.452394] NMI backtrace for cpu 23 Dec 14 02:51:30 virtualbc kernel: [4214243.390012] NMI backtrace for cpu 22 Dec 14 02:51:30 virtualbc kernel: [4214243.452594] NMI backtrace for cpu 25 Dec 14 02:51:30 virtualbc kernel: [4214243.452703] NMI backtrace for cpu 28 Dec 14 02:51:30 virtualbc kernel: [4214243.452909] NMI backtrace for cpu 26 Dec 14 02:51:30 virtualbc kernel: [4214243.453003] NMI backtrace for cpu 27 Dec 14 02:51:30 virtualbc kernel: [4214243.453014] NMI backtrace for cpu 30 Dec 14 02:51:30 virtualbc kernel: [4214243.453105] NMI backtrace for cpu 31 (In reply to comment #14) > OK, so patch should be against vanilla 2.6.34, correct? No, this is production servers, for this moment i'm have two test servers with 2.6.36.2. But this is not critical, i'm have good knowledge of C, so can port for any kernel. And I very much hope that the testing will not be on the machine that produced the dmesg in your comment #13 -- no symbol names for functions, just hexadecimal -- not very helpful... :-( OK, 2.6.36 is more convenient for me anyway. virtualbc server at this moment dont't have debug in kernel -(( But I'm sure the symptoms are the same. This is identical servers from Sun/Oracle with Opteron CPUs. It's identical for all severs under Linux :( Created attachment 40442 [details]
Diagnostic patch to dump out hrtimer functions in effect during an RCU CPU stall warning message.
Diagnostic patch -- compiles against 2.6.36, but is otherwise untested.
Created attachment 40452 [details]
Updated diagnostic patch for hrtimers.
This one should compile with stall-warning enabled. :-/
Created attachment 40682 [details]
new traces with Paul 's patch
So the offending hrtimer entry was the scheduler tick itself, which indicates that the CPU was idle. CPU 15 misses some: 0, 1, 4, 5, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 28, 29, 30. Ten-second pause, then: CPUs 0 and 15 get them all: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31. 95-second pause, then: CPU 31: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 20, 31. In all cases, the system is mostly idle. So I am wondering if the diagnostic is causing the long-term problem with all the NMIs? I will look for race conditions that could cause spurious stall warnings, and in the meantime I suggest building a kernel with CONFIG_RCU_CPU_STALL_VERBOSE=n. Though it will take some months to be sure of the hang. I'm compiled kernel with CONFIG_RCU_CPU_STALL_VERBOSE=n. So we are waiting for system hung ? But, as i'm wrote before, system hangs with out any logs in /var/log/messages :( No, -you- are waiting for the system to hang. -I- am looking for why this might be happening. I might have another diagnostic patch or (even better) a fix, hopefully soon. But either way, I am afraid that at some point we will need to let at least one of your systems run for at least a month to see if the problem is fixed. |