Bug 40092
Summary: | RCU stall in linux-3.0.0 | ||
---|---|---|---|
Product: | Process Management | Reporter: | Philip Armstrong (phil) |
Component: | Scheduler | Assignee: | Paul E. McKenney (paulmck) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | akpm, alan, florian, john.stultz, maciej.rutecki, rjw, tglx |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.0.0 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 36912 | ||
Attachments: |
dmesg output
dmesg output with rcu_tree.rcu_cpu_stall_supress=1 RCU oops with rcu_tree.rcu_cpu_stall_suppress=1 Mutiple RCU oops with CONFIG_RCU_FAST_NO_HZ=y, CONFIG_FRAME_POINTER=y |
Description
Philip Armstrong
2011-07-25 21:44:33 UTC
*** Bug 40082 has been marked as a duplicate of this bug. *** > You can suppress the RCU CPU stall checking by adding > "rcu_tree.rcu_cpu_stall_suppress=1" to the kernel boot parameters. > If you do this, the stall would likely still be happening, but > RCU would refrain from reporting it. Still, this is useful in case > the problem is in the RCU stall checking itself. OK, I've tried this & received another oops at boot. (dmesg attached if it makes any difference. >From the look of the stack traces, I would guess that you have frame >pointers disabled. If so, could you please try reproducing this with >frame pointers enabled? >Could you please also try rebuilding the kernel with CONFIG_RCU_FAST_NO_HZ=n? Will do. Created attachment 66642 [details]
dmesg output with rcu_tree.rcu_cpu_stall_supress=1
On Tue, Jul 26, 2011 at 09:12:39AM +0100, Phil Armstrong wrote: > On 26/07/11 01:15, Paul E. McKenney wrote: > >Hello, Phil, > > > >Bugzilla seems less than responsive, so I am falling back on email. > > It seems to have recovered again now... I CCed it so that this message should show up automatically. > >You can suppress the RCU CPU stall checking by adding > >"rcu_tree.rcu_cpu_stall_suppress=1" to the kernel boot parameters. > >If you do this, the stall would likely still be happening, but > >RCU would refrain from reporting it. Still, this is useful in case > >the problem is in the RCU stall checking itself. > > Will try this. > > Looking at a boot log from this morning, I got the delayed boot but > without the RCU stall check oops, so perhaps there are really two > problems :( Or perhaps the stall is happening and delaying your boot whether or not RCU is complaining about it. Ah, and notice that CPU 3 is not responding to the NMI until a full minute after the stall was detected in https://bugzilla.kernel.org/attachment.cgi?id=66642: [ 3.976349] [<ffffffff8132f594>] ? start_secondary+0x1de/0x1e5 [ 67.293217] NMI backtrace for cpu 3 [ 67.293217] CPU 3 But wait... We also have the following, earlier in that same dmesg: [ 67.293213] [<ffffffff8100827e>] ? cpu_idle+0xa2/0xc4 [ 67.293213] [<ffffffff8132f594>] ? start_secondary+0x1de/0x1e5 [ 3.976363] NMI backtrace for cpu 0 [ 3.976363] CPU 0 Different CPUs seem to have rather different ideas about what time it is. Which could cause the system to wait a minute, if one CPU waited for an absolute time set by the other CPU. The different ideas about time can also provoke RCU CPU stalls. And you got a stall despite rcu_tree.rcu_cpu_stall_suppress=1. Not sure why that would be. Could you please double-check this? If worst comes to worst, I can send you a patch that hard-disables CPU stall checking. Thanx, Paul > > From the look of the stack traces, I would guess that you have frame > >pointers disabled. If so, could you please try reproducing this with > >frame pointers enabled? > >Could you please also try rebuilding the kernel with > CONFIG_RCU_FAST_NO_HZ=n? > > Will do. > > Phil Looks like I mis-spelt "suppress" in the kernel command line for the second dmesg, so scratch the second report. I'll try again tonight with the correct option. Sorry about that. Having left the machine in question downloading the full git clone I'll try a frame-pointer enabled build. Comment on attachment 66642 [details]
dmesg output with rcu_tree.rcu_cpu_stall_supress=1
Ignore this, the kernel command line is wrong.
OK. I think I got the kernel command line correct, but I'm still getting an oops from the RCU code. See the attached dmesg output: syslog-3.0.0-no_rcu_stall2.txt Created attachment 66762 [details]
RCU oops with rcu_tree.rcu_cpu_stall_suppress=1
Created attachment 66782 [details]
Mutiple RCU oops with CONFIG_RCU_FAST_NO_HZ=y, CONFIG_FRAME_POINTER=y
Setting CONFIG_RCU_FAST_NO_HZ=n results in a kernel that boots to [ 0.503236] Switched to NOHz mode on CPU#3 [ 0.503241] Switched to NOHz mode on CPU#1 [ 0.503247] Switched to NOHz mode on CPU#2 [ 0.503250] Switched to NOHz mode on CPU#0 and then hangs. After fifteen minutes I poked the keyboard & it promptly jumped ahead again, before hanging for good at: [ 92.899293] NET: Registered protocol family 1 I gave up after another half an hour. Note that the time on the log (92s) bears no relation to wall clock time at all. I'll try this kernel again & then give up on it! Does nohz=off change the system behavior? Also, does tapping frequently on the keyboard (ie: ctrl key) during boot make the issue go away? Tapping on the keyboard does seem to make the boot go quicker. I'll try a few times & see whether it kills the RCU oops: they don't appear every boot. nohz=off eliminates the delays for both CONFIG_RCU_FAST_NO_HZ=y and allows the CONFIG_RCU_FAST_NO_HZ=n kernel to boot successfully. Tapping on the keyboard doesn't seem to help with CONFIG_RCU_FAST_NO_HZ=y. I also have a kernel dmesg from that boot where the timestamps appear to go backwards as the oops is polled from the different CPUs. Odd. Ok. If nohz=off resolves it, it sounds like it then a problem with the oneshot clockevent mode. Possibly the device that triggers the timer irq is halting in a low-power mode. Usually keyboard interrupts force the system out of low-power, allowing the system to continue more normally. That said, it does seem configuration dependent so there may be a few things going on. Could you also try "nolapic_timer" ? nolapic_timer also fixes both kernels. It also eliminates some pauses (which could be curtailed by a keyboard press) in the shutdown process with showed up even with nohz=off. Philip: Thanks for the debugging and very quick feedback! Thomas: Any thoughts on how to address this? Thomas? On Monday, August 15, 2011, Paul E. McKenney wrote: > On Sun, Aug 14, 2011 at 09:02:27PM +0200, Rafael J. Wysocki wrote: ... > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=40092 > > Subject : RCU stall in linux-3.0.0 > > Submitter : Philip Armstrong <phil@kantaka.co.uk> > > Date : 2011-07-25 21:44 (21 days old) > > This one was due the the CPU's idea of the current time getting way > out of sync (as in more than a minute's worth of disagreement, which is > pretty impressive when you think about it). So what happened is that > one CPU decided that the current grace period was a full minute old > immediately, and thus started screaming piteously. John Stultz provided > a configuration workaround, and he and Thomas Gleixner would be working > out the long-term fix. |