Bug 71351 - "INFO: rcu_sched detected stalls on CPUs/tasks" on high server load
Summary: "INFO: rcu_sched detected stalls on CPUs/tasks" on high server load
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-01 18:11 UTC by Mirek Kratochvil
Modified: 2014-07-24 04:32 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.10.22, 3.11, 3.13.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg on 3.10.25 kernel (240.33 KB, text/plain)
2014-03-01 18:11 UTC, Mirek Kratochvil
Details
dmesg from 3.13.3 (97.22 KB, text/plain)
2014-03-01 18:12 UTC, Mirek Kratochvil
Details
dmesg from 3.13.5 (245.83 KB, text/plain)
2014-03-01 18:12 UTC, Mirek Kratochvil
Details

Description Mirek Kratochvil 2014-03-01 18:11:08 UTC
After upgrading the kernel on several of my machines from 3.6.9 to 3.13.3, I've seen following problem happen randomly after some amount of time:

[ 5727.864173] INFO: rcu_sched detected stalls on CPUs/tasks: { 7} (detected by 5, t=60002 jiffies, g=602758, c=602757, q=24880)
[ 5727.864179] sending NMI to all CPUs:
[ 5727.864183] NMI backtrace for cpu 5
[ 5727.864186] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 3.13.3 #4
[ 5727.864187] Hardware name: Supermicro X8SIE/X8SIE, BIOS 1.0c 05/27/2010
[ 5727.864189] task: ffff880236095210 ti: ffff8802360ba000 task.ti: ffff8802360ba000
[ 5727.864191] RIP: 0010:[<ffffffff812bd031>]  [<ffffffff812bd031>] __const_udelay+0x21/0x30

.....
(Full dmesg in attachment).

I don't know where to start searching. The machines do

- HFSC traffic shaping of cca 500Mbits of data (low CPU load, not many classes)
- e1000 and/or igb networking
- some (not very hard) disk&CPU load from postgresql.
- irqbalance for (well) IRQ balancing
- bIRD routing daemon with OSPF.

When this problem happens, one of following thing usually (not everytime and randomly) starts failing:

- Network interrupts start to take away more CPU (from 2-3% on each core to around 50% on each core)
- HFSC stops working and it doesn't do anything at all
- HFSC fails and no packets run through.

I've been unable yet to see this in lab setup (it's on production servers) so I can't produce much useful debug output - if there's some more useful thing I should attach here, tell me.

I'm currently trying to bisect a bit to see what change could have introduced this problem (it doesn't happen on 3.1.1 to 3.6.9 and it certainly happens from 3.10.22 to 3.13.5) but it's quite a slow process because of waiting several hours for the bug to occur.

Dmesg's with the error description are attached.

So far I tried to isolate following things:

- e1000 or igb driver (happens on both)
- HFSC (seems to happen even without HFSC)
- GRO, TSO, ... etc for network stuff (no effect)
- C-state idle drivers (I've been told that some NICs don't play well when C-states go above 1, but it didn't help much).

Thanks for any help on solving this.
-mk

PS. because the machines are doing networking and this seems triggered by heavy network usage, I posted this in "networking" component, but I'm not sure whether it's really networking - please reassign if it looks otherwise.
Comment 1 Mirek Kratochvil 2014-03-01 18:11:48 UTC
Created attachment 127751 [details]
dmesg on 3.10.25 kernel
Comment 2 Mirek Kratochvil 2014-03-01 18:12:13 UTC
Created attachment 127761 [details]
dmesg from 3.13.3
Comment 3 Mirek Kratochvil 2014-03-01 18:12:39 UTC
Created attachment 127771 [details]
dmesg from 3.13.5
Comment 4 Mirek Kratochvil 2014-03-05 17:49:21 UTC
I tracked down the issue a bit, happens on 3.7.0 but doesn't happen on 3.6.11.

There were some RCU changes merged for 3.7, I hope I'll be able to bisect the one that caused the problem.
Comment 5 Mirek Kratochvil 2014-03-05 17:53:14 UTC
More details:

- falling back to a kernel with no NO_HZ set (e.g. rigid 1000Hz timer frequency) solves the issue, but CPU usage of the network cards gets around 20 times higher (which is unusuable for this setup, and just "too much")

- preemption doesn't affect/cause this.
Comment 6 Andev 2014-07-24 04:32:13 UTC
Can you check if any of the recent kernels still have these issues?

Note You need to log in before you can comment on or make changes to this bug.