Hi, I upgraded my machine from 4.0.4 to 4.2-rc4 a day or so ago. Today, it hung with a message about rcu_sched detecting a hang; first network went away, then seemingly disk and everything. I managed to grab a video of the oops off the iKVM; sorry for the obscure way of reporting a stack trace, but it's really the best thing I could do! I hope this is the right component; it's hard for me to say what exactly causes this. I've only seen this once (so far). The video is at http://home.samfundet.no/~sesse/crash.avi . I'll type over the most interesting part here: INFO: rcu_sched detected stalls on CPUs/tasks: { 36} (detected by 10, t=60095 jiffies, g=5991784, c=5991783, q=5328) Task dump for CPU 36: swapper/36 R running task [skipping the rest of the task dump, too boring to type] Call Trace [skipping the addresses]: ? cpuidle_enter_state+0xd2/0x1bb ? cpuidle_enter+0x12/0x14 ? call_cupidle+0x54/0x56 ? cpuidle-select+0xe/0x10 ? cpu_startup_entry+0x1f5/0x2cc ? start_secondary+0x149/0x162 There's also complaints that the ixgbe transmit queue times out further down, but I suppose that somehow comes because RCU is hosed.
It happened again, with 4.2-rc5. Here's the new hang message, this time triggered by running mtr --report against an IPv6 host: Aug 6 17:26:56 pannekake kernel: [141625.537638] INFO: rcu_sched self-detected stall on CPU { 27} (t=60000 jiffies g=4159086 c=4159085 q=2139) Aug 6 17:26:56 pannekake kernel: [141625.547469] Task dump for CPU 27: Aug 6 17:26:56 pannekake kernel: [141625.550881] mtr R running task 0 22121 22081 0x00000008 Aug 6 17:26:56 pannekake kernel: [141625.558069] 0000000000000000 ffff88103f363d98 ffffffff8106e488 000000000000001b Aug 6 17:26:56 pannekake kernel: [141625.565641] ffffffff81684900 ffff88103f363db8 ffffffff810702b0 0000000008000000 Aug 6 17:26:56 pannekake kernel: [141625.573220] ffffffff81684900 ffff88103f363de8 ffffffff8108df9f ffff88103f375a00 Aug 6 17:26:56 pannekake kernel: [141625.580803] Call Trace: Aug 6 17:26:56 pannekake kernel: [141625.583345] <IRQ> [<ffffffff8106e488>] sched_show_task+0xc1/0xc6 Aug 6 17:26:56 pannekake kernel: [141625.589650] [<ffffffff810702b0>] dump_cpu_task+0x35/0x39 Aug 6 17:26:56 pannekake kernel: [141625.595144] [<ffffffff8108df9f>] rcu_dump_cpu_stacks+0x6a/0x8c Aug 6 17:26:56 pannekake kernel: [141625.601320] [<ffffffff81090606>] rcu_check_callbacks+0x1f6/0x5d4 Aug 6 17:26:56 pannekake kernel: [141625.607669] [<ffffffff810940c8>] update_process_times+0x2a/0x4f Aug 6 17:26:56 pannekake kernel: [141625.613925] [<ffffffff8109fbee>] tick_sched_handle+0x32/0x3e Aug 6 17:26:56 pannekake kernel: [141625.619923] [<ffffffff8109fc2f>] tick_sched_timer+0x35/0x5c Aug 6 17:26:56 pannekake kernel: [141625.625830] [<ffffffff81094a1f>] __hrtimer_run_queues+0x8f/0x18d Aug 6 17:26:56 pannekake kernel: [141625.632171] [<ffffffff81094c9e>] hrtimer_interrupt+0xa0/0x166 Aug 6 17:26:56 pannekake kernel: [141625.638258] [<ffffffff8102bf2a>] local_apic_timer_interrupt+0x4e/0x52 Aug 6 17:26:56 pannekake kernel: [141625.645036] [<ffffffff8102c36f>] smp_apic_timer_interrupt+0x39/0x4a Aug 6 17:26:56 pannekake kernel: [141625.651643] [<ffffffff8140b9e8>] apic_timer_interrupt+0x68/0x70 Aug 6 17:26:56 pannekake kernel: [141625.657895] <EOI> [<ffffffff81346ee8>] ? dst_destroy+0x7c/0xb5 Aug 6 17:26:56 pannekake kernel: [141625.664188] [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20 Aug 6 17:26:56 pannekake kernel: [141625.670272] [<ffffffff81082b45>] ? queue_write_lock_slowpath+0x60/0x6f Aug 6 17:26:56 pannekake kernel: [141625.677140] [<ffffffff8140aa33>] _raw_write_lock_bh+0x23/0x25 Aug 6 17:26:56 pannekake kernel: [141625.683218] [<ffffffff813d4553>] __fib6_clean_all+0x40/0x82 Aug 6 17:26:56 pannekake kernel: [141625.689124] [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20 Aug 6 17:26:56 pannekake kernel: [141625.695207] [<ffffffff813d6058>] fib6_clean_all+0xe/0x10 Aug 6 17:26:56 pannekake kernel: [141625.700854] [<ffffffff813d60d3>] fib6_run_gc+0x79/0xc8 Aug 6 17:26:56 pannekake kernel: [141625.706329] [<ffffffff813d0510>] ip6_dst_gc+0x85/0xf9 Aug 6 17:26:56 pannekake kernel: [141625.711718] [<ffffffff81346d68>] dst_alloc+0x55/0x159 Aug 6 17:26:56 pannekake kernel: [141625.717105] [<ffffffff813d09b5>] __ip6_dst_alloc.isra.32+0x19/0x63 Aug 6 17:26:56 pannekake kernel: [141625.723620] [<ffffffff813d1830>] ip6_pol_route+0x36a/0x3e8 Aug 6 17:26:56 pannekake kernel: [141625.729441] [<ffffffff813d18d6>] ip6_pol_route_output+0x11/0x13 Aug 6 17:26:56 pannekake kernel: [141625.735700] [<ffffffff813f02c8>] fib6_rule_action+0xa7/0x1bf Aug 6 17:26:56 pannekake kernel: [141625.741698] [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17 Aug 6 17:26:56 pannekake kernel: [141625.748043] [<ffffffff81357c48>] fib_rules_lookup+0xb5/0x12a Aug 6 17:26:56 pannekake kernel: [141625.754050] [<ffffffff81141628>] ? poll_select_copy_remaining+0xf9/0xf9 Aug 6 17:26:56 pannekake kernel: [141625.761002] [<ffffffff813f0535>] fib6_rule_lookup+0x37/0x5c Aug 6 17:26:56 pannekake kernel: [141625.766914] [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17 Aug 6 17:26:56 pannekake kernel: [141625.773260] [<ffffffff813d008c>] ip6_route_output+0x7a/0x82 Aug 6 17:26:56 pannekake kernel: [141625.779177] [<ffffffff813c44c8>] ip6_dst_lookup_tail+0x53/0x112 Aug 6 17:26:56 pannekake kernel: [141625.785437] [<ffffffff813c45c3>] ip6_dst_lookup_flow+0x2a/0x6b Aug 6 17:26:56 pannekake kernel: [141625.791604] [<ffffffff813ddaab>] rawv6_sendmsg+0x407/0x9b6 Aug 6 17:26:56 pannekake kernel: [141625.797423] [<ffffffff813d7914>] ? do_ipv6_setsockopt.isra.8+0xd87/0xde2 Aug 6 17:26:56 pannekake kernel: [141625.804464] [<ffffffff8139d4b4>] inet_sendmsg+0x57/0x8e Aug 6 17:26:56 pannekake kernel: [141625.810028] [<ffffffff81329ba3>] sock_sendmsg+0x2e/0x3c Aug 6 17:26:56 pannekake kernel: [141625.815588] [<ffffffff8132be57>] SyS_sendto+0xfe/0x143 Aug 6 17:26:56 pannekake kernel: [141625.821063] [<ffffffff813dd551>] ? rawv6_setsockopt+0x5e/0x67 Aug 6 17:26:56 pannekake kernel: [141625.827146] [<ffffffff8132c9f8>] ? sock_common_setsockopt+0xf/0x11 Aug 6 17:26:56 pannekake kernel: [141625.833660] [<ffffffff8132c08c>] ? SyS_setsockopt+0x81/0xa2 Aug 6 17:26:56 pannekake kernel: [141625.839565] [<ffffffff8140ac17>] entry_SYSCALL_64_fastpath+0x12/0x6a Maybe netdev should be added, since this looks network-related? It seems I cannot Cc them from Bugzilla, though.
Reassigning to networking, in the hopes that someone from netdev can look at it.
It died again tonight. Unfortunately there was nothing on the serial console or the console when I woke up, and nothing in the logs. But I have to assume it's the same issue.
The video and the last stacktrace does not really match. I first assumed an issue within the IPv6 stack but couldn't see anything that should bring down the machine at dst_destroy. It would be interesting if this same issue happens with 4.1, too. Thanks, Hannes
The video is from the first crash. The pasted stack trace is from the second crash. (I don't have a stacktrace from the third crash.) So it's not necessarily expected that they agree. I can try downgrading to 4.1, but since I've migrated from mq to smq (which is new in 4.2, and had a bug such that the easiest was to specify that explicitly :-) ), it might be slightly tricky to get dm-cache to agree to that.
Now on 4.1.4. No crash yet (although the last two days on 4.2-rc5 were also without crash, so it probably is slightly below once a day).
Could you try to see if the 4.2 kernel panics or hangs at the same place (in IPv6 code)? Maybe you can get some more stable backtraces, that would be extremely helpful. :)
Which 4.2 kernel? kernel.org lists only 4.2-rc6 (which I haven't tried either, but, yes.) 4.1.4 now has almost 48 hours uptime without dying on me.
Thanks for the report Steinar. Hi Hannes, What may be happening could be: 1. ip6_rt_pcpu_alloc() is called under read_lock_bh(&table->tb6_lock). 2. At the same time, ip6_rt_pcpu_alloc() may call dst_alloc() which may trigger ip6_dst_gc() which eventually takes a write_lock_bh(&table->tb6_lock)....oops. I will try to come up with a patch that drops the read_lock_bh() before calling dst_alloc() in ip6_rt_pcpu_alloc(). Thanks, --Martin
Closing this, I haven't seen it in a long time.