Bug 102291

Summary: RCU hang, kills entire system
Product: Networking Reporter: Steinar H. Gunderson (steinar+kernel)
Component: IPV6Assignee: Len Brown (lenb)
Status: RESOLVED OBSOLETE    
Severity: normal CC: hannes, kafai, tomi
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.2-rc4, 4.2-rc5 Subsystem:
Regression: Yes Bisected commit-id:

Description Steinar H. Gunderson 2015-08-05 00:19:34 UTC
Hi,

I upgraded my machine from 4.0.4 to 4.2-rc4 a day or so ago. Today, it hung with a message about rcu_sched detecting a hang; first network went away, then seemingly disk and everything. I managed to grab a video of the oops off the iKVM; sorry for the obscure way of reporting a stack trace, but it's really the best thing I could do!

I hope this is the right component; it's hard for me to say what exactly causes this. I've only seen this once (so far).

The video is at http://home.samfundet.no/~sesse/crash.avi . I'll type over the most interesting part here:

INFO: rcu_sched detected stalls on CPUs/tasks: { 36} (detected by 10, t=60095 jiffies, g=5991784, c=5991783, q=5328)
Task dump for CPU 36:
swapper/36      R  running task [skipping the rest of the task dump, too boring to type]
Call Trace [skipping the addresses]:
  ? cpuidle_enter_state+0xd2/0x1bb
  ? cpuidle_enter+0x12/0x14
  ? call_cupidle+0x54/0x56
  ? cpuidle-select+0xe/0x10
  ? cpu_startup_entry+0x1f5/0x2cc
  ? start_secondary+0x149/0x162

There's also complaints that the ixgbe transmit queue times out further down, but I suppose that somehow comes because RCU is hosed.
Comment 1 Steinar H. Gunderson 2015-08-06 15:39:33 UTC
It happened again, with 4.2-rc5. Here's the new hang message, this time triggered by running mtr --report against an IPv6 host:

Aug  6 17:26:56 pannekake kernel: [141625.537638] INFO: rcu_sched self-detected stall on CPU { 27}  (t=60000 jiffies g=4159086 c=4159085 q=2139)
Aug  6 17:26:56 pannekake kernel: [141625.547469] Task dump for CPU 27:
Aug  6 17:26:56 pannekake kernel: [141625.550881] mtr             R  running task        0 22121  22081 0x00000008
Aug  6 17:26:56 pannekake kernel: [141625.558069]  0000000000000000 ffff88103f363d98 ffffffff8106e488 000000000000001b
Aug  6 17:26:56 pannekake kernel: [141625.565641]  ffffffff81684900 ffff88103f363db8 ffffffff810702b0 0000000008000000
Aug  6 17:26:56 pannekake kernel: [141625.573220]  ffffffff81684900 ffff88103f363de8 ffffffff8108df9f ffff88103f375a00
Aug  6 17:26:56 pannekake kernel: [141625.580803] Call Trace:
Aug  6 17:26:56 pannekake kernel: [141625.583345]  <IRQ>  [<ffffffff8106e488>] sched_show_task+0xc1/0xc6
Aug  6 17:26:56 pannekake kernel: [141625.589650]  [<ffffffff810702b0>] dump_cpu_task+0x35/0x39
Aug  6 17:26:56 pannekake kernel: [141625.595144]  [<ffffffff8108df9f>] rcu_dump_cpu_stacks+0x6a/0x8c
Aug  6 17:26:56 pannekake kernel: [141625.601320]  [<ffffffff81090606>] rcu_check_callbacks+0x1f6/0x5d4
Aug  6 17:26:56 pannekake kernel: [141625.607669]  [<ffffffff810940c8>] update_process_times+0x2a/0x4f
Aug  6 17:26:56 pannekake kernel: [141625.613925]  [<ffffffff8109fbee>] tick_sched_handle+0x32/0x3e
Aug  6 17:26:56 pannekake kernel: [141625.619923]  [<ffffffff8109fc2f>] tick_sched_timer+0x35/0x5c
Aug  6 17:26:56 pannekake kernel: [141625.625830]  [<ffffffff81094a1f>] __hrtimer_run_queues+0x8f/0x18d
Aug  6 17:26:56 pannekake kernel: [141625.632171]  [<ffffffff81094c9e>] hrtimer_interrupt+0xa0/0x166
Aug  6 17:26:56 pannekake kernel: [141625.638258]  [<ffffffff8102bf2a>] local_apic_timer_interrupt+0x4e/0x52
Aug  6 17:26:56 pannekake kernel: [141625.645036]  [<ffffffff8102c36f>] smp_apic_timer_interrupt+0x39/0x4a
Aug  6 17:26:56 pannekake kernel: [141625.651643]  [<ffffffff8140b9e8>] apic_timer_interrupt+0x68/0x70
Aug  6 17:26:56 pannekake kernel: [141625.657895]  <EOI>  [<ffffffff81346ee8>] ? dst_destroy+0x7c/0xb5
Aug  6 17:26:56 pannekake kernel: [141625.664188]  [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20
Aug  6 17:26:56 pannekake kernel: [141625.670272]  [<ffffffff81082b45>] ? queue_write_lock_slowpath+0x60/0x6f
Aug  6 17:26:56 pannekake kernel: [141625.677140]  [<ffffffff8140aa33>] _raw_write_lock_bh+0x23/0x25
Aug  6 17:26:56 pannekake kernel: [141625.683218]  [<ffffffff813d4553>] __fib6_clean_all+0x40/0x82
Aug  6 17:26:56 pannekake kernel: [141625.689124]  [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20
Aug  6 17:26:56 pannekake kernel: [141625.695207]  [<ffffffff813d6058>] fib6_clean_all+0xe/0x10
Aug  6 17:26:56 pannekake kernel: [141625.700854]  [<ffffffff813d60d3>] fib6_run_gc+0x79/0xc8
Aug  6 17:26:56 pannekake kernel: [141625.706329]  [<ffffffff813d0510>] ip6_dst_gc+0x85/0xf9
Aug  6 17:26:56 pannekake kernel: [141625.711718]  [<ffffffff81346d68>] dst_alloc+0x55/0x159
Aug  6 17:26:56 pannekake kernel: [141625.717105]  [<ffffffff813d09b5>] __ip6_dst_alloc.isra.32+0x19/0x63
Aug  6 17:26:56 pannekake kernel: [141625.723620]  [<ffffffff813d1830>] ip6_pol_route+0x36a/0x3e8
Aug  6 17:26:56 pannekake kernel: [141625.729441]  [<ffffffff813d18d6>] ip6_pol_route_output+0x11/0x13
Aug  6 17:26:56 pannekake kernel: [141625.735700]  [<ffffffff813f02c8>] fib6_rule_action+0xa7/0x1bf
Aug  6 17:26:56 pannekake kernel: [141625.741698]  [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17
Aug  6 17:26:56 pannekake kernel: [141625.748043]  [<ffffffff81357c48>] fib_rules_lookup+0xb5/0x12a
Aug  6 17:26:56 pannekake kernel: [141625.754050]  [<ffffffff81141628>] ? poll_select_copy_remaining+0xf9/0xf9
Aug  6 17:26:56 pannekake kernel: [141625.761002]  [<ffffffff813f0535>] fib6_rule_lookup+0x37/0x5c
Aug  6 17:26:56 pannekake kernel: [141625.766914]  [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17
Aug  6 17:26:56 pannekake kernel: [141625.773260]  [<ffffffff813d008c>] ip6_route_output+0x7a/0x82
Aug  6 17:26:56 pannekake kernel: [141625.779177]  [<ffffffff813c44c8>] ip6_dst_lookup_tail+0x53/0x112
Aug  6 17:26:56 pannekake kernel: [141625.785437]  [<ffffffff813c45c3>] ip6_dst_lookup_flow+0x2a/0x6b
Aug  6 17:26:56 pannekake kernel: [141625.791604]  [<ffffffff813ddaab>] rawv6_sendmsg+0x407/0x9b6
Aug  6 17:26:56 pannekake kernel: [141625.797423]  [<ffffffff813d7914>] ? do_ipv6_setsockopt.isra.8+0xd87/0xde2
Aug  6 17:26:56 pannekake kernel: [141625.804464]  [<ffffffff8139d4b4>] inet_sendmsg+0x57/0x8e
Aug  6 17:26:56 pannekake kernel: [141625.810028]  [<ffffffff81329ba3>] sock_sendmsg+0x2e/0x3c
Aug  6 17:26:56 pannekake kernel: [141625.815588]  [<ffffffff8132be57>] SyS_sendto+0xfe/0x143
Aug  6 17:26:56 pannekake kernel: [141625.821063]  [<ffffffff813dd551>] ? rawv6_setsockopt+0x5e/0x67
Aug  6 17:26:56 pannekake kernel: [141625.827146]  [<ffffffff8132c9f8>] ? sock_common_setsockopt+0xf/0x11
Aug  6 17:26:56 pannekake kernel: [141625.833660]  [<ffffffff8132c08c>] ? SyS_setsockopt+0x81/0xa2
Aug  6 17:26:56 pannekake kernel: [141625.839565]  [<ffffffff8140ac17>] entry_SYSCALL_64_fastpath+0x12/0x6a

Maybe netdev should be added, since this looks network-related? It seems I cannot Cc them from Bugzilla, though.
Comment 2 Steinar H. Gunderson 2015-08-08 09:36:03 UTC
Reassigning to networking, in the hopes that someone from netdev can look at it.
Comment 3 Steinar H. Gunderson 2015-08-08 09:37:31 UTC
It died again tonight. Unfortunately there was nothing on the serial console or the console when I woke up, and nothing in the logs. But I have to assume it's the same issue.
Comment 4 hannes 2015-08-10 10:08:27 UTC
The video and the last stacktrace does not really match. I first assumed an issue within the IPv6 stack but couldn't see anything that should bring down the machine at dst_destroy. It would be interesting if this same issue happens with 4.1, too.

Thanks,
Hannes
Comment 5 Steinar H. Gunderson 2015-08-10 10:11:39 UTC
The video is from the first crash. The pasted stack trace is from the second crash. (I don't have a stacktrace from the third crash.) So it's not necessarily expected that they agree.

I can try downgrading to 4.1, but since I've migrated from mq to smq (which is new in 4.2, and had a bug such that the easiest was to specify that explicitly :-) ), it might be slightly tricky to get dm-cache to agree to that.
Comment 6 Steinar H. Gunderson 2015-08-11 14:49:22 UTC
Now on 4.1.4. No crash yet (although the last two days on 4.2-rc5 were also without crash, so it probably is slightly below once a day).
Comment 7 hannes 2015-08-12 17:29:15 UTC
Could you try to see if the 4.2 kernel panics or hangs at the same place (in IPv6 code)? Maybe you can get some more stable backtraces, that would be extremely helpful. :)
Comment 8 Steinar H. Gunderson 2015-08-12 17:31:33 UTC
Which 4.2 kernel? kernel.org lists only 4.2-rc6 (which I haven't tried either, but, yes.)

4.1.4 now has almost 48 hours uptime without dying on me.
Comment 9 Martin KaFai Lau 2015-08-12 17:59:05 UTC
Thanks for the report Steinar.


Hi Hannes,

What may be happening could be:

1. ip6_rt_pcpu_alloc() is called under read_lock_bh(&table->tb6_lock).

2. At the same time, ip6_rt_pcpu_alloc() may call dst_alloc() which may trigger ip6_dst_gc() which eventually takes a write_lock_bh(&table->tb6_lock)....oops.

I will try to come up with a patch that drops the read_lock_bh() before calling dst_alloc() in ip6_rt_pcpu_alloc().

Thanks,
--Martin
Comment 10 Steinar H. Gunderson 2016-08-06 14:49:31 UTC
Closing this, I haven't seen it in a long time.