103821 – Boot hangs at clocksource_done_booting on large configs

Bug 103821 - Boot hangs at clocksource_done_booting on large configs

Summary: Boot hangs at clocksource_done_booting on large configs

Status:	NEW

Alias:	None

Product:	Timers
Classification:	Unclassified
Component:	Other (show other bugs)
Hardware:	x86-64 Linux

Importance:	P1 high
Assignee:	john stultz

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-08-31 18:21 UTC by Alex Thorlton
Modified:	2016-05-16 23:13 UTC (History)
CC List:	5 users (show)

See Also:
Kernel Version:	4.2-rc1
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Alex Thorlton 2015-08-31 18:21:31 UTC

We've been hitting issues with machines with 4096 or more logical cores
hanging up at clocksource_done_booting, seemingly forever.  The problem
occurs somewhat intermittently, though it hits more consistently as you
increase the core count.  We've observed this problem on kernels
starting at 3.10, all the way up through 4.2.

We initially thought there was some issue with the stop_machine code,
but Thomas G. cleared that up for us with a thorough explanation of the
stop_machine mechanism.  After ruling out issues in stop_machine, we
started investigating possible hardware related scaling issues.

I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule
out any scaling issues related to multiple concurrent reads to our
RTC's MMR.  We start getting hung task timeouts during
clocksource_done_booting:

8<----
 calling  clocksource_done_booting+0x0/0x42 @ 1
 INFO: task swapper/0:1 blocked for more than 480 seconds.
       Not tainted 4.2.0-rc1-nortc+ #384
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 swapper/0       D ffff881efa933b48     0     1      0 0x00000000
  ffff881efa933b48 ffff881ef45e0000 ffffb5de7a130000 000000fdfa933b38
  ffff881efa934000 7fffffffffffffff ffff881efa933ca0 ffffb5de7a130000
  0000000000000000 ffff881efa933b68 ffffffff8158eae7 ffff881efa933b68
 Call Trace:
  [<ffffffff8158eae7>] schedule+0x37/0x80
  [<ffffffff81591707>] schedule_timeout+0x1e7/0x270
  [<ffffffff812ee9b9>] ? _find_next_bit.part.0+0x19/0x70
  [<ffffffff8158fdb0>] wait_for_completion+0xa0/0x100
  [<ffffffff8108d490>] ? wake_up_q+0x70/0x70
  [<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
  [<ffffffff810fa2a0>] __stop_cpus+0x50/0x70
  [<ffffffff810b8f7d>] ? console_unlock+0x2dd/0x4a0
  [<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
  [<ffffffff810fa525>] stop_cpus+0x35/0x50
  [<ffffffff810d18f0>] ? tk_setup_internals.constprop.10+0x140/0x140
  [<ffffffff810fa652>] __stop_machine+0xb2/0xe0
  [<ffffffff810d18f0>] ? tk_setup_internals.constprop.10+0x140/0x140
  [<ffffffff810d18f0>] ? tk_setup_internals.constprop.10+0x140/0x140
  [<ffffffff810fa6ae>] stop_machine+0x2e/0x50
  [<ffffffff810d1fad>] timekeeping_notify+0x2d/0x50
  [<ffffffff810d3d4b>] __clocksource_select+0xcb/0x120
  [<ffffffff81b77cfa>] ? boot_override_clock+0x4d/0x4d
  [<ffffffff81b77d2c>] clocksource_done_booting+0x32/0x42
  [<ffffffff810003a9>] do_one_initcall+0x169/0x1d0
  [<ffffffff81b5613c>] kernel_init_freeable+0x1bb/0x248
  [<ffffffff81582560>] ? rest_init+0x80/0x80
  [<ffffffff8158256e>] kernel_init+0xe/0xe0
  [<ffffffff81592c5f>] ret_from_fork+0x3f/0x70
  [<ffffffff81582560>] ? rest_init+0x80/0x80
---->8

After this I NMId the machine to get backtraces for the other CPUs.
Most are stuck around here:

8<----
 CPU: 4867 PID: 19730 Comm: migration/4867 Tainted: G      D         4.2.0-rc1-nortc+ #384
 Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013
 task: ffffad5e798cc500 ti: ffffad5e798d4000 task.ti: ffffad5e798d4000
 RIP: 0010:[<ffffffff8105e879>]  [<ffffffff8105e879>] uv_handle_nmi+0x19/0x790
 RSP: 0000:ffffc90020156da8  EFLAGS: 00010092
 RAX: 000002fe65b95384 RBX: 000000000000002d RCX: ffffc90020162a98
 RDX: 000002fe65b95384 RSI: ffffc90020156ef8 RDI: 0000000000000001
 RBP: ffffc90020156e18 R08: 0000000000000001 R09: 0000000000000000
 R10: 0000000000000003 R11: 0000000000000020 R12: 000002fe65b95384
 R13: 0000000000000000 R14: ffffffff81a38880 R15: ffffc90020156ef8
 FS:  0000000000000000(0000) GS:ffffc90020151000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000000da80 CR3: 0000000001a0b000 CR4: 00000000001406e0
 Stack:
  0000000000000000 0000000000000000 0000000000000000 0000000000000000
  0000000000000000 0000000000000000 0000000000000000 0000000000000000
  0000000000000000 000000000000002d 000002fe65b95384 0000000000000000
 Call Trace:
  <NMI>
  [<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
  [<ffffffff81007869>] nmi_handle+0x79/0x100
  [<ffffffff81007873>] ? nmi_handle+0x83/0x100
  [<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
  [<ffffffff81007c6c>] unknown_nmi_error+0x1c/0x90
  [<ffffffff81007e48>] default_do_nmi+0xd8/0xf0
  [<ffffffff81007ee3>] do_nmi+0x83/0xb0
  [<ffffffff815948bf>] end_repeat_nmi+0x1e/0x2e
  [<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
  [<ffffffff810486c6>] ? read_hpet+0x16/0x20
  [<ffffffff810486c6>] ? read_hpet+0x16/0x20
  [<ffffffff810486c6>] ? read_hpet+0x16/0x20
  <<EOE>>
  <IRQ>
  [<ffffffff810d1b97>] ktime_get+0x37/0xb0
  [<ffffffff810d819b>] clockevents_program_event+0x4b/0x130
  [<ffffffff810d8737>] tick_handle_periodic+0x57/0x70
  [<ffffffff8103b46c>] local_apic_timer_interrupt+0x3c/0x70
  [<ffffffff815952c1>] smp_apic_timer_interrupt+0x41/0x60
  [<ffffffff815936ab>] apic_timer_interrupt+0x6b/0x70
  <EOI>
  [<ffffffff81097b41>] ? update_blocked_averages+0x381/0x860
  [<ffffffff810f9f98>] ? multi_cpu_stop+0x98/0xf0
  [<ffffffff8100272a>] ? __switch_to+0x26a/0x590
  [<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
  [<ffffffff810fa1c8>] cpu_stopper_thread+0x88/0x110
  [<ffffffff8158e4e8>] ? __schedule+0x338/0x900
  [<ffffffff81085575>] smpboot_thread_fn+0x135/0x190
  [<ffffffff81085440>] ? sort_range+0x30/0x30
  [<ffffffff810820cb>] kthread+0xdb/0x100
  [<ffffffff81081ff0>] ? kthread_create_on_node+0x180/0x180
  [<ffffffff81592c5f>] ret_from_fork+0x3f/0x70
  [<ffffffff81081ff0>] ? kthread_create_on_node+0x180/0x180
 Code: 31 c0 e8 a0 c2 52 00 e9 39 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f7 53 48 83 ec 48 <48> 8b 1d 00 f2 fa 7e 65 44 8b 2d 70 b8 fa 7e 9c 58 0f 1f 44 00
---->8

The full log is a bit too big to attach here.  Here's a link to it:

http://oss.sgi.com/projects/athorlton/harp50-6144-nortc

(I'd wget it - probably don't wanna open a 15M file in your browser :)

Anyways, based on my limited understanding of what's going on here, I
tried disabling interrupts a bit further out in the cpu_stopper_thread
code, to try and avoid that apic_timer_interrupt during the clocksource
change.  change_clocksource already disables interrupts, but, from the
backtraces, it appears that we're taking an interrupt after the stopper
thread wakes up, but before change_clocksource actually gets called.

I used this patch to try and avoid taking a apic_timer_interrupt at any
point during the clocksource change:

8<---
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index fd643d8..8502521 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -417,8 +417,11 @@ static void cpu_stopper_thread(unsigned int cpu)
 {
        struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
        struct cpu_stop_work *work;
+       unsigned long flags;
        int ret;

+       local_irq_save(flags);
+
 repeat:
        work = NULL;
        spin_lock_irq(&stopper->lock);
@@ -452,6 +455,8 @@ repeat:
                cpu_stop_signal_done(done, true);
                goto repeat;
        }
+
+       local_irq_restore(flags);
 }

 extern void sched_set_stop_task(int cpu, struct task_struct *stop);
--->8

This was a bit of a shot in the dark, but it has worked on every kernel
and every machine we've tried it on so far.

I'm hoping to get some input from the experts in this area, first of
all, on whether the problem I'm seeing is actually what I think it is,
and, if so, if I've solved it in the correct way.

Any input is greatly appreciated!

- Alex

Comment 1 Alex Thorlton 2015-08-31 18:22:44 UTC

Pointer to the e-mail thread for the same bug:

https://lkml.org/lkml/2015/8/31/353

Comment 2 [account disabled by administrator] 2016-05-15 03:53:37 UTC

Seems Ok expect that you can do spin_lock_irqsave and merge the savingof irqs with the line(s) disabling and acquiring the spinlock used.

Comment 3 Alex Thorlton 2016-05-16 23:13:21 UTC

Peter Z. pointed out, quite a while back, that this patch doesn't actually achieve the desired effect.  In reality, adding in this irqsave/restore where it is only disables IRQs over this little chunk of code:

 repeat:
        work = NULL;

Which is most likely (more like definitely) *not* actually fixing the problem.  My guess is that the addition of the extra instructions fudged the timing here just enough that we didn't catch the bug during these boots.

This has been an incredibly elusive issue to debug.  It likes to pop up for a day or so, and then when you go back and boot the same kernel on the same machine a day later, it's gone.

The most recent theory (that I haven't been able to disprove yet) is that 6K possible writers along with a large amount of possible readers is too much for the timekeeper_seqlock to handle, but again, this is based strictly on observation of behavior, and not on any hard facts.

We're working on some UV fixes for the mainline kernel now.  Hopefully once those are pushed through we'll have some more time to dig into this!

Note You need to log in before you can comment on or make changes to this bug.