Bug 103821
Summary: | Boot hangs at clocksource_done_booting on large configs | ||
---|---|---|---|
Product: | Timers | Reporter: | Alex Thorlton (athorlton) |
Component: | Other | Assignee: | john stultz (john.stultz) |
Status: | NEW --- | ||
Severity: | high | CC: | athorlton, bastienphilbert, rja, sivanich, tglx |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.2-rc1 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Alex Thorlton
2015-08-31 18:21:31 UTC
Pointer to the e-mail thread for the same bug: https://lkml.org/lkml/2015/8/31/353 Seems Ok expect that you can do spin_lock_irqsave and merge the savingof irqs with the line(s) disabling and acquiring the spinlock used. Peter Z. pointed out, quite a while back, that this patch doesn't actually achieve the desired effect. In reality, adding in this irqsave/restore where it is only disables IRQs over this little chunk of code: repeat: work = NULL; Which is most likely (more like definitely) *not* actually fixing the problem. My guess is that the addition of the extra instructions fudged the timing here just enough that we didn't catch the bug during these boots. This has been an incredibly elusive issue to debug. It likes to pop up for a day or so, and then when you go back and boot the same kernel on the same machine a day later, it's gone. The most recent theory (that I haven't been able to disprove yet) is that 6K possible writers along with a large amount of possible readers is too much for the timekeeper_seqlock to handle, but again, this is based strictly on observation of behavior, and not on any hard facts. We're working on some UV fixes for the mainline kernel now. Hopefully once those are pushed through we'll have some more time to dig into this! |