Bug 200957

Summary: boot stalls on several old dual core Intel CPUs
Product: Timers Reporter: Viktor Jägersküpper (viktor_jaegerskuepper)
Component: OtherAssignee: john stultz (john.stultz)
Status: RESOLVED CODE_FIX    
Severity: normal CC: diego.viola, feng.tang, frame, hi-angel, peterz, stefan.jensen
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.18 Subsystem:
Regression: No Bisected commit-id:
Attachments: /proc/version
.config used for 4.19-rc1

Description Viktor Jägersküpper 2018-08-28 15:06:51 UTC
Several Arch Linux users (including me) reported that since kernel 4.18 the booting process stalls under conditions which worked with earlier kernel versions. It happens very early, so nobody was able to get a dmesg output. Several users took photos and the kernel output was the same except for some numbers which may be irrelevant for you.

If I boot without "quiet" and with "debug", I see this:

(These five lines are from the init process of my initramfs:)
:: running early hook [udev]
starting version 239
:: running early hook [lvm2]
:: running hook [udev]
:: Triggering uevents...

(debug output with device info)

(And finally after about 60 seconds:)

INFO: rcu_preempt detected stalls on CPUs/tasks:
o1-...!: (0 ticks this GP) idle=cf0/0/0 softirq=90/90 fqs=0 last_accelerate: e833/e840, nonlazy_posted: 9306, ..
o(detected by 0, t=18062 jiffies, g=-244, c=-245, q=5)
Sending NMI from CPU0 to CPUs 1:
NMI backtrace for cpu1 skipped: idling at acpi_idle_do_entry+0x15/0x40
rcu_preempt kthread starved for 18062 jiffies! g18446744073709551372 c18446744073709551371 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1
RCU grace-period kthread stack dump:
rcu_preempt	I	0	10	2 0x80000000
Call Trace:
 ? __schedule+0x29b/0x8b0
 schedule+0x32/0x90
 schedule_timeout+0x1d1/0x4a0
 ? collect_expired_timers+0xa0/0xa0
 rcu_gp_kthread+0x43e/0x950
 ? synchronize_rcu_expedited+0x30/0x30
 kthread+0x112/0x130
 ? kthread_flush_work_fn+0x10/0x10
 ret_from_fork+0x35/0x40

Two other users confirmed that they got the same call trace.

I was able to boot successfully with "acpi=off" or "nosmp"/"maxcpus=0" or "maxcpus=1", I used the last one several times to activate the second core later, and several other users could also boot with "nosmp" or used other parameters to boot successfully.

One other user and I used "git bisect" and we both got

7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread

as first bad commit, which has been included since 4.18-rc1. All 4.18 releases up to 4.18.5 are affected, and I also tested 4.19-rc1 and found that it is affected.

The issue could possibly be restricted to old Intel Core 2 Duo CPUs and other CPUs of the same microarchitecture including Pentium models which I guess from looking at what other users wrote. I chose "x86-64" as Hardware because this is the only architecture which Arch Linux supports officially, so I assume that all affected Arch Linux users have this.

Here is the link to the forum thread:
https://bbs.archlinux.org/viewtopic.php?id=239672
Comment 1 Siegfried Metz 2018-08-28 15:23:37 UTC
Same bug as https://bugzilla.kernel.org/show_bug.cgi?id=200959 reported minutes before mine.
Comment 2 Viktor Jägersküpper 2018-08-28 16:04:32 UTC
Created attachment 278169 [details]
/proc/version
Comment 3 Viktor Jägersküpper 2018-08-28 16:07:14 UTC
Created attachment 278171 [details]
.config used for 4.19-rc1

produced with "make localmodconfig" while using the Arch Linux 4.18.5 kernel
Comment 4 john stultz 2018-08-29 18:43:37 UTC
Please send mail to lkml <linux-kernel@vger.kernel.org> reporting this issue, (add "REGRESSION:" prefix to the subject) pointing out the identified bisected commit.

Also please cc (from the identified commit):
 Peter Zijlstra (Intel) <peterz@infradead.org>
 Thomas Gleixner <tglx@linutronix.de>
 Rafael J. Wysocki <rafael.j.wysocki@intel.com>
 len.brown@intel.com
 rjw@rjwysocki.net
 diego.viola@gmail.com
 rui.zhang@intel.com

thanks
Comment 5 john stultz 2018-08-29 18:44:42 UTC
*** Bug 200959 has been marked as a duplicate of this bug. ***
Comment 6 Viktor Jägersküpper 2018-09-21 19:25:34 UTC
This is fixed for the mainline kernel in commit 	e2c631ba75a7e727e8db0a9d30a06bfd434adb3a and for the 4.18.y kernel in commit 	51d34e94c4701f125907c026272870790a37c4a1.