Bug 151671 - [REGRESSION][BISECTED] AMD 700: CPU stalls on boot unless a key is pressed
Summary: [REGRESSION][BISECTED] AMD 700: CPU stalls on boot unless a key is pressed
Status: RESOLVED DUPLICATE of bug 15289
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-06 23:09 UTC by Dainius Masiliūnas
Modified: 2016-08-07 22:08 UTC (History)
0 users

See Also:
Kernel Version: 4.7.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg, pressed keyboard key (52.50 KB, text/plain)
2016-08-06 23:09 UTC, Dainius Masiliūnas
Details
dmesg, pressed power button (rcu stall) (62.30 KB, text/plain)
2016-08-06 23:13 UTC, Dainius Masiliūnas
Details
git bisect log (7.81 KB, text/plain)
2016-08-07 21:14 UTC, Dainius Masiliūnas
Details

Description Dainius Masiliūnas 2016-08-06 23:09:41 UTC
Created attachment 227841 [details]
dmesg, pressed keyboard key

A regression somewhere between kernel 4.1.15 and 4.4.6 causes the kernel to stall while booting until some event happens, such as plugging in a USB keyboard, pressing a key on it, or shortly pressing the power button. Without that, the kernel potentially stalls forever. Sometimes it stalls before USB is initialised, too. It happens on kernel 4.7.0 too.

The motherboard that this issue occurs on is a Gigabyte GA-MA770-UD3, AMD 700 chipset.

I will try and bisect this regression (it seems to be always reproducible). Two dmesg outputs are attached: one from a boot where I pressed a key after two minutes (happens most of the times), and one where it stalled before initialising USB and thus I pressed the power key after two minutes (happens rarely).
Comment 1 Dainius Masiliūnas 2016-08-06 23:13:34 UTC
Created attachment 227851 [details]
dmesg, pressed power button (rcu stall)

The time when I pressed the power button, the kernel produced an rcu stall message and a traceback, part of it pasted for easier reference:

[  122.416580] 	3-...: (1 ticks this GP) idle=451/1/0 softirq=11/11 fqs=1 
[  122.416580] 	
[  122.416582] (detected by 1, t=36412 jiffies, g=-292, c=-293, q=34)
[  122.416583] Task dump for CPU 3:
[  122.416584] swapper/3       R
[  122.416584]   running task    
[  122.416585]     0     0      1 0x00000000
[  122.416586]  0000000000000003
[  122.416586]  0000000041f6643c
[  122.416587]  0000000000000000
[  122.416587]  0000000000000380

[  122.416588]  ffffffffffffffcf
[  122.416588]  ffffffff8100e7c0
[  122.416588]  0000000000000010
[  122.416588]  0000000000000282

[  122.416589]  ffff88012b15bed0
[  122.416589]  0000000000000018
[  122.416589]  0000000000000003
[  122.416590]  ffff88012b15c000

[  122.416590] Call Trace:
[  122.416594]  [<ffffffff8100e7c0>] ? default_idle+0x10/0xc0
[  122.416596]  [<ffffffff8100e8a0>] ? amd_e400_idle+0x30/0xc0
[  122.416598]  [<ffffffff8108ddb6>] ? cpu_startup_entry+0x2c6/0x330
[  122.416601]  [<ffffffff81032eda>] ? start_secondary+0x12a/0x130
[  122.416603] rcu_sched kthread starved for 36410 jiffies! g18446744073709551324 c18446744073709551323 f0x0 s3 ->state=0x0
[  122.416750] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x288b24a829f, max_idle_ns: 440795236536 ns
[  122.416768] ehci-pci 0000:00:13.2: USB 2.0 started, EHCI 1.00
[  122.417002] hub 2-0:1.0: USB hub found

[  122.417008] 	3-...: (1 GPs behind) idle=451/1/0 softirq=11/11 fqs=0 
[  122.417008] 	
[  122.417010]  (t=0 jiffies g=-291 c=-292 q=35)
[  122.417011] Task dump for CPU 0:
[  122.417013] kworker/0:1     R  running task        0    33      2 0x00000000
[  122.417019] Workqueue: events console_callback
[  122.417019] hub 2-0:1.0: 6 ports detected

[  122.417022]  ffffffff00000001 ffff8800cb620010 ffffffff00000002 ffff000100000000
[  122.417023]  ffff88012b1d8000 ffff000100000000 0000027000000000 ffff88012b030000
[  122.417024]  ffff88012b347400 ffff8800cb633552 ffffffff81406a00 0000000000000027
[  122.417025] Call Trace:
[  122.417028]  [<ffffffff81402d01>] ? set_con2fb_map+0x321/0x380
[  122.417031]  [<ffffffff81406450>] ? bit_clear+0xf0/0xf0
[  122.417033]  [<ffffffff813f0861>] ? vt_console_print+0x3b1/0x400
[  122.417036]  [<ffffffff810a0449>] ? call_console_drivers.constprop.28+0xf9/0x100
[  122.417037]  [<ffffffff810a0ea5>] ? console_unlock+0x2d5/0x4a0
[  122.417038]  [<ffffffff813f2e75>] ? console_callback+0xa5/0x160
[  122.417040]  [<ffffffff8106b2ea>] ? process_one_work+0x13a/0x3f0
[  122.417042]  [<ffffffff8106b895>] ? worker_thread+0x45/0x420
[  122.417043]  [<ffffffff8106b850>] ? rescuer_thread+0x2b0/0x2b0
[  122.417044]  [<ffffffff8106b850>] ? rescuer_thread+0x2b0/0x2b0
[  122.417045]  [<ffffffff8106ffd8>] ? kthread+0xb8/0xd0
[  122.417046]  [<ffffffff8106ff20>] ? kthread_park+0x50/0x50
[  122.417049]  [<ffffffff818652df>] ? ret_from_fork+0x3f/0x70
[  122.417050]  [<ffffffff8106ff20>] ? kthread_park+0x50/0x50
Comment 2 Dainius Masiliūnas 2016-08-07 21:14:36 UTC
Created attachment 227891 [details]
git bisect log

Bisect complete. Unfortunately no one commit could be found, because several of them would result in a kernel panic (or such: the kernel would show a never-ending flood of messages that seemed to contain a stack trace). The offending commit is thus one of:

# only skipped commits left to test
# possible first bad commit: [407a2c720556e8e340e06f6a7174f5d6d80cf9ea] Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
# possible first bad commit: [3a95398f54cbd664c749fe9f1bfc7e7dbace92d0] Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
# possible first bad commit: [43224b96af3154cedd7220f7b90094905f07ac78] Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
# possible first bad commit: [d70b3ef54ceaf1c7c92209f5a662a670d04cbed9] Merge branch 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
# possible first bad commit: [7ef3d7d58d9dc73ee3d4f8f56d0024c8cca8163f] Merge branches 'x86/apic', 'x86/asm', 'x86/mm' and 'x86/platform' into x86/core, to merge last updates
# possible first bad commit: [cb17b2a674f2059343f997599b4b001e64eec516] x86/hpet: Use proper hpet device number for MSI allocation
# possible first bad commit: [bafac298fb20e9ae1305c710d4fd8d20c5911afa] x86/hpet: Check for irq==0 when allocating hpet MSI interrupts
# possible first bad commit: [f6b1464f647424bbeb609ec832428e4079940701] genirq: Prevent crash in irq_move_irq()

Full git bisect log is attached.
Comment 3 Dainius Masiliūnas 2016-08-07 22:08:10 UTC
The true culprit is C1E. Setting it to disabled or auto in the BIOS makes the kernel boot successfully every time. As such, this is a duplicate of the (still unsolved) bug #15289.

*** This bug has been marked as a duplicate of bug 15289 ***

Note You need to log in before you can comment on or make changes to this bug.