[ 1.001112] ------------[ cut here ]------------ [ 1.001119] WARNING: CPU: 4 PID: 1 at kernel/smp.c:277 smp_call_function_single+0x115/0x130 [ 1.001123] Modules linked in: [ 1.001127] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 4.9.127-gentoo #6 [ 1.001130] Hardware name: HP HP EliteBook 735 G5/83DA, BIOS Q81 Ver. 01.03.01 07/26/2018 [ 1.001135] 0000000000000000 ffffffff83f48893 0000000000000000 0000000000000000 [ 1.001140] ffffffff83c79624 0000000000000004 ffffffff83f77df0 ffffffff846773e0 [ 1.001145] 0000000000000001 00000000c0002003 000000000000fd40 ffffffff83cfde85 [ 1.001150] Call Trace: [ 1.001154] <IRQ> [ 1.001158] [<ffffffff83f48893>] ? dump_stack+0x5c/0x79 [ 1.001164] [<ffffffff83c79624>] ? __warn+0xd4/0xf0 [ 1.001168] [<ffffffff83f77df0>] ? __wrmsr_on_cpu+0x40/0x40 [ 1.001172] [<ffffffff83cfde85>] ? smp_call_function_single+0x115/0x130 [ 1.001175] [<ffffffff83f780db>] ? rdmsr_safe_on_cpu+0x5b/0x90 [ 1.001180] [<ffffffff83c40ae9>] ? get_block_address.isra.1+0x89/0x100 [ 1.001183] [<ffffffff83c40bf5>] ? amd_threshold_interrupt+0x95/0x140 [ 1.001188] [<ffffffff8422e48a>] ? smp_threshold_interrupt+0x2a/0x50 [ 1.001192] [<ffffffff8422d526>] ? threshold_interrupt+0x96/0xa0 [ 1.001195] [<ffffffff8422eef1>] ? __do_softirq+0x71/0x2be [ 1.001199] [<ffffffff83c7fad4>] ? irq_exit+0xb4/0xc0 [ 1.001202] [<ffffffff8422ea0c>] ? smp_apic_timer_interrupt+0x4c/0x60 [ 1.001205] [<ffffffff8422d2a6>] ? apic_timer_interrupt+0x96/0xa0 [ 1.001208] <EOI> [ 1.001211] [<ffffffff83de856e>] ? kmem_cache_alloc+0x8e/0x500 [ 1.001217] [<ffffffff83e27816>] ? alloc_inode+0x66/0x80 [ 1.001220] [<ffffffff83e297ec>] ? new_inode_pseudo+0xc/0x60 [ 1.001223] [<ffffffff83e29855>] ? new_inode+0x15/0x30 [ 1.001227] [<ffffffff83e9a042>] ? start_creating+0x62/0xf0 [ 1.001230] [<ffffffff83e99f3f>] ? tracefs_get_inode+0xf/0x50 [ 1.001233] [<ffffffff83e9a4b6>] ? tracefs_create_file+0x46/0x140 [ 1.001237] [<ffffffff83d41a6d>] ? trace_create_file+0xd/0x30 [ 1.001241] [<ffffffff83d4e04c>] ? event_create_dir+0xec/0x4b0 [ 1.001246] [<ffffffff8494085d>] ? set_debug_rodata+0xc/0xc [ 1.001250] [<ffffffff8496a3fe>] ? event_trace_init+0x241/0x2b3 [ 1.001254] [<ffffffff8496a1bd>] ? setup_trace_event+0x28/0x28 [ 1.001258] [<ffffffff83c0218e>] ? do_one_initcall+0x4e/0x170 [ 1.001261] [<ffffffff849410c9>] ? kernel_init_freeable+0x174/0x1fb [ 1.001265] [<ffffffff8421e0f0>] ? rest_init+0x80/0x80 [ 1.001268] [<ffffffff8421e0fa>] ? kernel_init+0xa/0x100 [ 1.001271] [<ffffffff8422b624>] ? ret_from_fork+0x44/0x70 [ 1.001288] ---[ end trace af9cce55a4efe612 ]---
Created attachment 278735 [details] dmesg
Confirmed that this only happens when AMDGPU support is built as a module or in-kernel
How are you able to get any output? I have a 745 G5 (same bios version) and can't get any kernel newer than the 4.9.0-8-amd64 kernel shipped with debian sarge to give me any text output.. it just hangs after the grub messages saying loading kernel and initrd, with no further output. I'm removing quiet, adding things like iommu=soft, nomodeset, noacpi, noapic... nothing seems to work. Yet certain older kernels do boot fine without any extra parameters. I suspect there's a segv/hang in the acpi code parsing code, but without any output, and no serial port, I can't even begin to figure out how to debug.. yet here you are posting output from a nearly identical machine. What sorcery is this? The internet claims previous bios versions worked fine with this laptop, but I can't downgrade my bios for some reason, and this is a new laptop that came with the new bios version pre-installed. I'm currently bisecting Linus's tree to see if I can get any insights.
I have finally been able to get a kernel to work in its entirety btw (4.18.x)... Will share kernel config and hopefully that works for you.
Created attachment 279001 [details] Working Kernel Config Please note that this only works with Bios Rev 1.04 and still produces the original error when using newer BIOS
In my case you have to add mce=off to the command line to get it to boot with the newer bios. Also iommu=soft or iommu=pt. I bisected the issue to a 4.10 commit about MCE and opened another bug here. On mobile can't look it up now. However I doubt it'll get any attention unless I post on lkml and I'm too busy at the moment to do that.
Thanks, marking it as a duplicate bug *** This bug has been marked as a duplicate of bug 201291 ***
Hi Amit, First of all, there is no segfault at all. What you see is a simple WARNING coming from the smp_call_function_single(). Secondly this is not a duplicate of bug 201291: * This report is about WARNING in the smp_call_function_single() * The bug 201291 is about NULL pointer dereference in the amd_threshold_interrupt()
Hi Rafal, You're absolutely correct in pointing this out. I conflated two issues (warning in smp_call_function_single w/o AMDGPU and segfault with AMDGPU without mce=off). Should I resolve this issue as INVALID?
I noticed this WARNING doesn't occur with recent kernels anymore. Of course I had to boot with mce enabled + fix patch: [PATCH] x86/MCE/AMD: Fix the thresholding machinery initialization order in order to reliably test new kernels. I did some basic bisecting + testing and found out that the WARNING has been fixed by: commit 17ef4af0ec0f97b369f304dc04d61722f3591c4b Author: Yazen Ghannam <yazen.ghannam@amd.com> Date: Tue Jun 13 18:28:29 2017 +0200 x86/mce/AMD: Use saved threshold block info in interrupt handler https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=17ef4af0ec0f97b369f304dc04d61722f3591c4b which is present in kernel 4.13 and newer.
(In reply to Amit Prakash Ambasta from comment #5) > Created attachment 279001 [details] > Working Kernel Config > > Please note that this only works with Bios Rev 1.04 and still produces the > original error when using newer BIOS Amit: do you have by any chance a dmesg from that 4.18.13 kernel? I compiled 4.18.13 with your config plus: CONFIG_EFI_STUB=y and I still wasn't able to boot it (early MCE error + kernel crash). It may be because of my BIOS 1.03.01 but I'm still curious to compare our dmesg-s.
Hi Rafal, Can you try CONFIG_AMDGPU=N and see if that works. I'll try to rebuild the 4.18.13 kernel and see if I can reproduce the issue
This was the bug with CONFIG_AMDGPU=y iirc (where I couldn't get any output without early_printk and video=efifb) https://bugzilla.kernel.org/show_bug.cgi?id=201215
I've tried 4.18.13 + discussed config + disabled CONFIG_AMDGPU but I was still getting early CPU errors.