Bug 201213

Summary: WARNING in the smp_call_function_single() on Ryzen 2500U
Product: Platform Specific/Hardware Reporter: Amit Prakash Ambasta (amit.prakash.ambasta)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED CODE_FIX    
Severity: normal CC: clemej, summercurrants, zajec5
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.9.127 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
Working Kernel Config

Description Amit Prakash Ambasta 2018-09-24 08:05:10 UTC
[    1.001112] ------------[ cut here ]------------
[    1.001119] WARNING: CPU: 4 PID: 1 at kernel/smp.c:277 smp_call_function_single+0x115/0x130
[    1.001123] Modules linked in:
[    1.001127] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 4.9.127-gentoo #6
[    1.001130] Hardware name: HP HP EliteBook 735 G5/83DA, BIOS Q81 Ver. 01.03.01 07/26/2018
[    1.001135]  0000000000000000 ffffffff83f48893 0000000000000000 0000000000000000
[    1.001140]  ffffffff83c79624 0000000000000004 ffffffff83f77df0 ffffffff846773e0
[    1.001145]  0000000000000001 00000000c0002003 000000000000fd40 ffffffff83cfde85
[    1.001150] Call Trace:
[    1.001154]  <IRQ> 
[    1.001158]  [<ffffffff83f48893>] ? dump_stack+0x5c/0x79
[    1.001164]  [<ffffffff83c79624>] ? __warn+0xd4/0xf0
[    1.001168]  [<ffffffff83f77df0>] ? __wrmsr_on_cpu+0x40/0x40
[    1.001172]  [<ffffffff83cfde85>] ? smp_call_function_single+0x115/0x130
[    1.001175]  [<ffffffff83f780db>] ? rdmsr_safe_on_cpu+0x5b/0x90
[    1.001180]  [<ffffffff83c40ae9>] ? get_block_address.isra.1+0x89/0x100
[    1.001183]  [<ffffffff83c40bf5>] ? amd_threshold_interrupt+0x95/0x140
[    1.001188]  [<ffffffff8422e48a>] ? smp_threshold_interrupt+0x2a/0x50
[    1.001192]  [<ffffffff8422d526>] ? threshold_interrupt+0x96/0xa0
[    1.001195]  [<ffffffff8422eef1>] ? __do_softirq+0x71/0x2be
[    1.001199]  [<ffffffff83c7fad4>] ? irq_exit+0xb4/0xc0
[    1.001202]  [<ffffffff8422ea0c>] ? smp_apic_timer_interrupt+0x4c/0x60
[    1.001205]  [<ffffffff8422d2a6>] ? apic_timer_interrupt+0x96/0xa0
[    1.001208]  <EOI> 
[    1.001211]  [<ffffffff83de856e>] ? kmem_cache_alloc+0x8e/0x500
[    1.001217]  [<ffffffff83e27816>] ? alloc_inode+0x66/0x80
[    1.001220]  [<ffffffff83e297ec>] ? new_inode_pseudo+0xc/0x60
[    1.001223]  [<ffffffff83e29855>] ? new_inode+0x15/0x30
[    1.001227]  [<ffffffff83e9a042>] ? start_creating+0x62/0xf0
[    1.001230]  [<ffffffff83e99f3f>] ? tracefs_get_inode+0xf/0x50
[    1.001233]  [<ffffffff83e9a4b6>] ? tracefs_create_file+0x46/0x140
[    1.001237]  [<ffffffff83d41a6d>] ? trace_create_file+0xd/0x30
[    1.001241]  [<ffffffff83d4e04c>] ? event_create_dir+0xec/0x4b0
[    1.001246]  [<ffffffff8494085d>] ? set_debug_rodata+0xc/0xc
[    1.001250]  [<ffffffff8496a3fe>] ? event_trace_init+0x241/0x2b3
[    1.001254]  [<ffffffff8496a1bd>] ? setup_trace_event+0x28/0x28
[    1.001258]  [<ffffffff83c0218e>] ? do_one_initcall+0x4e/0x170
[    1.001261]  [<ffffffff849410c9>] ? kernel_init_freeable+0x174/0x1fb
[    1.001265]  [<ffffffff8421e0f0>] ? rest_init+0x80/0x80
[    1.001268]  [<ffffffff8421e0fa>] ? kernel_init+0xa/0x100
[    1.001271]  [<ffffffff8422b624>] ? ret_from_fork+0x44/0x70
[    1.001288] ---[ end trace af9cce55a4efe612 ]---
Comment 1 Amit Prakash Ambasta 2018-09-24 10:41:41 UTC
Created attachment 278735 [details]
dmesg
Comment 2 Amit Prakash Ambasta 2018-09-26 06:16:13 UTC
Confirmed that this only happens when AMDGPU support is built as a module or in-kernel
Comment 3 clemej 2018-09-29 04:22:38 UTC
How are you able to get any output? I have a 745 G5 (same bios version) and can't get any kernel newer than the 4.9.0-8-amd64 kernel shipped with debian sarge to give me any text output.. it just hangs after the grub messages saying loading kernel and initrd, with no further output.  I'm removing quiet, adding things like iommu=soft, nomodeset, noacpi, noapic... nothing seems to work.  Yet certain older kernels do boot fine without any extra parameters. 

I suspect there's a segv/hang in the acpi code parsing code, but without any output, and no serial port, I can't even begin to figure out how to debug.. yet here you are posting output from a nearly identical machine. What sorcery is this?

The internet claims previous bios versions worked fine with this laptop, but I can't downgrade my bios for some reason, and this is a new laptop that came with the new bios version pre-installed.

I'm currently bisecting Linus's tree to see if I can get any insights.
Comment 4 Amit Prakash Ambasta 2018-09-29 09:51:47 UTC
I have finally been able to get a kernel to work in its entirety btw (4.18.x)... Will share kernel config and hopefully that works for you.
Comment 5 Amit Prakash Ambasta 2018-10-12 07:43:20 UTC
Created attachment 279001 [details]
Working Kernel Config

Please note that this only works with Bios Rev 1.04 and still produces the original error when using newer BIOS
Comment 6 clemej 2018-10-12 12:27:31 UTC
In my case you have to add mce=off to the command line to get it to boot with the newer bios.  Also iommu=soft or iommu=pt. I bisected the issue to a 4.10 commit about MCE and opened another bug here. On mobile can't look it up now. However I doubt it'll get any attention unless I post on lkml and I'm too busy at the moment to do that.
Comment 7 Amit Prakash Ambasta 2018-10-12 12:38:28 UTC
Thanks, marking it as a duplicate bug

*** This bug has been marked as a duplicate of bug 201291 ***
Comment 8 Rafał Miłecki 2018-11-28 09:40:51 UTC
Hi Amit,

First of all, there is no segfault at all. What you see is a simple WARNING coming from the smp_call_function_single().

Secondly this is not a duplicate of bug 201291:
* This report is about WARNING in the smp_call_function_single()
* The bug 201291 is about NULL pointer dereference in the amd_threshold_interrupt()
Comment 9 Amit Prakash Ambasta 2018-11-28 09:54:24 UTC
Hi Rafal,

You're absolutely correct in pointing this out. I conflated two issues (warning in smp_call_function_single w/o AMDGPU and segfault with AMDGPU without mce=off). Should I resolve this issue as INVALID?
Comment 10 Rafał Miłecki 2018-11-28 10:33:09 UTC
I noticed this WARNING doesn't occur with recent kernels anymore. Of course I had to boot with mce enabled + fix patch:
[PATCH] x86/MCE/AMD: Fix the thresholding machinery initialization order
in order to reliably test new kernels.

I did some basic bisecting + testing and found out that the WARNING has been fixed by:

commit 17ef4af0ec0f97b369f304dc04d61722f3591c4b
Author: Yazen Ghannam <yazen.ghannam@amd.com>
Date:   Tue Jun 13 18:28:29 2017 +0200

    x86/mce/AMD: Use saved threshold block info in interrupt handler

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=17ef4af0ec0f97b369f304dc04d61722f3591c4b

which is present in kernel 4.13 and newer.
Comment 11 Rafał Miłecki 2018-12-08 23:57:40 UTC
(In reply to Amit Prakash Ambasta from comment #5)
> Created attachment 279001 [details]
> Working Kernel Config
> 
> Please note that this only works with Bios Rev 1.04 and still produces the
> original error when using newer BIOS

Amit: do you have by any chance a dmesg from that 4.18.13 kernel? I compiled 4.18.13 with your config plus:
CONFIG_EFI_STUB=y
and I still wasn't able to boot it (early MCE error + kernel crash). It may be because of my BIOS 1.03.01 but I'm still curious to compare our dmesg-s.
Comment 12 Amit Prakash Ambasta 2018-12-10 07:15:06 UTC
Hi Rafal,

Can you try CONFIG_AMDGPU=N and see if that works. I'll try to rebuild the 4.18.13 kernel and see if I can reproduce the issue
Comment 13 Amit Prakash Ambasta 2018-12-10 07:16:35 UTC
This was the bug with CONFIG_AMDGPU=y iirc (where I couldn't get any output without early_printk and video=efifb) https://bugzilla.kernel.org/show_bug.cgi?id=201215
Comment 14 Rafał Miłecki 2018-12-16 21:58:45 UTC
I've tried 4.18.13 + discussed config + disabled CONFIG_AMDGPU but I was still getting early CPU errors.