Bug 201213
Summary: | WARNING in the smp_call_function_single() on Ryzen 2500U | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Amit Prakash Ambasta (amit.prakash.ambasta) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | clemej, summercurrants, zajec5 |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.9.127 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg
Working Kernel Config |
Description
Amit Prakash Ambasta
2018-09-24 08:05:10 UTC
Created attachment 278735 [details]
dmesg
Confirmed that this only happens when AMDGPU support is built as a module or in-kernel How are you able to get any output? I have a 745 G5 (same bios version) and can't get any kernel newer than the 4.9.0-8-amd64 kernel shipped with debian sarge to give me any text output.. it just hangs after the grub messages saying loading kernel and initrd, with no further output. I'm removing quiet, adding things like iommu=soft, nomodeset, noacpi, noapic... nothing seems to work. Yet certain older kernels do boot fine without any extra parameters. I suspect there's a segv/hang in the acpi code parsing code, but without any output, and no serial port, I can't even begin to figure out how to debug.. yet here you are posting output from a nearly identical machine. What sorcery is this? The internet claims previous bios versions worked fine with this laptop, but I can't downgrade my bios for some reason, and this is a new laptop that came with the new bios version pre-installed. I'm currently bisecting Linus's tree to see if I can get any insights. I have finally been able to get a kernel to work in its entirety btw (4.18.x)... Will share kernel config and hopefully that works for you. Created attachment 279001 [details]
Working Kernel Config
Please note that this only works with Bios Rev 1.04 and still produces the original error when using newer BIOS
In my case you have to add mce=off to the command line to get it to boot with the newer bios. Also iommu=soft or iommu=pt. I bisected the issue to a 4.10 commit about MCE and opened another bug here. On mobile can't look it up now. However I doubt it'll get any attention unless I post on lkml and I'm too busy at the moment to do that. Thanks, marking it as a duplicate bug *** This bug has been marked as a duplicate of bug 201291 *** Hi Amit, First of all, there is no segfault at all. What you see is a simple WARNING coming from the smp_call_function_single(). Secondly this is not a duplicate of bug 201291: * This report is about WARNING in the smp_call_function_single() * The bug 201291 is about NULL pointer dereference in the amd_threshold_interrupt() Hi Rafal, You're absolutely correct in pointing this out. I conflated two issues (warning in smp_call_function_single w/o AMDGPU and segfault with AMDGPU without mce=off). Should I resolve this issue as INVALID? I noticed this WARNING doesn't occur with recent kernels anymore. Of course I had to boot with mce enabled + fix patch: [PATCH] x86/MCE/AMD: Fix the thresholding machinery initialization order in order to reliably test new kernels. I did some basic bisecting + testing and found out that the WARNING has been fixed by: commit 17ef4af0ec0f97b369f304dc04d61722f3591c4b Author: Yazen Ghannam <yazen.ghannam@amd.com> Date: Tue Jun 13 18:28:29 2017 +0200 x86/mce/AMD: Use saved threshold block info in interrupt handler https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=17ef4af0ec0f97b369f304dc04d61722f3591c4b which is present in kernel 4.13 and newer. (In reply to Amit Prakash Ambasta from comment #5) > Created attachment 279001 [details] > Working Kernel Config > > Please note that this only works with Bios Rev 1.04 and still produces the > original error when using newer BIOS Amit: do you have by any chance a dmesg from that 4.18.13 kernel? I compiled 4.18.13 with your config plus: CONFIG_EFI_STUB=y and I still wasn't able to boot it (early MCE error + kernel crash). It may be because of my BIOS 1.03.01 but I'm still curious to compare our dmesg-s. Hi Rafal, Can you try CONFIG_AMDGPU=N and see if that works. I'll try to rebuild the 4.18.13 kernel and see if I can reproduce the issue This was the bug with CONFIG_AMDGPU=y iirc (where I couldn't get any output without early_printk and video=efifb) https://bugzilla.kernel.org/show_bug.cgi?id=201215 I've tried 4.18.13 + discussed config + disabled CONFIG_AMDGPU but I was still getting early CPU errors. |