Created attachment 286135 [details] dmesg from 4.4.191 x86 kernel running on HP Z400 x86-64 with 6GB ram On my HP Z400 x86-64 4-core 6gb ram machine When running x86 kernels starting with 4.4.191 and newer there is a kernel oops like this: [ 0.148791] microcode: CPU1 microcode updated early to revision 0x1d, date = 2018-05-11 [ 0.148792] Initializing CPU#1 [ 0.148795] ------------[ cut here ]------------ [ 0.148801] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/apic/apic.c:1303 setup_local_APIC+0xfb/0x520() [ 0.148801] Modules linked in: [ 0.148804] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.4.191-smp-mine #1 [ 0.148804] Hardware name: Hewlett-Packard HP Z400 Workstation/0B4Ch, BIOS 786G3 v03.61 03/05/2018 [ 0.148806] 00200086 [ 0.148806] 00200086 [ 0.148806] f2d35f0c [ 0.148806] c15e5f96 [ 0.148807] 00000000 [ 0.148807] c1ea8463 [ 0.148807] f2d35f3c [ 0.148808] c1055c87 [ 0.148808] c1ea9950 [ 0.148808] 00000001 [ 0.148809] 00000000 [ 0.148809] c1ea8463 [ 0.148809] 00000517 [ 0.148809] c103bdab [ 0.148810] c103bdab [ 0.148810] 00000002 [ 0.148810] 00000000 [ 0.148811] 00000001 [ 0.148811] f2d35f4c [ 0.148811] c1055d62 [ 0.148811] 00000009 [ 0.148812] 00000000 [ 0.148812] f2d35f88 [ 0.148812] c103bdab [ 0.148813] Call Trace: [ 0.148817] [<c15e5f96>] dump_stack+0x47/0x61 [ 0.148819] [<c1055c87>] warn_slowpath_common+0x87/0xc0 [ 0.148820] [<c103bdab>] ? setup_local_APIC+0xfb/0x520 [ 0.148821] [<c103bdab>] ? setup_local_APIC+0xfb/0x520 [ 0.148822] [<c1055d62>] warn_slowpath_null+0x22/0x30 [ 0.148823] [<c103bdab>] setup_local_APIC+0xfb/0x520 [ 0.148825] [<c103c1dd>] apic_ap_setup+0xd/0x20 [ 0.148826] [<c10396d4>] start_secondary+0x44/0x1b0 [ 0.148831] ---[ end trace 0c04e759e0e80ff4 ]--- [ 0.148832] Leaving ESR disabled. [ 0.150958] CPU 2 irqstacks, hard=f2da2000 soft=f2da4000 [ 0.151031] #2 [ 0.151670] microcode: CPU2 microcode updated early to revision 0x1d, date = 2018-05-11 [ 0.151672] Initializing CPU#2 [ 0.151675] ------------[ cut here ]------------ I find that I can fix the problem by reverting this patch to arch/x86/kernel/apic/bigsmp_32.c: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.3-rc7&id=bae3a8d3308ee69a7dbdf145911b18dfda8ade0d The problem does not occur with 4.4.190 and earlier x86 kernels and does not happen when running x86-64 kernels.
Note that this is a regression in 4.4.191 from 4.4.190. It still happens in 4.4.205
Created attachment 286137 [details] dmesg from 4.4.190 x86 kernel kernel 4.4.190 x86 runs fine
Created attachment 286139 [details] dmesg from 4.4.205 with oops dmesg from 4.4.205 with kernel oops
Created attachment 286141 [details] dmesg from 4.4.205 x86 fixed with reversion of patch "x86/apic: Do not initialize LDR and DFR for bigsmp" dmesg from 4.4.205 x86 fixed with reversion of patch "x86/apic: Do not initialize LDR and DFR for bigsmp"
@Greg: you might wanna add a stable@ account to bugzilla too, exactly for such bug reports. Hmm, that's: WARN_ON(i != BAD_APICID && i != logical_smp_processor_id()); but that looks rather fishy: [ 0.035190] Overriding APIC driver with bigsmp which shouldn't happen, IMHO, on such a small machine. Can you try the latest upstream kernel - 5.4 - too pls? Because the patch is there too and we probably should revert it there also. Or if upstream is fine, only in stable@ because stable@ then is apparently missing something else from upstream. Thx.
Created attachment 286145 [details] dmesg from 5.4.1-smp without the apic bug but has a memremap bug Attached is the dmesg from kernel 5.4.1-smp. It does not have the apic.c oops. it has a memremap oops which I can avoid by adding kernel parameter mem=4G
That looks like an EFI issue. For some reason the kernel thinks yours is an apple machine. But that code in map_properties() should run only on apple x86, strange. Reassigning. Independent from it all, I'd strongly suggest you move to a 64-bit kernel and userland as it is a lot more tested and scrutinized than 32-bit.
Please don't reassign this bug against 4.4-stable as an EFI bug just because some unrelated WARN() gets triggered when 5.4 is built and run on the same hardware. That doesn't make any sense whatsoever. If that WARN() is actually an EFI issue (which seems dubious to me given the unreliability of the backtrace), please create a separate bug against EFI.
(In reply to Ard Biesheuvel from comment #8) > Please don't reassign this bug against 4.4-stable as an EFI bug just because > some unrelated WARN() gets triggered when 5.4 is built and run on the same > hardware. That doesn't make any sense whatsoever. > > If that WARN() is actually an EFI issue (which seems dubious to me given the > unreliability of the backtrace), please create a separate bug against EFI. Whatever Ard, I was just trying to be helpful. If you don't want to look at this, that's your call. Richard, can you disable CONFIG_APPLE_PROPERTIES in your .config and see if you can trigger the issue again? Also, make sure you have CONFIG_UNWINDER_FRAME_POINTER - I believe this is the unwinder we can do on 32-bit for reliable stacktraces. Thx.
(In reply to Borislav Petkov from comment #9) > (In reply to Ard Biesheuvel from comment #8) > > Please don't reassign this bug against 4.4-stable as an EFI bug just > because > > some unrelated WARN() gets triggered when 5.4 is built and run on the same > > hardware. That doesn't make any sense whatsoever. > > > > If that WARN() is actually an EFI issue (which seems dubious to me given > the > > unreliability of the backtrace), please create a separate bug against EFI. > > Whatever Ard, I was just trying to be helpful. If you don't want to look at > this, that's your call. > I am happy to look at the unrelated issue that has been identified, but categorizing a bug called 'Linux 4.4.191 x86 on HP Z400 kernel oops at arch/x86/kernel/apic/apic.c:1303 setup_local_APIC+0xfb/0x520()' as an EFI bug because some build of v5.4 running on the same hardware triggers an oops that looks vaguely EFI related is not really helpful. This is obviously an x86 problem, no? > Richard, can you disable CONFIG_APPLE_PROPERTIES in your .config and see if > you can trigger the issue again? Also, make sure you have > CONFIG_UNWINDER_FRAME_POINTER - I believe this is the unwinder we can do on > 32-bit for reliable stacktraces. > Perhaps I am missing something, and I am clueless about x86 internals, but the working v4.4.190 dmesg shows [ 0.035165] Overriding APIC driver with bigsmp as well, doesn't show any memremap OOPSes, and boots fine. v4.4.191 is broken because of the identified patch. It also has [ 0.000000] smpboot: Allowing 16 CPUs, 12 hotplug CPUs which suggests that void __init default_setup_apic_routing(void) { int version = boot_cpu_apic_version; if (num_possible_cpus() > 8) { switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_INTEL: if (!APIC_XAPIC(version)) { def_to_bigsmp = 0; break; } /* P4 and above */ /* fall through */ case X86_VENDOR_HYGON: case X86_VENDOR_AMD: def_to_bigsmp = 1; } } results in def_to_bigsmp == 1, which causes static int probe_bigsmp(void) { if (def_to_bigsmp) dmi_bigsmp = 1; else dmi_check_system(bigsmp_dmi_table); return dmi_bigsmp; } to return true, and so switching to bigmsp mode on this hardware seems perfectly appropriate. Again, I know very little about x86, but I fail to see how any of this is EFI related. The memremap OOPS may be EFI related if the stacktrace turns out to be accurate, but it is most likely a separate issue. Thanks,
IIUC, that code shouldn't be called for bigsmp. Commit fe6f85ca121e9c74e7490fe66b0c5aae38e332c3 fixes it. It should land in stable eventually. Author: Jan Beulich <jbeulich@suse.com> Date: Tue Oct 29 10:34:19 2019 +0100 x86/apic/32: Avoid bogus LDR warnings The removal of the LDR initialization in the bigsmp_32 APIC code unearthed a problem in setup_local_APIC(). The code checks unconditionally for a mismatch of the logical APIC id by comparing the early APIC id which was initialized in get_smp_config() with the actual LDR value in the APIC. Due to the removal of the bogus LDR initialization the check now can trigger on bigsmp_32 APIC systems emitting a warning for every booting CPU. This is of course a false positive because the APIC is not using logical destination mode. Restrict the check and the possibly resulting fixup to systems which are actually using the APIC in logical destination mode. [ tglx: Massaged changelog and added Cc stable ] Fixes: bae3a8d3308 ("x86/apic: Do not initialize LDR and DFR for bigsmp") Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/666d8f91-b5a8-1afd-7add-821e72a35f03@suse.com
On Mon, Dec 02, 2019 at 09:57:12PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=205729 > > Bandan Das (bsd@makefile.in) changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |bsd@makefile.in > > --- Comment #11 from Bandan Das (bsd@makefile.in) --- > IIUC, that code shouldn't be called for bigsmp. Commit > fe6f85ca121e9c74e7490fe66b0c5aae38e332c3 fixes it. It should land in stable > eventually. It is already in the following kernel releases: 4.9.201 4.14.154 4.19.84 5.3.11 5.4
(In reply to Ard Biesheuvel from comment #10) > I am happy to look at the unrelated issue that has been identified, but > categorizing a bug called 'Linux 4.4.191 x86 on HP Z400 kernel oops at > arch/x86/kernel/apic/apic.c:1303 setup_local_APIC+0xfb/0x520()' as an EFI > bug because some build of v5.4 running on the same hardware triggers an oops > that looks vaguely EFI related is not really helpful. Ok, point taken, I won't reassign *possibly* EFI-related bugs but have people open a new one. > Perhaps I am missing something, and I am clueless about x86 internals, but > the working v4.4.190 dmesg shows Well, you don't have to debug that - that's our job. But I appreciate you poking at it. But before we go and waste any more time with this: Richard, what do you wanna do? Do you want to switch to a 64-bit latest stable kernel, say 5.4.1, and see if that works fine on your box?
(In reply to Borislav Petkov from comment #13) > Richard, what do you wanna do? Do you want to switch to a 64-bit latest > stable kernel, say 5.4.1, and see if that works fine on your box? As I said in my original comment 0 the bug does not happen with 64-bit kernels on my box. It runs 64-bit 4.4.205 and 5.4.1 well. (In reply to gregkh from comment #12) > On Mon, Dec 02, 2019 at 09:57:12PM +0000, > bugzilla-daemon@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=205729 > > > > Bandan Das (bsd@makefile.in) changed: > > > > What |Removed |Added > > > ---------------------------------------------------------------------------- > > CC| |bsd@makefile.in > > > > --- Comment #11 from Bandan Das (bsd@makefile.in) --- > > IIUC, that code shouldn't be called for bigsmp. Commit > > fe6f85ca121e9c74e7490fe66b0c5aae38e332c3 fixes it. It should land in stable > > eventually. > > It is already in the following kernel releases: > 4.9.201 4.14.154 4.19.84 5.3.11 5.4 Thank you. I will keep my fix for now with 4.4 and wait until a better fix filters down. The 4.4 kernel is long term stable for Slackware at the moment. But Slackware current is running 5.4.1.
Created attachment 286163 [details] temporary patch to bigsmp_32 to fix the kernel oops on 4.4 kernels after 4.4.190
(In reply to Richard Narron from comment #14) > As I said in my original comment 0 the bug does not happen with 64-bit > kernels on my box. It runs 64-bit 4.4.205 and 5.4.1 well. So you were simply reporting the 32-bit issue - you're not running a 32-bit kernel only and are fine with running a 64-bit kernel?
(In reply to Borislav Petkov from comment #16) > (In reply to Richard Narron from comment #14) > > As I said in my original comment 0 the bug does not happen with 64-bit > > kernels on my box. It runs 64-bit 4.4.205 and 5.4.1 well. > > So you were simply reporting the 32-bit issue - you're not running a 32-bit > kernel only and are fine with running a 64-bit kernel? What does this have to do with the bug? I am not fine running just one version of linux on the machine. It seems to be a 32-bit issue but I will attach the dmesg from 64-bit 5.4.1 so it can be checked too... It is always good not to jump to conclusions too quickly...
Created attachment 286169 [details] dmesg from 5.4.1 x86-64
(In reply to Richard Narron from comment #17) > What does this have to do with the bug? It has to do with prioritizing debugging time. > I am not fine running just one version of linux on the machine. What does "not fine" mean exactly? You have a special need to run a 32-bit kernel and you absolutely need it or you can do everything you wanna do on your machine, on a 64-bit kernel too? > It seems to be a 32-bit issue but I will attach the dmesg from 64-bit > 5.4.1 so it can be checked too... Yes, that looks good. > It is always good not to jump to conclusions too quickly... Oh, interesting. There are conclusions and someone jumped to them?! Now *that* I wanna know - do tell.
Created attachment 286177 [details] kernel 4.4.205 patch: x86/apic/32: Avoid bogus LDR warnings Attached is my version of the "x86/apic/32: Avoid bogus LDR warnings" patch. It works for me, but it needs to be tested...
status changed to verified. As Bandon Das says in comment 11: Commit fe6f85ca121e9c74e7490fe66b0c5aae38e332c3 fixes it. It should land in stable eventually.
Created attachment 286199 [details] [patch 4.4] x86/apic/32: Avoid bogus LDR warnings
The patch is included in kernel 4.4.207-rc1 and fixes the problem.