Bug 205729

Summary: Linux 4.4.191 x86 on HP Z400 kernel oops at arch/x86/kernel/apic/apic.c:1303 setup_local_APIC+0xfb/0x520()
Product: Platform Specific/Hardware Reporter: Richard Narron (richard)
Component: i386Assignee: platform_i386
Status: CLOSED CODE_FIX    
Severity: normal CC: ardb, bp, bsd, bsd, gregkh, JBeulich, richard, tglx
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.4.191 and newer Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg from 4.4.191 x86 kernel running on HP Z400 x86-64 with 6GB ram
dmesg from 4.4.190 x86 kernel
dmesg from 4.4.205 with oops
dmesg from 4.4.205 x86 fixed with reversion of patch "x86/apic: Do not initialize LDR and DFR for bigsmp"
dmesg from 5.4.1-smp without the apic bug but has a memremap bug
temporary patch to bigsmp_32 to fix the kernel oops on 4.4 kernels after 4.4.190
dmesg from 5.4.1 x86-64
kernel 4.4.205 patch: x86/apic/32: Avoid bogus LDR warnings
[patch 4.4] x86/apic/32: Avoid bogus LDR warnings

Description Richard Narron 2019-12-01 17:04:55 UTC
Created attachment 286135 [details]
dmesg from 4.4.191 x86 kernel running on HP Z400 x86-64 with 6GB ram

On my HP Z400 x86-64 4-core 6gb ram machine When running x86 kernels starting with 4.4.191 and newer there is a kernel oops like this:

[    0.148791] microcode: CPU1 microcode updated early to revision 0x1d, date = 2018-05-11
[    0.148792] Initializing CPU#1
[    0.148795] ------------[ cut here ]------------
[    0.148801] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/apic/apic.c:1303 setup_local_APIC+0xfb/0x520()
[    0.148801] Modules linked in:

[    0.148804] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.4.191-smp-mine #1
[    0.148804] Hardware name: Hewlett-Packard HP Z400 Workstation/0B4Ch, BIOS 786G3 v03.61 03/05/2018
[    0.148806]  00200086
[    0.148806]  00200086
[    0.148806]  f2d35f0c
[    0.148806]  c15e5f96
[    0.148807]  00000000
[    0.148807]  c1ea8463
[    0.148807]  f2d35f3c
[    0.148808]  c1055c87

[    0.148808]  c1ea9950
[    0.148808]  00000001
[    0.148809]  00000000
[    0.148809]  c1ea8463
[    0.148809]  00000517
[    0.148809]  c103bdab
[    0.148810]  c103bdab
[    0.148810]  00000002

[    0.148810]  00000000
[    0.148811]  00000001
[    0.148811]  f2d35f4c
[    0.148811]  c1055d62
[    0.148811]  00000009
[    0.148812]  00000000
[    0.148812]  f2d35f88
[    0.148812]  c103bdab

[    0.148813] Call Trace:
[    0.148817]  [<c15e5f96>] dump_stack+0x47/0x61
[    0.148819]  [<c1055c87>] warn_slowpath_common+0x87/0xc0
[    0.148820]  [<c103bdab>] ? setup_local_APIC+0xfb/0x520
[    0.148821]  [<c103bdab>] ? setup_local_APIC+0xfb/0x520
[    0.148822]  [<c1055d62>] warn_slowpath_null+0x22/0x30
[    0.148823]  [<c103bdab>] setup_local_APIC+0xfb/0x520
[    0.148825]  [<c103c1dd>] apic_ap_setup+0xd/0x20
[    0.148826]  [<c10396d4>] start_secondary+0x44/0x1b0
[    0.148831] ---[ end trace 0c04e759e0e80ff4 ]---
[    0.148832] Leaving ESR disabled.
[    0.150958] CPU 2 irqstacks, hard=f2da2000 soft=f2da4000
[    0.151031]   #2
[    0.151670] microcode: CPU2 microcode updated early to revision 0x1d, date = 2018-05-11
[    0.151672] Initializing CPU#2
[    0.151675] ------------[ cut here ]------------

I find that I can fix the problem by reverting this patch to 
arch/x86/kernel/apic/bigsmp_32.c:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.3-rc7&id=bae3a8d3308ee69a7dbdf145911b18dfda8ade0d

The problem does not occur with 4.4.190 and earlier x86 kernels and does not happen when running x86-64 kernels.
Comment 1 Richard Narron 2019-12-01 17:06:38 UTC
Note that this is a regression in 4.4.191 from 4.4.190. It still happens in 4.4.205
Comment 2 Richard Narron 2019-12-01 17:10:05 UTC
Created attachment 286137 [details]
dmesg from 4.4.190 x86 kernel

kernel 4.4.190 x86 runs fine
Comment 3 Richard Narron 2019-12-01 17:17:57 UTC
Created attachment 286139 [details]
dmesg from 4.4.205 with oops

dmesg from 4.4.205 with kernel oops
Comment 4 Richard Narron 2019-12-01 17:22:20 UTC
Created attachment 286141 [details]
dmesg from 4.4.205 x86 fixed with reversion of patch "x86/apic: Do not initialize LDR and DFR for bigsmp"

dmesg from 4.4.205 x86 fixed with reversion of patch "x86/apic: Do not initialize LDR and DFR for bigsmp"
Comment 5 Borislav Petkov 2019-12-01 17:40:19 UTC
@Greg: you might wanna add a stable@ account to bugzilla too, exactly for such 
bug reports.

Hmm, that's:

WARN_ON(i != BAD_APICID && i != logical_smp_processor_id());

but that looks rather fishy:

[    0.035190] Overriding APIC driver with bigsmp

which shouldn't happen, IMHO, on such a small machine.

Can you try the latest upstream kernel - 5.4 - too pls? Because the patch is there too and we probably should revert it there also. Or if upstream is fine, only in stable@ because stable@ then is apparently missing something else from upstream.

Thx.
Comment 6 Richard Narron 2019-12-02 00:36:22 UTC
Created attachment 286145 [details]
dmesg from 5.4.1-smp without the apic bug but has a memremap bug

Attached is the dmesg from kernel 5.4.1-smp. It does not have the apic.c oops.

it has a memremap oops which I can avoid by adding kernel parameter mem=4G
Comment 7 Borislav Petkov 2019-12-02 16:13:39 UTC
That looks like an EFI issue. For some reason the kernel thinks yours is an apple machine. But that code in map_properties() should run only on apple x86, strange. Reassigning.

Independent from it all, I'd strongly suggest you move to a 64-bit kernel and userland as it is a lot more tested and scrutinized than 32-bit.
Comment 8 Ard Biesheuvel 2019-12-02 16:24:14 UTC
Please don't reassign this bug against 4.4-stable as an EFI bug just because some unrelated WARN() gets triggered when 5.4 is built and run on the same hardware. That doesn't make any sense whatsoever.

If that WARN() is actually an EFI issue (which seems dubious to me given the unreliability of the backtrace), please create a separate bug against EFI.
Comment 9 Borislav Petkov 2019-12-02 16:58:03 UTC
(In reply to Ard Biesheuvel from comment #8)
> Please don't reassign this bug against 4.4-stable as an EFI bug just because
> some unrelated WARN() gets triggered when 5.4 is built and run on the same
> hardware. That doesn't make any sense whatsoever.
> 
> If that WARN() is actually an EFI issue (which seems dubious to me given the
> unreliability of the backtrace), please create a separate bug against EFI.

Whatever Ard, I was just trying to be helpful. If you don't want to look at this, that's your call.

Richard, can you disable CONFIG_APPLE_PROPERTIES in your .config and see if you can trigger the issue again? Also, make sure you have CONFIG_UNWINDER_FRAME_POINTER - I believe this is the unwinder we can do on 32-bit for reliable stacktraces.

Thx.
Comment 10 Ard Biesheuvel 2019-12-02 17:23:01 UTC
(In reply to Borislav Petkov from comment #9)
> (In reply to Ard Biesheuvel from comment #8)
> > Please don't reassign this bug against 4.4-stable as an EFI bug just
> because
> > some unrelated WARN() gets triggered when 5.4 is built and run on the same
> > hardware. That doesn't make any sense whatsoever.
> > 
> > If that WARN() is actually an EFI issue (which seems dubious to me given
> the
> > unreliability of the backtrace), please create a separate bug against EFI.
> 
> Whatever Ard, I was just trying to be helpful. If you don't want to look at
> this, that's your call.
> 

I am happy to look at the unrelated issue that has been identified, but categorizing a bug called 'Linux 4.4.191 x86 on HP Z400 kernel oops at arch/x86/kernel/apic/apic.c:1303 setup_local_APIC+0xfb/0x520()' as an EFI bug because some build of v5.4 running on the same hardware triggers an oops that looks vaguely EFI related is not really helpful. This is obviously an x86 problem, no?


> Richard, can you disable CONFIG_APPLE_PROPERTIES in your .config and see if
> you can trigger the issue again? Also, make sure you have
> CONFIG_UNWINDER_FRAME_POINTER - I believe this is the unwinder we can do on
> 32-bit for reliable stacktraces.
> 

Perhaps I am missing something, and I am clueless about x86 internals, but the working v4.4.190 dmesg shows

[    0.035165] Overriding APIC driver with bigsmp

as well, doesn't show any memremap OOPSes, and boots fine. v4.4.191 is broken because of the identified patch.

It also has

[    0.000000] smpboot: Allowing 16 CPUs, 12 hotplug CPUs

which suggests that

void __init default_setup_apic_routing(void)
{
	int version = boot_cpu_apic_version;

	if (num_possible_cpus() > 8) {
		switch (boot_cpu_data.x86_vendor) {
		case X86_VENDOR_INTEL:
			if (!APIC_XAPIC(version)) {
				def_to_bigsmp = 0;
				break;
			}
			/* P4 and above */
			/* fall through */
		case X86_VENDOR_HYGON:
		case X86_VENDOR_AMD:
			def_to_bigsmp = 1;
		}
	}

results in def_to_bigsmp == 1, which causes

static int probe_bigsmp(void)
{
	if (def_to_bigsmp)
		dmi_bigsmp = 1;
	else
		dmi_check_system(bigsmp_dmi_table);

	return dmi_bigsmp;
}

to return true, and so switching to bigmsp mode on this hardware seems perfectly appropriate.

Again, I know very little about x86, but I fail to see how any of this is EFI related. The memremap OOPS may be EFI related if the stacktrace turns out to be accurate, but it is most likely a separate issue.

Thanks,
Comment 11 Bandan Das 2019-12-02 21:57:12 UTC
IIUC, that code shouldn't be called for bigsmp. Commit fe6f85ca121e9c74e7490fe66b0c5aae38e332c3 fixes it. It should land in stable eventually.

Author: Jan Beulich <jbeulich@suse.com>
Date:   Tue Oct 29 10:34:19 2019 +0100

    x86/apic/32: Avoid bogus LDR warnings
    
    The removal of the LDR initialization in the bigsmp_32 APIC code unearthed
    a problem in setup_local_APIC().
    
    The code checks unconditionally for a mismatch of the logical APIC id by
    comparing the early APIC id which was initialized in get_smp_config() with
    the actual LDR value in the APIC.
    
    Due to the removal of the bogus LDR initialization the check now can
    trigger on bigsmp_32 APIC systems emitting a warning for every booting
    CPU. This is of course a false positive because the APIC is not using
    logical destination mode.
    
    Restrict the check and the possibly resulting fixup to systems which are
    actually using the APIC in logical destination mode.
    
    [ tglx: Massaged changelog and added Cc stable ]
    
    Fixes: bae3a8d3308 ("x86/apic: Do not initialize LDR and DFR for bigsmp")
    Signed-off-by: Jan Beulich <jbeulich@suse.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/666d8f91-b5a8-1afd-7add-821e72a35f03@suse.com
Comment 12 gregkh 2019-12-02 22:07:32 UTC
On Mon, Dec 02, 2019 at 09:57:12PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=205729
> 
> Bandan Das (bsd@makefile.in) changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |bsd@makefile.in
> 
> --- Comment #11 from Bandan Das (bsd@makefile.in) ---
> IIUC, that code shouldn't be called for bigsmp. Commit
> fe6f85ca121e9c74e7490fe66b0c5aae38e332c3 fixes it. It should land in stable
> eventually.

It is already in the following kernel releases:
	4.9.201 4.14.154 4.19.84 5.3.11 5.4
Comment 13 Borislav Petkov 2019-12-03 09:28:04 UTC
(In reply to Ard Biesheuvel from comment #10)
> I am happy to look at the unrelated issue that has been identified, but
> categorizing a bug called 'Linux 4.4.191 x86 on HP Z400 kernel oops at
> arch/x86/kernel/apic/apic.c:1303 setup_local_APIC+0xfb/0x520()' as an EFI
> bug because some build of v5.4 running on the same hardware triggers an oops
> that looks vaguely EFI related is not really helpful.

Ok, point taken, I won't reassign *possibly* EFI-related bugs but have
people open a new one.

> Perhaps I am missing something, and I am clueless about x86 internals, but
> the working v4.4.190 dmesg shows

Well, you don't have to debug that - that's our job. But I appreciate
you poking at it.

But before we go and waste any more time with this:

Richard, what do you wanna do? Do you want to switch to a 64-bit latest
stable kernel, say 5.4.1, and see if that works fine on your box?
Comment 14 Richard Narron 2019-12-03 13:49:03 UTC
(In reply to Borislav Petkov from comment #13)

> Richard, what do you wanna do? Do you want to switch to a 64-bit latest
> stable kernel, say 5.4.1, and see if that works fine on your box?

As I said in my original comment 0 the bug does not happen with 64-bit kernels on my box.  It runs 64-bit 4.4.205 and 5.4.1 well.

(In reply to gregkh from comment #12)
> On Mon, Dec 02, 2019 at 09:57:12PM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=205729
> > 
> > Bandan Das (bsd@makefile.in) changed:
> > 
> >            What    |Removed                     |Added
> >
> ----------------------------------------------------------------------------
> >                  CC|                            |bsd@makefile.in
> > 
> > --- Comment #11 from Bandan Das (bsd@makefile.in) ---
> > IIUC, that code shouldn't be called for bigsmp. Commit
> > fe6f85ca121e9c74e7490fe66b0c5aae38e332c3 fixes it. It should land in stable
> > eventually.
> 
> It is already in the following kernel releases:
>       4.9.201 4.14.154 4.19.84 5.3.11 5.4

Thank you.  I will keep my fix for now with 4.4 and wait until a better fix filters down. 

The 4.4 kernel is long term stable for Slackware at the moment.  But Slackware current is running 5.4.1.
Comment 15 Richard Narron 2019-12-03 13:55:50 UTC
Created attachment 286163 [details]
temporary patch to bigsmp_32 to fix the kernel oops on 4.4 kernels after 4.4.190
Comment 16 Borislav Petkov 2019-12-03 17:25:12 UTC
(In reply to Richard Narron from comment #14)
> As I said in my original comment 0 the bug does not happen with 64-bit
> kernels on my box.  It runs 64-bit 4.4.205 and 5.4.1 well.

So you were simply reporting the 32-bit issue - you're not running a 32-bit kernel only and are fine with running a 64-bit kernel?
Comment 17 Richard Narron 2019-12-04 00:35:53 UTC
(In reply to Borislav Petkov from comment #16)
> (In reply to Richard Narron from comment #14)
> > As I said in my original comment 0 the bug does not happen with 64-bit
> > kernels on my box.  It runs 64-bit 4.4.205 and 5.4.1 well.
> 
> So you were simply reporting the 32-bit issue - you're not running a 32-bit
> kernel only and are fine with running a 64-bit kernel?

What does this have to do with the bug?  I am not fine running just one version of linux on the machine.

It seems to be a 32-bit issue but I will attach the dmesg from 64-bit 5.4.1 so it can be checked too...  It is always good not to jump to conclusions too quickly...
Comment 18 Richard Narron 2019-12-04 00:44:49 UTC
Created attachment 286169 [details]
dmesg from 5.4.1 x86-64
Comment 19 Borislav Petkov 2019-12-04 08:00:35 UTC
(In reply to Richard Narron from comment #17)
> What does this have to do with the bug?

It has to do with prioritizing debugging time.

> I am not fine running just one version of linux on the machine.

What does "not fine" mean exactly?

You have a special need to run a 32-bit kernel and you absolutely need
it or you can do everything you wanna do on your machine, on a 64-bit
kernel too?

> It seems to be a 32-bit issue but I will attach the dmesg from 64-bit
> 5.4.1 so it can be checked too...

Yes, that looks good.

> It is always good not to jump to conclusions too quickly...

Oh, interesting. There are conclusions and someone jumped to them?! Now
*that* I wanna know - do tell.
Comment 20 Richard Narron 2019-12-04 16:56:19 UTC
Created attachment 286177 [details]
kernel 4.4.205 patch: x86/apic/32: Avoid bogus LDR warnings

Attached is my version of the "x86/apic/32: Avoid bogus LDR warnings" patch.

It works for me, but it needs to be tested...
Comment 21 Richard Narron 2019-12-05 13:48:32 UTC
status changed to verified.

As Bandon Das says in comment 11:

Commit fe6f85ca121e9c74e7490fe66b0c5aae38e332c3 fixes it. It should land in stable eventually.
Comment 22 Richard Narron 2019-12-06 11:23:08 UTC
Created attachment 286199 [details]
[patch 4.4] x86/apic/32: Avoid bogus LDR warnings
Comment 23 Richard Narron 2019-12-20 00:42:48 UTC
The patch is included in kernel 4.4.207-rc1 and fixes the problem.