Bug 194501

Summary: Wrong apicid read from MADT makes system run with a single core
Product: Platform Specific/Hardware Reporter: Alexandru N. Barloiu (axl)
Component: x86-64Assignee: Chen Yu (yu.c.chen)
Status: CLOSED INSUFFICIENT_DATA    
Severity: normal CC: mail, rui.zhang, yu.c.chen
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.9.8 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg log
acpidump
lspci
config
dmesg-4.4.49
acpidump-4.4.49
dmesg-4.9.10
acpidump-4.9.10
config 4.7.7
dmesg 4.7.7
cpuonline.log 4.7.7
config 4.10
dmesg 4.10
cpuonline.log 4.10

Description Alexandru N. Barloiu 2017-02-08 09:27:44 UTC
Created attachment 254551 [details]
dmesg log

Hello. The exact problem is in the beginning of the dmesg attached to this bug. I have noticed this problem right about when kernel changed from 4.8 series to 4.9 series. Right before that, it worked with 4 cores * 2 (for hyperthreading). Right now, /proc/cpuinfo shows 1 core. Before it showed 8. 

If noacpi is passed at boot, system boots with 4 cores (and seems hyperthreading off). 

Will attach config, acpidump and lspci dump shortly. Let me know if I can provide further information.
Comment 1 Alexandru N. Barloiu 2017-02-08 09:28:14 UTC
Created attachment 254561 [details]
acpidump
Comment 2 Alexandru N. Barloiu 2017-02-08 09:28:34 UTC
Created attachment 254571 [details]
lspci
Comment 3 Alexandru N. Barloiu 2017-02-08 09:28:52 UTC
Created attachment 254581 [details]
config
Comment 4 Alexandru N. Barloiu 2017-02-16 20:23:07 UTC
hello. a little follow-up to this bug report. I've managed to try the new 4.9.10 kernel. as well as 4.4.49. I will be posting the relevant files. sorry for missing lspci for 4.9.10. had only 2-3 minutes between boots. the server is in a remote location. i rarely have the opportunity to catch someone over the phone to do reboots for me. 

so, as an update, with kernel 4.9.10 it started with 2 cpus. 0 and 6 for some reason. i think. with 4.4.49 it started with normal 8. both kernels (and the one before - 4.9.9 and 4.9.8 - the one that generated the logs in the original bug report) were compiled with basically the same configs. that didn't change. 

given the fact that it's the same machine, same config, just different versions of kernel that yield different results, I'm going to assume it's not me, or gentoo, but the kernel itself, this one time. 

so logs incoming, in the hope that machines like that poor dl380 g7 will be taken care of and get the love they deserve. and again, if i can provide further information...
Comment 5 Alexandru N. Barloiu 2017-02-16 20:27:08 UTC
Created attachment 254797 [details]
dmesg-4.4.49
Comment 6 Alexandru N. Barloiu 2017-02-16 20:27:50 UTC
Created attachment 254799 [details]
acpidump-4.4.49
Comment 7 Alexandru N. Barloiu 2017-02-16 20:28:12 UTC
Created attachment 254801 [details]
dmesg-4.9.10
Comment 8 Alexandru N. Barloiu 2017-02-16 20:28:38 UTC
Created attachment 254803 [details]
acpidump-4.9.10
Comment 9 Chen Yu 2017-03-06 02:22:09 UTC
Hi Alexandru, according to dmesg 4.4.49, there is only one CPU brought up?
And per dmesg 4.9, 
[    0.000000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:2065 
it seems that Linux got wrong apicid from madt table.

But this issue happened after update from 4.8 to 4.9? Do you have time to take a git bisect for it?
Comment 10 Alexandru N. Barloiu 2017-03-06 10:23:18 UTC
Hello. It is only ONE physical cpu, 4 cores * 2 hyperthreading. 8 logical. Total. 
According to 4.4.x series all are brought up. According to 4.9 and now 4.10 series only one logical core is up. Although kernel knows there are 8, just can't handle the order it seems.

With 4.8 series again, all worked for a while (in the start of the series). Sometime between 4.8 and 4.9 it stopped working. 

It is not my machine, and I do not have remote access. Next time I am scheduled to go there and update will be sometime in april. I plan to try: 4.10 last of them, 4.4 last of them and if none of them work I plan to try all kernels since 4.8 until current day to find out at which release the bug appeared (which i assume is git bisecting). But I don't know if the owner will allow me so much time for testing. I can compile the kernels at home and go with them already prepared, but that damn machine take 5 minutes to boot. So 10 kernels equals with 1 hour of downtime. I'll do my best to try as many of them.
Comment 11 Chen Yu 2017-03-27 04:01:42 UTC
Hi Alex,
In order to ease your load, currently let's debug on it w/o recompiling the kernel firstly to see if there is any clue, please boot with both 4.9 and 4.4 (bad and good version), and test with:

1. append the following command into grub:
"ftrace=function_graph ftrace_graph_filter=_cpu_up trace_event=cpuhp:cpuhp_exit,cpuhp:cpuhp_multi_enter,cpuhp:cpuhp_enter"

2. boot up the system, and then provide:
cat /sys/kernel/debug/tracing/trace > ~/cpuonline.log
Comment 12 Chen Yu 2017-03-27 04:04:45 UTC
BTW, please also test with minimal drivers load by appending "init=/bin/bash" in the grub.
Comment 13 Alexandru N. Barloiu 2017-04-02 05:59:42 UTC
Hello. I'll do my best to run these parameters, and return the log. I expect in the following week or the next. I'll prolly try 4.10.8 and 4.4.59. or whatever is latest at that moment.
Comment 14 Gennadiy Polovinkin 2017-04-11 12:45:14 UTC
Created attachment 255821 [details]
config 4.7.7
Comment 15 Gennadiy Polovinkin 2017-04-11 12:45:53 UTC
Created attachment 255823 [details]
dmesg 4.7.7
Comment 16 Gennadiy Polovinkin 2017-04-11 12:46:17 UTC
Created attachment 255825 [details]
cpuonline.log 4.7.7
Comment 17 Gennadiy Polovinkin 2017-04-11 12:46:47 UTC
Created attachment 255827 [details]
config 4.10
Comment 18 Gennadiy Polovinkin 2017-04-11 12:47:12 UTC
Created attachment 255829 [details]
dmesg 4.10
Comment 19 Gennadiy Polovinkin 2017-04-11 12:47:33 UTC
Created attachment 255831 [details]
cpuonline.log 4.10
Comment 20 Gennadiy Polovinkin 2017-04-11 12:47:51 UTC
HP ProLiant DL360 G6, BIOS P64 08/16/2015
1 CPU X5550
HT Disabled

with kernel 4.7.7 it started with 4 cpu cores
with kernel 4.10.8 it started with 1 cpu cores

HP ProLiant DL360 G7, BIOS P68 08/16/2015
2 CPU X5650
HT Enabled

with kernel 4.7.7 it started with 24 cpu cores
with kernel 4.10.6 it started with 20 cpu cores
Comment 21 Chen Yu 2017-06-21 14:49:18 UTC
I noticed a suspicious point that there is a bug might be related to this issue,  which was introduced in 4.9  and was fixed in 4.11-rc3

the bug was introduced in
commit f7c28833c252031bc68a29e26a18a661797cf3a3
Author: Gu Zheng <guz.fnst@cn.fujitsu.com>
Date:   Thu Aug 25 16:35:15 2016 +0800

and was fixed in

commit 2b85b3d22920db7473e5fed5719e7955c0ec323e
Author: Dou Liyang <douly.fnst@cn.fujitsu.com>
Date:   Fri Mar 3 16:02:25 2017 +0800

Please test the latest upstream kernel(NOT the distribution one):
git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
and compile this one.
Comment 22 Chen Yu 2017-08-28 16:28:02 UTC
I'm closing this one as no respond for a while. Please feel free to reopen if Comment 21 doesn't work for you.