|Summary:||Wrong apicid read from MADT makes system run with a single core|
|Product:||Platform Specific/Hardware||Reporter:||Alexandru N. Barloiu (axl)|
|Component:||x86-64||Assignee:||Chen Yu (yu.c.chen)|
|Severity:||normal||CC:||mail, rui.zhang, yu.c.chen|
Description Alexandru N. Barloiu 2017-02-08 09:27:44 UTC
Comment 1 Alexandru N. Barloiu 2017-02-08 09:28:14 UTC
Created attachment 254561 [details] acpidump
Comment 2 Alexandru N. Barloiu 2017-02-08 09:28:34 UTC
Created attachment 254571 [details] lspci
Comment 3 Alexandru N. Barloiu 2017-02-08 09:28:52 UTC
Created attachment 254581 [details] config
Comment 4 Alexandru N. Barloiu 2017-02-16 20:23:07 UTC
hello. a little follow-up to this bug report. I've managed to try the new 4.9.10 kernel. as well as 4.4.49. I will be posting the relevant files. sorry for missing lspci for 4.9.10. had only 2-3 minutes between boots. the server is in a remote location. i rarely have the opportunity to catch someone over the phone to do reboots for me. so, as an update, with kernel 4.9.10 it started with 2 cpus. 0 and 6 for some reason. i think. with 4.4.49 it started with normal 8. both kernels (and the one before - 4.9.9 and 4.9.8 - the one that generated the logs in the original bug report) were compiled with basically the same configs. that didn't change. given the fact that it's the same machine, same config, just different versions of kernel that yield different results, I'm going to assume it's not me, or gentoo, but the kernel itself, this one time. so logs incoming, in the hope that machines like that poor dl380 g7 will be taken care of and get the love they deserve. and again, if i can provide further information...
Comment 5 Alexandru N. Barloiu 2017-02-16 20:27:08 UTC
Created attachment 254797 [details] dmesg-4.4.49
Comment 6 Alexandru N. Barloiu 2017-02-16 20:27:50 UTC
Created attachment 254799 [details] acpidump-4.4.49
Comment 7 Alexandru N. Barloiu 2017-02-16 20:28:12 UTC
Created attachment 254801 [details] dmesg-4.9.10
Comment 8 Alexandru N. Barloiu 2017-02-16 20:28:38 UTC
Created attachment 254803 [details] acpidump-4.9.10
Comment 9 Chen Yu 2017-03-06 02:22:09 UTC
Hi Alexandru, according to dmesg 4.4.49, there is only one CPU brought up? And per dmesg 4.9, [ 0.000000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:2065 it seems that Linux got wrong apicid from madt table. But this issue happened after update from 4.8 to 4.9? Do you have time to take a git bisect for it?
Comment 10 Alexandru N. Barloiu 2017-03-06 10:23:18 UTC
Hello. It is only ONE physical cpu, 4 cores * 2 hyperthreading. 8 logical. Total. According to 4.4.x series all are brought up. According to 4.9 and now 4.10 series only one logical core is up. Although kernel knows there are 8, just can't handle the order it seems. With 4.8 series again, all worked for a while (in the start of the series). Sometime between 4.8 and 4.9 it stopped working. It is not my machine, and I do not have remote access. Next time I am scheduled to go there and update will be sometime in april. I plan to try: 4.10 last of them, 4.4 last of them and if none of them work I plan to try all kernels since 4.8 until current day to find out at which release the bug appeared (which i assume is git bisecting). But I don't know if the owner will allow me so much time for testing. I can compile the kernels at home and go with them already prepared, but that damn machine take 5 minutes to boot. So 10 kernels equals with 1 hour of downtime. I'll do my best to try as many of them.
Comment 11 Chen Yu 2017-03-27 04:01:42 UTC
Hi Alex, In order to ease your load, currently let's debug on it w/o recompiling the kernel firstly to see if there is any clue, please boot with both 4.9 and 4.4 (bad and good version), and test with: 1. append the following command into grub: "ftrace=function_graph ftrace_graph_filter=_cpu_up trace_event=cpuhp:cpuhp_exit,cpuhp:cpuhp_multi_enter,cpuhp:cpuhp_enter" 2. boot up the system, and then provide: cat /sys/kernel/debug/tracing/trace > ~/cpuonline.log
Comment 12 Chen Yu 2017-03-27 04:04:45 UTC
BTW, please also test with minimal drivers load by appending "init=/bin/bash" in the grub.
Comment 13 Alexandru N. Barloiu 2017-04-02 05:59:42 UTC
Hello. I'll do my best to run these parameters, and return the log. I expect in the following week or the next. I'll prolly try 4.10.8 and 4.4.59. or whatever is latest at that moment.
Comment 14 Gennadiy Polovinkin 2017-04-11 12:45:14 UTC
Created attachment 255821 [details] config 4.7.7
Comment 15 Gennadiy Polovinkin 2017-04-11 12:45:53 UTC
Created attachment 255823 [details] dmesg 4.7.7
Comment 16 Gennadiy Polovinkin 2017-04-11 12:46:17 UTC
Created attachment 255825 [details] cpuonline.log 4.7.7
Comment 17 Gennadiy Polovinkin 2017-04-11 12:46:47 UTC
Created attachment 255827 [details] config 4.10
Comment 18 Gennadiy Polovinkin 2017-04-11 12:47:12 UTC
Created attachment 255829 [details] dmesg 4.10
Comment 19 Gennadiy Polovinkin 2017-04-11 12:47:33 UTC
Created attachment 255831 [details] cpuonline.log 4.10
Comment 20 Gennadiy Polovinkin 2017-04-11 12:47:51 UTC
HP ProLiant DL360 G6, BIOS P64 08/16/2015 1 CPU X5550 HT Disabled with kernel 4.7.7 it started with 4 cpu cores with kernel 4.10.8 it started with 1 cpu cores HP ProLiant DL360 G7, BIOS P68 08/16/2015 2 CPU X5650 HT Enabled with kernel 4.7.7 it started with 24 cpu cores with kernel 4.10.6 it started with 20 cpu cores
Comment 21 Chen Yu 2017-06-21 14:49:18 UTC
I noticed a suspicious point that there is a bug might be related to this issue, which was introduced in 4.9 and was fixed in 4.11-rc3 the bug was introduced in commit f7c28833c252031bc68a29e26a18a661797cf3a3 Author: Gu Zheng <firstname.lastname@example.org> Date: Thu Aug 25 16:35:15 2016 +0800 and was fixed in commit 2b85b3d22920db7473e5fed5719e7955c0ec323e Author: Dou Liyang <email@example.com> Date: Fri Mar 3 16:02:25 2017 +0800 Please test the latest upstream kernel(NOT the distribution one): git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git and compile this one.
Comment 22 Chen Yu 2017-08-28 16:28:02 UTC
I'm closing this one as no respond for a while. Please feel free to reopen if Comment 21 doesn't work for you.