Bug 194501
Summary: | Wrong apicid read from MADT makes system run with a single core | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Alexandru N. Barloiu (axl) |
Component: | x86-64 | Assignee: | Chen Yu (yu.c.chen) |
Status: | CLOSED INSUFFICIENT_DATA | ||
Severity: | normal | CC: | mail, rui.zhang, yu.c.chen |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 4.9.8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg log
acpidump lspci config dmesg-4.4.49 acpidump-4.4.49 dmesg-4.9.10 acpidump-4.9.10 config 4.7.7 dmesg 4.7.7 cpuonline.log 4.7.7 config 4.10 dmesg 4.10 cpuonline.log 4.10 |
Created attachment 254561 [details]
acpidump
Created attachment 254571 [details]
lspci
Created attachment 254581 [details]
config
hello. a little follow-up to this bug report. I've managed to try the new 4.9.10 kernel. as well as 4.4.49. I will be posting the relevant files. sorry for missing lspci for 4.9.10. had only 2-3 minutes between boots. the server is in a remote location. i rarely have the opportunity to catch someone over the phone to do reboots for me. so, as an update, with kernel 4.9.10 it started with 2 cpus. 0 and 6 for some reason. i think. with 4.4.49 it started with normal 8. both kernels (and the one before - 4.9.9 and 4.9.8 - the one that generated the logs in the original bug report) were compiled with basically the same configs. that didn't change. given the fact that it's the same machine, same config, just different versions of kernel that yield different results, I'm going to assume it's not me, or gentoo, but the kernel itself, this one time. so logs incoming, in the hope that machines like that poor dl380 g7 will be taken care of and get the love they deserve. and again, if i can provide further information... Created attachment 254797 [details]
dmesg-4.4.49
Created attachment 254799 [details]
acpidump-4.4.49
Created attachment 254801 [details]
dmesg-4.9.10
Created attachment 254803 [details]
acpidump-4.9.10
Hi Alexandru, according to dmesg 4.4.49, there is only one CPU brought up? And per dmesg 4.9, [ 0.000000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:2065 it seems that Linux got wrong apicid from madt table. But this issue happened after update from 4.8 to 4.9? Do you have time to take a git bisect for it? Hello. It is only ONE physical cpu, 4 cores * 2 hyperthreading. 8 logical. Total. According to 4.4.x series all are brought up. According to 4.9 and now 4.10 series only one logical core is up. Although kernel knows there are 8, just can't handle the order it seems. With 4.8 series again, all worked for a while (in the start of the series). Sometime between 4.8 and 4.9 it stopped working. It is not my machine, and I do not have remote access. Next time I am scheduled to go there and update will be sometime in april. I plan to try: 4.10 last of them, 4.4 last of them and if none of them work I plan to try all kernels since 4.8 until current day to find out at which release the bug appeared (which i assume is git bisecting). But I don't know if the owner will allow me so much time for testing. I can compile the kernels at home and go with them already prepared, but that damn machine take 5 minutes to boot. So 10 kernels equals with 1 hour of downtime. I'll do my best to try as many of them. Hi Alex, In order to ease your load, currently let's debug on it w/o recompiling the kernel firstly to see if there is any clue, please boot with both 4.9 and 4.4 (bad and good version), and test with: 1. append the following command into grub: "ftrace=function_graph ftrace_graph_filter=_cpu_up trace_event=cpuhp:cpuhp_exit,cpuhp:cpuhp_multi_enter,cpuhp:cpuhp_enter" 2. boot up the system, and then provide: cat /sys/kernel/debug/tracing/trace > ~/cpuonline.log BTW, please also test with minimal drivers load by appending "init=/bin/bash" in the grub. Hello. I'll do my best to run these parameters, and return the log. I expect in the following week or the next. I'll prolly try 4.10.8 and 4.4.59. or whatever is latest at that moment. Created attachment 255821 [details]
config 4.7.7
Created attachment 255823 [details]
dmesg 4.7.7
Created attachment 255825 [details]
cpuonline.log 4.7.7
Created attachment 255827 [details]
config 4.10
Created attachment 255829 [details]
dmesg 4.10
Created attachment 255831 [details]
cpuonline.log 4.10
HP ProLiant DL360 G6, BIOS P64 08/16/2015 1 CPU X5550 HT Disabled with kernel 4.7.7 it started with 4 cpu cores with kernel 4.10.8 it started with 1 cpu cores HP ProLiant DL360 G7, BIOS P68 08/16/2015 2 CPU X5650 HT Enabled with kernel 4.7.7 it started with 24 cpu cores with kernel 4.10.6 it started with 20 cpu cores I noticed a suspicious point that there is a bug might be related to this issue, which was introduced in 4.9 and was fixed in 4.11-rc3 the bug was introduced in commit f7c28833c252031bc68a29e26a18a661797cf3a3 Author: Gu Zheng <guz.fnst@cn.fujitsu.com> Date: Thu Aug 25 16:35:15 2016 +0800 and was fixed in commit 2b85b3d22920db7473e5fed5719e7955c0ec323e Author: Dou Liyang <douly.fnst@cn.fujitsu.com> Date: Fri Mar 3 16:02:25 2017 +0800 Please test the latest upstream kernel(NOT the distribution one): git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git and compile this one. I'm closing this one as no respond for a while. Please feel free to reopen if Comment 21 doesn't work for you. |
Created attachment 254551 [details] dmesg log Hello. The exact problem is in the beginning of the dmesg attached to this bug. I have noticed this problem right about when kernel changed from 4.8 series to 4.9 series. Right before that, it worked with 4 cores * 2 (for hyperthreading). Right now, /proc/cpuinfo shows 1 core. Before it showed 8. If noacpi is passed at boot, system boots with 4 cores (and seems hyperthreading off). Will attach config, acpidump and lspci dump shortly. Let me know if I can provide further information.