Bug 218109
Summary: | [AMD] [RYZEN] Binned down CPUs are reported as offline | ||
---|---|---|---|
Product: | Linux | Reporter: | Peter Jung (admin) |
Component: | Kernel | Assignee: | Virtual assignee for kernel bugs (linux-kernel) |
Status: | NEW --- | ||
Severity: | normal | CC: | admin, mario.limonciello, void |
Priority: | P3 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | 6.6 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | acpidump |
Description
Peter Jung
2023-11-06 19:34:26 UTC
Would you mind sharing an acpidump for your system? I think this is actually a BIOS bug not a kernel bug. Created attachment 305373 [details]
acpidump
acpidump
Hi Mario, I have attached you one. Since we know a bunch of people, which have this issue on their Zen3 CPU it could then also affect several BIOS. > [008h 0008 1] Revision : 04 It appears that the MADT in this BIOS conforms to FADT Major version 04, no minor version specified. https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html#fixed-acpi-description-table-fadt > $ grep "Runtime Online Capable : 0" apic.dsl | wc -l 32 None of the processors are declared as runtime online capable (hotpluggable) > $ grep "Enabled : 1" apic.dsl | wc -l 24 > $ grep "Enabled : 0" apic.dsl | wc -l 8 You can see that your /proc/cpuinfo matches the 12 cores, but the BIOS has entries for another 4 cores that makes Linux advertise this incorrectly. The way this stuff all works is that there is a function acpi_is_processor_usable() which looks at the FADT version to decide whether to trust the values in the MADT. This is because the bit for "Runtime Online Capable" wasn't introduced until FADT major 6 minor 3 https://github.com/torvalds/linux/blob/d2f51b3516dade79269ff45eae2a7668ae711b25/arch/x86/kernel/acpi/boot.c#L200 Here is the ACPI spec location for it: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html#processor-local-apic-structure > I have attached you one. Since we know a bunch of people, which have this issue on their Zen3 CPU it could then also affect several BIOS. Is this just a cosmetic reporting issue, or is there a functional Linux problem because of this BIOS bug? > This issue was for a short time fixed in 6.2 or 6.3 (im not sure) The reason that this changed after kernel 6.3 is because there was a regression reported by forcing this to be ACPI 6.3 or later for hotplug CPUs to work. This is the commit that changed the behavior. https://github.com/torvalds/linux/commit/fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c Sorry I made one mistake. This is the field that has FADT major/minor. You have 6.2 not 4.0. [008h 0008 1] Revision : 06 [083h 0131 1] FADT Minor Revision : 02 Hi Mario, Thank you very much for your explanation and going into the details. I am not not sure, how much this could affect the general linux system. The only regression, which I found, which happened due of this issue was in sched-ext - but this is nothing in the offical kernel. You can follow the issue here: https://github.com/sched-ext/sched_ext/issues/69 Setting nr_cpus=24 fixed this issue and hetjun also has impleneted a fix for this for "scx_rusty". But the question is, how much this false reporting does affect the RCU Subsystem, Watchdog and percpualloc. (There could be maybe other also, these are which I know and report the threads as 32 instead of 24) > Thank you very much for your explanation and going into the details. One more thing - You might want to read this original bug report to better understand how we got where we are. https://lore.kernel.org/all/20230327191026.3454-1-eric.devolder@oracle.com/ > The only regression, which I found, which happened due of this issue was in > sched-ext - but this is nothing in the offical kernel. You can follow the > issue here: I'll leave a comment. > But the question is, how much this false reporting does affect the RCU > Subsystem, Watchdog and percpualloc. (There could be maybe other also, these > are which I know and report the threads as 32 instead of 24) My expectation is that it shouldn't break other things and it's just cosmetic, but if you find otherwise please mention. Hi Mario, I don't think I'd say this is just cosmetic. The kernel looks at the possible cpumap in a bunch of places, and does all sorts of allocations, work, etc based on what it thinks are CPUs that could someday come online. Wouldn't this affect any place where cpu_possible_mask is referenced? |