Phoronix.com discovered a severe performance regression on AMD APYC introduced on schedutil [see link 1] by the following commits from v5.11-rc1 commit 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems") commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC") The problem happens on CPU-bound workloads spanning a large number of cores. In this case schedutil won't select the maximum P-State. Actually, it's likely that it will select the minimum one. [link 1] https://www.phoronix.com/scan.php?page=article&item=linux511-amd-schedutil&num=1 TEST : Intel Open Image Denoise, www.openimagedenoise.org INVOCATION : ./denoise -hdr memorial.pfm -out out.pfm -bench 200 -threads $NTHREADS CPU : MODEL : 2x AMD EPYC 7742 FREQUENCY TABLE : P2: 1.50 GHz P1: 2.00 GHz P0: 2.25 GHz MAX BOOST : 3.40 GHz Results: threads, msecs (ratio). Lower is better. v5.10 v5.11-rc4 v5.11-rc4-patch ------------------------------------------------------- 1 1069.85 (1.00) 1071.84 (1.00) 1070.42 (1.00) 2 542.24 (1.00) 544.40 (1.00) 544.48 (1.00) 4 278.00 (1.00) 278.44 (1.00) 277.72 (1.00) 8 149.81 (1.00) 149.61 (1.00) 149.87 (1.00) 16 79.01 (1.00) 79.31 (1.00) 78.94 (1.00) 24 58.01 (1.00) 58.51 (1.01) 58.15 (1.00) 32 46.58 (1.00) 48.30 (1.04) 46.66 (1.00) 48 37.29 (1.00) 51.29 (1.38) 37.27 (1.00) 64 34.01 (1.00) 49.59 (1.46) 33.71 (0.99) 80 31.09 (1.00) 44.27 (1.42) 31.33 (1.01) 96 28.56 (1.00) 40.82 (1.43) 28.47 (1.00) 112 28.09 (1.00) 40.06 (1.43) 28.63 (1.02) 120 28.73 (1.00) 39.78 (1.38) 28.14 (0.98) 128 28.93 (1.00) 39.60 (1.37) 29.38 (1.02) See how the 128 threads case is almost 40% worse than baseline in v5.11-rc4. The column v5.11-rc4-patch corresponds to a patch I've just sent to LKML to address this problem. I'm opening this bugzilla entry to attach a few plots made during the study of this problem, for lack of a better place to share them.
Created attachment 294791 [details] plot of mpstat activity data Activity data of good and bad kernel. The plot shows that the test is CPU-bound.
Created attachment 294793 [details] plot of frequency requests from the tracepoint power:cpu_frequency The tracepoint shows that on the bad kernel schedutil requests almost exclusively the minimum P-State
Created attachment 294795 [details] plot of frequency data from hardware feedback (APERF, MPERF) "cpupower monitor" shows that the bad kernel actually run at the minimum P-State.
Created attachment 294797 [details] plot of PELT root runqueues utilization The PELT utilization for root runqueues of the bad kernel is half what was on the good kernel (~450 vs ~825).
A candidate fix for this problem has been posted to LKML: https://lore.kernel.org/lkml/20210122204038.3238-1-ggherdovich@suse.cz
So, the replacement patch from Rafael causes Zen 3 frequency reporting to be ALL jacked up. Before the patch, core frequencies in /proc/cpuinfo as well as using tools like nmon seemed accurate. After testing Rafael's patch, my core frequencies are all up around 6 GHz (!), and even external tools like Geekbench report my 5800X's BASE clock as 6.0 GHz (https://browser.geekbench.com/v5/cpu/6466982) I'm sure this isn't intended behavior. The patch was merged like yesterday into the mainline kernel, so should I file an actual bug report?
On Fri, Feb 12, 2021 at 6:29 PM <bugzilla-daemon@bugzilla.kernel.org> wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=211305 > > Matt McDonald (gardotd426@gmail.com) changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |gardotd426@gmail.com > > --- Comment #6 from Matt McDonald (gardotd426@gmail.com) --- > So, the replacement patch from Rafael causes Zen 3 frequency reporting to be > ALL jacked up. > > Before the patch, core frequencies in /proc/cpuinfo as well as using tools > like > nmon seemed accurate. After testing Rafael's patch, my core frequencies are > all > up around 6 GHz (!), and even external tools like Geekbench report my 5800X's > BASE clock as 6.0 GHz (https://browser.geekbench.com/v5/cpu/6466982) > > I'm sure this isn't intended behavior. If the reported frequencies are like that all the time, then it isn't. What is there in scaling_cur_freq in sysfs if the system is idle? > The patch was merged like yesterday into > the mainline kernel, so should I file an actual bug report? It doesn't particularly matter, because I have seen this comment from you.
Created attachment 295255 [details] cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known Attached is a tentative fix on top of commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies"). Please give it a go and report back.
Comment on attachment 295255 [details] cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known Yeah sure thing, I'm building now.
Okay so that's *way* worse. Everything's limited and locked to 2.2GHz. And yes, it's actually running at 2.2GHz, it's not misreporting. My Geekbench score was less than a third of what it should be cat /proc/cpuinfo | grep MHz cpu MHz : 2200.000 cpu MHz : 2200.088 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2199.982 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2200.000 cpu MHz : 2200.000 cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq 2199680 2199981 2195932 2199979 2195634 2198726 2199437 2195587 2197662 2198924 2198856 2195535 2196402 2199234 2199880 2195064 analyzing CPU 0: driver: acpi-cpufreq CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 2.20 GHz - 6.00 GHz available frequency steps: 3.80 GHz, 2.80 GHz, 2.20 GHz available cpufreq governors: performance schedutil current policy: frequency should be within 2.20 GHz and 2.20 GHz. The governor "schedutil" may decide which speed to use within this range. current CPU frequency: 2.20 GHz (asserted by call to hardware) boost state support: Supported: yes Active: no Boost States: 0 Total States: 3 Pstate-P0: 1000MHz Pstate-P1: 700MHz Pstate-P2: 500MHz Both schedutil and performance governors had no effect. But I do see in that cpupower output that it says the hardware limits happen to be 2.20GHz to 6.0GHz.
Oh, I can also add that the previous patch that was turned down and replaced with this patchset doesn't cause this issue, cpu frequency and frequency reporting work as expected with that patch, and I'm able to boost up to 4750MHz under full load and 5GHz under single-core load.
(In reply to Matt McDonald from comment #10) > Okay so that's *way* worse. > > Everything's limited and locked to 2.2GHz. And yes, it's actually running at > 2.2GHz, it's not misreporting. My Geekbench score was less than a third of > what it should be > > cat /proc/cpuinfo | grep MHz > cpu MHz : 2200.000 > cpu MHz : 2200.088 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2199.982 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2200.000 > cpu MHz : 2200.000 This actually doesn't mean that the CPUs are running at the given frequency. > > cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq > 2199680 > 2199981 > 2195932 > 2199979 > 2195634 > 2198726 > 2199437 > 2195587 > 2197662 > 2198924 > 2198856 > 2195535 > 2196402 > 2199234 > 2199880 > 2195064 And so this. > analyzing CPU 0: > driver: acpi-cpufreq > CPUs which run at the same hardware frequency: 0 > CPUs which need to have their frequency coordinated by software: 0 > maximum transition latency: Cannot determine or is not supported. > hardware limits: 2.20 GHz - 6.00 GHz > available frequency steps: 3.80 GHz, 2.80 GHz, 2.20 GHz > available cpufreq governors: performance schedutil > current policy: frequency should be within 2.20 GHz and 2.20 GHz. > The governor "schedutil" may decide which speed to use > within this range. > current CPU frequency: 2.20 GHz (asserted by call to hardware) > boost state support: > Supported: yes > Active: no > Boost States: 0 > Total States: 3 > Pstate-P0: 1000MHz > Pstate-P1: 700MHz > Pstate-P2: 500MHz > > > Both schedutil and performance governors had no effect. > > But I do see in that cpupower output that it says the hardware limits happen > to be 2.20GHz to 6.0GHz. That's as expected.
(In reply to Matt McDonald from comment #11) > Oh, I can also add that the previous patch that was turned down and replaced > with this patchset doesn't cause this issue, cpu frequency and frequency > reporting work as expected with that patch, and I'm able to boost up to > 4750MHz under full load and 5GHz under single-core load. So what do you see in /proc/cpuinfo and scaling_cur_freq with commits 3c55e94c0ade and d11a1d08a082 reverted?
Also can you please enable dynamic debug in freq_table.c, unload acpi-cpufreq, load it again and attach the output of dmesg?
Created attachment 295295 [details] cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known (v2) I found a mistake in the previous version of the fix patch which didn't initialize policy->max properly. Please test this one instead and there is no need to provide the information requested in the previous comments (at least not ATM). Thanks!
Haha I'd just typed my response and bugzilla stopped me from submitting. That's a cool feature. Yeah, I'll build and test now.
That does seem to have fixed it: cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq 4854354 3823787 3647266 4016171 3576030 3974600 3816628 3590646 3919312 3626692 3618178 3597246 4367040 3599805 3837612 3874146 cat /proc/cpuinfo | grep MHz cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 4193.751 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 cpu MHz : 3800.000 sudo cpupower frequency-info [sudo] password for matt: analyzing CPU 0: driver: acpi-cpufreq CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 2.20 GHz - 6.00 GHz available frequency steps: 3.80 GHz, 2.80 GHz, 2.20 GHz available cpufreq governors: performance schedutil current policy: frequency should be within 2.20 GHz and 3.80 GHz. The governor "performance" may decide which speed to use within this range. current CPU frequency: 3.80 GHz (asserted by call to hardware) boost state support: Supported: yes Active: no Boost States: 0 Total States: 3 Pstate-P0: 1000MHz Pstate-P1: 700MHz Pstate-P2: 500MHz Everything is back to how it should be, only now with assumingly better schedutil performance (I'll run some benchmarks later). No 6.0GHz reporting and no being stuck at 2.20GHz. CPU performance under the "performance" governor is back to where it should be, and I'm boosting up to 4.9-5.0 in single core and 4.8 all-core.
OK, thanks for testing! Let me post the last patch for verification.