Bug 211305

Summary: schedutil selects low P-States on AMD EPYC with frequency invariance
Product: Power Management Reporter: Giovanni Gherdovich (ggherdovich)
Component: cpufreqAssignee: linux-pm (linux-pm)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: gardotd426, ggherdovich, rjw, rric
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: v5.11-rc1 Subsystem:
Regression: No Bisected commit-id:
Attachments: plot of mpstat activity data
plot of frequency requests from the tracepoint power:cpu_frequency
plot of frequency data from hardware feedback (APERF, MPERF)
plot of PELT root runqueues utilization
cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known
cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known (v2)

Description Giovanni Gherdovich 2021-01-21 00:55:45 UTC
Phoronix.com discovered a severe performance regression on AMD APYC
introduced on schedutil [see link 1] by the following commits from v5.11-rc1

    commit 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
    commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")

The problem happens on CPU-bound workloads spanning a large number of cores.
In this case schedutil won't select the maximum P-State. Actually, it's
likely that it will select the minimum one.

[link 1] https://www.phoronix.com/scan.php?page=article&item=linux511-amd-schedutil&num=1

TEST        : Intel Open Image Denoise, www.openimagedenoise.org
INVOCATION  : ./denoise -hdr memorial.pfm -out out.pfm -bench 200 -threads $NTHREADS
CPU         : MODEL            : 2x AMD EPYC 7742
              FREQUENCY TABLE  : P2: 1.50 GHz
                                 P1: 2.00 GHz
                                 P0: 2.25 GHz
              MAX BOOST        :     3.40 GHz

Results: threads, msecs (ratio). Lower is better.

               v5.10          v5.11-rc4    v5.11-rc4-patch
    -------------------------------------------------------
      1   1069.85 (1.00)   1071.84 (1.00)   1070.42 (1.00)
      2    542.24 (1.00)    544.40 (1.00)    544.48 (1.00)
      4    278.00 (1.00)    278.44 (1.00)    277.72 (1.00)
      8    149.81 (1.00)    149.61 (1.00)    149.87 (1.00)
     16     79.01 (1.00)     79.31 (1.00)     78.94 (1.00)
     24     58.01 (1.00)     58.51 (1.01)     58.15 (1.00)
     32     46.58 (1.00)     48.30 (1.04)     46.66 (1.00)
     48     37.29 (1.00)     51.29 (1.38)     37.27 (1.00)
     64     34.01 (1.00)     49.59 (1.46)     33.71 (0.99)
     80     31.09 (1.00)     44.27 (1.42)     31.33 (1.01)
     96     28.56 (1.00)     40.82 (1.43)     28.47 (1.00)
    112     28.09 (1.00)     40.06 (1.43)     28.63 (1.02)
    120     28.73 (1.00)     39.78 (1.38)     28.14 (0.98)
    128     28.93 (1.00)     39.60 (1.37)     29.38 (1.02)

See how the 128 threads case is almost 40% worse than baseline in v5.11-rc4.
The column v5.11-rc4-patch corresponds to a patch I've just sent to LKML to
address this problem.

I'm opening this bugzilla entry to attach a few plots made during the study
of this problem, for lack of a better place to share them.
Comment 1 Giovanni Gherdovich 2021-01-21 00:57:30 UTC
Created attachment 294791 [details]
plot of mpstat activity data

Activity data of good and bad kernel. The plot shows that the test is CPU-bound.
Comment 2 Giovanni Gherdovich 2021-01-21 00:59:07 UTC
Created attachment 294793 [details]
plot of frequency requests from the tracepoint power:cpu_frequency

The tracepoint shows that on the bad kernel schedutil requests almost exclusively the minimum P-State
Comment 3 Giovanni Gherdovich 2021-01-21 01:00:52 UTC
Created attachment 294795 [details]
plot of frequency data from hardware feedback (APERF, MPERF)

"cpupower monitor" shows that the bad kernel actually run at the minimum P-State.
Comment 4 Giovanni Gherdovich 2021-01-21 01:03:02 UTC
Created attachment 294797 [details]
plot of PELT root runqueues utilization

The PELT utilization for root runqueues of the bad kernel is half what was on the good kernel (~450 vs ~825).
Comment 5 Giovanni Gherdovich 2021-01-27 21:18:16 UTC
A candidate fix for this problem has been posted to LKML:

https://lore.kernel.org/lkml/20210122204038.3238-1-ggherdovich@suse.cz
Comment 6 Matt McDonald 2021-02-12 17:28:39 UTC
So, the replacement patch from Rafael causes Zen 3 frequency reporting to be ALL jacked up. 

Before the patch, core frequencies in /proc/cpuinfo as well as using tools like nmon seemed accurate. After testing Rafael's patch, my core frequencies are all up around 6 GHz (!), and even external tools like Geekbench report my 5800X's BASE clock as 6.0 GHz (https://browser.geekbench.com/v5/cpu/6466982)

I'm sure this isn't intended behavior. The patch was merged like yesterday into the mainline kernel, so should I file an actual bug report?
Comment 7 rafael 2021-02-12 18:19:41 UTC
On Fri, Feb 12, 2021 at 6:29 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=211305
>
> Matt McDonald (gardotd426@gmail.com) changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |gardotd426@gmail.com
>
> --- Comment #6 from Matt McDonald (gardotd426@gmail.com) ---
> So, the replacement patch from Rafael causes Zen 3 frequency reporting to be
> ALL jacked up.
>
> Before the patch, core frequencies in /proc/cpuinfo as well as using tools
> like
> nmon seemed accurate. After testing Rafael's patch, my core frequencies are
> all
> up around 6 GHz (!), and even external tools like Geekbench report my 5800X's
> BASE clock as 6.0 GHz (https://browser.geekbench.com/v5/cpu/6466982)
>
> I'm sure this isn't intended behavior.

If the reported frequencies are like that all the time, then it isn't.

What is there in scaling_cur_freq in sysfs if the system is idle?

> The patch was merged like yesterday into
> the mainline kernel, so should I file an actual bug report?

It doesn't particularly matter, because I have seen this comment from you.
Comment 8 Rafael J. Wysocki 2021-02-12 19:12:37 UTC
Created attachment 295255 [details]
cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known

Attached is a tentative fix on top of commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies").

Please give it a go and report back.
Comment 9 Matt McDonald 2021-02-12 20:39:45 UTC
Comment on attachment 295255 [details]
cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known

Yeah sure thing, I'm building now.
Comment 10 Matt McDonald 2021-02-12 21:15:46 UTC
Okay so that's *way* worse. 

Everything's limited and locked to 2.2GHz. And yes, it's actually running at 2.2GHz, it's not misreporting. My Geekbench score was less than a third of what it should be

cat /proc/cpuinfo | grep MHz
cpu MHz		: 2200.000
cpu MHz		: 2200.088
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2199.982
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2200.000
cpu MHz		: 2200.000


 cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
2199680
2199981
2195932
2199979
2195634
2198726
2199437
2195587
2197662
2198924
2198856
2195535
2196402
2199234
2199880
2195064


analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 2.20 GHz - 6.00 GHz
  available frequency steps:  3.80 GHz, 2.80 GHz, 2.20 GHz
  available cpufreq governors: performance schedutil
  current policy: frequency should be within 2.20 GHz and 2.20 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
  current CPU frequency: 2.20 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: no
    Boost States: 0
    Total States: 3
    Pstate-P0:  1000MHz
    Pstate-P1:  700MHz
    Pstate-P2:  500MHz


Both schedutil and performance governors had no effect. 

But I do see in that cpupower output that it says the hardware limits happen to be 2.20GHz to 6.0GHz.
Comment 11 Matt McDonald 2021-02-12 22:00:00 UTC
Oh, I can also add that the previous patch that was turned down and replaced with this patchset doesn't cause this issue, cpu frequency and frequency reporting work as expected with that patch, and I'm able to boost up to 4750MHz under full load and 5GHz under single-core load.
Comment 12 Rafael J. Wysocki 2021-02-15 13:52:57 UTC
(In reply to Matt McDonald from comment #10)
> Okay so that's *way* worse. 
> 
> Everything's limited and locked to 2.2GHz. And yes, it's actually running at
> 2.2GHz, it's not misreporting. My Geekbench score was less than a third of
> what it should be
> 
> cat /proc/cpuinfo | grep MHz
> cpu MHz               : 2200.000
> cpu MHz               : 2200.088
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2199.982
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000
> cpu MHz               : 2200.000

This actually doesn't mean that the CPUs are running at the given frequency.

> 
>  cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
> 2199680
> 2199981
> 2195932
> 2199979
> 2195634
> 2198726
> 2199437
> 2195587
> 2197662
> 2198924
> 2198856
> 2195535
> 2196402
> 2199234
> 2199880
> 2195064

And so this.

> analyzing CPU 0:
>   driver: acpi-cpufreq
>   CPUs which run at the same hardware frequency: 0
>   CPUs which need to have their frequency coordinated by software: 0
>   maximum transition latency:  Cannot determine or is not supported.
>   hardware limits: 2.20 GHz - 6.00 GHz
>   available frequency steps:  3.80 GHz, 2.80 GHz, 2.20 GHz
>   available cpufreq governors: performance schedutil
>   current policy: frequency should be within 2.20 GHz and 2.20 GHz.
>                   The governor "schedutil" may decide which speed to use
>                   within this range.
>   current CPU frequency: 2.20 GHz (asserted by call to hardware)
>   boost state support:
>     Supported: yes
>     Active: no
>     Boost States: 0
>     Total States: 3
>     Pstate-P0:  1000MHz
>     Pstate-P1:  700MHz
>     Pstate-P2:  500MHz
> 
> 
> Both schedutil and performance governors had no effect. 
> 
> But I do see in that cpupower output that it says the hardware limits happen
> to be 2.20GHz to 6.0GHz.

That's as expected.
Comment 13 Rafael J. Wysocki 2021-02-15 13:54:45 UTC
(In reply to Matt McDonald from comment #11)
> Oh, I can also add that the previous patch that was turned down and replaced
> with this patchset doesn't cause this issue, cpu frequency and frequency
> reporting work as expected with that patch, and I'm able to boost up to
> 4750MHz under full load and 5GHz under single-core load.

So what do you see in /proc/cpuinfo and scaling_cur_freq with commits 3c55e94c0ade and d11a1d08a082 reverted?
Comment 14 Rafael J. Wysocki 2021-02-15 14:04:26 UTC
Also can you please enable dynamic debug in freq_table.c, unload acpi-cpufreq, load it again and attach the output of dmesg?
Comment 15 Rafael J. Wysocki 2021-02-15 14:43:56 UTC
Created attachment 295295 [details]
cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known (v2)

I found a mistake in the previous version of the fix patch which didn't initialize policy->max properly.

Please test this one instead and there is no need to provide the information requested in the previous comments (at least not ATM).

Thanks!
Comment 16 Matt McDonald 2021-02-15 14:47:06 UTC
Haha I'd just typed my response and bugzilla stopped me from submitting. That's a cool feature. 

Yeah, I'll build and test now.
Comment 17 Matt McDonald 2021-02-15 15:09:29 UTC
That does seem to have fixed it:

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
4854354
3823787
3647266
4016171
3576030
3974600
3816628
3590646
3919312
3626692
3618178
3597246
4367040
3599805
3837612
3874146

cat /proc/cpuinfo | grep MHz
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 4193.751
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000
cpu MHz		: 3800.000

sudo cpupower frequency-info
[sudo] password for matt:
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 2.20 GHz - 6.00 GHz
  available frequency steps:  3.80 GHz, 2.80 GHz, 2.20 GHz
  available cpufreq governors: performance schedutil
  current policy: frequency should be within 2.20 GHz and 3.80 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency: 3.80 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: no
    Boost States: 0
    Total States: 3
    Pstate-P0:  1000MHz
    Pstate-P1:  700MHz
    Pstate-P2:  500MHz


Everything is back to how it should be, only now with assumingly better schedutil performance (I'll run some benchmarks later). No 6.0GHz reporting and no being stuck at 2.20GHz. CPU performance under the "performance" governor is back to where it should be, and I'm boosting up to 4.9-5.0 in single core and 4.8 all-core.
Comment 18 Rafael J. Wysocki 2021-02-15 15:30:32 UTC
OK, thanks for testing!

Let me post the last patch for verification.