Bug 219023 - Slow ethernet TX speed when intel_pstate is used on Jasper Lake
Summary: Slow ethernet TX speed when intel_pstate is used on Jasper Lake
Status: NEW
Alias: None
Product: Power Management
Classification: Unclassified
Component: intel_pstate (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: Kristen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-10 04:33 UTC by Kai-Heng Feng
Modified: 2024-07-19 11:18 UTC (History)
3 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg with intel_pstate.dyndbg, hwp enabled (66.19 KB, text/plain)
2024-07-10 04:34 UTC, Kai-Heng Feng
Details
dmesg with intel_pstate.dyndbg, no_hwp (64.61 KB, text/plain)
2024-07-10 04:34 UTC, Kai-Heng Feng
Details

Description Kai-Heng Feng 2024-07-10 04:33:02 UTC
CPU: Intel(R) Celeron(R) N5105 @ 2.00GHz (Jasper Lake)
Ethernet: Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

The TX speed of iperf3 is ~750Mbps.

When the system is boot with "intel_pstate=no_hwp", the TX speed becomes ~950Mbps which is expected.

The difference is introduced by dma_sync_single_for_{cpu,device}() in r8169's rtl_rx() routine. The time in the dma_sync helpers are much longer when HWP is used.
Comment 1 Kai-Heng Feng 2024-07-10 04:34:15 UTC
Created attachment 306555 [details]
dmesg with intel_pstate.dyndbg, hwp enabled
Comment 2 Kai-Heng Feng 2024-07-10 04:34:41 UTC
Created attachment 306556 [details]
dmesg with intel_pstate.dyndbg, no_hwp
Comment 3 Srinivas Pandruvada 2024-07-10 17:41:09 UTC
HWP is not implemented in kernel to do anything about this. Only option is to change default energy_performance_preference.
Try setting different energy_performance_preferences from cpufreq sysfs.
Comment 4 Kai-Heng Feng 2024-07-11 01:26:03 UTC
$ grep . policy*/energy_performance_preference
policy0/energy_performance_preference:performance
policy1/energy_performance_preference:performance
policy2/energy_performance_preference:performance

$ grep . policy*/scaling_governor 
policy0/scaling_governor:performance
policy1/scaling_governor:performance
policy2/scaling_governor:performance
policy3/scaling_governor:performance

$ iperf3 -c 192.168.2.1 -t 86400
Connecting to host 192.168.2.1, port 5201
[  5] local 192.168.2.205 port 48828 connected to 192.168.2.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  90.7 MBytes   761 Mbits/sec    0    273 KBytes       
[  5]   1.00-2.00   sec  90.0 MBytes   755 Mbits/sec    0    273 KBytes

So changing the EPP and governor doesn't make any difference.
Comment 5 Kai-Heng Feng 2024-07-15 03:47:03 UTC
Same issue can be reproduced on Elkhart Lake, when HWP is enabled.
Comment 6 Srinivas Pandruvada 2024-07-16 05:23:58 UTC
I think here the workaround should be in RTL driver. The HWP algorithms on these older platforms can't be modified.
Comment 7 Heiner Kallweit 2024-07-16 08:37:26 UTC
(In reply to Srinivas Pandruvada from comment #6)
> I think here the workaround should be in RTL driver. The HWP algorithms on
> these older platforms can't be modified.

See Kai-Heng's initial post:
Issue is with dma_sync_single_for_{cpu,device}(), not with r8169 driver code.
So every user of dma_sync_single_for_{cpu,device}() may be affected.
r8169 is just one example here.

And as you stated "the workaround": Which workaround do you propose?
Comment 8 Srinivas Pandruvada 2024-07-16 10:21:11 UTC
(In reply to Heiner Kallweit from comment #7)
> (In reply to Srinivas Pandruvada from comment #6)
> > I think here the workaround should be in RTL driver. The HWP algorithms on
> > these older platforms can't be modified.
> 
> See Kai-Heng's initial post:
> Issue is with dma_sync_single_for_{cpu,device}(), not with r8169 driver code.
> So every user of dma_sync_single_for_{cpu,device}() may be affected.
> r8169 is just one example here.
> 
> And as you stated "the workaround": Which workaround do you propose?

I suppose this is a regression from previous kernels. So something changes in kernel. Based on comment I thought the RTL driver added dma_sync_single.. which triggered the issue. 

If changing to perf governor didn't fix, then max frequency all the time is not helping (So speed is not limited because CPU kept frequencies low). If power limited then try default intel_pstate  powersave governor with EPP balance_power.
Comment 9 Kai-Heng Feng 2024-07-17 01:05:22 UTC
> I suppose this is a regression from previous kernels.
Not really. The issue can be seen on all old kernels that can boot up.

> If changing to perf governor didn't fix, then max frequency all the time is
> not helping (So speed is not limited because CPU kept frequencies low). If
> power limited then try default intel_pstate  powersave governor with EPP
> balance_power.
Actually, max frequency all the time works by writing 0 to /dev/cpu_dma_latency.

I am not saying that the fix needs to be in intel_pstate or HWP, but we need to understand why and how HWP makes dma_sync_single much slower, so a proper fix can be implemented.
Comment 10 Srinivas Pandruvada 2024-07-17 02:05:11 UTC

(In reply to Kai-Heng Feng from comment #9)
> > I suppose this is a regression from previous kernels.
> Not really. The issue can be seen on all old kernels that can boot up.
> 
> > If changing to perf governor didn't fix, then max frequency all the time is
> > not helping (So speed is not limited because CPU kept frequencies low). If
> > power limited then try default intel_pstate  powersave governor with EPP
> > balance_power.
> Actually, max frequency all the time works by writing 0 to
> /dev/cpu_dma_latency.
> 

You said performance governor didn't fix. So max frequency is not fixing. Going to C-states is a problem. So wakeup latency is affecting the speed as there will be delay in interrupt processing.


> I am not saying that the fix needs to be in intel_pstate or HWP, but we need
> to understand why and how HWP makes dma_sync_single much slower, so a proper
> fix can be implemented.

Try to run turbostat with and without.
Comment 11 Kai-Heng Feng 2024-07-17 04:08:09 UTC
> You said performance governor didn't fix. So max frequency is not fixing.
> Going to C-states is a problem. So wakeup latency is affecting the speed as
> there will be delay in interrupt processing.

Not sure if it's related to C-states, at least intel_idle.max_cstate=0 doesn't help.

>Try to run turbostat with and without.

Same bad network performance with or without turbostat running.
Comment 12 Srinivas Pandruvada 2024-07-17 06:14:03 UTC
(In reply to Kai-Heng Feng from comment #11)
> > You said performance governor didn't fix. So max frequency is not fixing.
> > Going to C-states is a problem. So wakeup latency is affecting the speed as
> > there will be delay in interrupt processing.
> 
> Not sure if it's related to C-states, at least intel_idle.max_cstate=0
> doesn't help.
> 
It probably using acpi_idle

Check:
cat /sys/devices/system/cpu/cpuidle/current_driver 



> >Try to run turbostat with and without.
> 
> Same bad network performance with or without turbostat running.

You mean running turbostat causes bad network performance with or without HWP?
Comment 13 Kai-Heng Feng 2024-07-17 06:30:15 UTC
$ cat /sys/devices/system/cpu/cpuidle/current_driver 
intel_idle

If 'intel_idle.max_cstate=0' is used, it's disabled and acpi_idle is in use.


> You mean running turbostat causes bad network performance with or without
> HWP?
OK, I think I misunderstood what you meant. Do you mean to run turbostat with and without HWP? What value should be observed?
Comment 14 Srinivas Pandruvada 2024-07-17 11:44:03 UTC
(In reply to Kai-Heng Feng from comment #13)
> $ cat /sys/devices/system/cpu/cpuidle/current_driver 
> intel_idle
> 
> If 'intel_idle.max_cstate=0' is used, it's disabled and acpi_idle is in use.
> 
So your dma latency program is preventing entry to C-states, so you have faster interrupt response time. At the same time it will also run at higher frequency because there is no idle time.

> 
> > You mean running turbostat causes bad network performance with or without
> > HWP?
> OK, I think I misunderstood what you meant. Do you mean to run turbostat
> with and without HWP? What value should be observed?

I will see you will see less C state residency without HWP.
Comment 15 Kai-Heng Feng 2024-07-19 11:18:19 UTC
(In reply to Srinivas Pandruvada from comment #14)
> (In reply to Kai-Heng Feng from comment #13)
> > $ cat /sys/devices/system/cpu/cpuidle/current_driver 
> > intel_idle
> > 
> > If 'intel_idle.max_cstate=0' is used, it's disabled and acpi_idle is in
> use.
> > 
> So your dma latency program is preventing entry to C-states, so you have
> faster interrupt response time. At the same time it will also run at higher
> frequency because there is no idle time.
> 
> > 
> > > You mean running turbostat causes bad network performance with or without
> > > HWP?
> > OK, I think I misunderstood what you meant. Do you mean to run turbostat
> > with and without HWP? What value should be observed?
> 
> I will see you will see less C state residency without HWP.

You are right, it's C state related. Seems like the latency of C1E is causing the issue. The diff below can solve the issue:

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 9aab7abc2ae9..dac2fc1f26e3 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1537,6 +1537,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
        X86_MATCH_VFM(INTEL_XEON_PHI_KNM,       &idle_cpu_knl),
        X86_MATCH_VFM(INTEL_ATOM_GOLDMONT,      &idle_cpu_bxt),
        X86_MATCH_VFM(INTEL_ATOM_GOLDMONT_PLUS, &idle_cpu_bxt),
+       X86_MATCH_VFM(INTEL_ATOM_TREMONT_L,     &idle_cpu_bxt),
        X86_MATCH_VFM(INTEL_ATOM_GOLDMONT_D,    &idle_cpu_dnv),
        X86_MATCH_VFM(INTEL_ATOM_TREMONT_D,     &idle_cpu_snr),
        X86_MATCH_VFM(INTEL_ATOM_CRESTMONT,     &idle_cpu_grr),
@@ -1996,6 +1997,7 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
                break;
        case INTEL_ATOM_GOLDMONT:
        case INTEL_ATOM_GOLDMONT_PLUS:
+       case INTEL_ATOM_TREMONT_L:
                bxt_idle_state_table_update();
                break;
        case INTEL_SKYLAKE:

Note You need to log in before you can comment on or make changes to this bug.