CPU: Intel(R) Celeron(R) N5105 @ 2.00GHz (Jasper Lake) Ethernet: Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) The TX speed of iperf3 is ~750Mbps. When the system is boot with "intel_pstate=no_hwp", the TX speed becomes ~950Mbps which is expected. The difference is introduced by dma_sync_single_for_{cpu,device}() in r8169's rtl_rx() routine. The time in the dma_sync helpers are much longer when HWP is used.
Created attachment 306555 [details] dmesg with intel_pstate.dyndbg, hwp enabled
Created attachment 306556 [details] dmesg with intel_pstate.dyndbg, no_hwp
HWP is not implemented in kernel to do anything about this. Only option is to change default energy_performance_preference. Try setting different energy_performance_preferences from cpufreq sysfs.
$ grep . policy*/energy_performance_preference policy0/energy_performance_preference:performance policy1/energy_performance_preference:performance policy2/energy_performance_preference:performance $ grep . policy*/scaling_governor policy0/scaling_governor:performance policy1/scaling_governor:performance policy2/scaling_governor:performance policy3/scaling_governor:performance $ iperf3 -c 192.168.2.1 -t 86400 Connecting to host 192.168.2.1, port 5201 [ 5] local 192.168.2.205 port 48828 connected to 192.168.2.1 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 90.7 MBytes 761 Mbits/sec 0 273 KBytes [ 5] 1.00-2.00 sec 90.0 MBytes 755 Mbits/sec 0 273 KBytes So changing the EPP and governor doesn't make any difference.
Same issue can be reproduced on Elkhart Lake, when HWP is enabled.
I think here the workaround should be in RTL driver. The HWP algorithms on these older platforms can't be modified.
(In reply to Srinivas Pandruvada from comment #6) > I think here the workaround should be in RTL driver. The HWP algorithms on > these older platforms can't be modified. See Kai-Heng's initial post: Issue is with dma_sync_single_for_{cpu,device}(), not with r8169 driver code. So every user of dma_sync_single_for_{cpu,device}() may be affected. r8169 is just one example here. And as you stated "the workaround": Which workaround do you propose?
(In reply to Heiner Kallweit from comment #7) > (In reply to Srinivas Pandruvada from comment #6) > > I think here the workaround should be in RTL driver. The HWP algorithms on > > these older platforms can't be modified. > > See Kai-Heng's initial post: > Issue is with dma_sync_single_for_{cpu,device}(), not with r8169 driver code. > So every user of dma_sync_single_for_{cpu,device}() may be affected. > r8169 is just one example here. > > And as you stated "the workaround": Which workaround do you propose? I suppose this is a regression from previous kernels. So something changes in kernel. Based on comment I thought the RTL driver added dma_sync_single.. which triggered the issue. If changing to perf governor didn't fix, then max frequency all the time is not helping (So speed is not limited because CPU kept frequencies low). If power limited then try default intel_pstate powersave governor with EPP balance_power.
> I suppose this is a regression from previous kernels. Not really. The issue can be seen on all old kernels that can boot up. > If changing to perf governor didn't fix, then max frequency all the time is > not helping (So speed is not limited because CPU kept frequencies low). If > power limited then try default intel_pstate powersave governor with EPP > balance_power. Actually, max frequency all the time works by writing 0 to /dev/cpu_dma_latency. I am not saying that the fix needs to be in intel_pstate or HWP, but we need to understand why and how HWP makes dma_sync_single much slower, so a proper fix can be implemented.
(In reply to Kai-Heng Feng from comment #9) > > I suppose this is a regression from previous kernels. > Not really. The issue can be seen on all old kernels that can boot up. > > > If changing to perf governor didn't fix, then max frequency all the time is > > not helping (So speed is not limited because CPU kept frequencies low). If > > power limited then try default intel_pstate powersave governor with EPP > > balance_power. > Actually, max frequency all the time works by writing 0 to > /dev/cpu_dma_latency. > You said performance governor didn't fix. So max frequency is not fixing. Going to C-states is a problem. So wakeup latency is affecting the speed as there will be delay in interrupt processing. > I am not saying that the fix needs to be in intel_pstate or HWP, but we need > to understand why and how HWP makes dma_sync_single much slower, so a proper > fix can be implemented. Try to run turbostat with and without.
> You said performance governor didn't fix. So max frequency is not fixing. > Going to C-states is a problem. So wakeup latency is affecting the speed as > there will be delay in interrupt processing. Not sure if it's related to C-states, at least intel_idle.max_cstate=0 doesn't help. >Try to run turbostat with and without. Same bad network performance with or without turbostat running.
(In reply to Kai-Heng Feng from comment #11) > > You said performance governor didn't fix. So max frequency is not fixing. > > Going to C-states is a problem. So wakeup latency is affecting the speed as > > there will be delay in interrupt processing. > > Not sure if it's related to C-states, at least intel_idle.max_cstate=0 > doesn't help. > It probably using acpi_idle Check: cat /sys/devices/system/cpu/cpuidle/current_driver > >Try to run turbostat with and without. > > Same bad network performance with or without turbostat running. You mean running turbostat causes bad network performance with or without HWP?
$ cat /sys/devices/system/cpu/cpuidle/current_driver intel_idle If 'intel_idle.max_cstate=0' is used, it's disabled and acpi_idle is in use. > You mean running turbostat causes bad network performance with or without > HWP? OK, I think I misunderstood what you meant. Do you mean to run turbostat with and without HWP? What value should be observed?
(In reply to Kai-Heng Feng from comment #13) > $ cat /sys/devices/system/cpu/cpuidle/current_driver > intel_idle > > If 'intel_idle.max_cstate=0' is used, it's disabled and acpi_idle is in use. > So your dma latency program is preventing entry to C-states, so you have faster interrupt response time. At the same time it will also run at higher frequency because there is no idle time. > > > You mean running turbostat causes bad network performance with or without > > HWP? > OK, I think I misunderstood what you meant. Do you mean to run turbostat > with and without HWP? What value should be observed? I will see you will see less C state residency without HWP.
(In reply to Srinivas Pandruvada from comment #14) > (In reply to Kai-Heng Feng from comment #13) > > $ cat /sys/devices/system/cpu/cpuidle/current_driver > > intel_idle > > > > If 'intel_idle.max_cstate=0' is used, it's disabled and acpi_idle is in > use. > > > So your dma latency program is preventing entry to C-states, so you have > faster interrupt response time. At the same time it will also run at higher > frequency because there is no idle time. > > > > > > You mean running turbostat causes bad network performance with or without > > > HWP? > > OK, I think I misunderstood what you meant. Do you mean to run turbostat > > with and without HWP? What value should be observed? > > I will see you will see less C state residency without HWP. You are right, it's C state related. Seems like the latency of C1E is causing the issue. The diff below can solve the issue: diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 9aab7abc2ae9..dac2fc1f26e3 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -1537,6 +1537,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = { X86_MATCH_VFM(INTEL_XEON_PHI_KNM, &idle_cpu_knl), X86_MATCH_VFM(INTEL_ATOM_GOLDMONT, &idle_cpu_bxt), X86_MATCH_VFM(INTEL_ATOM_GOLDMONT_PLUS, &idle_cpu_bxt), + X86_MATCH_VFM(INTEL_ATOM_TREMONT_L, &idle_cpu_bxt), X86_MATCH_VFM(INTEL_ATOM_GOLDMONT_D, &idle_cpu_dnv), X86_MATCH_VFM(INTEL_ATOM_TREMONT_D, &idle_cpu_snr), X86_MATCH_VFM(INTEL_ATOM_CRESTMONT, &idle_cpu_grr), @@ -1996,6 +1997,7 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv) break; case INTEL_ATOM_GOLDMONT: case INTEL_ATOM_GOLDMONT_PLUS: + case INTEL_ATOM_TREMONT_L: bxt_idle_state_table_update(); break; case INTEL_SKYLAKE: