Bug 187971

Summary: 18,000ms CPU online during S3 resume -- due to MTRR un-cached -- MacBook Pro 2015
Product: Power Management Reporter: Chen Yu (yu.c.chen)
Component: Hibernation/SuspendAssignee: Chen Yu (yu.c.chen)
Status: DEFERRED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: lenb, rjw, rui.zhang, tglx, todd.e.brandt
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.9.0-rc4 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 178231    
Attachments: ftrace event patch to dump the notifier chain
Linux-5.3 18 SECOND resume delay
issue.def
issue.def
A refined version to fix mtrr across S3

Description Chen Yu 2016-11-17 13:57:19 UTC
ftrace shows that boot CPU tooks nearly 100ms at wait_for_completion to wait for the woken up of APs. This latency is not expected but need more investiogation if the latency is due to silicon design.
Comment 1 Len Brown 2016-11-21 15:04:13 UTC
Please ftrace just the cpu online process.
eg enable tracing and do the echo 1 > /sys/cpu/cpu1/online
that should show where the time goes w/o a huge trace file
Comment 2 Chen Yu 2016-12-05 06:00:51 UTC
pure cpu online does not reproduce this problem.
According to ftrace during suspend/resume, it might be in kernel_unpark in APs(such as CPU1, not on CPU2 strangely..) takes too much time to finish.
I'll make a double check.

# pwd
/sys/kernel/debug/tracing

# echo function_graph > current_tracer

# echo 'setup_secondary_APIC_clock bringup_cpu cpuhp_online_idle kthread_unpark cpu_set_state_online arch_cpu_idle_prepare arch_cpu_idle_prepare arch_cpu_idle_prepare arch_cpu_idle_prepare arch_cpu_idle_prepare cpu_startup_entry clockevents_register_device clockevents_config_and_register stop_machine_unpark wait_for_completion complete' > set_ftrace_filter

# echo 1 > options/funcgraph-abstime

# echo 10000 > buffer_size_kb
Comment 3 Chen Yu 2016-12-08 03:27:22 UTC
Created attachment 247131 [details]
ftrace event  patch  to dump the notifier chain

here's the update recently:
after further investigation, this issue seems to be caused by multiple reason, and the scope has been narrow down.
The first obvious cause for it is when CPU0 tries to wake up the CPU1, there are quite many cpuhp callbacks which take too much time, thus we further enable the ftrace for it by:
# echo 1 > events/cpuhp/cpuhp_enter/enable
# echo 1 > events/cpuhp/cpuhp_exit/enable
# echo 1 > events/cpuhp/cpuhp_multi_enter/enable

The result showed, most of the time has been spent on 'notify_online', which is 
to send CPU_ONLINE info by notifier to the objects registered before. And I noticed there are many duplicated notifier callbacks for one object, which is time-costy to execute so many callbacks. To figure out which notifier it is, I've prepared a ftrace events patch to dump the function name for every callbacks(as attached for future reference), and the result revealed that, it is 'pvclock_gtod_notify'. So it seems that, this notifier has been registered into the CPU_ONLINE list for many times, we need to figure it out why this happened.

282.686939 /* cpuhp_enter: cpu: 0001 target: 151 step: 116 (notify_online) */

//Chen Yu: many pvclock_gtod_notify callbacks below:
282.689603 /* notifier_call: Notifier callback: [<ffffffffc07750b0>] pvclock_gtod_notify+0x0/0xa0 [kvm] with value: 0x[0] begin */
  282.689788 /* notifier_call: Notifier callback: [<ffffffffc07750b0>] pvclock_gtod_notify+0x0/0xa0 [kvm] with value: 0x[0] end */

...
282.928749 /* cpuhp_exit:  cpu: 0001  state: 116 step: 116 ret: 0 */

So the total time for CPU1 above to execute the notify_online phase is
(282.928749 - 282.686939) =  241ms, which is quite long.

And if we use analyze_suspend tool to do the job, the wake up time would be even more longer and raised to nearly 1 second. The following plan is to figure out why pvclock_gtod_notify takes so much time, and then what's the impact of 
 using analyze_suspend.
Comment 4 Chen Yu 2016-12-12 02:33:07 UTC
OK, here I have two questions:
1. pvclock_gtod_notify should only be invoked when clocksource has been changed,
   however according to 
commit 16e8d74d2da9920f874b10a3d979fb25c01f518f
Author: Marcelo Tosatti <mtosatti@redhat.com>
Date:   Tue Nov 27 23:29:00 2012 -0200

    KVM: x86: notifier for clocksource changes
    
    Register a notifier for clocksource change event. In case
    the host switches to clock other than TSC, disable master
    clock usage.
This notiflier has also been invoked during periodic update_wall_time, so there is something we can do to avoid this redundance IMO.

2. What is more interesting is the second issue, during resumed from s3, the speed of processor running instructions seems quite slow than the normal case. For example, here is the function-graph trace result for cache_add_dev during resume:
256.252328 |   1)               |  cache_add_dev() {
  256.252509 |   1) + 29.301 us   |    get_cpu_device();
  ...
259.082635 |   1) $ 2830155 us  |  }
It shows that, the cache_add_dev has consumed 2830155us, meanwhile the simple 
function of get_cpu_device has taken near 30us.

Let's take the normal cpu online procedure during cpu-hotplug for a comparison:

103326.093490 |   1)               |  cache_add_dev() {
103326.093491 |   1)   0.059 us    |    get_cpu_device();
103326.094656 |   1) # 1165.198 us |  }

The cache_add_dev only takes 1165us, and the get_cpu_device only takes 0.059us.

Even after I changed the clocksource from tsc to hpet,and disable no_console_suspend as well as disabled ignore_loglevel,  it is the same result, so it suggested that maybe the nonboot-CPUs during resume is running at a quite slow speed, maybe related to cpu freq or other factors.
Comment 5 Chen Yu 2016-12-20 10:33:00 UTC
The second problem has been described and given a workaround posted at:
https://www.spinics.net/lists/kernel/msg2407337.html
Comment 6 Chen Yu 2016-12-23 08:46:41 UTC
The first problem is tracking at https://bugzilla.kernel.org/show_bug.cgi?id=190961
and patch sent at : https://patchwork.kernel.org/patch/9487257/
Comment 7 Chen Yu 2017-08-14 08:00:50 UTC
I'm going to review this patch and them maybe resend this patch out.
Comment 8 Chen Yu 2017-10-11 06:39:45 UTC
Paste the discussion here for next step:


Here's the time consumed by enable_nonboot_cpus()
during resume on a Xeon SKL-D platform which has
32 cpus.

1. The original 4.14-rc2, APs will sync with the BSP after
   all the APs have been brought up.

vmlinuz-4.14.0-rc2+:
[  136.236263] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  136.797652] CPU31 is up

136.797652 - 136.236263 = 0.561389s


2. After applied patch to disable the defer flag, then
   each AP will sync with the BSP once been brought up.

vmlinuz-4.14.0-rc2-disable-defer+:
[  219.100372] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  219.856779] CPU31 is up

219.856779 - 219.100372 = 0.756407s



3. Based on the patch in step2, check if the MTRR is the same
   as the saved MTRR, if it is avoid the synchronization for
   this AP, otherwise, do a synchronization for the AP.

vmlinuz-4.14.0-rc2-disable-defer-avoid-mtrr-sync+:
[  233.740309] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  234.301651] CPU31 is up

234.301651 - 233.740309 = 0.561342s


So it looks like if we can avoid the MTRR synchronization if
it is not needed, it will save some time.


Here's the two patches:

1. Sync the MTRR for each APs once been brought up:

Index: mainline/arch/x86/kernel/smpboot.c
===================================================================
--- mainline.orig/arch/x86/kernel/smpboot.c
+++ mainline/arch/x86/kernel/smpboot.c
@@ -1366,12 +1366,10 @@ void __init native_smp_prepare_cpus(unsi
 
 void arch_enable_nonboot_cpus_begin(void)
 {
-	set_mtrr_aps_delayed_init();
 }
 
 void arch_enable_nonboot_cpus_end(void)
 {
-	mtrr_aps_init();
 }

2. Avoid the MTRR sychronization if it is not needed:

Index: mainline/arch/x86/kernel/cpu/mtrr/generic.c
===================================================================
--- mainline.orig/arch/x86/kernel/cpu/mtrr/generic.c
+++ mainline/arch/x86/kernel/cpu/mtrr/generic.c
@@ -894,6 +894,32 @@ static int generic_have_wrcomb(void)
 	return config & (1 << 10);
 }
 
+struct mtrr_state_type mtrr_state_tmp;
+
+static bool generic_need_sync(void)
+{
+	struct mtrr_var_range *vrs;
+	unsigned lo, dummy;
+	unsigned int i;
+
+	vrs = mtrr_state_tmp.var_ranges;
+
+	rdmsr(MSR_MTRRcap, lo, dummy);
+	mtrr_state_tmp.have_fixed = (lo >> 8) & 1;
+
+	for (i = 0; i < num_var_ranges; i++)
+		get_mtrr_var_range(i, &vrs[i]);
+	if (mtrr_state_tmp.have_fixed)
+		get_fixed_ranges(mtrr_state_tmp.fixed_ranges);
+
+	rdmsr(MSR_MTRRdefType, lo, dummy);
+	mtrr_state_tmp.def_type = (lo & 0xff);
+	mtrr_state_tmp.enabled = (lo & 0xc00) >> 10;
+
+	return (memcmp(&mtrr_state_tmp, &mtrr_state,
+		sizeof(struct mtrr_state_type)) ? true : false);
+}
+
 int positive_have_wrcomb(void)
 {
 	return 1;
@@ -910,4 +936,5 @@ const struct mtrr_ops generic_mtrr_ops =
 	.set			= generic_set_mtrr,
 	.validate_add_page	= generic_validate_add_page,
 	.have_wrcomb		= generic_have_wrcomb,
+	.need_sync		= generic_need_sync,
 };
Index: mainline/arch/x86/kernel/cpu/mtrr/mtrr.h
===================================================================
--- mainline.orig/arch/x86/kernel/cpu/mtrr/mtrr.h
+++ mainline/arch/x86/kernel/cpu/mtrr/mtrr.h
@@ -25,6 +25,7 @@ struct mtrr_ops {
 	int	(*validate_add_page)(unsigned long base, unsigned long size,
 				     unsigned int type);
 	int	(*have_wrcomb)(void);
+	bool	(*need_sync)(void);
 };
 
 extern int generic_get_free_region(unsigned long base, unsigned long size,
Index: mainline/arch/x86/kernel/cpu/mtrr/main.c
===================================================================
--- mainline.orig/arch/x86/kernel/cpu/mtrr/main.c
+++ mainline/arch/x86/kernel/cpu/mtrr/main.c
@@ -177,7 +177,8 @@ static int mtrr_rendezvous_handler(void
 	if (data->smp_reg != ~0U) {
 		mtrr_if->set(data->smp_reg, data->smp_base,
 			     data->smp_size, data->smp_type);
-	} else if (mtrr_aps_delayed_init || !cpu_online(smp_processor_id())) {
+	} else if ((mtrr_aps_delayed_init || !cpu_online(smp_processor_id())) &&
+			mtrr_if->need_sync()) {
 		mtrr_if->set_all();
 	}
 	return 0;
Comment 9 Chen Yu 2017-11-06 22:43:23 UTC
Latest patch sent to:
https://patchwork.kernel.org/patch/10034001/
https://patchwork.kernel.org/patch/10033995/
https://patchwork.kernel.org/patch/10033971/

Actually in the MTRR write operation, the value to be written is checked against the actual value already in the MTRR register, if there is a match, the written operation will be bypassed. So above patch set only put the synchronization earlier and does some optimization to avoid redundant synchronization during APs online process.
Comment 10 Chen Yu 2018-01-15 02:58:13 UTC
Updated patch based on Rafael's suggestion has been sent to:
https://patchwork.kernel.org/patch/10150077/
Comment 11 Chen Yu 2018-03-19 04:18:58 UTC
resent to
https://patchwork.kernel.org/patch/10291645/
Hope this time would track more attentions(hope the maintainers are not that busy dealing with the 'specter' this time.
Comment 12 Chen Yu 2018-03-26 08:18:16 UTC
Add Thomas in. : )
Comment 13 Len Brown 2019-08-29 14:59:29 UTC
Created attachment 284697 [details]
Linux-5.3 18 SECOND resume delay

Linux 5.3-rc5 still has this problem.
here is an example that takes 18 SECONDS !
Comment 14 Todd Brandt 2019-09-06 22:04:08 UTC
Created attachment 284871 [details]
issue.def

I set the issue definition as any CPU UP or DOWN call that takes longer than 20ms on any hardware.
Comment 15 Todd Brandt 2019-09-19 18:26:45 UTC
Created attachment 285067 [details]
issue.def

change minimum CPU UP/DOWN time to 10 ms from 20ms
Comment 16 Chen Yu 2020-01-03 08:47:37 UTC
Created attachment 286599 [details]
A refined version to fix mtrr across S3

This refresh patch only applies to S3 and will not impact other code path, so that with minimal impact.
Comment 17 Zhang Rui 2020-06-24 02:16:54 UTC
@Chen Yu, what is the status of the refined patch?
Comment 18 Chen Yu 2020-06-28 15:03:09 UTC
It seems that we don't get any feedback from the mailing list so far. I would re-send the patch again, hoping the x86 maintainers finally pay attention to this.