ftrace shows that boot CPU tooks nearly 100ms at wait_for_completion to wait for the woken up of APs. This latency is not expected but need more investiogation if the latency is due to silicon design.
Please ftrace just the cpu online process. eg enable tracing and do the echo 1 > /sys/cpu/cpu1/online that should show where the time goes w/o a huge trace file
pure cpu online does not reproduce this problem. According to ftrace during suspend/resume, it might be in kernel_unpark in APs(such as CPU1, not on CPU2 strangely..) takes too much time to finish. I'll make a double check. # pwd /sys/kernel/debug/tracing # echo function_graph > current_tracer # echo 'setup_secondary_APIC_clock bringup_cpu cpuhp_online_idle kthread_unpark cpu_set_state_online arch_cpu_idle_prepare arch_cpu_idle_prepare arch_cpu_idle_prepare arch_cpu_idle_prepare arch_cpu_idle_prepare cpu_startup_entry clockevents_register_device clockevents_config_and_register stop_machine_unpark wait_for_completion complete' > set_ftrace_filter # echo 1 > options/funcgraph-abstime # echo 10000 > buffer_size_kb
Created attachment 247131 [details] ftrace event patch to dump the notifier chain here's the update recently: after further investigation, this issue seems to be caused by multiple reason, and the scope has been narrow down. The first obvious cause for it is when CPU0 tries to wake up the CPU1, there are quite many cpuhp callbacks which take too much time, thus we further enable the ftrace for it by: # echo 1 > events/cpuhp/cpuhp_enter/enable # echo 1 > events/cpuhp/cpuhp_exit/enable # echo 1 > events/cpuhp/cpuhp_multi_enter/enable The result showed, most of the time has been spent on 'notify_online', which is to send CPU_ONLINE info by notifier to the objects registered before. And I noticed there are many duplicated notifier callbacks for one object, which is time-costy to execute so many callbacks. To figure out which notifier it is, I've prepared a ftrace events patch to dump the function name for every callbacks(as attached for future reference), and the result revealed that, it is 'pvclock_gtod_notify'. So it seems that, this notifier has been registered into the CPU_ONLINE list for many times, we need to figure it out why this happened. 282.686939 /* cpuhp_enter: cpu: 0001 target: 151 step: 116 (notify_online) */ //Chen Yu: many pvclock_gtod_notify callbacks below: 282.689603 /* notifier_call: Notifier callback: [<ffffffffc07750b0>] pvclock_gtod_notify+0x0/0xa0 [kvm] with value: 0x[0] begin */ 282.689788 /* notifier_call: Notifier callback: [<ffffffffc07750b0>] pvclock_gtod_notify+0x0/0xa0 [kvm] with value: 0x[0] end */ ... 282.928749 /* cpuhp_exit: cpu: 0001 state: 116 step: 116 ret: 0 */ So the total time for CPU1 above to execute the notify_online phase is (282.928749 - 282.686939) = 241ms, which is quite long. And if we use analyze_suspend tool to do the job, the wake up time would be even more longer and raised to nearly 1 second. The following plan is to figure out why pvclock_gtod_notify takes so much time, and then what's the impact of using analyze_suspend.
OK, here I have two questions: 1. pvclock_gtod_notify should only be invoked when clocksource has been changed, however according to commit 16e8d74d2da9920f874b10a3d979fb25c01f518f Author: Marcelo Tosatti <mtosatti@redhat.com> Date: Tue Nov 27 23:29:00 2012 -0200 KVM: x86: notifier for clocksource changes Register a notifier for clocksource change event. In case the host switches to clock other than TSC, disable master clock usage. This notiflier has also been invoked during periodic update_wall_time, so there is something we can do to avoid this redundance IMO. 2. What is more interesting is the second issue, during resumed from s3, the speed of processor running instructions seems quite slow than the normal case. For example, here is the function-graph trace result for cache_add_dev during resume: 256.252328 | 1) | cache_add_dev() { 256.252509 | 1) + 29.301 us | get_cpu_device(); ... 259.082635 | 1) $ 2830155 us | } It shows that, the cache_add_dev has consumed 2830155us, meanwhile the simple function of get_cpu_device has taken near 30us. Let's take the normal cpu online procedure during cpu-hotplug for a comparison: 103326.093490 | 1) | cache_add_dev() { 103326.093491 | 1) 0.059 us | get_cpu_device(); 103326.094656 | 1) # 1165.198 us | } The cache_add_dev only takes 1165us, and the get_cpu_device only takes 0.059us. Even after I changed the clocksource from tsc to hpet,and disable no_console_suspend as well as disabled ignore_loglevel, it is the same result, so it suggested that maybe the nonboot-CPUs during resume is running at a quite slow speed, maybe related to cpu freq or other factors.
The second problem has been described and given a workaround posted at: https://www.spinics.net/lists/kernel/msg2407337.html
The first problem is tracking at https://bugzilla.kernel.org/show_bug.cgi?id=190961 and patch sent at : https://patchwork.kernel.org/patch/9487257/
I'm going to review this patch and them maybe resend this patch out.
Paste the discussion here for next step: Here's the time consumed by enable_nonboot_cpus() during resume on a Xeon SKL-D platform which has 32 cpus. 1. The original 4.14-rc2, APs will sync with the BSP after all the APs have been brought up. vmlinuz-4.14.0-rc2+: [ 136.236263] smpboot: Booting Node 0 Processor 1 APIC 0x2 [ 136.797652] CPU31 is up 136.797652 - 136.236263 = 0.561389s 2. After applied patch to disable the defer flag, then each AP will sync with the BSP once been brought up. vmlinuz-4.14.0-rc2-disable-defer+: [ 219.100372] smpboot: Booting Node 0 Processor 1 APIC 0x2 [ 219.856779] CPU31 is up 219.856779 - 219.100372 = 0.756407s 3. Based on the patch in step2, check if the MTRR is the same as the saved MTRR, if it is avoid the synchronization for this AP, otherwise, do a synchronization for the AP. vmlinuz-4.14.0-rc2-disable-defer-avoid-mtrr-sync+: [ 233.740309] smpboot: Booting Node 0 Processor 1 APIC 0x2 [ 234.301651] CPU31 is up 234.301651 - 233.740309 = 0.561342s So it looks like if we can avoid the MTRR synchronization if it is not needed, it will save some time. Here's the two patches: 1. Sync the MTRR for each APs once been brought up: Index: mainline/arch/x86/kernel/smpboot.c =================================================================== --- mainline.orig/arch/x86/kernel/smpboot.c +++ mainline/arch/x86/kernel/smpboot.c @@ -1366,12 +1366,10 @@ void __init native_smp_prepare_cpus(unsi void arch_enable_nonboot_cpus_begin(void) { - set_mtrr_aps_delayed_init(); } void arch_enable_nonboot_cpus_end(void) { - mtrr_aps_init(); } 2. Avoid the MTRR sychronization if it is not needed: Index: mainline/arch/x86/kernel/cpu/mtrr/generic.c =================================================================== --- mainline.orig/arch/x86/kernel/cpu/mtrr/generic.c +++ mainline/arch/x86/kernel/cpu/mtrr/generic.c @@ -894,6 +894,32 @@ static int generic_have_wrcomb(void) return config & (1 << 10); } +struct mtrr_state_type mtrr_state_tmp; + +static bool generic_need_sync(void) +{ + struct mtrr_var_range *vrs; + unsigned lo, dummy; + unsigned int i; + + vrs = mtrr_state_tmp.var_ranges; + + rdmsr(MSR_MTRRcap, lo, dummy); + mtrr_state_tmp.have_fixed = (lo >> 8) & 1; + + for (i = 0; i < num_var_ranges; i++) + get_mtrr_var_range(i, &vrs[i]); + if (mtrr_state_tmp.have_fixed) + get_fixed_ranges(mtrr_state_tmp.fixed_ranges); + + rdmsr(MSR_MTRRdefType, lo, dummy); + mtrr_state_tmp.def_type = (lo & 0xff); + mtrr_state_tmp.enabled = (lo & 0xc00) >> 10; + + return (memcmp(&mtrr_state_tmp, &mtrr_state, + sizeof(struct mtrr_state_type)) ? true : false); +} + int positive_have_wrcomb(void) { return 1; @@ -910,4 +936,5 @@ const struct mtrr_ops generic_mtrr_ops = .set = generic_set_mtrr, .validate_add_page = generic_validate_add_page, .have_wrcomb = generic_have_wrcomb, + .need_sync = generic_need_sync, }; Index: mainline/arch/x86/kernel/cpu/mtrr/mtrr.h =================================================================== --- mainline.orig/arch/x86/kernel/cpu/mtrr/mtrr.h +++ mainline/arch/x86/kernel/cpu/mtrr/mtrr.h @@ -25,6 +25,7 @@ struct mtrr_ops { int (*validate_add_page)(unsigned long base, unsigned long size, unsigned int type); int (*have_wrcomb)(void); + bool (*need_sync)(void); }; extern int generic_get_free_region(unsigned long base, unsigned long size, Index: mainline/arch/x86/kernel/cpu/mtrr/main.c =================================================================== --- mainline.orig/arch/x86/kernel/cpu/mtrr/main.c +++ mainline/arch/x86/kernel/cpu/mtrr/main.c @@ -177,7 +177,8 @@ static int mtrr_rendezvous_handler(void if (data->smp_reg != ~0U) { mtrr_if->set(data->smp_reg, data->smp_base, data->smp_size, data->smp_type); - } else if (mtrr_aps_delayed_init || !cpu_online(smp_processor_id())) { + } else if ((mtrr_aps_delayed_init || !cpu_online(smp_processor_id())) && + mtrr_if->need_sync()) { mtrr_if->set_all(); } return 0;
Latest patch sent to: https://patchwork.kernel.org/patch/10034001/ https://patchwork.kernel.org/patch/10033995/ https://patchwork.kernel.org/patch/10033971/ Actually in the MTRR write operation, the value to be written is checked against the actual value already in the MTRR register, if there is a match, the written operation will be bypassed. So above patch set only put the synchronization earlier and does some optimization to avoid redundant synchronization during APs online process.
Updated patch based on Rafael's suggestion has been sent to: https://patchwork.kernel.org/patch/10150077/
resent to https://patchwork.kernel.org/patch/10291645/ Hope this time would track more attentions(hope the maintainers are not that busy dealing with the 'specter' this time.
Add Thomas in. : )
Created attachment 284697 [details] Linux-5.3 18 SECOND resume delay Linux 5.3-rc5 still has this problem. here is an example that takes 18 SECONDS !
Created attachment 284871 [details] issue.def I set the issue definition as any CPU UP or DOWN call that takes longer than 20ms on any hardware.
Created attachment 285067 [details] issue.def change minimum CPU UP/DOWN time to 10 ms from 20ms
Created attachment 286599 [details] A refined version to fix mtrr across S3 This refresh patch only applies to S3 and will not impact other code path, so that with minimal impact.
@Chen Yu, what is the status of the refined patch?
It seems that we don't get any feedback from the mailing list so far. I would re-send the patch again, hoping the x86 maintainers finally pay attention to this.