Bug 197299

Summary: native_calibrate_tsc: time drift when spread spectrum clock is used
Product: Timers Reporter: Max Asbock (masbock)
Component: OtherAssignee: john stultz (john.stultz)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bin.gao, bjascob, lcm, lcm, lenb
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: v4.14-rc5 Subsystem:
Regression: No Bisected commit-id:

Description Max Asbock 2017-10-17 21:13:53 UTC
We observe a 0.25% wall time drift on systems where a spread spectrum clock is used to generate CPU internal clocks, including the TSC. The CPUs in the system are of type INTEL_FAM6_SKYLAKE_X.

As for the cause of this drift it helps to look at native_calibrate_tsc:

First, an attempt is made to obtain a frequency through CPUID 15H call:

    eax_denominator = ebx_numerator = ecx_hz = edx = 0;

    /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
    cpuid(0x15, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);

In our case this call returns 0 for ecx_hz and therefore crystal_khz is also set 0 and we get to the following switch statement where are hardcoded predefined value us selected:
    crystal_khz = ecx_hz / 1000;
    
    if (crystal_khz == 0) {

        case INTEL_FAM6_SKYLAKE_X:
        case INTEL_FAM6_ATOM_DENVERTON:
            crystal_khz = 25000;    /* 25.0 MHz */
            break;

This frequency is then considered accurate and no further TSC calibration is done, as explained in this comment: 
   /*
     * TSC frequency determined by CPUID is a "hardware reported"
     * frequency and is the most accurate one so far we have. This
     * is considered a known frequency.
     */
    setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);

When the spread spectrum clock is disabled this works fine. The frequency obtained here corresponds to the actual TSC frequency and the wall clock keeps correct time.
When the spread spectrum clock is enabled the clock sweeps between 0.5% below the nominal frequency and the nominal frequency. Therefore the average frequency is off by 0.25% which leads to the 0.25% wall time drift.

I am not sure how to address this problem, perhaps one way would be to fall back to TSC calibration in this case instead of relying on the predefined value.
Comment 1 john stultz 2017-10-17 21:28:14 UTC
Max! Its good to hear from you! Been a long time!

First pass, I'd try to see how the TSC calibration does compared to the predefined value. Depending on how long the freq sweep period is, you might just get random error depending on where you are in the sweep at bootup.

Can the kernel get the spread spectrum usage state somehow from the BIOS? 

It may be that this isn't something calibration can address, and that ntpd correction is the only realistic approach.
Comment 2 Max Asbock 2017-10-17 21:45:21 UTC
(In reply to john stultz from comment #1)
> Max! Its good to hear from you! Been a long time!
> 
> First pass, I'd try to see how the TSC calibration does compared to the
> predefined value. Depending on how long the freq sweep period is, you might
> just get random error depending on where you are in the sweep at bootup.

Hi John, good to hear from you too!

Actually we have data from running older kernels (as found in RHEL and SLES) where TSC calibration is done. TSC calibration gets the expected frequency (F * 99.75% fairly precisely.
 
> 
> Can the kernel get the spread spectrum usage state somehow from the BIOS? 

I don't know the answer to this question yet, I'll try to find out.

> It may be that this isn't something calibration can address, and that ntpd
> correction is the only realistic approach.

We hope there is a solution that doesn't require ntpd.
Comment 3 john stultz 2017-10-17 22:09:43 UTC
(In reply to Max Asbock from comment #2)
> (In reply to john stultz from comment #1)
> > Max! Its good to hear from you! Been a long time!
> > 
> > First pass, I'd try to see how the TSC calibration does compared to the
> > predefined value. Depending on how long the freq sweep period is, you might
> > just get random error depending on where you are in the sweep at bootup.
> 
> Hi John, good to hear from you too!
> 
> Actually we have data from running older kernels (as found in RHEL and SLES)
> where TSC calibration is done. TSC calibration gets the expected frequency
> (F * 99.75% fairly precisely.
>  
> > 
> > Can the kernel get the spread spectrum usage state somehow from the BIOS? 
> 
> I don't know the answer to this question yet, I'll try to find out.
>

Ok. Well, if you can figure that out or use something like DMIDECODE quirk matching on the affected hardware you can enable something like a force_calibrate flag. Might also allow for boot argument for force it.
Comment 4 Chris McDermott 2017-10-18 21:45:51 UTC
(In reply to john stultz from comment #3)
> (In reply to Max Asbock from comment #2)
> > (In reply to john stultz from comment #1)
> >  
> > > 
> > > Can the kernel get the spread spectrum usage state somehow from the BIOS? 
> > 
> > I don't know the answer to this question yet, I'll try to find out.

Hi John,
 
I was thinking about this too. If there was some way to programmatically determine whether SSC was enabled we could calibrate the TSC based on that information possibly using a quirk.

> >
> 
> Ok. Well, if you can figure that out or use something like DMIDECODE quirk
> matching on the affected hardware you can enable something like a
> force_calibrate flag. Might also allow for boot argument for force it.

I have a couple of questions. It would probably be good to have someone from Intel address at least the first question.

1) I wonder why the CPUID 0x15x function (Time Stamp Counter and Nominal Core Crystal Clock Information) is returning 0 in ECX (nominal frequency of the core crystal clock) on the Skylake processors. We have reproduced this on a number of different Intel Xeon Scalable Gold and Platinum CPUs. The CPUID spec seems to suggest that 0 could be expected ("0 is returned if the nominal core crystal clock is not enumerated"), but I don't know under what conditions this would occur.  We get a value of 0 for cases where SSC is disabled and enabled.

2) Given that 0 in ECX is not considered an error for CPUID 0x15, wouldn't it better to run the TSC calibration code under this condition rather than just using a default value (regardless of the CPU frequency)? It seems to me if the code was modified to do that, then the probability of time drift would be reduced.
Comment 5 john stultz 2017-10-18 21:56:33 UTC
(In reply to Chris McDermott from comment #4)
> (In reply to john stultz from comment #3)
> > (In reply to Max Asbock from comment #2)
> > > (In reply to john stultz from comment #1)
> > >  
> > > > 
> > > > Can the kernel get the spread spectrum usage state somehow from the
> BIOS? 
> > > 
> > > I don't know the answer to this question yet, I'll try to find out.
> 
> Hi John,
>  
> I was thinking about this too. If there was some way to programmatically
> determine whether SSC was enabled we could calibrate the TSC based on that
> information possibly using a quirk.
> 
> > >
> > 
> > Ok. Well, if you can figure that out or use something like DMIDECODE quirk
> > matching on the affected hardware you can enable something like a
> > force_calibrate flag. Might also allow for boot argument for force it.
> 
> I have a couple of questions. It would probably be good to have someone from
> Intel address at least the first question.

Yea. I'm not sure I have the expertise to answer these.

> 1) I wonder why the CPUID 0x15x function (Time Stamp Counter and Nominal
> Core Crystal Clock Information) is returning 0 in ECX (nominal frequency of
> the core crystal clock) on the Skylake processors. We have reproduced this
> on a number of different Intel Xeon Scalable Gold and Platinum CPUs. The
> CPUID spec seems to suggest that 0 could be expected ("0 is returned if the
> nominal core crystal clock is not enumerated"), but I don't know under what
> conditions this would occur.  We get a value of 0 for cases where SSC is
> disabled and enabled.
> 
> 2) Given that 0 in ECX is not considered an error for CPUID 0x15, wouldn't
> it better to run the TSC calibration code under this condition rather than
> just using a default value (regardless of the CPU frequency)? It seems to me
> if the code was modified to do that, then the probability of time drift
> would be reduced.

This seems reasonable if the 0 value you're getting indicated an error. Though that you always get one worries me.

The main reason for using a hardcoded value instead of calibrating, is that the calibration has its own error (and boot time cost), and using a known good value is preferred, as hardware used for calibration may have its own quirks.

Ideally the hardware would report the modified freq (or oscillate the freq so that the center is the specified freq). But as usual we have to deal with the actual hardware at hand. :)
Comment 6 john stultz 2017-10-18 21:57:12 UTC
I'd work out a patch and submit to lkml cc'ing folks at intel. Bugzilla doesn't get enough visibility to be useful with these sorts of design questions.
Comment 7 Max Asbock 2017-10-31 17:56:49 UTC
We talked more with our hardware engineers. They think spread spectrum clocks are not going away and are likely not confined to one particular hardware vendor. Therefore, it looks like this is going to be a problem. I am not ready to propose a patch yet. Maybe it would help to get input from the authors of the native_calibrate_tsc function as they are familiar with the hardware. 
I cc'ed Len Brown and Bin Gao. If you have any input, we'd greatly appreciate it.
Comment 8 bjascob 2017-12-22 05:16:50 UTC
*** Bug 198137 has been marked as a duplicate of this bug. ***
Comment 9 bjascob 2017-12-24 18:50:21 UTC
I'd like to point out another issue with the kernel code referenced at the top.  There are two bug reports on Bugzilla (mine at Bug#19837 and another at Bug#197842) where we are seeing a 4% clock drift.  It sees likely that this is due to the call to cpuid returning 0 for ecx_hz  which triggers the switch statement setting crystal_khz = 25000.  I suspect that in our cases, this should be 24000.

I have a i9-7940x which has a base clock of 3.1GHz and calls to cpuid returns a TSC/clock ratio of 258/2
24000 * 258 / 2 = 3.096 GHz (correct)
25000 * 258 / 2 = 3.225 GHz (wrong !)  <-- result of native_calibrate_tsc

Note that the (2.5-2.4)/2.4 ~= 4% which accounts for the drift we are seeing.

I'm not certain what the right answer is here, but it looks to me like the hard-coded values are incorrect for at least some SkylakeX cpus / motherboards.
Comment 10 Chris McDermott 2018-01-25 20:04:44 UTC
Len Brown's update removes the TSC calibration using the hardcoded 25.0MHz value in native_calibrate_tsc() with Skylake Xeon. We'll test this and report the results here.


commit b511203093489eb1829cb4de86e8214752205ac6
Author: Len Brown <len.brown@intel.com>
Date:   Fri Dec 22 00:27:55 2017 -0500

    x86/tsc: Fix erroneous TSC rate on Skylake Xeon
    
    The INTEL_FAM6_SKYLAKE_X hardcoded crystal_khz value of 25MHZ is
    problematic:
    
     - SKX workstations (with same model # as server variants) use a 24 MHz
       crystal.  This results in a -4.0% time drift rate on SKX workstations.
    
     - SKX servers subject the crystal to an EMI reduction circuit that reduces its
       actual frequency by (approximately) -0.25%.  This results in -1 second per
       10 minute time drift as compared to network time.
    
    This issue can also trigger a timer and power problem, on configurations
    that use the LAPIC timer (versus the TSC deadline timer).  Clock ticks
    scheduled with the LAPIC timer arrive a few usec before the time they are
    expected (according to the slow TSC).  This causes Linux to poll-idle, when
    it should be in an idle power saving state.  The idle and clock code do not
    graciously recover from this error, sometimes resulting in significant
    polling and measurable power impact.
    
    Stop using native_calibrate_tsc() for INTEL_FAM6_SKYLAKE_X.
    native_calibrate_tsc() will return 0, boot will run with tsc_khz = cpu_khz,
    and the TSC refined calibration will update tsc_khz to correct for the
    difference.
Comment 11 Max Asbock 2018-01-30 22:50:15 UTC
(In reply to Chris McDermott from comment #10)
> Len Brown's update removes the TSC calibration using the hardcoded 25.0MHz
> value in native_calibrate_tsc() with Skylake Xeon. We'll test this and
> report the results here.
> 

We tested a system for several days with the 4.15-rc9 kernel which contains the patch referred to in comment #10. We no longer see the large time drift. Therefore we can consider the problem as fixed in main-line.