We observe a 0.25% wall time drift on systems where a spread spectrum clock is used to generate CPU internal clocks, including the TSC. The CPUs in the system are of type INTEL_FAM6_SKYLAKE_X. As for the cause of this drift it helps to look at native_calibrate_tsc: First, an attempt is made to obtain a frequency through CPUID 15H call: eax_denominator = ebx_numerator = ecx_hz = edx = 0; /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */ cpuid(0x15, &eax_denominator, &ebx_numerator, &ecx_hz, &edx); In our case this call returns 0 for ecx_hz and therefore crystal_khz is also set 0 and we get to the following switch statement where are hardcoded predefined value us selected: crystal_khz = ecx_hz / 1000; if (crystal_khz == 0) { case INTEL_FAM6_SKYLAKE_X: case INTEL_FAM6_ATOM_DENVERTON: crystal_khz = 25000; /* 25.0 MHz */ break; This frequency is then considered accurate and no further TSC calibration is done, as explained in this comment: /* * TSC frequency determined by CPUID is a "hardware reported" * frequency and is the most accurate one so far we have. This * is considered a known frequency. */ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ); When the spread spectrum clock is disabled this works fine. The frequency obtained here corresponds to the actual TSC frequency and the wall clock keeps correct time. When the spread spectrum clock is enabled the clock sweeps between 0.5% below the nominal frequency and the nominal frequency. Therefore the average frequency is off by 0.25% which leads to the 0.25% wall time drift. I am not sure how to address this problem, perhaps one way would be to fall back to TSC calibration in this case instead of relying on the predefined value.
Max! Its good to hear from you! Been a long time! First pass, I'd try to see how the TSC calibration does compared to the predefined value. Depending on how long the freq sweep period is, you might just get random error depending on where you are in the sweep at bootup. Can the kernel get the spread spectrum usage state somehow from the BIOS? It may be that this isn't something calibration can address, and that ntpd correction is the only realistic approach.
(In reply to john stultz from comment #1) > Max! Its good to hear from you! Been a long time! > > First pass, I'd try to see how the TSC calibration does compared to the > predefined value. Depending on how long the freq sweep period is, you might > just get random error depending on where you are in the sweep at bootup. Hi John, good to hear from you too! Actually we have data from running older kernels (as found in RHEL and SLES) where TSC calibration is done. TSC calibration gets the expected frequency (F * 99.75% fairly precisely. > > Can the kernel get the spread spectrum usage state somehow from the BIOS? I don't know the answer to this question yet, I'll try to find out. > It may be that this isn't something calibration can address, and that ntpd > correction is the only realistic approach. We hope there is a solution that doesn't require ntpd.
(In reply to Max Asbock from comment #2) > (In reply to john stultz from comment #1) > > Max! Its good to hear from you! Been a long time! > > > > First pass, I'd try to see how the TSC calibration does compared to the > > predefined value. Depending on how long the freq sweep period is, you might > > just get random error depending on where you are in the sweep at bootup. > > Hi John, good to hear from you too! > > Actually we have data from running older kernels (as found in RHEL and SLES) > where TSC calibration is done. TSC calibration gets the expected frequency > (F * 99.75% fairly precisely. > > > > > Can the kernel get the spread spectrum usage state somehow from the BIOS? > > I don't know the answer to this question yet, I'll try to find out. > Ok. Well, if you can figure that out or use something like DMIDECODE quirk matching on the affected hardware you can enable something like a force_calibrate flag. Might also allow for boot argument for force it.
(In reply to john stultz from comment #3) > (In reply to Max Asbock from comment #2) > > (In reply to john stultz from comment #1) > > > > > > > > Can the kernel get the spread spectrum usage state somehow from the BIOS? > > > > I don't know the answer to this question yet, I'll try to find out. Hi John, I was thinking about this too. If there was some way to programmatically determine whether SSC was enabled we could calibrate the TSC based on that information possibly using a quirk. > > > > Ok. Well, if you can figure that out or use something like DMIDECODE quirk > matching on the affected hardware you can enable something like a > force_calibrate flag. Might also allow for boot argument for force it. I have a couple of questions. It would probably be good to have someone from Intel address at least the first question. 1) I wonder why the CPUID 0x15x function (Time Stamp Counter and Nominal Core Crystal Clock Information) is returning 0 in ECX (nominal frequency of the core crystal clock) on the Skylake processors. We have reproduced this on a number of different Intel Xeon Scalable Gold and Platinum CPUs. The CPUID spec seems to suggest that 0 could be expected ("0 is returned if the nominal core crystal clock is not enumerated"), but I don't know under what conditions this would occur. We get a value of 0 for cases where SSC is disabled and enabled. 2) Given that 0 in ECX is not considered an error for CPUID 0x15, wouldn't it better to run the TSC calibration code under this condition rather than just using a default value (regardless of the CPU frequency)? It seems to me if the code was modified to do that, then the probability of time drift would be reduced.
(In reply to Chris McDermott from comment #4) > (In reply to john stultz from comment #3) > > (In reply to Max Asbock from comment #2) > > > (In reply to john stultz from comment #1) > > > > > > > > > > > Can the kernel get the spread spectrum usage state somehow from the > BIOS? > > > > > > I don't know the answer to this question yet, I'll try to find out. > > Hi John, > > I was thinking about this too. If there was some way to programmatically > determine whether SSC was enabled we could calibrate the TSC based on that > information possibly using a quirk. > > > > > > > > Ok. Well, if you can figure that out or use something like DMIDECODE quirk > > matching on the affected hardware you can enable something like a > > force_calibrate flag. Might also allow for boot argument for force it. > > I have a couple of questions. It would probably be good to have someone from > Intel address at least the first question. Yea. I'm not sure I have the expertise to answer these. > 1) I wonder why the CPUID 0x15x function (Time Stamp Counter and Nominal > Core Crystal Clock Information) is returning 0 in ECX (nominal frequency of > the core crystal clock) on the Skylake processors. We have reproduced this > on a number of different Intel Xeon Scalable Gold and Platinum CPUs. The > CPUID spec seems to suggest that 0 could be expected ("0 is returned if the > nominal core crystal clock is not enumerated"), but I don't know under what > conditions this would occur. We get a value of 0 for cases where SSC is > disabled and enabled. > > 2) Given that 0 in ECX is not considered an error for CPUID 0x15, wouldn't > it better to run the TSC calibration code under this condition rather than > just using a default value (regardless of the CPU frequency)? It seems to me > if the code was modified to do that, then the probability of time drift > would be reduced. This seems reasonable if the 0 value you're getting indicated an error. Though that you always get one worries me. The main reason for using a hardcoded value instead of calibrating, is that the calibration has its own error (and boot time cost), and using a known good value is preferred, as hardware used for calibration may have its own quirks. Ideally the hardware would report the modified freq (or oscillate the freq so that the center is the specified freq). But as usual we have to deal with the actual hardware at hand. :)
I'd work out a patch and submit to lkml cc'ing folks at intel. Bugzilla doesn't get enough visibility to be useful with these sorts of design questions.
We talked more with our hardware engineers. They think spread spectrum clocks are not going away and are likely not confined to one particular hardware vendor. Therefore, it looks like this is going to be a problem. I am not ready to propose a patch yet. Maybe it would help to get input from the authors of the native_calibrate_tsc function as they are familiar with the hardware. I cc'ed Len Brown and Bin Gao. If you have any input, we'd greatly appreciate it.
*** Bug 198137 has been marked as a duplicate of this bug. ***
I'd like to point out another issue with the kernel code referenced at the top. There are two bug reports on Bugzilla (mine at Bug#19837 and another at Bug#197842) where we are seeing a 4% clock drift. It sees likely that this is due to the call to cpuid returning 0 for ecx_hz which triggers the switch statement setting crystal_khz = 25000. I suspect that in our cases, this should be 24000. I have a i9-7940x which has a base clock of 3.1GHz and calls to cpuid returns a TSC/clock ratio of 258/2 24000 * 258 / 2 = 3.096 GHz (correct) 25000 * 258 / 2 = 3.225 GHz (wrong !) <-- result of native_calibrate_tsc Note that the (2.5-2.4)/2.4 ~= 4% which accounts for the drift we are seeing. I'm not certain what the right answer is here, but it looks to me like the hard-coded values are incorrect for at least some SkylakeX cpus / motherboards.
Len Brown's update removes the TSC calibration using the hardcoded 25.0MHz value in native_calibrate_tsc() with Skylake Xeon. We'll test this and report the results here. commit b511203093489eb1829cb4de86e8214752205ac6 Author: Len Brown <len.brown@intel.com> Date: Fri Dec 22 00:27:55 2017 -0500 x86/tsc: Fix erroneous TSC rate on Skylake Xeon The INTEL_FAM6_SKYLAKE_X hardcoded crystal_khz value of 25MHZ is problematic: - SKX workstations (with same model # as server variants) use a 24 MHz crystal. This results in a -4.0% time drift rate on SKX workstations. - SKX servers subject the crystal to an EMI reduction circuit that reduces its actual frequency by (approximately) -0.25%. This results in -1 second per 10 minute time drift as compared to network time. This issue can also trigger a timer and power problem, on configurations that use the LAPIC timer (versus the TSC deadline timer). Clock ticks scheduled with the LAPIC timer arrive a few usec before the time they are expected (according to the slow TSC). This causes Linux to poll-idle, when it should be in an idle power saving state. The idle and clock code do not graciously recover from this error, sometimes resulting in significant polling and measurable power impact. Stop using native_calibrate_tsc() for INTEL_FAM6_SKYLAKE_X. native_calibrate_tsc() will return 0, boot will run with tsc_khz = cpu_khz, and the TSC refined calibration will update tsc_khz to correct for the difference.
(In reply to Chris McDermott from comment #10) > Len Brown's update removes the TSC calibration using the hardcoded 25.0MHz > value in native_calibrate_tsc() with Skylake Xeon. We'll test this and > report the results here. > We tested a system for several days with the 4.15-rc9 kernel which contains the patch referred to in comment #10. We no longer see the large time drift. Therefore we can consider the problem as fixed in main-line.