Bug 197843
Summary: | native_calibrate_tsc(): possibly incorrect TSC frequency on newest Intel Skylake-X CPUs, i7-7820X in particular | ||
---|---|---|---|
Product: | Timers | Reporter: | Ivan (nekotekina) |
Component: | Interval Timers | Assignee: | timers_interval-timers |
Status: | NEW --- | ||
Severity: | high | CC: | piecuch |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.10 ~ 4.13 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Ivan
2017-11-11 00:24:05 UTC
I have a very similar issue with an ASUS Prime X299-A motherboard and i9-7940x CPU. I'm seeing a 4% clock drift which causes audio output from VLC to be glitchy. Unlike you, however, my kern.log says tsc detected the processor speed correctly at 3.1GHz. The clock drift doesn't seem to be causing any instability and I've left my clocksource at the default (which is tsc per the kern.log). To correct for the glitchy audio I'm using the adjtimex package / adjtimexconfig to apply an automatic correction to the RTC clock. This works but it seems like a work-around, not a fix. I'm curious if you've found a more permanent solution. Have you reviewed Bug#197299 native_calibrate_tsc? I'm wondering if this issue is related to that one or if this is something totally different. As of 11/17 intel pushed out a new microcode release. I'm wondering if you've tried this and seen any difference? Detected TSC frequency isn't really written in kern.log by default as I can see. I don't think CPU frequency is written incorrectly for me either (it's always 3,6 GHz). But I have "vboxdrv: TSC mode is Invariant, tentative frequency 3600000111 Hz" message, which was initially printing something close to 3750000000 Hz until I applied a hack, but this is vboxdrv. Unfortunately I haven't found any better workaround. Bug#197299 is interesting, but I can't judge whether it's a different issue or the same. Yes, I tried this microcode release. No visible changes. Can you please check if it's a regression introduced somewhere after 4.4 kernel? It happens on kernels 4.15 (and later, checked on 5.0.0 and 5.3.1) on Intel Xeon Gold 6146 and Intel Xeon Gold 6154. I'm forced to use HPET on those. Reverting back to 4.4.0 makes the clock stable on tsc. FYI if this helps.. My particular issue (see above) was fixed with commit https://github.com/torvalds/linux/commit/b511203093489eb1829cb4de86e8214752205ac6. This was committed 2018/01/25 and shows up in kernel 4.14.15 and the 4.15 kernel which I have in Ubuntu 18.04. The fix is in arch/x86/kernel/tsc.c::native_calibrate_tsc(void) Looking at this file in later kernels it looks like there has been a lot of changes here since and in the latest it uses CPUID to determine the TSC frequency instead of being hard-coded. I'd suggest looking at native_calibrate_tsc in arch/x86/kernel/tsc.c for the various kernels you've tried and maybe you can see where the issue comes in. I have managed to bisect the bug: aa297292d708e89773b3b2cdcaf33f01bfa095d8 is the first bad commit commit aa297292d708e89773b3b2cdcaf33f01bfa095d8 Author: Len Brown <len.brown@intel.com> Date: Fri Jun 17 01:22:51 2016 -0400 x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID Skylake CPU base-frequency and TSC frequency may differ by up to 2%. Enumerate CPU and TSC frequencies separately, allowing cpu_khz and tsc_khz to differ. The existing CPU frequency calibration mechanism is unchanged. However, CPUID extensions are preferred, when available. CPUID.0x16 is preferred over MSR and timer calibration for CPU frequency discovery. CPUID.0x15 takes precedence over CPU-frequency for TSC frequency discovery. Signed-off-by: Len Brown <len.brown@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/b27ec289fd005833b27d694d9c2dbb716c5cdff7.1466138954.git.len.brown@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> :040000 040000 4961fd66b14c79ad1e56f38f2d6e7468e420bc76 2ce45fca87b8444b3aa82ee60b6b739251249094 M arch For the sake of completeness, here's a complete bisection log: git bisect start # bad: [694d0d0bb2030d2e36df73e2d23d5770511dbc8d] Linux 4.8-rc2 git bisect bad 694d0d0bb2030d2e36df73e2d23d5770511dbc8d # good: [b3afc4525a507f21e98cc7571ea8c3f28484241c] Linux 4.7.10 git bisect good b3afc4525a507f21e98cc7571ea8c3f28484241c # good: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7 git bisect good 523d939ef98fd712632d93a5a2b588e477a7565e # bad: [f0c98ebc57c2d5e535bc4f9167f35650d2ba3c90] Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm git bisect bad f0c98ebc57c2d5e535bc4f9167f35650d2ba3c90 # bad: [0e06f5c0deeef0332a5da2ecb8f1fcf3e024d958] Merge branch 'akpm' (patches from Andrew) git bisect bad 0e06f5c0deeef0332a5da2ecb8f1fcf3e024d958 # bad: [e65805251f2db69c9f67ed8062ab82526be5a374] Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad e65805251f2db69c9f67ed8062ab82526be5a374 # good: [dd9506954539dcedd0294a065ff0976e61386fc6] Merge tag 'hwmon-for-linus-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging git bisect good dd9506954539dcedd0294a065ff0976e61386fc6 # good: [7e4dc77b2869a683fc43c0394fca5441816390ba] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 7e4dc77b2869a683fc43c0394fca5441816390ba # bad: [c410614c902531d1ce2e46aec8ac91aa4dc89968] Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad c410614c902531d1ce2e46aec8ac91aa4dc89968 # good: [0f657262d5f99ad86b9a63fb5dcd29036c2ed916] Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 0f657262d5f99ad86b9a63fb5dcd29036c2ed916 # good: [2d724ffddd958f21e2711b7400c63bdfee287d75] Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 2d724ffddd958f21e2711b7400c63bdfee287d75 # good: [e99a0745bdf8a5f7e3126a686846af4aeb852cc9] x86/pci, x86/platform/intel_mid_pci: Remove duplicate power off code git bisect good e99a0745bdf8a5f7e3126a686846af4aeb852cc9 # bad: [c48ec42d6eae08f55685ab660f0743ed33b9f22a] x86/tsc: Remove the unused check_tsc_disabled() git bisect bad c48ec42d6eae08f55685ab660f0743ed33b9f22a # good: [05680e7fa8a4e700e031a5e72cd8c18265f0031a] x86/tsc_msr: Correct Silvermont reference clock values git bisect good 05680e7fa8a4e700e031a5e72cd8c18265f0031a # good: [05680e7fa8a4e700e031a5e72cd8c18265f0031a] x86/tsc_msr: Correct Silvermont reference clock values git bisect good 05680e7fa8a4e700e031a5e72cd8c18265f0031a # good: [02c0cd2dcf7fdc47d054b855b148ea8b82dbb7eb] x86/tsc_msr: Remove irqoff around MSR-based TSC enumeration git bisect good 02c0cd2dcf7fdc47d054b855b148ea8b82dbb7eb # bad: [ff4c86635ee12461fd3bd911d7d5253394da8f9d] x86/tsc: Enumerate BXT tsc_khz via CPUID git bisect bad ff4c86635ee12461fd3bd911d7d5253394da8f9d # bad: [aa297292d708e89773b3b2cdcaf33f01bfa095d8] x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID git bisect bad aa297292d708e89773b3b2cdcaf33f01bfa095d8 # first bad commit: [aa297292d708e89773b3b2cdcaf33f01bfa095d8] x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID I tested these commits on Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz running on a Supermicro X11DPU-XLL v1.02 motherboard. If there's any more information I can provide you with please feel free to ask for updates. I've managed to trace it down to Hyperspeed being turned on. [Supermicro claims][1] that hyperspeed impacts base clock frequency. I see that the code in `arch/x86/kernel/tsc.c:614:unsigned long native_calibrate_tsc(void)` is not ideal: it uses cpuid leaf 0x15 which is fine, but if that doesn't work it falls back to leaf 0x16 which is documented as following: > * Data is returned from this interface in accordance with the processor's > specification and does not reflect actual values. Suitable use of this data > includes the display of processor information in like manner to the processor > brand string and for determining the appropriate range to use when displaying > processor information e.g. frequency history graphs. The returned information > should not be used for any other purpose as the returned information does not > accurately correlate to information / counters returned by other processor > interfaces. In case the 0x15 leaf doesn't work we should fall back to the behaviour implemented for CPUs that have lower cpuid levels. [1]: https://www.supermicro.com/support/faqs/faq.cfm?faq=21337 |