Created attachment 254793 [details] Patch to arch/x86/kernel/tsc.c from v4.8 I noticed that linux 4.8, built from the stable git release tag 'v4.8', does not think my CPU has ART (Intel's Always Running Timer), when in fact it does, as all CPUs of the 7th-gen Intel x86_64 architecture do; I found that in arch/x86/kernel/tsc.c, the detect_art() function thinks if the boot_cpu_data.cpuid_level is <= 0x13, then the CPU does not have ART ; but even in user-space my CPU DOES support CPUID.0x15 , and it has a boot_cpu_data.extended_cpuid_level of 0x80000008, fully supporting CPUID.0x80000007 (which yields the 'constant_tsc' bit 8 as 1), and the linux synthesized X86_FEATURE_NONSTOP_TSC is set , so how could it NOT have ART hardware ? Looking up the symbols 'art_to_tsc_denominator' and 'art_to_tsc_numerator' in the running kernel yielded 0 for both with linux v4.8, while in a test program, this code returns '832' as the numerator and 'ccc_freq_hz' and '7' as the denominator in user-space: <quote><code><pre> static inline __attribute__((always_inline)) void IA64_art_params( unsigned int *art_numerator, *art_denominator, *ccc_freq_hz ) { U32_t unu; asm volatile ( "mov $0, %%ecx\n\t" "mov $0x15, %%eax\n\t" "cpuid\n\t" "mov %%eax, %0\n\t" "mov %%ebx, %1\n\t" "mov %%ecx, %2\n\t" "mov %%edx, %3" : "=m" (*art_denominator) , "=m" (*art_numerator) , "=m" (*ccc_freq_hz) , "=m" (unu) : "%eax", "%ebx", "%ecx", "%edx" ); } </pre></code></quote> Contents of /proc/cpuinfo for CPU 0 : <quote><code><pre> processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz stepping : 3 microcode : 0x17 cpu MHz : 2900.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_t sc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept v pid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts bugs : bogomips : 5786.55 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: </pre></code><quote> Note that it has 'constant_tsc' and 'nonstop_tsc' . I have applied this patch to arch/x86/kernel/tsc.c, and it works fine, enabling X86_FEATURE_ART on the CPU, and enabling accurate real-time clock values to be obtained in user-space by use of the the 'rdtsc' and 'rdtscp' instructions. I think the CPU will always get a fault on the first rdmsr of MSR_IA32_TSC_ADJUST , because the offset has not yet been written . <quote><code><pre> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 78b9cb5..8eccb3e 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1024,6 +1024,7 @@ core_initcall(cpufreq_register_tsc_scaling); #endif /* CONFIG_CPU_FREQ */ #define ART_CPUID_LEAF (0x15) +#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x80000008) #define ART_MIN_DENOMINATOR (1) @@ -1032,24 +1033,43 @@ core_initcall(cpufreq_register_tsc_scaling); */ static void detect_art(void) { - unsigned int unused[2]; - - if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF) - return; - - cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, - &art_to_tsc_numerator, unused, unused+1); - + unsigned int v[2]; + + if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF) + { + if(boot_cpu_data.extended_cpuid_level >= MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART) + { + pr_info("Would normally not use ART - cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for ART support.", + boot_cpu_data.cpuid_level, ART_CPUID_LEAF, boot_cpu_data.extended_cpuid_level); + }else + return; + } + + cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, + &art_to_tsc_numerator, v, v+1); + /* Don't enable ART in a VM, non-stop TSC required */ if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || - !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || - art_to_tsc_denominator < ART_MIN_DENOMINATOR) - return; - - if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset)) - return; - + !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || + art_to_tsc_denominator < ART_MIN_DENOMINATOR) + { + pr_info("Not using Intel ART for TSC - HYPERVISOR:%d NO NONSTOP_TSC:%d bad TSC/Crystal ratio denominator: %d.", boot_cpu_has(X86_FEATURE_HYPERVISOR), !boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator ); + return; + } + if ( (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))!=0) /* will get fault on first read if nothing written yet */ + { + if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0) + { + pr_info("Not using Intel ART for TSC - failed to initialize TSC_ADJUST: %d %d.", v[0], v[1] ); + return; + }else + { + art_to_tsc_offset = 0; /* perhaps initalize to -1 * current rdtsc value ? */ + pr_info("Using Intel ART for TSC - TSC_ADJUST initialized to 0."); + } + } /* Make this sticky over multiple CPU init calls */ + pr_info("Using Intel Always Running Timer (ART) feature for TSC on all CPUs."); setup_force_cpu_cap(X86_FEATURE_ART); } </pre></code></quote> patch attached also.
Created attachment 254795 [details] Patch to arch/x86/kernel/tsc.c : improved logging
So now, with the above patch applied, I see log messages at boot : <pre> Feb 16 17:47:07 jvdlux kernel: [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns Feb 16 17:47:07 jvdlux kernel: [ 0.000000] hpet clockevent registered Feb 16 17:47:07 jvdlux kernel: [ 0.000000] tsc: Fast TSC calibration using PIT Feb 16 17:47:07 jvdlux kernel: [ 0.000000] tsc: Detected 2893.267 MHz processor Feb 16 17:47:07 jvdlux kernel: [ 0.000000] tsc: Would normally not use ART - cpuid_level:13 < 21 - but extended_cpuid_level is: 80000008, so probing for ART support. Feb 16 17:47:07 jvdlux kernel: [ 0.000000] tsc: Using Intel Always Running Timer (ART) feature 6a for TSC on all CPUs - TSC/CCC: 832/7 offset: 0. Feb 16 17:47:07 jvdlux kernel: [ 0.000237] Calibrating delay loop (skipped), value calculated using timer frequency.. 5786.53 BogoMIPS (lpj=2893267) Feb 16 17:47:07 jvdlux kernel: [ 0.000377] pid_max: default: 32768 minimum: 301 Feb 16 17:47:07 jvdlux kernel: [ 0.000449] ACPI: Core revision 20160422 </pre>
I wrote a little script to print out symbol values from running kernel: $ for n in art_to_tsc_numerator art_to_tsc_denominator tsc_khz; do /usr/bin/ksym $n 4; done ffffffff82bd9a04 art_to_tsc_numerator 00000340 832 ffffffff82bd9a00 art_to_tsc_denominator 00000007 7 ffffffff829f24b4 tsc_khz 002c25f3 2893299 Using these values, a program could derive a very accurate value for a 'real-time nano-seconds since start' value without entering kernel by issuing the 'rdtscp' instruction, and using the TSC_Value and 'tsc_khz' value: nanoseconds = (TSC_Value - TSC_Value_at_process_start) * ( 1000000000 / ( tsc_khz * 1000 ) ) and know it will be very near the value the kernel would return in response to a clock_gettime(CLOCK_MONOTONIC_RAW,&ts) system call if they were actually invoked at precisely the same time . I am writing a library that makes use of this , which needs to be sure that the kernel is using the TSC for clock_gettime(CLOCK_MONOTONIC_RAW,×tamp).
Oops, of course pseudo-code calculation in Comment #3 should be : nanoseconds = (TSC_Value - TSC_Value_at_process_start) * (1000000000 / tsc_khz); /* 1e9 / ( (1e-3 *tsc_khz) * 1e9 ) == (1e9 / ((tsc_khz*1-e6) * 1e6) ) == 1e9 * (1/tsc_khz) == (1e9 / tsc_khz) ( in this case!*/ ie if tsc_khz is 2893299 or 2893300, as it ALWAYS ends up being on my system, number of nanoseconds per TSC tick is @ : 1e9 / 293300 khz = 345.626 ns This corresponds well with observed minimum times between clock_gettime(CLOCK_MONOTONIC_RAW,×pec). But not when the above patch is not applied; then , some calculation involving the current CPU frequency occurs and there is a large disparity between minimum measurable times with clock_gettime(), plus the relatively large latency involved with syscall entry (> 20ns or so).
Aargh! - tsc_khz is KILO Hertz, so 2893300 KILO Hertz has period of 34.5626 ns or 1e9 / (2893.3 * 1000) or 1e9 / ( 2893300 / 100 ), so pseudo-code in Comment #3 and Comment #4 should be: nanoseconds = (TSC_Value - TSC_Value_at_process_start) * (1000000000 / (tsc_khz / 100) );
actually, the correct calculation is : 1e9 / ( 2893300 *10 ) ! ( 2.8933khz is 1000000000 / (2.89330000 * 1000000000) ns!) Sorry. so pseudo-code in Comment #3 and Comment #4 should be: nanoseconds = (TSC_Value - TSC_Value_at_process_start) * (1000000000 / (tsc_khz * 10 ));
2893300hz has period 34.5626ns ; 1 / 34.5626e-9 == 2.8933E+07 GHz.
So please apply above patch or something like it to a future linux version and let users derive accurate real-time durations from rdtsc & rdtscp instruction results!
Sorry for confusion. I think 28933300 Kilo-Hertz really is 1000 * 2893300 Giga-Hertz so the nano-second increment per TSC tick really is the tiny 3.456102e-10 value I started with in Comment #3 . So potentially one can get extremely accurate timings using TSC . I haven't wrote the library yet - now I know how linux is calculating clock_gettime(CLOCK_MONOTONIC_RAW) from TSC I know what to do .
probably best way of calculating number of nanoseconds per tick would be : static const unsigned long long n_ps = 1000000000000ULL / ( ((unsigned long long)tsc_khz) * 1000U ) ; /* number of peta(1e-12)-seconds per TSC tick ? */ unsigned long n_ns = ((((unsigned long long) ( TSC_Value_Now() - TSC_Value_At_Start() ) )*n_ps ) / 1000ULL ); /* 1e-9 seconds */ return n_ns; but you can only rely on clock_gettime(CLOCK_MONOTONIC_RAW,&ts) being near this value IFF the above patch is applied to tsc.c .
Created attachment 254825 [details] demonstration of issue & fix $ tar -xpf ttsc.tar $ cd ttsc $ CPPFLAGS='-DKSYM_LOCATION=\"'$(pwd)/ksym'\"' make $ ./ttsc1 has tsc: 1 constant: 1 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 Hooray! TSC is enabled with KHz: 2893300 ts2 - ts1: 212 ts3 - ts2: 170 ns1: 0.000000257 ns2: 0.000001682 ts3 - ts2: 178 ns1: 0.000000308 ts3 - ts2: 27 ns1: 0.000000272 ts3 - ts2: 15 ns1: 0.000000132 ts3 - ts2: 17 ns1: 0.000000244 ts3 - ts2: 18 ns1: 0.000000247 ts3 - ts2: 18 ns1: 0.000000263 ts3 - ts2: 18 ns1: 0.000000247 ts3 - ts2: 17 ns1: 0.000000261 ts3 - ts2: 19 ns1: 0.000000257 ts3 - ts2: 19 ns1: 0.000000282 t1 - t0: 66444 - ns2: 0.000067677
To demonstrate the relevance of this issue, here is a little test program , in attached 'ttsc.tar' file: $ tar -xpf ttsc.tar $ cd ttsc $ PWD="$(pwd)"; $ CPPFLAGS='-DKSYM_LOCATION=\"'${PWD}'\"' make ... $ ./ttsc1 has tsc: 1 constant: 1 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 Hooray! TSC is enabled with KHz: 2893300 ts2 - ts1: 138 ts3 - ts2: 81 ns1: 0.000000158 ns2: 0.000001816 ts3 - ts2: 24 ns1: 0.000000162 ts3 - ts2: 17 ns1: 0.000000143 ts3 - ts2: 17 ns1: 0.000000146 ts3 - ts2: 17 ns1: 0.000000149 ts3 - ts2: 17 ns1: 0.000000141 ts3 - ts2: 16 ns1: 0.000000142 ts3 - ts2: 17 ns1: 0.000000151 ts3 - ts2: 15 ns1: 0.000000149 ts3 - ts2: 17 ns1: 0.000000145 ts3 - ts2: 17 ns1: 0.000000140 t1 - t0: 28689 - ns2: 0.000029599 The 'ts3 - ts1' and 'ns1' values shown above are the number of nanoseconds between calls to my IA64_s_ns_since_start() function and the system clock_gettime() call respectively : ttsc1.c, @line 50: clock_gettime(CLOCK_MONOTONIC_RAW,&tsp1); clock_gettime(CLOCK_MONOTONIC_RAW,&tsp2); ts2 = IA64_s_ns_since_start(); ts3 = IA64_s_ns_since_start(); n_ns1=(((((U128_t)tsp2.tv_sec) * 1000000000ULL) + ((U128_t) tsp2.tv_nsec)) - ((((U128_t)tsp1.tv_sec) * 1000000000ULL) + ((U128_t) tsp1.tv_nsec))) ; So latencies of @ 140ns are seen between calls to clock_gettime, while latencies of @15-20ns are seen between calls to IA64_s_ns_since_start() , on my 2.9Ghz base -> 3.9Ghz "turbo" frequency 4/8-core i7-4910MQ CPU . This is an attempt to overcome the fundamental restriction on the minimum length of time that can be measured on Linux with use of any system call, because of the overhead of user -> kernel space transition, even through the VDSO . The IA64_s_ns_since_start() function uses NO system calls and does not need to enter the kernel. Yes, I know that no code in the Linux kernel except the e1000 driver's PTP implementation uses the actual ART hardware frequency values returned in CPUID:15H, which, IMHO, is a shame - it would be nice if Linux actually used the hardware crystal clock frequency and its exact ratio to TSC frequency to generate more accurate monotonic raw time stamps from the TSC . But with the attached patch to tsc.c, which has also been successfully applied to version 4.9.10 , at least user space code can know that the X86_FEATURE_ART is set in the boot_cpu_data CPUID flags by reading /dev/cpu/0/cpuid:bit((3*32)+10), and can then confidently use code such as that in IA64_rdtsc.h to use the 'rdtcsp' instruction to generate real-time time stamp values in user-space that will end up measuring very slightly less overall cumulative real-time than the clock_gettime() would , by a very small constant offset of < 500ns, because this is the cumulative overhead , but is nonetheless well correlated with clock_gettime(CLOCK_MONOTONIC_RAW,,,) values, and very accurate for instaneous duration measurements .
Oops, should have been: $ CPPFLAGS='-DKSYM_LOCATION=\"'${PWD}/ksym'\"' make in Comment #12 line 7 (actually, code would also look for ./ksym).
& "instaneous" == "instantaneous"
The point is, if you have a 7th-gen Intel x86_64 CPU that identifies as model name: i7-4910MQ , linux was thinking ( I believe incorrectly ) that it did not support CPUID:15H, because the boot_cpu_data.cpuid_level was 0x13 ; yet the extended cpuid level is 0x80000008 , and both 'tsc' and 'constant_tsc' capabilities are in CPUID:01, and the kernel sets its synthesized 'nonstop_tsc' bit , but did NOT set the synthesized X86_FEATURE_ART bit, and had no idea of the ART denominator & numerator, so could not use for instance e1000 PTP . So when running on i7-4910MQ CPUs WITHOUT the attachment #2 [details] ( patch : https://bugzilla.kernel.org/attachment.cgi?id=254795 ) applied, the test program 'ttsc1' will exit with an error message, because X86_FEATURE_ART bit is not set . So please apply the patch to future versions of arch/x86/kernel/tsc.c .
See http://lkml.kernel.org/r/alpine.DEB.2.20.1702201239330.4208@nanos