Bug 194609 - enable Intel Always Running Timer (ART) X86_FEATURE_ART on 7th-gen (i7) clevo kapok laptop CPU: "i7-4910MQ"
Summary: enable Intel Always Running Timer (ART) X86_FEATURE_ART on 7th-gen (i7) clevo...
Status: RESOLVED INVALID
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: platform_ia-64
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-16 16:46 UTC by Jason Vas Dias
Modified: 2017-02-20 23:22 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Patch to arch/x86/kernel/tsc.c from v4.8 (2.78 KB, patch)
2017-02-16 16:46 UTC, Jason Vas Dias
Details | Diff
Patch to arch/x86/kernel/tsc.c : improved logging (2.92 KB, patch)
2017-02-16 17:57 UTC, Jason Vas Dias
Details | Diff
demonstration of issue & fix (30.00 KB, application/x-gtar)
2017-02-18 16:04 UTC, Jason Vas Dias
Details

Description Jason Vas Dias 2017-02-16 16:46:50 UTC
Created attachment 254793 [details]
Patch to arch/x86/kernel/tsc.c from v4.8

I noticed that linux 4.8, built from the stable git release tag 'v4.8', does
not think my CPU has ART (Intel's Always Running Timer), when in fact it does,
as all CPUs of the 7th-gen Intel x86_64 architecture do; I found that in 
arch/x86/kernel/tsc.c, the detect_art() function thinks if the 
boot_cpu_data.cpuid_level is <= 0x13, then the CPU does not have ART ;
but even in user-space my CPU DOES support CPUID.0x15 , and it has 
a boot_cpu_data.extended_cpuid_level of 0x80000008, fully supporting
CPUID.0x80000007 (which yields the 'constant_tsc' bit 8 as 1), and the
linux synthesized X86_FEATURE_NONSTOP_TSC is set , so how could it NOT
have ART hardware ? 

Looking up the symbols 'art_to_tsc_denominator' and 'art_to_tsc_numerator' in 
the running kernel yielded 0 for both with linux v4.8, while in a test program,
this code returns '832' as the numerator and 'ccc_freq_hz' and '7' as the denominator in user-space:

<quote><code><pre>
static inline __attribute__((always_inline))
void IA64_art_params( unsigned int *art_numerator, *art_denominator, *ccc_freq_hz )
{
  U32_t unu;
  asm volatile
  ( "mov $0,    %%ecx\n\t"
    "mov $0x15, %%eax\n\t"    
    "cpuid\n\t"
    "mov %%eax, %0\n\t"
    "mov %%ebx, %1\n\t"
    "mov %%ecx, %2\n\t"
    "mov %%edx, %3"    
  : "=m" (*art_denominator) ,
    "=m" (*art_numerator) ,
    "=m" (*ccc_freq_hz) ,
    "=m" (unu)
  : "%eax", "%ebx", "%ecx", "%edx"
  );
}
</pre></code></quote>

Contents of /proc/cpuinfo for CPU 0 :
<quote><code><pre>
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 60
model name	: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz
stepping	: 3
microcode	: 0x17
cpu MHz		: 2900.000
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_t
sc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept v
pid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts
bugs		:
bogomips	: 5786.55
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:
</pre></code><quote>


Note that it has 'constant_tsc' and 'nonstop_tsc' .


I have applied this patch to arch/x86/kernel/tsc.c, and it works fine, enabling
X86_FEATURE_ART on the CPU, and enabling accurate real-time clock values to be
obtained in user-space by use of the the 'rdtsc' and 'rdtscp' instructions. 

I think the CPU will always get a fault on the first rdmsr of MSR_IA32_TSC_ADJUST , because the offset has not yet been written .

<quote><code><pre>
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 78b9cb5..8eccb3e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1024,6 +1024,7 @@ core_initcall(cpufreq_register_tsc_scaling);
 #endif /* CONFIG_CPU_FREQ */

 #define ART_CPUID_LEAF (0x15)
+#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x80000008)
 #define ART_MIN_DENOMINATOR (1)


@@ -1032,24 +1033,43 @@ core_initcall(cpufreq_register_tsc_scaling);
  */
 static void detect_art(void)
 {
-	unsigned int unused[2];
-
-	if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
-		return;
-
-	cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
-	      &art_to_tsc_numerator, unused, unused+1);
-
+	unsigned int v[2];
+
+	if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
+        {
+                if(boot_cpu_data.extended_cpuid_level >= MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART)
+                {
+                        pr_info("Would normally not use ART - cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for ART support.",
+                        boot_cpu_data.cpuid_level, ART_CPUID_LEAF, boot_cpu_data.extended_cpuid_level);
+                }else
+                        return;
+        }
+
+        cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
+              &art_to_tsc_numerator, v, v+1);
+
 	/* Don't enable ART in a VM, non-stop TSC required */
 	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
-	    !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
-	    art_to_tsc_denominator < ART_MIN_DENOMINATOR)
-		return;
-
-	if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))
-		return;
-
+           !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
+	   art_to_tsc_denominator < ART_MIN_DENOMINATOR)
+        {
+                pr_info("Not using Intel ART for TSC - HYPERVISOR:%d  NO NONSTOP_TSC:%d  bad TSC/Crystal ratio denominator: %d.", boot_cpu_has(X86_FEATURE_HYPERVISOR), !boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator );
+                return;
+        }
+	if (  (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))!=0) /* will get fault on first read if nothing written yet */
+        {
+                if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0)
+                {
+                        pr_info("Not using Intel ART for TSC - failed to initialize TSC_ADJUST: %d %d.", v[0], v[1] );
+                        return;
+                }else
+                {
+                        art_to_tsc_offset = 0; /* perhaps initalize to -1 * current rdtsc value ? */
+                        pr_info("Using Intel ART for TSC - TSC_ADJUST initialized to 0.");
+                }
+        }
 	/* Make this sticky over multiple CPU init calls */
+        pr_info("Using Intel Always Running Timer (ART) feature for TSC on all CPUs.");
 	setup_force_cpu_cap(X86_FEATURE_ART);
 }
</pre></code></quote>

patch attached also.
Comment 1 Jason Vas Dias 2017-02-16 17:57:04 UTC
Created attachment 254795 [details]
Patch to arch/x86/kernel/tsc.c : improved logging
Comment 2 Jason Vas Dias 2017-02-16 17:59:32 UTC
So now, with the above patch applied, I see log messages at boot :
<pre>
Feb 16 17:47:07 jvdlux kernel: [    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
Feb 16 17:47:07 jvdlux kernel: [    0.000000] hpet clockevent registered
Feb 16 17:47:07 jvdlux kernel: [    0.000000] tsc: Fast TSC calibration using PIT
Feb 16 17:47:07 jvdlux kernel: [    0.000000] tsc: Detected 2893.267 MHz processor
Feb 16 17:47:07 jvdlux kernel: [    0.000000] tsc: Would normally not use ART - cpuid_level:13 < 21 - but extended_cpuid_level is: 80000008, so probing for ART support.
Feb 16 17:47:07 jvdlux kernel: [    0.000000] tsc: Using Intel Always Running Timer (ART) feature 6a for TSC on all CPUs - TSC/CCC: 832/7 offset: 0.
Feb 16 17:47:07 jvdlux kernel: [    0.000237] Calibrating delay loop (skipped), value calculated using timer frequency.. 5786.53 BogoMIPS (lpj=2893267)
Feb 16 17:47:07 jvdlux kernel: [    0.000377] pid_max: default: 32768 minimum: 301
Feb 16 17:47:07 jvdlux kernel: [    0.000449] ACPI: Core revision 20160422
</pre>
Comment 3 Jason Vas Dias 2017-02-16 18:10:27 UTC
I wrote a little script to print out symbol values from running kernel:

$ for n in art_to_tsc_numerator art_to_tsc_denominator tsc_khz; do /usr/bin/ksym $n 4; done
ffffffff82bd9a04 art_to_tsc_numerator 00000340 832
ffffffff82bd9a00 art_to_tsc_denominator 00000007 7
ffffffff829f24b4 tsc_khz 002c25f3 2893299

Using these values, a program could derive a very accurate value for a
'real-time nano-seconds since start' value without entering kernel by issuing
the 'rdtscp' instruction, and using the TSC_Value and 'tsc_khz' value:

    nanoseconds = (TSC_Value - TSC_Value_at_process_start) * ( 1000000000 / ( tsc_khz * 1000 ) )   

and know it will be very near the value the kernel would return in response
to a clock_gettime(CLOCK_MONOTONIC_RAW,&ts) system call if they were actually
invoked at precisely the same time . I am writing a library that makes use of
this , which needs to be sure that the kernel is using the TSC for 
clock_gettime(CLOCK_MONOTONIC_RAW,&timestamp).
Comment 4 Jason Vas Dias 2017-02-16 19:21:54 UTC
Oops, of course pseudo-code calculation in Comment #3 should be :
  nanoseconds = (TSC_Value - TSC_Value_at_process_start) * 
                (1000000000 /  tsc_khz);  
           /*    1e9 / (  (1e-3 *tsc_khz) * 1e9 ) == 
                 (1e9 / ((tsc_khz*1-e6) * 1e6) ) ==
                 1e9 * (1/tsc_khz) ==
                 (1e9 / tsc_khz) ( in this case!*/   
ie if tsc_khz is 2893299 or 2893300, as it ALWAYS ends up being on my system, 
number of nanoseconds per TSC tick is @ :

         1e9 / 293300 khz = 345.626 ns

This corresponds well with observed minimum times between clock_gettime(CLOCK_MONOTONIC_RAW,&timespec).
But not when the above patch is not applied; then , some calculation
involving the current CPU frequency occurs and there is a large 
disparity between minimum measurable times with clock_gettime(), 
plus the relatively large latency involved with syscall entry (> 20ns or so).
Comment 5 Jason Vas Dias 2017-02-16 19:41:36 UTC
Aargh! - tsc_khz is KILO Hertz, so 2893300 KILO Hertz has period of 34.5626 ns or
1e9  / (2893.3 * 1000)
or
1e9 / ( 2893300 / 100 ), so pseudo-code in Comment #3 and Comment #4 should be:

  nanoseconds = (TSC_Value - TSC_Value_at_process_start) * 
                (1000000000 /  (tsc_khz / 100) );
Comment 6 Jason Vas Dias 2017-02-16 19:52:58 UTC
actually, the correct calculation is :
  1e9 / ( 2893300 *10 ) !
  ( 2.8933khz is 1000000000 / (2.89330000 * 1000000000) ns!) Sorry.
 
so pseudo-code in Comment #3 and Comment #4 should be:
  nanoseconds = (TSC_Value - TSC_Value_at_process_start) * 
                (1000000000 /  (tsc_khz * 10 ));
Comment 7 Jason Vas Dias 2017-02-16 19:56:30 UTC
2893300hz has period 34.5626ns ;  1 / 34.5626e-9 == 2.8933E+07 GHz.
Comment 8 Jason Vas Dias 2017-02-16 19:58:35 UTC
So please apply above patch or something like it to a future linux version
and let users derive accurate real-time durations from rdtsc & rdtscp 
instruction results!
Comment 9 Jason Vas Dias 2017-02-16 20:31:26 UTC
Sorry for confusion. I think 28933300 Kilo-Hertz really is 1000 * 2893300 Giga-Hertz so the nano-second increment per TSC tick really is the tiny
3.456102e-10 value I started with in Comment #3 .
So potentially one can get extremely accurate timings using TSC . 
I haven't wrote the library yet - now I know how linux is 
calculating clock_gettime(CLOCK_MONOTONIC_RAW) from TSC I know what to do .
Comment 10 Jason Vas Dias 2017-02-16 21:01:25 UTC
probably best way of calculating number of nanoseconds per tick would be :
   static const
   unsigned long long n_ps =  1000000000000ULL /  
                           ( ((unsigned long long)tsc_khz) * 1000U )
                           ; /* number of peta(1e-12)-seconds per TSC tick ? */ 
   unsigned long n_ns = ((((unsigned long long)
                           ( TSC_Value_Now() - TSC_Value_At_Start() )
                          )*n_ps ) / 1000ULL ); /* 1e-9 seconds */  
   return n_ns;

but you can only rely on clock_gettime(CLOCK_MONOTONIC_RAW,&ts) being
near this value IFF the above patch is applied to tsc.c .
Comment 11 Jason Vas Dias 2017-02-18 16:04:35 UTC
Created attachment 254825 [details]
demonstration of issue & fix

$ tar -xpf ttsc.tar
$ cd ttsc
$ CPPFLAGS='-DKSYM_LOCATION=\"'$(pwd)/ksym'\"' make 
$ ./ttsc1
has tsc: 1 constant: 1
832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
Hooray! TSC is enabled with KHz: 2893300
ts2 - ts1: 212 ts3 - ts2: 170 ns1: 0.000000257 ns2: 0.000001682
ts3 - ts2: 178 ns1: 0.000000308
ts3 - ts2: 27 ns1: 0.000000272
ts3 - ts2: 15 ns1: 0.000000132
ts3 - ts2: 17 ns1: 0.000000244
ts3 - ts2: 18 ns1: 0.000000247
ts3 - ts2: 18 ns1: 0.000000263
ts3 - ts2: 18 ns1: 0.000000247
ts3 - ts2: 17 ns1: 0.000000261
ts3 - ts2: 19 ns1: 0.000000257
ts3 - ts2: 19 ns1: 0.000000282
t1 - t0: 66444 - ns2: 0.000067677
Comment 12 Jason Vas Dias 2017-02-18 16:05:03 UTC
To demonstrate the relevance of this issue, here is a little test program ,
in attached 'ttsc.tar' file:

$ tar -xpf ttsc.tar
$ cd ttsc
$ PWD="$(pwd)";
$ CPPFLAGS='-DKSYM_LOCATION=\"'${PWD}'\"' make
...
$ ./ttsc1
has tsc: 1 constant: 1
832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
Hooray! TSC is enabled with KHz: 2893300
ts2 - ts1: 138 ts3 - ts2: 81 ns1: 0.000000158 ns2: 0.000001816
ts3 - ts2: 24 ns1: 0.000000162
ts3 - ts2: 17 ns1: 0.000000143
ts3 - ts2: 17 ns1: 0.000000146
ts3 - ts2: 17 ns1: 0.000000149
ts3 - ts2: 17 ns1: 0.000000141
ts3 - ts2: 16 ns1: 0.000000142
ts3 - ts2: 17 ns1: 0.000000151
ts3 - ts2: 15 ns1: 0.000000149
ts3 - ts2: 17 ns1: 0.000000145
ts3 - ts2: 17 ns1: 0.000000140
t1 - t0: 28689 - ns2: 0.000029599


The 'ts3 - ts1' and 'ns1' values shown above are the number of nanoseconds between calls to my IA64_s_ns_since_start() function and the system clock_gettime() call respectively : ttsc1.c, @line 50: 
        clock_gettime(CLOCK_MONOTONIC_RAW,&tsp1);
        clock_gettime(CLOCK_MONOTONIC_RAW,&tsp2);      
        ts2 = IA64_s_ns_since_start();      
        ts3 = IA64_s_ns_since_start();
        n_ns1=(((((U128_t)tsp2.tv_sec) * 1000000000ULL) + ((U128_t) tsp2.tv_nsec)) -      
               ((((U128_t)tsp1.tv_sec) * 1000000000ULL) + ((U128_t) tsp1.tv_nsec)))
              ;

So latencies of @ 140ns are seen between calls to clock_gettime, while latencies
of @15-20ns are seen between calls to IA64_s_ns_since_start() , on my 
2.9Ghz base -> 3.9Ghz "turbo" frequency 4/8-core i7-4910MQ CPU .

This is an attempt to overcome the fundamental restriction on the minimum length
of time that can be measured on Linux with use of any system call, because of 
the overhead of user -> kernel space transition, even through the VDSO . 
The IA64_s_ns_since_start() function uses NO system calls and does not 
need to enter the kernel.

Yes, I know that no code in the Linux kernel except the e1000 driver's PTP implementation uses the actual ART hardware frequency values returned in CPUID:15H, which, IMHO, is a shame - it would be nice if Linux actually 
used the hardware crystal clock frequency and its exact ratio to TSC frequency
to generate more accurate  monotonic raw time stamps from the TSC .

But with the attached patch to tsc.c, which has also been successfully applied to version 4.9.10 , at least user space code can know that the X86_FEATURE_ART
is set in the boot_cpu_data CPUID flags by reading 
/dev/cpu/0/cpuid:bit((3*32)+10), and can then
confidently use code such as that in IA64_rdtsc.h to use the 'rdtcsp' instruction to generate real-time time stamp values in user-space that will 
end up  measuring very slightly less overall cumulative real-time than the clock_gettime() would , by a very small constant offset of < 500ns, 
because this is the cumulative overhead , but is nonetheless well correlated 
with clock_gettime(CLOCK_MONOTONIC_RAW,,,) values, and very accurate for
instaneous duration measurements .
Comment 13 Jason Vas Dias 2017-02-18 21:23:54 UTC
Oops, should have been:
$ CPPFLAGS='-DKSYM_LOCATION=\"'${PWD}/ksym'\"' make
in Comment #12 line 7
(actually, code would also look for ./ksym).
Comment 14 Jason Vas Dias 2017-02-18 21:25:13 UTC
& "instaneous" == "instantaneous"
Comment 15 Jason Vas Dias 2017-02-18 21:39:38 UTC
The point is, if you have a 7th-gen Intel x86_64 CPU that identifies as 
model name: i7-4910MQ , linux was thinking ( I believe incorrectly ) that 
it did not support CPUID:15H,
because the boot_cpu_data.cpuid_level was 0x13 ; yet the extended cpuid level
is 0x80000008 , and both 'tsc' and 'constant_tsc' capabilities are in CPUID:01,
and the kernel sets its synthesized 'nonstop_tsc' bit , but did NOT  set the
synthesized X86_FEATURE_ART bit, and had no idea of the ART denominator & 
numerator, so could not use for instance e1000 PTP . So when running on i7-4910MQ
CPUs WITHOUT the attachment #2 [details] ( 
patch :  https://bugzilla.kernel.org/attachment.cgi?id=254795 ) applied, 
the test program 'ttsc1' will exit with an error message, because X86_FEATURE_ART
bit is not set .
So please apply the patch to future versions of arch/x86/kernel/tsc.c .

Note You need to log in before you can comment on or make changes to this bug.