Bug 198961

Summary:	vdso needs to support CLOCK_MONOTONIC_RAW clocks
Product:	Platform Specific/Hardware	Reporter:	Jason Vas Dias (jason.vas.dias)
Component:	x86-64	Assignee:	platform_x86_64 (platform_x86_64)
Status:	NEW ---
Severity:	normal	CC:	hpa, sitsofe
Priority:	P1
Hardware:	Intel
OS:	Linux
Kernel Version:	2.6 - 4.14	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	program to demonstrate the problem program to demonstrate the problem patch to arch/x86/entry/vdso/vclock_gettime.c to implement fix patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to implement fix patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts) test of rdtscp asm stanza better patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts) better patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts) patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts) and linux_tsc_calibration() Program to demonstrate use of linux_tsc_calibration() same patch against latest RHEL-7 kernel (Scientific Linux distro) better program to illustrate an efficient user-space TSC reader that uses new __vdso_linux_tsc_calibration() function better version of POSIX timer latency measurement & verification program. Updated patch against latest RHEL7 kernel 3.10.0-693.21.1.el7 Patch against ti-linux-kernel ti2017.06 (4.9.65-rt23) for ARM Program to demonstrate latencies of lfence+rdtsc vs. rdtscp header used by tsc_latency.c Improved timer latency & inconsistency detection program latest version of patch against v4.16-rc6 , that compiles with or without -DRETPOLINE OK . latest version of patch against 3.10.0-693.21.1, that compiles with/without -DRETPOLINE Combined patch against 4.15.11, including rdtscp + tsc_calibration patches Program to demonstrate latencies of lfence+rdtsc vs. rdtscp Improved timer latency & inconsistency detection program

Description Jason Vas Dias 2018-03-01 16:14:05 UTC

Created attachment 274491 [details]
program to demonstrate the problem

Contrary to the manual pages documentation, Linux DOES NOT currently support
the CLOCK_MONOTONIC_RAW clock for measuring small time differences with
nanosecond resolution, ie. a monotonically increasing counter that is NOT
affected by operator OR NTP adjustments, or rather does so in a way that renders its results useless to measure small time differences . 

This is because there is no support for CLOCK_MONOTONIC_RAW clocks in the VDSO -
such calls end up using the vdso_fallback_gettime() method to enter the kernel, perform locking on the gtod structure , and get some adjusted value , when
all we want to do for CLOCK_MONOTONIC_RAW is read the Time Stamp Counter (TSC)
(issue the rdtsc / rdtscp instruction) and perform the same clocksource
conversion as the kernel does using the 'shift' and 'mult' values of the
clocksource's vsyscall_gtod_data structure  (TSC) .

One can issue the rdtsc / rdtscp instruction in user-space and perform the 
same conversion in user space using the same shift and mult values from the 
memory mapped gtod_data structure, which is located in the vdso , so can be
mapped from user-space - it looks like there is already a mechanism for
doing so with the vvar_vdso_gtod_data VDSO variable which has the address
of the gtod_data structure, which is also readable from user space. 

But what the code in arch/x86/entry/vclock_gettime.c does currently is ( @ line 269 ):
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
	switch (clock) {
	case CLOCK_REALTIME:
		if (do_realtime(ts) == VCLOCK_NONE)
			goto fallback;
		break;
	case CLOCK_MONOTONIC:
		if (do_monotonic(ts) == VCLOCK_NONE)
			goto fallback;
		break;
	case CLOCK_REALTIME_COARSE:
		do_realtime_coarse(ts);
		break;
	case CLOCK_MONOTONIC_COARSE:
		do_monotonic_coarse(ts);
		break;
	default:
		goto fallback;
	}

	return 0;
fallback:
	return vdso_fallback_gettime(clock, ts);
}


Note, for CLOCK_MONTONIC_RAW clocks, it always enters the kernel , but for
CLOCK_MONOTONIC clocks , which are subject to NTP but not to operator adjustments, it does access the mapped gtod_data .

I propose adding a new CLOCK_MONOTONIC_RAW case to this 
switch, which would, if the current clocksource is the TSC,
would actually issue the 'rdtscp' instruction to read the actual 
TSC value, and perform the same conversion as the kernel's clocksource
does, such as with something like :

  register U64_t cycles = x86_64_rdtsc();
  register U64_t nanoseconds = 
    (cycles * clocksource->mult)>>clocksource->shift;
  timespec->tv_sec  = nanoseconds / 1000000000;
  timespec->tv_nsec = nanoseconds % 1000000000;

where 'clocksource' is the vsyscall_gtod_data pointer, which
is mapped to shared region with the kernel .
constructed by adding the vvar_vdso_gtod_data offset value
(an absolute symbol in the VDSO) 
to the address where the first vdso page is mapped .

This really is a serious issue IMHO - users get no warning
from the documentation that clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
will enter the kernel, while clock_gettime(CLOCK_MONOTONIC,&ts) will
not, and if they are after a genuine 'cpu cycles' counter that is NOT
subject to ANY adjustments, then CLOCK_MONOTONIC_RAW is the only choice
they have.  

All the kernel code does that already handles CLOCK_MONOTONIC_RAW does 
is to  read the TSC (issue rdtscp) and perform the shift & mult conversion as 
shown above - but since it must do locking on the gtod_data structure,
and enter the kernel, putting the task in danger of being rescheduled, 
very bad latencies & kernel context switch overhead can result .

On my x86_64 ( a 2.3 - 3.9 Ghz Haswell ) using CLOCK_MONOTONIC_RAW clocks
comes with a latency (minimum time that can be measured) of @ 300 - 600 ns, 
whereas using CLOCK_MONOTONIC (which I cannot do for my application) takes
@ 40ns . This is shown by example test code:

$ gcc -std=gnu11 -o timer timer.c
$ ./timer
sum: 34240
Total time: 0.000069091S - Average Latency: 0.000000342S

This just issues 100 clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls
in a row, measuring the difference between their values - so 
the average latency measured was 342ns - it tends to vary widely
between 300-700 ns on my machine - I guess sometimes the task
gets rescheduled when entering the kernel via clock_gettime() calls
 - this is not desirable behaviour.

Adding the -m flag to the test program invocation makes it use CLOCK_MONOTONIC :

$ ./timer -m
sum: 2315
Total time: 0.000005179S - Average Latency: 0.000000023S

So here the latency was 23ns - this was under a 4.12.10 kernel.

Issuing a rdtscp instruction and performing a multiply and shift on the value
takes @ 8ns on my CPU , reliably.

Please, can we have a user-space TSC reader in the VDSO for CLOCK_MONOTONIC_RAW clock_gettime calls that does not enter the kernel !

I am going to develop one for use under x86_64 kernels I build , & will attach it to this bug when done.

Comment 1 Jason Vas Dias 2018-03-01 16:24:25 UTC

Created attachment 274493 [details]
program to demonstrate the problem

Comment 2 Jason Vas Dias 2018-03-01 17:53:37 UTC

Created attachment 274495 [details]
patch to arch/x86/entry/vdso/vclock_gettime.c to implement fix

This patch implements support for CLOCK_MONOTONIC_RAW in the VDSO
by simply reading the TSC & performing mult & shift adjustments, 
if TSC is the clock source.

Comment 3 Jason Vas Dias 2018-03-01 18:10:40 UTC

Created attachment 274497 [details]
patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to implement fix

This patch implements support for CLOCK_MONOTONIC_RAW in the VDSO
by simply reading the TSC & performing mult & shift adjustments, 
if TSC is the clock source.
Patch is against latest stable version v4.15.7 tree.

Comment 4 Jason Vas Dias 2018-03-02 17:13:04 UTC

I can report the patch builds and runs fine 
with kernel v4.15.7 - 
with the addition of a cast to (u64*) of 
the address of the remainder 'ns' (&ts->tv_nsec) . 
But unexpected results - while CLOCK_MONOTONIC
latencies are down to 16-24ns, CLOCK_MONOTONIC_RAW using new do_monotonic_raw 
vdso inline has latency of 92-100 ns - or 
maybe this is the "resolution" of the conversion
calculation: @ 100 ns?  at least this is better
than 300-700ns for kernel fallback entry handling of clock_gettime(CLOCK_MONOTONIC_RAW) .
The TSC frequency on this processor is about 
2.3Ghz , so sub-nanosecond precision / resolution should be available, but isn't under 
Linux - just trying to improve this situation 
somehow .
Will post more detailed results when electricity
restored ,  and try to get those tsc conversion
times down.

Comment 5 Jason Vas Dias 2018-03-02 17:27:59 UTC

I think the long latency of do_monotonic_raw with respect to 
do_monotonic might also be due to 
the latter reusing the last seconds 
value and doing a simpler calculation 
most of the time - the calculation
in the patch could do with some optimization,
but I cannot understand why raw access to
the TSC via CLOCK_MONOTONIC_RAW should 
be denied to user space programs which also
want to make use of Linux TSC calibration
& conversion mechanism so that timespec 
compatible nanosecond precision results are
obtained - this has to be possible with a 
latency of < 10ns when the TSC ticks 2.3 times per ns.

Comment 6 Jason Vas Dias 2018-03-02 17:28:32 UTC

I think the long latency of do_monotonic_raw with respect to 
do_monotonic might also be due to 
the latter reusing the last seconds 
value and doing a simpler calculation 
most of the time - the calculation
in the patch could do with some optimization,
but I cannot understand why raw access to
the TSC via CLOCK_MONOTONIC_RAW should 
be denied to user space programs which also
want to make use of Linux TSC calibration
& conversion mechanism so that timespec 
compatible nanosecond precision results are
obtained - this has to be possible with a 
latency of < 10ns when the TSC ticks 2.3 times per ns.

Comment 7 Jason Vas Dias 2018-03-02 18:23:45 UTC

Actually, it could be that the 
value returned by : 
    u64 tsc = rdtsc_ordered () ;
    tsc *= gtod->mult ; 
    tsc >>= gtod->shift;
might be better interpreted as two
32-bit integers, with the high 32
bits the seconds , &  low 32 the ns ? :
     struct sns { u32 s, ns; } , 
       *snsp ((struct sns *)&tsc);
     ts->tv_sec = snsp->s; 
     ts->tv_nsec = snsp->ns;

     ?
I did try something like this before
to interpret tsc values, and it gave sensible
results. 

I think there should be a bit field 
defining the layout of precisely which
 bits are seconds and which nanoseconds, 
and not pretend that 64-bit second values
are being sampled direct from the TSC . 
I believe the layout is 32:32, or 32:30.


By something similar, a long division
might be avoided and the latencies could 
come way down ; also as done by do_monotonic,
use could be made of previous return value.

I think the previous calculation 
in the patch might shift the ns up by 7 bits 
(128) , giving rise to @128ns resolution
in averages. This requires further investigation
which I cannot do right now .

Comment 8 Jason Vas Dias 2018-03-02 19:00:31 UTC

accurate measurement of elapsed time is key to
accurate performance measurement of any software system.  Sometimes very many very small differences do add up to a significant number.
So it is wrong to limit the minimum measurable
time or generate greater latency unecessarily,
and it is equally unnecessary and over complicating for performance measurement 
code to have to deal with NTP & CPU frequency
adjustments, which CLOCK_MONOTONIC is prone to,
while the TSC values on modern intel cpus are not and nor should be CLOCK_MONOTONIC_RAW values.

Comment 9 Jason Vas Dias 2018-03-03 11:15:59 UTC

affects modern Intel CPUs with 'nonstop-tsc' -
also I think AMD CPUs affected, but cannot 
test at the moment . 
For ARM, there should be a way of getting
ring 0 / kernel code to read the sys_clkin2
timer register and put the unconverted value in a mmap-able / vdso page without the full syscall entry 
/ kernel <-> user overhead, which is also
> @ 100ns on an A15 - I am working on this ,
but still awaiting electricity access.

Comment 10 Jason Vas Dias 2018-03-03 15:15:31 UTC

Aha! electricity back at last (Dublin Emma-Geddon).


This is the method I used before to read the tsc (forget 
the *enabled | *initialized calls) , which illustrates
my interpretation of the Intel Software Developer's Manual (SDM)
sections about the TSC, which is that there is a high 32-bits
register and a low 32-bits register :
<quote><code>
static inline __attribute__((always_inline))
U64_t
IA64_tsc_now()
{ if(!(    _ia64_invariant_tsc_enabled
      ||((!_ia64_tsc_info_initialized) && IA64_invariant_tsc_is_enabled(NULL,NULL))
      )
    )
  { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant TSC enabled.\n",__LINE__,__FUNCTION__);
    return 0;
  }
  U32_t tsc_hi, tsc_lo;
  register UL_t tsc;
  asm volatile
  ( "rdtscp\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t"
    "mov %%ecx, %2\n\t"  
  : "=m" (tsc_hi) ,
    "=m" (tsc_lo) ,
    "=m" (_ia64_tsc_user_cpu) :  
  : "%rax","%rcx","%rdx"
  );
  tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
  return tsc;
}
</code></quote>


Then , Linux applies TSC Calibration and does a multiply :

<quote><code>
    tsc *= gtod->mult;
</code></quote>
and then a right shift:
<quote><code>
    tsc >>= gtod->shift;
</code></quote>

The question is , I believe:
  Are the low 32-bits of this resulting number the 0-999999999 number
of nanoseconds, or MUST we do a long-division of the whole
number to get the number of seconds with remainder number of nanoseconds ? 
Note the 2-bit gap in valid values - to represent a 9 digit nanosecond number,
1<<30 or 30 bits are needed, so I should have no problems just sampling the
low 30 bits:
<quote><code>
     ts->tv_sec =  tsc >> 32;
     ts->tv_nsec = tsc & ( (1U<<30U)-1 ) ;
</code></quote>
because the number of seconds is always guaranteed to begin at bit 32 ? 
OR is it
<quote><code>
     ts->tv_sec =  tsc >> 30;
     ts->tv_nsec = tsc & ( (1U<<30U)-1 ) ;
</code></quote>
 ?
OR  
MUST it be that the only valid interpretation is :
<quote><code>
     ts->tv_sec =  __iter_div_u64_rem(tsc, NSEC_PER_SEC, (u64*)&ts->tv_nsec);
</code></quote>
   If this is the case , then of course the approach taken by do_monotonic 
   should be used:
<quote><code>
   do_monotonic_raw() { ...
     tsc = rdtsc_ordered();
     static u64 last_tsc_value = 0 , last_seconds=0; 
     if ( last_tsc_value )
     { register u64 tsc_diff = (tsc - last_tsc_value);
       if ( tsc_diff  > 999999999UL )
       {  ts->tv_sec  = last_seconds = __iter_div_u64_rem(tsc, NSEC_PER_SEC, 
                                            (u64*)&ts->tv_nsec);
       }else
       {  ts->tv_sec  = last_seconds;  
          ts->tv_nsec = tsc_diff; 
       }
     }else
     {
       ts->tv_sec = last_seconds = __iter_div_u64_rem(tsc, NSEC_PER_SEC, 
                                                     (u64*)&ts->tv_nsec);
     }
     last_tsc_value = tsc;
     ...
   }
</code></quote>
 I guess next step is to do a debugging version to find out .

 It would be simpler if the TSC seconds and nanoseconds values were somehow 
 separate bit-fields in the value returned after multiplication and shift.

 There are still suspiciously few differences between consecutive values of 
 less then 100ns , but I am testing the new code above now - must rebuild
 entire kernel & modules, since now '#3' release will be built :-( .

Comment 11 Jason Vas Dias 2018-03-03 19:00:11 UTC

Created attachment 274507 [details]
patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts)

Updated patch against 4.15.7 arch/x86/entry/vclock_gettime.c
to provide user-space access to TSC with CLOCK_MONOTONIC_RAW 
clocks .
This version has latency of < 30ns, comparable to that of CLOCK_MONOTONIC .

Comment 12 Jason Vas Dias 2018-03-03 19:07:36 UTC

Created attachment 274509 [details]
test of rdtscp asm stanza

$ gcc -std=gnu11 -o t_rdtscp t_rdtscp.c -lm
$ ./t_rdtscp
16
24
10
15
11
18
12
11
sum : 1.180467e+02 nanoseconds avg: 14

Comment 13 Jason Vas Dias 2018-03-03 23:25:36 UTC

Created attachment 274513 [details]
better patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts)

Of course, the previous patch messes up the size of the gtod_data structure ,
which causes all programs that rely on it (eg util-linux-ng, ntpd) to fail .
Changing the size of the gtod_data structure is probably a bad idea.
So this patch creates a separate vvar to hold the previous tsc and last second
values for do_monotonic_raw(), so the same algorithm can work without changing
gtod_data size.

Comment 14 Jason Vas Dias 2018-03-03 23:29:23 UTC

Created attachment 274515 [details]
better patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts)



Of course, the previous patch messes up the size of the gtod_data structure ,
which causes all programs that rely on it (eg util-linux-ng, ntpd) to fail .
Changing the size of the gtod_data structure is probably a bad idea.
So this patch creates a separate vvar to hold the previous tsc and last second
values for do_monotonic_raw(), so the same algorithm can work without changing
gtod_data size.

Comment 15 Jason Vas Dias 2018-03-03 23:35:28 UTC

Comment on attachment 274515 [details]
better patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts)

The last patch should have similar performance to do_monotonic, without 
breaking anything that depends on gtod_data size.

Comment 16 Jason Vas Dias 2018-03-04 18:16:55 UTC

Of course, any attempt to create a writable non-automatic data object in the VDSO 
is impossible due to it having no writable data segment :-( .

It looks like the best that can be done at the moment is to accept the 100ns 
minimum - the implementation of do_monotonic_raw in the patch is now :

<quote><code>
      do_monotonic_raw( struct timespec *ts )
      { volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; 
        // so same instrs generated for 64-bit as for 32-bit builds
        u64 ns;
        register u64 tsc=0;
        if (gtod->vclock_mode == VCLOCK_TSC)
        { asm volatile
            ( "rdtscp"
            : "=a" (tsc_lo)
            , "=d" (tsc_hi)
            , "=c" (tsc_cpu)
            ); // : eax, edx, ecx used - NOT rax, rdx, rcx
          // note that rdtsc_ordered() uses a barrier that is made
          // unecessary by use of rdtscp .
	  tsc     = ((((u64)tsc_hi) & 0xffffffffUL) << 32) | 
                    (( (u64)tsc_lo) & 0xffffffffUL);
          tsc    *= gtod->mult;
          tsc   >>= gtod->shift;
          ts->tv_sec  = __iter_div_u64_rem(tsc, NSEC_PER_SEC, &ns);            
          ts->tv_nsec = ns;
          return VCLOCK_TSC;          
        }
        return VCLOCK_NONE;
</code></quote>

At least this method actually ALWAYS unequivocably reads a 
timestamp counter value  unlike other methods which can return 
what kernel has written recently to vsyscall_gtod_data, but because
it cannot store the last TSC value, it must always do the long division
and cannot be as efficient as other methods .  

So the TSC conversion really needs to be implemented in normal 
user space library, such as I have written, and which I was 
trying to obviate the need for by fixing this bug ;  this 
would be able to store the previous TSC value and do optimizations
to get the average number of nanoseconds used down to below 20ns -
I acheived @ 8ns with a user-space implementation .

But the VDSO does not export the necessary bits , ie. a pointer 
to the read only 'shift' and 'mult' values which are the result
of TSC calibration to normalize the TSC result as a number of 
nanoseconds .

So I propose that the VDSO should provide a method in vclock_gettime.c 
such as :

   struct clocksource_mult_shift
   { u32 mult, shift;
   };
   const struct clocksource_mult_shift *
   clocksource_mult_shift_address()
   { return ( ((struct clocksource_mult_shift)*) & gtod->mult );
   }

Then user-space code could make use of the read-only mult & shift
values which are kept updated by the kernel to interpret TSC values
correctly, and be free to keep a record of the last such value, so
could avoid making long divisions for each such value returned.

Comment 17 H. Peter Anvin 2018-03-04 18:26:49 UTC

Bugzilla is not where Linux kernel development happens.  Please see Documentation/SubmittingPatches for the right way to do that.

Comment 18 Jason Vas Dias 2018-03-04 23:51:56 UTC

thanks Peter for your interest.

Comment 19 Jason Vas Dias 2018-03-04 23:58:55 UTC

Created attachment 274527 [details]
patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts) and linux_tsc_calibration()

This patch makes clock_gettime(CLOCK_MONOTONIC_RAW,&timespec) 
actually read the TSC, if TSC is the clock source .

Also provides :
struct linux_timestamp_conversion
{ u32  mult;
  u32  shift;
};

extern
  const struct linux_timestamp_conversion *
  __vdso_linux_tsc_calibration(void);

which can be used by user-space code that wants 
issue rdtsc / rdtscp and perform same conversion
on value as kernel / do_clockgettime_raw() does :
    tsc  *= tsc_cal->mult;
    tsc >>= tsc_cal->shift;

Comment 20 Jason Vas Dias 2018-03-05 00:02:28 UTC

Created attachment 274529 [details]
Program to demonstrate use of linux_tsc_calibration()

$ gcc -std=gnu11 -o t_vdso_tsc t_vdso_tsc.c
$ ./t_vdso_tsc
Got TSC calibration @ 0x7ffe3d353098: mult: 5798629 shift: 24
sum: 2233
Total time: 0.000004949S - Average Latency: 0.000000022S

This was on a CPU Family:Model 06:3C Haswell 2.3-3.9GHz 4/8 core machine.

Comment 21 Jason Vas Dias 2018-03-05 00:48:35 UTC

So on same machine :
  vendor_id:	GenuineIntel
  cpu family:	6
  model:	60
  model name:	Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz
 
  tsc: Refined TSC clocksource calibration: 2893.300 MHz

The use of clock_gettime(CLOCK_MONOTONIC_RAW, &ts) could
not measure less than @ 300ns, and had a latency 
of 300 - 700ns , because the VDSO used
  vdso_fallback_gettime() 
(entering the kernel via syscall) for these calls.

The use of clock_gettime(CLOCK_MONOTONIC, &ts) makes
use of previous (adjusted) values in the vsyscall_gtod_data
area , and so has a latency of 16 - 30ns .
But this value is prone to NTP adjustments, which we do not want.

With the last patch ( attachment #274527 [details] ), the use of clock_gettime(CLOCK_MONOTONIC_RAW, &ts) will have a latency
of 90 - 130 nanoseconds, because there is nowhere for it
to store any "previous values" . 

Hence, to enable user-space rdtsc / rdtscp issuers to 
interpret TSC values in the same way the kernel does ,
Linux should export its calibration values , with
a new function in the VDSO :
  extern
    const struct linux_timestamp_conversion { u32 mult, shift; }
  * __vdso_linux_tsc_calibration ;

So user-space functions can , with the aid of a 'vdso_sym' function
such as that provided by tools/testing/selftests/vDSO/parse_vdso.c ,
do something like:
   
   const struct linux_timestamp_conversion { u32 mult, shift; }
   * (*linux_tsc_calibration)(void) =  
     vdso_sym("LINUX_2.6","__vdso_linux_tsc_calibration"),
   * tsc_cal;
   tsc_cal = (*linux_tsc_calibration)();
   U64_t tsc = my_rdtsc() ;
   tsc *= tsc_cal->mult;
   tsc >>=tsc_cal->shift;
 
Without something in the VDSO to divulge the address of the 'mult' (& implicitly  
'shift') fields, user programs must parse the debug_info section of the 
vdso{32,64}.so.dbg object to get the ABSOLUTE SYMBOL var_vsyscall_gtod_data value, find where the VDSO is mapped, and then add the offset to vvar_page 
to calculate the address of vsyscall_gtod_data, which is much more prone
to errors.

It is unacceptable on a modern 2.9-3.9Ghz machine to be limited to 
either measuring some NTP adjusted value or measuring only time 
difference values which are > 1000ns with any accuracy.
Hence this is a BUG, not a new feature or enhancement request.
It is also a BUG that the documentation was misleading about CLOCK_MONOTONIC_RAW.

Comment 22 H. Peter Anvin 2018-03-05 00:58:53 UTC

*Please* take this to an appropriate forum.

Comment 23 H. Peter Anvin 2018-03-05 01:01:51 UTC

*Please* take this to an appropriate forum.

Comment 24 Jason Vas Dias 2018-03-05 03:26:59 UTC

I just did send the patch as an email :
Subject: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and 
To: x86@kernel.org, linux-kernel@vger.kernel.org, andi@firstfloor.org, tglx@linutronix.de

But I think it is more legible as an attachment, which can be viewed as a diff.

Comment 25 Jason Vas Dias 2018-03-05 15:09:37 UTC

Created attachment 274539 [details]
same patch against latest RHEL-7 kernel (Scientific Linux distro)

I rebuilt the stock RHEL-7 kernel on my machine at work, 
from the RPM ( http://ftp.scientificlinux.org/linux/scientific/7x/SRPMS/kernel-3.10.0-693.17.1.el7.src.rpm ) with a spec file modified only to apply the patch ,
with nice results .

Before the patch was applied, the results of the test programs attached above
were as follows :

<quote><pre>
$ uname -a
Linux jvdlnx 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 04:11:40 CST 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cpuinfo ...
vendor_id	:GenuineIntel
cpu family	:6
model	:60
model name	:Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

$ grep -i 'refined.*tsc.*calibration' /var/log/messages
 Refined TSC clocksource calibration: 3392.160 MHz

$ gcc -o timer timer2.c
$ ./timer
sum: 30633
Total time: 0.000061707S - Average Latency: 0.000000306S
$ ./timer
sum: 30594
Total time: 0.000061946S - Average Latency: 0.000000305S
$ ./timer
sum: 31066
Total time: 0.000062617S - Average Latency: 0.000000310S
$ ./timer
sum: 31257
Total time: 0.000075304S - Average Latency: 0.000000312S
$ ./timer
sum: 30865
Total time: 0.000062306S - Average Latency: 0.000000308S
$ ./t_rdtscp
25
17
18
10
10
17
11
11
sum : 1.180467e+02 nanoseconds avg: 14
</pre></quote>


Then, after the patch is applied, and showing the 
performance of the new 'clock_get_time_raw()' method
shown in attachment #274529 [details] :

<quote><pre>
$ cat POST_kernel_patch.log
$ uname -a
Linux jvdlnx.faztech.ie 3.10.0-693.17.1.el7.jvd.x86_64 #1 SMP Mon Mar 5 13:46:54 GMT 2018 x86_64 x86_64 x86_64 GNU/Linux
                                            ^-- this means my VDSO patch is applied

$ cd ~/src
$ gcc -o timer timer2.c
$ ./timer
sum: 15319
Total time: 0.000030800S - Average Latency: 0.000000153S
$ ./timer
sum: 16289
Total time: 0.000033070S - Average Latency: 0.000000162S
$ ./timer
sum: 15562
Total time: 0.000031236S - Average Latency: 0.000000155S
$ ./timer
sum: 15477
Total time: 0.000031033S - Average Latency: 0.000000154S
$ ./timer
sum: 15506
Total time: 0.000031113S - Average Latency: 0.000000155S
$ ./timer
sum: 15519
Total time: 0.000031136S - Average Latency: 0.000000155S
$ ./timer
sum: 15498
Total time: 0.000031163S - Average Latency: 0.000000154S


# this runs its own TSC reading function instead of calling clock_gettime(CLOCK_MONOTONIC_RAW, &ts),
# it calls clock_get_time_raw(), which uses the TSC calibration returned by the VDSO method
# __vdso_linux_tsc_calibration() to do the TSC value conversion itself, but because unlike
# the vdso do_monotonic_raw() , clock_get_time_raw() does have a writable data section,
# it can optimize so that when the seconds value has not changed between calls it
# does not have to do long-division, while do_monotonic_raw() does a long-divide every time:

$ gcc -o t_vdso_tsc t_vdso_tsc.c
$ ./t_vdso_tsc
Got TSC calibration @ 0xffffffffff5ff0a0: mult: 4945799 shift: 24
sum: 1849
Total time: 0.000004073S - Average Latency: 0.000000018S
$ ./t_vdso_tsc
Got TSC calibration @ 0xffffffffff5ff0a0: mult: 4945799 shift: 24
sum: 2166
Total time: 0.000004633S - Average Latency: 0.000000021S

</pre></quote>

QED , reading TSC in userspace and doing conversion there rather in kernel
      is much, much faster and opens up a new world of time measurement
      possibilities for userspace code .

Comment 26 Jason Vas Dias 2018-03-05 18:39:59 UTC

Created attachment 274541 [details]
better program to illustrate an efficient user-space TSC reader that uses new __vdso_linux_tsc_calibration() function

$ gcc -std=gnu11 -DPATH_TO_PARSE_VDSO_C=\"${KERNEL_SRC}/tools/testing/selftests/vDSO/parse_vdso.c" -o test_vdso_tsc test_vdso_tsc.c

$ ./test_vdso_tsc
Got TSC calibration @ 0x7fff7f38d098: mult: 5798707 shift: 24
sum: 1893
Total time: 0.000004203S - Average Latency: 0.000000018S

Comment 27 Jason Vas Dias 2018-03-15 13:59:07 UTC

Created attachment 274749 [details]
better version of POSIX timer latency measurement & verification program.

Now supports -r <repeat> option & performs checks
that each subsequent timespec is properly greater
than each previous timespec, in the way that
$KERNEL/tools/testing/selftests/timers/inconsistency_check.c
does.

Building:

$ gcc -o timer timer.c

Output under patched 4.15.9 kernel:

$ ./timer -?
Usage: timer_latency [
	-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)
	-d : dump timespec contents. N_SAMPLES: 100
	-r <repeat count>
]	Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.

$ ./timer -r 100
...
Total time: 0.000001846S - Average Latency: 0.000000018S N zero deltas: 0 N inconsistent deltas: 0
sum: 1844
Total time: 0.000001844S - Average Latency: 0.000000018S N zero deltas: 0 N inconsistent deltas: 0
sum: 1847
Total time: 0.000001847S - Average Latency: 0.000000018S N zero deltas: 0 N inconsistent deltas: 0
Average of 100 average latencies of 100 samples : 0.000000022S


Try running the program under any unpatched kernel
and the latencies will range between 150 and 1000ns 
(average @ 600ns on my CPU) .

Comment 28 Jason Vas Dias 2018-03-15 14:03:37 UTC

Created attachment 274751 [details]
Updated patch against latest RHEL7 kernel 3.10.0-693.21.1.el7

This patch applies against the Scientific Linux RHEL 7.4's
latest 3.10.0-693.21.1.el7 kernel, and produces good results -
  $ inconsistency_check -c 4 -t 120
runs fine, and the CLOCK_MONOTONIC_RAW clock has a latency
of @ 16ns on a 3.4GHz Haswell.

Comment 29 Jason Vas Dias 2018-03-15 14:08:54 UTC

Created attachment 274753 [details]
Patch against ti-linux-kernel ti2017.06 (4.9.65-rt23) for ARM

This patch applies against the latest ti-linux-kernel:
   https://git.ti.com/ti-linux-kernel
tag:     ti2017.06-rt
version: 4.9.65-rt23

and runs fine on my beagleboard-X15 (ARM A15) ,
again inconsistency_check -c 4 -t 120 runs fine,
and the latency is exactly the arm_arch_timer
period: 162ns , compared with 300-1000ns before
the patch.

Comment 30 Jason Vas Dias 2018-03-21 12:38:10 UTC

Created attachment 274841 [details]
Program to demonstrate latencies of lfence+rdtsc vs. rdtscp

This program demonstrates why I still think it is worthwhile
to make the kernel communicate the availability of 'rdtscp'
in the vsyscall_gtod_data, and for the vDSO to use this
field to choose to issue:
    '        rdtscp
    ' rather than an  :
    '        lfence
             rdtsc
    '
sequence :


$ gcc -std=gnu11 -I. \
  -D_WITH_TSC_CAL_=1 \
 -DPARSE_VDSO_C=\"$BLD/linux-4.15.9/tools/testing/selftests/vDSO/parse_vdso.c\"\ 
 -o tsc_latency tsc_latency.c

On my 2.8-3.9ghz 4/8 core i7-4710MQ laptop, under linux-4.15.9, the results are :

$ ./tsc_latency
Sum: 31508 Avg: 31.508
TSC calibration: khz:2893299 mult:5798646 shift:24 ticks_per_ns: 2.893299e+00
avg ns: 1.088999e+01

$ ./tsc_latency -m
Sum: 43920 Avg: 43.920
TSC calibration: khz:2893299 mult:5798646 shift:24 ticks_per_ns: 2.893299e+00
avg ns: 1.517990e+01

On my work 2-3.4ghz i7-4770 workstation. under RHEL-7 linux 3.10.0-693.21.1.el7 :

 $ ./tsc_latency
Sum: 39140 Avg: 39.140
TSC calibration: khz:3392162 mult:4945877 shift:24 ticks_per_ns: 3.392162e+00
avg ns: 1.153836e+01


$ ./tsc_latency -r 10
repeat: 10
Sum: 39124 Avg: 39.124
TSC calibration: khz:3392162 mult:4945877 shift:24 ticks_per_ns: 3.392162e+00
avg ns: 1.153365e+01
Sum: 39028 Avg: 39.028
avg ns: 1.150535e+01
Sum: 39004 Avg: 39.004
avg ns: 1.149827e+01
Sum: 38996 Avg: 38.996
avg ns: 1.149591e+01
Sum: 38992 Avg: 38.992
avg ns: 1.149473e+01
Sum: 38996 Avg: 38.996
avg ns: 1.149591e+01
Sum: 39000 Avg: 39.000
avg ns: 1.149709e+01
Sum: 38992 Avg: 38.992
avg ns: 1.149473e+01
Sum: 39000 Avg: 39.000
avg ns: 1.149709e+01
Sum: 38996 Avg: 38.996
avg ns: 1.149591e+01
Average of 10 Averages Sum: 385 Avg: 38.005
avg avg ns: 1.134969e+01

$ ./tsc_latency -m -r 10
repeat: 10
Sum: 91764 Avg: 91.764
TSC calibration: khz:3392162 mult:4945877 shift:24 ticks_per_ns: 3.392162e+00
avg ns: 2.705177e+01
Sum: 91782 Avg: 91.782
avg ns: 2.705708e+01
Sum: 91773 Avg: 91.773
avg ns: 2.705443e+01
Sum: 91782 Avg: 91.782
avg ns: 2.705708e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Sum: 91696 Avg: 91.696
avg ns: 2.703173e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Average of 10 Averages Sum: 910 Avg: 91.000
avg avg ns: 2.682655e+01


So, that is why I think it makes sense for vDSO to use rdtscp rather
than rdtsc .

Comment 31 Jason Vas Dias 2018-03-21 12:41:38 UTC

Created attachment 274843 [details]
header used by tsc_latency.c

Comment 32 Jason Vas Dias 2018-03-21 12:52:58 UTC

Created attachment 274845 [details]
Improved timer latency & inconsistency detection program

$ gcc -std=gnu11 -O3 -o timer timer.c

# using CLOCK_MONOTONIC_RAW:

$ ./timer
sum: 2118
Total time: 0.000002118S - Average Latency: 0.000000021S N zero deltas: 0 N inconsistent deltas: 0

# using CLOCK_MONOTONIC:

$ ./timer -m
sum: 3341
Total time: 0.000003341S - Average Latency: 0.000000033S N zero deltas: 0 N inconsistent deltas: 0


# Actually, maybe a sample size of 1000 is better:

$ gcc -std=gnu11 -O3 -DN_SAMPLES=1000 -o timer timer.c

$ ./timer
sum: 25991
Total time: 0.000025991S - Average Latency: 0.000000025S N zero deltas: 0 N inconsistent deltas: 0

$ ./timer -m
sum: 40416
Total time: 0.000040416S - Average Latency: 0.000000040S N zero deltas: 0 N inconsistent deltas: 0

but results are the same - 'rdtscp' is faster than 'lfence+rdtsc' .

Comment 33 Jason Vas Dias 2018-03-21 12:57:11 UTC

Created attachment 274847 [details]
latest version of patch against v4.16-rc6 , that compiles with or without -DRETPOLINE OK .

Comment 34 Jason Vas Dias 2018-03-21 13:00:56 UTC

Created attachment 274849 [details]
latest version of patch against 3.10.0-693.21.1, that compiles with/without -DRETPOLINE

Comment 35 Jason Vas Dias 2018-03-21 14:46:07 UTC

Created attachment 274851 [details]
Combined patch against 4.15.11, including rdtscp + tsc_calibration patches

$ scripts/checkpatch.pl /tmp/vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-v4.15.11.patch | less -X
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#355:
new file mode 100644

total: 0 errors, 1 warnings, 391 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

/tmp/vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-v4.15.11.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

ie. the only complaint is that I created a new file 
  ( arch/x86/include/uapi/asm/vdso_linux_tsc_calibration.h )
.

Comment 36 Jason Vas Dias 2018-03-22 15:13:34 UTC

Created attachment 274869 [details]
Program to demonstrate latencies of lfence+rdtsc vs. rdtscp

Improved version, now compiles with '-std=gnu11 -Wall -Wextra -Werror'
with no errors. No change to result values.

Comment 37 Jason Vas Dias 2018-03-22 15:15:17 UTC

Created attachment 274871 [details]
Improved timer latency & inconsistency detection program

Fixes compilation -Wall -Wextra - no change to results.

Comment 38 Sitsofe Wheeler 2019-10-20 05:52:12 UTC

I think this has been fixed by https://lore.kernel.org/linux-arm-kernel/20190621095252.32307-1-vincenzo.frascino@arm.com/ which landed in the 5.3 kernel...