Bug 198961
Created attachment 274493 [details]
program to demonstrate the problem
Created attachment 274495 [details]
patch to arch/x86/entry/vdso/vclock_gettime.c to implement fix
This patch implements support for CLOCK_MONOTONIC_RAW in the VDSO
by simply reading the TSC & performing mult & shift adjustments,
if TSC is the clock source.
Created attachment 274497 [details]
patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to implement fix
This patch implements support for CLOCK_MONOTONIC_RAW in the VDSO
by simply reading the TSC & performing mult & shift adjustments,
if TSC is the clock source.
Patch is against latest stable version v4.15.7 tree.
I can report the patch builds and runs fine with kernel v4.15.7 - with the addition of a cast to (u64*) of the address of the remainder 'ns' (&ts->tv_nsec) . But unexpected results - while CLOCK_MONOTONIC latencies are down to 16-24ns, CLOCK_MONOTONIC_RAW using new do_monotonic_raw vdso inline has latency of 92-100 ns - or maybe this is the "resolution" of the conversion calculation: @ 100 ns? at least this is better than 300-700ns for kernel fallback entry handling of clock_gettime(CLOCK_MONOTONIC_RAW) . The TSC frequency on this processor is about 2.3Ghz , so sub-nanosecond precision / resolution should be available, but isn't under Linux - just trying to improve this situation somehow . Will post more detailed results when electricity restored , and try to get those tsc conversion times down. I think the long latency of do_monotonic_raw with respect to do_monotonic might also be due to the latter reusing the last seconds value and doing a simpler calculation most of the time - the calculation in the patch could do with some optimization, but I cannot understand why raw access to the TSC via CLOCK_MONOTONIC_RAW should be denied to user space programs which also want to make use of Linux TSC calibration & conversion mechanism so that timespec compatible nanosecond precision results are obtained - this has to be possible with a latency of < 10ns when the TSC ticks 2.3 times per ns. I think the long latency of do_monotonic_raw with respect to do_monotonic might also be due to the latter reusing the last seconds value and doing a simpler calculation most of the time - the calculation in the patch could do with some optimization, but I cannot understand why raw access to the TSC via CLOCK_MONOTONIC_RAW should be denied to user space programs which also want to make use of Linux TSC calibration & conversion mechanism so that timespec compatible nanosecond precision results are obtained - this has to be possible with a latency of < 10ns when the TSC ticks 2.3 times per ns. Actually, it could be that the value returned by : u64 tsc = rdtsc_ordered () ; tsc *= gtod->mult ; tsc >>= gtod->shift; might be better interpreted as two 32-bit integers, with the high 32 bits the seconds , & low 32 the ns ? : struct sns { u32 s, ns; } , *snsp ((struct sns *)&tsc); ts->tv_sec = snsp->s; ts->tv_nsec = snsp->ns; ? I did try something like this before to interpret tsc values, and it gave sensible results. I think there should be a bit field defining the layout of precisely which bits are seconds and which nanoseconds, and not pretend that 64-bit second values are being sampled direct from the TSC . I believe the layout is 32:32, or 32:30. By something similar, a long division might be avoided and the latencies could come way down ; also as done by do_monotonic, use could be made of previous return value. I think the previous calculation in the patch might shift the ns up by 7 bits (128) , giving rise to @128ns resolution in averages. This requires further investigation which I cannot do right now . accurate measurement of elapsed time is key to accurate performance measurement of any software system. Sometimes very many very small differences do add up to a significant number. So it is wrong to limit the minimum measurable time or generate greater latency unecessarily, and it is equally unnecessary and over complicating for performance measurement code to have to deal with NTP & CPU frequency adjustments, which CLOCK_MONOTONIC is prone to, while the TSC values on modern intel cpus are not and nor should be CLOCK_MONOTONIC_RAW values. affects modern Intel CPUs with 'nonstop-tsc' -
also I think AMD CPUs affected, but cannot
test at the moment .
For ARM, there should be a way of getting
ring 0 / kernel code to read the sys_clkin2
timer register and put the unconverted value in a mmap-able / vdso page without the full syscall entry
/ kernel <-> user overhead, which is also
> @ 100ns on an A15 - I am working on this ,
but still awaiting electricity access.
Aha! electricity back at last (Dublin Emma-Geddon). This is the method I used before to read the tsc (forget the *enabled | *initialized calls) , which illustrates my interpretation of the Intel Software Developer's Manual (SDM) sections about the TSC, which is that there is a high 32-bits register and a low 32-bits register : <quote><code> static inline __attribute__((always_inline)) U64_t IA64_tsc_now() { if(!( _ia64_invariant_tsc_enabled ||((!_ia64_tsc_info_initialized) && IA64_invariant_tsc_is_enabled(NULL,NULL)) ) ) { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant TSC enabled.\n",__LINE__,__FUNCTION__); return 0; } U32_t tsc_hi, tsc_lo; register UL_t tsc; asm volatile ( "rdtscp\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" "mov %%ecx, %2\n\t" : "=m" (tsc_hi) , "=m" (tsc_lo) , "=m" (_ia64_tsc_user_cpu) : : "%rax","%rcx","%rdx" ); tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); return tsc; } </code></quote> Then , Linux applies TSC Calibration and does a multiply : <quote><code> tsc *= gtod->mult; </code></quote> and then a right shift: <quote><code> tsc >>= gtod->shift; </code></quote> The question is , I believe: Are the low 32-bits of this resulting number the 0-999999999 number of nanoseconds, or MUST we do a long-division of the whole number to get the number of seconds with remainder number of nanoseconds ? Note the 2-bit gap in valid values - to represent a 9 digit nanosecond number, 1<<30 or 30 bits are needed, so I should have no problems just sampling the low 30 bits: <quote><code> ts->tv_sec = tsc >> 32; ts->tv_nsec = tsc & ( (1U<<30U)-1 ) ; </code></quote> because the number of seconds is always guaranteed to begin at bit 32 ? OR is it <quote><code> ts->tv_sec = tsc >> 30; ts->tv_nsec = tsc & ( (1U<<30U)-1 ) ; </code></quote> ? OR MUST it be that the only valid interpretation is : <quote><code> ts->tv_sec = __iter_div_u64_rem(tsc, NSEC_PER_SEC, (u64*)&ts->tv_nsec); </code></quote> If this is the case , then of course the approach taken by do_monotonic should be used: <quote><code> do_monotonic_raw() { ... tsc = rdtsc_ordered(); static u64 last_tsc_value = 0 , last_seconds=0; if ( last_tsc_value ) { register u64 tsc_diff = (tsc - last_tsc_value); if ( tsc_diff > 999999999UL ) { ts->tv_sec = last_seconds = __iter_div_u64_rem(tsc, NSEC_PER_SEC, (u64*)&ts->tv_nsec); }else { ts->tv_sec = last_seconds; ts->tv_nsec = tsc_diff; } }else { ts->tv_sec = last_seconds = __iter_div_u64_rem(tsc, NSEC_PER_SEC, (u64*)&ts->tv_nsec); } last_tsc_value = tsc; ... } </code></quote> I guess next step is to do a debugging version to find out . It would be simpler if the TSC seconds and nanoseconds values were somehow separate bit-fields in the value returned after multiplication and shift. There are still suspiciously few differences between consecutive values of less then 100ns , but I am testing the new code above now - must rebuild entire kernel & modules, since now '#3' release will be built :-( . Created attachment 274507 [details]
patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts)
Updated patch against 4.15.7 arch/x86/entry/vclock_gettime.c
to provide user-space access to TSC with CLOCK_MONOTONIC_RAW
clocks .
This version has latency of < 30ns, comparable to that of CLOCK_MONOTONIC .
Created attachment 274509 [details]
test of rdtscp asm stanza
$ gcc -std=gnu11 -o t_rdtscp t_rdtscp.c -lm
$ ./t_rdtscp
16
24
10
15
11
18
12
11
sum : 1.180467e+02 nanoseconds avg: 14
Created attachment 274513 [details]
better patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts)
Of course, the previous patch messes up the size of the gtod_data structure ,
which causes all programs that rely on it (eg util-linux-ng, ntpd) to fail .
Changing the size of the gtod_data structure is probably a bad idea.
So this patch creates a separate vvar to hold the previous tsc and last second
values for do_monotonic_raw(), so the same algorithm can work without changing
gtod_data size.
Created attachment 274515 [details]
better patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts)
Of course, the previous patch messes up the size of the gtod_data structure ,
which causes all programs that rely on it (eg util-linux-ng, ntpd) to fail .
Changing the size of the gtod_data structure is probably a bad idea.
So this patch creates a separate vvar to hold the previous tsc and last second
values for do_monotonic_raw(), so the same algorithm can work without changing
gtod_data size.
Comment on attachment 274515 [details]
better patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts)
The last patch should have similar performance to do_monotonic, without
breaking anything that depends on gtod_data size.
Of course, any attempt to create a writable non-automatic data object in the VDSO is impossible due to it having no writable data segment :-( . It looks like the best that can be done at the moment is to accept the 100ns minimum - the implementation of do_monotonic_raw in the patch is now : <quote><code> do_monotonic_raw( struct timespec *ts ) { volatile u32 tsc_lo=0, tsc_hi=0, tsc_cpu=0; // so same instrs generated for 64-bit as for 32-bit builds u64 ns; register u64 tsc=0; if (gtod->vclock_mode == VCLOCK_TSC) { asm volatile ( "rdtscp" : "=a" (tsc_lo) , "=d" (tsc_hi) , "=c" (tsc_cpu) ); // : eax, edx, ecx used - NOT rax, rdx, rcx // note that rdtsc_ordered() uses a barrier that is made // unecessary by use of rdtscp . tsc = ((((u64)tsc_hi) & 0xffffffffUL) << 32) | (( (u64)tsc_lo) & 0xffffffffUL); tsc *= gtod->mult; tsc >>= gtod->shift; ts->tv_sec = __iter_div_u64_rem(tsc, NSEC_PER_SEC, &ns); ts->tv_nsec = ns; return VCLOCK_TSC; } return VCLOCK_NONE; </code></quote> At least this method actually ALWAYS unequivocably reads a timestamp counter value unlike other methods which can return what kernel has written recently to vsyscall_gtod_data, but because it cannot store the last TSC value, it must always do the long division and cannot be as efficient as other methods . So the TSC conversion really needs to be implemented in normal user space library, such as I have written, and which I was trying to obviate the need for by fixing this bug ; this would be able to store the previous TSC value and do optimizations to get the average number of nanoseconds used down to below 20ns - I acheived @ 8ns with a user-space implementation . But the VDSO does not export the necessary bits , ie. a pointer to the read only 'shift' and 'mult' values which are the result of TSC calibration to normalize the TSC result as a number of nanoseconds . So I propose that the VDSO should provide a method in vclock_gettime.c such as : struct clocksource_mult_shift { u32 mult, shift; }; const struct clocksource_mult_shift * clocksource_mult_shift_address() { return ( ((struct clocksource_mult_shift)*) & gtod->mult ); } Then user-space code could make use of the read-only mult & shift values which are kept updated by the kernel to interpret TSC values correctly, and be free to keep a record of the last such value, so could avoid making long divisions for each such value returned. Bugzilla is not where Linux kernel development happens. Please see Documentation/SubmittingPatches for the right way to do that. thanks Peter for your interest. Created attachment 274527 [details]
patch to 4.15.7's arch/x86/entry/vdso/vclock_gettime.c to provide user-space access to TSC via clock_gettime(CLOCK_MONOTONIC_RAW, &ts) and linux_tsc_calibration()
This patch makes clock_gettime(CLOCK_MONOTONIC_RAW,×pec)
actually read the TSC, if TSC is the clock source .
Also provides :
struct linux_timestamp_conversion
{ u32 mult;
u32 shift;
};
extern
const struct linux_timestamp_conversion *
__vdso_linux_tsc_calibration(void);
which can be used by user-space code that wants
issue rdtsc / rdtscp and perform same conversion
on value as kernel / do_clockgettime_raw() does :
tsc *= tsc_cal->mult;
tsc >>= tsc_cal->shift;
Created attachment 274529 [details]
Program to demonstrate use of linux_tsc_calibration()
$ gcc -std=gnu11 -o t_vdso_tsc t_vdso_tsc.c
$ ./t_vdso_tsc
Got TSC calibration @ 0x7ffe3d353098: mult: 5798629 shift: 24
sum: 2233
Total time: 0.000004949S - Average Latency: 0.000000022S
This was on a CPU Family:Model 06:3C Haswell 2.3-3.9GHz 4/8 core machine.
So on same machine :
vendor_id: GenuineIntel
cpu family: 6
model: 60
model name: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz
tsc: Refined TSC clocksource calibration: 2893.300 MHz
The use of clock_gettime(CLOCK_MONOTONIC_RAW, &ts) could
not measure less than @ 300ns, and had a latency
of 300 - 700ns , because the VDSO used
vdso_fallback_gettime()
(entering the kernel via syscall) for these calls.
The use of clock_gettime(CLOCK_MONOTONIC, &ts) makes
use of previous (adjusted) values in the vsyscall_gtod_data
area , and so has a latency of 16 - 30ns .
But this value is prone to NTP adjustments, which we do not want.
With the last patch ( attachment #274527 [details] ), the use of clock_gettime(CLOCK_MONOTONIC_RAW, &ts) will have a latency
of 90 - 130 nanoseconds, because there is nowhere for it
to store any "previous values" .
Hence, to enable user-space rdtsc / rdtscp issuers to
interpret TSC values in the same way the kernel does ,
Linux should export its calibration values , with
a new function in the VDSO :
extern
const struct linux_timestamp_conversion { u32 mult, shift; }
* __vdso_linux_tsc_calibration ;
So user-space functions can , with the aid of a 'vdso_sym' function
such as that provided by tools/testing/selftests/vDSO/parse_vdso.c ,
do something like:
const struct linux_timestamp_conversion { u32 mult, shift; }
* (*linux_tsc_calibration)(void) =
vdso_sym("LINUX_2.6","__vdso_linux_tsc_calibration"),
* tsc_cal;
tsc_cal = (*linux_tsc_calibration)();
U64_t tsc = my_rdtsc() ;
tsc *= tsc_cal->mult;
tsc >>=tsc_cal->shift;
Without something in the VDSO to divulge the address of the 'mult' (& implicitly
'shift') fields, user programs must parse the debug_info section of the
vdso{32,64}.so.dbg object to get the ABSOLUTE SYMBOL var_vsyscall_gtod_data value, find where the VDSO is mapped, and then add the offset to vvar_page
to calculate the address of vsyscall_gtod_data, which is much more prone
to errors.
It is unacceptable on a modern 2.9-3.9Ghz machine to be limited to
either measuring some NTP adjusted value or measuring only time
difference values which are > 1000ns with any accuracy.
Hence this is a BUG, not a new feature or enhancement request.
It is also a BUG that the documentation was misleading about CLOCK_MONOTONIC_RAW.
*Please* take this to an appropriate forum. *Please* take this to an appropriate forum. I just did send the patch as an email : Subject: [PATCH v4.15.7 1/1] on Intel, VDSO should handle CLOCK_MONOTONIC_RAW and To: x86@kernel.org, linux-kernel@vger.kernel.org, andi@firstfloor.org, tglx@linutronix.de But I think it is more legible as an attachment, which can be viewed as a diff. Created attachment 274539 [details] same patch against latest RHEL-7 kernel (Scientific Linux distro) I rebuilt the stock RHEL-7 kernel on my machine at work, from the RPM ( http://ftp.scientificlinux.org/linux/scientific/7x/SRPMS/kernel-3.10.0-693.17.1.el7.src.rpm ) with a spec file modified only to apply the patch , with nice results . Before the patch was applied, the results of the test programs attached above were as follows : <quote><pre> $ uname -a Linux jvdlnx 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 04:11:40 CST 2018 x86_64 x86_64 x86_64 GNU/Linux $ cpuinfo ... vendor_id :GenuineIntel cpu family :6 model :60 model name :Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz $ grep -i 'refined.*tsc.*calibration' /var/log/messages Refined TSC clocksource calibration: 3392.160 MHz $ gcc -o timer timer2.c $ ./timer sum: 30633 Total time: 0.000061707S - Average Latency: 0.000000306S $ ./timer sum: 30594 Total time: 0.000061946S - Average Latency: 0.000000305S $ ./timer sum: 31066 Total time: 0.000062617S - Average Latency: 0.000000310S $ ./timer sum: 31257 Total time: 0.000075304S - Average Latency: 0.000000312S $ ./timer sum: 30865 Total time: 0.000062306S - Average Latency: 0.000000308S $ ./t_rdtscp 25 17 18 10 10 17 11 11 sum : 1.180467e+02 nanoseconds avg: 14 </pre></quote> Then, after the patch is applied, and showing the performance of the new 'clock_get_time_raw()' method shown in attachment #274529 [details] : <quote><pre> $ cat POST_kernel_patch.log $ uname -a Linux jvdlnx.faztech.ie 3.10.0-693.17.1.el7.jvd.x86_64 #1 SMP Mon Mar 5 13:46:54 GMT 2018 x86_64 x86_64 x86_64 GNU/Linux ^-- this means my VDSO patch is applied $ cd ~/src $ gcc -o timer timer2.c $ ./timer sum: 15319 Total time: 0.000030800S - Average Latency: 0.000000153S $ ./timer sum: 16289 Total time: 0.000033070S - Average Latency: 0.000000162S $ ./timer sum: 15562 Total time: 0.000031236S - Average Latency: 0.000000155S $ ./timer sum: 15477 Total time: 0.000031033S - Average Latency: 0.000000154S $ ./timer sum: 15506 Total time: 0.000031113S - Average Latency: 0.000000155S $ ./timer sum: 15519 Total time: 0.000031136S - Average Latency: 0.000000155S $ ./timer sum: 15498 Total time: 0.000031163S - Average Latency: 0.000000154S # this runs its own TSC reading function instead of calling clock_gettime(CLOCK_MONOTONIC_RAW, &ts), # it calls clock_get_time_raw(), which uses the TSC calibration returned by the VDSO method # __vdso_linux_tsc_calibration() to do the TSC value conversion itself, but because unlike # the vdso do_monotonic_raw() , clock_get_time_raw() does have a writable data section, # it can optimize so that when the seconds value has not changed between calls it # does not have to do long-division, while do_monotonic_raw() does a long-divide every time: $ gcc -o t_vdso_tsc t_vdso_tsc.c $ ./t_vdso_tsc Got TSC calibration @ 0xffffffffff5ff0a0: mult: 4945799 shift: 24 sum: 1849 Total time: 0.000004073S - Average Latency: 0.000000018S $ ./t_vdso_tsc Got TSC calibration @ 0xffffffffff5ff0a0: mult: 4945799 shift: 24 sum: 2166 Total time: 0.000004633S - Average Latency: 0.000000021S </pre></quote> QED , reading TSC in userspace and doing conversion there rather in kernel is much, much faster and opens up a new world of time measurement possibilities for userspace code . Created attachment 274541 [details]
better program to illustrate an efficient user-space TSC reader that uses new __vdso_linux_tsc_calibration() function
$ gcc -std=gnu11 -DPATH_TO_PARSE_VDSO_C=\"${KERNEL_SRC}/tools/testing/selftests/vDSO/parse_vdso.c" -o test_vdso_tsc test_vdso_tsc.c
$ ./test_vdso_tsc
Got TSC calibration @ 0x7fff7f38d098: mult: 5798707 shift: 24
sum: 1893
Total time: 0.000004203S - Average Latency: 0.000000018S
Created attachment 274749 [details]
better version of POSIX timer latency measurement & verification program.
Now supports -r <repeat> option & performs checks
that each subsequent timespec is properly greater
than each previous timespec, in the way that
$KERNEL/tools/testing/selftests/timers/inconsistency_check.c
does.
Building:
$ gcc -o timer timer.c
Output under patched 4.15.9 kernel:
$ ./timer -?
Usage: timer_latency [
-m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW)
-d : dump timespec contents. N_SAMPLES: 100
-r <repeat count>
] Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.
$ ./timer -r 100
...
Total time: 0.000001846S - Average Latency: 0.000000018S N zero deltas: 0 N inconsistent deltas: 0
sum: 1844
Total time: 0.000001844S - Average Latency: 0.000000018S N zero deltas: 0 N inconsistent deltas: 0
sum: 1847
Total time: 0.000001847S - Average Latency: 0.000000018S N zero deltas: 0 N inconsistent deltas: 0
Average of 100 average latencies of 100 samples : 0.000000022S
Try running the program under any unpatched kernel
and the latencies will range between 150 and 1000ns
(average @ 600ns on my CPU) .
Created attachment 274751 [details]
Updated patch against latest RHEL7 kernel 3.10.0-693.21.1.el7
This patch applies against the Scientific Linux RHEL 7.4's
latest 3.10.0-693.21.1.el7 kernel, and produces good results -
$ inconsistency_check -c 4 -t 120
runs fine, and the CLOCK_MONOTONIC_RAW clock has a latency
of @ 16ns on a 3.4GHz Haswell.
Created attachment 274753 [details] Patch against ti-linux-kernel ti2017.06 (4.9.65-rt23) for ARM This patch applies against the latest ti-linux-kernel: https://git.ti.com/ti-linux-kernel tag: ti2017.06-rt version: 4.9.65-rt23 and runs fine on my beagleboard-X15 (ARM A15) , again inconsistency_check -c 4 -t 120 runs fine, and the latency is exactly the arm_arch_timer period: 162ns , compared with 300-1000ns before the patch. Created attachment 274841 [details]
Program to demonstrate latencies of lfence+rdtsc vs. rdtscp
This program demonstrates why I still think it is worthwhile
to make the kernel communicate the availability of 'rdtscp'
in the vsyscall_gtod_data, and for the vDSO to use this
field to choose to issue:
' rdtscp
' rather than an :
' lfence
rdtsc
'
sequence :
$ gcc -std=gnu11 -I. \
-D_WITH_TSC_CAL_=1 \
-DPARSE_VDSO_C=\"$BLD/linux-4.15.9/tools/testing/selftests/vDSO/parse_vdso.c\"\
-o tsc_latency tsc_latency.c
On my 2.8-3.9ghz 4/8 core i7-4710MQ laptop, under linux-4.15.9, the results are :
$ ./tsc_latency
Sum: 31508 Avg: 31.508
TSC calibration: khz:2893299 mult:5798646 shift:24 ticks_per_ns: 2.893299e+00
avg ns: 1.088999e+01
$ ./tsc_latency -m
Sum: 43920 Avg: 43.920
TSC calibration: khz:2893299 mult:5798646 shift:24 ticks_per_ns: 2.893299e+00
avg ns: 1.517990e+01
On my work 2-3.4ghz i7-4770 workstation. under RHEL-7 linux 3.10.0-693.21.1.el7 :
$ ./tsc_latency
Sum: 39140 Avg: 39.140
TSC calibration: khz:3392162 mult:4945877 shift:24 ticks_per_ns: 3.392162e+00
avg ns: 1.153836e+01
$ ./tsc_latency -r 10
repeat: 10
Sum: 39124 Avg: 39.124
TSC calibration: khz:3392162 mult:4945877 shift:24 ticks_per_ns: 3.392162e+00
avg ns: 1.153365e+01
Sum: 39028 Avg: 39.028
avg ns: 1.150535e+01
Sum: 39004 Avg: 39.004
avg ns: 1.149827e+01
Sum: 38996 Avg: 38.996
avg ns: 1.149591e+01
Sum: 38992 Avg: 38.992
avg ns: 1.149473e+01
Sum: 38996 Avg: 38.996
avg ns: 1.149591e+01
Sum: 39000 Avg: 39.000
avg ns: 1.149709e+01
Sum: 38992 Avg: 38.992
avg ns: 1.149473e+01
Sum: 39000 Avg: 39.000
avg ns: 1.149709e+01
Sum: 38996 Avg: 38.996
avg ns: 1.149591e+01
Average of 10 Averages Sum: 385 Avg: 38.005
avg avg ns: 1.134969e+01
$ ./tsc_latency -m -r 10
repeat: 10
Sum: 91764 Avg: 91.764
TSC calibration: khz:3392162 mult:4945877 shift:24 ticks_per_ns: 3.392162e+00
avg ns: 2.705177e+01
Sum: 91782 Avg: 91.782
avg ns: 2.705708e+01
Sum: 91773 Avg: 91.773
avg ns: 2.705443e+01
Sum: 91782 Avg: 91.782
avg ns: 2.705708e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Sum: 91696 Avg: 91.696
avg ns: 2.703173e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Sum: 91682 Avg: 91.682
avg ns: 2.702760e+01
Average of 10 Averages Sum: 910 Avg: 91.000
avg avg ns: 2.682655e+01
So, that is why I think it makes sense for vDSO to use rdtscp rather
than rdtsc .
Created attachment 274843 [details]
header used by tsc_latency.c
Created attachment 274845 [details]
Improved timer latency & inconsistency detection program
$ gcc -std=gnu11 -O3 -o timer timer.c
# using CLOCK_MONOTONIC_RAW:
$ ./timer
sum: 2118
Total time: 0.000002118S - Average Latency: 0.000000021S N zero deltas: 0 N inconsistent deltas: 0
# using CLOCK_MONOTONIC:
$ ./timer -m
sum: 3341
Total time: 0.000003341S - Average Latency: 0.000000033S N zero deltas: 0 N inconsistent deltas: 0
# Actually, maybe a sample size of 1000 is better:
$ gcc -std=gnu11 -O3 -DN_SAMPLES=1000 -o timer timer.c
$ ./timer
sum: 25991
Total time: 0.000025991S - Average Latency: 0.000000025S N zero deltas: 0 N inconsistent deltas: 0
$ ./timer -m
sum: 40416
Total time: 0.000040416S - Average Latency: 0.000000040S N zero deltas: 0 N inconsistent deltas: 0
but results are the same - 'rdtscp' is faster than 'lfence+rdtsc' .
Created attachment 274847 [details]
latest version of patch against v4.16-rc6 , that compiles with or without -DRETPOLINE OK .
Created attachment 274849 [details]
latest version of patch against 3.10.0-693.21.1, that compiles with/without -DRETPOLINE
Created attachment 274851 [details]
Combined patch against 4.15.11, including rdtscp + tsc_calibration patches
$ scripts/checkpatch.pl /tmp/vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-v4.15.11.patch | less -X
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#355:
new file mode 100644
total: 0 errors, 1 warnings, 391 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
/tmp/vdso_vclock_gettime_CLOCK_MONOTONIC_RAW-v4.15.11.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
ie. the only complaint is that I created a new file
( arch/x86/include/uapi/asm/vdso_linux_tsc_calibration.h )
.
Created attachment 274869 [details]
Program to demonstrate latencies of lfence+rdtsc vs. rdtscp
Improved version, now compiles with '-std=gnu11 -Wall -Wextra -Werror'
with no errors. No change to result values.
Created attachment 274871 [details]
Improved timer latency & inconsistency detection program
Fixes compilation -Wall -Wextra - no change to results.
I think this has been fixed by https://lore.kernel.org/linux-arm-kernel/20190621095252.32307-1-vincenzo.frascino@arm.com/ which landed in the 5.3 kernel... |
Created attachment 274491 [details] program to demonstrate the problem Contrary to the manual pages documentation, Linux DOES NOT currently support the CLOCK_MONOTONIC_RAW clock for measuring small time differences with nanosecond resolution, ie. a monotonically increasing counter that is NOT affected by operator OR NTP adjustments, or rather does so in a way that renders its results useless to measure small time differences . This is because there is no support for CLOCK_MONOTONIC_RAW clocks in the VDSO - such calls end up using the vdso_fallback_gettime() method to enter the kernel, perform locking on the gtod structure , and get some adjusted value , when all we want to do for CLOCK_MONOTONIC_RAW is read the Time Stamp Counter (TSC) (issue the rdtsc / rdtscp instruction) and perform the same clocksource conversion as the kernel does using the 'shift' and 'mult' values of the clocksource's vsyscall_gtod_data structure (TSC) . One can issue the rdtsc / rdtscp instruction in user-space and perform the same conversion in user space using the same shift and mult values from the memory mapped gtod_data structure, which is located in the vdso , so can be mapped from user-space - it looks like there is already a mechanism for doing so with the vvar_vdso_gtod_data VDSO variable which has the address of the gtod_data structure, which is also readable from user space. But what the code in arch/x86/entry/vclock_gettime.c does currently is ( @ line 269 ): notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { switch (clock) { case CLOCK_REALTIME: if (do_realtime(ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_MONOTONIC: if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; case CLOCK_MONOTONIC_COARSE: do_monotonic_coarse(ts); break; default: goto fallback; } return 0; fallback: return vdso_fallback_gettime(clock, ts); } Note, for CLOCK_MONTONIC_RAW clocks, it always enters the kernel , but for CLOCK_MONOTONIC clocks , which are subject to NTP but not to operator adjustments, it does access the mapped gtod_data . I propose adding a new CLOCK_MONOTONIC_RAW case to this switch, which would, if the current clocksource is the TSC, would actually issue the 'rdtscp' instruction to read the actual TSC value, and perform the same conversion as the kernel's clocksource does, such as with something like : register U64_t cycles = x86_64_rdtsc(); register U64_t nanoseconds = (cycles * clocksource->mult)>>clocksource->shift; timespec->tv_sec = nanoseconds / 1000000000; timespec->tv_nsec = nanoseconds % 1000000000; where 'clocksource' is the vsyscall_gtod_data pointer, which is mapped to shared region with the kernel . constructed by adding the vvar_vdso_gtod_data offset value (an absolute symbol in the VDSO) to the address where the first vdso page is mapped . This really is a serious issue IMHO - users get no warning from the documentation that clock_gettime(CLOCK_MONOTONIC_RAW,&ts) will enter the kernel, while clock_gettime(CLOCK_MONOTONIC,&ts) will not, and if they are after a genuine 'cpu cycles' counter that is NOT subject to ANY adjustments, then CLOCK_MONOTONIC_RAW is the only choice they have. All the kernel code does that already handles CLOCK_MONOTONIC_RAW does is to read the TSC (issue rdtscp) and perform the shift & mult conversion as shown above - but since it must do locking on the gtod_data structure, and enter the kernel, putting the task in danger of being rescheduled, very bad latencies & kernel context switch overhead can result . On my x86_64 ( a 2.3 - 3.9 Ghz Haswell ) using CLOCK_MONOTONIC_RAW clocks comes with a latency (minimum time that can be measured) of @ 300 - 600 ns, whereas using CLOCK_MONOTONIC (which I cannot do for my application) takes @ 40ns . This is shown by example test code: $ gcc -std=gnu11 -o timer timer.c $ ./timer sum: 34240 Total time: 0.000069091S - Average Latency: 0.000000342S This just issues 100 clock_gettime(CLOCK_MONOTONIC_RAW,&ts) calls in a row, measuring the difference between their values - so the average latency measured was 342ns - it tends to vary widely between 300-700 ns on my machine - I guess sometimes the task gets rescheduled when entering the kernel via clock_gettime() calls - this is not desirable behaviour. Adding the -m flag to the test program invocation makes it use CLOCK_MONOTONIC : $ ./timer -m sum: 2315 Total time: 0.000005179S - Average Latency: 0.000000023S So here the latency was 23ns - this was under a 4.12.10 kernel. Issuing a rdtscp instruction and performing a multiply and shift on the value takes @ 8ns on my CPU , reliably. Please, can we have a user-space TSC reader in the VDSO for CLOCK_MONOTONIC_RAW clock_gettime calls that does not enter the kernel ! I am going to develop one for use under x86_64 kernels I build , & will attach it to this bug when done.