Very often, the timing looks misaligned. The following is a very typical case: │ Disassembly of section .text: │ │ 0000000000f48a60 <ff_restore_rgb_planes_sse2>: │ ff_restore_rgb_planes_sse2(): │ mov 0x8(%rsp),%rax │ mov 0x10(%rsp),%r10 │ movslq %eax,%rax │ add %rax,%rdi │ add %rax,%rsi │ add %rax,%rdx │ neg %rax │ movdqa 0x1392b30,%xmm3 0.06 │22: mov %rax,%r11 2.07 │25: movdqa (%rdi,%r11,1),%xmm0 21.39 │ movdqa (%rsi,%r11,1),%xmm1 28.50 │ movdqa (%rdx,%r11,1),%xmm2 28.97 │ psubb %xmm3,%xmm1 3.09 │ paddb %xmm1,%xmm0 3.93 │ paddb %xmm1,%xmm2 4.02 │ movdqa %xmm0,(%rdi,%r11,1) 4.37 │ movdqa %xmm2,(%rdx,%r11,1) 3.26 │ add $0x10,%r11 0.03 │ ↑ jl 25 0.06 │ add %rcx,%rdi 0.15 │ add %r8,%rsi │ add %r9,%rdx 0.09 │ sub $0x1,%r10d │ ↑ jg 22 │ repz retq In this example, the 3 costing calls are obviously the 3 movdqa loads but for some reason there is an "off-by-one" in the timing column. This kind of misalignment appears all the time and is not specific to x86 (I have the same problem with arm and arm64). The above example was obtained with the following (ffmpeg.git/master @ 4ed7c2bbc3): $ ./ffmpeg -lavfi testsrc2=hd1080:d=60 -c:v utvideo -pix_fmt rgb24 -y /tmp/testsrc2.avi $ perf record ./ffmpeg_g -i /tmp/testsrc2.avi -f null - $ perf report