Bug 207323

Summary: [RFE] Latency profiling capability
Product: Tracing/Profiling Reporter: Konstantin Kharlamov (hi-angel)
Component: Perf toolAssignee: Arnaldo Carvalho de Melo (acme)
Status: NEW ---    
Severity: normal CC: irogers, jolsa
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.6.2 Subsystem:
Regression: No Bisected commit-id:

Description Konstantin Kharlamov 2020-04-17 14:38:38 UTC
While knowing hotspots in code is important, just as important is knowing the opposite, the underload spots. E.g. whether code spends too much time locked, sleeping, yield()'s too often, etc. Ultimately, places where code gaves up its execution time back to the scheduler.

Currently it is either impossible to know that with perf, or it is undocumented. I'm particularly should mention that kernel Wiki has a paragraph "profiling sleep times", which makes it look as if there is such capability. It is not entirely correct, and resulted in #207301 report.
Comment 1 Konstantin Kharlamov 2020-04-21 19:55:56 UTC
Okay, so I want to share a workaround. I'd even say a solution but one that does not involve perf. I hope it won't be considered offtopic though, because it uses ebpf which same as perf is a kernel project. I think we need more docs on how to profile sleeping/locked/blocked times on Linux because ATM situation with docs on that is dire.

-----

The ebpf scripts for profiling are part of `bcc`/`bcc-tools` package. The simplest one to profile latency is called `offcputime`. But there's a better one, called `offwaketime`, which combines stack of whoever woke the thread up and a stack of the thread that was blocked/locked/slept, separated by `--` symbols. The time is measured in μs.

Example session:

    $ cat test.cpp
    #include <unistd.h>

    int main() {
        for (size_t i = 0; i < 5; ++i) {
            sleep(1);
        }
    }
    $ g++ test.cpp -o test -O0 -g3
    ./test &; sudo /usr/share/bcc/tools/offwaketime -p $(pgrep test) > out.txt
    [1] 124301
    [1]  + 124301 done       ./test
    ^C
    $ cat out.txt
    Tracing blocked time (us) by user + kernel off-CPU and waker stack... Hit Ctrl-C to end.

        waker:           swapper/3 0
        b'secondary_startup_64'
        b'start_secondary'
        b'cpu_startup_entry'
        b'do_idle'
        b'cpuidle_enter'
        b'cpuidle_enter_state'
        b'apic_timer_interrupt'
        b'smp_apic_timer_interrupt'
        b'hrtimer_interrupt'
        b'__hrtimer_run_queues'
        b'hrtimer_wakeup'
        --               --
        b'finish_task_switch'
        b'__schedule'
        b'schedule'
        b'do_nanosleep'
        b'hrtimer_nanosleep'
        b'common_nsleep'
        b'__x64_sys_clock_nanosleep'
        b'do_syscall_64'
        b'entry_SYSCALL_64_after_hwframe'
        b'[unknown]'
        target:          test 124301
            4000732

The output can also be fed to Flamegraph if you add `-f` option. As a bonus, Flamegraph would nicely collapse similar sleep-stacks that were woken up from different places. But one should manually edit first the `offwaketime` output to remove "idle thread" lines at the end, otherwise you may not find anything else on the flamegraph because their values are often far bigger than everything else.
Comment 2 Ian Rogers 2022-01-21 06:55:58 UTC
There was recently a latency subcommand for perf ftrace added:
https://lore.kernel.org/lkml/20211129231830.1117781-1-namhyung@kernel.org/
I believe this can be marked as fixed.
Comment 3 Konstantin Kharlamov 2023-03-15 17:17:49 UTC
(In reply to Ian Rogers from comment #2)
> There was recently a latency subcommand for perf ftrace added:
> https://lore.kernel.org/lkml/20211129231830.1117781-1-namhyung@kernel.org/
> I believe this can be marked as fixed.

Sorry, but `ftrace latency` isn't useful, at least from what I can see.

Most importantly, you can't combine it with `perf record` since it's a separate command, and as a result you can't generate a flamegraph out of it to see where problems are.

But even on its own it is very limited, as apparently it can't trace userspace functions, so e.g. executing a `… -T main …` will result in `failed to set tracing filters`. The patchset mentions that as a temporary limitation, but it's still present an year later as of 6.2. Not that it matters though, since you can't make a flamegraph out of it anyway.
Comment 4 Konstantin Kharlamov 2023-03-15 17:32:33 UTC
(In reply to Konstantin Kharlamov from comment #3)
> Most importantly, you can't combine it with `perf record` since it's a
> separate command, and as a result you can't generate a flamegraph out of it
> to see where problems are.

Well, I'm using Flamegraph as just an example. Really any kind of output that would "show where problems are". The `ftrace latency` doesn't give any kind of stacktraces that describe where slowness is, it just shows a table how much the `foo` in `-T foo` has taken after being ran many times. So you can not really use `ftrace latency` unless you has at least a guess where the slowness is coming from.
Comment 5 Ian Rogers 2023-03-15 17:58:36 UTC
There is lock analysis landing and also kwork:
https://lore.kernel.org/lkml/20220727090330.107760-1-yangjihong1@huawei.com/
Wrt merged traced there was discussion on the mailing list to similar problems.
Comment 6 Ian Rogers 2024-01-19 18:28:05 UTC
As we have offcpu support in perf record:
https://lore.kernel.org/all/20220422053401.208207-1-namhyung@kernel.org/
I'm not sure what this bug is about. Is it that combining multiple kinds of event samples should be supported in perf report? The bug title confuses me.
Comment 7 Konstantin Kharlamov 2024-01-19 19:09:53 UTC
(In reply to Ian Rogers from comment #6)
> As we have offcpu support in perf record:
> https://lore.kernel.org/all/20220422053401.208207-1-namhyung@kernel.org/
> I'm not sure what this bug is about. Is it that combining multiple kinds of
> event samples should be supported in perf report? The bug title confuses me.

Hello, thank you for mentioning the new feature!

So, I'm trying it out, but I can't seem to get any information about the slowness from a testcase. I tried different combinations, with and without `-a` and other options.

So, given a testcase:

    #include <unistd.h>

    void my_slow_f() { sleep(1); }

    int main() {
        for (size_t i = 0; i < 5; ++i) {
            my_slow_f();
        }
    }

I compile it with debugging info and no optimization:

    gcc test.c -o a -O0 -g3

I do recording for example as:

    sudo perf record -g --off-cpu -- ./a

Then I read the report:

    sudo perf report -g

This report shows two entries "cycles:P" and "offcpu-time". But neither contains `main` or `my_slow_f`. As a matter of fact doing a

    sudo grep slow perf.data

returns with nothing.

Do I do something wrong? Both kernel and perf are of version 6.7.
Comment 8 Konstantin Kharlamov 2024-01-19 19:28:41 UTC
Just to point out I'm not the only one interested (well, it's probably obvious, but anyway), there's a fun blog post from a Mesa developer, who's writing:¹

> Naturally I went to my standby for profiling, perf. I found some stuff, and I
> fixed it all, and surprise surprise the numbers didn’t change. Clearly there
> was some blocking going on that was affecting the frames. Could the
> flamegraph tools for perf highlight this issue for me?
>
> No.
>
> An average Monday in the life of a zink developer.
>
> New tools would be needed to figure out what’s going on here.
>
> Do such tools already exist? Maybe. Post in the comments if you have tools
> you use for this type of thing.

As I'm writing this, no comment has been posted on that.

So, yeah, if there's a way to profile the time CPU wasn't doing anything (as in, sleeping, being locked, waiting for memory to get fetched, etc), whether with perf or without, it's not currently known.

1: https://www.supergoodcode.com/presenting/
Comment 9 Ian Rogers 2024-01-19 19:44:15 UTC
I'd have thought 6.7 is okay. We have a test of the functionality here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell/record_offcpu.sh?h=perf-tools-next#n39

The offcpu code needs BPF support and for that to have been compiled into the tool. If you run 'perf version --build-options' you should see:
...
         bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL
Comment 10 Konstantin Kharlamov 2024-01-19 19:52:18 UTC
(In reply to Ian Rogers from comment #9)
> I'd have thought 6.7 is okay. We have a test of the functionality here:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/
> tree/tools/perf/tests/shell/record_offcpu.sh?h=perf-tools-next#n39
> 
> The offcpu code needs BPF support and for that to have been compiled into
> the tool. If you run 'perf version --build-options' you should see:
> ...
>          bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL

Yes, it's on. I also use sometimes basic `bpftrace`, so BPF is certainly working.

I'd presume if my system was missing something for perf `off-cpu` support to work, it would just error out…
Comment 11 Ian Rogers 2024-01-19 19:57:40 UTC
It should error out. The BPF program for off CPU writes into maps and then the maps get dumped as samples at the end of perf record. This leads to the samples being after the regular samples. I think we need to improve this in the perf tool by having the BPF program just dump samples directly. You can write bpf output into a perf ring buffer from BPF, but that's not quite a sample so some finessing is necessary. If you use a command like 'perf script -D' you should be able to see the samples in the buffer, that may indicate what the problem is.
Comment 12 Konstantin Kharlamov 2024-01-19 20:16:12 UTC
While on it, I should mention that `--off-cpu` option doesn't work without root permissions, which might limit its applicability a bit (but still if it would work that would be amazing). When running without `sudo` it complains `libbpf: Failed to bump RLIMIT_MEMLOCK (err = -1), you might need to do it explicitly`.