Bug 207323
Summary: | [RFE] Latency profiling capability | ||
---|---|---|---|
Product: | Tracing/Profiling | Reporter: | Konstantin Kharlamov (hi-angel) |
Component: | Perf tool | Assignee: | Arnaldo Carvalho de Melo (acme) |
Status: | NEW --- | ||
Severity: | normal | CC: | irogers, jolsa |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.6.2 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Konstantin Kharlamov
2020-04-17 14:38:38 UTC
Okay, so I want to share a workaround. I'd even say a solution but one that does not involve perf. I hope it won't be considered offtopic though, because it uses ebpf which same as perf is a kernel project. I think we need more docs on how to profile sleeping/locked/blocked times on Linux because ATM situation with docs on that is dire. ----- The ebpf scripts for profiling are part of `bcc`/`bcc-tools` package. The simplest one to profile latency is called `offcputime`. But there's a better one, called `offwaketime`, which combines stack of whoever woke the thread up and a stack of the thread that was blocked/locked/slept, separated by `--` symbols. The time is measured in μs. Example session: $ cat test.cpp #include <unistd.h> int main() { for (size_t i = 0; i < 5; ++i) { sleep(1); } } $ g++ test.cpp -o test -O0 -g3 ./test &; sudo /usr/share/bcc/tools/offwaketime -p $(pgrep test) > out.txt [1] 124301 [1] + 124301 done ./test ^C $ cat out.txt Tracing blocked time (us) by user + kernel off-CPU and waker stack... Hit Ctrl-C to end. waker: swapper/3 0 b'secondary_startup_64' b'start_secondary' b'cpu_startup_entry' b'do_idle' b'cpuidle_enter' b'cpuidle_enter_state' b'apic_timer_interrupt' b'smp_apic_timer_interrupt' b'hrtimer_interrupt' b'__hrtimer_run_queues' b'hrtimer_wakeup' -- -- b'finish_task_switch' b'__schedule' b'schedule' b'do_nanosleep' b'hrtimer_nanosleep' b'common_nsleep' b'__x64_sys_clock_nanosleep' b'do_syscall_64' b'entry_SYSCALL_64_after_hwframe' b'[unknown]' target: test 124301 4000732 The output can also be fed to Flamegraph if you add `-f` option. As a bonus, Flamegraph would nicely collapse similar sleep-stacks that were woken up from different places. But one should manually edit first the `offwaketime` output to remove "idle thread" lines at the end, otherwise you may not find anything else on the flamegraph because their values are often far bigger than everything else. There was recently a latency subcommand for perf ftrace added: https://lore.kernel.org/lkml/20211129231830.1117781-1-namhyung@kernel.org/ I believe this can be marked as fixed. (In reply to Ian Rogers from comment #2) > There was recently a latency subcommand for perf ftrace added: > https://lore.kernel.org/lkml/20211129231830.1117781-1-namhyung@kernel.org/ > I believe this can be marked as fixed. Sorry, but `ftrace latency` isn't useful, at least from what I can see. Most importantly, you can't combine it with `perf record` since it's a separate command, and as a result you can't generate a flamegraph out of it to see where problems are. But even on its own it is very limited, as apparently it can't trace userspace functions, so e.g. executing a `… -T main …` will result in `failed to set tracing filters`. The patchset mentions that as a temporary limitation, but it's still present an year later as of 6.2. Not that it matters though, since you can't make a flamegraph out of it anyway. (In reply to Konstantin Kharlamov from comment #3) > Most importantly, you can't combine it with `perf record` since it's a > separate command, and as a result you can't generate a flamegraph out of it > to see where problems are. Well, I'm using Flamegraph as just an example. Really any kind of output that would "show where problems are". The `ftrace latency` doesn't give any kind of stacktraces that describe where slowness is, it just shows a table how much the `foo` in `-T foo` has taken after being ran many times. So you can not really use `ftrace latency` unless you has at least a guess where the slowness is coming from. There is lock analysis landing and also kwork: https://lore.kernel.org/lkml/20220727090330.107760-1-yangjihong1@huawei.com/ Wrt merged traced there was discussion on the mailing list to similar problems. As we have offcpu support in perf record: https://lore.kernel.org/all/20220422053401.208207-1-namhyung@kernel.org/ I'm not sure what this bug is about. Is it that combining multiple kinds of event samples should be supported in perf report? The bug title confuses me. (In reply to Ian Rogers from comment #6) > As we have offcpu support in perf record: > https://lore.kernel.org/all/20220422053401.208207-1-namhyung@kernel.org/ > I'm not sure what this bug is about. Is it that combining multiple kinds of > event samples should be supported in perf report? The bug title confuses me. Hello, thank you for mentioning the new feature! So, I'm trying it out, but I can't seem to get any information about the slowness from a testcase. I tried different combinations, with and without `-a` and other options. So, given a testcase: #include <unistd.h> void my_slow_f() { sleep(1); } int main() { for (size_t i = 0; i < 5; ++i) { my_slow_f(); } } I compile it with debugging info and no optimization: gcc test.c -o a -O0 -g3 I do recording for example as: sudo perf record -g --off-cpu -- ./a Then I read the report: sudo perf report -g This report shows two entries "cycles:P" and "offcpu-time". But neither contains `main` or `my_slow_f`. As a matter of fact doing a sudo grep slow perf.data returns with nothing. Do I do something wrong? Both kernel and perf are of version 6.7. Just to point out I'm not the only one interested (well, it's probably obvious, but anyway), there's a fun blog post from a Mesa developer, who's writing:¹ > Naturally I went to my standby for profiling, perf. I found some stuff, and I > fixed it all, and surprise surprise the numbers didn’t change. Clearly there > was some blocking going on that was affecting the frames. Could the > flamegraph tools for perf highlight this issue for me? > > No. > > An average Monday in the life of a zink developer. > > New tools would be needed to figure out what’s going on here. > > Do such tools already exist? Maybe. Post in the comments if you have tools > you use for this type of thing. As I'm writing this, no comment has been posted on that. So, yeah, if there's a way to profile the time CPU wasn't doing anything (as in, sleeping, being locked, waiting for memory to get fetched, etc), whether with perf or without, it's not currently known. 1: https://www.supergoodcode.com/presenting/ I'd have thought 6.7 is okay. We have a test of the functionality here: https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell/record_offcpu.sh?h=perf-tools-next#n39 The offcpu code needs BPF support and for that to have been compiled into the tool. If you run 'perf version --build-options' you should see: ... bpf_skeletons: [ on ] # HAVE_BPF_SKEL (In reply to Ian Rogers from comment #9) > I'd have thought 6.7 is okay. We have a test of the functionality here: > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/ > tree/tools/perf/tests/shell/record_offcpu.sh?h=perf-tools-next#n39 > > The offcpu code needs BPF support and for that to have been compiled into > the tool. If you run 'perf version --build-options' you should see: > ... > bpf_skeletons: [ on ] # HAVE_BPF_SKEL Yes, it's on. I also use sometimes basic `bpftrace`, so BPF is certainly working. I'd presume if my system was missing something for perf `off-cpu` support to work, it would just error out… It should error out. The BPF program for off CPU writes into maps and then the maps get dumped as samples at the end of perf record. This leads to the samples being after the regular samples. I think we need to improve this in the perf tool by having the BPF program just dump samples directly. You can write bpf output into a perf ring buffer from BPF, but that's not quite a sample so some finessing is necessary. If you use a command like 'perf script -D' you should be able to see the samples in the buffer, that may indicate what the problem is. While on it, I should mention that `--off-cpu` option doesn't work without root permissions, which might limit its applicability a bit (but still if it would work that would be amazing). When running without `sudo` it complains `libbpf: Failed to bump RLIMIT_MEMLOCK (err = -1), you might need to do it explicitly`. |