Overview: When running `perf script` on a profile recorded using dwarf call-graph containing library with split debuginfo, perf crashes with: Aborted perf: util/unwind-libdw.c:30: __find_debuginfo: Assertion `dso' failed. Details: The issue seems to be related to how perf / libdw attempts to find the dwarf info for a library[1] which has separate debuginfo, possibly related to the recent commit "perf unwind: Fix separate debug info files when using elfutils' libdw… " (bf53fc6b5f415cddc7118091cb8fd6a211b2320d). If I attach GDB to perf I see the following backtrace: $ gdb --args perf script --no-inline ... (gdb) bt #0 0x00007ffff58d93d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007ffff58daac8 in __GI_abort () at abort.c:90 #2 0x00007ffff58d21a6 in __assert_fail_base (fmt=0x7ffff5a2de00 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x6fca0f "dso", file=file@entry=0x706aa7 "util/unwind-libdw.c", line=line@entry=30, function=function@entry=0x6f0860 <__PRETTY_FUNCTION__.12310> "__find_debuginfo") at assert.c:92 #3 0x00007ffff58d2252 in __GI___assert_fail (assertion=0x6fca0f "dso", file=0x706aa7 "util/unwind-libdw.c", line=30, function=0x6f0860 <__PRETTY_FUNCTION__.12310> "__find_debuginfo") at assert.c:101 #4 0x00000000005a60d6 in __find_debuginfo () #5 0x00007ffff7052e8f in find_debuginfo (mod=mod@entry=0x1253ac0) at dwfl_module_getdwarf.c:539 #6 0x00007ffff7073a09 in find_debuginfo (mod=0x1253ac0) at dwfl_module_getdwarf.c:1423 #7 find_dw (mod=0x1253ac0) at dwfl_module_getdwarf.c:1413 #8 dwfl_module_getdwarf (mod=0x1253ac0, bias=0x7fffffffb260) at dwfl_module_getdwarf.c:1447 #9 0x00007ffff7079b95 in dwfl_module_dwarf_cfi (mod=mod@entry=0x1253ac0, bias=bias@entry=0x7fffffffb260) at dwfl_module_dwarf_cfi.c:69 #10 0x00007ffff7083722 in __libdwfl_frame_unwind (state=state@entry=0x1368b20) at frame_unwind.c:740 #11 0x00007ffff7083928 in dwfl_frame_pc (state=0x1368b20, pc=<optimized out>, isactivation=0x7fffffffb2b7) at dwfl_frame_pc.c:54 #12 0x00000000005a5d65 in frame_callback () #13 0x00007ffff7081a71 in dwfl_thread_getframes (thread=0x7fffffffb420, callback=0x5a5d10 <frame_callback>, arg=0x11c0550) at dwfl_frame.c:450 #14 0x00007ffff708173a in get_one_thread_cb (thread=0x7fffffffb420, arg=0x7fffffffb4a0) at dwfl_frame.c:322 #15 0x00007ffff70817d7 in dwfl_getthreads (dwfl=0x1223b40, callback=0x7ffff7081700 <get_one_thread_cb>, arg=0x7fffffffb4a0) at dwfl_frame.c:294 #16 0x00007ffff7081db3 in getthread (callback=0x7ffff7081c90 <get_one_thread_frames_cb>, arg=0x7fffffffb490, tid=<optimized out>, dwfl=<optimized out>) at dwfl_frame.c:371 #17 dwfl_getthread_frames (dwfl=<optimized out>, tid=1970, callback=<optimized out>, arg=<optimized out>) at dwfl_frame.c:405 #18 0x00000000005a6260 in unwind__get_entries () #19 0x00000000004633c3 in process_sample_event () #20 0x000000000051bee6 in machines__deliver_event () #21 0x000000000051c5e7 in perf_session__deliver_event () #22 0x000000000052078b in __ordered_events__flush.part.0 () #23 0x000000000051cfa4 in perf_session__process_user_event () #24 0x000000000051d51e in process_simple () #25 0x000000000051effc in perf_session__process_events () #26 0x00000000004676a0 in cmd_script () #27 0x00000000004bd25b in run_builtin () #28 0x00000000004330e7 in main () Envionment: CentOS 7 Linux 3.10.0-1160.15.2.el7.x86_64 Relevant RPMs: * elfutils-libs-0.176-5.el7.x86_64 * perf-5.11.0-1.el7.elrepo.x86_64 (https://elrepo.org/linux/kernel/el7/x86_64/RPMS/perf-5.10.16-1.el7.elrepo.x86_64.rpm ) [1]: Couchbase Server, `/opt/couchbase/lib/libcouchstore.so` and `/usr/lib/debug/opt/couchbase/lib/libcouchstore.so.debug`. There's nothing particulary special about this, C++ shared library packaged into split rpm / debuginfo; linked via a .gnu_debuglink section in the ELF: readelf --string-dump=.gnu_debuglink /opt/couchbase/lib/libcouchstore.so String dump of section '.gnu_debuglink': [ 0] libcouchstore.so.debug [ 18] :l]?
Note this is a "semi"-regression - with perf-5.10.16-1.el7.elrepo I don't see the crash, however it seems like perf has failed to locate the DWARF info for the given library, as any backtraces including it are not decoded correctly: # perf --version perf version 5.10.16-1.el7.elrepo.x86_64 perf script --no-inline |grep -B5 -A5 libcouchstore.so <cut> WriterPool3 1970 337490.608958: 20408163 cpu-clock:pppH: 7f40109ee93f compare_records+0xf (/opt/couchbase/lib/libcouchstore.so) <cut> Note no frames before or after `compare_records ` are correctly decoded, _but_ others are fine and `perf script` exits with status zero.
Could you please get output from: # perf script -vv it might show what went wrong
Created attachment 295333 [details] perf script -v -v --no-inline Attached result of: perf script -v -v --no-inline >/dev/null 2> perf_script_vv.txt
Just ran it in gdb to check I was still seeing the same exception; I believe the problem is that mod->userdata is null (which is the assert which is firing), but it wasn't immediately obvious to me where in the various callbacks that is supposed to be set. #5 0x00007ffff7052e8f in find_debuginfo (mod=mod@entry=0x15c8470) at dwfl_module_getdwarf.c:539 539 mod->debug.fd = (*mod->dwfl->callbacks->find_debuginfo) (MODCB_ARGS (mod), (gdb) p *mod $2 = {dwfl = 0x15cea20, next = 0x0, userdata = 0x0, name = 0xd2dfb0 "libcouchstore.so", low_addr = 139876824051712, high_addr = 139876826574744, main = {name = 0xdb9ae0 "/opt/couchbase/lib/libcouchstore.so", fd = 20, valid = false, relocated = false, elf = 0x15be9c0, vaddr = 0, address_sync = 350598}, debug = {name = 0x0, fd = 0, valid = false, relocated = false, elf = 0x0, vaddr = 0, address_sync = 0}, aux_sym = {name = 0x0, fd = 0, valid = false, relocated = false, elf = 0x0, vaddr = 0, address_sync = 0}, main_bias = 139876824051712, ebl = 0x15dcb20, e_type = 3, elferr = DWFL_E_NOERROR, reloc_info = 0x0, symfile = 0x0, symdata = 0x0, aux_symdata = 0x0, syments = 0, aux_syments = 0, first_global = 0, aux_first_global = 0, symstrdata = 0x0, aux_symstrdata = 0x0, symxndxdata = 0x0, aux_symxndxdata = 0x0, elfdir = 0x0, dw = 0x0, alt = 0x0, alt_fd = 0, alt_elf = 0x0, symerr = DWFL_E_NOERROR, dwerr = DWFL_E_NO_DWARF, first_cu = 0x0, cu = 0x0, lazy_cu_root = 0x0, aranges = 0x0, build_id_bits = 0x0, build_id_vaddr = 0, build_id_len = 0, ncu = 0, lazycu = 0, naranges = 0, dwarf_cfi = 0x0, eh_cfi = 0x15c82d0, segment = 2, gc = false, is_executable = false}
Created attachment 295341 [details] Proposed fix v1 I suspect the problem might be due to the fact that userdata is not setup if __report_module fails to find mod via dwfl_addrmodule in the previously-mentioned patch[1]: @@ -46,16 +60,24 @@ static int __report_module(struct addr_location *al, u64 ip, mod = dwfl_addrmodule(ui->dwfl, ip); if (mod) { Dwarf_Addr s; + void **userdatap; - dwfl_module_info(mod, NULL, &s, NULL, NULL, NULL, NULL, NULL); + dwfl_module_info(mod, &userdatap, &s, NULL, NULL, NULL, NULL, NULL); + *userdatap = dso; if (s != al->map->start - al->map->pgoff) mod = 0; } if (!mod) - mod = dwfl_report_elf(ui->dwfl, dso->short_name, - (dso->symsrc_filename ? dso->symsrc_filename : dso->long_name), -1, al->map->start - al->map->pgoff, - false); + mod = dwfl_report_elf(ui->dwfl, dso->short_name, dso->long_name, -1, + al->map->start - al->map->pgoff, false); + if (!mod) { + char filename[PATH_MAX]; + + if (dso__build_id_filename(dso, filename, sizeof(filename), false)) + mod = dwfl_report_elf(ui->dwfl, dso->short_name, filename, -1, + al->map->start - al->map->pgoff, false); + } return mod && dwfl_addrmodule(ui->dwfl, ip) == mod ? 0 : -1; } Note how `*userdatap` is only initialised with dso if the first call to dwfl_addrmodule returns non-null. However, if one of the two other attempts to lookup `mod` succeed, `*userdatap` is never initialised. With the attached patch I no longer see the assert and backtrace decode seems to work as well as with libunwind. Note I'm flying pretty blind here, I'm not at all familiar with the subtleties of ELF / DWARF module manipulation; this is just a dumb "set a field which looks like it should have been set" :) I also don't know enough about the different lookup methods here to understand exactly why dwfl_addrmodule failed for my module... [1]: "perf unwind: Fix separate debug info files when using elfutils' libdw… " - https://github.com/torvalds/linux/commit/bf53fc6b5f415cddc7118091cb8fd6a211b2320d
yes, looks like userdatap might not be initialized, I asked Jan to look at it Could you please send the patch for review and make sure Jan is CC-ed? thanks
Sure - can you direct me to where I should post the patch (I'm not a regular kernel contributor...).
please send it to: linux-perf-users@vger.kernel.org Arnaldo Carvalho de Melo <acme@kernel.org> Jan Kratochvil <jan.kratochvil@redhat.com> Jiri Olsa <jolsa@redhat.com> no worries about the patch/changelog shape (just should be apply-able) let's start the discussion and we will tune the rest
I agree with the patch. I wrote exactly the same patch before I found the one of yours here. I just do not have the problem reproducible but if it fixes the problem for you that's all we need.
I'll collect the Acked-by: Jan and Jiri, ok?
Thanks. This is the first patch I've submitted to Linux so any assistance appreciated :)
Found the same issue in Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=1930640
I tested your patch (https://www.spinics.net/lists/linux-perf-users/msg12625.html) and it fixes the problem for me.