Bug 211801 - perf: util/unwind-libdw.c:30: __find_debuginfo: Assertion `dso' failed.
Summary: perf: util/unwind-libdw.c:30: __find_debuginfo: Assertion `dso' failed.
Status: NEW
Alias: None
Product: Tracing/Profiling
Classification: Unclassified
Component: Perf tool (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Arnaldo Carvalho de Melo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-16 14:24 UTC by Dave Rigby
Modified: 2021-02-19 11:05 UTC (History)
4 users (show)

See Also:
Kernel Version: perf version 5.11.0-1.el7.elrepo.x86_64
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
perf script -v -v --no-inline (285.17 KB, text/plain)
2021-02-17 20:26 UTC, Dave Rigby
Details
Proposed fix v1 (1.14 KB, patch)
2021-02-18 10:18 UTC, Dave Rigby
Details | Diff

Description Dave Rigby 2021-02-16 14:24:46 UTC
Overview:

When running `perf script` on a profile recorded using dwarf call-graph containing library with split debuginfo, perf crashes with:

 Aborted
 perf: util/unwind-libdw.c:30: __find_debuginfo: Assertion `dso' failed.

Details:

The issue seems to be related to how perf / libdw attempts to find the dwarf info for a library[1] which has separate debuginfo, possibly related to the recent commit "perf unwind: Fix separate debug info files when using elfutils' libdw…
" (bf53fc6b5f415cddc7118091cb8fd6a211b2320d).

If I attach GDB to perf I see the following backtrace:

$ gdb --args perf script --no-inline
...
(gdb) bt
#0  0x00007ffff58d93d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007ffff58daac8 in __GI_abort () at abort.c:90
#2  0x00007ffff58d21a6 in __assert_fail_base (fmt=0x7ffff5a2de00 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x6fca0f "dso", file=file@entry=0x706aa7 "util/unwind-libdw.c", line=line@entry=30, function=function@entry=0x6f0860 <__PRETTY_FUNCTION__.12310> "__find_debuginfo") at assert.c:92
#3  0x00007ffff58d2252 in __GI___assert_fail (assertion=0x6fca0f "dso", file=0x706aa7 "util/unwind-libdw.c", line=30, function=0x6f0860 <__PRETTY_FUNCTION__.12310> "__find_debuginfo") at assert.c:101
#4  0x00000000005a60d6 in __find_debuginfo ()
#5  0x00007ffff7052e8f in find_debuginfo (mod=mod@entry=0x1253ac0) at dwfl_module_getdwarf.c:539
#6  0x00007ffff7073a09 in find_debuginfo (mod=0x1253ac0) at dwfl_module_getdwarf.c:1423
#7  find_dw (mod=0x1253ac0) at dwfl_module_getdwarf.c:1413
#8  dwfl_module_getdwarf (mod=0x1253ac0, bias=0x7fffffffb260) at dwfl_module_getdwarf.c:1447
#9  0x00007ffff7079b95 in dwfl_module_dwarf_cfi (mod=mod@entry=0x1253ac0, bias=bias@entry=0x7fffffffb260) at dwfl_module_dwarf_cfi.c:69
#10 0x00007ffff7083722 in __libdwfl_frame_unwind (state=state@entry=0x1368b20) at frame_unwind.c:740
#11 0x00007ffff7083928 in dwfl_frame_pc (state=0x1368b20, pc=<optimized out>, isactivation=0x7fffffffb2b7) at dwfl_frame_pc.c:54
#12 0x00000000005a5d65 in frame_callback ()
#13 0x00007ffff7081a71 in dwfl_thread_getframes (thread=0x7fffffffb420, callback=0x5a5d10 <frame_callback>, arg=0x11c0550) at dwfl_frame.c:450
#14 0x00007ffff708173a in get_one_thread_cb (thread=0x7fffffffb420, arg=0x7fffffffb4a0) at dwfl_frame.c:322
#15 0x00007ffff70817d7 in dwfl_getthreads (dwfl=0x1223b40, callback=0x7ffff7081700 <get_one_thread_cb>, arg=0x7fffffffb4a0) at dwfl_frame.c:294
#16 0x00007ffff7081db3 in getthread (callback=0x7ffff7081c90 <get_one_thread_frames_cb>, arg=0x7fffffffb490, tid=<optimized out>, dwfl=<optimized out>) at dwfl_frame.c:371
#17 dwfl_getthread_frames (dwfl=<optimized out>, tid=1970, callback=<optimized out>, arg=<optimized out>) at dwfl_frame.c:405
#18 0x00000000005a6260 in unwind__get_entries ()
#19 0x00000000004633c3 in process_sample_event ()
#20 0x000000000051bee6 in machines__deliver_event ()
#21 0x000000000051c5e7 in perf_session__deliver_event ()
#22 0x000000000052078b in __ordered_events__flush.part.0 ()
#23 0x000000000051cfa4 in perf_session__process_user_event ()
#24 0x000000000051d51e in process_simple ()
#25 0x000000000051effc in perf_session__process_events ()
#26 0x00000000004676a0 in cmd_script ()
#27 0x00000000004bd25b in run_builtin ()
#28 0x00000000004330e7 in main ()

Envionment:

CentOS 7
Linux 3.10.0-1160.15.2.el7.x86_64
Relevant RPMs:
* elfutils-libs-0.176-5.el7.x86_64
* perf-5.11.0-1.el7.elrepo.x86_64 (https://elrepo.org/linux/kernel/el7/x86_64/RPMS/perf-5.10.16-1.el7.elrepo.x86_64.rpm
)

[1]: Couchbase Server, `/opt/couchbase/lib/libcouchstore.so` and `/usr/lib/debug/opt/couchbase/lib/libcouchstore.so.debug`. There's nothing particulary special about this, C++ shared library packaged into split rpm / debuginfo; linked via a .gnu_debuglink section in the ELF:

 readelf --string-dump=.gnu_debuglink /opt/couchbase/lib/libcouchstore.so 

 String dump of section '.gnu_debuglink':
   [     0]  libcouchstore.so.debug
   [    18]  :l]?
Comment 1 Dave Rigby 2021-02-16 14:30:34 UTC
Note this is a "semi"-regression - with perf-5.10.16-1.el7.elrepo I don't see the crash, however it seems like perf has failed to locate the DWARF info for the given library, as any backtraces including it are not decoded correctly:

 # perf --version
 perf version 5.10.16-1.el7.elrepo.x86_64


perf script --no-inline |grep -B5 -A5 libcouchstore.so
<cut>

 WriterPool3  1970 337490.608958:   20408163 cpu-clock:pppH: 
 	    7f40109ee93f compare_records+0xf (/opt/couchbase/lib/libcouchstore.so)
<cut>

Note no frames before or after `compare_records ` are correctly decoded, _but_ others are fine and `perf script` exits with status zero.
Comment 2 Jiri Olsa 2021-02-17 20:06:40 UTC
Could you please get output from:

  # perf script -vv 

it might show what went wrong
Comment 3 Dave Rigby 2021-02-17 20:26:57 UTC
Created attachment 295333 [details]
perf script -v -v --no-inline

Attached result of:

 perf script -v -v --no-inline >/dev/null 2> perf_script_vv.txt
Comment 4 Dave Rigby 2021-02-17 20:38:33 UTC
Just ran it in gdb to check I was still seeing the same exception; I believe the problem is that mod->userdata is null (which is the assert which is firing), but it wasn't immediately obvious to me where in the various callbacks that is supposed to be set.

#5  0x00007ffff7052e8f in find_debuginfo (mod=mod@entry=0x15c8470) at dwfl_module_getdwarf.c:539
539	  mod->debug.fd = (*mod->dwfl->callbacks->find_debuginfo) (MODCB_ARGS (mod),

(gdb) p *mod
$2 = {dwfl = 0x15cea20, next = 0x0, userdata = 0x0, name = 0xd2dfb0 "libcouchstore.so", low_addr = 139876824051712, high_addr = 139876826574744, main = {name = 0xdb9ae0 "/opt/couchbase/lib/libcouchstore.so", fd = 20, valid = false, relocated = false, elf = 0x15be9c0, vaddr = 0, 
    address_sync = 350598}, debug = {name = 0x0, fd = 0, valid = false, relocated = false, elf = 0x0, vaddr = 0, address_sync = 0}, aux_sym = {name = 0x0, fd = 0, valid = false, relocated = false, elf = 0x0, vaddr = 0, address_sync = 0}, main_bias = 139876824051712, ebl = 0x15dcb20, e_type = 3, 
  elferr = DWFL_E_NOERROR, reloc_info = 0x0, symfile = 0x0, symdata = 0x0, aux_symdata = 0x0, syments = 0, aux_syments = 0, first_global = 0, aux_first_global = 0, symstrdata = 0x0, aux_symstrdata = 0x0, symxndxdata = 0x0, aux_symxndxdata = 0x0, elfdir = 0x0, dw = 0x0, alt = 0x0, alt_fd = 0, 
  alt_elf = 0x0, symerr = DWFL_E_NOERROR, dwerr = DWFL_E_NO_DWARF, first_cu = 0x0, cu = 0x0, lazy_cu_root = 0x0, aranges = 0x0, build_id_bits = 0x0, build_id_vaddr = 0, build_id_len = 0, ncu = 0, lazycu = 0, naranges = 0, dwarf_cfi = 0x0, eh_cfi = 0x15c82d0, segment = 2, gc = false, 
  is_executable = false}
Comment 5 Dave Rigby 2021-02-18 10:18:51 UTC
Created attachment 295341 [details]
Proposed fix v1

I suspect the problem might be due to the fact that userdata is not setup if __report_module fails to find mod via dwfl_addrmodule in the previously-mentioned patch[1]:

@@ -46,16 +60,24 @@ static int __report_module(struct addr_location *al, u64 ip,
        mod = dwfl_addrmodule(ui->dwfl, ip);
        if (mod) {
                Dwarf_Addr s;
+               void **userdatap;
 
-               dwfl_module_info(mod, NULL, &s, NULL, NULL, NULL, NULL, NULL);
+               dwfl_module_info(mod, &userdatap, &s, NULL, NULL, NULL, NULL, NULL);
+               *userdatap = dso;
                if (s != al->map->start - al->map->pgoff)
                        mod = 0;
        }
 
        if (!mod)
-               mod = dwfl_report_elf(ui->dwfl, dso->short_name,
-                                     (dso->symsrc_filename ? dso->symsrc_filename : dso->long_name), -1, al->map->start - al->map->pgoff,
-                                     false);
+               mod = dwfl_report_elf(ui->dwfl, dso->short_name, dso->long_name, -1,
+                                     al->map->start - al->map->pgoff, false);
+       if (!mod) {
+               char filename[PATH_MAX];
+
+               if (dso__build_id_filename(dso, filename, sizeof(filename), false))
+                       mod = dwfl_report_elf(ui->dwfl, dso->short_name, filename, -1,
+                                             al->map->start - al->map->pgoff, false);
+       }
 
        return mod && dwfl_addrmodule(ui->dwfl, ip) == mod ? 0 : -1;
 }


Note how `*userdatap` is only initialised with dso if the first call to dwfl_addrmodule returns non-null. However, if one of the two other attempts to lookup `mod` succeed, `*userdatap` is never initialised.

With the attached patch I no longer see the assert and backtrace decode seems to work as well as with libunwind. Note I'm flying pretty blind here, I'm not at all familiar with the subtleties of ELF / DWARF module manipulation; this is just a dumb "set a field which looks like it should have been set" :)

I also don't know enough about the different lookup methods here to understand exactly why dwfl_addrmodule failed for my module...


[1]: "perf unwind: Fix separate debug info files when using elfutils' libdw…
" - https://github.com/torvalds/linux/commit/bf53fc6b5f415cddc7118091cb8fd6a211b2320d
Comment 6 Jiri Olsa 2021-02-18 10:32:45 UTC
yes, looks like userdatap might not be initialized,
I asked Jan to look at it

Could you please send the patch for review and make
sure Jan is CC-ed?

thanks
Comment 7 Dave Rigby 2021-02-18 10:54:47 UTC
Sure - can you direct me to where I should post the patch (I'm not a regular kernel contributor...).
Comment 8 Jiri Olsa 2021-02-18 15:13:52 UTC
please send it to:

  linux-perf-users@vger.kernel.org
  Arnaldo Carvalho de Melo <acme@kernel.org>
  Jan Kratochvil <jan.kratochvil@redhat.com>
  Jiri Olsa <jolsa@redhat.com>

no worries about the patch/changelog shape (just should be apply-able)
let's start the discussion and we will tune the rest
Comment 9 Jan Kratochvil 2021-02-18 16:30:19 UTC
I agree with the patch. I wrote exactly the same patch before I found the one of yours here.
I just do not have the problem reproducible but if it fixes the problem for you that's all we need.
Comment 10 Arnaldo Carvalho de Melo 2021-02-18 17:16:51 UTC
I'll collect the Acked-by: Jan and Jiri, ok?
Comment 11 Dave Rigby 2021-02-18 17:18:14 UTC
Thanks. This is the first patch I've submitted to Linux so any assistance appreciated :)
Comment 12 Richard W.M. Jones 2021-02-19 10:42:06 UTC
Found the same issue in Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=1930640
Comment 13 Richard W.M. Jones 2021-02-19 11:05:42 UTC
I tested your patch (https://www.spinics.net/lists/linux-perf-users/msg12625.html)
and it fixes the problem for me.

Note You need to log in before you can comment on or make changes to this bug.