Bug 219880 - segfault executing make -C tests run_b219174
Summary: segfault executing make -C tests run_b219174
Status: RESOLVED CODE_FIX
Alias: None
Product: Tools
Classification: Unclassified
Component: libcap (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Andrew G. Morgan
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-03-15 01:53 UTC by Andrew G. Morgan
Modified: 2025-03-22 19:40 UTC (History)
1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Docker file (313 bytes, text/plain)
2025-03-15 02:30 UTC, Andrew G. Morgan
Details
Docker file for glibc-2.40 reproducer (1.68 KB, text/plain)
2025-03-15 10:10 UTC, Christian
Details
Draft patch fixing I/O on mips64el (3.92 KB, patch)
2025-03-22 13:38 UTC, Christian
Details | Diff

Description Andrew G. Morgan 2025-03-15 01:53:54 UTC
Christian (Debian libcap maintainer) says that glibc-2.40 seems to be associated with this test segfaulting on powerpc64

I have tested under QEMU this vanilla version of Debian:

  morgan@deb-ppc64:~/libcap$ cat /etc/debian_version 
  12.9

and this one:

  morgan@deb-mipsel:~/libcap/tests$ cat /etc/debian_version 
  12.9
  morgan@deb-mipsel:~/libcap/tests$ /lib/mipsel-linux-gnu/libc.so.6
  GNU C Library (Debian GLIBC 2.36-9+deb12u9) stable release version 2.36.

[so the prior glibc]

Christian provided some handy pointers for reproducing:

[1]: https://buildd.debian.org/status/fetch.php?pkg=libcap2&arch=mips64el&ver=1%3A2.75-1&stamp=1741363968&raw=0

[2]: https://web.git.kernel.org/pub/scm/libs/libcap/libcap.git/tree/libcap/execable.c#n7

[3]: I've attached a Dockerfile for such a container, which I built and
     ran with
      $ podman build --platform=linux/mips64le -t mips64el/debian:unstable .
      $ podman run --rm -it mips64el/debian:unstable

[4]: https://buildd.debian.org/status/fetch.php?pkg=libcap2&arch=mips64el&ver=1%3A2.75-2&stamp=1741896514&raw=0
Comment 1 Andrew G. Morgan 2025-03-15 02:30:11 UTC
Created attachment 307824 [details]
Docker file
Comment 2 Andrew G. Morgan 2025-03-15 03:53:43 UTC
Using the Docker [3] method, I can reproduce the failure:

root@f06fe0f0f6fe:/sources/libcap2-2.75/libcap# /lib/mips64el-linux-gnuabi64/libc.so.6      
GNU C Library (Debian GLIBC 2.41-4) stable release version 2.41.
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 14.2.0.
libc ABIs: MIPS_PLT UNIQUE MIPS_O32_FP64 ABSOLUTE MIPS_XHASH
Minimum supported kernel: 3.2.0
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.
root@f06fe0f0f6fe:/sources/libcap2-2.75/libcap# ./libcap.so.2.75 --help
./libcap.so.2.75 is the shared library version: libcap-2.75.
See the License file for distribution information.
More information on this library is available from:

    https://sites.google.com/site/fullycapable/

usage: libcap.so [--help|--usage|--summary]qemu: uncaught target signal 11 (Segmentation fault) - core dumped
Segmentation fault (core dumped)
Comment 3 Christian 2025-03-15 10:10:19 UTC
Just to add some data:

This failure was first observed on Debian's mips64el and powerpc architectures. powerpc is not an official architecture and I don't know much about it (especially viz. the ppc64el architecture). mips64el, however, is an officially supported architecture.

Though the reproducers here are in a container, I also managed to reproduce the issue on bare metal on a mips64el porterbox that Debian provides.

I suspected glibc-2.40 to be a factor because I don't see this issue with 2.39.

I've attached another Dockerfile to reproduce this.
  * It creates a bookworm installation, and then upgrades glibc to 2.39
  * A clone of upstream libcap.git can be found in /sources/libcap. Tests run fine after a build
  * Run /upgrade-glibc-2.40.sh to install glibc-2.40, and observe the segfault appearing
  * Run /upgrade-glibc-mostrecent.sh to install the most recent version in unstable (currently 2.41 + patches)
Comment 4 Christian 2025-03-15 10:10:55 UTC
Created attachment 307826 [details]
Docker file for glibc-2.40 reproducer
Comment 5 Andrew G. Morgan 2025-03-15 14:56:39 UTC
What is curious in the libcap.so case is that --summary works, but --help segfaults. gdb does not work in the Docker image so I'll have to figure out how to retarget my QEMU system image to look further. Compiling with -g preserves the segfault behavior, so a backtrace from a segfault might be useful.

gdb ./libcap.so
 run --help
 bt
Comment 6 Christian 2025-03-16 21:20:11 UTC
Interestingly, even with -g, I can't get a useful backtrace:

[...]
Program received signal SIGSEGV, Segmentation fault.
0x000000fff7f49fc8 in ?? ()
(gdb) bt
warning: GDB can't find the start of the function at 0xfff7f49fc8.
[...]
Comment 7 Christian 2025-03-16 21:22:51 UTC
> gdb does not work in the Docker image so I'll have to figure out how to
> retarget my QEMU system image to look further.

The filesystem layout built by the Dockerfiles above can converted to qcow2 and booted with QEMU.


The following must be appended to the Dockerfile:

RUN apt install -y linux-image-5kc-malta systemd-sysv openssh-server iproute2
RUN echo "[Match]\nName=enp*\n\n[Network]\nDHCP=yes" > /etc/systemd/network/guest.network
RUN systemctl enable systemd-networkd.service
RUN passwd -d root
RUN echo "PermitRootLogin yes" > /etc/ssh/sshd_config.d/guest.conf
RUN echo "PermitEmptyPasswords yes" >> /etc/ssh/sshd_config.d/guest.conf


Using docker or podman, a flattened filesystem layout can be extracted to a tarball:

$ docker/podman create --name temp_mips mips64el/debian:unstable 
$ docker/podman container export temp_mips > /tmp/unstable-mips64el.tar
$ docker/podman rm temp_mips


Using libguestfs-tools, this tarball can be converted to qcow2, and the initrd and kernel images extracted:

$ virt-make-fs --partition=mbr --format=qcow2 --size=10G --type=ext4 /tmp/unstable-mips64el.tar /tmp/unstable-mips64el.img
$ virt-copy-out -a /tmp/unstable-mips64el.img /boot/$( virt-ls -a /tmp/unstable-mips64el.img /boot | grep vmlinuz) /tmp
$ virt-copy-out -a /tmp/unstable-mips64el.img /boot/$( virt-ls -a /tmp/unstable-mips64el.img /boot | grep initrd) /tmp


I could boot this with

$ SSHPORT=20022
$ qemu-system-mips64el -machine malta -m 2048 -cpu 5KEc -display none -vga none -nographic -kernel /tmp/vmlinuz* -initrd /tmp/initrd.img-* -hda /tmp/unstable-mips64el.img -append 'root=/dev/sda1 rw' -nic user,model=virtio,hostfwd=tcp:127.0.0.1:$SSHPORT-:22

localhost:SSHPORT is forwarded to guest port :22, where an instance of OpenSSH will be listening. The root account has an empty password.
Comment 8 Andrew G. Morgan 2025-03-18 04:08:14 UTC
No idea if this is important, but logging these here in case I come back to them:

https://lore.kernel.org/lkml/20250225-nolibc-mips-n32-v2-2-664b47d87fa0@weissschuh.net/
https://lore.kernel.org/lkml/20250316-nolibc-sp-align-v1-1-1e1fb073ca1e@weissschuh.net/
Comment 9 Andrew G. Morgan 2025-03-18 06:25:41 UTC
I upgraded my existing QEMU system image to "sid". This has glibc 2.41-6

I doubt the issue is the stack pointer. Placing a breakpoint on __so_start, I find the following registers (sp is 16 byte aligned):

(gdb) info reg
                  zero               at               v0               v1
 R0   0000000000000000 fffffffffffffff0 000000fff7fc0a50 0000000000000000 
                    a0               a1               a2               a3
 R4   000000fff7ff7000 000000fff7ff7768 000000ffffffcb48 000000ffffffcb60 
                    a4               a5               a6               a7
 R8   000000fff7fa14d0 000000fff7fa97e0 000000ffffffcd90 900000000a2cfe00 
                    t0               t1               t2               t3
 R12  0000000000000000 ffffffff84080018 ffffffff81bd1d40 0000000000000001 
                    s0               s1               s2               s3
 R16  000000fff8003020 000000aaaaaa9760 000000fff8003020 000000aaaac387a0 
                    s4               s5               s6               s7
 R20  000000aaaac363b0 000000fff7ffae50 0000000000000000 000000aaaac1cc04 
                    t8               t9               k0               k1
 R24  0000000000000000 000000aaaaaa9760 0000000000000000 0000000000000000 
                    gp               sp               s8               ra
 R28  000000aaaaac2840 000000ffffffcac0 0000000000000000 000000fff7fd7b0c 
                status               lo               hi         badvaddr
      000000000000a4f3 000000000000003a 000000000000001f 000000fff7e75560 
                 cause               pc
      0000000010800024 000000aaaaaa9774 
                  fcsr              fir          restart
              00000000         00f30000 0000000000000000
Comment 10 Andrew G. Morgan 2025-03-18 13:19:39 UTC
Just after the segfault:

(gdb) info reg
                  zero               at               v0               v1
 R0   0000000000000000 ffffffff81e970dc 0000000000000001 0000000000000000 
                    a0               a1               a2               a3
 R4   000000fff7fa1748 0000000000000000 0000000000000000 fffffffffbad2884 
                    a4               a5               a6               a7
 R8   0000000000000000 0000000000000000 2d2d7c706c65682d 2d2d7c6567617375 
                    t0               t1               t2               t3
 R12  0000000000000000 ffffffff84080018 ffffffff81e974f8 0000000000000003 
                    s0               s1               s2               s3
 R16  000000fff7fa1748 000000000000000a ffffffffffffffff 000000fff7fa16a0 
                    s4               s5               s6               s7
 R20  000000fff7f9fce0 0000000000000001 0000000000000000 000000aaaaab0000 
                    t8               t9               k0               k1
 R24  0000000000000000 000000fff7f49a98 0000000000000000 0000000000000000 
                    gp               sp               s8               ra
 R28  000000fff7fa97e0 000000ffffffcb70 0000000000000000 000000fff7f4a0d0 
                status               lo               hi         badvaddr
      000000000000a4f3 0000000000000003 0000000000000000 0000000000000000 
                 cause               pc
      000000001080000c 000000fff7f49fc8 
                  fcsr              fir          restart
              00000000         00f30000 0000000000000000
Comment 11 Andrew G. Morgan 2025-03-19 02:16:00 UTC
Exploring some more. A separate invocation. Whatever the problem code is, it is in libc:

(gdb) info reg
                  zero               at               v0               v1
 R0   0000000000000000 ffffffff81e970dc 0000000000000001 0000000000000000 
                    a0               a1               a2               a3
 R4   000000fff7fa1748 0000000000000000 0000000000000000 fffffffffbad2884 
                    a4               a5               a6               a7
 R8   0000000000000000 0000000000000000 6173752d2d7c706c 6d75732d2d7c6567 
                    t0               t1               t2               t3
 R12  0000000000000000 ffffffff84080018 ffffffff81e974f8 0000000000000003 
                    s0               s1               s2               s3
 R16  000000fff7fa1748 000000000000000a ffffffffffffffff 000000fff7fa16a0 
                    s4               s5               s6               s7
 R20  000000fff7f9fce0 0000000000000001 0000000000000000 000000aaaaab0000 
                    t8               t9               k0               k1
 R24  0000000000000005 000000fff7f49a98 0000000000000000 0000000000000000 
                    gp               sp               s8               ra
 R28  000000fff7fa97e0 000000ffffffcb70 0000000000000000 000000fff7f4a0d0 
                status               lo               hi         badvaddr
      000000000000a4f3 0000000000000003 0000000000000000 0000000000000000 
                 cause               pc
      000000001080000c 000000fff7f49fc8 
                  fcsr              fir          restart
              00000000         00f30000 0000000000000000 

morgan@deb-mips64el-hack:~/libcap/libcap$ cat /proc/26633/maps
aaaaaa0000-aaaaaab000 r-xp 00000000 fe:01 130763                         /home/morgan/libcap/libcap/libcap.so.2.75
aaaaabb000-aaaaabc000 rw-p 0000b000 fe:01 130763                         /home/morgan/libcap/libcap/libcap.so.2.75
aaaaabc000-aaaaadd000 rw-p 00000000 00:00 0                              [heap]
fff7d90000-fff7f8b000 r-xp 00000000 fe:01 530231                         /usr/lib/mips64el-linux-gnuabi64/libc.so.6
fff7f8b000-fff7f9a000 ---p 001fb000 fe:01 530231                         /usr/lib/mips64el-linux-gnuabi64/libc.so.6
fff7f9a000-fff7fa0000 r--p 0020a000 fe:01 530231                         /usr/lib/mips64el-linux-gnuabi64/libc.so.6
fff7fa0000-fff7fa5000 rw-p 00210000 fe:01 530231                         /usr/lib/mips64el-linux-gnuabi64/libc.so.6
fff7fa5000-fff7fb2000 rw-p 00000000 00:00 0 
fff7fbb000-fff7fe9000 r-xp 00000000 fe:01 530228                         /usr/lib/mips64el-linux-gnuabi64/ld.so.1
fff7fef000-fff7ff1000 rw-p 00000000 00:00 0 
fff7ff7000-fff7ff9000 rw-p 00000000 00:00 0 
fff7ff9000-fff7ffb000 r--p 0002e000 fe:01 530228                         /usr/lib/mips64el-linux-gnuabi64/ld.so.1
fff7ffb000-fff7ffd000 rw-p 00030000 fe:01 530228                         /usr/lib/mips64el-linux-gnuabi64/ld.so.1
fffffdc000-ffffffd000 rwxp 00000000 00:00 0                              [stack]
ffffffd000-ffffffe000 r-xp 00000000 00:00 0 
ffffffe000-fffffff000 r--p 00000000 00:00 0                              [vvar]
fffffff000-10000000000 r-xp 00000000 00:00 0                             [vdso]
Comment 12 Andrew G. Morgan 2025-03-19 02:27:04 UTC
So, I think a problem with the breakpoints into libc is that the breakpoints is onto an ephemeral trampoline of some sort that gets rewritten the first time the function is called. For example, I put a breakpoint on printf, and it hits once which then redirects to __dl_runtime_resolve() but that breakpoint is not hit again.
Comment 13 Andrew G. Morgan 2025-03-19 13:23:48 UTC
putting a breakpoint on _dl_lookup_symbol_x() shows each symbol getting resolved. The last one resolved before the segfault is "puts". Curiously, a printf completes without resolving that, but in the 2nd invocation of printf(), puts() is called.

Somewhere before the segfault:

(gdb) bt
#0  0x000000fff7fc6684 in do_lookup_x (undef_name=0xaaaaaa1d00 "puts", 
    new_hash=<optimized out>, old_hash=0xffffffca98, ref=0xaaaaaa1448, 
    result=0xffffffca88, scope=<optimized out>, i=0, version=0xfff7ff87f8, 
    flags=0, skip=<optimized out>, type_class=1, undef_map=<optimized out>)
    at dl-lookup.c:418
#1  0x000000fff7fc7110 in _dl_lookup_symbol_x (undef_name=0xaaaaaa1d00 "puts", 
    undef_map=<optimized out>, ref=0xffffffcb60, symbol_scope=<optimized out>, 
    version=0xfff7ff87f8, type_class=<optimized out>, flags=0, skip_map=0x0)
    at dl-lookup.c:779
#2  0x000000fff7fd15f8 in __dl_runtime_resolve (sym_index=<optimized out>, 
    return_address=<optimized out>, old_gpreg=<optimized out>, 
    stub_pc=<optimized out>) at ../sysdeps/mips/dl-trampoline.c:182
#3  0x000000fff7fd136c in _dl_runtime_resolve () from /lib64/ld.so.1
Comment 14 Christian 2025-03-20 15:58:40 UTC
I bisected GLIBC and this appears to be the offending commit:

    https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=2a99e2398d9d717c034e915f7846a49e623f5450;hp=a81cdde1cb9d514fc8f014ddf21771c96ff2c182

I did not investigate further.


This process was quite straightforward: I just installed the build dependencies, and ran tests against the build result with this recipe:

    https://sourceware.org/glibc/wiki/Testing/Builds#Compile_normally.2C_run_under_new_glibc


Just for reference, this is the bug that was fixed by this commit:

    https://sourceware.org/bugzilla/show_bug.cgi?id=27777
Comment 15 Christian 2025-03-20 21:17:47 UTC
A crude attempt at reverting the previously identified breaking commit in GLIBC HEAD resolves the issue.

In libio/bits/types/struct_FILE.h, a struct is padded and there is an ominous "Make sure we don't get into trouble again" comment to it. I traced this back to the following change, which is the only documentation I could find:

https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=libio/libioP.h;h=97c3206c4de757a1e0b2a5030342dc2bbced61dd;hp=5fe9598c0d09cb2678028efeddd145e237c545d3;hb=1ea89a402d892b68b193e2e4390d8eb33ed686e7;hpb=dfd2257ad98eb0f6eab167e5fe5ff68ca87172e3

Either way, I think we have enough indications to support the argument that this is not a problem with libcap, but with GLIBC's internal I/O stuff, which the libcap.so test triggers with the printf() call.

I'll re-assign the Debian bug currently filed against the libcap package to the glibc package, and inform the maintainers (one of which is a GLIBC upstream developer). We don't have a short reproducer, but the breaking commit is simple enough that I hope a trained eye might spot the problem immediately.
Comment 16 Andrew G. Morgan 2025-03-21 01:51:52 UTC
Thanks Christian. It looks like this is the libc bug:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1089636

Anyone experiencing this issue can follow the progress of that bug.
Comment 17 Christian 2025-03-22 13:37:14 UTC
Good news, Aurelien Jarno (Debian glibc maintainer) tracked down the issue.

It appears that this is a libcap bug after all, specifically in relation to the executable library feature.

Aurelien was kind enough to provide a patch. I prepared a commit against HEAD, using Aurelien's analysis as the commit message. The fix was slightly extended by me to the tests/ and pam_cap directories.

I've attached the commit but I would only consider this a draft for now. On bare metal mips64el, everything works fine, but within a mips64el container or with unshare, psx_test/libcap_psx_test seem to endless-loop. The latter is a problem as Debian build daemons recently switch to unshare isolation. I'll see if I can take another look tomorrow.
Comment 18 Christian 2025-03-22 13:38:27 UTC
Created attachment 307881 [details]
Draft patch fixing I/O on mips64el
Comment 19 Andrew G. Morgan 2025-03-22 17:41:18 UTC
Thanks for sharing that development. After a little investigation I have committed the following change to execable.h (it works when applied to the comment2 test case, and it also works under QEMU):

https://web.git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=04b285680bfb45117af685eabf1675917118bdb5

BLUF: glibc has a tricky external dependency that it provides no weak default value for. I guess this is one of those subtle things that arises when trying to maintain backward compatibility and feature development and not using weak symbol definitions to achieve it. Over the suggested approach, I'm concerned that musl isn't glibc and we need to support that too without cross contamination.

Related content:

  https://lists.gnu.org/archive/html/bug-glibc/2001-12/msg00203.html
  https://stackoverflow.com/a/68339111/14760867
  https://stackoverflow.com/a/75339879/14760867
Comment 20 Andrew G. Morgan 2025-03-22 17:52:21 UTC
WRT the infinite looping in containers vs HW, I'm confident this isn't exactly the same bug as this present one. We should start a new bug to track that.
Comment 21 Andrew G. Morgan 2025-03-22 19:40:15 UTC
For that issue, I've filed:

https://bugzilla.kernel.org/show_bug.cgi?id=219912

Note You need to log in before you can comment on or make changes to this bug.