Christian (Debian libcap maintainer) says that glibc-2.40 seems to be associated with this test segfaulting on powerpc64 I have tested under QEMU this vanilla version of Debian: morgan@deb-ppc64:~/libcap$ cat /etc/debian_version 12.9 and this one: morgan@deb-mipsel:~/libcap/tests$ cat /etc/debian_version 12.9 morgan@deb-mipsel:~/libcap/tests$ /lib/mipsel-linux-gnu/libc.so.6 GNU C Library (Debian GLIBC 2.36-9+deb12u9) stable release version 2.36. [so the prior glibc] Christian provided some handy pointers for reproducing: [1]: https://buildd.debian.org/status/fetch.php?pkg=libcap2&arch=mips64el&ver=1%3A2.75-1&stamp=1741363968&raw=0 [2]: https://web.git.kernel.org/pub/scm/libs/libcap/libcap.git/tree/libcap/execable.c#n7 [3]: I've attached a Dockerfile for such a container, which I built and ran with $ podman build --platform=linux/mips64le -t mips64el/debian:unstable . $ podman run --rm -it mips64el/debian:unstable [4]: https://buildd.debian.org/status/fetch.php?pkg=libcap2&arch=mips64el&ver=1%3A2.75-2&stamp=1741896514&raw=0
Created attachment 307824 [details] Docker file
Using the Docker [3] method, I can reproduce the failure: root@f06fe0f0f6fe:/sources/libcap2-2.75/libcap# /lib/mips64el-linux-gnuabi64/libc.so.6 GNU C Library (Debian GLIBC 2.41-4) stable release version 2.41. Copyright (C) 2025 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 14.2.0. libc ABIs: MIPS_PLT UNIQUE MIPS_O32_FP64 ABSOLUTE MIPS_XHASH Minimum supported kernel: 3.2.0 For bug reporting instructions, please see: <http://www.debian.org/Bugs/>. root@f06fe0f0f6fe:/sources/libcap2-2.75/libcap# ./libcap.so.2.75 --help ./libcap.so.2.75 is the shared library version: libcap-2.75. See the License file for distribution information. More information on this library is available from: https://sites.google.com/site/fullycapable/ usage: libcap.so [--help|--usage|--summary]qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault (core dumped)
Just to add some data: This failure was first observed on Debian's mips64el and powerpc architectures. powerpc is not an official architecture and I don't know much about it (especially viz. the ppc64el architecture). mips64el, however, is an officially supported architecture. Though the reproducers here are in a container, I also managed to reproduce the issue on bare metal on a mips64el porterbox that Debian provides. I suspected glibc-2.40 to be a factor because I don't see this issue with 2.39. I've attached another Dockerfile to reproduce this. * It creates a bookworm installation, and then upgrades glibc to 2.39 * A clone of upstream libcap.git can be found in /sources/libcap. Tests run fine after a build * Run /upgrade-glibc-2.40.sh to install glibc-2.40, and observe the segfault appearing * Run /upgrade-glibc-mostrecent.sh to install the most recent version in unstable (currently 2.41 + patches)
Created attachment 307826 [details] Docker file for glibc-2.40 reproducer
What is curious in the libcap.so case is that --summary works, but --help segfaults. gdb does not work in the Docker image so I'll have to figure out how to retarget my QEMU system image to look further. Compiling with -g preserves the segfault behavior, so a backtrace from a segfault might be useful. gdb ./libcap.so run --help bt
Interestingly, even with -g, I can't get a useful backtrace: [...] Program received signal SIGSEGV, Segmentation fault. 0x000000fff7f49fc8 in ?? () (gdb) bt warning: GDB can't find the start of the function at 0xfff7f49fc8. [...]
> gdb does not work in the Docker image so I'll have to figure out how to > retarget my QEMU system image to look further. The filesystem layout built by the Dockerfiles above can converted to qcow2 and booted with QEMU. The following must be appended to the Dockerfile: RUN apt install -y linux-image-5kc-malta systemd-sysv openssh-server iproute2 RUN echo "[Match]\nName=enp*\n\n[Network]\nDHCP=yes" > /etc/systemd/network/guest.network RUN systemctl enable systemd-networkd.service RUN passwd -d root RUN echo "PermitRootLogin yes" > /etc/ssh/sshd_config.d/guest.conf RUN echo "PermitEmptyPasswords yes" >> /etc/ssh/sshd_config.d/guest.conf Using docker or podman, a flattened filesystem layout can be extracted to a tarball: $ docker/podman create --name temp_mips mips64el/debian:unstable $ docker/podman container export temp_mips > /tmp/unstable-mips64el.tar $ docker/podman rm temp_mips Using libguestfs-tools, this tarball can be converted to qcow2, and the initrd and kernel images extracted: $ virt-make-fs --partition=mbr --format=qcow2 --size=10G --type=ext4 /tmp/unstable-mips64el.tar /tmp/unstable-mips64el.img $ virt-copy-out -a /tmp/unstable-mips64el.img /boot/$( virt-ls -a /tmp/unstable-mips64el.img /boot | grep vmlinuz) /tmp $ virt-copy-out -a /tmp/unstable-mips64el.img /boot/$( virt-ls -a /tmp/unstable-mips64el.img /boot | grep initrd) /tmp I could boot this with $ SSHPORT=20022 $ qemu-system-mips64el -machine malta -m 2048 -cpu 5KEc -display none -vga none -nographic -kernel /tmp/vmlinuz* -initrd /tmp/initrd.img-* -hda /tmp/unstable-mips64el.img -append 'root=/dev/sda1 rw' -nic user,model=virtio,hostfwd=tcp:127.0.0.1:$SSHPORT-:22 localhost:SSHPORT is forwarded to guest port :22, where an instance of OpenSSH will be listening. The root account has an empty password.
No idea if this is important, but logging these here in case I come back to them: https://lore.kernel.org/lkml/20250225-nolibc-mips-n32-v2-2-664b47d87fa0@weissschuh.net/ https://lore.kernel.org/lkml/20250316-nolibc-sp-align-v1-1-1e1fb073ca1e@weissschuh.net/
I upgraded my existing QEMU system image to "sid". This has glibc 2.41-6 I doubt the issue is the stack pointer. Placing a breakpoint on __so_start, I find the following registers (sp is 16 byte aligned): (gdb) info reg zero at v0 v1 R0 0000000000000000 fffffffffffffff0 000000fff7fc0a50 0000000000000000 a0 a1 a2 a3 R4 000000fff7ff7000 000000fff7ff7768 000000ffffffcb48 000000ffffffcb60 a4 a5 a6 a7 R8 000000fff7fa14d0 000000fff7fa97e0 000000ffffffcd90 900000000a2cfe00 t0 t1 t2 t3 R12 0000000000000000 ffffffff84080018 ffffffff81bd1d40 0000000000000001 s0 s1 s2 s3 R16 000000fff8003020 000000aaaaaa9760 000000fff8003020 000000aaaac387a0 s4 s5 s6 s7 R20 000000aaaac363b0 000000fff7ffae50 0000000000000000 000000aaaac1cc04 t8 t9 k0 k1 R24 0000000000000000 000000aaaaaa9760 0000000000000000 0000000000000000 gp sp s8 ra R28 000000aaaaac2840 000000ffffffcac0 0000000000000000 000000fff7fd7b0c status lo hi badvaddr 000000000000a4f3 000000000000003a 000000000000001f 000000fff7e75560 cause pc 0000000010800024 000000aaaaaa9774 fcsr fir restart 00000000 00f30000 0000000000000000
Just after the segfault: (gdb) info reg zero at v0 v1 R0 0000000000000000 ffffffff81e970dc 0000000000000001 0000000000000000 a0 a1 a2 a3 R4 000000fff7fa1748 0000000000000000 0000000000000000 fffffffffbad2884 a4 a5 a6 a7 R8 0000000000000000 0000000000000000 2d2d7c706c65682d 2d2d7c6567617375 t0 t1 t2 t3 R12 0000000000000000 ffffffff84080018 ffffffff81e974f8 0000000000000003 s0 s1 s2 s3 R16 000000fff7fa1748 000000000000000a ffffffffffffffff 000000fff7fa16a0 s4 s5 s6 s7 R20 000000fff7f9fce0 0000000000000001 0000000000000000 000000aaaaab0000 t8 t9 k0 k1 R24 0000000000000000 000000fff7f49a98 0000000000000000 0000000000000000 gp sp s8 ra R28 000000fff7fa97e0 000000ffffffcb70 0000000000000000 000000fff7f4a0d0 status lo hi badvaddr 000000000000a4f3 0000000000000003 0000000000000000 0000000000000000 cause pc 000000001080000c 000000fff7f49fc8 fcsr fir restart 00000000 00f30000 0000000000000000
Exploring some more. A separate invocation. Whatever the problem code is, it is in libc: (gdb) info reg zero at v0 v1 R0 0000000000000000 ffffffff81e970dc 0000000000000001 0000000000000000 a0 a1 a2 a3 R4 000000fff7fa1748 0000000000000000 0000000000000000 fffffffffbad2884 a4 a5 a6 a7 R8 0000000000000000 0000000000000000 6173752d2d7c706c 6d75732d2d7c6567 t0 t1 t2 t3 R12 0000000000000000 ffffffff84080018 ffffffff81e974f8 0000000000000003 s0 s1 s2 s3 R16 000000fff7fa1748 000000000000000a ffffffffffffffff 000000fff7fa16a0 s4 s5 s6 s7 R20 000000fff7f9fce0 0000000000000001 0000000000000000 000000aaaaab0000 t8 t9 k0 k1 R24 0000000000000005 000000fff7f49a98 0000000000000000 0000000000000000 gp sp s8 ra R28 000000fff7fa97e0 000000ffffffcb70 0000000000000000 000000fff7f4a0d0 status lo hi badvaddr 000000000000a4f3 0000000000000003 0000000000000000 0000000000000000 cause pc 000000001080000c 000000fff7f49fc8 fcsr fir restart 00000000 00f30000 0000000000000000 morgan@deb-mips64el-hack:~/libcap/libcap$ cat /proc/26633/maps aaaaaa0000-aaaaaab000 r-xp 00000000 fe:01 130763 /home/morgan/libcap/libcap/libcap.so.2.75 aaaaabb000-aaaaabc000 rw-p 0000b000 fe:01 130763 /home/morgan/libcap/libcap/libcap.so.2.75 aaaaabc000-aaaaadd000 rw-p 00000000 00:00 0 [heap] fff7d90000-fff7f8b000 r-xp 00000000 fe:01 530231 /usr/lib/mips64el-linux-gnuabi64/libc.so.6 fff7f8b000-fff7f9a000 ---p 001fb000 fe:01 530231 /usr/lib/mips64el-linux-gnuabi64/libc.so.6 fff7f9a000-fff7fa0000 r--p 0020a000 fe:01 530231 /usr/lib/mips64el-linux-gnuabi64/libc.so.6 fff7fa0000-fff7fa5000 rw-p 00210000 fe:01 530231 /usr/lib/mips64el-linux-gnuabi64/libc.so.6 fff7fa5000-fff7fb2000 rw-p 00000000 00:00 0 fff7fbb000-fff7fe9000 r-xp 00000000 fe:01 530228 /usr/lib/mips64el-linux-gnuabi64/ld.so.1 fff7fef000-fff7ff1000 rw-p 00000000 00:00 0 fff7ff7000-fff7ff9000 rw-p 00000000 00:00 0 fff7ff9000-fff7ffb000 r--p 0002e000 fe:01 530228 /usr/lib/mips64el-linux-gnuabi64/ld.so.1 fff7ffb000-fff7ffd000 rw-p 00030000 fe:01 530228 /usr/lib/mips64el-linux-gnuabi64/ld.so.1 fffffdc000-ffffffd000 rwxp 00000000 00:00 0 [stack] ffffffd000-ffffffe000 r-xp 00000000 00:00 0 ffffffe000-fffffff000 r--p 00000000 00:00 0 [vvar] fffffff000-10000000000 r-xp 00000000 00:00 0 [vdso]
So, I think a problem with the breakpoints into libc is that the breakpoints is onto an ephemeral trampoline of some sort that gets rewritten the first time the function is called. For example, I put a breakpoint on printf, and it hits once which then redirects to __dl_runtime_resolve() but that breakpoint is not hit again.
putting a breakpoint on _dl_lookup_symbol_x() shows each symbol getting resolved. The last one resolved before the segfault is "puts". Curiously, a printf completes without resolving that, but in the 2nd invocation of printf(), puts() is called. Somewhere before the segfault: (gdb) bt #0 0x000000fff7fc6684 in do_lookup_x (undef_name=0xaaaaaa1d00 "puts", new_hash=<optimized out>, old_hash=0xffffffca98, ref=0xaaaaaa1448, result=0xffffffca88, scope=<optimized out>, i=0, version=0xfff7ff87f8, flags=0, skip=<optimized out>, type_class=1, undef_map=<optimized out>) at dl-lookup.c:418 #1 0x000000fff7fc7110 in _dl_lookup_symbol_x (undef_name=0xaaaaaa1d00 "puts", undef_map=<optimized out>, ref=0xffffffcb60, symbol_scope=<optimized out>, version=0xfff7ff87f8, type_class=<optimized out>, flags=0, skip_map=0x0) at dl-lookup.c:779 #2 0x000000fff7fd15f8 in __dl_runtime_resolve (sym_index=<optimized out>, return_address=<optimized out>, old_gpreg=<optimized out>, stub_pc=<optimized out>) at ../sysdeps/mips/dl-trampoline.c:182 #3 0x000000fff7fd136c in _dl_runtime_resolve () from /lib64/ld.so.1
I bisected GLIBC and this appears to be the offending commit: https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=2a99e2398d9d717c034e915f7846a49e623f5450;hp=a81cdde1cb9d514fc8f014ddf21771c96ff2c182 I did not investigate further. This process was quite straightforward: I just installed the build dependencies, and ran tests against the build result with this recipe: https://sourceware.org/glibc/wiki/Testing/Builds#Compile_normally.2C_run_under_new_glibc Just for reference, this is the bug that was fixed by this commit: https://sourceware.org/bugzilla/show_bug.cgi?id=27777
A crude attempt at reverting the previously identified breaking commit in GLIBC HEAD resolves the issue. In libio/bits/types/struct_FILE.h, a struct is padded and there is an ominous "Make sure we don't get into trouble again" comment to it. I traced this back to the following change, which is the only documentation I could find: https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=libio/libioP.h;h=97c3206c4de757a1e0b2a5030342dc2bbced61dd;hp=5fe9598c0d09cb2678028efeddd145e237c545d3;hb=1ea89a402d892b68b193e2e4390d8eb33ed686e7;hpb=dfd2257ad98eb0f6eab167e5fe5ff68ca87172e3 Either way, I think we have enough indications to support the argument that this is not a problem with libcap, but with GLIBC's internal I/O stuff, which the libcap.so test triggers with the printf() call. I'll re-assign the Debian bug currently filed against the libcap package to the glibc package, and inform the maintainers (one of which is a GLIBC upstream developer). We don't have a short reproducer, but the breaking commit is simple enough that I hope a trained eye might spot the problem immediately.
Thanks Christian. It looks like this is the libc bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1089636 Anyone experiencing this issue can follow the progress of that bug.
Good news, Aurelien Jarno (Debian glibc maintainer) tracked down the issue. It appears that this is a libcap bug after all, specifically in relation to the executable library feature. Aurelien was kind enough to provide a patch. I prepared a commit against HEAD, using Aurelien's analysis as the commit message. The fix was slightly extended by me to the tests/ and pam_cap directories. I've attached the commit but I would only consider this a draft for now. On bare metal mips64el, everything works fine, but within a mips64el container or with unshare, psx_test/libcap_psx_test seem to endless-loop. The latter is a problem as Debian build daemons recently switch to unshare isolation. I'll see if I can take another look tomorrow.
Created attachment 307881 [details] Draft patch fixing I/O on mips64el
Thanks for sharing that development. After a little investigation I have committed the following change to execable.h (it works when applied to the comment2 test case, and it also works under QEMU): https://web.git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=04b285680bfb45117af685eabf1675917118bdb5 BLUF: glibc has a tricky external dependency that it provides no weak default value for. I guess this is one of those subtle things that arises when trying to maintain backward compatibility and feature development and not using weak symbol definitions to achieve it. Over the suggested approach, I'm concerned that musl isn't glibc and we need to support that too without cross contamination. Related content: https://lists.gnu.org/archive/html/bug-glibc/2001-12/msg00203.html https://stackoverflow.com/a/68339111/14760867 https://stackoverflow.com/a/75339879/14760867
WRT the infinite looping in containers vs HW, I'm confident this isn't exactly the same bug as this present one. We should start a new bug to track that.
For that issue, I've filed: https://bugzilla.kernel.org/show_bug.cgi?id=219912