Created attachment 307480 [details] check log in clean chroot Hi! When trying to package 2.73 on Arch Linux I am running into issues with the test suite. More specifically, `TestThreadChurn` never returns (have now tried with a single job and the default detected amount of jobs for the make call). ``` === RUN TestThreadChurn psx_churn_test.go:20: [0] testing kill=false, sysc=false psx_churn_test.go:33: [0] PASSED kill=false, sysc=false psx_churn_test.go:20: [1] testing kill=true, sysc=false psx_churn_test.go:33: [1] PASSED kill=true, sysc=false psx_churn_test.go:20: [2] testing kill=false, sysc=true ``` Eventually the test is killed (see full logs in attachment).
Thanks for the bug report. I don't see this failure on any of my systems, so I'm wondering if we can distill this down a bit to something smaller? This is what I see: ``` $ make distclean $ cd go $ go version go version go1.21.3 linux/amd64 $ gcc --version gcc (Debian 12.2.0-14) 12.2.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ make PSXGOPACKAGE [...] $ CC="gcc" CGO_ENABLED="1" go test -c -mod=vendor kernel.org/pub/linux/libs/security/libcap/psx $ ldd psx.test linux-vdso.so.1 (0x00007ffc78fb5000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2d4ac57000) /lib64/ld-linux-x86-64.so.2 (0x00007f2d4ae43000) $ ./psx.test -test.v === RUN TestErrno --- PASS: TestErrno (0.01s) === RUN TestThreadChurn psx_churn_test.go:20: [0] testing kill=false, sysc=false psx_churn_test.go:33: [0] PASSED kill=false, sysc=false psx_churn_test.go:20: [1] testing kill=true, sysc=false psx_churn_test.go:33: [1] PASSED kill=true, sysc=false psx_churn_test.go:20: [2] testing kill=false, sysc=true psx_churn_test.go:33: [2] PASSED kill=false, sysc=true psx_churn_test.go:20: [3] testing kill=true, sysc=true psx_churn_test.go:33: [3] PASSED kill=true, sysc=true --- PASS: TestThreadChurn (3.05s) === RUN TestSyscall3 --- PASS: TestSyscall3 (0.02s) === RUN TestSyscall6 --- PASS: TestSyscall6 (0.01s) === RUN TestShared --- PASS: TestShared (4.44s) PASS ok kernel.org/pub/linux/libs/security/libcap/psx 7.534s ```
Scratch that. I have just found a system that reproduces a hang. So far, I've seen different numbers of `psx_churn_test.go: ...` lines of output. So, this suggests some sort of race condition.
Mmm. It looks as if something in the go runtime is blocking the SIGSYS handling, so while the tkill has been issued, it remains pending (the "4" in SigPnd) for that thread. The thread in question is in the `runtime.futex` code. I need to dig in some more and see if this is a regression, or confusion on my part. ``` (gdb) info th Id Target Id Frame 1 Thread 0x7ffff7da6740 (LWP 40173) "psx.test" 0x00007ffff7e90095 in __libc_open64 (file=0x69fa00 "/proc/40173/task", oflag=oflag@entry=65536) at ../sysdeps/unix/sysv/linux/open64.c:41 2 Thread 0x7fffb12ff6c0 (LWP 40176) "psx.test" 0x00007ffff7e8490b in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120 4 Thread 0x7fffabfff6c0 (LWP 40178) "psx.test" 0x00007ffff7e8490b in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120 * 5 Thread 0x7fffab7fe6c0 (LWP 40179) "psx.test" runtime.futex () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:558 6 Thread 0x7fffaaffd6c0 (LWP 40180) "psx.test" 0x00007ffff7e8490b in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120 7 Thread 0x7fffaa7fc6c0 (LWP 40183) "psx.test" 0x00007ffff7e8490b in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120 (gdb) [1]+ Stopped gdb ./psx.test $ grep Sig /proc/40179/status SigQ: 1/255107 SigPnd: 0000000040000000 SigBlk: fffffffc7ffbfeff SigIgn: 0000000000000000 SigCgt: fffffffd7fc1feff ```
deja vu: https://github.com/golang/go/issues/42494
I've been experimenting. It looks as if I'm going to have to invest in some per-architecture assembly. The issue in this bug is that there is no opportunity to unblock the SIGSYS signal on all the threads of the go runtime, so when individual threads are exiting and Go blocks signals on them, the PSX mechanism can't complete on those threads - we get into a deadlock. This is the nature of the bug that I found and worked around before. Back (libcap-2.71 and earlier) when libpsx operated exclusively on pthread's, there were some creation/deletion hooks to override this blocking and the PSX mechanism could reach all running threads. Since libcap-2.72 we've changed the thread abstraction to be LWP threads, which do not support those hooks. This enabled support for C++ threads and also thread creation from loaded via dlopen .so files: two reported bugs. However, I dropped the ball and forgot about the above issue with thread blocking from Go and that old bug. I now believe I can migrate to reusing the signal 32 that glibc "hides" for its setxid() fixup mechanism - Go doesn't block that one (the Go fix for the above bug). However, getting that to work likely involves some pretty gnarly per-architecture signal handling assembly code.
Just to set some expectations. To fix this, I'm going to have to restructure the PSX code to make it compile against kernel headers on different architectures and then test each of those architectures. I've been looking into using QEMU for this, but this is not my day job, so I'm not sure how quickly I will complete this. In the meantime, my recommendation for libcap/libpsx is to use libcap-2.71 if this issue affects you.
(In reply to Andrew G. Morgan from comment #6) > Just to set some expectations. To fix this, I'm going to have to restructure > the PSX code to make it compile against kernel headers on different > architectures and then test each of those architectures. I've been looking > into using QEMU for this, but this is not my day job, so I'm not sure how > quickly I will complete this. Thanks for the feedback and the work on this! > In the meantime, my recommendation for libcap/libpsx is to use libcap-2.71 > if this issue affects you. I'll open a tracking issue for this and wait on 2.71 until this is resolved.
Browsing for some multi-architecture details. It looks like musl has done the heavy lifting and organized things nicely (much less messy than glibc). Musl provides a lot of SA_RESTORER details in its "src/signal/.../restore.s" files.
So, curiously, there is a kernel capability difference in behavior between: Linux deb-armfh 6.1.0-30-armmp-lpae #1 SMP Debian 6.1.124-1 (2025-01-12) armv7l GNU/Linux And on my dev system native system: Linux fed-x86-64 6.12.10-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jan 17 18:05:24 UTC 2025 x86_64 GNU/Linux Namely: ``` $ cd progs $ cp ../go/compare-cap $ sudo setcap =ep ./compare-cap $ sudo ./compare-cap ``` On the arm kernel: ``` morgan@deb-armfh:~/libcap/progs$ sudo setcap all=ep ./compare-cap morgan@deb-armfh:~/libcap/progs$ sudo ./compare-cap 2025/01/26 19:11:33 failed to be blocked from removing filecaps ("=p"): <nil> ``` On my x86-64 native system: ``` $ sudo setcap all=ep ./compare-cap $ sudo ./compare-cap 2025/01/26 19:13:03 compare-cap success! ``` I've had to make some changes to display the "=p" in the log line, but this removal of a file capability is supposed not to work if the effective flag is lowered. (This is one of the progs/quicktest.sh tests.) I guess this could be a QEMU thing. I need to gather more data, but I wonder if there is some kernel issue? (Perhaps more modern kernels have fixed it?)
Same observation true of the i386 port: Linux deb-i386 6.1.0-30-686-pae #1 SMP PREEMPT_DYNAMIC Debian 6.1.124-1 (2025-01-12) i686 GNU/Linux
Running __aarch64__ support natively on an RPi5 with this kernel: Linux arm64 6.6.62+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024-11-25) aarch64 GNU/Linux the ./compare-cap binary works as expected. I'll try the same OS on QEMU and see how that compares.
So this is a fix supporting all four combinations of {x86, arm} x {32bit, 64bit}. All other Linux architectures are untested, and unlikely to "just work", but I plan to explore the others if I can find a QEMU way to run them. https://git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=025f28ca4fe085fbcbf7933d53a42d335744e553 I've tagged this with a release candidate tag: psx/v1.2.74rc1 . If anyone finding this bug wants to help with the testing, that would be great.
Just corrected the tag to psx/v1.2.74-rc1
I've just filed this bug: https://bugzilla.kernel.org/show_bug.cgi?id=219752 it seems to be rare (I've seen it occur only once). I can't tell if it is related to the changes tracked in this present bug or not, but I don't want to forget about this specific test issue.
Note in passing. Trying to validate a the change for mipsle under QEMU, with the debian image, ran into this bug: https://github.com/golang/go/issues/56426 Attempting to work around this by manually installing go1.24.0. The default install on the latest debian iso image is ``` $ go version go version go1.19.8 linux/mipsle ``` as it appears to have been resolved in go1.20 and not in any of the go1.19 releases.
mipsle on QEMU shares the ./compare-cap problem.
PowerPC on QEMU shares the ./compare-cap problem.
s390x on QEMU does not appear to share the ./compare-cap problem that the other QEMU emulation does. The setup I'm using is: morgan@deb-s390x:~/libcap$ uname -a Linux deb-s390x 6.1.0-31-s390x #1 SMP Debian 6.1.128-1 (2025-02-07) s390x GNU/Linux
mipsle support: https://git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=5a9f9dde6c4782dd3b4a8ac04095abb5d6f85388 ppc64 support: https://git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=9c46e11a46bd1b28e724655a0d57355dd7d5248e s390x support: https://git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=c32a4d372f9a2f4e1b6561dc61fa87eb42c1dcb0
I'm looking to support riscv next, but I haven't yet found a QEMU system image to run on.
I found some fedora specific info for qemu and riscv: https://fedoraproject.org/wiki/Architectures/RISC-V/QEMU using this, I confirmed riscv support with: https://git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=dfb0fc263bbc215e3bd86a412ab85effcf2c857a I've tagged this with {cap,psx}/v1.2.74-rc5 . My plan is to let this soak for a week and if no one reports an issue, I'll cut 2.74 proper.
I've just found that the musl-gcc (x86_64 test only) build fails, so need to do a bit more work to resolve this.
Noting this is fixed with: 20c22e64bf8d5a90ed0e884753808133bcc5798d
psx/v1.2.74 turned out to be a dud, because testing with "vendor"ed module support is subtly different from building it as a standalone package. This bug report was filed and fixed within a day of releasing libcap-2.74: https://bugzilla.kernel.org/show_bug.cgi?id=219838 Long story short libcap-2.75 is the version I consider stable.
Note: https://github.com/aquasecurity/tracee/issues/4678 reported an issue that should have been fixed by this change. However, it tripped up a different regression: https://github.com/aquasecurity/tracee/pull/4688 so I've been looking for a reason for that regression. I found one (related to the kernel API and arm64) and addressed that with: https://web.git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=07d8ce731d5fe9063abfef4a77306e273b18b5f3 This will be included in libcap-2.76.