Bug 219687 - libcap 2.73: TestThreadChurn never returns
Summary: libcap 2.73: TestThreadChurn never returns
Status: RESOLVED CODE_FIX
Alias: None
Product: Tools
Classification: Unclassified
Component: libcap (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Andrew G. Morgan
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-01-12 14:59 UTC by David Runge
Modified: 2025-03-29 21:54 UTC (History)
1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
check log in clean chroot (43.26 KB, text/plain)
2025-01-12 14:59 UTC, David Runge
Details

Description David Runge 2025-01-12 14:59:59 UTC
Created attachment 307480 [details]
check log in clean chroot

Hi! When trying to package 2.73 on Arch Linux I am running into issues with the test suite.

More specifically, `TestThreadChurn` never returns (have now tried with a single job and the default detected amount of jobs for the make call).

```
=== RUN   TestThreadChurn
    psx_churn_test.go:20: [0] testing kill=false, sysc=false
    psx_churn_test.go:33: [0] PASSED kill=false, sysc=false
    psx_churn_test.go:20: [1] testing kill=true, sysc=false
    psx_churn_test.go:33: [1] PASSED kill=true, sysc=false
    psx_churn_test.go:20: [2] testing kill=false, sysc=true
```

Eventually the test is killed (see full logs in attachment).
Comment 1 Andrew G. Morgan 2025-01-12 16:18:38 UTC
Thanks for the bug report. I don't see this failure on any of my systems, so I'm wondering if we can distill this down a bit to something smaller?

This is what I see:

```
$ make distclean
$ cd go
$ go version
go version go1.21.3 linux/amd64
$ gcc --version
gcc (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ make PSXGOPACKAGE
[...]
$ CC="gcc" CGO_ENABLED="1" go test -c -mod=vendor kernel.org/pub/linux/libs/security/libcap/psx
$ ldd psx.test
        linux-vdso.so.1 (0x00007ffc78fb5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2d4ac57000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2d4ae43000)
$ ./psx.test -test.v
=== RUN   TestErrno
--- PASS: TestErrno (0.01s)
=== RUN   TestThreadChurn
    psx_churn_test.go:20: [0] testing kill=false, sysc=false
    psx_churn_test.go:33: [0] PASSED kill=false, sysc=false
    psx_churn_test.go:20: [1] testing kill=true, sysc=false
    psx_churn_test.go:33: [1] PASSED kill=true, sysc=false
    psx_churn_test.go:20: [2] testing kill=false, sysc=true
    psx_churn_test.go:33: [2] PASSED kill=false, sysc=true
    psx_churn_test.go:20: [3] testing kill=true, sysc=true
    psx_churn_test.go:33: [3] PASSED kill=true, sysc=true
--- PASS: TestThreadChurn (3.05s)
=== RUN   TestSyscall3
--- PASS: TestSyscall3 (0.02s)
=== RUN   TestSyscall6
--- PASS: TestSyscall6 (0.01s)
=== RUN   TestShared
--- PASS: TestShared (4.44s)
PASS
ok      kernel.org/pub/linux/libs/security/libcap/psx   7.534s
```
Comment 2 Andrew G. Morgan 2025-01-12 16:43:40 UTC
Scratch that. I have just found a system that reproduces a hang. So far, I've seen different numbers of `psx_churn_test.go: ...` lines of output. So, this suggests some sort of race condition.
Comment 3 Andrew G. Morgan 2025-01-14 06:18:16 UTC
Mmm. It looks as if something in the go runtime is blocking the SIGSYS handling, so while the tkill has been issued, it remains pending (the "4" in SigPnd) for that thread. The thread in question is in the `runtime.futex` code. I need to dig in some more and see if this is a regression, or confusion on my part.

```
(gdb) info th
  Id   Target Id                                    Frame 
  1    Thread 0x7ffff7da6740 (LWP 40173) "psx.test" 0x00007ffff7e90095 in __libc_open64 (file=0x69fa00 "/proc/40173/task", 
    oflag=oflag@entry=65536) at ../sysdeps/unix/sysv/linux/open64.c:41
  2    Thread 0x7fffb12ff6c0 (LWP 40176) "psx.test" 0x00007ffff7e8490b in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120
  4    Thread 0x7fffabfff6c0 (LWP 40178) "psx.test" 0x00007ffff7e8490b in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120
* 5    Thread 0x7fffab7fe6c0 (LWP 40179) "psx.test" runtime.futex () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:558
  6    Thread 0x7fffaaffd6c0 (LWP 40180) "psx.test" 0x00007ffff7e8490b in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120
  7    Thread 0x7fffaa7fc6c0 (LWP 40183) "psx.test" 0x00007ffff7e8490b in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120
(gdb) 
[1]+  Stopped                 gdb ./psx.test
$ grep Sig /proc/40179/status 
SigQ:   1/255107
SigPnd: 0000000040000000
SigBlk: fffffffc7ffbfeff
SigIgn: 0000000000000000
SigCgt: fffffffd7fc1feff
```
Comment 4 Andrew G. Morgan 2025-01-14 06:28:21 UTC
deja vu:

  https://github.com/golang/go/issues/42494
Comment 5 Andrew G. Morgan 2025-01-18 06:47:52 UTC
I've been experimenting. It looks as if I'm going to have to invest in some per-architecture assembly.

The issue in this bug is that there is no opportunity to unblock the SIGSYS signal on all the threads of the go runtime, so when individual threads are exiting and Go blocks signals on them, the PSX mechanism can't complete on those threads - we get into a deadlock. This is the nature of the bug that I found and worked around before.

Back (libcap-2.71 and earlier) when libpsx operated exclusively on pthread's, there were some creation/deletion hooks to override this blocking and the PSX mechanism could reach all running threads. Since libcap-2.72 we've changed the thread abstraction to be LWP threads, which do not support those hooks. This enabled support for C++ threads and also thread creation from loaded via dlopen .so files: two reported bugs.

However, I dropped the ball and forgot about the above issue with thread blocking from Go and that old bug. I now believe I can migrate to reusing the signal 32 that glibc "hides" for its setxid() fixup mechanism - Go doesn't block that one (the Go fix for the above bug). However, getting that to work likely involves some pretty gnarly per-architecture signal handling assembly code.
Comment 6 Andrew G. Morgan 2025-01-19 18:58:31 UTC
Just to set some expectations. To fix this, I'm going to have to restructure the PSX code to make it compile against kernel headers on different architectures and then test each of those architectures. I've been looking into using QEMU for this, but this is not my day job, so I'm not sure how quickly I will complete this.

In the meantime, my recommendation for libcap/libpsx is to use libcap-2.71 if this issue affects you.
Comment 7 David Runge 2025-01-21 18:55:58 UTC
(In reply to Andrew G. Morgan from comment #6)
> Just to set some expectations. To fix this, I'm going to have to restructure
> the PSX code to make it compile against kernel headers on different
> architectures and then test each of those architectures. I've been looking
> into using QEMU for this, but this is not my day job, so I'm not sure how
> quickly I will complete this.

Thanks for the feedback and the work on this!

> In the meantime, my recommendation for libcap/libpsx is to use libcap-2.71
> if this issue affects you.

I'll open a tracking issue for this and wait on 2.71 until this is resolved.
Comment 8 Andrew G. Morgan 2025-01-22 05:54:47 UTC
Browsing for some multi-architecture details. It looks like musl has done the
heavy lifting and organized things nicely (much less messy than glibc).
Musl provides a lot of SA_RESTORER details in its "src/signal/.../restore.s"
files.
Comment 9 Andrew G. Morgan 2025-01-27 03:16:25 UTC
So, curiously, there is a kernel capability difference in behavior between:

Linux deb-armfh 6.1.0-30-armmp-lpae #1 SMP Debian 6.1.124-1 (2025-01-12) armv7l GNU/Linux

And on my dev system native system:

Linux fed-x86-64 6.12.10-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jan 17 18:05:24 UTC 2025 x86_64 GNU/Linux

Namely:

```
$ cd progs
$ cp ../go/compare-cap
$ sudo setcap =ep ./compare-cap
$ sudo ./compare-cap
```

On the arm kernel:

```
morgan@deb-armfh:~/libcap/progs$ sudo setcap all=ep ./compare-cap 
morgan@deb-armfh:~/libcap/progs$ sudo ./compare-cap 
2025/01/26 19:11:33 failed to be blocked from removing filecaps ("=p"): <nil>
```

On my x86-64 native system:

```
$ sudo setcap all=ep ./compare-cap
$ sudo ./compare-cap
2025/01/26 19:13:03 compare-cap success!
```

I've had to make some changes to display the "=p" in the log line, but this removal of a file capability is supposed not to work if the effective flag is lowered. (This is one of the progs/quicktest.sh tests.) I guess this could be a QEMU thing. I need to gather more data, but I wonder if there is some kernel issue? (Perhaps more modern kernels have fixed it?)
Comment 10 Andrew G. Morgan 2025-01-27 05:01:01 UTC
Same observation true of the i386 port:

Linux deb-i386 6.1.0-30-686-pae #1 SMP PREEMPT_DYNAMIC Debian 6.1.124-1 (2025-01-12) i686 GNU/Linux
Comment 11 Andrew G. Morgan 2025-02-01 16:54:40 UTC
Running __aarch64__ support natively on an RPi5 with this kernel:

Linux arm64 6.6.62+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024-11-25) aarch64 GNU/Linux

the ./compare-cap binary works as expected. I'll try the same OS on QEMU and see how that compares.
Comment 12 Andrew G. Morgan 2025-02-01 17:06:38 UTC
So this is a fix supporting all four combinations of {x86, arm} x {32bit, 64bit}. All other Linux architectures are untested, and unlikely to "just work", but I plan to explore the others if I can find a QEMU way to run them.

https://git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=025f28ca4fe085fbcbf7933d53a42d335744e553

I've tagged this with a release candidate tag: psx/v1.2.74rc1 . If anyone finding this bug wants to help with the testing, that would be great.
Comment 13 Andrew G. Morgan 2025-02-05 05:41:35 UTC
Just corrected the tag to psx/v1.2.74-rc1
Comment 14 Andrew G. Morgan 2025-02-06 06:02:28 UTC
I've just filed this bug: https://bugzilla.kernel.org/show_bug.cgi?id=219752 it seems to be rare (I've seen it occur only once). I can't tell if it is related to the changes tracked in this present bug or not, but I don't want to forget about this specific test issue.
Comment 15 Andrew G. Morgan 2025-02-15 17:50:29 UTC
Note in passing. Trying to validate a the change for mipsle under QEMU, with the debian image, ran into this bug:

   https://github.com/golang/go/issues/56426

Attempting to work around this by manually installing go1.24.0. The default install on the latest debian iso image is
```
$ go version
go version go1.19.8 linux/mipsle
```
as it appears to have been resolved in go1.20 and not in any of the go1.19 releases.
Comment 16 Andrew G. Morgan 2025-02-15 22:21:21 UTC
mipsle on QEMU shares the ./compare-cap problem.
Comment 17 Andrew G. Morgan 2025-02-17 00:15:19 UTC
PowerPC on QEMU shares the ./compare-cap problem.
Comment 18 Andrew G. Morgan 2025-02-18 06:10:22 UTC
s390x on QEMU does not appear to share the ./compare-cap problem that the other QEMU emulation does.

The setup I'm using is:

morgan@deb-s390x:~/libcap$ uname -a
Linux deb-s390x 6.1.0-31-s390x #1 SMP Debian 6.1.128-1 (2025-02-07) s390x GNU/Linux
Comment 20 Andrew G. Morgan 2025-02-18 06:19:42 UTC
I'm looking to support riscv next, but I haven't yet found a QEMU system image to run on.
Comment 21 Andrew G. Morgan 2025-02-22 20:08:29 UTC
I found some fedora specific info for qemu and riscv:

   https://fedoraproject.org/wiki/Architectures/RISC-V/QEMU

using this, I confirmed riscv support with:

https://git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=dfb0fc263bbc215e3bd86a412ab85effcf2c857a

I've tagged this with {cap,psx}/v1.2.74-rc5 . My plan is to let this soak for a week and if no one reports an issue, I'll cut 2.74 proper.
Comment 22 Andrew G. Morgan 2025-03-02 17:09:49 UTC
I've just found that the musl-gcc (x86_64 test only) build fails, so need to do a bit more work to resolve this.
Comment 23 Andrew G. Morgan 2025-03-03 00:26:19 UTC
Noting this is fixed with: 20c22e64bf8d5a90ed0e884753808133bcc5798d
Comment 24 Andrew G. Morgan 2025-03-05 15:04:21 UTC
psx/v1.2.74 turned out to be a dud, because testing with "vendor"ed module support is subtly different from building it as a standalone package. This bug report was filed and fixed within a day of releasing libcap-2.74:

  https://bugzilla.kernel.org/show_bug.cgi?id=219838

Long story short libcap-2.75 is the version I consider stable.
Comment 25 Andrew G. Morgan 2025-03-29 21:54:25 UTC
Note:

  https://github.com/aquasecurity/tracee/issues/4678

reported an issue that should have been fixed by this change. However, it tripped up a different regression:

  https://github.com/aquasecurity/tracee/pull/4688

so I've been looking for a reason for that regression. I found one (related to the kernel API and arm64) and addressed that with:

  https://web.git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=07d8ce731d5fe9063abfef4a77306e273b18b5f3

This will be included in libcap-2.76.

Note You need to log in before you can comment on or make changes to this bug.