Bug 208477 - quicktest.sh fail on musl libc, Alpine Linux distribution
Summary: quicktest.sh fail on musl libc, Alpine Linux distribution
Status: RESOLVED CODE_FIX
Alias: None
Product: Tools
Classification: Unclassified
Component: libcap (show other bugs)
Hardware: All Linux
: P2 normal
Assignee: Andrew G. Morgan
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-06 16:15 UTC by Milan P. Stanić
Modified: 2020-07-12 15:41 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.4.34
Subsystem:
Regression: No
Bisected commit-id:


Attachments
build log libcap 2.37 (14.48 KB, text/plain)
2020-07-06 16:15 UTC, Milan P. Stanić
Details

Description Milan P. Stanić 2020-07-06 16:15:48 UTC
Created attachment 290147 [details]
build log libcap 2.37

I'm trying to build libcap 2.37 on Alpine Linux, which based on musl libc

running 'cd progs && LD_LIBRARY_PATH=../libcap ./quicktest.sh' in APKBUILD (Alpine build script) fails with next message:

EXPECT SUCCESS: TEST: ./capsh --inh=cap_chown --mode=PURE1E --print --inmode=PURE1E

Attached is build log

TIA
Comment 1 Andrew G. Morgan 2020-07-07 01:19:44 UTC
Can you do this manually and post the output?

sudo ./capsh --print --inh=cap_chown --print --mode=PURE1E --print --inmode=PURE1E

I'm suspicious that this actual command works fine when run as root.

I suspect something about fakeroot is not letting the test work correctly. I've not used fakeroot much, but my recollection is it likes to say it has done things to make things progress when it not actually doing them. So, when a test like this checks the details, it calls fakeroot's bluff.

The quicktest.sh is very aggressive about exploring lots of subtle privilege things under linux. This particular one is validating some prctl calls did exactly what was intended.
Comment 2 Andrew G. Morgan 2020-07-07 01:44:19 UTC
I've added a link to https://bugzilla.kernel.org/show_bug.cgi?id=206539 in that case, I think fakeroot was being used to build things, but I don't recall they were trying to get quicktest.sh to work.

I think you should be able to run make all and make test, but not make sudotest in this environment.
Comment 3 Milan P. Stanić 2020-07-07 19:14:32 UTC
(In reply to Andrew G. Morgan from comment #1)
> Can you do this manually and post the output?
> 
> sudo ./capsh --print --inh=cap_chown --print --mode=PURE1E --print
> --inmode=PURE1E


With sudo it doesn't work but running it as root without sudo it works.

I run it with:
LD_LIBRARY_PATH=../libcap ./capsh --print --inh=cap_chown --print --mode=PURE1E --print --inmode=PURE1E > libcap-capsh.log 2>&1

Sorry, I don't know how to attach log file in comment here (this is my first bug report on bugzilla.kernel.org), but it succeed. here are few last lines.
```
Securebits: 0357/0xef/8'b11101111
 secure-noroot: yes (locked)
 secure-no-suid-fixup: yes (locked)
 secure-keep-caps: no (locked)
 secure-no-ambient-raise: yes (locked)
uid=0(root) euid=0(root)
gid=0(root)
groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
Guessed mode: PURE1E (3)
```

> I'm suspicious that this actual command works fine when run as root.

You are right, it doesn't work even when run as root.

> I suspect something about fakeroot is not letting the test work correctly.
> I've not used fakeroot much, but my recollection is it likes to say it has
> done things to make things progress when it not actually doing them. So,
> when a test like this checks the details, it calls fakeroot's bluff.

Yes, Alpine builders use fakeroot, sorry I forgot to mention that.

> The quicktest.sh is very aggressive about exploring lots of subtle privilege
> things under linux. This particular one is validating some prctl calls did
> exactly what was intended.
Comment 4 Milan P. Stanić 2020-07-07 19:26:25 UTC
(In reply to Andrew G. Morgan from comment #2)
> I've added a link to https://bugzilla.kernel.org/show_bug.cgi?id=206539 th in
> that case, I think fakeroot was being used to build things, but I don't
> recall they were trying to get quicktest.sh to work.
> 
> I think you should be able to run make all and make test, but not make
> sudotest in this environment.

Wi(In reply to Andrew G. Morgan from comment #2)
> I've added a link to https://bugzilla.kernel.org/show_bug.cgi?id=206539 in
> that case, I think fakeroot was being used to build things, but I don't
> recall they were trying to get quicktest.sh to work.
> 
> I think you should be able to run make all and make test, but not make
> sudotest in this environment.

With `make ; make test` I got 'Segmentation fault' no matter if run as root or Alpine abuild script:

hello [24787], main<2> f7ddc558 (keepcaps=1 vs. want=1)
hello [24787], thread<0> f7d1bd7c (keepcaps=1 vs. want=1)
hello [24787], thread<1> f7d3ed7c (keepcaps=1 vs. want=1)
[24787] started=2 vs 3
[24787] started=3 vs 3
iteration [24787]: 3
hello [24787], main<3> f7ddc558 (keepcaps=0 vs. want=0)
hello [24787], thread<0> f7cf8d7c (keepcaps=0 vs. want=0)
hello [24787], thread<2> f7d3ed7c (keepcaps=0 vs. want=0)
hello [24787], thread<1> f7d1bd7c (keepcaps=0 vs. want=0)
iteration [24787]: 4
make[1]: *** [Makefile:22: run_psx_test] Segmentation fault
make[1]: Leaving directory '/home/mps/aports/main/libcap/src/libcap-2.36/tests'
make: *** [Makefile:39: test] Error 2


Here is merge request I created on Alpine linux gitlab CI, https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/10031

and here is one of the failed jobs:

https://gitlab.alpinelinux.org/mps/aports/-/jobs/159769

here is APKBUILD file (Alpine build 'script')

https://gitlab.alpinelinux.org/alpine/aports/-/blob/master/main/libcap/APKBUILD
Comment 5 Andrew G. Morgan 2020-07-07 20:14:58 UTC
FWIW the sudo issue is likely that the environment variable is not being honored. Try it this way:

sudo bash -c "LD_LIBRARY_PATH=../libcap ./capsh --print --inh=cap_chown --print --mode=PURE1E --print --inmode=PURE1E" > results.log

For the segfault, can you run the test under gdb and when it crashes do:

(gdb) bt

to generate a backtrace?
Comment 6 Milan P. Stanić 2020-07-07 20:38:39 UTC
(In reply to Andrew G. Morgan from comment #5)
> FWIW the sudo issue is likely that the environment variable is not being
> honored. Try it this way:
> 
> sudo bash -c "LD_LIBRARY_PATH=../libcap ./capsh --print --inh=cap_chown
> --print --mode=PURE1E --print --inmode=PURE1E" > results.log
> 
> For the segfault, can you run the test under gdb and when it crashes do:
> 
> (gdb) bt
> 
> to generate a backtrace?

Sorry, I don't know how to run make with Makefiles under gdb, but I run psx_test (which looks like this where segfault happened) 'gdb ./psx_test'  and result is:

```
GNU gdb (GDB) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv7-alpine-linux-musleabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./psx_test...
(gdb) run
Starting program: /home/mps/aports/main/libcap/src/libcap-2.36/tests/psx_test
iteration [25782]: 0
hello [25782], main<0> f7fee558 (keepcaps=1 vs. want=1)
[Detaching after fork from child process 25789]
pid=25782 forked -> 25789
[New LWP 25790]
[25782] started=1 vs 1
iteration [25782]: 1

Thread 2 "psx_test" received signal SIG64, Real-time event 64.
[Switching to LWP 25790]
0xf7fa9e44 in ?? () from /lib/ld-musl-armhf.so.1
(gdb) child 25789 exiting
bt
#0  0xf7fa9e44 in ?? () from /lib/ld-musl-armhf.so.1
#1  0xf7faa18c in ?? () from /lib/ld-musl-armhf.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
```
Comment 7 Andrew G. Morgan 2020-07-08 05:36:39 UTC
Sorry, I forgot an important detail! The "received signal SIG64" is actually a normal part of the way the psx code runs. When you see that, you should just say 'c<RETURN>' for continue. Eventually, you will get to a point where you see SEGFAULT and it is that stack trace I'm interested in.

If there is some recipe for recreating your runtime environment, I see you are using docker in some way, I could also use that to try to debug this myself.
Comment 8 Milan P. Stanić 2020-07-08 14:52:37 UTC
(In reply to Andrew G. Morgan from comment #7)
> Sorry, I forgot an important detail! The "received signal SIG64" is actually
> a normal part of the way the psx code runs. When you see that, you should
> just say 'c<RETURN>' for continue. Eventually, you will get to a point where
> you see SEGFAULT and it is that stack trace I'm interested in.

here is the result of the repeating 'c' till it segfaults:
```
Thread 3 "psx_test" received signal SIG64, Real-time event 64.
0xf7fa9e44 in ?? () from /lib/ld-musl-armhf.so.1
(gdb) c
Continuing.

Thread 2 "psx_test" received signal SIG64, Real-time event 64.
[Switching to LWP 26866]
0xf7fa9e44 in ?? () from /lib/ld-musl-armhf.so.1
(gdb) c
Continuing.

Thread 4 "psx_test" received signal SIG64, Real-time event 64.
[Switching to LWP 26877]
0xf7fa9e44 in ?? () from /lib/ld-musl-armhf.so.1
(gdb) c
Continuing.
hello [26858], main<3> f7fee558 (keepcaps=0 vs. want=0)
hello [26858], thread<0> f7f21d7c (keepcaps=0 vs. want=0)
hello [26858], thread<2> f7f67d7c (keepcaps=0 vs. want=0)
hello [26858], thread<1> f7f44d7c (keepcaps=0 vs. want=0)
iteration [26858]: 4
[LWP 26866 exited]

Thread 4 "psx_test" received signal SIG64, Real-time event 64.
0xf7fa9e44 in ?? () from /lib/ld-musl-armhf.so.1
(gdb) c
Continuing.

Thread 1 "psx_test" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 26858]
0xf7f96654 in ?? () from /lib/ld-musl-armhf.so.1
(gdb) bt
#0  0xf7f96654 in ?? () from /lib/ld-musl-armhf.so.1
#1  0xf7faa6e2 in pthread_kill () from /lib/ld-musl-armhf.so.1
#2  0xf7faa6e2 in pthread_kill () from /lib/ld-musl-armhf.so.1
#3  0x004015ee in __psx_syscall (syscall_nr=172) at psx.c:502
#4  0x00400c18 in main (argc=<optimized out>, argv=0xfffefb94) at psx_test.c:84
```

> If there is some recipe for recreating your runtime environment, I see you
> are using docker in some way, I could also use that to try to debug this
> myself.

I don't use dockers, these CI on Alpine infrastructure are set by infra team. I'm using mostly lxc for development.

I looked at bellow url when built svxlink software as a guide, maybe it can help

https://github.com/sm0svx/svxlink/blob/master/docker/alpine/Dockerfile.build


Though simple
```
FROM alpine:edge
RUN wget <tarball> && tar xf <tarball> && cd <dir> && ./configure && make && make test
```
could be useful for those experinced with docker
Comment 9 Andrew G. Morgan 2020-07-08 15:15:54 UTC
On the face of it, from that stack trace, this looks like the segfault is happening inside the musl library. I'm probably not the best person to debug an issue there.

=> #0  0xf7f96654 in ?? () from /lib/ld-musl-armhf.so.1
   #1  0xf7faa6e2 in pthread_kill () from /lib/ld-musl-armhf.so.1
   #2  0xf7faa6e2 in pthread_kill () from /lib/ld-musl-armhf.so.1
   #3  0x004015ee in __psx_syscall (syscall_nr=172) at psx.c:502
   #4  0x00400c18 in main (argc=<optimized out>, argv=0xfffefb94) at psx_test.c:84

There is no pthread_kill() call in psx_test.c but the libpsx/psx.c source code uses it to determine if a thread is still alive or not. The closest one to the line indicated in the stack trace is this one:

https://git.kernel.org/pub/scm/libs/libcap/libcap.git/tree/psx/psx.c#n506

I suspect this is an implicit side effect associated with when psx_text.c exits a pthread somewhere in the vicinity of where the main thread invokes pthread_join().

Since this does not segfault with glibc, there must be something subtle occurring.
Comment 10 Andrew G. Morgan 2020-07-08 19:35:41 UTC
This seems relevant:

https://stackoverflow.com/questions/8482314/segmentation-fault-caused-by-pthread-kill
Comment 11 Rich Felker 2020-07-08 20:55:58 UTC
This is almost surely just UB due to use of a pthread_t value after its lifetime ends (after it exits detached, or after it's joined if joinable). If you're wrapping pthread_create, then the wrapped start function needs to install a cancellation handler so that, on exit, the thread can report itself as exiting. You can't just use pthread_kill after it's already exited to probe whether it still exists.
Comment 12 Andrew G. Morgan 2020-07-09 04:14:37 UTC
It seems that this implementation isn't described as a variant of the pthread_kill() doc:

  https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_kill.html

Yet, this unresolved bug report seems relevant too:

  https://sourceware.org/bugzilla/show_bug.cgi?id=12889

Let me think some more about this. I am now able to reproduce a musl build of libpsx, so I can explore the issue in more detail.
Comment 13 Andrew G. Morgan 2020-07-10 05:46:54 UTC
I really can't help with the fakeroot thing not supporting quicktest.sh.

However, I believe the segfault issue is now resolved with:

https://git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/?id=dca9b22261f4837b0c81640ca3aa5133b95e0999

The following now works for me based on the state of the github mucl sources:

  make CC=/usr/local/musl/bin/musl-gcc clean all test sudotest

I've added a validation check for me when I make releases too.

Please reopen if you have an unresolved issue.
Comment 14 Rich Felker 2020-07-10 05:51:06 UTC
The fix looks ok at first glance, but contrary to the comments it's not musl-specific. The old code had a UAF bug that was equally present on glibc and just masked by dead thread ids being cached for reuse; if one actually happened to be reused before the pthread_kill call, very bad things would happen.
Comment 15 Milan P. Stanić 2020-07-10 19:50:02 UTC
(In reply to Andrew G. Morgan from comment #13)
> I really can't help with the fakeroot thing not supporting quicktest.sh.
> 
> However, I believe the segfault issue is now resolved with:
> 
> https://git.kernel.org/pub/scm/libs/libcap/libcap.git/commit/
> ?id=dca9b22261f4837b0c81640ca3aa5133b95e0999
> 
> The following now works for me based on the state of the github mucl sources:
> 
>   make CC=/usr/local/musl/bin/musl-gcc clean all test sudotest
> 
> I've added a validation check for me when I make releases too.

I picked patch from above and removed first hunk because it doesn't apply to releases 2.37 and 2.37, added small patch to to remove 'psx_test_wrap' and run 'make -j1 test' from Alpine APKBUILD. It passed on all our CIs, i.e. x86, x86_64, armv7, aarch64, s390x and ppc64le.

> Please reopen if you have an unresolved issue.

I hope with new release of libcap these patches will not be needed.

Thank you for help
Comment 16 Andrew G. Morgan 2020-07-12 00:51:42 UTC
FYI I have just released libcap-2.39. This patch is included.
Comment 17 Milan P. Stanić 2020-07-12 15:41:09 UTC
(In reply to Andrew G. Morgan from comment #16)
> FYI I have just released libcap-2.39. This patch is included.

This release build fine and pass 'make -j1 test' without any issue and without patches.

Thank you

Note You need to log in before you can comment on or make changes to this bug.