There is a race condition on the fs->users counter and fs->in_exec flag that impacts exec() on a suid program when there are multiple threads being created and destroyed by a parent process. This in_exec flag was introduced way back in 2009 with commit: commit 498052bba55ecaff58db6a1436b0e25bfd75a7ff Author: Al Viro <viro@zeniv.linux.org.uk> Date: Mon Mar 30 07:20:30 2009 -0400 When performing an exec() on a suid program, check_unsafe_exec() checks to see we have more f->fs->users than the count of child threads that share the same fs with the parent. However, there are two race conditions that this seems to fail on: 1. The parent may be creating a new thread and copy_fs in kernel/fork.c bumps the fs->users count before the thread is attached to the parent hence causing the check p->fs->users > n_fs in check_unsafe_exec() to be true in check_unsafe_exec() and cause the execution a the suid program by the parent to fail to marked as unsafe and so it executes as a non-suid executable. The crux of the matter is that the fs->user count temporarily ahead by 1 with the number of threads linked to the process and we don't have any locking mechanism to protect us from this in the exec phase. 2. A thread my be dying and the associated fs pointer is nullified before it is removed from the parent and this also causes issues in the check_unsafe_exec() fs sanity check. I believe this can be fixed checking for t->fs being NULL, e.g.: diff --git a/fs/exec.c b/fs/exec.c index 72934df68471..ebfd9b76b69f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1447,7 +1447,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm) spin_lock(&p->fs->lock); rcu_read_lock(); while_each_thread(p, t) { - if (t->fs == p->fs) + if (t->fs == p->fs || !t->fs) n_fs++; } rcu_read_unlock(); The reproducer is quite simple and always easy to reproduce: $ cat Makefile ALL=a b all: $(ALL) a: LDFLAGS=-pthread b: b.c $(CC) b.c -o b sudo chown root:root b sudo chmod u+s b test: for I in $$(seq 1000); do echo $I; ./a ; done clean: rm -vf $(ALL) $ cat a.c #include <unistd.h> #include <stdio.h> #include <pthread.h> #include <time.h> void *nothing(void *p) { return NULL; } void *target(void *p) { for (;;) { pthread_t t; if (pthread_create(&t, NULL, nothing, NULL) == 0) pthread_join(t, NULL); } return NULL; } int main(void) { struct timespec tv; int i; for (i = 0; i < 10; i++) { pthread_t t; pthread_create(&t, NULL, target, NULL); } tv.tv_sec = 0; tv.tv_nsec = 100000; nanosleep(&tv, NULL); if (execl("./b", "./b", NULL) < 0) perror("execl"); return 0; } $ cat b.c #include <unistd.h> #include <stdio.h> #include <sys/types.h> int main(void) { const uid_t euid = geteuid(); if (euid != 0) { printf("Failed, got euid %d (expecting 0)\n", euid); return 1; } return 0; } To reproduce: make make test You will see "Failed, got euid 1000 (expecting 0)" errors whenever the suid program being exec'd fails to exec as a suid process because of the race and failure in check_unsafe_exec()
I don't think the second condition you gave is relevant, because once the task_struct->fs pointer is nullified, this thread is not accounted in fs->users anymore. Since this thread doesn't count for neither fs->count nor n_fs, it is okay. The first condition is indeed a problem. I'm not sure of the wanted fix (probably some kind of locking in copy_process, that while_each_thread could use too), but currently this race is bad.
Thanks for looking at this. The locking on copy_process concerns me as I didn't want to introduce a locking fix that caused a serious performance regression on the copy and exec paths.
Any updates on this?
Created attachment 256351 [details] this fix solves the issue without any overhead of extra per thread locking and a simple lightweight retry This has been extensively tested on a multi-proc Xeon server with the reproducers and fixes the issue. I've checked this out with a high contention of pthreads and the retry loop occurs very rarely, so the overhead of the retry is very small indeed.