There is a race condition on the fs->users counter and fs->in_exec flag that impacts exec() on a suid program when there are multiple threads being created and destroyed by a parent process. This in_exec flag was introduced way back in 2009 with commit: commit 498052bba55ecaff58db6a1436b0e25bfd75a7ff Author: Al Viro <viro@zeniv.linux.org.uk> Date: Mon Mar 30 07:20:30 2009 -0400 When performing an exec() on a suid program, check_unsafe_exec() checks to see we have more f->fs->users than the count of child threads that share the same fs with the parent. However, there are two race conditions that this seems to fail on: 1. The parent may be creating a new thread and copy_fs in kernel/fork.c bumps the fs->users count before the thread is attached to the parent hence causing the check p->fs->users > n_fs in check_unsafe_exec() to be true in check_unsafe_exec() and cause the execution a the suid program by the parent to fail to marked as unsafe and so it executes as a non-suid executable. The crux of the matter is that the fs->user count temporarily ahead by 1 with the number of threads linked to the process and we don't have any locking mechanism to protect us from this in the exec phase. 2. A thread my be dying and the associated fs pointer is nullified before it is removed from the parent and this also causes issues in the check_unsafe_exec() fs sanity check. I believe this can be fixed checking for t->fs being NULL, e.g.: diff --git a/fs/exec.c b/fs/exec.c index 72934df68471..ebfd9b76b69f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1447,7 +1447,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm) spin_lock(&p->fs->lock); rcu_read_lock(); while_each_thread(p, t) { - if (t->fs == p->fs) + if (t->fs == p->fs || !t->fs) n_fs++; } rcu_read_unlock(); The reproducer is quite simple and always easy to reproduce: $ cat Makefile ALL=a b all: $(ALL) a: LDFLAGS=-pthread b: b.c $(CC) b.c -o b sudo chown root:root b sudo chmod u+s b test: for I in $$(seq 1000); do echo $I; ./a ; done clean: rm -vf $(ALL) $ cat a.c #include <unistd.h> #include <stdio.h> #include <pthread.h> #include <time.h> void *nothing(void *p) { return NULL; } void *target(void *p) { for (;;) { pthread_t t; if (pthread_create(&t, NULL, nothing, NULL) == 0) pthread_join(t, NULL); } return NULL; } int main(void) { struct timespec tv; int i; for (i = 0; i < 10; i++) { pthread_t t; pthread_create(&t, NULL, target, NULL); } tv.tv_sec = 0; tv.tv_nsec = 100000; nanosleep(&tv, NULL); if (execl("./b", "./b", NULL) < 0) perror("execl"); return 0; } $ cat b.c #include <unistd.h> #include <stdio.h> #include <sys/types.h> int main(void) { const uid_t euid = geteuid(); if (euid != 0) { printf("Failed, got euid %d (expecting 0)\n", euid); return 1; } return 0; } To reproduce: make make test You will see "Failed, got euid 1000 (expecting 0)" errors whenever the suid program being exec'd fails to exec as a suid process because of the race and failure in check_unsafe_exec()
I don't think the second condition you gave is relevant, because once the task_struct->fs pointer is nullified, this thread is not accounted in fs->users anymore. Since this thread doesn't count for neither fs->count nor n_fs, it is okay. The first condition is indeed a problem. I'm not sure of the wanted fix (probably some kind of locking in copy_process, that while_each_thread could use too), but currently this race is bad.
Thanks for looking at this. The locking on copy_process concerns me as I didn't want to introduce a locking fix that caused a serious performance regression on the copy and exec paths.
Any updates on this?
Created attachment 256351 [details] this fix solves the issue without any overhead of extra per thread locking and a simple lightweight retry This has been extensively tested on a multi-proc Xeon server with the reproducers and fixes the issue. I've checked this out with a high contention of pthreads and the retry loop occurs very rarely, so the overhead of the retry is very small indeed.
Did you ever submit the fix to the mailing list ?
Hi Colin, May I please ask about the most recent update about your fix? This race condition can still be reproduced on the current linux mainline v5.19.0-rc6. I found ubuntu had picked this patch up as https://lists.ubuntu.com/archives/kernel-team/2017-May/084102.html but also reported the soft lockup issue in kernel 4.4 as https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1876856 so I'm wondering do you have an updated version of the patch and the plan to submit it to the upstream? Would really appreciate if you're still tracking this issue!