Bug 195453

Summary: race on fs/exec with fs_struct in_exec flag, introduced in commit 498052bba55ecaff58db6a1436b0e25bfd75a7ff
Product: File System Reporter: Colin Ian King (colin.king)
Component: OtherAssignee: fs_other
Status: NEW ---    
Severity: high CC: benh, dev.jongy, domnulnopcea, shaoyi
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29+ onwards Subsystem:
Regression: No Bisected commit-id:
Attachments: this fix solves the issue without any overhead of extra per thread locking and a simple lightweight retry

Description Colin Ian King 2017-04-18 13:22:33 UTC
There is a race condition on the fs->users counter and fs->in_exec flag that impacts exec() on a suid program when there are multiple threads being created and destroyed by a parent process.  This in_exec flag was introduced way back in 2009 with commit:

commit 498052bba55ecaff58db6a1436b0e25bfd75a7ff
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Mon Mar 30 07:20:30 2009 -0400


When performing an exec() on a suid program, check_unsafe_exec() checks to see we have more f->fs->users than the count of child threads that share the same fs with the parent.

However, there are two race conditions that this seems to fail on:

1. The parent may be creating a new thread and copy_fs in kernel/fork.c bumps the fs->users count before the thread is attached to the parent hence causing the check p->fs->users > n_fs in check_unsafe_exec() to be true in check_unsafe_exec() and cause the execution a the suid program by the parent to fail to marked as unsafe and so it executes as a non-suid executable.  The crux of the matter is that the fs->user count temporarily ahead by 1 with the number of threads linked to the process and we don't have any locking mechanism to protect us from this in the exec phase.

2. A thread my be dying and the associated fs pointer is nullified before it is removed from the parent and this also causes issues in the check_unsafe_exec() fs sanity check. I believe this can be fixed checking for t->fs being NULL, e.g.:

diff --git a/fs/exec.c b/fs/exec.c
index 72934df68471..ebfd9b76b69f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1447,7 +1447,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
        spin_lock(&p->fs->lock);
        rcu_read_lock();
        while_each_thread(p, t) {
-               if (t->fs == p->fs)
+               if (t->fs == p->fs || !t->fs)
                        n_fs++;
        }
        rcu_read_unlock();


The reproducer is quite simple and always easy to reproduce:

$ cat Makefile 
ALL=a b
all: $(ALL)

a: LDFLAGS=-pthread

b: b.c
	$(CC) b.c -o b
	sudo chown root:root b
	sudo chmod u+s b

test:
	for I in $$(seq 1000); do echo $I; ./a ; done

clean:
	rm -vf $(ALL)


$ cat a.c
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>
#include <time.h>

void *nothing(void *p)
{
	return NULL;
}

void *target(void *p) {
	for (;;) {
		pthread_t t;
		if (pthread_create(&t, NULL, nothing, NULL) == 0)
			pthread_join(t, NULL);
    	}
	return NULL;
}

int main(void)
{
	struct timespec tv;
	int i;

	for (i = 0; i < 10; i++) {
		pthread_t t;
		pthread_create(&t, NULL, target, NULL);
	}
	tv.tv_sec = 0;
	tv.tv_nsec = 100000;
	nanosleep(&tv, NULL);
	if (execl("./b", "./b", NULL) < 0)
		perror("execl");
	return 0;
}

$ cat b.c
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>

int main(void)
{
	const uid_t euid = geteuid();
	if (euid != 0) {
		printf("Failed, got euid %d (expecting 0)\n", euid);
        	return 1;
	}
	return 0;
}



To reproduce:

make
make test

You will see "Failed, got euid 1000 (expecting 0)" errors whenever the suid program being exec'd fails to exec as a suid process because of the race and failure in check_unsafe_exec()
Comment 1 Yonatan Goldschmidt 2017-04-21 00:36:08 UTC
I don't think the second condition you gave is relevant, because once the task_struct->fs pointer is nullified, this thread is not accounted in fs->users anymore. Since this thread doesn't count for neither fs->count nor n_fs, it is okay.

The first condition is indeed a problem. I'm not sure of the wanted fix (probably some kind of locking in copy_process, that while_each_thread could use too), but currently this race is bad.
Comment 2 Colin Ian King 2017-04-21 09:47:07 UTC
Thanks for looking at this.  The locking on copy_process concerns me as I didn't want to introduce a locking fix that caused a serious performance regression on the copy and exec paths.
Comment 3 Colin Ian King 2017-05-03 10:55:47 UTC
Any updates on this?
Comment 4 Colin Ian King 2017-05-10 11:09:09 UTC
Created attachment 256351 [details]
this fix solves the issue without any overhead of extra per thread locking and a simple lightweight retry

This has been extensively tested on a multi-proc Xeon server with the reproducers and fixes the issue.  I've checked this out with a high contention of pthreads and the retry loop occurs very rarely, so the overhead of the retry is very small indeed.
Comment 5 Benjamin Herrenschmidt 2021-04-16 01:24:27 UTC
Did you ever submit the fix to the mailing list ?
Comment 6 shaoyi 2022-07-13 01:39:01 UTC
Hi Colin,

May I please ask about the most recent update about your fix? This race condition can still be reproduced on the current linux mainline v5.19.0-rc6. I found ubuntu had picked this patch up as https://lists.ubuntu.com/archives/kernel-team/2017-May/084102.html but also reported the soft lockup issue in kernel 4.4 as https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1876856 so I'm wondering do you have an updated version of the patch and the plan to submit it to the upstream? Would really appreciate if you're still tracking this issue!