Bug 200467

Summary: The syscall futex with operation FUTEX_LOCK_PI is not allowed to return ESRCH for robust mutexes.
Product: Other Reporter: stli
Component: OtherAssignee: Thomas Gleixner (tglx)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: fweimer, stli, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.17 Subsystem:
Regression: No Bisected commit-id:
Attachments: Reduced testcase which runs without the glibc-testsuite

Description stli 2018-07-10 15:50:03 UTC
Created attachment 277313 [details]
Reduced testcase which runs without the glibc-testsuite

See also glibc-"Bug 23183 - tst-robustpi4 test failure" (https://sourceware.org/bugzilla/show_bug.cgi?id=23183).

There exists the test <glibc-src>/nptl/tst-robustpi4. Sometimes it fails due to:
tst-robustpi4: ../nptl/pthread_mutex_lock.c:425: __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust' failed.

The test creates the tf-thread which locks the mutex (PTHREAD_MUTEX_ROBUST_NP with PTHREAD_PRIO_INHERIT protocol) and is then canceled by the main-thread. Then the main-thread locks the mutex via pthread_mutex_lock and expects EOWNERDEAD as the thread has been exited with the locked mutex. This call sometimes triggers the assertion. You can also run the attached testcase instead of tst-robustpi4.

In the assertion-case pthread_mutex_lock fails to acquire the mutex in userspace as it is already owned by the tf-thread and is then performing the futex-syscall with FUTEX_LOCK_PI operation. This futex-syscall sometimes returns ESRCH and the futex-value is FUTEX_OWNER_DIED | FUTEX_WAITERS | 0. Then the assertion is raised or an indefinitely loop is performed. This happens if the futex-syscall (main-thread) and exit-syscall (tf-thread) happens at the same time.
In the correct case, this futex-syscall returns without an error and the futex-value is FUTEX_OWNER_DIED | FUTEX_WAITERS | <tid of main-thread>. Thus we know, that we are the current owner of the mutex and can handle FUTEX_OWNER_DIED.

I've seen this failure on s390x and x86_64. On s390x, I've seen the assertion more often on a zVM-guest instead of running directly on a lpar! On x86_64, I've used a kvm-guest with Fedora 28. If I only run the test, I don't see this assertion. But if I e.g. build glibc while running the testcase, this assertion is also triggered on x86_64!

I've also added some kprobes in order to trace the syscalls exit / futex in the assertion-case.
Assumption: The dying thread (tid=0xa3c) owns the mutex and thus the futex-value is 0xa3c.

The main thread is trying to lock the mutex and is using syscall futex, which adds the FUTEX_WAITERS bit (futex-value is 0x80000a3c) in:
futex_lock_pi_atomic()
{
...
	/*
	 * First waiter. Set the waiters bit before attaching ourself to
	 * the owner. If owner tries to unlock, it will be forced into
	 * the kernel and blocked on hb->lock.
	 */
	newval = uval | FUTEX_WAITERS;
	ret = lock_pi_update_atomic(uaddr, uval, newval);
}


Afterwards attach_to_pi_owner is called. If the exit syscall is processed at the same time, then the PF_EXITING flag is set for the exiting thread. Then it can happen that attach_to_pi_owner returns EAGAIN and the futex-syscall retries to lock the futex. In the meantime the exit-syscall is processing the mutex and handle_futex_death() sets the futex-value to 0xc0000000:
handle_futex_death(...)
{
...
		mval = (uval & FUTEX_WAITERS) | FUTEX_OWNER_DIED;
		if (cmpxchg_futex_value_locked(&nval, uaddr, uval, mval)) ...
...
}

If the futex-syscall is now calling attach_to_pi_owner and ...
... the exiting thread has already exited, then ESRCH is returned due to:
	if (!pid)
		return -ESRCH;
	p = futex_find_get_task(pid);
	if (!p)
		return -ESRCH;

... the exiting thread is not yet finished, then ESRCH is returned due to:
	if (unlikely(p->flags & PF_EXITING)) {
		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
		...
		return ret;
	}

This ESRCH is returned by the futex-syscall.

If the futex-syscall happens a bit earlier than the exit-syscall (glibc does not trigger the assertion), then the futex-syscall sets the FUTEX_WAITERS bit and attach_to_pi_owner() is returning zero. While exit-syscall, handle_futex_death() is setting the futex-value to 0xc0000000. Afterwards, the futex-syscall is calling fixup_owner()/fixup_pi_state_owner() which is then setting futex-value to 0xc0000000|<tid-of-main-thread>. In this case, the futex-syscall does not return an error.


The kernel is not allowed to return ESRCH, but should acquire the mutex and return 0. Then glibc knows it acquired the mutex, and can further check FUTEX_OWNER_DIED bit.
Comment 1 Thomas Gleixner 2018-12-10 13:19:54 UTC
Taking the bug. Will reply by mail
Comment 2 stli 2019-01-23 15:55:48 UTC
I've tested the kernel commit https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=da791a667536bf8322042e38ca85d55a78d3c273 "futex: Cure exit race" from Thomas Gleixner on s390x (inside a zVM-guest) and x86_64 (inside a kvm-guest).

Running on a kernel with this commit, the attached "reduced testcase" runs (with rounds_max = 100000000) without fails. I've also successfully run the original glibc testcase nptl/tst-robustpi4 in a loop.

Running on an older kernel without this commit, the attached "reduced testcase" always failed within ~268000 rounds.
On s390x (inside a zVM-guest) the original glibc testcase failed ~1900 times while running it 1000000 times.

Thus I'm closing this bugzilla.
Thanks.