Bug 105651

Summary: Concurrency issue in sem_lock
Product: Process Management Reporter: Felix Hübner (felixh)
Component: OtherAssignee: process_other
Status: NEW ---    
Severity: high CC: felixh, manfred
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.3.0-rc Subsystem:
Regression: No Bisected commit-id:
Attachments: Draft for Fix
Updated patch
Updated patch

Description Felix Hübner 2015-10-08 08:06:41 UTC
In line 376 (file ipc/sem.c) sem_wait_array is called with pre condition sma->complex_count!=0 resulting in an immediate return.

The following scenario is possible, allowing P0 and P1 to be in the critical section (section between sem_lock und sem_unlock) at the same time:

P0 is a process performing up with semtimedop on single semaphore 1. 
P1 is a process performing up with semtimedop on single semaphore 1.
P2 is a process performing down with semtimedop on semaphores 1 and 2.

P0 runs sem_lock up to line 332: if(sma->complex_count==0) evaluates to true
P1 runs sem_lock up to line 329.
P2 runs sem_lock to (including) line 310.

P0 does spin_lock(&sem->lock); in line 336.
P2 performs rest of timedsemop, increments complex_count and ends up in line 1961 and starts to sleep.
P1 evaluates if(sem->complex_count==0) in line 331 and ends up in line 361.

P0 evaluates if(!spin_is_locked(&sma->sem_perm.lock)) in line 339 and ends up after line 346.
P1 performs ipc_lock_object(&sma->sem_perm); line 362; it then evaluates if(sma->complex_count==0) and executes the else statement, calls sem_wait_array in line 376, which returns erroneously.
P1 runs the rest of timedsemop and wakes P2 up in do_smart_update (line 1911), update_queue and unlink_queue respectively. This reduces complex_count.

P0 now evaluates if(sma->complex_count==0) in line 353 and is able to enter the critical section, while P1 is still in do_smart_update().


The bug has first been discovered in kernel 3.10.63, though it may be present in earlier version. The description above applies to kernel version 4.3.0-rc, but except for some slight changes in the line numbers it should apply to all versions between 3.10.63 up to 4.3.0-rc.
Comment 1 Felix Hübner 2015-10-09 08:14:48 UTC
A more detailed problem description including all relevant code parts

In line 376 (file ipc/sem.c) sem_wait_array is called with pre condition
sma->complex_count!=0 resulting in an immediate return.

The following scenario is possible, allowing P0 and P1 to be in the
critical section (section between sem_lock und sem_unlock) at the same time:

P0 is a process performing up with semtimedop on single semaphore 1.
P1 is a process performing up with semtimedop on single semaphore 1.
P2 is a process performing down with semtimedop on semaphores 1 and 2.

start condition:

sma->complex_count=0

All line references apply to the file ipc/sem.c.

# P0 runs sem_lock up to line 332: if(sma->complex_count==0) evaluates
to true

static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
			      int nsops)
{
	struct sem *sem;

	if (nsops != 1) {
		...
	}
	sem = sma->sem_base + sops->sem_num;

	if (sma->complex_count == 0) {

# P1 runs sem_lock up to line 329.

static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
		      int nsops)
{
	struct sem *sem;

	if (nsops != 1) {
		...
	}

	sem = sma->sem_base + sops->sem_num;

# P2 runs sem_lock to (including) line 310.

static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
			      int nsops)
{
	struct sem *sem;

	if (nsops != 1) {
		ipc_lock_object(&sma->sem_perm);

		sem_wait_array(sma);

# P0 does spin_lock(&sem->lock); in line 336.

		spin_lock(&sem->lock);

# P2 performs rest of semtimedop, increments complex_count and ends up
in line 1961 and starts to sleep.

		return -1;
	}

  ##semtimedop:
	locknum = sem_lock(sma, sops, nsops);

	...

	error = perform_atomic_semop(sma, &queue);
	if (error == 0) {
		...
	}
	if (error <= 0)
		...

	if (nsops == 1) {
		...
	} else {
		if (!sma->complex_count)
			merge_queues(sma);

		if (alter)
			list_add_tail(&queue.list, &sma->pending_alter);
		else
			list_add_tail(&queue.list, &sma->pending_const);

		sma->complex_count++;
	}

	queue.status = -EINTR;
	queue.sleeper = current;

sleep_again:
	__set_current_state(TASK_INTERRUPTIBLE);
	sem_unlock(sma, locknum);
	rcu_read_unlock();

	if (timeout)
		jiffies_left = schedule_timeout(jiffies_left);
	else
		schedule();

# P1 evaluates if(sem->complex_count==0) in line 331 and ends up in line
361.

	if (sma->complex_count == 0) {
		...
	}

# P0 evaluates if(!spin_is_locked(&sma->sem_perm.lock)) in line 339 and
ends up after line 346.

	if (!spin_is_locked(&sma->sem_perm.lock)) {				
		ipc_smp_acquire__after_spin_is_unlocked();


# P1 performs ipc_lock_object(&sma->sem_perm); line 362; it then
evaluates if(sma->complex_count==0) and executes the else statement,
calls sem_wait_array in line 376, which returns erroneously.

	ipc_lock_object(&sma->sem_perm);

	if (sma->complex_count == 0) {
		...
	} else {
		sem_wait_array(sma);
			
			static void sem_wait_array(struct sem_array*sma)
			{
				int i;
				struct sem *sem;

				if (sma->complex_count)  {
					return;
				}				

		return -1;
	}

# P1 runs the rest of semtimedop and wakes P2 up in do_smart_update
(line 1911), update_queue and unlink_queue respectively. This reduces
complex_count.

	locknum = sem_lock(sma, sops, nsops);
	...
	...
	error = perform_atomic_semop(sma, &queue);
	if (error == 0) {
		if (alter)
			do_smart_update(sma, sops, nsops, 1, &tasks);

# P0 now evaluates if(sma->complex_count==0) in line 353 and is able to
enter the critical section, while P1 is still in do_smart_update().

		if (sma->complex_count == 0) {
			/* fast path successful! */
			return sops->sem_num;
		}	


Thanks to Manfred Spraul for his comments on the problem description.
Comment 2 Manfred Spraul 2015-12-28 13:44:56 UTC
Created attachment 198411 [details]
Draft for Fix

draft patch - still needs testing
Comment 3 Manfred Spraul 2015-12-31 19:29:18 UTC
Test case:
# ./sem-scalebench -c 12 -p 2 -i 1 -o 2 -h 4 -x 9973 -t 60
(replace "-h 4" with "-h <number of cores" and "-c 12" with "-c 3*no_cores")

No regressions for me (CONFIG_PREEMPT+CONFIG_DEBUG_LIST)

https://github.com/manfred-colorfu/ipcscale
Comment 4 Manfred Spraul 2015-12-31 22:06:01 UTC
Created attachment 198551 [details]
Updated patch

Updated patch:
The initial patch left a race, because the order was wrong: First sem_wait_array() must be called, then complex_count can be decreased.

Passed some stress testing.
Comment 5 Manfred Spraul 2016-01-02 09:51:33 UTC
Created attachment 198611 [details]
Updated patch

New patch, different approach:

Instead of trying to fix the current logic, create a new variable that tracks all complex ops (both ongoing operations and sleeping complex ops).

This simplifies the fast path and is easier to understand.