Bug 105651 - Concurrency issue in sem_lock
Summary: Concurrency issue in sem_lock
Status: NEW
Alias: None
Product: Process Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: process_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-08 08:06 UTC by Felix Hübner
Modified: 2016-01-02 09:51 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.3.0-rc
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Draft for Fix (2.16 KB, patch)
2015-12-28 13:44 UTC, Manfred Spraul
Details | Diff
Updated patch (3.97 KB, patch)
2015-12-31 22:06 UTC, Manfred Spraul
Details | Diff
Updated patch (9.14 KB, patch)
2016-01-02 09:51 UTC, Manfred Spraul
Details | Diff

Description Felix Hübner 2015-10-08 08:06:41 UTC
In line 376 (file ipc/sem.c) sem_wait_array is called with pre condition sma->complex_count!=0 resulting in an immediate return.

The following scenario is possible, allowing P0 and P1 to be in the critical section (section between sem_lock und sem_unlock) at the same time:

P0 is a process performing up with semtimedop on single semaphore 1. 
P1 is a process performing up with semtimedop on single semaphore 1.
P2 is a process performing down with semtimedop on semaphores 1 and 2.

P0 runs sem_lock up to line 332: if(sma->complex_count==0) evaluates to true
P1 runs sem_lock up to line 329.
P2 runs sem_lock to (including) line 310.

P0 does spin_lock(&sem->lock); in line 336.
P2 performs rest of timedsemop, increments complex_count and ends up in line 1961 and starts to sleep.
P1 evaluates if(sem->complex_count==0) in line 331 and ends up in line 361.

P0 evaluates if(!spin_is_locked(&sma->sem_perm.lock)) in line 339 and ends up after line 346.
P1 performs ipc_lock_object(&sma->sem_perm); line 362; it then evaluates if(sma->complex_count==0) and executes the else statement, calls sem_wait_array in line 376, which returns erroneously.
P1 runs the rest of timedsemop and wakes P2 up in do_smart_update (line 1911), update_queue and unlink_queue respectively. This reduces complex_count.

P0 now evaluates if(sma->complex_count==0) in line 353 and is able to enter the critical section, while P1 is still in do_smart_update().


The bug has first been discovered in kernel 3.10.63, though it may be present in earlier version. The description above applies to kernel version 4.3.0-rc, but except for some slight changes in the line numbers it should apply to all versions between 3.10.63 up to 4.3.0-rc.
Comment 1 Felix Hübner 2015-10-09 08:14:48 UTC
A more detailed problem description including all relevant code parts

In line 376 (file ipc/sem.c) sem_wait_array is called with pre condition
sma->complex_count!=0 resulting in an immediate return.

The following scenario is possible, allowing P0 and P1 to be in the
critical section (section between sem_lock und sem_unlock) at the same time:

P0 is a process performing up with semtimedop on single semaphore 1.
P1 is a process performing up with semtimedop on single semaphore 1.
P2 is a process performing down with semtimedop on semaphores 1 and 2.

start condition:

sma->complex_count=0

All line references apply to the file ipc/sem.c.

# P0 runs sem_lock up to line 332: if(sma->complex_count==0) evaluates
to true

static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
			      int nsops)
{
	struct sem *sem;

	if (nsops != 1) {
		...
	}
	sem = sma->sem_base + sops->sem_num;

	if (sma->complex_count == 0) {

# P1 runs sem_lock up to line 329.

static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
		      int nsops)
{
	struct sem *sem;

	if (nsops != 1) {
		...
	}

	sem = sma->sem_base + sops->sem_num;

# P2 runs sem_lock to (including) line 310.

static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
			      int nsops)
{
	struct sem *sem;

	if (nsops != 1) {
		ipc_lock_object(&sma->sem_perm);

		sem_wait_array(sma);

# P0 does spin_lock(&sem->lock); in line 336.

		spin_lock(&sem->lock);

# P2 performs rest of semtimedop, increments complex_count and ends up
in line 1961 and starts to sleep.

		return -1;
	}

  ##semtimedop:
	locknum = sem_lock(sma, sops, nsops);

	...

	error = perform_atomic_semop(sma, &queue);
	if (error == 0) {
		...
	}
	if (error <= 0)
		...

	if (nsops == 1) {
		...
	} else {
		if (!sma->complex_count)
			merge_queues(sma);

		if (alter)
			list_add_tail(&queue.list, &sma->pending_alter);
		else
			list_add_tail(&queue.list, &sma->pending_const);

		sma->complex_count++;
	}

	queue.status = -EINTR;
	queue.sleeper = current;

sleep_again:
	__set_current_state(TASK_INTERRUPTIBLE);
	sem_unlock(sma, locknum);
	rcu_read_unlock();

	if (timeout)
		jiffies_left = schedule_timeout(jiffies_left);
	else
		schedule();

# P1 evaluates if(sem->complex_count==0) in line 331 and ends up in line
361.

	if (sma->complex_count == 0) {
		...
	}

# P0 evaluates if(!spin_is_locked(&sma->sem_perm.lock)) in line 339 and
ends up after line 346.

	if (!spin_is_locked(&sma->sem_perm.lock)) {				
		ipc_smp_acquire__after_spin_is_unlocked();


# P1 performs ipc_lock_object(&sma->sem_perm); line 362; it then
evaluates if(sma->complex_count==0) and executes the else statement,
calls sem_wait_array in line 376, which returns erroneously.

	ipc_lock_object(&sma->sem_perm);

	if (sma->complex_count == 0) {
		...
	} else {
		sem_wait_array(sma);
			
			static void sem_wait_array(struct sem_array*sma)
			{
				int i;
				struct sem *sem;

				if (sma->complex_count)  {
					return;
				}				

		return -1;
	}

# P1 runs the rest of semtimedop and wakes P2 up in do_smart_update
(line 1911), update_queue and unlink_queue respectively. This reduces
complex_count.

	locknum = sem_lock(sma, sops, nsops);
	...
	...
	error = perform_atomic_semop(sma, &queue);
	if (error == 0) {
		if (alter)
			do_smart_update(sma, sops, nsops, 1, &tasks);

# P0 now evaluates if(sma->complex_count==0) in line 353 and is able to
enter the critical section, while P1 is still in do_smart_update().

		if (sma->complex_count == 0) {
			/* fast path successful! */
			return sops->sem_num;
		}	


Thanks to Manfred Spraul for his comments on the problem description.
Comment 2 Manfred Spraul 2015-12-28 13:44:56 UTC
Created attachment 198411 [details]
Draft for Fix

draft patch - still needs testing
Comment 3 Manfred Spraul 2015-12-31 19:29:18 UTC
Test case:
# ./sem-scalebench -c 12 -p 2 -i 1 -o 2 -h 4 -x 9973 -t 60
(replace "-h 4" with "-h <number of cores" and "-c 12" with "-c 3*no_cores")

No regressions for me (CONFIG_PREEMPT+CONFIG_DEBUG_LIST)

https://github.com/manfred-colorfu/ipcscale
Comment 4 Manfred Spraul 2015-12-31 22:06:01 UTC
Created attachment 198551 [details]
Updated patch

Updated patch:
The initial patch left a race, because the order was wrong: First sem_wait_array() must be called, then complex_count can be decreased.

Passed some stress testing.
Comment 5 Manfred Spraul 2016-01-02 09:51:33 UTC
Created attachment 198611 [details]
Updated patch

New patch, different approach:

Instead of trying to fix the current logic, create a new variable that tracks all complex ops (both ongoing operations and sleeping complex ops).

This simplifies the fast path and is easier to understand.

Note You need to log in before you can comment on or make changes to this bug.