In line 376 (file ipc/sem.c) sem_wait_array is called with pre condition sma->complex_count!=0 resulting in an immediate return. The following scenario is possible, allowing P0 and P1 to be in the critical section (section between sem_lock und sem_unlock) at the same time: P0 is a process performing up with semtimedop on single semaphore 1. P1 is a process performing up with semtimedop on single semaphore 1. P2 is a process performing down with semtimedop on semaphores 1 and 2. P0 runs sem_lock up to line 332: if(sma->complex_count==0) evaluates to true P1 runs sem_lock up to line 329. P2 runs sem_lock to (including) line 310. P0 does spin_lock(&sem->lock); in line 336. P2 performs rest of timedsemop, increments complex_count and ends up in line 1961 and starts to sleep. P1 evaluates if(sem->complex_count==0) in line 331 and ends up in line 361. P0 evaluates if(!spin_is_locked(&sma->sem_perm.lock)) in line 339 and ends up after line 346. P1 performs ipc_lock_object(&sma->sem_perm); line 362; it then evaluates if(sma->complex_count==0) and executes the else statement, calls sem_wait_array in line 376, which returns erroneously. P1 runs the rest of timedsemop and wakes P2 up in do_smart_update (line 1911), update_queue and unlink_queue respectively. This reduces complex_count. P0 now evaluates if(sma->complex_count==0) in line 353 and is able to enter the critical section, while P1 is still in do_smart_update(). The bug has first been discovered in kernel 3.10.63, though it may be present in earlier version. The description above applies to kernel version 4.3.0-rc, but except for some slight changes in the line numbers it should apply to all versions between 3.10.63 up to 4.3.0-rc.
A more detailed problem description including all relevant code parts In line 376 (file ipc/sem.c) sem_wait_array is called with pre condition sma->complex_count!=0 resulting in an immediate return. The following scenario is possible, allowing P0 and P1 to be in the critical section (section between sem_lock und sem_unlock) at the same time: P0 is a process performing up with semtimedop on single semaphore 1. P1 is a process performing up with semtimedop on single semaphore 1. P2 is a process performing down with semtimedop on semaphores 1 and 2. start condition: sma->complex_count=0 All line references apply to the file ipc/sem.c. # P0 runs sem_lock up to line 332: if(sma->complex_count==0) evaluates to true static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, int nsops) { struct sem *sem; if (nsops != 1) { ... } sem = sma->sem_base + sops->sem_num; if (sma->complex_count == 0) { # P1 runs sem_lock up to line 329. static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, int nsops) { struct sem *sem; if (nsops != 1) { ... } sem = sma->sem_base + sops->sem_num; # P2 runs sem_lock to (including) line 310. static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, int nsops) { struct sem *sem; if (nsops != 1) { ipc_lock_object(&sma->sem_perm); sem_wait_array(sma); # P0 does spin_lock(&sem->lock); in line 336. spin_lock(&sem->lock); # P2 performs rest of semtimedop, increments complex_count and ends up in line 1961 and starts to sleep. return -1; } ##semtimedop: locknum = sem_lock(sma, sops, nsops); ... error = perform_atomic_semop(sma, &queue); if (error == 0) { ... } if (error <= 0) ... if (nsops == 1) { ... } else { if (!sma->complex_count) merge_queues(sma); if (alter) list_add_tail(&queue.list, &sma->pending_alter); else list_add_tail(&queue.list, &sma->pending_const); sma->complex_count++; } queue.status = -EINTR; queue.sleeper = current; sleep_again: __set_current_state(TASK_INTERRUPTIBLE); sem_unlock(sma, locknum); rcu_read_unlock(); if (timeout) jiffies_left = schedule_timeout(jiffies_left); else schedule(); # P1 evaluates if(sem->complex_count==0) in line 331 and ends up in line 361. if (sma->complex_count == 0) { ... } # P0 evaluates if(!spin_is_locked(&sma->sem_perm.lock)) in line 339 and ends up after line 346. if (!spin_is_locked(&sma->sem_perm.lock)) { ipc_smp_acquire__after_spin_is_unlocked(); # P1 performs ipc_lock_object(&sma->sem_perm); line 362; it then evaluates if(sma->complex_count==0) and executes the else statement, calls sem_wait_array in line 376, which returns erroneously. ipc_lock_object(&sma->sem_perm); if (sma->complex_count == 0) { ... } else { sem_wait_array(sma); static void sem_wait_array(struct sem_array*sma) { int i; struct sem *sem; if (sma->complex_count) { return; } return -1; } # P1 runs the rest of semtimedop and wakes P2 up in do_smart_update (line 1911), update_queue and unlink_queue respectively. This reduces complex_count. locknum = sem_lock(sma, sops, nsops); ... ... error = perform_atomic_semop(sma, &queue); if (error == 0) { if (alter) do_smart_update(sma, sops, nsops, 1, &tasks); # P0 now evaluates if(sma->complex_count==0) in line 353 and is able to enter the critical section, while P1 is still in do_smart_update(). if (sma->complex_count == 0) { /* fast path successful! */ return sops->sem_num; } Thanks to Manfred Spraul for his comments on the problem description.
Created attachment 198411 [details] Draft for Fix draft patch - still needs testing
Test case: # ./sem-scalebench -c 12 -p 2 -i 1 -o 2 -h 4 -x 9973 -t 60 (replace "-h 4" with "-h <number of cores" and "-c 12" with "-c 3*no_cores") No regressions for me (CONFIG_PREEMPT+CONFIG_DEBUG_LIST) https://github.com/manfred-colorfu/ipcscale
Created attachment 198551 [details] Updated patch Updated patch: The initial patch left a race, because the order was wrong: First sem_wait_array() must be called, then complex_count can be decreased. Passed some stress testing.
Created attachment 198611 [details] Updated patch New patch, different approach: Instead of trying to fix the current logic, create a new variable that tracks all complex ops (both ongoing operations and sleeping complex ops). This simplifies the fast path and is easier to understand.