Andreas Schneider reported that he was seeing tdbtorture runs fail on v5.0 kernels: $ ./bin/tdbtorture Testing with 3 processes, 5000 loops, 2 hash_size, seed=1553097402 tdb_brlock failed (fd=3) at offset 168 rw_type=1 flags=1 len=1 tdb_lock failed on list 0 ltype=1 (Resource deadlock avoided) tdb_store failed: Resource deadlock avoided child 1206 exited with status 2 child 1204 exited with signal 15 child 1205 exited with signal 15 ...these always pass on v4.x kernels, so this looks like a regression in the v5.0 file locking code.
Created attachment 281927 [details] debug patch This patch adds some debugging, and I was having trouble reproducing this, but then I hit the new WARN_ON in this code. So it found a (supposed) deadlock on one pass and then failed to find it on the second search. That may mean that Bruce's theory was right here and that we are racing against changes to the chain of fl_blockers.
Right, I think there may be a TOCTOU race between posix_locks_deadlock and locks_insert_lock_ctx. But that would cause deadlock-prone code to deadlock rather than EDEADLK to be returned when it shouldn't be. I'm having a lot more trouble seeing how posix_locks_deadlock could detect a cycle when there isn't one. I really think we need to understand how tdbtorture guarantees that it won't create cycles. I understand that this didn't happen before, but we've changed the performance characteristics of the locking code a bunch, and that could easily make it easy to hit cases that tdbtorture didn't before.
Created attachment 281931 [details] fixed debug patch
So with this debug patch, I see this pop up in the ring buffer when a deadlock is detected: [ 77.226853] fl=00000000f306701c fl_owner=0000000067ab71be fl_blocker= (null) fl_flags=0x81 fl_type=0x1 fl_pid=1149 fl_start=172 fl_end=172 I'll attach the trace output as well, but when the -35 shows up in there, it sees itself in the blocked hash: tdbtorture-1149 [000] .N.. 77.326856: posix_lock_inode: fl=0x00000000f306701c dev=0xfd:0x0 ino=0x1221 fl_blocker=0x (null) fl_owner=0x0000000067ab71be fl_pid=1149 fl_flags=FL_POSIX|FL_SLEEP fl_type=F_WRLCK fl_start=172 fl_end=172 ret=-35 ...a little prior to that, I see this: tdbtorture-1149 [000] .... 77.324645: posix_lock_inode: fl=0x00000000f306701c dev=0xfd:0x0 ino=0x1221 fl_blocker=0x000000006e8b59f8 fl_owner=0x0000000067ab71be fl_pid=1149 fl_flags=FL_POSIX|FL_SLEEP fl_type=F_WRLCK fl_start=172 fl_end=172 ret=1 ...so the task had attempted to acquire this lock before and failed. Do we have a situation where we aren't removing entries from the block_lock_hash prior to attempting to reacquire it? Maybe we can sprinkle in some assertions to catch that potential problem.
Created attachment 281933 [details] possible fix Draft patch. This seems to make the test pass when run in a loop.
The "a little prior to that" log message says that fl_blocker is not NULL. i.e. a lock returned by what_owner_is_waiting_for() has a non-NULL fl_blocker. Is that right? The code in what_owner_is_waiting_for() will never return a lock with fl_blocker non-NULL. So this suggests that something is setting ->fl_blocker = NULL without holding blocked_lock_lock. That can only be __locks_delete_block(), which is only ever called with blocked_lock_lock held. What am I missing here?
Uhmm.. I confused myself. I need to look for setting fl_block to non-NULL, which is locks_move_blocks() and __locks_insert_block() But still ... locks_move_blocks() takes the lock, and __locks_insert_block() is only called with the lock held. How did we print a message with fl_blocked not NULL ??
Created attachment 281935 [details] Tested patch with better description This seems to work correctly, but please do sanity check me here and make sure I haven't completely broken deadlock detection (again).
Ahh ... never mind. That message with fl_blocker not NULL was a tracepoint message wasn't it - not the message added by the patch. So..... I enhanced the tracing a bit. Process A is trying to get a write lock on a byte, but process B holds a read lock. Process B is also trying to get a write lock on the same byte, but it is blocked by the attempt from process A. The reason this shouldn't be a deadlock is, presumably, that process B's attempt doesn't actually conflict with the current lock. Presumably process C held a read lock as well. Process A blocked on process C, and then process B blocked on process A. Process C dropped its lock and process A tried again and now blocks on process B. Process B wasn't woken up (that was the whole point of my recent patches), so it is still waiting for process A, but that now causes a deadlock. This a bit like that other problem Bruce found .... but that I only vaguely remember at the moment - "blocks" is not a transitive relation. Hopefully there is a fix that doesn't require reverting my whole patch. BTW, I don't think your patch is correct Jeff. I think it addresses a symptom, not the bug.
Having pondered a bit more, I've changed some of my mind. I think the code in your patch, Jeff, is correct. I don't think that the description is quite right. The problem isn't directly related to the hash. The lock-request in question *isn't* in the hash at this point, but something that is blocked by it is. If posix_locks_deadlock() fails to find a deadlock, the caller_fl will be passed to __locks_insert_block(), and this wakes up all locks that are blocked on caller_fl, clearing the fl_blocker link. So if posix_locks_deadlock() finds caller_fl while searching for a deadlock, it can be sure that link in the cycle is about to be broken and it need not treat it as the cause of a deadlock. This bug is part of the bug Bruce found last year (And got an lwn.net mention for!), but at the time we didn't consider the dead-lock detection side. Fixes: 5946c4319ebb ("fs/locks: allow a lock request to block other requests.")
Thanks Neil. I've incorporated your description and added your SoB line (hope that's ok -- let me know if it isn't). I'm sending this to the list now, and will put it in linux-next for a few days before sending it on to Linus.
On third thought, I think I like the idea of waking up all of the waiters blocked on the request prior to doing the deadlock check. We're waking them up anyway when we go to reinsert the thing in the hash, so we might as well just go ahead and wake early and keep deadlock detection simpler. v3 patch sent to the list today. It also seems to fix this.
(In reply to bfields from comment #2) > Right, I think there may be a TOCTOU race between posix_locks_deadlock and > locks_insert_lock_ctx. By the way, I was totally confusing inserting a block with applying a lock in the uncontended case. The former is in a single critical section under the blocked_lock_lock. The latter doesn't add an edge to the graph of blocked tasks. Apologies for the noise.
Merged into mainline and should make v5.1-rc3. Also marked for stable, so should get picked up for v5.0 stable series kernels as well.