Created attachment 73323 [details]
Test program to reproduce the lock up
Calling ftruncate shortly after submitting a lot of direct IO requests on a file located on ext4 filesystem causes the ftruncate syscall to lock up. Using other filesystems, e.g. xfs, does not exhibit this behavior.
This problem can be reproduced on all kernel versions above 3.1-rc3 (3.1-rc3 itself is fine), e.g. on v3.2.14 -- the kernel that is used in the latest Ubuntu LTS release.
The attached program can be used to reproduce the problem. It is possible to reproduce the problem by running the program on a temporary ext4 filesystem inside UML, it is also advisable to do so since other syscalls accessing the file system may lock up as well after starting the program.
I was able to bisect the problem to this commit: 8c0bec2151a47906bf779c6715a10ce04453ab77.
If you plan to be building the user-mode linux kernel for this range of kernel commits, you may need to apply the changes from commit e5f0bdc7840bdb791247cb98dfc1dab6ea6c7da4 which fix the building problem for ARCH=um.
Keywords: ext4, ftruncate, direct IO, dio
Architecture: amd64 (but likely is architecture-independent)
Author: Jiaying Zhang <firstname.lastname@example.org>
Date: Wed Aug 31 11:50:51 2011 -0400
ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining
The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
in ext4_evict_inode() causes lockdep complaining about potential
deadlock in several places. In most/all of these LOCKDEP complaints
it looks like it's a false positive, since many of the potential
circular locking cases can't take place by the time the
ext4_evict_inode() is called; but since at the very least it may mask
real problems, we need to address this.
This change removes the flush_completed_IO() and i_mutex lock in
ext4_evict_inode(). Instead, we take a different approach to resolve
the software lockup that commit 2581fdc810 intends to fix. Rather
than having ext4-dio-unwritten thread wait for grabing the i_mutex
lock of an inode, we use mutex_trylock() instead, and simply requeue
the work item if we fail to grab the inode's i_mutex lock.
This should speed up work queue processing in general and also
prevents the following deadlock scenario: During page fault,
shrink_icache_memory is called that in turn evicts another inode B.
Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.
Signed-off-by: Jiaying Zhang <email@example.com>
Signed-off-by: "Theodore Ts'o" <firstname.lastname@example.org>
I tested on the 3.2.10 fedora kernel, though, and it didn't lock up (tried many times):
# rm -f lockfile; ./ftruncate-test lockfile
This might lock up..
It didn't lock up.
Can you lock it up, and do a sysrq-w and maybe sysrq-d and attach the resulting dmesg?
Created attachment 73345 [details]
Output of SysRq-w at lock up
You are correct that 3.2.10 does not exhibit the problem (both on RedHat-patched and vanilla kernel versions). That means that the bug was fixed between 8c0bec21 and v3.2.10 and then reappeared again between the v3.2.10 and v3.2.14.
I repeated the bisect, this time between v3.2.10 and v3.2.14, and found this commit which exhibited the problem again:
Author: Jeff Moyer <email@example.com>
Date: Mon Feb 20 17:59:24 2012 -0500
ext4: fix race between unwritten extent conversion and truncate
commit 266991b13890049ee1a6bb95b9817f06339ee3d7 upstream.
The following comment in ext4_end_io_dio caught my attention:
/* XXX: probably should move into the real I/O completion handler */
The truncate code takes i_mutex, then calls inode_dio_wait. Because the
ext4 code path above will end up dropping the mutex before it is
reacquired by the worker thread that does the extent conversion, it
seems to me that the truncate can happen out of order. Jan Kara
mentioned that this might result in error messages in the system logs,
but that should be the extent of the "damage."
The fix is pretty straight-forward: don't call inode_dio_done until the
extent conversion is complete.
Reviewed-by: Jan Kara <firstname.lastname@example.org>
Signed-off-by: Jeff Moyer <email@example.com>
Signed-off-by: "Theodore Ts'o" <firstname.lastname@example.org>
Signed-off-by: Greg Kroah-Hartman <email@example.com>
The output of sysrq-w during the lock up on this commit is attached.
Hum, I see. So we are holding i_mutex and waiting for dio to finish and worker thread cannot take i_mutex to finish the extent conversion and call inode_dio_done(). Slightly subtle is that the worker tries to be clever and if it cannot acquire i_mutex, it requeues the work item so it does not really deadlock the system, it just eats up CPU cycling the work over and over...
I'm uncertain how to best fix this. If we just revert 266991b13, we reintroduce the problem of AIO DIO completion racing with truncate (so extent conversion would happen on already freed blocks). But I'm thinking that with DIO vs truncate protection in VFS, we probably don't need i_mutex for extent conversion: The only think in ext4_end_io_nolock() that can possibly need i_mutex is ext4_convert_unwritten_extents(). That function just starts a transaction (i_mutex not needed) and calls ext4_map_blocks() which takes i_data_sem for protection.
I'll let this brew in my head for a while and if I (or anyone else - hint, hint) does not find a problem with this, I'll write a patch to remove the lock...
This bug relates to a very old kernel. Closing as obsolete.