Bug 43260

Summary:	ftruncate locks up when used with direct IO on ext4
Product:	File System	Reporter:	Ivan Tarasov (ivan)
Component:	ext4	Assignee:	fs_ext4 (fs_ext4)
Status:	RESOLVED OBSOLETE
Severity:	high	CC:	alan, jack, sandeen, tytso
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	all above 3.1-rc3	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	Test program to reproduce the lock up Output of SysRq-w at lock up

Description Ivan Tarasov 2012-05-17 23:31:10 UTC

Created attachment 73323 [details]
Test program to reproduce the lock up

Calling ftruncate shortly after submitting a lot of direct IO requests on a file located on ext4 filesystem causes the ftruncate syscall to lock up. Using other filesystems, e.g. xfs, does not exhibit this behavior.

This problem can be reproduced on all kernel versions above 3.1-rc3 (3.1-rc3 itself is fine), e.g. on v3.2.14 -- the kernel that is used in the latest Ubuntu LTS release.

The attached program can be used to reproduce the problem. It is possible to reproduce the problem by running the program on a temporary ext4 filesystem inside UML, it is also advisable to do so since other syscalls accessing the file system may lock up as well after starting the program.

I was able to bisect the problem to this commit: 8c0bec2151a47906bf779c6715a10ce04453ab77.

If you plan to be building the user-mode linux kernel for this range of kernel commits, you may need to apply the changes from commit e5f0bdc7840bdb791247cb98dfc1dab6ea6c7da4 which fix the building problem for ARCH=um.

Keywords: ext4, ftruncate, direct IO, dio
Architecture: amd64 (but likely is architecture-independent)

Comment 1 Eric Sandeen 2012-05-18 02:06:41 UTC

commit details:

commit 8c0bec2151a47906bf779c6715a10ce04453ab77
Author: Jiaying Zhang <jiayingz@google.com>
Date:   Wed Aug 31 11:50:51 2011 -0400

    ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining
    
    The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
    in ext4_evict_inode() causes lockdep complaining about potential
    deadlock in several places.  In most/all of these LOCKDEP complaints
    it looks like it's a false positive, since many of the potential
    circular locking cases can't take place by the time the
    ext4_evict_inode() is called; but since at the very least it may mask
    real problems, we need to address this.
    
    This change removes the flush_completed_IO() and i_mutex lock in
    ext4_evict_inode().  Instead, we take a different approach to resolve
    the software lockup that commit 2581fdc810 intends to fix.  Rather
    than having ext4-dio-unwritten thread wait for grabing the i_mutex
    lock of an inode, we use mutex_trylock() instead, and simply requeue
    the work item if we fail to grab the inode's i_mutex lock.
    
    This should speed up work queue processing in general and also
    prevents the following deadlock scenario: During page fault,
    shrink_icache_memory is called that in turn evicts another inode B.
    Inode B has some pending io_end work so it calls ext4_ioend_wait()
    that waits for inode B's i_ioend_count to become zero.  However, inode
    B's ioend work was queued behind some of inode A's ioend work on the
    same cpu's ext4-dio-unwritten workqueue.  As the ext4-dio-unwritten
    thread on that cpu is processing inode A's ioend work, it tries to
    grab inode A's i_mutex lock.  Since the i_mutex lock of inode A is
    still hold before the page fault happened, we enter a deadlock.
    
    Signed-off-by: Jiaying Zhang <jiayingz@google.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

I tested on the 3.2.10 fedora kernel, though, and it didn't lock up (tried many times):

# rm -f lockfile; ./ftruncate-test lockfile
This might lock up..
It didn't lock up.


Can you lock it up, and do a sysrq-w and maybe sysrq-d and attach the resulting dmesg?

Comment 2 Ivan Tarasov 2012-05-21 23:08:04 UTC

Created attachment 73345 [details]
Output of SysRq-w at lock up

Eric,

You are correct that 3.2.10 does not exhibit the problem (both on RedHat-patched and vanilla kernel versions). That means that the bug was fixed between 8c0bec21 and v3.2.10 and then reappeared again between the v3.2.10 and v3.2.14.

I repeated the bisect, this time between v3.2.10 and v3.2.14, and found this commit which exhibited the problem again:

commit 8608fb78b2cbf9eb8794e592bf43a8b1884c5a85
Author: Jeff Moyer <jmoyer@redhat.com>
Date:   Mon Feb 20 17:59:24 2012 -0500

    ext4: fix race between unwritten extent conversion and truncate
    
    commit 266991b13890049ee1a6bb95b9817f06339ee3d7 upstream.
    
    The following comment in ext4_end_io_dio caught my attention:
    
        /* XXX: probably should move into the real I/O completion handler */
            inode_dio_done(inode);
    
    The truncate code takes i_mutex, then calls inode_dio_wait.  Because the
    ext4 code path above will end up dropping the mutex before it is
    reacquired by the worker thread that does the extent conversion, it
    seems to me that the truncate can happen out of order.  Jan Kara
    mentioned that this might result in error messages in the system logs,
    but that should be the extent of the "damage."
    
    The fix is pretty straight-forward: don't call inode_dio_done until the
    extent conversion is complete.
    
    Reviewed-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

The output of sysrq-w during the lock up on this commit is attached.

Comment 3 Jan Kara 2012-05-24 10:16:57 UTC

Hum, I see. So we are holding i_mutex and waiting for dio to finish and worker thread cannot take i_mutex to finish the extent conversion and call inode_dio_done(). Slightly subtle is that the worker tries to be clever and if it cannot acquire i_mutex, it requeues the work item so it does not really deadlock the system, it just eats up CPU cycling the work over and over...

I'm uncertain how to best fix this. If we just revert 266991b13, we reintroduce the problem of AIO DIO completion racing with truncate (so extent conversion would happen on already freed blocks). But I'm thinking that with DIO vs truncate protection in VFS, we probably don't need i_mutex for extent conversion: The only think in ext4_end_io_nolock() that can possibly need i_mutex is ext4_convert_unwritten_extents(). That function just starts a transaction (i_mutex not needed) and calls ext4_map_blocks() which takes i_data_sem for protection.

I'll let this brew in my head for a while and if I (or anyone else - hint, hint) does not find a problem with this, I'll write a patch to remove the lock...

Comment 4 Alan 2015-02-19 17:26:46 UTC

This bug relates to a very old kernel. Closing as obsolete.