Bug 15018 - ext4 backtraces out of nowhere
ext4 backtraces out of nowhere
Status: RESOLVED CODE_FIX
Product: File System
Classification: Unclassified
Component: ext4
All Linux
: P1 normal
Assigned To: Jan Kara
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-01-09 19:01 UTC by Steinar H. Gunderson
Modified: 2010-03-18 11:12 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.33-rc3
Tree: Mainline
Regression: No


Attachments
Patch fixing deadlock in ext4_da_reserve_space (3.90 KB, patch)
2010-01-26 23:20 UTC, Jan Kara
Details | Diff

Description Steinar H. Gunderson 2010-01-09 19:01:07 UTC
Hi,

While doing an rsync to an ext4 filesystem, suddenly the above showed up in my dmesg:

Jan  9 15:55:20 pannekake kernel: [61568.551238] Pid: 28573, comm: rsync Not tainted 2.6.33-rc3 #2
Jan  9 15:55:20 pannekake kernel: [61568.557085] Call Trace:
Jan  9 15:55:20 pannekake kernel: [61568.559623]  [<ffffffffa0253aa1>] ext4_write_inode+0x4d/0xc7 [ext4]
Jan  9 15:55:20 pannekake kernel: [61568.566074]  [<ffffffff810dabe7>] writeback_single_inode+0x125/0x2df
Jan  9 15:55:20 pannekake kernel: [61568.572563]  [<ffffffff810dae48>] write_inode_now+0x78/0xba
Jan  9 15:55:20 pannekake kernel: [61568.578287]  [<ffffffffa0255b2d>] ext4_da_get_block_prep+0x28b/0x36e [ext4]
Jan  9 15:55:20 pannekake kernel: [61568.585367]  [<ffffffff810e1b69>] __block_prepare_write+0x1f3/0x3c0
Jan  9 15:55:20 pannekake kernel: [61568.591792]  [<ffffffffa02558a2>] ? ext4_da_get_block_prep+0x0/0x36e [ext4]
Jan  9 15:55:20 pannekake kernel: [61568.598903]  [<ffffffff810e1eb1>] block_write_begin+0x80/0xcf
Jan  9 15:55:20 pannekake kernel: [61568.604840]  [<ffffffffa02587be>] ext4_da_write_begin+0x196/0x218 [ext4]
Jan  9 15:55:20 pannekake kernel: [61568.611683]  [<ffffffffa02558a2>] ? ext4_da_get_block_prep+0x0/0x36e [ext4]
Jan  9 15:55:20 pannekake kernel: [61568.618803]  [<ffffffff810911c1>] ? iov_iter_copy_from_user_atomic+0x84/0xf6
Jan  9 15:55:20 pannekake kernel: [61568.626052]  [<ffffffff8109133f>] generic_file_buffered_write+0x10c/0x230
Jan  9 15:55:20 pannekake kernel: [61568.632961]  [<ffffffffa025375b>] ? ext4_dirty_inode+0x40/0x45 [ext4]
Jan  9 15:55:20 pannekake kernel: [61568.639536]  [<ffffffff81092b9c>] __generic_file_aio_write+0x347/0x37b
Jan  9 15:55:20 pannekake kernel: [61568.646245]  [<ffffffff81302709>] ? __mutex_lock_slowpath+0x22d/0x23c
Jan  9 15:55:20 pannekake kernel: [61568.652819]  [<ffffffff81092c2b>] generic_file_aio_write+0x5b/0xa2
Jan  9 15:55:20 pannekake kernel: [61568.659194]  [<ffffffffa024fa93>] ext4_file_write+0x8e/0x97 [ext4]
Jan  9 15:55:20 pannekake kernel: [61568.665577]  [<ffffffff810bfbc8>] do_sync_write+0xc0/0x106
Jan  9 15:55:20 pannekake kernel: [61568.671195]  [<ffffffff810cdb68>] ? poll_select_copy_remaining+0xc5/0xe9
Jan  9 15:55:20 pannekake kernel: [61568.678015]  [<ffffffff81125b50>] ? security_file_permission+0x11/0x13
Jan  9 15:55:20 pannekake kernel: [61568.684678]  [<ffffffff810c04df>] vfs_write+0xae/0x14a
Jan  9 15:55:20 pannekake kernel: [61568.689954]  [<ffffffff810c063f>] sys_write+0x47/0x6e
Jan  9 15:55:20 pannekake kernel: [61568.695111]  [<ffffffff81001feb>] system_call_fastpath+0x16/0x1b

It showed up again ~20 minutes later, and since then I haven't seen it. I'm a bit puzzled since I don't actually see any error messages (just the backtrace), and I'm not able to find any ill effects except for the traceback, but I guess it still shouldn't happen.

FWIW, the ext4 is on an LVM whose one of the component RAID-5s was resyncing at the time. I don't know if that is relevant or not.
Comment 1 Anonymous Emailer 2010-01-10 10:46:03 UTC
Reply-To: adilger@sun.com

On 2010-01-09, at 12:01, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15018
>
> While doing an rsync to an ext4 filesystem, suddenly the above  
> showed up in my dmesg:
>
> Jan  9 15:55:20 pannekake kernel: [61568.551238] Pid: 28573, comm:  
> rsync Not
> tainted 2.6.33-rc3 #2
> Jan  9 15:55:20 pannekake kernel: [61568.557085] Call Trace:
> Jan  9 15:55:20 pannekake kernel: [61568.559623]  [<ffffffffa0253aa1>]
> ext4_write_inode+0x4d/0xc7 [ext4]
>
> It showed up again ~20 minutes later, and since then I haven't seen  
> it. I'm a bit puzzled since I don't actually see any error messages  
> (just the backtrace), and I'm not able to find any ill effects  
> except for the traceback, but I guess it still shouldn't happen.
>
> FWIW, the ext4 is on an LVM whose one of the component RAID-5s was  
> resyncing at the time. I don't know if that is relevant or not.


In ext4_write_inode it has a dump_stack() in the case where  
ext4_journal_current_handle() is non-NULL.

This is a bug, because later on it calls ext4_force_commit(), which  
would end up blocking on the journal commit to finish, which is  
blocked waiting on the journal handle held by the selfsame thread that  
started the journal commit in the first place.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Comment 2 Jan Kara 2010-01-18 23:06:07 UTC
Ah, inlining and such stuff made reading this stack trace quite difficult. Anyway, the problem is that we just should not call write_inode_now() in ext4_da_reserve_space() when we have transaction already started. And as Andreas correctly points out, the patch even introduces a deadlock because it can call write_inode_now with wait==1. Moreover we hold page lock when holding write_inode_now which introduces a plenty of nasty locking issues I believe.
The bug has been introduced by commit 0637c6f4135f592f094207c7c21e7c0fc5557834.

Ted, I think that if we are failing to do the write because of ENOSPC, the only reasonable non-deadlocky thing to do is to restart the write completely - i.e., bail out up to ext4_da_write_begin, stop the transaction, drop the page, *do some magic*, and start again...
Comment 3 Aneesh Kumar K.V 2010-01-20 17:10:31 UTC
On Mon, 18 Jan 2010 23:06:08 GMT, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15018
> 
> 
> Jan Kara <jack@suse.cz> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |jack@suse.cz, tytso@mit.edu
> 
> 
> 
> 
> --- Comment #2 from Jan Kara <jack@suse.cz>  2010-01-18 23:06:07 ---
> Ah, inlining and such stuff made reading this stack trace quite difficult.
> Anyway, the problem is that we just should not call write_inode_now() in
> ext4_da_reserve_space() when we have transaction already started. And as
> Andreas correctly points out, the patch even introduces a deadlock because it
> can call write_inode_now with wait==1. Moreover we hold page lock when holding
> write_inode_now which introduces a plenty of nasty locking issues I believe.
> The bug has been introduced by commit 0637c6f4135f592f094207c7c21e7c0fc5557834.
> 
> Ted, I think that if we are failing to do the write because of ENOSPC, the only
> reasonable non-deadlocky thing to do is to restart the write completely - i.e.,
> bail out up to ext4_da_write_begin, stop the transaction, drop the page, *do
> some magic*, and start again...
> 

How about the below. I am yet to test the change

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c818972..5af483a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1835,24 +1835,12 @@ repeat:
 	 * later. Real quota accounting is done at pages writeout
 	 * time.
 	 */
-	if (vfs_dq_reserve_block(inode, md_needed + 1)) {
-		/* 
-		 * We tend to badly over-estimate the amount of
-		 * metadata blocks which are needed, so if we have
-		 * reserved any metadata blocks, try to force out the
-		 * inode and see if we have any better luck.
-		 */
-		if (md_reserved && retries++ <= 3)
-			goto retry;
+	if (vfs_dq_reserve_block(inode, md_needed + 1))
 		return -EDQUOT;
-	}
 
 	if (ext4_claim_free_blocks(sbi, md_needed + 1)) {
 		vfs_dq_release_reservation_block(inode, md_needed + 1);
 		if (ext4_should_retry_alloc(inode->i_sb, &retries)) {
-		retry:
-			if (md_reserved)
-				write_inode_now(inode, (retries == 3));
 			yield();
 			goto repeat;
 		}
@@ -3032,7 +3020,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 			       loff_t pos, unsigned len, unsigned flags,
 			       struct page **pagep, void **fsdata)
 {
-	int ret, retries = 0;
+	int ret, retries = 0, quota_retries = 0;
 	struct page *page;
 	pgoff_t index;
 	unsigned from, to;
@@ -3091,6 +3079,21 @@ retry:
 
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
+
+	if ((ret == -EDQUOT) && (quota_retries <= 3)) {
+		/*
+		 * Since we do aggressive quota reservation may get no quota error
+		 * even though we may be having quota. Forcing the inode write will
+		 * free up the extra quota reserved. We do it only if we have
+		 * meta data blocks reserved because we do aggressive reservation only
+		 * for meta data blocks.
+		 */
+		if (EXT4_I(inode)->i_reserved_meta_blocks) {
+			write_inode_now(inode, (quota_retries == 3));
+			quota_retries++;
+			goto retry;
+		}
+	}
 out:
 	return ret;
 }
Comment 4 Jan Kara 2010-01-26 23:19:18 UTC
Hmm, I don't like the code in ext4_da_reserve_space even after your patch - the retries are duplicated with the ones from ext4_da_write_begin and calling write_inode_now from ext4_da_write_begin doesn't make too much sense either - why not write all inodes with delayed allocation if we want to write something?

Actually, instead of retrying delayed allocation, I think it makes more sense to just fall back to normal allocation. Attached patch does that and it also allows me to completely fill user's quota.
Comment 5 Jan Kara 2010-01-26 23:20:19 UTC
Created attachment 24739 [details]
Patch fixing deadlock in ext4_da_reserve_space
Comment 6 Aneesh Kumar K.V 2010-01-27 09:46:39 UTC
On Tue, 26 Jan 2010 23:19:23 GMT, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15018
> 
> 
> 
> 
> 
> --- Comment #4 from Jan Kara <jack@suse.cz>  2010-01-26 23:19:18 ---
> Hmm, I don't like the code in ext4_da_reserve_space even after your patch - the
> retries are duplicated with the ones from ext4_da_write_begin and calling
> write_inode_now from ext4_da_write_begin doesn't make too much sense either -
> why not write all inodes with delayed allocation if we want to write something?
> 
> Actually, instead of retrying delayed allocation, I think it makes more sense
> to just fall back to normal allocation. Attached patch does that and it also
> allows me to completely fill user's quota.
> 

Switching to non delayed allocation may not guarantee that we succeed in block
allocation. For quota we really need to flush this inode. Even if we
flush other inodes of other user we may still get a quota error

-aneesh
Comment 7 Jan Kara 2010-01-27 11:51:47 UTC
Aneesh, I agree that switching to non-delayed allocation does not guarantee a success in block allocation. But flushing inode does not guarantee that either (and IMHO does not bring us that much). The user might have other inodes which have delayed-allocated writes still pending and overestimation of reserved space on these inodes causes EDQUOT. So the ultimately correct way would be to flush all user's inodes but I'm not sure we want to do that from ext4_da_write_begin and it wouldn't be an easy thing to do. A middle ground might be to call writeback_inodes_sb_if_idle (similarly as we do that from ext4_nonda_switch) when we hit EDQUOT but I'm not too satisfied with that either...
Comment 8 Jan Kara 2010-03-18 11:11:48 UTC
This bug should be fixed in 2.6.33 (commit 1db913823c0f8360fccbd24ca67eb073966a5ffd). Please reopen this bug if you can still see the problem.

Note You need to log in before you can comment on or make changes to this bug.