Hi, While doing an rsync to an ext4 filesystem, suddenly the above showed up in my dmesg: Jan 9 15:55:20 pannekake kernel: [61568.551238] Pid: 28573, comm: rsync Not tainted 2.6.33-rc3 #2 Jan 9 15:55:20 pannekake kernel: [61568.557085] Call Trace: Jan 9 15:55:20 pannekake kernel: [61568.559623] [<ffffffffa0253aa1>] ext4_write_inode+0x4d/0xc7 [ext4] Jan 9 15:55:20 pannekake kernel: [61568.566074] [<ffffffff810dabe7>] writeback_single_inode+0x125/0x2df Jan 9 15:55:20 pannekake kernel: [61568.572563] [<ffffffff810dae48>] write_inode_now+0x78/0xba Jan 9 15:55:20 pannekake kernel: [61568.578287] [<ffffffffa0255b2d>] ext4_da_get_block_prep+0x28b/0x36e [ext4] Jan 9 15:55:20 pannekake kernel: [61568.585367] [<ffffffff810e1b69>] __block_prepare_write+0x1f3/0x3c0 Jan 9 15:55:20 pannekake kernel: [61568.591792] [<ffffffffa02558a2>] ? ext4_da_get_block_prep+0x0/0x36e [ext4] Jan 9 15:55:20 pannekake kernel: [61568.598903] [<ffffffff810e1eb1>] block_write_begin+0x80/0xcf Jan 9 15:55:20 pannekake kernel: [61568.604840] [<ffffffffa02587be>] ext4_da_write_begin+0x196/0x218 [ext4] Jan 9 15:55:20 pannekake kernel: [61568.611683] [<ffffffffa02558a2>] ? ext4_da_get_block_prep+0x0/0x36e [ext4] Jan 9 15:55:20 pannekake kernel: [61568.618803] [<ffffffff810911c1>] ? iov_iter_copy_from_user_atomic+0x84/0xf6 Jan 9 15:55:20 pannekake kernel: [61568.626052] [<ffffffff8109133f>] generic_file_buffered_write+0x10c/0x230 Jan 9 15:55:20 pannekake kernel: [61568.632961] [<ffffffffa025375b>] ? ext4_dirty_inode+0x40/0x45 [ext4] Jan 9 15:55:20 pannekake kernel: [61568.639536] [<ffffffff81092b9c>] __generic_file_aio_write+0x347/0x37b Jan 9 15:55:20 pannekake kernel: [61568.646245] [<ffffffff81302709>] ? __mutex_lock_slowpath+0x22d/0x23c Jan 9 15:55:20 pannekake kernel: [61568.652819] [<ffffffff81092c2b>] generic_file_aio_write+0x5b/0xa2 Jan 9 15:55:20 pannekake kernel: [61568.659194] [<ffffffffa024fa93>] ext4_file_write+0x8e/0x97 [ext4] Jan 9 15:55:20 pannekake kernel: [61568.665577] [<ffffffff810bfbc8>] do_sync_write+0xc0/0x106 Jan 9 15:55:20 pannekake kernel: [61568.671195] [<ffffffff810cdb68>] ? poll_select_copy_remaining+0xc5/0xe9 Jan 9 15:55:20 pannekake kernel: [61568.678015] [<ffffffff81125b50>] ? security_file_permission+0x11/0x13 Jan 9 15:55:20 pannekake kernel: [61568.684678] [<ffffffff810c04df>] vfs_write+0xae/0x14a Jan 9 15:55:20 pannekake kernel: [61568.689954] [<ffffffff810c063f>] sys_write+0x47/0x6e Jan 9 15:55:20 pannekake kernel: [61568.695111] [<ffffffff81001feb>] system_call_fastpath+0x16/0x1b It showed up again ~20 minutes later, and since then I haven't seen it. I'm a bit puzzled since I don't actually see any error messages (just the backtrace), and I'm not able to find any ill effects except for the traceback, but I guess it still shouldn't happen. FWIW, the ext4 is on an LVM whose one of the component RAID-5s was resyncing at the time. I don't know if that is relevant or not.
Reply-To: adilger@sun.com On 2010-01-09, at 12:01, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=15018 > > While doing an rsync to an ext4 filesystem, suddenly the above > showed up in my dmesg: > > Jan 9 15:55:20 pannekake kernel: [61568.551238] Pid: 28573, comm: > rsync Not > tainted 2.6.33-rc3 #2 > Jan 9 15:55:20 pannekake kernel: [61568.557085] Call Trace: > Jan 9 15:55:20 pannekake kernel: [61568.559623] [<ffffffffa0253aa1>] > ext4_write_inode+0x4d/0xc7 [ext4] > > It showed up again ~20 minutes later, and since then I haven't seen > it. I'm a bit puzzled since I don't actually see any error messages > (just the backtrace), and I'm not able to find any ill effects > except for the traceback, but I guess it still shouldn't happen. > > FWIW, the ext4 is on an LVM whose one of the component RAID-5s was > resyncing at the time. I don't know if that is relevant or not. In ext4_write_inode it has a dump_stack() in the case where ext4_journal_current_handle() is non-NULL. This is a bug, because later on it calls ext4_force_commit(), which would end up blocking on the journal commit to finish, which is blocked waiting on the journal handle held by the selfsame thread that started the journal commit in the first place. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Ah, inlining and such stuff made reading this stack trace quite difficult. Anyway, the problem is that we just should not call write_inode_now() in ext4_da_reserve_space() when we have transaction already started. And as Andreas correctly points out, the patch even introduces a deadlock because it can call write_inode_now with wait==1. Moreover we hold page lock when holding write_inode_now which introduces a plenty of nasty locking issues I believe. The bug has been introduced by commit 0637c6f4135f592f094207c7c21e7c0fc5557834. Ted, I think that if we are failing to do the write because of ENOSPC, the only reasonable non-deadlocky thing to do is to restart the write completely - i.e., bail out up to ext4_da_write_begin, stop the transaction, drop the page, *do some magic*, and start again...
On Mon, 18 Jan 2010 23:06:08 GMT, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=15018 > > > Jan Kara <jack@suse.cz> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |jack@suse.cz, tytso@mit.edu > > > > > --- Comment #2 from Jan Kara <jack@suse.cz> 2010-01-18 23:06:07 --- > Ah, inlining and such stuff made reading this stack trace quite difficult. > Anyway, the problem is that we just should not call write_inode_now() in > ext4_da_reserve_space() when we have transaction already started. And as > Andreas correctly points out, the patch even introduces a deadlock because it > can call write_inode_now with wait==1. Moreover we hold page lock when > holding > write_inode_now which introduces a plenty of nasty locking issues I believe. > The bug has been introduced by commit > 0637c6f4135f592f094207c7c21e7c0fc5557834. > > Ted, I think that if we are failing to do the write because of ENOSPC, the > only > reasonable non-deadlocky thing to do is to restart the write completely - > i.e., > bail out up to ext4_da_write_begin, stop the transaction, drop the page, *do > some magic*, and start again... > How about the below. I am yet to test the change diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c818972..5af483a 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1835,24 +1835,12 @@ repeat: * later. Real quota accounting is done at pages writeout * time. */ - if (vfs_dq_reserve_block(inode, md_needed + 1)) { - /* - * We tend to badly over-estimate the amount of - * metadata blocks which are needed, so if we have - * reserved any metadata blocks, try to force out the - * inode and see if we have any better luck. - */ - if (md_reserved && retries++ <= 3) - goto retry; + if (vfs_dq_reserve_block(inode, md_needed + 1)) return -EDQUOT; - } if (ext4_claim_free_blocks(sbi, md_needed + 1)) { vfs_dq_release_reservation_block(inode, md_needed + 1); if (ext4_should_retry_alloc(inode->i_sb, &retries)) { - retry: - if (md_reserved) - write_inode_now(inode, (retries == 3)); yield(); goto repeat; } @@ -3032,7 +3020,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata) { - int ret, retries = 0; + int ret, retries = 0, quota_retries = 0; struct page *page; pgoff_t index; unsigned from, to; @@ -3091,6 +3079,21 @@ retry: if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) goto retry; + + if ((ret == -EDQUOT) && (quota_retries <= 3)) { + /* + * Since we do aggressive quota reservation may get no quota error + * even though we may be having quota. Forcing the inode write will + * free up the extra quota reserved. We do it only if we have + * meta data blocks reserved because we do aggressive reservation only + * for meta data blocks. + */ + if (EXT4_I(inode)->i_reserved_meta_blocks) { + write_inode_now(inode, (quota_retries == 3)); + quota_retries++; + goto retry; + } + } out: return ret; }
Hmm, I don't like the code in ext4_da_reserve_space even after your patch - the retries are duplicated with the ones from ext4_da_write_begin and calling write_inode_now from ext4_da_write_begin doesn't make too much sense either - why not write all inodes with delayed allocation if we want to write something? Actually, instead of retrying delayed allocation, I think it makes more sense to just fall back to normal allocation. Attached patch does that and it also allows me to completely fill user's quota.
Created attachment 24739 [details] Patch fixing deadlock in ext4_da_reserve_space
On Tue, 26 Jan 2010 23:19:23 GMT, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=15018 > > > > > > --- Comment #4 from Jan Kara <jack@suse.cz> 2010-01-26 23:19:18 --- > Hmm, I don't like the code in ext4_da_reserve_space even after your patch - > the > retries are duplicated with the ones from ext4_da_write_begin and calling > write_inode_now from ext4_da_write_begin doesn't make too much sense either - > why not write all inodes with delayed allocation if we want to write > something? > > Actually, instead of retrying delayed allocation, I think it makes more sense > to just fall back to normal allocation. Attached patch does that and it also > allows me to completely fill user's quota. > Switching to non delayed allocation may not guarantee that we succeed in block allocation. For quota we really need to flush this inode. Even if we flush other inodes of other user we may still get a quota error -aneesh
Aneesh, I agree that switching to non-delayed allocation does not guarantee a success in block allocation. But flushing inode does not guarantee that either (and IMHO does not bring us that much). The user might have other inodes which have delayed-allocated writes still pending and overestimation of reserved space on these inodes causes EDQUOT. So the ultimately correct way would be to flush all user's inodes but I'm not sure we want to do that from ext4_da_write_begin and it wouldn't be an easy thing to do. A middle ground might be to call writeback_inodes_sb_if_idle (similarly as we do that from ext4_nonda_switch) when we hit EDQUOT but I'm not too satisfied with that either...
This bug should be fixed in 2.6.33 (commit 1db913823c0f8360fccbd24ca67eb073966a5ffd). Please reopen this bug if you can still see the problem.