Bug 15906

Summary: performance regression in "umount" of filesystems using barriers
Product: File System Reporter: Kees Cook (kees)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: RESOLVED CODE_FIX    
Severity: normal CC: akpm, axboe, brian.bloniarz, cebbert, dmonakhov, greg, kmannth, sandeen, thierry.vignaud, tytso
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Patch attached fix original regression
Explicitly pass in whether sb is pinned or not
Fix async inodes writeback for FS with delalloc

Description Kees Cook 2010-05-04 20:36:58 UTC
As root, if /tmp is on an LV (or possibly even without this):

cd /tmp
dd if=/dev/zero of=test.ext4 bs=1 count=1 seek=1G
mkfs.ext4 -F test.ext4
mkdir -p /mnt/test
mount -o loop test.ext4 /mnt/test
echo $(seq 65536) | (cd /mnt/test; xargs touch)
time umount /mnt/test

Prior to 2.6.32, this took a second or so.  Now it takes over a minute.

Reproduced in both Ubuntu 10.04 (2.6.32) and Fedora 12 (2.6.33).
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/543617

The current Ubuntu work-around is only useful on an otherwise idle system.
Comment 1 Kees Cook 2010-05-04 20:37:32 UTC
Sorry, I meant Fedora 13 (Beta).
Comment 3 Kees Cook 2010-05-04 20:52:46 UTC
As an interesting oddity, creating an ext3 filesystem, but then attempting to mount it as ext4 shows the same pathological behavior.  Mounting as ext3, it's fine again.
Comment 4 Eric Sandeen 2010-05-04 21:05:24 UTC
As I mentioned on the RH bz, it seems barrier-related.

I thought most barrier enhancements/changes were before 2.6.31, though.  Was this really fast for you on .31?  Lots of lvm barrier changes went into .30.

mounting either the host fs, the loopback file, or both with "nobarrier" avoids all the cache flushes from the barriers, and it's speedy again.

-Eric
Comment 5 Kees Cook 2010-05-04 21:11:03 UTC
Hm, this was really a reduced test-case, as my initial situations where I noticed this was umounting LVM snapshots, then later umounting aufs unions (in an effort to get away from 3x slow-down of snapshot IO).

Doing this is just as slow as loopback for me:

VG=systemvg
lvcreate -n testlv -L1G $VG
mkfs.ext4 -F /dev/$VG/testlv
mkdir -p /mnt/test
mount -o loop /dev/$VG/testlv /mnt/test
seq 65536 | (cd /mnt/test; xargs touch)
time umount /mnt/test


Should 68db1961bbf4e16c220ccec4a780e966bc1fece3 be expected to fix it too?
Comment 6 Kees Cook 2010-05-04 21:12:51 UTC
Hm, it seemed to be progressively worse since 2.6.28 (when it was okay).  It was not good in 2.6.31, and bad in 2.6.32.  So I don't have data for 2.6.29 or .30.
Comment 7 keith mannthey 2010-05-04 21:34:38 UTC
Eric is spot on. 

At 2.6.30 and dm/lvm started supporting barriers, this can cause serious performance issues depending on your configuration. 

Try mounting with the " -o nobarrier " option to see if your performance is restored.
Comment 8 Eric Sandeen 2010-05-04 21:48:27 UTC
Kees, no, 68db1961bbf4e16c220ccec4a780e966bc1fece3 made the test on loopback slower, because now a barrier from a filesystem on a loopback device issues an fsync on the underlying filesystem which issues a cache flush on the underlying block device .... you get the picture.

Unless we really have gratuitous barrier issues here, there's probably not a whole lot to be done, though maybe some batching is possible.

As for the business about ext3, there is nothing mysterious or odd there either, ext3 mounts with barriers off by default, despite my best efforts, and mounting it with the ext4 driver does turn them on.
Comment 9 Kees Cook 2010-05-05 00:28:42 UTC
Well, it seems to me that there is something wrong somewhere in here since "sync; umount" takes seconds and "umount" alone takes minutes.  I can, however, confirm that adding "-o nobarrier" causes the symptom to vanish.
Comment 10 Theodore Tso 2010-05-05 03:18:02 UTC
To be fair, there does seem to be a real problem here.  See my comment here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/543617/comments/12

It looks like if there are dirty inodes, whatever writeback path is used by umount(8) is much more inefficient than that used by sync(8).  Specifically it seems to be doing a journal commit between each dirty inode which is flushed out.
Comment 11 Eric Sandeen 2010-05-05 04:00:04 UTC
Yep, fair enough, and on closer inspection its doing way more IO than expected, wrapping the journal 4 times.  :)
Comment 12 Eric Sandeen 2010-05-05 04:46:16 UTC
so basically:

sync_inodes_sb
  bdi_sync_writeback (sync_mode = WB_SYNC_ALL)
    ...

writeback_inodes_wb
  writeback_single_inode
    ext4_write_inode
      if (WB_SYNC_ALL)
        ext4_force_commit <--- ouch.
Comment 13 Dmitry Monakhov 2010-05-05 07:28:10 UTC
Yep. i've already know that issue. In fact it was broken by followng commit

From 03ba3782e8dcc5b0e1efe440d33084f066e38cae Mon Sep 17 00:00:00 2001          
From: Jens Axboe <jens.axboe@oracle.com>                                        
Date: Wed, 9 Sep 2009 09:08:54 +0200                                            
Subject: [PATCH] writeback: switch to per-bdi threads for flushing data

The problem with __sync_filesystem(0) is no longer works on umount
because sb can not be pined s_mount sem is downed for write and s_root is NULL.

And in fact ext3 is also broken in case of "-obarrier=1"
The patch attached fix the original regression, but there is one more issue left

A delalloc option. In fact dirty inode is still dirty even after first
call of writeback_single_inode which is called from __sync_filesystem(0)
due to delalloc allocation happen during inode write. So it takes second
__sync_filesystem call to clear dirty flags. Currently i'm working on that
issue. I hope i'll post a solution today.
Comment 14 Dmitry Monakhov 2010-05-05 07:30:04 UTC
Created attachment 26224 [details]
Patch attached fix original regression
Comment 15 Theodore Tso 2010-05-05 19:44:04 UTC
The summary for this patch should probably be changed, yes?   It should happen with filesystems other than ext4 that use barriers (i.e., btrfs, XFS), and it should happen on a single disk just as easily as a LVM.  The latter I can confirm --- you can see the performance degradation quite easily on a single disk, so the mention of LVM in the summary is probably misleading.
Comment 16 Greg.Freemyer 2010-05-05 21:36:46 UTC
QAqPqqaAAaQw

On 5/5/10, bugzilla-daemon@bugzilla.kernel.org
<bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=15906
>
>
> Kees Cook <kees@outflux.net> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>             Summary|serious performance         |performance regression in
>                    |regression in "umount" on   |"umount" of filesystems
>                    |ext4 over LVM               |using boundaries
>             Summary|serious performance         |performance regression in
>                    |regression in "umount" on   |"umount" of filesystems
>                    |ext4 over LVM               |using barriers
>
>
>
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are watching the assignee of the bug.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Comment 17 Dmitry Monakhov 2010-05-06 03:48:58 UTC
(In reply to comment #15)
> The summary for this patch should probably be changed, yes?   It should
> happen
> with filesystems other than ext4 that use barriers (i.e., btrfs, XFS), and it
> should happen on a single disk just as easily as a LVM.  The latter I can
> confirm --- you can see the performance degradation quite easily on a single
> disk, so the mention of LVM in the summary is probably misleading.

Just a note, performance degradation is not that huge for XFS, as soon as i
understand this is because it batch several inode writes in to one 'log' so
less barriers required
But nor than less this is generic writeback issue.
Comment 18 Jens Axboe 2010-05-06 07:08:55 UTC
Created attachment 26255 [details]
Explicitly pass in whether sb is pinned or not

Revision two of this patch. I got rid of WB_SYNC_NONE_PIN, since we should not further muddy the sync type and whether or not the sb is pinned. Instead lets add a specific flag for this.

Untested, but it compiles.
Comment 19 Jens Axboe 2010-05-06 07:10:30 UTC
BTW, I believe that Chris has observed this problem on btrfs as well. If you have a lot of dirty inodes and do the umount, all the writeback will be done with WB_SYNC_ALL which will cause a log commit for each on btrfs. End result is the same, umount runs a lot slower than it otherwise should have.
Comment 20 Dmitry Monakhov 2010-05-07 09:45:32 UTC
I can confirm that Jens's fix the original regression for me,
so IMHO first part well done.

About second one.
(In reply to comment #13)
> A delalloc option. In fact dirty inode is still dirty even after first
> call of writeback_single_inode which is called from __sync_filesystem(0)
> due to delalloc allocation happen during inode write. So it takes second
> __sync_filesystem call to clear dirty flags. Currently i'm working on that
> issue. I hope i'll post a solution today.

Proposed patch was posted may be found here: 
http://marc.info/?l=linux-fsdevel&m=127322500915287&w=2
Comment 21 Dmitry Monakhov 2010-05-07 10:49:00 UTC
#########################################
# Testcase
#Prep stage
for ((i=0;i<1000;i++)) ;do echo test > /tmp/FILES/file-$i ;done
tar cf /tmp/files.tar /tmp/FILES
#Actual measurements  
FSTYPE=ext4
OPT='-obarrier=1'
mkfs.$FSTYPE /dev/sdb1 
mount /dev/sdb1 /mnt  $OPT
time tar xf /tmp/files.tar  -C /mnt/
time umount /mnt
#########################################

w/o update dflags patch
real    0m0.142s
user    0m0.005s
sys     0m0.093s

real    0m8.506s
user    0m0.000s
sys     0m0.016s

with update dflags patch
real    0m0.105s
user    0m0.006s
sys     0m0.086s

real    0m0.148s
user    0m0.002s
sys     0m0.007s

Seems where are no visible diffidence for btrfs and xfs.
Comment 22 Dmitry Monakhov 2010-05-17 03:28:50 UTC
Comment on attachment 26224 [details]
Patch attached fix original regression 

Superseded by Jens fix.
Comment 23 Dmitry Monakhov 2010-05-17 03:44:19 UTC
Created attachment 26403 [details]
Fix async inodes writeback for FS with delalloc

Attach second patch, after it has passed 1 week review cycle in a list.

Seems that we have an agreement for both issues. IMHO They are 99.999% percent candidates for 2.6.35-rc1(and then for stable).
Who will take care about patches? Jens, will you?
Comment 24 Jens Axboe 2010-05-17 10:52:54 UTC
I'll add both patches to for-2.6.35 and mark them with a stable tag.
Comment 25 Kees Cook 2010-05-19 19:12:08 UTC
Thanks again to everyone who dug into this.  I'd been pulling my hair out out for months on it.  :)
Comment 26 Chuck Ebbert 2010-05-21 01:04:55 UTC
(In reply to comment #18)
> Created an attachment (id=26255) [details]
> Explicitly pass in whether sb is pinned or not
> 

This patch is triggering a warning:

[ http://www.kerneloops.org/oops.php?number=3142224 ]

WARNING: at fs/fs-writeback.c:597 writeback_inodes_wb+0x229/0x380()
Hardware name: MS-7250
Modules linked in: fuse sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 ext2 dm_multipath kvm_amd kvm uinput snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device rndis_wlan snd_pcm cfg80211 snd_timer snd rfkill soundcore snd_page_alloc rndis_host edac_core ppdev cdc_ether i2c_nforce2 usbnet k8temp edac_mce_amd parport_pc forcedeth mii parport serio_raw sha256_generic aes_x86_64 aes_generic cbc dm_crypt firewire_ohci pata_acpi firewire_core crc_itu_t floppy ata_generic pata_amd sata_nv radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
Pid: 1035, comm: flush-253:0 Not tainted 2.6.33.4-95.fc13.x86_64 #1
Call Trace:
[] warn_slowpath_common+0x77/0x8f
[] warn_slowpath_null+0xf/0x11
[] writeback_inodes_wb+0x229/0x380
[] wb_writeback+0x13d/0x1bc
[] ? call_rcu_sched+0x10/0x12
[] ? call_rcu+0x9/0xb
[] wb_do_writeback+0x69/0x14d
[] bdi_writeback_task+0x3a/0xaa
[] ? bdi_start_fn+0x0/0xd4
[] bdi_start_fn+0x6c/0xd4
[] ? bdi_start_fn+0x0/0xd4
[] kthread+0x7a/0x82
[] kernel_thread_helper+0x4/0x10
[] ? kthread+0x0/0x82
[] ? kernel_thread_helper+0x0/0x10
Comment 27 Eric Sandeen 2010-05-21 02:07:01 UTC
static int pin_sb_for_writeback(struct writeback_control *wbc,
                                struct inode *inode, struct super_block **psb)
{
        struct super_block *sb = inode->i_sb;

        /*
         * If this sb is already pinned, nothing more to do. If not and
         * *psb is non-NULL, unpin the old one first
         */
        if (sb == *psb)
                return 0;
        else if (*psb)
                unpin_sb_for_writeback(psb);

        /*
         * Caller must already hold the ref for this
         */
        if (wbc->sync_mode == WB_SYNC_ALL || wbc->sb_pinned) {
                WARN_ON(!rwsem_is_locked(&sb->s_umount));  <<----- here
                return 0;
        }    

http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=e913fc825dc685a444cb4c1d0f9d32f372f5986

is what Jens had upstream, and I think 

http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=30fd1e551a730d942e91109762c942786be0ef7c

is the fix for the issue you see here...

> Even if the writeout itself isn't a data integrity operation, we need
> to ensure that the caller doesn't drop the sb umount sem before we
> have actually done the writeback.
Comment 28 Jens Axboe 2010-05-21 06:08:09 UTC
Eric is spot on, that latter commit does indeed fix up that issue. The problem is that the caller holds the umount sem when initiating the WB_SYNC_NONE writeback, and thus it passes down ->sb_pinned == 1. But since we clear the work state for WB_SYNC_NONE early, the caller can then drop the umount sem before we are actually done writing out inodes.

So the fix is queued up with the original patch, when it goes upstream there should be no warnings.
Comment 29 Eric Sandeen 2010-05-22 04:38:10 UTC
(Fedora has the updated patches now, btw - 2.6.33.4-102.fc13.x86_64 I think)
Comment 30 Brian Bloniarz 2010-05-26 16:44:37 UTC
I'm trying to get these fixes merged into the ubuntu release, because ubuntu's workaround has issues. The downstream bug for this is https://bugs.launchpad.net/ubuntu/+source/linux/+bug/585092

I backported to ubuntu's kernel, which took a little fixup. You'll find the patches attached on that bug.
0001-writeback-fix-WB_SYNC_NONE-writeback-from-umount.patch
0002-writeback-Update-dirty-flags-in-two-steps.patch
0003-writeback-ensure-that-WB_SYNC_NONE-writeback-with-sb.patch

If anyone who's familiar with these code paths could review my backports, I'd appreciate it.  They apply cleanly to mainline v2.6.32.11, which is the base for ubuntu's kernel.
Comment 31 Brian Bloniarz 2010-06-28 14:19:26 UTC
The patches mentioned in this discussion have been reverted due to issues, please do not use them.

The latest info on progress resolving this bug (as of 06/28) can be found by reading through: http://lkml.org/lkml/2010/6/10/175