Bug 15906
Summary: | performance regression in "umount" of filesystems using barriers | ||
---|---|---|---|
Product: | File System | Reporter: | Kees Cook (kees) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | akpm, axboe, brian.bloniarz, cebbert, dmonakhov, greg, kmannth, sandeen, thierry.vignaud, tytso |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Patch attached fix original regression
Explicitly pass in whether sb is pinned or not Fix async inodes writeback for FS with delalloc |
Description
Kees Cook
2010-05-04 20:36:58 UTC
Sorry, I meant Fedora 13 (Beta). As an interesting oddity, creating an ext3 filesystem, but then attempting to mount it as ext4 shows the same pathological behavior. Mounting as ext3, it's fine again. As I mentioned on the RH bz, it seems barrier-related. I thought most barrier enhancements/changes were before 2.6.31, though. Was this really fast for you on .31? Lots of lvm barrier changes went into .30. mounting either the host fs, the loopback file, or both with "nobarrier" avoids all the cache flushes from the barriers, and it's speedy again. -Eric Hm, this was really a reduced test-case, as my initial situations where I noticed this was umounting LVM snapshots, then later umounting aufs unions (in an effort to get away from 3x slow-down of snapshot IO). Doing this is just as slow as loopback for me: VG=systemvg lvcreate -n testlv -L1G $VG mkfs.ext4 -F /dev/$VG/testlv mkdir -p /mnt/test mount -o loop /dev/$VG/testlv /mnt/test seq 65536 | (cd /mnt/test; xargs touch) time umount /mnt/test Should 68db1961bbf4e16c220ccec4a780e966bc1fece3 be expected to fix it too? Hm, it seemed to be progressively worse since 2.6.28 (when it was okay). It was not good in 2.6.31, and bad in 2.6.32. So I don't have data for 2.6.29 or .30. Eric is spot on. At 2.6.30 and dm/lvm started supporting barriers, this can cause serious performance issues depending on your configuration. Try mounting with the " -o nobarrier " option to see if your performance is restored. Kees, no, 68db1961bbf4e16c220ccec4a780e966bc1fece3 made the test on loopback slower, because now a barrier from a filesystem on a loopback device issues an fsync on the underlying filesystem which issues a cache flush on the underlying block device .... you get the picture. Unless we really have gratuitous barrier issues here, there's probably not a whole lot to be done, though maybe some batching is possible. As for the business about ext3, there is nothing mysterious or odd there either, ext3 mounts with barriers off by default, despite my best efforts, and mounting it with the ext4 driver does turn them on. Well, it seems to me that there is something wrong somewhere in here since "sync; umount" takes seconds and "umount" alone takes minutes. I can, however, confirm that adding "-o nobarrier" causes the symptom to vanish. To be fair, there does seem to be a real problem here. See my comment here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/543617/comments/12 It looks like if there are dirty inodes, whatever writeback path is used by umount(8) is much more inefficient than that used by sync(8). Specifically it seems to be doing a journal commit between each dirty inode which is flushed out. Yep, fair enough, and on closer inspection its doing way more IO than expected, wrapping the journal 4 times. :) so basically: sync_inodes_sb bdi_sync_writeback (sync_mode = WB_SYNC_ALL) ... writeback_inodes_wb writeback_single_inode ext4_write_inode if (WB_SYNC_ALL) ext4_force_commit <--- ouch. Yep. i've already know that issue. In fact it was broken by followng commit From 03ba3782e8dcc5b0e1efe440d33084f066e38cae Mon Sep 17 00:00:00 2001 From: Jens Axboe <jens.axboe@oracle.com> Date: Wed, 9 Sep 2009 09:08:54 +0200 Subject: [PATCH] writeback: switch to per-bdi threads for flushing data The problem with __sync_filesystem(0) is no longer works on umount because sb can not be pined s_mount sem is downed for write and s_root is NULL. And in fact ext3 is also broken in case of "-obarrier=1" The patch attached fix the original regression, but there is one more issue left A delalloc option. In fact dirty inode is still dirty even after first call of writeback_single_inode which is called from __sync_filesystem(0) due to delalloc allocation happen during inode write. So it takes second __sync_filesystem call to clear dirty flags. Currently i'm working on that issue. I hope i'll post a solution today. Created attachment 26224 [details]
Patch attached fix original regression
The summary for this patch should probably be changed, yes? It should happen with filesystems other than ext4 that use barriers (i.e., btrfs, XFS), and it should happen on a single disk just as easily as a LVM. The latter I can confirm --- you can see the performance degradation quite easily on a single disk, so the mention of LVM in the summary is probably misleading. QAqPqqaAAaQw On 5/5/10, bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=15906 > > > Kees Cook <kees@outflux.net> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Summary|serious performance |performance regression in > |regression in "umount" on |"umount" of filesystems > |ext4 over LVM |using boundaries > Summary|serious performance |performance regression in > |regression in "umount" on |"umount" of filesystems > |ext4 over LVM |using barriers > > > > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are watching the assignee of the bug. > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > (In reply to comment #15) > The summary for this patch should probably be changed, yes? It should > happen > with filesystems other than ext4 that use barriers (i.e., btrfs, XFS), and it > should happen on a single disk just as easily as a LVM. The latter I can > confirm --- you can see the performance degradation quite easily on a single > disk, so the mention of LVM in the summary is probably misleading. Just a note, performance degradation is not that huge for XFS, as soon as i understand this is because it batch several inode writes in to one 'log' so less barriers required But nor than less this is generic writeback issue. Created attachment 26255 [details]
Explicitly pass in whether sb is pinned or not
Revision two of this patch. I got rid of WB_SYNC_NONE_PIN, since we should not further muddy the sync type and whether or not the sb is pinned. Instead lets add a specific flag for this.
Untested, but it compiles.
BTW, I believe that Chris has observed this problem on btrfs as well. If you have a lot of dirty inodes and do the umount, all the writeback will be done with WB_SYNC_ALL which will cause a log commit for each on btrfs. End result is the same, umount runs a lot slower than it otherwise should have. I can confirm that Jens's fix the original regression for me, so IMHO first part well done. About second one. (In reply to comment #13) > A delalloc option. In fact dirty inode is still dirty even after first > call of writeback_single_inode which is called from __sync_filesystem(0) > due to delalloc allocation happen during inode write. So it takes second > __sync_filesystem call to clear dirty flags. Currently i'm working on that > issue. I hope i'll post a solution today. Proposed patch was posted may be found here: http://marc.info/?l=linux-fsdevel&m=127322500915287&w=2 ######################################### # Testcase #Prep stage for ((i=0;i<1000;i++)) ;do echo test > /tmp/FILES/file-$i ;done tar cf /tmp/files.tar /tmp/FILES #Actual measurements FSTYPE=ext4 OPT='-obarrier=1' mkfs.$FSTYPE /dev/sdb1 mount /dev/sdb1 /mnt $OPT time tar xf /tmp/files.tar -C /mnt/ time umount /mnt ######################################### w/o update dflags patch real 0m0.142s user 0m0.005s sys 0m0.093s real 0m8.506s user 0m0.000s sys 0m0.016s with update dflags patch real 0m0.105s user 0m0.006s sys 0m0.086s real 0m0.148s user 0m0.002s sys 0m0.007s Seems where are no visible diffidence for btrfs and xfs. Comment on attachment 26224 [details]
Patch attached fix original regression
Superseded by Jens fix.
Created attachment 26403 [details]
Fix async inodes writeback for FS with delalloc
Attach second patch, after it has passed 1 week review cycle in a list.
Seems that we have an agreement for both issues. IMHO They are 99.999% percent candidates for 2.6.35-rc1(and then for stable).
Who will take care about patches? Jens, will you?
I'll add both patches to for-2.6.35 and mark them with a stable tag. Thanks again to everyone who dug into this. I'd been pulling my hair out out for months on it. :) (In reply to comment #18) > Created an attachment (id=26255) [details] > Explicitly pass in whether sb is pinned or not > This patch is triggering a warning: [ http://www.kerneloops.org/oops.php?number=3142224 ] WARNING: at fs/fs-writeback.c:597 writeback_inodes_wb+0x229/0x380() Hardware name: MS-7250 Modules linked in: fuse sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 ext2 dm_multipath kvm_amd kvm uinput snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device rndis_wlan snd_pcm cfg80211 snd_timer snd rfkill soundcore snd_page_alloc rndis_host edac_core ppdev cdc_ether i2c_nforce2 usbnet k8temp edac_mce_amd parport_pc forcedeth mii parport serio_raw sha256_generic aes_x86_64 aes_generic cbc dm_crypt firewire_ohci pata_acpi firewire_core crc_itu_t floppy ata_generic pata_amd sata_nv radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Pid: 1035, comm: flush-253:0 Not tainted 2.6.33.4-95.fc13.x86_64 #1 Call Trace: [] warn_slowpath_common+0x77/0x8f [] warn_slowpath_null+0xf/0x11 [] writeback_inodes_wb+0x229/0x380 [] wb_writeback+0x13d/0x1bc [] ? call_rcu_sched+0x10/0x12 [] ? call_rcu+0x9/0xb [] wb_do_writeback+0x69/0x14d [] bdi_writeback_task+0x3a/0xaa [] ? bdi_start_fn+0x0/0xd4 [] bdi_start_fn+0x6c/0xd4 [] ? bdi_start_fn+0x0/0xd4 [] kthread+0x7a/0x82 [] kernel_thread_helper+0x4/0x10 [] ? kthread+0x0/0x82 [] ? kernel_thread_helper+0x0/0x10 static int pin_sb_for_writeback(struct writeback_control *wbc, struct inode *inode, struct super_block **psb) { struct super_block *sb = inode->i_sb; /* * If this sb is already pinned, nothing more to do. If not and * *psb is non-NULL, unpin the old one first */ if (sb == *psb) return 0; else if (*psb) unpin_sb_for_writeback(psb); /* * Caller must already hold the ref for this */ if (wbc->sync_mode == WB_SYNC_ALL || wbc->sb_pinned) { WARN_ON(!rwsem_is_locked(&sb->s_umount)); <<----- here return 0; } http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=e913fc825dc685a444cb4c1d0f9d32f372f5986 is what Jens had upstream, and I think http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=30fd1e551a730d942e91109762c942786be0ef7c is the fix for the issue you see here... > Even if the writeout itself isn't a data integrity operation, we need > to ensure that the caller doesn't drop the sb umount sem before we > have actually done the writeback. Eric is spot on, that latter commit does indeed fix up that issue. The problem is that the caller holds the umount sem when initiating the WB_SYNC_NONE writeback, and thus it passes down ->sb_pinned == 1. But since we clear the work state for WB_SYNC_NONE early, the caller can then drop the umount sem before we are actually done writing out inodes. So the fix is queued up with the original patch, when it goes upstream there should be no warnings. (Fedora has the updated patches now, btw - 2.6.33.4-102.fc13.x86_64 I think) I'm trying to get these fixes merged into the ubuntu release, because ubuntu's workaround has issues. The downstream bug for this is https://bugs.launchpad.net/ubuntu/+source/linux/+bug/585092 I backported to ubuntu's kernel, which took a little fixup. You'll find the patches attached on that bug. 0001-writeback-fix-WB_SYNC_NONE-writeback-from-umount.patch 0002-writeback-Update-dirty-flags-in-two-steps.patch 0003-writeback-ensure-that-WB_SYNC_NONE-writeback-with-sb.patch If anyone who's familiar with these code paths could review my backports, I'd appreciate it. They apply cleanly to mainline v2.6.32.11, which is the base for ubuntu's kernel. The patches mentioned in this discussion have been reverted due to issues, please do not use them. The latest info on progress resolving this bug (as of 06/28) can be found by reading through: http://lkml.org/lkml/2010/6/10/175 |