Kernel Bug Tracker – Bug 20902
High IO wait when writing to ext4
Last modified: 2013-12-10 22:21:20 UTC
I'm running an ext4 filesystem on a LUKS device that sits ontop of an LVM device. The filesystem is created with no special options and is used for storing filessizes ranging from 1KB to 20GB.
I experience extremely high IO wait times when trying to write files to the filesystem, bot locally and through Samba.
The problem can be reproduced by doing a reboot, unlocking with cryptsetup, mounting the filesystem and then writing a file to any directory on the filesystem. The 'cp' process stalls for almost 15 minutes and then returns to normal and completes the copy eventually. If I then 'cp' the same file again, there's no problem at all and the copy goes through like normal without stalling for 15 minutes. Rebooting and doing it again, reproduces the problem immediately.
I've attached anything that might be of interest to anyone who'd want to look into this.
Created attachment 34362 [details]
dmesg output during problem
Created attachment 34372 [details]
Created attachment 34382 [details]
tune2fs -l of filesystem
Created attachment 34392 [details]
VG backup file
Created attachment 34432 [details]
Created attachment 34442 [details]
vmstat when problem starts
Created attachment 34452 [details]
dumpe2fs -h from LVMcrypto device
Any chance you can test w/o LUKS to see how that affects things?
If I had to guess I'd say things are stuck behind here:
[ 554.110652] flush-252:1 D ffff880001e14cc0 0 2186 2 0x00000000
[ 554.110679] Call Trace:
[ 554.110686] [<ffffffff81586503>] io_schedule+0x73/0xc0
[ 554.110692] [<ffffffff81180ab5>] sync_buffer+0x45/0x50
[ 554.110714] [<ffffffff81586e8d>] out_of_line_wait_on_bit+0x7d/0x90
[ 554.110725] [<ffffffff811809f8>] __wait_on_buffer+0x28/0x30
[ 554.110732] [<ffffffff81217199>] ext4_mb_init_cache+0x389/0x900
I guess ext4_mb_init_cache() is in:
/* wait for I/O completion */
for (i = 0; i < groups_per_page && bh[i]; i++)
I just ran into something similar a few days ago. It was a 4TB filesystem with about 50,000 files in a mixed data set. I ran a test script that unmounted, remounted, and then started writing 64MB files in a loop using dd. With delalloc on, the third 64MB write after umount/mount always took about 150 seconds. With delalloc off, the first write always took ~150 seconds. Writes after that were a fraction of a seconds.
Unfortunately I cannot do a test on the same filesystem without LUKS, since the problem only shows itself when writing, and I'd prefer keeping my files.
Yes, after the initial first write that takes ages, it seems the problem is gone more or less. On the odd occasion it can do it again, but its rare.
Did things work OK with an earlier kernel version? Is this something that has always troubled you, or is it something which recently started when you upgraded to the 2.6.36 kernel?
Once things are OK after "the first write", are they OK forever after that? If the problem re-occurs after the first write, can you say something about when that might be?
How about replicating it on a USB attached disk? Can you try a couple of different scenarios with a USB attached disk? Say,
1) With a USB attached disk using a MBR-partitioned disk?
2) With a USB attached disk set up using LVM with a single LVM volume?
3) With a USB attached disk set up using LVM and a LUKS-encrypted volume?
And can you specified what mount options, if any, that you were using?
I remember first noticing it as far back as when 8.04 came out, and has happened ever since. However, around that time, my LVM grew substantially aswell for the next year. So if it's 8.04 and up, or if it's the size of the LVM as it gets larger, I don't know.
By "first write" I mean the first write after a reboot. As soon as the system wants to flush data to the drives, it starts this cycle of slow and cumbersome I/O waiting.
I'm using the mount options as per the tune2fs attachment further up.
Have any of you with this problem tried other IO schedulers just to rule that in or out?
noop and deadline produce the exact same result for me.
Could you capture a blktrace when this happens?
# mount -t debugfs none /sys/kernel/debug
# blktrace -d /dev/blah -o mytrace &
# do your test
# killall blktrace
# tar cvjf mytrace.tar.bz2 mytrace*
and attach it. Might offer clues.
Created attachment 38002 [details]
blktrace showing initial write slowness
Here's mine. My test case is mount, sleep 5, then do 10 x 128MB writes using dd to the just-mounted filesystem. The first 128MB write took over 20 seconds. Unfortunately I don't have access any more to the box where it took 150 seconds.
Created attachment 38012 [details]
graph of IO
Here's a graph of the IO.
Does the trace cover the entire test starting at mount? Looks like it.
Were the 10x128MB writes in series or parallel?
Looks like it spends about 20s seekily sucking up 28MB of metadata off the disk (mostly done in the [flush-253:0] process) and then doing 1.3G of horrible writeback ;)
What exact commands did you use for the test; I assume it was buffered IO, all in the same subdir? Sync afterwards or not?
It might be interesting to see the tune2fs -l output to correlate the initial reads to disk metadata but I suppose we can have a pretty good guess at that.
Anyway looks like the first issue is the giant seeky read of disk metadata before any writing even gets underway.
Created attachment 38102 [details]
tune2fs -l output
Yep, the trace covers the entire test, starting at mount.
The writes are done in series, buffered IO, same subdir, without a sync. Here's all I'm doing:
for i in $(seq 1 10); do
dd if=/dev/zero of=$testdir/test.$i bs=1M count=128
(In reply to comment #16)
> Here's mine. My test case is mount, sleep 5, then do 10 x 128MB writes using
> dd to the just-mounted filesystem. The first 128MB write took over 20 seconds.
> Unfortunately I don't have access any more to the box where it took 150
We've seen this problem with Lustre as well. The root of the problem is that the initial write to a filesystem that is fairly full causes mballoc to scan all of the block groups looking for groups with enough space for preallocation of an 8MB chunk. On an 8TB filesystem with 64k groups @ 100 seeks/second this could take up to 10 minutes to complete.
The patch from Curt committed in 8a57d9d61a6e361c7bb159dda797672c1df1a691 fixed this for small writes at mount time, but does not help for large writes.
We are starting to look at other solutions to this problem in our bugzilla:
with a patch (currently untested) in:
Increasing the flex_bg size is likely going to reduce the severity of this problem, by reducing the number of seeks needed to load the block bitmaps proportional to the flex_bg factor (32 by default today). That would change the 8TB bitmap scan time from 10 minutes to about 20s.
Other possibilities include starting the bitmap scan at some random group instead of always starting at group 0, storing some free extent information for each group in the group descriptor table, or storing some information in the superblock about which group to start allocations at.
Curtw's patch (at commit id: 8a57d9d61) also fixes the problem in the steady state, once all of the block bitmap's statistics are loaded into memory. So one thing we could do is to simply force the block bitmap scan at mount time. It doesn't so much solve the problem as it moves it to a time when it might be less objectionable. For 8TB file systems if it really does take 10 minutes then we would have to mitigate this by (a) having mke2fs use a larger flex_bg size automatically, and/or (b) loading up the block bitmap statistics in parallel (which will help on RAID systems, but not when we have 8TB on a single spindle; given that 3 and 4 TB disks are within the horizon in the next couple of years, 8TB/spindle aren't that far out of reach).
Storing the largest contiguous free extent in a block group in the block group might be another way of solving this problem. The reason why I don't like this approach, though, is that it forces the implementation details of the buddy bitmap implementation into the file system format. It's possible that we might have N blocks free, where N might be say (for the sake of argument) 256 blocks. But if those N blocks aren't aligned on a buddy bitmap allocation boundary, mballoc won't find that free extent. It might see it as a free regions that are 31 blocks, 32 blocks, 128 blocks, 64 blocks, and 1 block free.