Bug 45741 - ext4 scans all disk when calling fallocate after mount on 99% full volume.
Summary: ext4 scans all disk when calling fallocate after mount on 99% full volume.
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-08 16:42 UTC by Mirek Rusin
Modified: 2012-11-08 14:21 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.2.0-23-generic
Subsystem:
Regression: No
Bisected commit-id:


Attachments
block io graph (62.10 KB, image/png)
2012-08-08 16:42 UTC, Mirek Rusin
Details

Description Mirek Rusin 2012-08-08 16:42:46 UTC
Created attachment 77131 [details]
block io graph

It seems I can reproduce this problem every time.

After filling up 55TB EXT4 volume (0-50MB fallocated only files; 10% of them were being deleted to fragment space more) to 99% full I've run into a problem where the whole system freezes for ~5 minutes, to reproduce:

1) unmount filesystem
2) mount filesystem
3) fallocate a file

It seem that every time the system freezes for about 5 minutes.

Initially I thought the disk was doing nothing, but in fact the os seems to scan the whole disk before continuing (graph attached) - it looks like it's reading every single inode before proceeding with fallocate?

Kernel logs the same thing every time:

Aug  8 17:05:09 XXX kernel: [189400.847170] INFO: task jbd2/sdc1-8:18852 blocked for more than 120 seconds.
Aug  8 17:05:09 XXX kernel: [189400.847561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  8 17:05:09 XXX kernel: [189400.868909] jbd2/sdc1-8     D ffffffff81806240     0 18852      2 0x00000000
Aug  8 17:05:09 XXX kernel: [189400.868915]  ffff8801a1e33ce0 0000000000000046 ffff8801a1e33c80 ffffffff811a86ce
Aug  8 17:05:09 XXX kernel: [189400.868920]  ffff8801a1e33fd8 ffff8801a1e33fd8 ffff8801a1e33fd8 0000000000013780
Aug  8 17:05:09 XXX kernel: [189400.868925]  ffffffff81c0d020 ffff8802320ec4d0 ffff8801a1e33cf0 ffff8801a1e33df8
Aug  8 17:05:09 XXX kernel: [189400.868929] Call Trace:
Aug  8 17:05:09 XXX kernel: [189400.868940]  [<ffffffff811a86ce>] ? __wait_on_buffer+0x2e/0x30
Aug  8 17:05:09 XXX kernel: [189400.868947]  [<ffffffff8165a55f>] schedule+0x3f/0x60
Aug  8 17:05:09 XXX kernel: [189400.868955]  [<ffffffff8126052a>] jbd2_journal_commit_transaction+0x18a/0x1240
Aug  8 17:05:09 XXX kernel: [189400.868962]  [<ffffffff8165c6fe>] ? _raw_spin_lock_irqsave+0x2e/0x40
Aug  8 17:05:09 XXX kernel: [189400.868970]  [<ffffffff81077198>] ? lock_timer_base.isra.29+0x38/0x70
Aug  8 17:05:09 XXX kernel: [189400.868976]  [<ffffffff8108aec0>] ? add_wait_queue+0x60/0x60
Aug  8 17:05:09 XXX kernel: [189400.868982]  [<ffffffff812652ab>] kjournald2+0xbb/0x220
Aug  8 17:05:09 XXX kernel: [189400.868988]  [<ffffffff8108aec0>] ? add_wait_queue+0x60/0x60
Aug  8 17:05:09 XXX kernel: [189400.868993]  [<ffffffff812651f0>] ? commit_timeout+0x10/0x10
Aug  8 17:05:09 XXX kernel: [189400.868999]  [<ffffffff8108a42c>] kthread+0x8c/0xa0
Aug  8 17:05:09 XXX kernel: [189400.869005]  [<ffffffff81666bf4>] kernel_thread_helper+0x4/0x10
Aug  8 17:05:09 XXX kernel: [189400.869011]  [<ffffffff8108a3a0>] ? flush_kthread_worker+0xa0/0xa0
Aug  8 17:05:09 XXX kernel: [189400.869016]  [<ffffffff81666bf0>] ? gs_change+0x13/0x13

Is this normal?
Comment 1 Theodore Tso 2012-08-09 18:10:59 UTC
It's not scanning every single inode (that would take a lot longer!), but it is scanning every single block allocation bitmap.   The problem is that we know how many free blocks are in a block group, but we don't know the distribution of the free blocks.  The distribution (there X blocks of size 2**3, Y blocks of size 2**4, etc.) is cached in memory, but the first time you unmount and mount the file system, we need to read in the block bitmap for a block group.

Normally, we only do this until we find a suitable group, but when the file system is completely full, we might need to scan the entire disk.

I've looked at mballoc, and there are some things we can fix on our side.  We're reading in the block bitmap without first checking to see if the block group is completely filled.  So that's an easy fix on our side, which will help at least somewhat.   So thanks for for reporting this.

That being said, it's a really bad idea to try to use a file system to 99%.  Above 80%, the file system performance definitely starts to fall off, and by the time you get up to 95%, performance is going to be really awful.  There are definitely things we can do to improve things, but ultimately, it's something that you should plan for.

You could also try increasing the flex-bg size, which is a configuration knob when the file system is formatted.  This collects allocation bitmaps for adjacent block groups together.  The default is 16, but you could try bumping that up to 64 or even 128.  It will improve the time needed to scan all of the allocation bitmaps in the cold cache case, but it may also decrease performance after that, when you need to allocate and delalocate inodes and blocks, and by increasing the distance from data blocks to the inode table.   How much this tradeoff will work is going to be very dependent on the details of your workload.
Comment 2 Florian Mickler 2012-10-15 21:24:57 UTC
A patch referencing this bug report has been merged in Linux v3.7-rc1:

commit 01fc48e8929e45e67527200017cff4e74e4ba054
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Fri Aug 17 09:46:17 2012 -0400

    ext4: don't load the block bitmap for block groups which have no space

Note You need to log in before you can comment on or make changes to this bug.