Bug 45741
Summary: | ext4 scans all disk when calling fallocate after mount on 99% full volume. | ||
---|---|---|---|
Product: | File System | Reporter: | Mirek Rusin (mirek) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | alan, florian, tytso |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.2.0-23-generic | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | block io graph |
It's not scanning every single inode (that would take a lot longer!), but it is scanning every single block allocation bitmap. The problem is that we know how many free blocks are in a block group, but we don't know the distribution of the free blocks. The distribution (there X blocks of size 2**3, Y blocks of size 2**4, etc.) is cached in memory, but the first time you unmount and mount the file system, we need to read in the block bitmap for a block group. Normally, we only do this until we find a suitable group, but when the file system is completely full, we might need to scan the entire disk. I've looked at mballoc, and there are some things we can fix on our side. We're reading in the block bitmap without first checking to see if the block group is completely filled. So that's an easy fix on our side, which will help at least somewhat. So thanks for for reporting this. That being said, it's a really bad idea to try to use a file system to 99%. Above 80%, the file system performance definitely starts to fall off, and by the time you get up to 95%, performance is going to be really awful. There are definitely things we can do to improve things, but ultimately, it's something that you should plan for. You could also try increasing the flex-bg size, which is a configuration knob when the file system is formatted. This collects allocation bitmaps for adjacent block groups together. The default is 16, but you could try bumping that up to 64 or even 128. It will improve the time needed to scan all of the allocation bitmaps in the cold cache case, but it may also decrease performance after that, when you need to allocate and delalocate inodes and blocks, and by increasing the distance from data blocks to the inode table. How much this tradeoff will work is going to be very dependent on the details of your workload. A patch referencing this bug report has been merged in Linux v3.7-rc1: commit 01fc48e8929e45e67527200017cff4e74e4ba054 Author: Theodore Ts'o <tytso@mit.edu> Date: Fri Aug 17 09:46:17 2012 -0400 ext4: don't load the block bitmap for block groups which have no space |
Created attachment 77131 [details] block io graph It seems I can reproduce this problem every time. After filling up 55TB EXT4 volume (0-50MB fallocated only files; 10% of them were being deleted to fragment space more) to 99% full I've run into a problem where the whole system freezes for ~5 minutes, to reproduce: 1) unmount filesystem 2) mount filesystem 3) fallocate a file It seem that every time the system freezes for about 5 minutes. Initially I thought the disk was doing nothing, but in fact the os seems to scan the whole disk before continuing (graph attached) - it looks like it's reading every single inode before proceeding with fallocate? Kernel logs the same thing every time: Aug 8 17:05:09 XXX kernel: [189400.847170] INFO: task jbd2/sdc1-8:18852 blocked for more than 120 seconds. Aug 8 17:05:09 XXX kernel: [189400.847561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 8 17:05:09 XXX kernel: [189400.868909] jbd2/sdc1-8 D ffffffff81806240 0 18852 2 0x00000000 Aug 8 17:05:09 XXX kernel: [189400.868915] ffff8801a1e33ce0 0000000000000046 ffff8801a1e33c80 ffffffff811a86ce Aug 8 17:05:09 XXX kernel: [189400.868920] ffff8801a1e33fd8 ffff8801a1e33fd8 ffff8801a1e33fd8 0000000000013780 Aug 8 17:05:09 XXX kernel: [189400.868925] ffffffff81c0d020 ffff8802320ec4d0 ffff8801a1e33cf0 ffff8801a1e33df8 Aug 8 17:05:09 XXX kernel: [189400.868929] Call Trace: Aug 8 17:05:09 XXX kernel: [189400.868940] [<ffffffff811a86ce>] ? __wait_on_buffer+0x2e/0x30 Aug 8 17:05:09 XXX kernel: [189400.868947] [<ffffffff8165a55f>] schedule+0x3f/0x60 Aug 8 17:05:09 XXX kernel: [189400.868955] [<ffffffff8126052a>] jbd2_journal_commit_transaction+0x18a/0x1240 Aug 8 17:05:09 XXX kernel: [189400.868962] [<ffffffff8165c6fe>] ? _raw_spin_lock_irqsave+0x2e/0x40 Aug 8 17:05:09 XXX kernel: [189400.868970] [<ffffffff81077198>] ? lock_timer_base.isra.29+0x38/0x70 Aug 8 17:05:09 XXX kernel: [189400.868976] [<ffffffff8108aec0>] ? add_wait_queue+0x60/0x60 Aug 8 17:05:09 XXX kernel: [189400.868982] [<ffffffff812652ab>] kjournald2+0xbb/0x220 Aug 8 17:05:09 XXX kernel: [189400.868988] [<ffffffff8108aec0>] ? add_wait_queue+0x60/0x60 Aug 8 17:05:09 XXX kernel: [189400.868993] [<ffffffff812651f0>] ? commit_timeout+0x10/0x10 Aug 8 17:05:09 XXX kernel: [189400.868999] [<ffffffff8108a42c>] kthread+0x8c/0xa0 Aug 8 17:05:09 XXX kernel: [189400.869005] [<ffffffff81666bf4>] kernel_thread_helper+0x4/0x10 Aug 8 17:05:09 XXX kernel: [189400.869011] [<ffffffff8108a3a0>] ? flush_kthread_worker+0xa0/0xa0 Aug 8 17:05:09 XXX kernel: [189400.869016] [<ffffffff81666bf0>] ? gs_change+0x13/0x13 Is this normal?