Bug 13930
Summary: | non-contiguous files (64.9%) on a ext4 fs | ||
---|---|---|---|
Product: | File System | Reporter: | Cédric M (zelogik+bugzilla) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | CLOSED OBSOLETE | ||
Severity: | normal | CC: | alan, sandeen, tytso |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.31-5(deb) | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Cédric M
2009-08-07 09:58:46 UTC
It might be interesting to poke around with filefrag and see which files are fragmented, and compare that to how big they are. Summary stats like this can sometimes be misleading. Or to put it another way, if half your files are < 100M and contiguous, and the other half are > 100M and all of those have 2 50M extents each, I think e2fsck would say "50.0% non-contiguous" - but this isn't really indicative of a problem. If you can demonstrate that simply copying a large file (or files) from cifs leads to bad fragmentation of that file (or files), then we probably have something to work on. I'm pretty sure what's going on here is the problem I've reported before where if you have a large number of large files being written at the same time, the Linux page cleaner round-robins between the different dirty inodes to avoid starving some inode from ever getting its dirty pages written out. This then combines with ext4's multi-block allocator limiting its search for 8MB of free extent chunks, so we only expand a dirty page writeback request into 2048 blocks. See the discussion here: http://thread.gmane.org/gmane.comp.file-systems.ext4/13107 The reason why you're seeing this so much is that this filesystem has relatively few inodes (just under 16,000) and a very large average inode size (about 54 megabytes), and so a very large number of the files are "non-contiguous". But, if you look at this statistic from e2fsck: Histogramme des profondeurs d'extents : 14555/1388 14,555, or 91% of the files, have fewer than 4 extents, so that all of the extents fit in the inode. (Note that an extent addresses at most 128 meg, so by definition a 512meg file will have at least 4 extents.) That means it's highly likely that if you look at a particularly large file using "filefrag -v", you will see something like this: ext logical physical expected length flags 0 0 2165248 512 1 512 2214400 2165759 1536 2 2048 2244608 2215935 2048 3 4096 2250752 2246655 2048 4 6144 2254848 2252799 32768 5 38912 2287616 8192 6 47104 2299904 2295807 2048 7 49152 2306048 2301951 2048 eof Note that extent #5 is really located contiguously after extent #4; the reason why a new extent was created is because the maximum length that can be encoded in the on-disk extent data structure is 32,768 blocks. (Which if you are using 4k blocks, means a maximum extent size of 128 megs.) So this kind of "non-contiguous" file is non-optimal, and we really should fix the block allocator to better. On the other hand, it's not as disastrously fragmented as say, the following file from an ext3 filesystem: ext logical physical length 0 0 5228587 12 1 12 5228600 110 2 122 5228768 145 3 267 5228915 1 4 268 5228918 9 5 277 5228936 69 6 346 5229392 165 7 511 5230282 124 8 635 5230496 42 9 677 5231614 10 10 687 5231856 20 11 707 5231877 46 12 753 5231975 1 13 754 5232033 14 14 768 5232205 2 15 770 5233913 4 16 774 5233992 262 17 1036 5234256 191 Part of the problem is that "non-contiguous" or "fragmented" doesn't really describe whether the file is like the first ext4 file (which is indeed non-contiguous, and while it could be better allocated on disk, the time to read the file sequentially won't be _that_ much worse than a file that 100% contiguous), than say, a file like this second ext3 file, where the performance degradation are much worse. I suppose we could do something where we define "fragmented" as a file where has no extents which are smaller than N blocks, or where the average extent size is greater than M blocks. My original way of dealing with this number was to simply use the phrase "non-contiguous" instead of "fragmented", which is technically accurate, but it causes people to get overly concerned when they see something like "64.9% non-contiguous files". Unfortunately, at moment what this means is something like "approximately 65% of your files are greater than 8 megabytes". I have make some try on a 242MB file: du -h file 242M And after an: Filesystem type is: ef53 File size of ****** is 253480960 (61885 blocks, blocksize 4096) ext logical physical expected length flags 0 0 215814144 2048 1 2048 215820288 215816191 2048 2 4096 215824384 215822335 2048 3 6144 215828480 215826431 2048 4 8192 215832576 215830527 2048 5 10240 215836672 215834623 2048 6 12288 215840768 215838719 2048 7 14336 215844864 215842815 2048 8 16384 215848960 215846911 2048 9 18432 215853056 215851007 26624 10 45056 215881728 215879679 2048 11 47104 215885824 215883775 2048 12 49152 215891968 215887871 2048 13 51200 215896064 215894015 2048 14 53248 215900160 215898111 2048 15 55296 215904256 215902207 2048 16 57344 215908352 215906303 2048 17 59392 215912448 215910399 2048 18 61440 215918592 215914495 445 eof So why so much extend ? (ie: ALL files have been write one to one on the whole disk (like a "cp -R /cifs /ext4) (not torrent/crappy_speed_download_tools/etc ... things)) With a full dir: (for / du -h / filefrag / sed ) 180M : 12 extents found 181M : 20 extents found 281M : 22 extents found 275M : 22 extents found 275M : 22 extents found 181M : 20 extents found 281M : 22 extents found 281M : 21 extents found 277M : 21 extents found 280M : 21 extents found 180M : 21 extents found 285M : 13 extents found 180M : 6 extents found There are so many extents because as discussed in the linux-ext4 mailing list thread I referenced above, when you write multiple large files close together in time, the large files get interleaved with each other. "cp -R /cifs /ext4" doesn't call fsync() between writing each file, so the dirty pages for multiple files are left dirty in the page cache during the copy. The VM page flush daemon doesn't write one file out completely, and then another, but instead round-robins between different inodes. The ext4 delayed allocation code tries to work around this by trying to find adjacent dirty pages and then trying to do a large block allocation; but currently the ext4 multiblock allocator only tries to grab up to 8 megabytes at a time, to avoid spending too much CPU time in what might be a fruitless attempt to find that many contiguous free blocks. It's a known bug, but fixing is a bit complicated. It's on our TODO list. I did some more looking at this issue. The root cause is pdflush, which is the daemon that starts forcing background writes when 10% of the available page cache is dirty. It will write out a maximum of 1024 blocks per page, because of a hard-coded limit in mm/page-writeback.c: /* * The maximum number of pages to writeout in a single bdflush/kupdate * operation. We do this so we don't hold I_SYNC against an inode for * enormous amounts of time, which would block a userspace task which has * been forced to throttle against that inode. Also, the code reevaluates * the dirty each time it has written this many pages. */ #define MAX_WRITEBACK_PAGES 1024 The means that background_writeout() in mm/page-writeback.c only calls ext4_da_writepages requesting a writeout of 1024 pages, which we can see if we put a trace on ext4_da_writepages after writing a very large file: pdflush-398 [000] 5743.853396: ext4_da_writepages: dev sdc1 ino 12 nr_t_write 1024 pages_skipped 0 range_start 0 range_end 0 nonblock ing 1 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 pdflush-398 [000] 5743.858988: ext4_da_writepages_result: dev sdc1 ino 12 ret 0 pages_written 1024 pages_skipped 0 congestion 0 more_ io 0 no_nrwrite_index_update 0 pdflush-398 [000] 5743.923578: ext4_da_writepages: dev sdc1 ino 12 nr_t_write 1024 pages_skipped 0 range_start 0 range_end 0 nonblock ing 1 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 pdflush-398 [000] 5743.927562: ext4_da_writepages_result: dev sdc1 ino 12 ret 0 pages_written 1024 pages_skipped 0 congestion 0 more_ io 0 no_nrwrite_index_update 0 The ext4_da_writepages() function is therefore allocating 1024 blocks at a time, which the ext4 multiblock allocator is increasing to 2048 blocks (and sometimes 1024 blocks is allocated, and sometimes 2048 blocks is allocated), as we can see from /proc/fs/ext4/<dev>/mb_history: 1982 12 1/14336/1024@12288 1/14336/2048@12288 1/14336/2048@12288 1 0 0 0x0e20 M 0 0 1982 12 1/15360/1024@13312 1/15360/1024@13312 1982 12 1/16384/1024@14336 1/16384/2048@14336 1/16384/2048@14336 1 0 0 0x0e20 M 2048 8192 1982 12 1/17408/1024@15360 1/17408/1024@15360 If there are multiple large dirty files in the page cache, then pdflush will round-robin trying to write out the inodes, with the result that large files get interleaved in chunks of 4M (1024 pages) to 8M (2048), and larger chunks happen only when there is only pages from one inode left in memory. Potential solutions in the next comment... There are a number of ways that we can increase the size of block allocation request made by ext4_da_writepages: 1) Increase MAX_WRITEBACK_PAGES, possibly on a per-filesystem basis. The comment around MAX_WRITEBACK_PAGES indicates the problem is around blocking tasks that wait on I_SYNC, but it's not clear this is really a problem. Before I_SYNC was separated out from I_LOCK, this was clearly much more of an issue, but now the only time when a process waits for I_SYNC, as near as I can tell, is when they are calling fsync() or otherwise forcing out the inode. So I don't think it's going to be that big of a deal. 2) We can change ext4_da_writepages() to attempt to see if there are more dirty pages in the page cache beyond what had been requested to be written, and if so, we pass a hint to mballoc via an extension to the allocation_request structure so that additional blocks are allocated and reserved in the inode's preallocation structure. 3) Jens Axboe is working on a set of patches which create a separate pdflush thread for each block device (the per-bdi patches). I don't think there is a risk in increasing MAX_WRITEBACK_PAGES, but if there is still a concern, with the per-bdi patches, perhaps the per-bdi patches could be changed to prefer dirty inodes which are closed, and writing out complete inodes which have been closed, one at a time, instead of stopping after MAX_WRITEBACK_PAGES. These changes should allow us to improve ext4's large file writeback to the point where it is allocating up to 32768 blocks at a time, instead of 1024 blocks at a time. At the moment the mballoc code isn't capable of allocating more than a block group's worth of blocks at a time, since it was written assuming that there was per block group metadata at the beginning of each block group which prevented allocations from spanning block groups. So long term, we may need to make further improvements to help assure sane allocations for really files > 128 megs --- although solution #3 might help this situation even without mballoc changes, since there would only be a single pdflush thread per bdi writing out large files. |