Bug 13930

Summary: non-contiguous files (64.9%) on a ext4 fs
Product: File System Reporter: Cédric M (zelogik+bugzilla)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: CLOSED OBSOLETE    
Severity: normal CC: alan, sandeen, tytso
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31-5(deb) Subsystem:
Regression: No Bisected commit-id:

Description Cédric M 2009-08-07 09:58:46 UTC
Normally ext4 is a perfect non-contifuous FS ... so why when I make an:

sudo e2fsck -v -f -p /dev/sdc1

   15953 inodes used (0.03%)
   10361 non-contiguous files (64.9%)
       7 non-contiguous directories (0.0%)
         nombre d'i-noeuds avec des blocs ind/dind/tind : 0/0/0
         Histogramme des profondeurs d'extents : 14555/1388
234324740 blocks used (95.96%)
       0 bad blocks
       1 large file

   14728 regular files
    1216 directories
       0 character device files
       0 block device files
       0 fifos
       0 links
       0 symbolic links (0 fast symbolic links)
       0 sockets
--------
   15944 files



We have 64.9% of non-contiguous files on a ext4 fs, 

Files have been transfered from a cifs mount (useful ?) and files size are [2-10]|[600-800]MB


Some references:
Lacie 1To on usb with usb-Hitachi_HDS721010KLA330

Want more info ... just ask because I don't know what send now for help you :)
Comment 1 Eric Sandeen 2009-08-07 14:40:11 UTC
It might be interesting to poke around with filefrag and see which files are fragmented, and compare that to how big they are.  Summary stats like this can sometimes be misleading.

Or to put it another way, if half your files are < 100M and contiguous, and the other half are > 100M and all of those have 2 50M extents each, I think e2fsck would say "50.0% non-contiguous" - but this isn't really indicative of a problem.

If you can demonstrate that simply copying a large file (or files) from cifs leads to bad fragmentation of that file (or files), then we probably have something to work on.
Comment 2 Theodore Tso 2009-08-07 22:20:51 UTC
I'm pretty sure what's going on here is the problem I've reported before where if you have a large number of large files being written at the same time, the Linux page cleaner round-robins between the different dirty inodes to avoid starving some inode from ever getting its dirty pages written out.  This then combines with ext4's multi-block allocator limiting its search for 8MB of free extent chunks, so we only expand a dirty page writeback request into 2048 blocks.   See the discussion here:

    http://thread.gmane.org/gmane.comp.file-systems.ext4/13107

The reason why you're seeing this so much is that this filesystem has relatively few inodes (just under 16,000) and a very large average inode size (about 54 megabytes), and so a very large number of the files are "non-contiguous".    But, if you look at this statistic from e2fsck:

         Histogramme des profondeurs d'extents : 14555/1388

14,555, or 91% of the files, have fewer than 4 extents, so that all of the extents fit in the inode.  (Note that an extent addresses at most 128 meg, so by definition a 512meg file will have at least 4 extents.)   That means it's highly likely that if you look at a particularly large file using "filefrag -v", you will see something like this:

 ext logical physical expected length flags
   0       0  2165248             512 
   1     512  2214400  2165759   1536 
   2    2048  2244608  2215935   2048 
   3    4096  2250752  2246655   2048 
   4    6144  2254848  2252799  32768 
   5   38912  2287616            8192 
   6   47104  2299904  2295807   2048 
   7   49152  2306048  2301951   2048 eof

Note that extent #5 is really located contiguously after extent #4; the reason why a new extent was created is because the maximum length that can be encoded in the on-disk extent data structure is 32,768 blocks.  (Which if you are using 4k blocks, means a maximum extent size of 128 megs.)

So this kind of "non-contiguous" file is non-optimal, and we really should fix the block allocator to better.  On the other hand, it's not as disastrously fragmented as say, the following file from an ext3 filesystem:

 ext logical physical length
   0       0  5228587    12 
   1	  12  5228600   110 
   2     122  5228768   145
   3     267  5228915     1
   4     268  5228918     9
   5     277  5228936    69
   6     346  5229392   165
   7     511  5230282   124
   8     635  5230496    42
   9     677  5231614    10
  10     687  5231856    20
  11     707  5231877    46
  12     753  5231975     1
  13     754  5232033    14
  14     768  5232205     2
  15     770  5233913     4
  16     774  5233992   262
  17    1036  5234256   191

Part of the problem is that "non-contiguous" or "fragmented" doesn't really describe whether the file is like the first ext4 file (which is indeed non-contiguous, and while it could be better allocated on disk, the time to read the file sequentially won't be _that_ much worse than a file that 100% contiguous), than say, a file like this second ext3 file, where the performance degradation are much worse.

I suppose we could do something where we define "fragmented" as a file where has no extents which are smaller than N blocks, or where the average extent size is greater than M blocks.   My original way of dealing with this number was to simply use the phrase "non-contiguous" instead of "fragmented", which is technically accurate, but it causes people to get overly concerned when they see something like "64.9% non-contiguous files".   Unfortunately, at moment what this means is something like "approximately 65% of your files are greater than 8 megabytes".
Comment 3 Cédric M 2009-08-07 22:37:59 UTC
I have make some try on a 242MB file:

du -h file
242M

And after an:
Filesystem type is: ef53
File size of ****** is 253480960 (61885 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0 215814144            2048 
   1    2048 215820288 215816191   2048 
   2    4096 215824384 215822335   2048 
   3    6144 215828480 215826431   2048 
   4    8192 215832576 215830527   2048 
   5   10240 215836672 215834623   2048 
   6   12288 215840768 215838719   2048 
   7   14336 215844864 215842815   2048 
   8   16384 215848960 215846911   2048 
   9   18432 215853056 215851007  26624 
  10   45056 215881728 215879679   2048 
  11   47104 215885824 215883775   2048 
  12   49152 215891968 215887871   2048 
  13   51200 215896064 215894015   2048 
  14   53248 215900160 215898111   2048 
  15   55296 215904256 215902207   2048 
  16   57344 215908352 215906303   2048 
  17   59392 215912448 215910399   2048 
  18   61440 215918592 215914495    445 eof


So why so much extend ?

(ie: ALL files have been write one to one on the whole disk (like a "cp -R /cifs /ext4) (not torrent/crappy_speed_download_tools/etc ... things))




With a full dir: (for / du -h / filefrag / sed )
180M	
: 12 extents found
181M	
: 20 extents found
281M	
: 22 extents found
275M	
: 22 extents found
275M	
: 22 extents found
181M	
: 20 extents found
281M	
: 22 extents found
281M	
: 21 extents found
277M	
: 21 extents found
280M	
: 21 extents found
180M	
: 21 extents found
285M	
: 13 extents found
180M	
: 6 extents found
Comment 4 Theodore Tso 2009-08-08 01:30:24 UTC
There are so many extents because as discussed in the linux-ext4 mailing list thread I referenced above, when you write multiple large files close together in time, the large files get interleaved with each other.   

"cp -R /cifs /ext4" doesn't call fsync() between writing each file, so the dirty pages for multiple files are left dirty in the page cache during the copy.   The VM page flush daemon doesn't write one file out completely, and then another, but instead round-robins between different inodes.   The ext4 delayed allocation code tries to work around this by trying to find adjacent dirty pages and then trying to do a large block allocation; but currently the ext4 multiblock allocator only tries to grab up to 8 megabytes at a time, to avoid spending too much CPU time in what might be a fruitless attempt to find that many contiguous free blocks.

It's a known bug, but fixing is a bit complicated.  It's on our TODO list.
Comment 5 Theodore Tso 2009-08-10 12:04:13 UTC
I did some more looking at this issue.   The root cause is pdflush, which is the daemon that starts forcing background writes when 10% of the available page cache is dirty.   It will write out a maximum of 1024 blocks per page, because of a hard-coded limit in mm/page-writeback.c:

/*
 * The maximum number of pages to writeout in a single bdflush/kupdate
 * operation.  We do this so we don't hold I_SYNC against an inode for
 * enormous amounts of time, which would block a userspace task which has
 * been forced to throttle against that inode.  Also, the code reevaluates
 * the dirty each time it has written this many pages.
 */
#define MAX_WRITEBACK_PAGES	1024

The means that background_writeout() in mm/page-writeback.c only calls ext4_da_writepages requesting a writeout of 1024 pages, which we can see if we put a trace on ext4_da_writepages after writing a very large file:

         pdflush-398   [000]  5743.853396: ext4_da_writepages: dev sdc1 ino 12 nr_t_write 1024 pages_skipped 0 range_start 0 range_end 0 nonblock
ing 1 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
         pdflush-398   [000]  5743.858988: ext4_da_writepages_result: dev sdc1 ino 12 ret 0 pages_written 1024 pages_skipped 0 congestion 0 more_
io 0 no_nrwrite_index_update 0
         pdflush-398   [000]  5743.923578: ext4_da_writepages: dev sdc1 ino 12 nr_t_write 1024 pages_skipped 0 range_start 0 range_end 0 nonblock
ing 1 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
         pdflush-398   [000]  5743.927562: ext4_da_writepages_result: dev sdc1 ino 12 ret 0 pages_written 1024 pages_skipped 0 congestion 0 more_
io 0 no_nrwrite_index_update 0

The ext4_da_writepages() function is therefore allocating 1024 blocks at a time, which the ext4 multiblock allocator is increasing to 2048 blocks (and sometimes 1024 blocks is allocated, and sometimes 2048 blocks is allocated), as we can see from /proc/fs/ext4/<dev>/mb_history:

1982  12       1/14336/1024@12288      1/14336/2048@12288      1/14336/2048@12288      1     0     0  0x0e20 M     0     0     
1982  12       1/15360/1024@13312                              1/15360/1024@13312     
1982  12       1/16384/1024@14336      1/16384/2048@14336      1/16384/2048@14336      1     0     0  0x0e20 M     2048  8192  
1982  12       1/17408/1024@15360                              1/17408/1024@15360     

If there are multiple large dirty files in the page cache, then pdflush will round-robin trying to write out the inodes, with the result that large files get interleaved in chunks of 4M (1024 pages) to 8M (2048), and larger chunks happen only when there is only pages from one inode left in memory.

Potential solutions in the next comment...
Comment 6 Theodore Tso 2009-08-10 13:11:32 UTC
There are a number of ways that we can increase the size of block allocation request made by ext4_da_writepages:

1)  Increase MAX_WRITEBACK_PAGES, possibly on a per-filesystem basis.

The comment around MAX_WRITEBACK_PAGES indicates the problem is around blocking tasks that wait on I_SYNC, but it's not clear this is really a problem.   Before I_SYNC was separated out from I_LOCK, this was clearly much more of an issue, but now the only time when a process waits for I_SYNC, as near as I can tell, is when they are calling fsync() or otherwise forcing out the inode.   So I don't think it's going to be that big of a deal.

2) We can change ext4_da_writepages() to attempt to see if there are more dirty pages in the page cache beyond what had been requested to be written, and if so, we pass a hint to mballoc via an extension to the allocation_request structure so that additional blocks are allocated and reserved in the inode's preallocation structure.

3) Jens Axboe is working on a set of patches which create a separate pdflush thread for each block device (the per-bdi patches).  I don't think there is a risk in increasing MAX_WRITEBACK_PAGES, but if there is still a concern, with the per-bdi patches, perhaps the per-bdi patches could be changed to prefer dirty inodes which are closed, and writing out complete inodes which have been closed, one at a time, instead of stopping after MAX_WRITEBACK_PAGES.

These changes should allow us to improve ext4's large file writeback to the point where it is allocating up to 32768 blocks at a time, instead of 1024 blocks at a time.  At the moment the mballoc code isn't capable of allocating more than a block group's worth of blocks at a time, since it was written assuming that there was per block group metadata at the beginning of each block group which prevented allocations from spanning block groups.   So long term, we may need to make further improvements to help assure sane allocations for really files > 128 megs --- although solution #3 might help this situation even without mballoc changes, since there would only be a single pdflush thread per bdi writing out large files.