Bug 15420

Summary: EXT4_USE_FOR_EXT23 causes wrong free space calculation on ext2 and ext3
Product: File System Reporter: Thomas Bächler (thomas)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: RESOLVED CODE_FIX    
Severity: normal CC: alan, bluewind, jack, sandeen, tytso
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33 Subsystem:
Regression: No Bisected commit-id:
Attachments: .config for my personal kernel
Possible patch from Jan Kara
Fix logic in ext4_indirect_calc_metadata_amount

Description Thomas Bächler 2010-03-01 18:18:24 UTC
When enabling the EXT4_USE_FOR_EXT23 option in 2.6.33. The free space calculation on ext2 and ext3 becomes incorrect. Steps to reproduce:

1) Create ext2 or ext3 file system with mkfs
2) Mount it on a kernel with EXT4_USE_FOR_EXT23 enabled
3) Launch "watch -n1 df -h /path/to/mountpoint" in another terminal
4) Download a file with wget

The "used space" on the ext2/3 will now increase by about 500MB per second. Once you stop the download and type "sync" (or wait for a while), the values shown for free/used space return to expected values. This is always reproducible.

We can reproduce this on x86_32 and x86_64 with the Arch Linux distribution kernel, as well as my own personal Core2-optimized 64 Bit kernel.

.config x86_32: http://repos.archlinux.org/wsvn/packages/kernel26/trunk/config?rev=70484
.config x86_64: http://repos.archlinux.org/wsvn/packages/kernel26/trunk/config.x86_64?rev=70484
Comment 1 Thomas Bächler 2010-03-01 18:19:33 UTC
Created attachment 25293 [details]
.config for my personal kernel
Comment 2 Thomas Bächler 2010-03-09 21:27:19 UTC
Ping? Anyone actually reading this?
Comment 3 Eric Sandeen 2010-03-09 21:35:05 UTC
Yes, but we are reading a lot of things ;)

Is it du, or df, or both, which shows the "wrong" amount until sync?
Comment 4 Thomas Bächler 2010-03-09 22:07:23 UTC
It is definitely df.

I didn't test du, and have no test case right now (and the other guys who had the problem are not reachable right now). I'll make another test with du as soon as I get to it.
Comment 5 Theodore Tso 2010-03-10 03:12:30 UTC
This sounds like the block reservation code for delayed allocation is over-aggressive about estimating how many blocks are need for the indirect blocks in the case of indirect-mapped inodes (which would be what you get when mounting an ext2 filesystem using the ext4 file system driver).

This should be an aesthetic issue only, but it would be nice to have this be fixed.
Comment 6 Jan Kara 2010-03-10 12:25:21 UTC
> http://bugzilla.kernel.org/show_bug.cgi?id=15420
> 
> 
> Theodore Tso <tytso@mit.edu> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |tytso@mit.edu
> 
> 
> 
> 
> --- Comment #5 from Theodore Tso <tytso@mit.edu>  2010-03-10 03:12:30 ---
> This sounds like the block reservation code for delayed allocation is
> over-aggressive about estimating how many blocks are need for the indirect
> blocks in the case of indirect-mapped inodes (which would be what you get
> when
> mounting an ext2 filesystem using the ext4 file system driver).
> 
> This should be an aesthetic issue only, but it would be nice to have this be
> fixed.
  Since bugzilla seems to be down now, I'll reply by email:
Won't it make sence to use 'nodelalloc' by default when ext4 driver is used
for ext2 or ext3 filesystem? Because delayed allocation has some implications
people need not expect - like the overestimation of needed blocks, or changed
semantics of ordered data mode etc...
  So something like the patch below?

									Honza
Comment 7 Thomas Bächler 2010-03-10 12:36:12 UTC
(In reply to comment #5)
> This sounds like the block reservation code for delayed allocation is
> over-aggressive about estimating how many blocks are need for the indirect
> blocks in the case of indirect-mapped inodes (which would be what you get
> when
> mounting an ext2 filesystem using the ext4 file system driver).
> 
> This should be an aesthetic issue only, but it would be nice to have this be
> fixed.

It is not purely aesthetic, lots of applications abort their operation if space is supposedly running low.

However, as this doesn't seem to be high priority for anyone here, we should probably consider disabling EXT4_USE_FOR_EXT23 in our distribution kernel for the time being.

(In reply to comment #6)
> Won't it make sence to use 'nodelalloc' by default when ext4 driver is used
> for ext2 or ext3 filesystem? Because delayed allocation has some implications
> people need not expect - like the overestimation of needed blocks, or changed
> semantics of ordered data mode etc...

Don't see a patch here.
Comment 8 Theodore Tso 2010-03-10 14:16:46 UTC
What applications do you actually call statfs() and check the amount of space that you have?  I'm not aware of any.

Does the problem go away if you add the mount option "nodelalloc"?   If it does, that's Jan's patch is probably the right workaround.  It looks like Bugzilla stripped it off (it's in the e-mail version of his reply).  I'll add it to this bug, and you can try it out as well.
Comment 9 Theodore Tso 2010-03-10 14:18:25 UTC
Created attachment 25448 [details]
Possible patch from Jan Kara
Comment 10 Thomas Bächler 2010-03-10 14:56:42 UTC
I don't know the exact application that was used, but it one guy said he was unable to work because many applications would constantly complain that there was not enough space and thus abort. I'll ask again.

I'll test this later and report back.
Comment 11 Eric Sandeen 2010-03-10 15:11:28 UTC
If delalloc doesn't work with indirect files on ext3 filesystems, then we need to stop telling people they can migrate ext3 to ext4 simply by turning on extents etc, and mounting as ext4.
Comment 12 Theodore Tso 2010-03-10 15:28:12 UTC
Well, let's figure out what actually broke, and why.   The vast majority of files are not appended to after they are first written, I suspect the migration story still works.   And this issue with over-aggressive estimation for space needed for indirect blocks is probably something recent in 2.6.33, and I think we can fix it --- and even if it isn't, I'd really like to know what applications are checking statfs and aborting as a result, and under what situations this is happening.
Comment 13 Theodore Tso 2010-03-10 15:44:46 UTC
OK, I can confirm that the problem is with delayed allocation and indirect files.   While downloading a 70 meg kernel source tar ball, the amount of space shown as being used by df goes as high as 1 gigabyte (50% of the space on my disk, about 25% of my available free memory) before dropping back down. as the blocks get allocated.   Mounting with nodelalloc makes the problem go away.
Using extents, the df result will occasionally be as much as 1-2 megs but it is much smaller in practice.

Using nodelalloc for ext2/3 mounts may make sense not just as a workaround, but to help protect users using crappy desktop applications that don't know how to use fsync().  (They'll still get screwed by btrfs, but the lazy application writers and Phoronix they can complain to the btrfs developers about how they are incompetent, and it's no longer my problem at that point.  :-)

We should try to improve the estimation logic for indirect blocks (I remember being in a hurry to write the code to work around the regression caused by the quota bugfix that has caused us so much heartache), but I don't think this is a problem for people who are migrating from ext3 to ext4, since it's rare that they will be extending an already existing indirect-mapped file.  The problem with EXT4_USE_FOR_EXT23 is that they are _not_ migrating, and we're seeing a weakness in delalloc with indirect block files; a weakness that was only introduced recently, and one that I hope we can fix.   But for other reasons mentioned above, we probably want to turn on nodelalloc anyway...
Comment 14 Jan Kara 2010-03-11 16:00:10 UTC
Indeed, the logic in ext4_indirect_calc_metadata_amount seems to be broken. Attached patch fixes it for me...
Comment 15 Jan Kara 2010-03-11 16:01:19 UTC
Created attachment 25477 [details]
Fix logic in ext4_indirect_calc_metadata_amount
Comment 16 Jan Kara 2010-03-11 16:10:46 UTC
As for comment #13 - I fully agree with Ted. When admin does "mount -t ext3 ...", I think we should give him as close behavior to ext3 as possible by the principle of the least surprise - and that means nodelalloc. If user wishes better performance etc, he can always mount the filesystem as ext4 explicitely. Also I remember some discussions with KDE guys when they were at some point considering detecting filesystem type and avoid doing fsync if it was ext3 - hopefully, I convinced them not to do it but I bet someone else had the same idea and actually implemented it. Not that I'd have too much pity with broken applications but I prefer to avoid breaking them silently like this...
Comment 17 Eric Sandeen 2010-03-11 16:16:46 UTC
(In reply to comment #16)
> As for comment #13 - I fully agree with Ted. When admin does "mount -t ext3
> ...", I think we should give him as close behavior to ext3 as possible by the
> principle of the least surprise - and that means nodelalloc. If user wishes
> better performance etc, he can always mount the filesystem as ext4
> explicitely.

Under what circumstances, then, do we recommend mounting ext3 with the ext4 driver?

The user will get different allocation behavior as well; will that also be a surprise?

I just want to be clear in our recommendations about this stuff...

> Also I remember some discussions with KDE guys when they were at some point
> considering detecting filesystem type and avoid doing fsync if it was ext3 -
> hopefully, I convinced them not to do it but I bet someone else had the same
> idea and actually implemented it. Not that I'd have too much pity with broken
> applications but I prefer to avoid breaking them silently like this...

I think by having this config option at -all- we have really muddied the waters, to be honest ... but I suppose this is all better hashed outon the list, not in this bug.  :)
Comment 18 Theodore Tso 2010-03-11 17:42:06 UTC
On Thu, Mar 11, 2010 at 04:00:27PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> Indeed, the logic in ext4_indirect_calc_metadata_amount seems to be broken.
> Attached patch fixes it for me...

Nice catch.  Yeah, I really screwed that up!

I'll get that into the ext4 patch queue, quickly, and get that pushed
to Linus.

					- Ted