Bug 15910 - zero-length files and performance degradation
zero-length files and performance degradation
Status: RESOLVED INVALID
Product: File System
Classification: Unclassified
Component: ext4
All Linux
: P1 normal
Assigned To: fs_ext4@kernel-bugs.osdl.org
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-05 13:48 UTC by jeanbaptiste.lallement@gmail.com
Modified: 2011-03-09 19:09 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.32-21-generic
Tree: Mainline
Regression: No


Attachments

Description jeanbaptiste.lallement@gmail.com 2010-05-05 13:48:47 UTC
Hi,

I'm raising this topic again due to the large number of users experiencing the zero-length issue on ext4 filesystem after a system crash or power failure. We have collected hundreds of reports from users who can no longer update their system after a crash during or shortly after package operations due to zero-length control script (see [1] references)

To reproduce it:
* install a fresh Ubuntu Lucid system on an ext4 filesystem, or Debian with dpkg < 1.15.6 or Ubuntu Karmic
* install a package, wait a few seconds and simulate a crash 
$ sudo apt-get install some-package; sleep 5; sudo echo b > /proc/sysrq-trigger
* reboot
$ ls -l /var/lib/dpkg/info/some-package.* will list empty maintainer's scripts.
$ ls -l /var/cache/apt/archive/some-package.* will show the empty archive you've just downloaded
At this stage, the package manager is unusable and the common user cannot update anything anymore.

This behavior is observed with data=ordered and with or without the mount option auto_da_alloc.
The problem is caused by 
1) rename which should act as a barrier with data=ordered but doesn't. auto_da_alloc doesn't detect the replace-via-rename (at least in the case of dpkg.)
2) file creation followed by a crash resulting in an empty file. 

To work around and mitigate this issue, in Debian and Ubuntu, the 'dpkg' package manager has been patched to fsync extracted files (Debian dpkg 1.15.6 and Ubuntu 1.15.5.6ubuntu2)

We first added a fsync() call for each extracted file. But scattered fsyncs resulted in a massive performance degradation during package installation (factor 10 or more, some reported that it took over an hour to unpack a linux-headers-* package!)
In order to reduce the I/O performance degradation, fsync calls were deferred to serialize the write + fsync. The performance loss is now a factor 2 to 5 depending on the package.

So, we currently have the choice between filesystem corruption or major performance loss. None of them is satisfactory. 

What is simply expected is that a file is there or not, but not something in-between. 

[1] references:
http://bugs.debian.org/430958
http://bugs.debian.org/567089
http://bugs.debian.org/578635
https://bugs.launchpad.net/ubuntu/+bug/512096
https://bugs.launchpad.net/ubuntu/+bug/537241
https://bugs.launchpad.net/ubuntu/+bug/559915
https://bugs.launchpad.net/ubuntu/+bug/570805

-- 
: JB
Comment 1 Theodore Tso 2010-05-05 18:54:23 UTC
Why can't you #1, just fsync after writing the control file, if that's the primary problem?

Or #2, make the dpkg recover more gracefully if it finds that the control file has been truncated down to zero?

The reality is that all of the newer file systems are going to have this property.  XFS has always behaved this way.  Btrfs will as well.  We are _all_ using the same hueristic to force sync a file which is replaced via a rename() system call, but that's really considered a workaround buggy application programs that don't call fsync(), because there are more stupid application programmers than there are of us file system developers.

As far as the rest of the files are concerned, what I would suggest doing is set a sentinel value which is used to indicate that package is being installed, and if the system crashes, either in the init scripts or the next time dpkg runs, it should reinstall that package.   That way you're not fsync()'ing every single file in the package, and you're also not optimizing for the exception condition.   You just have appropriate application-level retries in case of a crash.

So Debian and Ubuntu have a choice.  You can just stick with the ext3, and not upgrade, but this is one place where you can't blackmail file system developers by saying, "if you don't do this, I'll go use some other file system" --- because we are *all* doing delayed allocation.   It's allowed by POSIX, and it's the only way to get much better file system performance --- and there are intelligent ways you can design your applications so the right thing happens on a power failure.   Programmers used to be familiar with these in the days before ext3, because that's how the world has always worked in Unix.  

Ext3 has lousy performance precisely because it guaranteed more semantics that what was promised by POSIX, and unfortunately, people have gotten flabby (think: the humans in the movie Wall-E) and lazy about how to write programs that write to the file system defensively.   So if people are upset about the performance of ext3, great, upgrade to newer file systems.   But then you will need to be careful about how you code applications like dpkg.

In retrospect, I really wish we hadn't given programmers the data=ordered guarantees in ext3, because they both trashed ext3's performance and caused application programmers to get the wrong idea about how the world worked.  Unfortunately, the damange has been done....
Comment 2 Dmitry Monakhov 2010-05-06 04:06:07 UTC
May be it is not best place to ask but still. 

Most of script/app developers was addicted to ordered mode for too long,
so they no longer call fsync() before rename() in usual create new copy and rename scenario for configs/init scripts. Most of developers not even know
that it is necessary(mandatory).
And in fact consequences are usually fatal because files are usually important
but old version was already unlinked.
This affect both versions because ext3 now use writeback by default,
and ext4 use writeback+delalloc.

May be it is useful to introduce compat mount option which force fsync()
internaly inside rename(). Renames is not what frequent operation so it has much less performance penalty as real ordered mode.
Comment 3 Eric Sandeen 2010-05-06 04:18:57 UTC
(In reply to comment #2)

> May be it is useful to introduce compat mount option which force fsync()
> internaly inside rename(). Renames is not what frequent operation so it has
> much less performance penalty as real ordered mode.

ext4 does already have allocate-on-rename heuristics, though not exactly fsync()

        if (retval == 0 && force_da_alloc)
                ext4_alloc_da_blocks(old_inode);
from

commit 8750c6d5fcbd3342b3d908d157f81d345c5325a7
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Mon Feb 23 23:05:27 2009 -0500

    ext4: Automatically allocate delay allocated blocks on rename

still, more mount options doesn't seem to solve the problem to me, in the end applications can't rely on it...
Comment 4 Guillem Jover 2010-05-09 18:19:07 UTC
Hi!

(In reply to comment #1)
> Why can't you #1, just fsync after writing the control file, if that's the
> primary problem?
> 
> Or #2, make the dpkg recover more gracefully if it finds that the control file
> has been truncated down to zero?

dpkg is now fsync()ing after all internal db changes, control file extractions, *and* to be installed files extracted from the deb package. It's also fsync()ing directories for at least all db directory changes.

As background info, dpkg used to fsync() all db files except for the newly extracted control files.

> The reality is that all of the newer file systems are going to have this
> property.  XFS has always behaved this way.  Btrfs will as well.  We are _all_
> using the same hueristic to force sync a file which is replaced via a rename()
> system call, but that's really considered a workaround buggy application
> programs that don't call fsync(), because there are more stupid application
> programmers than there are of us file system developers.

I don't have any problem with that, and I personally consider previous dpkg behaviour buggy. And as you say it's bound to cause problems on other file systems eventually.

> As far as the rest of the files are concerned, what I would suggest doing is
> set a sentinel value which is used to indicate that package is being installed,
> and if the system crashes, either in the init scripts or the next time dpkg
> runs, it should reinstall that package.   That way you're not fsync()'ing every
> single file in the package, and you're also not optimizing for the exception
> condition.   You just have appropriate application-level retries in case of a
> crash.

dpkg already marks packages which failed to unpack as such, and that they need to be reinstalled, it can also recover from such situations by rolling back to the previous files, which it keeps as backups until it has finished the current package operation.

The problem is, dpkg needs to guarantee the system is always usable, and when a crash occurs, say when it's unpacking libc, it's not acceptable for dpkg not to fsync() before rename() as it might end up with an empty libc.so file, even if it might have marked the package as correctly unpacked (wrongly but unknowingly as there's no guarantees), which is not true until the changes have been fully committed to the file system.

If any file of the many packages which are required for a system to boot properly or for dpkg itself to operate correctly ends up with zero-length then neither the user nor the system will be able to recover from this situation. Worse is that this might require recovering from a different media, for example, which end-users should not be required to do, or they might just not know how to.

I guess in this regard dpkg is special, and it cannot be compared to something like firefox fsync()ing too much, if dpkg fails to operate properly your entire system might get hosed.

> So Debian and Ubuntu have a choice.  You can just stick with the ext3, and not
> upgrade, but this is one place where you can't blackmail file system developers
> by saying, "if you don't do this, I'll go use some other file system" ---
> because we are *all* doing delayed allocation.   It's allowed by POSIX, and
> it's the only way to get much better file system performance --- and there are
> intelligent ways you can design your applications so the right thing happens on
> a power failure.   Programmers used to be familiar with these in the days
> before ext3, because that's how the world has always worked in Unix.  
> 
> Ext3 has lousy performance precisely because it guaranteed more semantics that
> what was promised by POSIX, and unfortunately, people have gotten flabby
> (think: the humans in the movie Wall-E) and lazy about how to write programs
> that write to the file system defensively.   So if people are upset about the
> performance of ext3, great, upgrade to newer file systems.   But then you will
> need to be careful about how you code applications like dpkg.

The main problem is that doing the right thing (fsync() + rename()), does not really penalize ext3 users, but it does on ext4 which is the one which really needs it. So we end up with lots of users (mostly from Ubuntu though, as the one who has already switched to ext4 as default) complaining the slow down is unacceptable, and I don't see much options besides adding a --force-unsafe-io or similar, which those users would add in the dpkg.cfg file with the acknowledgment thay might lose data in case of an abrupt halt.

Something in between we have talked about is doing fsync() on extracted files only for a subset of the packages, say only for priority important or higher, which besides being the wrong solution does not cover for example packages as important as the kernel or boot loaders. Obviously better than no fsync() at all but still not right, this could be added as --force-unsafe-io and the previous as --force-unsafer-io though, but still.
Comment 5 Theodore Tso 2010-05-10 03:49:25 UTC
> The problem is, dpkg needs to guarantee the system is always usable,
> and when a crash occurs, say when it's unpacking libc, it's not
> acceptable for dpkg not to fsync() before rename() as it might end
> up with an empty libc.so file, even if it might have marked the
> package as correctly unpacked (wrongly but unknowingly as there's no
> guarantees), which is not true until the changes have been fully
> committed to the file system.

Why not unpack all of the files as "foo.XXXXXX" (where XXXXXX is a
mkstemp filename template) do a sync call (which in Linux is
synchronous and won't return until all the files have been written),
and only then, rename the files?  That's going to be the most fastest
and most efficient way to guarantee safety under Linux; the downside
is that you need to have enough free space to store the old and the
new files in the package simultaneously.  But this also is a win,
because it means you don't actually start overwriting files in a
package until you know that the package installation is most likely
going to succeed.  (Well, it could fail in the postinstall script, but
at least you don't have to worry about disk full errors.)

   	     	   	   	       	    - Ted
Comment 6 Peng Tao 2010-05-10 14:24:44 UTC
On Mon, May 10, 2010 at 10:56 AM,  <tytso@mit.edu> wrote:
>> The problem is, dpkg needs to guarantee the system is always usable,
>> and when a crash occurs, say when it's unpacking libc, it's not
>> acceptable for dpkg not to fsync() before rename() as it might end
>> up with an empty libc.so file, even if it might have marked the
>> package as correctly unpacked (wrongly but unknowingly as there's no
>> guarantees), which is not true until the changes have been fully
>> committed to the file system.
>
> Why not unpack all of the files as "foo.XXXXXX" (where XXXXXX is a
> mkstemp filename template) do a sync call (which in Linux is
> synchronous and won't return until all the files have been written),
> and only then, rename the files?  That's going to be the most fastest
> and most efficient way to guarantee safety under Linux; the downside
> is that you need to have enough free space to store the old and the
> new files in the package simultaneously.  But this also is a win,
> because it means you don't actually start overwriting files in a
> package until you know that the package installation is most likely
> going to succeed.  (Well, it could fail in the postinstall script, but
> at least you don't have to worry about disk full errors.)
What about letting fsync() on dir recursively fsync() all
files/sub-dirs in the dir?
Then apps can unpack package in a temp dir, fsync(), and rename.
>
>                                            - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Thanks,
-Bergwolf
Comment 7 Theodore Tso 2010-05-10 14:42:13 UTC
On Mon, May 10, 2010 at 10:22:47PM +0800, Peng Tao wrote:
> What about letting fsync() on dir recursively fsync() all
> files/sub-dirs in the dir?
> Then apps can unpack package in a temp dir, fsync(), and rename.

There are programs to do who execute fsync() on a directory, and they
do not expect a recursive fsync() on all files/subdirectories in a
directory.

At least for Linux, sync() is synchronous and will do what you want.
There is unfortunately not a portable way to do what you want short of
fsync'ing all of the files after they are written.  This case is
mostly optimized under ext3/4 (we could do a bit better for ext4, but
the performance shouldn't be disastrous --- certainly much better than
write a file, fsync, rename a file, repeat).

					- Ted
Comment 8 Guillem Jover 2010-05-10 17:23:31 UTC
(In reply to comment #5)
> Why not unpack all of the files as "foo.XXXXXX" (where XXXXXX is a
> mkstemp filename template) do a sync call (which in Linux is
> synchronous and won't return until all the files have been written),
> and only then, rename the files? That's going to be the most fastest
> and most efficient way to guarantee safety under Linux; the downside
> is that you need to have enough free space to store the old and the
> new files in the package simultaneously. But this also is a win,
> because it means you don't actually start overwriting files in a
> package until you know that the package installation is most likely
> going to succeed.  (Well, it could fail in the postinstall script, but
> at least you don't have to worry about disk full errors.)

Ah, forgot to mention that we also discussed about using sync(), but
the problem as you say is that using sync() is not portable, so we need
the deferred fsync() and rename() code anyway for unpacked files on
non-Linux systems. Another possible issue, is that if there's been lots
of I/O in parallel or just before running dpkg the sync() might take much
longer than expected, but depending on the implementation fsync() might
show similar slowdowns anyway (not, though if it was on a different
"disk" and file system).

Regarding the downsides and wins you mention they already apply to the
current implementation. As I mentioned before dpkg has always supported
rolling back, by making a hardlinked backup of the old file as .dpkg-tmp,
extracting the new file as .dpkg-new and then doing an atomic rename() over
the current file, and in case of error (from dpkg itself or the appropriate
maintainer script) restoring all the old file backups for the package
(either in the current run or in a subsequent dpkg run). And only once
the unpack stage has been successful it removes the backups in one pass.
So the need for rollback already makes dpkg take (approx.) twice the space
per package, and thus there's no unsafe overwrites that cannot be reverted
(except for the zero-length ones).

I've added the conditional code now for Linux to do the sync() and then
rename() all files in one pass, and it's just few lines of code (due to
the deferred fsync() changes which are now in place), I'll request some
testing from ext4 users, and if it improves something and does not make
the matters worse on ext3 and other file systems, then I guess we might
use that on Linux. It still looks like a workaround to me.

As a side remark, I don't think it's fair though, that you complain about
application developers not doing the right thing, when at the same time,
you expect them not to use the proper portable tool for such job. And that
you seem to not see a problem that using it implies a performance penalty
on a file system that really needs it. That there's lots of users willing
to sacrifice safety for performance, tells me the penalty is significant
enough. Isn't there anything that could be improved to make the correct
fsync()+rename() case a bit faster? In this particular case those are
already batched after all writes have been performed.
Comment 9 Theodore Tso 2010-05-10 22:33:09 UTC
I'll grant that using sync(2) is non-portable, but relying on _not_ needing an fsync(2) at all was also just as non-portable, if not worse (it only really worked on ext3, and no other file system, and of course only on Linux).

Trying to make (fsync ,rename)**N --- that is, alternating fsync and rename calls --- fast is always going to be difficult for nearly all file systems.  The fundamental problem is that file systems are optimized for throughput when you're _not_ calling fsync all the time, that's a very different sort of thing than what databases need to do --- and databases generally solve the problem by having _two_ logs, a redo and an undo log.  I don't know of any filesystem which has that kind of complexity, and so pretty much any filesystem where you have a series if fsync() and rename() calls interleaves is going to run you into pain.   Some filesystems will be better at it than others, but it's always going to be faster to write all the files, do a single sync, and then do all of the renames.   Yeah, that's non-portable; the problem is that the only synchronization primitive which POSIX gives us is fsync(), and so we just don't have a lot of options in terms of what we communicate between userspace and the kernel.

One of the things I wonder about is why are users' systems cashing so often such that this is a problem?  I can't remember the last time I've had a system crash while I've been doing an "apt-get dist-upgrade" or "apt-get upgrade".  Is this a common problem or an uncommon problem?    And if it's not that common (and I hope it is, but maybe Ubuntu is shipping too many unstable crappy binary device drivers), maybe the right answer is to have rescue CD's or rescue partitions which will automatically repair a damaged libc package if the systems just happened to crash while upgrading glibc.   Again, let's optimize for the common case, and I hope we've haven't entered the windows world where blue screens of death are so common that this is the case we have to optimize for....
Comment 10 Phillip Susi 2011-03-07 00:30:36 UTC
Since it was decided that this is not a bug in the kernel, shouldn't this report be closed?
Comment 11 Theodore Tso 2011-03-09 19:09:22 UTC
Agreed, closing.

Note You need to log in before you can comment on or make changes to this bug.