Hi, I'm raising this topic again due to the large number of users experiencing the zero-length issue on ext4 filesystem after a system crash or power failure. We have collected hundreds of reports from users who can no longer update their system after a crash during or shortly after package operations due to zero-length control script (see [1] references) To reproduce it: * install a fresh Ubuntu Lucid system on an ext4 filesystem, or Debian with dpkg < 1.15.6 or Ubuntu Karmic * install a package, wait a few seconds and simulate a crash $ sudo apt-get install some-package; sleep 5; sudo echo b > /proc/sysrq-trigger * reboot $ ls -l /var/lib/dpkg/info/some-package.* will list empty maintainer's scripts. $ ls -l /var/cache/apt/archive/some-package.* will show the empty archive you've just downloaded At this stage, the package manager is unusable and the common user cannot update anything anymore. This behavior is observed with data=ordered and with or without the mount option auto_da_alloc. The problem is caused by 1) rename which should act as a barrier with data=ordered but doesn't. auto_da_alloc doesn't detect the replace-via-rename (at least in the case of dpkg.) 2) file creation followed by a crash resulting in an empty file. To work around and mitigate this issue, in Debian and Ubuntu, the 'dpkg' package manager has been patched to fsync extracted files (Debian dpkg 1.15.6 and Ubuntu 1.15.5.6ubuntu2) We first added a fsync() call for each extracted file. But scattered fsyncs resulted in a massive performance degradation during package installation (factor 10 or more, some reported that it took over an hour to unpack a linux-headers-* package!) In order to reduce the I/O performance degradation, fsync calls were deferred to serialize the write + fsync. The performance loss is now a factor 2 to 5 depending on the package. So, we currently have the choice between filesystem corruption or major performance loss. None of them is satisfactory. What is simply expected is that a file is there or not, but not something in-between. [1] references: http://bugs.debian.org/430958 http://bugs.debian.org/567089 http://bugs.debian.org/578635 https://bugs.launchpad.net/ubuntu/+bug/512096 https://bugs.launchpad.net/ubuntu/+bug/537241 https://bugs.launchpad.net/ubuntu/+bug/559915 https://bugs.launchpad.net/ubuntu/+bug/570805 -- : JB
Why can't you #1, just fsync after writing the control file, if that's the primary problem? Or #2, make the dpkg recover more gracefully if it finds that the control file has been truncated down to zero? The reality is that all of the newer file systems are going to have this property. XFS has always behaved this way. Btrfs will as well. We are _all_ using the same hueristic to force sync a file which is replaced via a rename() system call, but that's really considered a workaround buggy application programs that don't call fsync(), because there are more stupid application programmers than there are of us file system developers. As far as the rest of the files are concerned, what I would suggest doing is set a sentinel value which is used to indicate that package is being installed, and if the system crashes, either in the init scripts or the next time dpkg runs, it should reinstall that package. That way you're not fsync()'ing every single file in the package, and you're also not optimizing for the exception condition. You just have appropriate application-level retries in case of a crash. So Debian and Ubuntu have a choice. You can just stick with the ext3, and not upgrade, but this is one place where you can't blackmail file system developers by saying, "if you don't do this, I'll go use some other file system" --- because we are *all* doing delayed allocation. It's allowed by POSIX, and it's the only way to get much better file system performance --- and there are intelligent ways you can design your applications so the right thing happens on a power failure. Programmers used to be familiar with these in the days before ext3, because that's how the world has always worked in Unix. Ext3 has lousy performance precisely because it guaranteed more semantics that what was promised by POSIX, and unfortunately, people have gotten flabby (think: the humans in the movie Wall-E) and lazy about how to write programs that write to the file system defensively. So if people are upset about the performance of ext3, great, upgrade to newer file systems. But then you will need to be careful about how you code applications like dpkg. In retrospect, I really wish we hadn't given programmers the data=ordered guarantees in ext3, because they both trashed ext3's performance and caused application programmers to get the wrong idea about how the world worked. Unfortunately, the damange has been done....
May be it is not best place to ask but still. Most of script/app developers was addicted to ordered mode for too long, so they no longer call fsync() before rename() in usual create new copy and rename scenario for configs/init scripts. Most of developers not even know that it is necessary(mandatory). And in fact consequences are usually fatal because files are usually important but old version was already unlinked. This affect both versions because ext3 now use writeback by default, and ext4 use writeback+delalloc. May be it is useful to introduce compat mount option which force fsync() internaly inside rename(). Renames is not what frequent operation so it has much less performance penalty as real ordered mode.
(In reply to comment #2) > May be it is useful to introduce compat mount option which force fsync() > internaly inside rename(). Renames is not what frequent operation so it has > much less performance penalty as real ordered mode. ext4 does already have allocate-on-rename heuristics, though not exactly fsync() if (retval == 0 && force_da_alloc) ext4_alloc_da_blocks(old_inode); from commit 8750c6d5fcbd3342b3d908d157f81d345c5325a7 Author: Theodore Ts'o <tytso@mit.edu> Date: Mon Feb 23 23:05:27 2009 -0500 ext4: Automatically allocate delay allocated blocks on rename still, more mount options doesn't seem to solve the problem to me, in the end applications can't rely on it...
Hi! (In reply to comment #1) > Why can't you #1, just fsync after writing the control file, if that's the > primary problem? > > Or #2, make the dpkg recover more gracefully if it finds that the control > file > has been truncated down to zero? dpkg is now fsync()ing after all internal db changes, control file extractions, *and* to be installed files extracted from the deb package. It's also fsync()ing directories for at least all db directory changes. As background info, dpkg used to fsync() all db files except for the newly extracted control files. > The reality is that all of the newer file systems are going to have this > property. XFS has always behaved this way. Btrfs will as well. We are > _all_ > using the same hueristic to force sync a file which is replaced via a > rename() > system call, but that's really considered a workaround buggy application > programs that don't call fsync(), because there are more stupid application > programmers than there are of us file system developers. I don't have any problem with that, and I personally consider previous dpkg behaviour buggy. And as you say it's bound to cause problems on other file systems eventually. > As far as the rest of the files are concerned, what I would suggest doing is > set a sentinel value which is used to indicate that package is being > installed, > and if the system crashes, either in the init scripts or the next time dpkg > runs, it should reinstall that package. That way you're not fsync()'ing > every > single file in the package, and you're also not optimizing for the exception > condition. You just have appropriate application-level retries in case of a > crash. dpkg already marks packages which failed to unpack as such, and that they need to be reinstalled, it can also recover from such situations by rolling back to the previous files, which it keeps as backups until it has finished the current package operation. The problem is, dpkg needs to guarantee the system is always usable, and when a crash occurs, say when it's unpacking libc, it's not acceptable for dpkg not to fsync() before rename() as it might end up with an empty libc.so file, even if it might have marked the package as correctly unpacked (wrongly but unknowingly as there's no guarantees), which is not true until the changes have been fully committed to the file system. If any file of the many packages which are required for a system to boot properly or for dpkg itself to operate correctly ends up with zero-length then neither the user nor the system will be able to recover from this situation. Worse is that this might require recovering from a different media, for example, which end-users should not be required to do, or they might just not know how to. I guess in this regard dpkg is special, and it cannot be compared to something like firefox fsync()ing too much, if dpkg fails to operate properly your entire system might get hosed. > So Debian and Ubuntu have a choice. You can just stick with the ext3, and > not > upgrade, but this is one place where you can't blackmail file system > developers > by saying, "if you don't do this, I'll go use some other file system" --- > because we are *all* doing delayed allocation. It's allowed by POSIX, and > it's the only way to get much better file system performance --- and there > are > intelligent ways you can design your applications so the right thing happens > on > a power failure. Programmers used to be familiar with these in the days > before ext3, because that's how the world has always worked in Unix. > > Ext3 has lousy performance precisely because it guaranteed more semantics > that > what was promised by POSIX, and unfortunately, people have gotten flabby > (think: the humans in the movie Wall-E) and lazy about how to write programs > that write to the file system defensively. So if people are upset about the > performance of ext3, great, upgrade to newer file systems. But then you > will > need to be careful about how you code applications like dpkg. The main problem is that doing the right thing (fsync() + rename()), does not really penalize ext3 users, but it does on ext4 which is the one which really needs it. So we end up with lots of users (mostly from Ubuntu though, as the one who has already switched to ext4 as default) complaining the slow down is unacceptable, and I don't see much options besides adding a --force-unsafe-io or similar, which those users would add in the dpkg.cfg file with the acknowledgment thay might lose data in case of an abrupt halt. Something in between we have talked about is doing fsync() on extracted files only for a subset of the packages, say only for priority important or higher, which besides being the wrong solution does not cover for example packages as important as the kernel or boot loaders. Obviously better than no fsync() at all but still not right, this could be added as --force-unsafe-io and the previous as --force-unsafer-io though, but still.
> The problem is, dpkg needs to guarantee the system is always usable, > and when a crash occurs, say when it's unpacking libc, it's not > acceptable for dpkg not to fsync() before rename() as it might end > up with an empty libc.so file, even if it might have marked the > package as correctly unpacked (wrongly but unknowingly as there's no > guarantees), which is not true until the changes have been fully > committed to the file system. Why not unpack all of the files as "foo.XXXXXX" (where XXXXXX is a mkstemp filename template) do a sync call (which in Linux is synchronous and won't return until all the files have been written), and only then, rename the files? That's going to be the most fastest and most efficient way to guarantee safety under Linux; the downside is that you need to have enough free space to store the old and the new files in the package simultaneously. But this also is a win, because it means you don't actually start overwriting files in a package until you know that the package installation is most likely going to succeed. (Well, it could fail in the postinstall script, but at least you don't have to worry about disk full errors.) - Ted
On Mon, May 10, 2010 at 10:56 AM, <tytso@mit.edu> wrote: >> The problem is, dpkg needs to guarantee the system is always usable, >> and when a crash occurs, say when it's unpacking libc, it's not >> acceptable for dpkg not to fsync() before rename() as it might end >> up with an empty libc.so file, even if it might have marked the >> package as correctly unpacked (wrongly but unknowingly as there's no >> guarantees), which is not true until the changes have been fully >> committed to the file system. > > Why not unpack all of the files as "foo.XXXXXX" (where XXXXXX is a > mkstemp filename template) do a sync call (which in Linux is > synchronous and won't return until all the files have been written), > and only then, rename the files? That's going to be the most fastest > and most efficient way to guarantee safety under Linux; the downside > is that you need to have enough free space to store the old and the > new files in the package simultaneously. But this also is a win, > because it means you don't actually start overwriting files in a > package until you know that the package installation is most likely > going to succeed. (Well, it could fail in the postinstall script, but > at least you don't have to worry about disk full errors.) What about letting fsync() on dir recursively fsync() all files/sub-dirs in the dir? Then apps can unpack package in a temp dir, fsync(), and rename. > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Thanks, -Bergwolf
On Mon, May 10, 2010 at 10:22:47PM +0800, Peng Tao wrote: > What about letting fsync() on dir recursively fsync() all > files/sub-dirs in the dir? > Then apps can unpack package in a temp dir, fsync(), and rename. There are programs to do who execute fsync() on a directory, and they do not expect a recursive fsync() on all files/subdirectories in a directory. At least for Linux, sync() is synchronous and will do what you want. There is unfortunately not a portable way to do what you want short of fsync'ing all of the files after they are written. This case is mostly optimized under ext3/4 (we could do a bit better for ext4, but the performance shouldn't be disastrous --- certainly much better than write a file, fsync, rename a file, repeat). - Ted
(In reply to comment #5) > Why not unpack all of the files as "foo.XXXXXX" (where XXXXXX is a > mkstemp filename template) do a sync call (which in Linux is > synchronous and won't return until all the files have been written), > and only then, rename the files? That's going to be the most fastest > and most efficient way to guarantee safety under Linux; the downside > is that you need to have enough free space to store the old and the > new files in the package simultaneously. But this also is a win, > because it means you don't actually start overwriting files in a > package until you know that the package installation is most likely > going to succeed. (Well, it could fail in the postinstall script, but > at least you don't have to worry about disk full errors.) Ah, forgot to mention that we also discussed about using sync(), but the problem as you say is that using sync() is not portable, so we need the deferred fsync() and rename() code anyway for unpacked files on non-Linux systems. Another possible issue, is that if there's been lots of I/O in parallel or just before running dpkg the sync() might take much longer than expected, but depending on the implementation fsync() might show similar slowdowns anyway (not, though if it was on a different "disk" and file system). Regarding the downsides and wins you mention they already apply to the current implementation. As I mentioned before dpkg has always supported rolling back, by making a hardlinked backup of the old file as .dpkg-tmp, extracting the new file as .dpkg-new and then doing an atomic rename() over the current file, and in case of error (from dpkg itself or the appropriate maintainer script) restoring all the old file backups for the package (either in the current run or in a subsequent dpkg run). And only once the unpack stage has been successful it removes the backups in one pass. So the need for rollback already makes dpkg take (approx.) twice the space per package, and thus there's no unsafe overwrites that cannot be reverted (except for the zero-length ones). I've added the conditional code now for Linux to do the sync() and then rename() all files in one pass, and it's just few lines of code (due to the deferred fsync() changes which are now in place), I'll request some testing from ext4 users, and if it improves something and does not make the matters worse on ext3 and other file systems, then I guess we might use that on Linux. It still looks like a workaround to me. As a side remark, I don't think it's fair though, that you complain about application developers not doing the right thing, when at the same time, you expect them not to use the proper portable tool for such job. And that you seem to not see a problem that using it implies a performance penalty on a file system that really needs it. That there's lots of users willing to sacrifice safety for performance, tells me the penalty is significant enough. Isn't there anything that could be improved to make the correct fsync()+rename() case a bit faster? In this particular case those are already batched after all writes have been performed.
I'll grant that using sync(2) is non-portable, but relying on _not_ needing an fsync(2) at all was also just as non-portable, if not worse (it only really worked on ext3, and no other file system, and of course only on Linux). Trying to make (fsync ,rename)**N --- that is, alternating fsync and rename calls --- fast is always going to be difficult for nearly all file systems. The fundamental problem is that file systems are optimized for throughput when you're _not_ calling fsync all the time, that's a very different sort of thing than what databases need to do --- and databases generally solve the problem by having _two_ logs, a redo and an undo log. I don't know of any filesystem which has that kind of complexity, and so pretty much any filesystem where you have a series if fsync() and rename() calls interleaves is going to run you into pain. Some filesystems will be better at it than others, but it's always going to be faster to write all the files, do a single sync, and then do all of the renames. Yeah, that's non-portable; the problem is that the only synchronization primitive which POSIX gives us is fsync(), and so we just don't have a lot of options in terms of what we communicate between userspace and the kernel. One of the things I wonder about is why are users' systems cashing so often such that this is a problem? I can't remember the last time I've had a system crash while I've been doing an "apt-get dist-upgrade" or "apt-get upgrade". Is this a common problem or an uncommon problem? And if it's not that common (and I hope it is, but maybe Ubuntu is shipping too many unstable crappy binary device drivers), maybe the right answer is to have rescue CD's or rescue partitions which will automatically repair a damaged libc package if the systems just happened to crash while upgrading glibc. Again, let's optimize for the common case, and I hope we've haven't entered the windows world where blue screens of death are so common that this is the case we have to optimize for....
Since it was decided that this is not a bug in the kernel, shouldn't this report be closed?
Agreed, closing.