Bug 70121 - Increasing efficiency of full data journaling
Summary: Increasing efficiency of full data journaling
Status: RESOLVED INVALID
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 enhancement
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-06 10:38 UTC by sworddragon2
Modified: 2014-03-07 20:41 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.13.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description sworddragon2 2014-02-06 10:38:04 UTC
Full data journaling provides the ability that it is guaranteed that a file will never be saved visible for the user in a damaged state (except a hardware defect appears afterwards). But this has the disadvantage that the writing througput is ~halfed as all files are written 2 times.

Here comes the idea: From a logical view to achieve this safety it is not needed to write the file 2 times. A simple committing should achieve the same level of safety. Here is an example: The filesystem could store a value for the file which is reflecting its state. It is initialized as empty value indicating the file has not successfully be written. As soon as the file has been written it is set to 1. This would avoid writing the file 2 times and still guarantee that the file will never be visible for te user in a damaged state on a crash as the filesystem check would see that the file state is unequal to 1 and correct the problem.
Comment 1 Alan 2014-02-07 11:28:05 UTC
Theoretical discussions belong on the mailing lists not here. It's not a bug
Comment 2 sworddragon2 2014-02-07 14:49:29 UTC
> Theoretical discussions belong on the mailing lists not here.

As I have never used a mailing list before and looking around at https://www.kernel.org this looks really complicated. I'm wondering why not a more simple system is used. Maybe I will figure out how to post something on this "forum-like" system to forward this request.


> It's not a bug

This is why I have set the importance to enhancement.
Comment 3 Phillip Susi 2014-03-05 21:48:52 UTC
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/6/2014 5:38 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> Full data journaling provides the ability that it is guaranteed
> that a file will never be saved visible for the user in a damaged
> state (except a hardware defect appears afterwards). But this has
> the disadvantage that the writing througput is ~halfed as all files
> are written 2 times.

It does no such thing.  To do that, a new kernel api is needed to
allow the application to specify where transactions start/stop.  Data
journaling only makes sense for an external journal on an ssd or
similar: then fsync() and friends can return as soon as the data hits
the ssd and rely on it being safely migrated to the hdd in the future.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTF4sKAAoJEI5FoCIzSKrwBpsH/35GAm2iqyNudw0JjJdceYqd
PC+47MwQoJE2kcGLNCPVZLhRdHg+fWF6JKewnGAJXz6jsOIeblWyqyL03x8cCDMB
107ssZzXy8OllhczGp1oMR4q7bM0RiIwDjqD8qPJxYwTOHcoIp7RbyK0ER++LcE6
lISrSCyzyqpgo9o2oRAlAdP8/aioaOyiJbGpIhfH42Je+zG5k3mGwaCIuVvBR+Ef
wm6rOqApxv6JwZy0s12alq6MUFMzE9qb9WTzLiDtPybSFzvUgt5brePi74HdJZHs
JYjtUqk43q8wYCWcDcEjKb8q6/h/HGP296ZhM9CoXjq+HYqqWWXbl9T5F5+Nl6k=
=nADc
-----END PGP SIGNATURE-----
Comment 4 sworddragon2 2014-03-06 05:28:19 UTC
> It does no such thing.

What does it not?


> Data
> journaling only makes sense for an external journal on an ssd or
> similar

~2 years ago I have disabled full data journaling for a short time but at this point an application crashed while it was writing a lot of files. The result was that many files got damaged which encouraged me to never disable full data journaling again. Now I'm seeing only 2 possible states of a file: Either it is only registered in the filesystem with a size of 0 bytes or it is completely written. I was never able to reproduce a half-written file with full data journaling enabled.
Comment 5 Theodore Tso 2014-03-06 15:34:46 UTC
On Thu, Feb 06, 2014 at 10:38:04AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> Here comes the idea: From a logical view to achieve this safety it is not
> needed to write the file 2 times. A simple committing should achieve the same
> level of safety. Here is an example: The filesystem could store a value for
> the
> file which is reflecting its state. It is initialized as empty value
> indicating
> the file has not successfully be written. As soon as the file has been
> written
> it is set to 1. This would avoid writing the file 2 times and still guarantee
> that the file will never be visible for te user in a damaged state on a crash
> as the filesystem check would see that the file state is unequal to 1 and
> correct the problem.

How does the file system know that the file has "successfully been
written"?  Secondly, even if we did know, in order to guarantee the
transaction semantics, we *always* update the journal first.  Only
after the journals is updated, do we write back to the final location
on disk.  So what you are suggesting just simply wouldn't work.

> ~2 years ago I have disabled full data journaling for a short time but at
> this
> point an application crashed while it was writing a lot of files. The result
> was that many files got damaged which encouraged me to never disable full
> data
> journaling again. Now I'm seeing only 2 possible states of a file: Either it
> is
> only registered in the filesystem with a size of 0 bytes or it is completely
> written. I was never able to reproduce a half-written file with full data
> journaling enabled.

You have a buggy application which isn't using fsync() where it
should.  If you can't fix the application, one thing you can do is to
enable use the nodelalloc mount option.  Although disabling delayed
allocation will involve a performance hit, it's much less of a
performance hit compared to data journalling, and it will avoid the
double write problem.

One of the reasons why I'm not particularly fond of this solution is
that, beyond not guaranteeing data integrity after a crash (it just
makes it more likely, but if you crash at the wrong moment, you can
still lose data --- this is true with data journalling too, btw; if
you haven't seen it, you've just gotten lucky), and beyond the fact
that it imposes a generic performance, it imposes a specific
performance penalty against applications which actually do the correct
thing and use fsync().

One of the unfortunate features of ext3, which also didn't have
delayed allocation (and ext4 with nodelalloc basically reverts this
aspect of file system behaviour to ext3 levels), is that it encouraged
applications not to use fsync(), which is a "works most of the time
until it doesn't", which is probably _why_ you have the buggy
application or applications.  But in the long run, it's better to fix
the buggy applications than to rely on nodelalloc.

Cheers,

						- Ted
Comment 6 sworddragon2 2014-03-07 07:16:40 UTC
> How does the file system know that the file has "successfully been
> written"?  Secondly, even if we did know, in order to guarantee the
> transaction semantics, we *always* update the journal first.  Only
> after the journals is updated, do we write back to the final location
> on disk.  So what you are suggesting just simply wouldn't work.

It seems it is just a too major change. Maybe it is something that could be considered in ext5.


> it just
> makes it more likely, but if you crash at the wrong moment, you can
> still lose data


I have never seen a damaged file with full data journaling enabled. Can you show me a race condition so that I can reproduce it? Hm, maybe it would be possible if the journal is smaller than the file (I'm wondering what would happen in such a case).
Comment 7 Theodore Tso 2014-03-07 13:48:49 UTC
On Fri, Mar 07, 2014 at 07:16:40AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> > How does the file system know that the file has "successfully been
> > written"?  Secondly, even if we did know, in order to guarantee the
> > transaction semantics, we *always* update the journal first.  Only
> > after the journals is updated, do we write back to the final location
> > on disk.  So what you are suggesting just simply wouldn't work.
> 
> It seems it is just a too major change. Maybe it is something that could be
> considered in ext5.

If you think it can be done, plesae submit patches.  :-)

> > it just
> > makes it more likely, but if you crash at the wrong moment, you can
> > still lose data
> 
> 
> I have never seen a damaged file with full data journaling enabled. Can you
> show me a race condition so that I can reproduce it? Hm, maybe it would be
> possible if the journal is smaller than the file (I'm wondering what would
> happen in such a case).

If the application is in the middle of writing the file when the
journal is committing, the file can be half written at the point where
the system is rebooted.  If you are thinking about the case where the
application writes to a temp file, and then renames the temp file on
top of the pre-existing file (without first fsync'ing the temp file,
which is the application bug), then yes, data journalling will save
you from that one particular use case, but it will come at a cost.
(And if you crash while you are writing the temp file, of course you
do lose your pending changes.)  You can get the same level of
protection by using mount -o nodelalloc instead of mount -o
data=journal.

As I said before, you'll give up some performance, but it won't be as
bad as using data=journal --- and you should still file bug reports
with your applications so they use fsync() properly.  I'll note that
if they don't, you'll have problems with all other file systems,
whether they be xfs, btrfs, etc.  Fsync(2) is the *only* way you can
guarantee that the contents of a file which has been written is safely
on stable store.

							- Ted
Comment 8 sworddragon2 2014-03-07 15:42:13 UTC
Thanks for the information :)

In this case full data journaling seems not to make all the things I thought it does. Maybe my last tests are just too long ago. But this means also that I still can have data loss if the system crashes while writing to the disk. It would be really nice if the filesystem would provide such a mechanism in the future.


> How does the file system know that the file has "successfully been
> written"?

Does the filesystem really not know if the file is complete? If this could be achieved a transaction safety should be not the problem here.


> If you think it can be done, plesae submit patches.  :-)

Sure I would if:

- My experience with C would be a little higher.
- Or at least if I would have the time to make a deeper look into the sourcecode and the coding rules for the kernel.


I have posted the last few years many hundred bugs/feature requests in ~100 different projects and am still doing this many times a week. But don't worry, they don't have so many bogus reports as I'm reporting here :)
The kernel is really a difficult place if you are not directly involved with it but want to improve something. But some of my tickets here had already success. Also I'm programming on own projects and this all together consumes much time.
Comment 9 Phillip Susi 2014-03-07 20:41:31 UTC
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/6/2014 12:28 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
>> It does no such thing.
> 
> What does it not?

write("foo");
              <-- crash
write("bar");

Your file now has "foo" and not "bar".  The fact that "foo" hit the
journal first, and may or may not have hit the final location before
the crash doesn't change the fact that you crashed between the two
write calls, so "bar" is certainly lost, and if "foo" did hit the
disk, then you have a partially written, and thus corrupt file.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTGi73AAoJEI5FoCIzSKrwQq8H/0+giACBP1uYn6fN97yEjGT+
movV5ITpmc6Aq11E2GSF7T60ehTV2Q27PYjxr4YF/fHUfkthBKyISgONwTpXiDUk
tYCpMKuAokC5zmhl2UlTwain+Y/By9UnH3DjbAkg7trnGiEcuT6AJXUZh7fYPPXq
AKnJNYT+LI8EzPLKRf4DbtL5mPmRsaGDa7zCZ5eVQgtyao1LmaFNTcaojUATpw/n
p4cwBHkcauQSs126Z0XryiWXrOpBoYRCwFF42DIlAOd5tSnGyhbE75ul83m5lCC4
BlRrZ6+057RPiYFpmbgTsatatSZBbuGvEwuTICHDJiR8jqmucjarKPg9/wocERA=
=PKiE
-----END PGP SIGNATURE-----

Note You need to log in before you can comment on or make changes to this bug.