Bug 218932 - Serious problem with ext4 with all kernels, auto-commits do not settle to block device
Summary: Serious problem with ext4 with all kernels, auto-commits do not settle to blo...
Status: RESOLVED INVALID
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-03 11:40 UTC by Serious
Modified: 2024-07-19 20:07 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Serious 2024-06-03 11:40:27 UTC
Tested in kernel 5.11, 6.1, 6.8
Some writes to ext4 (default commit=5) do not commit to block device in more than 25 seconds.

How to reproduce:
apt install php-cli -y
Instead of iostat create io watcher, that shows real writes to block device in Mb every second:
show_writes.php
=================================
#!/usr/bin/php
<?php
if(!isset($argv[1])) die("provide device\n");
if(!($device = realpath($argv[1]))) die("can't realpath device\n");
if(!preg_match("@[^/]+$@", $device, $m)) die();
$device = $m[0];
$prev = 0;
while(1) {
        $data = preg_split("/\s+/", trim(file_get_contents("/sys/block/$device/stat")));
        if(empty($data)) die("can't get device info");
        if($prev) echo date("h:i:s")." ".number_format(($data[6]-$prev)*512/1024/1024,2)." M\n";
        $prev = $data[6];
        sleep(1);
}
=================================
Add free HDD/partition of 20Mb+ size (/dev/vdb)
In first console run:
php show_writes.php /dev/vdb

In second console run:
mkfs.ext4 -E lazy_itable_init=0 -E lazy_journal_init=0 /dev/vdb
mount /dev/vdb /mnt/

rm -f /mnt/test; sync; sleep 1; date && dd if=/dev/urandom of=/mnt/test bs=1M count=10
# watch that 10Mb write has settled in 5 seconds, good
rm -f /mnt/test; sync; sleep 1; date && dd if=/dev/urandom of=/mnt/test bs=1M count=10
# watch, that second 10Mb write does not commit to block device in more than 25 seconds, but should in 5 seconds, repeat if did.
Comment 1 Theodore Tso 2024-06-05 18:32:12 UTC
This is not a bug.   What you are observing is the dirty writeback for buffered I/O.   This is configurable; see [1], and in particular the documentation for dirty_expire_centisecs, which you can query by looking at the contents of /proc/sys/vm/dirty_expire_centisecs, and which you can configure by writing to that file (e.g., "cat 500 > /proc/sys/vm/dirty_expire_centisecs").   Note that changing dirty_expire_centisecs from 3000 (30 seconds) to 500 (5 seconds) will have performance implications; there are Very Good Reasons why the default is set to 30 seconds (as well it being the historic default used by Unix systems for decades).

[1] https://docs.kernel.org/admin-guide/sysctl/vm.html

Note that if you want to make sure something is written to disk, it's best to explicit about it, using the fsync(2) system call.
Comment 2 Theodore Tso 2024-06-05 18:39:21 UTC
Oops, replace "cat 500 > /proc/sys/vm/dirty_expire_centisecs" with "echo 500 > /proc/sys/vm/dirty_expire_centisecs" in the previous message.
Comment 3 Serious 2024-06-05 20:16:22 UTC
I disagree. I'v checked value of /proc/sys/vm/dirty_writeback_centisecs on all tested systems and all have 500 value, not 30000.
Comment 4 Serious 2024-06-05 20:21:46 UTC
Oh you are right, /proc/sys/vm/dirty_writeback_centisecs, 3000.
Comment 5 Serious 2024-06-05 20:47:09 UTC
/proc/sys/vm/dirty_expire_centisecs 3000
This contradicts information from https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
commit=nrsec	(*)	Ext4 can be told to sync all its data and metadata
			every 'nrsec' seconds. The default value is 5 seconds.
			This means that if you lose your power, you will lose
			as much as the latest 5 seconds of work (your
			filesystem will not be damaged though, thanks to the
			journaling).  This default value (or any low value)
			will hurt performance, but it's good for data-safety.
			Setting it to 0 will have the same effect as leaving
			it at the default (5 seconds).
			Setting it to very large values will improve
			performance.

So actually commit=5 does NOT guarantee, that if you lose your power, you will lose as much as the latest 5 seconds of work. You will lose 30 seconds of work.
Comment 6 Serious 2024-06-05 21:01:39 UTC
Short investigation showed, that this question has already been asked in 2019 and nobody has fixed the docs since then. https://www.spinics.net/lists/linux-ext4/msg68987.html
This wrong information has been duplicated all over the internet (arch wiki, superuser etc etc) confusing filesystem users.
Comment 7 Jan Kara 2024-07-19 12:33:25 UTC
Not sure where you've got the URL in comment 5 from but it is an ancient version of the documentation. This text has been fixed in 2018. Current version of the documentation is at https://www.kernel.org/doc/html/latest/admin-guide/ext4.html and has:

commit=nrsec (*)

    This setting limits the maximum age of the running transaction to ‘nrsec’ seconds. The default value is 5 seconds. This means that if you lose your power, you will lose as much as the latest 5 seconds of metadata changes (your filesystem will not be damaged though, thanks to the journaling). This default value (or any low value) will hurt performance, but it’s good for data-safety. Setting it to 0 will have the same effect as leaving it at the default (5 seconds). Setting it to very large values will improve performance. Note that due to delayed allocation even older data can be lost on power failure since writeback of those data begins only after time set in /proc/sys/vm/dirty_expire_centisecs.
Comment 8 Serious 2024-07-19 20:07:33 UTC
(In reply to Jan Kara from comment #7)
new one is better

Note You need to log in before you can comment on or make changes to this bug.