Tested in kernel 5.11, 6.1, 6.8 Some writes to ext4 (default commit=5) do not commit to block device in more than 25 seconds. How to reproduce: apt install php-cli -y Instead of iostat create io watcher, that shows real writes to block device in Mb every second: show_writes.php ================================= #!/usr/bin/php <?php if(!isset($argv[1])) die("provide device\n"); if(!($device = realpath($argv[1]))) die("can't realpath device\n"); if(!preg_match("@[^/]+$@", $device, $m)) die(); $device = $m[0]; $prev = 0; while(1) { $data = preg_split("/\s+/", trim(file_get_contents("/sys/block/$device/stat"))); if(empty($data)) die("can't get device info"); if($prev) echo date("h:i:s")." ".number_format(($data[6]-$prev)*512/1024/1024,2)." M\n"; $prev = $data[6]; sleep(1); } ================================= Add free HDD/partition of 20Mb+ size (/dev/vdb) In first console run: php show_writes.php /dev/vdb In second console run: mkfs.ext4 -E lazy_itable_init=0 -E lazy_journal_init=0 /dev/vdb mount /dev/vdb /mnt/ rm -f /mnt/test; sync; sleep 1; date && dd if=/dev/urandom of=/mnt/test bs=1M count=10 # watch that 10Mb write has settled in 5 seconds, good rm -f /mnt/test; sync; sleep 1; date && dd if=/dev/urandom of=/mnt/test bs=1M count=10 # watch, that second 10Mb write does not commit to block device in more than 25 seconds, but should in 5 seconds, repeat if did.
This is not a bug. What you are observing is the dirty writeback for buffered I/O. This is configurable; see [1], and in particular the documentation for dirty_expire_centisecs, which you can query by looking at the contents of /proc/sys/vm/dirty_expire_centisecs, and which you can configure by writing to that file (e.g., "cat 500 > /proc/sys/vm/dirty_expire_centisecs"). Note that changing dirty_expire_centisecs from 3000 (30 seconds) to 500 (5 seconds) will have performance implications; there are Very Good Reasons why the default is set to 30 seconds (as well it being the historic default used by Unix systems for decades). [1] https://docs.kernel.org/admin-guide/sysctl/vm.html Note that if you want to make sure something is written to disk, it's best to explicit about it, using the fsync(2) system call.
Oops, replace "cat 500 > /proc/sys/vm/dirty_expire_centisecs" with "echo 500 > /proc/sys/vm/dirty_expire_centisecs" in the previous message.
I disagree. I'v checked value of /proc/sys/vm/dirty_writeback_centisecs on all tested systems and all have 500 value, not 30000.
Oh you are right, /proc/sys/vm/dirty_writeback_centisecs, 3000.
/proc/sys/vm/dirty_expire_centisecs 3000 This contradicts information from https://www.kernel.org/doc/Documentation/filesystems/ext4.txt commit=nrsec (*) Ext4 can be told to sync all its data and metadata every 'nrsec' seconds. The default value is 5 seconds. This means that if you lose your power, you will lose as much as the latest 5 seconds of work (your filesystem will not be damaged though, thanks to the journaling). This default value (or any low value) will hurt performance, but it's good for data-safety. Setting it to 0 will have the same effect as leaving it at the default (5 seconds). Setting it to very large values will improve performance. So actually commit=5 does NOT guarantee, that if you lose your power, you will lose as much as the latest 5 seconds of work. You will lose 30 seconds of work.
Short investigation showed, that this question has already been asked in 2019 and nobody has fixed the docs since then. https://www.spinics.net/lists/linux-ext4/msg68987.html This wrong information has been duplicated all over the internet (arch wiki, superuser etc etc) confusing filesystem users.
Not sure where you've got the URL in comment 5 from but it is an ancient version of the documentation. This text has been fixed in 2018. Current version of the documentation is at https://www.kernel.org/doc/html/latest/admin-guide/ext4.html and has: commit=nrsec (*) This setting limits the maximum age of the running transaction to ‘nrsec’ seconds. The default value is 5 seconds. This means that if you lose your power, you will lose as much as the latest 5 seconds of metadata changes (your filesystem will not be damaged though, thanks to the journaling). This default value (or any low value) will hurt performance, but it’s good for data-safety. Setting it to 0 will have the same effect as leaving it at the default (5 seconds). Setting it to very large values will improve performance. Note that due to delayed allocation even older data can be lost on power failure since writeback of those data begins only after time set in /proc/sys/vm/dirty_expire_centisecs.
(In reply to Jan Kara from comment #7) new one is better