Bug 151491

Summary: free space lossage on busy system with bigalloc enabled and 128KB cluster
Product: File System Reporter: Matthew L. Martin (mlmartin)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: NEW ---    
Severity: high CC: betacentauri, enwlinux, mfe555, tytso
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.1.8 Subsystem:
Regression: No Bisected commit-id:
Attachments: details of fault with scripts for reproduction
Test script
Test script output

Description Matthew L. Martin 2016-08-04 19:47:41 UTC
Created attachment 227581 [details]
details of fault with scripts for reproduction

file system with bigalloc enabled and 128KB cluster size with a large number of 2MB files being created/overwritten/deleted loses usable space.

Running du and df gives wildly different usage with df showing much more usage than du. lsof shows no phantom open files. Using dd to fill the file system shows that df's version of free space is operative, but unmounting and remounting the file system returns the free space. There is no difference between df and du usage after remount.

The fault does not seem to be present in the 4.7 kernel (or it takes a lot more activity for it to show up).

I will build 4.4.16 and retest to see if is present there.

We do have a(n obnoxious) workaround of periodically unmounting/remounting files systems.

Details of configurations and tests in the attached document
Comment 1 Matthew L. Martin 2016-08-06 12:54:07 UTC
The fault is in linux 4.4.16. After running the populate script for a few hours du and df disagree a great deal:

# df -h /mnt/hdd_sd[gh]; du -hs  /mnt/hdd_sd[gh]
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdg        5.5T  129G  5.3T   3% /mnt/hdd_sdg
/dev/sdh        5.5T   20G  5.4T   1% /mnt/hdd_sdh
20G	/mnt/hdd_sdg
20G	/mnt/hdd_sdh

# lsof | grep -e sdg -e sdh
jbd2/sdg- 32609           root  cwd       DIR              9,127      4096        192 /
jbd2/sdg- 32609           root  rtd       DIR              9,127      4096        192 /
jbd2/sdg- 32609           root  txt   unknown                                         /proc/32609/exe
jbd2/sdh- 32614           root  cwd       DIR              9,127      4096        192 /
jbd2/sdh- 32614           root  rtd       DIR              9,127      4096        192 /
jbd2/sdh- 32614           root  txt   unknown                                         /proc/32614/exe
Comment 2 Matthew L. Martin 2016-08-08 20:30:00 UTC
I believe that I have confirmed that this fault is not present in linux 4.7. After running the reproduction script for over three hours I have not seen a difference between the usage reported by du and df:

# df -h /mnt/hdd_sd[gh]; du -hs  /mnt/hdd_sd[gh]
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdg        5.5T   20G  5.4T   1% /mnt/hdd_sdg
/dev/sdh        5.5T   20G  5.4T   1% /mnt/hdd_sdh
20G	/mnt/hdd_sdg
20G	/mnt/hdd_sdh
Comment 3 Matthew L. Martin 2016-08-09 18:28:31 UTC
Unfortunately, after an extended run of the test scripts the fault presented itself in the 4.7 kernel:

[root@d-ceph01 ~]# df -h /mnt/hdd_sd[gh]; du -hs  /mnt/hdd_sd[gh]
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdg        5.5T  669G  4.8T  13% /mnt/hdd_sdg
/dev/sdh        5.5T   20G  5.4T   1% /mnt/hdd_sdh
20G	/mnt/hdd_sdg
20G	/mnt/hdd_sdh
Comment 4 Fischreiher 2017-11-11 11:04:42 UTC
I have a similar or same problem on a Linux based Enigma2 set-top box

- ext4
- kernel 4.8.3
- bigalloc enabled
- cluster size of 262144

In normal use of the set-top box, the free space lossage is 10s of Gigabytes per day. The problem can be easily reproduced:

When creating a fresh file, there is a significant difference in file size (ls -la) and disk usage (du). When making two copies of the file ..

gbquad:/hdd/test# cp file file.copy1
gbquad:/hdd/test# cp file file.copy2
gbquad:/hdd/test# ls -la
-rw-------    1 root     root     581821460 Nov  1 18:52 file
-rw-------    1 root     root     581821460 Nov  1 18:56 file.copy1
-rw-------    1 root     root     581821460 Nov  1 18:57 file.copy2
gbquad:/hdd/test# du *
607232  file
658176  file.copy1
644864  file.copy2

... all three files show an overhead in the ~10% range, and the overhead is different for these files although their md5sums are equal.

When deleting a file (rm), the overhead remains occupied on the disk. For example, after deleting "file", "df" reports approx. 581821460 more bytes free, not 607232 kbytes more free space. The overhead (607232 kB - 581821460 B = approx. 39 MB) remains blocked.

When unmounting and mounting again, the blocked space becomes free again, and in addition the overhead of those files that were not deleted also disappears, so that after a re-mount the 'file size' and 'disk usage' match for all files (except for rounding up to some block size).

I found that
    echo 3 > /proc/sys/vm/drop_caches
seems to detach the blocked disk space from the files (so that 'du file' no longer includes the offset), but it does not free the space, 'df' still shows all file overheads as used disk space.
Comment 5 Theodore Tso 2017-11-11 19:20:17 UTC
Can you try replicating this on an upstream kernel, running in a controlled environment (e.g., using kvm) and then give us reliable reproduction instructions --- e.g., using simple shell scripts, and which doesn't depend on the vagaries of the settop box software, and random versions of du, df, etc.

For bonus points, use get a copy of kvm-xfstests[1][2], and run the test using scripts cut and pasted into "kvm-xfstests shell".   That way we will be able to reproduce *exactly* what you are doing.

[1] https://github.com/tytso/xfstests-bld
[2] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md

Thanks!!
Comment 6 Betacentauri 2017-11-13 16:27:31 UTC
First result:
When I revert this commit https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/ext4?h=v4.1&id=9d21c9fa2cc24e2a195a79c27b6550e1a96051a4 in a 4.1 ARM kernel the problem doesn't occur any longer.
Scripts for replicating in the kvm-xfstests environment will follow later.


By the way: The setup boxes fischreiher and me are talking about use only slightly adapted upstream kernel to support broadcom hardware. fs directory in kernel sources is not changed.

Frank
Comment 7 Betacentauri 2017-11-13 17:27:41 UTC
Created attachment 260631 [details]
Test script
Comment 8 Betacentauri 2017-11-13 17:30:11 UTC
Created attachment 260633 [details]
Test script output

I have used a 4.9 32 bit kernel (config from kernel-configs folder).

Output shows how used space in df command increases. Also files generated by dd and cp have different file size in ls command.
Comment 9 Betacentauri 2017-11-13 17:39:41 UTC
Forgot to say that the script is for the kvm-xfstests environment. And the output already shows the output of that environment.
Comment 10 Eric Whitney 2017-11-16 23:54:44 UTC
I've been able to reproduce the reported problem on my test system running a 4.14 x86-64 kernel with the supplied test script.  Thanks for supplying it!

The block reporting errors from du and df are likely caused by delayed allocation accounting bugs.  Experiments with an instrumented kernel show that the number of delayed allocated blocks is occasionally overcounted as the test files are physically allocated, leaving a residual value behind once allocation is complete.  This residual value remains once a file has been fully written out or deleted, and distorts the results reported by du or df.  Interestingly, the overcounting isn't deterministic and varies from run to run.

Part of the overcounting appears due to code in ext4_ext_map_blocks() that increases i_reserved_data_blocks when new clusters are allocated.  This code has been previously implicated in other observed failures and in this case appears to contribute some but not always all of the overcounted clusters seen when running the test script.  Kernel traces indicate that there is usually another as yet unknown contributor to the overcount.

Ted has suggested a temporary workaround which can be used to avoid the reported problems, though it may have a significant workload-dependent performance impact.  Delayed allocation can simply be disabled by using the nodelalloc mount option.  I've tested this with repeated runs of the supplied test script, and it avoids the reported problems as expected.

Reverting "ext4: don't release reserved space for previously allocated cluster" (9d21c9fs2cc2) isn't an attractive option because doing so would expose users to potential data loss.  The purpose of the patch was to fix cases where the number of outstanding delayed allocation blocks were undercounted.  Undercounting can lead to unexpected free space exhaustion at writeback time, among other things.

I'll see what more I can learn from some additional experimentation.
Comment 11 Betacentauri 2017-11-17 16:00:14 UTC
nodelalloc mount options workarounds the problem in the test environment. But I also checked with real ARM system with 4.1.37 kernel:

root@sf4008:/media# mount 
...
/dev/sda on /media/sda type ext4 (rw,relatime,nodelalloc,data=ordered)
root@sf4008:/media# ls -las sda/testfiles/
 10240 -rw-r--r--    1 root     root      10485760 Nov 17 16:47 test
 10304 -rw-r--r--    1 root     root      10485760 Nov 17 16:47 test1

Test is a little bit different. Only 10 MB files are generated. In most cases file size (first column) is equal, but in some cases file size still differs like in above example. It's not deterministic for me when it happens. But I only see 2 sizes. 10240 or 10304. With delalloc the file size was much more random.
Comment 12 Eric Whitney 2017-11-20 15:50:09 UTC
After many attempts, I'm unable to reproduce the newly reported behavior for the delalloc workaround in comment 11 on my x86-64 test system (which is not an xfstests-bld test appliance) running either a current 4.14 kernel or an older Debian Jessie 4.8 kernel.  I consistently get a reported value of 10240 1k units, which is correct for the reported size.

However, in the process of running my trials I arrived at a simpler reproducer that should be helpful in identifying the source of the original space reporting problem.  There's no need to copy the first test file if the test system's free memory is sufficiently limited relative to the size of the test file - a simple sequential write of a single test file suffices.  In fact, the tighter the free memory, the more likely the problem occurs and the likelihood of larger reporting errors increases.  A test system with ample free memory won't exhibit the problem at all.

I'm getting workable kernel traces with the simpler reproducer, and the free memory-related behavior suggests a direction, so I'll see where that takes me.
Comment 13 Betacentauri 2017-11-20 17:59:39 UTC
The ARM machine I use has very little free memory. So that fits to your analysis. 
With your information I could also reproduce it in the xfstests environment with nodelalloc. But only 2 times. I set memory in the virtual machine to 256MB (in config.kvm). Then I mounted the filesystem with nodelalloc and executed this little script several times:

#!/bin/bash

i=0
while [ $i -lt 10 ]; do
 dd if=/dev/zero of=./test$i bs=1M count=200 > /dev/null
 cp test$i testx_$i
 sync
 let i=i+1
 echo $i
done

Result was this:
root@kvm-xfstests:/media/test# ls -las
total 4097028
   256 drwxr-xr-x 3 root root      4096 Nov 20 17:43 .
     4 drwxr-xr-x 3 root root      4096 Nov 20 17:33 ..
   256 -rwxr-xr-x 1 root root       155 Nov 20 17:48 h.sh
   256 drwx------ 2 root root     16384 Nov 20 17:26 lost+found
204800 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test0
205056 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test1
204800 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test2
204800 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test3
204800 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test4
...

One file has size 205056. But I cannot really reproduce it.

So better focus on fixing the bug than on trying to reproduce it with nodelalloc. If I find a way to reproduce, I'll let you know ;-)
Comment 14 Eric Whitney 2017-11-30 16:08:41 UTC
I've been able to identify the unknown contributor to the delalloc accounting error as discussed previously (in comment 10).

When ext4 is writing back a previously delalloc'ed extent that contains just a portion of a cluster, and then delallocs an extent that contains another disjoint portion of that same cluster, the count of delalloc'ed clusters (i_reserved_data_blocks) is incorrectly incremented.  The cluster has been physically allocated during writeback, but the subsequent delalloc write does not discover that allocation.  This is because the code in ext4_da_map_blocks() checks for a previously physically allocated block at the point of allocation rather than a previously physically allocated cluster spanning the point of allocation.

The effect is to bump the delalloc'ed cluster count for clusters that will never be allocated (since they've already been allocated), and the overcount will therefore never be reduced.

It's more likely this problem would occur when writing files sequentially if the test system was under memory pressure, resulting in writeback activity in parallel with delalloc writes.  The magnitude of the overcount is also likely to be larger in this situation.  This correlates well with the observation that the reproducer for the accounting errors is more likely to reproduce the problem on a test system with little free memory.

I've been testing a prototype patch that appears to fix this problem.  However, I've also identified at least two other unrelated delalloc accounting problems for bigalloc file systems whose effects are masked by the other contributor to overcounting in ext4_ext_map_blocks().  Fixing it results in failures caused by these other problems when running xfstests-bld on bigalloc.  So, there's a lot of work yet to be done before it's time to post patches.
Comment 15 Fischreiher 2017-12-02 11:35:37 UTC
Thanks a lot for your investigation, Eric, and for describing these details. It is great to know that this issue is in competent hands.
Comment 16 Fischreiher 2018-01-27 09:25:38 UTC
Hi Erik, I got the message that this is not an easy nor quick fix, but is it still ob your list?
Comment 17 Theodore Tso 2018-01-27 14:20:32 UTC
Eric has still been working on it (he has been reporting on it on our weekly ext4 concalls, and we've been discussing it).   He's identified an approach and has patches which he is refining, perfecting, and testing.   Hopefully there will be something that can be released for users to test in the near future.
Comment 18 Fischreiher 2018-01-27 19:28:54 UTC
Great, thank you, I'll be patient.
Comment 19 Betacentauri 2018-12-03 17:10:26 UTC
Any news regarding this ticket? Is the problem in the meantime fixed?
Comment 20 Eric Whitney 2018-12-11 16:44:56 UTC
4.20 contains patches that correct delalloc cluster accounting for bigalloc file systems.  They were merged at the beginning of the release cycle. See "ext4: generalize extents status tree search functions" (ad431025aecd) and the following five patches.  They should address all the problems described in this bugzilla.

Any independent testing would be appreciated.  I'd recommend working with the latest mainline rc available at this time, which is 4.20-rc6.
Comment 21 Betacentauri 2018-12-11 18:57:58 UTC
Thanks for the patches!

In kvm-xfstests environment with a 4.20-rc6 kernel I cannot reproduce the bug anymore. I also created a tmpfs with several big files to reduce free memory, but still there are no problems :-)
Comment 22 Eric Whitney 2018-12-13 16:12:21 UTC
Very good - thanks for the testing!  I'll close this bug out at the end of the 4.20 release cycle if no negative test results are reported by then.