Bug 151491
Summary: | free space lossage on busy system with bigalloc enabled and 128KB cluster | ||
---|---|---|---|
Product: | File System | Reporter: | Matthew L. Martin (mlmartin) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | NEW --- | ||
Severity: | high | CC: | betacentauri, enwlinux, mfe555, tytso |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.1.8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
details of fault with scripts for reproduction
Test script Test script output |
The fault is in linux 4.4.16. After running the populate script for a few hours du and df disagree a great deal: # df -h /mnt/hdd_sd[gh]; du -hs /mnt/hdd_sd[gh] Filesystem Size Used Avail Use% Mounted on /dev/sdg 5.5T 129G 5.3T 3% /mnt/hdd_sdg /dev/sdh 5.5T 20G 5.4T 1% /mnt/hdd_sdh 20G /mnt/hdd_sdg 20G /mnt/hdd_sdh # lsof | grep -e sdg -e sdh jbd2/sdg- 32609 root cwd DIR 9,127 4096 192 / jbd2/sdg- 32609 root rtd DIR 9,127 4096 192 / jbd2/sdg- 32609 root txt unknown /proc/32609/exe jbd2/sdh- 32614 root cwd DIR 9,127 4096 192 / jbd2/sdh- 32614 root rtd DIR 9,127 4096 192 / jbd2/sdh- 32614 root txt unknown /proc/32614/exe I believe that I have confirmed that this fault is not present in linux 4.7. After running the reproduction script for over three hours I have not seen a difference between the usage reported by du and df: # df -h /mnt/hdd_sd[gh]; du -hs /mnt/hdd_sd[gh] Filesystem Size Used Avail Use% Mounted on /dev/sdg 5.5T 20G 5.4T 1% /mnt/hdd_sdg /dev/sdh 5.5T 20G 5.4T 1% /mnt/hdd_sdh 20G /mnt/hdd_sdg 20G /mnt/hdd_sdh Unfortunately, after an extended run of the test scripts the fault presented itself in the 4.7 kernel: [root@d-ceph01 ~]# df -h /mnt/hdd_sd[gh]; du -hs /mnt/hdd_sd[gh] Filesystem Size Used Avail Use% Mounted on /dev/sdg 5.5T 669G 4.8T 13% /mnt/hdd_sdg /dev/sdh 5.5T 20G 5.4T 1% /mnt/hdd_sdh 20G /mnt/hdd_sdg 20G /mnt/hdd_sdh I have a similar or same problem on a Linux based Enigma2 set-top box - ext4 - kernel 4.8.3 - bigalloc enabled - cluster size of 262144 In normal use of the set-top box, the free space lossage is 10s of Gigabytes per day. The problem can be easily reproduced: When creating a fresh file, there is a significant difference in file size (ls -la) and disk usage (du). When making two copies of the file .. gbquad:/hdd/test# cp file file.copy1 gbquad:/hdd/test# cp file file.copy2 gbquad:/hdd/test# ls -la -rw------- 1 root root 581821460 Nov 1 18:52 file -rw------- 1 root root 581821460 Nov 1 18:56 file.copy1 -rw------- 1 root root 581821460 Nov 1 18:57 file.copy2 gbquad:/hdd/test# du * 607232 file 658176 file.copy1 644864 file.copy2 ... all three files show an overhead in the ~10% range, and the overhead is different for these files although their md5sums are equal. When deleting a file (rm), the overhead remains occupied on the disk. For example, after deleting "file", "df" reports approx. 581821460 more bytes free, not 607232 kbytes more free space. The overhead (607232 kB - 581821460 B = approx. 39 MB) remains blocked. When unmounting and mounting again, the blocked space becomes free again, and in addition the overhead of those files that were not deleted also disappears, so that after a re-mount the 'file size' and 'disk usage' match for all files (except for rounding up to some block size). I found that echo 3 > /proc/sys/vm/drop_caches seems to detach the blocked disk space from the files (so that 'du file' no longer includes the offset), but it does not free the space, 'df' still shows all file overheads as used disk space. Can you try replicating this on an upstream kernel, running in a controlled environment (e.g., using kvm) and then give us reliable reproduction instructions --- e.g., using simple shell scripts, and which doesn't depend on the vagaries of the settop box software, and random versions of du, df, etc. For bonus points, use get a copy of kvm-xfstests[1][2], and run the test using scripts cut and pasted into "kvm-xfstests shell". That way we will be able to reproduce *exactly* what you are doing. [1] https://github.com/tytso/xfstests-bld [2] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md Thanks!! First result: When I revert this commit https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/ext4?h=v4.1&id=9d21c9fa2cc24e2a195a79c27b6550e1a96051a4 in a 4.1 ARM kernel the problem doesn't occur any longer. Scripts for replicating in the kvm-xfstests environment will follow later. By the way: The setup boxes fischreiher and me are talking about use only slightly adapted upstream kernel to support broadcom hardware. fs directory in kernel sources is not changed. Frank Created attachment 260631 [details]
Test script
Created attachment 260633 [details]
Test script output
I have used a 4.9 32 bit kernel (config from kernel-configs folder).
Output shows how used space in df command increases. Also files generated by dd and cp have different file size in ls command.
Forgot to say that the script is for the kvm-xfstests environment. And the output already shows the output of that environment. I've been able to reproduce the reported problem on my test system running a 4.14 x86-64 kernel with the supplied test script. Thanks for supplying it! The block reporting errors from du and df are likely caused by delayed allocation accounting bugs. Experiments with an instrumented kernel show that the number of delayed allocated blocks is occasionally overcounted as the test files are physically allocated, leaving a residual value behind once allocation is complete. This residual value remains once a file has been fully written out or deleted, and distorts the results reported by du or df. Interestingly, the overcounting isn't deterministic and varies from run to run. Part of the overcounting appears due to code in ext4_ext_map_blocks() that increases i_reserved_data_blocks when new clusters are allocated. This code has been previously implicated in other observed failures and in this case appears to contribute some but not always all of the overcounted clusters seen when running the test script. Kernel traces indicate that there is usually another as yet unknown contributor to the overcount. Ted has suggested a temporary workaround which can be used to avoid the reported problems, though it may have a significant workload-dependent performance impact. Delayed allocation can simply be disabled by using the nodelalloc mount option. I've tested this with repeated runs of the supplied test script, and it avoids the reported problems as expected. Reverting "ext4: don't release reserved space for previously allocated cluster" (9d21c9fs2cc2) isn't an attractive option because doing so would expose users to potential data loss. The purpose of the patch was to fix cases where the number of outstanding delayed allocation blocks were undercounted. Undercounting can lead to unexpected free space exhaustion at writeback time, among other things. I'll see what more I can learn from some additional experimentation. nodelalloc mount options workarounds the problem in the test environment. But I also checked with real ARM system with 4.1.37 kernel: root@sf4008:/media# mount ... /dev/sda on /media/sda type ext4 (rw,relatime,nodelalloc,data=ordered) root@sf4008:/media# ls -las sda/testfiles/ 10240 -rw-r--r-- 1 root root 10485760 Nov 17 16:47 test 10304 -rw-r--r-- 1 root root 10485760 Nov 17 16:47 test1 Test is a little bit different. Only 10 MB files are generated. In most cases file size (first column) is equal, but in some cases file size still differs like in above example. It's not deterministic for me when it happens. But I only see 2 sizes. 10240 or 10304. With delalloc the file size was much more random. After many attempts, I'm unable to reproduce the newly reported behavior for the delalloc workaround in comment 11 on my x86-64 test system (which is not an xfstests-bld test appliance) running either a current 4.14 kernel or an older Debian Jessie 4.8 kernel. I consistently get a reported value of 10240 1k units, which is correct for the reported size. However, in the process of running my trials I arrived at a simpler reproducer that should be helpful in identifying the source of the original space reporting problem. There's no need to copy the first test file if the test system's free memory is sufficiently limited relative to the size of the test file - a simple sequential write of a single test file suffices. In fact, the tighter the free memory, the more likely the problem occurs and the likelihood of larger reporting errors increases. A test system with ample free memory won't exhibit the problem at all. I'm getting workable kernel traces with the simpler reproducer, and the free memory-related behavior suggests a direction, so I'll see where that takes me. The ARM machine I use has very little free memory. So that fits to your analysis. With your information I could also reproduce it in the xfstests environment with nodelalloc. But only 2 times. I set memory in the virtual machine to 256MB (in config.kvm). Then I mounted the filesystem with nodelalloc and executed this little script several times: #!/bin/bash i=0 while [ $i -lt 10 ]; do dd if=/dev/zero of=./test$i bs=1M count=200 > /dev/null cp test$i testx_$i sync let i=i+1 echo $i done Result was this: root@kvm-xfstests:/media/test# ls -las total 4097028 256 drwxr-xr-x 3 root root 4096 Nov 20 17:43 . 4 drwxr-xr-x 3 root root 4096 Nov 20 17:33 .. 256 -rwxr-xr-x 1 root root 155 Nov 20 17:48 h.sh 256 drwx------ 2 root root 16384 Nov 20 17:26 lost+found 204800 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test0 205056 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test1 204800 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test2 204800 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test3 204800 -rw-r--r-- 1 root root 209715200 Nov 20 17:48 test4 ... One file has size 205056. But I cannot really reproduce it. So better focus on fixing the bug than on trying to reproduce it with nodelalloc. If I find a way to reproduce, I'll let you know ;-) I've been able to identify the unknown contributor to the delalloc accounting error as discussed previously (in comment 10). When ext4 is writing back a previously delalloc'ed extent that contains just a portion of a cluster, and then delallocs an extent that contains another disjoint portion of that same cluster, the count of delalloc'ed clusters (i_reserved_data_blocks) is incorrectly incremented. The cluster has been physically allocated during writeback, but the subsequent delalloc write does not discover that allocation. This is because the code in ext4_da_map_blocks() checks for a previously physically allocated block at the point of allocation rather than a previously physically allocated cluster spanning the point of allocation. The effect is to bump the delalloc'ed cluster count for clusters that will never be allocated (since they've already been allocated), and the overcount will therefore never be reduced. It's more likely this problem would occur when writing files sequentially if the test system was under memory pressure, resulting in writeback activity in parallel with delalloc writes. The magnitude of the overcount is also likely to be larger in this situation. This correlates well with the observation that the reproducer for the accounting errors is more likely to reproduce the problem on a test system with little free memory. I've been testing a prototype patch that appears to fix this problem. However, I've also identified at least two other unrelated delalloc accounting problems for bigalloc file systems whose effects are masked by the other contributor to overcounting in ext4_ext_map_blocks(). Fixing it results in failures caused by these other problems when running xfstests-bld on bigalloc. So, there's a lot of work yet to be done before it's time to post patches. Thanks a lot for your investigation, Eric, and for describing these details. It is great to know that this issue is in competent hands. Hi Erik, I got the message that this is not an easy nor quick fix, but is it still ob your list? Eric has still been working on it (he has been reporting on it on our weekly ext4 concalls, and we've been discussing it). He's identified an approach and has patches which he is refining, perfecting, and testing. Hopefully there will be something that can be released for users to test in the near future. Great, thank you, I'll be patient. Any news regarding this ticket? Is the problem in the meantime fixed? 4.20 contains patches that correct delalloc cluster accounting for bigalloc file systems. They were merged at the beginning of the release cycle. See "ext4: generalize extents status tree search functions" (ad431025aecd) and the following five patches. They should address all the problems described in this bugzilla. Any independent testing would be appreciated. I'd recommend working with the latest mainline rc available at this time, which is 4.20-rc6. Thanks for the patches! In kvm-xfstests environment with a 4.20-rc6 kernel I cannot reproduce the bug anymore. I also created a tmpfs with several big files to reduce free memory, but still there are no problems :-) Very good - thanks for the testing! I'll close this bug out at the end of the 4.20 release cycle if no negative test results are reported by then. |
Created attachment 227581 [details] details of fault with scripts for reproduction file system with bigalloc enabled and 128KB cluster size with a large number of 2MB files being created/overwritten/deleted loses usable space. Running du and df gives wildly different usage with df showing much more usage than du. lsof shows no phantom open files. Using dd to fill the file system shows that df's version of free space is operative, but unmounting and remounting the file system returns the free space. There is no difference between df and du usage after remount. The fault does not seem to be present in the 4.7 kernel (or it takes a lot more activity for it to show up). I will build 4.4.16 and retest to see if is present there. We do have a(n obnoxious) workaround of periodically unmounting/remounting files systems. Details of configurations and tests in the attached document