Created attachment 254231 [details] Example C program After calling fallocate() on a shared mmap'ed file and writing data into the newly allocated region, occasionally (first observed after running for ~1 week) some data is replaced by 0s. The address and size of corrupted data is also not reproducible. The initial failure was debugged and reduced to a C++ program that failed with both gcc and clang, and later to the attached C program. The amount allocated every iteration was reduced to 1 byte because that caused faster failures, and wasn't reproducible with higher power of 2 sizes. Is this a bug or user error? OS: Ubuntu 16.04.1 LTS kernel versions: 4.4.0-38-generic, 4.9.7-040907-generic block device: Observed on both /dev/ram0 and local SSD ext4 mount options: (rw, relatime,data=ordered) Unable to reproduce when using the "FALLOC_FL_ZERO_RANGE" flag, and on a tmpfs ram disk. Reproduction steps: sudo mkdir /mnt/ram0 sudo mkfs.ext4 /dev/ram0 sudo mount /dev/ram0 /mnt/ram0/ gcc -O2 tests_mmap_fallocate.c -o tests_mmap_fallocate_gcc while sudo rm -f /mnt/ram0/tests_mmap_fallocate && sudo ./tests_mmap_fallocate_gcc; do date && sleep 1; done ... ... ... Value has been modified (Also nothing found in /var/log/kern.log) On a development machine the failure only occurs after several days of running in a loop, but fails within minutes on a virtualized Linux machine on a server.
Has anyone investigated or been able to reproduce this failure?
Looks like a bug in ext4... I'm investigating...
Created attachment 256719 [details] [PATCH] ext4: Fix data corruption for mmap writes This patch fixes the issue for me.
BTW, can I base a testcase for fstests on your example program?
Thanks for investigating and making the patch. Sorry that I missed your last comment, feel free to base a testcase on the example program.