Bug 194071
Summary: | data loss using fallocate and mmap | ||
---|---|---|---|
Product: | File System | Reporter: | Michael Zimmer (michael) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | NEW --- | ||
Severity: | high | CC: | david, jack |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.4.0+ | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Example C program
[PATCH] ext4: Fix data corruption for mmap writes |
Has anyone investigated or been able to reproduce this failure? Looks like a bug in ext4... I'm investigating... Created attachment 256719 [details]
[PATCH] ext4: Fix data corruption for mmap writes
This patch fixes the issue for me.
BTW, can I base a testcase for fstests on your example program? Thanks for investigating and making the patch. Sorry that I missed your last comment, feel free to base a testcase on the example program. |
Created attachment 254231 [details] Example C program After calling fallocate() on a shared mmap'ed file and writing data into the newly allocated region, occasionally (first observed after running for ~1 week) some data is replaced by 0s. The address and size of corrupted data is also not reproducible. The initial failure was debugged and reduced to a C++ program that failed with both gcc and clang, and later to the attached C program. The amount allocated every iteration was reduced to 1 byte because that caused faster failures, and wasn't reproducible with higher power of 2 sizes. Is this a bug or user error? OS: Ubuntu 16.04.1 LTS kernel versions: 4.4.0-38-generic, 4.9.7-040907-generic block device: Observed on both /dev/ram0 and local SSD ext4 mount options: (rw, relatime,data=ordered) Unable to reproduce when using the "FALLOC_FL_ZERO_RANGE" flag, and on a tmpfs ram disk. Reproduction steps: sudo mkdir /mnt/ram0 sudo mkfs.ext4 /dev/ram0 sudo mount /dev/ram0 /mnt/ram0/ gcc -O2 tests_mmap_fallocate.c -o tests_mmap_fallocate_gcc while sudo rm -f /mnt/ram0/tests_mmap_fallocate && sudo ./tests_mmap_fallocate_gcc; do date && sleep 1; done ... ... ... Value has been modified (Also nothing found in /var/log/kern.log) On a development machine the failure only occurs after several days of running in a loop, but fails within minutes on a virtualized Linux machine on a server.