Bug 194071 - data loss using fallocate and mmap
Summary: data loss using fallocate and mmap
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-06 10:59 UTC by Michael Zimmer
Modified: 2017-09-05 10:02 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.4.0+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Example C program (1.23 KB, text/plain)
2017-02-06 10:59 UTC, Michael Zimmer
Details
[PATCH] ext4: Fix data corruption for mmap writes (2.13 KB, patch)
2017-05-25 11:29 UTC, Jan Kara
Details | Diff

Description Michael Zimmer 2017-02-06 10:59:24 UTC
Created attachment 254231 [details]
Example C program

After calling fallocate() on a shared mmap'ed file and writing data into the newly allocated region, occasionally (first observed after running for ~1 week) some data is replaced by 0s. The address and size of corrupted data is also not reproducible.

The initial failure was debugged and reduced to a C++ program that failed with both gcc and clang, and later to the attached C program. The amount allocated every iteration was reduced to 1 byte because that caused faster failures, and wasn't reproducible with higher power of 2 sizes.

Is this a bug or user error?

OS: Ubuntu 16.04.1 LTS
kernel versions: 4.4.0-38-generic, 4.9.7-040907-generic
block device: Observed on both /dev/ram0 and local SSD
ext4 mount options: (rw, relatime,data=ordered)

Unable to reproduce when using the "FALLOC_FL_ZERO_RANGE" flag, and on a tmpfs ram disk.

Reproduction steps:
sudo mkdir /mnt/ram0
sudo mkfs.ext4 /dev/ram0
sudo mount /dev/ram0 /mnt/ram0/
gcc -O2 tests_mmap_fallocate.c -o tests_mmap_fallocate_gcc
while sudo rm -f /mnt/ram0/tests_mmap_fallocate && sudo ./tests_mmap_fallocate_gcc; do date && sleep 1; done
...
...
...
Value has been modified
(Also nothing found in /var/log/kern.log)

On a development machine the failure only occurs after several days of running in a loop, but fails within minutes on a virtualized Linux machine on a server.
Comment 1 Michael Zimmer 2017-04-26 10:46:17 UTC
Has anyone investigated or been able to reproduce this failure?
Comment 2 Jan Kara 2017-05-25 08:59:38 UTC
Looks like a bug in ext4... I'm investigating...
Comment 3 Jan Kara 2017-05-25 11:29:21 UTC
Created attachment 256719 [details]
[PATCH] ext4: Fix data corruption for mmap writes

This patch fixes the issue for me.
Comment 4 Jan Kara 2017-05-25 11:55:53 UTC
BTW, can I base a testcase for fstests on your example program?
Comment 5 Michael Zimmer 2017-09-05 10:02:27 UTC
Thanks for investigating and making the patch. Sorry that I missed your last comment, feel free to base a testcase on the example program.

Note You need to log in before you can comment on or make changes to this bug.