Bug 99171 - MD RAID or DRBD can be broken from userspace when using O_DIRECT
Summary: MD RAID or DRBD can be broken from userspace when using O_DIRECT
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Jens Axboe
Depends on:
Reported: 2015-05-29 09:56 UTC by Stanislav German-Evtushenko
Modified: 2018-03-26 18:56 UTC (History)
7 users (show)

See Also:
Kernel Version: any
Tree: Mainline
Regression: No

drbd_oos_test.c (1.17 KB, text/x-csrc)
2015-05-29 09:56 UTC, Stanislav German-Evtushenko
drbd copy of write bio (1007 bytes, patch)
2018-03-26 10:25 UTC, bcs
Details | Diff

Description Stanislav German-Evtushenko 2015-05-29 09:56:10 UTC
Created attachment 178311 [details]


MD RAID, DRBD and may be other software raid-like block devices can become inconsistent (silently) if program in userspace is doing something wrong.

*** How to reproduce ***

1. Prepare

gcc -pthread drbd_oos_test.c
dd if=/dev/zero of=/tmp/mdadm1 bs=1M count=100
dd if=/dev/zero of=/tmp/mdadm2 bs=1M count=100
losetup /dev/loop1 /tmp/mdadm1
losetup /dev/loop2 /tmp/mdadm2
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/loop{1,2}

2. Write data with O_DIRECT

./a.out /dev/md0

3. Check consistency with vbindiff

vbindiff /tmp/mdadm{1,2}      #press enter multiple times to skip metadata

*** Variant: EXT3 or EXT4 on top of md0 ***

The step 2 can be extended by creating file system:

mkfs.ext3 /dev/md0
mkdir /tmp/ext3
mount /dev/md0 /tmp/ext3
./a.out /tmp/ext3/testfile1
vbindiff /tmp/mdadm{1,2}      #press enter multiple times to skip metadata

In both cases data on /tmp/mdadm1 and /tmp/mdadm2 will differ. We get the same result when we use DRBD instead of MD RAID.

Best regards,
Comment 1 Phil Turmel 2017-06-30 14:40:20 UTC
I'm not convinced this is a meaningful testcase.  Any userspace application that modifies a data buffer in one thread while another thread is writing that buffer to disk is certain to not get predicable data back when reading it later.  Whether this situation results in a mismatch among raid mirrors is not terribly meaningful.
Comment 2 Wolfgang Bumiller 2017-06-30 17:30:13 UTC
This is not at all about the contents of the data. It is expected that garbage is written to the disks, but each disk making up the raid will contain different garbage, which means the disks are out of sync, iow. the raid is "broken". This in turn means the user space can "break" the raid.
The problem is that with O_DIRECT the the user space pointer is passed to the block drivers for the underlying layers making up the raid, and they all read from it independently. Any user who can run a program where they can use O_DIRECT on a file on a raid can break the raid.

It is expected that garbage is written to the disk, but the whole point of a raid is that each disk should contain the *same* garbage. Keep the garbage consistent... or something.
Comment 3 John Brooks 2017-06-30 18:43:49 UTC
If any data, garbage or otherwise, is written to the RAID, should not the array be consistent afterwards? Any action by a userspace program (short of bypassing the RAID and directly writing to the constituent block devices) that results in the array becoming out of sync sounds like a bug to me.
Comment 4 Phil Turmel 2017-06-30 18:52:47 UTC
I'd be nice for it to be consistent, but giving up the performance of zero-copy operations to avoid what can only be garbage doesn't seem like a great tradeoff to me.  And it is long-known behaviour thanks to direct access by the kernel on mirrored swap devices.
Comment 5 Wolfgang Bumiller 2017-10-30 15:22:00 UTC
Since this comes up every once in a while I thought I'd also share a "legitimate" case where this can happen.
Legitimate in the sense that the data being written is legitimately also being modified (keep reading), and _somewhat_ common because the setup _seems_ to make sense (initially):

Take a virtual machine, give it a disk - put the image on a software raid and tell qemu to disable caching (iow. use O_DIRECT, because the guest already does caching anyway).
Run linux in the VM, add part of the/a disk on the raid as swap, and cause the guest to start swapping a lot.

What *seems* to be happening is this: kernel decides to swap out part of some memory. At the same time the process it belongs to exits and the kernel marks the pages as unused - the swap write is still in flight. The kernel now knows that this area is unused and thus there is no reason to ever re-read it from the swap device. Someone else needs memory, the kernel gives 'em the affected pages. The swap write is still in flight. The new process starts using the memory, at this point we really don't care what kind of garbage data ends up being written to the disk, simply because we won't ever need it.
The swap writes finish. Now the raid is degraded.

The lesson: if you use software raid you kinda need to know the possible pitfalls you can run into...
Comment 6 bcs 2018-03-26 10:25:06 UTC
Created attachment 274945 [details]
drbd copy of write bio

We can confirm the testcase and the problem.

Therefore we developed a solution.

With this patch we solved the problem without impact of performance.

Feel free to participate on the solution without data corruption.

Note You need to log in before you can comment on or make changes to this bug.