Bug 99171

Summary: MD RAID or DRBD can be broken from userspace when using O_DIRECT
Product: IO/Storage Reporter: Stanislav German-Evtushenko (ginermail)
Component: Block LayerAssignee: Jens Axboe (axboe)
Status: NEW ---    
Severity: high CC: bug-kernel-20190616, c.burkhardt, devzero, ginermail, john, melroy, philip, sascha_lucas, szg00000, wry+bzkernel
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: any Subsystem:
Regression: No Bisected commit-id:
Attachments: drbd_oos_test.c
drbd copy of write bio

Description Stanislav German-Evtushenko 2015-05-29 09:56:10 UTC
Created attachment 178311 [details]
drbd_oos_test.c

Hello,

MD RAID, DRBD and may be other software raid-like block devices can become inconsistent (silently) if program in userspace is doing something wrong.


*** How to reproduce ***

1. Prepare

gcc -pthread drbd_oos_test.c
dd if=/dev/zero of=/tmp/mdadm1 bs=1M count=100
dd if=/dev/zero of=/tmp/mdadm2 bs=1M count=100
losetup /dev/loop1 /tmp/mdadm1
losetup /dev/loop2 /tmp/mdadm2
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/loop{1,2}

2. Write data with O_DIRECT

./a.out /dev/md0

3. Check consistency with vbindiff

vbindiff /tmp/mdadm{1,2}      #press enter multiple times to skip metadata



*** Variant: EXT3 or EXT4 on top of md0 ***

The step 2 can be extended by creating file system:

mkfs.ext3 /dev/md0
mkdir /tmp/ext3
mount /dev/md0 /tmp/ext3
./a.out /tmp/ext3/testfile1
vbindiff /tmp/mdadm{1,2}      #press enter multiple times to skip metadata


In both cases data on /tmp/mdadm1 and /tmp/mdadm2 will differ. We get the same result when we use DRBD instead of MD RAID.

Best regards,
Stanislav
Comment 1 Phil Turmel 2017-06-30 14:40:20 UTC
I'm not convinced this is a meaningful testcase.  Any userspace application that modifies a data buffer in one thread while another thread is writing that buffer to disk is certain to not get predicable data back when reading it later.  Whether this situation results in a mismatch among raid mirrors is not terribly meaningful.
Comment 2 Wolfgang Bumiller 2017-06-30 17:30:13 UTC
This is not at all about the contents of the data. It is expected that garbage is written to the disks, but each disk making up the raid will contain different garbage, which means the disks are out of sync, iow. the raid is "broken". This in turn means the user space can "break" the raid.
The problem is that with O_DIRECT the the user space pointer is passed to the block drivers for the underlying layers making up the raid, and they all read from it independently. Any user who can run a program where they can use O_DIRECT on a file on a raid can break the raid.

It is expected that garbage is written to the disk, but the whole point of a raid is that each disk should contain the *same* garbage. Keep the garbage consistent... or something.
Comment 3 John Brooks 2017-06-30 18:43:49 UTC
If any data, garbage or otherwise, is written to the RAID, should not the array be consistent afterwards? Any action by a userspace program (short of bypassing the RAID and directly writing to the constituent block devices) that results in the array becoming out of sync sounds like a bug to me.
Comment 4 Phil Turmel 2017-06-30 18:52:47 UTC
I'd be nice for it to be consistent, but giving up the performance of zero-copy operations to avoid what can only be garbage doesn't seem like a great tradeoff to me.  And it is long-known behaviour thanks to direct access by the kernel on mirrored swap devices.
Comment 5 Wolfgang Bumiller 2017-10-30 15:22:00 UTC
Since this comes up every once in a while I thought I'd also share a "legitimate" case where this can happen.
Legitimate in the sense that the data being written is legitimately also being modified (keep reading), and _somewhat_ common because the setup _seems_ to make sense (initially):

Take a virtual machine, give it a disk - put the image on a software raid and tell qemu to disable caching (iow. use O_DIRECT, because the guest already does caching anyway).
Run linux in the VM, add part of the/a disk on the raid as swap, and cause the guest to start swapping a lot.

What *seems* to be happening is this: kernel decides to swap out part of some memory. At the same time the process it belongs to exits and the kernel marks the pages as unused - the swap write is still in flight. The kernel now knows that this area is unused and thus there is no reason to ever re-read it from the swap device. Someone else needs memory, the kernel gives 'em the affected pages. The swap write is still in flight. The new process starts using the memory, at this point we really don't care what kind of garbage data ends up being written to the disk, simply because we won't ever need it.
The swap writes finish. Now the raid is degraded.

The lesson: if you use software raid you kinda need to know the possible pitfalls you can run into...
Comment 6 bcs 2018-03-26 10:25:06 UTC
Created attachment 274945 [details]
drbd copy of write bio

We can confirm the testcase and the problem.

Therefore we developed a solution.

With this patch we solved the problem without impact of performance.

Feel free to participate on the solution without data corruption.
Comment 7 Melroy 2023-03-25 23:28:25 UTC
Is this patch already delivered upstream or anything!? Or is this still not solved by default?
Comment 8 Roland Kletzing 2023-05-07 10:54:36 UTC
hello, i'd also be interested what's the status of this bug!?

i' really curious why this exists for so long and getting so few notice

i bet there are a LOT people out there using virtual machines on top of mdraid and if this is broken, this should be either fixed or at least this should be known more widely

also see:
https://bugzilla.kernel.org/show_bug.cgi?id=99171
Comment 9 Sascha Lucas 2023-05-08 19:37:19 UTC
(In reply to Roland Kletzing from comment #8)
> hello, i'd also be interested what's the status of this bug!?
> 
> i' really curious why this exists for so long and getting so few notice

I assume the behavior described here is not considered a bug. In case of DRBD the problem is mentioned in man drbd.conf[1], where it is called "false positives" and "not necessarily poses a problem for the integrity of the data" ... what ever this could mean.

[1] https://github.com/LINBIT/drbd-utils/blob/0870121c730ea1ebde511380ab9d06b045cca75b/documentation/v84/drbd.conf.xml#L2061-L2082
Comment 10 Roland Kletzing 2023-05-08 20:53:51 UTC
there is no mention of O_DIRECT on that page

anyhow :

https://lkml.org/lkml/2007/1/10/235

"So O_DIRECT not only is a total disaster from a design standpoint (just 
look at all the crap it results in)"


https://lkml.org/lkml/2007/1/11/121

"Yes. O_DIRECT is really fundamentally broken. There's just no way to fix 
it sanely. Except by teaching people not to use it, and making the normal 
paths fast enough "


mhhh, apparently this still seems to be true !?