Bug 99171 - MD RAID or DRBD can be broken from userspace when using O_DIRECT
Summary: MD RAID or DRBD can be broken from userspace when using O_DIRECT
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Jens Axboe
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-29 09:56 UTC by Stanislav German-Evtushenko
Modified: 2024-10-16 21:19 UTC (History)
9 users (show)

See Also:
Kernel Version: any
Subsystem:
Regression: No
Bisected commit-id:


Attachments
drbd_oos_test.c (1.17 KB, text/x-csrc)
2015-05-29 09:56 UTC, Stanislav German-Evtushenko
Details
drbd copy of write bio (1007 bytes, patch)
2018-03-26 10:25 UTC, bcs
Details | Diff

Description Stanislav German-Evtushenko 2015-05-29 09:56:10 UTC
Created attachment 178311 [details]
drbd_oos_test.c

Hello,

MD RAID, DRBD and may be other software raid-like block devices can become inconsistent (silently) if program in userspace is doing something wrong.


*** How to reproduce ***

1. Prepare

gcc -pthread drbd_oos_test.c
dd if=/dev/zero of=/tmp/mdadm1 bs=1M count=100
dd if=/dev/zero of=/tmp/mdadm2 bs=1M count=100
losetup /dev/loop1 /tmp/mdadm1
losetup /dev/loop2 /tmp/mdadm2
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/loop{1,2}

2. Write data with O_DIRECT

./a.out /dev/md0

3. Check consistency with vbindiff

vbindiff /tmp/mdadm{1,2}      #press enter multiple times to skip metadata



*** Variant: EXT3 or EXT4 on top of md0 ***

The step 2 can be extended by creating file system:

mkfs.ext3 /dev/md0
mkdir /tmp/ext3
mount /dev/md0 /tmp/ext3
./a.out /tmp/ext3/testfile1
vbindiff /tmp/mdadm{1,2}      #press enter multiple times to skip metadata


In both cases data on /tmp/mdadm1 and /tmp/mdadm2 will differ. We get the same result when we use DRBD instead of MD RAID.

Best regards,
Stanislav
Comment 1 Phil Turmel 2017-06-30 14:40:20 UTC
I'm not convinced this is a meaningful testcase.  Any userspace application that modifies a data buffer in one thread while another thread is writing that buffer to disk is certain to not get predicable data back when reading it later.  Whether this situation results in a mismatch among raid mirrors is not terribly meaningful.
Comment 2 Wolfgang Bumiller 2017-06-30 17:30:13 UTC
This is not at all about the contents of the data. It is expected that garbage is written to the disks, but each disk making up the raid will contain different garbage, which means the disks are out of sync, iow. the raid is "broken". This in turn means the user space can "break" the raid.
The problem is that with O_DIRECT the the user space pointer is passed to the block drivers for the underlying layers making up the raid, and they all read from it independently. Any user who can run a program where they can use O_DIRECT on a file on a raid can break the raid.

It is expected that garbage is written to the disk, but the whole point of a raid is that each disk should contain the *same* garbage. Keep the garbage consistent... or something.
Comment 3 John Brooks 2017-06-30 18:43:49 UTC
If any data, garbage or otherwise, is written to the RAID, should not the array be consistent afterwards? Any action by a userspace program (short of bypassing the RAID and directly writing to the constituent block devices) that results in the array becoming out of sync sounds like a bug to me.
Comment 4 Phil Turmel 2017-06-30 18:52:47 UTC
I'd be nice for it to be consistent, but giving up the performance of zero-copy operations to avoid what can only be garbage doesn't seem like a great tradeoff to me.  And it is long-known behaviour thanks to direct access by the kernel on mirrored swap devices.
Comment 5 Wolfgang Bumiller 2017-10-30 15:22:00 UTC
Since this comes up every once in a while I thought I'd also share a "legitimate" case where this can happen.
Legitimate in the sense that the data being written is legitimately also being modified (keep reading), and _somewhat_ common because the setup _seems_ to make sense (initially):

Take a virtual machine, give it a disk - put the image on a software raid and tell qemu to disable caching (iow. use O_DIRECT, because the guest already does caching anyway).
Run linux in the VM, add part of the/a disk on the raid as swap, and cause the guest to start swapping a lot.

What *seems* to be happening is this: kernel decides to swap out part of some memory. At the same time the process it belongs to exits and the kernel marks the pages as unused - the swap write is still in flight. The kernel now knows that this area is unused and thus there is no reason to ever re-read it from the swap device. Someone else needs memory, the kernel gives 'em the affected pages. The swap write is still in flight. The new process starts using the memory, at this point we really don't care what kind of garbage data ends up being written to the disk, simply because we won't ever need it.
The swap writes finish. Now the raid is degraded.

The lesson: if you use software raid you kinda need to know the possible pitfalls you can run into...
Comment 6 bcs 2018-03-26 10:25:06 UTC
Created attachment 274945 [details]
drbd copy of write bio

We can confirm the testcase and the problem.

Therefore we developed a solution.

With this patch we solved the problem without impact of performance.

Feel free to participate on the solution without data corruption.
Comment 7 Melroy 2023-03-25 23:28:25 UTC
Is this patch already delivered upstream or anything!? Or is this still not solved by default?
Comment 8 Roland Kletzing 2023-05-07 10:54:36 UTC
hello, i'd also be interested what's the status of this bug!?

i' really curious why this exists for so long and getting so few notice

i bet there are a LOT people out there using virtual machines on top of mdraid and if this is broken, this should be either fixed or at least this should be known more widely

also see:
https://bugzilla.kernel.org/show_bug.cgi?id=99171
Comment 9 Sascha Lucas 2023-05-08 19:37:19 UTC
(In reply to Roland Kletzing from comment #8)
> hello, i'd also be interested what's the status of this bug!?
> 
> i' really curious why this exists for so long and getting so few notice

I assume the behavior described here is not considered a bug. In case of DRBD the problem is mentioned in man drbd.conf[1], where it is called "false positives" and "not necessarily poses a problem for the integrity of the data" ... what ever this could mean.

[1] https://github.com/LINBIT/drbd-utils/blob/0870121c730ea1ebde511380ab9d06b045cca75b/documentation/v84/drbd.conf.xml#L2061-L2082
Comment 10 Roland Kletzing 2023-05-08 20:53:51 UTC
there is no mention of O_DIRECT on that page

anyhow :

https://lkml.org/lkml/2007/1/10/235

"So O_DIRECT not only is a total disaster from a design standpoint (just 
look at all the crap it results in)"


https://lkml.org/lkml/2007/1/11/121

"Yes. O_DIRECT is really fundamentally broken. There's just no way to fix 
it sanely. Except by teaching people not to use it, and making the normal 
paths fast enough "


mhhh, apparently this still seems to be true !?
Comment 11 Roland Kletzing 2024-10-12 09:28:56 UTC
@stanislav:

https://marc.info/?l=linux-raid&m=172854310516409&w=2
"Which means that the test case is actually invalid; you either would 
need drop O_DIRECT or modify the buffer after write() to arrive with
a valid example.
"


@wolfgang:

https://bugzilla.proxmox.com/show_bug.cgi?id=5235#c14

one more person (besides me) ran test there and yet were unable to reproduce the issue. 

https://marc.info/?l=linux-raid&m=172855001521105&w=2

"And then ending up with data corruption on MD. Which I really would love
to see reproduced, especially with recent kernels, as there is a lot of
vagueness around it (add part of the disk on the raid as swap? How?
In the host? On the guest?)."
Comment 12 Stanislav German-Evtushenko 2024-10-12 13:34:11 UTC
Roland Kletzing,

The test case from the description (https://bugzilla.kernel.org/show_bug.cgi?id=99171#c0) is still reproducible on Ubuntu 24.04, 6.8.0-45-generic. The original reason I started investigating this issue was that virtual machines running on top of DRBD with cache=none were sometimes hanging during live migration. I found out that this was caused by inconsistencies in underlying DRBD storage caused by VMs with cache=none writing to their swaps. I can't think of other (than swapping) examples of something modifying buffers while in flight. The artificial example was crafted to reproduce the case reliably as running a VM and waiting until swapping inside this VM causes the issue may take long.
Comment 13 Roland Kletzing 2024-10-12 17:10:48 UTC
@stanislav, i can reproduce the problem with your tool, vbindiff showing difference. 

but consistency check via "echo check >/sys/block/md0/md/sync_action" does NOT show those raid inconsistencies. 

any clue, why ?
Comment 14 Stanislav German-Evtushenko 2024-10-15 01:04:33 UTC
You need to make sure you have permissions to /dev/md0 when running ./a.out /dev/md0 (it won't tell you if you don't).

> but consistency check via "echo check >/sys/block/md0/md/sync_action" does
> NOT show those raid inconsistencies.

It shows them in my experiments:

echo check | sudo tee /sys/block/md0/md/sync_action
cat /sys/block/md0/md/mismatch_cnt
640
Comment 15 Roland Kletzing 2024-10-15 20:23:34 UTC
>but consistency check via "echo check >/sys/block/md0/md/sync_action" does NOT
>>show those raid inconsistencies. 

i did wrong, i can see it now. 

i have tested a little bit further , and apparently with drbd_oos_test.c  from above, it even seems to be possible to degrade mdraid array from within a virtual machine.

this is what i did:

- place virtual disk on mdraid and add that disk to virtual machine
- ext4 format inside virtual machine
- mount inside virtual machine
- make the mountpoint writeable for a non-root user
- run drbd_oos_test on that mount as non-root user

even with this, you can make the raid going degraded.

that means, whoever is running a virtual machine with "cache=none" (=o_direct, which is default for proxmox but not probably not for other hypervisors - libvirt is using writeback for example),  imposes the risk that any malicious user inside a VM can degrade the raid array OUTSIDE the vm.

that means, any customer can provide sleepness nights to a hoster's sysadmin/storage admin.  

with this, i would consider the test case pretty valid, even if it's not the correct way to submit/handle data via o_direct. 

sorry, but i would really consider raid which can be broken from inside a VM (from userspace and via non-root user) fundamentally broken.
Comment 16 Roland Kletzing 2024-10-16 21:03:49 UTC
btw - btrfs suffers from the same issue.  so you may expand the ticket subject to include btrfs @stanislav.

in this mail there is another btrfs specific testing tool for doing "O_DIRECT write the wrong way" https://lore.kernel.org/linux-btrfs/cf8a733f-2c9d-7ffe-e865-4c13d99dfb60@libero.it/

with that i i can (like above with mdraid) corrupt the btrfs raid on the host from inside a VM. as ordinary non-root user

$ ./a.out test.dat
main:  data = 0x72a57eee6000
write_thread pid = 12488
read_thread pid = 12489
update_thread pid = 12490
read_thread:  data = 0x72a57eee2000
ERROR: read thread; e = 5 - Input/output error
ERROR: read thread; e = 5 - Input/output error
ERROR: read thread; e = 5 - Input/output error
ERROR: read thread; e = 5 - Input/output error
ERROR: read thread; e = 5 - Input/output error
ERROR: read thread; e = 5 - Input/output error
ERROR: read thread; e = 5 - Input/output error
ERROR: read thread; e = 5 - Input/output error
ERROR: read thread; e = 5 - Input/output error

pve-host:

# btrfs device stats -c /btrfs/
[/dev/sdf1].write_io_errs    0
[/dev/sdf1].read_io_errs     0
[/dev/sdf1].flush_io_errs    0
[/dev/sdf1].corruption_errs  2340
[/dev/sdf1].generation_errs  0
[/dev/sdh1].write_io_errs    0
[/dev/sdh1].read_io_errs     0
[/dev/sdh1].flush_io_errs    0
[/dev/sdh1].corruption_errs  2343
[/dev/sdh1].generation_errs  0


what's a little bit more problematic for me this time is, that proxmox is adding btrfs support to their proxmox virtual environment product (experimental status) , but they take hidden/invisible/intransparent countermeasure to  avoid O_DIRECT on btrfs (i.e. cache=none) 

https://forum.proxmox.com/threads/important-information-on-btrfs-getting-lost-in-wiki-wrong-vm-disk-defaults-with-btrfs-storage.143413/
https://forum.proxmox.com/threads/virtual-disk-default-no-cache-settings-weirdness.143430/
https://bugzilla.proxmox.com/show_bug.cgi?id=5320

i just found out by chance, that this quirk to avoid O_DIRECT does only seem to apply when you freshly start a VM on btrfs storage - but not if you live migrate a running VM from another storage to btrfs storage. So the quirk is incomplete.

so any malicious user in a vm still can break the mirror and introduce inconsistency into the hosts raid.   

the problem with O_DIRECT is at least documented in the proxmox wiki - but with the same argument (problems with O_DIRECT), proxmox team don't like to support mdraid below virtual machines - which is a little bit inconsequent, imho - especially when taking into consideration, that mdraid is technology which exists since at least 1997. i.e. mdraid is at least 10 years older.

at least it's good to know that there are quirks applied (as btrfs in proxmox is it's own storage class, where proxmox creates btrfs subvolumes/images for each virtual disk), i.e. the hypervisor "knows" that you configured a VM with virtual disk on btrfs. 

for mdraid, you would need to use storage class/type "dir" on top of any filesystem/volume manager - for which proper detection and adding quirk would be difficult.

anyhow - the whole O_DIRECT stuff looks like one big mess to me. 

imho, for every storage technology/driver supporting it, imho we would perhaps be better off and more secure, to have it disabled per default on kernel/driver level and need to be forcefully enabled as boot-time or module param - for those people "who know what they are doing and why the do need that (including all the issues".
Comment 17 Roland Kletzing 2024-10-16 21:19:42 UTC
>imho, for every storage technology/driver supporting it, imho we would perhaps
>be better off and more secure, to have it disabled per default on
>kernel/driver level and need to be forcefully enabled as boot-time or module
>param - for those people "who know what they are doing and why the do need
>that (including all the issues".

i'm don't mean a global switch here but a switch for every filesystem driver / volume manager where unresolved O_DIRECT issues exist.

Note You need to log in before you can comment on or make changes to this bug.