Created attachment 178311 [details] drbd_oos_test.c Hello, MD RAID, DRBD and may be other software raid-like block devices can become inconsistent (silently) if program in userspace is doing something wrong. *** How to reproduce *** 1. Prepare gcc -pthread drbd_oos_test.c dd if=/dev/zero of=/tmp/mdadm1 bs=1M count=100 dd if=/dev/zero of=/tmp/mdadm2 bs=1M count=100 losetup /dev/loop1 /tmp/mdadm1 losetup /dev/loop2 /tmp/mdadm2 mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/loop{1,2} 2. Write data with O_DIRECT ./a.out /dev/md0 3. Check consistency with vbindiff vbindiff /tmp/mdadm{1,2} #press enter multiple times to skip metadata *** Variant: EXT3 or EXT4 on top of md0 *** The step 2 can be extended by creating file system: mkfs.ext3 /dev/md0 mkdir /tmp/ext3 mount /dev/md0 /tmp/ext3 ./a.out /tmp/ext3/testfile1 vbindiff /tmp/mdadm{1,2} #press enter multiple times to skip metadata In both cases data on /tmp/mdadm1 and /tmp/mdadm2 will differ. We get the same result when we use DRBD instead of MD RAID. Best regards, Stanislav
I'm not convinced this is a meaningful testcase. Any userspace application that modifies a data buffer in one thread while another thread is writing that buffer to disk is certain to not get predicable data back when reading it later. Whether this situation results in a mismatch among raid mirrors is not terribly meaningful.
This is not at all about the contents of the data. It is expected that garbage is written to the disks, but each disk making up the raid will contain different garbage, which means the disks are out of sync, iow. the raid is "broken". This in turn means the user space can "break" the raid. The problem is that with O_DIRECT the the user space pointer is passed to the block drivers for the underlying layers making up the raid, and they all read from it independently. Any user who can run a program where they can use O_DIRECT on a file on a raid can break the raid. It is expected that garbage is written to the disk, but the whole point of a raid is that each disk should contain the *same* garbage. Keep the garbage consistent... or something.
If any data, garbage or otherwise, is written to the RAID, should not the array be consistent afterwards? Any action by a userspace program (short of bypassing the RAID and directly writing to the constituent block devices) that results in the array becoming out of sync sounds like a bug to me.
I'd be nice for it to be consistent, but giving up the performance of zero-copy operations to avoid what can only be garbage doesn't seem like a great tradeoff to me. And it is long-known behaviour thanks to direct access by the kernel on mirrored swap devices.
Since this comes up every once in a while I thought I'd also share a "legitimate" case where this can happen. Legitimate in the sense that the data being written is legitimately also being modified (keep reading), and _somewhat_ common because the setup _seems_ to make sense (initially): Take a virtual machine, give it a disk - put the image on a software raid and tell qemu to disable caching (iow. use O_DIRECT, because the guest already does caching anyway). Run linux in the VM, add part of the/a disk on the raid as swap, and cause the guest to start swapping a lot. What *seems* to be happening is this: kernel decides to swap out part of some memory. At the same time the process it belongs to exits and the kernel marks the pages as unused - the swap write is still in flight. The kernel now knows that this area is unused and thus there is no reason to ever re-read it from the swap device. Someone else needs memory, the kernel gives 'em the affected pages. The swap write is still in flight. The new process starts using the memory, at this point we really don't care what kind of garbage data ends up being written to the disk, simply because we won't ever need it. The swap writes finish. Now the raid is degraded. The lesson: if you use software raid you kinda need to know the possible pitfalls you can run into...
Created attachment 274945 [details] drbd copy of write bio We can confirm the testcase and the problem. Therefore we developed a solution. With this patch we solved the problem without impact of performance. Feel free to participate on the solution without data corruption.
Is this patch already delivered upstream or anything!? Or is this still not solved by default?
hello, i'd also be interested what's the status of this bug!? i' really curious why this exists for so long and getting so few notice i bet there are a LOT people out there using virtual machines on top of mdraid and if this is broken, this should be either fixed or at least this should be known more widely also see: https://bugzilla.kernel.org/show_bug.cgi?id=99171
(In reply to Roland Kletzing from comment #8) > hello, i'd also be interested what's the status of this bug!? > > i' really curious why this exists for so long and getting so few notice I assume the behavior described here is not considered a bug. In case of DRBD the problem is mentioned in man drbd.conf[1], where it is called "false positives" and "not necessarily poses a problem for the integrity of the data" ... what ever this could mean. [1] https://github.com/LINBIT/drbd-utils/blob/0870121c730ea1ebde511380ab9d06b045cca75b/documentation/v84/drbd.conf.xml#L2061-L2082
there is no mention of O_DIRECT on that page anyhow : https://lkml.org/lkml/2007/1/10/235 "So O_DIRECT not only is a total disaster from a design standpoint (just look at all the crap it results in)" https://lkml.org/lkml/2007/1/11/121 "Yes. O_DIRECT is really fundamentally broken. There's just no way to fix it sanely. Except by teaching people not to use it, and making the normal paths fast enough " mhhh, apparently this still seems to be true !?
@stanislav: https://marc.info/?l=linux-raid&m=172854310516409&w=2 "Which means that the test case is actually invalid; you either would need drop O_DIRECT or modify the buffer after write() to arrive with a valid example. " @wolfgang: https://bugzilla.proxmox.com/show_bug.cgi?id=5235#c14 one more person (besides me) ran test there and yet were unable to reproduce the issue. https://marc.info/?l=linux-raid&m=172855001521105&w=2 "And then ending up with data corruption on MD. Which I really would love to see reproduced, especially with recent kernels, as there is a lot of vagueness around it (add part of the disk on the raid as swap? How? In the host? On the guest?)."
Roland Kletzing, The test case from the description (https://bugzilla.kernel.org/show_bug.cgi?id=99171#c0) is still reproducible on Ubuntu 24.04, 6.8.0-45-generic. The original reason I started investigating this issue was that virtual machines running on top of DRBD with cache=none were sometimes hanging during live migration. I found out that this was caused by inconsistencies in underlying DRBD storage caused by VMs with cache=none writing to their swaps. I can't think of other (than swapping) examples of something modifying buffers while in flight. The artificial example was crafted to reproduce the case reliably as running a VM and waiting until swapping inside this VM causes the issue may take long.
@stanislav, i can reproduce the problem with your tool, vbindiff showing difference. but consistency check via "echo check >/sys/block/md0/md/sync_action" does NOT show those raid inconsistencies. any clue, why ?
You need to make sure you have permissions to /dev/md0 when running ./a.out /dev/md0 (it won't tell you if you don't). > but consistency check via "echo check >/sys/block/md0/md/sync_action" does > NOT show those raid inconsistencies. It shows them in my experiments: echo check | sudo tee /sys/block/md0/md/sync_action cat /sys/block/md0/md/mismatch_cnt 640
>but consistency check via "echo check >/sys/block/md0/md/sync_action" does NOT >>show those raid inconsistencies. i did wrong, i can see it now. i have tested a little bit further , and apparently with drbd_oos_test.c from above, it even seems to be possible to degrade mdraid array from within a virtual machine. this is what i did: - place virtual disk on mdraid and add that disk to virtual machine - ext4 format inside virtual machine - mount inside virtual machine - make the mountpoint writeable for a non-root user - run drbd_oos_test on that mount as non-root user even with this, you can make the raid going degraded. that means, whoever is running a virtual machine with "cache=none" (=o_direct, which is default for proxmox but not probably not for other hypervisors - libvirt is using writeback for example), imposes the risk that any malicious user inside a VM can degrade the raid array OUTSIDE the vm. that means, any customer can provide sleepness nights to a hoster's sysadmin/storage admin. with this, i would consider the test case pretty valid, even if it's not the correct way to submit/handle data via o_direct. sorry, but i would really consider raid which can be broken from inside a VM (from userspace and via non-root user) fundamentally broken.
btw - btrfs suffers from the same issue. so you may expand the ticket subject to include btrfs @stanislav. in this mail there is another btrfs specific testing tool for doing "O_DIRECT write the wrong way" https://lore.kernel.org/linux-btrfs/cf8a733f-2c9d-7ffe-e865-4c13d99dfb60@libero.it/ with that i i can (like above with mdraid) corrupt the btrfs raid on the host from inside a VM. as ordinary non-root user $ ./a.out test.dat main: data = 0x72a57eee6000 write_thread pid = 12488 read_thread pid = 12489 update_thread pid = 12490 read_thread: data = 0x72a57eee2000 ERROR: read thread; e = 5 - Input/output error ERROR: read thread; e = 5 - Input/output error ERROR: read thread; e = 5 - Input/output error ERROR: read thread; e = 5 - Input/output error ERROR: read thread; e = 5 - Input/output error ERROR: read thread; e = 5 - Input/output error ERROR: read thread; e = 5 - Input/output error ERROR: read thread; e = 5 - Input/output error ERROR: read thread; e = 5 - Input/output error pve-host: # btrfs device stats -c /btrfs/ [/dev/sdf1].write_io_errs 0 [/dev/sdf1].read_io_errs 0 [/dev/sdf1].flush_io_errs 0 [/dev/sdf1].corruption_errs 2340 [/dev/sdf1].generation_errs 0 [/dev/sdh1].write_io_errs 0 [/dev/sdh1].read_io_errs 0 [/dev/sdh1].flush_io_errs 0 [/dev/sdh1].corruption_errs 2343 [/dev/sdh1].generation_errs 0 what's a little bit more problematic for me this time is, that proxmox is adding btrfs support to their proxmox virtual environment product (experimental status) , but they take hidden/invisible/intransparent countermeasure to avoid O_DIRECT on btrfs (i.e. cache=none) https://forum.proxmox.com/threads/important-information-on-btrfs-getting-lost-in-wiki-wrong-vm-disk-defaults-with-btrfs-storage.143413/ https://forum.proxmox.com/threads/virtual-disk-default-no-cache-settings-weirdness.143430/ https://bugzilla.proxmox.com/show_bug.cgi?id=5320 i just found out by chance, that this quirk to avoid O_DIRECT does only seem to apply when you freshly start a VM on btrfs storage - but not if you live migrate a running VM from another storage to btrfs storage. So the quirk is incomplete. so any malicious user in a vm still can break the mirror and introduce inconsistency into the hosts raid. the problem with O_DIRECT is at least documented in the proxmox wiki - but with the same argument (problems with O_DIRECT), proxmox team don't like to support mdraid below virtual machines - which is a little bit inconsequent, imho - especially when taking into consideration, that mdraid is technology which exists since at least 1997. i.e. mdraid is at least 10 years older. at least it's good to know that there are quirks applied (as btrfs in proxmox is it's own storage class, where proxmox creates btrfs subvolumes/images for each virtual disk), i.e. the hypervisor "knows" that you configured a VM with virtual disk on btrfs. for mdraid, you would need to use storage class/type "dir" on top of any filesystem/volume manager - for which proper detection and adding quirk would be difficult. anyhow - the whole O_DIRECT stuff looks like one big mess to me. imho, for every storage technology/driver supporting it, imho we would perhaps be better off and more secure, to have it disabled per default on kernel/driver level and need to be forcefully enabled as boot-time or module param - for those people "who know what they are doing and why the do need that (including all the issues".
>imho, for every storage technology/driver supporting it, imho we would perhaps >be better off and more secure, to have it disabled per default on >kernel/driver level and need to be forcefully enabled as boot-time or module >param - for those people "who know what they are doing and why the do need >that (including all the issues". i'm don't mean a global switch here but a switch for every filesystem driver / volume manager where unresolved O_DIRECT issues exist.