Bug 99171
Summary: | MD RAID or DRBD can be broken from userspace when using O_DIRECT | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Stanislav German-Evtushenko (ginermail) |
Component: | Block Layer | Assignee: | Jens Axboe (axboe) |
Status: | NEW --- | ||
Severity: | high | CC: | bug-kernel-20190616, c.burkhardt, devzero, ginermail, john, melroy, philip, sascha_lucas, szg00000, wry+bzkernel |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | any | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
drbd_oos_test.c
drbd copy of write bio |
I'm not convinced this is a meaningful testcase. Any userspace application that modifies a data buffer in one thread while another thread is writing that buffer to disk is certain to not get predicable data back when reading it later. Whether this situation results in a mismatch among raid mirrors is not terribly meaningful. This is not at all about the contents of the data. It is expected that garbage is written to the disks, but each disk making up the raid will contain different garbage, which means the disks are out of sync, iow. the raid is "broken". This in turn means the user space can "break" the raid. The problem is that with O_DIRECT the the user space pointer is passed to the block drivers for the underlying layers making up the raid, and they all read from it independently. Any user who can run a program where they can use O_DIRECT on a file on a raid can break the raid. It is expected that garbage is written to the disk, but the whole point of a raid is that each disk should contain the *same* garbage. Keep the garbage consistent... or something. If any data, garbage or otherwise, is written to the RAID, should not the array be consistent afterwards? Any action by a userspace program (short of bypassing the RAID and directly writing to the constituent block devices) that results in the array becoming out of sync sounds like a bug to me. I'd be nice for it to be consistent, but giving up the performance of zero-copy operations to avoid what can only be garbage doesn't seem like a great tradeoff to me. And it is long-known behaviour thanks to direct access by the kernel on mirrored swap devices. Since this comes up every once in a while I thought I'd also share a "legitimate" case where this can happen. Legitimate in the sense that the data being written is legitimately also being modified (keep reading), and _somewhat_ common because the setup _seems_ to make sense (initially): Take a virtual machine, give it a disk - put the image on a software raid and tell qemu to disable caching (iow. use O_DIRECT, because the guest already does caching anyway). Run linux in the VM, add part of the/a disk on the raid as swap, and cause the guest to start swapping a lot. What *seems* to be happening is this: kernel decides to swap out part of some memory. At the same time the process it belongs to exits and the kernel marks the pages as unused - the swap write is still in flight. The kernel now knows that this area is unused and thus there is no reason to ever re-read it from the swap device. Someone else needs memory, the kernel gives 'em the affected pages. The swap write is still in flight. The new process starts using the memory, at this point we really don't care what kind of garbage data ends up being written to the disk, simply because we won't ever need it. The swap writes finish. Now the raid is degraded. The lesson: if you use software raid you kinda need to know the possible pitfalls you can run into... Created attachment 274945 [details]
drbd copy of write bio
We can confirm the testcase and the problem.
Therefore we developed a solution.
With this patch we solved the problem without impact of performance.
Feel free to participate on the solution without data corruption.
Is this patch already delivered upstream or anything!? Or is this still not solved by default? hello, i'd also be interested what's the status of this bug!? i' really curious why this exists for so long and getting so few notice i bet there are a LOT people out there using virtual machines on top of mdraid and if this is broken, this should be either fixed or at least this should be known more widely also see: https://bugzilla.kernel.org/show_bug.cgi?id=99171 (In reply to Roland Kletzing from comment #8) > hello, i'd also be interested what's the status of this bug!? > > i' really curious why this exists for so long and getting so few notice I assume the behavior described here is not considered a bug. In case of DRBD the problem is mentioned in man drbd.conf[1], where it is called "false positives" and "not necessarily poses a problem for the integrity of the data" ... what ever this could mean. [1] https://github.com/LINBIT/drbd-utils/blob/0870121c730ea1ebde511380ab9d06b045cca75b/documentation/v84/drbd.conf.xml#L2061-L2082 there is no mention of O_DIRECT on that page anyhow : https://lkml.org/lkml/2007/1/10/235 "So O_DIRECT not only is a total disaster from a design standpoint (just look at all the crap it results in)" https://lkml.org/lkml/2007/1/11/121 "Yes. O_DIRECT is really fundamentally broken. There's just no way to fix it sanely. Except by teaching people not to use it, and making the normal paths fast enough " mhhh, apparently this still seems to be true !? |
Created attachment 178311 [details] drbd_oos_test.c Hello, MD RAID, DRBD and may be other software raid-like block devices can become inconsistent (silently) if program in userspace is doing something wrong. *** How to reproduce *** 1. Prepare gcc -pthread drbd_oos_test.c dd if=/dev/zero of=/tmp/mdadm1 bs=1M count=100 dd if=/dev/zero of=/tmp/mdadm2 bs=1M count=100 losetup /dev/loop1 /tmp/mdadm1 losetup /dev/loop2 /tmp/mdadm2 mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/loop{1,2} 2. Write data with O_DIRECT ./a.out /dev/md0 3. Check consistency with vbindiff vbindiff /tmp/mdadm{1,2} #press enter multiple times to skip metadata *** Variant: EXT3 or EXT4 on top of md0 *** The step 2 can be extended by creating file system: mkfs.ext3 /dev/md0 mkdir /tmp/ext3 mount /dev/md0 /tmp/ext3 ./a.out /tmp/ext3/testfile1 vbindiff /tmp/mdadm{1,2} #press enter multiple times to skip metadata In both cases data on /tmp/mdadm1 and /tmp/mdadm2 will differ. We get the same result when we use DRBD instead of MD RAID. Best regards, Stanislav