42800 – (mdadm) md raid 1 volume with ram disk and loop/physical device fails during fsyncs

Bug 42800 - (mdadm) md raid 1 volume with ram disk and loop/physical device fails during fsyncs

Summary: (mdadm) md raid 1 volume with ram disk and loop/physical device fails during ...

Status:	NEW

Alias:	None

Product:	IO/Storage
Classification:	Unclassified
Component:	MD (show other bugs)
Hardware:	All Linux

Importance:	P1 normal
Assignee:	io_md

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-02-20 19:26 UTC by Petros Koutoupis
Modified:	2012-08-13 00:18 UTC (History)
CC List:	1 user (show)

See Also:
Kernel Version:	3.0
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Petros Koutoupis 2012-02-20 19:26:04 UTC

I have observed this on kernel 3.0 and 2.6.35 in which when I create an md mirror raid 1 array with a Linux RAM disk (i.e. /dev/ram0) and a local loop or physical device, every time an fsync is sent to the md volume, the RAM disk fails. You can easily reproduce this with the following sequence:

# dd if=/dev/zero of=/dev/ram0
# dd in=/dev/ram0 od=/mnt/ram0.dat
# losetup /dev/loop0 /mnt/ram0.dat
# mdadm --create /dev/md1 --level 1 --raid-devices=2 /dev/ram0 /dev/loop0
# mke2fs -F /dev/md1

There was a time when this worked, the last known working kernel that I used to do this was 2.6.18 (Red Hat/ CentOS 5.x).

Note that this was seen only with RAID 1. I was not able to reproduce this in a RAIDs 0 and 5 (1x RAM disk with 2x loop devices).

The only kernel messages printed into the log are (note that mkfs is sent after the resync is done):
---------------------------
Feb 20 13:24:42 vbox-fedora14 kernel: [18807.967835] md: md1: resync done.
Feb 20 13:25:03 vbox-fedora14 kernel: [18828.930879] md/raid1:md1: Disk failure on ram0, disabling device.
Feb 20 13:25:03 vbox-fedora14 kernel: [18828.930879] <1>md/raid1:md1: Operation continuing on 1 devices.
---------------------------

Comment 1 Petros Koutoupis 2012-08-02 20:41:06 UTC

Oops. A typo was just brought to my attention. The second command reads:
# dd in=/dev/ram0 od=/mnt/ram0.dat

And should read:
# dd if=/dev/ram0 of=/mnt/ram0.dat

Comment 2 Petros Koutoupis 2012-08-10 15:03:50 UTC

The core problem to this can be found in when a RAID 1 configured MD array sends I/O to the ram disk of 0 bytes. Found in drivers/block/brd.c in function brd_make_request():

---------------------
int err = -EIO;
[...]
bio_for_each_segment(bvec, bio, i) {
    unsigned int len = bvec->bv_len;
    err = brd_do_bvec(brd, bvec->bv_page, len,
                      bvec->bv_offset, rw, sector);
    if (err)
        break;
    sector += len >> SECTOR_SHIFT;
}
[...]
bio_endio(bio, err);
---------------------
err is initialized to EIO and when an I/O transfer is sent with 0 bytes, it falls through bio_for_each_segment() with err still set to EIO. Here is a GDB dump right after the array has failed on the last bio:

---------------------
(bio)->bi_idx < (bio)->bi_vcnt) == (0xcd5d < 0x5000) == Empty loop!!!!

(gdb) p /x *bio
$3 = {
   bi_sector = 0x2,
   bi_next = 0x0,
   bi_bdev = 0x0,
   bi_flags = 0x0,
   bi_rw = 0x10,
   bi_vcnt = 0x5000,
   bi_idx = 0xcd5d,
   bi_phys_segments = 0xccf8d240,
   bi_size = 0x0,
   bi_seg_front_size = 0x0,
   bi_seg_back_size = 0x0,
   bi_max_vecs = 0x0,
   bi_comp_cpu = 0xccf8d7c0,
   bi_cnt = {
     counter = 0x0
   },
   bi_io_vec = 0x0,
   bi_end_io = 0x0,
   bi_private = 0xccdc3c40,
   bi_fs_private = 0x0,
   bi_destructor = 0x0,
   bi_inline_vecs = 0xcb965d4c
}
---------------------

The question is, is it by design that MD sends 0 byte commands to the underlying block device? If so, then this may need to be addressed in brd.c and not MD.

Comment 3 Neil Brown 2012-08-13 00:18:00 UTC

I just tried your recipe on current mainline kernel and it works smoothly - no failure.

The bio that you have displayed above looks to be corrupted:
 - bi_end_io is NULL, so the bio_endio() call will crash
 - bi_idx == 0xcd5d.  It should nearly always be '0'. Occasionally 1 or 2.
 - bi_vcnt == 0x5000 - it should never exceed 256 (BIO_MAX_PAGES).
 - bi_bdev == NULL - this is not possible.  It *must* point to the brd device
           to else brd_make_request could not be called.

so I don't really trust it.

The only times that md should send zero-length request down is when REQ_FLUSH is set.  In this case generic_make_request should notice that q->flush_flags is zero and so will completed the request early without passing it down to brd.

So something is clearly wrong but I cannot see what.
I would suggest modifying the code in brd_make_request to print out bi_flags
whenever bi_size is zero.  Maybe add a WARN() too so we can see the stack trace.

Note You need to log in before you can comment on or make changes to this bug.