I have observed this on kernel 3.0 and 2.6.35 in which when I create an md mirror raid 1 array with a Linux RAM disk (i.e. /dev/ram0) and a local loop or physical device, every time an fsync is sent to the md volume, the RAM disk fails. You can easily reproduce this with the following sequence: # dd if=/dev/zero of=/dev/ram0 # dd in=/dev/ram0 od=/mnt/ram0.dat # losetup /dev/loop0 /mnt/ram0.dat # mdadm --create /dev/md1 --level 1 --raid-devices=2 /dev/ram0 /dev/loop0 # mke2fs -F /dev/md1 There was a time when this worked, the last known working kernel that I used to do this was 2.6.18 (Red Hat/ CentOS 5.x). Note that this was seen only with RAID 1. I was not able to reproduce this in a RAIDs 0 and 5 (1x RAM disk with 2x loop devices). The only kernel messages printed into the log are (note that mkfs is sent after the resync is done): --------------------------- Feb 20 13:24:42 vbox-fedora14 kernel: [18807.967835] md: md1: resync done. Feb 20 13:25:03 vbox-fedora14 kernel: [18828.930879] md/raid1:md1: Disk failure on ram0, disabling device. Feb 20 13:25:03 vbox-fedora14 kernel: [18828.930879] <1>md/raid1:md1: Operation continuing on 1 devices. ---------------------------
Oops. A typo was just brought to my attention. The second command reads: # dd in=/dev/ram0 od=/mnt/ram0.dat And should read: # dd if=/dev/ram0 of=/mnt/ram0.dat
The core problem to this can be found in when a RAID 1 configured MD array sends I/O to the ram disk of 0 bytes. Found in drivers/block/brd.c in function brd_make_request(): --------------------- int err = -EIO; [...] bio_for_each_segment(bvec, bio, i) { unsigned int len = bvec->bv_len; err = brd_do_bvec(brd, bvec->bv_page, len, bvec->bv_offset, rw, sector); if (err) break; sector += len >> SECTOR_SHIFT; } [...] bio_endio(bio, err); --------------------- err is initialized to EIO and when an I/O transfer is sent with 0 bytes, it falls through bio_for_each_segment() with err still set to EIO. Here is a GDB dump right after the array has failed on the last bio: --------------------- (bio)->bi_idx < (bio)->bi_vcnt) == (0xcd5d < 0x5000) == Empty loop!!!! (gdb) p /x *bio $3 = { bi_sector = 0x2, bi_next = 0x0, bi_bdev = 0x0, bi_flags = 0x0, bi_rw = 0x10, bi_vcnt = 0x5000, bi_idx = 0xcd5d, bi_phys_segments = 0xccf8d240, bi_size = 0x0, bi_seg_front_size = 0x0, bi_seg_back_size = 0x0, bi_max_vecs = 0x0, bi_comp_cpu = 0xccf8d7c0, bi_cnt = { counter = 0x0 }, bi_io_vec = 0x0, bi_end_io = 0x0, bi_private = 0xccdc3c40, bi_fs_private = 0x0, bi_destructor = 0x0, bi_inline_vecs = 0xcb965d4c } --------------------- The question is, is it by design that MD sends 0 byte commands to the underlying block device? If so, then this may need to be addressed in brd.c and not MD.
I just tried your recipe on current mainline kernel and it works smoothly - no failure. The bio that you have displayed above looks to be corrupted: - bi_end_io is NULL, so the bio_endio() call will crash - bi_idx == 0xcd5d. It should nearly always be '0'. Occasionally 1 or 2. - bi_vcnt == 0x5000 - it should never exceed 256 (BIO_MAX_PAGES). - bi_bdev == NULL - this is not possible. It *must* point to the brd device to else brd_make_request could not be called. so I don't really trust it. The only times that md should send zero-length request down is when REQ_FLUSH is set. In this case generic_make_request should notice that q->flush_flags is zero and so will completed the request early without passing it down to brd. So something is clearly wrong but I cannot see what. I would suggest modifying the code in brd_make_request to print out bi_flags whenever bi_size is zero. Maybe add a WARN() too so we can see the stack trace.