Bug 194551

Summary: RAID10 - writemostly FEATURE REQUEST
Product: IO/Storage Reporter: Reindl Harald (harry)
Component: MDAssignee: io_md
Status: REOPENED ---    
Severity: normal CC: neilb
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.9.9 Subsystem:
Regression: No Bisected commit-id:

Description Reindl Harald 2017-02-11 18:22:51 UTC
/dev/sda: Samsung SSD 850 EVO 2TB
/dev/sdb: Samsung SSD 850 EVO 2TB
/dev/sdc: ST2000DX002-2DV164
/dev/sdd: ST2000DX002-2DV164

* each stripe is one one of the SSD and a mirror on a HDD
* both HDD are added "writemostly" to the array

normally i would expect that for heavy read-IO the two HDD drives would sleep and the two SSD working more or less like a RAID0 but as example running a btrfs-srub within a virtual machine on a 1.5 TB vdisk you can hear the rotating disks clearly and there are also unexpected lags while a disk IO auf around 700 MB/sec shows that it's basicly working - but in some cases the performance when opening applications sucks compareable as before with 4x2 TB HDD

pretty sure the lags are coming because read-access is also spread to HDD - why is that isntead only hand over writes to them which are expected to keep as slow as without any SSD in the array?

/dev/md2:
        Version : 1.1
  Creation Time : Wed Jun  8 13:10:56 2011
     Raid Level : raid10
     Array Size : 3875222528 (3695.70 GiB 3968.23 GB)
  Used Dev Size : 1937611264 (1847.85 GiB 1984.11 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sat Feb 11 15:10:31 2017
          State : active
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : localhost.localdomain:2  (local to host localhost.localdomain)
           UUID : ea253255:cb915401:f32794ad:ce0fe396
         Events : 1818029

    Number   Major   Minor   RaidDevice State
       4       8       35        0      active sync set-A writemostly /dev/sdc3
       5       8       19        1      active sync set-B   /dev/sdb3
       7       8       51        2      active sync set-A writemostly /dev/sdd3
       6       8        3        3      active sync set-B   /dev/sda3
Comment 1 Neil Brown 2017-02-12 22:24:54 UTC
Quoting from the mdadm man page, the "--write-mostly" section:

"This is valid for RAID1 only and means that...."

As RAID10 is not RAID1, it is expected that writemostly doesn't work.
Comment 2 Reindl Harald 2017-02-12 23:02:22 UTC
it does work but not as effective as it should, otherwise the performance difference of this identical setups (cloned by move 2 disks from one machine to the other and rebuild the RAID's with 2 new disks in both) would not be possible

[root@srv-rhsoft:~]$ hdparm -Tt /dev/md2
/dev/md2:
 Timing cached reads: 23974 MB in 1.99 seconds = 12038.71 MB/sec
 Timing buffered disk reads: 2400 MB in 3.00 seconds = 798.92 MB/sec

[root@rh:~]$ hdparm -Tt /dev/md2
/dev/md2:
 Timing cached reads: 22224 MB in 1.99 seconds = 11157.92 MB/sec
 Timing buffered disk reads: 1084 MB in 3.00 seconds = 361.10 MB/sec 

RAID10 is more or less RAID0 + RAID1
Comment 3 Neil Brown 2017-02-13 03:43:03 UTC
> it does work but not as effective as it should

It does not work *at*all*.  The raid10 code does not test the write-mostly flag at all.  There must be some other explanation for the performance difference.

> RAID10 is more or less RAID0 + RAID1

In an abstract sense, this is true.
However the code in the md/raid10 module it quite different from the code in the md/raid1 and md/raid0 modules.
md/raid1 supports write-mostly and write-behind, which md/raid10 code does not.
Comment 4 Reindl Harald 2017-02-13 12:39:34 UTC
well, than please change it to a feature request

RAID10 with "writemostly" makes a lot of sense for large storages to get them fast *and* reliable without make it extremly expensive 

* you don't want RAID5/RAID6 rebuild over many TB
* very large SSD für RAID1 are much more expensive than smaller ones

so with 4x2 TB disks you get 4 TB useable storage and with "writemostly" which would be in the best case "writeonly" you have a lightening fast RAID0 with good redundancy and most workloads are read-intense with less writes

another benefit: different technologies - it's very unlikely that both disks of a stripe fail at the same time or due rebuild when one half is a SSD and the other a HDD
Comment 5 Neil Brown 2017-02-14 02:49:38 UTC
A feature request makes some sense.

However, a feature request without code doesn't get a very high priority.
I suggest your best bet would be to send email to linux-raid@vger.kernel.org, telling the list that you would really like write-behind for RAID10, and why using RAID0 over write-behind-raid1 doesn't provide a sufficient solution.
Maybe others will agree.  Maybe someone will get enthusiastic.

I'll re-open this bug and label it as a "feature request", but I don't know that doing so will serve much of a useful purpose.