Bug 191881 - Latency on raid10 arrays significantly worse on kernel 4.4 than on kernel 4.1
Summary: Latency on raid10 arrays significantly worse on kernel 4.4 than on kernel 4.1
Status: RESOLVED WILL_NOT_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: MD (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: io_md
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-03 22:55 UTC by Ralph
Modified: 2017-10-05 08:38 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Comparison of consecutive FIO runs on kernel 4.1 vs. kernel 4.4 (24.94 KB, text/plain)
2017-01-05 16:42 UTC, Ralph
Details
blktrace data for fio test on kernel 4.1 (3.33 MB, application/zip)
2017-01-06 13:26 UTC, Ralph
Details
FIO output data for blktrace test run on kernel 4.1 (8.40 KB, text/plain)
2017-01-06 13:27 UTC, Ralph
Details
blktrace data for fio test on kernel 4.4 (3.93 MB, application/zip)
2017-01-06 13:30 UTC, Ralph
Details
FIO output data for blktrace test run on kernel 4.4 (18.23 KB, text/plain)
2017-01-06 13:33 UTC, Ralph
Details
output of hwinfo on kernel 4.4 system (opensuse leap 42.2) (1.57 MB, text/plain)
2017-01-11 15:49 UTC, Ralph
Details
output of lsmod on the kernel 4.4 system (Opensuse leap 42.2) (7.40 KB, text/plain)
2017-01-11 15:50 UTC, Ralph
Details

Description Ralph 2017-01-03 22:55:23 UTC
I used fio to test the random write performance on a mdadm Raid 10 array over 4 SSDs (N2 layout, Chunk size = 32kb). 

I did this on Opensuse 42.1 with kernel 4.1 and Opensuse 42.2 with kernel 4.4. I get very different results for one and the same raid array: 

On kernel 4.1 the latency is significantly lower - resulting in a significantly higher performance. During consecutive fio test runs on kernel 4.1 the write performance has a relatively small spread around 635 MB/sec [610 - 660 MB/sec] and is well defined above 600 MB/sec. With a growing blocksize of the random write bunches of the Fio test (64 kb, 128 kb, 256 kb,...) the performance grows systematically until approximately twice the rate of a single SSD is reached.   

On kernel 4.4, however, I get on average a much smaller write rate around 450 MB/sec - with a huge spread between 330 MB/sec and 610 MB/sec during a sequence of runs. Typically, the average latency for write access show much higher values than on kernel 4.1. The expected increase in performance with the blocksize of the randomly written data bunches is much smaller and less systematic than on kernel 4.1.              

I used the very same raid array in both cases (the tests were done on one and the same machine). The same test run parameters  and the same I/O engine (sync) were used.  You find typical example test results below. The results do not vary much with fstrim issued ahead of a test or not.  

I admit that the fio version changed (I shall compile the same version if required); however the relatively bad latency values for the Raid 10 array on kernel 4.4 also show up in general - e.g. in database tests where single processes write small pieces of data to different tables. 

The differences also show up when the total size of written fio test data is increased from 500MB to 20 GB.     

I do not know whether this is a bug or not. But I thought I should report this finding. 

Is it possible that something changed to the worse with mdadm and the raid10 module between kernel 4.1 and kernel 4.4 ? 
   
Please, let me know if and what kind of test data or other information you need.  


-------------
Fio test run:
---------------
[global]
size=500m
directory=/mnt
direct=1
bs=64k,64k,64k
ioengine=sync

[read]
rw=randread

[write]
stonewall
rw=randwrite

---------
Results on kernel 4.1 
***********************
read: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1
write: (g=1): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1
fio-2.2.10
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [75.0% done] [164.6MB/181.1MB/0KB /s] [2632/2911/0 iops] [eta 00m:01s]
read: (groupid=0, jobs=1): err= 0: pid=4257: Tue Jan  3 20:35:38 2017
  read : io=512000KB, bw=384673KB/s, iops=6010, runt=  1331msec
    clat (usec): min=146, max=3099, avg=165.68, stdev=38.12
     lat (usec): min=146, max=3100, avg=165.71, stdev=38.13
    clat percentiles (usec):
     |  1.00th=[  151],  5.00th=[  155], 10.00th=[  159], 20.00th=[  161],
     | 30.00th=[  163], 40.00th=[  163], 50.00th=[  163], 60.00th=[  165],
     | 70.00th=[  165], 80.00th=[  167], 90.00th=[  171], 95.00th=[  187],
     | 99.00th=[  205], 99.50th=[  217], 99.90th=[  245], 99.95th=[  249],
     | 99.99th=[ 3088]
    bw (KB  /s): min=381312, max=387840, per=99.97%, avg=384576.00, stdev=4615.99
    lat (usec) : 250=99.98%
    lat (msec) : 2=0.01%, 4=0.01%
  cpu          : usr=0.90%, sys=4.51%, ctx=8013, majf=0, minf=24
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=8000/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1
write: (groupid=1, jobs=1): err= 0: pid=4258: Tue Jan  3 20:35:38 2017
  write: io=512000KB, bw=639201KB/s, iops=9987, runt=   801msec
    clat (usec): min=92, max=2557, avg=98.19, stdev=28.54
     lat (usec): min=93, max=2558, avg=99.50, stdev=28.54
    clat percentiles (usec):
     |  1.00th=[   92],  5.00th=[   93], 10.00th=[   93], 20.00th=[   95],
     | 30.00th=[   95], 40.00th=[   96], 50.00th=[   97], 60.00th=[   97],
     | 70.00th=[   98], 80.00th=[   98], 90.00th=[   99], 95.00th=[  103],
     | 99.00th=[  131], 99.50th=[  135], 99.90th=[  163], 99.95th=[  165],
     | 99.99th=[ 2544]
    bw (KB  /s): min=633088, max=633088, per=99.04%, avg=633088.00, stdev= 0.00
    lat (usec) : 100=90.04%, 250=9.94%, 500=0.01%
    lat (msec) : 4=0.01%
  cpu          : usr=1.50%, sys=7.00%, ctx=8016, majf=0, minf=8
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=8000/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=512000KB, aggrb=384673KB/s, minb=384673KB/s, maxb=384673KB/s, mint=1331msec, maxt=1331msec

Run status group 1 (all jobs):
  WRITE: io=512000KB, aggrb=639200KB/s, minb=639200KB/s, maxb=639200KB/s, mint=801msec, maxt=801msec

Disk stats (read/write):
    dm-4: ios=16000/15868, merge=0/0, ticks=2408/1280, in_queue=3692, util=75.70%, aggrios=16000/16000, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    md125: ios=16000/16000, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4000/8001, aggrmerge=0/0, aggrticks=603/622, aggrin_queue=1219, aggrutil=72.71%
  sda: ios=8000/8001, merge=0/0, ticks=1200/636, in_queue=1828, util=72.71%
  sdb: ios=0/8001, merge=0/0, ticks=0/620, in_queue=620, util=24.66%
  sdc: ios=8000/8001, merge=0/0, ticks=1212/620, in_queue=1828, util=72.71%
  sdd: ios=0/8001, merge=0/0, ticks=0/612, in_queue=600, util=23.87%



Note the relatively small values of "lat" with a small stdev of the write process !!  


---------------------------
Results on kernel 4.4 
************************
read: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1
write: (g=1): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1
fio-2.15
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [100.0% done] [0KB/390.6MB/0KB /s] [0/6248/0 iops] [eta 00m:00s]         
read: (groupid=0, jobs=1): err= 0: pid=5127: Tue Jan  3 22:58:53 2017
  read : io=512000KB, bw=375642KB/s, iops=5869, runt=  1363msec
    clat (usec): min=146, max=2716, avg=169.48, stdev=39.99
     lat (usec): min=146, max=2717, avg=169.52, stdev=40.02
    clat percentiles (usec):
     |  1.00th=[  151],  5.00th=[  155], 10.00th=[  159], 20.00th=[  159],
     | 30.00th=[  161], 40.00th=[  161], 50.00th=[  163], 60.00th=[  163],
     | 70.00th=[  165], 80.00th=[  175], 90.00th=[  201], 95.00th=[  213],
     | 99.00th=[  225], 99.50th=[  229], 99.90th=[  282], 99.95th=[  290],
     | 99.99th=[ 2704]
    lat (usec) : 250=99.74%, 500=0.24%
    lat (msec) : 4=0.03%
  cpu          : usr=0.66%, sys=5.80%, ctx=8001, majf=0, minf=22
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=8000/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1
write: (groupid=1, jobs=1): err= 0: pid=5128: Tue Jan  3 22:58:53 2017
  write: io=512000KB, bw=387292KB/s, iops=6051, runt=  1322msec
    clat (usec): min=90, max=8898, avg=159.50, stdev=103.85
     lat (usec): min=91, max=8902, avg=162.97, stdev=104.10
    clat percentiles (usec):
     |  1.00th=[   91],  5.00th=[   92], 10.00th=[   96], 20.00th=[  147],
     | 30.00th=[  151], 40.00th=[  153], 50.00th=[  155], 60.00th=[  171],
     | 70.00th=[  175], 80.00th=[  181], 90.00th=[  199], 95.00th=[  205],
     | 99.00th=[  231], 99.50th=[  243], 99.90th=[  286], 99.95th=[  302],
     | 99.99th=[ 8896]
    lat (usec) : 100=11.00%, 250=88.69%, 500=0.27%, 1000=0.01%
    lat (msec) : 2=0.01%, 10=0.01%
  cpu          : usr=4.39%, sys=12.72%, ctx=8002, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=8000/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=512000KB, aggrb=375641KB/s, minb=375641KB/s, maxb=375641KB/s, mint=1363msec, maxt=1363msec

Run status group 1 (all jobs):
  WRITE: io=512000KB, aggrb=387291KB/s, minb=387291KB/s, maxb=387291KB/s, mint=1322msec, maxt=1322msec


Note the increased values of "lat" and the bigger stdev for the write process !!
Comment 1 Ralph 2017-01-04 09:48:40 UTC
I should add that I used the deadline scheduler in both cases. Storage is organized as 

SSDs => Raid Partitions => mdadm Raid 10 Array => 1 LVM2-Group => LVM2-Volume with ext4
Comment 2 Shaohua Li 2017-01-04 21:59:14 UTC
Didn't find significant changes in raid10 side between 4.1 to 4.4. Can you try a raw raid10 array and check if write performance changes? I'd like to narrow down to make sure it's not problem of lvm or ext4.
Comment 3 Ralph 2017-01-05 14:04:51 UTC
Good suggestion. I assume that you mean that I should do the following things:

1) Prepare a raid 10 array based on unformatted partitions of type 0xFD.
2) Write to the raid array with the help of "dd" or FIO both under kernel 4.1 and 4.4. But with FIO I still use the "sync"-I/O engine ? 

A different and additional step would be to omit the LVM layer and test write performance on a raid array formatted directly with ext4 OR to combine 4 ext4 formatted partitions to a raid array and write to it.   

Please correct me - if I misunderstood you. I cannot perform tests right now because I am too busy today and tomorrow with other things. But I shall perform the tests the next days.
Comment 4 Ralph 2017-01-05 16:26:12 UTC
(In reply to Shaohua Li from comment #2)
> Didn't find significant changes in raid10 side between 4.1 to 4.4. Can you
> try a raw raid10 array and check if write performance changes? I'd like to
> narrow down to make sure it's not problem of lvm or ext4.


I meanwhile performed FIO runs on one and the same raw raid array - both on a Opensuse Leap 42.1 (kernel 4.4) installation and on an Opensuse Leap 42.1 (kernel 4.1) installation. The results (see below and the attachment) confirm the findings described previously. 

Conclusion: 
-------------
The sluggishness under Kernel 4.4 does NOT depend on LVM or ext4. Neither does it depend on different FIO versions. 


What to try next? What more information do you need? 


********************************************
Description of what I did and typical results 
 
I did the following up to now: 

1) I provided a 0xFD partition of 40 GiB on each of 4 SSDs  
2) I created a Raid 10 array with
 
mdadm --create --verbose /dev/md/d02 --level=raid10 --bitmap=none --chunk=32 
--layout=n2 --raid-devices=4 /dev/sda5 /dev/sdb5 /dev/sdc5 /dev/sdd5

3) After realignment on the kernel 4.4 system "cat /proc/mdstat" shows 
Personalities : [raid10] [raid6] [raid5] [raid4] 
md_d2 : active raid10 sdd5[3] sdc5[2] sdb5[1] sda5[0]
      83826688 blocks super 1.2 32K chunks 2 near-copies [4/4] [UUUU]

4) The very same raid array appears on the kernel 4.1 system as "md126": 
cat /proc/mdstat
Personalities : [raid10] [raid6] [raid5] [raid4] 
     
md126 : active raid10 sdb5[1] sdd5[3] sda5[0] sdc5[2]
      83826688 blocks super 1.2 32K chunks 2 near-copies [4/4] [UUUU]
   
5) I did not format the raid array, nor did I create a LVM group/volume 

6) Then I executed the following fio job "rand_write_mnt" with FIO version 2.15 

[global]
size=500m
direct=1
bs=64k
ioengine=sync

[write]
bs=64k
numjobs=1
rw=randwrite

by the command 
mytux:~/fio # fio rand_write_mnt.fio --filename=/dev/md_d2


6) I did this on Opensuse Leap 42.1 with kernel 4.1 and Opensuse Leap 42.2 with kernel 4.4 - on both sides with FIO version 2.15 (to exclude fio differences)    


Typical result on Opensuse Leap 42.1 / kernel 4.1: 
----------------------------------------------------
mytux:~/fio # fio rand_write_mnt.fio --filename=/dev/md126
write: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1
fio-2.15
Starting 1 process

write: (groupid=0, jobs=1): err= 0: pid=26116: Thu Jan  5 16:51:24 2017
  write: io=512000KB, bw=664935KB/s, iops=10389, runt=   770msec
    clat (usec): min=90, max=2551, avg=94.58, stdev=28.25
     lat (usec): min=91, max=2553, avg=95.71, stdev=28.27
    clat percentiles (usec):
     |  1.00th=[   90],  5.00th=[   91], 10.00th=[   91], 20.00th=[   92],
     | 30.00th=[   92], 40.00th=[   92], 50.00th=[   93], 60.00th=[   94],
     | 70.00th=[   94], 80.00th=[   94], 90.00th=[   95], 95.00th=[  103],
     | 99.00th=[  125], 99.50th=[  129], 99.90th=[  159], 99.95th=[  161],
     | 99.99th=[ 2544]
    lat (usec) : 100=94.09%, 250=5.90%
    lat (msec) : 4=0.01%
  cpu          : usr=2.08%, sys=4.68%, ctx=8010, majf=0, minf=8
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=8000/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=512000KB, aggrb=664935KB/s, minb=664935KB/s, maxb=664935KB/s, mint=770msec, maxt=770msec

Disk stats (read/write):
    md126: ios=62/13310, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=30/8001, aggrmerge=1/0, aggrticks=3/652, aggrin_queue=651, aggrutil=74.29%
  sda: ios=64/8001, merge=4/0, ticks=0/664, in_queue=656, util=74.29%
  sdb: ios=2/8001, merge=0/0, ticks=0/648, in_queue=648, util=73.39%
  sdc: ios=55/8001, merge=1/0, ticks=8/648, in_queue=648, util=73.39%
  sdd: ios=2/8001, merge=2/0, ticks=4/648, in_queue=652, util=73.84%



Typical result on Opensuse Leap 42.2 with kernel 4.4: 
--------------------------------------------------------
mytux:~/fio # fio rand_write_mnt.fio --filename=/dev/md_d2
write: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, 
iodepth=1
fio-2.15
Starting 1 process
Jobs: 1 (f=1)
write: (groupid=0, jobs=1): err= 0: pid=8886: Thu Jan  5 15:31:26 2017
  write: io=512000KB, bw=391437KB/s, iops=6116, runt=  1308msec
    clat (usec): min=97, max=2629, avg=157.47, stdev=43.69
     lat (usec): min=98, max=2632, avg=161.08, stdev=44.15
    clat percentiles (usec):
     |  1.00th=[   99],  5.00th=[  101], 10.00th=[  108], 20.00th=[  143],
     | 30.00th=[  147], 40.00th=[  149], 50.00th=[  151], 60.00th=[  159],
     | 70.00th=[  171], 80.00th=[  179], 90.00th=[  199], 95.00th=[  209],
     | 99.00th=[  241], 99.50th=[  258], 99.90th=[  434], 99.95th=[  486],
     | 99.99th=[ 2640]
    lat (usec) : 100=3.71%, 250=95.58%, 500=0.66%, 750=0.03%, 1000=0.01%
    lat (msec) : 4=0.01%
  cpu          : usr=2.14%, sys=14.31%, ctx=8001, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued    : total=r=0/w=8000/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=512000KB, aggrb=391437KB/s, minb=391437KB/s, maxb=391437KB/s, 
mint=1308msec, maxt=1308msec

Disk stats (read/write):
    md_d2: ios=20/6852, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=14/8001, aggrmerge=2/0, aggrticks=1/962, aggrin_queue=960, 
aggrutil=69.58%
  sda: ios=28/8001, merge=6/0, ticks=0/960, in_queue=956, util=67.32%
  sdb: ios=4/8001, merge=0/0, ticks=4/940, in_queue=940, util=66.20%
  sdc: ios=22/8001, merge=4/0, ticks=0/992, in_queue=988, util=69.58%
  sdd: ios=4/8001, merge=0/0, ticks=0/956, in_queue=956, util=67.37%


---------------------------------
In addition I did consecutive runs. As stated before the spread of the write performance is huge under kernel 4.4, but it is well defined and always above 620 MB/sec under kernel 4.1. See the upcoming attachment. 

-------------------------------------------------------
I included "invalidate=1" in the FIO job. No difference.
Comment 5 Ralph 2017-01-05 16:42:23 UTC
Created attachment 250401 [details]
Comparison of consecutive FIO runs on kernel 4.1 vs. kernel 4.4

See the spread of the Random Write performance on kernel 4.4 between 390 MB/sec and 640 MB/sec
compared to a well defined higher RANDOM WRITE performance on kernel 4.1 around
650MB/sec +- 10MB/sec
Comment 6 Shaohua Li 2017-01-05 20:13:02 UTC
if you run the fio script against a single disk, for example /dev/sda5, does the performance change between 4.1 and 4.4?

If no, please attach the blktrace when you run the fio script against the raid array for both 4.1 and 4.4. That will give us more hints.
Comment 7 Ralph 2017-01-06 13:22:42 UTC
(In reply to Shaohua Li from comment #6)
> if you run the fio script against a single disk, for example /dev/sda5, does
> the performance change between 4.1 and 4.4?
> 
> If no, please attach the blktrace when you run the fio script against the
> raid array for both 4.1 and 4.4. That will give us more hints.

Meanwhile I did a test with blktrace for a single disk partition /dev/sdb6 (40 GiB). I shall provide the output as attachments later on. 

Note that I did not delete the md raid array configurations. So other partitions as sdb5 still were members of md raid arrays during the FIO tests and the md-, raid10-, raid5-modules were not unloaded.  

Several things are remarkable: 

1) Again the spread of random write performance is bigger for kernel 4.4 than for kernel 4.1. On average the performance was lower on kernel 4.4. 
Result values of some consecutive FIO runs without blktrace are: 

Kernel 4.1 :  
459605KB/s, 462511KB/s, 462093KB/s, 460846KB/s, 462929KB/s, 462511KB/s

Kernel 4.4 :  
425249KB/s, 397207KB/s, 456327KB/s, 429530KB/s, 442906KB/s, 437233KB/s, 386123KB/s, 436487KB/s

This may be an indication that single disc access delays are simply enhanced and reflected in the raid array behavior (??).   

2) When performing blktrace the CPU event distribution summarized at the end of blktrace shows a clear concentration on exactly 1 cpu core on the kernel 4.1 installation. With kernel 4.4 the CPU event distribution shows a much larger spread over the CPU cores.   

kernel 4.1 - end of blktrace: 
=== sdb6 ===
  CPU  0:                 7723 events,      363 KiB data
  CPU  1:                 2605 events,      123 KiB data
  CPU  2:                  520 events,       25 KiB data
  CPU  3:               399986 events,    18750 KiB data
  CPU  4:                   26 events,        2 KiB data
  CPU  5:                   17 events,        1 KiB data
  CPU  6:                   17 events,        1 KiB data
  CPU  7:                  100 events,        5 KiB data
  Total:                410994 events (dropped 0),    19266 KiB data

kernel 4.4 - end of blktrace: 
=== sdb6 ===
  CPU  0:               207564 events,     9730 KiB data
  CPU  1:               120663 events,     5657 KiB data
  CPU  2:                80507 events,     3774 KiB data
  CPU  3:                52699 events,     2471 KiB data
  CPU  4:                12912 events,      606 KiB data
  CPU  5:                   33 events,        2 KiB data
  CPU  6:                  724 events,       34 KiB data
  CPU  7:                 2032 events,       96 KiB data
  Total:                477134 events (dropped 0),    22366 KiB data

This can be reproduced. This is really astonishing. Does this indicate that kernel 4.4 in general performs a different CPU load distribution with a more frequent switching between the cores? Could this explain the latency shown in the FIO tests?   

I shall provide the output data for the FIO tests on /dev/sdb6 in a minute. 
For each test there will be 2 files : 

1) blktrace_sdb6_kernel4x_fio_output_test.txt : FIO output during test
2) blktrace_sdb6_kernel41_for_fio_test.zip : zip file containing the blktrace data files

Could you, please, verify that these files provide the information you need, before I do the blktrace tests on the whole md raid10 array ? Thanks, in advance.
Comment 8 Ralph 2017-01-06 13:26:01 UTC
Created attachment 250521 [details]
blktrace data for fio test on kernel 4.1

Blktrace for kernel 4.1 :

The zip files contains the blktrace result data files, which were recorded  during a FIO test on a SDD partition /dev/sdb6, which is NOT a member of a defined raid array.
Comment 9 Ralph 2017-01-06 13:27:30 UTC
Created attachment 250531 [details]
FIO output data for blktrace test run on kernel 4.1

FIO output data created during FIO test run with blktrace on kernel 4.1
Comment 10 Ralph 2017-01-06 13:30:57 UTC
Created attachment 250541 [details]
blktrace data for fio test on kernel 4.4

Blktrace data of FIO test on kernel 4.4 

The zip file contains the blktrace data files, which were created during a FIO test on device /dev/sdb6. This partition was NOT part of a md raid array. 
However, /dev/sdb5 and other partitions on /dev/sdb were members of md raid arrays during test.
Comment 11 Ralph 2017-01-06 13:33:05 UTC
Created attachment 250551 [details]
FIO output data for blktrace test run on kernel 4.4

The attached file contains FIO output data produced on kernel 4.4 for FIO test on /dev/sdb6. These data correspond to the data of the blktrace files for kernel 4.4.
Comment 12 Shaohua Li 2017-01-06 17:42:51 UTC
No, don't need blktrace for the raid array. It's clear the single disk performance drops in 4.4. if you use noop ioscheduer, anything changes? I bet not, but want to confirm.
Since the test is 64k bs, the iops isn't very high, 6400/s, I can't imagine CPU side can affect the performance for such low IOPS (eg, the switching between the CPUs), but I could be wrong. What kind of driver/HBA are you using? I'd suggest involving people from the driver side, since this sounds like a driver bug to me. In the meaning time, I'd suggest you try a newer kernel and check if the single disk performance changes.
Comment 13 Ralph 2017-01-07 09:29:55 UTC
(In reply to Shaohua Li from comment #12)
> No, don't need blktrace for the raid array. It's clear the single disk
> performance drops in 4.4. if you use noop ioscheduer, anything changes? I
> bet not, but want to confirm.
> Since the test is 64k bs, the iops isn't very high, 6400/s, I can't imagine
> CPU side can affect the performance for such low IOPS (eg, the switching
> between the CPUs), but I could be wrong. What kind of driver/HBA are you
> using? I'd suggest involving people from the driver side, since this sounds
> like a driver bug to me. In the meaning time, I'd suggest you try a newer
> kernel and check if the single disk performance changes.

The NOOP scheduler does not change the general picture. I tried this already before; small differences, but no change regarding the performance reduction. 

Regarding HW and drivers: 

The board of the test system is an ASrock Z170 extreme 7+. The SATA controller - to which the SSDs of the Raid array (4 Samsung EVO 850) AND a separate system SSD disk (with the OS installations on it) are attached - is an onboard Intel Sunrise Point controller in AHCI mode. (   

 P: /devices/pci0000:00/0000:00:17.0
  E: DEVPATH=/devices/pci0000:00/0000:00:17.0
  E: DRIVER=ahci
  E: ID_MODEL_FROM_DATABASE=Sunrise Point-H SATA controller [AHCI mode]
  E: ID_PCI_CLASS_FROM_DATABASE=Mass storage controller
  E: ID_PCI_INTERFACE_FROM_DATABASE=AHCI 1.0
  E: ID_PCI_SUBCLASS_FROM_DATABASE=SATA controller
  E: ID_VENDOR_FROM_DATABASE=Intel Corporation
  E: MODALIAS=pci:v00008086d0000A102sv00001849sd0000A102bc01sc06i01
  E: PCI_CLASS=10601
  E: PCI_ID=8086:A102
  E: PCI_SLOT_NAME=0000:00:17.0
  E: PCI_SUBSYS_ID=1849:A102
  E: SUBSYSTEM=pci
  E: USEC_INITIALIZED=5776122
  
  P: /devices/pci0000:00/0000:00:17.0/ata1/ata_port/ata1
  E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/ata_port/ata1
  E: SUBSYSTEM=ata_port
  
  P: /devices/pci0000:00/0000:00:17.0/ata1/host1
  E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/host1
  E: DEVTYPE=scsi_host
  E: SUBSYSTEM=scsi
  
  P: /devices/pci0000:00/0000:00:17.0/ata1/host1/scsi_host/host1
  E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/host1/scsi_host/host1
  E: SUBSYSTEM=scsi_host
  
  P: /devices/pci0000:00/0000:00:17.0/ata1/host1/target1:0:0
  E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/host1/target1:0:0
  E: DEVTYPE=scsi_target
  E: SUBSYSTEM=scsi
  
  P: /devices/pci0000:00/0000:00:17.0/ata1/host1/target1:0:0/1:0:0:0
  E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/host1/target1:0:0/1:0:0:0
  E: DEVTYPE=scsi_device
  E: DRIVER=sd
  E: MODALIAS=scsi:t-0x00
  E: SUBSYSTEM=scsi

There is an additional ASM1062 SATA controller which controls DVD drives. There is also a 3ware controller with HDDs attached. Both should not have a major  impact on the SSD performance. 

  
Possible storage related modules - as far as I understand it; note, however, that the SSD disks are not encrypted:  

intel_rapl             24576  0 
intel_powerclamp       16384  0 
crc32c_intel           24576  2 
aesni_intel           167936  0 
aes_x86_64             20480  1 aesni_intel
lrw                    16384  1 aesni_intel
glue_helper            16384  1 aesni_intel
ablk_helper            16384  1 aesni_intel
cryptd                 20480  2 aesni_intel,ablk_helper
pinctrl_intel          20480  1 pinctrl_sunrisepoint
intel_lpss_acpi        16384  0 
intel_lpss             16384  1 intel_lpss_acpi
mfd_core               16384  1 intel_lpss
ahci                   36864  18 
libahci                36864  1 ahci
libata                270336  2 ahci,libahci
async_raid6_recov      20480  1 raid456
async_memcpy           16384  2 raid456,async_raid6_recov
libcrc32c              16384  1 raid456
async_pq               16384  2 raid456,async_raid6_recov
async_xor              16384  3 async_pq,raid456,async_raid6_recov
async_tx               16384  5 async_pq,raid456,async_xor,async_memcpy,async_raid6_recov
raid6_pq              106496  4 async_pq,raid456,btrfs,async_raid6_recov
raid10                 53248  2 
md_mod                147456  4 raid456,raid10
dm_mod                122880  20
sg                     40960  0
scsi_mod              262144  6 sg,st,3w_9xxx,libata,sd_mod,sr_mod
3w_9xxx                45056  6 


I shall provide a complete HW component list created by hwinfo plus a lsmod list for both Opensuse Leap 42.2. and Opensuse 42.1 as soon as I find the time. The same regarding a different later kernel as e.g. 4.7 - I think there is a Tumbleweed installation on one of the disks; so I would not even have to touch the Leap test installations.
Comment 14 Ralph 2017-01-07 13:46:47 UTC
(In reply to Shaohua Li from comment #12)
> No, don't need blktrace for the raid array. It's clear the single disk
> performance drops in 4.4. if you use noop ioscheduer, anything changes? I
> bet not, but want to confirm.
> Since the test is 64k bs, the iops isn't very high, 6400/s, I can't imagine
> CPU side can affect the performance for such low IOPS (eg, the switching
> between the CPUs), but I could be wrong. What kind of driver/HBA are you
> using? I'd suggest involving people from the driver side, since this sounds
> like a driver bug to me. In the meaning time, I'd suggest you try a newer
> kernel and check if the single disk performance changes.

Meanwhile I did a clean and minimal install of Opensuse Tumbleweed with kernel 4.7. All partitions on the SSDs for the Raid Arrays and teh Raid Arrays were discovered correctly. udev treated same as expected. 

Unfortunately, things did not at all get better but worse. 

The random write performance is now consistently and without spread at the lower end of what I got for kernel 4.4. 

Results of 6 consecutive fio (version 2.15) test runs on /dev/sdb6 : 
----------------------------------------------------------------
382660KB/s, 381520KB/s, 382374KB/s, 382089KB/s, 381520KB/s, 383520KB/s

So this marks a drop in single SSD performance of about 80 MB/sec on kernel 4.7 compared to kernel 4.1 - which gave me 460 MB/sec. On Raid Arrays the relative loss in performance is even bigger. I would describe this as significant. 

Either FIO delivers wrong results or something major happened between kernel 4.1, 4.4 and 4.7 - e.g. regarding the drivers of the Intel SATA controller. Or could it be something specific for Opensuse's kernel configuration?

What still worries me a bit is the fact that other partitions (sdb2, sdb3, sdb5) on /dev/sdb still are part of md arrays. Could this have an impact on the performance measured for the raw partition /dev/sdb6 ?
Comment 15 Shaohua Li 2017-01-09 17:45:02 UTC
does the raid array run for resync? if not, I have no idea what could cause the slowness. why read is slower than write in the disk?
Comment 16 Ralph 2017-01-11 15:42:02 UTC
(In reply to Shaohua Li from comment #15)
> does the raid array run for resync? if not, I have no idea what could cause
> the slowness. why read is slower than write in the disk?

Sorry for my late answer. Just not to mix up different topics and problems:

1) The last tests were done for random writing, only
---------------------------------------------------- 
All the last fio tests did random writing - for random writing the performance loss is most significant. Also the last data given for /dev/sdb6 (single disk raw partition access) were given for random write. I wonder a bit why you ask about the difference in read/write performance. 


2) Nevertheless, regarding your question for read vs. write performance:  
-----------------------------------------------------------------------
If you look at the following web site 
http://www.anandtech.com/show/6935/seagate-600-ssd-review/5

you will find that a generally and significantly lower random read rate in comparison to write rates is very common for several types of SSDs - very much in contrast to the sequential region. It also depends strongly on the chosen block size of data bunches written.  

But I am no specialist on this question - I did find a lower random read performance for small blocksizes (4 to 64 KB) on different Samsung SSDs (PROs  and EVOs on very different Linux systems).   


3) Dependency on resyncing?
-----------------------------  
I double checked: No, the raid array tests were NOT done in a phase of resyncing. Also during the tests for the single partition /dev/sdb6 the other partitions (parts of arrays) were not under resyncing.   

4) Dependency on kernel 4.1 (Opensuse Leap 42.1), kernel 4.2 (Opensuse leap 42.2), kernel 4.7 (Opensuse Tumbleweed) : 

I can only state here what I see by fio tests ( with the same fio version and 64 kB blocksize) : 

a) Random write performance on a single partition /dev/sdb6 is around 460 MB/sec for kernel 4.1.  

b) Random write performance on a single partition /dev/sdb6 varies between 380 MB/sec and maximum 460 MB/sec for kernel 4.4 with clearly below 460 MB/sec.   

c) Random write performance is consistently close to 380 MB/sec for kernel 4.7.  

All for Samsung SSD 850 EVO. 

Regarding Raid 10 Arrays: 
--------------------------- 
There is a huge discrepancy in random write performance between kernel 4.1 and 4.4 : 

a) Kernel 4.1: Around 640 MB/sec for a raid Array Chunk Size of 32 KB and a fio test blocksize of 64 kb. That this rate should be much larger than the single disk performance was to be expected. 

b) Kernel 4.4: Performance varies over a large scale - on average more at 450 MB/sec and with a lower end at approx. 380 MB/sec.  

I have not tested arrays yet for kernel 4.7.
Comment 17 Ralph 2017-01-11 15:49:17 UTC
Created attachment 251211 [details]
output of hwinfo on kernel 4.4 system (opensuse leap 42.2)

The attachment contains the output of hwinfo on thr opensuse Leap 42.2 system with kernel 4.4
Comment 18 Ralph 2017-01-11 15:50:30 UTC
Created attachment 251221 [details]
output of lsmod on the kernel 4.4 system (Opensuse leap 42.2)

The attachment contains the output of lsmod on th eOpensuse leap 42.2 system with kernel 4.4
Comment 19 Shaohua Li 2017-01-11 16:43:13 UTC
cc Tejun, he is the maintainer of ata, probably he can find something in the driver side.
Comment 20 Tejun Heo 2017-01-11 22:50:43 UTC
Hmm... I can't think of much which could have affected that. We switched away from threaded interrupt handling in ahci but that's 4.5. Just in case, can you please test how a more recent kernel behaves? Thanks.
Comment 21 Ralph 2017-01-12 15:20:52 UTC
(In reply to Tejun Heo from comment #20)
> Hmm... I can't think of much which could have affected that. We switched
> away from threaded interrupt handling in ahci but that's 4.5. Just in case,
> can you please test how a more recent kernel behaves? Thanks.

Hi, thank you for your interest in my problem. I had a look at kernel 4.9 on Opensuse Tumbleweed and kernel 4.1 (Opensuse leap 41.1) on the same HW platform. For comparison I did 2 tests: 

 - one on a raw Raid10 array device - /dev/md129 based on raw Raid partitions /dev/sd[a-d]5 (type: 0xFD, no LVM, no ext4) on 4 Samsung EVO 850. 
 
- one on a separate raw Linux test partition /dev/sde2 (type: 0x83) on the same SSD that holds the OS (Samsung 840 PRO; this SSD is not part of any md array)  

I enriched the fio test a bit: the random write job is now following after a separate random read job (1 job at a time, bs=64k = twice the Raid Chunk size). Both for RandRead and RandWrite only 1 job was requesting access to the partitions or raid arrays at a time.     

I waited 10 minutes to avoid impacts of any kind of jobs after booting. No X11 or graphical interface was started.  


Results for the Raid 10 array /dev/md129 
-------------------------------------------
(all in MB/sec)

Kernel 4.9 - Random Read         
357, 367, 350, 353, 343, 363, 372, 384, 385, 378, 371, 402, 352, 336, 376, 360

Kernel 4.1 - Random Read         
407, 399, 397, 401, 397, 404, 399, 399, 407, 395


Kernel 4.9 - Random Write         
611, 635, 400, 414, 373, 649, 369, 375, 484, 363, 645, 668, 364, 366, 645, 370

Kernel 4.1 - Random Write        
654, 664, 667, 660, 657, 664, 667, 675, 653, 664


Results for raw partition /dev/sde2 
-------------------------------------

Kernel 4.9 - Random Read :
211, 206, 214, 206, 211, 208, 213, 211, 217, 210, 217, 209, 203, 212

Kernel 4.1 - Random Read:
245, 237, 238, 231, 231, 233, 232, 230, 230, 234, 232, 233, 233


Kernel 4.9 - Random Write :
288, 351, 292, 352, 352, 335, 295, 317, 252, 351, 350, 299, 295, 328

Kernel 4.1 - Random Write :
348, 343, 342, 353, 349, 351, 351, 351, 347, 354, 349, 351, 353  




In addition I looked at the access to /dev/sdb6 - /dev/sdb has several other partitions belonging to Raid 10 and Raid 5 md arrays


Results for raw partition /dev/sdb6 
-------------------------------------

Kernel 4.9 - Random Read :
229, 248, 239, 246, 247, 234, 234, 242, 241, 235, 246, 230, 222, 232, 251, 235

Kernel 4.1 - Random Read :
278, 295, 286, 286, 295, 284, 283, 295, 287

Kernel 4.9 - Random Write :
455, 366, 459, 391, 459, 459, 459, 444, 451, 456, 459, 455, 460, 458, 396, 460

Kernel 4.1 - Random Write:
446, 449, 460, 451, 459, 459, 457, 455, 459


My conclusion:
---------------
Random Read is consistently worse for kernel 4.9 than for kernel 4.1 for the chosen FIO test setup (1 single read job with rel. small blocksize) - this is independent of the arrays or single disk access. This finding seems in addition to be independent of the SSD type.  

Random write on the md raid10 array shows the same large variation as also found on kernel 4.4. On average the random write performance on md raid 10 arrays is lower for kernel 4.9 than for kernel 4.1. 

Random write for single raw partition access shows some more variation on kernel 4.9 than for kernel 4.1. However, the spread was less pronounced for kernel 4.9 than for kernel 4.4. 

Note, however, that the random write performance on /dev/sdb6 (though varying) is higher than on kernel 4.7 (where it stayed at 380 MB/sec; see comment 16).   

Additional remarks
-----------------------
I admit that the test setup is a bit special (only on job requesting access to the tested array or partition at a time plus relatively "bs" values. The overall performance values would of course be different for several parallel read/write jobs requesting access to the Raid array. But preliminary test showed also a faster asymptotic approach to the sequential limits for kernel 4.1 than for later kernels).

Still, the question remains why does kernel 4.1 gives me higher values on average, with a significant lower standard deviation especially for Random Writes on the Raid 10 Array?

In addition I should mention that I can reproduce the better results on the kernel 4.1 system even with a much higher number of jobs running in the background, samba plus apache server running, full KDE with akonadi running, ODT files open, browsers running, plus ksysguard and gkrellm active and amarok running.  The very good raid performance on the kernel 4.1 system is almost indifferent to background load up to 20% on some processor cores.         

The kernel 4.9 results instead were measured on a minimal installation without KDE running.
Comment 22 Tejun Heo 2017-01-18 22:08:10 UTC
Thanks for the comprehensive testing, but as all the tests seem to give about the same results, let's narrow it down to a random read test on a single device.

I frankly have no idea. Given that there is a reliable reproducer and comparison points, bisecting seems like the best course of action. We can try to find out where the latency is actually coming from - as it might not even be the hardware or driver but somewhere higher up in the stack. Hmm... but yeah, would it be difficult to for you to bisect the problem?

Thanks.
Comment 23 Ralph 2017-06-19 10:11:00 UTC
Hi, sorry for a long delay in answering to the last mail. Some days ago I found the time again to look at SSD-Raid-Arrays again. By chance I stumbled over some articles on CPU governors while testing. On my system Intel's pstate driver (intel_pstate) is relevant (i7 4700k CPU). The default governor on Opensuse Leap 42.2 is "powersave". (There are only 2 choices: "powersave" and "performance"). 

Reading about some problems people had with "intel-pstate" on Red Hat 7 led me to testing the Raid performance again - this time with the CPU governor set to "performance" for all CPU cores/threads.    

And guess what: 
The measured performance values on Kernel 4.4 (on Leap 42.2) were on average much, much closer (if not identical) to the values which I found on Opensuse Leap 42.1. 

Preliminary test results: 
*************************
md-raid-10, chunk size: 32k, bitmap: none 

Case 1: 
-------
fio bs=512k, numjobs 1, 500 MB data, Random Read [RR] / Random Write [RW] 

Leap 42.1, Kernel 4.1: 
RR 780 MByte/sec - RW: 754 MByte/sec   RR Spread around 35 Mbyte/sec    

Leap 42.2, Kernel 4.4, governor: powersave :    
RR 669 MByte/sec - RW: 606 MByte/sec   RR Spread around 50 Mbyte/sec    

Leap 42.2, Kernel 4.4, governor: performance :  
RR 780 MByte/sec - RW: 750 MByte/sec   RR Spread around 35 Mbyte/sec    


Case 2:
-------
fio bs=1024k, numjobs 1, 500 MB data, Random Read [RR] / Random Write [RW] 

Leap 42.1, Kernel 4.1: 
RR 860 MByte/sec - RW: 800 MByte/sec   RR Spread around 30 Mbyte/sec    

Leap 42.2, Kernel 4.4, governor: powersave :   
RR 735 MByte/sec - RW: 660 MByte/sec   RR Spread > 50 Mbyte/sec    

Leap 42.2, Kernel 4.4, governor: performance :   
RR 877 MByte/sec - RW: 792 MByte/sec   RR Spread around 25 Mbyte/sec    

--------------

Although not systematically tested: This seems to be a significant. 

I now regard it quite probable that the source of the significant reduction I saw for md-Raid-10-arrays with SSDs when comparing kernel 4.4 with 4.1 may be a change (???) in CPU governor settings or some general change in the intel-pstate driver. 

This change affects at least Opensuse Leap 42.2 with Kernel 4.4 in contrast to Leap 42.1 with Kernel 4.1.  

It seems to me that the "powersave" governor does NOT react to fio tests and a rise in I/O load over a certain period of time at all. Or it does not react fast enough ... The frequency of the CPU cores does not go up to top values during fio runs of the above type  .... Due to whatever reasons .... 
CPU load below thresholds? Bad analytical time resolution of the load? No reaction to I/O spikes?            

Anyway: The effect is dramatic! A relatively high CPU frequency seems to be necessary to get an optimum performance of of md-raid-10-arrays! At least for spike like loads. 

Which makes me wonder whether Intels "powersave" governor (based on the intel_pstate driver) is in general no good choice when one wants to use md-raid-arrays. 

Any ideas or comments ?    

Anything I could do? What parameters (if there are any) of the intel_pstate governors could help to increase the performance of md-Raid-Arrays ?
Comment 24 Ralph 2017-06-19 10:31:30 UTC
Sorry, the processor I use on the test system is not an "i7 4700k" but an "i7 6700k".
Comment 25 Ralph 2017-06-19 15:55:47 UTC
I checked for the CPU governor on an Opensuse Leap 42.1 system: 

In fact, on Opensuse 42.1 the good old (ACPI-based) "ondemand" governor does its work  - and not the "new" intel_pstate based "powersave" governor for HWP.

So, there is a major difference between Leap 42.1 and 42.2 (besides the kernel version) and this difference may explain the observations described in comment #23 (and previous findings). 

Which in turn raises the question whether the intel_pstate "powersave" governor should not be run on systems with md-raid-arrays of SSDs ?
Comment 26 Shaohua Li 2017-10-03 16:31:58 UTC
it's a known issue cpu power saving mode can hurt performance for fast devices, ssd for example.
Comment 27 Ralph 2017-10-05 08:38:42 UTC
(In reply to Shaohua Li from comment #26)
> it's a known issue cpu power saving mode can hurt performance for fast
> devices, ssd for example.

And I had to learn it the hard way ... :-)

For me it was surprising that the impact was so significant. And I had expected that the guys from Intel had tested the impact of their governor on their own onboard Raid controllers a bit better. 

I switched back to the "ondemand" governor on my systems - and the performance now is perfect again.

The "bug" was no "bug" at all on the kernel or driver side. Nevertheless, thank you very much for the support and suggestions!

Note You need to log in before you can comment on or make changes to this bug.