I used fio to test the random write performance on a mdadm Raid 10 array over 4 SSDs (N2 layout, Chunk size = 32kb). I did this on Opensuse 42.1 with kernel 4.1 and Opensuse 42.2 with kernel 4.4. I get very different results for one and the same raid array: On kernel 4.1 the latency is significantly lower - resulting in a significantly higher performance. During consecutive fio test runs on kernel 4.1 the write performance has a relatively small spread around 635 MB/sec [610 - 660 MB/sec] and is well defined above 600 MB/sec. With a growing blocksize of the random write bunches of the Fio test (64 kb, 128 kb, 256 kb,...) the performance grows systematically until approximately twice the rate of a single SSD is reached. On kernel 4.4, however, I get on average a much smaller write rate around 450 MB/sec - with a huge spread between 330 MB/sec and 610 MB/sec during a sequence of runs. Typically, the average latency for write access show much higher values than on kernel 4.1. The expected increase in performance with the blocksize of the randomly written data bunches is much smaller and less systematic than on kernel 4.1. I used the very same raid array in both cases (the tests were done on one and the same machine). The same test run parameters and the same I/O engine (sync) were used. You find typical example test results below. The results do not vary much with fstrim issued ahead of a test or not. I admit that the fio version changed (I shall compile the same version if required); however the relatively bad latency values for the Raid 10 array on kernel 4.4 also show up in general - e.g. in database tests where single processes write small pieces of data to different tables. The differences also show up when the total size of written fio test data is increased from 500MB to 20 GB. I do not know whether this is a bug or not. But I thought I should report this finding. Is it possible that something changed to the worse with mdadm and the raid10 module between kernel 4.1 and kernel 4.4 ? Please, let me know if and what kind of test data or other information you need. ------------- Fio test run: --------------- [global] size=500m directory=/mnt direct=1 bs=64k,64k,64k ioengine=sync [read] rw=randread [write] stonewall rw=randwrite --------- Results on kernel 4.1 *********************** read: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1 write: (g=1): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1 fio-2.2.10 Starting 2 processes Jobs: 1 (f=1): [_(1),w(1)] [75.0% done] [164.6MB/181.1MB/0KB /s] [2632/2911/0 iops] [eta 00m:01s] read: (groupid=0, jobs=1): err= 0: pid=4257: Tue Jan 3 20:35:38 2017 read : io=512000KB, bw=384673KB/s, iops=6010, runt= 1331msec clat (usec): min=146, max=3099, avg=165.68, stdev=38.12 lat (usec): min=146, max=3100, avg=165.71, stdev=38.13 clat percentiles (usec): | 1.00th=[ 151], 5.00th=[ 155], 10.00th=[ 159], 20.00th=[ 161], | 30.00th=[ 163], 40.00th=[ 163], 50.00th=[ 163], 60.00th=[ 165], | 70.00th=[ 165], 80.00th=[ 167], 90.00th=[ 171], 95.00th=[ 187], | 99.00th=[ 205], 99.50th=[ 217], 99.90th=[ 245], 99.95th=[ 249], | 99.99th=[ 3088] bw (KB /s): min=381312, max=387840, per=99.97%, avg=384576.00, stdev=4615.99 lat (usec) : 250=99.98% lat (msec) : 2=0.01%, 4=0.01% cpu : usr=0.90%, sys=4.51%, ctx=8013, majf=0, minf=24 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=8000/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 write: (groupid=1, jobs=1): err= 0: pid=4258: Tue Jan 3 20:35:38 2017 write: io=512000KB, bw=639201KB/s, iops=9987, runt= 801msec clat (usec): min=92, max=2557, avg=98.19, stdev=28.54 lat (usec): min=93, max=2558, avg=99.50, stdev=28.54 clat percentiles (usec): | 1.00th=[ 92], 5.00th=[ 93], 10.00th=[ 93], 20.00th=[ 95], | 30.00th=[ 95], 40.00th=[ 96], 50.00th=[ 97], 60.00th=[ 97], | 70.00th=[ 98], 80.00th=[ 98], 90.00th=[ 99], 95.00th=[ 103], | 99.00th=[ 131], 99.50th=[ 135], 99.90th=[ 163], 99.95th=[ 165], | 99.99th=[ 2544] bw (KB /s): min=633088, max=633088, per=99.04%, avg=633088.00, stdev= 0.00 lat (usec) : 100=90.04%, 250=9.94%, 500=0.01% lat (msec) : 4=0.01% cpu : usr=1.50%, sys=7.00%, ctx=8016, majf=0, minf=8 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=8000/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=512000KB, aggrb=384673KB/s, minb=384673KB/s, maxb=384673KB/s, mint=1331msec, maxt=1331msec Run status group 1 (all jobs): WRITE: io=512000KB, aggrb=639200KB/s, minb=639200KB/s, maxb=639200KB/s, mint=801msec, maxt=801msec Disk stats (read/write): dm-4: ios=16000/15868, merge=0/0, ticks=2408/1280, in_queue=3692, util=75.70%, aggrios=16000/16000, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% md125: ios=16000/16000, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4000/8001, aggrmerge=0/0, aggrticks=603/622, aggrin_queue=1219, aggrutil=72.71% sda: ios=8000/8001, merge=0/0, ticks=1200/636, in_queue=1828, util=72.71% sdb: ios=0/8001, merge=0/0, ticks=0/620, in_queue=620, util=24.66% sdc: ios=8000/8001, merge=0/0, ticks=1212/620, in_queue=1828, util=72.71% sdd: ios=0/8001, merge=0/0, ticks=0/612, in_queue=600, util=23.87% Note the relatively small values of "lat" with a small stdev of the write process !! --------------------------- Results on kernel 4.4 ************************ read: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1 write: (g=1): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1 fio-2.15 Starting 2 processes Jobs: 1 (f=1): [_(1),w(1)] [100.0% done] [0KB/390.6MB/0KB /s] [0/6248/0 iops] [eta 00m:00s] read: (groupid=0, jobs=1): err= 0: pid=5127: Tue Jan 3 22:58:53 2017 read : io=512000KB, bw=375642KB/s, iops=5869, runt= 1363msec clat (usec): min=146, max=2716, avg=169.48, stdev=39.99 lat (usec): min=146, max=2717, avg=169.52, stdev=40.02 clat percentiles (usec): | 1.00th=[ 151], 5.00th=[ 155], 10.00th=[ 159], 20.00th=[ 159], | 30.00th=[ 161], 40.00th=[ 161], 50.00th=[ 163], 60.00th=[ 163], | 70.00th=[ 165], 80.00th=[ 175], 90.00th=[ 201], 95.00th=[ 213], | 99.00th=[ 225], 99.50th=[ 229], 99.90th=[ 282], 99.95th=[ 290], | 99.99th=[ 2704] lat (usec) : 250=99.74%, 500=0.24% lat (msec) : 4=0.03% cpu : usr=0.66%, sys=5.80%, ctx=8001, majf=0, minf=22 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=8000/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 write: (groupid=1, jobs=1): err= 0: pid=5128: Tue Jan 3 22:58:53 2017 write: io=512000KB, bw=387292KB/s, iops=6051, runt= 1322msec clat (usec): min=90, max=8898, avg=159.50, stdev=103.85 lat (usec): min=91, max=8902, avg=162.97, stdev=104.10 clat percentiles (usec): | 1.00th=[ 91], 5.00th=[ 92], 10.00th=[ 96], 20.00th=[ 147], | 30.00th=[ 151], 40.00th=[ 153], 50.00th=[ 155], 60.00th=[ 171], | 70.00th=[ 175], 80.00th=[ 181], 90.00th=[ 199], 95.00th=[ 205], | 99.00th=[ 231], 99.50th=[ 243], 99.90th=[ 286], 99.95th=[ 302], | 99.99th=[ 8896] lat (usec) : 100=11.00%, 250=88.69%, 500=0.27%, 1000=0.01% lat (msec) : 2=0.01%, 10=0.01% cpu : usr=4.39%, sys=12.72%, ctx=8002, majf=0, minf=9 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=8000/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=512000KB, aggrb=375641KB/s, minb=375641KB/s, maxb=375641KB/s, mint=1363msec, maxt=1363msec Run status group 1 (all jobs): WRITE: io=512000KB, aggrb=387291KB/s, minb=387291KB/s, maxb=387291KB/s, mint=1322msec, maxt=1322msec Note the increased values of "lat" and the bigger stdev for the write process !!
I should add that I used the deadline scheduler in both cases. Storage is organized as SSDs => Raid Partitions => mdadm Raid 10 Array => 1 LVM2-Group => LVM2-Volume with ext4
Didn't find significant changes in raid10 side between 4.1 to 4.4. Can you try a raw raid10 array and check if write performance changes? I'd like to narrow down to make sure it's not problem of lvm or ext4.
Good suggestion. I assume that you mean that I should do the following things: 1) Prepare a raid 10 array based on unformatted partitions of type 0xFD. 2) Write to the raid array with the help of "dd" or FIO both under kernel 4.1 and 4.4. But with FIO I still use the "sync"-I/O engine ? A different and additional step would be to omit the LVM layer and test write performance on a raid array formatted directly with ext4 OR to combine 4 ext4 formatted partitions to a raid array and write to it. Please correct me - if I misunderstood you. I cannot perform tests right now because I am too busy today and tomorrow with other things. But I shall perform the tests the next days.
(In reply to Shaohua Li from comment #2) > Didn't find significant changes in raid10 side between 4.1 to 4.4. Can you > try a raw raid10 array and check if write performance changes? I'd like to > narrow down to make sure it's not problem of lvm or ext4. I meanwhile performed FIO runs on one and the same raw raid array - both on a Opensuse Leap 42.1 (kernel 4.4) installation and on an Opensuse Leap 42.1 (kernel 4.1) installation. The results (see below and the attachment) confirm the findings described previously. Conclusion: ------------- The sluggishness under Kernel 4.4 does NOT depend on LVM or ext4. Neither does it depend on different FIO versions. What to try next? What more information do you need? ******************************************** Description of what I did and typical results I did the following up to now: 1) I provided a 0xFD partition of 40 GiB on each of 4 SSDs 2) I created a Raid 10 array with mdadm --create --verbose /dev/md/d02 --level=raid10 --bitmap=none --chunk=32 --layout=n2 --raid-devices=4 /dev/sda5 /dev/sdb5 /dev/sdc5 /dev/sdd5 3) After realignment on the kernel 4.4 system "cat /proc/mdstat" shows Personalities : [raid10] [raid6] [raid5] [raid4] md_d2 : active raid10 sdd5[3] sdc5[2] sdb5[1] sda5[0] 83826688 blocks super 1.2 32K chunks 2 near-copies [4/4] [UUUU] 4) The very same raid array appears on the kernel 4.1 system as "md126": cat /proc/mdstat Personalities : [raid10] [raid6] [raid5] [raid4] md126 : active raid10 sdb5[1] sdd5[3] sda5[0] sdc5[2] 83826688 blocks super 1.2 32K chunks 2 near-copies [4/4] [UUUU] 5) I did not format the raid array, nor did I create a LVM group/volume 6) Then I executed the following fio job "rand_write_mnt" with FIO version 2.15 [global] size=500m direct=1 bs=64k ioengine=sync [write] bs=64k numjobs=1 rw=randwrite by the command mytux:~/fio # fio rand_write_mnt.fio --filename=/dev/md_d2 6) I did this on Opensuse Leap 42.1 with kernel 4.1 and Opensuse Leap 42.2 with kernel 4.4 - on both sides with FIO version 2.15 (to exclude fio differences) Typical result on Opensuse Leap 42.1 / kernel 4.1: ---------------------------------------------------- mytux:~/fio # fio rand_write_mnt.fio --filename=/dev/md126 write: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1 fio-2.15 Starting 1 process write: (groupid=0, jobs=1): err= 0: pid=26116: Thu Jan 5 16:51:24 2017 write: io=512000KB, bw=664935KB/s, iops=10389, runt= 770msec clat (usec): min=90, max=2551, avg=94.58, stdev=28.25 lat (usec): min=91, max=2553, avg=95.71, stdev=28.27 clat percentiles (usec): | 1.00th=[ 90], 5.00th=[ 91], 10.00th=[ 91], 20.00th=[ 92], | 30.00th=[ 92], 40.00th=[ 92], 50.00th=[ 93], 60.00th=[ 94], | 70.00th=[ 94], 80.00th=[ 94], 90.00th=[ 95], 95.00th=[ 103], | 99.00th=[ 125], 99.50th=[ 129], 99.90th=[ 159], 99.95th=[ 161], | 99.99th=[ 2544] lat (usec) : 100=94.09%, 250=5.90% lat (msec) : 4=0.01% cpu : usr=2.08%, sys=4.68%, ctx=8010, majf=0, minf=8 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=8000/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=512000KB, aggrb=664935KB/s, minb=664935KB/s, maxb=664935KB/s, mint=770msec, maxt=770msec Disk stats (read/write): md126: ios=62/13310, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=30/8001, aggrmerge=1/0, aggrticks=3/652, aggrin_queue=651, aggrutil=74.29% sda: ios=64/8001, merge=4/0, ticks=0/664, in_queue=656, util=74.29% sdb: ios=2/8001, merge=0/0, ticks=0/648, in_queue=648, util=73.39% sdc: ios=55/8001, merge=1/0, ticks=8/648, in_queue=648, util=73.39% sdd: ios=2/8001, merge=2/0, ticks=4/648, in_queue=652, util=73.84% Typical result on Opensuse Leap 42.2 with kernel 4.4: -------------------------------------------------------- mytux:~/fio # fio rand_write_mnt.fio --filename=/dev/md_d2 write: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=sync, iodepth=1 fio-2.15 Starting 1 process Jobs: 1 (f=1) write: (groupid=0, jobs=1): err= 0: pid=8886: Thu Jan 5 15:31:26 2017 write: io=512000KB, bw=391437KB/s, iops=6116, runt= 1308msec clat (usec): min=97, max=2629, avg=157.47, stdev=43.69 lat (usec): min=98, max=2632, avg=161.08, stdev=44.15 clat percentiles (usec): | 1.00th=[ 99], 5.00th=[ 101], 10.00th=[ 108], 20.00th=[ 143], | 30.00th=[ 147], 40.00th=[ 149], 50.00th=[ 151], 60.00th=[ 159], | 70.00th=[ 171], 80.00th=[ 179], 90.00th=[ 199], 95.00th=[ 209], | 99.00th=[ 241], 99.50th=[ 258], 99.90th=[ 434], 99.95th=[ 486], | 99.99th=[ 2640] lat (usec) : 100=3.71%, 250=95.58%, 500=0.66%, 750=0.03%, 1000=0.01% lat (msec) : 4=0.01% cpu : usr=2.14%, sys=14.31%, ctx=8001, majf=0, minf=9 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=8000/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=512000KB, aggrb=391437KB/s, minb=391437KB/s, maxb=391437KB/s, mint=1308msec, maxt=1308msec Disk stats (read/write): md_d2: ios=20/6852, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=14/8001, aggrmerge=2/0, aggrticks=1/962, aggrin_queue=960, aggrutil=69.58% sda: ios=28/8001, merge=6/0, ticks=0/960, in_queue=956, util=67.32% sdb: ios=4/8001, merge=0/0, ticks=4/940, in_queue=940, util=66.20% sdc: ios=22/8001, merge=4/0, ticks=0/992, in_queue=988, util=69.58% sdd: ios=4/8001, merge=0/0, ticks=0/956, in_queue=956, util=67.37% --------------------------------- In addition I did consecutive runs. As stated before the spread of the write performance is huge under kernel 4.4, but it is well defined and always above 620 MB/sec under kernel 4.1. See the upcoming attachment. ------------------------------------------------------- I included "invalidate=1" in the FIO job. No difference.
Created attachment 250401 [details] Comparison of consecutive FIO runs on kernel 4.1 vs. kernel 4.4 See the spread of the Random Write performance on kernel 4.4 between 390 MB/sec and 640 MB/sec compared to a well defined higher RANDOM WRITE performance on kernel 4.1 around 650MB/sec +- 10MB/sec
if you run the fio script against a single disk, for example /dev/sda5, does the performance change between 4.1 and 4.4? If no, please attach the blktrace when you run the fio script against the raid array for both 4.1 and 4.4. That will give us more hints.
(In reply to Shaohua Li from comment #6) > if you run the fio script against a single disk, for example /dev/sda5, does > the performance change between 4.1 and 4.4? > > If no, please attach the blktrace when you run the fio script against the > raid array for both 4.1 and 4.4. That will give us more hints. Meanwhile I did a test with blktrace for a single disk partition /dev/sdb6 (40 GiB). I shall provide the output as attachments later on. Note that I did not delete the md raid array configurations. So other partitions as sdb5 still were members of md raid arrays during the FIO tests and the md-, raid10-, raid5-modules were not unloaded. Several things are remarkable: 1) Again the spread of random write performance is bigger for kernel 4.4 than for kernel 4.1. On average the performance was lower on kernel 4.4. Result values of some consecutive FIO runs without blktrace are: Kernel 4.1 : 459605KB/s, 462511KB/s, 462093KB/s, 460846KB/s, 462929KB/s, 462511KB/s Kernel 4.4 : 425249KB/s, 397207KB/s, 456327KB/s, 429530KB/s, 442906KB/s, 437233KB/s, 386123KB/s, 436487KB/s This may be an indication that single disc access delays are simply enhanced and reflected in the raid array behavior (??). 2) When performing blktrace the CPU event distribution summarized at the end of blktrace shows a clear concentration on exactly 1 cpu core on the kernel 4.1 installation. With kernel 4.4 the CPU event distribution shows a much larger spread over the CPU cores. kernel 4.1 - end of blktrace: === sdb6 === CPU 0: 7723 events, 363 KiB data CPU 1: 2605 events, 123 KiB data CPU 2: 520 events, 25 KiB data CPU 3: 399986 events, 18750 KiB data CPU 4: 26 events, 2 KiB data CPU 5: 17 events, 1 KiB data CPU 6: 17 events, 1 KiB data CPU 7: 100 events, 5 KiB data Total: 410994 events (dropped 0), 19266 KiB data kernel 4.4 - end of blktrace: === sdb6 === CPU 0: 207564 events, 9730 KiB data CPU 1: 120663 events, 5657 KiB data CPU 2: 80507 events, 3774 KiB data CPU 3: 52699 events, 2471 KiB data CPU 4: 12912 events, 606 KiB data CPU 5: 33 events, 2 KiB data CPU 6: 724 events, 34 KiB data CPU 7: 2032 events, 96 KiB data Total: 477134 events (dropped 0), 22366 KiB data This can be reproduced. This is really astonishing. Does this indicate that kernel 4.4 in general performs a different CPU load distribution with a more frequent switching between the cores? Could this explain the latency shown in the FIO tests? I shall provide the output data for the FIO tests on /dev/sdb6 in a minute. For each test there will be 2 files : 1) blktrace_sdb6_kernel4x_fio_output_test.txt : FIO output during test 2) blktrace_sdb6_kernel41_for_fio_test.zip : zip file containing the blktrace data files Could you, please, verify that these files provide the information you need, before I do the blktrace tests on the whole md raid10 array ? Thanks, in advance.
Created attachment 250521 [details] blktrace data for fio test on kernel 4.1 Blktrace for kernel 4.1 : The zip files contains the blktrace result data files, which were recorded during a FIO test on a SDD partition /dev/sdb6, which is NOT a member of a defined raid array.
Created attachment 250531 [details] FIO output data for blktrace test run on kernel 4.1 FIO output data created during FIO test run with blktrace on kernel 4.1
Created attachment 250541 [details] blktrace data for fio test on kernel 4.4 Blktrace data of FIO test on kernel 4.4 The zip file contains the blktrace data files, which were created during a FIO test on device /dev/sdb6. This partition was NOT part of a md raid array. However, /dev/sdb5 and other partitions on /dev/sdb were members of md raid arrays during test.
Created attachment 250551 [details] FIO output data for blktrace test run on kernel 4.4 The attached file contains FIO output data produced on kernel 4.4 for FIO test on /dev/sdb6. These data correspond to the data of the blktrace files for kernel 4.4.
No, don't need blktrace for the raid array. It's clear the single disk performance drops in 4.4. if you use noop ioscheduer, anything changes? I bet not, but want to confirm. Since the test is 64k bs, the iops isn't very high, 6400/s, I can't imagine CPU side can affect the performance for such low IOPS (eg, the switching between the CPUs), but I could be wrong. What kind of driver/HBA are you using? I'd suggest involving people from the driver side, since this sounds like a driver bug to me. In the meaning time, I'd suggest you try a newer kernel and check if the single disk performance changes.
(In reply to Shaohua Li from comment #12) > No, don't need blktrace for the raid array. It's clear the single disk > performance drops in 4.4. if you use noop ioscheduer, anything changes? I > bet not, but want to confirm. > Since the test is 64k bs, the iops isn't very high, 6400/s, I can't imagine > CPU side can affect the performance for such low IOPS (eg, the switching > between the CPUs), but I could be wrong. What kind of driver/HBA are you > using? I'd suggest involving people from the driver side, since this sounds > like a driver bug to me. In the meaning time, I'd suggest you try a newer > kernel and check if the single disk performance changes. The NOOP scheduler does not change the general picture. I tried this already before; small differences, but no change regarding the performance reduction. Regarding HW and drivers: The board of the test system is an ASrock Z170 extreme 7+. The SATA controller - to which the SSDs of the Raid array (4 Samsung EVO 850) AND a separate system SSD disk (with the OS installations on it) are attached - is an onboard Intel Sunrise Point controller in AHCI mode. ( P: /devices/pci0000:00/0000:00:17.0 E: DEVPATH=/devices/pci0000:00/0000:00:17.0 E: DRIVER=ahci E: ID_MODEL_FROM_DATABASE=Sunrise Point-H SATA controller [AHCI mode] E: ID_PCI_CLASS_FROM_DATABASE=Mass storage controller E: ID_PCI_INTERFACE_FROM_DATABASE=AHCI 1.0 E: ID_PCI_SUBCLASS_FROM_DATABASE=SATA controller E: ID_VENDOR_FROM_DATABASE=Intel Corporation E: MODALIAS=pci:v00008086d0000A102sv00001849sd0000A102bc01sc06i01 E: PCI_CLASS=10601 E: PCI_ID=8086:A102 E: PCI_SLOT_NAME=0000:00:17.0 E: PCI_SUBSYS_ID=1849:A102 E: SUBSYSTEM=pci E: USEC_INITIALIZED=5776122 P: /devices/pci0000:00/0000:00:17.0/ata1/ata_port/ata1 E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/ata_port/ata1 E: SUBSYSTEM=ata_port P: /devices/pci0000:00/0000:00:17.0/ata1/host1 E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/host1 E: DEVTYPE=scsi_host E: SUBSYSTEM=scsi P: /devices/pci0000:00/0000:00:17.0/ata1/host1/scsi_host/host1 E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/host1/scsi_host/host1 E: SUBSYSTEM=scsi_host P: /devices/pci0000:00/0000:00:17.0/ata1/host1/target1:0:0 E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/host1/target1:0:0 E: DEVTYPE=scsi_target E: SUBSYSTEM=scsi P: /devices/pci0000:00/0000:00:17.0/ata1/host1/target1:0:0/1:0:0:0 E: DEVPATH=/devices/pci0000:00/0000:00:17.0/ata1/host1/target1:0:0/1:0:0:0 E: DEVTYPE=scsi_device E: DRIVER=sd E: MODALIAS=scsi:t-0x00 E: SUBSYSTEM=scsi There is an additional ASM1062 SATA controller which controls DVD drives. There is also a 3ware controller with HDDs attached. Both should not have a major impact on the SSD performance. Possible storage related modules - as far as I understand it; note, however, that the SSD disks are not encrypted: intel_rapl 24576 0 intel_powerclamp 16384 0 crc32c_intel 24576 2 aesni_intel 167936 0 aes_x86_64 20480 1 aesni_intel lrw 16384 1 aesni_intel glue_helper 16384 1 aesni_intel ablk_helper 16384 1 aesni_intel cryptd 20480 2 aesni_intel,ablk_helper pinctrl_intel 20480 1 pinctrl_sunrisepoint intel_lpss_acpi 16384 0 intel_lpss 16384 1 intel_lpss_acpi mfd_core 16384 1 intel_lpss ahci 36864 18 libahci 36864 1 ahci libata 270336 2 ahci,libahci async_raid6_recov 20480 1 raid456 async_memcpy 16384 2 raid456,async_raid6_recov libcrc32c 16384 1 raid456 async_pq 16384 2 raid456,async_raid6_recov async_xor 16384 3 async_pq,raid456,async_raid6_recov async_tx 16384 5 async_pq,raid456,async_xor,async_memcpy,async_raid6_recov raid6_pq 106496 4 async_pq,raid456,btrfs,async_raid6_recov raid10 53248 2 md_mod 147456 4 raid456,raid10 dm_mod 122880 20 sg 40960 0 scsi_mod 262144 6 sg,st,3w_9xxx,libata,sd_mod,sr_mod 3w_9xxx 45056 6 I shall provide a complete HW component list created by hwinfo plus a lsmod list for both Opensuse Leap 42.2. and Opensuse 42.1 as soon as I find the time. The same regarding a different later kernel as e.g. 4.7 - I think there is a Tumbleweed installation on one of the disks; so I would not even have to touch the Leap test installations.
(In reply to Shaohua Li from comment #12) > No, don't need blktrace for the raid array. It's clear the single disk > performance drops in 4.4. if you use noop ioscheduer, anything changes? I > bet not, but want to confirm. > Since the test is 64k bs, the iops isn't very high, 6400/s, I can't imagine > CPU side can affect the performance for such low IOPS (eg, the switching > between the CPUs), but I could be wrong. What kind of driver/HBA are you > using? I'd suggest involving people from the driver side, since this sounds > like a driver bug to me. In the meaning time, I'd suggest you try a newer > kernel and check if the single disk performance changes. Meanwhile I did a clean and minimal install of Opensuse Tumbleweed with kernel 4.7. All partitions on the SSDs for the Raid Arrays and teh Raid Arrays were discovered correctly. udev treated same as expected. Unfortunately, things did not at all get better but worse. The random write performance is now consistently and without spread at the lower end of what I got for kernel 4.4. Results of 6 consecutive fio (version 2.15) test runs on /dev/sdb6 : ---------------------------------------------------------------- 382660KB/s, 381520KB/s, 382374KB/s, 382089KB/s, 381520KB/s, 383520KB/s So this marks a drop in single SSD performance of about 80 MB/sec on kernel 4.7 compared to kernel 4.1 - which gave me 460 MB/sec. On Raid Arrays the relative loss in performance is even bigger. I would describe this as significant. Either FIO delivers wrong results or something major happened between kernel 4.1, 4.4 and 4.7 - e.g. regarding the drivers of the Intel SATA controller. Or could it be something specific for Opensuse's kernel configuration? What still worries me a bit is the fact that other partitions (sdb2, sdb3, sdb5) on /dev/sdb still are part of md arrays. Could this have an impact on the performance measured for the raw partition /dev/sdb6 ?
does the raid array run for resync? if not, I have no idea what could cause the slowness. why read is slower than write in the disk?
(In reply to Shaohua Li from comment #15) > does the raid array run for resync? if not, I have no idea what could cause > the slowness. why read is slower than write in the disk? Sorry for my late answer. Just not to mix up different topics and problems: 1) The last tests were done for random writing, only ---------------------------------------------------- All the last fio tests did random writing - for random writing the performance loss is most significant. Also the last data given for /dev/sdb6 (single disk raw partition access) were given for random write. I wonder a bit why you ask about the difference in read/write performance. 2) Nevertheless, regarding your question for read vs. write performance: ----------------------------------------------------------------------- If you look at the following web site http://www.anandtech.com/show/6935/seagate-600-ssd-review/5 you will find that a generally and significantly lower random read rate in comparison to write rates is very common for several types of SSDs - very much in contrast to the sequential region. It also depends strongly on the chosen block size of data bunches written. But I am no specialist on this question - I did find a lower random read performance for small blocksizes (4 to 64 KB) on different Samsung SSDs (PROs and EVOs on very different Linux systems). 3) Dependency on resyncing? ----------------------------- I double checked: No, the raid array tests were NOT done in a phase of resyncing. Also during the tests for the single partition /dev/sdb6 the other partitions (parts of arrays) were not under resyncing. 4) Dependency on kernel 4.1 (Opensuse Leap 42.1), kernel 4.2 (Opensuse leap 42.2), kernel 4.7 (Opensuse Tumbleweed) : I can only state here what I see by fio tests ( with the same fio version and 64 kB blocksize) : a) Random write performance on a single partition /dev/sdb6 is around 460 MB/sec for kernel 4.1. b) Random write performance on a single partition /dev/sdb6 varies between 380 MB/sec and maximum 460 MB/sec for kernel 4.4 with clearly below 460 MB/sec. c) Random write performance is consistently close to 380 MB/sec for kernel 4.7. All for Samsung SSD 850 EVO. Regarding Raid 10 Arrays: --------------------------- There is a huge discrepancy in random write performance between kernel 4.1 and 4.4 : a) Kernel 4.1: Around 640 MB/sec for a raid Array Chunk Size of 32 KB and a fio test blocksize of 64 kb. That this rate should be much larger than the single disk performance was to be expected. b) Kernel 4.4: Performance varies over a large scale - on average more at 450 MB/sec and with a lower end at approx. 380 MB/sec. I have not tested arrays yet for kernel 4.7.
Created attachment 251211 [details] output of hwinfo on kernel 4.4 system (opensuse leap 42.2) The attachment contains the output of hwinfo on thr opensuse Leap 42.2 system with kernel 4.4
Created attachment 251221 [details] output of lsmod on the kernel 4.4 system (Opensuse leap 42.2) The attachment contains the output of lsmod on th eOpensuse leap 42.2 system with kernel 4.4
cc Tejun, he is the maintainer of ata, probably he can find something in the driver side.
Hmm... I can't think of much which could have affected that. We switched away from threaded interrupt handling in ahci but that's 4.5. Just in case, can you please test how a more recent kernel behaves? Thanks.
(In reply to Tejun Heo from comment #20) > Hmm... I can't think of much which could have affected that. We switched > away from threaded interrupt handling in ahci but that's 4.5. Just in case, > can you please test how a more recent kernel behaves? Thanks. Hi, thank you for your interest in my problem. I had a look at kernel 4.9 on Opensuse Tumbleweed and kernel 4.1 (Opensuse leap 41.1) on the same HW platform. For comparison I did 2 tests: - one on a raw Raid10 array device - /dev/md129 based on raw Raid partitions /dev/sd[a-d]5 (type: 0xFD, no LVM, no ext4) on 4 Samsung EVO 850. - one on a separate raw Linux test partition /dev/sde2 (type: 0x83) on the same SSD that holds the OS (Samsung 840 PRO; this SSD is not part of any md array) I enriched the fio test a bit: the random write job is now following after a separate random read job (1 job at a time, bs=64k = twice the Raid Chunk size). Both for RandRead and RandWrite only 1 job was requesting access to the partitions or raid arrays at a time. I waited 10 minutes to avoid impacts of any kind of jobs after booting. No X11 or graphical interface was started. Results for the Raid 10 array /dev/md129 ------------------------------------------- (all in MB/sec) Kernel 4.9 - Random Read 357, 367, 350, 353, 343, 363, 372, 384, 385, 378, 371, 402, 352, 336, 376, 360 Kernel 4.1 - Random Read 407, 399, 397, 401, 397, 404, 399, 399, 407, 395 Kernel 4.9 - Random Write 611, 635, 400, 414, 373, 649, 369, 375, 484, 363, 645, 668, 364, 366, 645, 370 Kernel 4.1 - Random Write 654, 664, 667, 660, 657, 664, 667, 675, 653, 664 Results for raw partition /dev/sde2 ------------------------------------- Kernel 4.9 - Random Read : 211, 206, 214, 206, 211, 208, 213, 211, 217, 210, 217, 209, 203, 212 Kernel 4.1 - Random Read: 245, 237, 238, 231, 231, 233, 232, 230, 230, 234, 232, 233, 233 Kernel 4.9 - Random Write : 288, 351, 292, 352, 352, 335, 295, 317, 252, 351, 350, 299, 295, 328 Kernel 4.1 - Random Write : 348, 343, 342, 353, 349, 351, 351, 351, 347, 354, 349, 351, 353 In addition I looked at the access to /dev/sdb6 - /dev/sdb has several other partitions belonging to Raid 10 and Raid 5 md arrays Results for raw partition /dev/sdb6 ------------------------------------- Kernel 4.9 - Random Read : 229, 248, 239, 246, 247, 234, 234, 242, 241, 235, 246, 230, 222, 232, 251, 235 Kernel 4.1 - Random Read : 278, 295, 286, 286, 295, 284, 283, 295, 287 Kernel 4.9 - Random Write : 455, 366, 459, 391, 459, 459, 459, 444, 451, 456, 459, 455, 460, 458, 396, 460 Kernel 4.1 - Random Write: 446, 449, 460, 451, 459, 459, 457, 455, 459 My conclusion: --------------- Random Read is consistently worse for kernel 4.9 than for kernel 4.1 for the chosen FIO test setup (1 single read job with rel. small blocksize) - this is independent of the arrays or single disk access. This finding seems in addition to be independent of the SSD type. Random write on the md raid10 array shows the same large variation as also found on kernel 4.4. On average the random write performance on md raid 10 arrays is lower for kernel 4.9 than for kernel 4.1. Random write for single raw partition access shows some more variation on kernel 4.9 than for kernel 4.1. However, the spread was less pronounced for kernel 4.9 than for kernel 4.4. Note, however, that the random write performance on /dev/sdb6 (though varying) is higher than on kernel 4.7 (where it stayed at 380 MB/sec; see comment 16). Additional remarks ----------------------- I admit that the test setup is a bit special (only on job requesting access to the tested array or partition at a time plus relatively "bs" values. The overall performance values would of course be different for several parallel read/write jobs requesting access to the Raid array. But preliminary test showed also a faster asymptotic approach to the sequential limits for kernel 4.1 than for later kernels). Still, the question remains why does kernel 4.1 gives me higher values on average, with a significant lower standard deviation especially for Random Writes on the Raid 10 Array? In addition I should mention that I can reproduce the better results on the kernel 4.1 system even with a much higher number of jobs running in the background, samba plus apache server running, full KDE with akonadi running, ODT files open, browsers running, plus ksysguard and gkrellm active and amarok running. The very good raid performance on the kernel 4.1 system is almost indifferent to background load up to 20% on some processor cores. The kernel 4.9 results instead were measured on a minimal installation without KDE running.
Thanks for the comprehensive testing, but as all the tests seem to give about the same results, let's narrow it down to a random read test on a single device. I frankly have no idea. Given that there is a reliable reproducer and comparison points, bisecting seems like the best course of action. We can try to find out where the latency is actually coming from - as it might not even be the hardware or driver but somewhere higher up in the stack. Hmm... but yeah, would it be difficult to for you to bisect the problem? Thanks.
Hi, sorry for a long delay in answering to the last mail. Some days ago I found the time again to look at SSD-Raid-Arrays again. By chance I stumbled over some articles on CPU governors while testing. On my system Intel's pstate driver (intel_pstate) is relevant (i7 4700k CPU). The default governor on Opensuse Leap 42.2 is "powersave". (There are only 2 choices: "powersave" and "performance"). Reading about some problems people had with "intel-pstate" on Red Hat 7 led me to testing the Raid performance again - this time with the CPU governor set to "performance" for all CPU cores/threads. And guess what: The measured performance values on Kernel 4.4 (on Leap 42.2) were on average much, much closer (if not identical) to the values which I found on Opensuse Leap 42.1. Preliminary test results: ************************* md-raid-10, chunk size: 32k, bitmap: none Case 1: ------- fio bs=512k, numjobs 1, 500 MB data, Random Read [RR] / Random Write [RW] Leap 42.1, Kernel 4.1: RR 780 MByte/sec - RW: 754 MByte/sec RR Spread around 35 Mbyte/sec Leap 42.2, Kernel 4.4, governor: powersave : RR 669 MByte/sec - RW: 606 MByte/sec RR Spread around 50 Mbyte/sec Leap 42.2, Kernel 4.4, governor: performance : RR 780 MByte/sec - RW: 750 MByte/sec RR Spread around 35 Mbyte/sec Case 2: ------- fio bs=1024k, numjobs 1, 500 MB data, Random Read [RR] / Random Write [RW] Leap 42.1, Kernel 4.1: RR 860 MByte/sec - RW: 800 MByte/sec RR Spread around 30 Mbyte/sec Leap 42.2, Kernel 4.4, governor: powersave : RR 735 MByte/sec - RW: 660 MByte/sec RR Spread > 50 Mbyte/sec Leap 42.2, Kernel 4.4, governor: performance : RR 877 MByte/sec - RW: 792 MByte/sec RR Spread around 25 Mbyte/sec -------------- Although not systematically tested: This seems to be a significant. I now regard it quite probable that the source of the significant reduction I saw for md-Raid-10-arrays with SSDs when comparing kernel 4.4 with 4.1 may be a change (???) in CPU governor settings or some general change in the intel-pstate driver. This change affects at least Opensuse Leap 42.2 with Kernel 4.4 in contrast to Leap 42.1 with Kernel 4.1. It seems to me that the "powersave" governor does NOT react to fio tests and a rise in I/O load over a certain period of time at all. Or it does not react fast enough ... The frequency of the CPU cores does not go up to top values during fio runs of the above type .... Due to whatever reasons .... CPU load below thresholds? Bad analytical time resolution of the load? No reaction to I/O spikes? Anyway: The effect is dramatic! A relatively high CPU frequency seems to be necessary to get an optimum performance of of md-raid-10-arrays! At least for spike like loads. Which makes me wonder whether Intels "powersave" governor (based on the intel_pstate driver) is in general no good choice when one wants to use md-raid-arrays. Any ideas or comments ? Anything I could do? What parameters (if there are any) of the intel_pstate governors could help to increase the performance of md-Raid-Arrays ?
Sorry, the processor I use on the test system is not an "i7 4700k" but an "i7 6700k".
I checked for the CPU governor on an Opensuse Leap 42.1 system: In fact, on Opensuse 42.1 the good old (ACPI-based) "ondemand" governor does its work - and not the "new" intel_pstate based "powersave" governor for HWP. So, there is a major difference between Leap 42.1 and 42.2 (besides the kernel version) and this difference may explain the observations described in comment #23 (and previous findings). Which in turn raises the question whether the intel_pstate "powersave" governor should not be run on systems with md-raid-arrays of SSDs ?
it's a known issue cpu power saving mode can hurt performance for fast devices, ssd for example.
(In reply to Shaohua Li from comment #26) > it's a known issue cpu power saving mode can hurt performance for fast > devices, ssd for example. And I had to learn it the hard way ... :-) For me it was surprising that the impact was so significant. And I had expected that the guys from Intel had tested the impact of their governor on their own onboard Raid controllers a bit better. I switched back to the "ondemand" governor on my systems - and the performance now is perfect again. The "bug" was no "bug" at all on the kernel or driver side. Nevertheless, thank you very much for the support and suggestions!