Bug 25272 - md bad speed on raid1 (feature request) new read algorithm
Summary: md bad speed on raid1 (feature request) new read algorithm
Status: RESOLVED OBSOLETE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: MD (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: io_md
URL:
Keywords:
: 17401 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-12-20 01:48 UTC by rspadim
Modified: 2012-08-14 15:00 UTC (History)
3 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description rspadim 2010-12-20 01:48:05 UTC
hello, i created a raid1 array, and read speed seens to be slow... 
with one reader on /dev/md3 i have slower speed than /dev/sdb (devices), that´s ok, kernel md module process speed
but, when i have 2 readers, the speed stay the same (i was thinking that one device could read one process and the other device could read the other process, but´s not what kernel do...)
could we change what read algorithm we could use?
default kernel (read with round robin, divide md device by devices, for example 100mb in 3 devices, first device 33mb, second 33mb, third 33mb, so we could have 3 reads with a total speed like 3 devices,  we could implement more algorithms.... nice? maybe somethink like noop, cqf and other io elevators...) thanks guys, i don´t know if it´s the right place to make todo or somethink like, (it´s not a bug.. i know, a feature request maybe...)

raid:
cat /proc/mdstat:
md3 : active raid1 sdd4[3] sdc4[2] sdb4[1]
      193654446 blocks super 1.2 [3/3] [UUU]


dd if=/dev/md3 of=/dev/null
iotop results:

Total DISK READ: 40.32 M/s | Total DISK WRITE: 9.10 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 4251 be/4 root       40.42 M/s    0.00 B/s  0.00 % 33.40 % dd if=/de~=/dev/null


sdb:
dd if=/dev/sdb of=/dev/null
iotop results:
Total DISK READ: 83.96 M/s | Total DISK WRITE: 10.07 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 4299 be/4 root       83.96 M/s    0.00 B/s  0.00 % 70.45 % dd if=/de~=/dev/null


all disks dd:
dd if=/dev/sdb of=/dev/null
dd if=/dev/sdc of=/dev/null
dd if=/dev/sdd of=/dev/null


iotop results:
Total DISK READ: 208.11 M/s | Total DISK WRITE: 9.29 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 4301 be/4 root       66.78 M/s    0.00 B/s  0.00 % 73.68 % dd if=/de~=/dev/null
 4302 be/4 root       72.24 M/s    0.00 B/s  0.00 % 73.06 % dd if=/de~=/dev/null
 4303 be/4 root       69.22 M/s    0.00 B/s  0.00 % 70.32 % dd if=/de~=/dev/null
Comment 1 Neil Brown 2010-12-20 02:00:42 UTC
If you can provide code - with measurements that show it to be better, then I'm very happy to consider it.

However I have way to many other things to work on at the moment to consider this.

The 'read_balance' code is quite easy to find in drivers/md/raid1.c.
Try an experiment and see if you can make it faster!
Comment 2 rspadim 2010-12-20 03:36:35 UTC
thank´s guys, i will try
first i will read the code :) 
thanks
Comment 3 rspadim 2010-12-20 04:06:45 UTC
hum... i founded the problem, i have this:

[root@agra md]# fdisk /dev/sda -l

Disk /dev/sda: 320.1 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders, total 625142448 sectors
Units = setores of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x69190750

Dispositivo Boot      Start End   Blocks   Id
/dev/sda1   *          63      224909      112423+  fd  
/dev/sda2          224910    33013574    16394332+  82  
/dev/sda3        33013575   237826259   102406342+  fd  
/dev/sda4       237826260   625137344   193655542+  fd  


and i have 3 raids on same disk... sda1,sda3,sda4 so... position is mirror based, not device based...


could we optimize the mirror based? maybe one disk for each mirror? or somethink like?
Comment 4 rspadim 2010-12-20 04:44:27 UTC
i was thinking about device/mirror speed...

we have 2 speeds...

head position speed (change from any position to the current byte position, seconds/byte distance or seconds/head distance), i will call it ALIGN_TIME

read speed for x bytes or 1 page (bytes/second), i will call READ_TIME


let´s think... we need the closest head (ok, min(ALIGN_TIME))... but we need the highest read speed too (min(READ_TIME))... and with nbd (network) we need smallest ping time? (NETWORK_TIME)

let´s think about a array with diferent devices: hdd (disk), ssd (flash memory),nbd(disk),nbd(flash memory) 

what we need: min(TIME)

ALIGN_TIME:
hdd - head position, from atual position to current read position (maybe distance * (head movement time/distance), distance can be byte? head position? what we can read about hdd information?)
ssd - none? maybe a spool time? i think it´s very fast...

NETWORK_TIME:
nbd - network speed (ping time?) + colision rate? any other information about client-server communication, maybe just a estimation number... bytes/second, packet size? (bytes/packet), packet/second?
maybe... byte/packet, packet/second and... packet time + bytes time? 
maybe three informations... bytes/second + bytes/packets + packets/second 

with a packet size of 500bytes we will have:
1 byte = 1packet * packet/second + 1byte * bytes/second
1000 bytes = 2packet * packet/second + 1000bytes * bytes/second

variables: NETWORK_BYTE_SECOND, NETWORK_PACKET_SECOND, NETWORK_BYTE_PACKET


READ_TIME:
how many bytes to read? -> time = bytes * (bytes/second)
we don´t know how many bytes to read? maybe a raid1 "page" size? time = page_size * (bytes/second)


what we could do...
/proc/sys/dev/raid/device_speed/xxxxx/ALIGN_TIME   (default = 1)
/proc/sys/dev/raid/device_speed/xxxxx/READ_TIME    (default = 1)
/proc/sys/dev/raid/device_speed/xxxxx/NETWORK_BYTE_SECOND (default = 1)
/proc/sys/dev/raid/device_speed/xxxxx/NETWORK_PACKET_SECOND (default = 1)
/proc/sys/dev/raid/device_speed/xxxxx/NETWORK_BYTE_PACKET (default = 1500)
/proc/sys/dev/raid/device_speed/default_page_size  (default = 4096 ?)

xxxxx = sda1, sda2, sdb1, nbd1, nbd2 ... device name (the same from /proc/mdstat)

the best device:
min( distance * ALIGN_TIME + bytes * READ_TIME + bytes*NETWORK_BYTE_SECOND + packets*NETWORK_PACKET_SECOND)
if we don´t have bytes -> bytes = default_page_size
packets = ceil(bytes / NETWORK_BYTE_PACKET), if (NETWORK_BYTE_PACKET==0), packets=1


what you think?
if we had a very poor tcp/ip connection (wifi, or 3g, or modem) nbd will never be used... (ALIGN_TIME maybe low, READ_TIME maybe low, NETWORK_TIMES very high)

it´s a good idea?
a second problem...
if a device is always used (it´s very fast) it will be degraded very fast (mean time between fails) another good information is MTBF, maybe in total seeks? or anythink like it? i don´t know, it´s not a speed otimization, just a use otimization... let´s just think about read optimization...

nice idea?
Comment 5 rspadim 2010-12-20 04:46:05 UTC
sorry we have many speeds, not just 2 :) hehehe
Comment 6 rspadim 2010-12-20 04:55:16 UTC
a raid startup optimize code?

let´s think...
first...

FOR DISKS-> HDD/SDD
lock disk (no one can read from this disk (device))

read byte at position 1
(disk head is in position 1)

set initial_time=now()
read byte at the last disk position 
(disk head is in position MAX)
set end_time=now()
unlock disk/device
ALIGN_TIME = (MAX)/(end_time-initial_time)


second...

lock disk (no one can read from this disk (device))
read byte at position 1
(disk head is in position 1)

set initial_time=now()
{
  read default_page_size bytes
}reapeat this N times
set end_time=now()
unlock disk/device

READ_TIME=(default_page_size * N)/(end_time-initial_time)


------
network:
 NETWORK_BYTE_SECOND, NETWORK_PACKET_SECOND, 

what´s NETWORK_BYTE_PACKET? maybe MTU? (1500)

NETWORK_PACKET_SECOND=1 byte "ping" time
NETOWKR_BYTE_SECOND=((1 packet with NETOWRK_BYTE_PACKET size "ping" time) / NETWORK_BYTE_PACKET) - NETWORK_PACKET_SECOND

nice?
i think it´s good enoght
maybe a online or cron bases update, can make network based very nice... disks too, maybe with iotop, or another user based program (not a kernel module, or kernel program...)


thanks guys
Comment 7 rspadim 2010-12-20 04:57:41 UTC
for nbd, we should run READ_TIME and ALIGN_TIME in daemon side... client side, could make time confused with NETWORK times...
Comment 8 rspadim 2010-12-20 05:11:15 UTC
i don´t know how to make kernel programs (i don´t know variables structures, and what variables to use and fuctions to use..) any source to read and understand?

i was thinking about a raid1_read_algorithm (for each md device):

1. minimal time (max read speed, what i propused)
2. closest_head (what´s today algorithm), i don´t know if it´s good for SSD...
3. disk_use/round_robin (choose disk by time used, X seconds / disk, or Y bytes / disk)

i don´t know another algorithm... any another idea?
Comment 9 rspadim 2010-12-20 05:26:16 UTC
i was thinking just about read... maybe with a read/write this can change (speed)... (best speed could be closest_head, just with a real test to check...)
Comment 10 rspadim 2010-12-20 13:12:55 UTC
there´re some information at:

/sys/block/xxxxx/

for example:
/sys/block/sda/queue/logical_block_size  (default_page_size?)
/sys/block/sda/queue/minimum_io_size     (default_page_size?)
/sys/block/sda/queue/rotational          (ssd?)
/sys/block/sda/queue/iosched             
/sys/block/sda/queue/scheduler           (what we want? md io scheduler...? i think just a from what disk to read "choiser"?)

any other idea?
Comment 11 rspadim 2010-12-20 13:54:42 UTC
i was using iostat with dd tool...
using only one dd if=/dev/md2 of=/dev/zero
i get about 100mb/s on sda and 0 mb/s on sdb,sdc,sdd
very nice...

with two dd
it didn't estabilize (don't work with only 2 disks) i have some read on sda,sdb,sdc,sdd but the total stay at 100mb/s

maybe we could implement another tool...

instead of:
read from md2
select disk
read data
return to kernel read data

whe could do:
read from md2
select disk (a disk that's not reading/write, or the minimal time to end io)
mark when io should end (time)
read data
mark no io to end
return to kernel read data

what? a time based read optimization again... if ther's a read operation at sda that stop after 1 second, and no operation at sdb, why we should use sda? just because closest head position? 2 disks working is faster here
Comment 13 rspadim 2010-12-20 21:19:01 UTC
a more study...


Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               2,00         0,00        24,00          0         24
sdb               2,00         0,00        24,00          0         24
sdc               0,00         0,00         0,00          0          0
sdd             932,00    111100,00         0,00     111100          0
md0               0,00         0,00         0,00          0          0
md1               0,00         0,00         0,00          0          0
md2               1,00         0,00         8,00          0          8
md3           111104,00    111104,00         0,00     111104          0
nb0               0,00         0,00         0,00          0          0




a lot of tps (md3), maybe because io_queue (sdd)



Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0,00         0,00         0,00          0          0
sdb               0,00         0,00         0,00          0          0
sdc             467,00     59032,00         0,00      59032          0
sdd             472,00     59528,00         0,00      59528          0
md0               0,00         0,00         0,00          0          0
md1               0,00         0,00         0,00          0          0
md2               0,00         0,00         0,00          0          0
md3           118560,00    118560,00         0,00     118560          0
nb0               0,00         0,00         0,00          0          0
with a second dd thread, tps (sdd,sdc) is divided (maybe sdd/sdc scheduler)

------------------------------------------------------------------------------------------------------------------
changed to noop:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0,00         0,00         0,00          0          0
sdb               0,00         0,00         0,00          0          0
sdc             887,00    112896,00         0,00     112896          0
sdd               0,00         0,00         0,00          0          0
md0               0,00         0,00         0,00          0          0
md1               0,00         0,00         0,00          0          0
md2               0,00         0,00         0,00          0          0
md3           112896,00    112896,00         0,00     112896          0
nb0               0,00         0,00         0,00          0          0


with two dd:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0,00         0,00         0,00          0          0
sdb               0,00         0,00         0,00          0          0
sdc             460,00     58880,00         0,00      58880          0
sdd             460,00     58880,00         0,00      58880          0
md0               0,00         0,00         0,00          0          0
md1               0,00         0,00         0,00          0          0
md2               0,00         0,00         0,00          0          0
md3           117504,00    117504,00         0,00     117504          0
nb0               0,00         0,00         0,00          0          0





------------------------------------------------------------------------------------

what i got? maybe we have a problem with max tps on md3.... i will try to change read size of md and report again here..
Comment 14 Neil Brown 2010-12-20 21:34:22 UTC
If you want to discuss this as a possible enhancement to md/raid1, I would suggest posting to linux-raid@vger.kernel.org.  You are more likely to get a response there.
Comment 15 rspadim 2010-12-20 22:01:37 UTC
how to send may to mail list? i should be allowed ? any site to understand it? thanks neil
Comment 16 Alan 2012-08-13 16:01:05 UTC
*** Bug 17401 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.