hello, i created a raid1 array, and read speed seens to be slow... with one reader on /dev/md3 i have slower speed than /dev/sdb (devices), that´s ok, kernel md module process speed but, when i have 2 readers, the speed stay the same (i was thinking that one device could read one process and the other device could read the other process, but´s not what kernel do...) could we change what read algorithm we could use? default kernel (read with round robin, divide md device by devices, for example 100mb in 3 devices, first device 33mb, second 33mb, third 33mb, so we could have 3 reads with a total speed like 3 devices, we could implement more algorithms.... nice? maybe somethink like noop, cqf and other io elevators...) thanks guys, i don´t know if it´s the right place to make todo or somethink like, (it´s not a bug.. i know, a feature request maybe...) raid: cat /proc/mdstat: md3 : active raid1 sdd4[3] sdc4[2] sdb4[1] 193654446 blocks super 1.2 [3/3] [UUU] dd if=/dev/md3 of=/dev/null iotop results: Total DISK READ: 40.32 M/s | Total DISK WRITE: 9.10 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 4251 be/4 root 40.42 M/s 0.00 B/s 0.00 % 33.40 % dd if=/de~=/dev/null sdb: dd if=/dev/sdb of=/dev/null iotop results: Total DISK READ: 83.96 M/s | Total DISK WRITE: 10.07 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 4299 be/4 root 83.96 M/s 0.00 B/s 0.00 % 70.45 % dd if=/de~=/dev/null all disks dd: dd if=/dev/sdb of=/dev/null dd if=/dev/sdc of=/dev/null dd if=/dev/sdd of=/dev/null iotop results: Total DISK READ: 208.11 M/s | Total DISK WRITE: 9.29 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 4301 be/4 root 66.78 M/s 0.00 B/s 0.00 % 73.68 % dd if=/de~=/dev/null 4302 be/4 root 72.24 M/s 0.00 B/s 0.00 % 73.06 % dd if=/de~=/dev/null 4303 be/4 root 69.22 M/s 0.00 B/s 0.00 % 70.32 % dd if=/de~=/dev/null
If you can provide code - with measurements that show it to be better, then I'm very happy to consider it. However I have way to many other things to work on at the moment to consider this. The 'read_balance' code is quite easy to find in drivers/md/raid1.c. Try an experiment and see if you can make it faster!
thank´s guys, i will try first i will read the code :) thanks
hum... i founded the problem, i have this: [root@agra md]# fdisk /dev/sda -l Disk /dev/sda: 320.1 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders, total 625142448 sectors Units = setores of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x69190750 Dispositivo Boot Start End Blocks Id /dev/sda1 * 63 224909 112423+ fd /dev/sda2 224910 33013574 16394332+ 82 /dev/sda3 33013575 237826259 102406342+ fd /dev/sda4 237826260 625137344 193655542+ fd and i have 3 raids on same disk... sda1,sda3,sda4 so... position is mirror based, not device based... could we optimize the mirror based? maybe one disk for each mirror? or somethink like?
i was thinking about device/mirror speed... we have 2 speeds... head position speed (change from any position to the current byte position, seconds/byte distance or seconds/head distance), i will call it ALIGN_TIME read speed for x bytes or 1 page (bytes/second), i will call READ_TIME let´s think... we need the closest head (ok, min(ALIGN_TIME))... but we need the highest read speed too (min(READ_TIME))... and with nbd (network) we need smallest ping time? (NETWORK_TIME) let´s think about a array with diferent devices: hdd (disk), ssd (flash memory),nbd(disk),nbd(flash memory) what we need: min(TIME) ALIGN_TIME: hdd - head position, from atual position to current read position (maybe distance * (head movement time/distance), distance can be byte? head position? what we can read about hdd information?) ssd - none? maybe a spool time? i think it´s very fast... NETWORK_TIME: nbd - network speed (ping time?) + colision rate? any other information about client-server communication, maybe just a estimation number... bytes/second, packet size? (bytes/packet), packet/second? maybe... byte/packet, packet/second and... packet time + bytes time? maybe three informations... bytes/second + bytes/packets + packets/second with a packet size of 500bytes we will have: 1 byte = 1packet * packet/second + 1byte * bytes/second 1000 bytes = 2packet * packet/second + 1000bytes * bytes/second variables: NETWORK_BYTE_SECOND, NETWORK_PACKET_SECOND, NETWORK_BYTE_PACKET READ_TIME: how many bytes to read? -> time = bytes * (bytes/second) we don´t know how many bytes to read? maybe a raid1 "page" size? time = page_size * (bytes/second) what we could do... /proc/sys/dev/raid/device_speed/xxxxx/ALIGN_TIME (default = 1) /proc/sys/dev/raid/device_speed/xxxxx/READ_TIME (default = 1) /proc/sys/dev/raid/device_speed/xxxxx/NETWORK_BYTE_SECOND (default = 1) /proc/sys/dev/raid/device_speed/xxxxx/NETWORK_PACKET_SECOND (default = 1) /proc/sys/dev/raid/device_speed/xxxxx/NETWORK_BYTE_PACKET (default = 1500) /proc/sys/dev/raid/device_speed/default_page_size (default = 4096 ?) xxxxx = sda1, sda2, sdb1, nbd1, nbd2 ... device name (the same from /proc/mdstat) the best device: min( distance * ALIGN_TIME + bytes * READ_TIME + bytes*NETWORK_BYTE_SECOND + packets*NETWORK_PACKET_SECOND) if we don´t have bytes -> bytes = default_page_size packets = ceil(bytes / NETWORK_BYTE_PACKET), if (NETWORK_BYTE_PACKET==0), packets=1 what you think? if we had a very poor tcp/ip connection (wifi, or 3g, or modem) nbd will never be used... (ALIGN_TIME maybe low, READ_TIME maybe low, NETWORK_TIMES very high) it´s a good idea? a second problem... if a device is always used (it´s very fast) it will be degraded very fast (mean time between fails) another good information is MTBF, maybe in total seeks? or anythink like it? i don´t know, it´s not a speed otimization, just a use otimization... let´s just think about read optimization... nice idea?
sorry we have many speeds, not just 2 :) hehehe
a raid startup optimize code? let´s think... first... FOR DISKS-> HDD/SDD lock disk (no one can read from this disk (device)) read byte at position 1 (disk head is in position 1) set initial_time=now() read byte at the last disk position (disk head is in position MAX) set end_time=now() unlock disk/device ALIGN_TIME = (MAX)/(end_time-initial_time) second... lock disk (no one can read from this disk (device)) read byte at position 1 (disk head is in position 1) set initial_time=now() { read default_page_size bytes }reapeat this N times set end_time=now() unlock disk/device READ_TIME=(default_page_size * N)/(end_time-initial_time) ------ network: NETWORK_BYTE_SECOND, NETWORK_PACKET_SECOND, what´s NETWORK_BYTE_PACKET? maybe MTU? (1500) NETWORK_PACKET_SECOND=1 byte "ping" time NETOWKR_BYTE_SECOND=((1 packet with NETOWRK_BYTE_PACKET size "ping" time) / NETWORK_BYTE_PACKET) - NETWORK_PACKET_SECOND nice? i think it´s good enoght maybe a online or cron bases update, can make network based very nice... disks too, maybe with iotop, or another user based program (not a kernel module, or kernel program...) thanks guys
for nbd, we should run READ_TIME and ALIGN_TIME in daemon side... client side, could make time confused with NETWORK times...
i don´t know how to make kernel programs (i don´t know variables structures, and what variables to use and fuctions to use..) any source to read and understand? i was thinking about a raid1_read_algorithm (for each md device): 1. minimal time (max read speed, what i propused) 2. closest_head (what´s today algorithm), i don´t know if it´s good for SSD... 3. disk_use/round_robin (choose disk by time used, X seconds / disk, or Y bytes / disk) i don´t know another algorithm... any another idea?
i was thinking just about read... maybe with a read/write this can change (speed)... (best speed could be closest_head, just with a real test to check...)
there´re some information at: /sys/block/xxxxx/ for example: /sys/block/sda/queue/logical_block_size (default_page_size?) /sys/block/sda/queue/minimum_io_size (default_page_size?) /sys/block/sda/queue/rotational (ssd?) /sys/block/sda/queue/iosched /sys/block/sda/queue/scheduler (what we want? md io scheduler...? i think just a from what disk to read "choiser"?) any other idea?
i was using iostat with dd tool... using only one dd if=/dev/md2 of=/dev/zero i get about 100mb/s on sda and 0 mb/s on sdb,sdc,sdd very nice... with two dd it didn't estabilize (don't work with only 2 disks) i have some read on sda,sdb,sdc,sdd but the total stay at 100mb/s maybe we could implement another tool... instead of: read from md2 select disk read data return to kernel read data whe could do: read from md2 select disk (a disk that's not reading/write, or the minimal time to end io) mark when io should end (time) read data mark no io to end return to kernel read data what? a time based read optimization again... if ther's a read operation at sda that stop after 1 second, and no operation at sdb, why we should use sda? just because closest head position? 2 disks working is faster here
http://freenas.org/ (freebsd) have a set of read balance algorithms, could we use it ? http://sourceforge.net/apps/phpbb/freenas/viewtopic.php?f=12&t=2030 http://sourceforge.net/apps/phpbb/freenas/download/file.php?id=317&sid=b2b03a6144f2fb6faa7159543f568084&mode=view http://www.freebsd.org/cgi/man.cgi?query=gmirror&sektion=8 balance option gmirror (8) man page freebsd gmirror source - line 1365 source code http://fxr.watson.org/fxr/source/geom/mirror/g_mirror.c
a more study... Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 2,00 0,00 24,00 0 24 sdb 2,00 0,00 24,00 0 24 sdc 0,00 0,00 0,00 0 0 sdd 932,00 111100,00 0,00 111100 0 md0 0,00 0,00 0,00 0 0 md1 0,00 0,00 0,00 0 0 md2 1,00 0,00 8,00 0 8 md3 111104,00 111104,00 0,00 111104 0 nb0 0,00 0,00 0,00 0 0 a lot of tps (md3), maybe because io_queue (sdd) Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 0,00 0,00 0,00 0 0 sdb 0,00 0,00 0,00 0 0 sdc 467,00 59032,00 0,00 59032 0 sdd 472,00 59528,00 0,00 59528 0 md0 0,00 0,00 0,00 0 0 md1 0,00 0,00 0,00 0 0 md2 0,00 0,00 0,00 0 0 md3 118560,00 118560,00 0,00 118560 0 nb0 0,00 0,00 0,00 0 0 with a second dd thread, tps (sdd,sdc) is divided (maybe sdd/sdc scheduler) ------------------------------------------------------------------------------------------------------------------ changed to noop: Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 0,00 0,00 0,00 0 0 sdb 0,00 0,00 0,00 0 0 sdc 887,00 112896,00 0,00 112896 0 sdd 0,00 0,00 0,00 0 0 md0 0,00 0,00 0,00 0 0 md1 0,00 0,00 0,00 0 0 md2 0,00 0,00 0,00 0 0 md3 112896,00 112896,00 0,00 112896 0 nb0 0,00 0,00 0,00 0 0 with two dd: Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 0,00 0,00 0,00 0 0 sdb 0,00 0,00 0,00 0 0 sdc 460,00 58880,00 0,00 58880 0 sdd 460,00 58880,00 0,00 58880 0 md0 0,00 0,00 0,00 0 0 md1 0,00 0,00 0,00 0 0 md2 0,00 0,00 0,00 0 0 md3 117504,00 117504,00 0,00 117504 0 nb0 0,00 0,00 0,00 0 0 ------------------------------------------------------------------------------------ what i got? maybe we have a problem with max tps on md3.... i will try to change read size of md and report again here..
If you want to discuss this as a possible enhancement to md/raid1, I would suggest posting to linux-raid@vger.kernel.org. You are more likely to get a response there.
how to send may to mail list? i should be allowed ? any site to understand it? thanks neil
*** Bug 17401 has been marked as a duplicate of this bug. ***