Distribution: Debian unstable Hardware Environment: Core 2 Quad Q6600, RAID-0 of 3 modern SATA disks Software Environment: Otherwise fairly idle Problem Description: I have a raid array which can give over 200 MB/sec performance, but with dm-crypt only something like 80 MB/sec: ------------------------------------------------------------ # hdparm -t /dev/mapper/myvg-root /dev/mapper/myvg-root_crypt /dev/mapper/myvg-root: Timing buffered disk reads: 624 MB in 3.00 seconds = 207.68 MB/sec /dev/mapper/myvg-root_crypt: Timing buffered disk reads: 236 MB in 3.02 seconds = 78.17 MB/sec ------------------------------------------------------------ I have a modern Core 2 Quad (Q6600) processor. The kcryptd process seems to be using 99% of a single CPU (i.e. of 400% available), spread evenly on the CPUs (so it probably was a single thread in full steam). Using the crypto test module tcrypt, AES-256-cbc benchmarks at about 116 MB/sec for a single thread (that's still almost 50% more than the performance I get with dm-crypt): ------------------------------------------------------------ # modprobe tcrypt mode=200 sec=1 testing speed of cbc(aes) encryption [...] test 10 (256 bit key, 16 byte blocks): 3772812 operations in 1 seconds (60364992 bytes) test 11 (256 bit key, 64 byte blocks): 1489236 operations in 1 seconds (95311104 bytes) test 12 (256 bit key, 256 byte blocks): 435686 operations in 1 seconds (111535616 bytes) test 13 (256 bit key, 1024 byte blocks): 113700 operations in 1 seconds (116428800 bytes) ------------------------------------------------------------ I did some oprofiling and other diggind and found out that when reading a crypted volume (echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/mapper/myvg-root_crypt of=/dev/null bs=1M), I get around 30-40 thousand context switches per second (as shown by the vmstat command) on an otherwise idle system, which seems quite a lot to me. Reading the underlying LVM volume (i.e. not decrypting it) causes about 7-8 thousand context switches per second, while when reading the md device on which the LVM volume resides I get around 3300 cs/sec. When idle I get 200 cs/sec. I suspect that 30-40 thousand context switches per second is both excessive and expensive. :) I noticed that there was a recent fix in dm-crypt.c that apparently disables broken parallel processing for *writes* (commit 3f1e9070f63b0eecadfa059959bf7c9dbe835962). Should parallel processing for reads still be possible? OpenSSL shows that the 4 cores in parallel are capable of 400 MB/sec AES-256-cbc performance: ------------------------------------------------------------ ### single thread $ openssl speed aes-256-cbc [...] The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-256 cbc 93397.93k 98932.82k 100337.70k 100455.77k 99770.37k ### 4 threads $ openssl speed aes-256-cbc -multi 4 [...] aes-256 cbc 371844.67k 388090.15k 398292.65k 400831.15k 400528.73k ------------------------------------------------------------ I have a setup with three 500 GB disks, sd[abc]: ------------------------------------------------------------ /dev/sda: Timing buffered disk reads: 220 MB in 3.02 seconds = 72.94 MB/sec /dev/sdb: Timing buffered disk reads: 232 MB in 3.01 seconds = 77.06 MB/sec /dev/sdc: Timing buffered disk reads: 222 MB in 3.02 seconds = 73.54 MB/sec ------------------------------------------------------------ Each of them has an identical ~495 GB partition, sd[abc]2, which are in a RAID 0 array md0: ------------------------------------------------------------ # cat /proc/mdstat Personalities : [raid0] md0 : active raid0 sda2[0] sdc2[2] sdb2[1] 1450500480 blocks 64k chunks unused devices: <none> # mdadm --detail --scan ARRAY /dev/md0 level=raid0 num-devices=3 UUID=a8a91091:07daccab:3847dbcc:3c6a600b ------------------------------------------------------------ md0 is an LVM Physical Volume (pv), the only in Volume Group (vg) named "myvg". ------------------------------------------------------------ # pvscan [...] PV /dev/md0 VG myvg lvm2 [1.35 TB / 979.30 GB free] # vgdisplay myvg --- Volume group --- VG Name myvg System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 5 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 1.35 TB PE Size 4.00 MB Total PE 354126 Alloc PE / Size 103424 / 404.00 GB Free PE / Size 250702 / 979.30 GB VG UUID 44CeDr-O3HU-UwkP-XEMt-rGhT-IGCE-zCC0mx ------------------------------------------------------------ myvg contains a 400 GB Logical Volume (lv) named "root", which is the raw device for which I get good performance: ------------------------------------------------------------ # lvdisplay --- Logical volume --- LV Name /dev/myvg/root VG Name myvg LV UUID csDRSR-kh5E-SIFm-UgzH-5CJD-lEc3-9Nc45M LV Write Access read/write LV Status available # open 1 LV Size 400.00 GB Current LE 102400 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 ------------------------------------------------------------ root is a LUKS volume, crypted with aes-cbc-essiv:sha256: ------------------------------------------------------------ # dmsetup table myvg-swap: 0 8388608 linear 9:0 838861184 myvg-root: 0 838860800 linear 9:0 384 myvg-swap_crypt: 0 8386552 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 253:1 2056 myvg-root_crypt: 0 838858744 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 253:0 2056 ------------------------------------------------------------ myvg-root_crypt is the crypt device for which I get bad performance.
> I have a modern Core 2 Quad (Q6600) processor. The kcryptd process seems to > be > using 99% of a single CPU (i.e. of 400% available), spread evenly on the CPUs > (so it probably was a single thread in full steam). It is moving from cpu to another cpu? hm, it is not good, so we need set processor affinity. But that it use only one core, it is expected. I have some ideas how to optimize dm-crypt on multiple cpus, but it is not yet implemented. > I did some oprofiling and other diggind and found out that when reading a > crypted volume (echo 3 >/proc/sys/vm/drop_caches; dd > if=/dev/mapper/myvg-root_crypt of=/dev/null bs=1M), I get around 30-40 > thousand > context switches per second (as shown by the vmstat command) on an otherwise > idle system, which seems quite a lot to me. Reading the underlying LVM volume > (i.e. not decrypting it) causes about 7-8 thousand context switches per > second, > while when reading the md device on which the LVM volume resides I get around > 3300 cs/sec. When idle I get 200 cs/sec. hrm. could you try if direct dm-crypt (without md - over LVM or better directly over normal partition) is better? for dd "test": if you increase readahead for crypt device - is it better or the same? (I mean blockdev --setra <num> /dev/mapper/... ) There are some patches which are still not in upstream kernel which should help, but this seems to be really excessive context switches count... > I noticed that there was a recent fix in dm-crypt.c that apparently disables > broken parallel processing for *writes* (commit > 3f1e9070f63b0eecadfa059959bf7c9dbe835962). Should parallel processing for > reads > still be possible? I think that even writes with proper implementation is possible. Just it need some configuration (sometimes it is not optimal to use all cores/CPUs for encryption).
On Thu, May 22, 2008 at 11:50:56AM -0700, bugme-daemon@bugzilla.kernel.org wrote: > It is moving from cpu to another cpu? hm, it is not good, so we need set > processor affinity. > But that it use only one core, it is expected. I have some ideas how to > optimize dm-crypt > on multiple cpus, but it is not yet implemented. Yes, kcryptd jumps from one cpu to another according to top. > > I did some oprofiling and other diggind and found out that when reading a > > crypted volume (echo 3 >/proc/sys/vm/drop_caches; dd > > if=/dev/mapper/myvg-root_crypt of=/dev/null bs=1M), I get around 30-40 > thousand > > context switches per second (as shown by the vmstat command) on an > otherwise > > idle system, which seems quite a lot to me. Reading the underlying LVM > volume > > (i.e. not decrypting it) causes about 7-8 thousand context switches per > second, > > while when reading the md device on which the LVM volume resides I get > around > > 3300 cs/sec. When idle I get 200 cs/sec. > > hrm. could you try if direct dm-crypt (without md - over LVM or better > directly > over normal partition) is better? More weirdness. ------------------------------------------------------------ # cryptsetup create ctest /dev/sdb1 --readonly --cipher aes-cbc-plain # echo 3 >/proc/sys/vm/drop_caches # dd if=/dev/mapper/ctest of=/dev/null bs=1M & # vmstat 10 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa [...] 3 0 104 6153688 956628 109916 0 0 53209 2 2046 99368 0 34 42 24 4 0 104 5394392 1576504 110400 0 0 62026 14 2206 97842 0 36 45 19 1 1 104 4624468 2205336 110984 0 0 62933 4 2472 112581 0 36 46 17 ------------------------------------------------------------ So I get 100k context switches per second. But this is quite strange (well, to me at least): `dd if=/dev/sdb1 of=/dev/null bs=1M' (i.e. on the normal partition, without dm-crypt) gives me 20000 context switches per second. If I do `mount /dev/sdb1 /media/spare1' on another terminal, the rate of context switches immediately drops to 1/10th or about 2000/second. Mounting the device increases the block size of the device from 512 to 4096, but manually doing that with blockdev --setbsz 4096 doesn't cause the drop in context switches, indicating that the block size is probably not the culprit. Unmounting the partition while dd is running doesn't cause an increase in context switches, but killing the dd (after unmounting) and starting it again does. > for dd "test": if you increase readahead for > crypt > device - is it better or the same? (I mean blockdev --setra <num> > /dev/mapper/... ) I tried to change the readahead for /dev/mapper/myvg-root_crypt from 256 to 1024 and 16384. It seemed to have a slight effect, so I tried 1048576 too. With readahead set to 1048576, the rate of context switches seems to drop roughly 40% to about 19000/second. Also higher readahead seems to correlate with somewhat higher throughput for dd, up to something like 93 MB/sec. Sami
If this is still seen with a modern kernel please update