Bug 10776

Summary:	dm-crypt: performance problems, uses only single cpu?, 30-40k context switches per sec
Product:	IO/Storage	Reporter:	Sami Liedes (sami.liedes)
Component:	LVM2/DM	Assignee:	Milan Broz (gmazyland)
Status:	CLOSED OBSOLETE
Severity:	normal	CC:	agk, alan
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	2.6.25.4	Subsystem:
Regression:	No	Bisected commit-id:

Description Sami Liedes 2008-05-22 11:12:18 UTC

Distribution: Debian unstable
Hardware Environment: Core 2 Quad Q6600, RAID-0 of 3 modern SATA disks
Software Environment: Otherwise fairly idle
Problem Description:

I have a raid array which can give over 200 MB/sec performance, but with dm-crypt only something like 80 MB/sec:

------------------------------------------------------------
# hdparm -t /dev/mapper/myvg-root /dev/mapper/myvg-root_crypt

/dev/mapper/myvg-root:
 Timing buffered disk reads:  624 MB in  3.00 seconds = 207.68 MB/sec

/dev/mapper/myvg-root_crypt:
 Timing buffered disk reads:  236 MB in  3.02 seconds =  78.17 MB/sec
------------------------------------------------------------

I have a modern Core 2 Quad (Q6600) processor. The kcryptd process seems to be using 99% of a single CPU (i.e. of 400% available), spread evenly on the CPUs (so it probably was a single thread in full steam).

Using the crypto test module tcrypt, AES-256-cbc benchmarks at about 116 MB/sec for a single thread (that's still almost 50% more than the performance I get with dm-crypt):

------------------------------------------------------------
# modprobe tcrypt mode=200 sec=1
testing speed of cbc(aes) encryption
[...]
test 10 (256 bit key, 16 byte blocks): 3772812 operations in 1 seconds (60364992 bytes)
test 11 (256 bit key, 64 byte blocks): 1489236 operations in 1 seconds (95311104 bytes)
test 12 (256 bit key, 256 byte blocks): 435686 operations in 1 seconds (111535616 bytes)
test 13 (256 bit key, 1024 byte blocks): 113700 operations in 1 seconds (116428800 bytes)
------------------------------------------------------------

I did some oprofiling and other diggind and found out that when reading a crypted volume (echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/mapper/myvg-root_crypt of=/dev/null bs=1M), I get around 30-40 thousand context switches per second (as shown by the vmstat command) on an otherwise idle system, which seems quite a lot to me. Reading the underlying LVM volume (i.e. not decrypting it) causes about 7-8 thousand context switches per second, while when reading the md device on which the LVM volume resides I get around 3300 cs/sec. When idle I get 200 cs/sec.

I suspect that 30-40 thousand context switches per second is both excessive and expensive. :)

I noticed that there was a recent fix in dm-crypt.c that apparently disables broken parallel processing for *writes* (commit 3f1e9070f63b0eecadfa059959bf7c9dbe835962). Should parallel processing for reads still be possible?

OpenSSL shows that the 4 cores in parallel are capable of 400 MB/sec AES-256-cbc performance:

------------------------------------------------------------
### single thread
$ openssl speed aes-256-cbc
[...]
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc      93397.93k    98932.82k   100337.70k   100455.77k    99770.37k

### 4 threads
$ openssl speed aes-256-cbc -multi 4
[...]
aes-256 cbc     371844.67k   388090.15k   398292.65k   400831.15k   400528.73k
------------------------------------------------------------

I have a setup with three 500 GB disks, sd[abc]:

------------------------------------------------------------
/dev/sda:
 Timing buffered disk reads:  220 MB in  3.02 seconds =  72.94 MB/sec

/dev/sdb:
 Timing buffered disk reads:  232 MB in  3.01 seconds =  77.06 MB/sec

/dev/sdc:
 Timing buffered disk reads:  222 MB in  3.02 seconds =  73.54 MB/sec
------------------------------------------------------------

Each of them has an identical ~495 GB partition, sd[abc]2, which are in a RAID 0 array md0:

------------------------------------------------------------
# cat /proc/mdstat
Personalities : [raid0]
md0 : active raid0 sda2[0] sdc2[2] sdb2[1]
      1450500480 blocks 64k chunks

unused devices: <none>

# mdadm --detail --scan
ARRAY /dev/md0 level=raid0 num-devices=3 UUID=a8a91091:07daccab:3847dbcc:3c6a600b
------------------------------------------------------------

md0 is an LVM Physical Volume (pv), the only in Volume Group (vg) named "myvg".

------------------------------------------------------------
# pvscan
[...]
  PV /dev/md0         VG myvg            lvm2 [1.35 TB / 979.30 GB free]

# vgdisplay myvg
  --- Volume group ---
  VG Name               myvg
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.35 TB
  PE Size               4.00 MB
  Total PE              354126
  Alloc PE / Size       103424 / 404.00 GB
  Free  PE / Size       250702 / 979.30 GB
  VG UUID               44CeDr-O3HU-UwkP-XEMt-rGhT-IGCE-zCC0mx
------------------------------------------------------------

myvg contains a 400 GB Logical Volume (lv) named "root", which is the raw device for which I get good performance:

------------------------------------------------------------
# lvdisplay
  --- Logical volume ---
  LV Name                /dev/myvg/root
  VG Name                myvg
  LV UUID                csDRSR-kh5E-SIFm-UgzH-5CJD-lEc3-9Nc45M
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                400.00 GB
  Current LE             102400
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0
------------------------------------------------------------

root is a LUKS volume, crypted with aes-cbc-essiv:sha256:

------------------------------------------------------------
# dmsetup table
myvg-swap: 0 8388608 linear 9:0 838861184
myvg-root: 0 838860800 linear 9:0 384
myvg-swap_crypt: 0 8386552 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 253:1 2056
myvg-root_crypt: 0 838858744 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 253:0 2056
------------------------------------------------------------

myvg-root_crypt is the crypt device for which I get bad performance.

Comment 1 Milan Broz 2008-05-22 11:50:55 UTC

> I have a modern Core 2 Quad (Q6600) processor. The kcryptd process seems to
> be
> using 99% of a single CPU (i.e. of 400% available), spread evenly on the CPUs
> (so it probably was a single thread in full steam).

It is moving from cpu to another cpu? hm, it is not good, so we need set processor affinity.
But that it use only one core, it is expected. I have some ideas how to optimize dm-crypt
on multiple cpus, but it is not yet implemented.

> I did some oprofiling and other diggind and found out that when reading a
> crypted volume (echo 3 >/proc/sys/vm/drop_caches; dd
> if=/dev/mapper/myvg-root_crypt of=/dev/null bs=1M), I get around 30-40
> thousand
> context switches per second (as shown by the vmstat command) on an otherwise
> idle system, which seems quite a lot to me. Reading the underlying LVM volume
> (i.e. not decrypting it) causes about 7-8 thousand context switches per
> second,
> while when reading the md device on which the LVM volume resides I get around
> 3300 cs/sec. When idle I get 200 cs/sec.

hrm. could you try if direct dm-crypt (without md - over LVM or better directly
over normal partition) is better? for dd "test": if you increase readahead for crypt
device - is it better or the same? (I mean blockdev --setra <num> /dev/mapper/... )

There are some patches which are still not in upstream kernel which should help,
but this seems to be really excessive context switches count...

> I noticed that there was a recent fix in dm-crypt.c that apparently disables
> broken parallel processing for *writes* (commit
> 3f1e9070f63b0eecadfa059959bf7c9dbe835962). Should parallel processing for
> reads
> still be possible?

I think that even writes with proper implementation is possible. Just it need
some configuration (sometimes it is not optimal to use all cores/CPUs for encryption).

Comment 2 Sami Liedes 2008-05-22 13:24:43 UTC

On Thu, May 22, 2008 at 11:50:56AM -0700, bugme-daemon@bugzilla.kernel.org wrote:
> It is moving from cpu to another cpu? hm, it is not good, so we need set
> processor affinity.
> But that it use only one core, it is expected. I have some ideas how to
> optimize dm-crypt
> on multiple cpus, but it is not yet implemented.

Yes, kcryptd jumps from one cpu to another according to top.

> > I did some oprofiling and other diggind and found out that when reading a
> > crypted volume (echo 3 >/proc/sys/vm/drop_caches; dd
> > if=/dev/mapper/myvg-root_crypt of=/dev/null bs=1M), I get around 30-40
> thousand
> > context switches per second (as shown by the vmstat command) on an
> otherwise
> > idle system, which seems quite a lot to me. Reading the underlying LVM
> volume
> > (i.e. not decrypting it) causes about 7-8 thousand context switches per
> second,
> > while when reading the md device on which the LVM volume resides I get
> around
> > 3300 cs/sec. When idle I get 200 cs/sec.
> 
> hrm. could you try if direct dm-crypt (without md - over LVM or better
> directly
> over normal partition) is better?

More weirdness.

------------------------------------------------------------
# cryptsetup create ctest /dev/sdb1 --readonly --cipher aes-cbc-plain
# echo 3 >/proc/sys/vm/drop_caches
# dd if=/dev/mapper/ctest of=/dev/null bs=1M &
# vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
[...]
 3  0    104 6153688 956628 109916    0    0 53209     2 2046 99368  0 34 42 24
 4  0    104 5394392 1576504 110400    0    0 62026    14 2206 97842  0 36 45 19
 1  1    104 4624468 2205336 110984    0    0 62933     4 2472 112581  0 36 46 17
------------------------------------------------------------

So I get 100k context switches per second. But this is quite strange
(well, to me at least):

`dd if=/dev/sdb1 of=/dev/null bs=1M' (i.e. on the normal partition,
without dm-crypt) gives me 20000 context switches per second. If I do
`mount /dev/sdb1 /media/spare1' on another terminal, the rate of
context switches immediately drops to 1/10th or about
2000/second. Mounting the device increases the block size of the
device from 512 to 4096, but manually doing that with blockdev
--setbsz 4096 doesn't cause the drop in context switches, indicating
that the block size is probably not the culprit. Unmounting the
partition while dd is running doesn't cause an increase in context
switches, but killing the dd (after unmounting) and starting it again
does.

> for dd "test": if you increase readahead for
> crypt
> device - is it better or the same? (I mean blockdev --setra <num>
> /dev/mapper/... )

I tried to change the readahead for /dev/mapper/myvg-root_crypt from
256 to 1024 and 16384. It seemed to have a slight effect, so I tried
1048576 too. With readahead set to 1048576, the rate of context
switches seems to drop roughly 40% to about 19000/second. Also higher
readahead seems to correlate with somewhat higher throughput for dd,
up to something like 93 MB/sec.

	Sami

Comment 3 Alan 2013-12-19 13:56:38 UTC

If this is still seen with a modern kernel please update