Bug 10781 - unresponsive system (unfair io scheduling) when using dm-crypt
Summary: unresponsive system (unfair io scheduling) when using dm-crypt
Status: CLOSED OBSOLETE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Milan Broz
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-05-23 08:29 UTC by Christian Jaeger
Modified: 2012-05-19 01:21 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.22.19 (kernel.org) [, 2.6.24-1-amd64 (Debian)]
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Christian Jaeger 2008-05-23 08:29:10 UTC
Latest working kernel version: -
Earliest failing kernel version: -
Distribution: Debian testing
Hardware Environment: Lenovo Thinkpad T61, 2.5Ghz Core2 Duo T9300,
    intel chipset, SATA disk, 2 GB RAM, NVidia video
Software Environment: Gnome / console
Problem Description:
    All tasks accessing dm-crypt'ed disk space become unresponsive for
    long periods of time when one i/o intensive (linear access) task is
    running on dm-crypt.

Steps to reproduce:

    (obviously replace sda9 with a partition where you don't have any
    valuable data)

     # cryptsetup create sda9_crypt /dev/sda9
     # time nice nice cat /dev/zero >/dev/mapper/sda9_crypt

    or:

     # cryptsetup remove sda9_crypt  # if necessary
     # cryptsetup luksFormat /dev/sda9
     # cryptsetup luksOpen /dev/sda9 sda9_crypt
     # time nice nice cat /dev/zero >/dev/mapper/sda9_crypt

    then after waiting some 10 seconds or so (until most binaries are
    dropped from the disk caches) try to start any program. Or
    e.g. "killall -STOP cat" will take 3-6 minutes.


More complete/wordy description follows:

I did install the system with the root fs (reiserfs) and swap on LVM
on dm-crypt, by using the Debian installer's ability to do so.

Quite soon I discovered that when I ran a compilation with make -j3 of
a software which requires hundreds of MB of RAM per gcc instance, and
thus touching swap during the compilation, that xorg (then using the
open "nv" driver) almost froze. The vesa driver did show far better
behaviour, so I did report that as a bug against the nv driver (but
soon found out that the closed-source nvidia driver behaved the same)
here:

 http://bugs.freedesktop.org/show_bug.cgi?id=15716

But the longer I'm using this machine, I'm suspecting something bad is
going on in the I/O layer really (and probably is dm-crypt related)
and it's not really the fault of xorg at all. One thing I did notice
some time ago is that ionice -c 3 doesn't help at all reducing the
impact of a "cat /dev/sdaX" run on the responsiveness of the machine
(also experimenting with the different io schedulers didn't seem to
help). Also I've felt the need to set up resource limits with ulimit
-v to prevent casual runaway processes (I'm a user-space developer)
from swapping and taking me minutes each time to get back control.

Currently I'm running a pristine 2.6.22.19 from kernel.org (in 64bit
mode; I haven't tried 32bit kernels so far).

Today I've noticed that when running the above tests (writing zeroes
to sda9_crypt), cat is running merrily along, as I can see from the
"System Monitor" gnome panel applet it is using about half the cpu
power of one core (shown in bright blue), and displaying the rest as
wait time (dark blue), which is I think expected (some read benchmarks
with cat from the root partition device also showed about 50% usage of
one core with about 40MByte/sec throughput, which is actually the
native disk throughput).

*But* if I try to open for example a new gnome-terminal, or even just
want to run "killall -STOP cat" (even at the console (ctl-alt-f1 to an
existant root login)), this takes ages, more precisely about 3-6
minutes. If I just hit ctl-z from the gnome-terminal where I started
the above cat instance, it more or less instantly stops (which is
rather expected as the shell shouldn't have to access the disk for
that) and all the other pending actions are then being run
immediately.

So my impression is that any 'fairness' in io scheduling seems to be
completely broken when using dm-crypt. I suspect that there might be a
problem with multiple I/O jobs going on at once *all using dm-crypt*,
kind of like dm-crypt had it's own purely fifo order scheduler (with a
huge backlog) or something. This is consistent with people having
their root partition not on dmcrypt telling me that they don't see the
problem when trying the above cat tests.

I've tried renicing the kcryptd processes to priorities 0, 10 and 19
(default is -5), but only priority 10 did seem to make it any better
if at all. Also switching off the second core didn't help in this
case.

(I'm now considering moving everything off dm-crypt to get decent
system behaviour.)

Thanks,
Christian.



Some further data:

- someone asked whether I have DMA enabled, and whether my sata disk is in AHCI or compat mode. I don't know how to enable DMA on sata disks but thought that there's no need to do this manually; I've looked at the kernel logs, which say:

 May 10 04:50:09 novo kernel: scsi0 : ahci
 May 10 04:50:09 novo kernel: ata1: SATA max UDMA/133 cmd 0xffffc20000068100 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 313

and

 May 10 04:50:09 novo kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 May 10 04:50:09 novo kernel: ata1.00: ATA-8: FUJITSU MHY2250BH, 0084000D, max UDMA/100
 May 10 04:50:09 novo kernel: ata1.00: 488397168 sectors, multi 16: LBA48 NCQ (depth 31/32)
 May 10 04:50:09 novo kernel: ata1.00: configured for UDMA/100
 May 10 04:50:09 novo kernel: scsi 0:0:0:0: Direct-Access     ATA      FUJITSU MHY2250B 0084 PQ: 0 ANSI: 5

- also I've been asked for vmstat data:

$ vmstat 1

 1  0 643308 744080   5756  95944    0    0     0     0  404  227  0  1 99  0
 1  4 643308 350156 202312  95968    0    0    64 118212  797 86879  0 44 39 17
 0  3 643308 256780 229636  95936    0    0     0 105692 1098  316  0 42  0 58
 1  3 643308 205284 266660  95964    0    0     0 36976  891  259  0 35  0 65
 0  4 643308 167772 301256  95964    0    0     0 34596  948  318  0 34  0 66

the second row is after I started the cat
then

 0  3 643308 134736 346312  96008    0    0     0 12288 1018  359  0 31  0 69
 1  2 643308 148028 346312  96008    0    0     0     0  928  251  0 30  0 70
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  2 643308 182980 346312  96008    0    0     0     0 1049  301  1 32  0 68
 0  3 643308 200612 346312  96008    0    0     0     0  948  272  1 32  0 67
 0  0 643308 330444 346312  96008    0    0     0    64  895  301  1 16 40 44
 0  0 643308 330476 346312  96008    0    0     0     0  398  195  1  1 97  0

the second row there is after I stopped it again.
this is without triggering the load of another program
(which would again make me have wait minutes to get back control)


- I've run

# smartctl -t short /dev/sda
# smartctl -a /dev/sda 2>&1|less
..
Device Model:     FUJITSU MHY2250BH
..
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       725         -
Comment 1 Christian Jaeger 2008-06-07 15:51:16 UTC
Could this be the same problem as the one in
http://bugzilla.kernel.org/show_bug.cgi?id=10378
and/or the one being discussed in the thread starting in
http://lkml.org/lkml/2008/2/28/150
?

I can't test right now, but will asap if that makes sense.

Christian.
Comment 2 Christian Jaeger 2008-12-03 19:27:51 UTC
This might be a problem [in combination] with CONFIG_USER_SCHED.

I'm running a new kernel with this change now:

@@ -1,7 +1,7 @@
 #
 # Automatically generated make config: don't edit
 # Linux kernel version: 2.6.27.7
-# Wed Dec  3 19:24:49 2008
+# Wed Dec  3 19:29:40 2008
 #
 CONFIG_64BIT=y
 # CONFIG_X86_32 is not set
@@ -81,10 +81,8 @@ CONFIG_IKCONFIG_PROC=y
 CONFIG_LOG_BUF_SHIFT=16
 # CONFIG_CGROUPS is not set
 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
-CONFIG_GROUP_SCHED=y
-CONFIG_FAIR_GROUP_SCHED=y
-# CONFIG_RT_GROUP_SCHED is not set
-CONFIG_USER_SCHED=y
+# CONFIG_GROUP_SCHED is not set
+# CONFIG_USER_SCHED is not set
 # CONFIG_CGROUP_SCHED is not set
 CONFIG_SYSFS_DEPRECATED=y
 CONFIG_SYSFS_DEPRECATED_V2=y

and things seem to be much better; I haven't run the above tests again yet, though, since I've used up all disk partitions atm and it's late at night.
Comment 3 Christian Jaeger 2008-12-03 19:48:09 UTC
Although the kernel I was running then didn't seem to have CONFIG_USER_SCHED
enabled: see https://bugs.freedesktop.org/show_bug.cgi?id=15716
Could it be that config-2.6.22.19 was using USER_SCHED without configuring it? Or something else changed? Or it's multiple problems effecting the whole thing.
Comment 4 Milan Broz 2008-12-04 00:46:30 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10781
> ------- Comment #3 from christian-bko@jaeger.mine.nu  2008-12-03 19:48
> -------
> Although the kernel I was running then didn't seem to have CONFIG_USER_SCHED
> enabled: see https://bugs.freedesktop.org/show_bug.cgi?id=15716
> Could it be that config-2.6.22.19 was using USER_SCHED without configuring
> it?
> Or something else changed? Or it's multiple problems effecting the whole
> thing.

Too many things changed in kernel since 2.6.22, please use more recent kernel
if possible.

If you want to play with it, add with this one liner patch in the beginning:

http://www2.kernel.org/pub/linux/kernel/people/agk/patches/2.6/2.6.25/dm-crypt-add-cond_resched.patch

You will probably need to modify it for < 2.6.25 to apply correctly:

Index: home/data/linux-2.6.24.y/drivers/md/dm-crypt.c
===================================================================
--- home.orig/data/linux-2.6.24.y/drivers/md/dm-crypt.c
+++ home/data/linux-2.6.24.y/drivers/md/dm-crypt.c
@@ -374,6 +374,7 @@ static int crypt_convert(struct crypt_co
 			break;
 
 		ctx->sector++;
+		cond_resched();
 	}
 
 	return r;
Comment 5 Christian Jaeger 2008-12-04 05:51:15 UTC
> please use more recent kernel if possible.

As you can see from the diff file I'm using "Linux kernel version: 2.6.27.7" now. As shown on the above-mentioned URL https://bugs.freedesktop.org/show_bug.cgi?id=15716 I've been upgrading to a newer kernel long ago already. I've been using various kernels since:

chris@novo:/boot$ l config*
-rw-r--r-- 1 root root 63490 2008-04-27 22:10 config-2.6.22.19
-rw-r--r-- 1 root root 72175 2008-06-11 06:50 config-2.6.25.6
-rw-r--r-- 1 root root 73063 2008-06-22 21:44 config-2.6.25.8
-rw-r--r-- 1 root root 73064 2008-07-03 22:17 config-2.6.25.10
-rw-r--r-- 1 root root 75183 2008-07-19 20:32 config-2.6.26
-rw-r--r-- 1 root root 75352 2008-09-11 23:03 config-2.6.26.3
-rw-r--r-- 1 root root 75352 2008-09-12 00:23 config-2.6.26.5
-rw-r--r-- 1 root root 77281 2008-10-16 16:47 config-2.6.27.1
-rw-r--r-- 1 root root 75352 2008-10-30 13:33 config-2.6.26.7.old
-rw-r--r-- 1 root root 75352 2008-10-30 14:04 config-2.6.26.6
-rw-r--r-- 1 root root 75352 2008-11-08 11:49 config-2.6.26.7
-rw-r--r-- 1 root root 77293 2008-11-08 12:55 config-2.6.27.5.old
-rw-r--r-- 1 root root 77290 2008-11-08 13:12 config.old
-rw-r--r-- 1 root root 77290 2008-11-08 13:12 config-2.6.27.5
-rw-r--r-- 1 root root 77216 2008-12-03 20:39 config-2.6.27.7
-rw-r--r-- 1 root root 77216 2008-12-03 20:39 config

I've never seen much improvement by going to a newer kernel, only a little maybe. I did move my swap from the logical volume on dm-crypt to a logical volume on a plain text backed volume group, and have moved most of my root filesystem to an unencrypted logical volume too, which both/together seem to have mitigated the issue a little bit.

But the most articulate improvement aside from switching off one core seems to be switching off CONFIG_USER_SCHED in the newest kernel. Though again, forgive me that I can't run the above test case right now. I'll do and if the problem persists also try your patch--thanks.
Comment 6 Christian Jaeger 2008-12-04 06:15:21 UTC
BTW one thing I also tested recently was whether I could get a faster disk or swap by using 3 USB sticks in a raid-0 setup (with 16k chunks). This raid device is generally quite a bit faster than my internal (crappy laptop-)disk, linear reading is about twice as fast (~60MB/sec, my laptop has 3 USB ports but two of them are on the same USB bus as it turns out), linear writing is about the same (~30MB/sec), random reading (find -type f on reiserfs with cold cache) is about 4-5 times faster. Using dm-crypt on top of that raid device didn't slow those speeds down, and actually I noticed that writing and reading from that device didn't slow my desktop down at all! I was already starting to suspect that my internal laptop disk is somehow just broken; so I made a clear text raid-0 setup again, created reiserfs on it, then created a non-sparse 4G file on that, then losetup'ed that file to a loop device which I cryptsetup luksFormat'ed and luksOpen'ed, and mkswap and swapon (and swapoff my old swap). But that swap was very bad, it brought my desktop right back to >20 seconds reaction times. Of course I don't know which point in the chain is the exact culprit. And this was with 2.6.27.5
 with CONFIG_USER_SCHED=y, I may try again now with the 2.6.27.7 kernel without USER_SCHED.

Note You need to log in before you can comment on or make changes to this bug.