Most recent kernel where this bug did *NOT* occur: Distribution: Ubuntu 7.04 (Feisty) Hardware Environment: i386 (AMD64 in 32-bit-mode) Software Environment: vanilla 2.6.20 kernel Description: Writing larger files onto a partition encrypted with DM-Crypt/LUKS (Filesystem: Reiserfs v. 3) freezes the system periodically every few seconds. The recently purchased computer has an AMD "Sempron 3400+" CPU (AM2-socket) installed on a mainboard with Nvidia n-force 430 chipset. The harddrive is a new 400GB SATA disk connected to the onboard SATA-terminal (kernel module: sata_nv). In addition to three smaller and unencrypted partitions (system, swap and home; still unencrypted because it's a testing setup), I created a 322-GB partition for encrypted data storage. This partition is LUKS-formatted via cryptsetup/DM-Crypt and has a Reiserfs file system. The problem is the following: Whenever I copy a slightly bigger file (e.g. 500MB) /to/ the dm-crypt- partition, the system is almost blocked for the duration of the copying process. This means that not only the system responsiveness gets rather low, but that, approx. once every two seconds, the system is completely "stalled" for a second. Even the mouse pointer and Amarok sound playback periodically stop completely. It doesn't matter if the source is one of the unencrypted partitions on the same harddisk, a CD-ROM or even an NFS-mount from a (compared to local copying) relatively slow remote server. This is particularly undesireable, e.g. if a database system for sensitive customer data is using the partition, or if the recording of video live-streams from surveillance cameras blocks the rest of the system. And especially since I thought the new standard CFQ-IO-scheduler would prevent those problems. By observing the write performance of my system, I could determine an anomaly which also seems to be connected with the above described lock-ups: Data transfer rates obtained via " time cp 'sourcelocation/ testfile' 'destination' " with a testfile of 1GByte random data. Write performance: copy: from one unencrypted filesystem to annother unencrypted filesystem on the same harddrive: 38MByte/sec from unencrypted filesystem to encrypted filesystem on the same harddrive: 16MByte/sec from 100-MBit/sec LAN NFS-mount to unencrypted filesystem on local machine: 11.2MByte/sec (maximum for 100MBit-LAN) from 100-MBit/sec LAN NFS-mount to encrypted filesystem on local machine: 6.5MByte/sec (even though 16MB/sec were possible from the local source before, and the LAN is also capable of 11.2MB/sec [!] This is the anomaly I meant before.) Read performance from encrypted partition (" time cp 1-GB-testfile.dat /dev/ null " or even " time cp 1-GB-testfile.dat /tmp " (no RAM tmpfs)) is perfect with 28 respectively 26 MByte/sec; no system lock-ups. To reproduce this problem, I downloaded, configured and compiled a vanilla 2.6.20 kernel and ran the computer with this one instead of the distribution standard, (which at the time is a 2.6.20-rc snapshot). I didn't load any third-party or binary-only kernel modules (i.e. I decommissioned the nvidia-graphics-driver). Here I have uploaded some information about my computer: http://datenparkplatz.de/DiesUndDas/lspci.output.2.6.20.vanilla.txt (output of "lspci") http://datenparkplatz.de/DiesUndDas/dmesg.output.2.6.20.vanilla.txt (output of "dmesg") http://datenparkplatz.de/DiesUndDas/proc.version.2.6.20.vanilla.txt (output of "cat /proc/version") http://datenparkplatz.de/DiesUndDas/kernelconfig-2.6.20-vanilla.txt (my kernels ".config") HTH, Ulrich
I have the same problem, well, even worse: I have dm-crypt encrypted /home partition (luks). In my /home dir I then created large file (4.6 GB) encrypted with dm-crypt (also luks, even the same password and encryption method). I mounted this file over /dev/loop0 and created ext3 filesystem on it, mounted under /mnt. Now, when I copy large files from my windows partition to this /mnt, after some time (200 MB copied?) the disk I/O system is stalled - I cannot read or write to the disk - the computer is not locked, but since it cannot do any disk operations, it's unusable. Off topic: Other interesting thing is, that even that I mounted this NTFS partitin READ ONLY, after reseting computer (because of the dm-crypt + copy problem), the next mount of that NTFS partition says, that it is corrupted! (unknown flag 0x40000 or something like that). I'm using dm-crypt encrypted home for a long time and I had no problems so far (except those short locks during encrypted disk I/O as described by bug reporter Alasdair G Kergon). I just guess, but it seems like dm-crypt module requires exclusive use of the computer and since it is called recursively??? during such copy (writes into encrypted file on encrypted partition), it causes deadlock. I attached dmesg, lspci and lsmod output (the disk I mentioned is the one attached to PATA onboard VIA interface).
Created attachment 11099 [details] dmesg output
Created attachment 11100 [details] lsmod output
Created attachment 11102 [details] lspci output
So......I created luks partition over DVD-RAM (4.6 GB) and when I copy files with Midnight Commander, I can see, that during reading files from encrypted /home to MC's cache computer responds OK, but when it starts to write to DVD-RAM, any other disk I/O is halted for SEVERAL SECONDS. So it seems to me, that dm-crypt encrypts data and waits for these data to pass through disk I/O while blocking any other disk usage. From this point of view it seems that it is not a bug but a bad design..... dm-crypt does not support asynchronous operation?
I don't quiet know whether this bug is related to the kernel "crashes" I've been getting with LUKS/DM-Crypt recently. My HD - several partitions, Linux running on one of the larger LUKS ones, using LVM2 (inside the LUKS, not LUKS inside the lvm2) Kernel 2.6.20-hardened-r1 (gentoo patchset), fuse support built as a module. The partition in question is one using the ntfs-3g driver, inside a LUKS mapping. It's got some music on it, and Amarok is the program accessing it. After some time of listening, the computer hangs. The X "screenshot" just sits there, no mouse input, no keyboard, no sshd. The three times this has happened I was listening to music and compiling programs (using portage) - the latter being a very IO-heavy activity. The last time the crash happened I was present at the computer, and you could move the mouse around very slowly for a few seconds before the whole system just hangs. (More detail: first Kicker went, Firefox was working with the mouse, then I tried switching windows, and there it hung) The kernel log doesn't have any messages to errors, except one I got a few days ago: Apr 8 21:23:19 gentoo VFS: Can't find ext4 filesystem on dev dm-10. Apr 8 21:23:19 gentoo SQUASHFS error: Can't find a SQUASHFS superblock on dm-10 Apr 8 21:23:19 gentoo FAT: bogus logical sector size 15978 Apr 8 21:23:19 gentoo VFS: Can't find a valid FAT filesystem on dev dm-10. Which was the last error I got before the system crashed. The subsequent crashes did not have this problem. I'm using the CFQ IO scheduler, Via SATA driver. Reiserfs partitions exist for /var and /tmp, as in the ubuntu bug https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82528 I am going to try removing the FUSE/NTFS combination because I have a feeling it could be causing the lockups. I'll look into newer kernels (and compile more debug info, hopefully). The only other thing that could be causing the problem is that when the initramfs image is loaded when starting up, when "exec switch_root" is executed, the cleanup of "echo > /proc/sys/kernel/hotplug" is not done (so "/sbin/mdev" is there). But since init runs udev, that shouldn't matter? Apologies if I've been too verbose or imprecise. I will be more than happy to provide more information.
Several fixes for dm-crypt by Olaf Kirch went into 2.6.22-rc5. Please test with this release and verify whether the problem has been fixed. Thanks.
Hi, sorry to say this, but for me, the problem still persists with linux 2.6.22.1.
I have a similar problem, running kernel 2.6.22-gentoo-r6 (2.6.22.6 with gentoo patchset) on an Athlon64 X2 4200+ with 2 GB RAM. I also tried vanilla 2.6.23-rc7, but the problem kept the same. My root filesystem is ext3 with data=journal. My home directory is encrypted using pam_mount (it's a file on the root filesystem, mounted via a loop device and dm-crypt), formatted as ext2. When copying a large file to that encrypted loop device, about 100-150 MB are written without problems. Then, suddenly the load of both processor cores goes up (100%wa for some seconds), then the load of one core goes down to a common level, while the load of the other core stays at 100%wa and nothing is written to disk for some minutes. Each process trying to access some file or directory in that encrypted filesystem blocks. After some minutes, the load suddenly drops down, again 100-200 MB are written to disk quite fast, then the system blocks again for some minutes... Any hint how to solve that problem?
Does it make sense to start a thread on the kernel mailing list? The situation with filesystem encryption under Linux is a bit annoying at the time, IMO. On a laptop, I think it's an essential feature.
Yes, it always makes sense to raise concerns on the mailing list.
> ------- Comment #11 from protasnb@gmail.com 2007-09-26 22:51 ------- > Yes, it always makes sense to raise concerns on the mailing list. maybe :-) but dm-crypt mailing is better for discussing dm-crypt issues... Well, currently I see two main problems on this bugzilla: 1) the crypt queue per cpu in dmcrypt was not good idea, it causes more problems and no real performance increase. In current -mm (for 2.6.24) dm-crypt already switched to singlethreaded queue, now with separate queue for processing io. So this is already fixed. 2) There is problem in missing io congestion control when dm-crypt submits work to crypt thread in device map function. This is probably root cause of your problems. I have some patches which tries to solve it but it is more hack currently, so we need to fix it properly. I will paste patch here for testing when ready (I hope it will be soon). Milan
Please could you try attached workaround if it helps ? Thanks, Milan ----- Add cond_resched() as a workaround to crypt processing queue to prevent some reported system stucks. This issue will be solved properly later... Signed-off-by: Milan Broz <mbroz@redhat.com> --- drivers/md/dm-crypt.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.23/drivers/md/dm-crypt.c =================================================================== --- linux-2.6.23.orig/drivers/md/dm-crypt.c 2007-09-27 18:35:26.000000000 +0200 +++ linux-2.6.23/drivers/md/dm-crypt.c 2007-09-27 18:39:29.000000000 +0200 @@ -665,6 +665,8 @@ static void kcryptd_do_work(struct work_ process_read(io); else process_write(io); + + cond_resched(); } /*
Does it make sense to apply this patch to a 2.6.22.6 kernel, or is it 2.6.23-only?
> ------- Comment #14 from michael.schachtebeck@stud.uni-goettingen.de > 2007-09-27 10:19 ------- > Does it make sense to apply this patch to a 2.6.22.6 kernel, or is it > 2.6.23-only? yes, you can try 2.6.22, code should be same here
I applied it to 2.6.22-gentoo-r8 (2.6.22.9 with gentoo patchset) - sorry, no effect at all. :-( Some more observations. maybe trivial: * even a kill -9 does not kill blocked processes which try to access the encrypted filesystem * writing to the root filesystem which contains the encrypted filesystem is *not* affected by the blocking, so it's probably no issue with the SATA driver, but related to dm-crypt or the loop device code.
Does not work with vanilla 2.6.23-rc8 either.
I see similar problem when using loop devices (even without dm-crypt) mentioned in comment #9 - caused by stalling in balance_dirty_pages. (but note this is only first part of problem.) This was recently fixed in -mm tree by BDI dirty limit patchset. Please, could you attach process states output from syslog - run echo t >/proc/sysrq-trigger when system is stalled (100% waiting) to confirm this ? (see also this http://lkml.org/lkml/2007/9/29/57 - if this patch helps, it is dirty_page balance problem). Anyway, the second part of problem - dm-crypt congestion patch - is needed too. I will prepare it when 2.6.24-rc is ready (too many related changes there).
ok, here is the output of echo t >/proc/sysrq-trigger with 2.6.23 with gentoo patchset. I will try the patch you mentioned in some days, lot to do at the moment...
Created attachment 13104 [details] output
ok, I had some time to test. I applied the patch from http://lkml.org/lkml/2007/9/29/57 - and it works well so far. I wrote a 1 GB file without blocking. Does this patch has an impact on the performance? I noticed that the write rate (according to dd) was 23,9 MB/s which seems not so much on an Athlon64 X2 4200+ (despite of the encryption).
bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #20 from michael.schachtebeck@stud.uni-goettingen.de > 2007-10-10 14:31 ------- > Created an attachment (id=13104) > --> (http://bugzilla.kernel.org/attachment.cgi?id=13104&action=view) > output Ok, so it *is* related to dirty page balance stall as expected. Oct 10 23:19:13 [kernel] dd D f6486cb8 0 6008 5760 Oct 10 23:19:13 [kernel] f6486ccc 00200086 00000002 f6486cb8 f6486cb0 00000000 f6423a40 c043cda0 Oct 10 23:19:13 [kernel] c043fa40 f6486cbc f6423b7c c201aa40 00000001 c2122a80 00200202 c2126000 Oct 10 23:19:13 [kernel] c2126000 c0133d67 f7bdae00 0002cda1 00000000 00000003 00000000 00000000 Oct 10 23:19:13 [kernel] Call Trace: Oct 10 23:19:13 [kernel] [<c0133d67>] lock_timer_base+0x27/0x60 Oct 10 23:19:13 [kernel] [<c034496a>] schedule_timeout+0x4a/0xc0 Oct 10 23:19:13 [kernel] [<c01cd2b2>] ext2_discard_prealloc+0x22/0x80 Oct 10 23:19:13 [kernel] [<c0133a20>] process_timeout+0x0/0x10 Oct 10 23:19:13 [kernel] [<c034433e>] io_schedule_timeout+0x1e/0x30 Oct 10 23:19:13 [kernel] [<c015db36>] congestion_wait+0x56/0x80 Oct 10 23:19:13 [kernel] [<c013e400>] autoremove_wake_function+0x0/0x40 Oct 10 23:19:13 [kernel] [<c0158961>] balance_dirty_pages_ratelimited_nr+0x141/0x230 Oct 10 23:19:13 [kernel] [<c0154062>] generic_file_buffered_write+0x372/0x6b0 ... Oct 10 23:19:13 [kernel] [<c0173765>] do_sync_write+0xd5/0x120
> ok, I had some time to test. I applied the patch from > http://lkml.org/lkml/2007/9/29/57 - and it works well so far. I wrote a 1 GB > file without blocking. Does this patch has an impact on the performance? I > noticed that the write rate (according to dd) was 23,9 MB/s which seems not > so > much on an Athlon64 X2 4200+ (despite of the encryption). good :) Patch mentioned above is not real fix - it will break dirty page balance... (but is simple to test to prove that here is the problem) You should wait till 2.6.24 or if there is some workaround backported. (see following thread on LKML too). Crypt performace: switching to two singlethreaded queues should increase performance, when ready, retest this with 2.6.24-rc (or current -mm) please, patches are already in. Thanks for testing, Milan
I did the same tests with 2.6.23-rc8-mm2 - it worked well (without patching), and the performance was significantly higher than with a patched 2.6.23 (about 32 MB/s when writing a 1 GB file instead of 24 MB/s with a patched 2.6.23). Do you know if the patch from 2.6.23-rc8-mm2 that fixes the bug can be applied to vanilla 2.6.23, or if there will soon be a port to a 2.6.23.x release?
bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #24 from michael.schachtebeck@stud.uni-goettingen.de > 2007-10-11 09:04 ------- > Do you know if the patch from 2.6.23-rc8-mm2 that fixes the bug can be > applied > to vanilla 2.6.23, or if there will soon be a port to a 2.6.23.x release? dirty_page: it is part of complex patchset (I have patched my 2.6.23 tree and it is 18 patches from mm ! For the record - I am using these patches to 2.6.23 but maybe some cosmetic changes needed... mm/nfs-remove-congestion_end.patch mm/lib-percpu_counter_add.patch mm/lib-percpu_counter_sub.patch mm/lib-percpu_counter-variable-batch.patch mm/lib-make-percpu_counter_add-take-s64.patch mm/lib-percpu_counter_set.patch mm/lib-percpu_counter_sum_positive.patch mm/lib-percpu_count_sum.patch mm/lib-percpu_counter_init-error-handling.patch mm/lib-percpu_counter_init_irq.patch mm/mm-bdi-init-hooks.patch mm/mm-scalable-bdi-statistics-counters.patch mm/mm-count-reclaimable-pages-per-bdi.patch mm/mm-count-writeback-pages-per-bdi.patch mm/lib-floating-proportions.patch mm/mm-per-device-dirty-threshold.patch mm/mm-per-device-dirty-threshold-warning-fix.patch mm/mm-per-device-dirty-threshold-fix.patch + all dm fixes from mm tree No idea if this will be backported or "workarounded"... for dm-crypt specific queue patches - no, these will not be backported. Milan
ok, so I hope that the fixes will be in 2.6.24 and use the patch from http://lkml.org/lkml/2007/9/29/57 for the time being... Thank you for your assistance.
Is this issue fixed in 2.6.24? I found this http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=04fbfdc14e5f48463820d6b9807daa5e9c92c51f commit, does it contain the fixes you mentioned above?
bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #27 from michael.schachtebeck@stud.uni-goettingen.de > 2008-01-25 00:06 ------- > Is this issue fixed in 2.6.24? I found this > > > http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=04fbfdc14e5f48463820d6b9807daa5e9c92c51f > > commit, does it contain the fixes you mentioned above? yes, 2.6.24 should contain these fixes.
The bug can be closed then I suppose, thanks.
[root@mail mail]# uname -a Linux mail.***** 2.6.24.4-64.fc8 #1 SMP Sat Mar 29 09:15:49 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux writing to dm-crypted/reiserfs partition is ~bout 4M/s, while unencrypted partition gives almost 100M/s. So this bug is not fixed!
bugme-daemon@bugzilla.kernel.org wrote: > writing to dm-crypted/reiserfs partition is ~bout 4M/s, while unencrypted > partition gives almost 100M/s. > > So this bug is not fixed! if it is about performance, please report it as separate bug and add info about your system. There are probably mixed several problems on this bug - this bugreport covers mainly problems with crypt over loop device - and this should be fixed in recent kernels.
As You advised http://bugzilla.kernel.org/show_bug.cgi?id=10502