Bug 135481 - Page Allocation Failures/OOM with dm-crypt on md RAID10 with in-progress check/repair/sync
Summary: Page Allocation Failures/OOM with dm-crypt on md RAID10 with in-progress chec...
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Alasdair G Kergon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-18 07:11 UTC by Matthias Dahl
Modified: 2016-11-20 22:55 UTC (History)
3 users (show)

See Also:
Kernel Version: tested on 4.4.8, 4.5.5, 4.6.3, 4.7.0rc6/7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel log for a test on 4.7.0rc6 (252.88 KB, text/plain)
2016-07-18 07:11 UTC, Matthias Dahl
Details
slabinfo progress log for 4.7.0rc6 (30.24 KB, application/gzip)
2016-07-18 07:14 UTC, Matthias Dahl
Details

Description Matthias Dahl 2016-07-18 07:11:49 UTC
Created attachment 224171 [details]
kernel log for a test on 4.7.0rc6

SHORT SUMMARY
=============

Given a md RAID10 (here: imsm) that has a plain dm-crypt (aes-xts-plain64) on-top, issuing a simple...

 #> dd if=/dev/zero of=/dev/mapper/crypted-device bs=512K status=progress

... will very quickly (as-in: <5 seconds) eat up all memory and force the OOM killer into action **if** a check/repair/resync on the RAID is really in progress, meaning both the minimum and maximum RAID speed have to be set to a value greater than 0.

If no such action is in progress, memory consumption will still spike to almost the total RAM but the system stays stable and big memory consumers can be started without any trouble.

The issue will also very likely trigger with a mkfs.ext4 on the crypted-device, but it is not as reliable in triggering it as the above dd command.


DETAILS
=======

Using direct i/o with dd/mkfs.ext4 will prevent this from happening completely, also the excessive memory usage will be gone entirely.

Adjusting vm.dirty_background_ratio and vm.dirty_ratio does in fact "hide" the problem but proper values do vary depending on system memory: Whereas with 32 GiB of RAM setting dirty_background_ratio=5 and dirty_ratio=10 works mostly fine, with 8 GiB of RAM you have to more than half that again at the very least. Setting both to 0 will in fact be same as using direct i/o and does neither show excessive memory usage nor any problems.

While dd is running, slabinfo clearly highlights bio-3 and kmalloc-256 as the biggest consumers. Here a ranking made by Michal Hocko:

$ zcat slabinfo.txt.gz | awk '{printf "%s %d\n" , $1, $6*$15}' | head_and_tail.sh 133 | paste-with-diff.sh | sort -n -k3

                    initial diff [#pages]
radix_tree_node     3444    2592
debug_objects_cache 388     46159
file_lock_ctx       114     138570
buffer_head         5616    238704
kmalloc-256         328     573164
bio-3               24      1118984

Since I am setting up a new machine, I have to do all tests in a Live environment, booted through USB -- mainly OpenSuSE Tumbleweed and Fedora Rawhide. Yet I have not been able to reproduce this exact problem in any kind of VM I tried with the same images. It is only 100% reproducible on the real thing which complicates matters since after all my testing is done, I have a nice long resync ahead of me.

Monitoring the memory usage with free shows that both "used" and "buffer/cache" are rising almost equally until the mem is exhausted (with some variation delta but in general it holds true). The memory is clearly spent in-kernel as no process is eating up all that memory.

All my tests are done on a IMSM-based RAID10. I cannot say for certain that this is or is not a contributing factor. Unfortunately I can only test it with this particular array. I did several tests with an external disk connected through USB3 and put a RAID10 on-top of it (with several partitions) but I was unable to reproduce this problem -- probably because of differences between the underlying interface/subsystem (SATA3 vs USB3).

If the dd command is running, one can slowly increase the resync speed through the sysctl knobs. There is a certain threshold, once you cross that, you will run into OOM. In my case it is something north of 8 MiB/s. Below that, it is still fine. If a sync_action is in progress but the speed is at zero, it is the same as no action were in progress -- everything works, even though memory usage is still excessive but big memory consumers can be started and the system can be used without the OOM killer getting in the way.


TEST SYSTEM
===========

RAID10 (imsm) over 4 disks (SATA3) w/ 64 KiB chunk size on a Z170-based system with 32 GiB of RAM (CPU: i7 6700k). Tests were all done on either a OpenSuSE Tumbleweed or Fedora Rawhide live system booted through USB.
Comment 1 Matthias Dahl 2016-07-18 07:14:57 UTC
Created attachment 224181 [details]
slabinfo progress log for 4.7.0rc6

This is a timely progression of slabinfo while dd was running until the OOM killer kicked in.
Comment 2 Matthias Dahl 2016-07-18 07:19:01 UTC
Since I could only select one component for this bug, I would kindly ask to cc' the md (and mm?) folks as well as it is still unclear where the real problem lies and I guess it falls in all those areas.
Comment 3 Bryan Apperson 2016-11-20 22:08:28 UTC
I am able to reproduce this on:

Linux localhost 4.8.8-2-ARCH #1 SMP PREEMPT Thu Nov 17 14:51:03 CET 2016 x86_64 GNU/Linux

# cryptsetup -v --cipher aes-xts-plain64 --key-size 512 --hash sha512 --iter-time 5000 --use-urandom luksFormat /dev/sdc1
# cryptsetup open /dev/sdc1 encryption_test
# dd if=/dev/zero of=/dev/mapper/encryption_test status=progress bs=1M

Without the following settings:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Will slow the machine to a halt, load will climb to 10+ and the system will become unresponsive. Tested the same drive without LUKS/dm-crypt and was able to run dd without issue.

The problem does subside completely using direct IO:

vm.dirty_background_ratio = 0
vm.dirty_ratio = 0

Tested using an external USB hard drive.
Comment 4 Bryan Apperson 2016-11-20 22:28:16 UTC
This title also may not be the best for this bug as it is reproducible on multiple types of disk, not just a RAID array undergoing repairs.
Comment 5 Bryan Apperson 2016-11-20 22:55:26 UTC
I would like top amend my earlier comment, I actually cannot reproduce this on my internal RAID controller. I  am using a z170 chipset is it possible this is related to some interrupt congestion?

Note You need to log in before you can comment on or make changes to this bug.