Bug 100681

Summary: writeback changes merged in 4.2 cause closing of luks crypt devices to hang
Product: IO/Storage Reporter: Jon Christopherson (jon)
Component: LVM2/DMAssignee: Tejun Heo (tj)
Status: CLOSED CODE_FIX    
Severity: high CC: agk, jon, snitzer
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.2.0-9-generic #201506290030 SMP Mon Jun 29 00:33:15 CDT 2015 x86_64 x86_64 x86_64 GNU/Linux Subsystem:
Regression: Yes Bisected commit-id:
Attachments: kernel cmdline
loaded modules
CPU info
dmesg
blocked tasks and trace
hung task timeout output
sysrq show blocked state after revert
latest hung task notification

Description Jon Christopherson 2015-06-29 15:56:04 UTC
After the recent DM changes My luks crypt devices hang on closing. I am running the latest mainline kernel as of 6/29/2015 compiled from a cloned git repo. If I revert to 4.1.0 the devices close correctly.

This command triggers the bug:

enigma jon # /opt/cryptsetup/sbin/cryptsetup --debug luksClose data
# cryptsetup 1.7.0-git processing "/opt/cryptsetup/sbin/cryptsetup --debug luksClose data"
# Running command close.
# Locking memory.
# Installing SIGINT/SIGTERM handler.
# Unblocking interruption on signal.
# Allocating crypt device context by device data.
# Initialising device-mapper backend library.
# dm version   OF   [16384] (*1)
# dm versions   OF   [16384] (*1)
# Detected dm-crypt version 1.14.0, dm-ioctl version 4.32.0.
# Device-mapper backend running with UDEV support enabled.
# dm status data  OF   [16384] (*1)
# Releasing device-mapper backend.
# Trying to open and read device /dev/sdc2 with direct-io.
# Allocating crypt device /dev/sdc2 context.
# Trying to open and read device /dev/sdc2 with direct-io.
# Initialising device-mapper backend library.
# dm table data  OFW    [16384] (*1)
# Trying to open and read device /dev/sdc2 with direct-io.
# Crypto backend (gcrypt 1.5.3, flawed whirlpool) initialized.
# Detected kernel Linux 4.2.0-9-generic x86_64.
# Reading LUKS header of size 1024 from device /dev/sdc2
# Key length 64, device size 3873472512 sectors, header size 4036 sectors.
# Deactivating volume data.
# dm status data  OF   [16384] (*1)
# Udev cookie 0xd4d5533 (semid 720896) created
# Udev cookie 0xd4d5533 (semid 720896) incremented to 1
# Udev cookie 0xd4d5533 (semid 720896) incremented to 2
# Udev cookie 0xd4d5533 (semid 720896) assigned to REMOVE task(2) with flags (0x0)
# dm remove data  OFT    [16384] (*1)

It seems to remove/close/relock the device, but never returns. I do not see any kernel panics, the task will just sit there forever causing shutdown/restarts to fail since most scripts try and close the devices on reboot/shutdown.
Comment 1 Jon Christopherson 2015-06-29 16:00:10 UTC
Created attachment 181331 [details]
kernel cmdline
Comment 2 Jon Christopherson 2015-06-29 16:00:54 UTC
Created attachment 181341 [details]
loaded modules
Comment 3 Jon Christopherson 2015-06-29 16:01:22 UTC
Created attachment 181351 [details]
CPU info
Comment 4 Jon Christopherson 2015-06-29 17:09:31 UTC
Created attachment 181361 [details]
dmesg
Comment 5 Jon Christopherson 2015-06-29 17:21:38 UTC
Created attachment 181371 [details]
blocked tasks and trace
Comment 6 Jon Christopherson 2015-06-29 17:40:11 UTC
Created attachment 181391 [details]
hung task timeout output
Comment 7 Mike Snitzer 2015-06-29 17:55:23 UTC
Hmm, can you try reverting commit 0f20972f7 ("dm: factor out a common cleanup_mapped_device()")?

I think I need to stop putting "No functional change" in commit headers that aren't intended to have a functional change... ;)
Comment 8 Jon Christopherson 2015-06-29 18:28:20 UTC
I am compiling the kernel now with head at 1bc5e157ed2b4f5b206155fc772d860158acd201 and the above commit reverted. Will post results as soon as I can reboot into it. The issue doesn't exist in the 4.1.0 release and as I recall for a couple days into the 4.2 merge window it wasn't there as well.
Comment 9 Jon Christopherson 2015-06-29 18:46:38 UTC
I am seeing the same behavior with the above mentioned commit reverted. I have attached a new call trace.
Comment 10 Jon Christopherson 2015-06-29 18:47:21 UTC
Created attachment 181401 [details]
sysrq show blocked state after revert
Comment 11 Jon Christopherson 2015-06-29 18:53:54 UTC
Created attachment 181411 [details]
latest hung task notification
Comment 12 Jon Christopherson 2015-06-29 20:44:26 UTC
I have been testing the kernel before all the dm changes on the 25th and do not get the error before the 25th. after the changes on the 25th , the error occurs. there were a lot of commits that day so not sure which one is causing it yet.
Comment 13 Jon Christopherson 2015-06-30 06:11:11 UTC
I have stepped through every commit that touches the DM code and found the commit that breaks things:

	e4bc13adfd ("Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block")

Until this commit things work as expected. Afterwards things no longer work. Time to see if disabling cgroup writeback support in the kernel config restores functionality.
Comment 14 Jon Christopherson 2015-06-30 07:28:40 UTC
disabling the config option CGROUP_WRITEBACK restores normal functionality. Something in that code is interfering with standard luks operations.
Comment 15 Mike Snitzer 2015-06-30 12:51:52 UTC
Reassigning to Tejun.
Tejun, I can assist with this but it is clearly your area of expertise.
Comment 16 Tejun Heo 2015-06-30 13:22:02 UTC
Heh, more like my area of failure. Yeah, definitely looks like a bug in cgroup writeback code. Looking into it.
Comment 17 Jon Christopherson 2015-06-30 16:01:51 UTC
Thanks for looking into it guys. I used to do a lot of coding for ircd-hybrid ages ago so have a lot of respect for you kernel developers.
Comment 18 Tejun Heo 2015-07-02 00:57:05 UTC
Patches to fix the bug posted.

 http://lkml.kernel.org/g/20150702005253.GA26440@mtj.duckdns.org
 http://lkml.kernel.org/g/20150702005337.GB26440@mtj.duckdns.org

Jon, it'd be great if you can confirm on the mailing list that the above two patches fix the reported problem.

Thanks!
Comment 19 Jon Christopherson 2015-07-02 03:08:47 UTC
Hello Tejun,

Thanks for your help! The expected behavior has returned and all is well.

-Jon