Bug 107641 - Busy luks disk not closable forever if medium is removed
Summary: Busy luks disk not closable forever if medium is removed
Status: CLOSED DOCUMENTED
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: All Linux
: P5 low
Assignee: Alasdair G Kergon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-10 02:28 UTC by Andrey Utkin
Modified: 2016-09-15 17:44 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.3.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Andrey Utkin 2015-11-10 02:28:50 UTC
Setup:
A USB stick with LUKS+EXT4 partition. It is supposed to be inserted/removed without previous manual unmounting, as the important info is readonly most of time. It is mounted/umounted by udev rules:

 # cat /etc/udev/rules.d/99-sensitive_mount.rules 
ACTION=="add", ENV{.ID_FS_TYPE_NEW}=="crypto_LUKS", ENV{ID_SERIAL}=="JetFlash_Transcend_8GB_082AA696KZPJGI1C-0:0", RUN+="/usr/local/sbin/sensitive_mount.sh"
ACTION=="remove", ENV{.ID_FS_TYPE_NEW}=="crypto_LUKS", ENV{ID_SERIAL}=="JetFlash_Transcend_8GB_082AA696KZPJGI1C-0:0", RUN+="/usr/local/sbin/sensitive_umount.sh"


 # cat /usr/local/sbin/sensitive_mount.sh
#!/bin/bash
/usr/local/sbin/sensitive_umount.sh
cryptsetup luksOpen --key-file /etc/sensitive.key /dev/disk/by-uuid/840dfa3f-e449-4124-8ebb-dea5785214e8 sensitive && mount -o noatime,sync /dev/mapper/sensitive /j/sensitive


 # cat /usr/local/sbin/sensitive_umount.sh
#!/bin/bash
umount -l /j/sensitive
cryptsetup close sensitive


It worked fine until addition of GPG stuff onto it. When gpg-agent is runnuing, it holds such UNIX socket open:

 # lsof | grep gnupg
gpg-agent 3744               j    3u     unix 0xffff88014453fa80      0t0      13557 /j/.gnupg/S.gpg-agent type=STREAM

In the end, "cryptsetup close" refuses to work forever, saying that device is busy. No escape here.


dmesg output for voluntary removal of USB stick when gpg-agent is not running:

 $ cat luks_umount_fail.dmesg 
[ 3671.056097] usb 2-1.4: USB disconnect, device number 3
[ 3671.080108] Buffer I/O error on dev dm-0, logical block 0, lost async page write
[ 3671.080122] Buffer I/O error on dev dm-0, logical block 1, lost async page write
[ 3671.080128] Buffer I/O error on dev dm-0, logical block 465, lost async page write
[ 3671.080133] Buffer I/O error on dev dm-0, logical block 481, lost async page write
[ 3671.080137] Buffer I/O error on dev dm-0, logical block 483, lost async page write
[ 3671.080143] Buffer I/O error on dev dm-0, logical block 8679, lost async page write
[ 3671.080492] Buffer I/O error on dev dm-0, logical block 557056, lost sync page write
[ 3671.080505] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
[ 3671.080509] Aborting journal on device dm-0-8.
[ 3671.080578] Buffer I/O error on dev dm-0, logical block 557056, lost sync page write
[ 3671.080586] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
[ 3671.080610] EXT4-fs (dm-0): previous I/O error to superblock detected
[ 3671.080898] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[ 3671.081308] EXT4-fs error (device dm-0): ext4_put_super:800: Couldn't clean up the journal
[ 3671.081313] EXT4-fs (dm-0): Remounting filesystem read-only
[ 3671.081316] EXT4-fs (dm-0): previous I/O error to superblock detected
[ 3671.081380] Buffer I/O error on dev dm-0, logical block 0, lost sync page write


attaching it back:


[ 3707.263232] usb 2-1.4: new high-speed USB device number 4 using ehci-pci
[ 3707.350263] usb 2-1.4: New USB device found, idVendor=8564, idProduct=1000
[ 3707.350270] usb 2-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 3707.350273] usb 2-1.4: Product: Mass Storage Device
[ 3707.350276] usb 2-1.4: Manufacturer: JetFlash
[ 3707.350279] usb 2-1.4: SerialNumber: 082AA696KZPJGI1C
[ 3707.350657] usb-storage 2-1.4:1.0: USB Mass Storage device detected
[ 3707.351751] scsi host5: usb-storage 2-1.4:1.0
[ 3708.524492] scsi 5:0:0:0: Direct-Access     JetFlash Transcend 8GB    1100 PQ: 0 ANSI: 4
[ 3708.525690] sd 5:0:0:0: Attached scsi generic sg1 type 0
[ 3708.526455] sd 5:0:0:0: [sdb] 15433728 512-byte logical blocks: (7.90 GB/7.35 GiB)
[ 3708.527197] sd 5:0:0:0: [sdb] Write Protect is off
[ 3708.527203] sd 5:0:0:0: [sdb] Mode Sense: 43 00 00 00
[ 3708.527944] sd 5:0:0:0: [sdb] No Caching mode page found
[ 3708.527948] sd 5:0:0:0: [sdb] Assuming drive cache: write through
[ 3708.532079]  sdb: sdb1 sdb2
[ 3708.535450] sd 5:0:0:0: [sdb] Attached SCSI removable disk
[ 3711.853602] EXT4-fs (dm-0): recovery complete
[ 3711.857413] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)


Now it is removed with gpg-agent running:


[ 3768.584489] usb 2-1.4: USB disconnect, device number 4
[ 3773.421215] systemd-udevd[4409]: Process '/usr/local/sbin/sensitive_umount.sh' failed with exit code 5.
[ 3778.857712] Buffer I/O error on dev dm-0, logical block 0, lost async page write
[ 3778.857725] Buffer I/O error on dev dm-0, logical block 1, lost async page write
[ 3778.857730] Buffer I/O error on dev dm-0, logical block 450, lost async page write
[ 3778.857735] Buffer I/O error on dev dm-0, logical block 465, lost async page write
[ 3778.857739] Buffer I/O error on dev dm-0, logical block 481, lost async page write
[ 3778.857743] Buffer I/O error on dev dm-0, logical block 482, lost async page write
[ 3778.857748] Buffer I/O error on dev dm-0, logical block 8679, lost async page write
[ 3813.807257] Buffer I/O error on dev dm-0, logical block 557056, lost sync page write
[ 3813.807292] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
[ 3813.807297] Aborting journal on device dm-0-8.
[ 3813.808554] Buffer I/O error on dev dm-0, logical block 557056, lost sync page write
[ 3813.808563] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
[ 3813.808588] EXT4-fs (dm-0): previous I/O error to superblock detected
[ 3813.808654] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[ 3813.808660] EXT4-fs error (device dm-0): ext4_put_super:800: Couldn't clean up the journal
[ 3813.808664] EXT4-fs (dm-0): Remounting filesystem read-only
[ 3813.808666] EXT4-fs (dm-0): previous I/O error to superblock detected
[ 3813.808718] Buffer I/O error on dev dm-0, logical block 0, lost sync page write


The important outcome of this bug is that user cannot mount the disk by same device-mapper name again until the machine is rebooted.

Any help? I am ready to build and test patched kernel, and to tinker with code. Thanks for any help.
Comment 1 Alasdair G Kergon 2015-11-10 02:38:05 UTC
Well, firstly you need to try to work out what is still holding the device and why it cannot be removed and whether there is a controlled way to release it or whether it could be considered a bug.

Secondly, see if there is a workaround that at least moves it out of the way so the name can be reused.  At the dm layer, there are tricks like 'dmsetup rename' and 'dmsetup remove' which can also take --force and --deferred.
Comment 2 Alasdair G Kergon 2015-11-10 02:40:20 UTC
If a running gpg-agent is the issue, do you need a mechanism for the udev rule to signal to that process that it must release the device first?
Comment 3 Alasdair G Kergon 2015-11-10 02:43:48 UTC
- Try having the udev rule first kill the gpg-agent process.  That could be generalised to kill all processes using the device being rules.  Tools like 'lsof' show you what is using it.
Comment 5 Andrey Utkin 2015-11-10 02:51:21 UTC
Thanks for quick reply.

> If a running gpg-agent is the issue, do you need a mechanism for the udev
> rule to signal to that process that it must release the device first?

There would be a race condition anyway I believe. Maybe not in this, but in another application. I believe I can workaround my personal issue by symlinking that unix socket to outside of my encrypted stick.


> Well, firstly you need to try to work out what is still holding the device
> and why it cannot be removed

I think I'll increase kernel log verbosity level and see what it gives. Otherwise, I have no idea how to debug that.
Comment 6 Alasdair G Kergon 2015-11-10 03:05:23 UTC
Yes, or perhaps something like --no-use-standard-socket

             By  enabling  this  option  gpg-agent  will listen on the socket
              named ‘S.gpg-agent’, located in the home directory, and not cre-
              ate a random socket below a temporary directory.  Tools connect-
              ing to gpg-agent should first try to connect to the socket given
              in  environment  variable  GPG_AGENT_INFO  and then fall back to
              this socket.  This option may not be used if the home  directory
              is  mounted as a remote file system.  Note, that --use-standard-
              socket is the default on Windows systems.
Comment 7 Alasdair G Kergon 2015-11-10 03:07:15 UTC
or probably the opposite --use-standard-socket
Comment 8 Andrey Utkin 2015-11-10 03:11:47 UTC
I have GnuPG 2.1.9 and these options have no effect anymore. And there's no way to set socket location.

       --use-standard-socket

       --no-use-standard-socket

       --use-standard-socket-p
              Since GnuPG 2.1 the standard socket is always used.  These options have no more effect.  The command gpg-agent --use-
              standard-socket-p will thus always return success.
Comment 9 Andrey Utkin 2015-11-10 03:15:11 UTC
Ok, killing gpg-agent works for me, but still - this smells like a bug of device mapper.
Comment 10 Alasdair G Kergon 2015-11-10 12:15:20 UTC
The kernel quite correctly won't let you remove a filesystem or device in a safe and controlled way while it is still in use - it does look like your problem can be resolved entirely in userspace.  If you can't find a way to control where the file is being placed, try the symlink idea, or speak to the gpg-agent developers.
Comment 11 Andrey Utkin 2015-11-10 13:05:08 UTC
So you consider it perfectly correct to be unable to reuse /dev/mapper name after that, and if one wants to reuse it, he needs to resort to rebooting physical machine, because even "dmsetup remove" and "dmsetup rename" don't work?

 # dmsetup remove sensitive
device-mapper: remove ioctl on sensitive failed: Device or resource busy
Command failed
[ERR]
15:01root@acer ~
 # dmsetup remove --force sensitive
device-mapper: resume ioctl on sensitive failed: Invalid argument
device-mapper: remove ioctl on sensitive failed: Device or resource busy
Command failed
[ERR]
15:02root@acer ~
 # dmsetup remove --force sensitive
device-mapper: resume ioctl on sensitive failed: Invalid argument
device-mapper: remove ioctl on sensitive failed: Device or resource busy
Command failed
[ERR]
15:02root@acer ~
 # dmsetup remove --force --deferred sensitive
device-mapper: resume ioctl on sensitive failed: Invalid argument
[OK]
15:02root@acer ~
 # ls /dev/mapper/
control  sensitive
[OK]
15:02root@acer ~
 # dmsetup remove --force --deferred sensitive
device-mapper: resume ioctl on sensitive failed: Invalid argument


Renaming:
 # dmsetup rename sensitive sensitive_garbage
device-mapper: rename ioctl on sensitive failed: No such device or address
Command failed
Comment 12 Alasdair G Kergon 2015-11-11 00:48:31 UTC
Are there still any users of the device?  What does 'open count' of 'dmsetup info' output say?  If so, you need to eliminate them first - kill the process, unmount the filesystem etc.

The deferred remove OR the rename (but not both!) might have worked, but you need to look at the state with 'dmsetup info' after each command to find out.  ('dmsetup info -c' gives a handy summary.  Also 'dmsetup table' can show if a forced remove was attempted.)
Comment 13 Andrey Utkin 2015-11-12 15:28:03 UTC
Sorry for the delay, will check in nearest days.
The ideat that there may be some processes still having open files is reasonable, will check that. Maybe it is really the reason why closing doesn't work.

Note You need to log in before you can comment on or make changes to this bug.