Bug 204757 - Mount/df/PAM login hangs during rsync to btrfs subvolume, or maybe doing btrfs subvolume snapshot
Summary: Mount/df/PAM login hangs during rsync to btrfs subvolume, or maybe doing btrf...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: BTRFS virtual assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-03 09:46 UTC by David Newall
Modified: 2019-09-03 09:46 UTC (History)
0 users

See Also:
Kernel Version: 4.4.0-159
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description David Newall 2019-09-03 09:46:11 UTC
Hi All,

I've been trying to track down a problem, intermittently, for a long time, and now need to reach out for advice.  I apologise in advance for the quality of this report, which I feel includes more detail than needed, yet may be missing what's important.  I'm trying my best.

The brief summary is that my system hangs during SSH login while a backup is in progress.  Sshd uses PAM authentication.  The problem seems to be related to mounts as df and mount also hang.

The longer details are:  I'm running Ubuntu 16.04.5 on a 64-bit VM under kvm.  I backup data using the following steps:

1. Take an LVM2 snapshot of the (non-root) ext2 file-system mounted as /data;
2. Mount a btrfs file system as /backup;
2. Mount the snapshot over an empty directory (may be subvolume; does it make a difference?) on /backup/snapshot;
3. Rsync the snapshot (with --archive --one-file-system --hard-links --inplace --numeric-ids --delete) to a subvolume /backup/data (thus it always contains /data as at last backup);
4. Take btrfs subvolume snapshot of /backup/data;
5. Unmount /backup/snapshot and /backup.

By the time I get called, SSH logins via PAM hang (but complete "immediately" if I re-configure sshd for UsePAM no).  Sessions which are still logged in seem unaffected, except df and mount both hang.  I don't know what else hangs.

During all of these steps, the /data is almost static, maybe even be completely static.

I've queried my user, carefully, to determine the exact step where it starts to hang, and am 90% confident in her answer, which indicates that the hang-condition starts during rsync.

Processes that were hanging complete normally when subvolume snapshot finishes.

There's a chance that processes complete when the snapshot or btrfs file-system is unmounted, but I think it's before then because I've tried running each step by hand, was unable to reproduce the problem, probably because the amount of data to rsync in real-use is much larger than I tried writing during that test.  At any rate, during that test I could log in between and during each step of the procedure.

The only messages in dmesg are "mounting ext2 file system using the ext4 subsystem" and "mounted filesystem without journal. Opts: (null)", which sounds right as I use "mount" instead of "mount -text2".

When I tried running df under strace, strace's output was:

  open("/proc/self/mountinfo", O_RDONLY)  = 3
  fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
  read(3, "18 24 0:17 / /sys rw,nosuid,node"..., 1024) = 1024
  read(3, "ystemd/systemd-cgroups-agent,nam"..., 1024) = 1024
  read(3, "t rw,nosuid,nodev,noexec,relatim"..., 1024) = 1024
  read(3, 

After the subvolume snapshot completed, strace continued producing output:

  "fs lxcfs rw,user_id=0,group_id=0"..., 1024) = 624
  --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=28055, si_uid=1000} ---
  read(3, "", 1024)                       = 0
  lseek(3, 0, SEEK_CUR)                   = 3696
  close(3)                                = 0

I think the SIGCONT was because I suspended the parent, strace, using Ctrl-Z.

I could just leave sshd doing non-PAM authentication but I think that's the wrong approach.  How do I zero in on this problem?

Thanks,

David

Note You need to log in before you can comment on or make changes to this bug.