Bug 204757

Summary: Mount/df/PAM login hangs during rsync to btrfs subvolume, or maybe doing btrfs subvolume snapshot
Product: File System Reporter: David Newall (btrfs)
Component: btrfsAssignee: BTRFS virtual assignee (fs_btrfs)
Status: NEW ---    
Severity: blocking    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.4.0-159 Subsystem:
Regression: No Bisected commit-id:

Description David Newall 2019-09-03 09:46:11 UTC
Hi All,

I've been trying to track down a problem, intermittently, for a long time, and now need to reach out for advice.  I apologise in advance for the quality of this report, which I feel includes more detail than needed, yet may be missing what's important.  I'm trying my best.

The brief summary is that my system hangs during SSH login while a backup is in progress.  Sshd uses PAM authentication.  The problem seems to be related to mounts as df and mount also hang.

The longer details are:  I'm running Ubuntu 16.04.5 on a 64-bit VM under kvm.  I backup data using the following steps:

1. Take an LVM2 snapshot of the (non-root) ext2 file-system mounted as /data;
2. Mount a btrfs file system as /backup;
2. Mount the snapshot over an empty directory (may be subvolume; does it make a difference?) on /backup/snapshot;
3. Rsync the snapshot (with --archive --one-file-system --hard-links --inplace --numeric-ids --delete) to a subvolume /backup/data (thus it always contains /data as at last backup);
4. Take btrfs subvolume snapshot of /backup/data;
5. Unmount /backup/snapshot and /backup.

By the time I get called, SSH logins via PAM hang (but complete "immediately" if I re-configure sshd for UsePAM no).  Sessions which are still logged in seem unaffected, except df and mount both hang.  I don't know what else hangs.

During all of these steps, the /data is almost static, maybe even be completely static.

I've queried my user, carefully, to determine the exact step where it starts to hang, and am 90% confident in her answer, which indicates that the hang-condition starts during rsync.

Processes that were hanging complete normally when subvolume snapshot finishes.

There's a chance that processes complete when the snapshot or btrfs file-system is unmounted, but I think it's before then because I've tried running each step by hand, was unable to reproduce the problem, probably because the amount of data to rsync in real-use is much larger than I tried writing during that test.  At any rate, during that test I could log in between and during each step of the procedure.

The only messages in dmesg are "mounting ext2 file system using the ext4 subsystem" and "mounted filesystem without journal. Opts: (null)", which sounds right as I use "mount" instead of "mount -text2".

When I tried running df under strace, strace's output was:

  open("/proc/self/mountinfo", O_RDONLY)  = 3
  fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
  read(3, "18 24 0:17 / /sys rw,nosuid,node"..., 1024) = 1024
  read(3, "ystemd/systemd-cgroups-agent,nam"..., 1024) = 1024
  read(3, "t rw,nosuid,nodev,noexec,relatim"..., 1024) = 1024
  read(3, 

After the subvolume snapshot completed, strace continued producing output:

  "fs lxcfs rw,user_id=0,group_id=0"..., 1024) = 624
  --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=28055, si_uid=1000} ---
  read(3, "", 1024)                       = 0
  lseek(3, 0, SEEK_CUR)                   = 3696
  close(3)                                = 0

I think the SIGCONT was because I suspended the parent, strace, using Ctrl-Z.

I could just leave sshd doing non-PAM authentication but I think that's the wrong approach.  How do I zero in on this problem?

Thanks,

David