173001 – file create shortly after boot gets EBUSY - inoded_unhashed() false in insert_inode_locked4()

Bug 173001 - file create shortly after boot gets EBUSY - inoded_unhashed() false in insert_inode_locked4()

Summary: file create shortly after boot gets EBUSY - inoded_unhashed() false in insert...

Status:	RESOLVED OBSOLETE

Alias:	None

Product:	File System
Classification:	Unclassified
Component:	btrfs (show other bugs)
Hardware:	All Linux

Importance:	P1 normal
Assignee:	Josef Bacik

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-09-27 21:12 UTC by Dave Olson
Modified:	2022-10-06 17:32 UTC (History)
CC List:	1 user (show)

See Also:
Kernel Version:	4.1.25
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Dave Olson 2016-09-27 21:12:52 UTC

This is seen on a modified 4.1.25 with local networking modifications, but not changes to the generic fs nor btrfs code, except backing out commit 033ad030df0ea932a21499582fea59e1df95769b (fsync() changes) because of performance problems.   This is on top of a debian jessie x86_64 distribution.

These are the BTRFS config options I have (the debug options were set while looking into this problem):

CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
CONFIG_BTRFS_FS_CHECK_INTEGRITY=y
CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y
CONFIG_BTRFS_DEBUG=y
# CONFIG_BTRFS_ASSERT is not set

The symptom is that when the system is powercycled, a minute or two after all services are up and running, without a sync, shutdown, etc., then on the following boot, logrotate gets an error when creating the file bootlog.1.gz in preparation to gzip bootlog.1.   This is a standard logrotate, and bootlog.1.gz had previously existed, and been rename()'ed to bootlog.2.gz.

This happens with other files created during boot up as well, but the logrotate case seems to be the most common failure.  This problem has been seen on multiple x86_64 (rangeley) systems, with different kinds of SSD.

I've used both the jessies btrfs-tools 3.17-1.1, and the btrfs 4.7 tools that I built myself to verify no filesystem problems before or after the EBUSY problem.

The mount options are
  rw,relatime,ssd,flushoncommit,space_cache
I've tried with and without ssd, flushoncommit, and space_cache options, with no difference in behavior.

logrotate tries to create the file that does not exist with this system call (as shown by strace):

  open("/var/log/boot.log.1.gz", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW, 0600) = -1 EBUSY (Device or resource busy)

My preliminary printk debugging shows that the failure is a non-zero return from inode_unhashed here:

if (likely(!old)) {
            spin_lock(&inode->i_lock);
            inode->i_state |= I_NEW;
            hlist_add_head(&inode->i_hash, head);
            spin_unlock(&inode->i_lock);
            spin_unlock(&inode_hash_lock);
            return 0;
        }
        __iget(old);
        spin_unlock(&old->i_lock);
        spin_unlock(&inode_hash_lock);
        wait_on_inode(old);
---->>> if (unlikely(!inode_unhashed(old))) {
            iput(old);
            printk(KERN_INFO "OLSON EBUSY at %s", __func__);
            return -EBUSY;
        }
        iput(old);

This failure occurs once every 2-10 power cycles (I happened to be doing some powercycle testing for an unrelated issue).

I'm still debugging this, but thought I'd file this bug to see if anybody has any ideas about what might be happening.

If I stop in initramfs, 'btrfs fi check /dev/sda4' shows no problems.

Comment 1 Dave Olson 2016-09-27 21:21:30 UTC

I forgot to add that the logrotate file is in /var/log, on the root filesystem.  subvolumes are in use, but the files that get EBUSY are not on subvolumes.  The other files for which we have seen this problem are also on the root filesystem, creating during service startup.

Comment 2 Dave Olson 2016-10-04 06:37:27 UTC

This is also reproducible on kernel.org 4.8.0 kernel (built checked out from tag 4.8 today), with no modifications.

Comment 3 Dave Olson 2016-10-06 16:35:53 UTC

I've tracked down the problem (on the unmodified 4.8 kernel), and have a hack to get around it (the hack likely leaks all kinds of things, it was just to confirm that I had found the cause).

When a btrfs filesystem is mounted, it appears to go through the journal and determine highest_objectid.   That value is incremented for every new inode allocated and used as i_ino.

For the failure case, that inode is always stale, and already in use by a file written prior to the powercycle.   My test is "lucky", in that logrotate renames the file with that inode number, so fs/inode.c sees it as being referenced, so when a new inode is allocated with the same i_ino, it is detected and EBUSY is returned.

It appears to me that even with flushoncommit set in the mount options, that some portion of the journal that has the new highest_objectid is not flushed to disk before the open (create) system call is returned, and that is what leads to the double allocate.

For all of the cases (38 in an overnight run), the objectid (i_ino) that is allocated and gets EBUSY from btrfs_create() is one greater than the highest_objectid value at filesystem mount.  retrying the btrfs_find_free_ino(), and the rest of btrfs_create() "always works".

Here's the hacky patch I used to confirm the problem:

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6811c4..5b55c48 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6499,6 +6529,7 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
        if (IS_ERR(trans))
                return PTR_ERR(trans);
 
+OLSON:
        err = btrfs_find_free_ino(root, &objectid);
        if (err)
                goto out_unlock;
@@ -6508,6 +6539,12 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
                                mode, &index);
        if (IS_ERR(inode)) {
                err = PTR_ERR(inode);
+        if(err == -EBUSY) { // OLSON 
+            printk(KERN_WARNING "OLSON:EBUSY in %s file=(%s) objectid=%lu, retry\n",
+                __func__, dentry->d_name.name?dentry->d_name.name:"NULL",
+                (unsigned long)objectid);
+            goto OLSON;
+        }
                goto out_unlock;
        }
        drop_inode_on_err = 1;

Comment 4 Dave Olson 2016-10-06 16:42:55 UTC

Oh, and btrfs_check (neither 3.17 nor 4.7.3) detects the problem of highest_objectid being stale, in my tests.

Comment 5 David Sterba 2022-10-06 17:32:01 UTC

This is a semi-automated bugzilla cleanup, report is against an old kernel version. If the problem still happens, please open a new bug. Thanks.

Note You need to log in before you can comment on or make changes to this bug.