Bug 203993

Summary: BTRFS hang after incorrect patch
Product: File System Reporter: Olivier Mazouffre (olivier.mazouffre)
Component: btrfsAssignee: BTRFS virtual assignee (fs_btrfs)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: stf_xl
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.14.125 Subsystem:
Regression: No Bisected commit-id:

Description Olivier Mazouffre 2019-06-26 08:42:20 UTC
With lastest 4.14.x kernel, BTRFS hangs randomly with this type of error:

[Wed Jun 19 09:27:44 2019] INFO: task btrfs-cleaner:733 blocked for more than 1200 seconds.
[Wed Jun 19 09:27:44 2019]       Tainted: G           O    4.14.125 #1
[Wed Jun 19 09:27:44 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jun 19 09:27:44 2019] btrfs-cleaner   D    0   733      2 0x80000000
[Wed Jun 19 09:27:44 2019] Call Trace:
[Wed Jun 19 09:27:44 2019]  ? __schedule+0x292/0x8b0
[Wed Jun 19 09:27:44 2019]  schedule+0x39/0x90
[Wed Jun 19 09:27:44 2019]  btrfs_tree_lock+0xc3/0x1c0 [btrfs]
[Wed Jun 19 09:27:44 2019]  ? wait_woken+0x80/0x80
[Wed Jun 19 09:27:44 2019]  do_walk_down+0xe7/0x580 [btrfs]
[Wed Jun 19 09:27:44 2019]  walk_down_tree+0xac/0xd0 [btrfs]
[Wed Jun 19 09:27:44 2019]  btrfs_drop_snapshot+0x393/0x880 [btrfs]
[Wed Jun 19 09:27:44 2019]  ? __schedule+0x29a/0x8b0
[Wed Jun 19 09:27:44 2019]  ? btrfs_destroy_inode+0x1a9/0x290 [btrfs]
[Wed Jun 19 09:27:44 2019]  btrfs_clean_one_deleted_snapshot+0xaf/0x100 [btrfs]
[Wed Jun 19 09:27:44 2019]  cleaner_kthread+0x10d/0x130 [btrfs]
[Wed Jun 19 09:27:44 2019]  kthread+0xff/0x140
[Wed Jun 19 09:27:44 2019]  ? btree_invalidatepage+0x90/0x90 [btrfs]
[Wed Jun 19 09:27:44 2019]  ? kthread_create_on_node+0x40/0x40
[Wed Jun 19 09:27:44 2019]  ret_from_fork+0x1f/0x40
[Wed Jun 19 09:27:44 2019] INFO: task btrfs-transacti:734 blocked for more than 1200 seconds.
[Wed Jun 19 09:27:44 2019]       Tainted: G           O    4.14.125 #1
[Wed Jun 19 09:27:44 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jun 19 09:27:44 2019] btrfs-transacti D    0   734      2 0x80000000
[Wed Jun 19 09:27:44 2019] Call Trace:
[Wed Jun 19 09:27:44 2019]  ? __schedule+0x292/0x8b0
[Wed Jun 19 09:27:44 2019]  schedule+0x39/0x90
[Wed Jun 19 09:27:44 2019]  wait_current_trans+0x8c/0xd0 [btrfs]
[Wed Jun 19 09:27:44 2019]  ? wait_woken+0x80/0x80
[Wed Jun 19 09:27:44 2019]  start_transaction+0x200/0x430 [btrfs]
[Wed Jun 19 09:27:44 2019]  transaction_kthread+0x142/0x1b0 [btrfs]
[Wed Jun 19 09:27:44 2019]  kthread+0xff/0x140
[Wed Jun 19 09:27:44 2019]  ? btrfs_cleanup_transaction+0x540/0x540 [btrfs]
[Wed Jun 19 09:27:44 2019]  ? kthread_create_on_node+0x40/0x40
[Wed Jun 19 09:27:44 2019]  ret_from_fork+0x1f/0x40

After debugging, I think that the bug is caused by this commit (kernel 4.14.123):

commit d819d97ea025f8c32c12adef0ff55b2c7bf5c853
Author: Josef Bacik <josef@toxicpanda.com>
Date:   Wed Jan 16 11:00:57 2019 -0500

    btrfs: honor path->skip_locking in backref code
    
    commit 38e3eebff643db725633657d1d87a3be019d1018 upstream.

The previous calls to btrfs_tree_read_lock and btrfs_set_lock_blocking_rw were not removed by the patch :

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index b517ef1477ea..d826fbaf7d50 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1281,9 +1292,14 @@ again:
                 }
                 btrfs_tree_read_lock(eb);
                 btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+                if (!path->skip_locking) {
+                    btrfs_tree_read_lock(eb);
+                    btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+                }
                 ret = find_extent_in_eb(eb, bytenr,
                             *extent_item_pos, &eie);
-                btrfs_tree_read_unlock_blocking(eb);
+                if (!path->skip_locking)
+                    btrfs_tree_read_unlock_blocking(eb);
                 free_extent_buffer(eb);
                 if (ret < 0)
                     goto out; 

I currently testing a the kernel 4.14.130 with the two lines in excess removed and for now the bug seems gone.
Comment 1 Stanislaw Gruszka 2019-07-08 12:06:18 UTC
Posted fix for this:
https://lore.kernel.org/stable/20190708120130.GA25587@redhat.com/T/#u