Bug 199753
Summary: | btrfs deadlock running rsync | ||
---|---|---|---|
Product: | File System | Reporter: | Eli V (eliventer) |
Component: | btrfs | Assignee: | BTRFS virtual assignee (fs_btrfs) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | dsterba, lakshmipathi.g |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 4.16.8, 4.17, 4.18.8 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Eli V
2018-05-17 12:57:29 UTC
Is it failing consistently? Can you share reproducible steps? thanks It fails every few days in the middle of an rsync, most recently on 6/2 with kernel 4.16.11. Just rebooted into 4.16.13 this morning. The system is dedicated to running backups, so all it does is run rsync from NFS filesytems to a big local btrfs, and then snapshot when the rsync completes. It runs just a single rsync at a time. Of note the fs is using a 64k leaf size for much better performance on the underlying hardware RAID which uses a 64K stripe size. Mount options from /proc/mounts: rw,noatime,compress=zstd,ssd_spread,space_cache=v2,enospc_debug,subvolid=5,subvol=/ btrfs deadlocked again under 4.6.13 on 6/12. I'm guessing it's the locking around the wait_woken callback into the btrfs code. I've rebooted into 4.17 will report back if it hangs in the next week or two. [673874.641848] INFO: task btrfs-transacti:2699 blocked for more than 120 seconds. [673874.642834] Tainted: G W I 4.16.13 #1 [673874.643763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [673874.644713] btrfs-transacti D 0 2699 2 0x80000000 [673874.645663] Call Trace: [673874.646623] ? __schedule+0x2b1/0x770 [673874.647557] schedule+0x2d/0x80 [673874.648503] btrfs_tree_lock+0x111/0x1d0 [btrfs] [673874.649432] ? wait_woken+0x80/0x80 [673874.650376] btrfs_search_slot+0x761/0xa60 [btrfs] [673874.651317] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [673874.652241] btrfs_insert_item+0x5b/0xc0 [btrfs] [673874.653152] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] [673874.654070] do_chunk_alloc+0x1e5/0x2b0 [btrfs] [673874.654960] find_free_extent+0xcd0/0xf60 [btrfs] [673874.655850] btrfs_reserve_extent+0x96/0x1e0 [btrfs] [673874.656736] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] [673874.657624] __btrfs_cow_block+0x127/0x550 [btrfs] [673874.658527] btrfs_cow_block+0xee/0x190 [btrfs] [673874.659423] btrfs_search_slot+0x227/0xa60 [btrfs] [673874.660321] ? btrfs_update_inode_item+0x59/0x100 [btrfs] [673874.661210] ? iput+0x72/0x1e0 [673874.662109] write_one_cache_group.isra.49+0x20/0x90 [btrfs] [673874.663025] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs] [673874.663941] btrfs_commit_transaction+0x11b/0x880 [btrfs] [673874.664852] ? start_transaction+0x8e/0x410 [btrfs] [673874.665772] transaction_kthread+0x184/0x1a0 [btrfs] [673874.666671] kthread+0xf0/0x130 [673874.667568] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] [673874.668467] ? kthread_create_worker_on_cpu+0x40/0x40 [673874.669369] ret_from_fork+0x1f/0x30 [673874.670293] INFO: task kworker/u34:10:30138 blocked for more than 120 seconds. [673874.671224] Tainted: G W I 4.16.13 #1 [673874.672154] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [673874.673112] kworker/u34:10 D 0 30138 2 0x80000000 [673874.674086] Workqueue: writeback wb_workfn (flush-btrfs-1) [673874.675057] Call Trace: [673874.676015] ? __schedule+0x2b1/0x770 [673874.676979] schedule+0x2d/0x80 [673874.677966] btrfs_tree_lock+0x111/0x1d0 [btrfs] [673874.678941] ? wait_woken+0x80/0x80 [673874.679918] btrfs_search_slot+0x761/0xa60 [btrfs] [673874.680905] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [673874.681908] btrfs_insert_item+0x5b/0xc0 [btrfs] [673874.682907] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] [673874.683915] __btrfs_end_transaction+0x8f/0x2e0 [btrfs] [673874.684924] find_free_extent+0xe23/0xf60 [btrfs] [673874.685945] btrfs_reserve_extent+0x96/0x1e0 [btrfs] [673874.686945] cow_file_range.isra.65+0x11e/0x480 [btrfs] [673874.687939] run_delalloc_range+0xed/0x340 [btrfs] [673874.688922] writepage_delalloc+0xf9/0x160 [btrfs] [673874.689894] __extent_writepage+0xd0/0x2f0 [btrfs] [673874.690853] extent_write_cache_pages+0x190/0x330 [btrfs] [673874.691822] extent_writepages+0x3b/0x50 [btrfs] [673874.692773] do_writepages+0x12/0x60 [673874.693728] __writeback_single_inode+0x38/0x300 [673874.694679] writeback_sb_inodes+0x1c6/0x430 [673874.695625] __writeback_inodes_wb+0x82/0xb0 [673874.696568] wb_writeback+0x23e/0x2d0 [673874.697510] ? wb_workfn+0x188/0x3c0 [673874.698452] wb_workfn+0x188/0x3c0 [673874.699378] process_one_work+0x1c0/0x390 [673874.700302] worker_thread+0x42/0x3e0 [673874.701220] kthread+0xf0/0x130 [673874.702140] ? create_worker+0x170/0x170 [673874.703052] ? kthread_create_worker_on_cpu+0x40/0x40 [673874.703967] ? SyS_exit_group+0xb/0x10 [673874.704877] ret_from_fork+0x1f/0x30 4.17 deadlock still present. rsync stack: [<0>] balance_dirty_pages_ratelimited+0x270/0xca0 [<0>] __btrfs_buffered_write+0x3c2/0x760 [btrfs] [<0>] btrfs_file_write_iter+0x36e/0x510 [btrfs] [<0>] __vfs_write+0xcf/0x130 [<0>] vfs_write+0xab/0x190 [<0>] ksys_write+0x3d/0x90 [<0>] do_syscall_64+0x39/0xe0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [<0>] 0xffffffffffffffff kernel dmesg, looks like the wait_woken is consistently present whenever there is a problem: [301835.925267] btrfs-transacti D 0 2892 2 0x80000000 [301835.926291] Call Trace: [301835.927244] ? __schedule+0x2b1/0x770 [301835.928189] schedule+0x2d/0x80 [301835.929144] btrfs_tree_lock+0x111/0x1d0 [btrfs] [301835.930086] ? wait_woken+0x80/0x80 [301835.931041] btrfs_search_slot+0x761/0xa60 [btrfs] [301835.931972] ? _cond_resched+0x11/0x40 [301835.932892] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [301835.933805] btrfs_insert_item+0x5b/0xc0 [btrfs] [301835.934712] btrfs_create_pending_block_groups+0xff/0x1e0 [btrfs] [301835.935618] do_chunk_alloc+0x1d5/0x270 [btrfs] [301835.936518] find_free_extent+0x651/0x1050 [btrfs] [301835.937426] btrfs_reserve_extent+0x96/0x1d0 [btrfs] [301835.938342] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] [301835.939239] ? pagevec_lru_move_fn+0xad/0xd0 [301835.940141] __btrfs_cow_block+0x124/0x540 [btrfs] [301835.941048] btrfs_cow_block+0xee/0x190 [btrfs] [301835.941954] btrfs_search_slot+0x227/0xa60 [btrfs] [301835.942873] ? btrfs_update_inode_item+0x59/0x100 [btrfs] [301835.943772] ? iput+0x66/0x1d0 [301835.944674] write_one_cache_group.isra.52+0x20/0x90 [btrfs] [301835.945591] btrfs_start_dirty_block_groups+0x100/0x420 [btrfs] [301835.946519] btrfs_commit_transaction+0xb6/0x840 [btrfs] [301835.947438] ? start_transaction+0x8c/0x3e0 [btrfs] [301835.948353] transaction_kthread+0x142/0x170 [btrfs] [301835.949254] kthread+0xf0/0x130 [301835.950158] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] [301835.951082] ? kthread_create_worker_on_cpu+0x40/0x40 [301835.952000] ret_from_fork+0x1f/0x30 [301835.952916] INFO: task kworker/u33:7:24296 blocked for more than 120 seconds. [301835.953851] Tainted: G W I 4.17.0 #1 [301835.954797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [301835.955763] kworker/u33:7 D 0 24296 2 0x80000000 [301835.956740] Workqueue: writeback wb_workfn (flush-btrfs-1) [301835.957717] Call Trace: [301835.958695] ? __schedule+0x2b1/0x770 [301835.959682] ? btrfs_buffer_uptodate+0x2f/0x40 [btrfs] [301835.960670] schedule+0x2d/0x80 [301835.961669] btrfs_tree_lock+0x111/0x1d0 [btrfs] [301835.962670] ? wait_woken+0x80/0x80 [301835.963676] btrfs_search_slot+0x761/0xa60 [btrfs] [301835.964678] ? _cond_resched+0x11/0x40 [301835.965686] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [301835.966710] btrfs_insert_item+0x5b/0xc0 [btrfs] [301835.967709] btrfs_create_pending_block_groups+0xff/0x1e0 [btrfs] [301835.968711] __btrfs_end_transaction+0x88/0x2d0 [btrfs] [301835.969699] find_free_extent+0xe7a/0x1050 [btrfs] [301835.970682] btrfs_reserve_extent+0x96/0x1d0 [btrfs] [301835.971661] cow_file_range.isra.67+0x109/0x460 [btrfs] [301835.972644] run_delalloc_range+0xed/0x340 [btrfs] [301835.973634] writepage_delalloc+0xf9/0x160 [btrfs] [301835.974629] __extent_writepage+0xd0/0x2f0 [btrfs] [301835.975610] extent_write_cache_pages+0x190/0x330 [btrfs] [301835.976596] extent_writepages+0x3b/0x50 [btrfs] [301835.977563] do_writepages+0x12/0x60 [301835.978532] __writeback_single_inode+0x38/0x300 [301835.979499] writeback_sb_inodes+0x1c6/0x430 [301835.980459] __writeback_inodes_wb+0x82/0xb0 [301835.981410] wb_writeback+0x23e/0x2d0 [301835.982366] ? wb_workfn+0x188/0x3c0 [301835.983308] wb_workfn+0x188/0x3c0 [301835.984244] process_one_work+0x1c0/0x390 [301835.985177] worker_thread+0x42/0x3e0 [301835.986100] kthread+0xf0/0x130 [301835.987029] ? create_worker+0x170/0x170 [301835.987948] ? kthread_create_worker_on_cpu+0x40/0x40 [301835.988869] ret_from_fork+0x1f/0x30 Just hit this on 4.18.8. rsync process is sitting doing nothing all weekend because of the deadlock on the btrfs-trsaction process: [361526.415801] INFO: task btrfs-transacti:2895 blocked for more than 120 seconds. [361526.416893] Tainted: G W I 4.18.8 #1 [361526.417826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [361526.418778] btrfs-transacti D 0 2895 2 0x80000000 [361526.419747] Call Trace: [361526.420695] ? __schedule+0x2cc/0x780 [361526.421635] schedule+0x2d/0x80 [361526.422585] btrfs_tree_lock+0x111/0x1d0 [btrfs] [361526.423519] ? wait_woken+0x80/0x80 [361526.424466] btrfs_search_slot+0x7c9/0xa80 [btrfs] [361526.425390] ? _cond_resched+0x11/0x40 [361526.426301] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [361526.427207] btrfs_insert_item+0x5b/0xc0 [btrfs] [361526.428107] btrfs_create_pending_block_groups+0xff/0x1e0 [btrfs] [361526.429005] do_chunk_alloc+0x1d4/0x260 [btrfs] [361526.429895] find_free_extent+0x651/0x1040 [btrfs] [361526.430794] btrfs_reserve_extent+0x96/0x1d0 [btrfs] [361526.431703] btrfs_alloc_tree_block+0xfa/0x520 [btrfs] [361526.432602] __btrfs_cow_block+0x124/0x540 [btrfs] [361526.433499] btrfs_cow_block+0xee/0x190 [btrfs] [361526.434395] btrfs_search_slot+0x238/0xa80 [btrfs] [361526.435295] ? btrfs_update_inode_item+0x59/0x100 [btrfs] [361526.436194] ? iput+0x66/0x1d0 [361526.437083] write_one_cache_group.isra.53+0x20/0x90 [btrfs] [361526.437995] btrfs_start_dirty_block_groups+0x100/0x420 [btrfs] [361526.438909] btrfs_commit_transaction+0xb6/0x840 [btrfs] [361526.439830] ? start_transaction+0x8c/0x3e0 [btrfs] [361526.440738] transaction_kthread+0x142/0x170 [btrfs] [361526.441631] kthread+0xf0/0x130 [361526.442527] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] [361526.443423] ? kthread_create_worker_on_cpu+0x40/0x40 [361526.444333] ret_from_fork+0x1f/0x30 [361526.445244] INFO: task kworker/u33:21:12240 blocked for more than 120 seconds. [361526.446167] Tainted: G W I 4.18.8 #1 [361526.447090] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [361526.448050] kworker/u33:21 D 0 12240 2 0x80000000 [361526.449009] Workqueue: writeback wb_workfn (flush-btrfs-1) [361526.449969] Call Trace: [361526.450926] ? __schedule+0x2cc/0x780 [361526.451905] ? btrfs_buffer_uptodate+0x2f/0x40 [btrfs] [361526.452875] schedule+0x2d/0x80 [361526.453857] btrfs_tree_lock+0x111/0x1d0 [btrfs] [361526.454832] ? wait_woken+0x80/0x80 [361526.455829] btrfs_search_slot+0x7c9/0xa80 [btrfs] [361526.456815] ? _cond_resched+0x11/0x40 [361526.457812] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [361526.458816] btrfs_insert_item+0x5b/0xc0 [btrfs] [361526.459821] btrfs_create_pending_block_groups+0xff/0x1e0 [btrfs] [361526.460820] __btrfs_end_transaction+0x88/0x2d0 [btrfs] [361526.461806] find_free_extent+0xe68/0x1040 [btrfs] [361526.462778] btrfs_reserve_extent+0x96/0x1d0 [btrfs] [361526.463747] cow_file_range.isra.69+0x109/0x460 [btrfs] [361526.464714] run_delalloc_range+0xed/0x340 [btrfs] [361526.465686] writepage_delalloc+0xf9/0x160 [btrfs] [361526.466655] __extent_writepage+0xd0/0x2f0 [btrfs] [361526.467621] extent_write_cache_pages+0x190/0x330 [btrfs] [361526.468602] extent_writepages+0x3b/0x50 [btrfs] [361526.469556] do_writepages+0x12/0x60 [361526.470501] __writeback_single_inode+0x38/0x300 [361526.471449] writeback_sb_inodes+0x18c/0x440 [361526.472405] __writeback_inodes_wb+0x82/0xb0 [361526.473346] wb_writeback+0x23e/0x2d0 [361526.474276] ? wb_workfn+0x2c4/0x3c0 [361526.475200] wb_workfn+0x2c4/0x3c0 [361526.476129] process_one_work+0x1da/0x3b0 [361526.477049] worker_thread+0x44/0x3d0 [361526.477960] kthread+0xf0/0x130 [361526.478867] ? create_worker+0x190/0x190 [361526.479783] ? kthread_create_worker_on_cpu+0x40/0x40 [361526.480696] ret_from_fork+0x1f/0x30 Thanks for the report. Fixed by 5ce555578e0919237f "Btrfs: fix deadlock when writing out free space caches", in 4.20. |