Bug 75271 - btrfs device add hangs, result in blocked task messages
Summary: btrfs device add hangs, result in blocked task messages
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Josef Bacik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-02 04:12 UTC by Chris Murphy
Modified: 2016-03-20 11:28 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.15.0-0.rc3.git3.1.fc21.x86_64
Tree: Fedora
Regression: No


Attachments
dmesg with sysrq-w (173.57 KB, text/plain)
2014-05-02 04:12 UTC, Chris Murphy
Details
dmesg selinux=0 (88.06 KB, text/plain)
2014-05-04 01:00 UTC, Chris Murphy
Details
console messages, hang/failure to power off (26.20 KB, image/png)
2014-05-04 01:12 UTC, Chris Murphy
Details
another dmesg with sysrq-w (110.53 KB, text/plain)
2014-10-25 10:18 UTC, John Leach
Details

Description Chris Murphy 2014-05-02 04:12:37 UTC
Created attachment 134651 [details]
dmesg with sysrq-w

0.
# mkfs.btrfs -d raid1 -m raid1 /dev/sd[bcd]
# mount /dev/sdb /mnt   ## default mount options

1. Populate with data until one of three devices is nearly full, but no enospc has occurred.
2. Poweroff, remove /dev/sdb (devid 1), replace with new device
3. Poweron
4.
# mount /dev/sdc -o degraded
# btrfs device add /dev/sdb /mnt

Actual result:
Performing full device TRIM (6.00GiB) …

...hang... does not recover after 120 minutes

From another shell:

# btrfs fi show
Label: none  uuid: 7387b661-4e72-45e4-a0a1-e6c4ad460164
	Total devices 4 FS bytes used 7.90GiB
	devid    2 size 6.00GiB used 5.99GiB path /dev/sdc
	devid    3 size 6.00GiB used 5.99GiB path /dev/sdd
	devid    4 size 6.00GiB used 0.00 path /dev/sdb
	*** Some devices missing

Shows the add has occurred, but inspection of /dev/sdb shows no Btrfs metadata has been written. So this status is not yet committed to disk.

Issued sysrq-w and collected dmesg, will attach full output. Partial output of blocked task:

[  360.370589] INFO: task btrfs-transacti:1162 blocked for more than 120 seconds.
[  360.374966]       Not tainted 3.15.0-0.rc3.git3.1.fc21.x86_64 #1
[  360.375911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.377151] btrfs-transacti D ffff88008571cd40  6456  1162      2 0x00000080
[  360.379152]  ffff880085769d20 0000000000000046 ffff88008571cd40 00000000001d5dc0
[  360.379915]  ffff880085769fd8 00000000001d5dc0 ffff880099194d40 ffff8800966ea210
[  360.380839]  ffff8800854331e8 ffff880085433000 ffff8800854331e8 0000000000000000
[  360.381747] Call Trace:
[  360.382674]  [<ffffffff817ec479>] schedule+0x29/0x70
[  360.383594]  [<ffffffffa00ca01f>] wait_current_trans.isra.19+0xcf/0x120 [btrfs]
[  360.386590]  [<ffffffff810eed10>] ? abort_exclusive_wait+0xb0/0xb0
[  360.390772]  [<ffffffffa00cb988>] start_transaction+0x3f8/0x650 [btrfs]
[  360.391981]  [<ffffffffa00cbc97>] btrfs_attach_transaction+0x17/0x20 [btrfs]
[  360.395811]  [<ffffffffa00c6b5e>] transaction_kthread+0x1de/0x290 [btrfs]
[  360.396996]  [<ffffffffa00c6980>] ? btrfs_cleanup_transaction+0x630/0x630 [btrfs]
[  360.398178]  [<ffffffff810c3348>] kthread+0x108/0x120
[  360.398880]  [<ffffffff810dbaa8>] ? sched_clock_cpu+0x98/0xc0
[  360.399741]  [<ffffffff810f8f6d>] ? trace_hardirqs_on_caller+0x15d/0x200
[  360.400605]  [<ffffffff810c3240>] ? insert_kthread_work+0x80/0x80
[  360.401523]  [<ffffffff817fc97c>] ret_from_fork+0x7c/0xb0
[  360.402495]  [<ffffffff810c3240>] ? insert_kthread_work+0x80/0x80
[  360.403445] 1 lock held by btrfs-transacti/1162:
[  360.407557]  #0:  (&fs_info->transaction_kthread_mutex){+.+...}, at: [<ffffffffa00c6a08>] transaction_kthread+0x88/0x290 [btrfs]
[  360.409707] INFO: task btrfs:1165 blocked for more than 120 seconds.
[  360.410962]       Not tainted 3.15.0-0.rc3.git3.1.fc21.x86_64 #1
[  360.411851] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.412709] btrfs           D ffff880089053380  2624  1165    997 0x00000080
[  360.413599]  ffff8800850b57f8 0000000000000046 ffff880089053380 00000000001d5dc0
[  360.414517]  ffff8800850b5fd8 00000000001d5dc0 ffff880099194d40 ffff88009d9d6758
[  360.415449]  ffff88009e3e34a8 0000000000000002 ffffffff811b0850 ffff8800850b5870
[  360.416307] Call Trace:
[  360.416983]  [<ffffffff811b0850>] ? wait_on_page_read+0x60/0x60
[  360.418657]  [<ffffffff817ec80f>] io_schedule+0xaf/0x150
[  360.419515]  [<ffffffff811b085e>] sleep_on_page+0xe/0x20
[  360.420360]  [<ffffffff817ecc43>] __wait_on_bit+0x83/0xa0
[  360.421205]  [<ffffffff811b1ab2>] ? find_get_pages_tag+0x2d2/0x370
[  360.421857]  [<ffffffff811b05e9>] wait_on_page_bit+0x89/0xa0
[  360.422667]  [<ffffffff810eed50>] ? autoremove_wake_function+0x40/0x40
[  360.423466]  [<ffffffff811b06f9>] filemap_fdatawait_range+0xf9/0x1c0
[  360.424707]  [<ffffffffa00e6a3c>] btrfs_wait_ordered_range+0x6c/0x150 [btrfs]
[  360.429249]  [<ffffffffa01108a0>] __btrfs_write_out_cache+0x640/0x860 [btrfs]
[  360.429890]  [<ffffffffa011144d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs]
[  360.430742]  [<ffffffffa00ba1f1>] btrfs_write_dirty_block_groups+0x691/0x730 [btrfs]
[  360.431650]  [<ffffffffa0142539>] commit_cowonly_roots+0x17a/0x234 [btrfs]
[  360.432456]  [<ffffffffa00caf5b>] btrfs_commit_transaction+0x44b/0xa80 [btrfs]
[  360.433271]  [<ffffffffa00fa734>] btrfs_init_new_device+0x884/0xeb0 [btrfs]
[  360.433903]  [<ffffffff811dfcae>] ? might_fault+0x5e/0xc0
[  360.434683]  [<ffffffff811dfd09>] ? might_fault+0xb9/0xc0
[  360.435545]  [<ffffffffa010392c>] btrfs_ioctl+0xc4c/0x2aa0 [btrfs]
[  360.436334]  [<ffffffff8135e8f2>] ? avc_has_perm+0x142/0x350
[  360.436919]  [<ffffffff810dbaa8>] ? sched_clock_cpu+0x98/0xc0
[  360.437688]  [<ffffffff810dbb26>] ? local_clock+0x16/0x30
[  360.438589]  [<ffffffff810f5a7f>] ? lock_release_holdtime.part.28+0xf/0x200
[  360.439359]  [<ffffffff8135e90a>] ? avc_has_perm+0x15a/0x350
[  360.439930]  [<ffffffff8135e7e4>] ? avc_has_perm+0x34/0x350
[  360.440638]  [<ffffffff81360f01>] ? inode_has_perm.isra.47+0x51/0x90
[  360.441367]  [<ffffffff812513e0>] do_vfs_ioctl+0x2f0/0x520
[  360.441926]  [<ffffffff81251691>] SyS_ioctl+0x81/0xa0
[  360.443840]  [<ffffffff81155eac>] ? __audit_syscall_entry+0x9c/0xf0
[  360.447746]  [<ffffffff817fca29>] system_call_fastpath+0x16/0x1b
[  360.449563] 4 locks held by btrfs/1165:
[  360.450279]  #0:  (&fs_info->volume_mutex){+.+.+.}, at: [<ffffffffa01038fe>] btrfs_ioctl+0xc1e/0x2aa0 [btrfs]
[  360.450874]  #1:  (sb_internal){.+.+..}, at: [<ffffffffa00cb9ed>] start_transaction+0x45d/0x650 [btrfs]
[  360.451686]  #2:  (&fs_info->reloc_mutex){+.+...}, at: [<ffffffffa00cae8d>] btrfs_commit_transaction+0x37d/0xa80 [btrfs]
[  360.452462]  #3:  (&fs_info->tree_log_mutex){+.+...}, at: [<ffffffffa00caf01>] btrfs_commit_transaction+0x3f1/0xa80 [btrfs]



Expected result:
Device should be added, or I should get an error message.


Regression:

This command works:
# btrfs replace start 1 /dev/sdb /mnt

The missing device is removed from fi show, and the replacement is rebuilt correctly. No hang. Volume works as expected.
Comment 1 Chris Murphy 2014-05-02 04:15:07 UTC
Originally reported on Btrfs list with kernel 3.14.1-1-ARCH #1 SMP PREEMPT Mon Apr 14 20:40:47 CEST 2014 x86_64 GNU/Linux; and btrfs-progs 3.14.

http://www.spinics.net/lists/linux-btrfs/msg33677.html
Comment 2 Chris Murphy 2014-05-02 04:27:20 UTC
This is reproduced on a VirtualBox 4.3.10 VM, so it's strange to see the device trim message on add. Although there is also a device trim message when doing the mkfs.

No change with this command, it still hangs.
btrfs device add -Kf /dev/sdb /mnt
Comment 3 Chris Murphy 2014-05-04 01:00:16 UTC
Created attachment 135041 [details]
dmesg selinux=0

I'm seeing what might be a similar problem (similar blocked task messages anyway) with mdadm trying to stop an array. Bug 75451.

The similarity made me wonder if this is an selinux problem, so I changed boot param from enforcing=0 to selinux=0. Problem is still reproducible.

This dmesg is with kernel-3.15.0-0.rc3.git0.1.fc21.x86_64 (non-debug), selinux=0, and includes sysrq-w.
Comment 4 Chris Murphy 2014-05-04 01:12:35 UTC
Created attachment 135051 [details]
console messages, hang/failure to power off

Bumping importance because clean unmount of other volumes isn't possible, hang at reboot and poweroff. Some subvolumes are failing to unmount.
Comment 5 John Leach 2014-10-25 10:18:45 UTC
Created attachment 154921 [details]
another dmesg with sysrq-w

I experienced the same with 3.14.9-1 on Debian. Accidentally wiped a disk so was missing from the raid1 pair on boot, re-added it and the add command hangs, doing no disk io.

I power cycled and reproduce this several times. Attached sysrq-w output.


But I upgraded to 3.16.3-2 (userspace tools too) and it's fixed. I added the disk succesfully and was able to remove the missing one.

Note You need to log in before you can comment on or make changes to this bug.