Bug 114541 - balance results in warning btrfs/extent-tree.c:7964 btrfs_alloc_tree_block+0x417/0x500
Summary: balance results in warning btrfs/extent-tree.c:7964 btrfs_alloc_tree_block+0x...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Josef Bacik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-13 21:59 UTC by Chris Murphy
Modified: 2016-03-20 09:50 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.5.0-0.rc7
Tree: Mainline
Regression: No


Attachments
dmesg (177.64 KB, text/plain)
2016-03-13 21:59 UTC, Chris Murphy
Details
dmesg 4.4.5 (94.36 KB, text/plain)
2016-03-13 22:12 UTC, Chris Murphy
Details
config-4.5.0-0.rc7.git0.1.fc24.x86_64 (168.73 KB, application/octet-stream)
2016-03-14 00:29 UTC, Chris Murphy
Details
console full output (188.94 KB, text/plain)
2016-03-14 01:22 UTC, Chris Murphy
Details
Patch for Testing (926 bytes, patch)
2016-03-14 01:53 UTC, nickkrause
Details | Diff
journal with patch 208991 (597.11 KB, text/plain)
2016-03-14 03:51 UTC, Chris Murphy
Details
Test Patch (1.52 KB, patch)
2016-03-14 04:47 UTC, nickkrause
Details | Diff
Test Patch (765 bytes, patch)
2016-03-14 05:44 UTC, nickkrause
Details | Diff

Description Chris Murphy 2016-03-13 21:59:32 UTC
Created attachment 208941 [details]
dmesg

Today I saw two warnings during a filtered balance, -dusage=50 -musage=50.

[61425.186172]  [<ffffffff813d497f>] dump_stack+0x63/0x84
[61425.186176]  [<ffffffff810a6142>] warn_slowpath_common+0x82/0xc0
[61425.186178]  [<ffffffff810a61dc>] warn_slowpath_fmt+0x5c/0x80
[61425.186183]  [<ffffffff813da662>] ? ___ratelimit+0xb2/0x100
[61425.186198]  [<ffffffffa05105f7>] btrfs_alloc_tree_block+0x417/0x500 [btrfs]
[61425.186201]  [<ffffffff810e1db7>] ? update_sd_lb_stats+0x107/0x500
[61425.186214]  [<ffffffffa04f7022>] btrfs_copy_root+0x102/0x2b0 [btrfs]
[61425.186232]  [<ffffffffa05709cd>] create_reloc_root+0x19d/0x220 [btrfs]
[61425.186249]  [<ffffffffa0576b66>] btrfs_init_reloc_root+0x96/0xb0 [btrfs]
[61425.186264]  [<ffffffffa051fee4>] record_root_in_trans+0xa4/0x100 [btrfs]
[61425.186279]  [<ffffffffa05216d4>]
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
[61425.186293]  [<ffffffffa0522906>] start_transaction+0x96/0x4a0 [btrfs]
[61425.186308]  [<ffffffffa0522d67>] btrfs_join_transaction+0x17/0x20 [btrfs]
[61425.186322]  [<ffffffffa0528cec>] btrfs_dirty_inode+0x3c/0xd0 [btrfs]
[61425.186336]  [<ffffffffa0528dfd>] btrfs_update_time+0x7d/0xb0 [btrfs]
[61425.186340]  [<ffffffff8125dfdd>] file_update_time+0xbd/0x110
[61425.186355]  [<ffffffffa05330e9>] btrfs_page_mkwrite+0xc9/0x3f0 [btrfs]
[61425.186357]  [<ffffffff810de475>] ? put_prev_entity+0x35/0x7f0
[61425.186361]  [<ffffffff811e63da>] do_page_mkwrite+0x7a/0x120
[61425.186364]  [<ffffffff811e89f9>] do_wp_page+0x1a9/0x890
[61425.186367]  [<ffffffff811eb848>] handle_mm_fault+0x818/0x1bd0
[61425.186369]  [<ffffffff810696c5>] __do_page_fault+0x195/0x400
[61425.186372]  [<ffffffff81069960>] do_page_fault+0x30/0x80
[61425.186374]  [<ffffffff81003d19>] ? syscall_return_slowpath+0x79/0xb0
[61425.186378]  [<ffffffff817cfec8>] page_fault+0x28/0x30
[61425.186399] ---[ end trace 2c5c3c76646758b9 ]---

There is, since kernel 4.4.x this dump_pagetables message that happens
every boot, that taints the kernel.
WARNING: CPU: 0 PID: 1 at arch/x86/mm/dump_pagetables.c:225
note_page+0x635/0x7e0()

Including entire dmesg as attachment. So far no apparent negative
consequence, the balance completes normally.
Comment 1 Chris Murphy 2016-03-13 22:12:40 UTC
Created attachment 208951 [details]
dmesg 4.4.5

Reproducible with Not tainted 4.4.5-300.fc23.x86_64. Attached dmesg.
Comment 2 Chris Murphy 2016-03-13 22:16:29 UTC
[root@f23m ~]# btrfs fi show /
Label: 'Fedora'  uuid: dbf2e938-1f28-4e93-aa6c-1e193004931b
	Total devices 1 FS bytes used 19.72GiB
	devid    1 size 104.00GiB used 24.04GiB path /dev/sda5

[root@f23m ~]# btrfs fi df /
Data, single: total=23.01GiB, used=19.29GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=1.00GiB, used=448.94MiB
GlobalReserve, single: total=160.00MiB, used=1.97MiB

[root@f23m ~]# mount | grep btrfs
/dev/sda5 on / type btrfs (rw,relatime,seclabel,ssd,space_cache,enospc_debug,subvolid=288,subvol=/f23w-root)
/dev/sda5 on /home type btrfs (rw,noatime,seclabel,ssd,space_cache,enospc_debug,subvolid=289,subvol=/home)
Comment 3 nickkrause 2016-03-13 22:18:10 UTC
Seems you hitting according to your output in the dmesg:
WARNING: CPU: 1 PID: 542 at fs/btrfs/extent-tree.c:7964 btrfs_alloc_tree_block+0x417/0x500 [btrfs]()
This seems to be due to your kernel config having these lines enabled probably:
#ifdef CONFIG_BTRFS_DEBUG
543         if (btrfs_should_fragment_free_space(extent_root, block_group)) {
544                 u64 bytes_used;
545 
546                 spin_lock(&block_group->space_info->lock);
547                 spin_lock(&block_group->lock);
548                 bytes_used = block_group->key.offset -
549                         btrfs_block_group_used(&block_group->item);
550                 block_group->space_info->bytes_used += bytes_used >> 1;
551                 spin_unlock(&block_group->lock);
552                 spin_unlock(&block_group->space_info->lock);
553                 fragment_free_space(extent_root, block_group);
554         }
555 #endif
Would you mind sending attaching your .config, it's located in the root kernel directory as .config as I have a feeling you enabled this feature by accident.
Comment 4 nickkrause 2016-03-13 22:19:07 UTC
Seems you hitting according to your output in the dmesg:
WARNING: CPU: 1 PID: 542 at fs/btrfs/extent-tree.c:7964 btrfs_alloc_tree_block+0x417/0x500 [btrfs]()
This seems to be due to your kernel config having these lines enabled probably:
#ifdef CONFIG_BTRFS_DEBUG
543         if (btrfs_should_fragment_free_space(extent_root, block_group)) {
544                 u64 bytes_used;
545 
546                 spin_lock(&block_group->space_info->lock);
547                 spin_lock(&block_group->lock);
548                 bytes_used = block_group->key.offset -
549                         btrfs_block_group_used(&block_group->item);
550                 block_group->space_info->bytes_used += bytes_used >> 1;
551                 spin_unlock(&block_group->lock);
552                 spin_unlock(&block_group->space_info->lock);
553                 fragment_free_space(extent_root, block_group);
554         }
555 #endif
Would you mind sending attaching your .config, it's located in the root kernel directory as .config as I have a feeling you enabled this feature by accident.
Comment 5 Chris Murphy 2016-03-13 22:21:34 UTC
This only happens with mount option enospc_debug.
Comment 6 Chris Murphy 2016-03-13 22:23:04 UTC
[root@f23m ~]# grep -i btrfs /boot/config-4.4.5-300.fc23.x86_64 
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
Comment 7 nickkrause 2016-03-13 22:38:11 UTC
OK I was based my lines on the mainline kernel but that's still a mount option for debugging so give me a hour or so I can write up a patch on a mail line
tree to make sure we are actually hitting these lines. In the mean time see
if you can get a main line kernel tree so we test this on it.
Comment 8 nickkrause 2016-03-13 22:44:50 UTC
OK I was based my lines on the mainline kernel but that's still a mount option for debugging so give me a hour or so I can write up a patch on a mail line
tree to make sure we are actually hitting these lines. In the mean time see
if you can get a main line kernel tree so we test this on it.
Comment 9 Chris Murphy 2016-03-13 22:46:21 UTC
[root@f23m /]# grep -i btrfs /boot/config-4.5.0-0.rc7.git0.1.fc24.x86_64 
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set

I only get the call trace with enospc_debug as a mount option on this kernel too. I am use -o remount,enospc_debug to enable it, if that matters.
Comment 10 nickkrause 2016-03-13 23:00:51 UTC
Actually it would be helpful if you can test the mainline kernel to see if this still happens plus that way I can line up your time numbers with kernel tools for the warning you are getting easier.
Comment 11 Chris Murphy 2016-03-13 23:22:06 UTC
Really? Fedora has almost no patches from mainline. There's not a one in these two pages having anything to do with btrfs. http://koji.fedoraproject.org/koji/rpminfo?rpmID=7476227 I haven't built a kernel in many months so it's going to take a while to do that refresher, and then it takes over an hour to build one on this hardware.
Comment 12 nickkrause 2016-03-13 23:47:29 UTC
See the issue is even if we use the fedora kernel at some point I am going to need to test my patch to verify it works which means compiling a kernel. It may be better to try it now. If you need help with it I can walk you through it I use Ubuntu but it's very similar so I feel it's worth the time. Otherwise there is no point in going forward as I will need to test it on your kernel.
Comment 13 Chris Murphy 2016-03-13 23:51:00 UTC
I understand testing a patch. What I don't understand is spending the time to build a kernel that's going to be the same as Fedora's. I've looked through their small list of patches for 4.5.0rc7 and there's not a single one related to Btrfs, so I'm 99.9% confident I'm going to get the same results with a kernel.org kernel.
Comment 14 Chris Murphy 2016-03-13 23:57:24 UTC
What .config do you want me to use for this?
Comment 15 nickkrause 2016-03-14 00:14:00 UTC
Your same config and I agree the Fedora kernel is very similar it's just for line numbers and other debugging features plus my patches are being based against the mainline so we will not have to worry about porting them to fedora's tree into after this is fixed. Also try it without mount enospc_debug to see if that's what 
is triggering it, you assume it is. This is due to perhaps you hitting a case
for enospc debugging that is enabled by this option that you would not be hitting
without debugging enabled.
Comment 16 Chris Murphy 2016-03-14 00:29:06 UTC
Created attachment 208971 [details]
config-4.5.0-0.rc7.git0.1.fc24.x86_64

Using config-4.5.0-0.rc7.git0.1.fc24.x86_64 as .config.
Comment 17 nickkrause 2016-03-14 00:41:53 UTC
OK did you test it or are you just attaching the config?
Comment 18 Chris Murphy 2016-03-14 01:22:35 UTC
Created attachment 208981 [details]
console full output

Mainline 4.5.0.rc7 build with previously mentioned config
Line 3, ran balance without enospc_debug, no warning in dmesg (through line 1265).
Line 1266: remount with enospc_debug, show mount options, balance again, and dmesg shows the call trace.
Comment 19 nickkrause 2016-03-14 01:53:33 UTC
Created attachment 208991 [details]
Patch for Testing
Comment 20 nickkrause 2016-03-14 01:55:45 UTC
The Patch I added is above,same as above just test to see if it causes any differences. Just paste the changes or a new dmesg whatever is easier.
Comment 21 Chris Murphy 2016-03-14 03:51:04 UTC
Created attachment 209001 [details]
journal with patch 208991
Comment 22 Chris Murphy 2016-03-14 03:53:37 UTC
System hangs during startup, I never get to gdm. Switching to tty2 I see the same message repeated:
num bytes is 16384 and root-> sectorsize is 4096
System has to be forced poweroff.
Comment 23 Chris Murphy 2016-03-14 04:25:26 UTC
Let it run for 6 minutes, no change, it does not recover, and I can't switch to any other console. But after the forced power off and booting another kernel, journalctl -b-1 shows the messages from the previous boot. Not every line is the same. A handful have num bytes other than 16384.

[    8.008564] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[    8.013349] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[    8.035790] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[   68.172641] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[   98.209371] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[   98.213827] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[  158.278762] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[  163.261749] f23m.localdomain kernel: num bytes is 4096 and root-> sectorsize is 4096
[  163.263251] f23m.localdomain kernel: num bytes is 8192 and root-> sectorsize is 4096
[  188.315718] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[  248.381066] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[  248.387157] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[  278.445450] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096
[  348.567243] f23m.localdomain kernel: num bytes is 8192 and root-> sectorsize is 4096
Comment 24 nickkrause 2016-03-14 04:47:09 UTC
Created attachment 209011 [details]
Test Patch
Comment 25 nickkrause 2016-03-14 04:49:10 UTC
Ok so that makes my assumption correct about where your WARN_ON is being triggered gets see with the see patch if this happens on the first call or does the goto loop trigger it.
Comment 26 Chris Murphy 2016-03-14 05:26:18 UTC
[    6.495740] num bytes is 0 and root-> sectorsize is 4096
[    6.495765] passed in incorrect value for empty size
[    6.495783] num bytes is 16384 and root-> sectorsize is 4096
[    6.495802] num bytes is 16384 and root-> sectorsize is 4096
Comment 27 nickkrause 2016-03-14 05:44:02 UTC
Created attachment 209031 [details]
Test Patch
Comment 28 nickkrause 2016-03-14 05:44:31 UTC
Test the above patch to see if it fixes your issue as it may do the trick :).
Comment 29 David Sterba 2016-03-14 10:26:19 UTC
Chris, I don't think you'll get any real help from Nick, he has some history in lkml & co and please be cautious about his patches.
Comment 30 nickkrause 2016-03-14 12:19:19 UTC
Chris,
would you mind posting the results of my last patch to see if it fixes the issue.

Note You need to log in before you can comment on or make changes to this bug.