Created attachment 208941 [details] dmesg Today I saw two warnings during a filtered balance, -dusage=50 -musage=50. [61425.186172] [<ffffffff813d497f>] dump_stack+0x63/0x84 [61425.186176] [<ffffffff810a6142>] warn_slowpath_common+0x82/0xc0 [61425.186178] [<ffffffff810a61dc>] warn_slowpath_fmt+0x5c/0x80 [61425.186183] [<ffffffff813da662>] ? ___ratelimit+0xb2/0x100 [61425.186198] [<ffffffffa05105f7>] btrfs_alloc_tree_block+0x417/0x500 [btrfs] [61425.186201] [<ffffffff810e1db7>] ? update_sd_lb_stats+0x107/0x500 [61425.186214] [<ffffffffa04f7022>] btrfs_copy_root+0x102/0x2b0 [btrfs] [61425.186232] [<ffffffffa05709cd>] create_reloc_root+0x19d/0x220 [btrfs] [61425.186249] [<ffffffffa0576b66>] btrfs_init_reloc_root+0x96/0xb0 [btrfs] [61425.186264] [<ffffffffa051fee4>] record_root_in_trans+0xa4/0x100 [btrfs] [61425.186279] [<ffffffffa05216d4>] btrfs_record_root_in_trans+0x44/0x70 [btrfs] [61425.186293] [<ffffffffa0522906>] start_transaction+0x96/0x4a0 [btrfs] [61425.186308] [<ffffffffa0522d67>] btrfs_join_transaction+0x17/0x20 [btrfs] [61425.186322] [<ffffffffa0528cec>] btrfs_dirty_inode+0x3c/0xd0 [btrfs] [61425.186336] [<ffffffffa0528dfd>] btrfs_update_time+0x7d/0xb0 [btrfs] [61425.186340] [<ffffffff8125dfdd>] file_update_time+0xbd/0x110 [61425.186355] [<ffffffffa05330e9>] btrfs_page_mkwrite+0xc9/0x3f0 [btrfs] [61425.186357] [<ffffffff810de475>] ? put_prev_entity+0x35/0x7f0 [61425.186361] [<ffffffff811e63da>] do_page_mkwrite+0x7a/0x120 [61425.186364] [<ffffffff811e89f9>] do_wp_page+0x1a9/0x890 [61425.186367] [<ffffffff811eb848>] handle_mm_fault+0x818/0x1bd0 [61425.186369] [<ffffffff810696c5>] __do_page_fault+0x195/0x400 [61425.186372] [<ffffffff81069960>] do_page_fault+0x30/0x80 [61425.186374] [<ffffffff81003d19>] ? syscall_return_slowpath+0x79/0xb0 [61425.186378] [<ffffffff817cfec8>] page_fault+0x28/0x30 [61425.186399] ---[ end trace 2c5c3c76646758b9 ]--- There is, since kernel 4.4.x this dump_pagetables message that happens every boot, that taints the kernel. WARNING: CPU: 0 PID: 1 at arch/x86/mm/dump_pagetables.c:225 note_page+0x635/0x7e0() Including entire dmesg as attachment. So far no apparent negative consequence, the balance completes normally.
Created attachment 208951 [details] dmesg 4.4.5 Reproducible with Not tainted 4.4.5-300.fc23.x86_64. Attached dmesg.
[root@f23m ~]# btrfs fi show / Label: 'Fedora' uuid: dbf2e938-1f28-4e93-aa6c-1e193004931b Total devices 1 FS bytes used 19.72GiB devid 1 size 104.00GiB used 24.04GiB path /dev/sda5 [root@f23m ~]# btrfs fi df / Data, single: total=23.01GiB, used=19.29GiB System, single: total=32.00MiB, used=16.00KiB Metadata, single: total=1.00GiB, used=448.94MiB GlobalReserve, single: total=160.00MiB, used=1.97MiB [root@f23m ~]# mount | grep btrfs /dev/sda5 on / type btrfs (rw,relatime,seclabel,ssd,space_cache,enospc_debug,subvolid=288,subvol=/f23w-root) /dev/sda5 on /home type btrfs (rw,noatime,seclabel,ssd,space_cache,enospc_debug,subvolid=289,subvol=/home)
Seems you hitting according to your output in the dmesg: WARNING: CPU: 1 PID: 542 at fs/btrfs/extent-tree.c:7964 btrfs_alloc_tree_block+0x417/0x500 [btrfs]() This seems to be due to your kernel config having these lines enabled probably: #ifdef CONFIG_BTRFS_DEBUG 543 if (btrfs_should_fragment_free_space(extent_root, block_group)) { 544 u64 bytes_used; 545 546 spin_lock(&block_group->space_info->lock); 547 spin_lock(&block_group->lock); 548 bytes_used = block_group->key.offset - 549 btrfs_block_group_used(&block_group->item); 550 block_group->space_info->bytes_used += bytes_used >> 1; 551 spin_unlock(&block_group->lock); 552 spin_unlock(&block_group->space_info->lock); 553 fragment_free_space(extent_root, block_group); 554 } 555 #endif Would you mind sending attaching your .config, it's located in the root kernel directory as .config as I have a feeling you enabled this feature by accident.
This only happens with mount option enospc_debug.
[root@f23m ~]# grep -i btrfs /boot/config-4.4.5-300.fc23.x86_64 CONFIG_BTRFS_FS=m CONFIG_BTRFS_FS_POSIX_ACL=y # CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set # CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set # CONFIG_BTRFS_DEBUG is not set # CONFIG_BTRFS_ASSERT is not set
OK I was based my lines on the mainline kernel but that's still a mount option for debugging so give me a hour or so I can write up a patch on a mail line tree to make sure we are actually hitting these lines. In the mean time see if you can get a main line kernel tree so we test this on it.
[root@f23m /]# grep -i btrfs /boot/config-4.5.0-0.rc7.git0.1.fc24.x86_64 CONFIG_BTRFS_FS=m CONFIG_BTRFS_FS_POSIX_ACL=y # CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set # CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set # CONFIG_BTRFS_DEBUG is not set # CONFIG_BTRFS_ASSERT is not set I only get the call trace with enospc_debug as a mount option on this kernel too. I am use -o remount,enospc_debug to enable it, if that matters.
Actually it would be helpful if you can test the mainline kernel to see if this still happens plus that way I can line up your time numbers with kernel tools for the warning you are getting easier.
Really? Fedora has almost no patches from mainline. There's not a one in these two pages having anything to do with btrfs. http://koji.fedoraproject.org/koji/rpminfo?rpmID=7476227 I haven't built a kernel in many months so it's going to take a while to do that refresher, and then it takes over an hour to build one on this hardware.
See the issue is even if we use the fedora kernel at some point I am going to need to test my patch to verify it works which means compiling a kernel. It may be better to try it now. If you need help with it I can walk you through it I use Ubuntu but it's very similar so I feel it's worth the time. Otherwise there is no point in going forward as I will need to test it on your kernel.
I understand testing a patch. What I don't understand is spending the time to build a kernel that's going to be the same as Fedora's. I've looked through their small list of patches for 4.5.0rc7 and there's not a single one related to Btrfs, so I'm 99.9% confident I'm going to get the same results with a kernel.org kernel.
What .config do you want me to use for this?
Your same config and I agree the Fedora kernel is very similar it's just for line numbers and other debugging features plus my patches are being based against the mainline so we will not have to worry about porting them to fedora's tree into after this is fixed. Also try it without mount enospc_debug to see if that's what is triggering it, you assume it is. This is due to perhaps you hitting a case for enospc debugging that is enabled by this option that you would not be hitting without debugging enabled.
Created attachment 208971 [details] config-4.5.0-0.rc7.git0.1.fc24.x86_64 Using config-4.5.0-0.rc7.git0.1.fc24.x86_64 as .config.
OK did you test it or are you just attaching the config?
Created attachment 208981 [details] console full output Mainline 4.5.0.rc7 build with previously mentioned config Line 3, ran balance without enospc_debug, no warning in dmesg (through line 1265). Line 1266: remount with enospc_debug, show mount options, balance again, and dmesg shows the call trace.
Created attachment 208991 [details] Patch for Testing
The Patch I added is above,same as above just test to see if it causes any differences. Just paste the changes or a new dmesg whatever is easier.
Created attachment 209001 [details] journal with patch 208991
System hangs during startup, I never get to gdm. Switching to tty2 I see the same message repeated: num bytes is 16384 and root-> sectorsize is 4096 System has to be forced poweroff.
Let it run for 6 minutes, no change, it does not recover, and I can't switch to any other console. But after the forced power off and booting another kernel, journalctl -b-1 shows the messages from the previous boot. Not every line is the same. A handful have num bytes other than 16384. [ 8.008564] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 8.013349] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 8.035790] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 68.172641] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 98.209371] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 98.213827] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 158.278762] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 163.261749] f23m.localdomain kernel: num bytes is 4096 and root-> sectorsize is 4096 [ 163.263251] f23m.localdomain kernel: num bytes is 8192 and root-> sectorsize is 4096 [ 188.315718] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 248.381066] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 248.387157] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 278.445450] f23m.localdomain kernel: num bytes is 262144 and root-> sectorsize is 4096 [ 348.567243] f23m.localdomain kernel: num bytes is 8192 and root-> sectorsize is 4096
Created attachment 209011 [details] Test Patch
Ok so that makes my assumption correct about where your WARN_ON is being triggered gets see with the see patch if this happens on the first call or does the goto loop trigger it.
[ 6.495740] num bytes is 0 and root-> sectorsize is 4096 [ 6.495765] passed in incorrect value for empty size [ 6.495783] num bytes is 16384 and root-> sectorsize is 4096 [ 6.495802] num bytes is 16384 and root-> sectorsize is 4096
Created attachment 209031 [details] Test Patch
Test the above patch to see if it fixes your issue as it may do the trick :).
Chris, I don't think you'll get any real help from Nick, he has some history in lkml & co and please be cautious about his patches.
Chris, would you mind posting the results of my last patch to see if it fixes the issue.