btrfs-progs-4.7.1-1.fc26.x86_64 Two Btrfs volumes have no problems mounting or scrubbing or with btrfs check with btrfs-progs 4.6.1. However both btrfs-progs 4.7 and 4.7.1 report numerous, various, incorrect backref problems, and --repair claims to fix them, but a subsequent check shows most are not fixed.
Created attachment 231741 [details] btrfs check, before repair --repair not run on this volume
Created attachment 231751 [details] btrfs check --repair This is the entire output from --repair.
Created attachment 231761 [details] btrfs check, after repair This is actually after --repair, followed by a mount (which results in no kernel messages at mount time but the file system has not really been exercised since repair).
Created attachment 231771 [details] btrfs show super, before repair btrfs-show-super -fa, before doing --repair
btrfs-debug-tree output to a file then gzipped, is 157MiB. https://drive.google.com/open?id=0B_2Asp8DGjJ9ekJDeGxoTE5sbFk
OK disregard comment 5! Sorry, that's a totally different debug tree. btrfs-debug tree output to a file, gzipped, is 157MiB https://drive.google.com/open?id=0B_2Asp8DGjJ9S2JIQzB6c1d3R3c btrfs-image -c9 -t4 -s, is 445MiB https://drive.google.com/open?id=0B_2Asp8DGjJ9WFJFVzh3Nkx1c3M Note that both of these files are taken before --repair was run.
linux-btrfs@ thread http://www.spinics.net/lists/linux-btrfs/msg58142.html One user has reported after --repair trying to fix this problem, subsequent check crashes with assertion. Uncertain if the file system is broken.
Hi, this looks like a regression in btrfsck. There's a similar issue filed in the duperemove tracker: https://github.com/markfasheh/duperemove/issues/141
Mark, https://github.com/markfasheh/btrfs-progs-patches/commit/0a9093c17e26430d4ba1da0d131fa23d95d81f1f looks good, the old code did not compare found_ref. Can you please submit the patch if it's the final version? I'm going to release 4.7.2 afterwards.
(In reply to David Sterba from comment #9) > Mark, > https://github.com/markfasheh/btrfs-progs-patches/commit/ > 0a9093c17e26430d4ba1da0d131fa23d95d81f1f looks good, the old code did not > compare found_ref. Can you please submit the patch if it's the final > version? I'm going to release 4.7.2 afterwards. This is a more complete fixup of that portion of the code: https://github.com/markfasheh/btrfs-progs-patches/commit/6df7f4e005c984f06a0f44d4407cbab5d40999cb That's not a complete fix though we can build upon that patch and I strongly feel that some of the changes are necessary - we shouldn't be ignoring errors from rb_insert() for example. So if you like it, I'll gladly submit but I wouldn't base 4.7.2 on only that change. Most of my testing by the way is to just run the duperemove smoke test: https://github.com/markfasheh/duperemove-tests/blob/master/tests/smoke_test.sh That'll reproduce it every time (even with my fix, though with fewer errors)
(In reply to Mark Fasheh from comment #10) > Most of my testing by the way is to just run the duperemove smoke test: > > https://github.com/markfasheh/duperemove-tests/blob/master/tests/smoke_test. > sh > > That'll reproduce it every time (even with my fix, though with fewer errors) To be clear, I mean you reproduce by running the test then unmounting the filesystem and btrfscking it. This is not an actual disk corruption however that test seems to get our extents nicely set up to make btrfsck unhappy.
(In reply to Mark Fasheh from comment #10) > (In reply to David Sterba from comment #9) > > Mark, > > https://github.com/markfasheh/btrfs-progs-patches/commit/ > > 0a9093c17e26430d4ba1da0d131fa23d95d81f1f looks good, the old code did not > > compare found_ref. Can you please submit the patch if it's the final > > version? I'm going to release 4.7.2 afterwards. > > This is a more complete fixup of that portion of the code: > > https://github.com/markfasheh/btrfs-progs-patches/commit/ > 6df7f4e005c984f06a0f44d4407cbab5d40999cb The BUG_ON seems quite dangerous for me. Maybe some fuzzed images can trigger the BUG_ON about disk_bytenr comparison. IMHO We should not introduce new BUG_ON, as fuzz test has already beaten us quite hard now.
I can reproduce the false warnings on one of my testing images (6df7f4e005c984f06a0f44d4407cbab5d40999cb). And I do not like to see newly added BUG_ON. So, I'm going to revert the offending patches and release 4.7.2.
Created attachment 232251 [details] "Incorrect local backref" with 4.7.2 check --repair Dunno if it's the same bug since my log's a bit different, but I also keep getting a myriad of "Incorrect local backref" no matter how often I run btrfs check --repair The attached log is from 4.7.2, after the same command was already run a few times. It's a root fs, and both kernel 4.7.2 and 4.8-rc5 have over the past days started to throw an error after a while which turns the fs to read-only.
Here's the dmesg part from 4.8-rc5 kernel raising the error: [ 2088.230765] BTRFS error (device dm-0): trying to drop 3 refs but we only have 2 for bytenr 12865773568 [ 2088.230772] ------------[ cut here ]------------ [ 2088.230780] WARNING: CPU: 1 PID: 3258 at fs/btrfs/extent-tree.c:7012 __btrfs_free_extent.isra.33+0xb4d/0xca0 [ 2088.230781] BTRFS: Transaction aborted (error -22) [ 2088.230782] Modules linked in: [ 2088.230805] CPU: 1 PID: 3258 Comm: kworker/u8:9 Tainted: G U IO 4.8.0-1-ck #1 [ 2088.230807] Hardware name: [ 2088.230813] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [ 2088.230816] 0000000000000286 000000003f90215d ffff880173b4fa88 ffffffff8144c970 [ 2088.230821] ffff880173b4fad8 0000000000000000 ffff880173b4fac8 ffffffff810fa6fb [ 2088.230825] 00001b6473b4faf0 00000002fedc2000 0000000000000000 ffff880210950d20 [ 2088.230830] Call Trace: [ 2088.230837] [<ffffffff8144c970>] dump_stack+0x63/0x83 [ 2088.230842] [<ffffffff810fa6fb>] __warn+0xcb/0xf0 [ 2088.230846] [<ffffffff810fa77f>] warn_slowpath_fmt+0x5f/0x80 [ 2088.230850] [<ffffffff8133856d>] __btrfs_free_extent.isra.33+0xb4d/0xca0 [ 2088.230856] [<ffffffff8133c620>] __btrfs_run_delayed_refs+0x5b0/0x12e0 [ 2088.230861] [<ffffffff813403d1>] btrfs_run_delayed_refs+0xa1/0x2a0 [ 2088.230865] [<ffffffff81340664>] delayed_ref_async_start+0x94/0xb0 [ 2088.230868] [<ffffffff8138b8fd>] btrfs_scrubparity_helper+0x7d/0x340 [ 2088.230872] [<ffffffff8138bbfe>] btrfs_extent_refs_helper+0xe/0x10 [ 2088.230877] [<ffffffff81113d58>] process_one_work+0x1e8/0x480 [ 2088.230880] [<ffffffff81114038>] worker_thread+0x48/0x4e0 [ 2088.230883] [<ffffffff81113ff0>] ? process_one_work+0x480/0x480 [ 2088.230886] [<ffffffff81113ff0>] ? process_one_work+0x480/0x480 [ 2088.230888] [<ffffffff811199b8>] kthread+0xd8/0xf0 [ 2088.230892] [<ffffffff81087772>] ? __switch_to+0x2d2/0x630 [ 2088.230896] [<ffffffff81a9fc7f>] ret_from_fork+0x1f/0x40 [ 2088.230899] [<ffffffff811198e0>] ? kthread_worker_fn+0x170/0x170 [ 2088.230900] ---[ end trace dbcac85c9567c22f ]--- [ 2088.230903] BTRFS: error (device dm-0) in __btrfs_free_extent:7012: errno=-22 unknown [ 2088.230908] BTRFS info (device dm-0): forced readonly [ 2088.230910] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2960: errno=-22 unknown [ 2088.232663] pending csums is 17444864
So am I just lucky enough to have a irreparably broken partition or …?
Still a problem with btrfs-progs-4.7.2-1.fc26.x86_64 although this may be an artifact of the (bad?) repair from 4.7.1? [chris@f24s ~]$ sudo btrfs check /dev/mapper/brick1 Checking filesystem on /dev/mapper/brick1 UUID: 30f4724a-9a9f-4a5d-a37d-3e53876bf2e1 checking extents ref mismatch on [284858793984 134217728] extent item 8, found 9 Incorrect local backref count on 284858793984 root 659 owner 188824 offset 422068224 found 3 wanted 2 back 0x55927e869750 backpointer mismatch on [284858793984 134217728] ref mismatch on [285446881280 134217728] extent item 8, found 9 Incorrect local backref count on 285446881280 root 659 owner 188824 offset 1010155520 found 3 wanted 2 back 0x55927e86a710 backpointer mismatch on [285446881280 134217728] ref mismatch on [285913120768 134217728] extent item 9, found 11 Incorrect local backref count on 285913120768 root 659 owner 188824 offset 1462910976 found 4 wanted 2 back 0x55927e86c1c0 backpointer mismatch on [285913120768 134217728] ref mismatch on [286852644864 134217728] extent item 8, found 9 Incorrect local backref count on 286852644864 root 659 owner 188824 offset 2357133312 found 3 wanted 2 back 0x55927f2263a0 backpointer mismatch on [286852644864 134217728] ref mismatch on [287165235200 134217728] extent item 14, found 15 Incorrect local backref count on 287165235200 root 659 owner 188824 offset 2669723648 found 3 wanted 2 back 0x55927f226ff0 backpointer mismatch on [287165235200 134217728] ref mismatch on [287433670656 134217728] extent item 9, found 11 Incorrect local backref count on 287433670656 root 659 owner 188824 offset 2938159104 found 4 wanted 2 back 0x55927f2275b0 backpointer mismatch on [287433670656 134217728] ref mismatch on [287792168960 134217728] extent item 10, found 11 Incorrect local backref count on 287792168960 root 659 owner 188824 offset 3212296192 found 3 wanted 2 back 0x55927f2287f0 backpointer mismatch on [287792168960 134217728] ref mismatch on [287926386688 134217728] extent item 7, found 8 Incorrect local backref count on 287926386688 root 659 owner 188824 offset 3346513920 found 3 wanted 2 back 0x55927f228bd0 backpointer mismatch on [287926386688 134217728] ref mismatch on [454773628928 8388608] extent item 6, found 8 Incorrect local backref count on 454773628928 root 659 owner 204210 offset 0 found 4 wanted 2 back 0x559280de9ac0 backpointer mismatch on [454773628928 8388608] ref mismatch on [455505809408 8388608] extent item 79, found 165 Incorrect local backref count on 455505809408 root 659 owner 204212 offset 0 found 88 wanted 2 back 0x559280e1a190 backpointer mismatch on [455505809408 8388608] ref mismatch on [455564529664 8388608] extent item 6, found 8 Incorrect local backref count on 455564529664 root 659 owner 204222 offset 0 found 4 wanted 2 back 0x559280e1ace0 backpointer mismatch on [455564529664 8388608] checking free space cache checking fs roots checking csums checking root refs found 445291180032 bytes used err is 0 total csum bytes: 434113644 total tree bytes: 649756672 total fs tree bytes: 161316864 total extent tree bytes: 18366464 btree space waste bytes: 69497799 file data blocks allocated: 1054615560192 referenced 1029668896768 I haven't tried --repair.
Created attachment 233341 [details] btrfs check repair 4.7.2 I have a btrfs-image of the unmodified file system in comment 17, let me know if this is desired for the image stash. I did a repair, followed up with check which comes up clean (both attached), and a subsequent mount comes up with no errors.
This hardly matters at this point, but that bytenr BUG_ON is exactly the type of hard logic check that btrfsck is missing. Without fixing the underlying cause, which is poor error handling, we are inviting situations where an otherwise well meaning patch might accidentally cause corruption of users file systems when it does things that 'shouldn't happen' but we never check for, like inserting identical items into a tree. In particular, the bytenr check is in the rbtree compare function which doesn't even get the opportunity to report errors. So at that point we're pretty well boned if we just silently continue. I would venture to guess that most users would prefer btrfsck to crash than silently corrupt if it encounters a situation such as I have described.
Btw, I should add that I understated the capability of btrfsck's error handling, it's generally pretty good though the patch (and area of code) in question is lacking in that area. That said, my point still stands - we can't let the dogma of 'No BUG_ONs' get in the way of common-sense logic checks to avoid situations that would be very bad for the user.
Confirming, that the newly released version no longer triggers the false positives on my fs. thx! :)
Sorry to be that guy *again*, but… do I understand this right, my btrfs FS is trashed and I need to reinstall, rather than wait further for a fix to btrfsck?
(In reply to Andreas Reis from comment #22) > Sorry to be that guy *again*, but… do I understand this right, my btrfs FS > is trashed and I need to reinstall, rather than wait further for a fix to > btrfsck? Not yet. Use '--init-extent-tree' option to execute btrfsck (of course with version 4.7.2) should fix it. As only extent tree is transhed. Thanks, Qu
Thanks, it finally stopped (and I just booted from it). After 10+ hours for a 128GB SSD (840 Evo) partition, with no indication that it would actually finish. At the end I did figure that the read-only check log which I posted above – and which completed within a minute – contained a max 'Backref 143118163968'. The repair run's actual numbers (exponentially more so in total) flew by way too fast, but they did give some hint – without knowing if it was a genuine one – to the ETA (actually it was around 149.5b). Still, given that the FS was still readable: Next time I'll just backup and reformat. Vastly more economical in every possible aspect. Two more notes: 1. If the printed messages from btrfs check were aligned to some fixed line number count, one could at least read them. 2. I seem to get "cache and super generation don't match" no matter what I do, but never mind.
(In reply to Mark Fasheh from comment #20) > Btw, I should add that I understated the capability of btrfsck's error > handling, it's generally pretty good though the patch (and area of code) in > question is lacking in that area. That said, my point still stands - we > can't let the dogma of 'No BUG_ONs' get in the way of common-sense logic > checks to avoid situations that would be very bad for the user. If you want to call it dogma, whatever. Throwing BUG_ONs everywhere is not the best way to handle errors, IMO. If the condition depends on external input, it must be handled in a better way, eg. print a message describing the condition, cleanup/rollback changes and then return with an error code that gets propagated upwards.