Bug 202717 - clear-space-cache v1 crashed, file system is corrupt
Summary: clear-space-cache v1 crashed, file system is corrupt
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: BTRFS virtual assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-02 00:14 UTC by Chris Murphy
Modified: 2019-08-12 04:54 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.20.12
Subsystem:
Regression: No
Bisected commit-id:


Attachments
coredumpctl gdb, thread apply all bt full (3.94 KB, text/plain)
2019-03-02 00:14 UTC, Chris Murphy
Details

Description Chris Murphy 2019-03-02 00:14:01 UTC
Versions
4.20.12-200.fc29.x86_64
btrfs-progs-4.19.1-1.fc30.x86_64

Hardware:
2x USB, no partition, entire block device is LUKS, and the dmcrypt virtual device is formatted btrfs, -mraid1 -draid1. At least three years old. 

x86_64

Summary:
Today the volume was scrubbed, no errors. Unmounted clean. Btrfs check reported no errors. Decided today was a good day to switch to space cache v2. Ran --clear-space-cache v1, which crashed. Compiled from source btrfsprogs 4.20.2 which reports extent or chunk tree errors. No further action taken.


Steps and actual results:

1.
Scrub. No errors. Clean umount. File system wasn't being used (it's used only as an smb share and smb wasn't running at the time)
2.
sudo btrfs check /dev/mapper/sdd
Opening filesystem to check...
Checking filesystem on /dev/mapper/sdd
UUID: f5adc913-bbea-4340-8b5f-3411e2cda642
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 735441764352 bytes used, no error found
total csum bytes: 714734272
total tree bytes: 3373776896
total fs tree bytes: 2373287936
total extent tree bytes: 202719232
btree space waste bytes: 551807565
file data blocks allocated: 2726177849344
 referenced 2205643223040
3.
sudo btrfs check --clear-space-cache v1 /dev/mapper/sdd
Opening filesystem to check...
Checking filesystem on /dev/mapper/sdd
UUID: f5adc913-bbea-4340-8b5f-3411e2cda642
Failed to find [713226895360, 168, 16384]
btrfs unable to find ref byte nr 713226911744 parent 0 root 2  owner 1 offset 0
transaction.c:168: btrfs_commit_transaction: BUG_ON `ret` triggered, value -5
btrfs(+0x480c9)[0x556777d240c9]
btrfs(btrfs_commit_transaction+0x19b)[0x556777d246bb]
btrfs(btrfs_clear_free_space_cache+0x353)[0x556777d17123]
btrfs(+0x5cf1b)[0x556777d38f1b]
btrfs(cmd_check+0xaee)[0x556777d4697e]
btrfs(main+0x92)[0x556777cf0ad2]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7ff9379f7413]
btrfs(_start+0x2e)[0x556777cf0cce]
Aborted                      

4a. Compile btrfs-progs 4.20.2
4b. btrfs check

$ sudo ./btrfs check /dev/mapper/sdd                                                                               
Opening filesystem to check...
Checking filesystem on /dev/mapper/sdd
UUID: f5adc913-bbea-4340-8b5f-3411e2cda642
[1/7] checking root items
[2/7] checking extents
ref mismatch on [713226256384 16384] extent item 1, found 0
backref 713226256384 root 2 not referenced back 0xe0e31e0
incorrect global backref count on 713226256384 found 1 wanted 0
backpointer mismatch on [713226256384 16384]
owner ref check failed [713226256384 16384]
ref mismatch on [713226272768 16384] extent item 1, found 0
backref 713226272768 root 2 not referenced back 0xe3b5f60
incorrect global backref count on 713226272768 found 1 wanted 0
backpointer mismatch on [713226272768 16384]
owner ref check failed [713226272768 16384]
ref mismatch on [713226911744 16384] extent item 0, found 1
tree backref 713226911744 parent 2 root 2 not found in extent tree
backpointer mismatch on [713226911744 16384]
ref mismatch on [713226928128 16384] extent item 0, found 1
tree backref 713226928128 parent 2 root 2 not found in extent tree
backpointer mismatch on [713226928128 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 735327158272 bytes used, error(s) found
total csum bytes: 714734272
total tree bytes: 3373694976
total fs tree bytes: 2373287936
total extent tree bytes: 202784768
btree space waste bytes: 551904464
file data blocks allocated: 2726063292416
 referenced 2205528666112
$
Comment 1 Chris Murphy 2019-03-02 00:14:24 UTC
Created attachment 281451 [details]
coredumpctl gdb, thread apply all bt full
Comment 2 Chris Murphy 2019-03-02 01:22:49 UTC
btrfsprogs 4.20.2

$ sudo /srv/scratch/gits/btrfs-progs/btrfs check --mode=lowmem /dev/mapper/sdd
[sudo] password for chris: 
Opening filesystem to check...
Checking filesystem on /dev/mapper/sdd
UUID: f5adc913-bbea-4340-8b5f-3411e2cda642
[1/7] checking root items
[2/7] checking extents
ERROR: extent [713226256384 16384] referencer bytenr mismatch, wanted: 713226256384, have: 713226911744
ERROR: extent [713226272768 16384] referencer bytenr mismatch, wanted: 713226272768, have: 713226928128
ERROR: extent[713226911744 16384] backref lost (owner: 2, level: 1) root 2
ERROR: extent[713226928128 16384] backref lost (owner: 2, level: 0) root 2
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs done with fs roots in lowmem mode, skipping
[7/7] checking quota groups skipped (not enabled on this FS)
found 735327125504 bytes used, error(s) found
total csum bytes: 714734272
total tree bytes: 3373662208
total fs tree bytes: 2373287936
total extent tree bytes: 202752000
btree space waste bytes: 551896887
file data blocks allocated: 2726063292416
 referenced 2205528666112
$
Comment 3 Ryo Onodera 2019-07-27 14:13:03 UTC
Well I had a similar issue....

Please pardon me relieving by myself here for a bit... First, let me clear this: the recovery ability of btrfs impressed me. It can really resist against silent data corruption and other unstable hardware issues. And its copy-on-write nature saves storage considerably. I used my btrfs to store several recent generations of dd images of other systems as plain image files ready for quick loop-mount to restore some files if needed, while de-duplicating most of unchanged parts.

This setup worked excellently for years. I was planning to use the btrfs as my root fs after some years of operation.

It's my bad not to take dd copy of it or prepare adhoc dm snapshot target of the underlying luks block device before doing the --clear-pace-cache v1.
But, it said it's a safe operation and being named as `clearing of caches', accompanied with the successful btrfs check, I had a misconception, it seems...
This was like a body blow, too heavy to get over now.... Not because I've lost data, but I've partially lost confidence I've accumulated over btrfs.

Sorry. Let's get back to the bug report part of mine.



Surprisingly, my report is similar to this. So, I'm commenting on this, rather than submitting separate report.

I ran --clear-space-cache after successful btrfs check, then the btrfs got unmountable.
My btrfs was set up with 2x SATA 8TB HDD with -m raid1 -d raid1 around three years old on amd64.

Then, it's an odd version combo but my kernel version is 4.15.0-13-generic (Ubuntu) and btrfs-progs is 5.1-1.
Because of other issue of the hardware, I was stuck at the kernel version.

As a background, the following cause of actions is taken because I've found a faulty cabling issue has been happening over the night.
So, kernel had complained huge amount of SATA-layer IO errors accompanied with upper layer BTRFS errors.
 
1. aborted the running duperemove
2. btrfs scrub yielded errors but all were fixed correctly. [1]
   Sadly after scrub finished, due to power outage, the system shutdown abruptly while the btrfs is mounted however no IO activity.
3. btrfs check showed no errors but with warnings about space cache generation, motivating clear-space-cache [2]
4. btrfs check --clear-space-cache v1
   error messages printed and aborted with exit code 1
   very sorry for lacking the crucial info, but I've lost accidentally most of log...
     part of it contained like this around 100 times:
 
       extent buffer leak: start 2744636555264 len 16384
       WARNING: dirty eb leak (aborted trans): start 2744636555264 len 16384

   ran twice, the second time printed even C backtraces maybe some assertion error triggered.

5. mount it causes error to dmesg and failed to do [3]



Any hope for possible recovery measure or better practice for future occurrence of similar incidents?


It was a bad action of mine by immediately doing --clear-space-cache after unclean shutdown?
I didn't tested but should mounting the fs without clearing space cache have worked at the time? (this causes great regret...)
Or could prolonged dedupe operation under the faulty hardware configuration caused some inconsistency already first of all?
Well, I was surprised duperemove continued its operation under such intermittent cable-related IO errors.
Or kernel and btrfs-progs version discrepancy was bad?


This is not ideal report, but hope this will help make btrfs better in terms of ease of usage and reliability as one of its user!!
If more data is needed, I try to do the best.

[1]

ryoqun@miyako:~$ sudo cat /var/lib/btrfs/scrub.status.825d0e61-dc18-40dc-b5d9-9fac5c5f86f5 
scrub status:1
825d0e61-dc18-40dc-b5d9-9fac5c5f86f5:1|data_extents_scrubbed:117070833|tree_extents_scrubbed:697806|data_bytes_scrubbed:7374207033344|tree_bytes_scrubbed:11432853504|read_errors:0|csum_errors:209|verify_errors:19212|no_csum:469664|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:19421|last_physical:8001561100288|t_start:1564068586|t_resumed:0|duration:100995|canceled:0|finished:1
825d0e61-dc18-40dc-b5d9-9fac5c5f86f5:2|data_extents_scrubbed:117070865|tree_extents_scrubbed:698047|data_bytes_scrubbed:7374209130496|tree_bytes_scrubbed:11436802048|read_errors:0|csum_errors:0|verify_errors:0|no_csum:470176|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:8001561100288|t_start:1564068586|t_resumed:0|duration:100303|canceled:0|finished:1


[2]

ryoqun@miyako:~$ sudo env time -v btrfs check --progress --readonly /dev/mapper/sdc_crypt
[sudo] password for ryoqun:
Opening filesystem to check...
Checking filesystem on /dev/mapper/sdc_crypt
UUID: 825d0e61-dc18-40dc-b5d9-9fac5c5f86f5
[1/7] checking root items                      (0:02:58 elapsed, 8667833 items checked)
[2/7] checking extents                         (0:08:04 elapsed, 698624 items checked)
btrfs: space cache generation (85978) does not match inode (88410)95 items checked)
failed to load free space cache for block group 817247551488
btrfs: space cache generation (80342) does not match inode (88420)30 items checked)
failed to load free space cache for block group 874155868160
btrfs: space cache generation (88284) does not match inode (88292)10 items checked)
failed to load free space cache for block group 952539021312
btrfs: space cache generation (88174) does not match inode (88406)37 items checked)
failed to load free space cache for block group 1140443840512
btrfs: space cache generation (88399) does not match inode (88403)
failed to load free space cache for block group 1167287386112
btrfs: space cache generation (86912) does not match inode (88300)949 items checked)
failed to load free space cache for block group 3236387880960
btrfs: space cache generation (86271) does not match inode (88403)062 items checked)
failed to load free space cache for block group 3371679350784
btrfs: space cache generation (79419) does not match inode (88392)099 items checked)
failed to load free space cache for block group 3393154187264
btrfs: space cache generation (80392) does not match inode (88447)660 items checked)
failed to load free space cache for block group 5069265174528
[3/7] checking free space cache                (0:02:52 elapsed, 7357 items checked)
[4/7] checking fs roots                        (1:12:11 elapsed, 180395 items checked)
[5/7] checking csums (without verifying data)  (0:12:23 elapsed, 3085356 items checked)
[6/7] checking root refs                       (0:00:00 elapsed, 3 items checked)
[7/7] checking quota groups skipped (not enabled on this FS)
found 7385314152448 bytes used, no error found
total csum bytes: 7199495400
total tree bytes: 11446124544
total fs tree bytes: 2963701760
total extent tree bytes: 676593664
btree space waste bytes: 907685438
file data blocks allocated: 2855398783815680
 referenced 10412345200640
        Command being timed: "btrfs check --progress --readonly /dev/mapper/sdc_crypt"
        User time (seconds): 2917.23
        System time (seconds): 99.05
        Percent of CPU this job got: 50%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:38:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3730184
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 18
        Minor (reclaiming a frame) page faults: 933156
        Voluntary context switches: 707570
        Involuntary context switches: 351171
        Swaps: 0
        File system inputs: 89724464
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
ryoqun@miyako:~$ sudo env time -v btrfs check --progress --readonly /dev/mapper/sdc_crypt
[sudo] password for ryoqun:
Opening filesystem to check...
Checking filesystem on /dev/mapper/sdc_crypt
UUID: 825d0e61-dc18-40dc-b5d9-9fac5c5f86f5
[1/7] checking root items                      (0:03:05 elapsed, 8667833 items checked)
[2/7] checking extents                         (0:08:05 elapsed, 698624 items checked)
btrfs: space cache generation (85978) does not match inode (88410)90 items checked)
failed to load free space cache for block group 817247551488
btrfs: space cache generation (80342) does not match inode (88420)32 items checked)
failed to load free space cache for block group 874155868160
btrfs: space cache generation (88284) does not match inode (88292)97 items checked)
failed to load free space cache for block group 952539021312
btrfs: space cache generation (88174) does not match inode (88406)32 items checked)
failed to load free space cache for block group 1140443840512
btrfs: space cache generation (88399) does not match inode (88403)
failed to load free space cache for block group 1167287386112
btrfs: space cache generation (86912) does not match inode (88300)957 items checked)
failed to load free space cache for block group 3236387880960
btrfs: space cache generation (86271) does not match inode (88403)063 items checked)
failed to load free space cache for block group 3371679350784
btrfs: space cache generation (79419) does not match inode (88392)098 items checked)
failed to load free space cache for block group 3393154187264
btrfs: space cache generation (80392) does not match inode (88447)667 items checked)
failed to load free space cache for block group 5069265174528
[3/7] checking free space cache                (0:02:59 elapsed, 7357 items checked)
[4/7] checking fs roots                        (1:15:27 elapsed, 180395 items checked)
[5/7] checking csums (without verifying data)  (0:12:30 elapsed, 3085356 items checked)
[6/7] checking root refs                       (0:00:00 elapsed, 3 items checked)
[7/7] checking quota groups skipped (not enabled on this FS)
found 7385314152448 bytes used, no error found
total csum bytes: 7199495400
total tree bytes: 11446124544
total fs tree bytes: 2963701760
total extent tree bytes: 676593664
btree space waste bytes: 907685438
file data blocks allocated: 2855398783815680
 referenced 10412345200640
        Command being timed: "btrfs check --progress --readonly /dev/mapper/sdc_crypt"
        User time (seconds): 3074.56
        System time (seconds): 98.18
        Percent of CPU this job got: 51%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:42:31
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3730184
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 22
        Minor (reclaiming a frame) page faults: 933147
        Voluntary context switches: 704106
        Involuntary context switches: 363997
        Swaps: 0
        File system inputs: 89543200
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


[3]

Jul 27 09:44:17 miyako kernel: [  295.890293] BTRFS: device fsid 825d0e61-dc18-40dc-b5d9-9fac5c5f86f5 devid 2 transid 91085 /dev/dm-3
Jul 27 09:44:27 miyako kernel: [  306.426036] BTRFS: device fsid 825d0e61-dc18-40dc-b5d9-9fac5c5f86f5 devid 1 transid 91085 /dev/dm-4
Jul 27 13:50:46 miyako kernel: [15085.112468] BTRFS info (device dm-4): disk space caching is enabled
Jul 27 13:50:46 miyako kernel: [15085.112472] BTRFS info (device dm-4): has skinny extents
Jul 27 13:50:46 miyako kernel: [15085.272434] BTRFS info (device dm-4): bdev /dev/mapper/sdb_crypt errs: wr 113375, rd 1447, flush 0, corrupt 209, gen 19212
Jul 27 13:50:46 miyako kernel: [15085.302966] BTRFS error (device dm-4): parent transid verify failed on 2744636407808 wanted 91158 found 91162
Jul 27 13:50:46 miyako kernel: [15085.314377] BTRFS error (device dm-4): parent transid verify failed on 2744636407808 wanted 91158 found 91162
Jul 27 13:50:46 miyako kernel: [15085.314393] BTRFS error (device dm-4): failed to read block groups: -5
Jul 27 13:50:46 miyako kernel: [15085.350834] BTRFS error (device dm-4): open_ctree failed
Comment 4 Chris Murphy 2019-08-12 04:54:50 UTC
This file system cannot successfully be ro scrubbed, see bug 204557.

Note You need to log in before you can comment on or make changes to this bug.