Bug 209143

Summary: kernel BUG at fs/btrfs/relocation.c:437!
Product: File System Reporter: Johannes Rohr (jorohr)
Component: btrfsAssignee: BTRFS virtual assignee (fs_btrfs)
Status: NEW ---    
Severity: normal CC: dsterba, t.d.richmond
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.4.0-45-generic Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output

Description Johannes Rohr 2020-09-03 08:00:03 UTC
Created attachment 292315 [details]
dmesg output

This is about a system running Ubuntu 20.04, uname -a says:

Linux ida.rooot.de 5.4.0-45-generic #49-Ubuntu SMP Wed Aug 26 13:38:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

That's a kernel package that was released 31 Aug.

I can reproduce the bug by running btrfs balance start /home -musage=70

After a few seconds, the command terminates with a segfault, dmesg output is attached.

The core lines in dmesg seem to be:

[32387.616249] kernel BUG at fs/btrfs/relocation.c:437!
[32387.616271] invalid opcode: 0000 [#1] SMP PTI
[32387.616180] BTRFS error (device sdc2): couldn't find block (4853431877632) (level 1) in tree (19918) with key (986 96 124)

There is nothing about an IO error in dmesg, so to me as a layman this doesn't look like physical damage.

Yet when I repeat the command, it always names the same block.
Comment 1 Johannes Rohr 2020-09-03 08:20:03 UTC
Here are the filesystem stats:

root@ida:~# btrfs fi show /                                                     │·············································································
Label: 'BTRFS_RAID'  uuid: b80344d6-ae49-4557-bd4d-641d0afcda3e                 │·············································································
        Total devices 4 FS bytes used 410.91GiB                                 │·············································································
        devid    2 size 460.94GiB used 214.03GiB path /dev/sdc2                 │·············································································
        devid    3 size 460.94GiB used 214.00GiB path /dev/sdd2                 │·············································································
        devid    5 size 460.90GiB used 214.03GiB path /dev/sdb2                 │·············································································
        devid    6 size 460.94GiB used 202.06GiB path /dev/sda2  

btrfs-progs package version is 5.4.1-2
Comment 2 Johannes Rohr 2020-09-03 09:30:34 UTC
btrfs fi usage / 

Overall:
    Device size:                   1.80TiB
    Device allocated:            844.12GiB
    Device unallocated:          999.59GiB
    Device missing:                  0.00B
    Used:                        821.95GiB
    Free (estimated):            508.97GiB      (min: 508.97GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:415.00GiB, Used:405.82GiB (97.79%)
   /dev/sdc2     212.00GiB
   /dev/sdd2     212.00GiB
   /dev/sdb2     209.00GiB
   /dev/sda2     197.00GiB

Metadata,RAID1: Size:7.00GiB, Used:5.15GiB (73.62%)
   /dev/sdc2       2.00GiB
   /dev/sdd2       2.00GiB
   /dev/sdb2       5.00GiB
   /dev/sda2       5.00GiB

System,RAID1: Size:64.00MiB, Used:96.00KiB (0.15%)
   /dev/sdc2      32.00MiB
   /dev/sdb2      32.00MiB
   /dev/sda2      64.00MiB

Unallocated:
   /dev/sdc2     246.91GiB
   /dev/sdd2     246.94GiB
   /dev/sdb2     246.87GiB
   /dev/sda2     258.88GiB
Comment 3 t.d.richmond 2020-11-19 01:34:54 UTC
Just wanted to say I've run into the exact same issue.

[1133936.776355] BTRFS error (device sdh): couldn't find block (235170781286400) (level 1) in tree (7) with key (7869
99 12 786708)
[1133936.776549] ------------[ cut here ]------------
[1133936.776552] kernel BUG at fs/btrfs/relocation.c:437!
[1133936.776612] invalid opcode: 0000 [#1] SMP PTI
[1133936.776657] CPU: 2 PID: 1259511 Comm: btrfs Tainted: G        W         5.4.0-52-generic #57-Ubuntu
[1133936.776716] Hardware name: LENOVO ThinkServer TS440/ThinkServer TS440, BIOS FBKTD0AUS 06/12/2018
[1133936.776818] RIP: 0010:remove_backref_node+0x139/0x140 [btrfs]
[1133936.776859] Code: 41 5f 5d c3 49 8b 96 d0 00 00 00 48 8b 75 c8 49 89 86 d0 00 00 00 48 89 73 50 48 89 53 58 48 8
9 02 80 4b 71 02 e9 38 ff ff ff <0f> 0b c3 0f 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54
[1133936.776974] RSP: 0018:ffffaab2411df928 EFLAGS: 00010206
[1133936.777011] RAX: ffff946499a3a100 RBX: ffff9467c6425680 RCX: ffff9467c64253c0
[1133936.777057] RDX: 0000000000a11c29 RSI: cc071d12e7391895 RDI: 000000000002f0a0
[1133936.777103] RBP: ffffaab2411df960 R08: ffff94687ea978c8 R09: 0000000000010101
[1133936.777149] R10: ffff94686fa5e830 R11: 0000000000000001 R12: ffff9467c6425380
[1133936.777196] R13: dead000000000100 R14: ffff9463f21fb020 R15: dead000000000122
[1133936.777243] FS:  00007f3f603818c0(0000) GS:ffff94687ea80000(0000) knlGS:0000000000000000
[1133936.777296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1133936.777334] CR2: 00007f4ff9481fb8 CR3: 0000000186acc001 CR4: 00000000001606e0
[1133936.777381] Call Trace:
[1133936.777436]  build_backref_tree+0x234/0x10e0 [btrfs]
[1133936.777502]  relocate_tree_blocks+0x121/0x630 [btrfs]
[1133936.777540]  ? _cond_resched+0x19/0x30
[1133936.777596]  ? tree_insert+0x55/0x60 [btrfs]
[1133936.777654]  relocate_block_group+0x224/0x5f0 [btrfs]
[1133936.777723]  btrfs_relocate_block_group+0x15e/0x300 [btrfs]
[1133936.777808]  btrfs_relocate_chunk+0x2a/0x90 [btrfs]
[1133936.777872]  __btrfs_balance+0x409/0xa50 [btrfs]
[1133936.777932]  btrfs_balance+0x386/0x550 [btrfs]
[1133936.777992]  btrfs_ioctl_balance+0x2c1/0x380 [btrfs]
[1133936.778056]  btrfs_ioctl+0x836/0x20d0 [btrfs]
[1133936.778090]  ? do_anonymous_page+0x2e6/0x650
[1133936.778122]  ? __handle_mm_fault+0x760/0x7a0
[1133936.778155]  do_vfs_ioctl+0x407/0x670
[1133936.780465]  ? do_vfs_ioctl+0x407/0x670
[1133936.780465]  ? do_vfs_ioctl+0x407/0x670
[1133936.782753]  ? do_user_addr_fault+0x216/0x450
[1133936.785046]  ksys_ioctl+0x67/0x90
[1133936.787319]  __x64_sys_ioctl+0x1a/0x20
[1133936.789533]  do_syscall_64+0x57/0x190
[1133936.792137]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1133936.794978] RIP: 0033:0x7f3f6049b50b
[1133936.797794] Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f
3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
[1133936.803711] RSP: 002b:00007ffcabbef918 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[1133936.807004] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f3f6049b50b
[1133936.810523] RDX: 00007ffcabbef9a0 RSI: 00000000c4009420 RDI: 0000000000000003
[1133936.814016] RBP: 0000000000000003 R08: 0000558a165ce6b0 R09: 000000000000007c
[1133936.817566] R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffcabbf18e1
[1133936.821145] R13: 00007ffcabbef9a0 R14: 0000000000000000 R15: 0000000000000000
[1133936.824599] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs cpuid wireguard ip6_udp_tunnel udp_
tunnel veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtyp
e iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc n
fsv3 aufs rpcsec_gss_krb5 nfsv4 nfs fscache overlay nls_iso8859_1 mei_hdcp intel_rapl_msr intel_rapl_common x86_pkg_t
emp_thermal intel_powerclamp coretemp kvm_intel i915 kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel drm_kms_hel
per crypto_simd i2c_algo_bit fb_sys_fops cryptd glue_helper syscopyarea rapl sysfillrect mei_me wmi_bmof intel_cstate
 input_leds sysimgblt mei ie31200_edac mac_hid sch_fq_codel nfsd auth_rpcgss lp nfs_acl lockd parport grace drm sunrp
c ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq libcrc32c hid_generic usbhid hid ahci crc32_pclmul liba
hci i2c_i801 e1000e lpc_ich megaraid_sas wmi video
[1133936.851100] ---[ end trace 44e625a057f42c07 ]---
[1133937.009968] RIP: 0010:remove_backref_node+0x139/0x140 [btrfs]
[1133937.009971] Code: 41 5f 5d c3 49 8b 96 d0 00 00 00 48 8b 75 c8 49 89 86 d0 00 00 00 48 89 73 50 48 89 53 58 48 8
9 02 80 4b 71 02 e9 38 ff ff ff <0f> 0b c3 0f 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54
[1133937.013183] RSP: 0018:ffffaab2411df928 EFLAGS: 00010206
[1133937.014160] RAX: ffff946499a3a100 RBX: ffff9467c6425680 RCX: ffff9467c64253c0
[1133937.015184] RDX: 0000000000a11c29 RSI: cc071d12e7391895 RDI: 000000000002f0a0
[1133937.016247] RBP: ffffaab2411df960 R08: ffff94687ea978c8 R09: 0000000000010101
[1133937.017262] R10: ffff94686fa5e830 R11: 0000000000000001 R12: ffff9467c6425380
[1133937.018248] R13: dead000000000100 R14: ffff9463f21fb020 R15: dead000000000122
[1133937.019254] FS:  00007f3f603818c0(0000) GS:ffff94687ea80000(0000) knlGS:0000000000000000
[1133937.020344] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1133937.021293] CR2: 00007f4ff9481fb8 CR3: 0000000186acc001 CR4: 00000000001606e0
Comment 4 Johannes Rohr 2020-11-19 21:44:15 UTC
(In reply to t.d.richmond from comment #3)
> Just wanted to say I've run into the exact same issue.
 
If it is really the same issue (I haven't checked), I can tell you that my way of solving it was fairly easy: And offline btrfs check -r with the newest available version of btrfs-progs. 

Obviously, btrfs check can detect and fix issues that btrfs scrub can't. But again, you need to do an offline scan, where the filesystem isn't mounted. 

And I guess, before running btrfs check with the -r option, it makes sense to run a readonly scan and let btrfs dev on the mailing list or in the #btrfs irc channel have a look.
Comment 5 t.d.richmond 2020-11-20 05:06:33 UTC
Unfortunately I had just finished doing an offline repair at the advice of the mailing group. I do think it fixed a lot of the issues I was having, but I'm still getting these segfaults while trying to rebalance.
Comment 6 Johannes Rohr 2020-11-20 15:44:01 UTC
(In reply to t.d.richmond from comment #5)
> Unfortunately I had just finished doing an offline repair at the advice of
> the mailing group. I do think it fixed a lot of the issues I was having, but
> I'm still getting these segfaults while trying to rebalance.

Which version of btrfs-progs did you use for the offline repair? I was told in no uncertain terms that I must use the latest one. What kernel version are you using?
Comment 7 t.d.richmond 2020-11-20 17:32:05 UTC
Brtfs-progs 5.9
5.4.0-52-generic