Bug 203413

Summary: btrfs forced read-only in fsync()
Product: File System Reporter: Drazen Kacar (drazen.kacar)
Component: btrfsAssignee: BTRFS virtual assignee (fs_btrfs)
Status: RESOLVED CODE_FIX    
Severity: normal CC: dsterba
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.0.9 Subsystem:
Regression: No Bisected commit-id:

Description Drazen Kacar 2019-04-24 15:28:58 UTC
This happened on the test virtual machine which wasn't doing much work. There is just one btrfs file system, no snapshots, no quota, no any kind of advanced configuration. Mounted with compress-force=zstd.

# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 

# uname -a
Linux stg-db-c02-snp 5.0.9-1.el7.elrepo.x86_64 #1 SMP Sat Apr 20 09:03:57 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

# btrfs --version
btrfs-progs v4.9.1

# rpm -qi kernel-ml
Name        : kernel-ml
Version     : 5.0.9
Release     : 1.el7.elrepo
Architecture: x86_64
Install Date: Tue 23 Apr 2019 04:49:44 PM CEST

It's a CentOS 7 with 5.0.9 kernel from elrepo. I deleted all files on the file system and then used pg_basebackup to copy production Paostgres database to this file system. That finished successfully. After that I started Postgres server (which was supposed to become a slave database) and the file system became read only.

I have this in dmesg:

[    7.509820] Btrfs loaded, crc32c=crc32c-intel
[    7.510384] BTRFS: device fsid a1f4444d-f874-458d-bf91-3656de6b1ac1 devid 3 transid 231 /dev/sde
[    7.510575] BTRFS: device fsid a1f4444d-f874-458d-bf91-3656de6b1ac1 devid 2 transid 231 /dev/sdc
[    7.510846] BTRFS: device fsid a1f4444d-f874-458d-bf91-3656de6b1ac1 devid 1 transid 231 /dev/sdb
[    7.510984] BTRFS: device fsid a1f4444d-f874-458d-bf91-3656de6b1ac1 devid 4 transid 231 /dev/sdd

...

[   82.509311] BTRFS info (device sdb): force zstd compression, level 0
[   82.509315] BTRFS info (device sdb): disk space caching is enabled
[   82.509317] BTRFS info (device sdb): has skinny extents
[59659.457445] BTRFS info (device sdb): relocating block group 53817114624 flags metadata|raid1
[59675.780813] BTRFS info (device sdb): found 7903 extents
[59675.816211] BTRFS info (device sdb): relocating block group 53783560192 flags system|raid1
[59675.859733] BTRFS info (device sdb): found 1 extents
[59675.899829] BTRFS info (device sdb): relocating block group 49488592896 flags data|raid0
[59693.573856] BTRFS info (device sdb): found 217229 extents
[59699.826960] BTRFS info (device sdb): found 217229 extents
[59701.835056] BTRFS info (device sdb): relocating block group 45193625600 flags data|raid0
[59735.398981] BTRFS info (device sdb): found 231965 extents
[59742.530866] BTRFS info (device sdb): found 231965 extents
[59744.700177] BTRFS info (device sdb): relocating block group 40898658304 flags data|raid0
[59765.646261] BTRFS info (device sdb): found 236629 extents
[59774.506785] BTRFS info (device sdb): found 236629 extents
[59777.137646] BTRFS info (device sdb): device deleted: /dev/sde
[59777.138514] BTRFS info (device sdb): relocating block group 59219378176 flags data|raid0
[59808.099470] BTRFS info (device sdb): found 175718 extents
[59813.751064] BTRFS info (device sdb): found 175718 extents
[59815.727759] BTRFS info (device sdb): relocating block group 55964598272 flags system|raid1
[59815.866099] BTRFS info (device sdb): found 1 extents
[59815.895655] BTRFS info (device sdb): relocating block group 55998152704 flags data|raid0
[59833.664676] BTRFS info (device sdb): found 176099 extents
[59838.978334] BTRFS info (device sdb): found 176099 extents
[59840.750431] BTRFS info (device sdb): relocating block group 62440603648 flags data|raid0
[59853.465967] BTRFS info (device sdb): found 177650 extents
[59860.094573] BTRFS info (device sdb): found 177650 extents
[59861.943680] BTRFS info (device sdb): relocating block group 65661829120 flags data|raid0
[59879.727352] BTRFS info (device sdb): found 174738 extents
[59884.813738] BTRFS info (device sdb): found 174738 extents
[59886.651099] BTRFS info (device sdb): device deleted: /dev/sdd
[72536.431822] BTRFS info (device sdb): force zstd compression, level 0
[72536.431825] BTRFS info (device sdb): disk space caching is enabled
[72536.431826] BTRFS info (device sdb): has skinny extents
[83550.487486] ------------[ cut here ]------------
[83550.487508] BTRFS: Transaction aborted (error -28)
[83550.487667] WARNING: CPU: 0 PID: 82773 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x20e/0x3e0 [btrfs]
[83550.487669] Modules linked in: vmw_vsock_vmci_transport vsock ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmw_balloon aesni_intel crypto_simd cryptd glue_helper intel_rapl_perf input_leds joydev pcspkr btrfs xor zstd_compress raid6_pq libcrc32c zstd_decompress sg i2c_piix4 vmw_vmci binfmt_misc ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod crc32c_intel serio_raw ata_generic pata_acpi vmxnet3 vmw_pvscsi vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ata_piix ttm drm libata dm_mirror dm_region_hash dm_log dm_mod
[83550.487705] CPU: 0 PID: 82773 Comm: postmaster Not tainted 5.0.9-1.el7.elrepo.x86_64 #1
[83550.487706] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[83550.487720] RIP: 0010:btrfs_update_root+0x20e/0x3e0 [btrfs]
[83550.487722] Code: 8b 40 50 f0 48 0f ba a8 40 ce 00 00 02 72 1d 41 83 fc fb 0f 84 b2 01 00 00 44 89 e6 48 c7 c7 98 9e 45 a0 31 c0 e8 f2 c5 cd e0 <0f> 0b 48 8b 7d c0 44 89 e1 ba 88 00 00 00 48 c7 c6 00 1e 45 a0 e8
[83550.487723] RSP: 0018:ffffc90006ba7cd8 EFLAGS: 00010286
[83550.487724] RAX: 0000000000000000 RBX: ffff88829d50bcb0 RCX: 0000000000000006
[83550.487725] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff8882b3c169c0
[83550.487726] RBP: ffffc90006ba7d30 R08: 0000000000000038 R09: 0000000000000673
[83550.487727] R10: ffff8882bfec8280 R11: 0000000000000000 R12: 00000000ffffffe4
[83550.487728] R13: ffff8882ad680000 R14: ffff8882ab248000 R15: ffff8882ab24d028
[83550.487729] FS:  00007eff825f88c0(0000) GS:ffff8882b3c00000(0000) knlGS:0000000000000000
[83550.487730] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[83550.487731] CR2: 00000000014ee798 CR3: 00000000037e8003 CR4: 00000000003606f0
[83550.487785] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[83550.487786] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[83550.487787] Call Trace:
[83550.487804]  ? btrfs_write_marked_extents+0xa0/0x150 [btrfs]
[83550.487820]  btrfs_sync_log+0x2cb/0xae0 [btrfs]
[83550.487826]  ? dput+0xae/0xd0
[83550.487840]  btrfs_sync_file+0x328/0x3a0 [btrfs]
[83550.487844]  vfs_fsync_range+0x3f/0x80
[83550.487846]  do_fsync+0x3d/0x70
[83550.487847]  __x64_sys_fsync+0x14/0x20
[83550.487853]  do_syscall_64+0x60/0x190
[83550.487857]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[83550.487858] RIP: 0033:0x7eff821ecd00
[83550.487860] Code: 48 3d 01 f0 ff ff 73 01 c3 48 8b 0d a2 82 20 00 f7 d8 64 89 01 48 83 c8 ff c3 83 3d e9 c6 20 00 00 75 10 b8 4a 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce f6 ff ff 48 89 04 24
[83550.487861] RSP: 002b:00007fff68b277b8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[83550.487862] RAX: ffffffffffffffda RBX: 00000000000026e3 RCX: 00007eff821ecd00
[83550.487863] RDX: 00007eff6f323630 RSI: 0000000000002000 RDI: 0000000000000180
[83550.487864] RBP: 000000000a00000e R08: 0000000000ffb95a R09: 00007fff68bdb0b0
[83550.487865] R10: 000000000001465d R11: 0000000000000246 R12: 0000000000000004
[83550.487866] R13: 00000000014ac418 R14: 00007eff0b145d08 R15: 00007eff0b145d08
[83550.487867] ---[ end trace 9bac5a7e37b033ac ]---
[83550.487889] BTRFS: error (device sdb) in btrfs_update_root:136: errno=-28 No space left
[83550.487958] BTRFS info (device sdb): forced readonly


# btrfs filesystem usage /data/pg_data
Overall:
    Device size:                 200.00GiB
    Device allocated:             32.06GiB
    Device unallocated:          167.94GiB
    Device missing:                  0.00B
    Used:                         25.98GiB
    Free (estimated):            169.64GiB      (min: 85.67GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              143.69MiB      (used: 0.00B)

Data,RAID0: Size:26.00GiB, Used:24.30GiB
   /dev/sdb       13.00GiB
   /dev/sdc       13.00GiB

Metadata,RAID1: Size:3.00GiB, Used:857.81MiB
   /dev/sdb        3.00GiB
   /dev/sdc        3.00GiB

System,RAID1: Size:32.00MiB, Used:64.00KiB
   /dev/sdb       32.00MiB
   /dev/sdc       32.00MiB

Unallocated:
   /dev/sdb       83.97GiB
   /dev/sdc       83.97GiB


It seems there's plenty of unallocated space.

The only unusual thing which might be related to this problem is rebalancing. Yesterday I was testing compression ratios and a lot of untarring went without problems. That was on fs with two disks, each 100GB, usage was 35 GB at most.
Then I added one extra 100GB disk and asked btrfs to rebalance (btrfs balance start -v /mountpoint). That ended with no space on device, although it managed to rebalance disk usage (at least according to coarse btrfs fs usage output). There was plenty of unallocated space on each disk. I tried again and got the same error.

Than I added the 4th disk (100GB) and then rebalance worked fine.

This morning I removed those two extra disks (/dev/sdd and /dev/sde) which worked fine, deleted all files on the file system and started pg_basebackup. And eventually ended with the RO file system, as described above.

So maybe the error has something to do with rebalancing code which didn't leave structures in the correct state.

I'd like to know if the developers think that this might be related to rebalancing. I suppose I could operate the thing in production without rebalancing, so that's important information for me.
Comment 1 David Sterba 2019-05-21 12:09:41 UTC
Thanks for the report. Fixed by 72bd2323ec87722c115a5906bc6a1b31d11e8f54 "Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path", now in 5.2-rc queue.