My UPSes haven't been able to hold up my ceph-on-btrfs nodes very often. I've experienced filesystem corruption reasonably often, as I bring nodes back up just to have them remount read-only after reporting that an extent to be cleaned up is not there. This appears to be a lot more frequent when cleanup processing is lagging behind. Like, the one filesystem I've just recovered had almost 1000 subvols to clean up, from a long sequence of ceph-osd's snapshotting current to snap_<#>, then snap_<#+x>, then snap_<#+x+y>, then removing snap_<#>, and so on, plus the occasional osd restarts due to timeouts, that cause current to be removed and re-created as a snapshot of the most recent snap_<#>. The symptom is the "unable to find ref byte nr %llu parent %llu root %llu owner %llu offset %llu" message in __btrfs_free_extent. I've run into that quite often over the last few years, and I've learned to fear removing old snapshots (saved as snapshots of snap_<#>) from filesystems with active ceph-osds. If the server was running slow because of the long list of cleanup actions and the large and often-updated extent tree that goes with it (causing syncs to take several minutes, which in turn causes further ceph-osd restarts), and if the host were to restart before cleanups were done, the filesystem would occasionally (not always, not often, but often enough for restarts to be scary ;-) run into this error. btrfs could usually fix it, readding the backrefs to the extent tree so that the fs could be remounted rw again, and finish the cleanup. More recently, running 4.3.* btrfs (kernel and userland progs), I've run into a few cases in which btrfsck completed successfully, without finding or fixing any errors, but that would report the "unable to find..." error shortly after a rw mount. Oddly, the metadata block logged by the kernel contained unrelated contents if I were to examine the disk. I'd expect we wouldn't reuse a leaf id before one or two full commits, but what do I know? This made it difficult to fix the error manually, but after some investigation, I managed to locate the metadata for the extent on disk, manually add a backref to the corresponding extent tree block, just to have the cleanup barf at a subsequent backref missing from the extent tree. In the end, I patched my copy of btrfs to not abort the transaction at that error, and then it reported just a couple more missing refs right away, and then completed cleanup successfully without further errors, so I could switch back to an unpatched btrfs. One thing I noticed while trying to figure out what was wrong and how to fix it is that the extents reported as missing have lots of entries in the extent tree, say, a handful of skinny TREE_BLOCK_REFS and tens to hundreds of skinny SHARED_BLOCK_REFS. This seems to indicate that the referenced extents are under intense activity, and it gave me a hunch as to why a crash corrupts the filesystem, and btrfsck won't fix it: I suspect a transaction is committed due to a snapshot taken by ceph-osd, causing the extent tree that is being processed by the cleanup thread to be committed as a side effect, already without backrefs corresponding to some entries in the delayed-refs list, but before those entries are removed from the delayed-refs list (because, I suppose, nothing forced those changes to be committed along with the extent tree changes, or nothing prevented the changes to the extent tree from being committed before the updated delayed-refs list, or even before the delayed-refs list was updated). I haven't checked the code to tell whether this even makes sense, but at least to me the scenario appears consistent with the observed behavior. As for why btrfsck won't fix it, it could be because the entries are already in the delayed-refs list, which btrfsck perhaps won't look at, or it might have to do with shared block backrefs, if btrfsck doesn't inspect those as closely as extent-removal has to. I hope this helps,