Bug 60594 - Raid1 fs with one missing device refuses to mount read/write
Summary: Raid1 fs with one missing device refuses to mount read/write
Status: RESOLVED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Josef Bacik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-20 21:55 UTC by Xavier Bassery
Modified: 2022-09-30 15:15 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Xavier Bassery 2013-07-20 21:55:21 UTC
I have a btrfs filesystem that was in raid1 on 2 disk partitions. In the meantime, the second disk has been swiped.
Now when I try to mount the single device left (in degraded mode), I get the following:
mount: wrong fs type, bad option, bad superblock on /dev/sda2,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
And in dmesg, it reads:
[ 4462.021667] device label / devid 1 transid 124757 /dev/sda2
[ 4462.022443] btrfs: allowing degraded mounts
[ 4462.022446] btrfs: use lzo compression
[ 4462.022447] btrfs: disk space caching is enabled
[ 4462.026556] btrfs: mismatching generation and generation_v2 found in root item. This root was 
probably mounted with an older kernel. Resetting all new fields.
[ 4462.039630] btrfs: mismatching generation and generation_v2 found in root item. This root was 
probably mounted with an older kernel. Resetting all new fields.
[ 4465.158797] Btrfs: too many missing devices, writeable mount is not allowed
[ 4465.242523] btrfs: open_ctree failed

My only option is to mount it read-only but then, I am not able to add a new device or rebalance the fs to convert it to single.

I have made an image of the filesystem if needed.

Here are some outputs:
# ./btrfs fi df /mnt/
Data, RAID1: total=430.00GB, used=426.02GB
System: total=32.00MB, used=68.00KB
Metadata, RAID1: total=4.00GB, used=2.92GB

# ./btrfs fi show /
Label: '/'  uuid: 355d8e01-306b-4610-bf10-0cde5c6f9c3a
        Total devices 2 FS bytes used 428.94GB
        devid    1 size 929.62GB used 434.03GB path /dev/sda2
        *** Some devices missing
Comment 1 Ilya Dryomov 2013-08-15 18:10:47 UTC
After much back and forth with Xavier, which included disabling most of
the num_tolerated_disk_barrier_failures checks and doing dev del missing
/ dev add, we got his FS back.  I *think* this could have been fixed
with dev-replace, but I chose dev del / dev add way because Xavier was
running 3.10, and IIRC dev-replace was broken in 3.10 by a "Make scrub
loop better" patch.  If dev-replace is indeed a way to fix such
situations, this can probably be closed.  Stefan?
Comment 2 Ilya Dryomov 2013-08-15 18:15:53 UTC
added Stefan to the CC
Comment 3 Stefan Behrens 2013-08-16 09:05:28 UTC
# mkfs.btrfs -d raid1 -m raid1 -f /dev/sdd /dev/sde
# mount /dev/sdd /mnt
# (cd ~/git/; tar cf - btrfs) | (cd /mnt; tar xf -)
# dd if=/dev/zero of=/mnt/0 bs=16M count=400
# umount /mnt
# dd if=/dev/zero of=/dev/sde bs=4096 count=1 seek=16
# btrfs dev scan
# btrfs-show-super /dev/sde
-> the superblock is all-zero
# mount /dev/sdd /mnt
-> fails
# mount /dev/sdd /mnt -o degraded
-> succeeds with a writable filesystem
# cat /proc/mounts | grep /mnt
-> /dev/sdd /mnt btrfs rw,relatime,degraded,space_cache 0 0
# btrfs fi df /mnt
Data, RAID1: total=12.00GiB, used=11.27GiB
Data: total=8.00MiB, used=0.00
System, RAID1: total=8.00MiB, used=4.00KiB
System: total=4.00MiB, used=0.00
Metadata, RAID1: total=1.00GiB, used=101.19MiB
Metadata: total=8.00MiB, used=0.00

fs/btrfs/disk-io.c,
btrfs_calc_num_tolerated_disk_barrier_failures()
   3323 if (space.total_bytes == 0 ||
   3324     space.used_bytes == 0)
   3325         continue;

Right, that system chunk in "single" mode had used=0.00, makes sense to ignore it.

23:29 <balthus> # ./btrfs fi df /mnt/
23:29 <balthus> Data, RAID1: total=430.00GB, used=426.02GB
23:29 <balthus> System: total=32.00MB, used=68.00KB
23:29 <balthus> Metadata, RAID1: total=4.00GB, used=2.92GB

Ok, got it. You created the filesystem with "single" (not RAID1) chunks, and afterwards converted everything to RAID1 and the system chunk was not converted.

# mkfs.btrfs -d single -m single -f /dev/sdd /dev/sde
# mount /dev/sdd /mnt
# (cd ~/git/; tar cf - btrfs/fs) | (cd /mnt; tar xf -)
# dd if=/dev/zero of=/mnt/0 bs=16M count=400
# sync; sync
# btrfs fi df /mnt
Data: total=8.01GiB, used=6.76GiB
System: total=4.00MiB, used=4.00KiB
Metadata: total=1.01GiB, used=11.58MiB
# btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt
Done, had to relocate 11 out of 11 chunks
# sync; sync
# btrfs fi df /mnt
Data, RAID1: total=10.00GiB, used=6.76GiB
System: total=4.00MiB, used=4.00KiB
Metadata, RAID1: total=2.00GiB, used=10.59MiB
# btrfs balance start -sconvert=raid1 -f /mnt
Done, had to relocate 0 out of 12 chunks
# sync; sync
# btrfs fi df /mnt
Data, RAID1: total=10.00GiB, used=6.76GiB
System: total=4.00MiB, used=4.00KiB
Metadata, RAID1: total=2.00GiB, used=10.59MiB
# umount /mnt
# dd if=/dev/zero of=/dev/sde bs=4096 count=1 seek=16
# btrfs dev scan
# btrfs-show-super /dev/sde
magic                   ........ [DON'T MATCH]
# mount /dev/sdd /mnt
-> fails
# mount /dev/sdd /mnt -o degraded
-> fails

I tried several balance options but never succeeded to get rid of this system chunk in "single" mode. And that's the problem.


btrfs_read_chunk_tree() contains a different approach to check whether all required devices are there than btrfs_calc_num_tolerated_disk_barrier_failures(). But the drawback is that btrfs_read_chunk_tree() can only handle the past and the present, it does not care about new chunks that will be created in the future. btrfs_calc_num_tolerated_disk_barrier_failures() is pessimistic.

The num_tolerated_disk_barrier_failures thing is a very strict mechanism to avoid filesystem integrity issues. It deals with the worst case. The goal is to switch the filesystem into read-only mode if it is not possible to write the super block to the required number of disks. One way to decide this fact could be to say, if it's written at least once, we are fine. But this doesn't deal with the case when somebody physically removes that disk at a later time ("it's RAID6, I'm able to remove up to 2 disks"). The result of this worst case is a corrupt filesystem, the old superblock points to data that is potentially not valid anymore, since if you don't handle the failed superblock write, these blocks might be freed and reused.

The pessimistic approach is to calculate the number of allowable failures, and to immediately switch to read-only mode if this situation occurs. Consequently, this check is also performed when the filesystem is mounted in writable mode.

IMO this num_tolerated_disk_barrier_failures thing is correct.
IMHO it is not correct that the "single" system chunk is not converted to RAID1 with the balance procedure.
Comment 4 Ilya Dryomov 2013-08-16 15:09:06 UTC
That's true, and I will send a patch that kills the check that prevented
the conversion of that "single" system chunk.  (Xavier and I had to work
around it, and it probably is the root cause.)  However, there still are
legit ways to get into a similar situation.  It is perfectly possible to
have say a two drive FS with raid1 metadata and single data, where all
the data is on one of the drives.  If the other drive goes bad both data
and metadata are OK, but the user can't mount rw to do dev add / dev del
dance.  What is the way out?  Currently we essentially declare that FS
borked, and that's IMHO is not acceptable.

(P.S. I was wrong about dev-replace, it could not have helped, it
naturally needs an rw mount.)
Comment 5 Stefan Behrens 2013-08-23 13:42:16 UTC
The way out is to mount read-only, copy the data aside and be happy that no data was lost.

The #1 goal (IMO) is to avoid data loss. Therefore the filesystem goes read-only if less devices are functional for writing than required by the selected RAID levels. And in order to avoid the surprise of a filesystem going read-only 30 seconds after mounting it, this is also enforced at mount time.

Since you and me don't agree, I guess we should ask additional people to state their opinion for this topic.

We could also leave this as an option to the user "mount -o degraded-and-I-want-to-lose-my-data", but in my opinion the use case is very, very exceptional.
Comment 6 Ilya Dryomov 2013-08-23 15:35:44 UTC
It's not that I don't agree, we definitely should not over-engineer, we
have enough of that already.  The use case is indeed pretty rare, but
it's something that used to be possible in the past and is forbidden
now, so I just wanted to make sure this was a conscious decision and see
if there is anything else one can do other than mount ro and copy.
Comment 7 Chris Murphy 2015-02-04 18:43:18 UTC
A similar problem still happens on kernel 3.19 with a file system in between raid1 > single conversion which is paused then normally/cleanly umounted. On next mount it will not mount rw. Filed bug 92641.
Comment 8 Hans Deragon 2017-01-02 12:58:56 UTC
(In reply to Ilya Dryomov from comment #6)
> The use case is indeed pretty rare, but it's something that used
> to be possible in the past and is forbidden now.

I suffer from this use case with kernel 4.4.0-57-generic.  Had a single non RAID btrfs filesystem, converted to raid1, had a disk completely failed on me (OS does not recognize it) and now I can only mount the FS as RO.

This use case is not so rare for people using brtfs raid1 at home.  Sometimes, you buy a disk, install the filesystem, use it for a while and wakeup, understanding that raid would be better.  Then you read on the net how easy it is to convert the single instance to raid, but alas, its actually broken.  When a disk fails, you can recover your data, but you cannot recreate your raid1 and you do suffer of downtime until your resolve the issue and have a newly working RW filesystem up to replace the old one.

So lets be clear here;  If anybody attempts to convert a single btrfs to raid1 (or any other raid), a message error shows up now, preventing this mistake?
Comment 9 David Sterba 2022-09-30 15:15:39 UTC
technically this could be a regression fix that would go to any rc but because the merge

Note You need to log in before you can comment on or make changes to this bug.