Bug 60594
Summary: | Raid1 fs with one missing device refuses to mount read/write | ||
---|---|---|---|
Product: | File System | Reporter: | Xavier Bassery (xavier) |
Component: | btrfs | Assignee: | Josef Bacik (josef) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | bugzilla, dan.mulholland, dsterba, hans, idryomov, sbehrens |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.10 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Xavier Bassery
2013-07-20 21:55:21 UTC
After much back and forth with Xavier, which included disabling most of the num_tolerated_disk_barrier_failures checks and doing dev del missing / dev add, we got his FS back. I *think* this could have been fixed with dev-replace, but I chose dev del / dev add way because Xavier was running 3.10, and IIRC dev-replace was broken in 3.10 by a "Make scrub loop better" patch. If dev-replace is indeed a way to fix such situations, this can probably be closed. Stefan? added Stefan to the CC # mkfs.btrfs -d raid1 -m raid1 -f /dev/sdd /dev/sde # mount /dev/sdd /mnt # (cd ~/git/; tar cf - btrfs) | (cd /mnt; tar xf -) # dd if=/dev/zero of=/mnt/0 bs=16M count=400 # umount /mnt # dd if=/dev/zero of=/dev/sde bs=4096 count=1 seek=16 # btrfs dev scan # btrfs-show-super /dev/sde -> the superblock is all-zero # mount /dev/sdd /mnt -> fails # mount /dev/sdd /mnt -o degraded -> succeeds with a writable filesystem # cat /proc/mounts | grep /mnt -> /dev/sdd /mnt btrfs rw,relatime,degraded,space_cache 0 0 # btrfs fi df /mnt Data, RAID1: total=12.00GiB, used=11.27GiB Data: total=8.00MiB, used=0.00 System, RAID1: total=8.00MiB, used=4.00KiB System: total=4.00MiB, used=0.00 Metadata, RAID1: total=1.00GiB, used=101.19MiB Metadata: total=8.00MiB, used=0.00 fs/btrfs/disk-io.c, btrfs_calc_num_tolerated_disk_barrier_failures() 3323 if (space.total_bytes == 0 || 3324 space.used_bytes == 0) 3325 continue; Right, that system chunk in "single" mode had used=0.00, makes sense to ignore it. 23:29 <balthus> # ./btrfs fi df /mnt/ 23:29 <balthus> Data, RAID1: total=430.00GB, used=426.02GB 23:29 <balthus> System: total=32.00MB, used=68.00KB 23:29 <balthus> Metadata, RAID1: total=4.00GB, used=2.92GB Ok, got it. You created the filesystem with "single" (not RAID1) chunks, and afterwards converted everything to RAID1 and the system chunk was not converted. # mkfs.btrfs -d single -m single -f /dev/sdd /dev/sde # mount /dev/sdd /mnt # (cd ~/git/; tar cf - btrfs/fs) | (cd /mnt; tar xf -) # dd if=/dev/zero of=/mnt/0 bs=16M count=400 # sync; sync # btrfs fi df /mnt Data: total=8.01GiB, used=6.76GiB System: total=4.00MiB, used=4.00KiB Metadata: total=1.01GiB, used=11.58MiB # btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt Done, had to relocate 11 out of 11 chunks # sync; sync # btrfs fi df /mnt Data, RAID1: total=10.00GiB, used=6.76GiB System: total=4.00MiB, used=4.00KiB Metadata, RAID1: total=2.00GiB, used=10.59MiB # btrfs balance start -sconvert=raid1 -f /mnt Done, had to relocate 0 out of 12 chunks # sync; sync # btrfs fi df /mnt Data, RAID1: total=10.00GiB, used=6.76GiB System: total=4.00MiB, used=4.00KiB Metadata, RAID1: total=2.00GiB, used=10.59MiB # umount /mnt # dd if=/dev/zero of=/dev/sde bs=4096 count=1 seek=16 # btrfs dev scan # btrfs-show-super /dev/sde magic ........ [DON'T MATCH] # mount /dev/sdd /mnt -> fails # mount /dev/sdd /mnt -o degraded -> fails I tried several balance options but never succeeded to get rid of this system chunk in "single" mode. And that's the problem. btrfs_read_chunk_tree() contains a different approach to check whether all required devices are there than btrfs_calc_num_tolerated_disk_barrier_failures(). But the drawback is that btrfs_read_chunk_tree() can only handle the past and the present, it does not care about new chunks that will be created in the future. btrfs_calc_num_tolerated_disk_barrier_failures() is pessimistic. The num_tolerated_disk_barrier_failures thing is a very strict mechanism to avoid filesystem integrity issues. It deals with the worst case. The goal is to switch the filesystem into read-only mode if it is not possible to write the super block to the required number of disks. One way to decide this fact could be to say, if it's written at least once, we are fine. But this doesn't deal with the case when somebody physically removes that disk at a later time ("it's RAID6, I'm able to remove up to 2 disks"). The result of this worst case is a corrupt filesystem, the old superblock points to data that is potentially not valid anymore, since if you don't handle the failed superblock write, these blocks might be freed and reused. The pessimistic approach is to calculate the number of allowable failures, and to immediately switch to read-only mode if this situation occurs. Consequently, this check is also performed when the filesystem is mounted in writable mode. IMO this num_tolerated_disk_barrier_failures thing is correct. IMHO it is not correct that the "single" system chunk is not converted to RAID1 with the balance procedure. That's true, and I will send a patch that kills the check that prevented the conversion of that "single" system chunk. (Xavier and I had to work around it, and it probably is the root cause.) However, there still are legit ways to get into a similar situation. It is perfectly possible to have say a two drive FS with raid1 metadata and single data, where all the data is on one of the drives. If the other drive goes bad both data and metadata are OK, but the user can't mount rw to do dev add / dev del dance. What is the way out? Currently we essentially declare that FS borked, and that's IMHO is not acceptable. (P.S. I was wrong about dev-replace, it could not have helped, it naturally needs an rw mount.) The way out is to mount read-only, copy the data aside and be happy that no data was lost. The #1 goal (IMO) is to avoid data loss. Therefore the filesystem goes read-only if less devices are functional for writing than required by the selected RAID levels. And in order to avoid the surprise of a filesystem going read-only 30 seconds after mounting it, this is also enforced at mount time. Since you and me don't agree, I guess we should ask additional people to state their opinion for this topic. We could also leave this as an option to the user "mount -o degraded-and-I-want-to-lose-my-data", but in my opinion the use case is very, very exceptional. It's not that I don't agree, we definitely should not over-engineer, we have enough of that already. The use case is indeed pretty rare, but it's something that used to be possible in the past and is forbidden now, so I just wanted to make sure this was a conscious decision and see if there is anything else one can do other than mount ro and copy. A similar problem still happens on kernel 3.19 with a file system in between raid1 > single conversion which is paused then normally/cleanly umounted. On next mount it will not mount rw. Filed bug 92641. (In reply to Ilya Dryomov from comment #6) > The use case is indeed pretty rare, but it's something that used > to be possible in the past and is forbidden now. I suffer from this use case with kernel 4.4.0-57-generic. Had a single non RAID btrfs filesystem, converted to raid1, had a disk completely failed on me (OS does not recognize it) and now I can only mount the FS as RO. This use case is not so rare for people using brtfs raid1 at home. Sometimes, you buy a disk, install the filesystem, use it for a while and wakeup, understanding that raid would be better. Then you read on the net how easy it is to convert the single instance to raid, but alas, its actually broken. When a disk fails, you can recover your data, but you cannot recreate your raid1 and you do suffer of downtime until your resolve the issue and have a newly working RW filesystem up to replace the old one. So lets be clear here; If anybody attempts to convert a single btrfs to raid1 (or any other raid), a message error shows up now, preventing this mistake? technically this could be a regression fix that would go to any rc but because the merge |