Bug 92921 - device removal from a degraded mounted volume results in oops
Summary: device removal from a degraded mounted volume results in oops
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Josef Bacik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-09 01:25 UTC by Chris Murphy
Modified: 2016-03-20 10:18 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.19.0-0.rc7.git2.1.fc22.i686
Tree: Fedora
Regression: No


Attachments
photo of screen, kernel panic (233.46 KB, image/jpeg)
2015-02-09 01:25 UTC, Chris Murphy
Details
journal with crash stack trace (222.19 KB, text/plain)
2015-02-09 01:30 UTC, Chris Murphy
Details
dmesg 3.19.0 (149.13 KB, text/plain)
2015-02-10 07:01 UTC, Chris Murphy
Details

Description Chris Murphy 2015-02-09 01:25:57 UTC
Created attachment 166101 [details]
photo of screen, kernel panic

# fallocate -l 250M b1
# fallocate -l 250M b2
# fallocate -l 250M b3
# fallocate -l 250M b4
# fallocate -l 250M b5
# mkfs.btrfs -M -draid1 -mraid1 b1 b2 b3 b4 b5
# losetup -f b[12345]
# mount /dev/loop0 /mnt/btr
# cp <somefiles> /mnt/btr

# btrfs fi show
Label: none  uuid: be5b794d-a603-4eee-adfa-74711c010cfa
	Total devices 5 FS bytes used 366.42MiB
	devid    1 size 250.00MiB used 250.00MiB path /dev/loop0
	devid    2 size 250.00MiB used 177.00MiB path /dev/loop1
	devid    3 size 250.00MiB used 192.00MiB path /dev/loop2
	devid    4 size 250.00MiB used 246.00MiB path /dev/loop3
	devid    5 size 250.00MiB used 249.00MiB path /dev/loop4

Btrfs v3.18.2
[root@f22s btr]# btrfs scrub start -Bd /mnt/btr
scrub device /dev/loop0 (id 1) done
	scrub started at Sun Feb  8 17:59:30 2015 and finished after 3 seconds
	total bytes scrubbed: 137.52MiB with 0 errors
scrub device /dev/loop1 (id 2) done
	scrub started at Sun Feb  8 17:59:30 2015 and finished after 3 seconds
	total bytes scrubbed: 176.32MiB with 0 errors
scrub device /dev/loop2 (id 3) done
	scrub started at Sun Feb  8 17:59:30 2015 and finished after 3 seconds
	total bytes scrubbed: 191.33MiB with 0 errors
scrub device /dev/loop3 (id 4) done
	scrub started at Sun Feb  8 17:59:30 2015 and finished after 3 seconds
	total bytes scrubbed: 137.70MiB with 0 errors
scrub device /dev/loop4 (id 5) done
	scrub started at Sun Feb  8 17:59:30 2015 and finished after 4 seconds
	total bytes scrubbed: 240.29MiB with 0 errors

# umount /mnt/btr
# losetup -d /dev/loop3
# rm -f b4
# mount /dev/loop0 /mnt/btr
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
# mount /dev/loop0 /mnt/btr -o degraded
# btrfs device delete /dev/loop4 /mnt/btr
ERROR: error removing the device '/dev/loop4' - No space left on device
# rm -rf <somefiles>
# btrfs device delete /dev/loop4 /mnt/btr

No errors for about 1 minute, and then the attached photo of kernel panic. The sshd shell was frozen, no console access.
Comment 1 Chris Murphy 2015-02-09 01:30:01 UTC
Created attachment 166111 [details]
journal with crash stack trace

Looks like the disk deleted OK, but then 32 seconds later there's a BUG warning and the oops. Attached is the full journalctl -b-1 -l -o short-monotonic output.

[29624.382790] f22s.localdomain kernel: BTRFS info (device loop4): disk deleted /dev/loop4
[29656.902166] f22s.localdomain kernel: BUG: unable to handle kernel NULL pointer dereference at 0000005c

Last time in the journal is [29656.903019], while the first time in the photo is 29663, so about 7 seconds missing (lost) between this journal and the photo.
Comment 2 Chris Murphy 2015-02-09 01:31:52 UTC
Is it even a good idea for Btrfs to permit device removal of a degraded volume? Seems like this shouldn't be permitted, or at best permitted only with a --force flag. Best practices is probably to add a device and remount back to normal mode, and then do the device delete. I don't know that anything good comes from removing a device from a degraded volume.
Comment 3 Chris Murphy 2015-02-10 07:01:10 UTC
Created attachment 166261 [details]
dmesg 3.19.0

Is reproducible in simplified form, by not first overfilling the volume (thus avoiding enospc).

# mount -o degraded /dev/loop0 /mnt/btr
# btrfs dev delete /dev/loop3 /mnt/btr
## wait 100s
# umount /mnt/btrfs

[  624.764677] BTRFS info (device loop3): disk deleted /dev/loop3
[  723.596939] BUG: unable to handle kernel NULL pointer dereference at 0000005c
[  723.597019] IP: [<c06b2de8>] bio_get_nr_vecs+0x8/0x40
[  723.597019] *pde = 00000000 
[  723.597019] Oops: 0000 [#1] SMP 

Full dmesg could be captured this time, this is what's attached.

Note You need to log in before you can comment on or make changes to this bug.