Hello, commit | commit de1414a654e66b81b5348dbc5259ecf2fb61655e | Author: Christoph Hellwig <hch@lst.de> | Date: Wed Jan 14 10:42:36 2015 +0100 | | fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info | | Now that we got rid of the bdi abuse on character devices we can always use | sb->s_bdi to get at the backing_dev_info for a file, except for the block | device special case. Export inode_to_bdi and replace uses of | mapping->backing_dev_info with it to prepare for the removal of | mapping->backing_dev_info. introduced a critical regression: Deleting a certain btrfs snapshot results in a BUG as below: | BUG: unable to handle kernel NULL pointer dereference at 0000000000000250 | IP: [<ffffffff812afcdb>] blk_get_backing_dev_info+0xb/0x20 | PGD 28e3f067 PUD 23cd9067 PMD 0 | Oops: 0000 [#1] SMP | CPU: 0 PID: 5053 Comm: btrfs Not tainted 4.1.0+ #20 | Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 | task: ffff8800003e0e80 ti: ffff880023d28000 task.ti: ffff880023d28000 | RIP: 0010:[<ffffffff812afcdb>] [<ffffffff812afcdb>] blk_get_backing_dev_info+0xb/0x20 | RSP: 0018:ffff880023d2b9e8 EFLAGS: 00010246 | RAX: 0000000000000000 RBX: ffff880023d2ba08 RCX: ffffffff8182a420 | RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88003e071780 | RBP: ffff880023d2b9e8 R08: 0000000000000000 R09: 0000000000000001 | R10: 0000000000000000 R11: 0000000000000001 R12: ffff88003e0719d8 | R13: 7fffffffffffffff R14: ffff880023d2c000 R15: 7fffffffffffffff | FS: 00007f71019b38c0(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 | CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b | CR2: 0000000000000250 CR3: 0000000023cda000 CR4: 00000000000006b0 | Stack: | ffff880023d2ba88 ffffffff810d1041 ffff880023d2ba30 ffff88002fc3c700 | 7fffffffffffffff 0000000000000000 0000000000000000 7fffffffffffffff | 0000000000000001 0000000000000000 0000000000000000 0000000000000000 | Call Trace: | [<ffffffff810d1041>] __filemap_fdatawrite_range+0xe1/0x100 | [<ffffffff810d108e>] filemap_fdatawrite_range+0xe/0x10 | [<ffffffff81219ea6>] btrfs_fdatawrite_range+0x26/0x70 | [<ffffffff8121ee37>] btrfs_wait_ordered_range+0x47/0x110 | [<ffffffff81217431>] ? btrfs_drop_extent_cache+0x341/0x3f0 | [<ffffffff81210791>] btrfs_evict_inode+0x1f1/0x4e0 | [<ffffffff811419f6>] ? __inode_wait_for_writeback+0x66/0xb0 | [<ffffffff811351be>] evict+0xae/0x170 | [<ffffffff81135fc1>] iput+0x131/0x190 | [<ffffffff81210bf4>] btrfs_invalidate_inodes+0x174/0x1c0 | [<ffffffff81237e7e>] btrfs_ioctl_snap_destroy+0x54e/0x710 | [<ffffffff810d8452>] ? __alloc_pages_nodemask+0x142/0x840 | [<ffffffff8123abc1>] btrfs_ioctl+0x1291/0x29c0 | [<ffffffff810ddbb6>] ? lru_cache_add_active_or_unevictable+0x26/0x90 | [<ffffffff810f6c2e>] ? handle_mm_fault+0xbee/0x1380 | [<ffffffff8112a762>] ? do_filp_open+0x92/0xe0 | [<ffffffff8112e2fe>] do_vfs_ioctl+0x7e/0x4f0 | [<ffffffff8103b53b>] ? __do_page_fault+0x17b/0x3b0 | [<ffffffff8112e7f6>] SyS_ioctl+0x86/0xa0 | [<ffffffff8145a7ae>] entry_SYSCALL_64_fastpath+0x12/0x71 | Code: 01 00 00 00 8d 41 ff 85 c0 0f 4e c2 89 87 bc 05 00 00 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 87 98 00 00 00 55 48 89 e5 <48> 8b 80 50 02 00 00 5d 48 05 78 01 00 | RIP [<ffffffff812afcdb>] blk_get_backing_dev_info+0xb/0x20 | RSP <ffff880023d2b9e8> | CR2: 0000000000000250 | ---[ end trace 46e94aee3470cdc0 ]--- If I read the disassembly correctly: | 00000000002afcd0 <blk_get_backing_dev_info>: | extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t, | unsigned int, void __user *); | extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t, | unsigned int, void __user *); | extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t, | struct scsi_ioctl_command __user *); | 2afcd0: 48 8b 87 98 00 00 00 mov 0x98(%rdi),%rax | rq->tag = -1; | rq->start_time = jiffies; | set_start_time_ns(rq); | rq->part = NULL; | } | EXPORT_SYMBOL(blk_rq_init); | 2afcd7: 55 push %rbp | 2afcd8: 48 89 e5 mov %rsp,%rbp | | static void req_bio_endio(struct request *rq, struct bio *bio, | unsigned int nbytes, int error) + 2afcdb: 48 8b 80 50 02 00 00 mov 0x250(%rax),%rax then bdev->bd_disk in blkdev.h:bdev_get_queue is NULL but must not be. Why this happens after the mentioned commit is somewhat beyond my skills. This does not happen with every btrfs snapshot, although in a certain but rather complex workflow I can reproduce this all the time. Hence I could do bisecting which led to the given commit. This BUG still exists in current HEAD (tested on 1bc5e15). If this is sufficient information to fix it, I'll be glad. If you need a simple reproducer, drop me a line and I'll start working on it which might take some time. See also the btrfs mailing list for my initial report http://www.spinics.net/lists/linux-btrfs/msg44883.html Christoph
After some investigation I managed to create a reproducer that still isn't minimal but at least worth sharing. Short version: schroot --chroot jessie-amd64 -- apt-get install lvm2 Longer version: - Debian jessie userland required, or anything else that runs schroot - Create a type=btrfs-snapshot schroot named jessie-amd64, containing a minimal jessie amd64 userland - Start a non-persistent session and install the lvm2 package, then end the session - BUG is triggered when schroot cleans up the chroot, i.e. deletes the snapshot Presumably the lvm2 postinst does things that later cause the problem. Just installing lvm2's dependencies works as expected. Neither sleep nor sync before the snapshot deletion are a workaround. An strace of that installation has some 180 Mbyte size, still searching the for the root cause that triggers the bug. After that it should be possible to have a minimal reproducer of "btrfs subvolume snapshot", (some command), "btrfs subvolume delete".
The culprit is vgcfgbackup called from lvm2 postinst. So the following reproducer should do the trick: * Have a btrfs at /mnt/schroot/ * Create a subvolume /mnt/schroot/jessie-amd64, and populate it with a userland, including the lvm2 package. * Create a snapshot: btrfs subvolume snapshot /mnt/schroot/jessie-amd64/ /mnt/schroot/snap * Bind-mount /proc mount --bind /proc /mnt/schroot/snap/proc * Exec vgcfgbackup in the snapshot chroot /mnt/schroot/snap /sbin/vgcfgbackup * Umount umount /mnt/schroot/snap/proc * Delete snapshot btrfs subvolume delete /mnt/schroot/snap Final step was to reduce vgcfgbackup to the actual ioctl (wild guessing) that causes the trouble. The strace output looks harmless.
Created attachment 187441 [details] btrfs: skip waiting on ordered range for special files In btrfs_evict_inode, we properly truncate the page cache for evicted inodes but then we call btrfs_wait_ordered_range for every inode as well. It's the right thing to do for regular files but results in incorrect behavior for device inodes for block devices. filemap_fdatawrite_range gets called with inode->i_mapping which gets resolved to the block device inode before getting passed to wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens next depends on whether there's an open file handle associated with the inode. If there is, we write to the block device, which is unexpected behavior. If there isn't, we through normally and inode->i_data is used. We can also end up racing against open/close which can result in crashes when i_mapping points to a block device inode that has been closed. Since there can't be any page cache associated with special file inodes, it's safe to skip the btrfs_wait_ordered_range call entirely and avoid the problem. [Still undergoing xfstests run but should be ok.]
Thanks, looks very good. Not a single crash in all scenarios where I've encountered them every time. Consider this a Tested-by:
Thanks, closing.