Bug 100911

Summary: NULL pointer dereference during btrfs snapshot removal
Product: File System Reporter: Christoph Biedl (bugzilla.kernel.bpeb)
Component: btrfsAssignee: Josef Bacik (josef)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: high CC: dsterba, hch, jeffm
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: btrfs: skip waiting on ordered range for special files

Description Christoph Biedl 2015-07-04 11:17:27 UTC
Hello,

commit

| commit de1414a654e66b81b5348dbc5259ecf2fb61655e
| Author: Christoph Hellwig <hch@lst.de>
| Date:   Wed Jan 14 10:42:36 2015 +0100
|
|     fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
|
|     Now that we got rid of the bdi abuse on character devices we can always use
|     sb->s_bdi to get at the backing_dev_info for a file, except for the block
|     device special case.  Export inode_to_bdi and replace uses of
|     mapping->backing_dev_info with it to prepare for the removal of
|     mapping->backing_dev_info.

introduced a critical regression: Deleting a certain btrfs snapshot
results in a BUG as below:

| BUG: unable to handle kernel NULL pointer dereference at 0000000000000250
| IP: [<ffffffff812afcdb>] blk_get_backing_dev_info+0xb/0x20
| PGD 28e3f067 PUD 23cd9067 PMD 0
| Oops: 0000 [#1] SMP
| CPU: 0 PID: 5053 Comm: btrfs Not tainted 4.1.0+ #20
| Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
| task: ffff8800003e0e80 ti: ffff880023d28000 task.ti: ffff880023d28000
| RIP: 0010:[<ffffffff812afcdb>]  [<ffffffff812afcdb>] blk_get_backing_dev_info+0xb/0x20
| RSP: 0018:ffff880023d2b9e8  EFLAGS: 00010246
| RAX: 0000000000000000 RBX: ffff880023d2ba08 RCX: ffffffff8182a420
| RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88003e071780
| RBP: ffff880023d2b9e8 R08: 0000000000000000 R09: 0000000000000001
| R10: 0000000000000000 R11: 0000000000000001 R12: ffff88003e0719d8
| R13: 7fffffffffffffff R14: ffff880023d2c000 R15: 7fffffffffffffff
| FS:  00007f71019b38c0(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
| CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
| CR2: 0000000000000250 CR3: 0000000023cda000 CR4: 00000000000006b0
| Stack:
|  ffff880023d2ba88 ffffffff810d1041 ffff880023d2ba30 ffff88002fc3c700
|  7fffffffffffffff 0000000000000000 0000000000000000 7fffffffffffffff
|  0000000000000001 0000000000000000 0000000000000000 0000000000000000
| Call Trace:
|  [<ffffffff810d1041>] __filemap_fdatawrite_range+0xe1/0x100
|  [<ffffffff810d108e>] filemap_fdatawrite_range+0xe/0x10
|  [<ffffffff81219ea6>] btrfs_fdatawrite_range+0x26/0x70
|  [<ffffffff8121ee37>] btrfs_wait_ordered_range+0x47/0x110
|  [<ffffffff81217431>] ? btrfs_drop_extent_cache+0x341/0x3f0
|  [<ffffffff81210791>] btrfs_evict_inode+0x1f1/0x4e0
|  [<ffffffff811419f6>] ? __inode_wait_for_writeback+0x66/0xb0
|  [<ffffffff811351be>] evict+0xae/0x170
|  [<ffffffff81135fc1>] iput+0x131/0x190
|  [<ffffffff81210bf4>] btrfs_invalidate_inodes+0x174/0x1c0
|  [<ffffffff81237e7e>] btrfs_ioctl_snap_destroy+0x54e/0x710
|  [<ffffffff810d8452>] ? __alloc_pages_nodemask+0x142/0x840
|  [<ffffffff8123abc1>] btrfs_ioctl+0x1291/0x29c0
|  [<ffffffff810ddbb6>] ? lru_cache_add_active_or_unevictable+0x26/0x90
|  [<ffffffff810f6c2e>] ? handle_mm_fault+0xbee/0x1380
|  [<ffffffff8112a762>] ? do_filp_open+0x92/0xe0
|  [<ffffffff8112e2fe>] do_vfs_ioctl+0x7e/0x4f0
|  [<ffffffff8103b53b>] ? __do_page_fault+0x17b/0x3b0
|  [<ffffffff8112e7f6>] SyS_ioctl+0x86/0xa0
|  [<ffffffff8145a7ae>] entry_SYSCALL_64_fastpath+0x12/0x71
| Code: 01 00 00 00 8d 41 ff 85 c0 0f 4e c2 89 87 bc 05 00 00 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 87 98 00 00 00 55 48 89 e5 <48> 8b 80 50 02 00 00 5d 48 05 78 01 00
| RIP  [<ffffffff812afcdb>] blk_get_backing_dev_info+0xb/0x20
|  RSP <ffff880023d2b9e8>
| CR2: 0000000000000250
| ---[ end trace 46e94aee3470cdc0 ]---

If I read the disassembly correctly:

| 00000000002afcd0 <blk_get_backing_dev_info>:
| extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
|                               unsigned int, void __user *);
| extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|                           unsigned int, void __user *);
| extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|                          struct scsi_ioctl_command __user *);
|   2afcd0:       48 8b 87 98 00 00 00    mov    0x98(%rdi),%rax
|         rq->tag = -1;
|         rq->start_time = jiffies;
|         set_start_time_ns(rq);
|         rq->part = NULL;
| }
| EXPORT_SYMBOL(blk_rq_init);
|   2afcd7:       55                      push   %rbp
|   2afcd8:       48 89 e5                mov    %rsp,%rbp
|
| static void req_bio_endio(struct request *rq, struct bio *bio,
|                           unsigned int nbytes, int error)
+   2afcdb:       48 8b 80 50 02 00 00    mov    0x250(%rax),%rax

then bdev->bd_disk in blkdev.h:bdev_get_queue is NULL but must not be.
Why this happens after the mentioned commit is somewhat beyond my
skills.

This does not happen with every btrfs snapshot, although in a certain
but rather complex workflow I can reproduce this all the time. Hence I
could do bisecting which led to the given commit. This BUG still
exists in current HEAD (tested on 1bc5e15).

If this is sufficient information to fix it, I'll be glad. If you need
a simple reproducer, drop me a line and I'll start working on it which
might take some time.

See also the btrfs mailing list for my initial report
http://www.spinics.net/lists/linux-btrfs/msg44883.html

    Christoph
Comment 1 Christoph Biedl 2015-07-05 13:59:03 UTC
After some investigation I managed to create a reproducer that still
isn't minimal but at least worth sharing.

Short version: schroot --chroot jessie-amd64 -- apt-get install lvm2

Longer version:

- Debian jessie userland required, or anything else that runs schroot
- Create a type=btrfs-snapshot schroot named jessie-amd64, containing
  a minimal jessie amd64 userland
- Start a non-persistent session and install the lvm2 package, then
  end the session
- BUG is triggered when schroot cleans up the chroot, i.e. deletes the
  snapshot

Presumably the lvm2 postinst does things that later cause the problem.
Just installing lvm2's dependencies works as expected.

Neither sleep nor sync before the snapshot deletion are a workaround.

An strace of that installation has some 180 Mbyte size, still searching
the for the root cause that triggers the bug. After that it should be
possible to have a minimal reproducer of "btrfs subvolume snapshot",
(some command), "btrfs subvolume delete".
Comment 2 Christoph Biedl 2015-07-07 21:29:40 UTC
The culprit is vgcfgbackup called from lvm2 postinst.

So the following reproducer should do the trick:

* Have a btrfs at /mnt/schroot/
* Create a subvolume /mnt/schroot/jessie-amd64, and populate it with a
  userland, including the lvm2 package.
* Create a snapshot: 
    btrfs subvolume snapshot /mnt/schroot/jessie-amd64/ /mnt/schroot/snap
* Bind-mount /proc
    mount --bind /proc /mnt/schroot/snap/proc
* Exec vgcfgbackup in the snapshot
    chroot /mnt/schroot/snap /sbin/vgcfgbackup
* Umount
    umount /mnt/schroot/snap/proc
* Delete snapshot
    btrfs subvolume delete /mnt/schroot/snap

Final step was to reduce vgcfgbackup to the actual ioctl (wild
guessing) that causes the trouble. The strace output looks harmless.
Comment 3 Jeff Mahoney 2015-09-11 20:12:26 UTC
Created attachment 187441 [details]
btrfs: skip waiting on ordered range for special files

In btrfs_evict_inode, we properly truncate the page cache for evicted
inodes but then we call btrfs_wait_ordered_range for every inode as well.
It's the right thing to do for regular files but results in incorrect
behavior for device inodes for block devices.

filemap_fdatawrite_range gets called with inode->i_mapping which gets
resolved to the block device inode before getting passed to
wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
next depends on whether there's an open file handle associated with the
inode.  If there is, we write to the block device, which is unexpected
behavior.  If there isn't, we through normally and inode->i_data is used.
We can also end up racing against open/close which can result in crashes
when i_mapping points to a block device inode that has been closed.

Since there can't be any page cache associated with special file inodes,
it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
the problem.


[Still undergoing xfstests run but should be ok.]
Comment 4 Christoph Biedl 2015-09-12 00:22:42 UTC
Thanks, looks very good. Not a single crash in all scenarios where I've encountered them every time.

Consider this a Tested-by:
Comment 5 David Sterba 2015-09-29 14:15:21 UTC
Thanks, closing.