Bug 60608 - general protection fault: 0000 on btrfs_clean_one_deleted_snapshot+0x46/0xe3
Summary: general protection fault: 0000 on btrfs_clean_one_deleted_snapshot+0x46/0xe3
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Josef Bacik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-23 12:23 UTC by Emil Karlson
Modified: 2013-07-26 13:10 UTC (History)
1 user (show)

See Also:
Kernel Version: linux-3.11-rc{1,2}
Tree: Mainline
Regression: No


Attachments
Kernel log (88.37 KB, text/plain)
2013-07-23 12:23 UTC, Emil Karlson
Details

Description Emil Karlson 2013-07-23 12:23:23 UTC
Created attachment 106996 [details]
Kernel log

System gets permanently stuck about once in a day (about once every 32 snapshot deletion perhaps).

I have a kdump, so I can fetch further data as needed.

crash> gdb list *(btrfs_clean_one_deleted_snapshot+0x46)
0xffffffff811851c3 is in btrfs_clean_one_deleted_snapshot (include/linux/list.h:88).
83       * This is only for internal list manipulation where we know
84       * the prev/next entries already!
85       */
86      static inline void __list_del(struct list_head * prev, struct list_head * next)
87      {
88              next->prev = prev;
89              prev->next = next;
90      }
91      
92      /**
Comment 1 Emil Karlson 2013-07-23 12:36:33 UTC
Last well-tested working version was 3.8.13, there is one unaccounted for crash on 3.10.1(with a couple defrag patches from 3.11) so this may be a regression in 3.9 or 3.10.
Comment 2 Josef Bacik 2013-07-23 15:56:00 UTC
Can you tell me where

btrfs_clean_one_deleted_snapshot+0x46

is?

You can do this by doing gdb btrfs.ko and then

list *(btrfs_clean_one_deleted_snapshot+0x46)
Comment 3 Emil Karlson 2013-07-24 09:05:11 UTC
Changed list_del to list_del_init in btrfs_clean_one_deleted_snapshot and
added
WARN_ON(!list_empty(&root->root_list)); to btrfs_add_dead_root() right before the list_add_tail
as suggested by Josef and got this:

[26232.363116] ------------[ cut here ]------------
[26232.363127] WARNING: CPU: 0 PID: 5192 at fs/btrfs/transaction.c:989 btrfs_add_dead_root+0x3b/0x7c()
[26232.363129] Modules linked in: xts gf128mul [last unloaded: microcode]
[26232.363136] CPU: 0 PID: 5192 Comm: btrfs-endio-wri Not tainted 3.11.0-rc2 #3
[26232.363139] Hardware name: LENOVO 7450FVG/7450FVG, BIOS 7WET35WW (2.06 ) 04/03/2009
[26232.363141]  0000000000000000 0000000000000009 ffffffff8152edc8 0000000000000000
[26232.363145]  ffffffff8105c921 ffff88012cdedcf0 ffffffff81183e91 0000000000000000
[26232.363148]  ffff88010c0c9800 ffff88010c0c9c50 ffff8801340da4a8 ffff88010c0c9ca0
[26232.363152] Call Trace:
[26232.363159]  [<ffffffff8152edc8>] ? dump_stack+0x41/0x51
[26232.363164]  [<ffffffff8105c921>] ? warn_slowpath_common+0x73/0x8b
[26232.363168]  [<ffffffff81183e91>] ? btrfs_add_dead_root+0x3b/0x7c
[26232.363171]  [<ffffffff81183e91>] ? btrfs_add_dead_root+0x3b/0x7c
[26232.363176]  [<ffffffff811913fe>] ? btrfs_destroy_inode+0x22b/0x255
[26232.363179]  [<ffffffff8118da21>] ? relink_extent_backref+0x6b5/0x6e4
[26232.363183]  [<ffffffff8118e186>] ? btrfs_finish_ordered_io+0x736/0x82e
[26232.363192]  [<ffffffff811a8055>] ? worker_loop+0x169/0x48c
[26232.363196]  [<ffffffff811a7eec>] ? btrfs_queue_worker+0x255/0x255
[26232.363200]  [<ffffffff81075354>] ? kthread+0xad/0xb5
[26232.363204]  [<ffffffff81070000>] ? work_on_cpu+0x6e/0x6e
[26232.363208]  [<ffffffff810752a7>] ? kthread_freezable_should_stop+0x3b/0x3b
[26232.363212]  [<ffffffff81539c6c>] ? ret_from_fork+0x7c/0xb0
[26232.363216]  [<ffffffff810752a7>] ? kthread_freezable_should_stop+0x3b/0x3b
[26232.363218] ---[ end trace 07e1544cf9b06f72 ]---
[26234.588309] BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0
[26234.588455] IP: [<ffffffff81185211>] btrfs_clean_one_deleted_snapshot+0x76/0xcd
[26234.588582] PGD 0 
[26234.588623] Oops: 0000 [#1] SMP 
[26234.588687] Modules linked in: xts gf128mul [last unloaded: microcode]
[26234.588817] CPU: 0 PID: 2867 Comm: btrfs-cleaner Tainted: G        W    3.11.0-rc2 #3
[26234.588936] Hardware name: LENOVO 7450FVG/7450FVG, BIOS 7WET35WW (2.06 ) 04/03/2009
[26234.589054] task: ffff880133287500 ti: ffff88012dcc8000 task.ti: ffff88012dcc8000
[26234.589168] RIP: 0010:[<ffffffff81185211>]  [<ffffffff81185211>] btrfs_clean_one_deleted_snapshot+0x76/0xcd
[26234.589325] RSP: 0018:ffff88012dcc9e78  EFLAGS: 00010246
[26234.589410] RAX: 0000160000000000 RBX: ffff88010c0c9c50 RCX: 0000000000000000
[26234.589520] RDX: 0000000000000000 RSI: ffff88012dcc9dc8 RDI: ffff88010c0c9cb0
[26234.589630] RBP: ffff88012d632000 R08: 0000000000000000 R09: ffffea0004db7540
[26234.589738] R10: ffffea0004b37b40 R11: 0000000000000000 R12: ffff88010c0c9800
[26234.589847] R13: ffff880133287500 R14: ffff880133287500 R15: 0000000000000000
[26234.589958] FS:  0000000000000000(0000) GS:ffff88013bc00000(0000) knlGS:0000000000000000
[26234.590081] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[26234.590135] CR2: 00000000000000c0 CR3: 000000000180b000 CR4: 00000000000407f0
[26234.590135] Stack:
[26234.590135]  ffff880133204000 0000000000000000 ffff880133287500 ffffffff8117e25f
[26234.590135]  0000000000000000 0000000000000000 ffff880130ce5ba8 ffff880133204000
[26234.590135]  ffffffff8117e13a 0000000000000000 0000000000000000 ffffffff81075354
[26234.590135] Call Trace:
[26234.590135]  [<ffffffff8117e25f>] ? cleaner_kthread+0x125/0x15a
[26234.590135]  [<ffffffff8117e13a>] ? transaction_kthread+0x18f/0x18f
[26234.590135]  [<ffffffff81075354>] ? kthread+0xad/0xb5
[26234.590135]  [<ffffffff81070000>] ? work_on_cpu+0x6e/0x6e
[26234.590135]  [<ffffffff810752a7>] ? kthread_freezable_should_stop+0x3b/0x3b
[26234.590135]  [<ffffffff81539c6c>] ? ret_from_fork+0x7c/0xb0
[26234.590135]  [<ffffffff810752a7>] ? kthread_freezable_should_stop+0x3b/0x3b
[26234.590135] Code: 89 10 48 89 1b 48 89 5b 08 80 85 a0 07 00 00 01 4c 89 e7 e8 84 e8 03 00 48 8b 93 b0 fb ff ff 31 c9 48 b8 00 00 00 00 00 16 00 00 <48> 03 82 c0 00 00 00 48 ba 00 00 00 00 00 88 ff ff 48 c1 f8 06 
[26234.590135] RIP  [<ffffffff81185211>] btrfs_clean_one_deleted_snapshot+0x76/0xcd
[26234.590135]  RSP <ffff88012dcc9e78>
[26234.590135] CR2: 00000000000000c0
[26234.611796] ---[ end trace 07e1544cf9b06f73 ]---
Comment 4 Stefan Behrens 2013-07-24 12:21:46 UTC
Can be reproduced quickly like this:

mount /dev/sdd1 /mnt -o autodefrag
test -e /mnt/sub2 || mkdir /mnt/sub2 || sleep 666
(for i in `seq 1 9999999`; do btrfs subvol create /mnt/sub2/subv.${i} && \
 dd if=/dev/urandom of=/mnt/sub2/subv.${i}/file bs=32769 count=1 2>/dev/null; \
 done) &
sleep 10
(for i in `seq 1 2 9999999`; do btrfs subvol snapshot -r /mnt/sub2/subv.${i} \
 /mnt/sub2/snap.${i}; sleep 1; done) &
sleep 60
(for i in `seq 1 9999999`; do \
 dd if=/dev/urandom of=/mnt/sub2/subv.${i}/file bs=3 count=1 seek=20000 \
 2>/dev/null; sleep 1; done) &
sleep 120
(for i in `seq 1 9999999`; do btrfs subvol del /mnt/sub2/subv.${i}; sleep 1; \
 done) &

The warning that the root is already part of the list has this call trace (3.11.0-rc2):
[<ffffffffa00aa315>] btrfs_add_dead_root+0x75/0x80 [btrfs]
[<ffffffffa00ba81a>] btrfs_destroy_inode+0x24a/0x2d0 [btrfs]
[<ffffffff811c5ff7>] destroy_inode+0x37/0x60
[<ffffffff811c612d>] evict+0x10d/0x1a0
[<ffffffff811c68d5>] iput+0x105/0x190
[<ffffffffa00bc8e8>] btrfs_run_defrag_inodes+0x2d8/0x3e0 [btrfs]
[<ffffffffa00bc716>] ? btrfs_run_defrag_inodes+0x106/0x3e0 [btrfs]
[<ffffffffa00a1dca>] cleaner_kthread+0x15a/0x190 [btrfs]
Comment 5 Josef Bacik 2013-07-25 21:02:03 UTC
I don't know if I'm special or what but I can't reproduce.  I've posted a patch, could you please verify that it fixes the problem?
Comment 6 Stefan Behrens 2013-07-26 07:00:28 UTC
Just noticed that this part of the reproducer was relevant, I didn't paste it:

(while true; do sleep 300; find /mnt -type f -print | xargs cat > /dev/null; \
 done) &
sleep 180
(for i in `seq 1 5000`; do btrfs subvol del /mnt/sub2/subv.${i}; done)
sleep 180
sleep 180
(for i in `seq 5000 10000`; do btrfs subvol del /mnt/sub2/subv.${i}; done)


Yes, the patch works (for obvious reasons). The question still is whether the issue is an indication for a problem and you just hide the problem with this patch.

I've added the patch like this to be able to see whether I'm able to reproduce:

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index d58cce7..e94bc69 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -986,7 +986,10 @@ static noinline int commit_cowonly_roots(struct btrfs_trans
 int btrfs_add_dead_root(struct btrfs_root *root)
 {
        spin_lock(&root->fs_info->trans_lock);
-       list_add_tail(&root->root_list, &root->fs_info->dead_roots);
+       if (list_empty(&root->root_list))
+               list_add_tail(&root->root_list, &root->fs_info->dead_roots);
+       else
+               WARN_ON_ONCE(1);
        spin_unlock(&root->fs_info->trans_lock);
        return 0;
 }
@@ -1925,7 +1928,7 @@ int btrfs_clean_one_deleted_snapshot(struct btrfs_root *ro
        }
        root = list_first_entry(&fs_info->dead_roots,
                        struct btrfs_root, root_list);
-       list_del(&root->root_list);
+       list_del_init(&root->root_list);
        spin_unlock(&fs_info->trans_lock);
Comment 7 Josef Bacik 2013-07-26 13:10:06 UTC
Posted an explanation as to why this is the real fix on the mailinglist, closing the bz.

Note You need to log in before you can comment on or make changes to this bug.