Bug 29952

Summary: btrfs as root filesystems hangs on mount after d863b50ab01333659314c2034890cb76d9fdc3c7
Product: File System Reporter: Asbjørn Sannes (ace)
Component: btrfsAssignee: fs_btrfs (fs_btrfs)
Severity: normal CC: akpm, florian, maciej.rutecki, paulmck, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 27352    
Attachments: sysrq-t output when waiting for mount
Two boots with CONFIG_PROVE_RCU=y
Hung boot that continues after sysrq-t.

Description Asbjørn Sannes 2011-02-26 15:59:38 UTC
I use btrfs as the root filesystem on one of my installs and it hangs upon boot when trying to mount the root filesystem. If I boot of an ext3 volume it works.

After the commit d863b50ab01333659314c2034890cb76d9fdc3c7 (vfs: call rcu_barrier after ->kill_sb()) the bootup hangs, pressing sysrq t for trace does not give a trace.

The volume is located on an lvm volume.

Reverting the commit ontop of v2.6.38-rc6-git6 makes it boot.
Comment 1 Andrew Morton 2011-02-26 18:51:37 UTC
We do need to see that sysrq-t output, please.  (Unless there's a
better way of finding out where rcu_barrier() got stuck?)

Please add ignore_loglevel to the kernel boot command line and try
Comment 2 Asbjørn Sannes 2011-02-26 19:59:51 UTC
Ah, added ignore_loglevel :) Did not know about that one.

I think this is the relevant part:
[   75.534655] mount           D ffff8802278d58d8  5824  1896      1 0x00000000
[   75.534655]  ffff8802254e5c28 0000000000000086 ffff8802254e5b88 ffff88022786dcc0
[   75.534655]  ffff880200000000 ffff8802254e5fd8 ffff8802278d5620 ffff8802254e5fd8
[   75.534655]  ffff8802278d58e0 ffff8802278d58d8 ffff8802254e4000 ffff8802254e4000
[   75.534655] Call Trace:
[   75.534655]  [<ffffffff8159fb68>] schedule_timeout+0x22/0xbb
[   75.534655]  [<ffffffff8104c251>] ? flat_send_IPI_allbutself+0x51/0x54
[   75.534655]  [<ffffffff81048433>] ? native_send_call_func_ipi+0x4a/0x60
[   75.534655]  [<ffffffff815a0eb9>] ? _raw_spin_unlock_irqrestore+0x20/0x2b
[   75.534655]  [<ffffffff810970af>] ? smp_call_function_many+0x1ae/0x1cf
[   75.534655]  [<ffffffff8159f172>] wait_for_common+0x9e/0x10b
[   75.534655]  [<ffffffff81068f99>] ? default_wake_function+0x0/0xf
[   75.534655]  [<ffffffff810bb31a>] ? call_rcu+0x10/0x12
[   75.534655]  [<ffffffff810bb30a>] ? call_rcu+0x0/0x12
[   75.534655]  [<ffffffff8159f279>] wait_for_completion+0x18/0x1a
[   75.534655]  [<ffffffff810ba97a>] _rcu_barrier+0x94/0xa4
[   75.534655]  [<ffffffff810ba9a1>] rcu_barrier+0x17/0x19
[   75.534655]  [<ffffffff8111cd77>] deactivate_locked_super+0x26/0x46
[   75.534655]  [<ffffffff8111d7e7>] mount_bdev+0x148/0x182
[   75.534655]  [<ffffffff811a6350>] ? ext4_fill_super+0x0/0x226c
[   75.534655]  [<ffffffff811a2020>] ext4_mount+0x10/0x12
[   75.534655]  [<ffffffff8111cfc9>] vfs_kern_mount+0xb8/0x1d1
[   75.534655]  [<ffffffff8111d140>] do_kern_mount+0x48/0xd8
[   75.534655]  [<ffffffff81133ecd>] do_mount+0x729/0x791
[   75.534655]  [<ffffffff810f0028>] ? memdup_user+0x43/0x63
[   75.534655]  [<ffffffff810f0081>] ? strndup_user+0x39/0x4f
[   75.534655]  [<ffffffff81133fb8>] sys_mount+0x83/0xbf
[   75.534655]  [<ffffffff810309fb>] system_call_fastpath+0x16/0x1b

Anything else I can do?

oh, seems this is not exactly btrfs code, so I was completly off..

I guess it tries ext4 before it tries btrfs, and that it tries ext3 before it tries ext4? That would make sense to me..
Comment 3 Andrew Morton 2011-02-26 20:50:50 UTC
That's the waiting thread.  We need to work out what thread it's waiting on.

Please attach the full sysrq-t output.  (attach, don't paste: the wordwrapping makes it unreadable).

Comment 4 Asbjørn Sannes 2011-02-26 21:02:20 UTC
Created attachment 49332 [details]
sysrq-t output when waiting for mount
Comment 5 Rafael J. Wysocki 2011-02-26 21:08:25 UTC
The first bad commit is:

commit d863b50ab01333659314c2034890cb76d9fdc3c7
Author: Boaz Harrosh <bharrosh@panasas.com>
Date:   Thu Feb 10 15:01:20 2011 -0800

    vfs: call rcu_barrier after ->kill_sb()

    Tested-by: Tao Ma <boyu.mt@taobao.com>
    Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Chris Mason <chris.mason@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

First-Bad-Commit : d863b50ab01333659314c2034890cb76d9fdc3c7
Comment 6 Andrew Morton 2011-02-26 21:24:09 UTC
Beats me.  There are of course 1000 different versions of the rcu code but it seems that you're running kernel/rcutree.c and its _rcu_barrier() appears to be stuck in wait_for_completion(&rcu_barrier_completion).

Paul, is there any debugging we can turn on to identify the cause of this?

Comment 7 Paul E. McKenney 2011-02-26 23:50:52 UTC
CONFIG_PROVE_RCU might help.  I wonder if that rcu_barrier() is in an RCU read-side critical section?  No good can possibly come of that.
Comment 8 Asbjørn Sannes 2011-02-27 09:25:37 UTC
Created attachment 49462 [details]
Two boots with CONFIG_PROVE_RCU=y

Interestingly enough, it continues after sysrq-t to boot up.
Comment 9 Paul E. McKenney 2011-02-28 05:29:28 UTC
Interesting indeed!  I take it that without CONFIG_PROVE_RCU=y, it did not boot all the way up?
Comment 10 Asbjørn Sannes 2011-03-02 05:54:01 UTC
Created attachment 49852 [details]
Hung boot that continues after sysrq-t.

Yes, after I do sysrq-t it continues.
I'm attaching a full dmesg from such a boot.
Comment 11 Asbjørn Sannes 2011-03-02 06:10:27 UTC
Could this be affected by bad BIOS or the like?
I played around with the BIOS a little bit, if I don't enable AHCI in the bios (which results in an AMD ACHI BIOS being run) and use "native IDE" everything seems to work fine..

It is kind of getting a bit ridiculous how many changes in the bios and/or kernel parameters I have to do these days to get Linux to boot :P

Maybe this is another one of those bugs that only bites a few and really is not a regression for the rest ..
Comment 12 Florian Mickler 2011-03-05 01:46:25 UTC
Maybe bug#27842 is related?
Comment 13 Asbjørn Sannes 2011-03-15 06:52:54 UTC
Hm, turning off C1E support in the BIOS makes it work. So I suppose this is more of the #15289 bug.

Earlier I did not need to turn it off in the BIOS it was enough to add acpi_skip_timer_override to the boot parameters (bug#15289).

Tested 2.6.36 + commit (d863b50ab01333659314c2034890cb76d9fdc3c7)
 and 2.6.37 + commit and it hung in all of those cases aswell.

Maybe it is the same bug.
Comment 14 Rafael J. Wysocki 2011-03-27 20:00:25 UTC

*** This bug has been marked as a duplicate of bug 15289 ***