Bug 6230

Summary: The init process gets stuck in "D" state during boot.
Product: IO/Storage Reporter: Alex Outhred (aouthred)
Component: MDAssignee: Neil Brown (neilb)
Status: CLOSED CODE_FIX    
Severity: high    
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 8fce4d8e3b9e3cf47cc8afeb6077e22ab795d989 Subsystem:
Regression: --- Bisected commit-id:
Attachments: netconsole log - dmesg, SysRq "Show Regs" and SysRq "Show State"
.config
git bisect log
Proposed patch
dmesg and stack trace from patched kernel

Description Alex Outhred 2006-03-15 03:59:10 UTC
Most recent kernel where this bug did not occur:
0d514f040ac6629311974889d5b96bcf21c6461a

Note - I have one patch on top of Linus' git tree, which hopefully is unrelated
to this bug report. It lets CPU frequency scaling continue to function on my
machine (despite an errata on this CPU):

>diff-tree 378f1d5... (from 53ea68e...)
>Author: Alex <alex@nautical.ath.cx>
>Date:   Sun Feb 5 13:45:21 2006 +1100
>
>    Revert "[CPUFREQ] p4-clockmod: Workaround for CPU's with N60 errata"
>    
>    This reverts c70ca00f77eb1bb69c5a2d029b3a911bc88eac1b commit.

Distribution:
Fedora development (rawhide) - recently synced

Hardware Environment:
pentium 4

Software Environment:
/boot on ext3 on SCSI
Root on reiserfs on lvm on md1 on SCSI
gcc version 4.1.0 20060304 (Red Hat 4.1.0-3)

Problem Description:
The init process gets stuck in "D" state during boot, after md aurorun, and just
after reiserfs says:
dm-0: checking transaction log (dm-0)

Boot does not proceed any further. Magic SysRq produces a call trace, which I'll
attach.

I'm hoping someone can tell what the problem is just by looking at my dmesg and
the call trace for init, because I don't have time to do a git-bisect at the moment.

I did wonder if commit 04b857f had uncovered a bug somewhere else - maybe in
reiserfs, lvm or cfq? But I'm just guessing, because that commit touches
md-raid1 and I use raid1.

Steps to reproduce:
100% reproducible on my machine. Persists with more recent git trees - most
recent I've tried is 3759fa9c55923f719ae944a3f8fbb029b36f759d.

I thought it might be a mkinitrd bug, so I substituted a known working initramfs
"initrd", but init still got stuck in "D" state.
Comment 1 Alex Outhred 2006-03-15 04:01:40 UTC
Created attachment 7578 [details]
netconsole log - dmesg, SysRq "Show Regs" and SysRq "Show State"

netconsole log - dmesg, SysRq "Show Regs" and SysRq "Show State"
Comment 2 Alex Outhred 2006-03-15 04:03:07 UTC
Created attachment 7579 [details]
.config
Comment 3 Neil Brown 2006-03-15 14:17:54 UTC
I'm afraid there isn't enough info in the trace for me to figure out what is wrong.

It looks like reiserfs has submitted a request to read a block from the journal,
and that request just hasn't returned.  It is not possible to tell why it hasn't
returned. 

There is some evidence in the trace to suggest you are using crypto within dm. 
Are you ?


It is hard to see how the patch that you identified could be causing this
problem, though you could always try reverting it and see....

A git bisect is really what we need here.  It is a pity you don't have time as
aI cannot think of any other path to follow.
Comment 4 Alex Outhred 2006-03-15 17:18:11 UTC
Sorry to point the finger at you, Neil - I did suggest that your patch had
merely uncovered a bug somewhere else, though!

I do use dm-crypt, but only for my swap partitions. The dm devices haven't been
created with cryptsetup or mounted at that stage in the boot process. I don't
really know why aes_enc_blk is called - there's no dm-crypt involved in md0 or
my LVM root device.

I thought that the pathological bit of the call trace started at
_update_journal_header_block. But perhaps there is no pathological bit - the
request just disappears into the lower levels of the block layer or the driver
(SYM53C8XX_2) without any clues about what has gone wrong?

Hopefully I'll be able to investigate this some more on the weekend -
git-bisect, etc. There aren't that many commits between known-good and known-bad.

Perhaps tonight I'll have time to play with the kernel command line. I'll try
changing the IO scheduler - could that make any difference, do you think?
Comment 5 Alex Outhred 2006-03-16 03:43:03 UTC
Created attachment 7584 [details]
git bisect log

OK, I did a git bisect, just for you. Unfortunately, it was that raid1 patch:

commit 04b857f74cec5efc7730e9db47e291310f4708a4
Author: NeilBrown <neilb@suse.de>
Date:	Thu Mar 9 17:33:46 2006 -0800

    [PATCH] md: Fix several raid1 bugs which cause a memory leak
    
    - wrong test for 'is this a BARRIER bio'
    - not freeing on all possible paths.
    - using r1_bio after freeing it.

See the attached git-bisect log.

Just for your information, switching IO schedulers from cfq to deadline made no
difference.

Is there anything I can do to help work out why this patch causes this side
effect? Any debugging options I should turn on? I'm not much of a programmer,
but I can apply patches and edit files with nano if you want me to add printks
anywhere.
Comment 6 Neil Brown 2006-03-19 16:39:52 UTC
Thanks for doing the "git bisect".

I've looked harded and think I have found the bug that this patch revealed.  
It only triggers when the first write to a raid1 is a BIO_RW_BARRIER write, which
could will be the case when replaying an empty journal.

I'll attach a patch that should fix it.  However I'm still confused.  The bug
should cause an atomic_t to go negative under and atomic_dec_and_test, and that
should cause a BUG message to be printed.  But we are not seeing one..

Anyway, if you could try the patch an report back, I would appreciate it.
Thanks
NeilBrown
Comment 7 Neil Brown 2006-03-19 16:40:31 UTC
Created attachment 7608 [details]
Proposed patch
Comment 8 Alex Outhred 2006-03-20 00:18:31 UTC
Created attachment 7616 [details]
dmesg and stack trace from patched kernel

Thanks Neil. I tried your patch, but unfortunately it doesn't seem to make any
difference to this particular bug. I've attached the new dmesg. This time I've
included the trace for all the kernel threads as well as init, in case there's
anything interesting there.

The init call trace seems quite long to my ignorant eyes, but a stack overflow
wouldn't manifest like this, would it? I've got 4k/4k stacks, plus
CONFIG_DEBUG_STACKOVERFLOW=y.
Comment 9 Neil Brown 2006-03-22 19:30:11 UTC
Ok... I think it is brown-paper-bag time.

Rather than a patch, can I just tell you what to change?


in include/linux/raid/raid1.h, right down at the bottom, change

#define	R1BIO_Returned 4

to

#define	R1BIO_Returned 6

'4' is in-use by R1BIO_Barrier :-(


please test and report back.
Comment 10 Alex Outhred 2006-03-22 21:15:57 UTC
Thanks Neil.

I'll test this as soon as I get home.

So it looks like this patch is the underlying cause of the problem:

[PATCH] md: support BIO_RW_BARRIER for md/raid1

author	        NeilBrown <neilb@suse.de>
                Wed, 9 Nov 2005 05:39:34 +0000 (21:39 -0800)
committer	Linus Torvalds <torvalds@g5.osdl.org>
	        Wed, 9 Nov 2005 15:56:38 +0000 (07:56 -0800)
commit	        a9701a30470856408d08657eb1bd7ae29a146190
tree	        eb6ea8c82fdc1b50bf56abadeee63a935034cf27
parent	        bd926c63b7a6843d3ce2728396c0891e54fce5c4

In terms of the underlying reason for the error, it looks like the 7-line block
of comment starting with:

/* For write-behind requests, we call bi_end_io when
 * the last non-write-behind device completes, providing
 * any write was successful. Otherwise we call when

pushed an important piece of context (the pre-existing #define R1BIO_Returned 4)
out of the diff.

My first question is, what stopped this bug from manifesting before now? Is it
the last bit of commit 04b857f, so that do_barriers now leads to set_bit
R1BIO_Barrier, which is misinterpreted somewhere else as R1BIO_Returned?

On the bright side, no-one else seems to have hit this bug. Which leads to my
second question - why does it only affect me so far? Something special about my
set-up? Surely plenty of people use barriers on md-raid1. Or do you have to have
 root on reiserfs as well, for the bug to trigger?

What a pity this fix didn't make it into Linus' tree before 2.6.16! I have to
say, it is mostly my fault for not reporting the bug sooner - I kept it to
myself for several days, thinking someone else would notice and fix it quickly,
as has happened with every other kernel bug I've found.
Comment 11 Neil Brown 2006-03-22 21:25:25 UTC
> 
> ------- Additional Comments From aouthred@gmail.com  2006-03-22 21:15 -------
> Thanks Neil.
> 
> I'll test this as soon as I get home.
> 
> So it looks like this patch is the underlying cause of the problem:

Yes.

> 
> In terms of the underlying reason for the error, it looks like the 7-line block
> of comment starting with:
> 
> /* For write-behind requests, we call bi_end_io when
>  * the last non-write-behind device completes, providing
>  * any write was successful. Otherwise we call when
> 
> pushed an important piece of context (the pre-existing #define R1BIO_Returned 4)
> out of the diff.

Yes. 

> 
> My first question is, what stopped this bug from manifesting before now? Is it
> the last bit of commit 04b857f, so that do_barriers now leads to set_bit
> R1BIO_Barrier, which is misinterpreted somewhere else as
> R1BIO_Returned?

Yes.

> 
> On the bright side, no-one else seems to have hit this bug. Which leads to my
> second question - why does it only affect me so far? Something special about my
> set-up? Surely plenty of people use barriers on md-raid1. Or do you have to have
>  root on reiserfs as well, for the bug to trigger?

I'm not sure it is "no-one". Novell's bugzilla has about 5 hits.

> 
> What a pity this fix didn't make it into Linus' tree before 2.6.16! I have to
> say, it is mostly my fault for not reporting the bug sooner - I kept it to
> myself for several days, thinking someone else would notice and fix it quickly,
> as has happened with every other kernel bug I've found.

You didn't report it soon enough, I didn't fix it soon enough, and
Linus released 2.6.16 too soon :-)
I'm just glad we will have a 2.6.16.1 soon!

NeilBrown

Comment 12 Alex Outhred 2006-03-22 23:55:33 UTC
That's fixed it.

Changing:
#define	R1BIO_Returned 4

to:
#define	R1BIO_Returned 6

enables normal boot.

Just for future reference, there is a duplicate entry for this bug in Novell's
bugzilla:

https://bugzilla.novell.com/show_bug.cgi?id=159435

I can't find this bug on Redhat bugzilla, maybe because they don't have many
users with root on reiserfs (on md-raid1).

Thank you, Neil.