Most recent kernel where this bug did not occur: 0d514f040ac6629311974889d5b96bcf21c6461a Note - I have one patch on top of Linus' git tree, which hopefully is unrelated to this bug report. It lets CPU frequency scaling continue to function on my machine (despite an errata on this CPU): >diff-tree 378f1d5... (from 53ea68e...) >Author: Alex <alex@nautical.ath.cx> >Date: Sun Feb 5 13:45:21 2006 +1100 > > Revert "[CPUFREQ] p4-clockmod: Workaround for CPU's with N60 errata" > > This reverts c70ca00f77eb1bb69c5a2d029b3a911bc88eac1b commit. Distribution: Fedora development (rawhide) - recently synced Hardware Environment: pentium 4 Software Environment: /boot on ext3 on SCSI Root on reiserfs on lvm on md1 on SCSI gcc version 4.1.0 20060304 (Red Hat 4.1.0-3) Problem Description: The init process gets stuck in "D" state during boot, after md aurorun, and just after reiserfs says: dm-0: checking transaction log (dm-0) Boot does not proceed any further. Magic SysRq produces a call trace, which I'll attach. I'm hoping someone can tell what the problem is just by looking at my dmesg and the call trace for init, because I don't have time to do a git-bisect at the moment. I did wonder if commit 04b857f had uncovered a bug somewhere else - maybe in reiserfs, lvm or cfq? But I'm just guessing, because that commit touches md-raid1 and I use raid1. Steps to reproduce: 100% reproducible on my machine. Persists with more recent git trees - most recent I've tried is 3759fa9c55923f719ae944a3f8fbb029b36f759d. I thought it might be a mkinitrd bug, so I substituted a known working initramfs "initrd", but init still got stuck in "D" state.
Created attachment 7578 [details] netconsole log - dmesg, SysRq "Show Regs" and SysRq "Show State" netconsole log - dmesg, SysRq "Show Regs" and SysRq "Show State"
Created attachment 7579 [details] .config
I'm afraid there isn't enough info in the trace for me to figure out what is wrong. It looks like reiserfs has submitted a request to read a block from the journal, and that request just hasn't returned. It is not possible to tell why it hasn't returned. There is some evidence in the trace to suggest you are using crypto within dm. Are you ? It is hard to see how the patch that you identified could be causing this problem, though you could always try reverting it and see.... A git bisect is really what we need here. It is a pity you don't have time as aI cannot think of any other path to follow.
Sorry to point the finger at you, Neil - I did suggest that your patch had merely uncovered a bug somewhere else, though! I do use dm-crypt, but only for my swap partitions. The dm devices haven't been created with cryptsetup or mounted at that stage in the boot process. I don't really know why aes_enc_blk is called - there's no dm-crypt involved in md0 or my LVM root device. I thought that the pathological bit of the call trace started at _update_journal_header_block. But perhaps there is no pathological bit - the request just disappears into the lower levels of the block layer or the driver (SYM53C8XX_2) without any clues about what has gone wrong? Hopefully I'll be able to investigate this some more on the weekend - git-bisect, etc. There aren't that many commits between known-good and known-bad. Perhaps tonight I'll have time to play with the kernel command line. I'll try changing the IO scheduler - could that make any difference, do you think?
Created attachment 7584 [details] git bisect log OK, I did a git bisect, just for you. Unfortunately, it was that raid1 patch: commit 04b857f74cec5efc7730e9db47e291310f4708a4 Author: NeilBrown <neilb@suse.de> Date: Thu Mar 9 17:33:46 2006 -0800 [PATCH] md: Fix several raid1 bugs which cause a memory leak - wrong test for 'is this a BARRIER bio' - not freeing on all possible paths. - using r1_bio after freeing it. See the attached git-bisect log. Just for your information, switching IO schedulers from cfq to deadline made no difference. Is there anything I can do to help work out why this patch causes this side effect? Any debugging options I should turn on? I'm not much of a programmer, but I can apply patches and edit files with nano if you want me to add printks anywhere.
Thanks for doing the "git bisect". I've looked harded and think I have found the bug that this patch revealed. It only triggers when the first write to a raid1 is a BIO_RW_BARRIER write, which could will be the case when replaying an empty journal. I'll attach a patch that should fix it. However I'm still confused. The bug should cause an atomic_t to go negative under and atomic_dec_and_test, and that should cause a BUG message to be printed. But we are not seeing one.. Anyway, if you could try the patch an report back, I would appreciate it. Thanks NeilBrown
Created attachment 7608 [details] Proposed patch
Created attachment 7616 [details] dmesg and stack trace from patched kernel Thanks Neil. I tried your patch, but unfortunately it doesn't seem to make any difference to this particular bug. I've attached the new dmesg. This time I've included the trace for all the kernel threads as well as init, in case there's anything interesting there. The init call trace seems quite long to my ignorant eyes, but a stack overflow wouldn't manifest like this, would it? I've got 4k/4k stacks, plus CONFIG_DEBUG_STACKOVERFLOW=y.
Ok... I think it is brown-paper-bag time. Rather than a patch, can I just tell you what to change? in include/linux/raid/raid1.h, right down at the bottom, change #define R1BIO_Returned 4 to #define R1BIO_Returned 6 '4' is in-use by R1BIO_Barrier :-( please test and report back.
Thanks Neil. I'll test this as soon as I get home. So it looks like this patch is the underlying cause of the problem: [PATCH] md: support BIO_RW_BARRIER for md/raid1 author NeilBrown <neilb@suse.de> Wed, 9 Nov 2005 05:39:34 +0000 (21:39 -0800) committer Linus Torvalds <torvalds@g5.osdl.org> Wed, 9 Nov 2005 15:56:38 +0000 (07:56 -0800) commit a9701a30470856408d08657eb1bd7ae29a146190 tree eb6ea8c82fdc1b50bf56abadeee63a935034cf27 parent bd926c63b7a6843d3ce2728396c0891e54fce5c4 In terms of the underlying reason for the error, it looks like the 7-line block of comment starting with: /* For write-behind requests, we call bi_end_io when * the last non-write-behind device completes, providing * any write was successful. Otherwise we call when pushed an important piece of context (the pre-existing #define R1BIO_Returned 4) out of the diff. My first question is, what stopped this bug from manifesting before now? Is it the last bit of commit 04b857f, so that do_barriers now leads to set_bit R1BIO_Barrier, which is misinterpreted somewhere else as R1BIO_Returned? On the bright side, no-one else seems to have hit this bug. Which leads to my second question - why does it only affect me so far? Something special about my set-up? Surely plenty of people use barriers on md-raid1. Or do you have to have root on reiserfs as well, for the bug to trigger? What a pity this fix didn't make it into Linus' tree before 2.6.16! I have to say, it is mostly my fault for not reporting the bug sooner - I kept it to myself for several days, thinking someone else would notice and fix it quickly, as has happened with every other kernel bug I've found.
> > ------- Additional Comments From aouthred@gmail.com 2006-03-22 21:15 ------- > Thanks Neil. > > I'll test this as soon as I get home. > > So it looks like this patch is the underlying cause of the problem: Yes. > > In terms of the underlying reason for the error, it looks like the 7-line block > of comment starting with: > > /* For write-behind requests, we call bi_end_io when > * the last non-write-behind device completes, providing > * any write was successful. Otherwise we call when > > pushed an important piece of context (the pre-existing #define R1BIO_Returned 4) > out of the diff. Yes. > > My first question is, what stopped this bug from manifesting before now? Is it > the last bit of commit 04b857f, so that do_barriers now leads to set_bit > R1BIO_Barrier, which is misinterpreted somewhere else as > R1BIO_Returned? Yes. > > On the bright side, no-one else seems to have hit this bug. Which leads to my > second question - why does it only affect me so far? Something special about my > set-up? Surely plenty of people use barriers on md-raid1. Or do you have to have > root on reiserfs as well, for the bug to trigger? I'm not sure it is "no-one". Novell's bugzilla has about 5 hits. > > What a pity this fix didn't make it into Linus' tree before 2.6.16! I have to > say, it is mostly my fault for not reporting the bug sooner - I kept it to > myself for several days, thinking someone else would notice and fix it quickly, > as has happened with every other kernel bug I've found. You didn't report it soon enough, I didn't fix it soon enough, and Linus released 2.6.16 too soon :-) I'm just glad we will have a 2.6.16.1 soon! NeilBrown
That's fixed it. Changing: #define R1BIO_Returned 4 to: #define R1BIO_Returned 6 enables normal boot. Just for future reference, there is a duplicate entry for this bug in Novell's bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=159435 I can't find this bug on Redhat bugzilla, maybe because they don't have many users with root on reiserfs (on md-raid1). Thank you, Neil.