Bug 11873

Summary: unable to mount ext3 root filesystem due to htree_dirblock_to_tree
Product: File System Reporter: Jimmy.Jazz
Component: ext3Assignee: Tejun Heo (tj)
Status: CLOSED CODE_FIX    
Severity: normal CC: Jimmy.Jazz, neilb, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28 rcx git and vanilla trees Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11808    
Attachments: dmesg log when fsck failed
bisect log related to the fsck failure
md-ext-devt-fix.patch
dmesg.txt from 2.6.28 rc3 git2
.config from 2.6.28 rc3 git2
.config used during bisect
fix-__blkdev_get.patch

Description Jimmy.Jazz 2008-10-28 05:09:25 UTC
Latest working kernel version:2.6.27.3
Earliest failing kernel version:2.6.28 rc1 git1
Distribution: gentoo
Hardware Environment: x86_64 amd
Software Environment: kernel + initramfs
Problem Description:

Since 2.6.28 mount command (busybox) is unable to mount the root file system. The error returned is,

EXT3-fs error (device md_d0p2): htree_dirblock_to_tree: bad entry in directory #2: rec_len % 4 != 0 - offset=0, inode=2109536318, rec_len=16058, name_len=111

(The figures could change)


To make it work, I need first to mount the file system as an ext2 file system, unmount it and recall mount command with the -t ext3 option.

step 1:
mount -o ro /dev/md/d0p2 /new-root
(mount ok with flags ro,errors=continue)
ls /new-root
(ls without error)
umount /new-root

and then

step 2:
mount -t ext3 -o ro /dev/md/d0p2 /new-root
(mount ok with flags ro,errors=continue,data=ordered)
ls /new-root
(ls without error)

ext2,jbd,ext3,mbcache,crc32 are modules and are loaded from an initramfs file system.

IMHO, it is a problem in mbcache or jbd.

Also, there is an other problem with that kernel. After creating the /dev node, loading md_mod and raid1, mdassemble start as usual the md device, but the mount command doesn't recognize the /dev node as valid and abort. 
I need to stop the md device that has been created earlier and reapply mdassemble or mdadm --assemble --scan from initramfs file system. Afterwards mount command is able to mount the file system but... like an ext2 file system. I'm on step 2 again.


Steps to reproduce: reboot the kernel, load the initramfs file system and open a shell just after the mount command has aborted.
Comment 1 Anonymous Emailer 2008-10-28 10:31:56 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 28 Oct 2008 05:09:25 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=11873
> 
>            Summary: unable to mount ext3 root filesystem due to
>                     htree_dirblock_to_tree
>            Product: File System
>            Version: 2.5
>      KernelVersion: 2.6.28 rcx git and vanilla trees
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: ext3
>         AssignedTo: akpm@osdl.org
>         ReportedBy: Jimmy.Jazz@gmx.net
>                 CC: Jimmy.Jazz@gmx.net
> 
> 
> Latest working kernel version:2.6.27.3
> Earliest failing kernel version:2.6.28 rc1 git1
> Distribution: gentoo
> Hardware Environment: x86_64 amd
> Software Environment: kernel + initramfs
> Problem Description:
> 
> Since 2.6.28 mount command (busybox) is unable to mount the root file system.
> The error returned is,
> 
> EXT3-fs error (device md_d0p2): htree_dirblock_to_tree: bad entry in
> directory
> #2: rec_len % 4 != 0 - offset=0, inode=2109536318, rec_len=16058,
> name_len=111
> 
> (The figures could change)
> 
> 
> To make it work, I need first to mount the file system as an ext2 file
> system,
> unmount it and recall mount command with the -t ext3 option.
> 
> step 1:
> mount -o ro /dev/md/d0p2 /new-root
> (mount ok with flags ro,errors=continue)
> ls /new-root
> (ls without error)
> umount /new-root
> 
> and then
> 
> step 2:
> mount -t ext3 -o ro /dev/md/d0p2 /new-root
> (mount ok with flags ro,errors=continue,data=ordered)
> ls /new-root
> (ls without error)
> 
> ext2,jbd,ext3,mbcache,crc32 are modules and are loaded from an initramfs file
> system.
> 
> IMHO, it is a problem in mbcache or jbd.
> 
> Also, there is an other problem with that kernel. After creating the /dev
> node,
> loading md_mod and raid1, mdassemble start as usual the md device, but the
> mount command doesn't recognize the /dev node as valid and abort. 
> I need to stop the md device that has been created earlier and reapply
> mdassemble or mdadm --assemble --scan from initramfs file system. Afterwards
> mount command is able to mount the file system but... like an ext2 file
> system.
> I'm on step 2 again.
> 
> 
> Steps to reproduce: reboot the kernel, load the initramfs file system and
> open
> a shell just after the mount command has aborted.
> 
Comment 2 Theodore Tso 2008-10-28 18:53:46 UTC
On Tue, Oct 28, 2008 at 10:31:22AM -0700, Andrew Morton wrote:
> > 
> > Since 2.6.28 mount command (busybox) is unable to mount the root file
> system.
> > The error returned is,
> > 
> > EXT3-fs error (device md_d0p2): htree_dirblock_to_tree: bad entry in
> directory
> > #2: rec_len % 4 != 0 - offset=0, inode=2109536318, rec_len=16058,
> name_len=111
> > 

That error means the ext3 filesystem detected an error when responding
to a readdir() request from user space.  So it seems rather strange
that this came from a mount command.  It looks like the mount
succeeded, and then something in userspace did the equivalent of "ls
/*" or "echo /*", and the first part of the directory was totally
garbaled.  (It should have been inode=2, rec_len=12, name_len=1 for
the '.' entry in the root directory.)

> > IMHO, it is a problem in mbcache or jbd.

Why do you think it is a problem with mbcache or jbd?  mbcache is only
used for extended attributes, and jbd would only get invoked once the
filesystem was getting modified.  If there was corruption in the
journal that was corrupting the filesystem when journal was recovered,
the directory corruption would be permanent, and not something that
would magically disappear when you mount the filesystem as ext2
read-only, and then remount as ext3.

This smells like ext3 is getting garbage when first reading in the
filesystem from the md device.  And then when you report that you're
also having mysterious md failures:

> > Also, there is an other problem with that kernel. After creating
> > the /dev node, loading md_mod and raid1, mdassemble start as usual
> > the md device, but the mount command doesn't recognize the /dev
> > node as valid and abort.  I need to stop the md device that has
> > been created earlier and reapply mdassemble or mdadm --assemble
> > --scan from initramfs file system. Afterwards mount command is
> > able to mount the file system but... like an ext2 file system.
> > I'm on step 2 again.
> > 

That smells like a problem with the md device, or the md setup code.

I suspect if you were to arrange to include dumpe2fs or debugfs in
your initrd, you would find that you are reading garbage from the md
device, and dumpe2fs and debugfs would be reporting, "huh?  there's no
ext3 filesystem image here".  I don't think this is a ext3 problem at
all.

							- Ted
Comment 3 Neil Brown 2008-10-29 02:51:00 UTC
Could be an md problem.
You seem to have presented the sequence out of order, or maybe I misunderstand you.

- machine fails to boot and you get a shell
  Is the md/d0p2 assembled at this point??
- You assemble the md array, try to mount but that fails.
- You stop the array, assemble it again and try to mount.
    This time the mount seems to work but you get the EXT3-fs error
- You unmount, mount as ext2.  That works
- umount, mount as ext3.  That works.

Is the a correct sequence of events?

What happens if you repeatedly try to mount (ext3) without stopping
and re-assembling the array?
What if you repeat ext3 without trying ext2 even after re-assembling the array?
If consecutive mounts don't work, what happens if you
  echo 3 >  /proc/sys/vm/drop_caches 
between the mount attempts?

What is the configuration of the array (raid1? 2 drives?  partitions or whole disk?)
Comment 4 Jimmy.Jazz 2008-10-29 04:46:23 UTC
(In reply to comment #2)

> 
> That error means the ext3 filesystem detected an error when responding
> to a readdir() request from user space.  So it seems rather strange
> that this came from a mount command.  It looks like the mount
> succeeded, and then something in userspace did the equivalent of "ls
> /*" or "echo /*", and the first part of the directory was totally
> garbaled.  (It should have been inode=2, rec_len=12, name_len=1 for
> the '.' entry in the root directory.)
> 
> > > IMHO, it is a problem in mbcache or jbd.
> 
> Why do you think it is a problem with mbcache or jbd? 

Also that was a simple deduction. Remember, the filesystem was able to mount as an ext2 fs.

 mbcache is only
> used for extended attributes, and jbd would only get invoked once the
> filesystem was getting modified. 

Anyway something has changed since 2.6.27.3. I will recheck the lot to be sure that only the kernel has changed. I mean that md, busybox, etc are still the same code.


 If there was corruption in the
> journal that was corrupting the filesystem when journal was recovered,
> the directory corruption would be permanent, and not something that
> would magically disappear when you mount the filesystem as ext2
> read-only, and then remount as ext3.

I agree but I'm speaking about facts. Mounting the filesystem as an ext3 (when it succeed) makes ls report some filesystem corruptions. Remounting it as ext2 and ls is able to read the directory again. I don't have any problems with any of the previous kernels. Moreover the issue affects both of my computers.

> 
> This smells like ext3 is getting garbage when first reading in the
> filesystem from the md device.  And then when you report that you're
> also having mysterious md failures:

It could be some hazardous misalignments too. mdp is quite solid and I'm using it for a while. It is hard to believe it is just an mdp failure. The code is still the same. Only the kernel has evolved. Also if that is an mdp issue something has brought it alive.

> 
> > > Also, there is an other problem with that kernel. After creating
> > > the /dev node, loading md_mod and raid1, mdassemble start as usual
> > > the md device, but the mount command doesn't recognize the /dev
> > > node as valid and abort.  I need to stop the md device that has
> > > been created earlier and reapply mdassemble or mdadm --assemble
> > > --scan from initramfs file system. Afterwards mount command is
> > > able to mount the file system but... like an ext2 file system.
> > > I'm on step 2 again.
> > > 
> 
> That smells like a problem with the md device, or the md setup code.

Or md_mod.

> 
> I suspect if you were to arrange to include dumpe2fs or debugfs in
> your initrd, you would find that you are reading garbage from the md
> device, and dumpe2fs and debugfs would be reporting, "huh?  there's no
> ext3 filesystem image here".  I don't think this is a ext3 problem at
> all.
> 

If you could give me any clues. Please, understand, I'm not in the habit to doing that. Also I will give a try with debugfs as dumpe2fs uses shared libs.


>                                                         - Ted
> 

Thanks for your help.
Comment 5 Jimmy.Jazz 2008-10-29 05:15:38 UTC
(In reply to comment #3)
> Could be an md problem.
> You seem to have presented the sequence out of order, or maybe I
> misunderstand
> you.
> 

> - machine fails to boot and you get a shell

kernel is still in initramfs when I get a shell (busybox ash). The shell and the kernel (everything expect fbcon/vesafb is a module) are minimal at that stage. Of course, the needed modules are loaded before init calls the mount (busybox) command.

>   Is the md/d0p2 assembled at this point??
Yes mdp is assembled.

You get,
cat /proc/mdstat 
Personalities : [raid1] 
md_d0 : active raid1 sda[0] sdb[1]
      244198448 blocks super 1.0 [2/2] [UU]
      bitmap: 2/8 pages [8KB], 16384KB chunk

unused devices: <none>

> - You assemble the md array, try to mount but that fails.
yes but sometimes the mount succeed and is reported corrupt when you try to access it (with ls for instance).

> - You stop the array, assemble it again and try to mount.
exactly. Also the init script is forcing the -t ext3 -o ro options.

>     This time the mount seems to work but you get the EXT3-fs error
Yes
> - You unmount, mount as ext2.  That works
yes
> - umount, mount as ext3.  That works.
yes
> 
> Is the a correct sequence of events?
The above is the sequence. Yes.

> 
> What happens if you repeatedly try to mount (ext3) without stopping
> and re-assembling the array?
No go it complains about a not found filesystem. I think it cannot do anything with  the /dev node file that is created with mknod -m 0660 /dev/md/d0p2 b 254 2 before mdassemble is called .

> What if you repeat ext3 without trying ext2 even after re-assembling the
> array?
Doesn't mount and in rare cases if it succeed it is corrupt when accessed.

> If consecutive mounts don't work, what happens if you
>   echo 3 >  /proc/sys/vm/drop_caches 
> between the mount attempts?

I will try it right now. I'm sure that is the solution.
> 
> What is the configuration of the array (raid1? 2 drives?  partitions or whole
> disk?)
> 
one raid1 partition array with 2 drives. whole disk with 4 partitions.
Comment 6 Jimmy.Jazz 2008-10-29 15:08:49 UTC
I made a more accurate test. Everything is the same except the kernel and its modules.

md_mod and raid1 are loaded.

The init script creates 2 nodes in /dev/md/, d0 and d0p2 to help mdadm (mdassemble) to assemble the device
(I won't rely on busybox v1.10.3 mdev hotplug because it creates character devices instead of block mpd devices)

mknod -m 0660 /dev/md/d0   b 254 0
mknod -m 0660 /dev/md/d0p2 b 254 2 

d0p2 is the partition that contains the ext3 root filesystem I want to mount.

mdassemble is now called and assembles d0. It has recognized 4 partitions p1 p2 p3 p4.

mdadm --misc --test /dev/md/d0 returns zero. that confirms that everything is okay.

I added the line echo 3 > /proc/sys/vm/drop_caches 

The script calls mount -t ext3 -o ro /dev/md/d0 /new-root. Mount call fails. The script opens then a shell.

Whatever is done, the mount command is not able to mount the partition. I need to stop and assemble the mdp device again.

I manually assemble the mdp device. The first "mount -t ext3 -o ro /dev/md/d0 /new-root" call fails. The second succeeds and the partition finally becomes readable.

In fact, I don't really need to force the -t ext2 option.

I also have activated debugfs and set jbd-debug to 1. I get some messages in dmesg log about jbd errors like: 

JBD: no valid journal superblock found.
EXT3-fs: error loading journal

otherwise, I have the following (I just took some notes that I thought to be relevant):

(fs/jbd/journal.c, 773): journal_init_inode: journal xxxx: inode md_d0p2/8 size 16777216, bits 12, blksize 4096
(fs/jbd/recovery.c, 242: journal_recover:No recovery required, last transaction 127173
(fs/jbd/journal.c, 947): journal_update_superblock:JBD:Skipping superblock update on recovered sb (start 0, seq 127174, errno 0)
(fs/jbd/journal.c, 947): journal_update_superblock:JBD:Skipping superblock update on recovered sb (start 1, seq 127174, errno 0)
EXT3-fs: mounted filesystem with ordered data mode
kjournald starting. Commit interval 5 seconds
Comment 7 Neil Brown 2008-10-29 15:15:25 UTC
You write:

> The script calls mount -t ext3 -o ro /dev/md/d0 /new-root. Mount call fails.

That should be     mount -t ext3 -o ro /dev/md/d0p2 /new-root
i.e. with a 'p2'.
Please confirm whether this was a typo or a genuine error in the script.

Thanks.
Comment 8 Jimmy.Jazz 2008-10-29 16:18:56 UTC
(In reply to comment #7)
> You write:
> 
> > The script calls mount -t ext3 -o ro /dev/md/d0 /new-root. Mount call
> fails.
> 
> That should be     mount -t ext3 -o ro /dev/md/d0p2 /new-root
> i.e. with a 'p2'.
> Please confirm whether this was a typo or a genuine error in the script.
>
 
Sorry that was a typo. mount -t ext3 -o ro /dev/md/d0p2 /new-root is the correct expression.

Thanks to point it at.
Comment 9 Neil Brown 2008-10-29 18:35:04 UTC
Thanks.

So it looks like when the filesystem reads from the md array, it sometimes gets the wrong data.  This presumably only happens in very specific circumstances otherwise you would expect to get some sort of random error every time you mount.

There isn't really enough information to figure out when the wrong data is read,
or where is read from, so we can't find the culprit based on that sort of info.

I suspect the best thing to do in this case is a 'git bisect'.  i.e. test a
number of different kernels until you find the patch that caused the problem.
If you haven't done a git-bisect before, 
   http://www.kernel.org/doc/local/git-quick.html
has a section that might help.
Comment 10 Jimmy.Jazz 2008-10-30 08:41:43 UTC
(In reply to comment #9)

> There isn't really enough information to figure out when the wrong data is
> read,
> or where is read from, so we can't find the culprit based on that sort of
> info.
> 
An easy way would be to tell me how and what kind of options I have to activate in the kernel when building it from source. There is a lot of hack that can be activate to help the kernel to be more verbose when mknod, md_mod, raid1, mdadm or mount are called. I could then transfer the log file on a memory stick.

> I suspect the best thing to do in this case is a 'git bisect'.  i.e. test a
> number of different kernels until you find the patch that caused the problem.
> If you haven't done a git-bisect before, 
>    http://www.kernel.org/doc/local/git-quick.html
> has a section that might help.
> 

The stable git doesn't contain 2.6.28 obviously. It stops at 2.6.27.4
How can I do a bisect that way ? Where are the patches you are speaking about. Should I use the md.git for that ? Please give me more clue.
Comment 11 Neil Brown 2008-10-30 17:13:01 UTC
Use the main upstream git tree

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

Don't worry about the stable branch.
Test "HEAD".  That will probably git bad so
  git bisect start
  git bisect bad
then 
  git checkout v2.6.27
  make
  # install, test.

That will presumably work, so
  git bisect good

then just repeat the  make/install/test/"git bisect good-or-bad"
sequence until "git bisect" tell you it has finished.  Then
report what "git bisect log" contains.
Comment 12 Jimmy.Jazz 2008-11-01 13:14:56 UTC
(In reply to comment #11)
> Use the main upstream git tree

> then just repeat the  make/install/test/"git bisect good-or-bad"
> sequence until "git bisect" tell you it has finished.  Then
> report what "git bisect log" contains.
> 

Thanks.

I don't have finished all the bisections right now. Also I get a main failure with fsck.ext{2,3,4}. I don't know if that is relevant but here is the result. I still have 2 revisions left to test. 
Comment 13 Jimmy.Jazz 2008-11-01 13:16:46 UTC
Created attachment 18583 [details]
dmesg log when fsck failed
Comment 14 Jimmy.Jazz 2008-11-01 13:17:54 UTC
Created attachment 18584 [details]
bisect log related to the fsck failure
Comment 15 Jimmy.Jazz 2008-11-01 15:11:01 UTC
Test ended with a git bisect bad...

# git bisect bad
4c46501d1659475dc6c89554af6ce7fe6ecf615c is first bad commit
commit 4c46501d1659475dc6c89554af6ce7fe6ecf615c
Author: Tejun Heo <tj@kernel.org>
Date:   Mon Aug 25 19:56:11 2008 +0900

    block: move holder_dir from disk to part0
    
    Move disk->holder_dir to part0->holder_dir.  Kill now mostly
    superflous bdev_get_holder().
    
    While at it, kill superflous kobject_get/put() around holder_dir,
    slave_dir and cmd_filter creation and collapse
    disk_sysfs_add_subdirs() into register_disk().  These serve no purpose
    but obfuscating the code.
    
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

:040000 040000 ae974a3c7e77b3b16600cfe42193b95904c8e344 47c3228991a181de2ae541f1dd7411d718968dec M	block
:040000 040000 6e123c247eaf445189f7c55e4ff0222bc503c647 460ba5c33120745c6b80698112cbae2f283c602f M	fs
:040000 040000 bed39dfa38fe347b04501f560a7163954a8f4317 5e7a376b3c0682f1a35e2dff33f2b09297c3587a M	include


and bisect.log

git bisect start
# bad: [721d5dfe7e516954c501d5e9d0dfab379cf4241a] Merge branch 'for-linus' of git://neil.brown.name/md
git bisect bad 721d5dfe7e516954c501d5e9d0dfab379cf4241a
# good: [3fa8749e584b55f1180411ab1b51117190bac1e5] Linux 2.6.27
git bisect good 3fa8749e584b55f1180411ab1b51117190bac1e5
# bad: [dbc2bc251e06c83efcc8d39f1e7de12c2b1ff591] hwmon: (dme1737) Be less i2c-centric
git bisect bad dbc2bc251e06c83efcc8d39f1e7de12c2b1ff591
# bad: [4dd9ec4946b4651a295d3bc8df9c15ac692a8f4e] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
git bisect bad 4dd9ec4946b4651a295d3bc8df9c15ac692a8f4e
# good: [4dd565134ece7e5d528d4c5288879310c54419e9] Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
git bisect good 4dd565134ece7e5d528d4c5288879310c54419e9
# bad: [c54dcd8ec9f05c8951d1e622e90904aef95379f9] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
git bisect bad c54dcd8ec9f05c8951d1e622e90904aef95379f9
# good: [d403a6484f0341bf0624d17ece46f24f741b6a92] Merge phase #1 of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
git bisect good d403a6484f0341bf0624d17ece46f24f741b6a92
# bad: [ef5bef357cdf49f3a386c7102dbf3be5f7e5c913] Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
git bisect bad ef5bef357cdf49f3a386c7102dbf3be5f7e5c913
# bad: [11914a53d2ec2974a565311af327b8983d8c820d] block: Add interface to abort queued requests
git bisect bad 11914a53d2ec2974a565311af327b8983d8c820d
# good: [ed9e1982347b36573cd622ee5f4e2a7ccd79b3fd] block: implement and use {disk|part}_to_dev()
git bisect good ed9e1982347b36573cd622ee5f4e2a7ccd79b3fd
# bad: [152e283fdfea0cd11e297d982378b55937842dde] block: introduce struct rq_map_data to use reserved pages
git bisect bad 152e283fdfea0cd11e297d982378b55937842dde
# bad: [540eed5637b766bb1e881ef744c42617760b4815] block: make partition array dynamic
git bisect bad 540eed5637b766bb1e881ef744c42617760b4815
# good: [b7db9956e57c8151b930d5e5fe5c766e6aad3ff7] block: move policy from disk to part0
git bisect good b7db9956e57c8151b930d5e5fe5c766e6aad3ff7
# bad: [0762b8bde9729f10f8e6249809660ff2ec3ad735] block: always set bdev->bd_part
git bisect bad 0762b8bde9729f10f8e6249809660ff2ec3ad735
# bad: [4c46501d1659475dc6c89554af6ce7fe6ecf615c] block: move holder_dir from disk to part0
git bisect bad 4c46501d1659475dc6c89554af6ce7fe6ecf615c


Also, it seems to be two different issues. 

The first affects the /dev/sdx device itself. The partition table is always valid when the md has a version 1.0 superblock and can be read with fdisk -l. There is one restriction: it is not msdos compatible (cylinders exceed 1024). When mount command has aborted, I still was not able to mount /dev/sda2 before mdadm has assembled the devices. That was a great feature that helped to restore a broken device. It has worked pretty well, especially when mdadm was unable to run/assemble the device for whatever reason.
Whenever mount is unable to mount the md device, mount is also unable to mount the underlying sda2 device after or before mdadm has stopped or assembled the md device, respectively.

The second issue affects the way how the md device is mounted. To mount the md device successfully, mount command needs to be called twice. As far as I understand, git bisect couldn't underline this because we have first to deal with the first issue.

Please, tell me if I'm wrong. 
Comment 16 Tejun Heo 2008-11-05 19:41:55 UTC
I've been looking at this for a few days but still can't figure out what's going on.  The bisected commit only changes sysfs related things and the tree looks exactly the same before & after, so I have no idea how it can cause such a problem.  I'm trying to reproduce the problem here but haven't been successful yet.  It could be that partitions are shifted somehow.  I'll post when I know more.  Thanks.
Comment 17 Tejun Heo 2008-11-05 21:24:15 UTC
Can you please post .config from the failing kernel?
Comment 18 Neil Brown 2008-11-05 21:31:51 UTC
Hi Jimmy.

The bug seems to be a bit indeterminate:  The filesystem doesn't mount at
first, but after some amount of 'random' trying different things, the mount
works.  It could be that sometimes the mount will work first time, even though
the bug is still present.  i.e. sometime you might have told 'git bisect' that
a kernel was 'good' when in fact it wasn't.  So the bad commit could be 
earlier.

Could you please try again over a smaller range, and repeat the test on each
kernel 3 times to make sure it works consistently?
Maybe discard the last 4 'good' lines from the log file (leave the 'bad' ones
there) and pass that to 'git bisect replay', and take it from there?

Also, in response to the end of comment #15, I'm not sure I understand exactly
what you are saying.  I think you are saying that mounting /dev/sda2 used to
work, but when the bug is present, it fails just the same as /dev/md/d0p2
fails.  This simply suggests that the problem is below the md layer as it
affects the sd device directly.

You might also be saying that after a successful mount, a subsequent mount attempt will sometimes fail.  This suggests that at each mount attempt it is
essentially random whether it succeeds or not.  That emphasises the fact
that it would be good to test each kernel several times to make sure that
every mount is successful, not just the first one.


Thanks.
Comment 19 Tejun Heo 2008-11-05 22:35:25 UTC
Created attachment 18698 [details]
md-ext-devt-fix.patch

Does this patch fix the problem?
Comment 20 Jimmy.Jazz 2008-11-06 15:38:26 UTC
(In reply to comment #19)

> Does this patch fix the problem?
> 

I will do the test without the patch again and see if it had improved anything. In fact, I didn't see the htree_dirblock_to_tree message any more during the new test. So, a part of the issue has gone but the device still doesn't mount correctly.

I will try to reply to both comment #16 and comment #19 but please, take into account the steps below to figure out what goes wrong.

The problem appends with 2.6.28 kernel tree and never with the 2.6.27 tree (the last one was 2.6.27.4 and it works).

The test I describe here was done with the 2.6.28 rc3 git2 and is executed before init switched to the root filesystem /dev/md/d0p2. See the dmesg.txt in attachment.

Also the key test is the fdisk interaction with the kernel (I use it only for the test and is not a part of the init script).

The init is just a script. So I can interrupt it before mdadm/mdassemble is called.

The steps.

The script reads /proc/partitions and creates the /dev/sdxy with the right major and minor as it is described in /proc/partitions. As expected md device is still not represented in /proc/partitions.
sda2 and is "equivalent" sdb2 correspond to the root device and can be mounted  to rescue a corrupted file system quite easily.
I can mount /dev/sda2 with the mount -t ext3 -o ro /dev/sda2 /new-root command without to need a second try. It proves the filesystem and the partition are at least sane.
 
The file system is accessible but for obvious reason I won't write on it. It will corrupt the meta device /dev/md/d0p2 it depends on. So, to save dmesg.txt I have to mount /dev/md/d0p2.

I dismount /dev/sda2, create the /dev/md/d0 node and assemble the raid device with mdadm. The md_mod and raid1 modules previously loaded sets md_d0 active with 2 out of 2 mirrors and it binds sda and sdb.

dmesg reports that sda{1,2,3,4} and sdb{1,2,3,4} are removed and also avoid underlying disk partitions to be accessed. No more message and nothing about the meta partitions are displayed. Here is the problem.

/proc/partitions reports sda, sdb and only ... md_d0. Only 254:0 is available. Nothing about the 254:{1,2,3,4} partitions it contains. I create manually the 254:2 node and named it /dev/md/d0p2. I try to mount /dev/md/d0p2 but the mount command fails with a  "No such device or address" error. That is quite reasonable as neither /proc/partitions nor /sys/block/md_d0/ have reported such md_d0p{1,2,3,4} devices.

That has nothing to do with the previous issue about the need to call the mount command twice. The patch has certainly corrected that bug.

But it remains that I have to stop and reassemble the md device in order to mount its partitions. The node partitions are not initialized correctly or at least the method to use has changed between the 2.6.27 kernel and the 2.6.28 kernel.
Anyway something goes wrong between the versions and should be corrected.

To exonerate mdadm I found that fdisk is able to tell the kernel about the existence of the partitions. Every time fdisk (-l) /dev/md/d0 is called the four partitions are displayed on the console. Afterwards, /proc/partitions and /sys/block/md_d0/ (sorry I didn't checked the last one) will show the md_d0p{1,2,3,4} partitions as well. 
I just have to retry the mount command and see that /dev/md/d0p2 is now mounted.

To resume,

~ # mdadm --assemble --scan
md: md_d0 stopped.
md: bind<sdb>
md: bind<sda>
raid1: raid set md_d0 active with 2 out of 2 mirrors
md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits
created bitmap (8 pages) for device md_d0
mdadm: /dev/md/d0 has been started with 2 drives.
~ # mount -t ext3 /dev/md/d0p2 /new-root
mount: mounting /dev/md/d0p2 on /new-root failed: No such device or address
~ # fdisk -l /dev/md/d0
 md_d0: p1 p2 p3 p4

 Disk /dev/md/d0: 250.0 GB, 250059210752 bytes
 2 heads, 4 sectors/track, 61049612 cylinders
 Units = cylinders of 8 * 512 = 4096 bytes

       Device Boot      Start         End      Blocks  Id System
       /dev/md/d0p1   *           1       17921       71682  83 Linux
       /dev/md/d0p2           17922      148994      524292  83 Linux
       /dev/md/d0p3          148995      542211     1572868  82 Linux swap

~ # cat /proc/partitions
major minor  #blocks  name

   8        0  244198584 sda
   8       16  244198584 sdb
 254        0  244198448 md_d0
 254        1      71682 md_d0p1
 254        2     524292 md_d0p2
 254        3    1572868 md_d0p3

~ # mount -t ext3 /dev/md/d0p2 /new-root
kjournald starting.  Commit interval 5 seconds
EXT3 FS on md_d0p2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.

Please, tell me if it is still necessary to proceed to the bisection again. If it is required, is it necessary to have all the modules activated in the kernel or just a limited number of them directly related to the issue to ease the process ? 


IMHO, it is not an indeterminate bug. It appends every time I'm using a 2.6.28 kernel and never with one of the 2.6.27.x stable branch. Of course, you need to activate some md devices at boot time not just access disks directly. In that case you will miss the issue.

Yes, it affects /dev/sda2 as well but only after mdadm -S has been called. As I see, that has brought some confusions even for me. /proc/partitions does not report the sdx{1,2,3,4} partitions. I missed that test the first time. Testing in a tiny rescue environment even with useful tools is quite uneasy, especially if you don't want to break things. Also, I understand well, if a new bisect is needed I will perform more than one test.



Also I used to say I never get this bug before. It is not correct. I cannot remember the release perhaps 2.6.24 or later but it was fast immediately corrected so I didn't pay more attention to it and I jumped the faulty release without more consideration. It was quite similar anyway.



Thank you for all the work you put in it.


 
Comment 21 Jimmy.Jazz 2008-11-06 15:42:22 UTC
Created attachment 18709 [details]
dmesg.txt from 2.6.28 rc3 git2
Comment 22 Jimmy.Jazz 2008-11-06 15:46:12 UTC
Created attachment 18710 [details]
.config from 2.6.28 rc3 git2
Comment 23 Jimmy.Jazz 2008-11-06 15:47:53 UTC
Created attachment 18711 [details]
.config used during bisect
Comment 24 Jimmy.Jazz 2008-11-06 15:56:34 UTC
Please, read "2.6.24 or older" instead of "2.6.24 or later" in the third sentence of the last paragraph of comment #20.
Comment 25 Tejun Heo 2008-11-06 17:45:08 UTC
Created attachment 18713 [details]
fix-__blkdev_get.patch

Please apply the attached along with md-ext-devt-fix.patch and test.  Thanks.
Comment 26 Jimmy.Jazz 2008-11-07 13:06:51 UTC
(In reply to comment #25)
With these patches the md device /dev/md/d0p2 can be mounted. Great job, thank you.

There is still a problem that could be cumbersome especially if one needs to access the underlying physical partition devices.

After mdadm has stopped and released the md devices, the physical partition disks won't be accessible. It is not possible to mount them. I still get the message when I run the command: mount -t ext3 -o ro /dev/sda2 /new-root,

mount: mounting /dev/sda2 on /new-root failed: No such device or address.

Also /proc/partitions won't show up the partitions of the physical devices after md_mod and raid1 modules have been unloaded.
Comment 27 Neil Brown 2008-11-07 17:05:42 UTC
This is a by product of how mdadm works.  It removes the partitions from
the device when the device is included in a raid to avoid confusion.
If you really want the partitions of the devices back (which only makes
any sense for raid1, and even then should only be done with care) you
can get them back with e.g.

  blockdev --rereadpt /dev/sda

This will cause the partitions of sda to re-appear.
Comment 28 Tejun Heo 2008-11-08 00:53:29 UTC
Hmmm... now the question is why md successfully assembled a faulty device instead of simply failing.  Can you please try fix-__blkdev_get.patch only (w/o md-ext-devt-fix.patch) and see what happens?
Comment 29 Rafael J. Wysocki 2008-11-09 09:43:02 UTC
Handled-By : Tejun Heo <tj@kernel.org>
Handled-By : Neil Brown <neilb@suse.de>
Comment 30 Rafael J. Wysocki 2008-11-09 11:11:49 UTC
First-Bad-Commit : 4c46501d1659475dc6c89554af6ce7fe6ecf615c
Comment 31 Jimmy.Jazz 2008-11-09 12:25:07 UTC
(In reply to comment #28)
> Hmmm... now the question is why md successfully assembled a faulty device
> instead of simply failing.  Can you please try fix-__blkdev_get.patch only
> (w/o
> md-ext-devt-fix.patch) and see what happens?
> 
@ tj and neilb,

I made the test with the 2.6.28 rc3 git5 kernel revision because it already contains the patch fix-__blkdev_get.patch. It works great. The md device is able to mount. 

Sorry, I couldn't have access to the same computer where I have done the tests the first time. But this other one had the same issue. If you really need that i test on the test computer used before, I would be glad to proceed asap. 

Also, the devices were never faulty. I cannot explain why mount had complained about the bad entry in directory. I don't have that message any longer.

In the above post, I tried to underline the fact that the sdx partitions are not automatically visibles in /proc/partitions after mdadm has stopped the md device they depend on. Yet, they were visible before calling mdadm. 

Also, if the kernel is able to detect the partitions table on the sdx devices when it is booting, why does it ignore them and why does it need blockdev --rereadpt to reveal them, after the disks have been freed by mdadm ? 

Could we avoid to use blockdev in the initramfs file system because it makes the whole just bigger. Avoiding the use of blockdev makes sense after all when you need to rescue or change the disk logic on the fly or when we cannot boot from a livecd for example.

Thank you
Comment 32 Neil Brown 2008-11-09 16:19:46 UTC
mdadm explicitly tells the kernel to forget about the sdxN partition
devices when it creates an md array over sdx.  This is because in most
case the partitions that appear to be on the sdxN are not meaningful at all.
e.g. If you create a raid5 over 5 devices sda..sde, then create a partition
table, that table will appear at the start of sda as well as the start of
the array.  The partitions will quite possible be bigger than the sda device
itself, so they will be meaningless.  This can cause confusion, so mdadm
explicitly tells the kernel to forget about those partitions.

raid1 is clearly an exception here.  The partitions do still mean something.
However it is generally not a good idea to mount just one half of a raid1
as it cause cause the two half to be out of sync.  Obviously some times it
is useful to do that, and those times blockdev is your friend.

The initramfs should never need to care about this.  Either it assembles
the array and uses the partitions of the array, or it doesn't and the sdx
devices still have their own partitions which can be used.

mdadm could possibly do the equivalent of "blockdev -rereadpt /dev/sdx"
when stopping an array container sdx, but I'm not convinced that it is
a good idea.
Comment 33 Tejun Heo 2008-11-09 18:01:03 UTC
Jimmy, yes, we still need to track down why md presented corrupt data instead of simply failing.  fix-__blkdev_get fixed the problem but that patch only fixes missing partition rescan.  ie. before the patch, even if the device is invalidated or hasn't been scanned yet, trying to opening partitions (instead of the whole device) won't trigger partition rescan and thus result in open failure or accessing the old partition information if it has changed underneath the block layer (partition manipulation tools issue explicit rescan so they're in the clear but e.g. media change for certain drivers can cause this).

Neil, can you think of something which can cause md return garbage data with a bug like the above?  The worst I can think of is failing to find component device and even that shouldn't happen even with the bug as, with sd, the partition is always scanned during probe.

Jimmy, can you please try to reproduce the problem with minimal setup?

Thanks.
Comment 34 Jimmy.Jazz 2008-11-11 08:57:16 UTC
(In reply to comment #33)

> Jimmy, can you please try to reproduce the problem with minimal setup?

Also, i tried a minimal setup since and ... i'm now fighting with an old known problem as described in http://lkml.indiana.edu/hypermail/linux/kernel/0709.1/1645.html. If the .config could be of any help i will send it to you. 
For now, i will continue with my long current .config file. I have done a git bisect replay as suggested. 
By the way, how do i introduce the patch without compromising the git bisect ? 

As one of the mdp partitions contains the root filesystem, mdadm -S /dev/md/d0 will fail whenever i shutdown the computer. The root partition is then mounted read only. Could that failure be a problem ? Is it possible to avoid it ? 
Comment 35 Jimmy.Jazz 2008-11-11 16:18:19 UTC
I discarded the last 'good' lines from the log file and replayed all the compilations and reboots. Now d0p2 was able to mount successfully. I perhaps misunderstood how git works. Also, I had to delete the old builds on the home server to make more room.

The result is,

# git bisect good
4c46501d1659475dc6c89554af6ce7fe6ecf615c is first bad commit
commit 4c46501d1659475dc6c89554af6ce7fe6ecf615c
Author: Tejun Heo <tj@kernel.org>
Date:   Mon Aug 25 19:56:11 2008 +0900

    block: move holder_dir from disk to part0
    
    Move disk->holder_dir to part0->holder_dir.  Kill now mostly
    superflous bdev_get_holder().
    
    While at it, kill superflous kobject_get/put() around holder_dir,
    slave_dir and cmd_filter creation and collapse
    disk_sysfs_add_subdirs() into register_disk().  These serve no purpose
    but obfuscating the code.
    
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

:040000 040000 ae974a3c7e77b3b16600cfe42193b95904c8e344 47c3228991a181de2ae541f1dd7411d718968dec M	block
:040000 040000 6e123c247eaf445189f7c55e4ff0222bc503c647 460ba5c33120745c6b80698112cbae2f283c602f M	fs
:040000 040000 bed39dfa38fe347b04501f560a7163954a8f4317 5e7a376b3c0682f1a35e2dff33f2b09297c3587a M	include

and

# git bisect log
git bisect start
# bad: [721d5dfe7e516954c501d5e9d0dfab379cf4241a] Merge branch 'for-linus' of git://neil.brown.name/md
git bisect bad 721d5dfe7e516954c501d5e9d0dfab379cf4241a
# good: [3fa8749e584b55f1180411ab1b51117190bac1e5] Linux 2.6.27
git bisect good 3fa8749e584b55f1180411ab1b51117190bac1e5
# bad: [dbc2bc251e06c83efcc8d39f1e7de12c2b1ff591] hwmon: (dme1737) Be less i2c-centric
git bisect bad dbc2bc251e06c83efcc8d39f1e7de12c2b1ff591
# bad: [4dd9ec4946b4651a295d3bc8df9c15ac692a8f4e] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
git bisect bad 4dd9ec4946b4651a295d3bc8df9c15ac692a8f4e
# bad: [c54dcd8ec9f05c8951d1e622e90904aef95379f9] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
git bisect bad c54dcd8ec9f05c8951d1e622e90904aef95379f9
# bad: [ef5bef357cdf49f3a386c7102dbf3be5f7e5c913] Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
git bisect bad ef5bef357cdf49f3a386c7102dbf3be5f7e5c913
# bad: [11914a53d2ec2974a565311af327b8983d8c820d] block: Add interface to abort queued requests
git bisect bad 11914a53d2ec2974a565311af327b8983d8c820d
# bad: [152e283fdfea0cd11e297d982378b55937842dde] block: introduce struct rq_map_data to use reserved pages
git bisect bad 152e283fdfea0cd11e297d982378b55937842dde
# bad: [540eed5637b766bb1e881ef744c42617760b4815] block: make partition array dynamic
git bisect bad 540eed5637b766bb1e881ef744c42617760b4815
# bad: [0762b8bde9729f10f8e6249809660ff2ec3ad735] block: always set bdev->bd_part
git bisect bad 0762b8bde9729f10f8e6249809660ff2ec3ad735
# bad: [4c46501d1659475dc6c89554af6ce7fe6ecf615c] block: move holder_dir from disk to part0
git bisect bad 4c46501d1659475dc6c89554af6ce7fe6ecf615c
# good: [5b99c2ffa980528a197f26c7d876cceeccce8dd5] block: make bi_phys_segments an unsigned int instead of short
git bisect good 5b99c2ffa980528a197f26c7d876cceeccce8dd5
# good: [e71bf0d0ee89e51b92776391c5634938236977d5] block: fix disk->part[] dereferencing race
git bisect good e71bf0d0ee89e51b92776391c5634938236977d5
# good: [ed9e1982347b36573cd622ee5f4e2a7ccd79b3fd] block: implement and use {disk|part}_to_dev()
git bisect good ed9e1982347b36573cd622ee5f4e2a7ccd79b3fd
# good: [548b10eb2959c96cef6fc29fc96e0931eeb53bc5] block: move __dev from disk to part0
git bisect good 548b10eb2959c96cef6fc29fc96e0931eeb53bc5
# good: [b7db9956e57c8151b930d5e5fe5c766e6aad3ff7] block: move policy from disk to part0
git bisect good b7db9956e57c8151b930d5e5fe5c766e6aad3ff7

It looks like the above test has not be a success. 

Also, I tried to begin a new fresh one with a small config file but I hit a blocking qc timeout when sata (nforce4) was probed (I built the kernel out-of-tree, so the module path name did not change every time).

If I remember well only mdadm is able to write on the md superbloc. mdassemble doesn't. It only assembles the devices. The real problem encountered was probably due to the missing rescan that has now be corrected with your patch fix-__blkdev_get.

In fact, before I opened the bug report, I tried fdisk and mdadm many times on the devices. This could lead to that htree_dirblock_to_tree error whenever accessing to the file system has been possible. That issue could be just a false positive.
Comment 36 Tejun Heo 2008-11-11 16:39:27 UTC
Thanks Jimmy.  Yes, that is the exact commit which is fixed by fix-__blkdev_get.patch.  I still wanna know what's going on with the faulty assembly.  Neil, can you please help here?  What should we be looking at?
Comment 37 Rafael J. Wysocki 2008-11-16 09:11:47 UTC
We seem to be tracking two separate issues here.  I'm closing this one to avoid confusion in the regressions report.

Care to open a new bug entry for the "faulty assembly" issue?
Comment 38 Neil Brown 2008-11-16 14:55:23 UTC
I don't think there is any evidence of "faulty assembly".
I think the array is assembled correctly, but the partition table
is confused, so when partitions are accessed we get the wrong data.

Is there something I have missed which is not explained by the above?
Comment 39 Tejun Heo 2008-11-16 17:52:41 UTC
With the above __blkdev_get(), the problem the only problem that should occur is __blkdev_get() fails for existing partitions and it won't happen for /dev/sd* as they force scan during attach but md* does seem to trigger the problem as it only sets the invalidated flag.  At any rate, the only thing which can occur is either md not being able to find the component partition and failing assembly or partitions for /dev/mdp* not being accessible due to open failure.  However, Jimmy saw garbage data coming from md after successful assembly which is quite worrying as __blkdev_get() can fail from different reasons.