Latest working kernel version:2.6.27.3 Earliest failing kernel version:2.6.28 rc1 git1 Distribution: gentoo Hardware Environment: x86_64 amd Software Environment: kernel + initramfs Problem Description: Since 2.6.28 mount command (busybox) is unable to mount the root file system. The error returned is, EXT3-fs error (device md_d0p2): htree_dirblock_to_tree: bad entry in directory #2: rec_len % 4 != 0 - offset=0, inode=2109536318, rec_len=16058, name_len=111 (The figures could change) To make it work, I need first to mount the file system as an ext2 file system, unmount it and recall mount command with the -t ext3 option. step 1: mount -o ro /dev/md/d0p2 /new-root (mount ok with flags ro,errors=continue) ls /new-root (ls without error) umount /new-root and then step 2: mount -t ext3 -o ro /dev/md/d0p2 /new-root (mount ok with flags ro,errors=continue,data=ordered) ls /new-root (ls without error) ext2,jbd,ext3,mbcache,crc32 are modules and are loaded from an initramfs file system. IMHO, it is a problem in mbcache or jbd. Also, there is an other problem with that kernel. After creating the /dev node, loading md_mod and raid1, mdassemble start as usual the md device, but the mount command doesn't recognize the /dev node as valid and abort. I need to stop the md device that has been created earlier and reapply mdassemble or mdadm --assemble --scan from initramfs file system. Afterwards mount command is able to mount the file system but... like an ext2 file system. I'm on step 2 again. Steps to reproduce: reboot the kernel, load the initramfs file system and open a shell just after the mount command has aborted.
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 28 Oct 2008 05:09:25 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11873 > > Summary: unable to mount ext3 root filesystem due to > htree_dirblock_to_tree > Product: File System > Version: 2.5 > KernelVersion: 2.6.28 rcx git and vanilla trees > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: ext3 > AssignedTo: akpm@osdl.org > ReportedBy: Jimmy.Jazz@gmx.net > CC: Jimmy.Jazz@gmx.net > > > Latest working kernel version:2.6.27.3 > Earliest failing kernel version:2.6.28 rc1 git1 > Distribution: gentoo > Hardware Environment: x86_64 amd > Software Environment: kernel + initramfs > Problem Description: > > Since 2.6.28 mount command (busybox) is unable to mount the root file system. > The error returned is, > > EXT3-fs error (device md_d0p2): htree_dirblock_to_tree: bad entry in > directory > #2: rec_len % 4 != 0 - offset=0, inode=2109536318, rec_len=16058, > name_len=111 > > (The figures could change) > > > To make it work, I need first to mount the file system as an ext2 file > system, > unmount it and recall mount command with the -t ext3 option. > > step 1: > mount -o ro /dev/md/d0p2 /new-root > (mount ok with flags ro,errors=continue) > ls /new-root > (ls without error) > umount /new-root > > and then > > step 2: > mount -t ext3 -o ro /dev/md/d0p2 /new-root > (mount ok with flags ro,errors=continue,data=ordered) > ls /new-root > (ls without error) > > ext2,jbd,ext3,mbcache,crc32 are modules and are loaded from an initramfs file > system. > > IMHO, it is a problem in mbcache or jbd. > > Also, there is an other problem with that kernel. After creating the /dev > node, > loading md_mod and raid1, mdassemble start as usual the md device, but the > mount command doesn't recognize the /dev node as valid and abort. > I need to stop the md device that has been created earlier and reapply > mdassemble or mdadm --assemble --scan from initramfs file system. Afterwards > mount command is able to mount the file system but... like an ext2 file > system. > I'm on step 2 again. > > > Steps to reproduce: reboot the kernel, load the initramfs file system and > open > a shell just after the mount command has aborted. >
On Tue, Oct 28, 2008 at 10:31:22AM -0700, Andrew Morton wrote: > > > > Since 2.6.28 mount command (busybox) is unable to mount the root file > system. > > The error returned is, > > > > EXT3-fs error (device md_d0p2): htree_dirblock_to_tree: bad entry in > directory > > #2: rec_len % 4 != 0 - offset=0, inode=2109536318, rec_len=16058, > name_len=111 > > That error means the ext3 filesystem detected an error when responding to a readdir() request from user space. So it seems rather strange that this came from a mount command. It looks like the mount succeeded, and then something in userspace did the equivalent of "ls /*" or "echo /*", and the first part of the directory was totally garbaled. (It should have been inode=2, rec_len=12, name_len=1 for the '.' entry in the root directory.) > > IMHO, it is a problem in mbcache or jbd. Why do you think it is a problem with mbcache or jbd? mbcache is only used for extended attributes, and jbd would only get invoked once the filesystem was getting modified. If there was corruption in the journal that was corrupting the filesystem when journal was recovered, the directory corruption would be permanent, and not something that would magically disappear when you mount the filesystem as ext2 read-only, and then remount as ext3. This smells like ext3 is getting garbage when first reading in the filesystem from the md device. And then when you report that you're also having mysterious md failures: > > Also, there is an other problem with that kernel. After creating > > the /dev node, loading md_mod and raid1, mdassemble start as usual > > the md device, but the mount command doesn't recognize the /dev > > node as valid and abort. I need to stop the md device that has > > been created earlier and reapply mdassemble or mdadm --assemble > > --scan from initramfs file system. Afterwards mount command is > > able to mount the file system but... like an ext2 file system. > > I'm on step 2 again. > > That smells like a problem with the md device, or the md setup code. I suspect if you were to arrange to include dumpe2fs or debugfs in your initrd, you would find that you are reading garbage from the md device, and dumpe2fs and debugfs would be reporting, "huh? there's no ext3 filesystem image here". I don't think this is a ext3 problem at all. - Ted
Could be an md problem. You seem to have presented the sequence out of order, or maybe I misunderstand you. - machine fails to boot and you get a shell Is the md/d0p2 assembled at this point?? - You assemble the md array, try to mount but that fails. - You stop the array, assemble it again and try to mount. This time the mount seems to work but you get the EXT3-fs error - You unmount, mount as ext2. That works - umount, mount as ext3. That works. Is the a correct sequence of events? What happens if you repeatedly try to mount (ext3) without stopping and re-assembling the array? What if you repeat ext3 without trying ext2 even after re-assembling the array? If consecutive mounts don't work, what happens if you echo 3 > /proc/sys/vm/drop_caches between the mount attempts? What is the configuration of the array (raid1? 2 drives? partitions or whole disk?)
(In reply to comment #2) > > That error means the ext3 filesystem detected an error when responding > to a readdir() request from user space. So it seems rather strange > that this came from a mount command. It looks like the mount > succeeded, and then something in userspace did the equivalent of "ls > /*" or "echo /*", and the first part of the directory was totally > garbaled. (It should have been inode=2, rec_len=12, name_len=1 for > the '.' entry in the root directory.) > > > > IMHO, it is a problem in mbcache or jbd. > > Why do you think it is a problem with mbcache or jbd? Also that was a simple deduction. Remember, the filesystem was able to mount as an ext2 fs. mbcache is only > used for extended attributes, and jbd would only get invoked once the > filesystem was getting modified. Anyway something has changed since 2.6.27.3. I will recheck the lot to be sure that only the kernel has changed. I mean that md, busybox, etc are still the same code. If there was corruption in the > journal that was corrupting the filesystem when journal was recovered, > the directory corruption would be permanent, and not something that > would magically disappear when you mount the filesystem as ext2 > read-only, and then remount as ext3. I agree but I'm speaking about facts. Mounting the filesystem as an ext3 (when it succeed) makes ls report some filesystem corruptions. Remounting it as ext2 and ls is able to read the directory again. I don't have any problems with any of the previous kernels. Moreover the issue affects both of my computers. > > This smells like ext3 is getting garbage when first reading in the > filesystem from the md device. And then when you report that you're > also having mysterious md failures: It could be some hazardous misalignments too. mdp is quite solid and I'm using it for a while. It is hard to believe it is just an mdp failure. The code is still the same. Only the kernel has evolved. Also if that is an mdp issue something has brought it alive. > > > > Also, there is an other problem with that kernel. After creating > > > the /dev node, loading md_mod and raid1, mdassemble start as usual > > > the md device, but the mount command doesn't recognize the /dev > > > node as valid and abort. I need to stop the md device that has > > > been created earlier and reapply mdassemble or mdadm --assemble > > > --scan from initramfs file system. Afterwards mount command is > > > able to mount the file system but... like an ext2 file system. > > > I'm on step 2 again. > > > > > That smells like a problem with the md device, or the md setup code. Or md_mod. > > I suspect if you were to arrange to include dumpe2fs or debugfs in > your initrd, you would find that you are reading garbage from the md > device, and dumpe2fs and debugfs would be reporting, "huh? there's no > ext3 filesystem image here". I don't think this is a ext3 problem at > all. > If you could give me any clues. Please, understand, I'm not in the habit to doing that. Also I will give a try with debugfs as dumpe2fs uses shared libs. > - Ted > Thanks for your help.
(In reply to comment #3) > Could be an md problem. > You seem to have presented the sequence out of order, or maybe I > misunderstand > you. > > - machine fails to boot and you get a shell kernel is still in initramfs when I get a shell (busybox ash). The shell and the kernel (everything expect fbcon/vesafb is a module) are minimal at that stage. Of course, the needed modules are loaded before init calls the mount (busybox) command. > Is the md/d0p2 assembled at this point?? Yes mdp is assembled. You get, cat /proc/mdstat Personalities : [raid1] md_d0 : active raid1 sda[0] sdb[1] 244198448 blocks super 1.0 [2/2] [UU] bitmap: 2/8 pages [8KB], 16384KB chunk unused devices: <none> > - You assemble the md array, try to mount but that fails. yes but sometimes the mount succeed and is reported corrupt when you try to access it (with ls for instance). > - You stop the array, assemble it again and try to mount. exactly. Also the init script is forcing the -t ext3 -o ro options. > This time the mount seems to work but you get the EXT3-fs error Yes > - You unmount, mount as ext2. That works yes > - umount, mount as ext3. That works. yes > > Is the a correct sequence of events? The above is the sequence. Yes. > > What happens if you repeatedly try to mount (ext3) without stopping > and re-assembling the array? No go it complains about a not found filesystem. I think it cannot do anything with the /dev node file that is created with mknod -m 0660 /dev/md/d0p2 b 254 2 before mdassemble is called . > What if you repeat ext3 without trying ext2 even after re-assembling the > array? Doesn't mount and in rare cases if it succeed it is corrupt when accessed. > If consecutive mounts don't work, what happens if you > echo 3 > /proc/sys/vm/drop_caches > between the mount attempts? I will try it right now. I'm sure that is the solution. > > What is the configuration of the array (raid1? 2 drives? partitions or whole > disk?) > one raid1 partition array with 2 drives. whole disk with 4 partitions.
I made a more accurate test. Everything is the same except the kernel and its modules. md_mod and raid1 are loaded. The init script creates 2 nodes in /dev/md/, d0 and d0p2 to help mdadm (mdassemble) to assemble the device (I won't rely on busybox v1.10.3 mdev hotplug because it creates character devices instead of block mpd devices) mknod -m 0660 /dev/md/d0 b 254 0 mknod -m 0660 /dev/md/d0p2 b 254 2 d0p2 is the partition that contains the ext3 root filesystem I want to mount. mdassemble is now called and assembles d0. It has recognized 4 partitions p1 p2 p3 p4. mdadm --misc --test /dev/md/d0 returns zero. that confirms that everything is okay. I added the line echo 3 > /proc/sys/vm/drop_caches The script calls mount -t ext3 -o ro /dev/md/d0 /new-root. Mount call fails. The script opens then a shell. Whatever is done, the mount command is not able to mount the partition. I need to stop and assemble the mdp device again. I manually assemble the mdp device. The first "mount -t ext3 -o ro /dev/md/d0 /new-root" call fails. The second succeeds and the partition finally becomes readable. In fact, I don't really need to force the -t ext2 option. I also have activated debugfs and set jbd-debug to 1. I get some messages in dmesg log about jbd errors like: JBD: no valid journal superblock found. EXT3-fs: error loading journal otherwise, I have the following (I just took some notes that I thought to be relevant): (fs/jbd/journal.c, 773): journal_init_inode: journal xxxx: inode md_d0p2/8 size 16777216, bits 12, blksize 4096 (fs/jbd/recovery.c, 242: journal_recover:No recovery required, last transaction 127173 (fs/jbd/journal.c, 947): journal_update_superblock:JBD:Skipping superblock update on recovered sb (start 0, seq 127174, errno 0) (fs/jbd/journal.c, 947): journal_update_superblock:JBD:Skipping superblock update on recovered sb (start 1, seq 127174, errno 0) EXT3-fs: mounted filesystem with ordered data mode kjournald starting. Commit interval 5 seconds
You write: > The script calls mount -t ext3 -o ro /dev/md/d0 /new-root. Mount call fails. That should be mount -t ext3 -o ro /dev/md/d0p2 /new-root i.e. with a 'p2'. Please confirm whether this was a typo or a genuine error in the script. Thanks.
(In reply to comment #7) > You write: > > > The script calls mount -t ext3 -o ro /dev/md/d0 /new-root. Mount call > fails. > > That should be mount -t ext3 -o ro /dev/md/d0p2 /new-root > i.e. with a 'p2'. > Please confirm whether this was a typo or a genuine error in the script. > Sorry that was a typo. mount -t ext3 -o ro /dev/md/d0p2 /new-root is the correct expression. Thanks to point it at.
Thanks. So it looks like when the filesystem reads from the md array, it sometimes gets the wrong data. This presumably only happens in very specific circumstances otherwise you would expect to get some sort of random error every time you mount. There isn't really enough information to figure out when the wrong data is read, or where is read from, so we can't find the culprit based on that sort of info. I suspect the best thing to do in this case is a 'git bisect'. i.e. test a number of different kernels until you find the patch that caused the problem. If you haven't done a git-bisect before, http://www.kernel.org/doc/local/git-quick.html has a section that might help.
(In reply to comment #9) > There isn't really enough information to figure out when the wrong data is > read, > or where is read from, so we can't find the culprit based on that sort of > info. > An easy way would be to tell me how and what kind of options I have to activate in the kernel when building it from source. There is a lot of hack that can be activate to help the kernel to be more verbose when mknod, md_mod, raid1, mdadm or mount are called. I could then transfer the log file on a memory stick. > I suspect the best thing to do in this case is a 'git bisect'. i.e. test a > number of different kernels until you find the patch that caused the problem. > If you haven't done a git-bisect before, > http://www.kernel.org/doc/local/git-quick.html > has a section that might help. > The stable git doesn't contain 2.6.28 obviously. It stops at 2.6.27.4 How can I do a bisect that way ? Where are the patches you are speaking about. Should I use the md.git for that ? Please give me more clue.
Use the main upstream git tree git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 Don't worry about the stable branch. Test "HEAD". That will probably git bad so git bisect start git bisect bad then git checkout v2.6.27 make # install, test. That will presumably work, so git bisect good then just repeat the make/install/test/"git bisect good-or-bad" sequence until "git bisect" tell you it has finished. Then report what "git bisect log" contains.
(In reply to comment #11) > Use the main upstream git tree > then just repeat the make/install/test/"git bisect good-or-bad" > sequence until "git bisect" tell you it has finished. Then > report what "git bisect log" contains. > Thanks. I don't have finished all the bisections right now. Also I get a main failure with fsck.ext{2,3,4}. I don't know if that is relevant but here is the result. I still have 2 revisions left to test.
Created attachment 18583 [details] dmesg log when fsck failed
Created attachment 18584 [details] bisect log related to the fsck failure
Test ended with a git bisect bad... # git bisect bad 4c46501d1659475dc6c89554af6ce7fe6ecf615c is first bad commit commit 4c46501d1659475dc6c89554af6ce7fe6ecf615c Author: Tejun Heo <tj@kernel.org> Date: Mon Aug 25 19:56:11 2008 +0900 block: move holder_dir from disk to part0 Move disk->holder_dir to part0->holder_dir. Kill now mostly superflous bdev_get_holder(). While at it, kill superflous kobject_get/put() around holder_dir, slave_dir and cmd_filter creation and collapse disk_sysfs_add_subdirs() into register_disk(). These serve no purpose but obfuscating the code. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> :040000 040000 ae974a3c7e77b3b16600cfe42193b95904c8e344 47c3228991a181de2ae541f1dd7411d718968dec M block :040000 040000 6e123c247eaf445189f7c55e4ff0222bc503c647 460ba5c33120745c6b80698112cbae2f283c602f M fs :040000 040000 bed39dfa38fe347b04501f560a7163954a8f4317 5e7a376b3c0682f1a35e2dff33f2b09297c3587a M include and bisect.log git bisect start # bad: [721d5dfe7e516954c501d5e9d0dfab379cf4241a] Merge branch 'for-linus' of git://neil.brown.name/md git bisect bad 721d5dfe7e516954c501d5e9d0dfab379cf4241a # good: [3fa8749e584b55f1180411ab1b51117190bac1e5] Linux 2.6.27 git bisect good 3fa8749e584b55f1180411ab1b51117190bac1e5 # bad: [dbc2bc251e06c83efcc8d39f1e7de12c2b1ff591] hwmon: (dme1737) Be less i2c-centric git bisect bad dbc2bc251e06c83efcc8d39f1e7de12c2b1ff591 # bad: [4dd9ec4946b4651a295d3bc8df9c15ac692a8f4e] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 git bisect bad 4dd9ec4946b4651a295d3bc8df9c15ac692a8f4e # good: [4dd565134ece7e5d528d4c5288879310c54419e9] Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 git bisect good 4dd565134ece7e5d528d4c5288879310c54419e9 # bad: [c54dcd8ec9f05c8951d1e622e90904aef95379f9] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 git bisect bad c54dcd8ec9f05c8951d1e622e90904aef95379f9 # good: [d403a6484f0341bf0624d17ece46f24f741b6a92] Merge phase #1 of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip git bisect good d403a6484f0341bf0624d17ece46f24f741b6a92 # bad: [ef5bef357cdf49f3a386c7102dbf3be5f7e5c913] Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 git bisect bad ef5bef357cdf49f3a386c7102dbf3be5f7e5c913 # bad: [11914a53d2ec2974a565311af327b8983d8c820d] block: Add interface to abort queued requests git bisect bad 11914a53d2ec2974a565311af327b8983d8c820d # good: [ed9e1982347b36573cd622ee5f4e2a7ccd79b3fd] block: implement and use {disk|part}_to_dev() git bisect good ed9e1982347b36573cd622ee5f4e2a7ccd79b3fd # bad: [152e283fdfea0cd11e297d982378b55937842dde] block: introduce struct rq_map_data to use reserved pages git bisect bad 152e283fdfea0cd11e297d982378b55937842dde # bad: [540eed5637b766bb1e881ef744c42617760b4815] block: make partition array dynamic git bisect bad 540eed5637b766bb1e881ef744c42617760b4815 # good: [b7db9956e57c8151b930d5e5fe5c766e6aad3ff7] block: move policy from disk to part0 git bisect good b7db9956e57c8151b930d5e5fe5c766e6aad3ff7 # bad: [0762b8bde9729f10f8e6249809660ff2ec3ad735] block: always set bdev->bd_part git bisect bad 0762b8bde9729f10f8e6249809660ff2ec3ad735 # bad: [4c46501d1659475dc6c89554af6ce7fe6ecf615c] block: move holder_dir from disk to part0 git bisect bad 4c46501d1659475dc6c89554af6ce7fe6ecf615c Also, it seems to be two different issues. The first affects the /dev/sdx device itself. The partition table is always valid when the md has a version 1.0 superblock and can be read with fdisk -l. There is one restriction: it is not msdos compatible (cylinders exceed 1024). When mount command has aborted, I still was not able to mount /dev/sda2 before mdadm has assembled the devices. That was a great feature that helped to restore a broken device. It has worked pretty well, especially when mdadm was unable to run/assemble the device for whatever reason. Whenever mount is unable to mount the md device, mount is also unable to mount the underlying sda2 device after or before mdadm has stopped or assembled the md device, respectively. The second issue affects the way how the md device is mounted. To mount the md device successfully, mount command needs to be called twice. As far as I understand, git bisect couldn't underline this because we have first to deal with the first issue. Please, tell me if I'm wrong.
I've been looking at this for a few days but still can't figure out what's going on. The bisected commit only changes sysfs related things and the tree looks exactly the same before & after, so I have no idea how it can cause such a problem. I'm trying to reproduce the problem here but haven't been successful yet. It could be that partitions are shifted somehow. I'll post when I know more. Thanks.
Can you please post .config from the failing kernel?
Hi Jimmy. The bug seems to be a bit indeterminate: The filesystem doesn't mount at first, but after some amount of 'random' trying different things, the mount works. It could be that sometimes the mount will work first time, even though the bug is still present. i.e. sometime you might have told 'git bisect' that a kernel was 'good' when in fact it wasn't. So the bad commit could be earlier. Could you please try again over a smaller range, and repeat the test on each kernel 3 times to make sure it works consistently? Maybe discard the last 4 'good' lines from the log file (leave the 'bad' ones there) and pass that to 'git bisect replay', and take it from there? Also, in response to the end of comment #15, I'm not sure I understand exactly what you are saying. I think you are saying that mounting /dev/sda2 used to work, but when the bug is present, it fails just the same as /dev/md/d0p2 fails. This simply suggests that the problem is below the md layer as it affects the sd device directly. You might also be saying that after a successful mount, a subsequent mount attempt will sometimes fail. This suggests that at each mount attempt it is essentially random whether it succeeds or not. That emphasises the fact that it would be good to test each kernel several times to make sure that every mount is successful, not just the first one. Thanks.
Created attachment 18698 [details] md-ext-devt-fix.patch Does this patch fix the problem?
(In reply to comment #19) > Does this patch fix the problem? > I will do the test without the patch again and see if it had improved anything. In fact, I didn't see the htree_dirblock_to_tree message any more during the new test. So, a part of the issue has gone but the device still doesn't mount correctly. I will try to reply to both comment #16 and comment #19 but please, take into account the steps below to figure out what goes wrong. The problem appends with 2.6.28 kernel tree and never with the 2.6.27 tree (the last one was 2.6.27.4 and it works). The test I describe here was done with the 2.6.28 rc3 git2 and is executed before init switched to the root filesystem /dev/md/d0p2. See the dmesg.txt in attachment. Also the key test is the fdisk interaction with the kernel (I use it only for the test and is not a part of the init script). The init is just a script. So I can interrupt it before mdadm/mdassemble is called. The steps. The script reads /proc/partitions and creates the /dev/sdxy with the right major and minor as it is described in /proc/partitions. As expected md device is still not represented in /proc/partitions. sda2 and is "equivalent" sdb2 correspond to the root device and can be mounted to rescue a corrupted file system quite easily. I can mount /dev/sda2 with the mount -t ext3 -o ro /dev/sda2 /new-root command without to need a second try. It proves the filesystem and the partition are at least sane. The file system is accessible but for obvious reason I won't write on it. It will corrupt the meta device /dev/md/d0p2 it depends on. So, to save dmesg.txt I have to mount /dev/md/d0p2. I dismount /dev/sda2, create the /dev/md/d0 node and assemble the raid device with mdadm. The md_mod and raid1 modules previously loaded sets md_d0 active with 2 out of 2 mirrors and it binds sda and sdb. dmesg reports that sda{1,2,3,4} and sdb{1,2,3,4} are removed and also avoid underlying disk partitions to be accessed. No more message and nothing about the meta partitions are displayed. Here is the problem. /proc/partitions reports sda, sdb and only ... md_d0. Only 254:0 is available. Nothing about the 254:{1,2,3,4} partitions it contains. I create manually the 254:2 node and named it /dev/md/d0p2. I try to mount /dev/md/d0p2 but the mount command fails with a "No such device or address" error. That is quite reasonable as neither /proc/partitions nor /sys/block/md_d0/ have reported such md_d0p{1,2,3,4} devices. That has nothing to do with the previous issue about the need to call the mount command twice. The patch has certainly corrected that bug. But it remains that I have to stop and reassemble the md device in order to mount its partitions. The node partitions are not initialized correctly or at least the method to use has changed between the 2.6.27 kernel and the 2.6.28 kernel. Anyway something goes wrong between the versions and should be corrected. To exonerate mdadm I found that fdisk is able to tell the kernel about the existence of the partitions. Every time fdisk (-l) /dev/md/d0 is called the four partitions are displayed on the console. Afterwards, /proc/partitions and /sys/block/md_d0/ (sorry I didn't checked the last one) will show the md_d0p{1,2,3,4} partitions as well. I just have to retry the mount command and see that /dev/md/d0p2 is now mounted. To resume, ~ # mdadm --assemble --scan md: md_d0 stopped. md: bind<sdb> md: bind<sda> raid1: raid set md_d0 active with 2 out of 2 mirrors md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits created bitmap (8 pages) for device md_d0 mdadm: /dev/md/d0 has been started with 2 drives. ~ # mount -t ext3 /dev/md/d0p2 /new-root mount: mounting /dev/md/d0p2 on /new-root failed: No such device or address ~ # fdisk -l /dev/md/d0 md_d0: p1 p2 p3 p4 Disk /dev/md/d0: 250.0 GB, 250059210752 bytes 2 heads, 4 sectors/track, 61049612 cylinders Units = cylinders of 8 * 512 = 4096 bytes Device Boot Start End Blocks Id System /dev/md/d0p1 * 1 17921 71682 83 Linux /dev/md/d0p2 17922 148994 524292 83 Linux /dev/md/d0p3 148995 542211 1572868 82 Linux swap ~ # cat /proc/partitions major minor #blocks name 8 0 244198584 sda 8 16 244198584 sdb 254 0 244198448 md_d0 254 1 71682 md_d0p1 254 2 524292 md_d0p2 254 3 1572868 md_d0p3 ~ # mount -t ext3 /dev/md/d0p2 /new-root kjournald starting. Commit interval 5 seconds EXT3 FS on md_d0p2, internal journal EXT3-fs: mounted filesystem with ordered data mode. Please, tell me if it is still necessary to proceed to the bisection again. If it is required, is it necessary to have all the modules activated in the kernel or just a limited number of them directly related to the issue to ease the process ? IMHO, it is not an indeterminate bug. It appends every time I'm using a 2.6.28 kernel and never with one of the 2.6.27.x stable branch. Of course, you need to activate some md devices at boot time not just access disks directly. In that case you will miss the issue. Yes, it affects /dev/sda2 as well but only after mdadm -S has been called. As I see, that has brought some confusions even for me. /proc/partitions does not report the sdx{1,2,3,4} partitions. I missed that test the first time. Testing in a tiny rescue environment even with useful tools is quite uneasy, especially if you don't want to break things. Also, I understand well, if a new bisect is needed I will perform more than one test. Also I used to say I never get this bug before. It is not correct. I cannot remember the release perhaps 2.6.24 or later but it was fast immediately corrected so I didn't pay more attention to it and I jumped the faulty release without more consideration. It was quite similar anyway. Thank you for all the work you put in it.
Created attachment 18709 [details] dmesg.txt from 2.6.28 rc3 git2
Created attachment 18710 [details] .config from 2.6.28 rc3 git2
Created attachment 18711 [details] .config used during bisect
Please, read "2.6.24 or older" instead of "2.6.24 or later" in the third sentence of the last paragraph of comment #20.
Created attachment 18713 [details] fix-__blkdev_get.patch Please apply the attached along with md-ext-devt-fix.patch and test. Thanks.
(In reply to comment #25) With these patches the md device /dev/md/d0p2 can be mounted. Great job, thank you. There is still a problem that could be cumbersome especially if one needs to access the underlying physical partition devices. After mdadm has stopped and released the md devices, the physical partition disks won't be accessible. It is not possible to mount them. I still get the message when I run the command: mount -t ext3 -o ro /dev/sda2 /new-root, mount: mounting /dev/sda2 on /new-root failed: No such device or address. Also /proc/partitions won't show up the partitions of the physical devices after md_mod and raid1 modules have been unloaded.
This is a by product of how mdadm works. It removes the partitions from the device when the device is included in a raid to avoid confusion. If you really want the partitions of the devices back (which only makes any sense for raid1, and even then should only be done with care) you can get them back with e.g. blockdev --rereadpt /dev/sda This will cause the partitions of sda to re-appear.
Hmmm... now the question is why md successfully assembled a faulty device instead of simply failing. Can you please try fix-__blkdev_get.patch only (w/o md-ext-devt-fix.patch) and see what happens?
Handled-By : Tejun Heo <tj@kernel.org> Handled-By : Neil Brown <neilb@suse.de>
First-Bad-Commit : 4c46501d1659475dc6c89554af6ce7fe6ecf615c
(In reply to comment #28) > Hmmm... now the question is why md successfully assembled a faulty device > instead of simply failing. Can you please try fix-__blkdev_get.patch only > (w/o > md-ext-devt-fix.patch) and see what happens? > @ tj and neilb, I made the test with the 2.6.28 rc3 git5 kernel revision because it already contains the patch fix-__blkdev_get.patch. It works great. The md device is able to mount. Sorry, I couldn't have access to the same computer where I have done the tests the first time. But this other one had the same issue. If you really need that i test on the test computer used before, I would be glad to proceed asap. Also, the devices were never faulty. I cannot explain why mount had complained about the bad entry in directory. I don't have that message any longer. In the above post, I tried to underline the fact that the sdx partitions are not automatically visibles in /proc/partitions after mdadm has stopped the md device they depend on. Yet, they were visible before calling mdadm. Also, if the kernel is able to detect the partitions table on the sdx devices when it is booting, why does it ignore them and why does it need blockdev --rereadpt to reveal them, after the disks have been freed by mdadm ? Could we avoid to use blockdev in the initramfs file system because it makes the whole just bigger. Avoiding the use of blockdev makes sense after all when you need to rescue or change the disk logic on the fly or when we cannot boot from a livecd for example. Thank you
mdadm explicitly tells the kernel to forget about the sdxN partition devices when it creates an md array over sdx. This is because in most case the partitions that appear to be on the sdxN are not meaningful at all. e.g. If you create a raid5 over 5 devices sda..sde, then create a partition table, that table will appear at the start of sda as well as the start of the array. The partitions will quite possible be bigger than the sda device itself, so they will be meaningless. This can cause confusion, so mdadm explicitly tells the kernel to forget about those partitions. raid1 is clearly an exception here. The partitions do still mean something. However it is generally not a good idea to mount just one half of a raid1 as it cause cause the two half to be out of sync. Obviously some times it is useful to do that, and those times blockdev is your friend. The initramfs should never need to care about this. Either it assembles the array and uses the partitions of the array, or it doesn't and the sdx devices still have their own partitions which can be used. mdadm could possibly do the equivalent of "blockdev -rereadpt /dev/sdx" when stopping an array container sdx, but I'm not convinced that it is a good idea.
Jimmy, yes, we still need to track down why md presented corrupt data instead of simply failing. fix-__blkdev_get fixed the problem but that patch only fixes missing partition rescan. ie. before the patch, even if the device is invalidated or hasn't been scanned yet, trying to opening partitions (instead of the whole device) won't trigger partition rescan and thus result in open failure or accessing the old partition information if it has changed underneath the block layer (partition manipulation tools issue explicit rescan so they're in the clear but e.g. media change for certain drivers can cause this). Neil, can you think of something which can cause md return garbage data with a bug like the above? The worst I can think of is failing to find component device and even that shouldn't happen even with the bug as, with sd, the partition is always scanned during probe. Jimmy, can you please try to reproduce the problem with minimal setup? Thanks.
(In reply to comment #33) > Jimmy, can you please try to reproduce the problem with minimal setup? Also, i tried a minimal setup since and ... i'm now fighting with an old known problem as described in http://lkml.indiana.edu/hypermail/linux/kernel/0709.1/1645.html. If the .config could be of any help i will send it to you. For now, i will continue with my long current .config file. I have done a git bisect replay as suggested. By the way, how do i introduce the patch without compromising the git bisect ? As one of the mdp partitions contains the root filesystem, mdadm -S /dev/md/d0 will fail whenever i shutdown the computer. The root partition is then mounted read only. Could that failure be a problem ? Is it possible to avoid it ?
I discarded the last 'good' lines from the log file and replayed all the compilations and reboots. Now d0p2 was able to mount successfully. I perhaps misunderstood how git works. Also, I had to delete the old builds on the home server to make more room. The result is, # git bisect good 4c46501d1659475dc6c89554af6ce7fe6ecf615c is first bad commit commit 4c46501d1659475dc6c89554af6ce7fe6ecf615c Author: Tejun Heo <tj@kernel.org> Date: Mon Aug 25 19:56:11 2008 +0900 block: move holder_dir from disk to part0 Move disk->holder_dir to part0->holder_dir. Kill now mostly superflous bdev_get_holder(). While at it, kill superflous kobject_get/put() around holder_dir, slave_dir and cmd_filter creation and collapse disk_sysfs_add_subdirs() into register_disk(). These serve no purpose but obfuscating the code. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> :040000 040000 ae974a3c7e77b3b16600cfe42193b95904c8e344 47c3228991a181de2ae541f1dd7411d718968dec M block :040000 040000 6e123c247eaf445189f7c55e4ff0222bc503c647 460ba5c33120745c6b80698112cbae2f283c602f M fs :040000 040000 bed39dfa38fe347b04501f560a7163954a8f4317 5e7a376b3c0682f1a35e2dff33f2b09297c3587a M include and # git bisect log git bisect start # bad: [721d5dfe7e516954c501d5e9d0dfab379cf4241a] Merge branch 'for-linus' of git://neil.brown.name/md git bisect bad 721d5dfe7e516954c501d5e9d0dfab379cf4241a # good: [3fa8749e584b55f1180411ab1b51117190bac1e5] Linux 2.6.27 git bisect good 3fa8749e584b55f1180411ab1b51117190bac1e5 # bad: [dbc2bc251e06c83efcc8d39f1e7de12c2b1ff591] hwmon: (dme1737) Be less i2c-centric git bisect bad dbc2bc251e06c83efcc8d39f1e7de12c2b1ff591 # bad: [4dd9ec4946b4651a295d3bc8df9c15ac692a8f4e] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 git bisect bad 4dd9ec4946b4651a295d3bc8df9c15ac692a8f4e # bad: [c54dcd8ec9f05c8951d1e622e90904aef95379f9] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 git bisect bad c54dcd8ec9f05c8951d1e622e90904aef95379f9 # bad: [ef5bef357cdf49f3a386c7102dbf3be5f7e5c913] Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 git bisect bad ef5bef357cdf49f3a386c7102dbf3be5f7e5c913 # bad: [11914a53d2ec2974a565311af327b8983d8c820d] block: Add interface to abort queued requests git bisect bad 11914a53d2ec2974a565311af327b8983d8c820d # bad: [152e283fdfea0cd11e297d982378b55937842dde] block: introduce struct rq_map_data to use reserved pages git bisect bad 152e283fdfea0cd11e297d982378b55937842dde # bad: [540eed5637b766bb1e881ef744c42617760b4815] block: make partition array dynamic git bisect bad 540eed5637b766bb1e881ef744c42617760b4815 # bad: [0762b8bde9729f10f8e6249809660ff2ec3ad735] block: always set bdev->bd_part git bisect bad 0762b8bde9729f10f8e6249809660ff2ec3ad735 # bad: [4c46501d1659475dc6c89554af6ce7fe6ecf615c] block: move holder_dir from disk to part0 git bisect bad 4c46501d1659475dc6c89554af6ce7fe6ecf615c # good: [5b99c2ffa980528a197f26c7d876cceeccce8dd5] block: make bi_phys_segments an unsigned int instead of short git bisect good 5b99c2ffa980528a197f26c7d876cceeccce8dd5 # good: [e71bf0d0ee89e51b92776391c5634938236977d5] block: fix disk->part[] dereferencing race git bisect good e71bf0d0ee89e51b92776391c5634938236977d5 # good: [ed9e1982347b36573cd622ee5f4e2a7ccd79b3fd] block: implement and use {disk|part}_to_dev() git bisect good ed9e1982347b36573cd622ee5f4e2a7ccd79b3fd # good: [548b10eb2959c96cef6fc29fc96e0931eeb53bc5] block: move __dev from disk to part0 git bisect good 548b10eb2959c96cef6fc29fc96e0931eeb53bc5 # good: [b7db9956e57c8151b930d5e5fe5c766e6aad3ff7] block: move policy from disk to part0 git bisect good b7db9956e57c8151b930d5e5fe5c766e6aad3ff7 It looks like the above test has not be a success. Also, I tried to begin a new fresh one with a small config file but I hit a blocking qc timeout when sata (nforce4) was probed (I built the kernel out-of-tree, so the module path name did not change every time). If I remember well only mdadm is able to write on the md superbloc. mdassemble doesn't. It only assembles the devices. The real problem encountered was probably due to the missing rescan that has now be corrected with your patch fix-__blkdev_get. In fact, before I opened the bug report, I tried fdisk and mdadm many times on the devices. This could lead to that htree_dirblock_to_tree error whenever accessing to the file system has been possible. That issue could be just a false positive.
Thanks Jimmy. Yes, that is the exact commit which is fixed by fix-__blkdev_get.patch. I still wanna know what's going on with the faulty assembly. Neil, can you please help here? What should we be looking at?
We seem to be tracking two separate issues here. I'm closing this one to avoid confusion in the regressions report. Care to open a new bug entry for the "faulty assembly" issue?
I don't think there is any evidence of "faulty assembly". I think the array is assembled correctly, but the partition table is confused, so when partitions are accessed we get the wrong data. Is there something I have missed which is not explained by the above?
With the above __blkdev_get(), the problem the only problem that should occur is __blkdev_get() fails for existing partitions and it won't happen for /dev/sd* as they force scan during attach but md* does seem to trigger the problem as it only sets the invalidated flag. At any rate, the only thing which can occur is either md not being able to find the component partition and failing assembly or partitions for /dev/mdp* not being accessible due to open failure. However, Jimmy saw garbage data coming from md after successful assembly which is quite worrying as __blkdev_get() can fail from different reasons.