Bug 21392 - Incorrect assembly of raid partitions on boot
Summary: Incorrect assembly of raid partitions on boot
Status: RESOLVED OBSOLETE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: MD (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: io_md
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-29 00:00 UTC by Chad Farmer
Modified: 2014-01-05 23:59 UTC (History)
2 users (show)

See Also:
Kernel Version: All, 2.6.36
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Patch to mainline 2.6.36 for md raid assembly problem (1.56 KB, application/octet-stream)
2010-11-13 01:07 UTC, Chad Farmer
Details

Description Chad Farmer 2010-10-29 00:00:01 UTC
The problem is that autorun_devices in md.c builds a candidates list of partitions and calls bind_rdev_to_array in the order the partitions were found, without regard for the state of the partition.  Function bind_rdev_to_array requires a unique mdk_rdev_t desc_nr value, so when partitions exist with the same desc_nr in their superblock (sb->this_disk.number), duplicates are rejected.  The rejected duplicate may be the current device that is needed to assemble the array.

The following test scenario demonstrates this problem.

Create raid1 group across three drives, sda1 primary, sdb1 secondary, sdc1 spare.  For simplicity I did not use LVM.  Use "mdamd --fail" command on sda1.  After sdb1 resyncs with sdc1, reboot.  After booting, the raid is running with a single partition, sdc1.  The following messages report the problem (In this case, the system also has an sdd1, but it is not current.

Oct 27 10:04:58 hms1 kernel: [   28.570081] md: considering sdd1 ...
Oct 27 10:04:59 hms1 kernel: [   28.573747] md:  adding sdd1 ...
Oct 27 10:04:59 hms1 kernel: [   28.577065] md:  adding sdc1 ...
Oct 27 10:04:59 hms1 kernel: [   28.580384] md:  adding sdb1 ...
Oct 27 10:04:59 hms1 kernel: [   28.583706] md:  adding sda1 ...
All four partitions were put into the candidates list.
Oct 27 10:04:59 hms1 kernel: [   28.587058] md: created md0
Oct 27 10:04:59 hms1 kernel: [   28.589942] md: bind<sda1>
The failed sda1 is descr_nr 0.
Oct 27 10:04:59 hms1 kernel: [   28.592744] md: bind<sdb1>
The current secondary sdb1 is descr_nr 1.
Oct 27 10:04:59 hms1 kernel: [   28.595547] md: export_rdev(sdc1)
The current primary sdc1 is rejected due to duplicate desc_nr 0.
Oct 27 10:04:59 hms1 kernel: [   28.598953] md: export_rdev(sdd1)
Oct 27 10:04:59 hms1 kernel: [   28.602359] md: running: <sdb1><sda1>
Oct 27 10:04:59 hms1 kernel: [   28.606205] md: kicking non-fresh sda1 from
array! events 24, 10
This is correct, sda1 is really not fresh.
Oct 27 10:04:59 hms1 kernel: [   28.612304] md: unbind<sda1>
Oct 27 10:04:59 hms1 kernel: [   28.619049] md: export_rdev(sda1)
Oct 27 10:04:59 hms1 kernel: [   28.623113] raid1: raid set md0 active with 1
out of 2 mirrors
This is wrong.  The current primary partition, sdc1, is present and operational, but was not picked up.

I confess that I found this on an older RedHat 5.3 kernel, but the 2.6.36 md.c module has the same code.  If I've incorrectly analyzed this, please enlighten me.

I've seen this situation on a production system where an unrecovered I/O caused sda1 to be failed (and the device disabled in Linux).  The recovery was correct.  On the next boot (done with a power cycle), the sda disk was again operational, the sb readable, and the raid was incorrectly assembled.
Comment 1 Chad Farmer 2010-11-03 01:13:30 UTC
I should add that md is built-in, so the code in md.c framed by "#ifndev MODULE" is active.
Comment 2 Chad Farmer 2010-11-13 01:07:31 UTC
Created attachment 37212 [details]
Patch to mainline 2.6.36 for md raid assembly problem

I don't have a system that will run the 2.6.36 kernel (some problem in ia64 early init that is unrelated to md assembly).  This patch builds on 2.6.36.  I have tested a similar patch on a RedHat 5.3 kernel.

This patch resolves the problem by undoing the bind_rdev_to_array of an obsolete partition with conflicting desc_rn value.  It is not elegant, but I was minimizing the scope of change.  

I am probably ignorant of all kinds of kernel.org protocol, but I believe that someone will be interested in a day-one bug that prevents correct assembly of a raid group that has spare devices.  With this problem, the raid is correctly running after having recovered from a failure, but on the next boot the raid is not assembled the way it was before booting.
Comment 3 Chad Farmer 2010-11-19 16:58:33 UTC
Perhaps a simplified statement of the problem will attract some interest.

When using raid1 spare devices, after a failure, two partitions for a raid1 have the same assignment (e.g. [0] for primary) in the partition superblock.  The two partitions are the failed [0] and the current primary [0].

The error is that when autoassembling the raid, this conflict is resolved in favor of the first [0] found, not the freshest (current) [0].  So the current [0] is not assembled and the failed [0] is then correctly removed as "not-fresh".

The net effect is that a raid1 that was working correctly after a failure, 
md0: sda1 (failed and removed), sdb[1], sdc[0]

On the next boot is degraded.

md0: sdb[1]

Note that same problem applies to secondary devices [1].  So a system that initially has two spares could wind up booting on the wrong (stale) device.

md0: sdc1[2](S) sdb1[1] sda1[1]

Fail and remove sda1.

md0: sdc1[0] sdb1[1]

Add a new spare sdd1.

md0: sdd1[2](S) sdc1[0] sdb1[1]

Fail and remove sdb1.

md0: sdd1[1] sdc1[0]

Reboot and the raid will assemble with the first [0] and [1] found.

md0: sda1[0] sdb1[1]  Because sda1 was failed before sdb1, sda1 is correctly removed as non-fresh before the md0 is used.  So what is used is:

md0: sdb1[1]

Partition sdb1 is stale! Only sdc1 and sdd1 have the current data.

I have not experimented with raid levels other than raid1, but since md assembles partitions for all raid levels, I suspect this problem ill occur with other raid levels, if it is possible to have two partitions (one failed) with the same desc_nr value in the superblock.
Comment 4 Alan 2013-12-10 22:27:07 UTC
If this is still present on modern kernels please see

Documentation/SubmittingPatches
Comment 5 Chad Farmer 2014-01-03 23:33:54 UTC
Looking at linux-stable.git as of December 29, 2013, the same code still exists.

In drivers/md/md.c, autorun_devices builds a candidates list.  It then walks
the list and calls bind_rdev_to_array.  If desc_nr is assigned bind_rdev_to_array calls find_rdev_nr and returns -EBUSY if the number is already in use.  A non-zero return from bind_rdev_to_array causes autorun_devices to call export_rdev which prevents the "duplicate" desc_nr from being included in the array.

The problem was that the candidates list was not sorted by date or screened.  
A "removed" (obsolete) partition from that array could be in the candidates list
before the actual current partition with that desc_nr.  So that could be more than one "primary", "secondary", or "spare" for the array.  In that case, the
current desc_nr (primary, secondary, spare) partition would not be included in
the array.  The partition that was incorrectly included would later be rejected
because it had older timestamps, but that would not get the correct partition
back in the array.

The use-case required having readable partition present that previously had been
in the array (and had the correct identification except for timestamp).  I believe there were scenarios where an obsolete partition would become the only partition in the array, causing a loss of data.

I don't have time to build and test 3.13-rc6 to see if it has the problem.
The bad code is still there, but there might be new code elsewhere that
somehow covers this up.

Unfortunately, when I had time, no kernel developers were interested.  Even
now, it is not the idea of fixing a problem that seems to be driving this --
just the desire to forget about it.  Its from someone we don't know, so mark it as obsolete.
Comment 6 Alan 2014-01-04 00:12:44 UTC
Actually its nothing to do with who anyone knows. Bugzilla is just used as a place to keep bugs, not to work on them, nor do we accept patches via Bugzilla (see Documentation/SubmittingPatches).

It was simply a case of 'if < 3.0 then close'

because we had hundreds and hundreds of obsolete bugs relating to code that has changed vastly since the time the bug was filed.
Comment 7 Neil Brown 2014-01-05 23:59:08 UTC
I try to discourage the use of "autorun_devices".  There are other cases it doesn't handle well.
I would much prefer that arrays were always assembled by mdadm.

I might still apply you patch though as it is fairly straight forward.  I'll leave it open and see what I think in a day or so :-)

Thanks.

Note You need to log in before you can comment on or make changes to this bug.