Bug 21392
Summary: | Incorrect assembly of raid partitions on boot | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Chad Farmer (chadfarmer) |
Component: | MD | Assignee: | io_md |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | alan, neilb |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | All, 2.6.36 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | Patch to mainline 2.6.36 for md raid assembly problem |
Description
Chad Farmer
2010-10-29 00:00:01 UTC
I should add that md is built-in, so the code in md.c framed by "#ifndev MODULE" is active. Created attachment 37212 [details]
Patch to mainline 2.6.36 for md raid assembly problem
I don't have a system that will run the 2.6.36 kernel (some problem in ia64 early init that is unrelated to md assembly). This patch builds on 2.6.36. I have tested a similar patch on a RedHat 5.3 kernel.
This patch resolves the problem by undoing the bind_rdev_to_array of an obsolete partition with conflicting desc_rn value. It is not elegant, but I was minimizing the scope of change.
I am probably ignorant of all kinds of kernel.org protocol, but I believe that someone will be interested in a day-one bug that prevents correct assembly of a raid group that has spare devices. With this problem, the raid is correctly running after having recovered from a failure, but on the next boot the raid is not assembled the way it was before booting.
Perhaps a simplified statement of the problem will attract some interest. When using raid1 spare devices, after a failure, two partitions for a raid1 have the same assignment (e.g. [0] for primary) in the partition superblock. The two partitions are the failed [0] and the current primary [0]. The error is that when autoassembling the raid, this conflict is resolved in favor of the first [0] found, not the freshest (current) [0]. So the current [0] is not assembled and the failed [0] is then correctly removed as "not-fresh". The net effect is that a raid1 that was working correctly after a failure, md0: sda1 (failed and removed), sdb[1], sdc[0] On the next boot is degraded. md0: sdb[1] Note that same problem applies to secondary devices [1]. So a system that initially has two spares could wind up booting on the wrong (stale) device. md0: sdc1[2](S) sdb1[1] sda1[1] Fail and remove sda1. md0: sdc1[0] sdb1[1] Add a new spare sdd1. md0: sdd1[2](S) sdc1[0] sdb1[1] Fail and remove sdb1. md0: sdd1[1] sdc1[0] Reboot and the raid will assemble with the first [0] and [1] found. md0: sda1[0] sdb1[1] Because sda1 was failed before sdb1, sda1 is correctly removed as non-fresh before the md0 is used. So what is used is: md0: sdb1[1] Partition sdb1 is stale! Only sdc1 and sdd1 have the current data. I have not experimented with raid levels other than raid1, but since md assembles partitions for all raid levels, I suspect this problem ill occur with other raid levels, if it is possible to have two partitions (one failed) with the same desc_nr value in the superblock. If this is still present on modern kernels please see Documentation/SubmittingPatches Looking at linux-stable.git as of December 29, 2013, the same code still exists. In drivers/md/md.c, autorun_devices builds a candidates list. It then walks the list and calls bind_rdev_to_array. If desc_nr is assigned bind_rdev_to_array calls find_rdev_nr and returns -EBUSY if the number is already in use. A non-zero return from bind_rdev_to_array causes autorun_devices to call export_rdev which prevents the "duplicate" desc_nr from being included in the array. The problem was that the candidates list was not sorted by date or screened. A "removed" (obsolete) partition from that array could be in the candidates list before the actual current partition with that desc_nr. So that could be more than one "primary", "secondary", or "spare" for the array. In that case, the current desc_nr (primary, secondary, spare) partition would not be included in the array. The partition that was incorrectly included would later be rejected because it had older timestamps, but that would not get the correct partition back in the array. The use-case required having readable partition present that previously had been in the array (and had the correct identification except for timestamp). I believe there were scenarios where an obsolete partition would become the only partition in the array, causing a loss of data. I don't have time to build and test 3.13-rc6 to see if it has the problem. The bad code is still there, but there might be new code elsewhere that somehow covers this up. Unfortunately, when I had time, no kernel developers were interested. Even now, it is not the idea of fixing a problem that seems to be driving this -- just the desire to forget about it. Its from someone we don't know, so mark it as obsolete. Actually its nothing to do with who anyone knows. Bugzilla is just used as a place to keep bugs, not to work on them, nor do we accept patches via Bugzilla (see Documentation/SubmittingPatches). It was simply a case of 'if < 3.0 then close' because we had hundreds and hundreds of obsolete bugs relating to code that has changed vastly since the time the bug was filed. I try to discourage the use of "autorun_devices". There are other cases it doesn't handle well. I would much prefer that arrays were always assembled by mdadm. I might still apply you patch though as it is fairly straight forward. I'll leave it open and see what I think in a day or so :-) Thanks. |