See the following for downstream / list reports: http://bugs.gentoo.org/show_bug.cgi?id=273902 http://www.nabble.com/%22array-md0-already-has-disks%22-with-kernel-2.6.29-td23131323.html Something broke between 2.6.28 and 2.6.29 which is causing systems using evms to fail to boot. A bunch of messages like this get spewed: "md: array md* already has disks!"
Is it possible get get the output of cat /proc/mdstat before an after running evms_activate, and also to run evms_active under 'strace' and get the trace from that. I have no idea how evms works so I would need as much tracing as possible to be able to figure out what is going wrong.
Created attachment 23743 [details] /proc/mdstat (before evms_activate)
Created attachment 23744 [details] strace evms_activate
Created attachment 23745 [details] /proc/mdstat (after evms_activate)
I have the exact same problem! So here are the files you requested! I'm using gentoo-sources-2.6.30-r5 for the kernel. - Beau
Thanks. Can you run mdadm -Ss which should remove the entry from /proc/mdstat, then run evms_activate and see what happens. If it doesn't work, an 'strace' of that would help too.
Created attachment 23812 [details] /proc/mdstat (before evms_activate) **corrected Sorry, I uploaded the wrong file. Here is the one that was generated. There were no entries in /proc/mdstat. I'm using hacked initramfs image that dumps me into an ash shell right after all the drivers are loaded.
Created attachment 24182 [details] patch against 2.6.31.6 that fixes this (for me) I did a git bisection to identify the commits that broke EVMS w/ MD RAID. The first commit which broke things is: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=d3374825ce57ba2214d375023979f6197ccc1385. However I believe this one is also involved, as it changes the behaviour of the breakage: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=efeb53c0e57213e843b7ef3cc6ebcdea7d6186ac. Reverting those two commits fixes the problem on my system. The attached patch does that for 2.6.31.6. Note that I do not know whether there are other negative implications to applying this patch... perhaps doing so breaks something else that has been added to MD since 2.6.28, which I have yet to experience, and my understanding of the code is insufficient for me to figure it out. I would hope that someone with a greater understanding of what's going on can figure out whether this is: (1) a bug in the recent changes to MD, which happen to affect EVMS, or (2) a deliberate change in behaviour, for which EVMS needs to be adapted Either way, I hope my narrowing down of the offending changes will help. I may continue to investigate further myself, but suspect I have gone as far as my ability allows without more expert involvement... Alex
Thanks for doing the bisect - it should help narrow the problem down. There definitely was a deliberate change in behaviour with that patch. However I am surprised that anything would need to adapt to it. The only change should have been that an md device would disappear (e.g. out of /sys) when it was no longer in use. EVMS, or any other code, should not really notice. Can you say anything about how the second patch "changes the behaviour of the breakage"?? That might give a useful clue as to the nature of the breakage. Unfortunately the strace in comment #3 died early - it looks like strace itself had problems decoding a getdents. If it was possible to get a complete strace of evms_activate failing, that might also provide useful clues. Thanks, NeilBrown
Created attachment 24201 [details] much reduced patch for 2.6.31.6 The first of the commits linked to introduces the problem that results in the "md: array md0 already has disks!" message many times per second after running "evms_activate". When just that problem exists, my initrd never even moves on to attempting to mount the root filesystem... it just sits there forever with a flashing cursor while the problem goes on in the background. The second commit changes that behaviour in that it does actually fail with an error very quickly. The error is "evms_commit_changes(): Invalid argument", as seen at the end of the strace in comment #3 from Beau, and then the initrd offers to drop me to a shell. A dmesg from the shell then reveals that the "array already has disks" issue exists in this situation too (but there's only a screenful or two of those, after which they cease when it hits the "invalid argument" error). I have in fact managed to get much further with this now, in that I've got a vastly reduced patch that fixes it on my system (attached). It seems to me that the problematic case is when the "mddev_put(mddev);" in the the "abort:" section of "md_alloc" is reached without having gone through one of the "goto" statements beforehand. Without this patch I hit the "Invalid argument" problem and also see "array already has disks" in dmesg, with the patch I see neither problem and the system boots, so it would seem that this alone fixes both things. To be honest I am slightly surprised, as I imagined these were two separate problems, and thought I would end up with a minimal patch reverting one aspect from each of the commits, but I'm not about to complain. This still hasn't helped me understand why the original behaviour was occurring though... perhaps you can figure that out. Do you still want an strace? I'm just not entirely sure how to go about getting it for you - given that my root filesystem is on the EVMS-managed MD RAID :) but perhaps I could put an strace binary into the initrd somehow... Alex P.S. If only it didn't take so long to recompile the kernel, I'd honestly say I'm enjoying this troubleshooting so far. I suppose it would have been less enjoyable if I hadn't made any progress!
Awesome! I think you've made significant progress. As far as getting an strace... I too have my root on the evms volume. I hacked a custom initramfs to have strace and to drop me into an ash shell right before evms_activate. Then I'd run strace evms_activate while piping standard error to a file on a thumbdrive. Working with initramfs is pretty easy. I hope this helps. - Beau
(bugzilla was down went I tried to respond, so I sent this by private email I received a reply that that patch at the end of this email solves the problem.) Thanks for the extra detail. I figured out why the second patch made a difference - I really didn't think it should have done. There is a bug in the first patch which is fixed by the second - the first sets mddev->hold_active rather than new->hold_active. So that explains part of the mystery, and confirms the problem is related to hold_active. I had another look at the evms source code and I saw the important difference this time. Every time evms performs an operation on an md device it does: open ioctl close md doesn't expect this. It assumes: open set of ioctls to perform desired changes close As a consequence, after certain ioctls the 'close' will destroy the md device so the effect of the ioctl will be lost. I think that the important ioctl in this case is SET_ARRAY_INFO. I cannot see exactly why losing the effect of that ioctl would produce the symptoms you report, but it certainly isn't impossible. So could you please try the following patch. I'm not it is what will finally go upstream, but if it makes a difference, then that will confirm my understanding. Thanks, NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index 9dd8720..f251ff3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5486,6 +5486,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode, done_unlock: abort_unlock: if (mddev->hold_active == UNTIL_IOCTL && + cmd != SET_ARRAY_INFO && err != -EINVAL) mddev->hold_active = 0; mddev_unlock(mddev);
The patch in the previous comment works fine for me on 2.6.29
Any ETA when this patch or something similar will make its way into git?
The patch also works for me on 2.6.31.6. Great job!
(In reply to comment #14) > Any ETA when this patch or something similar will make its way into git? December 30th 2009 The mainline commit is cbd1998377504df005302ac90d49db72a48552a6 NeilBrown
.. so marking as 'resolved'.