Bug 217074

Summary:	upgrading to kernel 6.1.12 from 5.15.x can no longer assemble software raid0
Product:	IO/Storage	Reporter:	Nikolay Kichukov (hijacker)
Component:	MD	Assignee:	io_md
Status:	RESOLVED CODE_FIX
Severity:	high	CC:	kernel, neilb, tsteiner
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	6.1.12	Subsystem:
Regression:	No	Bisected commit-id:

Description Nikolay Kichukov 2023-02-22 21:12:19 UTC

Hello,
Installing a new kernel 6.1.12 does not allow assembly of raid0 device.

Going back to previous working kernels: 5.15.65, 5.15.75 assembles the raid0 without any problems.

Kernel command line parameters:
... ro kvm_amd.nested=0 kvm_amd.avic=1 kvm_amd.npt=1 raid0.default_layout=2

mdadm assembly attempt fails with:
'mdadm: unexpected failure opening /dev/md<NR>'

Tried with mdadm-4.1 and mdadm-4.2, but as it works with either versions of mdadm, I rule out the mdadm software.

strace -f output, last few lines:

mkdir("/run/mdadm", 0755)               = -1 EEXIST (File exists)
openat(AT_FDCWD, "/run/mdadm/map.lock", O_RDWR|O_CREAT|O_TRUNC, 0600) = 3
fcntl(3, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
flock(3, LOCK_EX)                       = 0
newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, "/run/mdadm/map", O_RDONLY) = 4
fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
newfstatat(4, "", {st_mode=S_IFREG|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(4, "", 4096)                       = 0
close(4)                                = 0
openat(AT_FDCWD, "/run/mdadm/map", O_RDONLY) = 4
fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
newfstatat(4, "", {st_mode=S_IFREG|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(4, "", 4096)                       = 0
close(4)                                = 0
newfstatat(AT_FDCWD, "/dev/.udev", 0x7ffcd8243c90, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/run/udev", {st_mode=S_IFDIR|0755, st_size=160, ...}, 0) = 0
openat(AT_FDCWD, "/proc/mdstat", O_RDONLY) = 4
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
newfstatat(4, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(4, "Personalities : [raid1] [raid0] "..., 1024) = 56
read(4, "", 1024)                       = 0
close(4)                                = 0
openat(AT_FDCWD, "/sys/block/md127/dev", O_RDONLY) = -1 ENOENT (No such file or directory)
getpid()                                = 18351
mknodat(AT_FDCWD, "/dev/.tmp.md.18351:9:127", S_IFBLK|0600, makedev(0x9, 0x7f)) = 0
openat(AT_FDCWD, "/dev/.tmp.md.18351:9:127", O_RDWR|O_EXCL|O_DIRECT) = -1 ENXIO (No such device or address)
unlink("/dev/.tmp.md.18351:9:127")      = 0
getpid()                                = 18351
mknodat(AT_FDCWD, "/tmp/.tmp.md.18351:9:127", S_IFBLK|0600, makedev(0x9, 0x7f)) = 0
openat(AT_FDCWD, "/tmp/.tmp.md.18351:9:127", O_RDWR|O_EXCL|O_DIRECT) = -1 ENXIO (No such device or address)
unlink("/tmp/.tmp.md.18351:9:127")      = 0
write(2, "mdadm: unexpected failure openin"..., 45mdadm: unexpected failure opening /dev/md127
) = 45
unlink("/run/mdadm/map.lock")           = 0
close(3)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++


Tried with kernel compiled with either CONFIG_DEVTMPFS_SAFE=y or CONFIG_DEVTMPFS_SAFE=n, fails the same way.

The raid consists of 4 devices, here is mdstat contents:

Personalities : [raid0] 
md127 : active raid0 sda[0] sdc[2] sdd[3] sdb[1]
      2929769472 blocks super 1.2 512k chunks
      
unused devices: <none>


Examining the 4 block devices:

gnusystem /var/log # mdadm --misc -E /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : bb710ce6:edd5d68d:a0a0a405:edd99547
           Name : gnusystem:md0-store  (local to host gnusystem)
  Creation Time : Wed Sep 29 22:28:09 2021
     Raid Level : raid0
   Raid Devices : 4

 Avail Dev Size : 976508976 sectors (465.64 GiB 499.97 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=0 sectors
          State : clean
    Device UUID : 7f226c1c:23632b9d:e3d6c656:74522906

    Update Time : Wed Sep 29 22:28:09 2021
  Bad Block Log : 512 entries available at offset 8 sectors
       Checksum : 51e99fb5 - correct
         Events : 0

     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
gnusystem /var/log # mdadm --misc -E /dev/sdb
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : bb710ce6:edd5d68d:a0a0a405:edd99547
           Name : gnusystem:md0-store  (local to host gnusystem)
  Creation Time : Wed Sep 29 22:28:09 2021
     Raid Level : raid0
   Raid Devices : 4

 Avail Dev Size : 1953260976 sectors (931.39 GiB 1000.07 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=0 sectors
          State : clean
    Device UUID : ed8795fe:c7e6719a:165db37e:32ec0894

    Update Time : Wed Sep 29 22:28:09 2021
  Bad Block Log : 512 entries available at offset 8 sectors
       Checksum : 215db63b - correct
         Events : 0

     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
gnusystem /var/log # mdadm --misc -E /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : bb710ce6:edd5d68d:a0a0a405:edd99547
           Name : gnusystem:md0-store  (local to host gnusystem)
  Creation Time : Wed Sep 29 22:28:09 2021
     Raid Level : raid0
   Raid Devices : 4

 Avail Dev Size : 976508976 sectors (465.64 GiB 499.97 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=0 sectors
          State : clean
    Device UUID : 3713dfff:d2e29aaf:3275039d:08b317bb

    Update Time : Wed Sep 29 22:28:09 2021
  Bad Block Log : 512 entries available at offset 8 sectors
       Checksum : 42f70f03 - correct
         Events : 0

     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
gnusystem /var/log # mdadm --misc -E /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : bb710ce6:edd5d68d:a0a0a405:edd99547
           Name : gnusystem:md0-store  (local to host gnusystem)
  Creation Time : Wed Sep 29 22:28:09 2021
     Raid Level : raid0
   Raid Devices : 4

 Avail Dev Size : 1953260976 sectors (931.39 GiB 1000.07 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=0 sectors
          State : clean
    Device UUID : 7da858ae:c0d6ca51:0ecaaaf0:280367cc

    Update Time : Wed Sep 29 22:28:09 2021
  Bad Block Log : 512 entries available at offset 8 sectors
       Checksum : 32cf4ab4 - correct
         Events : 0

     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

If any more information is needed, let me know.

Comment 1 Tim Steiner 2023-02-26 03:54:19 UTC

I ran into this as well.  I believe it is the result of trying to use mdadm on a kernel with BLOCK_LEGACY_AUTOLOAD disabled.

Comment 2 Nikolay Kichukov 2023-02-27 08:28:10 UTC

Hi Tim,

Thanks for your feedback, this may well be the case as I have it disabled, indeed:

grep BLOCK_LEGACY_AUTOLOAD /etc/kernels/kernel-config-6.1.12-gentoo-x86_64 
# CONFIG_BLOCK_LEGACY_AUTOLOAD is not set

I did notice the change when I was configuring and trying the 6.1.12 kernel, so I've manually loaded 'md_mod' and 'raid0' after the system had loaded. And this did not help either, thus the bug report. 'raid0' depends on 'md_mod' and 'md_mod' has no other dependencies according to modinfo (or these are compiled into the kernel in my case)...

As the cadence of rebooting this system is not so frequent, I cannot check if this was caused by CONFIG_BLOCK_LEGACY_AUTOLOAD. Even if it is fixed by enabling the block legacy autoload, I think it should also work by manually loading the kernel modules?

Cheers,
-Nikolay

Comment 3 Matt Whitlock 2023-03-04 19:57:20 UTC

I ran into this as well, also upon upgrading from 5.15.88 to 6.1.12, only in my case it's an mdraid *raid1* array that I cannot assemble, so the issue is not specific to raid0.

While comparing kernel configs, I too suspected CONFIG_DEVTMPFS_SAFE=y, as I intentionally enabled that option while building my new kernel. It's good to know that that's not the issue. (Thanks, Nikolay.)

My kernel does not have any block-layer code compiled as modules — it's all built in — so I would not think that CONFIG_BLOCK_LEGACY_AUTOLOAD would be the culprit, though I too have that option disabled in my new kernel. Given the error that occurs (ENXIO) and where it occurs in the strace (opening a device node for the mdraid device), that option does look highly suspect. I'll turn it on and try again during my maintenance window tonight.

Comment 4 Matt Whitlock 2023-03-05 07:39:29 UTC

Indeed, enabling CONFIG_BLOCK_LEGACY_AUTOLOAD allowed mdadm to assemble my arrays. Does mdadm need to be updated so as not to depend on legacy behavior?

Comment 5 Neil Brown 2023-03-05 21:15:59 UTC

mdadm can be told not to depend on this legacy behavior by adding the line

 CREATE names=yes

to /etc/mdadm.conf.
This works since mdadm-4.1.

Maybe we need to make that the default now.  It works on any kernel since 2.6.29.

Comment 6 Matt Whitlock 2023-03-05 22:02:35 UTC

@Neil Brown: Are you sure about that? Here's what the mdadm.conf(5) man page has to say about that option:

names=yes
Since Linux 2.6.29 it has been possible to create md devices with a name like md_home rather than just a number, like md3. mdadm will use the numeric alternative by default as other tools that interact with md arrays may expect only numbers. If names=yes is given in mdadm.conf then mdadm will use a name when appropriate. If names=no is given, then non-numeric md device names will not be used even if the default changes in a future release of mdadm.

This option seems irrelevant to the issue at hand here. I'm not letting mdadm choose the name for my md device node; rather, I am specifying it. I don't use mdadm.conf. My command line is:

mdadm --assemble /dev/md/root \
  --uuid=ab1712b4:d3c0c78b:2bc4e96b:4dc8c563

The issue is that mdadm creates the /dev/md/root block special device node with major number 9, minor number 127, and then attempts to open it, and the kernel complains, "whoa, no such device exists." The kernel doesn't care about the device node's path in the file system; it cares about the 9:127, and that's going to be the same regardless of whether I'm picking the path name myself or I'm letting mdadm pick it, with or without the "names=yes" option.

Comment 7 Neil Brown 2023-03-05 22:22:33 UTC

Yes I am sure.  Possibly the documentation for mdadm.conf could be improved.

There are two different sorts names.  Note that you almost acknowledged this by writing "name for my md device node" while the documentation only talks about names for "md devices", not for "md device nodes".

There are
1/ there are names in /dev or /dev/md/ (device nodes)
2/ there are names that appear in /proc/mdstat and in /sys/block/ (devices)

names=yes is about the second sort of names.

By default, mdadm creates devices by using mknod to create something in /dev with the appropriate major/minor numbers.  The kernel will transparently create a device with a names like "md%d" based on the minor number.

With "names=yes", mdadm instead creates devices by writing e.g. md_root or md27 into /sys/module/md_mod/parameters/new_array.  The kernel will then create an array with exactly that name.

The same thing appears in /dev/md/ in both cases, but without names=yes, /dev/md/root will be a symlink to /dev/md127 (or similar) while with names=yes /dev/md/root will be a symlink to /dev/md_root (I think - it is a while since I've worked on this and I might not have all the details correct).

Comment 8 Matt Whitlock 2023-03-05 22:31:31 UTC

@Neil Brown: Thank you for the clarification! So it seems that mdadm should not be relying on the legacy behavior in any case. Without "names=yes" (or with "names=no"), mdadm should still be writing "md127" into /sys/module/md_mod/parameters/new_array before it attempts to open the new device node that it created. Presently mdadm is relying on the kernel to create the md device on the fly as a legacy side effect of opening a device node for a non-existent md device.

I don't particularly like the idea of having /dev/md_root, as I am more familiar with having numbered device nodes in /dev and having human-named nodes (possibly as symlinks) in subdirectories of /dev. Regardless, how would I enable "names=yes" when invoking "mdadm --assemble" with an explicit array specification (i.e., no mdadm.conf)?

Comment 9 Matt Whitlock 2023-03-05 22:38:27 UTC

Incidentally, I suspect you might be wrong about what needs to be written into "new_array" since the kernel would need to know not only the name for the new md device but also the device number. Or, alternatively, if the mechanism is designed to have the kernel dynamically pick the device number, then it makes less/no sense for mdadm to name the md device with a number since it wouldn't find out the device number until after it has passed a name to "new_array". Maybe that's why the new mechanism isn't used in the case when the device node is to be numbered rather than named?

Comment 10 Neil Brown 2023-03-05 23:54:37 UTC

The only way is with mdadm.conf.

Maybe:
  echo CREATE names=yes > /etc/mdadm.conf ; mdadm --assemble .... ; rm /etc/mdadm.conf
??

When the names=yes functionality was added, the new_array parameter was new and so it had to be opt-in.  I agree that the code needs to be different now.

I had never imagined that any would LIKE the numeric names.... I guess we could keep them if you really like them, but why?

Comment 11 Neil Brown 2023-03-05 23:57:40 UTC

> Incidentally, I suspect you might be wrong about what needs to be written
> into "new_array" since the kernel would need to know not only the name for
> the new md device but also the device number.

Allow me to quote the comment that I placed in the source code - so when I forgot how all this worked, I could easily go back and find out:

	/*
	 * val must be "md_*" or "mdNNN".
	 * For "md_*" we allocate an array with a large free minor number, and
	 * set the name to val.  val must not already be an active name.
	 * For "mdNNN" we allocate an array with the minor number NNN
	 * which must not already be in use.
	 */

Hopefully that also helps you understand how it works.

Comment 12 Matt Whitlock 2023-03-06 00:18:13 UTC

(In reply to Neil Brown from comment #10)
> I had never imagined that any would LIKE the numeric names....

It's just familiarity. I can adjust, though then I'd question the sanity of having /dev/md/root be a symlink to /dev/md_root, as that seems redundant.

Should I open a new bug on mdadm to discuss how to fix its behavior? In particular, there appears to be an abstraction breakdown when specifying array members on the command line, as the command line takes an argument to specify the device node path name, but there's no way to specify the array name, which need not be similar to the node name. Ideally, I would want to pass an array name like "XXX" (rather than a device node path like "/dev/md/XXX") to "mdadm --assemble" and have it write that name to "new_array" and create /dev/mdNN and /dev/md/XXX->../mdNN, where NN is the device minor number chosen by the kernel for the new array. If "new_array" doesn't exist (perhaps because sysfs isn't mounted at /sys), then mdadm could fall back to the present behavior (which fails if the kernel does not have CONFIG_BLOCK_LEGACY_AUTOLOAD). Perhaps, to maintain backward compatibility, mdadm should continue accepting an md device node path name and should detect if it is of the form "/dev/md_XXX" or "/dev/md/XXX" and attempt to write "XXX" to "new_array" first before falling back to the present behavior. (If the device node path is specified as "/dev/mdNN", then the "new_array" mechanism would not be used, and the current behavior would be used exclusively. The man page should mention the need for CONFIG_BLOCK_LEGACY_AUTOLOAD when specifying a device node path name in the numeric form.)

Comment 13 Neil Brown 2023-03-06 01:08:35 UTC

> Should I open a new bug on mdadm to discuss how to fix its behavior?

That is unlikely to help.  This bugzilla isn't used by most kernel developers.  We mostly prefer email lists.  I still get notification when someone flags "MD" even though I'm not the maintainer any more.  I don't know if anyone else does.

The best place to discuss this more broadly is to send an email to linux-raid@vger.kernel.org.  You don't need to subscribe or anything like that, just send an email.

> the command line takes an argument to specify the device node path name, but
> there's no way to specify the array name, which need not be similar to the
> node name.

Not entirely true.  While the device name and the device node name could be completely unrelated, mdadm does not support that.  If the device name is "md_FOO", then the name of the array is "FOO" and you get "/dev/md/FOO" linked to "/dev/md_FOO".

You can assemble an array as 
  mdadm -A FOO /dev/sd[abc]

 and that will be treated just like 
  mdadm -A /dev/md/FOO /dev/sd[abc]

 and the device name will be "md_FOO" - if names are supported.

I agree that testing for the presence of parameters/new_array would be sensible - we cannot default to names=yes when that doesn't exist.

I'm actually hoping that we can revert the dependency on CONFIG_BLOCK_LEGACY_AUTOLOAD until mdadm has better support for that option.  I'd rather not mention it in the man page unless we really have to.

See also
 https://lore.kernel.org/linux-raid/a13cd3b5-cc41-bf2f-c8ac-e031ad0d5dd7@leemhuis.info/

and my reply to the thread.

Comment 14 Neil Brown 2023-03-14 00:12:13 UTC

It turns out that I mis-diagnosed this problem somewhat.

mdadm 4.1 already supports using parameters/new_array when it exsits.  So on a kernel with CONFIG_BLOCK_LEGACY_AUTOLOAD

 mdadm -A /dev/md0 /dev/sd[bc]

will work.  EXCEPT there is one case where it doesn't use new_array.  This happens when it isn't given a name that it trusts - typically because the "name" field in the metadata says that this array belongs to some other computer (which is unfortunately quite common because arrays are created before the correct host name is set).  In this case mdadm typically chooses /dev/md127.

In this case new_array isn't used and mdadm fails.

I've just posted a patch for mdadm to fix this.

https://lore.kernel.org/linux-raid/167875238571.8008.9808655454439667586@noble.neil.brown.name/T/#u

The patch is one line:

--- a/mdopen.c
+++ b/mdopen.c
@@ -370,6 +370,7 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
 		}
 		if (block_udev)
 			udev_block(devnm);
+		create_named_array(devnm);
 	}
 
 	sprintf(devname, "/dev/%s", devnm);


I expect the stable kernels will soon get a patch to force CONFIG_BLOCK_LEGACY_AUTOLOAD on when md is enabled.  Hopefully the next release of mdadm will have this fully fixed.

Thanks for reporting this!

Comment 15 Matt Whitlock 2023-03-14 00:53:38 UTC

(In reply to Neil Brown from comment #14)
> EXCEPT there is one case where it doesn't use new_array.  This
> happens when it isn't given a name that it trusts - typically because the
> "name" field in the metadata says that this array belongs to some other
> computer (which is unfortunately quite common because arrays are created
> before the correct host name is set).

Ahh, nice catch. In my case it's the opposite scenario: the array was created with the correct hostname set, but I'm assembling it in an early boot environment (like an initramfs but not implemented as one), so the hostname hasn't been set yet at the time the array is assembled.

Thanks for the fix and for the information.