Bug 89721 - Unable to determine root device id from userspace after booting without initramfs
Summary: Unable to determine root device id from userspace after booting without initr...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Josef Bacik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-12-14 13:04 UTC by maxtram95
Modified: 2020-09-09 16:15 UTC (History)
16 users (show)

See Also:
Kernel Version: 3.18.0
Tree: Mainline
Regression: No


Attachments

Description maxtram95 2014-12-14 13:04:27 UTC
I have root fs on btrfs, my system boots without initramfs. When booting, systemd prints the following message to log:

systemd-gpt-auto-generator[108]: Failed to determine block device of root file system: No such file or directory

Here is bugreport for systemd, and it seems to be not systemd bug: https://bugs.freedesktop.org/show_bug.cgi?id=84689

Ioctl BTRFS_IOC_DEV_INFO returns useless path="/dev/root" for root filesystem when booted without initrd. There seems to be no way to either get device id (e.g. 8:4) using ioctl or get real device node path (e.g. /dev/sda4 as shown in /proc/mounts). Ioctl BTRFS_IOC_DEV_INFO should return real device path instead of "/dev/root".
Comment 1 Pavel Volkov 2017-03-23 05:45:14 UTC
I understand not many people boot Linux without initramfs these days, but can someone look into this please :)
Comment 2 seirra blake 2018-08-19 07:35:18 UTC
Confirmed in 4.17.0, 4.17.15, 4.18.3. I am now unable to boot my system, it worked fine in 4.16.0. is the workaround to have an initramfs?
Comment 3 Pavel Volkov 2018-08-22 18:07:11 UTC
(In reply to general from comment #2)

I don't think so, since this is 4 years old bug and it doesn't affect kernel boot, only userspace.
Comment 4 seirra blake 2018-08-22 18:36:19 UTC
i got 4.9.76 and 4.16.0 to work but that's it. i would have thought seeing as btrfs is under a lot of attention lately, the age of the bug wouldn't matter to them? perhaps a useful insight might be more info on affected users as the reporter was a gentoo user, i am a gentoo user, you wouldn't happen to be a gentoo user too.... would you? (assuming you got the problem too) i could provide logs as well but no one has asked for them so i don't know if that would help
Comment 5 seirra blake 2018-12-28 01:15:31 UTC
still present on 4.20.0
Comment 6 Erik Quaeghebeur 2019-03-15 07:29:22 UTC
I am affected by this on 4.19.27 using systemd 239 and not using an initramfs.
Comment 7 Erik Quaeghebeur 2019-03-17 21:54:41 UTC
(In reply to Erik Quaeghebeur from comment #6)
> I am affected by this on 4.19.27 using systemd 239 and not using an
> initramfs.

Some log excerpts:

mrt 17 21:23:15 hostname systemd-gpt-auto-generator[108]: Failed to determine block device of root file system: No such file or directory
[...]
mrt 17 21:23:15 hostname systemd[100]: /lib64/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1.
[...]
mrt 17 21:23:15 hostname kernel: BTRFS info (device sda3): device fsid d9bb1e6b-6c6e-4eeb-99b7-2167a21c84ec devid 1 moved old:/dev/root new:/dev/sda3

So BTRFS does seem to correct this at some point, but too late.
Comment 8 James Chew 2019-07-10 12:49:24 UTC
Happens to me too on Kernel 5.1.15 and systemd 242 without initramfs.

[  +0.017444] systemd-gpt-auto-generator[272]: Failed to determine block device of root file system: No such file or directory
[  +0.005031] systemd[265]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1.

[  +0.061556] BTRFS info (device sda4): device fsid 6adbd55d-f1be-4960-b283-87b31486a690 devid 1 moved old:/dev/root new:/dev/sda4
Comment 9 Karlson2k 2020-07-08 11:12:40 UTC
Still present in 5.7.7

[    2.236089] systemd-gpt-auto-generator[302]: Failed to determine block device of root file system: No such file or directory
[    2.237822] systemd[296]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1.
....
[    2.760648] BTRFS info (device nvme0n1p2): device fsid 43c74ec4-9cb8-4bdf-96a3-9024b881b015 devid 1 moved old:/dev/root new:/dev/nvme0n1p2

Gentoo, root on BTRFS, no initramfs.
Comment 10 Karlson2k 2020-09-02 18:59:52 UTC
Additional details:

I'm using kernel with 'root=PARTUUID=xxxx' to boot without initramfs.

While kernel is able to identify device and find required partition, it behave strange:
[    1.662010] BTRFS: device label rootfs devid 1 transid 29269 /dev/root scanned by swapper/0 (1)
[    1.662179] BTRFS info (device nvme0n1p2): disk space caching is enabled
[    1.662180] BTRFS info (device nvme0n1p2): has skinny extents
[    1.668038] BTRFS info (device nvme0n1p2): enabling ssd optimizations
[    1.669627] VFS: Mounted root (btrfs filesystem) readonly on device 0:18.
[    1.670006] devtmpfs: mounted
[    1.670218] Freeing unused kernel image (initmem) memory: 948K

Note wrong (and non-existing) device path '/dev/root' in the first line, while device is correctly identified as 'nvme0n1p2'.

Later device path is corrected:
[    2.157417] BTRFS info (device nvme0n1p2): device fsid xxxxxx-x-x-x-xxxxxx devid 1 moved old:/dev/root new:/dev/nvme0n1p2

But it is too late as systemd already produced errors.

To conclude:
Seems that when kernel find BTRFS by their own (without initramfs), it assign wrong non-existing path '/dev/root' instead of correct path.

Related: 
https://github.com/systemd/systemd/issues/12615
https://bugs.freedesktop.org/show_bug.cgi?id=84689
Comment 11 Anand Jain 2020-09-03 05:26:28 UTC
[    1.662010] BTRFS: device label rootfs devid 1 transid 29269 /dev/root scanned by swapper/0 (1)


'swapper/0 (1)' is calling btrfs device scan /dev/root to scan the device.


Can we look into swapper and figure out why?
Comment 12 Karlson2k 2020-09-03 08:39:19 UTC
According to the docs, '/dev/root' is temporal mounting point alias while rootfs is not available or read-only.
That explain why '/dev/root' is assigned BEFORE mounting rootfs and why initramfs solve situation (initramfs is mounted as rootfs so /dev is available).

However it's not clear why these kind of problems arise only with BTRFS.
There are no similar reports for any other FS type. It happens only with BTRFS.
Comment 13 Anand Jain 2020-09-03 09:31:08 UTC
There isn't a bug or a problem that I can understand here. 

About the message

mrt 17 21:23:15 hostname kernel: BTRFS info (device sda3): device fsid d9bb1e6b-6c6e-4eeb-99b7-2167a21c84ec devid 1 moved old:/dev/root new:/dev/sda3

That's fine because it just indicates someone (systemd?) scanned the new path to the same fsid which is already mounted.

These things in btrfs are noticeable in btrfs because its been sys logged. But is there something that is not working?
Comment 14 Karlson2k 2020-09-03 10:12:54 UTC
The problem is with systemd:

[    1.744161] systemd-gpt-auto-generator[288]: Failed to determine block device of root file system: No such file or directory
[    1.754426] systemd[280]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1.

Only happens with BTRFS and only without initramfs.
Comment 15 Anand Jain 2020-09-03 10:29:27 UTC
Do you get anything in the kernel messages when this happens? I think systemd is using 'btrfs device scan /dev/nvme...?' when it reported 'Failed to determine block device of root file system: No such file or directory'.

I guess what might be happening is that.. the device major:min number of /dev/root and /dev/nvme.. are different is it possible to confirm that?
Comment 16 Chris Bainbridge 2020-09-03 11:49:52 UTC
The issue was that /dev/root does not exist when the kernel does not boot from initramfs, but btrfs still returns "/dev/root" when queried:

  "Hmm, so this appears to be a btrfs kernel bug. it should not return the
  useless "/dev/root" string as a volume device when asked via the
  BTRFS_IOC_DEV_INFO ioctl. it should instead return a proper device name, that
  we can actually make use of from userspace." - Lennart Poettering
  https://bugs.freedesktop.org/show_bug.cgi?id=84689#c3
Comment 17 Karlson2k 2020-09-03 12:24:59 UTC
More messages from the log (with enabled debug log for systemd):
----------------------------------------------------------
[    0.000000] kernel: microcode: microcode updated early to revision 0xd6, date = 2020-04-23
[    0.000000] kernel: Linux version 5.7.19-gentoo (root@host) (gcc version 9.3.0 (Gentoo 9.3.0-r1 p3), GNU ld (Gentoo 2.33.1 p2) 2.33.1) #1 SMP Sun Aug 30 19:36:07 MSK 2020
....
[    1.675394] kernel: BTRFS: device label rootfs devid 1 transid 29806 /dev/root scanned by swapper/0 (1)
[    1.682853] kernel: BTRFS info (device nvme0n1p2): disk space caching is enabled
[    1.682854] kernel: BTRFS info (device nvme0n1p2): has skinny extents
[    1.689825] kernel: BTRFS info (device nvme0n1p2): enabling ssd optimizations
[    1.690632] kernel: VFS: Mounted root (btrfs filesystem) readonly on device 0:18.
[    1.690904] kernel: devtmpfs: mounted
....
[    1.727637] systemd[1]: systemd 245 running in system mode. (+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL -XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
....
[    1.744141] systemd-gpt-auto-generator[288]: Disabling root partition auto-detection, root= is defined.
[    1.744161] systemd-gpt-auto-generator[288]: Failed to determine block device of root file system: No such file or directory
....
[    1.754426] systemd[280]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1
....
[    1.807716] BTRFS info (device nvme0n1p2): disk space caching is enabled
....
[    2.102263] systemd-remount-fs[303]: Remounting /...
....
[    2.134333] systemd[1]: Started udev Kernel Device Manager.
....
[    2.169856] BTRFS info (device nvme0n1p2): device fsid 43c77ec4-9cb8-48df-96a3-9024b881b015 devid 1 moved old:/dev/root new:/dev/nvme0n1p2

....
[    2.160822] systemd[1]: dev-nvme0n1.device: Changed dead -> plugged
[    2.160859] systemd[1]: sys-devices-pci0000:00-0000:00:1d.0-0000:02:00.0-nvme-nvme0-nvme0n1.device: Changed dead -> plugged
[    2.169856] kernel: BTRFS info (device nvme0n1p2): device fsid 43c77ec4-9cb8-48df-96a3-9024b881b015 devid 1 moved old:/dev/root new:/dev/nvme0n1p2
....
[    2.168393] systemd[1]: dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device: Changed dead -> plugged
[    2.168420] systemd[1]: dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device: Job 27 dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device/start finished, result=done
[    2.168451] systemd[1]: Found device Viper M.2 VPR100 boot.
[    2.168484] systemd[1]: dev-nvme0n1p1.device: Changed dead -> plugged
....
[    2.172881] systemd[1]: dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device: Changed dead -> plugged
[    2.172905] systemd[1]: dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device: Job 22 dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device/start finished, result=done
[    2.172926] systemd[1]: Found device Viper M.2 VPR100 rootfs.
[    2.172951] systemd[1]: dev-disk-by\x2did-nvme\x2deui.6479a72782353833\x2dpart2.device: Changed dead -> plugged
[    2.172970] systemd[1]: dev-disk-by\x2dpath-pci\x2d0000:02:00.0\x2dnvme\x2d1\x2dpart2.device: Changed dead -> plugged
[    2.172993] systemd[1]: dev-disk-by\x2duuid-43c77ec4\x2d9cb8\x2d48df\x2d96a3\x2d9024b881b015.device: Changed dead -> plugged
[    2.173018] systemd[1]: dev-disk-by\x2dpartlabel-rootfs.device: Changed dead -> plugged
[    2.173053] systemd[1]: sys-devices-pci0000:00-0000:00:1d.0-0000:02:00.0-nvme-nvme0-nvme0n1-nvme0n1p2.device: Changed dead -> plugged
[    2.173089] systemd[1]: dev-nvme0n1p2.device: Changed tentative -> plugged
----------------------------------------------------------

Rootfs block device path changed before udev scanned.

I can guess that rootfs device has assigned 4:0 number (see https://www.kernel.org/doc/html/latest/admin-guide/devices.html)
Comment 18 Karlson2k 2020-09-03 12:29:18 UTC
(In reply to Karlson2k from comment #17)
> More messages from the log (with enabled debug log for systemd):
[skipped]

Sorry, wrong lines ordering (check timestamps).
Here are the correct one:
----------------------------------------------------------
[    0.000000] kernel: microcode: microcode updated early to revision 0xd6, date = 2020-04-23
[    0.000000] kernel: Linux version 5.7.19-gentoo (root@host) (gcc version 9.3.0 (Gentoo 9.3.0-r1 p3), GNU ld (Gentoo 2.33.1 p2) 2.33.1) #1 SMP Sun Aug 30 19:36:07 MSK 2020
....
[    1.675394] kernel: BTRFS: device label rootfs devid 1 transid 29806 /dev/root scanned by swapper/0 (1)
[    1.682853] kernel: BTRFS info (device nvme0n1p2): disk space caching is enabled
[    1.682854] kernel: BTRFS info (device nvme0n1p2): has skinny extents
[    1.689825] kernel: BTRFS info (device nvme0n1p2): enabling ssd optimizations
[    1.690632] kernel: VFS: Mounted root (btrfs filesystem) readonly on device 0:18.
[    1.690904] kernel: devtmpfs: mounted
....
[    1.727637] systemd[1]: systemd 245 running in system mode. (+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL -XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
....
[    1.744141] systemd-gpt-auto-generator[288]: Disabling root partition auto-detection, root= is defined.
[    1.744161] systemd-gpt-auto-generator[288]: Failed to determine block device of root file system: No such file or directory
....
[    1.754426] systemd[280]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1
....
[    1.807716] BTRFS info (device nvme0n1p2): disk space caching is enabled
....
[    2.102263] systemd-remount-fs[303]: Remounting /...
....
[    2.134333] systemd[1]: Started udev Kernel Device Manager.
....
[    2.160822] systemd[1]: dev-nvme0n1.device: Changed dead -> plugged
[    2.160859] systemd[1]: sys-devices-pci0000:00-0000:00:1d.0-0000:02:00.0-nvme-nvme0-nvme0n1.device: Changed dead -> plugged
[    2.169856] kernel: BTRFS info (device nvme0n1p2): device fsid 43c77ec4-9cb8-48df-96a3-9024b881b015 devid 1 moved old:/dev/root new:/dev/nvme0n1p2
....
[    2.168393] systemd[1]: dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device: Changed dead -> plugged
[    2.168420] systemd[1]: dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device: Job 27 dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device/start finished, result=done
[    2.168451] systemd[1]: Found device Viper M.2 VPR100 boot.
[    2.168484] systemd[1]: dev-nvme0n1p1.device: Changed dead -> plugged
....
[    2.172881] systemd[1]: dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device: Changed dead -> plugged
[    2.172905] systemd[1]: dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device: Job 22 dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device/start finished, result=done
[    2.172926] systemd[1]: Found device Viper M.2 VPR100 rootfs.
[    2.172951] systemd[1]: dev-disk-by\x2did-nvme\x2deui.6479a72782353833\x2dpart2.device: Changed dead -> plugged
[    2.172970] systemd[1]: dev-disk-by\x2dpath-pci\x2d0000:02:00.0\x2dnvme\x2d1\x2dpart2.device: Changed dead -> plugged
[    2.172993] systemd[1]: dev-disk-by\x2duuid-43c77ec4\x2d9cb8\x2d48df\x2d96a3\x2d9024b881b015.device: Changed dead -> plugged
[    2.173018] systemd[1]: dev-disk-by\x2dpartlabel-rootfs.device: Changed dead -> plugged
[    2.173053] systemd[1]: sys-devices-pci0000:00-0000:00:1d.0-0000:02:00.0-nvme-nvme0-nvme0n1-nvme0n1p2.device: Changed dead -> plugged
[    2.173089] systemd[1]: dev-nvme0n1p2.device: Changed tentative -> plugged
----------------------------------------------------------

 
> Rootfs block device path changed before udev scanned.
Not correct. Rootfs mapped to right path after udev scan.

> I can guess that rootfs device has assigned 4:0 number (see
> https://www.kernel.org/doc/html/latest/admin-guide/devices.html)
Comment 19 Karlson2k 2020-09-03 12:41:39 UTC
(In reply to Anand Jain from comment #15)
> I guess what might be happening is that.. the device major:min number of
> /dev/root and /dev/nvme.. are different is it possible to confirm that?

From the log:
[    1.675394] kernel: BTRFS: device label rootfs devid 1 transid 29806 /dev/root scanned by swapper/0 (1)
[    1.682853] kernel: BTRFS info (device nvme0n1p2): disk space caching is enabled
....
[    1.690632] kernel: VFS: Mounted root (btrfs filesystem) readonly on device 0:18.

After boot:
# LANG=en ls -l /dev/nvme0n1*
brw-rw---- 1 root disk 259, 0 Sep  3 11:42 /dev/nvme0n1
brw-rw---- 1 root disk 259, 1 Sep  3 11:42 /dev/nvme0n1p1
brw-rw---- 1 root disk 259, 2 Sep  3 11:42 /dev/nvme0n1p2
brw-rw---- 1 root disk 259, 3 Sep  3 11:42 /dev/nvme0n1p3

So, looks like it was 0:18 and changed to 259:2.
Comment 20 Anand Jain 2020-09-03 14:34:18 UTC
Few things are starting to make sense to me, it appears that device bdev is changing the major:minor during bootup from "0:18" to "259:0".

Btrfs will successfully change the device path for example /dev/sdx to /dev/mapper/... as long as both of them pointing to the same block device. But here the mounted device appears as another device with the FS dd-ed to it. It is fair to fail because the FS is already mounted and the device is already opened.

But it is not too clear why this problem does not happen in the case of initramfs how does it manage.

So when BTRFS_IOC_DEV_INFO returns /dev/root which does not exist in the system anymore. The systemd logs the below message.

[    1.744161] systemd-gpt-auto-generator[288]: Failed to determine block device of root file system: No such file or directory

To reproduce this do I just need to boot gentoo? But how do I make sure I don't use initramfs. Do you have a reproducer?

Thanks.
Comment 21 Karlson2k 2020-09-04 07:29:44 UTC
(In reply to Anand Jain from comment #20)
> Few things are starting to make sense to me, it appears that device bdev is
> changing the major:minor during bootup from "0:18" to "259:0".
That's correct.

> Btrfs will successfully change the device path for example /dev/sdx to
> /dev/mapper/... as long as both of them pointing to the same block device.
> But here the mounted device appears as another device with the FS dd-ed to
> it. It is fair to fail because the FS is already mounted and the device is
> already opened.
To be clear: system does NOT fail to boot. The only thing fails is systemd's gpt-auto-generator.
System works fine and device path is changed as you see in logs.


> But it is not too clear why this problem does not happen in the case of
> initramfs how does it manage.
initramfs mount /dev and scan devices when devtmpfs is available. So block device is detected at the right path initially.

> So when BTRFS_IOC_DEV_INFO returns /dev/root which does not exist in the
> system anymore. The systemd logs the below message.
> 
> [    1.744161] systemd-gpt-auto-generator[288]: Failed to determine block
> device of root file system: No such file or directory
> 
> To reproduce this do I just need to boot gentoo? But how do I make sure I
> don't use initramfs. Do you have a reproducer?

For me gentoo is the simplest way to reproduce.
But you can do it with any GNU/Linux distribution. Arch, Debian/Ubuntu, Fedora/CentOS are fine.

Steps to reproduce:
0. Ensure that you have machine with UEFI and your drive is GPT-partitioned with BTRFS root partition.
1. Build (obtain) kernel with drivers for your HDD/SSD compiled-in (modules will not work as you need to mount your HDD/SSD before modules are loaded).
2. Create boot entry (preferably in GRUB2 loader) without initramfs and kernel command line with "root=PARTUUID=xxx-xx-xx" (substitute your PARTUUID of root partition).
3. Check errors in systemd log.

I saw similar reports with "root=/dev/...", but I checked only with PARTUUID identification.
Comment 22 Lennart Poettering 2020-09-04 13:51:59 UTC
I think btrfs should ideally have some ioctl where it returns the *current* backing device major/minors of all devices, instead of the device name used originally when mounting it, as /dev/root is otherwise not really visible to userspace and userspace has no way to figure out what /dev/root actually really mapped to retroactively.

If that's not in the cards, then btrfs should at least return the *current* device name via BTRFS_IOC_DEV_INFO instead of whatever was used when mounting it, so that the /dev/root thing is replaced by the actual device behind it.

Or in other words: return /dev/root to userspace makes mighty little sense, ever, because it's useless to userspace.
Comment 23 Karlson2k 2020-09-05 13:52:44 UTC
(In reply to Lennart Poettering from comment #22)
> I think btrfs should ideally have some ioctl where it returns the *current*
> backing device major/minors of all devices, instead of the device name used
> originally when mounting it, as /dev/root is otherwise not really visible to
> userspace and userspace has no way to figure out what /dev/root actually
> really mapped to retroactively.
> 
> If that's not in the cards, then btrfs should at least return the *current*
> device name via BTRFS_IOC_DEV_INFO instead of whatever was used when
> mounting it, so that the /dev/root thing is replaced by the actual device
> behind it.
> 
> Or in other words: return /dev/root to userspace makes mighty little sense,
> ever, because it's useless to userspace.

According to the log, the device name is known at the moment of mounting:
[    1.675394] kernel: BTRFS: device label rootfs devid 1 transid 29806 /dev/root scanned by swapper/0 (1)
[    1.682853] kernel: BTRFS info (device nvme0n1p2): disk space caching is enabled

Why not return correct device name?
And why this problem is only with BTRFS?
Comment 24 Karlson2k 2020-09-06 12:42:06 UTC
New detailed systemd issue report.
https://github.com/systemd/systemd/issues/16953

They believed that problem must be fixed in kernel.
Comment 25 Anand Jain 2020-09-06 19:04:24 UTC
I have not been able to set the test machine yet.

Could you run cat '/proc/self/mounts | grep btrfs' 
Or
'btrfs fi show -m /' 

What path does it return?

btrfs returns the last scanned device path using 'btrfs device scan <device-path>'. There is no old or new.

So the question is whether the scan using 'btrfs dev scan <device-path>' is successful or failure? If it fails, there will be a kernel log.
Comment 26 Karlson2k 2020-09-07 09:55:34 UTC
Commands results:

# grep -Ee '.+ / ' /proc/self/mounts
/dev/nvme0n1p2 / btrfs rw,noatime,ssd,space_cache,subvolid=257,subvol=/@root 0 0

# grep -Ee '.+ .+ btrfs ' /proc/self/mounts
/dev/nvme0n1p2 / btrfs rw,noatime,ssd,space_cache,subvolid=257,subvol=/@root 0 0
/dev/nvme0n1p2 /home btrfs rw,noatime,ssd,space_cache,subvolid=256,subvol=/@home 0 0
/dev/nvme0n1p2 /srv btrfs rw,noatime,ssd,space_cache,subvolid=258,subvol=/@srv 0 0

# btrfs fi show -m /
Label: 'rootfs'  uuid: 43c77ec4-9cb8-48df-96a3-9024b881b015
        Total devices 1 FS bytes used 23.69GiB
        devid    1 size 440.44GiB used 25.02GiB path /dev/nvme0n1p2


I see no kernel errors in log.

More info could be found in old comment https://bugs.freedesktop.org/show_bug.cgi?id=84689#c0 

ioctl() BTRFS_IOC_DEV_INFO return path '/dev/root'.
stat() for '/dev/root' return error with errno ENOENT (No such file or directory).
Comment 27 Anand Jain 2020-09-07 10:19:51 UTC
# btrfs fi show -m /
uses BTRFS_IOC_DEV_INFO.

> ioctl() BTRFS_IOC_DEV_INFO return path '/dev/root'.

That's incorrect, the logs above says it returns /dev/nvme0n1p2.
Comment 28 Karlson2k 2020-09-07 11:02:15 UTC
(In reply to Anand Jain from comment #27)
> # btrfs fi show -m /
> uses BTRFS_IOC_DEV_INFO.
> 
> > ioctl() BTRFS_IOC_DEV_INFO return path '/dev/root'.
> 
> That's incorrect, the logs above says it returns /dev/nvme0n1p2.

Sorry, I mixed two situation in my comment.

After boot and full init by udev, the correct path is returned.

Before scan of devices by udev, the path is '/dev/root'.
Comment 29 Anand Jain 2020-09-07 12:17:35 UTC
> After boot and full init by udev, the correct path is returned.

> Before scan of devices by udev, the path is '/dev/root'.

So I don't understand how is this a btrfs issue, it returns what's been given to it (by using btrfs device scan <>).

And this problem does not exist if initramfs is used, maybe it is able to scan the right path at the right time?

Or we need some error logs whether the 'btrfs device scan <>' is failing?

As above, we would fail the 'btrfs device scan <>' if the same device changed its major:minor from "0:18" to "259:0". But there isn't any error log in the kernel.
Comment 30 Lennart Poettering 2020-09-07 15:20:51 UTC
let me clarify: /dev/root is returned by BTRFS_IOC_DEV_INFO when root= is specified on the kernel command line, pointing to a btrfs file system, when no initrd is involved. In this case it's the kernel itself that mounts the fs (*before userspace is even invoked*), and no udev and no "btrfs device scan" have ever been issued.
Comment 31 Lennart Poettering 2020-09-07 15:23:55 UTC
(btw, I'd much prefer if btrfs would return backing device major/minor numbers in addition to backing device node paths in BTRFS_IOC_DEV_INFO. Is there any chance a few bytes of btrfs_ioctl_dev_info_args' .unused[] field could be spared to encode that?)
Comment 32 Anand Jain 2020-09-07 23:12:43 UTC
How does major:minor help? if it is for debugging I doubt if we are doing the right thing. As of now, the debug logs go to kernel logs. pls, let me know. how can I help?
Comment 33 Anand Jain 2020-09-07 23:21:42 UTC
Oh all relevant fail logs are already in place too. Can you please upload full kernel logs somewhere? I expect to see an error when the mounted '0:18' (/dev/root) device is scanned by the userland as '259:0' (/dev/nvme..) and return -EEXIST.
Comment 34 Lennart Poettering 2020-09-08 15:10:33 UTC
> How does major:minor help? if it is for debugging I doubt if we are doing the
> right thing. As of now, the debug logs go to kernel logs. pls, let me know.
> how can I help?

It doesn't help in this specific issue. Just generally it's better to reference devices by their major/minor, since they are less dynamic than device node paths, and people have mount namespaces and such, where device node paths lose their meaning, in particular when people bind mount device nodes into containers and such. major/minor pairs generally are somewhat the same all across the OS as long as the device exists.
Comment 35 Anand Jain 2020-09-09 03:33:37 UTC
> It doesn't help in this specific issue. 

 OK. We can take it later if needed.


> and people have mount namespaces and such, where device node paths lose their
> meaning, in particular when people bind mount device nodes into containers
> and such. major/minor pairs generally are somewhat the same all across the OS
> as long as the device exists.


Right. As long as the major:minor are the same btrfs can update the device path as requested by the userland (using btrfs dev scan) and the paths shall be returned (when btrfs fi show is used).

To reiterate, as investigated above -

The issue here the major:minor is not the same and the fsid is mounted.

Happens only without initramfs.

Root fs major:minor change from 0:18 to 259:0 during the boot process.

After the boot process, the btrfs fi show / (which uses BTRFS_IOC_DEV_INFO) shows the device path as '259:0' (/dev/nvme..) which is expected and correct.  So I don't know when is the problem.

The btrfs userland may change the device path using 'btrfs dev scan <>'.

After the bootup the path is correct.

Update to the device path may fail if the device is mounted _and_ the major:minor is different since the patch [1] has hardened the related code

 [1]
 commit a9261d4125c97ce8624e9941b75dee1b43ad5df9
    btrfs: harden agaist duplicate fsid on scanned devices
 merged in mainline v5.0 and back-ported to several stable.

Now, what I want from systemd / people who understands the bootup process, if this is causing the issue and provide enough details and logs about the same. In a kind, I still don't understand in effect what is failing.?
Comment 36 Anand Jain 2020-09-09 04:46:52 UTC
Here is an example of how to get back the device path that makes sense in the current root/container context.

mount -o bind / /mnt

mount -o bind /dev /mnt/newdev

chroot /mnt

ls -li /newdev/sda
34212 brw-rw---- 1 root disk 8, 0 Sep  9 10:55 /newdev/sda

cd /dev

ln -s /newdev/btrfs-control

btrfs dev scan /newdev/sda

exit

btrfs fi show -m
Label: none  uuid: 834c9656-2795-43d6-85ed-38fa2341a192
        Total devices 1 FS bytes used 192.00KiB
        devid    1 size 3.00GiB used 536.00MiB path /newdev/sda

/newdev/sda does not exist here outside of chroot. So the btrfs dev scan will reset the path to the accessible path. As shown below.

btrfs dev scan

btrfs fi show -m
Label: none  uuid: 834c9656-2795-43d6-85ed-38fa2341a192
        Total devices 1 FS bytes used 192.00KiB
        devid    1 size 3.00GiB used 536.00MiB path /dev/sda

ls -il /dev/sda
34212 brw-rw---- 1 root disk 8, 0 Sep  9 10:55 /dev/sda
Comment 37 Karlson2k 2020-09-09 14:45:04 UTC
Why it works just fine with EXT4?
Can same logic be applied to BTRFS?

I've created test machine with rootfs on EXT4, without initramfs and with systemd.

Here is info from logs, boot with EXT4:
----------------------
[    0.927896] kernel: EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null)
[    0.929690] kernel: VFS: Mounted root (ext4 filesystem) readonly on device 8:3.
[    0.931083] kernel: devtmpfs: mounted
....
[    1.063715] systemd[1]: systemd 245 running in system mode. (+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL -XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
....
[    1.079542] systemd-gpt-auto-generator[215]: Disabling root partition auto-detection, root= is defined.
[    1.079572] systemd-gpt-auto-generator[215]: Failed to chase block device '/', ignoring: No such file or directory
....
[    1.080414] systemd-gpt-auto-generator[215]: sda3: Root device /dev/sda.
....
[    1.089351] systemd-gpt-auto-generator[215]: Waiting for device (parent + 3 partitions) to appear...
[    1.091226] systemd-gpt-auto-generator[215]: swap specified in fstab, ignoring.
[    1.091239] systemd-gpt-auto-generator[215]: /boot specified in fstab, ignoring.
-----------------------

# ls -l /dev/sda*
brw-rw---- 1 root disk 8, 0 Sep  9 12:30 /dev/sda
brw-rw---- 1 root disk 8, 1 Sep  9 12:30 /dev/sda1
brw-rw---- 1 root disk 8, 2 Sep  9 12:30 /dev/sda2
brw-rw---- 1 root disk 8, 3 Sep  9 12:30 /dev/sda3


As you can see, major:minor for EXT4 device is 8:3 and remains the same.
Systemd's gpt-auto correctly detects /dev/sda as root device.

Could it be re-implemented with BTRFS?
Comment 38 Lennart Poettering 2020-09-09 14:45:55 UTC
regarding the real-life case where the issue is triggered during boot:

systemd has a little tool "systemd-gpt-auto-generator", which is where the issue is triggered. It's called very very early during boot. All it does is find the backing block device of the root partition, then it looks if the same disk has a GPT partition table, and if it has and there are specifically marked partitions in it, then it will mount them automatically. Now, the code in this tool that finds the backing block device uses regular stat() .st_dev for most file systems, except for btrfs where it uses BTRFS_IOC_DEV_INFO. This works great if an initrd is used. But it falls apart on systems where no initrd is used. On these systems the user specifies root= on the kernel cmdline, which the kernel itself then interprets and mounts. The kernel then invokes the init system on that just mounted rootfs as PID 1 which calls this "systemd-gpt-auto-generator" very quickly. In this scenario BTRFS_IOC_DEV_INFO returns /dev/root as backing device of the root fs. If the tool then goes on and tries to open /dev/root this will necessarily fail, since such a device node does not catually exist in userspace. The tool is invoked *before* udev is invoked, and *before* any btrfs assembly/scan commands are invoked, i.e. very very early after the kernel invoked userspace for the first time.

All we are asking for is that it doesn't return /dev/root to userspace ever, since that's not a device node that makes sense in userspace, it doesn't exist in /dev as device node and it's just a magic alias that userspace cannot reasonably resolve.
Comment 39 Lennart Poettering 2020-09-09 14:50:02 UTC
(In reply to Karlson2k from comment #37)
> Why it works just fine with EXT4?
> Can same logic be applied to BTRFS?

btrfs returns a virtual device node in stat()'s .st_dev field that is supposed to indicate the backing block device of a file system. A virtual device node means it has a major of 0 and some dynamically allocated minor. The reason that it returns this virtual device node is that unlike classic file systems such as xfs or ext4, btrfs file systems can back multiple devices, and devices can be removed and added any time. This means returning a single, definite backing device makes semantically little sense. That's why the BTRFS_IOC_DEV_INFO ioctl exists: it returns a list of backing devices, i.e. more than just one. Except that it returns useless data (i.e. "/dev/root") when called on the root fs (as configured via root= on the kernel cmdline) very very early on if not initrd is used.
Comment 40 Karlson2k 2020-09-09 16:15:26 UTC
(In reply to Lennart Poettering from comment #39)

> 
> btrfs returns a virtual device node in stat()'s .st_dev field that is
> supposed to indicate the backing block device of a file system. A virtual
> device node means it has a major of 0 and some dynamically allocated minor.
> The reason that it returns this virtual device node is that unlike classic
> file systems such as xfs or ext4, btrfs file systems can back multiple
> devices, and devices can be removed and added any time. This means returning
> a single, definite backing device makes semantically little sense. That's
> why the BTRFS_IOC_DEV_INFO ioctl exists: it returns a list of backing
> devices, i.e. more than just one. 

Thanks for excellent detailed clarification.
Seems the best way to solve it is to provide major:minor along with device path.

Another solution is to provide .st_dev with real device while btrfs is backed by single device.

Note You need to log in before you can comment on or make changes to this bug.