Bug 89721
Summary: | Unable to determine root device id from userspace after booting without initramfs | ||
---|---|---|---|
Product: | File System | Reporter: | maxtram95 |
Component: | btrfs | Assignee: | Josef Bacik (josef) |
Status: | NEW --- | ||
Severity: | normal | CC: | akimasa, anandsuveer, ao, bugzilla.kernel.org, chris.bainbridge, james05, k2k, mads, maxtram95, mzxreary, pastas4, romain, sathnaga, schmidicom, sir.suriv, sophietheopossum, szg00000 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.18.0 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
maxtram95
2014-12-14 13:04:27 UTC
I understand not many people boot Linux without initramfs these days, but can someone look into this please :) Confirmed in 4.17.0, 4.17.15, 4.18.3. I am now unable to boot my system, it worked fine in 4.16.0. is the workaround to have an initramfs? (In reply to general from comment #2) I don't think so, since this is 4 years old bug and it doesn't affect kernel boot, only userspace. i got 4.9.76 and 4.16.0 to work but that's it. i would have thought seeing as btrfs is under a lot of attention lately, the age of the bug wouldn't matter to them? perhaps a useful insight might be more info on affected users as the reporter was a gentoo user, i am a gentoo user, you wouldn't happen to be a gentoo user too.... would you? (assuming you got the problem too) i could provide logs as well but no one has asked for them so i don't know if that would help still present on 4.20.0 I am affected by this on 4.19.27 using systemd 239 and not using an initramfs. (In reply to Erik Quaeghebeur from comment #6) > I am affected by this on 4.19.27 using systemd 239 and not using an > initramfs. Some log excerpts: mrt 17 21:23:15 hostname systemd-gpt-auto-generator[108]: Failed to determine block device of root file system: No such file or directory [...] mrt 17 21:23:15 hostname systemd[100]: /lib64/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1. [...] mrt 17 21:23:15 hostname kernel: BTRFS info (device sda3): device fsid d9bb1e6b-6c6e-4eeb-99b7-2167a21c84ec devid 1 moved old:/dev/root new:/dev/sda3 So BTRFS does seem to correct this at some point, but too late. Happens to me too on Kernel 5.1.15 and systemd 242 without initramfs. [ +0.017444] systemd-gpt-auto-generator[272]: Failed to determine block device of root file system: No such file or directory [ +0.005031] systemd[265]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1. [ +0.061556] BTRFS info (device sda4): device fsid 6adbd55d-f1be-4960-b283-87b31486a690 devid 1 moved old:/dev/root new:/dev/sda4 Still present in 5.7.7 [ 2.236089] systemd-gpt-auto-generator[302]: Failed to determine block device of root file system: No such file or directory [ 2.237822] systemd[296]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1. .... [ 2.760648] BTRFS info (device nvme0n1p2): device fsid 43c74ec4-9cb8-4bdf-96a3-9024b881b015 devid 1 moved old:/dev/root new:/dev/nvme0n1p2 Gentoo, root on BTRFS, no initramfs. Additional details: I'm using kernel with 'root=PARTUUID=xxxx' to boot without initramfs. While kernel is able to identify device and find required partition, it behave strange: [ 1.662010] BTRFS: device label rootfs devid 1 transid 29269 /dev/root scanned by swapper/0 (1) [ 1.662179] BTRFS info (device nvme0n1p2): disk space caching is enabled [ 1.662180] BTRFS info (device nvme0n1p2): has skinny extents [ 1.668038] BTRFS info (device nvme0n1p2): enabling ssd optimizations [ 1.669627] VFS: Mounted root (btrfs filesystem) readonly on device 0:18. [ 1.670006] devtmpfs: mounted [ 1.670218] Freeing unused kernel image (initmem) memory: 948K Note wrong (and non-existing) device path '/dev/root' in the first line, while device is correctly identified as 'nvme0n1p2'. Later device path is corrected: [ 2.157417] BTRFS info (device nvme0n1p2): device fsid xxxxxx-x-x-x-xxxxxx devid 1 moved old:/dev/root new:/dev/nvme0n1p2 But it is too late as systemd already produced errors. To conclude: Seems that when kernel find BTRFS by their own (without initramfs), it assign wrong non-existing path '/dev/root' instead of correct path. Related: https://github.com/systemd/systemd/issues/12615 https://bugs.freedesktop.org/show_bug.cgi?id=84689 [ 1.662010] BTRFS: device label rootfs devid 1 transid 29269 /dev/root scanned by swapper/0 (1) 'swapper/0 (1)' is calling btrfs device scan /dev/root to scan the device. Can we look into swapper and figure out why? According to the docs, '/dev/root' is temporal mounting point alias while rootfs is not available or read-only. That explain why '/dev/root' is assigned BEFORE mounting rootfs and why initramfs solve situation (initramfs is mounted as rootfs so /dev is available). However it's not clear why these kind of problems arise only with BTRFS. There are no similar reports for any other FS type. It happens only with BTRFS. There isn't a bug or a problem that I can understand here. About the message mrt 17 21:23:15 hostname kernel: BTRFS info (device sda3): device fsid d9bb1e6b-6c6e-4eeb-99b7-2167a21c84ec devid 1 moved old:/dev/root new:/dev/sda3 That's fine because it just indicates someone (systemd?) scanned the new path to the same fsid which is already mounted. These things in btrfs are noticeable in btrfs because its been sys logged. But is there something that is not working? The problem is with systemd: [ 1.744161] systemd-gpt-auto-generator[288]: Failed to determine block device of root file system: No such file or directory [ 1.754426] systemd[280]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1. Only happens with BTRFS and only without initramfs. Do you get anything in the kernel messages when this happens? I think systemd is using 'btrfs device scan /dev/nvme...?' when it reported 'Failed to determine block device of root file system: No such file or directory'. I guess what might be happening is that.. the device major:min number of /dev/root and /dev/nvme.. are different is it possible to confirm that? The issue was that /dev/root does not exist when the kernel does not boot from initramfs, but btrfs still returns "/dev/root" when queried: "Hmm, so this appears to be a btrfs kernel bug. it should not return the useless "/dev/root" string as a volume device when asked via the BTRFS_IOC_DEV_INFO ioctl. it should instead return a proper device name, that we can actually make use of from userspace." - Lennart Poettering https://bugs.freedesktop.org/show_bug.cgi?id=84689#c3 More messages from the log (with enabled debug log for systemd): ---------------------------------------------------------- [ 0.000000] kernel: microcode: microcode updated early to revision 0xd6, date = 2020-04-23 [ 0.000000] kernel: Linux version 5.7.19-gentoo (root@host) (gcc version 9.3.0 (Gentoo 9.3.0-r1 p3), GNU ld (Gentoo 2.33.1 p2) 2.33.1) #1 SMP Sun Aug 30 19:36:07 MSK 2020 .... [ 1.675394] kernel: BTRFS: device label rootfs devid 1 transid 29806 /dev/root scanned by swapper/0 (1) [ 1.682853] kernel: BTRFS info (device nvme0n1p2): disk space caching is enabled [ 1.682854] kernel: BTRFS info (device nvme0n1p2): has skinny extents [ 1.689825] kernel: BTRFS info (device nvme0n1p2): enabling ssd optimizations [ 1.690632] kernel: VFS: Mounted root (btrfs filesystem) readonly on device 0:18. [ 1.690904] kernel: devtmpfs: mounted .... [ 1.727637] systemd[1]: systemd 245 running in system mode. (+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL -XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified) .... [ 1.744141] systemd-gpt-auto-generator[288]: Disabling root partition auto-detection, root= is defined. [ 1.744161] systemd-gpt-auto-generator[288]: Failed to determine block device of root file system: No such file or directory .... [ 1.754426] systemd[280]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1 .... [ 1.807716] BTRFS info (device nvme0n1p2): disk space caching is enabled .... [ 2.102263] systemd-remount-fs[303]: Remounting /... .... [ 2.134333] systemd[1]: Started udev Kernel Device Manager. .... [ 2.169856] BTRFS info (device nvme0n1p2): device fsid 43c77ec4-9cb8-48df-96a3-9024b881b015 devid 1 moved old:/dev/root new:/dev/nvme0n1p2 .... [ 2.160822] systemd[1]: dev-nvme0n1.device: Changed dead -> plugged [ 2.160859] systemd[1]: sys-devices-pci0000:00-0000:00:1d.0-0000:02:00.0-nvme-nvme0-nvme0n1.device: Changed dead -> plugged [ 2.169856] kernel: BTRFS info (device nvme0n1p2): device fsid 43c77ec4-9cb8-48df-96a3-9024b881b015 devid 1 moved old:/dev/root new:/dev/nvme0n1p2 .... [ 2.168393] systemd[1]: dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device: Changed dead -> plugged [ 2.168420] systemd[1]: dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device: Job 27 dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device/start finished, result=done [ 2.168451] systemd[1]: Found device Viper M.2 VPR100 boot. [ 2.168484] systemd[1]: dev-nvme0n1p1.device: Changed dead -> plugged .... [ 2.172881] systemd[1]: dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device: Changed dead -> plugged [ 2.172905] systemd[1]: dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device: Job 22 dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device/start finished, result=done [ 2.172926] systemd[1]: Found device Viper M.2 VPR100 rootfs. [ 2.172951] systemd[1]: dev-disk-by\x2did-nvme\x2deui.6479a72782353833\x2dpart2.device: Changed dead -> plugged [ 2.172970] systemd[1]: dev-disk-by\x2dpath-pci\x2d0000:02:00.0\x2dnvme\x2d1\x2dpart2.device: Changed dead -> plugged [ 2.172993] systemd[1]: dev-disk-by\x2duuid-43c77ec4\x2d9cb8\x2d48df\x2d96a3\x2d9024b881b015.device: Changed dead -> plugged [ 2.173018] systemd[1]: dev-disk-by\x2dpartlabel-rootfs.device: Changed dead -> plugged [ 2.173053] systemd[1]: sys-devices-pci0000:00-0000:00:1d.0-0000:02:00.0-nvme-nvme0-nvme0n1-nvme0n1p2.device: Changed dead -> plugged [ 2.173089] systemd[1]: dev-nvme0n1p2.device: Changed tentative -> plugged ---------------------------------------------------------- Rootfs block device path changed before udev scanned. I can guess that rootfs device has assigned 4:0 number (see https://www.kernel.org/doc/html/latest/admin-guide/devices.html) (In reply to Karlson2k from comment #17) > More messages from the log (with enabled debug log for systemd): [skipped] Sorry, wrong lines ordering (check timestamps). Here are the correct one: ---------------------------------------------------------- [ 0.000000] kernel: microcode: microcode updated early to revision 0xd6, date = 2020-04-23 [ 0.000000] kernel: Linux version 5.7.19-gentoo (root@host) (gcc version 9.3.0 (Gentoo 9.3.0-r1 p3), GNU ld (Gentoo 2.33.1 p2) 2.33.1) #1 SMP Sun Aug 30 19:36:07 MSK 2020 .... [ 1.675394] kernel: BTRFS: device label rootfs devid 1 transid 29806 /dev/root scanned by swapper/0 (1) [ 1.682853] kernel: BTRFS info (device nvme0n1p2): disk space caching is enabled [ 1.682854] kernel: BTRFS info (device nvme0n1p2): has skinny extents [ 1.689825] kernel: BTRFS info (device nvme0n1p2): enabling ssd optimizations [ 1.690632] kernel: VFS: Mounted root (btrfs filesystem) readonly on device 0:18. [ 1.690904] kernel: devtmpfs: mounted .... [ 1.727637] systemd[1]: systemd 245 running in system mode. (+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL -XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified) .... [ 1.744141] systemd-gpt-auto-generator[288]: Disabling root partition auto-detection, root= is defined. [ 1.744161] systemd-gpt-auto-generator[288]: Failed to determine block device of root file system: No such file or directory .... [ 1.754426] systemd[280]: /lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1 .... [ 1.807716] BTRFS info (device nvme0n1p2): disk space caching is enabled .... [ 2.102263] systemd-remount-fs[303]: Remounting /... .... [ 2.134333] systemd[1]: Started udev Kernel Device Manager. .... [ 2.160822] systemd[1]: dev-nvme0n1.device: Changed dead -> plugged [ 2.160859] systemd[1]: sys-devices-pci0000:00-0000:00:1d.0-0000:02:00.0-nvme-nvme0-nvme0n1.device: Changed dead -> plugged [ 2.169856] kernel: BTRFS info (device nvme0n1p2): device fsid 43c77ec4-9cb8-48df-96a3-9024b881b015 devid 1 moved old:/dev/root new:/dev/nvme0n1p2 .... [ 2.168393] systemd[1]: dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device: Changed dead -> plugged [ 2.168420] systemd[1]: dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device: Job 27 dev-disk-by\x2dpartuuid-5c4e98cb\x2ded7f\x2d435f\x2d83a9\x2d0b78b567b4f0.device/start finished, result=done [ 2.168451] systemd[1]: Found device Viper M.2 VPR100 boot. [ 2.168484] systemd[1]: dev-nvme0n1p1.device: Changed dead -> plugged .... [ 2.172881] systemd[1]: dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device: Changed dead -> plugged [ 2.172905] systemd[1]: dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device: Job 22 dev-disk-by\x2dpartuuid-97f8aae5\x2d86b6\x2d40b5\x2d9ffb\x2d496d80551b37.device/start finished, result=done [ 2.172926] systemd[1]: Found device Viper M.2 VPR100 rootfs. [ 2.172951] systemd[1]: dev-disk-by\x2did-nvme\x2deui.6479a72782353833\x2dpart2.device: Changed dead -> plugged [ 2.172970] systemd[1]: dev-disk-by\x2dpath-pci\x2d0000:02:00.0\x2dnvme\x2d1\x2dpart2.device: Changed dead -> plugged [ 2.172993] systemd[1]: dev-disk-by\x2duuid-43c77ec4\x2d9cb8\x2d48df\x2d96a3\x2d9024b881b015.device: Changed dead -> plugged [ 2.173018] systemd[1]: dev-disk-by\x2dpartlabel-rootfs.device: Changed dead -> plugged [ 2.173053] systemd[1]: sys-devices-pci0000:00-0000:00:1d.0-0000:02:00.0-nvme-nvme0-nvme0n1-nvme0n1p2.device: Changed dead -> plugged [ 2.173089] systemd[1]: dev-nvme0n1p2.device: Changed tentative -> plugged ---------------------------------------------------------- > Rootfs block device path changed before udev scanned. Not correct. Rootfs mapped to right path after udev scan. > I can guess that rootfs device has assigned 4:0 number (see > https://www.kernel.org/doc/html/latest/admin-guide/devices.html) (In reply to Anand Jain from comment #15) > I guess what might be happening is that.. the device major:min number of > /dev/root and /dev/nvme.. are different is it possible to confirm that? From the log: [ 1.675394] kernel: BTRFS: device label rootfs devid 1 transid 29806 /dev/root scanned by swapper/0 (1) [ 1.682853] kernel: BTRFS info (device nvme0n1p2): disk space caching is enabled .... [ 1.690632] kernel: VFS: Mounted root (btrfs filesystem) readonly on device 0:18. After boot: # LANG=en ls -l /dev/nvme0n1* brw-rw---- 1 root disk 259, 0 Sep 3 11:42 /dev/nvme0n1 brw-rw---- 1 root disk 259, 1 Sep 3 11:42 /dev/nvme0n1p1 brw-rw---- 1 root disk 259, 2 Sep 3 11:42 /dev/nvme0n1p2 brw-rw---- 1 root disk 259, 3 Sep 3 11:42 /dev/nvme0n1p3 So, looks like it was 0:18 and changed to 259:2. Few things are starting to make sense to me, it appears that device bdev is changing the major:minor during bootup from "0:18" to "259:0". Btrfs will successfully change the device path for example /dev/sdx to /dev/mapper/... as long as both of them pointing to the same block device. But here the mounted device appears as another device with the FS dd-ed to it. It is fair to fail because the FS is already mounted and the device is already opened. But it is not too clear why this problem does not happen in the case of initramfs how does it manage. So when BTRFS_IOC_DEV_INFO returns /dev/root which does not exist in the system anymore. The systemd logs the below message. [ 1.744161] systemd-gpt-auto-generator[288]: Failed to determine block device of root file system: No such file or directory To reproduce this do I just need to boot gentoo? But how do I make sure I don't use initramfs. Do you have a reproducer? Thanks. (In reply to Anand Jain from comment #20) > Few things are starting to make sense to me, it appears that device bdev is > changing the major:minor during bootup from "0:18" to "259:0". That's correct. > Btrfs will successfully change the device path for example /dev/sdx to > /dev/mapper/... as long as both of them pointing to the same block device. > But here the mounted device appears as another device with the FS dd-ed to > it. It is fair to fail because the FS is already mounted and the device is > already opened. To be clear: system does NOT fail to boot. The only thing fails is systemd's gpt-auto-generator. System works fine and device path is changed as you see in logs. > But it is not too clear why this problem does not happen in the case of > initramfs how does it manage. initramfs mount /dev and scan devices when devtmpfs is available. So block device is detected at the right path initially. > So when BTRFS_IOC_DEV_INFO returns /dev/root which does not exist in the > system anymore. The systemd logs the below message. > > [ 1.744161] systemd-gpt-auto-generator[288]: Failed to determine block > device of root file system: No such file or directory > > To reproduce this do I just need to boot gentoo? But how do I make sure I > don't use initramfs. Do you have a reproducer? For me gentoo is the simplest way to reproduce. But you can do it with any GNU/Linux distribution. Arch, Debian/Ubuntu, Fedora/CentOS are fine. Steps to reproduce: 0. Ensure that you have machine with UEFI and your drive is GPT-partitioned with BTRFS root partition. 1. Build (obtain) kernel with drivers for your HDD/SSD compiled-in (modules will not work as you need to mount your HDD/SSD before modules are loaded). 2. Create boot entry (preferably in GRUB2 loader) without initramfs and kernel command line with "root=PARTUUID=xxx-xx-xx" (substitute your PARTUUID of root partition). 3. Check errors in systemd log. I saw similar reports with "root=/dev/...", but I checked only with PARTUUID identification. I think btrfs should ideally have some ioctl where it returns the *current* backing device major/minors of all devices, instead of the device name used originally when mounting it, as /dev/root is otherwise not really visible to userspace and userspace has no way to figure out what /dev/root actually really mapped to retroactively. If that's not in the cards, then btrfs should at least return the *current* device name via BTRFS_IOC_DEV_INFO instead of whatever was used when mounting it, so that the /dev/root thing is replaced by the actual device behind it. Or in other words: return /dev/root to userspace makes mighty little sense, ever, because it's useless to userspace. (In reply to Lennart Poettering from comment #22) > I think btrfs should ideally have some ioctl where it returns the *current* > backing device major/minors of all devices, instead of the device name used > originally when mounting it, as /dev/root is otherwise not really visible to > userspace and userspace has no way to figure out what /dev/root actually > really mapped to retroactively. > > If that's not in the cards, then btrfs should at least return the *current* > device name via BTRFS_IOC_DEV_INFO instead of whatever was used when > mounting it, so that the /dev/root thing is replaced by the actual device > behind it. > > Or in other words: return /dev/root to userspace makes mighty little sense, > ever, because it's useless to userspace. According to the log, the device name is known at the moment of mounting: [ 1.675394] kernel: BTRFS: device label rootfs devid 1 transid 29806 /dev/root scanned by swapper/0 (1) [ 1.682853] kernel: BTRFS info (device nvme0n1p2): disk space caching is enabled Why not return correct device name? And why this problem is only with BTRFS? New detailed systemd issue report. https://github.com/systemd/systemd/issues/16953 They believed that problem must be fixed in kernel. I have not been able to set the test machine yet. Could you run cat '/proc/self/mounts | grep btrfs' Or 'btrfs fi show -m /' What path does it return? btrfs returns the last scanned device path using 'btrfs device scan <device-path>'. There is no old or new. So the question is whether the scan using 'btrfs dev scan <device-path>' is successful or failure? If it fails, there will be a kernel log. Commands results: # grep -Ee '.+ / ' /proc/self/mounts /dev/nvme0n1p2 / btrfs rw,noatime,ssd,space_cache,subvolid=257,subvol=/@root 0 0 # grep -Ee '.+ .+ btrfs ' /proc/self/mounts /dev/nvme0n1p2 / btrfs rw,noatime,ssd,space_cache,subvolid=257,subvol=/@root 0 0 /dev/nvme0n1p2 /home btrfs rw,noatime,ssd,space_cache,subvolid=256,subvol=/@home 0 0 /dev/nvme0n1p2 /srv btrfs rw,noatime,ssd,space_cache,subvolid=258,subvol=/@srv 0 0 # btrfs fi show -m / Label: 'rootfs' uuid: 43c77ec4-9cb8-48df-96a3-9024b881b015 Total devices 1 FS bytes used 23.69GiB devid 1 size 440.44GiB used 25.02GiB path /dev/nvme0n1p2 I see no kernel errors in log. More info could be found in old comment https://bugs.freedesktop.org/show_bug.cgi?id=84689#c0 ioctl() BTRFS_IOC_DEV_INFO return path '/dev/root'. stat() for '/dev/root' return error with errno ENOENT (No such file or directory). # btrfs fi show -m /
uses BTRFS_IOC_DEV_INFO.
> ioctl() BTRFS_IOC_DEV_INFO return path '/dev/root'.
That's incorrect, the logs above says it returns /dev/nvme0n1p2.
(In reply to Anand Jain from comment #27) > # btrfs fi show -m / > uses BTRFS_IOC_DEV_INFO. > > > ioctl() BTRFS_IOC_DEV_INFO return path '/dev/root'. > > That's incorrect, the logs above says it returns /dev/nvme0n1p2. Sorry, I mixed two situation in my comment. After boot and full init by udev, the correct path is returned. Before scan of devices by udev, the path is '/dev/root'. > After boot and full init by udev, the correct path is returned. > Before scan of devices by udev, the path is '/dev/root'. So I don't understand how is this a btrfs issue, it returns what's been given to it (by using btrfs device scan <>). And this problem does not exist if initramfs is used, maybe it is able to scan the right path at the right time? Or we need some error logs whether the 'btrfs device scan <>' is failing? As above, we would fail the 'btrfs device scan <>' if the same device changed its major:minor from "0:18" to "259:0". But there isn't any error log in the kernel. let me clarify: /dev/root is returned by BTRFS_IOC_DEV_INFO when root= is specified on the kernel command line, pointing to a btrfs file system, when no initrd is involved. In this case it's the kernel itself that mounts the fs (*before userspace is even invoked*), and no udev and no "btrfs device scan" have ever been issued. (btw, I'd much prefer if btrfs would return backing device major/minor numbers in addition to backing device node paths in BTRFS_IOC_DEV_INFO. Is there any chance a few bytes of btrfs_ioctl_dev_info_args' .unused[] field could be spared to encode that?) How does major:minor help? if it is for debugging I doubt if we are doing the right thing. As of now, the debug logs go to kernel logs. pls, let me know. how can I help? Oh all relevant fail logs are already in place too. Can you please upload full kernel logs somewhere? I expect to see an error when the mounted '0:18' (/dev/root) device is scanned by the userland as '259:0' (/dev/nvme..) and return -EEXIST. > How does major:minor help? if it is for debugging I doubt if we are doing the
> right thing. As of now, the debug logs go to kernel logs. pls, let me know.
> how can I help?
It doesn't help in this specific issue. Just generally it's better to reference devices by their major/minor, since they are less dynamic than device node paths, and people have mount namespaces and such, where device node paths lose their meaning, in particular when people bind mount device nodes into containers and such. major/minor pairs generally are somewhat the same all across the OS as long as the device exists.
> It doesn't help in this specific issue. OK. We can take it later if needed. > and people have mount namespaces and such, where device node paths lose their > meaning, in particular when people bind mount device nodes into containers > and such. major/minor pairs generally are somewhat the same all across the OS > as long as the device exists. Right. As long as the major:minor are the same btrfs can update the device path as requested by the userland (using btrfs dev scan) and the paths shall be returned (when btrfs fi show is used). To reiterate, as investigated above - The issue here the major:minor is not the same and the fsid is mounted. Happens only without initramfs. Root fs major:minor change from 0:18 to 259:0 during the boot process. After the boot process, the btrfs fi show / (which uses BTRFS_IOC_DEV_INFO) shows the device path as '259:0' (/dev/nvme..) which is expected and correct. So I don't know when is the problem. The btrfs userland may change the device path using 'btrfs dev scan <>'. After the bootup the path is correct. Update to the device path may fail if the device is mounted _and_ the major:minor is different since the patch [1] has hardened the related code [1] commit a9261d4125c97ce8624e9941b75dee1b43ad5df9 btrfs: harden agaist duplicate fsid on scanned devices merged in mainline v5.0 and back-ported to several stable. Now, what I want from systemd / people who understands the bootup process, if this is causing the issue and provide enough details and logs about the same. In a kind, I still don't understand in effect what is failing.? Here is an example of how to get back the device path that makes sense in the current root/container context. mount -o bind / /mnt mount -o bind /dev /mnt/newdev chroot /mnt ls -li /newdev/sda 34212 brw-rw---- 1 root disk 8, 0 Sep 9 10:55 /newdev/sda cd /dev ln -s /newdev/btrfs-control btrfs dev scan /newdev/sda exit btrfs fi show -m Label: none uuid: 834c9656-2795-43d6-85ed-38fa2341a192 Total devices 1 FS bytes used 192.00KiB devid 1 size 3.00GiB used 536.00MiB path /newdev/sda /newdev/sda does not exist here outside of chroot. So the btrfs dev scan will reset the path to the accessible path. As shown below. btrfs dev scan btrfs fi show -m Label: none uuid: 834c9656-2795-43d6-85ed-38fa2341a192 Total devices 1 FS bytes used 192.00KiB devid 1 size 3.00GiB used 536.00MiB path /dev/sda ls -il /dev/sda 34212 brw-rw---- 1 root disk 8, 0 Sep 9 10:55 /dev/sda Why it works just fine with EXT4? Can same logic be applied to BTRFS? I've created test machine with rootfs on EXT4, without initramfs and with systemd. Here is info from logs, boot with EXT4: ---------------------- [ 0.927896] kernel: EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null) [ 0.929690] kernel: VFS: Mounted root (ext4 filesystem) readonly on device 8:3. [ 0.931083] kernel: devtmpfs: mounted .... [ 1.063715] systemd[1]: systemd 245 running in system mode. (+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL -XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified) .... [ 1.079542] systemd-gpt-auto-generator[215]: Disabling root partition auto-detection, root= is defined. [ 1.079572] systemd-gpt-auto-generator[215]: Failed to chase block device '/', ignoring: No such file or directory .... [ 1.080414] systemd-gpt-auto-generator[215]: sda3: Root device /dev/sda. .... [ 1.089351] systemd-gpt-auto-generator[215]: Waiting for device (parent + 3 partitions) to appear... [ 1.091226] systemd-gpt-auto-generator[215]: swap specified in fstab, ignoring. [ 1.091239] systemd-gpt-auto-generator[215]: /boot specified in fstab, ignoring. ----------------------- # ls -l /dev/sda* brw-rw---- 1 root disk 8, 0 Sep 9 12:30 /dev/sda brw-rw---- 1 root disk 8, 1 Sep 9 12:30 /dev/sda1 brw-rw---- 1 root disk 8, 2 Sep 9 12:30 /dev/sda2 brw-rw---- 1 root disk 8, 3 Sep 9 12:30 /dev/sda3 As you can see, major:minor for EXT4 device is 8:3 and remains the same. Systemd's gpt-auto correctly detects /dev/sda as root device. Could it be re-implemented with BTRFS? regarding the real-life case where the issue is triggered during boot: systemd has a little tool "systemd-gpt-auto-generator", which is where the issue is triggered. It's called very very early during boot. All it does is find the backing block device of the root partition, then it looks if the same disk has a GPT partition table, and if it has and there are specifically marked partitions in it, then it will mount them automatically. Now, the code in this tool that finds the backing block device uses regular stat() .st_dev for most file systems, except for btrfs where it uses BTRFS_IOC_DEV_INFO. This works great if an initrd is used. But it falls apart on systems where no initrd is used. On these systems the user specifies root= on the kernel cmdline, which the kernel itself then interprets and mounts. The kernel then invokes the init system on that just mounted rootfs as PID 1 which calls this "systemd-gpt-auto-generator" very quickly. In this scenario BTRFS_IOC_DEV_INFO returns /dev/root as backing device of the root fs. If the tool then goes on and tries to open /dev/root this will necessarily fail, since such a device node does not catually exist in userspace. The tool is invoked *before* udev is invoked, and *before* any btrfs assembly/scan commands are invoked, i.e. very very early after the kernel invoked userspace for the first time. All we are asking for is that it doesn't return /dev/root to userspace ever, since that's not a device node that makes sense in userspace, it doesn't exist in /dev as device node and it's just a magic alias that userspace cannot reasonably resolve. (In reply to Karlson2k from comment #37) > Why it works just fine with EXT4? > Can same logic be applied to BTRFS? btrfs returns a virtual device node in stat()'s .st_dev field that is supposed to indicate the backing block device of a file system. A virtual device node means it has a major of 0 and some dynamically allocated minor. The reason that it returns this virtual device node is that unlike classic file systems such as xfs or ext4, btrfs file systems can back multiple devices, and devices can be removed and added any time. This means returning a single, definite backing device makes semantically little sense. That's why the BTRFS_IOC_DEV_INFO ioctl exists: it returns a list of backing devices, i.e. more than just one. Except that it returns useless data (i.e. "/dev/root") when called on the root fs (as configured via root= on the kernel cmdline) very very early on if not initrd is used. (In reply to Lennart Poettering from comment #39) > > btrfs returns a virtual device node in stat()'s .st_dev field that is > supposed to indicate the backing block device of a file system. A virtual > device node means it has a major of 0 and some dynamically allocated minor. > The reason that it returns this virtual device node is that unlike classic > file systems such as xfs or ext4, btrfs file systems can back multiple > devices, and devices can be removed and added any time. This means returning > a single, definite backing device makes semantically little sense. That's > why the BTRFS_IOC_DEV_INFO ioctl exists: it returns a list of backing > devices, i.e. more than just one. Thanks for excellent detailed clarification. Seems the best way to solve it is to provide major:minor along with device path. Another solution is to provide .st_dev with real device while btrfs is backed by single device. I hit with same failure while booting with qemu-kvm without initrd and root= option in kernel cmdline(root=/dev/sda3 ro console=tty0 console=ttyS0,115200 init=/sbin/init initcall_debug selinux=0) do we have any workaround already? Kernel used is today's mainline(5.10.0-rc6-gbbe2ba04c5a9) ``` [ 3.539147][ T148] xfs-eofblocks/s (148) used greatest stack depth: 13744 bytes left [ 3.548313][ T1] BTRFS: device label fedora_atest-guest devid 1 transid 1608 /dev/root scanned by swapper/0 (1) [ 3.553209][ T1] BTRFS info (device sda3): disk space caching is enabled [ 3.555192][ T1] BTRFS info (device sda3): has skinny extents [ 3.581150][ T1] VFS: Mounted root (btrfs filesystem) readonly on device 0:18. [ 3.581879][ T1] devtmpfs: error mounting -2 [ 3.583730][ T1] Freeing unused kernel memory: 5696K [ 3.584235][ T1] This architecture does not have kernel memory protection. [ 3.584915][ T1] Run /sbin/init as init process [ 3.585482][ T1] Kernel panic - not syncing: Requested init /sbin/init failed (error -2). [ 3.586385][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.10.0-rc6-gbbe2ba04c5a9 #2 [ 3.587249][ T1] Call Trace: [ 3.587584][ T1] [c000000007703cc0] [c00000000097d270] dump_stack+0xc4/0x114 (unreliable) [ 3.588473][ T1] [c000000007703d10] [c000000000137fe0] panic+0x16c/0x3fc [ 3.589207][ T1] [c000000007703db0] [c000000000012cdc] kernel_init+0xe4/0x168 [ 3.589986][ T1] [c000000007703e20] [c00000000000daf0] ret_from_kernel_thread+0x5c/0x6c ``` ---- qemu cmdline: ``` /usr/local/lib/python3.9/site-packages/virttest/bin/install_root/bin/qemu-system-ppc64 \ -name guest=vm1,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-10-vm1/master-key.aes \ -machine pseries-5.2,accel=kvm,usb=off,dump-guest-core=off,memory-backend=ppc_spapr.ram \ -cpu POWER9 \ -m size=8388608k,slots=32,maxmem=83886080k \ -object memory-backend-ram,id=ppc_spapr.ram,size=8589934592 \ -overcommit mem-lock=off \ -smp 16,sockets=1,dies=1,cores=16,threads=1 \ -uuid 1243aa0b-27c6-44cd-93e6-bf111b09ec85 \ -display none \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,fd=43,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc \ -no-shutdown \ -boot strict=on \ -kernel /home/sath/linux/vmlinux \ -append 'root=/dev/sda3 ro console=tty0 console=ttyS0,115200 init=/sbin/init initcall_debug selinux=0' \ -device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.0,addr=0x3 \ -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 \ -blockdev '{"driver":"file","filename":"/home/kvmci/tests/data/avocado-vt/images/fdevel-ppc64le.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1 \ -netdev tap,fd=45,id=hostnet0,vhost=on,vhostfd=46 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:35:33:4e,bus=pci.0,addr=0x1 \ -chardev pty,id=charserial0 \ -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \ -chardev socket,id=charchannel0,fd=47,server,nowait \ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \ -M pseries,cap-nested-hv=on \ -msg timestamp=on 2020-12-04 13:05:07.890+0000: Domain id=10 is tainted: custom-argv 2020-12-04 13:05:07.895+0000: 389225: info : libvirt version: 6.9.0, package: 2.fc34 (Fedora Project, 2020-11-04-09:16:45, ) 2020-12-04 13:05:07.895+0000: 389225: info : hostname: localhost 2020-12-04 13:05:07.895+0000: 389225: info : virObjectUnref:381 : OBJECT_UNREF: obj=0x7fff44186020 char device redirected to /dev/pts/3 (label charserial0) 2020-12-04T13:05:17.663758Z qemu-system-ppc64: OS terminated: OS panic: Requested init /sbin/init failed (error -2). 2020-12-04 13:05:17.665+0000: shutting down, reason=crashed 2020-12-04T13:05:17.721451Z qemu-system-ppc64: terminating on signal 15 from pid 386066 (/usr/sbin/libvirtd) ``` (In reply to Satheesh Rajendran from comment #41) > I hit with same failure while booting with qemu-kvm without initrd and root= > option in kernel cmdline(root=/dev/sda3 ro console=tty0 console=ttyS0,115200 > init=/sbin/init initcall_debug selinux=0) > > do we have any workaround already? > > Kernel used is today's mainline(5.10.0-rc6-gbbe2ba04c5a9) > > Found the workaround to get it working, using below kernel cmdline helps to boot now, root=/dev/sda3 rootflags=subvol=root rootfstype=btrfs console=tty0 console=ttyS0,115200 init=/sbin/init initcall_debug selinux=0 `rootflags=subvol=root` ==> made the difference. virsh start --console vm1 [ 9.268226][ T221] BTRFS info (device sda3): devid 1 device path /dev/root changed to /dev/sda3 scanned by systemd-udevd (221) [ 9.410353][ T291] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null) Fedora 34 (Rawhide Prerelease) Kernel 5.10.0-rc6-dirty on an ppc64le (hvc0) guest xml: # virsh dumpxml vm1 <domain type='kvm' id='39' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <name>vm1</name> <uuid>1243aa0b-27c6-44cd-93e6-bf111b09ec85</uuid> <maxMemory slots='32' unit='KiB'>83886080</maxMemory> <memory unit='KiB'>8388608</memory> <currentMemory unit='KiB'>8388608</currentMemory> <vcpu placement='static'>16</vcpu> <resource> <partition>/machine</partition> </resource> <os> <type arch='ppc64le' machine='pseries-5.2'>hvm</type> <kernel>/home/kvmci/linux/vmlinux</kernel> <cmdline>root=/dev/sda3 rootflags=subvol=root rootfstype=btrfs console=tty0 console=ttyS0,115200 init=/sbin/init initcall_debug selinux=0</cmdline> <boot dev='hd'/> </os> <cpu mode='custom' match='exact' check='none'> <model fallback='forbid'>POWER9</model> <topology sockets='1' dies='1' cores='16' threads='1'/> </cpu> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/local/lib/python3.9/site-packages/virttest/bin/install_root/bin/qemu-system-ppc64</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/home/kvmci/tests/data/avocado-vt/images/fdevel-ppc64le.qcow2' index='1'/> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> <controller type='scsi' index='0' model='virtio-scsi'> <alias name='scsi0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='usb' index='0' model='qemu-xhci' ports='15'> <alias name='usb'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='0' model='pci-root'> <model name='spapr-pci-host-bridge'/> <target index='0'/> <alias name='pci.0'/> </controller> <controller type='virtio-serial' index='0'> <alias name='virtio-serial0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </controller> <interface type='bridge'> <mac address='52:54:00:35:33:4e'/> <source bridge='virbr0'/> <target dev='vnet38'/> <model type='virtio'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/3'/> <target type='spapr-vio-serial' port='0'> <model name='spapr-vty'/> </target> <alias name='serial0'/> <address type='spapr-vio' reg='0x30000000'/> </serial> <console type='pty' tty='/dev/pts/3'> <source path='/dev/pts/3'/> <target type='serial' port='0'/> <alias name='serial0'/> <address type='spapr-vio' reg='0x30000000'/> </console> <channel type='unix'> <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-39-vm1/org.qemu.guest_agent.0'/> <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/> <alias name='channel0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <memballoon model='virtio'> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </memballoon> <panic model='pseries'/> </devices> <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c286,c701</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c286,c701</imagelabel> </seclabel> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+107</label> <imagelabel>+107:+107</imagelabel> </seclabel> <qemu:commandline> <qemu:arg value='-M'/> <qemu:arg value='pseries,cap-nested-hv=on'/> </qemu:commandline> </domain> (In reply to Satheesh Rajendran from comment #42) > Found the workaround to get it working, using below kernel cmdline helps to > boot now, > > root=/dev/sda3 rootflags=subvol=root rootfstype=btrfs console=tty0 > console=ttyS0,115200 init=/sbin/init initcall_debug selinux=0 > > > `rootflags=subvol=root` ==> made the difference. > > > virsh start --console vm1 > [ 9.268226][ T221] BTRFS info (device sda3): devid 1 device path > /dev/root changed to /dev/sda3 scanned by systemd-udevd (221) > [ 9.410353][ T291] EXT4-fs (sda2): mounted filesystem with ordered data > mode. Opts: (null) > > Fedora 34 (Rawhide Prerelease) > Kernel 5.10.0-rc6-dirty on an ppc64le (hvc0) Unfortunately, that doesn't really work either. "systemd-gpt-auto-generator" is executed before "systemd-udevd" and at this point the value (as you can see in your own example) is still "/dev/root". My own workaround is currently to disable "systemd-gpt-auto-generator" with the kernel parameter "systemd.gpt_auto=no". This is exactly as described at https://github.com/systemd/systemd/issues/16953#issue-692871151 **Workaround** As a workaround, I specified all mounts (`/`, `swap`) in `/etc/fstab` and disabled gpt-auto-generator completely by **`systemd.gpt_auto=no`** kernel command line parameter. It even speeds up boot process by ~100-200 msec. Still present in 6.6.21. Comment #40 has two reasonable suggestions. Could a BTRFS developer comment on these and, if they are not feasible, provide other suggestions? |