Bug 216340
Summary: | Kernel stuck on "Loading initial ramdisk" since 5.11 | ||
---|---|---|---|
Product: | Memory Management | Reporter: | Azamat S. Kalimoulline (turtle) |
Component: | Page Allocator | Assignee: | Andrew Morton (akpm) |
Status: | ASSIGNED --- | ||
Severity: | normal | CC: | akpm, bp, mario.limonciello, regressions, till-kernel, vbabka |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.11 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
stock 5.10 config
first non boot config |
Description
Azamat S. Kalimoulline
2022-08-08 20:58:20 UTC
OS boots via EFI. Laptop's BIOS has no option to boot without it. Why such an old kernel? Could you use a modern kernel like 5.19 instead? This is since 5.11 kernel, i.e. 5.18 also not boot. But I debugged yesterday and it is not iommu related. My system installed on luks and lvm using full disk encryption (https://docs.voidlinux.org/installation/guides/fde.html). And if boot without it then kernel boots. But my system not. I will try today same setup on another computer to determine if this make sense. But on another same laptop it also not boots from luks and lvm. I checked on another laptop and PC - and with 5.18 kernel it works fine. So it is specific to ELiteBook 845 G7 with full disk encryption. Boot was performed in UEFI mode. earlyprintk=efi,keep doesn't make sense and no messages after "Loading initial ramdisk" > This is since 5.11 kernel, i.e. 5.18 also not boot. Got it; that wasn't clear from your initial message. > And if boot without it then kernel boots. It sounds like a FDE configuration issue until you mention that it works in 5.10 and regresses only moving to 5.11 or later. > So it is specific to ELiteBook 845 G7 with full disk encryption I think more isolation is needed to figure out why this is happening. Some suggestions that will help get you there: 1) Does turning off amdgpu via modprobe.blacklist=amdgpu or 'nomodest' help? 2) Are you putting amdgpu in your initrd? If not; can you? 3) Are you compiling with pinctrl-amd as built-in or module? If it's as module - is it in your initrd? If not, please mark it as built-in or move it into your initrd. 4) Since you know it's between 5.10 and 5.11, can you bisect to identify the commit that caused it? I can confirm this. Same configuration, HP EliteBook 845 G7 with full disk encryption, latest kernel to boot is 5.10. From 5.11 upwards, system is stuck at boot. Only difference is this is a Ryzen 7 model. I am using Void Linux. I will try to follow some of your suggestions to provide more info. > Does turning off amdgpu via modprobe.blacklist=amdgpu or 'nomodest' help? No. > Are you putting amdgpu in your initrd? If not; can you? It presents initrd. > Are you compiling with pinctrl-amd as built-in or module? It compiled built-in. > Since you know it's between 5.10 and 5.11, can you bisect to identify the > commit that caused it? I didn't it before. How to do that? Also I don't think this is amdgpu related. Because if I remove rd.* options from my kernel cmdline then 5.18 and 5.10 boots nearly same. Both doesn't boot system at all because no device to boot from, but there I can see kernel messages and system in both cases reacts to Ctrl-Alt-Del combination. If try to boot with rd.* options in my kernel cmdline then no messages after "Loading initial ramdisk" and Ctrl-Alt-Del combination has no effect. > No. > It presents initrd. > It compiled built-in. OK. > I didn't it before. How to do that? https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html > Also I don't think this is amdgpu related. My suspicion was that your prompt for a password wasn't being properly displayed since you didn't put amdgpu in the initramfs, but you proved that to be wrong. > If try to boot with rd.* options in my kernel cmdline then no messages after > "Loading initial ramdisk" and Ctrl-Alt-Del combination has no effect. AFAICT the kernel doesn't directly rd.lvm.vg nor rd.luks.uuid. These parameters to me seem they're getting passed to userspace. I'm on a 5.18 checkout: $ uname -r 5.18.0-02568-gabeb24bcfb54 I don't see these parameters at all. $ modinfo rd | grep parm parm: rd_nr:Maximum number of brd devices (int) parm: rd_size:Size of each RAM disk in kbytes. (ulong) parm: max_part:Num Minors to reserve between devices (int) Could you perhaps break into your initramfs by adding a shell call and debug from that? > My suspicion was that your prompt for a password wasn't being properly > displayed since you didn't put amdgpu in the initramfs Yes. Because it prompts password before grub. After that grub loads kernel and initrd, in initrd placed key and crypttab where arrays defined. And then it unlocks automatically using this key without prompting password. This is how it works on 5.10 > kernel doesn't directly rd.lvm.vg nor rd.luks.uuid. These parameters to me > seem they're getting passed to userspace. May be may be. And problem, perhaps in initrd itself. But there are no messages after "Loading initial ramdisk" and I stuck there. > Could you perhaps break into your initramfs by adding a shell call and debug > from that? How to do that? > https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html I'll try it later. It need significant amount of time but I'll try find it. > Yes. Because it prompts password before grub. Where is the key stored and passed to the kernel and initrd from before GRUB? An EFI NVRAM variable? Some random memory address? I think until you've bisected should be your focal area; a regression in how that is all handled. If you're storing it at a memory range that is not reserved by the firmware, the kernel can very easily repurpose it for something else. > How to do that? It depends on what your initramfs provider has. On Ubuntu and Debian I know you can do things like "break=bottom" on the kernel command line to break in the initramfs. You might have to manually modify your initramfs to add some breakpoints/shell points if none are already defined. > I'll try it later. It need significant amount of time but I'll try find it. Yeah it does, but it's rewarding and infinitely easier to know what to fix when you know what the real cause was :) > Where is the key stored and passed to the kernel and initrd from before GRUB? To grub I provide password and it prompts for key. So grub unlocked it and loads from disk kernel and initrd. In initrd stored another key that unlocked during boot process. > It depends on what your initramfs provider has. There are dracut on void linux and it uses rd.break to stop on breakpoints. But it stuck after "Loading initial ramdisk". May be before. But after this message appears. > Yeah it does, but it's rewarding and infinitely easier to know what to fix > when you know what the real cause was :) Agree. I think that it is probably one thing that can help to find root of issue. Oh, I've found strange behavior. When after turning on laptop press down key then boot menu appears, where two options - UEFI - void and boot from file. Selecting my void linux then all boot proceeds as it should be. This is NOT TRUE when I press F9 for boot menu. So I think that bisect only one option to determine cause of this behavior. :( I think that stuck on initrd loading. I finished bisecting kernel. That is the results:
> # git bisect bad
> f5a89a5cae812a39993be32e74c8ed7856b1e2b2 is the first bad commit
> commit f5a89a5cae812a39993be32e74c8ed7856b1e2b2
> Author: Dave Airlie <airlied@redhat.com>
> Date: Thu Oct 29 13:59:08 2020 +1000
>
> drm/amdgpu/ttm: use multihop
>
> This removes the code to move resources directly between
> SYSTEM and VRAM in favour of using the core ttm mulithop code.
>
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Reviewed-by: Christian König <christian.koenig@amd.com>
> Link:
> https://patchwork.freedesktop.org/patch/msgid/20201109005432.861936-3-airlied@gmail.com
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 136
> +++-----------------------------
> 1 file changed, 12 insertions(+), 124 deletions(-)
It looks like that is amdgpu bug but I can't understand why it is not affected when FDE is not used. I think this is not an accurate bisect result since it was said above that blacklisting amdgpu didn't help. That particular commit if you dig into the links had a non-obvious dependency on https://github.com/torvalds/linux/commit/aefec40938e4a0e1214f9121520aba4d51697cd9 Just to double-confirm it's not caused by amdgpu, can you blacklist amdgpu on 5.10 and FDE works? If so, I think you should repeat the bisect with amdgpu blacklisted the whole time and see what result you get. I thought so. Because it not booted differently. 5.11-rc1 doesn't even boot initrd - if I point to non-existent initrd it even no prints that can't find it. And After some bisections it starts to not boot another way - it prints some messages then stuck. So I will repeat bisect with amdgpu blacklist because 5.10 and FDE works (without GUI, obviously). It could be faster because I marked that commit when it started boot differently. git bisect has a way to "skip" commits where it's unclear if it's a pass or fail. I think just keep good notes on your steps if you're unclear so you don't have to re-run "the whole thing" if you still get a wrong result. Yeah, I understand that is bisecting bunch of commits. This is standard search method. So, that is the result:
> # git bisect good
> f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 is the first bad commit
> commit f289041ed4cf9a3f6e8a32068fef9ffb2acc5662
> Author: Vlastimil Babka <vbabka@suse.cz>
> Date: Mon Dec 14 19:13:45 2020 -0800
>
> mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO
>
> CONFIG_PAGE_POISONING_ZERO uses the zero pattern instead of 0xAA. It was
> introduced by commit 1414c7f4f7d7 ("mm/page_poisoning.c: allow for zero
> poisoning"), noting that using zeroes retains the benefit of sanitizing
> content of freed pages, with the benefit of not having to zero them again
> on alloc, and the downside of making some forms of corruption (stray
> writes of NULLs) harder to detect than with the 0xAA pattern. Together
> with CONFIG_PAGE_POISONING_NO_SANITY it made possible to sanitize the
> contents on free without checking it back on alloc.
>
> These days we have the init_on_free() option to achieve sanitization with
> zeroes and to save clearing on alloc (and without checking on alloc).
> Arguably if someone does choose to check the poison for corruption on
> alloc, the savings of not clearing the page are secondary, and it makes
> sense to always use the 0xAA poison pattern. Thus, remove the
> CONFIG_PAGE_POISONING_ZERO option for being redundant.
>
> Link: https://lkml.kernel.org/r/20201113104033.22907-6-vbabka@suse.cz
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: David Hildenbrand <david@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Alexander Potapenko <glider@google.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Laura Abbott <labbott@kernel.org>
> Cc: Mateusz Nosek <mateusznosek0@gmail.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> include/linux/poison.h | 4 ----
> mm/Kconfig.debug | 12 ------------
> mm/page_alloc.c | 8 +-------
> tools/include/linux/poison.h | 6 +-----
> 4 files changed, 2 insertions(+), 28 deletions(-)
What else can I provide? Hi, I don't see how the commit could cause this bug, could be another stray bisect. Only theoretically, if your .config (before this commit, i.e. on 5.10) enabled CONFIG_PAGE_POISONING_ZERO, and there was some broken code that didn't properly initialize allocated object and unknowingly relied on this zeroing, then removing the zeroing could expose the bug. So maybe share your 5.10 config? I did make oldconfig. So this is same. One other thing - this could help to understand, that may be it because of bug of platform (HP Elitebook 845 G7) - there are some non-obvious actions where kernel works - pressing down button at boot then choose linux entry in boot menu. This not works when do same but calling that boot menu via F9. Or boot by default. So if you want to more info - I can provide. "make oldconfig" uses your existing .config as a base, so unless you upload it here, we can't know what it contains. I did cp /boot/config-5.10.132 .oldconfig before it. This config-5.10.132 is void linux built kernel config. You need to upload this config here to bugzilla so that he can take a look. (In reply to Vlastimil Babka from comment #26) > "make oldconfig" uses your existing .config as a base, so unless you upload > it here, we can't know what it contains. did you do this before every build? That in some cases seems to be needed to avoid corner cases. See these mails for details and solutions: https://lore.kernel.org/all/CAMuHMdXRoVt_zRBNvugJjYhJnyYbABeCWv9fFRM0r_=s7FYvJQ@mail.gmail.com/ https://lore.kernel.org/all/12e09497-a848-b767-88f4-7dabd8360c5e@leemhuis.info/ Side note: I googled and quickly found a hit where somebody was running Fedora with Linux 5.13 on a HP EliteBook 845 G7. Makes me wonder if the full disk encryption or void's kernel config (are there maybe fresher void live media you could try) might have a influence here. But in the end it doesn't matter that much, as I guess we need a reliable bisection to get to the root of the problem. > did you do this before every build? Yes, I made script for exclude manual entry errors. > I googled and quickly found a hit where somebody was running Fedora with > Linux 5.13 on a HP EliteBook 845 G7. I can run Void Linux XFCE live disk with kernel >5.10 without full disk encryption. Also I can run my Void Linux stick with 5.18 kernel on another laptop or PC. > I guess we need a reliable bisection This is reliable bisection on HP EliteBook 845 G7. May be I need to repeat it with other params to exclude other bugs. But with that current bug kernel not loads at all. I.e. initrd even not start. And as adviced before, I remake bisection with modprobe.blacklist=amdgpu. But this matters on GPU errors only. Created attachment 301592 [details]
stock 5.10 config
Created attachment 301593 [details]
first non boot config
When doesn't boots due to amdgpu bug it shows kernel messages then stuck, we could reboot with Ctrl-Alt-Del. When doesn't boots in this case it shows "Loading initial ramdisk", but this message printed by GRUB and actually initrd not loaded and not executed, Ctrl-Alt-Del has no sense. Again, I'm able to boot pressing DOWN arrow and selecting my OS from boot menu, then loads grub and I can boot 5.18 kernel. If I press F9 for boot menu and do same - I CAN NOT BOOT kernels higher that 5.10.x. So, may be this is platform of that laptop bug, but it is annoying for me and 5.10.x boots without problems. It seems CONFIG_PAGE_POISONING_ZERO was indeed used, but also INIT_ON_ALLOC_DEFAULT_ON=y which should have pre-zero pages anyway, thus a buggy driver or something shouldn't see a difference. Maybe you can try enabling CONFIG_INIT_ON_FREE_DEFAULT_ON which will do the zeroing on free, as PAGE_POISONING_ZERO did. But I would be surprised if that helped. I will try and write the results. I need only set CONFIG_INIT_ON_FREE_DEFAULT_ON=y ? (In reply to Azamat S. Kalimoulline from comment #38) > I need only set CONFIG_INIT_ON_FREE_DEFAULT_ON=y ? yes Based on the results from the bisect, I was finally able to boot into a current kernel by using kernel commandline page_poison=0. Void linux sets page_poison=1 be default since kernel 5.3. Seems like CONFIG_PAGE_POISONING_ZERO is indeed somehow the root cause for not being able to boot kernels >5.10. (In reply to till-kernel@0xaa55.org from comment #40) > Based on the results from the bisect, I was finally able to boot into a > current kernel by using kernel commandline page_poison=0. Void linux sets > page_poison=1 be default since kernel 5.3. Interesting. > Seems like > CONFIG_PAGE_POISONING_ZERO is indeed somehow the root cause for not being > able to boot kernels >5.10. But the option was removed in >5.10. If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1 enabled with 5.10, does it also fail to boot? All this would mean something doesn't like filling pages that are freed with pattern 0xAA (standard poisoning pattern) but doesn't mind if they are filled with 0x00 (which CONFIG_PAGE_POISONING_ZERO did). > CONFIG_INIT_ON_FREE_DEFAULT_ON=y with this f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 not boots. > page_poison=0 with this boots f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 and 5.18 stock void stock kernel. > If you disable CONFIG_PAGE_POISONING_ZERO
But this option removid in that patch we talk about (f289041ed4cf9a3f6e8a32068fef9ffb2acc5662). Statring this patch kernel doesn't boots.
(In reply to Azamat S. Kalimoulline from comment #42) > > CONFIG_INIT_ON_FREE_DEFAULT_ON=y > > with this f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 not boots. Can you try this config change with 5.18? There were some fixes to interaction between init_on_alloc/free and poisoning after f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 (In reply to Azamat S. Kalimoulline from comment #43) > > If you disable CONFIG_PAGE_POISONING_ZERO > > But this option removid in that patch we talk about > (f289041ed4cf9a3f6e8a32068fef9ffb2acc5662). Statring this patch kernel > doesn't boots. Yes, to confirm that I have suggested: If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1 enabled with kernel v5.10, does it also fail to boot? > Can you try this config change with 5.18?
Will it be ok if I'll try it on master head?
> If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1 enabled
> with kernel v5.10, does it also fail to boot?
Do you mean if I rebuild latest successfull kernel with CONFIG_PAGE_POISONING_ZERO disabled? And boot it with page_poison=1?
(In reply to Azamat S. Kalimoulline from comment #46) > > Can you try this config change with 5.18? > > Will it be ok if I'll try it on master head? You can try, but it's just after rc1 so it might not be stable for other reasons... v5.19 tag could be safer bet. (In reply to Azamat S. Kalimoulline from comment #47) > > If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1 > enabled > > with kernel v5.10, does it also fail to boot? > > Do you mean if I rebuild latest successfull kernel with > CONFIG_PAGE_POISONING_ZERO disabled? And boot it with page_poison=1? Yeah, preferably v5.10 tag, not the parent commit of f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 (again due to possible instability of intermediate version) > If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1 enabled
> with kernel v5.10, does it also fail to boot?
5.10.135 with CONFIG_PAGE_POISONING_ZERO disabled fails to boot.
> CONFIG_INIT_ON_FREE_DEFAULT_ON=y
5.18.16 fail to boot
(In reply to Azamat S. Kalimoulline from comment #49) > 5.10.135 with CONFIG_PAGE_POISONING_ZERO disabled fails to boot. Thanks, that confirms the commit isn't the root cause, but that page poisoning with non-zero pattern triggers some kind of problem on the system and it was always the case regardless of kernel version. Either the poisoning itself is buggy (improbable) or it's pointing to a bug elsewhere in the kernel, or maybe even firmware (since it depends on how you boot the kernel). Not sure how to proceed though if you can't get any useful output from the failed boot. Maybe just disable poisoning then :/ (In reply to Azamat S. Kalimoulline from comment #50) > > CONFIG_INIT_ON_FREE_DEFAULT_ON=y > > 5.18.16 fail to boot That was actually not useful suggestion from me, as when poisoning is enabled, it overrrides INIT_ON_FREE (and INIT_ON_ALLOC), so the enabling of this option didn't have any effect. I still have this regression report on the list of tracked regressions. I wonder if we should put this at rest, as unless I'm mistaken this seems to be a (very?) odd corner case and the reporter apparently found a workaround. Is that okay for everybody? /me will take "a few days of silence" as yes We found workaround setting kernel cmdline page_poison=0 . As it continues to work it is sufficient for me. (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #52) Indeed I don't think it's really a "real" regression. What was apparently always true for the system in question (and I suspect it's UEFI related): - no CONFIG_PAGE_POISONING (or boottime-disabled) - works - CONFIG_PAGE_POISONING=Y (and boottime-enabled) - doesn't work - CONFIG_PAGE_POISONING_ZERO - works 5.11 removed the third option as obsolete, which made the distro kernel config switch to the second option. But that option didn't regress in 5.11, it always failed on this system. And it's a debugging option, if it fails, it means it exposes problems elsewhere. As I said before - it is high possibility with hardware/firmware of HP EliteBook, because even with stock options with several strange manipulations I can make it work. https://bugzilla.kernel.org/show_bug.cgi?id=216340#c13 |