Bug 216340 - Kernel stuck on "Loading initial ramdisk" since 5.11
Summary: Kernel stuck on "Loading initial ramdisk" since 5.11
Status: ASSIGNED
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Page Allocator (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-08-08 20:58 UTC by Azamat S. Kalimoulline
Modified: 2022-09-20 20:27 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.11
Subsystem:
Regression: No
Bisected commit-id:


Attachments
stock 5.10 config (230.86 KB, text/plain)
2022-08-17 12:50 UTC, Azamat S. Kalimoulline
Details
first non boot config (231.00 KB, text/plain)
2022-08-17 12:50 UTC, Azamat S. Kalimoulline
Details

Description Azamat S. Kalimoulline 2022-08-08 20:58:20 UTC
I have HP EliteBook 845 G7. And since kernel 5.11 it stopped from booting. 5.10 kernel boots fine. This is like https://bugzilla.kernel.org/show_bug.cgi?id=212133 but workaround with iommu=soft not working. Renoir architecture. I have AMD
> Ryzen 5 PRO 4650U with Radeon Graphics. iommu=off and acpi=off doesn't help.
Comment 1 Azamat S. Kalimoulline 2022-08-08 20:59:36 UTC
OS boots via EFI. Laptop's BIOS has no option to boot without it.
Comment 2 Mario Limonciello (AMD) 2022-08-09 01:07:22 UTC
Why such an old kernel?  Could you use a modern kernel like 5.19 instead?
Comment 3 Azamat S. Kalimoulline 2022-08-09 06:45:46 UTC
This is since 5.11 kernel, i.e. 5.18 also not boot. But I debugged yesterday and it is not iommu related. My system installed on luks and lvm using full disk encryption (https://docs.voidlinux.org/installation/guides/fde.html). And if boot without it then kernel boots. But my system not. I will try today same setup on another computer to determine if this make sense. But on another same laptop it also not boots from luks and lvm.
Comment 4 Azamat S. Kalimoulline 2022-08-09 10:09:00 UTC
I checked on another laptop and PC - and with 5.18 kernel it works fine. So it is specific to ELiteBook 845 G7 with full disk encryption. Boot was performed in UEFI mode.
Comment 5 Azamat S. Kalimoulline 2022-08-09 10:09:35 UTC
earlyprintk=efi,keep doesn't make sense and no messages after "Loading initial ramdisk"
Comment 6 Mario Limonciello (AMD) 2022-08-09 14:22:16 UTC
> This is since 5.11 kernel, i.e. 5.18 also not boot.

Got it; that wasn't clear from your initial message.

> And if boot without it then kernel boots.

It sounds like a FDE configuration issue until you mention that it works in 5.10 and regresses only moving to 5.11 or later.

> So it is specific to ELiteBook 845 G7 with full disk encryption

I think more isolation is needed to figure out why this is happening.

Some suggestions that will help get you there:

1) Does turning off amdgpu via modprobe.blacklist=amdgpu or 'nomodest' help?
2) Are you putting amdgpu in your initrd?  If not; can you?
3) Are you compiling with pinctrl-amd as built-in or module?  If it's as module - is it in your initrd?  If not, please mark it as built-in or move it into your initrd.

4) Since you know it's between 5.10 and 5.11, can you bisect to identify the commit that caused it?
Comment 7 till-kernel@0xaa55.org 2022-08-09 16:09:08 UTC
I can confirm this. Same configuration, HP EliteBook 845 G7 with full disk encryption, latest kernel to boot is 5.10. From 5.11 upwards, system is stuck at boot. Only difference is this is a Ryzen 7 model. I am using Void Linux.
I will try to follow some of your suggestions to provide more info.
Comment 8 Azamat S. Kalimoulline 2022-08-09 19:02:01 UTC
> Does turning off amdgpu via modprobe.blacklist=amdgpu or 'nomodest' help?

No.

> Are you putting amdgpu in your initrd?  If not; can you?

It presents initrd.

> Are you compiling with pinctrl-amd as built-in or module?

It compiled built-in.

> Since you know it's between 5.10 and 5.11, can you bisect to identify the
> commit that caused it?

I didn't it before. How to do that?

Also I don't think this is amdgpu related. Because if I remove rd.* options from my kernel cmdline then 5.18 and 5.10 boots nearly same. Both doesn't boot system at all because no device to boot from, but there I can see kernel messages and system in both cases reacts to Ctrl-Alt-Del combination. If try to boot with rd.* options in my kernel cmdline then no messages after "Loading initial ramdisk" and Ctrl-Alt-Del combination has no effect.
Comment 9 Mario Limonciello (AMD) 2022-08-09 19:19:05 UTC
> No.
> It presents initrd.
> It compiled built-in.

OK.

> I didn't it before. How to do that?

https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html

> Also I don't think this is amdgpu related.

My suspicion was that your prompt for a password wasn't being properly displayed since you didn't put amdgpu in the initramfs, but you proved that to be wrong.

> If try to boot with rd.* options in my kernel cmdline then no messages after
> "Loading initial ramdisk" and Ctrl-Alt-Del combination has no effect.

AFAICT the kernel doesn't directly rd.lvm.vg nor rd.luks.uuid.  These parameters to me seem they're getting passed to userspace.

I'm on a 5.18 checkout:
$ uname -r
5.18.0-02568-gabeb24bcfb54

I don't see these parameters at all.
$ modinfo rd | grep parm
parm:           rd_nr:Maximum number of brd devices (int)
parm:           rd_size:Size of each RAM disk in kbytes. (ulong)
parm:           max_part:Num Minors to reserve between devices (int)

Could you perhaps break into your initramfs by adding a shell call and debug from that?
Comment 10 Azamat S. Kalimoulline 2022-08-09 19:33:06 UTC
> My suspicion was that your prompt for a password wasn't being properly
> displayed since you didn't put amdgpu in the initramfs

Yes. Because it prompts password before grub. After that grub loads kernel and initrd, in initrd placed key and crypttab where arrays defined. And then it unlocks automatically using this key without prompting password. This is how it works on 5.10

> kernel doesn't directly rd.lvm.vg nor rd.luks.uuid.  These parameters to me
> seem they're getting passed to userspace.

May be may be. And problem, perhaps in initrd itself. But there are no messages after "Loading initial ramdisk" and I stuck there.

> Could you perhaps break into your initramfs by adding a shell call and debug
> from that?

How to do that?

> https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html

I'll try it later. It need significant amount of time but I'll try find it.
Comment 11 Mario Limonciello (AMD) 2022-08-09 19:38:45 UTC
> Yes. Because it prompts password before grub.

Where is the key stored and passed to the kernel and initrd from before GRUB?  An EFI NVRAM variable?  Some random memory address?

I think until you've bisected should be your focal area; a regression in how that is all handled.  If you're storing it at a memory range that is not reserved by the firmware, the kernel can very easily repurpose it for something else.

> How to do that?

It depends on what your initramfs provider has.  On Ubuntu and Debian I know you can do things like "break=bottom" on the kernel command line to break in the initramfs.

You might have to manually modify your initramfs to add some breakpoints/shell points if none are already defined.

> I'll try it later. It need significant amount of time but I'll try find it.

Yeah it does, but it's rewarding and infinitely easier to know what to fix when you know what the real cause was :)
Comment 12 Azamat S. Kalimoulline 2022-08-09 20:52:32 UTC
> Where is the key stored and passed to the kernel and initrd from before GRUB?
To grub I provide password and it prompts for key. So grub unlocked it and loads from disk kernel and initrd. In initrd stored another key that unlocked during boot process.

> It depends on what your initramfs provider has.
There are dracut on void linux and it uses rd.break to stop on breakpoints. But it stuck after "Loading initial ramdisk". May be before. But after this message appears.

> Yeah it does, but it's rewarding and infinitely easier to know what to fix
> when you know what the real cause was :)
Agree. I think that it is probably one thing that can help to find root of issue.
Comment 13 Azamat S. Kalimoulline 2022-08-09 20:59:13 UTC
Oh, I've found strange behavior. When after turning on laptop press down key then boot menu appears, where two options - UEFI - void and boot from file. Selecting my void linux then all boot proceeds as it should be. This is NOT TRUE when I press F9 for boot menu.
Comment 14 Azamat S. Kalimoulline 2022-08-09 20:59:45 UTC
So I think that bisect only one option to determine cause of this behavior. :(
Comment 15 Azamat S. Kalimoulline 2022-08-09 21:01:02 UTC
I think that stuck on initrd loading.
Comment 16 Azamat S. Kalimoulline 2022-08-11 23:04:20 UTC
I finished bisecting kernel. That is the results:

> # git bisect bad
> f5a89a5cae812a39993be32e74c8ed7856b1e2b2 is the first bad commit
> commit f5a89a5cae812a39993be32e74c8ed7856b1e2b2
> Author: Dave Airlie <airlied@redhat.com>
> Date:   Thu Oct 29 13:59:08 2020 +1000
> 
>     drm/amdgpu/ttm: use multihop
>     
>     This removes the code to move resources directly between
>     SYSTEM and VRAM in favour of using the core ttm mulithop code.
>     
>     Signed-off-by: Dave Airlie <airlied@redhat.com>
>     Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>     Reviewed-by: Christian König <christian.koenig@amd.com>
>     Link:
>     https://patchwork.freedesktop.org/patch/msgid/20201109005432.861936-3-airlied@gmail.com
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 136
>  +++-----------------------------
>  1 file changed, 12 insertions(+), 124 deletions(-)
Comment 17 Azamat S. Kalimoulline 2022-08-11 23:06:36 UTC
It looks like that is amdgpu bug but I can't understand why it is not affected when FDE is not used.
Comment 18 Mario Limonciello (AMD) 2022-08-12 02:23:05 UTC
I think this is not an accurate bisect result since it was said above that blacklisting amdgpu didn't help.

That particular commit if you dig into the links had a non-obvious dependency on https://github.com/torvalds/linux/commit/aefec40938e4a0e1214f9121520aba4d51697cd9 

Just to double-confirm it's not caused by amdgpu, can you blacklist amdgpu on 5.10 and FDE works?  If so, I think you should repeat the bisect with amdgpu blacklisted the whole time and see what result you get.
Comment 19 Azamat S. Kalimoulline 2022-08-12 08:39:53 UTC
I thought so. Because it not booted differently. 5.11-rc1 doesn't even boot initrd - if I point to non-existent initrd it even no prints that can't find it. And After some bisections it starts to not boot another way - it prints some messages then stuck. So I will repeat bisect with amdgpu blacklist because 5.10 and FDE works (without GUI, obviously).
Comment 20 Azamat S. Kalimoulline 2022-08-12 08:40:51 UTC
It could be faster because I marked that commit when it started boot differently.
Comment 21 Mario Limonciello (AMD) 2022-08-12 16:21:44 UTC
git bisect has a way to "skip" commits where it's unclear if it's a pass or fail.  

I think just keep good notes on your steps if you're unclear so you don't have to re-run "the whole thing" if you still get a wrong result.
Comment 22 Azamat S. Kalimoulline 2022-08-12 21:52:43 UTC
Yeah, I understand that is bisecting bunch of commits. This is standard search method. So, that is the result:

> # git bisect good
> f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 is the first bad commit
> commit f289041ed4cf9a3f6e8a32068fef9ffb2acc5662
> Author: Vlastimil Babka <vbabka@suse.cz>
> Date:   Mon Dec 14 19:13:45 2020 -0800
> 
>     mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO
>      
>     CONFIG_PAGE_POISONING_ZERO uses the zero pattern instead of 0xAA.  It was
>     introduced by commit 1414c7f4f7d7 ("mm/page_poisoning.c: allow for zero
>     poisoning"), noting that using zeroes retains the benefit of sanitizing
>     content of freed pages, with the benefit of not having to zero them again
>     on alloc, and the downside of making some forms of corruption (stray
>     writes of NULLs) harder to detect than with the 0xAA pattern.  Together
>     with CONFIG_PAGE_POISONING_NO_SANITY it made possible to sanitize the
>     contents on free without checking it back on alloc.
>     
>     These days we have the init_on_free() option to achieve sanitization with
>     zeroes and to save clearing on alloc (and without checking on alloc).
>     Arguably if someone does choose to check the poison for corruption on
>     alloc, the savings of not clearing the page are secondary, and it makes
>     sense to always use the 0xAA poison pattern.  Thus, remove the
>     CONFIG_PAGE_POISONING_ZERO option for being redundant.
>     
>     Link: https://lkml.kernel.org/r/20201113104033.22907-6-vbabka@suse.cz
>     Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>     Acked-by: David Hildenbrand <david@redhat.com>
>     Cc: Mike Rapoport <rppt@linux.ibm.com>
>     Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>     Cc: Alexander Potapenko <glider@google.com>
>     Cc: Kees Cook <keescook@chromium.org>
>     Cc: Laura Abbott <labbott@kernel.org>
>     Cc: Mateusz Nosek <mateusznosek0@gmail.com>
>     Cc: Michal Hocko <mhocko@kernel.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
>  include/linux/poison.h       |  4 ----
>  mm/Kconfig.debug             | 12 ------------
>  mm/page_alloc.c              |  8 +-------
>  tools/include/linux/poison.h |  6 +-----
>  4 files changed, 2 insertions(+), 28 deletions(-)
Comment 23 Azamat S. Kalimoulline 2022-08-17 08:38:52 UTC
What else can I provide?
Comment 24 Vlastimil Babka 2022-08-17 09:59:47 UTC
Hi, I don't see how the commit could cause this bug, could be another stray bisect. Only theoretically, if your .config (before this commit, i.e. on 5.10) enabled CONFIG_PAGE_POISONING_ZERO, and there was some broken code that didn't properly initialize allocated object and unknowingly relied on this zeroing, then removing the zeroing could expose the bug. So maybe share your 5.10 config?
Comment 25 Azamat S. Kalimoulline 2022-08-17 11:04:01 UTC
I did make oldconfig. So this is same. One other thing - this could help to understand, that may be it because of bug of platform (HP Elitebook 845 G7) - there are some non-obvious actions where kernel works - pressing down button at boot then choose linux entry in boot menu. This not works when do same but calling that boot menu via F9. Or boot by default. So if you want to more info - I can provide.
Comment 26 Vlastimil Babka 2022-08-17 11:15:42 UTC
"make oldconfig" uses your existing .config as a base, so unless you upload it here, we can't know what it contains.
Comment 27 Azamat S. Kalimoulline 2022-08-17 11:18:54 UTC
I did cp /boot/config-5.10.132 .oldconfig before it.
Comment 28 Azamat S. Kalimoulline 2022-08-17 11:19:18 UTC
This config-5.10.132 is void linux built kernel config.
Comment 29 Borislav Petkov 2022-08-17 11:32:25 UTC
You need to upload this config here to bugzilla so that he can take a look.
Comment 30 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-08-17 11:48:55 UTC
(In reply to Vlastimil Babka from comment #26)
> "make oldconfig" uses your existing .config as a base, so unless you upload
> it here, we can't know what it contains.

did you do this before every build? That in some cases seems to be needed to avoid corner cases. See these mails for details and solutions:

https://lore.kernel.org/all/CAMuHMdXRoVt_zRBNvugJjYhJnyYbABeCWv9fFRM0r_=s7FYvJQ@mail.gmail.com/
https://lore.kernel.org/all/12e09497-a848-b767-88f4-7dabd8360c5e@leemhuis.info/

Side note: I googled and quickly found a hit where somebody was running Fedora with Linux 5.13 on a HP EliteBook 845 G7. Makes me wonder if the full disk encryption or void's kernel config (are there maybe fresher void live media you could try) might have a influence here. But in the end it doesn't matter that much, as I guess we need a reliable bisection to get to the root of the problem.
Comment 31 Azamat S. Kalimoulline 2022-08-17 12:45:23 UTC
> did you do this before every build?

Yes, I made script for exclude manual entry errors.

> I googled and quickly found a hit where somebody was running Fedora with
> Linux 5.13 on a HP EliteBook 845 G7.

I can run Void Linux XFCE live disk with kernel >5.10 without full disk encryption. Also I can run my Void Linux stick with 5.18 kernel on another laptop or PC.

> I guess we need a reliable bisection

This is reliable bisection on HP EliteBook 845 G7. May be I need to repeat it with other params to exclude other bugs. But with that current bug kernel not loads at all. I.e. initrd even not start. And as adviced before, I remake bisection with modprobe.blacklist=amdgpu. But this matters on GPU errors only.
Comment 32 Azamat S. Kalimoulline 2022-08-17 12:50:29 UTC
Created attachment 301592 [details]
stock 5.10 config
Comment 33 Azamat S. Kalimoulline 2022-08-17 12:50:54 UTC
Created attachment 301593 [details]
first non boot config
Comment 34 Azamat S. Kalimoulline 2022-08-17 12:53:08 UTC
When doesn't boots due to amdgpu bug it shows kernel messages then stuck, we could reboot with Ctrl-Alt-Del. When doesn't boots in this case it shows "Loading initial ramdisk", but this message printed by GRUB and actually initrd not loaded and not executed, Ctrl-Alt-Del has no sense.
Comment 35 Azamat S. Kalimoulline 2022-08-17 12:55:26 UTC
Again, I'm able to boot pressing DOWN arrow and selecting my OS from boot menu, then loads grub and I can boot 5.18 kernel. If I press F9 for boot menu and do same - I CAN NOT BOOT kernels higher that 5.10.x. So, may be this is platform of that laptop bug, but it is annoying for me and 5.10.x boots without problems.
Comment 36 Vlastimil Babka 2022-08-17 13:03:47 UTC
It seems CONFIG_PAGE_POISONING_ZERO was indeed used, but also INIT_ON_ALLOC_DEFAULT_ON=y which should have pre-zero pages anyway, thus a buggy driver or something shouldn't see a difference. Maybe you can try enabling CONFIG_INIT_ON_FREE_DEFAULT_ON which will do the zeroing on free, as PAGE_POISONING_ZERO did. But I would be surprised if that helped.
Comment 37 Azamat S. Kalimoulline 2022-08-17 13:05:43 UTC
I will try and write the results.
Comment 38 Azamat S. Kalimoulline 2022-08-17 13:06:28 UTC
I need only set CONFIG_INIT_ON_FREE_DEFAULT_ON=y ?
Comment 39 Vlastimil Babka 2022-08-17 13:07:59 UTC
(In reply to Azamat S. Kalimoulline from comment #38)
> I need only set CONFIG_INIT_ON_FREE_DEFAULT_ON=y ?

yes
Comment 40 till-kernel@0xaa55.org 2022-08-17 21:39:10 UTC
Based on the results from the bisect, I was finally able to boot into a current kernel by using kernel commandline page_poison=0. Void linux sets page_poison=1 be default since kernel 5.3. Seems like CONFIG_PAGE_POISONING_ZERO is indeed somehow the root cause for not being able to boot kernels >5.10.
Comment 41 Vlastimil Babka 2022-08-17 21:52:10 UTC
(In reply to till-kernel@0xaa55.org from comment #40)
> Based on the results from the bisect, I was finally able to boot into a
> current kernel by using kernel commandline page_poison=0. Void linux sets
> page_poison=1 be default since kernel 5.3.

Interesting. 

> Seems like
> CONFIG_PAGE_POISONING_ZERO is indeed somehow the root cause for not being
> able to boot kernels >5.10.

But the option was removed in >5.10. If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1 enabled with 5.10, does it also fail to boot?

All this would mean something doesn't like filling pages that are freed with pattern 0xAA (standard poisoning pattern) but doesn't mind if they are filled with 0x00 (which CONFIG_PAGE_POISONING_ZERO did).
Comment 42 Azamat S. Kalimoulline 2022-08-18 12:04:19 UTC
> CONFIG_INIT_ON_FREE_DEFAULT_ON=y

with this f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 not boots.

> page_poison=0

with this boots f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 and 5.18 stock void stock kernel.
Comment 43 Azamat S. Kalimoulline 2022-08-18 12:10:18 UTC
> If you disable CONFIG_PAGE_POISONING_ZERO

But this option removid in that patch we talk about (f289041ed4cf9a3f6e8a32068fef9ffb2acc5662). Statring this patch kernel doesn't boots.
Comment 44 Vlastimil Babka 2022-08-18 14:03:54 UTC
(In reply to Azamat S. Kalimoulline from comment #42)
> > CONFIG_INIT_ON_FREE_DEFAULT_ON=y
> 
> with this f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 not boots.

Can you try this config change with 5.18? There were some fixes to interaction between init_on_alloc/free and poisoning after f289041ed4cf9a3f6e8a32068fef9ffb2acc5662
Comment 45 Vlastimil Babka 2022-08-18 14:05:05 UTC
(In reply to Azamat S. Kalimoulline from comment #43)
> > If you disable CONFIG_PAGE_POISONING_ZERO
> 
> But this option removid in that patch we talk about
> (f289041ed4cf9a3f6e8a32068fef9ffb2acc5662). Statring this patch kernel
> doesn't boots.

Yes, to confirm that I have suggested:

If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1 enabled with kernel v5.10, does it also fail to boot?
Comment 46 Azamat S. Kalimoulline 2022-08-18 14:07:11 UTC
> Can you try this config change with 5.18?

Will it be ok if I'll try it on master head?
Comment 47 Azamat S. Kalimoulline 2022-08-18 14:08:33 UTC
> If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1 enabled
> with kernel v5.10, does it also fail to boot?

Do you mean if I rebuild latest successfull kernel with CONFIG_PAGE_POISONING_ZERO disabled? And boot it with page_poison=1?
Comment 48 Vlastimil Babka 2022-08-18 14:11:29 UTC
(In reply to Azamat S. Kalimoulline from comment #46)
> > Can you try this config change with 5.18?
> 
> Will it be ok if I'll try it on master head?

You can try, but it's just after rc1 so it might not be stable for other reasons... v5.19 tag could be safer bet.

(In reply to Azamat S. Kalimoulline from comment #47)
> > If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1
> enabled
> > with kernel v5.10, does it also fail to boot?
> 
> Do you mean if I rebuild latest successfull kernel with
> CONFIG_PAGE_POISONING_ZERO disabled? And boot it with page_poison=1?

Yeah, preferably v5.10 tag, not the parent commit of f289041ed4cf9a3f6e8a32068fef9ffb2acc5662 (again due to possible instability of intermediate version)
Comment 49 Azamat S. Kalimoulline 2022-08-18 14:43:51 UTC
> If you disable CONFIG_PAGE_POISONING_ZERO and keep the page_poison=1 enabled
> with kernel v5.10, does it also fail to boot?

5.10.135 with CONFIG_PAGE_POISONING_ZERO disabled fails to boot.
Comment 50 Azamat S. Kalimoulline 2022-08-18 15:18:23 UTC
> CONFIG_INIT_ON_FREE_DEFAULT_ON=y

5.18.16 fail to boot
Comment 51 Vlastimil Babka 2022-08-21 22:15:10 UTC
(In reply to Azamat S. Kalimoulline from comment #49)
> 5.10.135 with CONFIG_PAGE_POISONING_ZERO disabled fails to boot.

Thanks, that confirms the commit isn't the root cause, but that page poisoning with non-zero pattern triggers some kind of problem on the system and it was always the case regardless of kernel version. Either the poisoning itself is buggy (improbable) or it's pointing to a bug elsewhere in the kernel, or maybe even firmware (since it depends on how you boot the kernel).

Not sure how to proceed though if you can't get any useful output from the failed boot. Maybe just disable poisoning then :/

(In reply to Azamat S. Kalimoulline from comment #50)
> > CONFIG_INIT_ON_FREE_DEFAULT_ON=y
> 
> 5.18.16 fail to boot

That was actually not useful suggestion from me, as when poisoning is enabled, it overrrides INIT_ON_FREE (and INIT_ON_ALLOC), so the enabling of this option didn't have any effect.
Comment 52 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-09-20 12:32:53 UTC
I still have this regression report on the list of tracked regressions. I wonder if we should put this at rest, as unless I'm mistaken this seems to be a (very?) odd corner case and the reporter apparently found a workaround. Is that okay for everybody? /me will take "a few days of silence" as yes
Comment 53 Azamat S. Kalimoulline 2022-09-20 17:30:23 UTC
We found workaround setting kernel cmdline page_poison=0 . As it continues to work it is sufficient for me.
Comment 54 Vlastimil Babka 2022-09-20 20:13:25 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #52)

Indeed I don't think it's really a "real" regression. What was apparently always true for the system in question (and I suspect it's UEFI related):

- no CONFIG_PAGE_POISONING (or boottime-disabled) - works
- CONFIG_PAGE_POISONING=Y (and boottime-enabled) - doesn't work
- CONFIG_PAGE_POISONING_ZERO - works

5.11 removed the third option as obsolete, which made the distro kernel config switch to the second option. But that option didn't regress in 5.11, it always failed on this system. And it's a debugging option, if it fails, it means it exposes problems elsewhere.
Comment 55 Azamat S. Kalimoulline 2022-09-20 20:27:48 UTC
As I said before - it is high possibility with hardware/firmware of HP EliteBook, because even with stock options with several strange manipulations I can make it work. https://bugzilla.kernel.org/show_bug.cgi?id=216340#c13

Note You need to log in before you can comment on or make changes to this bug.