Bug 218173 - Kernel 6.6 onwards hangs on "loading initial ramdisk"
Summary: Kernel 6.6 onwards hangs on "loading initial ramdisk"
Status: RESOLVED CODE_FIX
Alias: None
Product: EFI
Classification: Unclassified
Component: Boot (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: Virtual assignee for kernel bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-21 13:33 UTC by River
Modified: 2023-12-21 14:03 UTC (History)
6 users (show)

See Also:
Kernel Version: 6.6.1, 6.6.2, 6.6.3,
Subsystem:
Regression: Yes
Bisected commit-id: a1b87d54f4e45ff5e0d081fb1d9db3bf1a8fb39a


Attachments
GRUB output (4.09 MB, image/jpeg)
2023-11-21 14:15 UTC, River
Details
efistub debug log prints (2.25 KB, patch)
2023-12-10 09:36 UTC, Ard Biesheuvel
Details | Diff
more efistub debug log prints (3.52 KB, patch)
2023-12-10 10:32 UTC, Ard Biesheuvel
Details | Diff
EFI debug memory map (19.66 KB, text/plain)
2023-12-10 23:41 UTC, bwg
Details
proposed fix (3.38 KB, patch)
2023-12-11 17:00 UTC, Ard Biesheuvel
Details | Diff

Description River 2023-11-21 13:33:06 UTC
After upgrading from 6.5.9 to 6.6.1 on my Dell Latitude E6420 (Intel i5-2520M) with EndeavourOS, the boot process would hang at "loading initial ramdisk". The issue is present on the 6.6.1 release of both Linux and Linux-zen, but not the 6.5.9 release, which makes me think this is somehow upstream in the kernel, rather than to do with packaging. My current workaround is using the Linux LTS kernel.

I have been unable to consistently reproduce this bug. Between 50 and 30 percent of the time, the "loading initial ramdisk" will display, the disk activity indicator will turn off briefly and then resume blinking, and then the kernel boots as expected. The other 50 to 70 percent of the time, the boot stops at "loading initial ramdisk" and the disk activity indicator turns off, and does not resume blinking. The disk activity light is constantly flashing during normal system operation, so I know it's not secretly booting but not updating the display. I haven't been able to replicate this issue in QEMU. I have seen similar bugs that have been solved by disabling IOMMU, but this has not had any effect. Neither has disabling graphics drivers and modesetting. I have been able to reproduce it while using Nouveau, so I don't believe it has to do with Nvidia's proprietary drivers.

Examining dmesg and journalctl, there doesn't appear to be ANY logs from the failed boots. I don't believe the kernel even is started on these failed boots. Enabling GRUB debug messages (linux,loader,init,fs,device,disk,partition) shows that the hang occurs after GRUB attempts to start the loaded image- it's able to load the image into memory, but the boot stalls after "Starting image" with a hex address (presumably the start addr of the kernel).  

I've been trying to compile the kernel myself to see if I can solve the issue, or at least aid in reproduceability, but this is not easy or fast to do on a 2012 i5 processor. I'll update if I can successfully recompile the kernel and if it yields any information.  

Please let me know if I should provide any additional information. This is my first time filing a bug here.
Comment 1 River 2023-11-21 13:37:37 UTC
I should add that this issue occurs on a completely vanilla, fresh, GUI install of EndeavourOS without any modifications, both on BTRFS and EXT4. I reinstalled the system three times to confirm.
Comment 2 Artem S. Tashkinov 2023-11-21 13:48:54 UTC
Remove quiet from the kernel boot options in GRUB, take a picture, upload it here.
Comment 3 River 2023-11-21 14:15:53 UTC
Created attachment 305452 [details]
GRUB output

Output of GRUB with noquiet, nosplash, and debug enabled.
Comment 4 River 2023-11-21 14:16:48 UTC
(In reply to Artem S. Tashkinov from comment #2)
> Remove quiet from the kernel boot options in GRUB, take a picture, upload it
> here.

I've just uploaded the output of GRUB with quiet and splash disabled.
Comment 5 Artem S. Tashkinov 2023-11-21 14:58:48 UTC
Please keep the boot options that you used and add these additional flags:

efi=debug earlycon=efifb keep_bootcon

And try again.

If there's something new in the output, please upload a new image.
Comment 6 Bagas Sanjaya 2023-11-22 00:09:46 UTC
(In reply to River from comment #0)
> After upgrading from 6.5.9 to 6.6.1 on my Dell Latitude E6420 (Intel
> i5-2520M) with EndeavourOS, the boot process would hang at "loading initial
> ramdisk". The issue is present on the 6.6.1 release of both Linux and
> Linux-zen, but not the 6.5.9 release, which makes me think this is somehow
> upstream in the kernel, rather than to do with packaging. My current
> workaround is using the Linux LTS kernel.
> 
> I have been unable to consistently reproduce this bug. Between 50 and 30
> percent of the time, the "loading initial ramdisk" will display, the disk
> activity indicator will turn off briefly and then resume blinking, and then
> the kernel boots as expected. The other 50 to 70 percent of the time, the
> boot stops at "loading initial ramdisk" and the disk activity indicator
> turns off, and does not resume blinking. The disk activity light is
> constantly flashing during normal system operation, so I know it's not
> secretly booting but not updating the display. I haven't been able to
> replicate this issue in QEMU. I have seen similar bugs that have been solved
> by disabling IOMMU, but this has not had any effect. Neither has disabling
> graphics drivers and modesetting. I have been able to reproduce it while
> using Nouveau, so I don't believe it has to do with Nvidia's proprietary
> drivers.
> 
> Examining dmesg and journalctl, there doesn't appear to be ANY logs from the
> failed boots. I don't believe the kernel even is started on these failed
> boots. Enabling GRUB debug messages
> (linux,loader,init,fs,device,disk,partition) shows that the hang occurs
> after GRUB attempts to start the loaded image- it's able to load the image
> into memory, but the boot stalls after "Starting image" with a hex address
> (presumably the start addr of the kernel).  
> 
> I've been trying to compile the kernel myself to see if I can solve the
> issue, or at least aid in reproduceability, but this is not easy or fast to
> do on a 2012 i5 processor. I'll update if I can successfully recompile the
> kernel and if it yields any information.  
> 

Since you have this regression using nouveau driver, please also report to
freedesktop tracker [1].

And after trying Artem's suggestion, try bisection to find exact commit that
introduces the regression (for reference see
Documentation/admin-guide/bug-bisect.rst in the kernel sources).

Thanks.

[1]: https://gitlab.freedesktop.org/drm/nouveau/-/issues
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-11-22 05:32:17 UTC
(In reply to Bagas Sanjaya from comment #6)

> Since you have this regression using nouveau driver, please also report to
> freedesktop tracker [1].
> [1]: https://gitlab.freedesktop.org/drm/nouveau/-/issues

Please only do so if there are strong indicators that this is a nouveau bug. I only skimmed the report and might have missed something, but to me it sounded like the kernel is crashing before it can output anything on screen. Graphics drivers can cause this, but so can a ton of other things; and when "nomodeset" does not work it's likely not the graphics driver.
Comment 8 Artem S. Tashkinov 2023-11-22 08:09:49 UTC
(In reply to Bagas Sanjaya from comment #6)
> 
> Since you have this regression using nouveau driver, please also report to
> freedesktop tracker [1].
> 
> And after trying Artem's suggestion, try bisection to find exact commit that
> introduces the regression (for reference see
> Documentation/admin-guide/bug-bisect.rst in the kernel sources).
> 
> Thanks.
> 
> [1]: https://gitlab.freedesktop.org/drm/nouveau/-/issues

Nothing in the report indicates it's related to or caused by nouveau. It could be caused by anything.

The system does not even boot the kernel properly.

Bisection on this old device could be problematic. Maybe we could first get some more info first.
Comment 9 River 2023-11-22 15:40:41 UTC
(In reply to Artem S. Tashkinov from comment #8)
> (In reply to Bagas Sanjaya from comment #6)
> > 
> > Since you have this regression using nouveau driver, please also report to
> > freedesktop tracker [1].
> > 
> > And after trying Artem's suggestion, try bisection to find exact commit
> that
> > introduces the regression (for reference see
> > Documentation/admin-guide/bug-bisect.rst in the kernel sources).
> > 
> > Thanks.
> > 
> > [1]: https://gitlab.freedesktop.org/drm/nouveau/-/issues
> 
> Nothing in the report indicates it's related to or caused by nouveau. It
> could be caused by anything.
> 
> The system does not even boot the kernel properly.
> 
> Bisection on this old device could be problematic. Maybe we could first get
> some more info first.

I agree, I don't believe it's caused by the graphics drivers. It's reproducible on both Nouveau and the proprietary drivers, and with nomodeset enabled on both. And you're right about the bisection - compiling Linux took over seven hours. But I currently have 6.6.2 compiled and I'm going to see if it boots, as well as attempting to get more information from GRUB.
Comment 10 River 2023-11-23 04:55:05 UTC
(In reply to Artem S. Tashkinov from comment #5)
> Please keep the boot options that you used and add these additional flags:
> 
> efi=debug earlycon=efifb keep_bootcon
> 
> And try again.
> 
> If there's something new in the output, please upload a new image.

I just attempted booting with these options set. No additional output was generated, other than what's in the picture I previously uploaded.
Comment 11 River 2023-11-23 04:55:56 UTC
I just attempted to boot a fresh, squeaky clean compilation of 6.6.2. Didn't boot. No new output. I'm going to start the arduous process of attempting a bisection. See if I can't track down the bad commit.
Comment 12 Artem S. Tashkinov 2023-11-23 07:14:48 UTC
1. You can build the kernel on a faster PC and then e.g. try to boot from a flash drive.
2. Do you ccache.
Comment 13 River 2023-11-26 19:43:05 UTC
(In reply to Artem S. Tashkinov from comment #12)
> 1. You can build the kernel on a faster PC and then e.g. try to boot from a
> flash drive.
> 2. Do you ccache.

I've got ccache set up right now, so hopefully that speeds up the process a little. I'll definitely consider your suggestion with regards to cross compiling, I can build the kernel inside WSL2 on my desktop which is significantly faster than my laptop.
Comment 14 River 2023-11-26 20:22:58 UTC
The 6.6.2 kernel is now in the Arch repos and I can confirm that the hang is still present, so it's not just a fluke with the 6.6.1 version.
Comment 15 bwg 2023-11-26 22:52:40 UTC
I am also seeing the same problem, except my machine (i7-2600S, integrated Intel graphics) always hangs in the same manner as described in the first message and has never successfully booted with any of the 6.6 kernels (I've tried 6.6, 6.6.1, and 6.6.2). This is also an Arch install and I'm now using the 6.1 LTS branch with no problems. I've rebuilt the ramdisk image multiple times, and for me as well nothing is ever logged when attempting to boot with the 6.6 kernels, which leads me to believe also this is a kernel regression and not an Arch packaging issue.
Comment 16 River 2023-11-26 23:09:23 UTC
(In reply to bwg from comment #15)
> I am also seeing the same problem, except my machine (i7-2600S, integrated
> Intel graphics) always hangs in the same manner as described in the first
> message and has never successfully booted with any of the 6.6 kernels (I've
> tried 6.6, 6.6.1, and 6.6.2). This is also an Arch install and I'm now using
> the 6.1 LTS branch with no problems. I've rebuilt the ramdisk image multiple
> times, and for me as well nothing is ever logged when attempting to boot
> with the 6.6 kernels, which leads me to believe also this is a kernel
> regression and not an Arch packaging issue.

Glad I'm not just a fluke, and that someone else is able to reproduce it.

Can you add the Grub debug options I tested and let me know what Grub says? If it stops at the same place mine does, we can be fairly certain it's probably the same problem, and it could be a quirk with Linux on Sandy Bridge architecture CPUs.
Comment 17 bwg 2023-11-29 19:20:24 UTC
Booting with kernel command line options noquiet nosplash debug loglevel=7 and the grub environment variable debug=linux, the boot stops at the same place as in your image ("loader/efi/linux.c:234:linux: starting image 0xc7934e18"). This is now with the 6.6.3 kernel.

When successfully booting the 6.1 kernel with the same options, there seems to be one more message printed after this point before the screen is redrawn; the screen is blanked too quickly after the message for me to read much of it, but it was related to the ramdisk. I'm not sure if the message would have been from grub or from the kernel at that point. As we're both booting with grub, I'm wondering if this could somehow be a grub issue.
Comment 18 River 2023-11-29 21:10:43 UTC
(In reply to bwg from comment #17)
> Booting with kernel command line options noquiet nosplash debug loglevel=7
> and the grub environment variable debug=linux, the boot stops at the same
> place as in your image ("loader/efi/linux.c:234:linux: starting image
> 0xc7934e18"). This is now with the 6.6.3 kernel.
> 
> When successfully booting the 6.1 kernel with the same options, there seems
> to be one more message printed after this point before the screen is
> redrawn; the screen is blanked too quickly after the message for me to read
> much of it, but it was related to the ramdisk. I'm not sure if the message
> would have been from grub or from the kernel at that point. As we're both
> booting with grub, I'm wondering if this could somehow be a grub issue.

I thought it was GRUB as well, but I was able to replicate it with systemd-boot. I ended up breaking my installation trying to go back to GRUB after that, lol. But yeah. Not GRUB, at least as far as I can tell.
Comment 19 bwg 2023-12-09 21:23:11 UTC
I've completed a git-bisect session on the mainline kernel tree between v6.5 and v6.6, which identifies commit 31c77a50992e8dd136feed7b67073bb5f1f978cc [1] as the first commit that exhibits the issue on my machine, which was part of the v6.6 x86 boot changes [2]. According to the commit message, it "duplicate[s] the SNP feature check in the EFI stub before handing over to the kernel proper", so it seems likely to me that this could be the right place for further investigation given the failure to get any kernel output after GRUB reports loading the ramdisk image. 

I'll keep investigating from here, but I'm very far our of my depth, so it'd be great if somebody familiar with the x86 boot process who might have an idea of why this could have broken EFI booting apparently only on (some subset of?) Sandy Bridge machines could have a look. I've built a 6.7-rc4 kernel and it also fails to boot.

[1] https://lore.kernel.org/all/20230807162720.545787-23-ardb@kernel.org/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bd9e99f790f21374b831a7dcf638156beacf3bf4
Comment 20 bwg 2023-12-10 00:47:56 UTC
My apologies, I copied the wrong commit id when composing my earlier message. It's in fact the next commit, a1b87d54f4e45ff5e0d081fb1d9db3bf1a8fb39a [1], "Avoid legacy decompressor when doing EFI boot", that git bisect identifies as the problematic one. 

[1] https://lore.kernel.org/all/20230807162720.545787-24-ardb@kernel.org/
Comment 21 River 2023-12-10 01:46:44 UTC
(In reply to bwg from comment #19)
> I've completed a git-bisect session on the mainline kernel tree between v6.5
> and v6.6, which identifies commit 31c77a50992e8dd136feed7b67073bb5f1f978cc
> [1] as the first commit that exhibits the issue on my machine, which was
> part of the v6.6 x86 boot changes [2]. According to the commit message, it
> "duplicate[s] the SNP feature check in the EFI stub before handing over to
> the kernel proper", so it seems likely to me that this could be the right
> place for further investigation given the failure to get any kernel output
> after GRUB reports loading the ramdisk image. 
> 
> I'll keep investigating from here, but I'm very far our of my depth, so it'd
> be great if somebody familiar with the x86 boot process who might have an
> idea of why this could have broken EFI booting apparently only on (some
> subset of?) Sandy Bridge machines could have a look. I've built a 6.7-rc4
> kernel and it also fails to boot.
> 
> [1] https://lore.kernel.org/all/20230807162720.545787-23-ardb@kernel.org/
> [2]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=bd9e99f790f21374b831a7dcf638156beacf3bf4

Oh my god, thank you so much for doing this. I managed to complete like 3 compilations in my bisection before I had to wipe my drive and reinstall due to drivers being messed up and preventing my display server from loading. (I blame nvidia.)

Agreed, hopefully someone in the dev team will understand the code changes better than I can, since I don't know really any bare-metal C. I also see your message about copying the wrong commit ID. Makes a little sense, as newer decompression algos might have issues on older hardware like SB. Thank you again!
Comment 22 River 2023-12-10 01:50:35 UTC
When I get some time, I'll compile the commit prior to the one you mentioned, and attempt to boot it. If it works, and the commit after that doesn't, I'll update you all here in this thread, and we can confirm for certain it's a problem with that exact commit. Then maybe we can see about getting a fix merged. Like I said, unfortunately I neither know C nor Assembly, so this is wholly out of my league. Here's to hoping this gets seen by people who know what they're doing.
Comment 23 Ard Biesheuvel 2023-12-10 07:35:03 UTC
(In reply to bwg from comment #15)
> I am also seeing the same problem, except my machine (i7-2600S, integrated
> Intel graphics) always hangs in the same manner as described in the first
> message and has never successfully booted with any of the 6.6 kernels (I've
> tried 6.6, 6.6.1, and 6.6.2). This is also an Arch install and I'm now using
> the 6.1 LTS branch with no problems. I've rebuilt the ramdisk image multiple
> times, and for me as well nothing is ever logged when attempting to boot
> with the 6.6 kernels, which leads me to believe also this is a kernel
> regression and not an Arch packaging issue.

Can you provide some more context on the type of system (make/model, age, BIOS vendor)?

@River: your system seems to be a bit older; does it have a serial port by any chance?
Comment 24 Ard Biesheuvel 2023-12-10 09:36:12 UTC
Created attachment 305575 [details]
efistub debug log prints

Please try booting with 'efi=debug' and the attached patch applied, and report back with the log output that is emitted before the hang.
Comment 25 Ard Biesheuvel 2023-12-10 10:32:37 UTC
Created attachment 305576 [details]
more efistub debug log prints

Actually, please try this one instead - it has a couple of more debug prints around the kernel decompression and KASLR randomization, which is where I suspect the problem lies.

Please also try booting with 'nokaslr' to see if it makes a difference or not.
Comment 26 Tommaso Fonda 2023-12-10 14:39:22 UTC
(In reply to Ard Biesheuvel from comment #23)
> Can you provide some more context on the type of system (make/model, age,
> BIOS vendor)?
> 
> @River: your system seems to be a bit older; does it have a serial port by
> any chance?

Hi, I am facing the same issue on a Dell T5600 workstation from 2012, A19 bios version from Dell itself.
I will apply the patch you posted in your most recent message and post the log output as soon as possible.
Thanks!
Comment 27 River 2023-12-10 15:09:18 UTC
(In reply to Ard Biesheuvel from comment #23)
> (In reply to bwg from comment #15)
> > I am also seeing the same problem, except my machine (i7-2600S, integrated
> > Intel graphics) always hangs in the same manner as described in the first
> > message and has never successfully booted with any of the 6.6 kernels (I've
> > tried 6.6, 6.6.1, and 6.6.2). This is also an Arch install and I'm now
> using
> > the 6.1 LTS branch with no problems. I've rebuilt the ramdisk image
> multiple
> > times, and for me as well nothing is ever logged when attempting to boot
> > with the 6.6 kernels, which leads me to believe also this is a kernel
> > regression and not an Arch packaging issue.
> 
> Can you provide some more context on the type of system (make/model, age,
> BIOS vendor)?
> 
> @River: your system seems to be a bit older; does it have a serial port by
> any chance?

It doesn't have an RS-232 serial port, but I think it might support serial over USB, and it has a Port Replicator slot with serial built in, for use in a docking station.
Comment 28 River 2023-12-10 17:23:25 UTC
(In reply to Ard Biesheuvel from comment #25)
> Created attachment 305576 [details]
> more efistub debug log prints
> 
> Actually, please try this one instead - it has a couple of more debug prints
> around the kernel decompression and KASLR randomization, which is where I
> suspect the problem lies.
> 
> Please also try booting with 'nokaslr' to see if it makes a difference or
> not.

Booting with nokaslr set, it works 4/4 times. This does seem to be the issue for me.
Comment 29 Tommaso Fonda 2023-12-10 17:42:21 UTC
I confirm that nokaslr allows me to boot.
Comment 30 bwg 2023-12-10 18:48:14 UTC
(In reply to Ard Biesheuvel from comment #23)
> (In reply to bwg from comment #15)
> > I am also seeing the same problem, except my machine (i7-2600S, integrated
> > Intel graphics) always hangs in the same manner as described in the first
> > message and has never successfully booted with any of the 6.6 kernels (I've
> > tried 6.6, 6.6.1, and 6.6.2). This is also an Arch install and I'm now
> using
> > the 6.1 LTS branch with no problems. I've rebuilt the ramdisk image
> multiple
> > times, and for me as well nothing is ever logged when attempting to boot
> > with the 6.6 kernels, which leads me to believe also this is a kernel
> > regression and not an Arch packaging issue.
> 
> Can you provide some more context on the type of system (make/model, age,
> BIOS vendor)?
> 
> @River: your system seems to be a bit older; does it have a serial port by
> any chance?

This is a Dell Optiplex 990 from 2011 (I upgraded the CPU at one point), with the most recent Dell BIOS rev. A23 from 2018. It does have a serial port, if that would be helpful. The commonality between the now three reports is that these are all older Dell systems. There's a note on the Arch wiki about Dell's buggy EFI implementation on systems of about this vintage in another context. [1] Last night I compiled a 6.7 kernel with CONFIG_EFI_STUB=n which boots normally with GRUB.

I'm compiling a kernel with your patch in #25 applied and will additionally check the nokaslr option as well.

[1] https://wiki.archlinux.org/title/EFISTUB#EFISTUB_does_not_work_on_some_Dell_systems
Comment 31 bwg 2023-12-10 20:28:00 UTC
Booting v6.7-rc4 with the EFI debug patch from #25 hangs after this output:

EFI stub: Decompressing kernel
EFI stub: Getting random seed for KASLR from EFI
EFI stub: Getting random seed for KASLR from image
EFI stub: Allocating memory for image
EFI stub: Calling decompress_kernel()


I confirm too that setting nokaslr allows both my 6.7-rc4 kernel and the 6.6.5 image from the Arch repository to boot.
Comment 32 Ard Biesheuvel 2023-12-10 22:01:46 UTC
Thanks all, this is excellent progress.

Please capture a couple of success and failure cases with the following hunk applied

--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -812,6 +812,7 @@ static efi_status_t efi_decompress_kernel(unsigned long *kernel_entry)
        if (status != EFI_SUCCESS)
                return status;
 
+       efi_info("Decompressing kernel to %lx\n", addr);
        entry = decompress_kernel((void *)addr, virt_addr, error);
        if (entry == ULONG_MAX) {
                efi_free(alloc_size, addr);

and share the EFI memory map listing that is emitted in the kernel log (with ef=debug).

Hopefully, there will be a discernible pattern in the non-working randomized offsets, and the EFI memory map can shed some light on why those regions are problematic.
Comment 33 bwg 2023-12-10 23:41:46 UTC
Created attachment 305578 [details]
EFI debug memory map

(In reply to Ard Biesheuvel from comment #32)
> Thanks all, this is excellent progress.
> 
> Please capture a couple of success and failure cases with the following hunk
> applied
> [...]
> and share the EFI memory map listing that is emitted in the kernel log (with
> ef=debug).
> 
> Hopefully, there will be a discernible pattern in the non-working randomized
> offsets, and the EFI memory map can shed some light on why those regions are
> problematic.

I'm attaching dmesg excerpts with the EFI memory maps reported by the kernel.

On 5 attempts with KASLR enabled, the system hangs after the EFI stub tries to decompress the kernel to these addresses:

4fca00000
667400000
5b3e00000
5c9400000
3ef600000

(note that unlike River upthread my machine has never succeeded booting since 6.6 with KASLR enabled)

With KASLR disabled, the kernel boots with these debug messages:

EFI stub: Decompressing kernel
EFI stub: Allocating memory for image
EFI stub: Decompressing kernel to 200000
EFI stub: Adjusting memory protections
EFI stub: Loading initrd
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Exiting boot services
Comment 34 Ard Biesheuvel 2023-12-11 08:26:54 UTC
OK, so the best course of action here is probably to simply disable physical KASLR on older systems. (Virtual KASLR is much more important in any case)

So I intend to submit a patch that does something like this:

--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -804,6 +804,19 @@ static efi_status_t efi_decompress_kernel(unsigned long *kernel_entry)
 
                virt_addr += (range * seed[1]) >> 32;
                virt_addr &= ~(CONFIG_PHYSICAL_ALIGN - 1);
+
+               /*
+                * Older Dell systems with AMI firmware will hang while
+                * decompressing the kernel if physical address randomization
+                * is enabled.
+                *
+                * https://bugzilla.kernel.org/show_bug.cgi?id=218173
+                */
+               if (efi_system_table->hdr.revision <= EFI_2_00_SYSTEM_TABLE_REVISION) {
+                       efi_debug("UEFI v2.0 or older detected - disabling physical KASLR\n",
+                                 efi_system_table->fw_revision);
+                       seed[0] = 0;
+               }
        }
 
        status = efi_random_alloc(alloc_size, CONFIG_PHYSICAL_ALIGN, &addr,

So I'll need confirmation that disabling physical KASLR (by wiping the seed) is sufficient, and that UEFI revision 2.0 or older is a reasonable cutoff point.

Thanks,
Comment 35 Tommaso Fonda 2023-12-11 15:30:16 UTC
(In reply to Ard Biesheuvel from comment #34)
> OK, so the best course of action here is probably to simply disable physical
> KASLR on older systems. (Virtual KASLR is much more important in any case)
> 
> So I intend to submit a patch that does something like this:
> 
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -804,6 +804,19 @@ static efi_status_t efi_decompress_kernel(unsigned long
> *kernel_entry)
>  
>                 virt_addr += (range * seed[1]) >> 32;
>                 virt_addr &= ~(CONFIG_PHYSICAL_ALIGN - 1);
> +
> +               /*
> +                * Older Dell systems with AMI firmware will hang while
> +                * decompressing the kernel if physical address randomization
> +                * is enabled.
> +                *
> +                * https://bugzilla.kernel.org/show_bug.cgi?id=218173
> +                */
> +               if (efi_system_table->hdr.revision <=
> EFI_2_00_SYSTEM_TABLE_REVISION) {
> +                       efi_debug("UEFI v2.0 or older detected - disabling
> physical KASLR\n",
> +                                 efi_system_table->fw_revision);
> +                       seed[0] = 0;
> +               }
>         }
>  
>         status = efi_random_alloc(alloc_size, CONFIG_PHYSICAL_ALIGN, &addr,
> 
> So I'll need confirmation that disabling physical KASLR (by wiping the seed)
> is sufficient, and that UEFI revision 2.0 or older is a reasonable cutoff
> point.
> 
> Thanks,

This patch fixes the issue for me (Dell T5600). I'm not a UEFI expert so I can't comment on whether UEFI 2.0 is a good cutoff point or not.
Many thanks!
Comment 36 Ard Biesheuvel 2023-12-11 17:00:35 UTC
Created attachment 305588 [details]
proposed fix

Thanks a lot for the report.

UEFI v2.0 is extremely old, but all we really need is a way to identify these systems.

However, I did remember that many Apple x86 Macs report EFI v1.10, even much more recent ones, so I will need to add a check on the firmware vendor as well. As far as I can tell, this is 'American Megatrends' - please speak up if the 'EFI v2.0 by xxx' line in your dmesg lists something else.

Attached is the patch that I will send out tomorrow, it should land in Linus's tree by next -rc
Comment 37 Tommaso Fonda 2023-12-11 18:05:28 UTC
(In reply to Ard Biesheuvel from comment #36)
> Created attachment 305588 [details]
> proposed fix
> 
> Thanks a lot for the report.
> 
> UEFI v2.0 is extremely old, but all we really need is a way to identify
> these systems.
> 
> However, I did remember that many Apple x86 Macs report EFI v1.10, even much
> more recent ones, so I will need to add a check on the firmware vendor as
> well. As far as I can tell, this is 'American Megatrends' - please speak up
> if the 'EFI v2.0 by xxx' line in your dmesg lists something else.
> 
> Attached is the patch that I will send out tomorrow, it should land in
> Linus's tree by next -rc

I cannot test the new patch today, but I confirm that my firmware is by American Megatrends.
Comment 38 bwg 2023-12-12 01:55:12 UTC
The latest patch allows 6.7-rc5 to boot on my system without the nokaslr option -- thanks!
Comment 39 River 2023-12-12 02:31:44 UTC
My system reports its UEFI version as follows:
[    0.000000] efi: EFI v2.00 by American Megatrends
I am currently compiling the patched kernel, so we'll see how it boots. Looks about halfway through, so around four or five hours to go. I suspect it will work, given everyone else's successful boots.
Comment 40 River 2023-12-12 18:19:54 UTC
(In reply to Ard Biesheuvel from comment #34)
> OK, so the best course of action here is probably to simply disable physical
> KASLR on older systems. (Virtual KASLR is much more important in any case)
> 
> So I intend to submit a patch that does something like this:
> 
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -804,6 +804,19 @@ static efi_status_t efi_decompress_kernel(unsigned long
> *kernel_entry)
>  
>                 virt_addr += (range * seed[1]) >> 32;
>                 virt_addr &= ~(CONFIG_PHYSICAL_ALIGN - 1);
> +
> +               /*
> +                * Older Dell systems with AMI firmware will hang while
> +                * decompressing the kernel if physical address randomization
> +                * is enabled.
> +                *
> +                * https://bugzilla.kernel.org/show_bug.cgi?id=218173
> +                */
> +               if (efi_system_table->hdr.revision <=
> EFI_2_00_SYSTEM_TABLE_REVISION) {
> +                       efi_debug("UEFI v2.0 or older detected - disabling
> physical KASLR\n",
> +                                 efi_system_table->fw_revision);
> +                       seed[0] = 0;
> +               }
>         }
>  
>         status = efi_random_alloc(alloc_size, CONFIG_PHYSICAL_ALIGN, &addr,
> 
> So I'll need confirmation that disabling physical KASLR (by wiping the seed)
> is sufficient, and that UEFI revision 2.0 or older is a reasonable cutoff
> point.
> 
> Thanks,

Just rebooted three times with this patch. It worked for all three times. I can't thank you enough for your help. Same goes to everyone else who helped narrow it down, and bwg for doing the bisect. I hope to see this in-tree soon.

Note You need to log in before you can comment on or make changes to this bug.