Bug 68761 - Unable to boot using UEFI
Summary: Unable to boot using UEFI
Status: RESOLVED CODE_FIX
Alias: None
Product: EFI
Classification: Unclassified
Component: Boot (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: EFI Virtual User
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-01-15 22:19 UTC by Brian
Modified: 2014-08-06 14:38 UTC (History)
26 users (show)

See Also:
Kernel Version: 3.12.7
Tree: Mainline
Regression: No


Attachments
lowmem patch (1.19 KB, patch)
2014-01-17 15:51 UTC, Matt Fleming
Details | Diff
banner (3.27 KB, patch)
2014-01-19 18:43 UTC, Matt Fleming
Details | Diff
0001-x86-efi-Add-EFI-framebuffer-earlyprintk-support.patch (8.54 KB, patch)
2014-01-20 14:44 UTC, Matt Fleming
Details | Diff
0002-x86-efi-Fix-earlyprintk-off-by-one-bug.patch (1.25 KB, patch)
2014-01-20 14:45 UTC, Matt Fleming
Details | Diff
dmesg of succesful UEFI shell 3.12.7 boot (77.67 KB, text/x-log)
2014-01-20 15:57 UTC, Bastian Beischer
Details
unpatched build of arch 3.12.7-2 kernel (3.69 MB, application/octet-stream)
2014-01-21 00:55 UTC, Ulf Winkelvos
Details
build of arch 3.12.7-2 kernel with lowmem, banner and earlyprintk patches (3.69 MB, application/octet-stream)
2014-01-21 00:57 UTC, Ulf Winkelvos
Details
build of arch 3.12.7-2 kernel with lowmem, banner and earlyprintk patches + no_setup_efi_pci patch (3.69 MB, application/octet-stream)
2014-01-21 00:59 UTC, Ulf Winkelvos
Details
debug messages from unsuccessful boot (806.97 KB, image/jpeg)
2014-01-21 01:36 UTC, Ulf Winkelvos
Details
debug messages from successful boot (very bad quality) (2.72 MB, image/jpeg)
2014-01-21 01:37 UTC, Ulf Winkelvos
Details
reloc.patch (3.13 KB, application/octet-stream)
2014-01-21 15:59 UTC, Matt Fleming
Details
Reloc patch output (1.60 MB, image/jpeg)
2014-01-21 19:42 UTC, Brian
Details
Gummiboot output (1.16 KB, text/plain)
2014-01-21 20:55 UTC, Brian
Details
3.12.8 with applied reloc.patch kernel panic (1.91 MB, image/jpeg)
2014-01-22 19:03 UTC, Max Liebkies
Details
reloc2.patch (3.57 KB, patch)
2014-01-24 11:29 UTC, Matt Fleming
Details | Diff
comment call to setup_efi_pci() (500 bytes, patch)
2014-03-24 01:50 UTC, Ulf Winkelvos
Details | Diff
Kernel which fails to boot (3.64 MB, application/octet-stream)
2014-04-11 23:51 UTC, Douglas Young
Details
Kernel which boots (3.64 MB, application/octet-stream)
2014-04-11 23:54 UTC, Douglas Young
Details
3.14.1 hangs, output with ignore_loglevel and earlyprintk=efi (1.32 MB, image/jpeg)
2014-04-23 21:00 UTC, Steven Vancoillie
Details
better error logging in eboot.c (1.40 KB, patch)
2014-04-29 03:03 UTC, Ulf Winkelvos
Details | Diff
add better error logging to efi-main (1.64 KB, patch)
2014-07-08 01:29 UTC, Ulf Winkelvos
Details | Diff

Description Brian 2014-01-15 22:19:56 UTC
Many machines are failing to boot with UEFI using the 3.12.7 kernel. This was not an issue with kernel 3.12.6. This issue is being discussed in the following bug report: https://bugs.archlinux.org/task/33745?project=1&order=lastedit&sort=desc. We have now come to a consensus that this issue is a bug introduced in the 3.12.7 kernel that did not previously exist in the 3.12.6 kernel.
Comment 1 Matt Fleming 2014-01-16 09:16:51 UTC
This is unlikely to be a regression in the UEFI code because no changes occurred between 3.12.6 and 3.12.7.

Please try and bisect the issue.
Comment 2 Bastian Beischer 2014-01-16 12:32:24 UTC
Hi Matt,

I bisected the 3.12.6 to 3.12.7 code and this is the log:

git bisect start
# bad: [4301b7a8fe14a787fbf0bb9cad16b623f45956f6] Linux 3.12.7
git bisect bad 4301b7a8fe14a787fbf0bb9cad16b623f45956f6
# good: [d0266db287d492abe63e19859ad99dd232bc0e89] Linux 3.12.6
git bisect good d0266db287d492abe63e19859ad99dd232bc0e89
# good: [f3c1f0d0aaf20f9dee35ae99ec8b8705af4dc60e] drm/radeon: fix render backend setup for SI and CIK
git bisect good f3c1f0d0aaf20f9dee35ae99ec8b8705af4dc60e
# good: [f3b578d9d009a9f670e893cec8579aa069aaaccb] mm: numa: avoid unnecessary work on the failure path
git bisect good f3b578d9d009a9f670e893cec8579aa069aaaccb
# bad: [e93b100931a45490cd07960a1ec51d9d8e5100cb] GFS2: Fix slab memory leak in gfs2_bufdata
git bisect bad e93b100931a45490cd07960a1ec51d9d8e5100cb
# bad: [eede0e9020693adaeed01fb464261a00ce9d05ad] mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
git bisect bad eede0e9020693adaeed01fb464261a00ce9d05ad
# good: [ef36ec29945653ced2c30158213841d248299a8a] mm: fix TLB flush race between migration, and change_protection_range
git bisect good ef36ec29945653ced2c30158213841d248299a8a
# bad: [9c612a77032a98b264d12fd6e3df2ca530d968d2] mm: numa: defer TLB flush for THP migration as long as possible
git bisect bad 9c612a77032a98b264d12fd6e3df2ca530d968d2
# good: [186fa6eb6131954d17457f37283e654cb079c25b] mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates
git bisect good 186fa6eb6131954d17457f37283e654cb079c25b
# first bad commit: [9c612a77032a98b264d12fd6e3df2ca530d968d2] mm: numa: defer TLB flush for THP migration as long as possible

Also noteworthy: I get an error message as follows:

Failed to alloc lowmem for boot params

However, it's quite likely that the bisect does not help. As one of the other Arch users pointed out to me:

"Your bisect produced nonsense, since the problem seemingly appears and disappears at random if you rebuild the same code.

Also, it happens in the very early boot code (arch/x86/boot/compressed). The mm subsystem (and the rest of the kernel) isn't even loaded at this time.

It might have something to do with the alignment or the content of the compressed kernel. Using another compression type (such as XZ or LZO) might hide the problem again."

I now tend to agree with him.
Comment 3 Alan 2014-01-16 22:44:30 UTC
It may be worth changing you configuration around a chunk and then re-doing the bisect see if you end up in the same place.

A bad commit in the TLB handling area is not that implausible a place for weird and strange crashes.
Comment 4 Matt Fleming 2014-01-17 11:27:07 UTC
The "failed to alloc lowmem for boot params" error is revealing at least, since you shouldn't be hitting that at all.

Bastian, could you post a dmesg of a working kernel?

What hardware are you seeing this on?

What boot loader are you using to boot the machine, or are you booting it directly either via the UEFI shell or the UEFI Boot Manager?
Comment 5 Bastian Beischer 2014-01-17 12:36:22 UTC
My laptop is a Lenovo Thinkpad W520. I've compiled an x86_64 kernel and I'm booting with gummiboot, which as far as I understand is a simple UEFI Boot Manager.

https://wiki.archlinux.org/index.php/Gummiboot

Other examples of affected hardware are mentioned in the Arch bug report:

https://bugs.archlinux.org/task/33745?project=1&order=lastedit&sort=desc
Comment 6 Matt Fleming 2014-01-17 15:51:17 UTC
How did you compile the kernel? And which kernel did you use? Was it a native upstream kernel or some Arch kernel? Please be specific.

This could be a toolchain/build issue, or simply an EFI memory map config issue. The attached patch may alleviate the "Failed to alloc lowmem" issue you're seeing. Btw, does your machine always fail to boot in the same way? That is, do you always see the "Failed to alloc lowmem" error on a failed boot?

Please provide a dmesg from a working kernel.

In order to debug this we need to be able to pinpoint the exact differences between working and non-working kernels, and the above archlinux thread mostly contains "Version x.xxx.x doesn't work for me, x.xx.x does", which isn't all that helpful.
Comment 7 Matt Fleming 2014-01-17 15:51:43 UTC
Created attachment 122431 [details]
lowmem patch
Comment 8 Bastian Beischer 2014-01-17 17:02:52 UTC
Thanks for the patch. I'll provide all the required info on Sunday as I'm away for the weekend and don't have my laptop with me.
Comment 9 Brian 2014-01-17 21:41:00 UTC
I just tried out the patch. It does not solve the boot problem on my machine.
A description of how the kernel is being compiled can be found in this comment: https://bugs.archlinux.org/task/33745#comment118190. I'll let Bastian provide the other information.

I know Matt said it won't be very helpful, but weirdly the 3.12.8 kernel boots successfully on my machine. I compiled it exactly the same as the 3.12.7 kernel as in the comment I mentioned previously.

Just an FYI, although the 3.12.8 kernel has not been released to the Arch testing repository, the SVN PKGBUILD files have been updated to the 3.12.8 kernel. Here's the svn link to the PKGBUILD files: svn://svn.archlinux.org/packages/linux/trunk.
Comment 10 Ulf Winkelvos 2014-01-19 04:31:18 UTC
Yet again... does this patch work: http://pastebin.com/24kvw8kt ?
Comment 11 Matt Fleming 2014-01-19 12:30:25 UTC
Ulf, don't propose unrelated patches. The setup_efi patch would not account for seeing the lowmem error message Bastian is seeing.

The archlinux thread posted in this report is a complete mess of people reporting different errors, on different hardware, with different boot loaders. Clearly people are hitting *different* issues.

If removing setup_efi_pci() works for you, please open a separate bug report, attach the patch, describe the failure you see without the patch, and describe the hardware and boot loader you're using.

Brian, did you also see the "Failed to alloc lowmem for boot params" error? If not, do you see any error message whatsoever? When your machine fails to boot, how does it fail?

Conflating these error reports into one monolithic report is not the way to get these problems resolved.
Comment 12 Brian 2014-01-19 15:06:03 UTC
Matt, I do not ever see the "Failed to alloc lowmem..." message when my computer boots nor when it fails to boot (with and without the patch you posted above). For the times it fails to boot, I'll power on the machine and see the Lenovo UEFI splash screen as usual. Then the splash screen goes away as it always does during boot, but that's it. The backlight on my screen is in fact turned on, but no messages or anything shows up after the splash screen goes away.

Not sure if this will be of any help, but when this boot failure occurs on my machine, powering off the machine does not require me to hold the power button for 5+ seconds, A.K.A. perform a hard shutdown. Just holding the power button for about 1 or 2 seconds shuts the computer off. In other words, it's shutting down cleanly. Probably because the failed boot process doesn't make it to the point of mounting the filesystem.
Comment 13 Matt Fleming 2014-01-19 15:37:19 UTC
Brian, are you using a boot loader and if so, which one? Are you able to run the UEFI shell from your BIOS menu? If you can, try and execute the vmlinuz directly (you may need to rename it with a .efi extension, i.e. vmlinuz.efi), something along the lines of,

Shell> fs0:
fs0:\> vmlinuz.efi ignore_loglevel

and report whether you see any output. The 'ignore_loglevel' parameter is important, debugging this blind is going to be incredibly time consuming and it appears the Archlinux Linux package makes the boot super quiet (see change-default-console-loglevel.patch), which looks smarter when things are working OK, but makes debugging difficult. Ideally we need some output to figure out where to start looking.

Since your machine powers off instantly, I suspect the kernel hasn't setup the interrupt tables by the time it crashes. It could be crashing in the EFI boot stub, or very early in the kernel boot code. The key to debugging this is to narrow down where we should start searching for bugs.
Comment 14 Brian 2014-01-19 16:41:21 UTC
I've used refind and gummiboot. Both fail to boot, but booting UEFI with GRUB2 is successful.

Yes, I can run UEFI shell.

Running "vmlinuz.efi ignore_loglevel" from the UEFI shell with a bootable kernel installed, a bunch of jargon is output to the screen. I see things like "Unable to mount filesystem...", "Kernel panic...", and at the end is a "Call Trace" list. It just hangs after that. I suppose this is normal since it is not the normal way to boot linux. I have to do an actual hard shutdown to power off (holding power button for 5+ seconds).

Running "vmlinuz.efi ignore_loglevel" from the UEFI shell with an unbootable kernel installed, there is no output after I hit the enter key. The system just hangs there in the UEFI shell. I can power off the machine like I described earlier, i.e., not a hard shutdown (holding power button for 1-2 seconds).
Comment 15 Matt Fleming 2014-01-19 18:42:19 UTC
Thanks Brian, those are all good data points. What model of Lenovo are you using?

So, first things first. Let's see whether we can make it into the EFI boot stub. Please apply the attached banner.patch. I'm hoping that because it applies directly to the EFI boot stub, and not the kernel proper, it won't alter your failure in any way. The patch will apply to v3.12.7.

Please report any output you see.
Comment 16 Matt Fleming 2014-01-19 18:43:15 UTC
Created attachment 122631 [details]
banner
Comment 17 Matt Fleming 2014-01-19 18:57:33 UTC
Bastian, I've opened the following bug report to track your issue, which is different from the one Brian is reporting,

  https://bugzilla.kernel.org/show_bug.cgi?id=69001

Let's move the "Failed to alloc lowmem for boot params" conversation over there.
Comment 18 Brian 2014-01-19 20:36:26 UTC
I am using a first generation Lenovo X1 Carbon.

Here is the output I get after applying your banner.patch:

    Building boot params
    Allocated boot params
    Built command line
    Finished building boot params
    EFI boot stub v3.12.7
    Setting up graphics
    Setting up pci
    Allocating gdt
    Allocating idt
    Relocating kernel
    Exiting boot services

Nothing else happens after this. The computer just hangs at this point.
Comment 19 Matt Fleming 2014-01-20 14:44:32 UTC
Thanks Brian.

Could you revert the banner patch and apply the following two patches to v3.12.7 and rebuild,

  0001-x86-efi-Add-EFI-framebuffer-earlyprintk-support.patch
  0002-x86-efi-Fix-earlyprintk-off-by-one-bug.patch

You'll need to enable CONFIG_EARLY_PRINTK_EFI in your config and pass earlyprintk=efi as an argument when booting your kernel, e.g.

fs0:\> vmlinuz.efi ignore_loglevel earlyprintk=efi

Hopefully you'll see some output (don't worry if the text scrolls slowly). If you still don't see any output, then we'll have to try something much more laborious.
Comment 20 Matt Fleming 2014-01-20 14:44:58 UTC
Created attachment 122711 [details]
0001-x86-efi-Add-EFI-framebuffer-earlyprintk-support.patch
Comment 21 Matt Fleming 2014-01-20 14:45:24 UTC
Created attachment 122721 [details]
0002-x86-efi-Fix-earlyprintk-off-by-one-bug.patch
Comment 22 Bastian Beischer 2014-01-20 15:57:16 UTC
Hey Matt and Brian,

I was able to boot the Arch core repo build of 3.12.7 from the UEFI shell.

This kernel shows all the symptoms Brian described when I try booting it from gummiboot. In particular there is no "Failed to allow lowmem for boot params" message and the boot just hangs. Pressing the power button does shut the system down normally though, so it's not locked up completely.

Here's how I got it to boot:

- I copied my kernel image from /boot/vmlinuz-linux to /boot/3.12.7.efi

- I downloaded a binary built of the UEFI shell v1 and executed the shellx64.efi file from gummiboot which gave me an UEFI shell (I can't go through my laptop's BIOS to obtain an UEFI shell)

- I executed the following UEFI shell commands:

fs0:
3.12.7.efi initrd=\initrams-linux.img root=/dev/sda2 rw

---> Kernel boots just fine! The is the first time I was able to boot this kernel build.

I didn't apply any of the patches Matt posted. I'll attach a dmesg of this successful boot.
Comment 23 Bastian Beischer 2014-01-20 15:57:48 UTC
Created attachment 122751 [details]
dmesg of succesful UEFI shell 3.12.7 boot
Comment 24 Bastian Beischer 2014-01-20 16:09:46 UTC
There's a typo in my previous post.

It should be

fs0:
3.12.7.efi initrd=\initramfs-linux.img root=/dev/sda2 rw

with "initramfs-linux.img" instead of "initrams-linux.img"
Comment 25 Brian 2014-01-20 18:33:24 UTC
Matt, I'm sorry to say this, but there was no output using the earlyprintk patches.
Comment 26 Ulf Winkelvos 2014-01-21 00:55:07 UTC
Created attachment 122821 [details]
unpatched build of arch 3.12.7-2 kernel
Comment 27 Ulf Winkelvos 2014-01-21 00:57:36 UTC
Created attachment 122831 [details]
build of arch 3.12.7-2 kernel with lowmem, banner and earlyprintk patches
Comment 28 Ulf Winkelvos 2014-01-21 00:59:32 UTC
Created attachment 122841 [details]
build of arch 3.12.7-2 kernel with lowmem, banner and earlyprintk patches + no_setup_efi_pci patch
Comment 29 Ulf Winkelvos 2014-01-21 01:36:43 UTC
Created attachment 122851 [details]
debug messages from unsuccessful boot
Comment 30 Ulf Winkelvos 2014-01-21 01:37:41 UTC
Created attachment 122861 [details]
debug messages from successful boot (very bad quality)
Comment 31 Ulf Winkelvos 2014-01-21 01:47:40 UTC
I built a set of kernel images based on the stock arch 3.12.7-2 PKGBUILD [1] with and without the attached lowmem, banner and earlyprintk patches. (I read too late that the latter were not intended to be used together, but I think it will not hurt either.) Both kernels do not boot on my dell xps 13 fhd when launched from efi shell. unpatched hangs with no output. the patched one hangs like Brian posted. (https://bugzilla.kernel.org/show_bug.cgi?id=68761#c18) The third kernel has setup_efi_pci patched out and boots fine, until it can't open root because of the missing initrd. (see attached jpgs). The patched kernels are incremental builds ontop of the stock arch, all built in a chroot enviroment.   


[1]: sha256sum of the attached kernels
5c2911ff966fcf25432d7140eb906603f8a4b106de7dcd43c75c033f933f2956  ulf-debugpatched-3.12.7-2.efi
7eb3e0ebac127bd60abef39a2c69eed27fa1029e600ab5f1e5009ec7acbdd5a1  ulf-debugpatched-nopci-3.12.7-2.efi
b28942870ddf51ee3e560028cc890cfb0d2e24727545ab76c0083b03a9c3dd0e  ulf-plain-3.12.7-2.efi
Comment 32 felineswift+kernel 2014-01-21 11:51:11 UTC
Hey,

I probably have same problem as Brian. Without patches there is no output from EFI shell or gummiboot after launching kernel. With banner and earlyprintk patches (arch 3.12.7-2) i get same output as comment #18 and hang after that. UEFI Grub boot in ubuntu liveusb boots succesfully. Arch liveusb does not. Most interesting thing is that 3.2.16-1 used to boot but doesn't anymore (same thing happens).
Comment 33 Matt Fleming 2014-01-21 15:58:45 UTC
Could everyone please try the following reloc.patch on top of v3.12.7 and report the output (feel free to apply the earlyprintk patches too).
Comment 34 Matt Fleming 2014-01-21 15:59:13 UTC
Created attachment 122911 [details]
reloc.patch
Comment 35 Brian 2014-01-21 19:41:26 UTC
Attached is an image of all the jargon that was output after applying both the reloc and earlyprintk patches.
Comment 36 Brian 2014-01-21 19:42:31 UTC
Created attachment 122931 [details]
Reloc patch output
Comment 37 Matt Fleming 2014-01-21 20:00:11 UTC
Brian, is that v3.12.7 with only the earlyprintk and reloc.patch? If so, it seems like your issue is resolved?

Could you verify it works with gummiboot too?
Comment 38 Brian 2014-01-21 20:11:57 UTC
Yes, it is v3.12.7.
No, computer still doesn't boot. I do get output when I boot using gummiboot, but it hangs after the "Call trace", as seen in the image I posted.
Comment 39 Matt Fleming 2014-01-21 20:29:04 UTC
Brian, sorry, I should have been clearer. Your kernel is panicing because it can't find the root file system, but you'll notice that you have to hold the button down to power off your machine now. You may want to try using the same command line that Bastian used, namely,

fs0:\> 3.12.7.efi initrd=\initramfs-linux.img root=/dev/sda2 rw

Does gummiboot produce the same call trace as in https://bugzilla.kernel.org/show_bug.cgi?id=68761#c36 ?


If so, I think we can conclude that this issue is resolved, since not finding the root file system is a config problem.

Ulf, does reloc.patch work for you without deleting setup_efi_pci()?
Comment 40 Brian 2014-01-21 20:54:24 UTC
I get a kernel panic regardless using gummiboot or manually booting through the UEFI shell.

I've attached a text file showing the output I get booting with gummiboot. Output is similar, if not the same, when booting through the UEFI shell.
Comment 41 Brian 2014-01-21 20:55:04 UTC
Created attachment 122941 [details]
Gummiboot output
Comment 42 Brian 2014-01-21 20:58:24 UTC
For completeness, yes, I have to do a forced shutdown.
May I ask for a high level explanation of what reloc.patch did?
Comment 43 Matt Fleming 2014-01-21 21:30:07 UTC
Great, thanks for testing.

reloc.patch fixes a bug in the EFI boot stub. The boot stub allocates a buffer and copies the kernel image into it, but it gets the alignment wrong. The kernel decompressor code checks the alignment, notices it's wrong, and rounds up the output buffer address past the buffer we originally allocated in the EFI boot stub. What happens when the decompressor runs is anybody's guess because no one has been able to get any output from the kernel, but rest assured it will be Bad Things (TM), i.e. overwriting random bits of firmware code/data.

Hitting this bug is going to depend on exactly where the kernel image is loaded in the address space, which means your choice of boot loader, kernel version and even kernel build affect your chance of a successful boot.

Thanks for your persistence in tracking this issue down, we got there eventually. I'll get this pushed upstream and it'll make its way to the stable kernels.

I am still interested to know whether reloc.patch fixes the setup_efi_pci() issue people are seeing. If anyone could provide any info, that would be much appreciated.
Comment 44 Bastian Beischer 2014-01-22 11:53:32 UTC
Hey Matt, that could be the culprit indeed :)

However, given that people were able to boot some builds of 3.12.7 in the past already, we can't conclude from Brian's successful boot that the reloc.patch fixes the issue. I guess we need more testers?

I'll test it now :)
Comment 45 Max Liebkies 2014-01-22 19:02:01 UTC
Hey Matt,

I was similarily affected by this bug. 3.12.7 didn't boot (at least not the precompiled version). Compiling 3.12.7 using -j 1 does however work for me, anything other than 1 results in the kernel not booting.

As 3.12.8 exhibits the same bug I applied your reloc.patch and booted it as I'd do with every other kernel. At least I can now see a kernel panic happening (see attached screenshot).

So I'm wondering: Should this patch fix the bug? If so, it does not, at least not for me because I didn't change the kernel commandline or any other configuration for the kernel with your reloc patch or working kernels. In my understanding the kernel _does_ find its root device but can't mount it? 

I'm using gummiboot (no EFI Shell).
Comment 46 Max Liebkies 2014-01-22 19:03:17 UTC
Created attachment 123051 [details]
3.12.8 with applied reloc.patch kernel panic
Comment 47 Brian 2014-01-22 19:47:19 UTC
Sorry, I should have been more specific. After applying reloc.patch, my computer did NOT boot successfully. I did get a kernel panic like Max described.
Comment 48 Bastian Beischer 2014-01-22 20:05:40 UTC
I concur with Max and Brian. I cannot boot the kernel after applying reloc.patch.
Comment 49 Ulf Winkelvos 2014-01-23 00:11:09 UTC
dell xps 13:
- arch 3.12.7-2 + reloc.patch: boots up to the kernel panic described by bastian, brian and max. (gummiboot and efi shell / kernel boots fine with syslinux in csm mode)
- arch 3.12.7-2 + reloc.patch + earlyprintk-efi.patch: boot hangs without output. (efi shell)

lenovo w520:
- arch 3.12.7-2 + reloc.patch: hangs at boot but after a couple of seconds upper 1/10 of the screen turns grey. it did never do taht before and does not with 3.12.7-2 stock. (no, i am not kidding! :))

So my guess is that still not enough memory for the kernel is allocated and parts of the firmware is overwitten. (block device layout for my xps 13, as well as bastian's, brians' and max's systems and graphic stack for my w520.
Comment 50 Matt Fleming 2014-01-24 11:28:35 UTC
Everyone, thanks for clarifying and apologies for the misunderstanding. I didn't realise that the patch broke your existing configurations. Turns out there was a typo in one line of the patch which meant that it was writing a 64-bit pointer to a 32-bit data item, resulting in the ramdisk pointer becoming corrupt.

Could everyone try reloc2.patch and see if things improve (I'll delete the old one so that there's no confusion).
Comment 51 Matt Fleming 2014-01-24 11:29:36 UTC
Created attachment 123281 [details]
reloc2.patch
Comment 52 Bastian Beischer 2014-01-24 13:50:45 UTC
Thanks Matt,

I still cannot boot 3.12.7 + reloc2.patch from gummiboot. I can however:

a) boot 3.12.7 from UEFI shell
b) load gummiboot.x64 from UEFI shell and _then_ load 3.12.7 + reloc2.patch from the gummiboot menu.

However, I think this behaviour is the same as without reloc2.patch (as I described a while ago: I can boot from UEFI shell but not from gummiboot directly). I didn't try to load gummiboot from UEFI shell and then load the kernel at that time though.
Comment 53 Bastian Beischer 2014-01-24 15:24:09 UTC
I saw that quite a few patches for EFISTUB booting went into 3.13.

Should try out 3.13 (+reloc2.patch?)
Comment 54 Ulf Winkelvos 2014-01-24 16:52:59 UTC
(In reply to Matt Fleming from comment #51)
> Created attachment 123281 [details]
> reloc2.patch
reloc2.patch works fine for me on dell xps 13 and lenovo w520. I will upload my patched arch kernel package and ask the guys in the arch bug report to test it and report back. Thanks a lot for your effort!
Comment 55 Max Liebkies 2014-01-24 18:25:18 UTC
Okay, sorry but reloc2.patch didn't help. Same behaviour as before: Just a blank screen, backlight works. Compiled with -j16.

Will try 3.13 tomorrow and see if that works. [Anyone have PKGBUILD for that one? :-)]
Comment 56 Dave Hope 2014-01-24 18:39:56 UTC
Ulf,

Thanks for providing the packages on the Arch bugtracker. I have been following this bug with interest. 

I have a Lenovo X230 and have faced this issue with 3.12.7 and 3.12.8.

Matt,

Thanks for providing reloc2.patch - Using  3.12.7-2 with reloc2.patch seems to resolve my issues. No black screen on boot.
Comment 57 Max Liebkies 2014-01-24 19:21:45 UTC
Kernel 3.13 (AUR: linux-mainline which is basically vanilla) boots just fine without your reloc2.patch
Comment 58 Brian 2014-01-24 20:27:01 UTC
Ulf, just tried your kernel with Matt's reloc2.patch. My computer boots successfully.
Comment 59 Bastian Beischer 2014-01-24 21:23:50 UTC
Please remember that we have had succesful builds of 3.12.7 without any patches in the past.

Ulf might have just gotten lucky. Instead of using his build I'd recommend to compile your own version, so that we have some statistical evidence for the statement that the patch fixes this problem at least.
Comment 60 kernelnoob 2014-01-24 21:43:11 UTC
I'm coming from the arch bug report discussion to report a successful boot with the pre-built patched kernel Ulf provided on thinkpad x230, using refind 0.7.7-1.

Though from what I gathered reading this thread, this success could be a matter of luck and not necessarily a proof of a working fix.

Let me know how I can help with finding out if it truly fixed the issue.
Comment 61 Ulf Winkelvos 2014-01-24 22:35:53 UTC
(In reply to Bastian Beischer from comment #59)
> Ulf might have just gotten lucky. Instead of using his build I'd recommend
> to compile your own version, so that we have some statistical evidence for
> the statement that the patch fixes this problem at least.
That is definitely a good idea. You should also try different values of the "pkgbase", as this made a difference before. (https://bugs.archlinux.org/task/33745#comment111238)
Comment 62 kernelnoob 2014-01-25 18:51:09 UTC
I compiled a kernel from Ulf's provided sources and patches with a modified pkgbase and it failed to boot exactly as before: it displays kernel parameter and hangs there.

A version without the modified pkgbase is compiling.
Comment 63 kernelnoob 2014-01-25 20:49:24 UTC
Results are in for locally compiled kernels:

With the provided pkgbase name: boots!
With a modified pkgbase name: hangs!

I'm totally alien with the inner workings of the kernel hence do not understand what's happening here but if you need me to provide additional debugging info, I'm willing to help.
Comment 64 Tom D 2014-01-26 19:01:47 UTC
I would like to add that this bug affects me too.  My affected stack includes:

- Dell XPS 13 (Sputnik/Developer Edition)
- Arch Linux 64 Bit
- Gummiboot

I'll try to get some time together and test some of these patches... I greatly appreciate all of your efforts.
Comment 65 Dave Hope 2014-01-30 21:02:48 UTC
Following on from Bastian Beischer request that others should compile a kernel and test I've done the following:

- Taken a copy of 3.12.9-1 and compiled it using the Arch Build System (ABS). All that was modified was 'pkgbase' so it'd get copied to a unique name. After adding an entry for gummiboot and restarting this package resulted in a black screen as per the bug report.

- Using the same copy from my first attempt, with the same 'pkgbase' I applied the patch inside the Arch Linux PKGBUILD prepare() function. Compiling, Installing and booting using gummiboot still resulted in the same black screen.
Comment 66 Dave Hope 2014-02-01 14:35:44 UTC
Further to my previous comment, I've done some further testing with 3.12.9-1.

- Compiled with the reloc2.patch applied and pkgbase set to "linux-dhope" in PKGBUILD. This results in a black screen on boot. I've cleaned out the tree and rebuilt three times and had the same result each time.

- Compiled with the reloc2.patch applied and pkgbase set to "linux-dhope-efibootfix" in PKGBUILD resulted in a successful boot.

I would be interested if other Lenovo X230 users experiencing this issue could try the same process.
Comment 67 Rich 2014-02-02 01:34:54 UTC
Okay so I have been following the Arch thread and been trying to work with them on this. I have an X230 with the latest bios applied, and when I try to boot I experience the issues with the black screen and then having to power down the system with a long press on the power button. I can state that booting to UFI Shell and trying to load the kernel from the command line does not work for me for any kernels that do not boot via gummiboot.

I tried compiling the latest Arch core kernel with modifications to the PKGBUILD
pkgbase=linux-reloc
Inside the prepare() function:
patch -p1 -i [path to reloc patch]

The first time I did this and compiled the kernel I installed it using pacman -U and rebooted to the kernel. I only received a blank screen and it would not go past that.

I also have the aur mainline kernel 13.3 installed on my machine and I can also report that it boots without a problem.

I'm going to build the kernel again and then try it one more time. Can someone tell me what I should do when booting to try and help get some more information? Should I use ignore_loglevel?
Comment 68 Matt Fleming 2014-02-03 11:27:50 UTC
Rich, ignore_loglevel is a good idea for debugging in general since it should print all kernel statements to the console.

I'm gonna have to go and have another read of the EFI boot stub code and hunt for more bugs. I certainly don't think that reloc2.patch is going to hurt anything, but there's clearly other things going on.
Comment 69 Bastian Beischer 2014-02-04 10:46:04 UTC
By the way, has anyone had any problems with 3.13, yet?

I have tried like 5 or 6 different builds and haven't had a single failure... 3.13 contains some EFI related changes (Matt will now best), maybe the issue is already fixed there?
Comment 70 Alan 2014-02-04 11:29:02 UTC
3.13 seems to work for me on more devices (but turning on all the kernel debugging spews lots of warnings and crashes my box so it's not IMHO right), 3.14-rc1 is completely broken.
Comment 71 Dave Hope 2014-02-04 17:11:27 UTC
Problem persists for me under 3.12.9-2 (no reloc2 patch applied). Even adding debug options yields no output:

debug ignore_loglevel log_buf_len=10M print_fatal_signals=1 LOGLEVEL=8 earlyprintk=vga,keep sched_debug
Comment 72 Rich 2014-02-06 05:59:27 UTC
Just wanted to add an addition, I'm using the 3.12.9-3-ck build of the kernel from AUR and it works fine, I haven't had a problem yet. Also with ignore_loglevel set I did not have any output from the CORE 3.12.9-3 kernel so I can't really help much there.
Comment 73 Damir Jelić 2014-02-08 16:09:02 UTC
Can confirm 3.13.2 won't boot on a Lenovo Thinkpad T530.
The kernel is a stock Arch kernel: 3.13.2-1-ARCH x86_64 GNU/Linux

After I recompiled the kernel (using the same kernel config with the Arch build system) boot works.

Prior to 3.13.2 the 3.13.X kernel versions all booted fine (all of them were stock Arch kernels)

I am available for further testing, so if you want me to try to boot with the UEFI shell let me know.
Comment 74 development 2014-02-11 13:48:57 UTC
I can confirm as well that EFI boot doesn't work on my Kabini (AMD A4-5000, ZBOX nano AQ01) system.

Last working kernel version was 3.10.20.

3.12.X, 3.13.X and 3.14-RC2 don't work and cause a hang before I even see a single output.
Comment 75 kernelnoob 2014-02-12 01:23:06 UTC
I'd like to mention that I have had no issue booting any linux-ck kernels version yet, even when the same revision of the arch kernel fails to boot.

It's been a while since my last manually built kernel which is probably why my locally built 3.13.x kernels are triggering systemd errors and end up in a rescue shell, but they don't get stuck as with this bug.
Comment 76 Damir Jelić 2014-02-12 17:15:29 UTC
Today was a new kernel package pushed to the repositories. The kernel version was the same but it contains an additional patch [1].

I gave it a try and it works fine. The added patch doesn't seem to be in any way relevant to the UEFI issue, and since my own rebuild of the same kernel worked fine the issue seems to be with that specific build.

I noticed an interesting difference in the output of the file command with the kernel image as an argument:

    file boot/vmlinuz-linux
    boot/vmlinuz-linux: Linux kernel x86 boot executable bzImage, version 3.13.2-1-ARCH (nobody@var-lib-archbuild-extra-x86_64-thomas) #1, RO-rootFS, swap_dev 0x3, Normal VGA

    file /boot/vmlinuz-linux
    /boot/vmlinuz-linux: Linux kernel x86 boot executable bzImage, version 3.13.2-2-ARCH (tobias@T-POWA-LX) #1 SMP PREEMPT Wed Feb 12 08:2, RO-rootFS, swap_dev 0x3, Normal VGA

    file arch/x86/boot/bzImage
    arch/x86/boot/bzImage: Linux kernel x86 boot executable bzImage, version 3.13.2-1-ARCH (poljar@monolith) #3 SMP PREEMPT Sat Feb 8 16:33:, RO-rootFS, swap_dev 0x3, Normal VGA

The first kernel above doesn't boot while the other two boot fine. Note the missing build date.

[1] https://projects.archlinux.org/svntogit/packages.git/commit/trunk?h=packages/linux&id=e57c6a0695cadbd16bda340a7d39f82e879ba4ca
Comment 77 daniel.funke 2014-02-21 16:12:40 UTC
Just to add some more data points:
on my Thinkpad X220i with UEFI boot via gummiboot, I tried the following:

1) all stock Arch Kernels since 3.12.6 did crash with a blank screen
2) a self-compiled kernel 3.12.9 with stock kernel name (arch) did crash with blank screen
3) a self-compiled kernel 3.12.9 with kernel name "arch-testing" booted
3) a self-compiled kernel 3.13.2 with kernel name "arch-testing" got stock in a reboot cycle without any messages printed to screen
Comment 78 Tom Wambold 2014-03-10 16:41:33 UTC
Adding my own data point, I am able to successfully boot the stock Arch package 3.13.6-1 with gummiboot on my ThinkPad x230.  Previously, the last stock Arch kernel I could boot was 3.12.6-1.
Comment 79 Ulf Winkelvos 2014-03-24 01:49:15 UTC
(In reply to Matt Fleming from comment #11)
...
> If removing setup_efi_pci() works for you, please open a separate bug
> report, attach the patch, describe the failure you see without the patch,
> and describe the hardware and boot loader you're using.

I just build a recent git kernel and it would not boot, commented out the setup_efi_pci() call and it works. (Like I said before it has never failed me on my dell xps 13 ivy bridge) It has worked for other too: http://permalink.gmane.org/gmane.linux.kernel.efi/1560

I'ld be that the two of us are not the only ones in this thread for whom this hack fixes or shadows the problem. Even if it does not work for everybody. It might be worth trying. I'll attach this very simple patch. maybe someone else could try it out.
Comment 80 Ulf Winkelvos 2014-03-24 01:50:26 UTC
Created attachment 130561 [details]
comment call to setup_efi_pci()
Comment 81 Thomas Bächler 2014-04-10 10:56:51 UTC
Since this bug is going nowhere right now (and I just got reports from two users that Arch's 3.14.0-4 kernel fails again), I'd like to get some new information, but don't know how. What confuses me most is that when I build a kernel twice, with identical code and configuration, with the same toolchain, it happens that one build works and one fails - at least if I believe the reports I get. Is there a way to compare two such builds and maybe find out which differences cause this problem?
Comment 82 Matt Fleming 2014-04-10 11:13:11 UTC
Thomas, can you send me the two kernels? Exactly how does the bad kernel fail? Hang, reset?

This really does sound like a toolchain bug, or perhaps an EFI boot stub bug triggered by values written by your toolchain.
Comment 83 Max Liebkies 2014-04-10 11:21:44 UTC
For me 3.14 fails to boot in the exact same way as 3.12 did before: Blank screen nothing more, when booting with gummiboot. My current setup involves chainloading syslinux with gummiboot and then booting the kernel, which works.

Just a suggestion: The last time I tried debugging this bug I tried compiling the kernel with -j1 and -jX with X >= 1. -j1 never failed for me, whereas anything larger than 1 *might* fail. 

Apart from that: If you can give me any way of displaying _some_ information (just getting a blank screen atm, backlight works...) I'd be happy to help out!
Comment 84 Thomas Bächler 2014-04-10 11:29:30 UTC
Matt, I am unable to reproduce any of these problems myself, so I am relying on what I get from https://bugs.archlinux.org/task/33745 and from personal conversation with affected users.

At one point, a user compiled two identical versions of 3.12.7 where one kernel worked and the other failed. He uploaded them at the time, however, I just found out that he deleted the files and I can thus no longer download them.

I'll see if I can find someone willing to produce two such kernels.

In any case, the failing kernels always simply produced a blank screen with no output, hanging indefinitely.
Comment 85 Matt Fleming 2014-04-10 11:42:51 UTC
Max, I think at this point I just need two copies of the same kernel version, without user modification - one working and one not working, where the only difference is that they were recompiled.

If anyone can provide me with those we might get a bit closer to debugging this.
Comment 86 Matt Fleming 2014-04-10 11:49:22 UTC
Thomas, does 3.14.0-4 include the patch I suggested here, http://marc.info/?l=linux-kernel&m=139703196223283&w=2 ?
Comment 87 Thomas Bächler 2014-04-10 11:55:49 UTC
That link says "No such message", but I think I know which message you refer to - and yes, that patch is included.
Comment 88 Damir Jelić 2014-04-10 11:59:45 UTC
Hi Thomas, some more info about the 3.14 Arch packages:
Comment 89 Damir Jelić 2014-04-10 12:04:43 UTC
Hi Thomas, some more info about the 3.14 Arch packages:
    3.14-1 boots
    3.14-2 boots
    3.14-3 fails
    3.14-4 fails

I'm unable to build a kernel that fails myself, I can however test builds provided by you if needed.

(Sorry for the useless comment above).
Comment 90 Thomas Bächler 2014-04-10 12:07:16 UTC
Damir, short version, the issue is as random as it was with 3.13. The difference between -2 and -3 is a minor change in the btusb module and a fix in the kernel entirely unrelated to early boot.
Comment 91 Bastian Beischer 2014-04-10 12:12:11 UTC
Hello all,

I agree with Thomas. I'd like to mention once more that I can not boot 3.14-4 from gummiboot, but I can go from gummiboot to UEFI shell v1 (shellx64.efi) and then:

fs0:
cp vmlinuz-linux 3.14-4.efi
3.14-4.efi initrd=\initramfs-linux.img root=/dev/sda2 rw

-> Successful boot.

I was able to boot most of the 3.13 kernels, but all the 3.14 kernels failed for me (including 3.14-1 and 3.14-2, which is different to what Damir experienced)

Could it be that this depends on the vendor of the notebook and their EFI firmware?
Comment 92 Matt Fleming 2014-04-10 12:29:17 UTC
Bastian, I think at this point it's pretty clear it's a(nother) memory corruption problem, or a garbage value problem, e.g. the boot stub is reading random/garbage data from somewhere.

While it may be more likely to hit this issue with certain vendors' notebooks, I don't think their firmware is to blame in this case. It's just that their memory map/boot environment seem to trigger the bug more easily. Which is why it's not entirely surprising that if you change your boot method slightly (going via the EFI Shell) things work better, since the memory map will look different.
Comment 93 Max Liebkies 2014-04-10 20:38:47 UTC
I have so far not been able to produce a working kernel. I tried 3 times but none of the kernels booted. 
Matt, I'll apply the patch you mentioned above (no, its not included in the stock Arch kernel) and see if that works.
Comment 94 Rich 2014-04-11 04:40:41 UTC
I have switched back from Gummiboot/EFI Stub to GRUB and have not had a kernel fail on me yet, I was able to always reproduce this problem with Gummiboot or EFI Stub, or even trying to boot from the EFI Shell. Could this not be a problem with the kernel but an actual problem with the way EFI is implemented on certain machines (Lenovo seems to a big issue) and the EFI Stub?

Just my two cents.
Comment 95 Douglas Young 2014-04-11 23:51:15 UTC
Created attachment 131951 [details]
Kernel which fails to boot

Tested on T430 laptop, built from arch package version 3.14-4, no modifications made to package.
Comment 96 Douglas Young 2014-04-11 23:54:41 UTC
Created attachment 131961 [details]
Kernel which boots

This version is from the official arch package, which was built on a different machine. This boots successfully on a T430 laptop.
Comment 97 Douglas Young 2014-04-11 23:56:42 UTC
This isn't quite as good as recompiled on the same machine, but both kernels have been produced from the same source.
Comment 98 Steven Vancoillie 2014-04-23 21:00:28 UTC
Created attachment 133461 [details]
3.14.1 hangs, output with ignore_loglevel and earlyprintk=efi
Comment 99 Steven Vancoillie 2014-04-23 21:01:57 UTC
I have an issue which I think is similar to this bug. I'm unable to boot the kernel since version 3.14 using gummiboot. In short: 3.13.8 boots, 3.14 and 3.14.1 did not boot. With 3.14.1, I continued testing by trying to boot from EFI shell with ignore_loglevel and earlyprintk=efi, which gives the attached output (133461). Let me know if I can do something more. I'm using arch linux, where the 3.14.1 kernel already includes a patch "x86/efi: Correct EFI boot stub use of code32_start". My computer is a Dell Latitude E4310.
Comment 100 Ulf Winkelvos 2014-04-29 03:03:55 UTC
Created attachment 134131 [details]
better error logging in eboot.c

@Matt: This issue still remains on my DELL XPS 13 even after the bugfix you found with Thomas. I experimented with the current git kernel. Still could not find a reproducible way to build a bad kernel though. What would be helpful was better error logging. I attached a patch proposal. The error messages are pretty generic, but now I get atleast some output, if the kernel fails. Interestingly i manged to build a kernel where setup_efi_pci fails, but as the ret value is not check at all, the kernel boots.
Comment 101 Christof 2014-07-05 17:09:43 UTC
I have an Intel NUC N2820 and did a fresh install of arch linux today.

first problem was gummiboot - screen stayed black. Downgraded to an older version and was able to start gummiboot.

After that screen stayed black when selecting an entry in gummiboot. I also downgraded to kernel 3.13.8 to get the system booting.
Comment 102 Ulf Winkelvos 2014-07-06 17:47:01 UTC
@Matt: what needs to be done, to get this pathc included in mainline? It would not hurt, would it?
Comment 103 Matt Fleming 2014-07-07 13:25:05 UTC
Ulf,

No your patch wouldn't hurt, and I agree that it would be a good addition.

Could you please rebase your patch against the 'next' branch at,

  git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git

and mail to linux-efi@vger.kernel.org and Cc me, and I'll make sure that it gets applied. Thanks!
Comment 104 Ulf Winkelvos 2014-07-08 01:29:04 UTC
Created attachment 142371 [details]
add better error logging to efi-main

attached the rebased patch here for reference purposes. Going to send it to the list tomorrow. Thx for your work Matt!
Comment 105 Ulf Winkelvos 2014-07-10 11:53:05 UTC
> On July 7, 2014 at 3:25 PM bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=68761
> 
> --- Comment #103 from Matt Fleming <matt@console-pimps.org> ---
> Ulf,
> 
> No your patch wouldn't hurt, and I agree that it would be a good addition.
> 
> Could you please rebase your patch against the 'next' branch at,
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
> 
> and mail to linux-efi@vger.kernel.org and Cc me, and I'll make sure that it
> gets applied. Thanks!
> 
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
> 
I just read this [1] and I think this just might be "it". The random symptoms we
are seeing are pretty much exactly what I would expect of a bug like this.
Michael Brown who discovered this, posted two patches to the list. Going to try
them tonight!

[1] http://permalink.gmane.org/gmane.linux.kernel.efi/4175

Cheers, Ulf
Comment 106 Julian Andres Klode 2014-07-19 21:52:13 UTC
I think I experience the same bug on Debian. Here, 3.14-1-amd64 does not boot (based on 3.14.12). The 3.16-rc5 kernel we have does boot fine, though.
Comment 107 Mike Cloaked 2014-07-20 08:06:08 UTC
Is the patch at https://git.kernel.org/cgit/linux/kernel/git/mfleming/efi.git/commit/?h=urgent&id=c7fb93ec51d462ec3540a729ba446663c26a0505 going to be included in the 3.16 kernel? It would be useful if it is so that more people can see if this fixes the efi stub loader boot bug.
Comment 108 Matt Fleming 2014-07-21 15:01:39 UTC
(In reply to Mike Cloaked from comment #107)
> Is the patch at
> https://git.kernel.org/cgit/linux/kernel/git/mfleming/efi.git/commit/
> ?h=urgent&id=c7fb93ec51d462ec3540a729ba446663c26a0505 going to be included
> in the 3.16 kernel? It would be useful if it is so that more people can see
> if this fixes the efi stub loader boot bug.

Yes Mike, that patch will be in v3.16.
Comment 109 Mike Cloaked 2014-07-29 08:56:03 UTC
There have been no further reports of this bug occurring in arch once the patch referred to was in the arch kernels, and I have added a comment to the arch bug at https://bugs.archlinux.org/task/33745#comment125801 asking if that bug might be closed if no further reports occur through to when kernel 3.16 is released. That may be a suitable time to consider this bug closed here too?
Comment 110 Matt Fleming 2014-07-30 10:02:43 UTC
I think this bug was mainly opened as a way for arch users/developers to track upstream progress of debugging this problem. If things are looking fixed then I'm definitely in favour of closing this bug.

If people run into similar bugs, or they think their bug has never been fixed, I'd suggest that they open a *new* report instead of re-opening this one. The reason being that, over time, this bug became a hodge-podge of reports and opening individual reports with as much detail as possible may at least allow us to gauge how many bugs we have in this area.

Anyone have a good reason to keep this report open?
Comment 111 Ulf Winkelvos 2014-07-30 20:39:49 UTC
No, I think you are right. I am pretty sure that Michaels patch fixed the problem atleast on some/most systems. Future will tell! :)

Note You need to log in before you can comment on or make changes to this bug.