Problem summary: Kernel 4.17 onwards fail to boot on Macbook Pro 14,1. To reproduce, start up Fedora or Arch with the default kernel or compile your own from kernel.org. As soon as this kernel is selected in grub and booted - the screen goes dark but the keyboard light stays on. Nothing happens - no text or logs are produced. Kernel 4.16 (bundled with fedora or compiled from kernel.org) boots fine. Investigation: The bug was tracked in detail here: https://github.com/Dunedan/mbp-2016-linux/issues/62 We (people in that thread) have used git bisect to find the first bad commit. It was https://github.com/torvalds/linux/commit/3548e131ec6a82208f36e68d31947b0fe244c7a7 To attempt at a solution to the bug, we checked out kernel 4.20. We replaced the changed files in that commit i.e. arch/x86/boot/compressed/{misc.c,head_64.S,pgtable_64.c} from the ones in kernel 4.16.18. Compiling kernel 4.20 with these changes make it bootable again.
Another issue that may be related with this is that, with the dual boot on my Macbook Pro 14,1 and MacOSX running, after shutting down either the MacOSX or the Linux, if I close and open the Macbook, it automatically turns back on to the grub boot menu. I am not sure if this is due to the fact that this is a hacked 4.19.2 linux without the trampoline32, though.
Just a clarification: current upstream is still broken, right? There was a number of fixes in the area. See d503ac531a52, 1b3a62643660, 5c9b0b1c4988 for instance. Could you please attach dmesg for a successful boot? And your kernel config. The first bad commit doesn't make sense to me: it doesn't do anything potentially destructive. It only calculates placement of the trampoline, but doesn't use it yet. It would be really helpful to find which line in paging_prepare() causes the issue. Could you try to insert "return paging_config;" before the end of the function to identify the faulty line (if there's any specific one).
Created attachment 280843 [details] Kernel Config File
Created attachment 280845 [details] dmesg output
(In reply to Kirill A. Shutemov from comment #2) > Just a clarification: current upstream is still broken, right? There was a > number of fixes in the area. See d503ac531a52, 1b3a62643660, 5c9b0b1c4988 > for instance. > > Could you please attach dmesg for a successful boot? And your kernel config. They are attached in the Comment 3 and Comment 4. > The first bad commit doesn't make sense to me: it doesn't do anything > potentially destructive. It only calculates placement of the trampoline, but > doesn't use it yet. > > It would be really helpful to find which line in paging_prepare() causes the > issue. Could you try to insert "return paging_config;" before the end of the > function to identify the faulty line (if there's any specific one). I will put return paging_config just in the beginning of the subroutine with 5.0-rc4 kernel, build the kernel, and let you know the result.
One (In reply to Bockjoo Kim from comment #5) > I will put return paging_config just in the beginning of the subroutine with > 5.0-rc4 kernel, build the kernel, and let you know the result. That's not going to work. As per v5.0-rc4 trampoline is in active use. Please just try it without any changes. Trick with "return paging_config;" is interesting to run on the first bad commit you have pointer out. Looking into dmesg, I see [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x00000000000fffff] reserved This may result in bad placement of trampoline. I believe it should be fixed with 1b3a62643660 ("x86/boot/compressed/64: Validate trampoline placement against E820").
(In reply to Kirill A. Shutemov from comment #6) > One (In reply to Bockjoo Kim from comment #5) > > I will put return paging_config just in the beginning of the subroutine > with > > 5.0-rc4 kernel, build the kernel, and let you know the result. > > That's not going to work. As per v5.0-rc4 trampoline is in active use. > Please just try it without any changes. I tried it without any changes, but it wouldn't boot as all other kernels >= 4.17. Since LEVEL5 is not configured, I am wondering if active trampoline matters and trampoline placement itself may be the issue.
Even if 5-level paging is disabled at compile time, the trampoline might be still needed to switch to 4-level paging for the case when bootloader (i.e. kexec) hands off control in 5-level paging mode.
Created attachment 280857 [details] test.patch Could you check if this patch makes any difference on current upstream?
(In reply to Kirill A. Shutemov from comment #9) > Created attachment 280857 [details] > test.patch > > Could you check if this patch makes any difference on current upstream? The patch did not make any difference.
Created attachment 280869 [details] test.patch Okay. Could you try this one? Have you tried to enable earlyprintk? earlyprintk=vga or earlyprintk=efi in kernel parameters in your case, I guess. I woulder if you can get any message from kernel with this.
5.0-rc4 with only this latest patch seems to work for me on a MacBookPro14,1
(In reply to Kirill A. Shutemov from comment #11) > Created attachment 280869 [details] > test.patch > > Okay. Could you try this one? I am sorry for the late response. The email was buried until I saw Mourad response. I tried the patch and it booted the kernel as was confirmed by Mourad already. > Have you tried to enable earlyprintk? earlyprintk=vga or earlyprintk=efi in > kernel parameters in your case, I guess. I woulder if you can get any > message from kernel with this. Without the patch, earlyprintk=vga earlyprintk=efi did not show anything either. I am trying it with the patch, but I have to figure out how to capture the earlyprintk. By the way, could you suggest a line to print output of find_trampoline_placement()?
I can confirm that 5.0.0-rc4+ boots with the patch, but I run into other problems: unable to handle kernel NULL pointer dereference at 0000000000000078 I am not sure if those are related to this or not and I dont want to take the focus away. I am attaching the output of dmesg of that boot attempt (journalctl -o short-precise -k -b -1) in case it is of any help.
Created attachment 280887 [details] dmesg for kerne5.0.0-rc4 with patch The kernel started but I ran into other errors.
(In reply to Pitam Mitra from comment #15) > Created attachment 280887 [details] > dmesg for kerne5.0.0-rc4 with patch > > The kernel started but I ran into other errors. The bug is unrelated to this one. Please report to regulator subsystem maintainers.
Created attachment 280889 [details] test.patch This patch should also boot fine. If it is, please uncomment line in extract_kernel(). It will hung the boot and let you capture earlyprintk output. Please, share the output here.
I think I tried something similar yesterday ( variable test_trampoline_32bit ), but it did not boot. I put it in the .bss section just to see if it works, but it did not boot either.
But this exact patch doesn't work too?
(In reply to Kirill A. Shutemov from comment #19) > But this exact patch doesn't work too? No. It does not boot with the exact patch either.
Created attachment 280891 [details] test.patch That's very strange. I don't understand what is so special about find_trampoline_placement(). Could you check this patch? If it works, could you please try to uncomment one or another line in find_trampoline_placement().
(In reply to Kirill A. Shutemov from comment #21) > Created attachment 280891 [details] > test.patch > > That's very strange. I don't understand what is so special about > find_trampoline_placement(). > > Could you check this patch? If it works, could you please try to uncomment > one or another line in find_trampoline_placement(). The patch let the machine boot ( the one with ebda_start = bios_start = 0 ; ). I tried booting the machine after uncommenting the original ebda line or the original bios line, but both cases did not let the machine boot.
Created attachment 280899 [details] test.patch Okay, my hypotheses is that bootloader doesn't map first 4k page into the identity mapping. This patch dump page table walk for two addresses into earlyprintk. Could you try it and share the output.
That did not fly either with earlyprintk=vga earlyprintk=efi in the kernel boot option. Only a cursor is blinking. It looks to me that the it can not find microcode or something. Should something equivalent to 'for(;;);' be moved to head_64.S?
Created attachment 280911 [details] dmesg output after the patch at Comment 23 - for(;;); After commenting out for(;;); from misc.c after applying the patch in Comment 23, I checked dmesg output. The screen itself was no different from the dmesg ( I have photos of booting if necessay ) output.
Created attachment 280945 [details] test.patch Another attempt at debug patch. It now dumps page table walk about two addresses into dmesg. Please try it and attach dmesg if it works. I'll look into fix in case if my hypothesis is correct.
Created attachment 280947 [details] dmesg output after applying the patch in Comment 26 I have attached the dmesg output after applying the patch.
Okay, thanks. My hypothesis is not correct. Let me think more on this.
Created attachment 281181 [details] test-fix.patch Could you check if this helps?
Created attachment 281185 [details] dmesg output after applying the patch in Comment #29 After the patch, the kernel boots. I have left the previous change to dump the page table alone and applied the patch in the Comment #29 additionally. So, the dmesg is reboot output of the patches in the Comments #26 and #29.
The patch from #26 leaves the problematic code unused. Please try the new patch on a clean kernel.
Created attachment 281187 [details] pgtable_64.c.diff and dmesg output, i.e., with only patch in Comment #29 I have restored all other back to the original code and only applied the patch in the Comment #29 to pgtable_64.c. The Kernel boots fine, too. The diff of pgtable_64.c.original and pgtable_64.c (patched) and the dmesg output are attached in a single file.
My Samsung 500C Chromebook is suffering from a very similar bug, in the same file. I bisected it to be a regression in commit x86/boot/compressed/64: Validate trampoline placement against E820 I found this issue googling for that commit message. I've filed a separate bug about that on bug 203463, because it was introduced 1 kernel version later, but perhaps this info helps you with this bug. I've also tried the patch from Comment #29, but it did not fix my bug.