Bug 202351

Summary: Changes in trampoline code makes Macbook Pro 14,1 fail in early boot.
Product: Memory Management Reporter: Pitam Mitra (pitamm)
Component: Page AllocatorAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: high CC: bockjoo, bugs-kernel, chad, charlie.habolin, danielroschka+kernel, eki.hur, iebotov, jsiml, kirill, mail+kernel-bugzilla, pitamm, scott
Priority: P1    
Hardware: Intel   
OS: Linux   
URL: https://github.com/Dunedan/mbp-2016-linux/issues/62
Kernel Version: 4.17 onwards Subsystem:
Regression: No Bisected commit-id:
Attachments: Kernel Config File
dmesg output
test.patch
test.patch
dmesg for kerne5.0.0-rc4 with patch
test.patch
test.patch
test.patch
dmesg output after the patch at Comment 23 - for(;;);
test.patch
dmesg output after applying the patch in Comment 26
test-fix.patch
dmesg output after applying the patch in Comment #29
pgtable_64.c.diff and dmesg output, i.e., with only patch in Comment #29

Description Pitam Mitra 2019-01-20 22:36:41 UTC
Problem summary:

Kernel 4.17 onwards fail to boot on Macbook Pro 14,1. To reproduce, start up Fedora or Arch with the default kernel or compile your own from kernel.org. As soon as this kernel is selected in grub and booted - the screen goes dark but the keyboard light stays on. Nothing happens - no text or logs are produced.

Kernel 4.16 (bundled with fedora or compiled from kernel.org) boots fine.


Investigation:

The bug was tracked in detail here: https://github.com/Dunedan/mbp-2016-linux/issues/62

We (people in that thread) have used git bisect to find the first bad commit. It was https://github.com/torvalds/linux/commit/3548e131ec6a82208f36e68d31947b0fe244c7a7


To attempt at a solution to the bug, we checked out kernel 4.20. We replaced the changed files in that commit i.e. arch/x86/boot/compressed/{misc.c,head_64.S,pgtable_64.c} from the ones in kernel 4.16.18. Compiling kernel 4.20 with these changes make it bootable again.
Comment 1 Bockjoo Kim 2019-01-23 15:12:46 UTC
Another issue that may be related with this is that, 
with the dual boot on my Macbook Pro 14,1 and MacOSX running, 
after shutting down either the MacOSX or the Linux, 
if I close and open the Macbook, it automatically turns back on 
to the grub boot menu.

I am not sure if this is due to the fact that this is 
a hacked 4.19.2 linux without the trampoline32, though.
Comment 2 Kirill A. Shutemov 2019-01-29 09:36:40 UTC
Just a clarification: current upstream is still broken, right? There was a number of fixes in the area. See d503ac531a52, 1b3a62643660, 5c9b0b1c4988 for instance.

Could you please attach dmesg for a successful boot? And your kernel config.

The first bad commit doesn't make sense to me: it doesn't do anything potentially destructive. It only calculates placement of the trampoline, but doesn't use it yet.

It would be really helpful to find which line in paging_prepare() causes the issue. Could you try to insert "return paging_config;" before the end of the function to identify the faulty line (if there's any specific one).
Comment 3 Bockjoo Kim 2019-01-29 14:55:25 UTC
Created attachment 280843 [details]
Kernel Config File
Comment 4 Bockjoo Kim 2019-01-29 14:56:38 UTC
Created attachment 280845 [details]
dmesg output
Comment 5 Bockjoo Kim 2019-01-29 14:59:37 UTC
(In reply to Kirill A. Shutemov from comment #2)
> Just a clarification: current upstream is still broken, right? There was a
> number of fixes in the area. See d503ac531a52, 1b3a62643660, 5c9b0b1c4988
> for instance.
> 
> Could you please attach dmesg for a successful boot? And your kernel config.
They are attached in the Comment 3 and Comment 4.

> The first bad commit doesn't make sense to me: it doesn't do anything
> potentially destructive. It only calculates placement of the trampoline, but
> doesn't use it yet.
> 
> It would be really helpful to find which line in paging_prepare() causes the
> issue. Could you try to insert "return paging_config;" before the end of the
> function to identify the faulty line (if there's any specific one).

I will put return paging_config just in the beginning of the subroutine with 5.0-rc4 kernel, build the kernel, and let you know the result.
Comment 6 Kirill A. Shutemov 2019-01-29 15:21:45 UTC
One (In reply to Bockjoo Kim from comment #5)
> I will put return paging_config just in the beginning of the subroutine with
> 5.0-rc4 kernel, build the kernel, and let you know the result.

That's not going to work. As per v5.0-rc4 trampoline is in active use. Please just try it without any changes.

Trick with "return paging_config;" is interesting to run on the first bad commit you have pointer out.

Looking into dmesg, I see

[    0.000000] BIOS-e820: [mem 0x000000000009e000-0x00000000000fffff] reserved

This may result in bad placement of trampoline. I believe it should be fixed with 1b3a62643660 ("x86/boot/compressed/64: Validate trampoline placement against E820").
Comment 7 Bockjoo Kim 2019-01-29 16:27:26 UTC
(In reply to Kirill A. Shutemov from comment #6)
> One (In reply to Bockjoo Kim from comment #5)
> > I will put return paging_config just in the beginning of the subroutine
> with
> > 5.0-rc4 kernel, build the kernel, and let you know the result.
> 
> That's not going to work. As per v5.0-rc4 trampoline is in active use.
> Please just try it without any changes.
I tried it without any changes, but it wouldn't boot as all other kernels >= 4.17.
Since LEVEL5 is not configured, I am wondering if active trampoline matters and
trampoline placement itself may be the issue.
Comment 8 Kirill A. Shutemov 2019-01-29 16:52:08 UTC
Even if 5-level paging is disabled at compile time, the trampoline might be still needed to switch to 4-level paging for the case when bootloader (i.e. kexec) hands off control in 5-level paging mode.
Comment 9 Kirill A. Shutemov 2019-01-29 16:58:00 UTC
Created attachment 280857 [details]
test.patch

Could you check if this patch makes any difference on current upstream?
Comment 10 Bockjoo Kim 2019-01-29 17:47:14 UTC
(In reply to Kirill A. Shutemov from comment #9)
> Created attachment 280857 [details]
> test.patch
> 
> Could you check if this patch makes any difference on current upstream?

The patch did not make any difference.
Comment 11 Kirill A. Shutemov 2019-01-30 11:58:25 UTC
Created attachment 280869 [details]
test.patch

Okay. Could you try this one?

Have you tried to enable earlyprintk? earlyprintk=vga or earlyprintk=efi in kernel parameters in your case, I guess. I woulder if you can get any message from kernel with this.
Comment 12 Mourad De Clerck 2019-01-30 18:47:41 UTC
5.0-rc4 with only this latest patch seems to work for me on a MacBookPro14,1
Comment 13 Bockjoo Kim 2019-01-30 19:19:11 UTC
(In reply to Kirill A. Shutemov from comment #11)
> Created attachment 280869 [details]
> test.patch
> 
> Okay. Could you try this one?
I am sorry for the late response. The email was buried until I saw Mourad response.
I tried the patch and it booted the kernel as was confirmed by Mourad already.
> Have you tried to enable earlyprintk? earlyprintk=vga or earlyprintk=efi in
> kernel parameters in your case, I guess. I woulder if you can get any
> message from kernel with this.
Without the patch, earlyprintk=vga earlyprintk=efi did not show anything either.
I am trying it with the patch, but I have to figure out how to capture the earlyprintk.

By the way, could you suggest a line to print output of find_trampoline_placement()?
Comment 14 Pitam Mitra 2019-01-31 03:53:14 UTC
I can confirm that 5.0.0-rc4+ boots with the patch, but I run into other problems:

unable to handle kernel NULL pointer dereference at 0000000000000078

I am not sure if those are related to this or not and I dont want to take the focus away.

I am attaching the output of dmesg of that boot attempt (journalctl -o short-precise -k -b -1) in case it is of any help.
Comment 15 Pitam Mitra 2019-01-31 03:55:25 UTC
Created attachment 280887 [details]
dmesg for kerne5.0.0-rc4 with patch

The kernel started but I ran into other errors.
Comment 16 Kirill A. Shutemov 2019-01-31 10:00:13 UTC
(In reply to Pitam Mitra from comment #15)
> Created attachment 280887 [details]
> dmesg for kerne5.0.0-rc4 with patch
> 
> The kernel started but I ran into other errors.

The bug is unrelated to this one. Please report to regulator subsystem maintainers.
Comment 17 Kirill A. Shutemov 2019-01-31 10:03:43 UTC
Created attachment 280889 [details]
test.patch

This patch should also boot fine. If it is, please uncomment line in extract_kernel(). It will hung the boot and let you capture earlyprintk output.

Please, share the output here.
Comment 18 Bockjoo Kim 2019-01-31 14:00:32 UTC
I think I tried something similar yesterday ( variable test_trampoline_32bit ), but it did not boot. I put it in the .bss section just to see if it works, but it did not boot either.
Comment 19 Kirill A. Shutemov 2019-01-31 14:16:39 UTC
But this exact patch doesn't work too?
Comment 20 Bockjoo Kim 2019-01-31 16:18:33 UTC
(In reply to Kirill A. Shutemov from comment #19)
> But this exact patch doesn't work too?

No. It does not boot with the exact patch either.
Comment 21 Kirill A. Shutemov 2019-01-31 16:40:07 UTC
Created attachment 280891 [details]
test.patch

That's very strange. I don't understand what is so special about find_trampoline_placement().

Could you check this patch? If it works, could you please try to uncomment one or another line in find_trampoline_placement().
Comment 22 Bockjoo Kim 2019-01-31 17:17:57 UTC
(In reply to Kirill A. Shutemov from comment #21)
> Created attachment 280891 [details]
> test.patch
> 
> That's very strange. I don't understand what is so special about
> find_trampoline_placement().
> 
> Could you check this patch? If it works, could you please try to uncomment
> one or another line in find_trampoline_placement().

The patch let the machine boot ( the one with ebda_start = bios_start = 0 ; ).
I tried booting the machine after uncommenting the original ebda line 
or the original bios line, but both cases did not let the machine boot.
Comment 23 Kirill A. Shutemov 2019-02-01 09:23:56 UTC
Created attachment 280899 [details]
test.patch

Okay, my hypotheses is that bootloader doesn't map first 4k page into the identity mapping.

This patch dump page table walk for two addresses into earlyprintk. Could you try it and share the output.
Comment 24 Bockjoo Kim 2019-02-01 15:39:34 UTC
That did not fly either with earlyprintk=vga earlyprintk=efi in the kernel boot option. Only a cursor is blinking.
It looks to me that the it can not find microcode or something.
Should something equivalent to 'for(;;);' be moved to head_64.S?
Comment 25 Bockjoo Kim 2019-02-01 17:48:36 UTC
Created attachment 280911 [details]
dmesg output after the patch at Comment 23 - for(;;);

After commenting out for(;;); from misc.c after applying the patch in Comment 23,
I checked dmesg output.
The screen itself was no different from the dmesg ( I have photos of booting if necessay ) output.
Comment 26 Kirill A. Shutemov 2019-02-04 13:36:19 UTC
Created attachment 280945 [details]
test.patch

Another attempt at debug patch. It now dumps page table walk about two addresses into dmesg. Please try it and attach dmesg if it works.

I'll look into fix in case if my hypothesis is correct.
Comment 27 Bockjoo Kim 2019-02-04 15:49:17 UTC
Created attachment 280947 [details]
dmesg output after applying the patch in Comment 26

I have attached the dmesg output after applying the patch.
Comment 28 Kirill A. Shutemov 2019-02-04 16:41:10 UTC
Okay, thanks. My hypothesis is not correct. Let me think more on this.
Comment 29 Kirill A. Shutemov 2019-02-18 15:17:23 UTC
Created attachment 281181 [details]
test-fix.patch

Could you check if this helps?
Comment 30 Bockjoo Kim 2019-02-18 16:15:49 UTC
Created attachment 281185 [details]
dmesg output after applying the patch in Comment #29

After the patch, the kernel boots.
I have left the previous change to dump the page table alone and applied the patch in the Comment #29 additionally.
So, the dmesg is reboot output of the patches in the Comments #26 and #29.
Comment 31 Kirill A. Shutemov 2019-02-18 17:19:51 UTC
The patch from #26 leaves the problematic code unused. Please try the new patch on a clean kernel.
Comment 32 Bockjoo Kim 2019-02-18 17:59:38 UTC
Created attachment 281187 [details]
pgtable_64.c.diff and dmesg output, i.e., with only patch in Comment #29

I have restored all other back to the original code and only applied the patch
in the Comment #29 to pgtable_64.c.
The Kernel boots fine, too.
The diff of pgtable_64.c.original and pgtable_64.c (patched) and the dmesg output
are attached in a single file.
Comment 33 mail+kernel-bugzilla 2019-04-30 23:49:24 UTC
My Samsung 500C Chromebook is suffering from a very similar bug, in the same file.

I bisected it to be a regression in commit

      x86/boot/compressed/64: Validate trampoline placement against E820

I found this issue googling for that commit message.

I've filed a separate bug about that on bug 203463, because it was introduced 1 kernel version later, but perhaps this info helps you with this bug.

I've also tried the patch from Comment #29, but it did not fix my bug.