Bug 110131 - Kernel panic after the random pool initialisation on Dell Latitude E6420
Summary: Kernel panic after the random pool initialisation on Dell Latitude E6420
Status: RESOLVED CODE_FIX
Alias: None
Product: EFI
Classification: Unclassified
Component: Boot (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: EFI Virtual User
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-12-29 03:06 UTC by Viorel-Cătălin Răpițeanu
Modified: 2016-02-01 11:17 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.2.4
Tree: Mainline
Regression: No


Attachments
dmesg log with CONFIG_EFI_PGT_DUMP (91.47 KB, text/x-log)
2016-01-22 21:31 UTC, Viorel-Cătălin Răpițeanu
Details
debug patch (2.01 KB, patch)
2016-01-27 11:49 UTC, Matt Fleming
Details | Diff
dmesg log with the debug patch applied (119.04 KB, text/x-log)
2016-01-27 18:14 UTC, Viorel-Cătălin Răpițeanu
Details
backport patch (5.17 KB, patch)
2016-01-27 20:47 UTC, Matt Fleming
Details | Diff
avoid loss of precision in numpages (1006 bytes, patch)
2016-01-28 22:45 UTC, Matt Fleming
Details | Diff

Description Viorel-Cătălin Răpițeanu 2015-12-29 03:06:58 UTC
"The boot output appears to be normal until it finished reading the ACPI tables, at which point it stopped. About 30 seconds later, it printed a message about initializing the random pool. Then nothing for several more minutes until it printed out a kernel panic and rebooted. Here are photos of the output as the kernel panic scrolled past:"
https://goo.gl/photos/C256LsvhmLEuVYow6

The logs were gotten by adding "debug earlyprintk=efi,keep" to the kernel arguments.

This regression was first introduced in the 4.2.4 kernel version and is still present in the current release (tested on 4.4 RC7 kernel).

More information can be found on:
https://bugs.archlinux.org/task/46894
Comment 1 Stanislaw Gruszka 2016-01-19 08:31:46 UTC
Could you install kernel from git sources (https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/) and perform bisection between 4.2.3 and 4.2.4. We have about 260 patches there, none of them seems to be related with random generator, there are however sched and x86 patches that possibly cause this bug.
Comment 2 Viorel-Cătălin Răpițeanu 2016-01-20 11:00:26 UTC
I've done a bisect between 4.2.3 and 4.2.4 and found that the regression was introduced by this commit:
> commit 496c2053cd784dd653d295e499503f14907022b3
> x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime,
> instead of top-down

I can also confirm that both people that have seen this problem so far (me included) are using UEFI to boot the kernel.
Comment 3 Stanislaw Gruszka 2016-01-20 11:35:21 UTC
Thanks. I assign bug to proper component.
Comment 4 Matt Fleming 2016-01-21 16:48:08 UTC
Can you try booting with the broken commit applied and specify "efi=old_map" on the kernel command line? That should get your kernel booting, but it's a good idea to verify that.
Comment 5 Matt Fleming 2016-01-21 16:51:02 UTC
Oh, also, build with CONFIG_EFI_PGT_DUMP enabled and make the kernel command line, "efi=old_map,debug". It would be good to stare at the memory mapping for your machine which will appear in dmesg.
Comment 6 Viorel-Cătălin Răpițeanu 2016-01-22 21:05:12 UTC
> Can you try booting with the broken commit applied and specify "efi=old_map"
> on the kernel command line? That should get your kernel booting, but it's a
> good idea to verify that.
That efi modifier gets my kernel booting.

> Oh, also, build with CONFIG_EFI_PGT_DUMP enabled and make the kernel command
> line, "efi=old_map,debug". It would be good to stare at the memory mapping
> for your machine which will appear in dmesg.
I will attach the dmesg for that scenario as soon as possible. If there is any more relevant debug information that I can provide, just leave a message.
Comment 7 Viorel-Cătălin Răpițeanu 2016-01-22 21:31:57 UTC
Created attachment 201021 [details]
dmesg log with CONFIG_EFI_PGT_DUMP

dmesg log of the bad kernel with "CONFIG_EFI_PGT_DUMP" and "efi=old_map,debug".
Comment 8 Matt Fleming 2016-01-27 11:49:46 UTC
Created attachment 202091 [details]
debug patch
Comment 9 Matt Fleming 2016-01-27 11:51:54 UTC
Could you try out the attached debug patch on top of v4.2.4, which includes the problematic commit? You don't need to specify efi=old_map, the patch should hopefully take care of ensuring the kernel boots but EFI runtime services will not be available (but that shouldn't be a problem).

After verifying that your kernel boots could you attach the new dmesg? The images you posted make it look like we're spinning forever trying to map the EFI regions, though I have no idea why yet.
Comment 10 Viorel-Cătălin Răpițeanu 2016-01-27 18:14:03 UTC
Created attachment 202111 [details]
dmesg log with the debug patch applied
Comment 11 Viorel-Cătălin Răpițeanu 2016-01-27 18:19:01 UTC
The kernel boots with the applied patch without having to specify efi=old_map.
The new dmesg log contains the applied debug patch. If there is anything else that could help the debugging process, please let me know.
Comment 12 Matt Fleming 2016-01-27 20:46:45 UTC
OK, that narrows things down a little. Could you try the attached backport ontop of a clean v4.2.4?
Comment 13 Matt Fleming 2016-01-27 20:47:12 UTC
Created attachment 202121 [details]
backport patch
Comment 14 Viorel-Cătălin Răpițeanu 2016-01-27 22:11:16 UTC
The kernel doesn't boot with the proposed backport patch without old_map.
Comment 15 Matt Fleming 2016-01-27 22:13:39 UTC
Thanks for testing that out. Let me go and stare at the code some more. It appears that this is a whole new bug in the mapping code.
Comment 16 Matt Fleming 2016-01-28 22:45:59 UTC
Created attachment 202191 [details]
avoid loss of precision in numpages

Fingers crossed, can you try out this attachment? I was able to reproduce your issue locally in Qemu by force-feeding the EFI virtual region allocator the problematic address range.
Comment 17 Viorel-Cătălin Răpițeanu 2016-01-28 23:56:56 UTC
The last patch successfully started the kernel. That was nice! Hope that this will be merged in the master branch soon.
Thank you for spending your time fixing this issue.
Comment 18 Matt Fleming 2016-02-01 11:17:19 UTC
This has now been merged into Linus' tree and should be backported to the stable releases soonish,

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/x86/mm/pageattr.c?id=742563777e8da62197d6cb4b99f4027f59454735

Thanks for all your help tracking this down.

Note You need to log in before you can comment on or make changes to this bug.