Bug 65841

Summary: Dell Venue 8 Pro (Bay Trail) kernel crashes part way through boot with kernels 3.12 and 3.13
Product: EFI Reporter: Adam Williamson (adamw)
Component: ServicesAssignee: EFI Virtual User (efi)
Severity: high CC: alan, kamiciy920, kernel, matt, mjg59-kernel, tianyu.lan
Priority: P1    
Hardware: i386   
OS: Linux   
Kernel Version: 3.13.0-0.rc1.git2.1.fc21 Subsystem:
Regression: No Bisected commit-id:
Attachments: bottom of the first traceback seen
another trace that pops up some seconds after the first
clean screenshot of the full trace
acpidump (requested by mjg59)

Description Adam Williamson 2013-11-26 02:39:12 UTC
So, I bought a Dell Venue 8 Pro to play with getting Fedora running on it. This is a Bay Trail-based 8" tablet similar to the Lenovo Miix2 and the Asus T100.

I can build Fedora 20-based images that deal with icky 32-bit UEFI firmware and boot. With a 3.11 kernel and 'nomodeset', I can even get to a GNOME 3 desktop, and with a keyboard and mouse plugged in, I can use it. Almost none of the other hardware on the tablet works, though.

With 3.12 or 3.13, though, boot fails part way through. It hits a kernel crash of some kind.

Unfortunately, it's going to be hard getting details out in a really usable form. I can't direct the output over the network as it doesn't have ethernet. I don't have any kind of serial console setup that'd work with this tablet. So I really can't get it out in text form at present. I _could_ do an install of the barely-functional 3.11-based system and then install 3.12 and 3.13 kernels and try to store the output from them to look at it from the 3.11 kernel, but that would necessitate blowing away the Windows 8.1 install, which would mean I couldn't make any real use of the tablet until all the various bugs are fixed. I'm hoping to avoid doing a permanent install of Fedora until it's at least vaguely usable.

So for now what I've got are some photos and a video of the boot. I really have no idea what component to file this against, I can't read kernel traces. I'll attach what I've got and hope you can figure it out. :)
Comment 1 Adam Williamson 2013-11-26 02:43:14 UTC
Created attachment 116091 [details]
bottom of the first traceback seen

Boot proceeds for a while and then explodes, with a whole bunch of output landing at once. This is the bottom of that output. It then sits like this for a while, before posting some more tracebacks, I think because it detects the CPUs are sitting idle. I'll post a screenshot of one of those later tracebacks next.
Comment 2 Adam Williamson 2013-11-26 02:44:09 UTC
Created attachment 116101 [details]
another trace that pops up some seconds after the first
Comment 3 Adam Williamson 2013-11-26 02:54:52 UTC
https://www.happyassassin.net/extras/v8p_fail.mp4 is a shakycam video of a boot with kernel 3.13.0-0.rc1.git2.1.fc21 . I can't manage to transcode it to another format or rotate it at present, sorry :(
Comment 4 Adam Williamson 2013-11-26 04:04:50 UTC
ah, large USB sticks to the rescue! I now have the system installed to a second USB stick and can successfully boot from that. So I should be able to stick a 3.13 kernel into the 'installed' system, boot, then boot back to 3.11 and extract the logs from the failed boot. Stay tuned.
Comment 5 Adam Williamson 2013-11-26 05:01:59 UTC
Unfortunately, it looks like the journal doesn't capture the kernel crashes :( https://bugs.freedesktop.org/attachment.cgi?id=89807 is what gets captured in journald during a boot of a 3.13 kernel, but it doesn't include the traces for this bug, I don't think. If anyone has any bright ideas how I can get 'em out, do yell.

Funnily enough, after installing and trying to boot a 3.13 kernel, I can no longer boot successfully even with the 3.11 kernel! It starts throwing traces during boot too. Odd.
Comment 6 Adam Williamson 2013-11-26 06:14:52 UTC
So I'm craning my neck around awkwardly and stepping through the video frame-by-frame. It appears that the crash happens right around microcode loading. I see something like:

"platform microcode: Direct firmware load failed with error -2"

repeated around when the traceback shows up. The first line of the traceback looks to be in 'virt_efi_get_time'.

I can also see (there may be typos or errors in the individual characters here, reading off the fuzzy video):

[47.176986] microcode: CPU0 sig=0x30673, pf=0x2, revision=0x312
[47.412697] BUG: unable to handle kernel paging request at fed08004
[47.420696] IP: [<f805fa36>] 0xf805fa36
[47.428607] mpde = 3207f067 *pte = 00000000
[47.436536] Ooops: 0000 [#1] SMP
[47.502595] task: f5e3ed80 ti: f5f22000 task.ti: f5f22000
[47.509960] platform microcode: Direct firmware load failed with error -2
[47.509985] platform microcode: Falling back to user helper
[47.510154] microcode: CPU1 sig=0x30673, pf=0x2, revision=0x312
[47.510963] platform microcode: Direct firmware load failed with error -2
[47.510967] platform microcode: Falling back to user helper
[47.511056] microcode: CPU2 sig=0x30673, pf=0x2, revision=0x312
[47.511756] platform microcode: Direct firmware load failed with error -2
[47.577060] EIP: 0060:[<f805fa36>] EFLAGS: 00010046 CPU: 3
[47.584893] EIP is at 0xf805fa36
[47.592644] EAX: fed03000 EBX: f5f23e0b ECX: c044(illegible)
[47.600446] ESI: 00000282 EDI: f5f23e34 ERP: f5f23dc(illegible)
[47.654779] Call Trace:
[47.662473]  [<c04481ec>] ? virt_efi_get_time+0x1c20x50
[47.670252]  [<c0448200>] ? virt_efi_get_time+0x3020x50
[47.677954]  [<c0447fcd>] ? efi_set_rtc_????+0x1b20xc0

.....god, my neck is getting tired...from there the trace goes to __getestimeofday, update_persistent_clock, sync_cmos_clock, process_one_work, process_one_work again, worker_thread, process_one_work, kthread, trace_hardirqs_on, ret_from_kernel_thread, insert_kthread_work, then a Code: block, an EIP line, a CR2 line and the trace ends.

After that I see a line:

BUG: sleeping function called from invalid context(? - line continues, but is illegible)


in_atomic(): 1, irqs_disabled(): 1, pid: (illegible)
INFO: lockdep is turned off
irq event stamp: 136104
hardirqs last enabled at (136103): (looks like the getesttimeofday function from the trace)
hardirqs last disabled at (136104): [<c0a7eba3>] _raw_spin_lock_irqs
softirqs last enabled at (136042): (illegible)
softirqs last disabled at (136037): (illegible)
CPU: 3 PID: 74 Comm: kworker/(illegible) Tainted: (illegible)

looks like the start of another trace...

That's as much as my neck can stand tonight, hope it's of some kind of use. Can try to decipher more tomorrow if necessary.
Comment 7 Adam Williamson 2013-11-26 06:16:59 UTC
CCing Lan - hi Lan, I saw you on a couple of other Bay Trail bugs, hope you don't mind being copied on this one.
Comment 8 Adam Williamson 2013-11-26 06:27:25 UTC
Created attachment 116111 [details]
clean screenshot of the full trace

Aha! So removing microcode_ctl doesn't solve the crash, but it makes the traceback much cleaner. With microcode_ctl package removed the entire crash looks like it fits on one screen. Here is a picture of the whole thing.
Comment 9 Alan 2013-11-26 14:06:36 UTC
It looks like it is making a call to 64bit EFI services even though they are not present. If you boot a 32bit Fedora does that behave any better ?

Comment 10 Alan 2013-11-26 14:07:20 UTC
(adding the font of EFI firmware wisdom to the cc list)
Comment 11 Matthew Garrett 2013-11-26 14:49:15 UTC
Looks like it *is* a 32-bit kernel. We don't default to using the EFI time services on 64-bit, and possibly we should do the same thing on 32-bit, but it's strange that it worked in 3.11 and breaks in 3.12. We've certainly made fewer efforts to implement any UEFI bug workarounds on 32-bit systems.
Comment 12 Alan 2013-11-26 15:08:51 UTC
Duh yes.. good point. So trying a 64bit Fedora might actually help.

I'e got a T100 heading my way so hopefully I can have more of a poke at this and at the fact we are doing VESA bios calls in EFI mode as well.
Comment 13 Matthew Garrett 2013-11-26 15:28:02 UTC
It's a 32-bit UEFI implementation and we don't have a 32-bit bootloader that knows how to boot a 64-bit kernel (and if we did, we'd then refuse to perform any UEFI calls). Fixing that is possible, but not high on my (or anyone else's) list.
Comment 14 Alan 2013-11-26 15:35:26 UTC
Thats not necessarily a lose if we could stop it then deciding that it wants to do BIOS calls instead. It is possible to boot ubuntu on the T100 in 64bit via 32bit EFI, although obviously you then can't access EFI services (what a shame ;))


has much of the background to all of this.
Comment 15 Matthew Garrett 2013-11-26 15:51:20 UTC
Well, other than there being no way to set up the bootloader.
Comment 16 Adam Williamson 2013-11-26 16:45:25 UTC
Yes, as discussed above, this is a pure 32-bit image and I cannot easily do 64-bit-on-32-bit unless Matthew or someone else smart fixes that up for me (or, I guess, I could steal whatever bootloader people are using to do the 64-on-32 trick with Ubuntu, but if anything I'd rather this gets *less* messy not *more* :>)

Matthew, should I move this downstream? Would anyone be interested in fixing it? I think it'd be nice to make these systems work...
Comment 17 Adam Williamson 2013-11-29 17:30:48 UTC
This change suggested by mjg59 appears to fix the problem (I built a 3.13 kernel with this patch, built a live image with that kernel, and it boots to a desktop):

--- linux-3.13.0-0.rc1.git3.1.fc21.x86_64/arch/x86/platform/efi/efi.c	2013-11-28 12:59:36.613028002 -0800
+++ linux-3.13.0-0.rc1.git3.1.fc21.x86_64/arch/x86/platform/efi/efi.c.new	2013-11-28 13:01:08.043768967 -0800
@@ -690,7 +690,7 @@
 	set_bit(EFI_MEMMAP, &x86_efi_facility);
-#ifdef CONFIG_X86_32
+#if 0
 	if (efi_is_native()) {
 		x86_platform.get_wallclock = efi_get_time;
 		x86_platform.set_wallclock = efi_set_rtc_mmss;
Comment 18 Adam Williamson 2013-11-30 20:54:50 UTC
Created attachment 116931 [details]
acpidump (requested by mjg59)
Comment 19 Matt Fleming 2013-12-20 19:25:32 UTC
An equivalent patch from Matthew made it into v3.13-rc4 in Linus' tree and is marked for stable,