Bug 84241

Summary: Fails to boot on Baytrail tablet
Product: EFI Reporter: Bastien Nocera (bugzilla)
Component: BootAssignee: EFI Virtual User (efi)
Status: RESOLVED CODE_FIX    
Severity: normal CC: adamw, hozootablets, jan.brummer, jwboyer, matt, StridAst76
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.17-rc0 git 4.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: baytrail-failed-boot-3.17.jpg
oops in efi time functions
big-arse revert
Fix garbled text output

Description Bastien Nocera 2014-09-11 13:10:00 UTC
Using this tablet:
http://www.onda-tablet.com/onda-v975w-quad-core-win-8-tablet-9-7-inch-retina-screen-ram-2gb-wifi-32gb.html

With 3.17 rc0 fails to boot, printing a single line of text:
��n��n��n��n��n��n��n��n (etc.)
until the end of the line.

Turning on debug doesn't print more data. This machine has a 32-bit UEFI firmware, and I used a 32 bit kernel.

kernel 3.16.0 can boot just fine.
Comment 1 Bastien Nocera 2014-09-11 13:29:54 UTC
Created attachment 149781 [details]
baytrail-failed-boot-3.17.jpg
Comment 2 Adam Williamson 2014-09-11 16:12:16 UTC
Confirmed I'm seeing this on my Venue 8 Pro, and (ex-)Intel devs working on the baytrail platform say they've seen it too but haven't got around to debugging yet. I usually get two repeating characters for a full-width line, similar to Bastien's picture. I think for a given kernel it's always the same two characters, but they're not the same between two different kernels, but I'd have to try it a couple more times to confirm.

From Jan-Michael Brummer:

"You are right, 3.17 is buggy as hell >:( We have the same experience 
here but not the time to bisect the character-issue. Hopefully at the 
end of the week."

this is of course a definite regression from 3.16, which booted successfully on these systems. It persists up to at least rc2, I'll check the latest Fedora 'rc4' kernel shortly.
Comment 3 Bastien Nocera 2014-09-11 16:21:05 UTC
Still fails for me with 3.17 rc4.
Comment 4 Matt Fleming 2014-09-11 18:07:02 UTC
A git bisect would be invaluable in tracking this down.

Alternatively, applying the patches on the following branch may improve things somewhat,

http://git.kernel.org/cgit/linux/kernel/git/mfleming/efi.git/log/?h=urgent
Comment 5 Adam Williamson 2014-09-11 18:24:11 UTC
the latter's easier :P, so I'll try it. if it doesn't help i'll try and get to a bisect at some point, but currently drowning in f21 alpha, unfortunately, so my side projects get limited time :/
Comment 6 Jan-Michael Brummer 2014-09-11 20:47:18 UTC
Just tried the 3.17-rc4 with your both patches, it boots again :) As i do not use a initrd obviously the GOT patch did it. Thanks
Comment 7 Bastien Nocera 2014-09-12 00:53:43 UTC
It fixes it (kind of). It booted twice, out of 3 separate attempts. Might not be related though, graphics is also particularly awful.
Comment 8 Bastien Nocera 2014-09-12 14:36:38 UTC
Right, it completely failed to boot in my last 3 attempts. 3.16 still works as expected.
Comment 9 Jan-Michael Brummer 2014-09-13 09:27:22 UTC
Works fine with the patches, no boot problem (DV8P). Are you sure that it is the same issue?
Comment 10 Adam Williamson 2014-09-13 18:33:27 UTC
I just rebased the Fedlet kernel to 3.17 and included the two x86-efi patches from Matt's branch. I'll see how that flies. It's building at http://koji.fedoraproject.org/koji/taskinfo?taskID=7575316 .
Comment 11 Bastien Nocera 2014-09-15 21:53:46 UTC
(In reply to Jan-Michael Brummer from comment #9)
> Works fine with the patches, no boot problem (DV8P). Are you sure that it is
> the same issue?

Same issue as what?
Comment 12 Adam Williamson 2014-09-17 16:55:02 UTC
Hum - so a 3.17rc4 kernel I built with Matt's patches - that's the one from c#10 - seems to boot reliably for me, but a 3.17rc5 kernel built with the exact same delta from Fedora's stock (same patch set, same config options) fails, sticks on the 'garbage characters' screen.

I guess it's worth noting the 'garbage characters' are briefly visible in a working boot with 3.17rc4, but are quickly replaced by the normal boot screen.
Comment 13 Jan-Michael Brummer 2014-09-18 18:23:18 UTC
Note: the garbage characters are produced by efi_printk() and the message should be 'setup_efi_pci() failed!'
Comment 14 Jan-Michael Brummer 2014-09-18 19:02:43 UTC
Commenting the efi_printk() of 'setup_efi_pci() failed!' reveals a oops within virt_efi_get_time().
Comment 15 Matt Fleming 2014-09-19 09:33:28 UTC
Jan-Michael, but things boot correctly with the patches in my 'urgent' queue, right?
Comment 16 Jan-Michael Brummer 2014-09-19 12:43:45 UTC
Well, at first they solved my issue. But after recreating my initrd the same symptom happend. That's why i moved on and tried to figure out what the characters mean. Obviously there is something wrong with the efi part, because as soon as i remove the efi_printk() it continues to boot and shows the framebuffer following with an oops in virt_efi_get_time(). See me previous comments.
Comment 17 Matt Fleming 2014-09-19 12:55:34 UTC
Could you upload a copy of the oops? Debugging that is going to be much more straight forward than debugging the early efi_prinkt() issue (assuming they're the same bug).

Also, try turning off CONFIG_RTC_DRV_EFI and see whether you get a working machine. It used to be disabled for x86 because, by and large, the implementations of GetTime() don't work very well on x86. Recent arm64 changes look like they turned it on.
Comment 18 Jan-Michael Brummer 2014-09-19 13:08:27 UTC
Created attachment 150911 [details]
oops in efi time functions
Comment 19 Jan-Michael Brummer 2014-09-19 13:10:17 UTC
As soon as i'm back at home (monday) i'll disable the efi rtc support. In the meantime i hope that my photo from yesterday helps.
Comment 20 Adam Williamson 2014-09-19 15:49:50 UTC
I see something similar - the 3.17rc4 Fedlet kernel I first built with the patches booted, but subsequently I built a 3.17rc5 kernel and that fails. I *think* it's deterministic - the 3.17rc4 kernel always boots, the rc5 one never does - but I haven't done enough boots yet to be absolutely sure, it could just be a very unlikely chance streak (I think I've done ~5 successful boots on rc4, ~5 failed boots on rc5).
Comment 21 Matt Fleming 2014-09-19 15:57:37 UTC
Yeah, that looks like a firmware bug to me. The IP is 0xf04e1a36. That's not a kernel address. We can probably just chalk this up to one of the many GetTime() bugs.

So the efi_printk() bugs is interesting. Can you try and bisect betwen -rc4 and -rc5 and see if it's some other regression?
Comment 22 Adam Williamson 2014-09-19 16:00:45 UTC
Well, I'm not sure if rc4->rc5 is the thing, it might be the initramfs generation or contents or something - Jan-Michael said regenerating initramfs triggered it for him, and I believe Bastien said the patches didn't completely work for him either. More testing needed I guess...

Jan-Michael, when you re-generated your initramfs did you change something?
Comment 23 Jan-Michael Brummer 2014-09-19 16:02:14 UTC
I think it is related to the garbage output and this one must be investigated. So we should try to find the "buggy" patch between 3.16 and 3.17rcX. Unfortunately i'm on a trip this weekend... Adam, can you jump in?
Comment 24 Jan-Michael Brummer 2014-09-19 16:04:08 UTC
I think i just changed the plymouth startup screen....
Comment 25 Adam Williamson 2014-09-19 16:16:38 UTC
"Adam, can you jump in?"

siiigh, I guess the gods really want me to set up a tweaked local kernel build environment :/ (I still build the fedlet kernels as full-fat, two-hour-long builds in Fedora's buildsystem, I don't have a config for doing a stripped local build). I'll try.
Comment 26 Matt Fleming 2014-09-23 13:19:59 UTC
Guys, could you try the attached patch? It applies on top of f3670394c29f in Linus' tree. It basically reverts a bunch of the EFI boot stub code back to the pre-merge window state.

Also, I can reproduce those unreadable characters on boot on my ASUS T100. I'll take a look at that issue.
Comment 27 Matt Fleming 2014-09-23 13:20:29 UTC
Created attachment 151601 [details]
big-arse revert
Comment 28 Adam Williamson 2014-09-23 15:24:24 UTC
I have a fedlet build with 'big arse revert' running now.

(there's a sentence I never thought I'd write.)
Comment 29 Matt Fleming 2014-09-24 15:26:13 UTC
Created attachment 151781 [details]
Fix garbled text output
Comment 30 Matt Fleming 2014-09-24 15:26:56 UTC
If someone wants to give the "Fix garbled text output" patch a try, it should fix the issue in the description to this bug.
Comment 31 Adam Williamson 2014-09-24 17:29:42 UTC
Sorry for the delay, I got sidelined.

An rc6 kernel with the 'big arse rebuild' boots successfully on the Fedlet (Venue 8 Pro, running pure 32-bit boot chain). Of course, rc4 booted but rc5 failed with the same patchset so this seems a sensitive issue, but insofar as I have an indication, it says 'big arses are good'.
Comment 32 Adam Williamson 2014-09-27 19:15:04 UTC
Oh, I don't believe I saw the garbled text with the 'big arse revert' either (I didn't use the patch from c#29).
Comment 33 Matt Fleming 2014-10-09 10:37:32 UTC
This should all be fixed in v3.17-rc7. Please reopen this bug if you're still having problems.

Thanks everyone.
Comment 34 StridAst 2014-10-24 16:31:11 UTC
So, I've been unable to reliably boot anything later then 14.04.1 LTS on my pipo W2 bay trail windows tablet.  tried the latest fedlet build, no dice. got to desktop once on fedlet and crashed there. no other attempts made it to desktop. (in about 10 attempts) tried ubuntu 14.10, and many ubuntu flavors, (xubuntu, mate, kubuntu, lubuntu, gnome, etc)  14.04.1 boots fine though.

I'm under the impression the latest Fedlet builds are based on a kernel that should be fixed. (don't think the same applies to the official 14.10 releases on ubuntu variants)  So thought I would mention that this (or another closely related) kernel bug is still being, well, buggy.  I'm not certain how to determine whats actually the problem.  I've been using lubuntu for years, but when it comes to troubleshooting anything serious like this, I don't know where to even look, since I get a garbled screen on boot, so no text to post.  I assume there's some log file on a live usb drive that would have the information, but I'm not certain where it would be, which greatly limits my ability to provide helpful feedback. I've seen people mention .dmesg and other logs of some sort.  But I really am a noob in this case, trying to work through things way out of my depth, and learn as much as I can in the process.
Comment 35 Adam Williamson 2014-10-24 17:51:53 UTC
If it crashes during early boot there's no magic log file storage, unfortunately, no. We mostly debug early kernel crashes based on screen photos / videos :/

The particular bug covered in this report does seem to be pretty solidly fixed; what you're hitting is likely something else. But it does sound like it might be a bit tricky to figure out what. Try editing the kernel command line and removing 'rhgb quiet' and see if you get more messages that way (that's for Fedlet).
Comment 36 StridAst 2014-10-25 01:31:23 UTC
interesting, thanks for the reply.  Now I am perplexed.  after 11 unsuccessful boot attempts on fedlet, now, 3 successful boots in a row, with quiet rhgb deleted, it has booted to desktop, got a kernel problem message on bootup (clicked to report it, don't know if that ever helps) and its acting laggy and the touchscreen isn't working, but those seem like just the usual expected hardware issues from the many slightly different bay trail tablets.  Sorry to waste people's time on this, and thank you :D  I'll post a couple questions on your fedlet site though, once I figure out what to ask while I tinker around with it. Still no luck on any ubuntu varient of 14.10