Bug 69831 - 3.14 Fedora Rawhide kernels crash with a divide error in intel_pstate on Bay Trail-m (Dell Venue 8 Pro)
Summary: 3.14 Fedora Rawhide kernels crash with a divide error in intel_pstate on Bay ...
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: cpufreq (show other bugs)
Hardware: i386 Linux
: P1 blocking
Assignee: Dirk Brandewie
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-02 21:26 UTC by Adam Williamson
Modified: 2015-07-21 18:48 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.14
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
photo of my divide error (696.75 KB, image/jpeg)
2014-02-07 08:32 UTC, Adam Williamson
Details
Stripped DV8P kernel config (80.91 KB, application/octet-stream)
2014-02-07 12:51 UTC, Jan-Michael Brummer
Details
config file being used by Intel guys who do *NOT* hit the pstate divide error (80.91 KB, text/plain)
2014-02-07 21:37 UTC, Adam Williamson
Details

Description Adam Williamson 2014-02-02 21:26:41 UTC
I've been fiddling with running Fedora on the Dell Venue 8 Pro - a Bay Trail-m based tablet device - for some time. We've made some progress with 3.13 kernels - see https://bugzilla.kernel.org/show_bug.cgi?id=67861 , https://bugzilla.kernel.org/show_bug.cgi?id=67911 , https://bugzilla.kernel.org/show_bug.cgi?id=65841 , for e.g.

Things seem to have regressed with 3.14 kernels, though. I've built three images with a 3.14 kernel so far - using Fedora Rawhide's kernel, and applying a few small vlv-specific patches, simplify-efi-initialization and allow-mapping-bgrt-on-x86-32 from https://bugzilla.kernel.org/show_bug.cgi?id=67911 , and a patch that reverts commit 6c4a8962a4a078cacfc8eb5d4bd79f6343b8cd7a (see https://bugs.freedesktop.org/show_bug.cgi?id=71977#c18 ).

With each of these images, the boot process reaches grub fine, but when proceeding from there, just hangs at "Booting a command list", which is a grub message. I don't see anything at all from the kernel - it just gets stuck, apparently indefinitely, right there.

Jan-Michael Brummer confirms seeing the same thing.

These systems are somewhat notorious for having 32-bit UEFI firmwares, note. I'm doing a 32-bit UEFI boot on them, which is an unusual codepath. But it's known to work (at least, work better than this) in 3.11, 3.12 and 3.13.
Comment 1 Adam Williamson 2014-02-02 21:29:20 UTC
My current kernel build is kernel-3.14.0-0.rc0.git19.1.1.fc21 - kernel-3.14.0-0.rc0.git19.1 is the current rawhide build, the extra '.1' denotes that this is my side build with the three vlv patches added.
Comment 2 Matt Fleming 2014-02-03 11:34:42 UTC
Adam, presumably this is a change that went into for the v3.14 merge window?

OK, first things first - can you try booting with efi=old_map on the kernel command line? This should be the default for x86-32 anyway, but it's possible that some of the logic is incorrect.

There were only two major changes in this EFI merge window, the 1:1 virtual mapping changes (disabled with efi=old_map) and support for kexec on EFI.
Comment 3 Adam Williamson 2014-02-03 23:38:17 UTC
Aha - this actually doesn't seem to be Bay Trail specific. One of our releng guys was saying he's seeing the same on his UEFI system with current Rawhide kernels, and indeed I just grabbed a Rawhide nightly, fired it up in a UEFI VM, and hit the same thing three times in a row.

efi=old_map doesn't seem to help.

so, I guess we're in the kexec-on-EFI stuff?
Comment 4 Adam Williamson 2014-02-03 23:50:02 UTC
...and yeesh, even a BIOS VM seems to hang on boot with the latest Rawhide nightly. Clearly we have a major snafu somewhere, but it may be grub or dracut or something and not the kernel, and it doesn't seem to be baytrail or UEFI specific, and it may be a Fedora thing. maybe put this on hold for now, and I'm gonna file a Fedora bug, since apparently there isn't one yet.
Comment 5 Adam Williamson 2014-02-04 00:16:52 UTC
http://happyassassin.net/temp/314_early_trace.png is the strack trace I got from booting with earlyprintk=vga in a BIOS KVM, josh boyer thinks he might have something to fix it that he's going to dig up for us later.
Comment 6 Matt Fleming 2014-02-04 13:29:20 UTC
Yeah, the call stack looks like the stack protector code is firing. commit a0acda917284 ("acpi, numa, mem_hotplug: mark all nodes the kernel resides un-hotpluggable") looks like a potential culprit? (Of course you could always just disable CONFIG_CC_STACKPROTECTOR to be sure).

It would be good to confirm that this is in fact the same issue you're hitting with the UEFI systems by capturing a stack trace, possibly via earlyprintk=efi.
Comment 7 Adam Williamson 2014-02-04 16:31:11 UTC
jwb dug up some patches from LKML discussion last week - davej ran into something similar but couldn't quite nail it down. I'm going to build a baytrail kernel with those patches in today and check it boots on the v8p.
Comment 8 Jan-Michael Brummer 2014-02-04 17:14:02 UTC
I've created an earlyprintk output and uploaded the photo to:
http://de.tabos.org/temp/fedlet-314rc1-1.jpg
Comment 10 Adam Williamson 2014-02-06 17:16:52 UTC
Jan-Michael and Thomas report that https://lkml.org/lkml/2014/1/23/190 fixes this. I'll try and test/confirm here today.
Comment 11 Adam Williamson 2014-02-07 08:31:57 UTC
I'm still hitting divide errors with that patch applied, but the intel folks say it works for them. I'm confused. I've tested multiple times, with various combinations of the 'add baytrail' and 'fallback to normal' patches (both Mika's initial version and Thomas' suggested improvement), and just the 'add baytrail' patch. No dice. Divide error, every damn time.

oh, but hum, now I look at -3.jpg, my divide error is somewhere else...I'll attach the photo.
Comment 12 Adam Williamson 2014-02-07 08:32:15 UTC
Created attachment 125001 [details]
photo of my divide error
Comment 13 Adam Williamson 2014-02-07 08:43:09 UTC
OK, so intel_pstate=disable gets me past my problem - definitely something exploding in pstate too. jan-michael, you don't see that? odd.
Comment 14 Adam Williamson 2014-02-07 09:26:03 UTC
Since I think all the other various bugs discussed here are being handled elsewhere, let's make this the bug for my intel_pstate problem - the one in the photo in c#12.

jan-michael says he doesn't hit this crash in a vanilla 3,14rc1 kernel that's stripped down for v8p (he may attach the kernel config he's using). But I don't see anything in Fedora Rawhide's current patch set or kernel spec that touches the intel_pstate driver in any way, nor do I see that anything's changed in intel_pstate in upstream git since rc1 (Fedora's kernels are bumped to new git revs between RCs).

Just to be clear: with kernel-3.14.0-0.rc1.git2.1.2.fc21.i686, which is Fedora's kernel-3.14.0-0.rc1.git2.1 with the following patchset:

# v8p
# https://bugzilla.kernel.org/show_bug.cgi?id=67911
Patch26001: simplify-efi-initialization.patch
Patch26002: allow-mapping-bgrt-on-x86-32.patch

# Reverts "drm/i915/vlv: re-enable hotplug detect based probing on VLV/BYT"
# see https://bugs.freedesktop.org/show_bug.cgi?id=71977
Patch26004: vlv-hotplug-revert.patch

# fix tsc calibration, otherwise kernel hangs in early init
# see https://bugzilla.kernel.org/show_bug.cgi?id=69831
# and https://lkml.org/lkml/2014/1/23/190
Patch26005: tsc-fallback-normal.patch
Patch26006: tsc-add-baytrail.patch

every boot on the Venue 8 Pro for me hits the intel_pstate crash shown in c#12. Booting with intel_pstate=disable gives me a successful boot.
Comment 15 Jan-Michael Brummer 2014-02-07 12:51:23 UTC
Created attachment 125101 [details]
Stripped DV8P kernel config

As requested i'm attaching the kernel config we are using here. It has some additional modules for our usb+gigabit ethernet hub and some atheros wifi dongles.
Comment 16 Adam Williamson 2014-02-07 21:37:11 UTC
Created attachment 125181 [details]
config file being used by Intel guys who do *NOT* hit the pstate divide error

For reference, this is the kernel config the Intel guys working on the V8P are using. They say with this config they don't hit the pstate divide error - i.e. they can boot with the patchset I listed, but with no intel_pstate=disable needed.

The config I'm using is the stock Fedora Rawhide kernel config, I'm not modifying it.
Comment 17 Adam Williamson 2014-02-07 21:38:59 UTC
oh, damn, sorry, just noticed I duped jan's post. sorry.
Comment 18 Adam Williamson 2014-02-08 05:48:35 UTC
Ah, so the difference is debugging. If I disable debugging in Fedora's kernel, I can boot with pstate enabled.
Comment 19 Len Brown 2014-02-11 00:54:55 UTC
Adam, so is it still a regression -- or did the debug vs non-debug
difference also apply to 3.13?
Comment 20 Rafael J. Wysocki 2014-02-11 01:01:38 UTC
I'm wondering if this helps by chance:

https://patchwork.kernel.org/patch/3612801/
Comment 21 Adam Williamson 2014-02-11 01:16:14 UTC
len: I would've been booting debug kernels all along, so this is still likely a regression.
Comment 22 Adam Williamson 2014-02-11 01:16:54 UTC
rafael: sounds plausible, I'll try and test (once I'm done tracing out a dracut issue the fedlet found: this thing is an *awesome* bug magnet. :>)
Comment 23 Adam Williamson 2014-02-12 00:36:50 UTC
For the record, after doing an install to my fedlet's internal storage, I was actually hitting a similar (but not the same) crash on boot even with a non-debug kernel. I've just built a debug kernel with the patch from c#20, and I'll see how that goes shortly.
Comment 24 Adam Williamson 2014-02-12 07:00:06 UTC
The patch from c#20 does seem to help. A kernel with debugging enabled and that patch built in boots successfully without intel_pstate=disable both in a live image and in my installed system.
Comment 25 Dirk Brandewie 2014-02-12 18:18:02 UTC
I have sent a patch set to the linux-pm list that contains Baytrail updates and removes power reporting so the patch from comment #20 will no longer be needed.
Comment 26 Adam Williamson 2014-02-14 07:17:44 UTC
I've built a kernel with patch 1/5 from your new series to check that also does the trick, but didn't get enough round tuits to test today.
Comment 27 Adam Williamson 2014-02-20 01:41:01 UTC
Looks like it already got upstreamed, but just to confirm, that patch does indeed seem to solve the problem. I believe there's nothing left requiring this to be open. Thanks.
Comment 28 Len Brown 2015-07-21 18:48:14 UTC
shipped in Linux-3.14-rc3:

commit 709c078e176bd47227e89bb34de7c64b57aaaeab
Author: Dirk Brandewie <dirk.j.brandewie@intel.com>
Date:   Wed Feb 12 10:01:03 2014 -0800

    intel_pstate: Remove energy reporting from pstate_sample tracepoint

Note You need to log in before you can comment on or make changes to this bug.