Bug 61041 - [ilk BISECTED]i915 driver triggers low memory corruption 1 minute after boot-up
Summary: [ilk BISECTED]i915 driver triggers low memory corruption 1 minute after boot-up
Status: CLOSED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Daniel Vetter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-08 21:56 UTC by Abby
Modified: 2014-03-15 02:48 UTC (History)
7 users (show)

See Also:
Kernel Version: 3.10.10
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
This is the dmesg output of the 3.10.10 kernel on my system. (68.87 KB, text/x-log)
2013-09-08 21:56 UTC, Abby
Details
dmesg output of kernel 3.12 with drm.debug=0xe option enabled (119.88 KB, text/x-log)
2013-11-19 23:58 UTC, Abby
Details

Description Abby 2013-09-08 21:56:39 UTC
Created attachment 107701 [details]
This is the dmesg output of the 3.10.10 kernel on my system.

On a system that uses the i915 display driver, the kernel reports the memory reserved for the BIOS as corrupted 1 minute after boot-up. If the i915 module is not loaded on boot-up, no corruption is detected.

This issue is not caused by faulty RAM sticks. I have ran memtest86+ v4.20 on my system, and after 8 passes, no errors were reported.

I first came across this issue when I updated the kernel on my Gentoo system from 3.8.13 to 3.10.7. Since I was using gentoo-sources, I decided to try out upstream's 3.10.10 kernel as well as the latest sources straight from git, and ran into the same issue with both kernels. I will attach the output of dmesg from the 3.10.10 kernel.


Steps to Reproduce:

Using a kernel >3.8.13:

1.) Configure your kernel. Make sure that your test is being done on a system that needs the i915 driver. Make sure that the i915 driver is built as a module, and make sure you have X86_CHECK_BIOS_CORRUPTION=y .

2.) Boot your system with the new kernel.

3.) Watch your logs (e.g. /var/log/messages, or pipe dmesg output to a log file). See if the kernel reports of any low memory corruption.


I have also done a bisect. "git bisect" suggests that the first bad commit was:

95c9608478d639dcffc14ea47b31bff021a99ed1 is the first bad commit
commit 95c9608478d639dcffc14ea47b31bff021a99ed1
Author: H. Peter Anvin <hpa@zytor.com>
Date:   Thu Feb 14 14:02:52 2013 -0800

    x86, mm: Move reserving low memory later in initialization

    Move the reservation of low memory, except for the 4K which actually does
    belong to the BIOS, later in the initialization; in particular, after we
    have already reserved the trampoline.

    The current code locates the trampoline as high as possible, so by deferring
    the allocation we will still be able to reserve as much memory as is
    possible. This allows us to run with reservelow=640k without getting a crash
    on system startup.

    Signed-off-by: H. Peter Anvin <hpa@zytor.com>
    Link: http://lkml.kernel.org/n/tip-0y9dqmmsousf69wutxwl3kkf@git.kernel.org

:040000 040000 365acf4d8c7e201ff7674dc46f6d5ac3a8b889ae 48081dd511455dddbb93e979f4358449d2533beb M     arch


If more information is needed from me, please let me know.
Comment 1 Daniel Vetter 2013-11-13 22:32:32 UTC
This is strange. Have you confirmed the bisect by reverting the offending commit?

Also please boot with drm.debug=0xe and attach the complet dmesg so we know what kind of gfx hw you have.
Comment 2 Abby 2013-11-19 23:55:02 UTC
Daniel,

I've reverted the commit mentioned in the bisect. The memory corruption bug doesn't happen after the revert, so the bisect is confirmed.

This bug is still present in the latest git version of the kernel (currently 3.12+). I will attach the dmesg output of that kernel with the drm.debug=0xe option enabled.
Comment 3 Abby 2013-11-19 23:58:06 UTC
Created attachment 115171 [details]
dmesg output of kernel 3.12 with drm.debug=0xe option enabled
Comment 4 Daniel Vetter 2013-11-20 10:05:19 UTC
Still confused. Another trick to play is to prevent i915 from loading with i915.modeset=0. Then wait for a while to make sure that we'd catch any lowmem corruption. Then stop X and reload i915 for real with

# rmmod i915
# modprobe i915 modeset=1

The big question is whether the lowmem corruption happens anyway of whether we need to load the i915 modeset driver (which would point at either bios leftovers or a bug in our takeover sequence).
Comment 5 Abby 2013-11-21 03:14:01 UTC
After blacklisting the module using /etc/modprobe.d/blacklist.conf, setting i915.modeset=0, booting the system and waiting a while, I found that the bug didn't trigger until a few seconds after I loaded the i915 module manually using

# modprobe i915 modeset=1

If I set i915.modeset=0, but don't blacklist the module, the low memory corruption happens.
Comment 6 H. Peter Anvin 2014-03-15 02:48:37 UTC
The bisection is bogus, simply because all it does is it makes the detector actually works as advertised.  Without the patch, the detector actually misses large swaths of memory, typically at least all memory below 64K (which includes your corruption address.)

Either way, there isn't anything to actually *do* here... the BIOS clobbers memory that we normally reserve for exactly the reason that BIOS has a nasty tendency to scribble on low memory.  The only thing that is happening is that you have enabled the detector that tells you that did indeed happen.

This is probably something you want to fix in your kernel configuration:

[    0.000000] smpboot: 8 Processors exceeds NR_CPUS limit of 4

Note You need to log in before you can comment on or make changes to this bug.