Bug 72351 - [GM45] oops in iowrite32 after long hibernation
Summary: [GM45] oops in iowrite32 after long hibernation
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: x86-64 Linux
: P3 normal
Assignee: Chris Wilson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-17 20:52 UTC by Jae-hyeon Park
Modified: 2014-09-26 12:50 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.10.0-rc7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Kernel oops in syslog (4.41 KB, text/plain)
2014-03-17 20:52 UTC, Jae-hyeon Park
Details
Xorg.0.log (25.73 KB, text/plain)
2014-03-17 20:54 UTC, Jae-hyeon Park
Details
Bisection log (2.79 KB, text/plain)
2014-03-17 20:55 UTC, Jae-hyeon Park
Details
*ERROR* and kernel oops in syslog (11.06 KB, text/plain)
2014-03-27 16:06 UTC, Jae-hyeon Park
Details
dmesg with external TV connected after X startup (103.05 KB, application/gzip)
2014-04-05 17:47 UTC, Jae-hyeon Park
Details
Xorg.0.log with external TV connected after X startup (25.79 KB, text/plain)
2014-04-05 17:48 UTC, Jae-hyeon Park
Details

Description Jae-hyeon Park 2014-03-17 20:52:41 UTC
Created attachment 129791 [details]
Kernel oops in syslog

I am experiencing a regression that reveals itself after resume from a long hibernation.  The symptom is that the X11 display freezes after the kernel emits an oops.

This seems to depend on the video chipset.  A paging request failure occurs on Thinkpad X200 Tablet with the Intel GM45 chipset whereas there is no problem on Thinkpad X220 Tablet with Intel HD Graphics 3000.

This problem does not occur if the hibernation is short.  I can reproduce the error reliably if the hibernation lasts for several hours.

I use the compositing window manager, compiz 0.8.8.

Since this problem depends on the kernel version, I performed a bisection.  The first bad commit is:

commit 17fec8a08698bcab98788e1e89f5b8e7502ababd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jul 4 00:23:33 2013 +0100

    drm/i915: Use Graphics Base of Stolen Memory on all gen3+
    
    So I made the mistake of missing that the desktop and mobile chipsets
    have different layouts in their PCI configurations, and we were
    incorrectly setting the wrong physical address for stolen memory on
    mobile chipsets.
    
    Since all gen3+ are actually consistent in the location of the GBSM
    register in the PCI configuration space on device 2 (the GPU), use it.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    [danvet: Drop cc: stable and fudge conflicts.]
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>


Let me attach: kernel oops in syslog, Xorg.0.log, bisection log
Comment 1 Jae-hyeon Park 2014-03-17 20:54:52 UTC
Created attachment 129801 [details]
Xorg.0.log
Comment 2 Jae-hyeon Park 2014-03-17 20:55:35 UTC
Created attachment 129811 [details]
Bisection log
Comment 3 Chris Wilson 2014-03-17 21:13:19 UTC
Tainted. By what?
Comment 4 Jae-hyeon Park 2014-03-17 21:42:41 UTC
Sorry, I forgot that I was using modules from https://github.com/evgeni/tp_smapi .  They are hdaps.ko, thinkpad_ec.ko, tp_smapi.ko.  I am going to retest this without those modules, if you want.
Comment 5 Chris Wilson 2014-03-18 07:46:00 UTC
I was more concerned that this may not have been the first warning. Can you please run 'addr2line -i -e </path/to/i915.ko> 0xffffffffa038a1c4' or perhaps better gdb </path/to/i915.ko> ; list *intel_gen4_queue_flip+0xc4
Comment 6 Jae-hyeon Park 2014-03-18 20:02:35 UTC
I realized that my ebuild script was stripping the kernel modules before installing them, so I recompiled the kernel without stripping the modules.  I also switched on the CONFIG_DEBUG_INFO option.  But then the system locked up hardly after resume, so I could not get dmesg or syslog.  Therefore, I just tried the gdb command, hoping that the address is not changed whether to (un)strip or to turn on/off CONFIG_DEBUG_INFO.  The result is:

# gdb /lib/modules/3.10.0-1+/kernel/drivers/gpu/drm/i915/i915.ko
GNU gdb (Gentoo 7.6.2 p1) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.gentoo.org/>...
Reading symbols from /lib64/modules/3.10.0-1+/kernel/drivers/gpu/drm/i915/i915.ko...done.
(gdb) list *intel_gen4_queue_flip+0xc4
0x2d1b4 is in intel_gen4_queue_flip (/var/tmp/portage/sys-kernel/bisect-3.99.99/work/linux-3.99.99/drivers/gpu/drm/i915/intel_ringbuffer.h:233).
228	/var/tmp/portage/sys-kernel/bisect-3.99.99/work/linux-3.99.99/drivers/gpu/drm/i915/intel_ringbuffer.h: No such file or directory.

Line intel_ringbuffer.h:233 is in the function:

    229 static inline void intel_ring_emit(struct intel_ring_buffer *ring,
    230                                    u32 data)
    231 {
    232         iowrite32(data, ring->virtual_start + ring->tail);
    233         ring->tail += 4;
    234 }

BTW, the kernel crashed without the tp_smapi modules.
Comment 7 Daniel Vetter 2014-03-27 09:06:51 UTC
Chris, another candidate for your ring init rework patches?
Comment 8 Chris Wilson 2014-03-27 09:39:42 UTC
Maybe, but he didn't say that they were any error messages upon resume. You would have thought he noticed the *ERROR* first.
Comment 9 Jae-hyeon Park 2014-03-27 16:06:24 UTC
Created attachment 130801 [details]
*ERROR* and kernel oops in syslog
Comment 10 Jae-hyeon Park 2014-03-27 16:09:30 UTC
Sorry, I missed the *ERROR*:

[drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 0000c82c tail 00000000 start 00003000
Comment 11 Daniel Vetter 2014-03-27 16:24:39 UTC
Definitely one for Chris' patches.
Comment 12 Chris Wilson 2014-03-31 07:53:24 UTC
I've rebased the patches against drm-intel-nightly, so they should be easier to apply:

http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug76554
Comment 13 Jae-hyeon Park 2014-04-05 17:43:40 UTC
I tried the bug76554 branch with head cfa8aaa35f180268c99e72964228c944930af680 by (shallow-)cloning the git repo.  Now, the long hibernation issue seems to be gone.  Thank you.

However, I hit "*ERROR* render ring initialization failed" under a different condition.  Maybe due to this, compiz or sometimes the X server crashes.  A good thing is that this takes much less time to reproduce.  The steps to trigger the error are:

0. turn off computer
1. disconnect external display from the VGA port
2. turn on Thinkpad X200 Tablet and wait until X server starts up
3. connect an LCD TV to the VGA port
4. log in (compiz then starts up)
5. hibernate
6. resume

I attach dmesg with drm.debug=7 and Xorg.0.log.
Should I file a different bug?
Comment 14 Jae-hyeon Park 2014-04-05 17:47:06 UTC
Created attachment 131511 [details]
dmesg with external TV connected after X startup
Comment 15 Jae-hyeon Park 2014-04-05 17:48:02 UTC
Created attachment 131521 [details]
Xorg.0.log with external TV connected after X startup
Comment 16 Jani Nikula 2014-09-26 11:31:42 UTC
Chris, is this a dupe of https://bugs.freedesktop.org/show_bug.cgi?id=76554? Can we close this one?
Comment 17 Chris Wilson 2014-09-26 12:21:14 UTC
Close enough. The bug in the summary was a different fix.
Comment 18 Jani Nikula 2014-09-26 12:50:18 UTC
Assuming the bug in the summary is now fixed, please reopen if this is not the case. We'll track the render ring initialization issue at https://bugs.freedesktop.org/show_bug.cgi?id=76554. Thanks for the report.

Note You need to log in before you can comment on or make changes to this bug.