Bug 16388 - i915 drm BUG: unable to handle kernel paging request at a5e89046
Summary: i915 drm BUG: unable to handle kernel paging request at a5e89046
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_video-dri-intel@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks: 15310
  Show dependency tree
 
Reported: 2010-07-14 16:59 UTC by lists
Modified: 2010-10-28 23:26 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.34.1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Config per jbarnes (109.50 KB, text/plain)
2010-07-26 14:48 UTC, lists
Details

Description lists 2010-07-14 16:59:13 UTC
Hibernate / Thaw screen is black with artifacts on thaw. Sometimes a ABRT is
shown after forced reboot.  Forcing a Sync or Crash or Boot with SYSRQ does not
work.  Machine is not pingable.

Version-Release number of selected component (if applicable):
2.6.34.1-9

How reproducible:
All the time

Steps to Reproduce:
1. Hibernate
2. Thaw
3. 

Actual results:

Machine hang.  

Expected results:

Normal thaw
Additional info:

Smolt: http://www.smolts.org/client/show/pub_98f6cfac-8cad-4a3d-a099-e2e2854e64c0 

xorg.conf is not present
BUG: unable to handle kernel paging request at a5e89046
IP: [<f7f8ec27>] drm_mode_getconnector+0x295/0x2b9 [drm]
*pdpt = 0000000036091001 *pde = 0000000000000000 
Oops: 0002 [#1] SMP 
last sysfs file:
/sys/devices/pci0000:00/0000:00:1c.0/0000:0b:00.0/ssb0:0/ieee80211/phy0/rfkill1/uevent
Modules linked in: aes_i586 aes_generic coretemp ipv6 cpufreq_ondemand
acpi_cpufreq fuse uinput arc4 snd_hda_codec_idt ecb snd_hda_intel snd_hda_codec
b43 snd_hwdep snd_seq snd_seq_device snd_pcm mac80211 snd_timer cfg80211 snd
b44 ssb dell_laptop soundcore dell_wmi i2c_i801 iTCO_wdt snd_page_alloc
iTCO_vendor_support rfkill wmi mii sdhci_pci sdhci mmc_core joydev microcode
dcdbas firewire_ohci firewire_core crc_itu_t i915 drm_kms_helper drm
i2c_algo_bit i2c_core video output [last unloaded: kvm]
Pid: 1516, comm: Xorg Not tainted 2.6.34.1-9.fc13.i686.PAE #1 0KD882/MM061      
EIP: 0060:[<f7f8ec27>] EFLAGS: 00013293 CPU: 1
EIP is at drm_mode_getconnector+0x295/0x2b9 [drm]
EAX: f36d313b EBX: 00000001 ECX: 00000003 EDX: f36d3e7c
ESI: f69d4000 EDI: f8032b74 EBP: f36d3e60 ESP: f36d3dec
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process Xorg (pid: 1516, ti=f36d2000 task=f3df5940 task.ti=f36d2000)
Stack:
000000d0 f69d7688 000a3e0c f69d4154 0000033b 00000003 00000001 f36d3e7c
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Call Trace:
[<f7f85ba3>] ? drm_ioctl+0x23c/0x31d [drm]
[<f7f8e992>] ? drm_mode_getconnector+0x0/0x2b9 [drm]
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c057f0e8>] ? file_has_perm+0x8c/0xa6
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c04e485d>] ? vfs_ioctl+0x2c/0x96
[<f7f85967>] ? drm_ioctl+0x0/0x31d [drm]
[<c04e4df3>] ? do_vfs_ioctl+0x488/0x4c6
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c057f38c>] ? selinux_file_ioctl+0x43/0x46
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c04e4e77>] ? sys_ioctl+0x46/0x66
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c0408cdf>] ? sysenter_do_call+0x12/0x28
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
Code: 00 74 17 e8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 28 eb 05 bf f2 ff ff <ff> 8b 45 90 e8 a5 1f
81 c8 89 f8 8b 55 f0 65 33 15 14 00 00 00 
EIP: [<f7f8ec27>] drm_mode_getconnector+0x295/0x2b9 [drm] SS:ESP 0068:f36d3dec
CR2: 00000000a5e89046
Comment 1 Andrew Morton 2010-07-21 21:22:58 UTC
Thanks.  Was 2.6.33 OK?   2.6.32?
Comment 2 lists 2010-07-22 02:51:58 UTC
Here's the honest answer -short version:  I only ran 2.6.33 for a couple of days after the fix for KBZ 13811 was out, and I did see it there.  It's really hard to tell because KBZ 13811 (which really was regression but was not marked nor treated like one) masked this problem. I was running a 2.6.32.8 from F11 on F12 before that because that one particular kernel got me ~5 working hibernate/thaw cycles, and didn't notice THIS issue.

-Long -version: (Sorry if it is ranty -- I know it's not your fault!)
Before I upgraded to F13 and 2.6.33 series, I applied the patch for KBZ 13811 to 2.6.32.14 and 2.6.32.16 and did see the problem there.  I was however running 2.6.32.8 for the last 6 months and did not notice it, but then 13811 would usually strike first, but the 2.6.32.8 I was running would usually let me get ~5 hibernate/thaw cycles before dying, so I can't really say for sure. Ever since Fedora put/required KMS into Fedora 11, the kernel has been in a regression since at least 2.6.29.  

I have been using in kernel hibernate/thaw (pmdisk/swsusp) since I think about 2.6.9 or 2.6.10, or about the time that pmdisk and swsusp "stuff" was big. I used to build my own kernels to configure that in as Fedora's kernels at the time didn't include it -- and generally (with a few hiccups here and there) it worked until the 2.6.29.4 that Fedora shipped in F-11.  

The last kernel that just plain worked was 2.6.27.44 as shipped in the last update of Fedora 10.  The entire KMS/GEM project has been at least for me nothing but a regression, since the mode-switch blink when switching to X didn't bother me, and I've lost the ability to hibernate / thaw my laptop reliably for the past year plus. 

KBZ 13811, regardless of how it was marked, was a regression, since before KMS as Fedora had it in 2.6.29 swsup worked, after it didn't. 

This may more be a virgin bug in KMS/gem, but the overall impact is regression

I'd try anything from 2.6.35, but does that work the libdrm/mesa xorg-intel driver that Fedora is shipping for F13?  That's rhetorical, but reflective of the uncertainty and doubt about a whether or not you can use Fedora as a base and have a workable system at least with Intel graphics. 

Thanks!
Comment 3 Rafael J. Wysocki 2010-07-23 19:59:03 UTC
On Friday, July 23, 2010, Jesse Barnes wrote:
> On Fri, 23 Jul 2010 14:15:55 +0200 (CEST)
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.33 and 2.6.34.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.33 and 2.6.34.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=16388
> > Subject             : i915 drm BUG: unable to handle kernel paging request
> at a5e89046
> > Submitter   :  <lists@clanduggan.org>
> > Date                : 2010-07-14 16:59 (10 days old)
> 
> Looks like some potential memory corruption?  At resume we try to get
> connector info but panic due to a bad pointer, maybe in one of the
> lists.  Can you gdb your drm_kms_helper module and do "list
> *drm_mode_getconnector+0x295" to see what line this is?
> 
> Also, what chipset do you have?  Maybe I can reproduce it here with
> your kernel config.
Comment 4 Chuck Ebbert 2010-07-23 21:03:02 UTC
This is probably due to the i915 hibernation memory corruption bug, and should be fixed by:

  commit 985b823b919273fe1327d56d2196b4f92e5d0fae
  drm/i915: fix hibernation since i915 self-reclaim fixes

  commit cd9f040df6ce46573760a507cb88192d05d27d86
  drm/i915: add 'reclaimable' to i915 self-reclaimable page allocations

And yes, those are in Fedora now.
Comment 5 Chuck Ebbert 2010-07-23 21:13:48 UTC
And it looks like those two are needed in 2.6.32-stable, since the patch that caused the bug went in 2.6.32.8 as drm-i915-selectively-enable-self-reclaim.patch
Comment 6 lists 2010-07-26 14:48:35 UTC
Created attachment 27261 [details]
Config per jbarnes

Sorry for the delay.

On Fri, 23 Jul 2010 10:37:12 -0700, Jesse Barnes <jbarnes@virtuousgeek.org>
wrote:
> 
> Looks like some potential memory corruption?  At resume we try to get
> connector info but panic due to a bad pointer, maybe in one of the
> lists.  Can you gdb your drm_kms_helper module and do "list
> *drm_mode_getconnector+0x295" to see what line this is?
> 

(gdb) list *drm_mode_getconnector+0x295
0x20f3 is in drm_mode_getconnector (drivers/gpu/drm/drm_crtc.c:1417).
1412					}
1413					copied++;
1414				}
1415			}
1416		}
1417		out_resp->count_encoders = encoders_count;
1418	
1419	out:
1420		mutex_unlock(&dev->mode_config.mutex);
1421		return ret;

> Also, what chipset do you have?  Maybe I can reproduce it here with
> your kernel config.

0:00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Contr
oller (rev 03) (prog-if 00 [VGA controller])
        Subsystem: Dell Device 01bd
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at eff00000 (32-bit, non-prefetchable) [size=512K]
        Region 1: I/O ports at eff8 [size=8]
        Region 2: Memory at d0000000 (32-bit, prefetchable) [size=256M]
        Region 3: Memory at efec0000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
                Address: 00000000  Data: 0000
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: i915
        Kernel modules: i915


0000:00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)
        Subsystem: Dell Device 01bd
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Region 0: Memory at eff80000 (32-bit, non-prefetchable) [size=512K]
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Comment 7 Chris Wilson 2010-08-06 10:26:50 UTC
Weird death following resume. It's either the write to a bit of memory we have just allocated for the ioctl, or the connector is corrupt.

Definitely fits the pattern for the i915 hibernation bug.

Note You need to log in before you can comment on or make changes to this bug.