Bug 16388

Summary: i915 drm BUG: unable to handle kernel paging request at a5e89046
Product: Drivers Reporter: lists
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: CLOSED CODE_FIX    
Severity: high CC: akpm, cebbert, chris, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34.1 Tree: Mainline
Regression: Yes
Bug Depends on:    
Bug Blocks: 15310    
Attachments: Config per jbarnes

Description lists 2010-07-14 16:59:13 UTC
Hibernate / Thaw screen is black with artifacts on thaw. Sometimes a ABRT is
shown after forced reboot.  Forcing a Sync or Crash or Boot with SYSRQ does not
work.  Machine is not pingable.

Version-Release number of selected component (if applicable):
2.6.34.1-9

How reproducible:
All the time

Steps to Reproduce:
1. Hibernate
2. Thaw
3. 

Actual results:

Machine hang.  

Expected results:

Normal thaw
Additional info:

Smolt: http://www.smolts.org/client/show/pub_98f6cfac-8cad-4a3d-a099-e2e2854e64c0 

xorg.conf is not present
BUG: unable to handle kernel paging request at a5e89046
IP: [<f7f8ec27>] drm_mode_getconnector+0x295/0x2b9 [drm]
*pdpt = 0000000036091001 *pde = 0000000000000000 
Oops: 0002 [#1] SMP 
last sysfs file:
/sys/devices/pci0000:00/0000:00:1c.0/0000:0b:00.0/ssb0:0/ieee80211/phy0/rfkill1/uevent
Modules linked in: aes_i586 aes_generic coretemp ipv6 cpufreq_ondemand
acpi_cpufreq fuse uinput arc4 snd_hda_codec_idt ecb snd_hda_intel snd_hda_codec
b43 snd_hwdep snd_seq snd_seq_device snd_pcm mac80211 snd_timer cfg80211 snd
b44 ssb dell_laptop soundcore dell_wmi i2c_i801 iTCO_wdt snd_page_alloc
iTCO_vendor_support rfkill wmi mii sdhci_pci sdhci mmc_core joydev microcode
dcdbas firewire_ohci firewire_core crc_itu_t i915 drm_kms_helper drm
i2c_algo_bit i2c_core video output [last unloaded: kvm]
Pid: 1516, comm: Xorg Not tainted 2.6.34.1-9.fc13.i686.PAE #1 0KD882/MM061      
EIP: 0060:[<f7f8ec27>] EFLAGS: 00013293 CPU: 1
EIP is at drm_mode_getconnector+0x295/0x2b9 [drm]
EAX: f36d313b EBX: 00000001 ECX: 00000003 EDX: f36d3e7c
ESI: f69d4000 EDI: f8032b74 EBP: f36d3e60 ESP: f36d3dec
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process Xorg (pid: 1516, ti=f36d2000 task=f3df5940 task.ti=f36d2000)
Stack:
000000d0 f69d7688 000a3e0c f69d4154 0000033b 00000003 00000001 f36d3e7c
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Call Trace:
[<f7f85ba3>] ? drm_ioctl+0x23c/0x31d [drm]
[<f7f8e992>] ? drm_mode_getconnector+0x0/0x2b9 [drm]
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c057f0e8>] ? file_has_perm+0x8c/0xa6
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c04e485d>] ? vfs_ioctl+0x2c/0x96
[<f7f85967>] ? drm_ioctl+0x0/0x31d [drm]
[<c04e4df3>] ? do_vfs_ioctl+0x488/0x4c6
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c057f38c>] ? selinux_file_ioctl+0x43/0x46
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c04e4e77>] ? sys_ioctl+0x46/0x66
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c0408cdf>] ? sysenter_do_call+0x12/0x28
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
[<c04c64a7>] ? get_swap_bio+0x3b/0x6b
Code: 00 74 17 e8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 28 eb 05 bf f2 ff ff <ff> 8b 45 90 e8 a5 1f
81 c8 89 f8 8b 55 f0 65 33 15 14 00 00 00 
EIP: [<f7f8ec27>] drm_mode_getconnector+0x295/0x2b9 [drm] SS:ESP 0068:f36d3dec
CR2: 00000000a5e89046
Comment 1 Andrew Morton 2010-07-21 21:22:58 UTC
Thanks.  Was 2.6.33 OK?   2.6.32?
Comment 2 lists 2010-07-22 02:51:58 UTC
Here's the honest answer -short version:  I only ran 2.6.33 for a couple of days after the fix for KBZ 13811 was out, and I did see it there.  It's really hard to tell because KBZ 13811 (which really was regression but was not marked nor treated like one) masked this problem. I was running a 2.6.32.8 from F11 on F12 before that because that one particular kernel got me ~5 working hibernate/thaw cycles, and didn't notice THIS issue.

-Long -version: (Sorry if it is ranty -- I know it's not your fault!)
Before I upgraded to F13 and 2.6.33 series, I applied the patch for KBZ 13811 to 2.6.32.14 and 2.6.32.16 and did see the problem there.  I was however running 2.6.32.8 for the last 6 months and did not notice it, but then 13811 would usually strike first, but the 2.6.32.8 I was running would usually let me get ~5 hibernate/thaw cycles before dying, so I can't really say for sure. Ever since Fedora put/required KMS into Fedora 11, the kernel has been in a regression since at least 2.6.29.  

I have been using in kernel hibernate/thaw (pmdisk/swsusp) since I think about 2.6.9 or 2.6.10, or about the time that pmdisk and swsusp "stuff" was big. I used to build my own kernels to configure that in as Fedora's kernels at the time didn't include it -- and generally (with a few hiccups here and there) it worked until the 2.6.29.4 that Fedora shipped in F-11.  

The last kernel that just plain worked was 2.6.27.44 as shipped in the last update of Fedora 10.  The entire KMS/GEM project has been at least for me nothing but a regression, since the mode-switch blink when switching to X didn't bother me, and I've lost the ability to hibernate / thaw my laptop reliably for the past year plus. 

KBZ 13811, regardless of how it was marked, was a regression, since before KMS as Fedora had it in 2.6.29 swsup worked, after it didn't. 

This may more be a virgin bug in KMS/gem, but the overall impact is regression

I'd try anything from 2.6.35, but does that work the libdrm/mesa xorg-intel driver that Fedora is shipping for F13?  That's rhetorical, but reflective of the uncertainty and doubt about a whether or not you can use Fedora as a base and have a workable system at least with Intel graphics. 

Thanks!
Comment 3 Rafael J. Wysocki 2010-07-23 19:59:03 UTC
On Friday, July 23, 2010, Jesse Barnes wrote:
> On Fri, 23 Jul 2010 14:15:55 +0200 (CEST)
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.33 and 2.6.34.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.33 and 2.6.34.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=16388
> > Subject             : i915 drm BUG: unable to handle kernel paging request
> at a5e89046
> > Submitter   :  <lists@clanduggan.org>
> > Date                : 2010-07-14 16:59 (10 days old)
> 
> Looks like some potential memory corruption?  At resume we try to get
> connector info but panic due to a bad pointer, maybe in one of the
> lists.  Can you gdb your drm_kms_helper module and do "list
> *drm_mode_getconnector+0x295" to see what line this is?
> 
> Also, what chipset do you have?  Maybe I can reproduce it here with
> your kernel config.
Comment 4 Chuck Ebbert 2010-07-23 21:03:02 UTC
This is probably due to the i915 hibernation memory corruption bug, and should be fixed by:

  commit 985b823b919273fe1327d56d2196b4f92e5d0fae
  drm/i915: fix hibernation since i915 self-reclaim fixes

  commit cd9f040df6ce46573760a507cb88192d05d27d86
  drm/i915: add 'reclaimable' to i915 self-reclaimable page allocations

And yes, those are in Fedora now.
Comment 5 Chuck Ebbert 2010-07-23 21:13:48 UTC
And it looks like those two are needed in 2.6.32-stable, since the patch that caused the bug went in 2.6.32.8 as drm-i915-selectively-enable-self-reclaim.patch
Comment 6 lists 2010-07-26 14:48:35 UTC
Created attachment 27261 [details]
Config per jbarnes

Sorry for the delay.

On Fri, 23 Jul 2010 10:37:12 -0700, Jesse Barnes <jbarnes@virtuousgeek.org>
wrote:
> 
> Looks like some potential memory corruption?  At resume we try to get
> connector info but panic due to a bad pointer, maybe in one of the
> lists.  Can you gdb your drm_kms_helper module and do "list
> *drm_mode_getconnector+0x295" to see what line this is?
> 

(gdb) list *drm_mode_getconnector+0x295
0x20f3 is in drm_mode_getconnector (drivers/gpu/drm/drm_crtc.c:1417).
1412					}
1413					copied++;
1414				}
1415			}
1416		}
1417		out_resp->count_encoders = encoders_count;
1418	
1419	out:
1420		mutex_unlock(&dev->mode_config.mutex);
1421		return ret;

> Also, what chipset do you have?  Maybe I can reproduce it here with
> your kernel config.

0:00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Contr
oller (rev 03) (prog-if 00 [VGA controller])
        Subsystem: Dell Device 01bd
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at eff00000 (32-bit, non-prefetchable) [size=512K]
        Region 1: I/O ports at eff8 [size=8]
        Region 2: Memory at d0000000 (32-bit, prefetchable) [size=256M]
        Region 3: Memory at efec0000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
                Address: 00000000  Data: 0000
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: i915
        Kernel modules: i915


0000:00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)
        Subsystem: Dell Device 01bd
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Region 0: Memory at eff80000 (32-bit, non-prefetchable) [size=512K]
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Comment 7 Chris Wilson 2010-08-06 10:26:50 UTC
Weird death following resume. It's either the write to a bit of memory we have just allocated for the ioctl, or the connector is corrupt.

Definitely fits the pattern for the i915 hibernation bug.