Bug 15544

Summary: black screen upon S3 resume, syslog has "render error" and "page table error"
Product: Drivers Reporter: Sanjoy Mahajan (sanjoy)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: CLOSED CODE_FIX    
Severity: normal CC: chris, jbarnes, jrnieder, maciej.rutecki, rjw, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 14885    
Attachments: lspci -vvvnn output
Xorg log from the run with the display hang
/var/log/dmesg
new Xorg log with render error

Description Sanjoy Mahajan 2010-03-16 00:45:15 UTC
Created attachment 25539 [details]
lspci -vvvnn output

Twice since I started running 2.6.33 two weeks ago, the laptop (T60 w/ Intel graphics and wireless) has woken up from S3 sleep with a completely black screen.  It happens maybe every 10 or 20 sleep/wake cycles.

My usual trick of putting it back to sleep and waking it (Fn+F4 and then the tapping shift key) hasn't restored the screen, so I think it's a different bug than I've seen before (hence I checked "Yes" for the Regression question).  But the system is otherwise responsive, so I can ctrl-alt-del and capture all lines of the logfile.

The relevant piece of the syslog contained:

  [450210.002650] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
  [450210.002655] i915 0000:00:02.0: setting latency timer to 64
  [450210.024621] render error detected, EIR: 0x00000010
  [450210.024623] page table error
  [450210.024625]   PGTBL_ER: 0x00000012
  [450210.024628] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
  [450210.024642] render error detected, EIR: 0x00000010
  [450210.024644] page table error
  [450210.024645]   PGTBL_ER: 0x00000012

I'll attach the Xorg.log from that boot and the output of lspci -vvvnn

Here is a more of preceding context:

  [450209.292710] ACPI: Waking up from system sleep state S3
  [450209.748145] i915 0000:00:02.0: restoring config space at offset 0x1 (was 0x900003, writing 0x900007)
  [450209.748166] pci 0000:00:02.1: restoring config space at offset 0x1 (was 0x900000, writing 0x900003)
  [450209.748207] HDA Intel 0000:00:1b.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100102)
  [450209.748248] pcieport 0000:00:1c.0: restoring config space at offset 0x9 (was 0x1fff1, writing 0x60116001)
  [450209.748263] pcieport 0000:00:1c.0: restoring config space at offset 0x1 (was 0x180107, writing 0x100507)
  [450209.748331] pcieport 0000:00:1c.1: restoring config space at offset 0x1 (was 0x100107, writing 0x100507)
  [450209.748398] pcieport 0000:00:1c.2: restoring config space at offset 0x1 (was 0x100107, writing 0x100507)
  [450209.748445] pcieport 0000:00:1c.3: restoring config space at offset 0xf (was 0x40400, writing 0x4040b)
  [450209.748458] pcieport 0000:00:1c.3: restoring config space at offset 0x9 (was 0x10001, writing 0xe421e421)
  [450209.748464] pcieport 0000:00:1c.3: restoring config space at offset 0x8 (was 0x0, writing 0xebf0ea00)
  [450209.748470] pcieport 0000:00:1c.3: restoring config space at offset 0x7 (was 0x20000000, writing 0x20008070)
  [450209.748480] pcieport 0000:00:1c.3: restoring config space at offset 0x3 (was 0x810000, writing 0x810010)
  [450209.748488] pcieport 0000:00:1c.3: restoring config space at offset 0x1 (was 0x100000, writing 0x100507)
  [450209.792055] uhci_hcd 0000:00:1d.0: power state changed by ACPI to D0
  [450209.792081] uhci_hcd 0000:00:1d.0: restoring config space at offset 0x1 (was 0x2800005, writing 0x2800001)
  [450209.792112] uhci_hcd 0000:00:1d.1: restoring config space at offset 0x1 (was 0x2800005, writing 0x2800001)
  [450209.800065] uhci_hcd 0000:00:1d.2: power state changed by ACPI to D0
  [450209.800090] uhci_hcd 0000:00:1d.2: restoring config space at offset 0x1 (was 0x2800005, writing 0x2800001)
  [450209.800121] uhci_hcd 0000:00:1d.3: restoring config space at offset 0x1 (was 0x2800005, writing 0x2800001)
  [450209.800160] ehci_hcd 0000:00:1d.7: restoring config space at offset 0x1 (was 0x2900106, writing 0x2900102)
  [450209.800202] pci 0000:00:1e.0: restoring config space at offset 0x1 (was 0x100005, writing 0x100007)
  [450209.800281] PIIX_IDE 0000:00:1f.1: restoring config space at offset 0x1 (was 0x2880005, writing 0x2800005)
  [450209.800320] ahci 0000:00:1f.2: restoring config space at offset 0x1 (was 0x2b00007, writing 0x2b00407)
  [450209.800718] iwl3945 0000:03:00.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100506)
  [450209.800840] yenta_cardbus 0000:15:00.0: restoring config space at offset 0xf (was 0x34001ff, writing 0x5c0010b)
  [450209.800847] yenta_cardbus 0000:15:00.0: restoring config space at offset 0xe (was 0x0, writing 0x94fc)
  [450209.800854] yenta_cardbus 0000:15:00.0: restoring config space at offset 0xd (was 0x0, writing 0x9400)
  [450209.800861] yenta_cardbus 0000:15:00.0: restoring config space at offset 0xc (was 0x0, writing 0x90fc)
  [450209.800868] yenta_cardbus 0000:15:00.0: restoring config space at offset 0xb (was 0x0, writing 0x9000)
  [450209.800876] yenta_cardbus 0000:15:00.0: restoring config space at offset 0xa (was 0x0, writing 0x67fff000)
  [450209.800883] yenta_cardbus 0000:15:00.0: restoring config space at offset 0x9 (was 0x0, writing 0x64000000)
  [450209.800890] yenta_cardbus 0000:15:00.0: restoring config space at offset 0x8 (was 0x0, writing 0xe3fff000)
  [450209.800897] yenta_cardbus 0000:15:00.0: restoring config space at offset 0x7 (was 0x0, writing 0xe0000000)
  [450209.800904] yenta_cardbus 0000:15:00.0: restoring config space at offset 0x6 (was 0x0, writing 0xb0171615)
  [450209.800913] yenta_cardbus 0000:15:00.0: restoring config space at offset 0x4 (was 0x0, writing 0xe4300000)
  [450209.800920] yenta_cardbus 0000:15:00.0: restoring config space at offset 0x3 (was 0x20000, writing 0x2a810)
  [450209.800929] yenta_cardbus 0000:15:00.0: restoring config space at offset 0x1 (was 0x2100000, writing 0x2100007)
  [450209.801152] PM: early resume of devices complete after 53.051 msecs
  [450210.002650] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
  [450210.002655] i915 0000:00:02.0: setting latency timer to 64
  [450210.024621] render error detected, EIR: 0x00000010
  [450210.024623] page table error
  [450210.024625]   PGTBL_ER: 0x00000012
  [450210.024628] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
  [450210.024642] render error detected, EIR: 0x00000010
  [450210.024644] page table error
  [450210.024645]   PGTBL_ER: 0x00000012
Comment 1 Sanjoy Mahajan 2010-03-16 00:46:53 UTC
Created attachment 25540 [details]
Xorg log from the run with the display hang
Comment 2 Sanjoy Mahajan 2010-03-16 00:53:49 UTC
Created attachment 25541 [details]
/var/log/dmesg

Here is the /var/log/dmesg file from the boot that eventually resumed with a black screen (a week after the boot).
Comment 3 Sanjoy Mahajan 2010-03-16 00:55:19 UTC
The boot parameters include i915.powersave=0 (which I started using in 2.6.32 to avoid the problem of a flickering and tearing screen).
Comment 4 Rafael J. Wysocki 2010-03-22 00:18:56 UTC
On Monday 22 March 2010, Sanjoy Mahajan wrote:
> It hasn't recurred since I filed the bug report (6 days ago), but as far
> as I know it is still a problem.
> 
> I'm reasonably sure, but not 100%, that it's a regression: I might have
> observed it on Debian's 2.6.32 kernel.  If so, it happened much less often.
Comment 5 Rafael J. Wysocki 2010-03-23 21:43:30 UTC
On Tuesday 23 March 2010, Sanjoy Mahajan wrote:
> me> It hasn't recurred since I filed the bug report (6 days ago), but as
> me> far as I know it is still a problem.
> 
> I spoke too soon.  It recurred about an hour ago.
Comment 6 ykzhao 2010-03-25 02:48:18 UTC
HI, Sanjoy
    Will you please try the following patch and see whether the issue still exists?
    http://lists.freedesktop.org/archives/intel-gfx/2010-February/005803.html

thanks.
    Yakui
Comment 7 Sanjoy Mahajan 2010-03-25 14:27:09 UTC
> http://lists.freedesktop.org/archives/intel-gfx/2010-February/005803.html

That patch seems already part of vanilla 2.6.33 (which I am running and
seeing problems with).  Here is the snippet from 2.6.33's intel_fb.c:

	mutex_lock(&dev->struct_mutex);

	ret = i915_gem_object_pin(fbo, 64*1024);
	if (ret) {
		DRM_ERROR("failed to pin fb: %d\n", ret);
		goto out_unref;
	}
Comment 8 Rafael J. Wysocki 2010-04-09 19:48:53 UTC
On Friday 09 April 2010, Sanjoy Mahajan wrote:
> I haven't seen the bug recur since March 23rd, although I'm running the
> same kernel (vanilla 2.6.33).  The X server has been updated slightly
> to Debian's xserver-xorg-core 1.7.6.
Comment 9 Sanjoy Mahajan 2010-04-25 18:40:01 UTC
The bug happened again yesterday.  The screen was not totally black but very close to it.  I'm still running vanilla 2.6.33 but with the Debian X.Org X Server 1.7.6.901 (1.7.7 RC 1).  I'll attach the Xorg.0.log file.  

The syslog had the render error lines again:

Apr 24 14:41:38 approx kernel: [113985.940542] render error detected, EIR: 0x00000010
Apr 24 14:41:38 approx kernel: [113985.940544] page table error
Apr 24 14:41:38 approx kernel: [113985.940546]   PGTBL_ER: 0x00000012
Apr 24 14:41:38 approx kernel: [113985.940549] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
Apr 24 14:41:38 approx kernel: [113985.940561] render error detected, EIR: 0x00000010
Apr 24 14:41:38 approx kernel: [113985.940564] page table error
Apr 24 14:41:38 approx kernel: [113985.940565]   PGTBL_ER: 0x00000012
Comment 10 Sanjoy Mahajan 2010-04-25 18:41:32 UTC
Created attachment 26137 [details]
new Xorg log with render error
Comment 11 Chris Wilson 2010-06-06 13:07:42 UTC
I think these category of page table error following suspend and resume should be fixed with:

commit ac0c6b5ad3b3b513e1057806d4b7627fcc0ecc27
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu May 27 13:18:18 2010 +0100

    drm/i915: Rebind bo if currently bound with incorrect alignment.
    
    Whilst pinning the buffer, check that that its current alignment
    matches the requested alignment. If it does not, rebind.
    
    This should clear up any final render errors whilst resuming,
    for reference:
    
      Bug 27070 - [i915] Page table errors with empty ringbuffer
      https://bugs.freedesktop.org/show_bug.cgi?id=27070
    
      Bug 15502 -  render error detected, EIR: 0x00000010
      https://bugzilla.kernel.org/show_bug.cgi?id=15502
    
      Bug 13844 -  i915 error: "render error detected"
      https://bugzilla.kernel.org/show_bug.cgi?id=13844
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org
    Signed-off-by: Eric Anholt <eric@anholt.net>

in 2.6.35-rc2.
Comment 12 Jonathan Nieder 2011-08-23 23:22:52 UTC
From <http://bugs.debian.org/601732>:

> I don't think the fix in <https://bugzilla.kernel.org/show_bug.cgi?id=15544>
> solved the problem for me.  That fix was in 2.6.35-rc2, but the problem
> happened for me even with the 2.6.36 kernel.  
>
> However, the good news is that the error is now gone; or rather the
> error shows up but is cleared so the system doesn't hang.  I don't know
> exactly which kernel solved the problem.  I think it was 2.6.38.  What I
> noticed different in the dmesg log is that, ever since I stopped
> experiencing hangs, the following line showed up implying that the error
> is being cleared automatically:
>
> [1099952.207172] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck:
> 0x00000010, masking