Bug 10844

Summary: [drm:i915_wait_irq] *ERROR* EBUSY after suspend
Product: Drivers Reporter: Jie Luo (clotho67)
Component: Video(DRI - non Intel)Assignee: Jesse Barnes (jbarnes)
Status: RESOLVED CODE_FIX    
Severity: normal CC: acpi-bugzilla, airlied, akpm, bug-track, jbarnes, rjw, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25.4 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 9832    
Attachments: dmesg
Restore pipestat config as well
diff of reg_dump before and after suspend
diff of reg_dump with and without patch
Call pci_set_master after pci_enable_device during resume time
dmesg_2.6.28_lex

Description Jie Luo 2008-05-31 22:15:13 UTC
Latest working kernel version: 2.6.24.7
Earliest failing kernel version: 2.6.25
Distribution: Gentoo
Hardware Environment: ThinkPad X60
Software Environment: 
Problem Description:
When upgrade to 2.6.25, I find that OpenGL program no longer work after suspend, and find this line in dmesg.

[drm:i915_wait_irq] *ERROR* EBUSY -- rec: 16309 emitted: 16310

This seems can be workaround by reload drm and i915 kernel modules after suspend. After doing this, OpenGl program does not have problem with suspending. I find the only difference between the first suspend and the suspend after reload drm kernel module are these lines in dmesg.

first suspend:

[   47.879220] Switched to high resolution mode on CPU 1
[   47.882557] ACPI: EC: missing OBF confirmation, don't expect it any longer.
[   47.885890] ACPI: EC: missing write data confirmation, don't expect it any longer.
[   47.969252] PM: Writing back config space on device 0000:00:02.1 at offset 1 (was 900000, writing 900003)
[   47.972596] PM: Writing back config space on device 0000:00:1b.0 at offset 1 (was 100106, writing 100102)

second suspend (after reload drm and i915):
[  345.656760] Switched to high resolution mode on CPU 1
[  345.726355] PM: Writing back config space on device 0000:00:02.0 at offset 1 (was 900003, writing 900007)
[  345.752645] ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 16
[  345.752652] PCI: Setting latency timer of device 0000:00:02.0 to 64
[  345.752669] PM: Writing back config space on device 0000:00:02.1 at offset 1 (was 900000, writing 900003)
[  345.765859] PM: Writing back config space on device 0000:00:1b.0 at offset 1 (was 100106, writing 100102)

It seems some error happened when ACPI setting PCI Interrupt for device 0000:00:02.0, which is the graphic card.

00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)


Steps to reproduce: 
1. Start with 2.6.25 kernel on a ThinkPad X60.
2. Suspend to ram.
Comment 1 Jie Luo 2008-05-31 22:18:01 UTC
Created attachment 16353 [details]
dmesg
Comment 2 Andrew Morton 2008-05-31 22:26:10 UTC
Dave, do you think this regression is an ACPI thing or a DRM thing?

Thanks.
Comment 3 Jie Luo 2008-05-31 22:30:47 UTC
This happen even without DRM modules loaded and X started. Doing a suspend from console and then start X, OpenGL program will failed with a "[drm:i915_wait_irq] *ERROR* EBUSY...".
Comment 4 Jie Luo 2008-06-01 23:00:32 UTC
(In reply to comment #3)
> This happen even without DRM modules loaded and X started. 
I made a mistake. This is wrong. I find comment out following code in i915_suspend() make OpenGL program work after suspend to ram.

if (state.event == PM_EVENT_SUSPEND) {
                /* Shut down the device */
                pci_disable_device(dev->pdev);
                pci_set_power_state(dev->pdev, PCI_D3hot);
        }

You still can see the "ACPI: EC: ..." in dmesg.

[   95.151000] ACPI: EC: missing OBF confirmation, don't expect it any longer.
[   95.154333] ACPI: EC: missing write data confirmation, don't expect it any longer.
[   95.223541] PM: Writing back config space on device 0000:00:02.0 at offset 1 (was 900003, writing 900007)
[   95.237760] ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 16
[   95.237768] PCI: Setting latency timer of device 0000:00:02.0 to 64
[   95.237784] PM: Writing back config space on device 0000:00:02.1 at offset 1 (was 900000, writing 900003)
[   95.241035] PM: Writing back config space on device 0000:00:1b.0 at offset 1 (was 100106, writing 100102)

But this time seems it can "Writing back config space on device 0000:00:02.0 ...", so the IRQ problem gone.

Hope this can help resolve this problem.
Comment 5 Jesse Barnes 2008-06-02 09:45:37 UTC
Yeah, that code path has been a problem in the past.  Hopefully this sort of thing will go away with Rafael's suspend/hibernate callback rework.  In the meantime I can test whether we actually need the disable_device & set_power_state calls (btw did you try commenting out just one of those calls?  if so, which one made the difference?).

Thanks,
Jesse
Comment 6 Jie Luo 2008-06-02 10:48:00 UTC
(In reply to comment #5)
> Yeah, that code path has been a problem in the past.  Hopefully this sort of
> thing will go away with Rafael's suspend/hibernate callback rework.  In the
> meantime I can test whether we actually need the disable_device &
> set_power_state calls (btw did you try commenting out just one of those
> calls? 
> if so, which one made the difference?).

Just commenting out disable_device is enough. 

Thanks.
Comment 7 Jesse Barnes 2008-06-02 11:37:36 UTC
Well, there's at least one bug in our restore code:  we're not even bothering to restore the interrupt configuration!  It may be that disable_device (which disables interrupts) ends up clearing our IER regs or something.  If so, this patch may help:

diff --git a/drivers/char/drm/i915_drv.c b/drivers/char/drm/i915_drv.c
index e8f3d68..a4da696 100644
--- a/drivers/char/drm/i915_drv.c
+++ b/drivers/char/drm/i915_drv.c
@@ -499,6 +499,11 @@ static int i915_resume(struct drm_device *dev)

        /* FIXME: restore TV & SDVO state */

+       /* Interrupt state */
+       I915_WRITE(I915REG_INT_IDENTITY_R, dev_priv->saveIIR);
+       I915_WRITE(I915REG_INT_ENABLE_R, dev_priv->saveIER);
+       I915_WRITE(I915REG_INT_MASK_R, dev_priv->saveIMR);
+
        /* FBC info */
        I915_WRITE(FBC_CFB_BASE, dev_priv->saveFBC_CFB_BASE);
        I915_WRITE(FBC_LL_BASE, dev_priv->saveFBC_LL_BASE);
Comment 8 Jie Luo 2008-06-02 20:47:55 UTC
(In reply to comment #7)
> Well, there's at least one bug in our restore code:  we're not even bothering
> to restore the interrupt configuration!  

It seems this patch is not enough to fix my problem. I still need to comment out disable_device.

As I have mentioned in my bug report, if I reload i915 module after suspend, it works even without comment out disable_device. So there may be some bugs in the disable_device code path as well.

And it seems that the video device will be disabled correctly even without call from drm, is this call necessary?
Comment 9 Jesse Barnes 2008-06-03 18:09:31 UTC
Yeah, something else in the disable_device path must be hosing things.

The call may not be necessary, but I'd still like to figure out why it breaks things.
Comment 10 Jesse Barnes 2008-06-06 14:59:02 UTC
Created attachment 16420 [details]
Restore pipestat config as well

Oops, missed a couple of regs in my last patch.  Can you try this one out?

Thanks,
Jesse
Comment 11 Jie Luo 2008-06-06 16:32:11 UTC
Created attachment 16421 [details]
diff of reg_dump before and after suspend

Still need to comment out disable_device. I attached the difference of intel_reg_dump before and after suspend, and the difference of intel_reg_dump with and without this patch. Maybe I should enable PCI debug to see if there are more information.
Comment 12 Jie Luo 2008-06-06 16:32:54 UTC
Created attachment 16422 [details]
diff of reg_dump with and without patch
Comment 13 Jie Luo 2008-06-07 15:57:26 UTC
The trick of reload i915 kernel modules seems just make disable_device fail to disable IRQ 16, so it just has the same effect as comment out disable_device. 

The "ACPI: EC" is another problem, I didn't see any problem with it currently.
Comment 14 Jie Luo 2008-06-20 00:19:10 UTC
It seams add pci_set_master(dev->pdev) after pci_enable_device make OpenGL works after resume at least. But pci_enable_device didn't enable video device immediately, it takes some time to enable the irq 16.
Comment 15 Jie Luo 2008-06-20 00:52:29 UTC
Created attachment 16559 [details]
Call pci_set_master after pci_enable_device during resume time

I just verify this by doing a clean build. 

When I compare the suspend code of hda-intel sound driver and the i915 drm module, I find this fix. 

The hda-intel's suspend code also take care of many things which maybe also should be cared in i915 driver, such as MSI. 

But I'm just a user, it's up to you to decide which can make i915's suspend code more robust.
Comment 16 Jie Luo 2008-06-20 00:54:48 UTC
Forget to mention, this bug also exits in current git kernel. I'm glade if it could be fixed before 2.6.26 release. 

Thanks.
Comment 17 Jesse Barnes 2008-06-20 15:24:00 UTC
Thanks for testing, Jie, I'll see if I can push a fix upstream quickly.
Comment 18 Jesse Barnes 2008-06-24 13:04:05 UTC
Thanks Jie, just pushed the fix to Linus.
Comment 19 Oleksij Rempel (fishor) 2008-11-20 03:45:59 UTC
please reopen this bug.

i have sam issue:
[drm:i915_wait_irq] *ERROR* EBUSY -- rec: 0 emitted: 1
i use kernel 2.6.28-rc5-slub-00172-gee2f6cc x86_64
Comment 20 Oleksij Rempel (fishor) 2008-11-20 03:47:35 UTC
Created attachment 18950 [details]
dmesg_2.6.28_lex
Comment 21 Jesse Barnes 2008-11-24 11:43:57 UTC
Possibly yet another DUP of the problem fixed by http://lists.freedesktop.org/archives/intel-gfx/2008-
November/000614.html?
Comment 22 Oleksij Rempel (fishor) 2008-11-24 14:47:24 UTC
i can't apply this patch to the latest git.
Comment 23 Oleksij Rempel (fishor) 2008-11-30 23:25:19 UTC
this issue is fixed with latest git