Bug 26932

Summary: [SNB mobile] Oops in DRM intel driver, esp. during S3/S4 stress test
Product: Drivers Reporter: Matthias Hopf (mat)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Severity: high CC: chris, daniel, eric, florian, gordon.jin, jbarnes, maciej.rutecki, rjw, sndirsch
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37 (3c0eee3) Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: Oops by Xserver
Oops by queens
Syslog with new oops

Description Matthias Hopf 2011-01-17 11:45:49 UTC
Especially during S3 and S4 stress tests (but also seen during regular use after some S3 cycles) the intel driver Oopses after typically 100-500 S3 cycles when running (multiple) OpenGL clients (queens and glhanoi in this case).

Mesa 7.10
libdrm 2.4.23 + git 0184bb1c
xf86-video-intel 2.14.0

drm loaded with debug=0x0e

Attaching 2 different (but probably equivalent) Oopses, one created by a screensaver (queens), and one by the Xserver.
Comment 1 Matthias Hopf 2011-01-17 11:55:27 UTC
Created attachment 43822 [details]
Oops by Xserver

There are tons of blt ring debug messages before, but nothing eye catching.
Comment 2 Matthias Hopf 2011-01-17 12:00:37 UTC
Created attachment 43832 [details]
Oops by queens

Also a lot of blt ring messages before this
Comment 3 Chris Wilson 2011-01-17 12:15:28 UTC
Ah, looks like the good old ring buffer overflow. Just one detail missing from the report, which platform? (I'm guessing SandyBridge, but mobile or desktop?)

I had to add a lot of error propagation to handle such overflows; too much, too late to be pushed into 2.6.37. Can you verify that drm-intel-fixes (or even linus/master) is stable? Sadly likely to be have been destabilised elsewhere...
Comment 4 Stefan Dirsch 2011-01-17 12:59:36 UTC
Good question. It is SandyBridge mobile - as usual.
Comment 5 Chris Wilson 2011-01-17 13:26:04 UTC
Oh noes, not mobile! :| We've been hitting some stability issues with mobile as well. See https://bugs.freedesktop.org/show_bug.cgi?id=32752
Comment 6 Matthias Hopf 2011-01-17 14:34:07 UTC
(In reply to comment #3)
> Ah, looks like the good old ring buffer overflow. Just one detail missing
> from
> the report, which platform? (I'm guessing SandyBridge, but mobile or
> desktop?)

Oops, yes, forgot that. :-]

> I had to add a lot of error propagation to handle such overflows; too much,
> too
> late to be pushed into 2.6.37. Can you verify that drm-intel-fixes (or even
> linus/master) is stable? Sadly likely to be have been destabilised
> elsewhere...

drm-intel-fixes didn't compile last time I tested (yesterday). I can try Linus' master branch, though.

As this typically requires >200 S3 cycles, don't expect any answers today.
Comment 7 Matthias Hopf 2011-01-17 15:51:52 UTC
Linus' master has major issues. Testing e78bf5e shows severe rendering issues (parts of the screen are not rendered, VT switching works, no information in neither Xorg log nor dmesg, with drm debug =0x0e I don't see any blt ring messages. No Oops, no hangcheck timer message.

Checked drm-intel-fixes as well. Same issue.

Guess I have to bisect this one first. Oh well.
Comment 8 Chris Wilson 2011-01-17 15:59:55 UTC
Thanks for the heads up.
Comment 9 Matthias Hopf 2011-01-17 18:02:27 UTC
Apparently frame buffer compression is borked on SandyBridge:

9c04f015ebc2cc2cca5a4a576deb82a311578edc is first bad commit
    drm/i915: Add frame buffer compression on Sandybridge

I'm not exactly sure why, but frame buffer compression seems to be notoriously difficult to get right on intel (though I don't know whether any other vendor actually does something likewise)... At least we often had issues in the past with ironlake, and it seems that the hardware allows tons of different configurations that aren't easy to get all right ;-)

Wouldn't it be a reasonable idea, to re-enable the "disable framebuffer compression" switch, maybe even undocumented, so you can easily test regressions in this subsystem?

I'm testing Linus' master with this commit reverted now.
Comment 10 Matthias Hopf 2011-01-17 18:04:55 UTC
Shall I submit a separate bug for the FBC issue?
Comment 11 Matthias Hopf 2011-01-17 18:26:00 UTC
Linus' master with the commit reverted seems to work fine.

Now I wanted to run the stress test overnight to check for the original bug, just to find out that after suspend the system doesn't resume any more, but reboots. Next bisect. Sigh.

Is there a trivial system to do bisects with additional patches (in this case the revert of FBC)? I have my own scripts, but would rather use standard tools...

Testing drm-intel-fixes first, though.
Comment 12 Matthias Hopf 2011-01-17 18:38:54 UTC
drm-intel-fixes shows the same issue. So next bisect. Sigh :-/
Comment 13 Chris Wilson 2011-01-17 20:23:45 UTC
To disable FBC you can use i915.powersave=0. Really, the hardware is easy, we just make it seem difficult. It's a knack.
Comment 14 Matthias Hopf 2011-01-18 15:06:19 UTC
Disabling i915.powersave=0 is enough, thanks.
Question remains: Should I open a separate bug for this?

I'll try to bisect the StR issue, and if that turns out to be too problematic, do the stress test with S4.
Comment 15 Chris Wilson 2011-01-18 15:36:47 UTC
(In reply to comment #14)
> Disabling i915.powersave=0 is enough, thanks.
> Question remains: Should I open a separate bug for this?

Too late, FBC should now be fixed in -fixes.
Comment 16 Matthias Hopf 2011-01-18 16:04:02 UTC
Good to hear, I'll double check later on. If you don't hear about this any more, it's fixed.

Still continuing the bisect for the StR issue.
Comment 17 Matthias Hopf 2011-01-20 10:20:19 UTC
FBC is fine in -fixes, thanks for that!
I was able to pinpoint the S3 issue to git commit 5bd5a45 (see LKML), unrelated to i915.

With that reverted I was able to reproduce the issue with Linus' master from yesterday (v2.6.38-rc1).

I noted some additional errors before the Oops:

  Jan 20 00:17:18 linux kernel: [16872.080710] PM: Finishing wakeup.
  Jan 20 00:17:18 linux kernel: [16872.080711] Restarting tasks ... 
  Jan 20 00:17:18 linux kernel: [16872.084898] [drm:i915_do_wait_request] *ERROR* something (likely vbetool) disabled interrupts, re-enabling
  Jan 20 00:17:18 linux kernel: [16872.088421] done.

Note that vbetool is NOT installed.


  ioremap error for 0xbc747000-0xbc74a000, requested 0x10, got 0x0

Also note that the Oops has changed significantly:
        BUG_ON((obj->base.write_domain & I915_GEM_GPU_DOMAINS) != 0);
in i915_gem_object_wait_rendering().

Attaching excerpt of /var/log/messages.
Comment 18 Matthias Hopf 2011-01-20 10:24:05 UTC
Created attachment 44372 [details]
Syslog with new oops
Comment 19 Chris Wilson 2011-01-20 10:55:31 UTC
Matthias, can you talk me through what was happening at the time of the OOPS? The code looks sane; if it changes rings (i.e. we will call i915_gem_object_wait_rendering() from do_execbuffer) then we will issue a flush on the old ring. I don't see the bug yet.
Comment 20 Matthias Hopf 2011-01-20 14:09:44 UTC
Chris, this happens only after several 100 S3 cycles (514 in this case). Therefore, I cannot state exactly what happened.

I returned to the office this morning and found the screen frozen (no obvious rendering errors, no obvious non-restored window regions etc.). According to the log this happens shortly (but I *think* not immediately) after resume, but way before the next suspend is triggered by the test script.
Additional Oopses occur after the original one due to non-responding processes, but that is only a side effect of this original Oops.

As soon as the test script triggers the next suspend, the machine freezes, but that is no wonder as it will try to suspend all drivers including drm.

Can it be that there is a race condition, so that the flush of the old ring was done before suspending? Or could it be a hardware issue that has to be worked around? I'm seeing this on many different machines, with 0106, 0116, and 0126 devices.
Comment 21 Matthias Hopf 2011-01-21 13:55:58 UTC
For reference:
Novell bug https://bugzilla.novell.com/show_bug.cgi?id=664252 (not public).
Comment 22 Chris Wilson 2011-01-21 14:19:41 UTC
Should have verified whether this is with -fixes or -next? On -fixes, it can quite easily be an unchecked error return causing the OOPS (the error prevents the ring from flushing, and we OOPS rather than propagate the error). The error state (ring overflow, or even the stale HEAD just fixed) would be quite rare and so conceivably fits this scenario.
Comment 23 Matthias Hopf 2011-01-21 15:31:58 UTC
This was linus/master (v2.6.38-rc1), as requested as a good test in comment #3.
Comment 24 Florian Mickler 2011-01-25 14:37:30 UTC
a fix for this has been merged (.38-rc2+)
commit 4efe070896e1f7373c98a13713e659d1f5dee52a
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Tue Jan 18 11:25:41 2011 -0800

    drm/i915: make the blitter report buffer modifications to the FBC unit
Comment 25 Matthias Hopf 2011-01-26 07:49:04 UTC
>     drm/i915: make the blitter report buffer modifications to the FBC unit

At least this particular commit has *nothing* to do with this issue.
Please don't close foreign bugs unless you're really sure about it. Proposing tests with certain commits always makes sense, closing bugs doesn't.

Testing drm-linux-fixes (4efe070) for other fixes right now.
Comment 26 Matthias Hopf 2011-01-26 08:01:05 UTC
Symptoms have changed, the effect hasn't.

Now I'm getting intermittent

  *ERROR* Hangcheck timer elapsed... GPU hang

after 43 and 44 test cycles, where the chip reset apparently works out, and I get tons of

  *ERROR* Hangcheck timer elapsed... render ring idle [waiting on 4571, at 318], missed IRQ

approx. every 2 seconds, and the machine doesn't suspend any more (hanging right at the end of all userspace processes invoked for suspending). Hangcheck timer messages continue to flood /var/log/messages.

Trying to reproduce now.
Comment 27 Florian Mickler 2011-01-26 08:21:29 UTC
Sorry Matthias, I was going from the "References" tag in the commit and paid no attention. Will just post a pointer next time.
Comment 28 Matthias Hopf 2011-01-26 09:21:39 UTC
No probs.

Reproducible, this time after 75 suspend cycles. Seems actually easier reproducible than the oops.
This time with an additional

  *ERROR* something (likely vbetool) disabled interrupts, re-enabling

before the hangcheck timer. Unfortunately, no additional information before or after that, even though drm.debug=0x0e.
Comment 29 Rafael J. Wysocki 2011-02-03 19:03:21 UTC
On Thursday, February 03, 2011, Matthias Hopf wrote:
> On Feb 03, 11 01:05:28 +0100, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.36 and 2.6.37.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.36 and 2.6.37.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> The original issue of this bug is not a regression. Though during trying
> to reproduce with the latest kernel I stumbled over two regressions (one
> regarding frame buffer compression, already fixed in intel-drm-fixes,
> one regarding S3 trampoline code and NX, fix available, bug 27472).
Comment 30 Rafael J. Wysocki 2011-02-03 19:04:18 UTC
Dropping from the list of post-2.6.36 regressions.
Comment 31 Chris Wilson 2011-03-19 12:45:42 UTC
Matthias have you had a chance to retest with 2.6.38. I think it's getting there, if not already stable...
Comment 32 Stefan Dirsch 2011-03-29 11:08:32 UTC
Chris, Matthias is on vacation until Sun 2011-04-03.
Comment 33 Daniel Vetter 2012-03-25 13:58:15 UTC
Ok, we've fixed ridiculous amounts of snb hangs recently. For all the glory, please test with 3.4 (i.e. a git snapshot atm).
Comment 34 Jesse Barnes 2012-04-18 20:19:26 UTC
Must be fixed!  There is no other possible explanation for Matthias's silence. :)
Comment 35 Matthias Hopf 2012-04-18 20:52:42 UTC
I'm no longer with SuSE, thus I don't have access to this hardware any longer.

I haven't seen any S3 hangs for a long time with SB, but I'm typically not running any OpenGL programs lately.

Still, assume fixed unless proven otherwise.