Kernel Bug Tracker – Bug 26932
[SNB mobile] Oops in DRM intel driver, esp. during S3/S4 stress test
Last modified: 2012-04-18 20:52:42 UTC
Especially during S3 and S4 stress tests (but also seen during regular use after some S3 cycles) the intel driver Oopses after typically 100-500 S3 cycles when running (multiple) OpenGL clients (queens and glhanoi in this case).
libdrm 2.4.23 + git 0184bb1c
drm loaded with debug=0x0e
Attaching 2 different (but probably equivalent) Oopses, one created by a screensaver (queens), and one by the Xserver.
Created attachment 43822 [details]
Oops by Xserver
There are tons of blt ring debug messages before, but nothing eye catching.
Created attachment 43832 [details]
Oops by queens
Also a lot of blt ring messages before this
Ah, looks like the good old ring buffer overflow. Just one detail missing from the report, which platform? (I'm guessing SandyBridge, but mobile or desktop?)
I had to add a lot of error propagation to handle such overflows; too much, too late to be pushed into 2.6.37. Can you verify that drm-intel-fixes (or even linus/master) is stable? Sadly likely to be have been destabilised elsewhere...
Good question. It is SandyBridge mobile - as usual.
Oh noes, not mobile! :| We've been hitting some stability issues with mobile as well. See https://bugs.freedesktop.org/show_bug.cgi?id=32752
(In reply to comment #3)
> Ah, looks like the good old ring buffer overflow. Just one detail missing from
> the report, which platform? (I'm guessing SandyBridge, but mobile or desktop?)
Oops, yes, forgot that. :-]
> I had to add a lot of error propagation to handle such overflows; too much, too
> late to be pushed into 2.6.37. Can you verify that drm-intel-fixes (or even
> linus/master) is stable? Sadly likely to be have been destabilised elsewhere...
drm-intel-fixes didn't compile last time I tested (yesterday). I can try Linus' master branch, though.
As this typically requires >200 S3 cycles, don't expect any answers today.
Linus' master has major issues. Testing e78bf5e shows severe rendering issues (parts of the screen are not rendered, VT switching works, no information in neither Xorg log nor dmesg, with drm debug =0x0e I don't see any blt ring messages. No Oops, no hangcheck timer message.
Checked drm-intel-fixes as well. Same issue.
Guess I have to bisect this one first. Oh well.
Thanks for the heads up.
Apparently frame buffer compression is borked on SandyBridge:
9c04f015ebc2cc2cca5a4a576deb82a311578edc is first bad commit
drm/i915: Add frame buffer compression on Sandybridge
I'm not exactly sure why, but frame buffer compression seems to be notoriously difficult to get right on intel (though I don't know whether any other vendor actually does something likewise)... At least we often had issues in the past with ironlake, and it seems that the hardware allows tons of different configurations that aren't easy to get all right ;-)
Wouldn't it be a reasonable idea, to re-enable the "disable framebuffer compression" switch, maybe even undocumented, so you can easily test regressions in this subsystem?
I'm testing Linus' master with this commit reverted now.
Shall I submit a separate bug for the FBC issue?
Linus' master with the commit reverted seems to work fine.
Now I wanted to run the stress test overnight to check for the original bug, just to find out that after suspend the system doesn't resume any more, but reboots. Next bisect. Sigh.
Is there a trivial system to do bisects with additional patches (in this case the revert of FBC)? I have my own scripts, but would rather use standard tools...
Testing drm-intel-fixes first, though.
drm-intel-fixes shows the same issue. So next bisect. Sigh :-/
To disable FBC you can use i915.powersave=0. Really, the hardware is easy, we just make it seem difficult. It's a knack.
Disabling i915.powersave=0 is enough, thanks.
Question remains: Should I open a separate bug for this?
I'll try to bisect the StR issue, and if that turns out to be too problematic, do the stress test with S4.
(In reply to comment #14)
> Disabling i915.powersave=0 is enough, thanks.
> Question remains: Should I open a separate bug for this?
Too late, FBC should now be fixed in -fixes.
Good to hear, I'll double check later on. If you don't hear about this any more, it's fixed.
Still continuing the bisect for the StR issue.
FBC is fine in -fixes, thanks for that!
I was able to pinpoint the S3 issue to git commit 5bd5a45 (see LKML), unrelated to i915.
With that reverted I was able to reproduce the issue with Linus' master from yesterday (v2.6.38-rc1).
I noted some additional errors before the Oops:
Jan 20 00:17:18 linux kernel: [16872.080710] PM: Finishing wakeup.
Jan 20 00:17:18 linux kernel: [16872.080711] Restarting tasks ...
Jan 20 00:17:18 linux kernel: [16872.084898] [drm:i915_do_wait_request] *ERROR* something (likely vbetool) disabled interrupts, re-enabling
Jan 20 00:17:18 linux kernel: [16872.088421] done.
Note that vbetool is NOT installed.
ioremap error for 0xbc747000-0xbc74a000, requested 0x10, got 0x0
Also note that the Oops has changed significantly:
BUG_ON((obj->base.write_domain & I915_GEM_GPU_DOMAINS) != 0);
Attaching excerpt of /var/log/messages.
Created attachment 44372 [details]
Syslog with new oops
Matthias, can you talk me through what was happening at the time of the OOPS? The code looks sane; if it changes rings (i.e. we will call i915_gem_object_wait_rendering() from do_execbuffer) then we will issue a flush on the old ring. I don't see the bug yet.
Chris, this happens only after several 100 S3 cycles (514 in this case). Therefore, I cannot state exactly what happened.
I returned to the office this morning and found the screen frozen (no obvious rendering errors, no obvious non-restored window regions etc.). According to the log this happens shortly (but I *think* not immediately) after resume, but way before the next suspend is triggered by the test script.
Additional Oopses occur after the original one due to non-responding processes, but that is only a side effect of this original Oops.
As soon as the test script triggers the next suspend, the machine freezes, but that is no wonder as it will try to suspend all drivers including drm.
Can it be that there is a race condition, so that the flush of the old ring was done before suspending? Or could it be a hardware issue that has to be worked around? I'm seeing this on many different machines, with 0106, 0116, and 0126 devices.
Novell bug https://bugzilla.novell.com/show_bug.cgi?id=664252 (not public).
Should have verified whether this is with -fixes or -next? On -fixes, it can quite easily be an unchecked error return causing the OOPS (the error prevents the ring from flushing, and we OOPS rather than propagate the error). The error state (ring overflow, or even the stale HEAD just fixed) would be quite rare and so conceivably fits this scenario.
This was linus/master (v2.6.38-rc1), as requested as a good test in comment #3.
a fix for this has been merged (.38-rc2+)
Author: Jesse Barnes <firstname.lastname@example.org>
Date: Tue Jan 18 11:25:41 2011 -0800
drm/i915: make the blitter report buffer modifications to the FBC unit
> drm/i915: make the blitter report buffer modifications to the FBC unit
At least this particular commit has *nothing* to do with this issue.
Please don't close foreign bugs unless you're really sure about it. Proposing tests with certain commits always makes sense, closing bugs doesn't.
Testing drm-linux-fixes (4efe070) for other fixes right now.
Symptoms have changed, the effect hasn't.
Now I'm getting intermittent
*ERROR* Hangcheck timer elapsed... GPU hang
after 43 and 44 test cycles, where the chip reset apparently works out, and I get tons of
*ERROR* Hangcheck timer elapsed... render ring idle [waiting on 4571, at 318], missed IRQ
approx. every 2 seconds, and the machine doesn't suspend any more (hanging right at the end of all userspace processes invoked for suspending). Hangcheck timer messages continue to flood /var/log/messages.
Trying to reproduce now.
Sorry Matthias, I was going from the "References" tag in the commit and paid no attention. Will just post a pointer next time.
Reproducible, this time after 75 suspend cycles. Seems actually easier reproducible than the oops.
This time with an additional
*ERROR* something (likely vbetool) disabled interrupts, re-enabling
before the hangcheck timer. Unfortunately, no additional information before or after that, even though drm.debug=0x0e.
On Thursday, February 03, 2011, Matthias Hopf wrote:
> On Feb 03, 11 01:05:28 +0100, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.36 and 2.6.37.
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.36 and 2.6.37. Please verify if it still should
> > be listed and let the tracking team know (either way).
> The original issue of this bug is not a regression. Though during trying
> to reproduce with the latest kernel I stumbled over two regressions (one
> regarding frame buffer compression, already fixed in intel-drm-fixes,
> one regarding S3 trampoline code and NX, fix available, bug 27472).
Dropping from the list of post-2.6.36 regressions.
Matthias have you had a chance to retest with 2.6.38. I think it's getting there, if not already stable...
Chris, Matthias is on vacation until Sun 2011-04-03.
Ok, we've fixed ridiculous amounts of snb hangs recently. For all the glory, please test with 3.4 (i.e. a git snapshot atm).
Must be fixed! There is no other possible explanation for Matthias's silence. :)
I'm no longer with SuSE, thus I don't have access to this hardware any longer.
I haven't seen any S3 hangs for a long time with SB, but I'm typically not running any OpenGL programs lately.
Still, assume fixed unless proven otherwise.