Bug 28582
Summary: | whole system hang when start X | ||
---|---|---|---|
Product: | Drivers | Reporter: | Gu Rui (chaos.proton) |
Component: | Video(DRI - Intel) | Assignee: | Ben Widawsky (ben) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | ben, chris, florian, jbarnes, maciej.rutecki, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.37+ | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 21782 | ||
Attachments: |
Disable rc6
wait for MI_SET_CONTEXT to finish d-i-f latest wait for MI_SET_CONTEXT to finish Re-enable rc6 use i915_gpu_idle i915.o split PWRCTXA and RSTDBYCTL added a second MI_FLUSH use PIPE_CONTROL to flush context use PIPE_CONTROL to flush context (one flush only) polling method to determine MI_SET_CONTEXT completion |
Description
Gu Rui
2011-02-08 08:33:11 UTC
This is likely just another symptom of a broken ddx and: commit 9334ef755f060e251f3f395caeda1a58b6834ea3 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jan 28 11:53:03 2011 +0000 drm: Don't switch fb when disabling an output In drm_crtc_helper_set_config, we call drm_crtc_helper_set_mode which may return early and do no operation if the crtc is to be disabled. In this case we merrily swap to the new fb, discarding the old_fb believing that it has been cleaned up. However, due to the early return, the old_fb was not presented to the backend for correct reaping, and nor was the new one - which is about to be reaped via the drm_helper_disable_unused_functions(), leading to incorrect refcounting of the pinned objects. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27722 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29857 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29230 Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> I've tried both kernel(2.6.38-rc4) with/revert that commit, none works... ;( I think they are different bug because in my case, once X starts, the whole system freeze. Let alone a back-trace or VT switching... Which leaves you with little choice but to first check you are not running broken components of the driver in userspace and to bisect. OK, here is the result:) ~/linux-git% git bisect bad d5bb081b027b520f9e59b4fb8faea83a136ec15e is the first bad commit commit d5bb081b027b520f9e59b4fb8faea83a136ec15e Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Wed Jan 5 12:01:26 2011 -0800 drm/i915: cleanup rc6 code Cleanup several aspects of the rc6 code: - misnamed intel_disable_clock_gating function (was only about rc6) - remove commented call to intel_disable_clock_gating - rc6 enabling code belongs in its own function (allows us to move the actual clock gating enable call back into restore_state) - allocate power & render contexts up front, only free on unload (avoids ugly lazy init at rc6 enable time) Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> [ickle: checkpatch cleanup] Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> :040000 040000 abcd868fa0ee4fcd647e85ccb0850c644045d474 8b1c544aaf7aa14e1e25bfaa12199dcecf578693 M drivers git master works well if I revert that commit. I think we have found the true bad commit. Created attachment 46992 [details]
Disable rc6
Please can you test this patch.
This patch can solve the problem. Amazing ;) Thank you very much ;-) First-Bad-Commit : d5bb081b027b520f9e59b4fb8faea83a136ec15e I've applied that patch to -fixes: commit ac66808814036b4c33dd98091b2176ae6157f1a8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Feb 9 16:15:32 2011 +0000 drm/i915: Disable RC6 on Ironlake The automatic powersaving feature is once again causing havoc, with 100% reliable hangs on boot and resume on affected machines. Reported-by: Francesco Allertsen <fallertsen@gmail.com> Reported-by: Gui Rui <chaos.proton@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=28582 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Rui, what does "intel_reg_read 0x111b8" report on your platform? Also, with just the bisected commit reverted, what does "cat /sys/kernel/debug/dri/0/i915_drpc_info" report while gfx is idle (either from the console or an ssh session). I use the git version(d4127e0e68315b66) of intel-gpu-tools, but it gives: # intel_reg_read 0x111b8 Couldn't map MMIO region: No such file or directory And there is nothing in /sys/kernel/debug/... I have turned on debug kernel and debugfs options. What other options do I need to turn on? FYI, I saw Chris' commit gone into Linus' branch. So it will be JustWorks(tm) now. I don't know where the true problem lays but hope you can fix it eventually. merged in .38-rc5: commit ac66808814036b4c33dd98091b2176ae6157f1a8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Feb 9 16:15:32 2011 +0000 drm/i915: Disable RC6 on Ironlake should this bug marked closed? ping Jesse, who has been assigned to this bug entry. I'd like to find the root cause here instead of covering up the symptom again. You may need to mount debugfs "mount -t debugfs debugfs /sys/kernel/debug" before you can see the file I mentioned above. How is the behavior if you revert d5bb081b027b520f9e59b4fb8faea83a136ec15e or just checkout the tree at 88271da3f3da75d6eaef5e768c82a1627edf7088 (the previous commit)? If things work w/o d5bb08 applied, we should probably revert it as well. With the patch in http://www.spinics.net/lists/dri-devel/msg08028.html and w/o d5bb08 and ac6680 applied.My /sys/kernel/debug/dri/0/i915_drpc_info is: HD boost: no Boost freq: 6 HW control enabled: no SW control enabled: yes Gated voltage change: no Starting frequency: P5 Max P-state: P0 Min P-state: P7 RS1 VID: 0 RS2 VID: 8 Render standby enabled: yes Current RS state: RS2 (RC6) RSTDBYCTL: 0x474c3000 And there is a /sys/kernel/debug/dri/64/i915_drpc_info with the same content. Whether X is running or not, the result is the same. > and w/o d5bb08 and ac6680 applied.My /sys/kernel/debug/dri/0/i915_drpc_info
> is:
>
> HD boost: no
> Boost freq: 6
> HW control enabled: no
> SW control enabled: yes
> Gated voltage change: no
> Starting frequency: P5
> Max P-state: P0
> Min P-state: P7
> RS1 VID: 0
> RS2 VID: 8
> Render standby enabled: yes
> Current RS state: RS2 (RC6)
> RSTDBYCTL: 0x474c3000
>
> And there is a /sys/kernel/debug/dri/64/i915_drpc_info with the
> same content. Whether X is running or not, the result is the same.
Thanks, and you don't see the hang? Do both X and suspend/resume work
properly?
As I replied in the thread "Intel i915 freeze on latest git", no hang there. But suspend/resume doesn't working. Francesco has bisected the first bad commit is 88271da3f. > As I replied in the thread "Intel i915 freeze on latest git", no hang there.
> But suspend/resume doesn't working. Francesco has bisected the first bad
> commit
> is 88271da3f.
Ok, so you're seeing the same thing... let me see what I can come up
with.
merged in 2.6.38-rc5: commit ac66808814036b4c33dd98091b2176ae6157f1a8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Feb 9 16:15:32 2011 +0000 drm/i915: Disable RC6 on Ironlake I can't reproduce this unfortunately, but I have a theory. Patch first, then theory... diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index f3c0525..5ebe682 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -6442,6 +6442,7 @@ void intel_enable_clock_gating(struct drm_device *dev) OUT_RING(MI_NOOP); OUT_RING(MI_FLUSH); ADVANCE_LP_RING(); + mmiowb(); } else DRM_DEBUG_KMS("Failed to allocate render context." "Disable RC6\n"); If somehow PWRCTXA gets set before the commands for MI_SET_CONTEXT have been bound into the ring, it's possible to enter rc6 and then try to restore rc6 with an invalid context (because MI_SET_CONTEXT hasn't yet been executed). I feel this is a little far-fetched, but afaics, it is possible with just the wrong compiler/HW combination. That should have been wmb(), not mmiowb(). I was testing something else with mmiowb... diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index f3c0525..75129ee 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -6442,6 +6442,7 @@ void intel_enable_clock_gating(struct drm_device *dev) OUT_RING(MI_NOOP); OUT_RING(MI_FLUSH); ADVANCE_LP_RING(); + wmb(); } else DRM_DEBUG_KMS("Failed to allocate render context." "Disable RC6\n"); Hello Ben Widawsky, How to test your patch? I mean, is that OK to test your patch on latest git with ac668088 reverted? (In reply to comment #21) > If somehow PWRCTXA gets set before the commands for MI_SET_CONTEXT have been > bound into the ring, it's possible to enter rc6 and then try to restore rc6 > with an invalid context (because MI_SET_CONTEXT hasn't yet been executed). In which case we need more than a compiler barrier (and more than a PCI posting), but we need to wait for the GPU. We could go the whole hog and do a i915_add_request(); i915_wait_request() or we could add a simpler poll in cases where we expect it to be nearly instantaneous. (Also neatly providing the MMIO write barrier. ;-) The theory sounds good. Gu, You should be able to try the wmb() after the ADVANCE_RING() with just ac668088 reverted... I think Chris makes a good point though, I'm attaching a patch for that based off of 88271da3. However, this patch will not work with just ac668088 reverted because the APIs have changed too much. I can create another patch for you if you tell me which commit I should start from. Created attachment 50642 [details]
wait for MI_SET_CONTEXT to finish
based off of 88271da3
Tried on 88271da3. But it seems that nothing happened -- there is neither "bug 28582 fail" nor "bug 28582 workaround on" appear in dmesg.(dmesg | grep 28582 yields nothing)... Besides, I think there should be braces after "dev_priv->pwrctx = intel_alloc_context_page(dev);" and "if (dev_priv->pwrctx == NULL)". Am I right?(The kernel I tried doesn't have them) I think since d5bb081b0 is the first bad commit, is better to patch based on it or latest git with ac668088 reverted, if it's convenience for you. Thank you very much on the effort. You are right it does need braces. I moved the workaround at the last minute and forgot to update. However I think it will still work as is, it just won't handle the failure case properly. As for not seeing any message, you will only see DRM_INFO if you are using the appropriate debug flags. You can do something like this, to see DRM_INFO messages: echo 6 > /sys/module/drm/parameters/debug, or use drm.debug= when you load it. If you didn't see the error, and X didn't hang, then it was a success, wasn't it? New patch coming.. Created attachment 50682 [details]
d-i-f latest wait for MI_SET_CONTEXT to finish
This patch should be used again d-i-f 467cffba85791cdfce38c124d75bd578f4bb8625.
It does not require any reverts. The patch itself re-enables rc6. I found that doing the revert as you mentioned left a bit of a mess.
Once again, DRM debug flags will need to be set if you expect to see a success message. Success should really be measured as no error + no hang.
Awesome Ben! Now I can confirm that your patch works! ;-) Thank you very much. Gu, can you please clarify. Did the patch make the hang go away, or did the patch just apply cleanly? The patch both apply cleanly and make the hang go away. Thanks very much ;-) Thanks! I'm attaching one more patch for test. This should have the same effect, and is the patch I'm submitting to the intel-gfx mailing list. Created attachment 50822 [details]
Re-enable rc6
This patch is cleaner than the last version. It should have the same effect as the other patch.
Oops, patch 50822(applied on clean latest git) cause system hang... Created attachment 50832 [details]
use i915_gpu_idle
Please try this patch.
If this doesn't work then I'm really stumped.
unfortunately, patch 50832 doesn't work neither... ;( I could sort of understand why 50822 didn't work, but 50832 should work. Hopefully you're building this as a module. If so, could you please upload drivers/gpu/drm/i915/i915.o after patch 50832. If you've built it in kernel, and aren't willing to build as module, please upload or email me a gzipped vmlinux. Thanks. (In reply to comment #34) > Created an attachment (id=50822) [details] > Re-enable rc6 > > This patch is cleaner than the last version. It should have the same effect > as > the other patch. Don't trust the comments next to LOAD_IMM_REG, whilst Daniel may have determined them from experimentation, the spec say that the variable length is solely to handle 32/64 bit writes. So use two LOAD_IMM_REG commands for two regs. (In reply to comment #34) > Created an attachment (id=50822) [details] > Re-enable rc6 And I should have also said, you need to ensure that the GPU is idle in teardown if manipulating the same registers as teardown by the GPU. Created attachment 50872 [details]
i915.o
Created attachment 50922 [details]
split PWRCTXA and RSTDBYCTL
This addresses Chris' comment about misuse of MI_LOAD_REGISTER_IMM.
Created attachment 50932 [details]
added a second MI_FLUSH
This patch is extra safe in synchronizing the MI_SET_CONTEXT, addressing Chris' comment from the intel-gfx mailing list.
Hello Gu. I looked at your i915.o, and it appeared fine. As Chris talked about on the intel-gfx mailing list, it makes sense that my last patch did not work. I've uploaded two new patches, which should both be applied from a clean branch. Please let us know if either or both fix the problem. Oops, neither 50932 nor 50922 works... (X hang) ;( Weirder still. Didn't trigger a hang for me. Can you post some debugging info (Xorg.log and dmesg) from the hang? Created attachment 51012 [details]
use PIPE_CONTROL to flush context
Try this patch from clean latest d-i-f
Created attachment 51022 [details]
use PIPE_CONTROL to flush context (one flush only)
If the previous patch works, please try this patch, also from clean d-i-f
No, 51012 not working. When X hangs, the system hangs.(remote ssh session got time out) So I cannot get debug info from the point. Sorry... Created attachment 51102 [details]
polling method to determine MI_SET_CONTEXT completion
This patch is similar to 50682 in what it tries to accomplish (wait for the command parser to finish MI_SET_CONTEXT). If this patch works, it leaves me a little confused. If this patch does not work, it tells us that the add_request mechanism (which has a few HW side effects) has some magic. This would help us greatly to narrow down the actual issue.
Patch 51102 works. ;-) (In reply to comment #50) > Created an attachment (id=51102) [details] > polling method to determine MI_SET_CONTEXT completion Ok. looks like we have a winner and that setting the regs from the ring is not truly safe. Ben, can you tidy up the error path (no need to print the HEAD/TAIL values and make the final error DRM_ERROR("Unable to set rc6 context, disabling\n")) and send to the list? The original bug is fixed in mainline by disabling rc6, and we've queued the latest, most successful, attempt to enable rc6 again for -next. This way we have a few months of testing before it hits mainline... Chris, do you have a pointer on where I should look if the patch has trickled mainwards already? If you resolve|patch_already_available, please also post a link to the patch in the form (on a single line): Patch: [direct link to the patch] The link can be patchwork, a git-web repository, mailinglist-post or smth similar... That way it helps crossing things off (and Rafaels scripts can do the right thing). Thanks, Flo s/on a single line/on it's own line/ I was referring to the patch attached... For completeness, the one series of fixes that came from this one bug: https://patchwork.kernel.org/patch/646311/ Thanks. There are three patches attached. But I take it, the original regression was fixed by : commit ac66808814036b4c33dd98091b2176ae6157f1a8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Feb 9 16:15:32 2011 +0000 drm/i915: Disable RC6 on Ironlake in .38-rc5 and as such we should close this regression. [ I'm in a bit of a pickle, as I don't want to impede on any development and bugfixing, but still those open regressions reports do suck :-) It get's more likely to overlook something if there are many not-closed regressions... ] Exactly, we disabled rc6 on Ironlake to restore 2.6.37 behaviour (commit ac6680) and now we think we understand the underlying bug and have patches in the pipeline to correctly enable the improved rc6 powersaving modes. A patch referencing this bug report has been merged in v3.0-rc1: commit 4a246cfc3c337ecb800d508ee5ed906534edb25c Author: Ben Widawsky <ben@bwidawsk.net> Date: Sat Mar 19 18:14:28 2011 -0700 drm/i915: fix rc6 initialization on Ironlake |