Bug 28582 - whole system hang when start X
Summary: whole system hang when start X
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Ben Widawsky
URL:
Keywords:
Depends on:
Blocks: 21782
  Show dependency tree
 
Reported: 2011-02-08 08:33 UTC by Gu Rui
Modified: 2011-05-30 07:53 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.37+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Disable rc6 (7.24 KB, patch)
2011-02-09 16:30 UTC, Chris Wilson
Details | Diff
wait for MI_SET_CONTEXT to finish (1.26 KB, patch)
2011-03-11 17:22 UTC, Ben Widawsky
Details | Diff
d-i-f latest wait for MI_SET_CONTEXT to finish (1.75 KB, patch)
2011-03-12 16:38 UTC, Ben Widawsky
Details | Diff
Re-enable rc6 (2.30 KB, patch)
2011-03-15 05:51 UTC, Ben Widawsky
Details | Diff
use i915_gpu_idle (1.07 KB, patch)
2011-03-15 06:37 UTC, Ben Widawsky
Details | Diff
i915.o (495.02 KB, application/octet-stream)
2011-03-15 08:01 UTC, Gu Rui
Details
split PWRCTXA and RSTDBYCTL (1.50 KB, patch)
2011-03-15 17:27 UTC, Ben Widawsky
Details | Diff
added a second MI_FLUSH (1.55 KB, patch)
2011-03-15 17:29 UTC, Ben Widawsky
Details | Diff
use PIPE_CONTROL to flush context (4.85 KB, patch)
2011-03-17 01:30 UTC, Ben Widawsky
Details | Diff
use PIPE_CONTROL to flush context (one flush only) (4.81 KB, patch)
2011-03-17 01:31 UTC, Ben Widawsky
Details | Diff
polling method to determine MI_SET_CONTEXT completion (2.32 KB, patch)
2011-03-17 20:59 UTC, Ben Widawsky
Details | Diff

Description Gu Rui 2011-02-08 08:33:11 UTC
At least 2.6.36 kernel works well. My GPU is Arrandale. 2.6.37 works in most of the times. The symptom is when you start X(either bare X or kdm), the whole system hang. No keyboard response, no network response(so I could not ssh into my system to inspect what is going on)...

If I add i915.modeset=0, X could not start. But in this condition, system doesn't hang. I can switch to console. There is no error message on dmesg nor X.0.log.

My box is DELL Vostro3500. My distribution is Slackware-current. pkg info:

mesa-7.10_84b857e-x86_64-1
xorg-server-1.9.3-x86_64-1
xf86-video-intel-2.14.0-x86_64-1

FYI, I saw the screen got teared for a little fraction of time when starting mode set on console.
Comment 1 Chris Wilson 2011-02-08 09:49:29 UTC
This is likely just another symptom of a broken ddx and:

commit 9334ef755f060e251f3f395caeda1a58b6834ea3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jan 28 11:53:03 2011 +0000

    drm: Don't switch fb when disabling an output
    
    In drm_crtc_helper_set_config, we call drm_crtc_helper_set_mode which
    may return early and do no operation if the crtc is to be disabled. In
    this case we merrily swap to the new fb, discarding the old_fb believing
    that it has been cleaned up. However, due to the early return, the
    old_fb was not presented to the backend for correct reaping, and nor was
    the new one - which is about to be reaped via the
    drm_helper_disable_unused_functions(), leading to incorrect refcounting
    of the pinned objects.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27722
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29857
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29230
    Tested-by: Takashi Iwai <tiwai@suse.de>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 2 Gu Rui 2011-02-08 16:10:40 UTC
I've tried both kernel(2.6.38-rc4) with/revert that commit, none works... ;( I think they are different bug because in my case, once X starts, the whole system freeze. Let alone a back-trace or VT switching...
Comment 3 Chris Wilson 2011-02-08 16:18:09 UTC
Which leaves you with little choice but to first check you are not running broken components of the driver in userspace and to bisect.
Comment 4 Gu Rui 2011-02-09 15:23:34 UTC
OK, here is the result:)

~/linux-git% git bisect bad
d5bb081b027b520f9e59b4fb8faea83a136ec15e is the first bad commit
commit d5bb081b027b520f9e59b4fb8faea83a136ec15e
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Wed Jan 5 12:01:26 2011 -0800

    drm/i915: cleanup rc6 code
    
    Cleanup several aspects of the rc6 code:
      - misnamed intel_disable_clock_gating function (was only about rc6)
      - remove commented call to intel_disable_clock_gating
      - rc6 enabling code belongs in its own function (allows us to move the
        actual clock gating enable call back into restore_state)
      - allocate power & render contexts up front, only free on unload
        (avoids ugly lazy init at rc6 enable time)
    
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    [ickle: checkpatch cleanup]
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

:040000 040000 abcd868fa0ee4fcd647e85ccb0850c644045d474 8b1c544aaf7aa14e1e25bfaa12199dcecf578693 M      drivers
Comment 5 Gu Rui 2011-02-09 15:49:30 UTC
git master works well if I revert that commit. I think we have found the true bad commit.
Comment 6 Chris Wilson 2011-02-09 16:30:10 UTC
Created attachment 46992 [details]
Disable rc6

Please can you test this patch.
Comment 7 Gu Rui 2011-02-09 16:58:10 UTC
This patch can solve the problem. Amazing ;)

Thank you very much ;-)
Comment 8 Rafael J. Wysocki 2011-02-09 17:02:54 UTC
First-Bad-Commit : d5bb081b027b520f9e59b4fb8faea83a136ec15e
Comment 9 Chris Wilson 2011-02-09 17:06:47 UTC
I've applied that patch to -fixes:

commit ac66808814036b4c33dd98091b2176ae6157f1a8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 9 16:15:32 2011 +0000

    drm/i915: Disable RC6 on Ironlake
    
    The automatic powersaving feature is once again causing havoc, with 100%
    reliable hangs on boot and resume on affected machines.
    
    Reported-by: Francesco Allertsen <fallertsen@gmail.com>
    Reported-by: Gui Rui <chaos.proton@gmail.com>
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=28582
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 10 Jesse Barnes 2011-02-09 18:10:40 UTC
Rui, what does "intel_reg_read 0x111b8" report on your platform?  Also, with just the bisected commit reverted, what does "cat /sys/kernel/debug/dri/0/i915_drpc_info" report while gfx is idle (either from the console or an ssh session).
Comment 11 Gu Rui 2011-02-10 02:57:03 UTC
I use the git version(d4127e0e68315b66) of intel-gpu-tools, but it gives:

# intel_reg_read 0x111b8
Couldn't map MMIO region: No such file or directory

And there is nothing in /sys/kernel/debug/... I have turned on debug kernel and debugfs options. What other options do I need to turn on?
Comment 12 Gu Rui 2011-02-16 12:53:10 UTC
FYI, I saw Chris' commit gone into Linus' branch. So it will be JustWorks(tm) now. I don't know where the true problem lays but hope you can fix it eventually.
Comment 13 Florian Mickler 2011-02-20 00:02:10 UTC
merged in .38-rc5:
commit ac66808814036b4c33dd98091b2176ae6157f1a8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 9 16:15:32 2011 +0000

    drm/i915: Disable RC6 on Ironlake


should this bug marked closed?
Comment 14 Gu Rui 2011-02-22 01:39:14 UTC
ping Jesse, who has been assigned to this bug entry.
Comment 15 Jesse Barnes 2011-02-22 17:12:21 UTC
I'd like to find the root cause here instead of covering up the symptom again.

You may need to mount debugfs "mount -t debugfs debugfs /sys/kernel/debug" before you can see the file I mentioned above.

How is the behavior if you revert d5bb081b027b520f9e59b4fb8faea83a136ec15e or just checkout the tree at 88271da3f3da75d6eaef5e768c82a1627edf7088 (the previous commit)?  If things work w/o d5bb08 applied, we should probably revert it as well.
Comment 16 Gu Rui 2011-02-22 17:46:49 UTC
With the patch in

http://www.spinics.net/lists/dri-devel/msg08028.html

and w/o d5bb08 and ac6680 applied.My /sys/kernel/debug/dri/0/i915_drpc_info is:

HD boost: no
Boost freq: 6
HW control enabled: no
SW control enabled: yes
Gated voltage change: no
Starting frequency: P5
Max P-state: P0
Min P-state: P7
RS1 VID: 0
RS2 VID: 8
Render standby enabled: yes
Current RS state: RS2 (RC6)
RSTDBYCTL: 0x474c3000

And there is a /sys/kernel/debug/dri/64/i915_drpc_info with the
same content. Whether X is running or not, the result is the same.
Comment 17 Jesse Barnes 2011-02-22 19:05:38 UTC
> and w/o d5bb08 and ac6680 applied.My /sys/kernel/debug/dri/0/i915_drpc_info
> is:
> 
> HD boost: no
> Boost freq: 6
> HW control enabled: no
> SW control enabled: yes
> Gated voltage change: no
> Starting frequency: P5
> Max P-state: P0
> Min P-state: P7
> RS1 VID: 0
> RS2 VID: 8
> Render standby enabled: yes
> Current RS state: RS2 (RC6)
> RSTDBYCTL: 0x474c3000
> 
> And there is a /sys/kernel/debug/dri/64/i915_drpc_info with the
> same content. Whether X is running or not, the result is the same.

Thanks, and you don't see the hang?  Do both X and suspend/resume work
properly?
Comment 18 Gu Rui 2011-02-23 02:24:01 UTC
As I replied in the thread "Intel i915 freeze on latest git",  no hang there. But suspend/resume doesn't working. Francesco has bisected the first bad commit is 88271da3f.
Comment 19 Jesse Barnes 2011-02-23 03:56:10 UTC
> As I replied in the thread "Intel i915 freeze on latest git",  no hang there.
> But suspend/resume doesn't working. Francesco has bisected the first bad
> commit
> is 88271da3f.

Ok, so you're seeing the same thing... let me see what I can come up
with.
Comment 20 Florian Mickler 2011-03-04 23:48:02 UTC
merged in 2.6.38-rc5:

commit ac66808814036b4c33dd98091b2176ae6157f1a8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 9 16:15:32 2011 +0000

    drm/i915: Disable RC6 on Ironlake
Comment 21 Ben Widawsky 2011-03-07 19:07:11 UTC
I can't reproduce this unfortunately, but I have a theory. Patch first, then theory...

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index f3c0525..5ebe682 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -6442,6 +6442,7 @@ void intel_enable_clock_gating(struct drm_device *dev)
                        OUT_RING(MI_NOOP);
                        OUT_RING(MI_FLUSH);
                        ADVANCE_LP_RING();
+                       mmiowb();
                } else
                        DRM_DEBUG_KMS("Failed to allocate render context."
                                       "Disable RC6\n");


If somehow PWRCTXA gets set before the commands for MI_SET_CONTEXT have been bound into the ring, it's possible to enter rc6 and then try to restore rc6 with an invalid context (because MI_SET_CONTEXT hasn't yet been executed).

I feel this is a little far-fetched, but afaics, it is possible with just the wrong compiler/HW combination.
Comment 22 Ben Widawsky 2011-03-07 19:12:58 UTC
That should have been wmb(), not mmiowb(). I was testing something else with mmiowb...


diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index f3c0525..75129ee 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -6442,6 +6442,7 @@ void intel_enable_clock_gating(struct drm_device *dev)
                        OUT_RING(MI_NOOP);
                        OUT_RING(MI_FLUSH);
                        ADVANCE_LP_RING();
+                       wmb();
                } else
                        DRM_DEBUG_KMS("Failed to allocate render context."
                                       "Disable RC6\n");
Comment 23 Gu Rui 2011-03-11 11:40:22 UTC
Hello Ben Widawsky,

How to test your patch? I mean, is that OK to test your patch on latest git with ac668088 reverted?
Comment 24 Chris Wilson 2011-03-11 12:12:56 UTC
(In reply to comment #21) 
> If somehow PWRCTXA gets set before the commands for MI_SET_CONTEXT have been
> bound into the ring, it's possible to enter rc6 and then try to restore rc6
> with an invalid context (because MI_SET_CONTEXT hasn't yet been executed).

In which case we need more than a compiler barrier (and more than a PCI posting), but we need to wait for the GPU. We could go the whole hog and do a i915_add_request(); i915_wait_request() or we could add a simpler poll in cases where we expect it to be nearly instantaneous. (Also neatly providing the MMIO write barrier. ;-)

The theory sounds good.
Comment 25 Ben Widawsky 2011-03-11 17:21:56 UTC
Gu,

You should be able to try the wmb() after the ADVANCE_RING() with just ac668088 reverted...

I think Chris makes a good point though, I'm attaching a patch for that based off of 88271da3. However, this patch will not work with just ac668088 reverted because the APIs have changed too much. I can create another patch for you if you tell me which commit I should start from.
Comment 26 Ben Widawsky 2011-03-11 17:22:57 UTC
Created attachment 50642 [details]
wait for MI_SET_CONTEXT to finish

based off of 88271da3
Comment 27 Gu Rui 2011-03-12 09:06:18 UTC
Tried on 88271da3. But it seems that nothing happened -- there is neither "bug 28582 fail" nor "bug 28582 workaround on" appear in dmesg.(dmesg | grep 28582 yields nothing)...

Besides, I think there should be braces after "dev_priv->pwrctx = intel_alloc_context_page(dev);" and "if (dev_priv->pwrctx == NULL)". Am I right?(The kernel I tried doesn't have them)

I think since d5bb081b0 is the first bad commit, is better to patch based on it or latest git with ac668088 reverted, if it's convenience for you.

Thank you very much on the effort.
Comment 28 Ben Widawsky 2011-03-12 15:54:51 UTC
You are right it does need braces. I moved the workaround at the last minute and forgot to update. However I think it will still work as is, it just won't handle the failure case properly.

As for not seeing any message, you will only see DRM_INFO if you are using the appropriate debug flags. You can do something like this, to see DRM_INFO messages:
echo 6 > /sys/module/drm/parameters/debug, or use drm.debug= when you load it.

If you didn't see the error, and X didn't hang, then it was a success, wasn't it?

New patch coming..
Comment 29 Ben Widawsky 2011-03-12 16:38:55 UTC
Created attachment 50682 [details]
d-i-f latest wait for MI_SET_CONTEXT to finish

This patch should be used again d-i-f 467cffba85791cdfce38c124d75bd578f4bb8625.

It does not require any reverts. The patch itself re-enables rc6. I found that doing the revert as you mentioned left a bit of a mess.

Once again, DRM debug flags will need to be set if you expect to see a success message. Success should really be measured as no error + no hang.
Comment 30 Gu Rui 2011-03-12 18:11:30 UTC
Awesome Ben! Now I can confirm that your patch works! ;-)

Thank you very much.
Comment 31 Ben Widawsky 2011-03-14 20:56:48 UTC
Gu, can you please clarify. Did the patch make the hang go away, or did the patch just apply cleanly?
Comment 32 Gu Rui 2011-03-15 04:50:10 UTC
The patch both apply cleanly and make the hang go away. Thanks very much ;-)
Comment 33 Ben Widawsky 2011-03-15 05:50:02 UTC
Thanks! I'm attaching one more patch for test. This should have the same effect, and is the patch I'm submitting to the intel-gfx mailing list.
Comment 34 Ben Widawsky 2011-03-15 05:51:27 UTC
Created attachment 50822 [details]
Re-enable rc6

This patch is cleaner than the last version. It should have the same effect as the other patch.
Comment 35 Gu Rui 2011-03-15 06:15:23 UTC
Oops, patch 50822(applied on clean latest git) cause system hang...
Comment 36 Ben Widawsky 2011-03-15 06:37:55 UTC
Created attachment 50832 [details]
use i915_gpu_idle

Please try this patch.

If this doesn't work then I'm really stumped.
Comment 37 Gu Rui 2011-03-15 07:07:21 UTC
unfortunately, patch 50832 doesn't work neither... ;(
Comment 38 Ben Widawsky 2011-03-15 07:24:35 UTC
I could sort of understand why 50822 didn't work, but 50832 should work.

Hopefully you're building this as a module. If so, could you please upload drivers/gpu/drm/i915/i915.o after patch 50832.

If you've built it in kernel, and aren't willing to build as module, please upload or email me a gzipped vmlinux.

Thanks.
Comment 39 Chris Wilson 2011-03-15 07:44:13 UTC
(In reply to comment #34)
> Created an attachment (id=50822) [details]
> Re-enable rc6
> 
> This patch is cleaner than the last version. It should have the same effect
> as
> the other patch.

Don't trust the comments next to LOAD_IMM_REG, whilst Daniel may have determined them from experimentation, the spec say that the variable length is solely to handle 32/64 bit writes. So use two LOAD_IMM_REG commands for two regs.
Comment 40 Chris Wilson 2011-03-15 07:46:12 UTC
(In reply to comment #34)
> Created an attachment (id=50822) [details]
> Re-enable rc6

And I should have also said, you need to ensure that the GPU is idle in teardown if manipulating the same registers as teardown by the GPU.
Comment 41 Gu Rui 2011-03-15 08:01:16 UTC
Created attachment 50872 [details]
i915.o
Comment 42 Ben Widawsky 2011-03-15 17:27:37 UTC
Created attachment 50922 [details]
split PWRCTXA and RSTDBYCTL 

This addresses Chris' comment about misuse of MI_LOAD_REGISTER_IMM.
Comment 43 Ben Widawsky 2011-03-15 17:29:03 UTC
Created attachment 50932 [details]
added a second MI_FLUSH

This patch is extra safe in synchronizing the MI_SET_CONTEXT, addressing Chris' comment from the intel-gfx mailing list.
Comment 44 Ben Widawsky 2011-03-15 17:30:28 UTC
Hello Gu.

I looked at your i915.o, and it appeared fine. As Chris talked about on the intel-gfx mailing list, it makes sense that my last patch did not work.

I've uploaded two new patches, which should both be applied from a clean branch. Please let us know if either or both fix the problem.
Comment 45 Gu Rui 2011-03-16 17:33:14 UTC
Oops, neither 50932 nor 50922 works... (X hang) ;(
Comment 46 Chris Wilson 2011-03-16 18:06:17 UTC
Weirder still. Didn't trigger a hang for me. Can you post some debugging info (Xorg.log and dmesg) from the hang?
Comment 47 Ben Widawsky 2011-03-17 01:30:21 UTC
Created attachment 51012 [details]
use PIPE_CONTROL to flush context

Try this patch from clean latest d-i-f
Comment 48 Ben Widawsky 2011-03-17 01:31:24 UTC
Created attachment 51022 [details]
use PIPE_CONTROL to flush context (one flush only)

If the previous patch works, please try this patch, also from clean d-i-f
Comment 49 Gu Rui 2011-03-17 10:57:41 UTC
No, 51012 not working. When X hangs, the system hangs.(remote ssh session got time out) So I cannot get debug info from the point. Sorry...
Comment 50 Ben Widawsky 2011-03-17 20:59:27 UTC
Created attachment 51102 [details]
polling method to determine MI_SET_CONTEXT completion

This patch is similar to 50682 in what it tries to accomplish (wait for the command parser to finish MI_SET_CONTEXT). If this patch works, it leaves me a little confused. If this patch does not work, it tells us that the add_request mechanism (which has a few HW side effects) has some magic. This would help us greatly to narrow down the actual issue.
Comment 51 Gu Rui 2011-03-18 05:13:30 UTC
Patch 51102 works. ;-)
Comment 52 Chris Wilson 2011-03-18 09:47:57 UTC
(In reply to comment #50)
> Created an attachment (id=51102) [details]
> polling method to determine MI_SET_CONTEXT completion

Ok. looks like we have a winner and that setting the regs from the ring is not truly safe. Ben, can you tidy up the error path (no need to print the HEAD/TAIL values and make the final error DRM_ERROR("Unable to set rc6 context, disabling\n")) and send to the list?
Comment 53 Chris Wilson 2011-03-21 08:34:18 UTC
The original bug is fixed in mainline by disabling rc6, and we've queued the latest, most successful, attempt to enable rc6 again for -next. This way we have a few months of testing before it hits mainline...
Comment 54 Florian Mickler 2011-03-30 21:03:58 UTC
Chris, do you have a pointer on where I should look if the patch has trickled mainwards already? 

If you resolve|patch_already_available, please also post a link to the patch in the form (on a single line):

Patch: [direct link to the patch]

The link can be patchwork, a git-web repository, mailinglist-post or smth similar... 

That way it helps crossing things off (and Rafaels scripts can do the right thing).

Thanks,
Flo
Comment 55 Florian Mickler 2011-03-30 21:04:30 UTC
s/on a single line/on it's own line/
Comment 56 Chris Wilson 2011-03-31 11:55:58 UTC
I was referring to the patch attached...

For completeness, the one series of fixes that came from this one bug:

https://patchwork.kernel.org/patch/646311/
Comment 57 Florian Mickler 2011-03-31 13:41:47 UTC
Thanks. There are three patches attached.


But I take it, the original regression was fixed by :

commit ac66808814036b4c33dd98091b2176ae6157f1a8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 9 16:15:32 2011 +0000

    drm/i915: Disable RC6 on Ironlake


in .38-rc5 and as such we should close this regression.

[ I'm in a bit of a pickle, as I don't want to impede on any development and bugfixing, but still those open regressions reports do suck :-) It get's more likely to overlook something if there are many not-closed regressions... ]
Comment 58 Chris Wilson 2011-03-31 14:23:58 UTC
Exactly, we disabled rc6 on Ironlake to restore 2.6.37 behaviour (commit ac6680) and now we think we understand the underlying bug and have patches in the pipeline to correctly enable the improved rc6 powersaving modes.
Comment 59 Florian Mickler 2011-05-30 07:53:11 UTC
A patch referencing this bug report has been merged in v3.0-rc1:

commit 4a246cfc3c337ecb800d508ee5ed906534edb25c
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Sat Mar 19 18:14:28 2011 -0700

    drm/i915: fix rc6 initialization on Ironlake

Note You need to log in before you can comment on or make changes to this bug.