Bug 43020 - [865G] hard hang after some idle time
Summary: [865G] hard hang after some idle time
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri-intel@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-31 13:17 UTC by Jean Delvare
Modified: 2012-04-30 09:52 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.37.6-0.11-default, 3.1.0-1.2-default, 3.3.2
Tree: Mainline
Regression: No


Attachments
time out load detect on gen2 (1.91 KB, patch)
2012-03-31 14:30 UTC, Daniel Vetter
Details | Diff
Takashi's debug patch (2.09 KB, patch)
2012-04-01 12:04 UTC, Jean Delvare
Details | Diff
printk-trace dpms state changes (490 bytes, patch)
2012-04-19 07:42 UTC, Daniel Vetter
Details | Diff
dmesg excerpts (3.74 KB, text/plain)
2012-04-19 09:48 UTC, Jean Delvare
Details
more dpms debug info (1.90 KB, patch)
2012-04-19 11:09 UTC, Daniel Vetter
Details | Diff
patch as submitted to intel-gfx (2.66 KB, patch)
2012-04-20 19:00 UTC, Daniel Vetter
Details | Diff

Description Jean Delvare 2012-03-31 13:17:02 UTC
I have a system based on an Asus P4P800-VM motherboard with integrated Intel graphics:

00:02.0 VGA compatible controller [0300]: Intel Corporation 82865G Integrated Graphics Controller [8086:2572] (rev 02) (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. P5P800-MX Mainboard [1043:80a5]
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at f0000000 (32-bit, prefetchable) [size=128M]
	Memory at fe780000 (32-bit, non-prefetchable) [size=512K]
	I/O ports at efe0 [size=8]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [d0] Power Management version 1
	Kernel driver in use: i915

The system is running openSUSE 11.4 / x86 (32-bit). Every time I leave the system alone for some time, it freezes completely. When this happens, I see the power consumption of the system jump from 65 W to 100 W, which suggests a never-ending loop or a busy spinlock. System is completely unresponsive, display is black, nothing happens on key press or mouse movement, remote login through ssh is impossible. The time it takes to freeze is very variable, from 5 minutes to several hours.

I have set up a serial console to capture the fatal event and I have also set drm.debug=0x06 to gather sufficient information. The log I gathered points to the i915 driver.

I am using a KVM switch and I believe this is the trigger for this bug. The system never freezes when the KVM is switched to it, it only freezes when the KVM is switched to my other system. The difference is very visible in the log I gathered. When the KVM is switched to the problematic system, I see the following message every 30 seconds in the kernel log:

[drm:intel_crt_detect_ddc], CRT detected via DDC:0x50 [EDID]

But when the KVM is instead switched to the other system, the message is different:

[drm:intel_crt_load_detect], starting load-detect on CRT

After some time, I get:

[drm:intel_update_fbc],                                          
[drm:intel_crt_load_detect], starting load-detect on CRT         

And then a few minutes later, the fatal error (long lines truncated by serial console):

[ 5265.356996] BUG: soft lockup - CPU#0 stuck for 63s! [upowerd:2509]
[ 5265.360003] Modules linked in: w83627hf hwmon_vid snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfs lockd fscache nfs_acl]
[ 5265.360003] Modules linked in: w83627hf hwmon_vid snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfs lockd fscache nfs_acl]
[ 5265.360003] 
[ 5265.360003] Pid: 2509, comm: upowerd Not tainted 2.6.37.6-0.11-default #1 To Be Filled By O.E.M. by More String To Be Filled By M
[ 5265.360003] EIP: 0060:[<f8a036d7>] EFLAGS: 00000286 CPU: 0
[ 5265.360003] EIP is at intel_crt_load_detect+0x247/0x570 [i915]
[ 5265.360003] EAX: f8cf0000 EBX: f7046000 ECX: 00000326 EDX: 00000300
[ 5265.360003] ESI: 00070000 EDI: 0000015e EBP: 00000301 ESP: f400fe94
[ 5265.360003]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 5265.360003] Process upowerd (pid: 2509, ti=f400e000 task=f40ef2b0 task.ti=f400e000)
[ 5265.360003] Stack:
[ 5265.360003]  00000004 f8a25fd6 f8a1cfb7 f8a26011 f71c1400 00000001 00060020 00000301
[ 5265.360003]  00060010 032502ff 00000000 032502ff 00000326 f722f000 00000001 f7224680
[ 5265.360003]  00000001 f8a03b27 c0905b40 f40ef2b0 c0905b44 00000000 f40ef2b0 f722f004
[ 5265.360003] Call Trace:
[ 5265.360003]  [<f8a03b27>] intel_crt_detect+0x127/0x140 [i915]
[ 5265.360003]  [<f8803857>] status_show+0x37/0x80 [drm]
[ 5265.360003]  [<c04a9ce1>] dev_attr_show+0x21/0x50
[ 5265.360003]  [<c0368bb0>] sysfs_read_file+0xc0/0x180
[ 5265.360003]  [<c030e68d>] vfs_read+0x9d/0x170
[ 5265.360003]  [<c030e7a1>] sys_read+0x41/0x70
[ 5265.360003]  [<c0202f10>] sysenter_do_call+0x12/0x22
[ 5265.360003]  [<ffffe430>] 0xffffe430
[ 5265.360003] Code: 00 00 8d 2c 10 d1 ed 90 8d 74 26 00 8b 43 0c 01 f0 8b 38 f6 83 88 02 00 00 01 0f 85 44 01 00 00 39 fa 76 e8 8b 
[ 5265.360003] Call Trace:
[ 5265.360003]  [<f8a03b27>] intel_crt_detect+0x127/0x140 [i915]
[ 5265.360003]  [<f8803857>] status_show+0x37/0x80 [drm]
[ 5265.360003]  [<c04a9ce1>] dev_attr_show+0x21/0x50
[ 5265.360003]  [<c0368bb0>] sysfs_read_file+0xc0/0x180
[ 5265.360003]  [<c030e68d>] vfs_read+0x9d/0x170
[ 5265.360003]  [<c030e7a1>] sys_read+0x41/0x70
[ 5265.360003]  [<c0202f10>] sysenter_do_call+0x12/0x22
[ 5265.360003]  [<ffffe430>] 0xffffe430

And gdb says:

(gdb) list *(intel_crt_detect+0x127)
0x2ab57 is in intel_crt_detect (/usr/src/debug/kernel-default-2.6.37.6/linux-2.6.37/drivers/gpu/drm/i915/intel_crt.c:490).
485			return connector->status;
486	
487		/* for pre-945g platforms use load detect */
488		crtc = crt->base.base.crtc;
489		if (crtc && crtc->enabled) {
490			status = intel_crt_load_detect(crtc, crt);
491		} else {
492			crtc = intel_get_load_detect_pipe(&crt->base, connector,
493							  NULL, &dpms_mode);
494			if (crtc) {
(gdb) list *(intel_crt_load_detect+0x247)
0x2a707 is in intel_crt_load_detect (/usr/src/debug/kernel-default-2.6.37.6/linux-2.6.37/drivers/gpu/drm/i915/i915_drv.h:1198).
1193	static inline u32 i915_read(struct drm_i915_private *dev_priv, u32 reg)
1194	{
1195		u32 val;
1196	
1197		val = readl(dev_priv->regs + reg);
1198		if (dev_priv->debug_flags & I915_DEBUG_READ)
1199			printk(KERN_ERR "read 0x%08x from 0x%08x\n", val, reg);
1200		return val;
1201	}
1202	
(gdb)
Comment 1 Chris Wilson 2012-03-31 13:20:31 UTC
Locking bug in status_show().

commit 007c80a5497a3f9c8393960ec6e6efd30955dcb1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 15 11:40:00 2011 +0000

    drm: Hold the mode mutex whilst probing for sysfs status
    
    As detect will use hw registers and may modify structures, it needs to be
    serialised by use of the dev->mode_config.mutex. Make it so.
    
    Otherwise, we may cause random crashes as the sysfs file is queried
    whilst a concurrent hotplug poll is being run.
Comment 2 Jean Delvare 2012-03-31 13:38:45 UTC
I can reproduce the bug with openSUSE 12.1, which runs kernel 3.1.0-1.2-default. This rules out commit 007c80a5497a3f9c8393960ec6e6efd30955dcb1 as being the fix, as this commit made it into kernel 2.6.39 and is thus included in the openSUSE 12.1 kernel.
Comment 3 Daniel Vetter 2012-03-31 14:30:42 UTC
Created attachment 72768 [details]
time out load detect on gen2

Please try out the attached patch, preferably on a recent kernel (v3.3). Please boot with drm.debug=0xe and attach the last few lines of dmesg if your machine times out in there somewhere (or if it still dies).
Comment 4 Jean Delvare 2012-03-31 14:37:26 UTC
I am already in progress of testing a similar patch provided by Takashi Iwai. I'll report when I am done, but initial kernel build will take long as this machine is quite slow.
Comment 5 Jean Delvare 2012-03-31 19:58:57 UTC
Out of curiosity, how is drm.debug=0xe any better than drm.debug=0x6? I only see 3 DRM_UT_* flags defined in <drm/drmP.h>.
Comment 6 Jean Delvare 2012-03-31 21:40:22 UTC
With Takashi's patch, I get:

[drm:intel_crt_load_detect], starting load-detect on CRT
[drm:intel_crt_load_detect] *ERROR* pipe_dsl_reg timeout: bf, vsample=43a

looping over and over. So the never-ending loop is the second one, dsl is stuck at 0xbf == 191 and thus never reaches vsample == 0x43a == 1082.
Comment 7 Jean Delvare 2012-04-01 08:25:41 UTC
More error messages captured this morning (note that timeout was set to 500 ms for the whole block):

[drm:intel_crt_load_detect], starting load-detect on CRT
[drm:intel_crt_load_detect] *ERROR* pipe_dsl_reg timeout: 1a9, vsample=43a

This one happened twice. I have added some more debug messages, I'll report when I get the results.
Comment 8 Jean Delvare 2012-04-01 11:59:52 UTC
I had more samples of this, dsl value differs, but once the messages start, the dsl value stays constant until I switch the KVM back to the system and wake up the screen. Then next time only that the problem happens, the dsl value is different (425, 401, 451...)

So my take is that the cause of the timeout (or freeze without the patch) is that the value in pipe_dsl_reg simply stops being updated.
Comment 9 Jean Delvare 2012-04-01 12:04:30 UTC
Created attachment 72781 [details]
Takashi's debug patch

For reference, here is the debug patch from Takashi. Results from comments #6, #7 and #8 are with this patch applied on top of kernel 2.6.37.6.
Comment 10 Jean Delvare 2012-04-02 07:01:39 UTC
And for completeness, test results with kernel 3.3.0 + Daniel's debug patch:

[drm:intel_crt_load_detect], starting load-detect on CRT
[drm:intel_crt_load_detect], timed out waiting for vactive in load_detect, scanline: 195
[drm:intel_crt_load_detect], timed out while load-detecting, scanline: 195

looping over and over. This confirms that the bug is still present in kernel 3.3.0. If I switch back to the system to break the loop, and then away from it, I get a similar loop after a moment, with a different scanline:

[drm:intel_crt_load_detect], starting load-detect on CRT
[drm:intel_crt_load_detect], timed out waiting for vactive in load_detect, scanline: 303
[drm:intel_crt_load_detect], timed out while load-detecting, scanline: 303

etc. So it's exactly the same as with kernel 2.6.37.6.
Comment 11 Jean Delvare 2012-04-19 07:22:32 UTC
What do we do about this bug? Daniel's or Takashi's patch would be good to have as a safety measure (and FWIW, I prefer Daniel patch's output, but I believe Takashi's patch is somewhat better as it exits more quickly when a problem is spotted) but it would be even better if we could just know when the clock is down (I believe?) and there's no point in attempting a CRT load detect. Can this be done?
Comment 12 Daniel Vetter 2012-04-19 07:42:24 UTC
Created attachment 72978 [details]
printk-trace dpms state changes

I looks like something turns of our crtc while we don't expect it to be turned of. And setting dpms back to on seems to fix it.

To figure our where things go wrong, can you apply this patch on top of either of the 2 workaround patches, reproduce the issues, break the loop by waking up the screen and then attach the dmesg, please?
Comment 13 Jean Delvare 2012-04-19 09:48:41 UTC
Created attachment 72980 [details]
dmesg excerpts

Relevant excerpts from dmesg with both patches applied. I triggered the problem twice during the capture period.
Comment 14 Daniel Vetter 2012-04-19 11:09:22 UTC
Created attachment 72981 [details]
more dpms debug info

I've also noticed a part of the load-detect code that looks strange and disabled it. Things should still work like before, maybe even better. Again, please test and attach dmesg when it gets stuck and recovers again (by unblanking).
Comment 15 Jean Delvare 2012-04-19 14:39:35 UTC
Latest patch wouldn't apply on top of kernel 2.6.37.6, so I'm rebuilding a new kernel (3.3.2). This will take a while.
Comment 16 Jesse Barnes 2012-04-19 15:13:52 UTC
On 8xx I can easily imagine the BIOS trying to be helpful and poking at plane & pipe regs for us.

Is there a BIOS option on this machine for blanking?  If not, we'll need to add some code to check our pipe status in more places...
Comment 17 Chris Wilson 2012-04-19 16:13:10 UTC
Crucial question if Daniel's patch works is then why we have an enabled crtc attached when probing the connector for an attachment. Perhaps

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d
index 60ccfe3..8a4ef2e 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -5346,6 +5346,9 @@ void intel_release_load_detect_pipe(struct intel_encoder *
                if (old->release_fb)
                        old->release_fb->funcs->destroy(old->release_fb);
 
+               if (WARN_ON(encoder->crtc))
+                       encoder->crtc = NULL;
+
                return;
        }
Comment 18 Jean Delvare 2012-04-19 18:49:21 UTC
I searched the BIOS for options about blanking, but did not find any. I have the latest version of the BIOS already. Most power management options related to stand-by and suspend are disabled, but I will try disabling power management completely to see if it helps.

Meanwhile I am done compiling kernel 3.3.2 with Daniel patches from comments #3 and #14. Kernel log messages are very different from previously. Here is the case where the chip isn't sleeping yet:

[drm:intel_crt_detect], start load detect
[drm:intel_get_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5]
[drm:intel_get_load_detect_pipe], using existing crtc for load_detect: f6304000
[drm:intel_crt_load_detect], starting load-detect on CRT
[drm:intel_release_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5]
[drm:intel_crt_detect], end load detect

And here is the case where it is sleeping:

[drm:intel_crt_detect], start load detect
[drm:intel_get_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5]
[drm:intel_crtc_dpms], dpms on crtc f6304000, new: 0, current: 3
[drm:i830_get_fifo_size], FIFO size - (0x0000005f) A: 47
[drm:intel_calculate_wm], FIFO entries required for mode: 93
[drm:intel_calculate_wm], FIFO watermark level: -48
[drm:i830_update_wm], Setting FIFO watermarks - A: 1
[drm:intel_update_fbc], 
[drm:intel_get_load_detect_pipe], using existing crtc for load_detect: f6304000
[drm:intel_crt_load_detect], starting load-detect on CRT
[drm:intel_release_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5]
[drm:intel_crtc_dpms], dpms on crtc f6304000, new: 3, current: 0
[drm:intel_update_fbc], 
[drm:i830_get_fifo_size], FIFO size - (0x0000005f) A: 47
[drm:intel_calculate_wm], FIFO entries required for mode: 93
[drm:intel_calculate_wm], FIFO watermark level: -48
[drm:i830_update_wm], Setting FIFO watermarks - A: 1
[drm:intel_crt_detect], end load detect

I seem to understand we are now waking up the chip just to test for monitor presence? It seems a bit weird from a power savings perspective. Can't we just reply -EAGAIN in that case? There's nothing user-space is going to do with the information if the system is sleeping anyway.
Comment 19 Chris Wilson 2012-04-19 19:08:19 UTC
(In reply to comment #18)
> I seem to understand we are now waking up the chip just to test for monitor
> presence? It seems a bit weird from a power savings perspective. Can't we
> just
> reply -EAGAIN in that case? There's nothing user-space is going to do with
> the
> information if the system is sleeping anyway.

load-detection is only performed as a result of a userspace request. The drm_kms_helper polling for change in connection status only does the DDC check.
Comment 20 Jean Delvare 2012-04-19 19:44:21 UTC
Disabling power management in the BIOS doesn't fix the problem.

I know that load detection is performed as a result of a user-space request. This upowerd is triggering such a request every 30 seconds, which is why I think it shouldn't wake up the graphics chip if it is sleeping.
Comment 21 Jean Delvare 2012-04-20 11:49:15 UTC
Chris, I've applied your patch from comment #17, but the WARN_ON never triggered. Looking at the debug logs, I don't think we even enter the code section where you added the WARN_ON.
Comment 22 Daniel Vetter 2012-04-20 14:05:02 UTC
Ok, not hitting the WARN_ON is pretty nice evidence that my patch might work. How's the kernel-compiling for that one going?
Comment 23 Jean Delvare 2012-04-20 14:15:44 UTC
Err, I'm not sure I understand your question, Daniel. Kernel 3.3.2 was compiled yesterday and my results in comment #18 are with this kernel and your patches applied. What else do you need from me?
Comment 24 Daniel Vetter 2012-04-20 14:18:03 UTC
Oops, I've confused things a bit and thought comment #18 was still about the old code. To confirm: You haven't seen any load_detect timeouts with that patch applied?
Comment 25 Jean Delvare 2012-04-20 15:19:29 UTC
I confirm, no load_detect timeouts since I applied the patch from comment #14.
Comment 26 Daniel Vetter 2012-04-20 19:00:46 UTC
Created attachment 73025 [details]
patch as submitted to intel-gfx
Comment 27 Florian Mickler 2012-04-30 09:52:28 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc5:

commit e95c8438ea1c56c254f0607c8fb6bca7f463c744
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Fri Apr 20 21:03:36 2012 +0200

    drm/i915: fixup load-detect on enabled, but not active pipe

Note You need to log in before you can comment on or make changes to this bug.