I have a system based on an Asus P4P800-VM motherboard with integrated Intel graphics: 00:02.0 VGA compatible controller [0300]: Intel Corporation 82865G Integrated Graphics Controller [8086:2572] (rev 02) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. P5P800-MX Mainboard [1043:80a5] Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f0000000 (32-bit, prefetchable) [size=128M] Memory at fe780000 (32-bit, non-prefetchable) [size=512K] I/O ports at efe0 [size=8] Expansion ROM at <unassigned> [disabled] Capabilities: [d0] Power Management version 1 Kernel driver in use: i915 The system is running openSUSE 11.4 / x86 (32-bit). Every time I leave the system alone for some time, it freezes completely. When this happens, I see the power consumption of the system jump from 65 W to 100 W, which suggests a never-ending loop or a busy spinlock. System is completely unresponsive, display is black, nothing happens on key press or mouse movement, remote login through ssh is impossible. The time it takes to freeze is very variable, from 5 minutes to several hours. I have set up a serial console to capture the fatal event and I have also set drm.debug=0x06 to gather sufficient information. The log I gathered points to the i915 driver. I am using a KVM switch and I believe this is the trigger for this bug. The system never freezes when the KVM is switched to it, it only freezes when the KVM is switched to my other system. The difference is very visible in the log I gathered. When the KVM is switched to the problematic system, I see the following message every 30 seconds in the kernel log: [drm:intel_crt_detect_ddc], CRT detected via DDC:0x50 [EDID] But when the KVM is instead switched to the other system, the message is different: [drm:intel_crt_load_detect], starting load-detect on CRT After some time, I get: [drm:intel_update_fbc], [drm:intel_crt_load_detect], starting load-detect on CRT And then a few minutes later, the fatal error (long lines truncated by serial console): [ 5265.356996] BUG: soft lockup - CPU#0 stuck for 63s! [upowerd:2509] [ 5265.360003] Modules linked in: w83627hf hwmon_vid snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfs lockd fscache nfs_acl] [ 5265.360003] Modules linked in: w83627hf hwmon_vid snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd nfs lockd fscache nfs_acl] [ 5265.360003] [ 5265.360003] Pid: 2509, comm: upowerd Not tainted 2.6.37.6-0.11-default #1 To Be Filled By O.E.M. by More String To Be Filled By M [ 5265.360003] EIP: 0060:[<f8a036d7>] EFLAGS: 00000286 CPU: 0 [ 5265.360003] EIP is at intel_crt_load_detect+0x247/0x570 [i915] [ 5265.360003] EAX: f8cf0000 EBX: f7046000 ECX: 00000326 EDX: 00000300 [ 5265.360003] ESI: 00070000 EDI: 0000015e EBP: 00000301 ESP: f400fe94 [ 5265.360003] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 5265.360003] Process upowerd (pid: 2509, ti=f400e000 task=f40ef2b0 task.ti=f400e000) [ 5265.360003] Stack: [ 5265.360003] 00000004 f8a25fd6 f8a1cfb7 f8a26011 f71c1400 00000001 00060020 00000301 [ 5265.360003] 00060010 032502ff 00000000 032502ff 00000326 f722f000 00000001 f7224680 [ 5265.360003] 00000001 f8a03b27 c0905b40 f40ef2b0 c0905b44 00000000 f40ef2b0 f722f004 [ 5265.360003] Call Trace: [ 5265.360003] [<f8a03b27>] intel_crt_detect+0x127/0x140 [i915] [ 5265.360003] [<f8803857>] status_show+0x37/0x80 [drm] [ 5265.360003] [<c04a9ce1>] dev_attr_show+0x21/0x50 [ 5265.360003] [<c0368bb0>] sysfs_read_file+0xc0/0x180 [ 5265.360003] [<c030e68d>] vfs_read+0x9d/0x170 [ 5265.360003] [<c030e7a1>] sys_read+0x41/0x70 [ 5265.360003] [<c0202f10>] sysenter_do_call+0x12/0x22 [ 5265.360003] [<ffffe430>] 0xffffe430 [ 5265.360003] Code: 00 00 8d 2c 10 d1 ed 90 8d 74 26 00 8b 43 0c 01 f0 8b 38 f6 83 88 02 00 00 01 0f 85 44 01 00 00 39 fa 76 e8 8b [ 5265.360003] Call Trace: [ 5265.360003] [<f8a03b27>] intel_crt_detect+0x127/0x140 [i915] [ 5265.360003] [<f8803857>] status_show+0x37/0x80 [drm] [ 5265.360003] [<c04a9ce1>] dev_attr_show+0x21/0x50 [ 5265.360003] [<c0368bb0>] sysfs_read_file+0xc0/0x180 [ 5265.360003] [<c030e68d>] vfs_read+0x9d/0x170 [ 5265.360003] [<c030e7a1>] sys_read+0x41/0x70 [ 5265.360003] [<c0202f10>] sysenter_do_call+0x12/0x22 [ 5265.360003] [<ffffe430>] 0xffffe430 And gdb says: (gdb) list *(intel_crt_detect+0x127) 0x2ab57 is in intel_crt_detect (/usr/src/debug/kernel-default-2.6.37.6/linux-2.6.37/drivers/gpu/drm/i915/intel_crt.c:490). 485 return connector->status; 486 487 /* for pre-945g platforms use load detect */ 488 crtc = crt->base.base.crtc; 489 if (crtc && crtc->enabled) { 490 status = intel_crt_load_detect(crtc, crt); 491 } else { 492 crtc = intel_get_load_detect_pipe(&crt->base, connector, 493 NULL, &dpms_mode); 494 if (crtc) { (gdb) list *(intel_crt_load_detect+0x247) 0x2a707 is in intel_crt_load_detect (/usr/src/debug/kernel-default-2.6.37.6/linux-2.6.37/drivers/gpu/drm/i915/i915_drv.h:1198). 1193 static inline u32 i915_read(struct drm_i915_private *dev_priv, u32 reg) 1194 { 1195 u32 val; 1196 1197 val = readl(dev_priv->regs + reg); 1198 if (dev_priv->debug_flags & I915_DEBUG_READ) 1199 printk(KERN_ERR "read 0x%08x from 0x%08x\n", val, reg); 1200 return val; 1201 } 1202 (gdb)
Locking bug in status_show(). commit 007c80a5497a3f9c8393960ec6e6efd30955dcb1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Mar 15 11:40:00 2011 +0000 drm: Hold the mode mutex whilst probing for sysfs status As detect will use hw registers and may modify structures, it needs to be serialised by use of the dev->mode_config.mutex. Make it so. Otherwise, we may cause random crashes as the sysfs file is queried whilst a concurrent hotplug poll is being run.
I can reproduce the bug with openSUSE 12.1, which runs kernel 3.1.0-1.2-default. This rules out commit 007c80a5497a3f9c8393960ec6e6efd30955dcb1 as being the fix, as this commit made it into kernel 2.6.39 and is thus included in the openSUSE 12.1 kernel.
Created attachment 72768 [details] time out load detect on gen2 Please try out the attached patch, preferably on a recent kernel (v3.3). Please boot with drm.debug=0xe and attach the last few lines of dmesg if your machine times out in there somewhere (or if it still dies).
I am already in progress of testing a similar patch provided by Takashi Iwai. I'll report when I am done, but initial kernel build will take long as this machine is quite slow.
Out of curiosity, how is drm.debug=0xe any better than drm.debug=0x6? I only see 3 DRM_UT_* flags defined in <drm/drmP.h>.
With Takashi's patch, I get: [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_crt_load_detect] *ERROR* pipe_dsl_reg timeout: bf, vsample=43a looping over and over. So the never-ending loop is the second one, dsl is stuck at 0xbf == 191 and thus never reaches vsample == 0x43a == 1082.
More error messages captured this morning (note that timeout was set to 500 ms for the whole block): [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_crt_load_detect] *ERROR* pipe_dsl_reg timeout: 1a9, vsample=43a This one happened twice. I have added some more debug messages, I'll report when I get the results.
I had more samples of this, dsl value differs, but once the messages start, the dsl value stays constant until I switch the KVM back to the system and wake up the screen. Then next time only that the problem happens, the dsl value is different (425, 401, 451...) So my take is that the cause of the timeout (or freeze without the patch) is that the value in pipe_dsl_reg simply stops being updated.
Created attachment 72781 [details] Takashi's debug patch For reference, here is the debug patch from Takashi. Results from comments #6, #7 and #8 are with this patch applied on top of kernel 2.6.37.6.
And for completeness, test results with kernel 3.3.0 + Daniel's debug patch: [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_crt_load_detect], timed out waiting for vactive in load_detect, scanline: 195 [drm:intel_crt_load_detect], timed out while load-detecting, scanline: 195 looping over and over. This confirms that the bug is still present in kernel 3.3.0. If I switch back to the system to break the loop, and then away from it, I get a similar loop after a moment, with a different scanline: [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_crt_load_detect], timed out waiting for vactive in load_detect, scanline: 303 [drm:intel_crt_load_detect], timed out while load-detecting, scanline: 303 etc. So it's exactly the same as with kernel 2.6.37.6.
What do we do about this bug? Daniel's or Takashi's patch would be good to have as a safety measure (and FWIW, I prefer Daniel patch's output, but I believe Takashi's patch is somewhat better as it exits more quickly when a problem is spotted) but it would be even better if we could just know when the clock is down (I believe?) and there's no point in attempting a CRT load detect. Can this be done?
Created attachment 72978 [details] printk-trace dpms state changes I looks like something turns of our crtc while we don't expect it to be turned of. And setting dpms back to on seems to fix it. To figure our where things go wrong, can you apply this patch on top of either of the 2 workaround patches, reproduce the issues, break the loop by waking up the screen and then attach the dmesg, please?
Created attachment 72980 [details] dmesg excerpts Relevant excerpts from dmesg with both patches applied. I triggered the problem twice during the capture period.
Created attachment 72981 [details] more dpms debug info I've also noticed a part of the load-detect code that looks strange and disabled it. Things should still work like before, maybe even better. Again, please test and attach dmesg when it gets stuck and recovers again (by unblanking).
Latest patch wouldn't apply on top of kernel 2.6.37.6, so I'm rebuilding a new kernel (3.3.2). This will take a while.
On 8xx I can easily imagine the BIOS trying to be helpful and poking at plane & pipe regs for us. Is there a BIOS option on this machine for blanking? If not, we'll need to add some code to check our pipe status in more places...
Crucial question if Daniel's patch works is then why we have an enabled crtc attached when probing the connector for an attachment. Perhaps diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d index 60ccfe3..8a4ef2e 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -5346,6 +5346,9 @@ void intel_release_load_detect_pipe(struct intel_encoder * if (old->release_fb) old->release_fb->funcs->destroy(old->release_fb); + if (WARN_ON(encoder->crtc)) + encoder->crtc = NULL; + return; }
I searched the BIOS for options about blanking, but did not find any. I have the latest version of the BIOS already. Most power management options related to stand-by and suspend are disabled, but I will try disabling power management completely to see if it helps. Meanwhile I am done compiling kernel 3.3.2 with Daniel patches from comments #3 and #14. Kernel log messages are very different from previously. Here is the case where the chip isn't sleeping yet: [drm:intel_crt_detect], start load detect [drm:intel_get_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5] [drm:intel_get_load_detect_pipe], using existing crtc for load_detect: f6304000 [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_release_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5] [drm:intel_crt_detect], end load detect And here is the case where it is sleeping: [drm:intel_crt_detect], start load detect [drm:intel_get_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5] [drm:intel_crtc_dpms], dpms on crtc f6304000, new: 0, current: 3 [drm:i830_get_fifo_size], FIFO size - (0x0000005f) A: 47 [drm:intel_calculate_wm], FIFO entries required for mode: 93 [drm:intel_calculate_wm], FIFO watermark level: -48 [drm:i830_update_wm], Setting FIFO watermarks - A: 1 [drm:intel_update_fbc], [drm:intel_get_load_detect_pipe], using existing crtc for load_detect: f6304000 [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_release_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5] [drm:intel_crtc_dpms], dpms on crtc f6304000, new: 3, current: 0 [drm:intel_update_fbc], [drm:i830_get_fifo_size], FIFO size - (0x0000005f) A: 47 [drm:intel_calculate_wm], FIFO entries required for mode: 93 [drm:intel_calculate_wm], FIFO watermark level: -48 [drm:i830_update_wm], Setting FIFO watermarks - A: 1 [drm:intel_crt_detect], end load detect I seem to understand we are now waking up the chip just to test for monitor presence? It seems a bit weird from a power savings perspective. Can't we just reply -EAGAIN in that case? There's nothing user-space is going to do with the information if the system is sleeping anyway.
(In reply to comment #18) > I seem to understand we are now waking up the chip just to test for monitor > presence? It seems a bit weird from a power savings perspective. Can't we > just > reply -EAGAIN in that case? There's nothing user-space is going to do with > the > information if the system is sleeping anyway. load-detection is only performed as a result of a userspace request. The drm_kms_helper polling for change in connection status only does the DDC check.
Disabling power management in the BIOS doesn't fix the problem. I know that load detection is performed as a result of a user-space request. This upowerd is triggering such a request every 30 seconds, which is why I think it shouldn't wake up the graphics chip if it is sleeping.
Chris, I've applied your patch from comment #17, but the WARN_ON never triggered. Looking at the debug logs, I don't think we even enter the code section where you added the WARN_ON.
Ok, not hitting the WARN_ON is pretty nice evidence that my patch might work. How's the kernel-compiling for that one going?
Err, I'm not sure I understand your question, Daniel. Kernel 3.3.2 was compiled yesterday and my results in comment #18 are with this kernel and your patches applied. What else do you need from me?
Oops, I've confused things a bit and thought comment #18 was still about the old code. To confirm: You haven't seen any load_detect timeouts with that patch applied?
I confirm, no load_detect timeouts since I applied the patch from comment #14.
Created attachment 73025 [details] patch as submitted to intel-gfx
A patch referencing this bug report has been merged in Linux v3.4-rc5: commit e95c8438ea1c56c254f0607c8fb6bca7f463c744 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Fri Apr 20 21:03:36 2012 +0200 drm/i915: fixup load-detect on enabled, but not active pipe