Bug 43020
Summary: | [865G] hard hang after some idle time | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jean Delvare (jdelvare) |
Component: | Video(DRI - Intel) | Assignee: | drivers_video-dri-intel (drivers_video-dri-intel) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | chris, daniel, florian, jbarnes, jrnieder, tiwai |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.37.6-0.11-default, 3.1.0-1.2-default, 3.3.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
time out load detect on gen2
Takashi's debug patch printk-trace dpms state changes dmesg excerpts more dpms debug info patch as submitted to intel-gfx |
Description
Jean Delvare
2012-03-31 13:17:02 UTC
Locking bug in status_show(). commit 007c80a5497a3f9c8393960ec6e6efd30955dcb1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Mar 15 11:40:00 2011 +0000 drm: Hold the mode mutex whilst probing for sysfs status As detect will use hw registers and may modify structures, it needs to be serialised by use of the dev->mode_config.mutex. Make it so. Otherwise, we may cause random crashes as the sysfs file is queried whilst a concurrent hotplug poll is being run. I can reproduce the bug with openSUSE 12.1, which runs kernel 3.1.0-1.2-default. This rules out commit 007c80a5497a3f9c8393960ec6e6efd30955dcb1 as being the fix, as this commit made it into kernel 2.6.39 and is thus included in the openSUSE 12.1 kernel. Created attachment 72768 [details]
time out load detect on gen2
Please try out the attached patch, preferably on a recent kernel (v3.3). Please boot with drm.debug=0xe and attach the last few lines of dmesg if your machine times out in there somewhere (or if it still dies).
I am already in progress of testing a similar patch provided by Takashi Iwai. I'll report when I am done, but initial kernel build will take long as this machine is quite slow. Out of curiosity, how is drm.debug=0xe any better than drm.debug=0x6? I only see 3 DRM_UT_* flags defined in <drm/drmP.h>. With Takashi's patch, I get: [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_crt_load_detect] *ERROR* pipe_dsl_reg timeout: bf, vsample=43a looping over and over. So the never-ending loop is the second one, dsl is stuck at 0xbf == 191 and thus never reaches vsample == 0x43a == 1082. More error messages captured this morning (note that timeout was set to 500 ms for the whole block): [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_crt_load_detect] *ERROR* pipe_dsl_reg timeout: 1a9, vsample=43a This one happened twice. I have added some more debug messages, I'll report when I get the results. I had more samples of this, dsl value differs, but once the messages start, the dsl value stays constant until I switch the KVM back to the system and wake up the screen. Then next time only that the problem happens, the dsl value is different (425, 401, 451...) So my take is that the cause of the timeout (or freeze without the patch) is that the value in pipe_dsl_reg simply stops being updated. Created attachment 72781 [details] Takashi's debug patch For reference, here is the debug patch from Takashi. Results from comments #6, #7 and #8 are with this patch applied on top of kernel 2.6.37.6. And for completeness, test results with kernel 3.3.0 + Daniel's debug patch: [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_crt_load_detect], timed out waiting for vactive in load_detect, scanline: 195 [drm:intel_crt_load_detect], timed out while load-detecting, scanline: 195 looping over and over. This confirms that the bug is still present in kernel 3.3.0. If I switch back to the system to break the loop, and then away from it, I get a similar loop after a moment, with a different scanline: [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_crt_load_detect], timed out waiting for vactive in load_detect, scanline: 303 [drm:intel_crt_load_detect], timed out while load-detecting, scanline: 303 etc. So it's exactly the same as with kernel 2.6.37.6. What do we do about this bug? Daniel's or Takashi's patch would be good to have as a safety measure (and FWIW, I prefer Daniel patch's output, but I believe Takashi's patch is somewhat better as it exits more quickly when a problem is spotted) but it would be even better if we could just know when the clock is down (I believe?) and there's no point in attempting a CRT load detect. Can this be done? Created attachment 72978 [details]
printk-trace dpms state changes
I looks like something turns of our crtc while we don't expect it to be turned of. And setting dpms back to on seems to fix it.
To figure our where things go wrong, can you apply this patch on top of either of the 2 workaround patches, reproduce the issues, break the loop by waking up the screen and then attach the dmesg, please?
Created attachment 72980 [details]
dmesg excerpts
Relevant excerpts from dmesg with both patches applied. I triggered the problem twice during the capture period.
Created attachment 72981 [details]
more dpms debug info
I've also noticed a part of the load-detect code that looks strange and disabled it. Things should still work like before, maybe even better. Again, please test and attach dmesg when it gets stuck and recovers again (by unblanking).
Latest patch wouldn't apply on top of kernel 2.6.37.6, so I'm rebuilding a new kernel (3.3.2). This will take a while. On 8xx I can easily imagine the BIOS trying to be helpful and poking at plane & pipe regs for us. Is there a BIOS option on this machine for blanking? If not, we'll need to add some code to check our pipe status in more places... Crucial question if Daniel's patch works is then why we have an enabled crtc attached when probing the connector for an attachment. Perhaps diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d index 60ccfe3..8a4ef2e 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -5346,6 +5346,9 @@ void intel_release_load_detect_pipe(struct intel_encoder * if (old->release_fb) old->release_fb->funcs->destroy(old->release_fb); + if (WARN_ON(encoder->crtc)) + encoder->crtc = NULL; + return; } I searched the BIOS for options about blanking, but did not find any. I have the latest version of the BIOS already. Most power management options related to stand-by and suspend are disabled, but I will try disabling power management completely to see if it helps. Meanwhile I am done compiling kernel 3.3.2 with Daniel patches from comments #3 and #14. Kernel log messages are very different from previously. Here is the case where the chip isn't sleeping yet: [drm:intel_crt_detect], start load detect [drm:intel_get_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5] [drm:intel_get_load_detect_pipe], using existing crtc for load_detect: f6304000 [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_release_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5] [drm:intel_crt_detect], end load detect And here is the case where it is sleeping: [drm:intel_crt_detect], start load detect [drm:intel_get_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5] [drm:intel_crtc_dpms], dpms on crtc f6304000, new: 0, current: 3 [drm:i830_get_fifo_size], FIFO size - (0x0000005f) A: 47 [drm:intel_calculate_wm], FIFO entries required for mode: 93 [drm:intel_calculate_wm], FIFO watermark level: -48 [drm:i830_update_wm], Setting FIFO watermarks - A: 1 [drm:intel_update_fbc], [drm:intel_get_load_detect_pipe], using existing crtc for load_detect: f6304000 [drm:intel_crt_load_detect], starting load-detect on CRT [drm:intel_release_load_detect_pipe], [CONNECTOR:4:VGA-1], [ENCODER:5:DAC-5] [drm:intel_crtc_dpms], dpms on crtc f6304000, new: 3, current: 0 [drm:intel_update_fbc], [drm:i830_get_fifo_size], FIFO size - (0x0000005f) A: 47 [drm:intel_calculate_wm], FIFO entries required for mode: 93 [drm:intel_calculate_wm], FIFO watermark level: -48 [drm:i830_update_wm], Setting FIFO watermarks - A: 1 [drm:intel_crt_detect], end load detect I seem to understand we are now waking up the chip just to test for monitor presence? It seems a bit weird from a power savings perspective. Can't we just reply -EAGAIN in that case? There's nothing user-space is going to do with the information if the system is sleeping anyway. (In reply to comment #18) > I seem to understand we are now waking up the chip just to test for monitor > presence? It seems a bit weird from a power savings perspective. Can't we > just > reply -EAGAIN in that case? There's nothing user-space is going to do with > the > information if the system is sleeping anyway. load-detection is only performed as a result of a userspace request. The drm_kms_helper polling for change in connection status only does the DDC check. Disabling power management in the BIOS doesn't fix the problem. I know that load detection is performed as a result of a user-space request. This upowerd is triggering such a request every 30 seconds, which is why I think it shouldn't wake up the graphics chip if it is sleeping. Chris, I've applied your patch from comment #17, but the WARN_ON never triggered. Looking at the debug logs, I don't think we even enter the code section where you added the WARN_ON. Ok, not hitting the WARN_ON is pretty nice evidence that my patch might work. How's the kernel-compiling for that one going? Err, I'm not sure I understand your question, Daniel. Kernel 3.3.2 was compiled yesterday and my results in comment #18 are with this kernel and your patches applied. What else do you need from me? Oops, I've confused things a bit and thought comment #18 was still about the old code. To confirm: You haven't seen any load_detect timeouts with that patch applied? I confirm, no load_detect timeouts since I applied the patch from comment #14. Created attachment 73025 [details]
patch as submitted to intel-gfx
A patch referencing this bug report has been merged in Linux v3.4-rc5: commit e95c8438ea1c56c254f0607c8fb6bca7f463c744 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Fri Apr 20 21:03:36 2012 +0200 drm/i915: fixup load-detect on enabled, but not active pipe |