Bug 15248

Summary: KMS broken on Intel 855GM broken with 2.6.32/33
Product: Drivers Reporter: Michal Nowak (mnowak)
Component: Video(DRI - Intel)Assignee: Jesse Barnes (jbarnes)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: high CC: jbarnes, legolas558, maciej.rutecki, noz, rjw, vasyl.demin, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32 2.6.33-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14230    
Attachments: lspci -v
dmesg
lspci -v
dmesg
dmidecode
lid state
New dmidecode 2.6.33-rc6 KMS on
Jesse Barnes' patch for latest linus tree
Jesse Barnes' patch for latest linus tree
Jesse Barnes' patch for latest linus tree
lspci -v of my box
'sudo lspci -v' on my box
dmesg output retrieved via ssh
stdout+stderr produced when starting Xorg (no Xorg.0.log file created because of early crash)
legolas558's kernel .config
.config stripped down from FC13
garbled fonts
Xorg.0.log with Xorg 1.7.4, kernel 2.6.33-rc8+855_nolid.patch
latest .config with built-in AGP and DRM
FC13's DRM patch for i8xx devices

Description Michal Nowak 2010-02-07 09:59:03 UTC
Created attachment 24936 [details]
lspci -v

Description of problem:

With 2.6.32 and 2.6.33-rc6 I noticed that system with Intel 855GM is not
booting anymore. After Grub and kernel start hang with completely black screen
happens, no key stroke works (not even Magic SysRq) and have to power of it. When I add 'nomodeset' it boots just fine until X starts problem with non-KMS+X
arrives, this is actually bug https://bugzilla.redhat.com/show_bug.cgi?id=522551 (it's not expected this to be ever fixed) and /var/log/Xorg.log is empty. When I add just "acpi=off" (no nomodeset) it boots fine too crashes in X too. I am not sure what acpi=off does regarding KMS but I guess it somehow turns it off.

Note: Kernel 2.6.33-rc6 (+ recent intel xorg driver 2.10) is necessary for this box to enable Xv when KMS enabled.

Just to stress that boxes with Intel 855 might be totally unusable with kernels
2.6.32+ - KSM broken in kernel and non-KSM usabe broken on X level. 2.6.31.x is
fine.

Version-Release number of selected component (if applicable):

kernel-2.6.32.7-37.fc12.i686
kernel-2.6.33-0.27.rc6.git1.fc13.i686
xorg-x11-drv-intel-2.9.1-1.fc12.i686
xorg-x11-drv-intel-2.10.0-1.fc12.i686
Comment 1 Michal Nowak 2010-02-07 09:59:25 UTC
Created attachment 24937 [details]
dmesg
Comment 2 Michal Nowak 2010-02-07 10:01:24 UTC
Created attachment 24938 [details]
lspci -v
Comment 3 Michal Nowak 2010-02-07 10:01:48 UTC
Created attachment 24939 [details]
dmesg
Comment 4 ykzhao 2010-02-10 02:05:24 UTC
Will you please boot the system with "nomodeset" option and attach the following output?
   1. cat /proc/acpi/button//lid/LID/state
   2. dmidecode

Thanks.
    Yakui.
Comment 5 Jesse Barnes 2010-02-11 18:12:31 UTC
If it's a LID issue (likely) this patch should work around the issue.  If it works, can you attach the output of 'dmidecode' to this bug?

diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds
index 75a9772..345e3b0 100644
--- a/drivers/gpu/drm/i915/intel_lvds.c
+++ b/drivers/gpu/drm/i915/intel_lvds.c
@@ -643,6 +643,8 @@ static enum drm_connector_status intel_lvds_detect(struct dr
 {
        enum drm_connector_status status = connector_status_connected;
 
+       return status;
+
        if (!acpi_lid_open() && !dmi_check_system(bad_lid_status))
                status = connector_status_disconnected;
Comment 6 Michal Nowak 2010-02-12 07:18:11 UTC
Booted "nomodeset" to init 3 and gathered all data. Will try the patch later.
Comment 7 Michal Nowak 2010-02-12 07:20:44 UTC
Created attachment 24994 [details]
dmidecode
Comment 8 Michal Nowak 2010-02-12 07:34:48 UTC
Created attachment 24995 [details]
lid state

state file says "Closed" but was "open" in reality.
Comment 9 Michal Nowak 2010-02-12 11:12:39 UTC
Created attachment 25000 [details]
New dmidecode 2.6.33-rc6 KMS on
Comment 10 Michal Nowak 2010-02-12 11:13:38 UTC
It works! Now I can boot 2.6.33-rc6 with that patch, KMS is enabled.
Comment 11 Jesse Barnes 2010-02-12 21:19:39 UTC
Great, just sent a patch, http://patchwork.kernel.org/patch/78947/, to ignore lid status on 8xx machines.  It should fix the issue for you.
Comment 12 Daniele C. 2010-02-13 00:02:43 UTC
Created attachment 25017 [details]
Jesse Barnes' patch for latest linus tree

this is the same patch as J.Barnes, except that is has been manually applied so line offsets are now correct

I am now recompiling and will report about it since I have the same hardware as mnowak
Comment 13 Daniele C. 2010-02-13 00:02:44 UTC
Created attachment 25018 [details]
Jesse Barnes' patch for latest linus tree

this is the same patch as J.Barnes, except that is has been manually applied so line offsets are now correct

I am now recompiling and will report about it since I have the same hardware as mnowak
Comment 14 Daniele C. 2010-02-13 00:03:32 UTC
Created attachment 25019 [details]
Jesse Barnes' patch for latest linus tree

this is the same patch as J.Barnes, except that is has been manually applied so line offsets are now correct

I am now recompiling and will report about it since I have the same hardware as mnowak
Comment 15 Daniele C. 2010-02-13 00:23:36 UTC
Created attachment 25020 [details]
lspci -v of my box

I have attached 'lspci -v' of my box; the patch does nothing for me, exactly same behaviour is observed.

System is booted with 'nomodeset nofb' and I get an OFF screen right after the i915 module is auto-loaded. Then I successfully login to Xorg (blindly, since screen is OFF) and it hangs up. Latest lines of Xorg.0.log are not helpful:

(II) intel(0): [drm] removed 1 reserved context for kernel
(II) intel(0): [drm] unmapping 8192 bytes of SAREA 0xdff08000 at 0xb7331000
(II) intel(0): [drm] Closed DRM master.

And neither messages.log, since it contains a wild memory dump in the tail just before my hard reboot!
Comment 16 Daniele C. 2010-02-13 00:27:03 UTC
Created attachment 25021 [details]
'sudo lspci -v' on my box
Comment 17 Jesse Barnes 2010-02-13 04:44:10 UTC
Daniele, I think you have a different problem.  Judging by your X log snippet, you have a really old X driver that won't work with KMS and recent kernels.  If after updating you still have issues, please file a new bug with your kernel .config attached, your X log, and your dmesg.
Comment 18 Michal Nowak 2010-02-13 11:46:45 UTC
(In reply to comment #15)
> System is booted with 'nomodeset nofb' and I get an OFF screen right after
> the

I can boot without this. This bug is about KMS enabled boot.
Comment 19 Rafael J. Wysocki 2010-02-14 23:35:46 UTC
Handled-By : Jesse Barnes <jbarnes@virtuousgeek.org>
Patch : http://patchwork.kernel.org/patch/78947/
Comment 20 Daniele C. 2010-02-15 13:00:06 UTC
@jbarnes: problem is that kernel crash is so hard that no buffers are written to disk, and Xorg.0.log is not overwritten, so I was providing a snippet from the previous Xorg.0.log which is created with an old version because I *must* use that version to use normally this box.

I have started Xorg from ssh and I have taken the output manually by copy/paste, since neither 'tee' can output anything. The box crashes when launching Xorg (startx) and I need to plug power off to restart it.

I am attaching the files you requested; please note that I am using latest linus tree (with the patch I attached before) and all latest packages (Arch Linux here)
Comment 21 Daniele C. 2010-02-15 13:02:31 UTC
Created attachment 25047 [details]
dmesg output retrieved via ssh

machine is booted without 'nomodeset' and without 'nofb' (even if 'nofb' is not relevant)
Comment 22 Daniele C. 2010-02-15 13:03:26 UTC
Created attachment 25048 [details]
stdout+stderr produced when starting Xorg (no Xorg.0.log file created because of early crash)
Comment 23 Daniele C. 2010-02-15 13:07:31 UTC
Created attachment 25049 [details]
legolas558's kernel .config

there's to say that my displays goes black (turned OFF) right after modules loading (I suppose when i915 module is loaded), then I can login and trigger startx (all operations done blindly or via ssh) and the system crashes hard, no magic keys working either; the Xorg output was also collected via ssh.

I am fully available to work this bug out

Thanks
Comment 24 Daniele C. 2010-02-15 13:09:59 UTC
(In reply to comment #18)
> (In reply to comment #15)
> > System is booted with 'nomodeset nofb' and I get an OFF screen right after
> the
> 
> I can boot without this. This bug is about KMS enabled boot.

Sorry my mistake, I get the OFF screen when *not* using 'nomodeset' option, e.g. with KMS enabled
Comment 25 noz 2010-02-16 03:09:26 UTC
using 2.6.33-rc8 and the "return status" hack, i can boot with KMS enabled, but starting Xorg with either vesa or intel driver (all latest) hard freeze (no sysrq)
same, disabling KMS, i can use the vesa driver without crash

this is an  improvement as i can have ACPI enabled.

Just like Daniele, I can't have accelerated drivers via KMS on my 855GM.

the poster's fedora kernels must have patches we do not have.
Comment 26 Daniele C. 2010-02-16 09:10:23 UTC
I have just tried FC12 as mnowak suggested me on another bug tracker.

1) the stock FC12/FC13 kernels already have some patch which fixes the kernel
for this hardware, since I can boot successfully with 'nomodeset' and start
Xorg, while on my vanilla 2.6.32/2.6.33 (linus' tree) I can't boot at all without 'nomodeset'

2) the FC13 kernel (2.6.33-rc8) + 855nolid.patch (J.Barnes small patch about
LID on 855GM) can boot but Xorg crashes before the loading screen finishes
loading, so I never reach the login manager

So now I think it is necessary to identify which Red Hat patches are fixing
kernel's DRM so that recent Xorg can work, perhaps I'll need another bug
tracker

I will try to incrementally apply some FC13 patches regarding DRM, but I don't like shooting in the dark with this stuff...

Please change status of this bug since it is *NOT RESOLVED*, J.Barnes' patch is surely an improvement but it has brought the situation back to early 2.6.32 development, when I could boot exactly how FC12+FC13_kernel+nolid.patch does now (without any KMS whatsoever).
Comment 27 Daniele C. 2010-02-16 12:21:54 UTC
Created attachment 25064 [details]
.config stripped down from FC13

I have attached a .config which works with 2.6.33-rc8 (linus tree); can some dev tell me why my .config (attachment 25049 [details]) causes the crashes instead?
Comment 28 Jesse Barnes 2010-02-16 16:43:34 UTC
One problem with the broken config is that AGP is modular.  If loaded in the wrong order, it can break the DRM and/or X drivers.  It's best to keep AGP as builtin along with the AGP drivers (CONFIG_AGP=7 and CONFIG_AGP_INTEL=y in this case).
Comment 29 noz 2010-02-16 21:43:58 UTC
Created attachment 25070 [details]
garbled fonts

with the current 2.6.33-rc8 + nolid patch, after some minutes
Comment 30 Daniele C. 2010-02-16 21:52:34 UTC
Created attachment 25071 [details]
Xorg.0.log with Xorg 1.7.4, kernel 2.6.33-rc8+855_nolid.patch

@jbarnes: yes I am now rebuilding the kernel with AGP and DRM built-in, that will most surely fix the issue - thanks for pointing it out.

So now I am at the same spot like mnowak and the other i855GM users: Xorg crashes after some time, and prints a lot of lines like:

(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.

until I terminate the Xorg process (luckly I still have control with the intel driver, while with vesa driver there is a kernel hard crash).

When trying again to start Xorg it fails with the same lines as above, so it must mean that the hardware memory or GPU is somewhat corrupted. I have also tried to suspend and restore (just in case some quirk would work) but nothing changes
Comment 31 Daniele C. 2010-02-16 21:53:48 UTC
Created attachment 25072 [details]
latest .config with built-in AGP and DRM
Comment 32 Jesse Barnes 2010-02-16 22:14:33 UTC
Ok good you're on the same page as everyone else now.  The crashes seem like a different issue than what was solved in this bug (bad lid detection), can you file a new bug at bugs.freedesktop.org for that, following the instructions at http://intellinuxgraphics.org/how_to_report_bug.html?  You can file it against the drm/intel component.  Would be extra cool if you could bisect the kernel and see if 8xx started failing for you at a particular commit.
Comment 33 Daniele C. 2010-02-17 03:28:42 UTC
@jbarnes: this bug is not about bad lid detection, this bug is about KMS broken (as far as I can read) and KMS is still broken since it crashes after a few seconds/minutes (mnowak can surely confirm this).

However the patch is a huge improvement since crash was instantaneous previously.

I have verified that my issue was not due to any Fedora Core patch, but to the .config (built-in AGP and DRM fixed the issue).

However if you want I can proceed as you suggested, but I am sure that mnowak can confirm that the patch only delays the crash

Also there is to say that with latest linus tree and my new .config I am not getting the crash but instead I am getting a weird glitch each time the screen is refreshed (mouse move, character typing, pretty anything except caret moving up/down); also the fonts are no more garbled
Comment 34 Michal Nowak 2010-02-17 08:08:03 UTC
I have to admit I was wrong with statement that XOrg crashes after some time - it's rock solid for me. Thought I am not on 2.6.33-rc8 (just git6rc7). Having no idea what might be the issue for Daniele since I believe we now have the same SW stack (F12+updates+updates_testing + F13's xorg-drv and kernel+Jesse's patch).

For *me* is the 2.6.31 -> 2.6.3{2,3} KMS regression fixed. I don't mind if you change the title to mention "problem on boot" and file a new bug at freedesktop's side.
Comment 35 Daniele C. 2010-02-17 10:50:20 UTC
Created attachment 25084 [details]
FC13's DRM patch for i8xx devices

@mnowak: FC13's kernel uses the attached patch (drm-intel-big-hammer.patch); this appears to have fixed the vertical refresh flicker and crash issue for me also.

So now I am running latest linus tree + 855nolid.patch + drm-intel-big-hammer.patch, linux is running with KMS on, latest Xorg and everything runs smoothly fine.

Only issue remaining is the garbled font issue, like if some bytes of the fonts bitmap are "eaten" (pretty much like the attached screenshot, just a bit less visible)

@jbarnes: if I get no more crashes with this kernel combination, might it be the case to submit also drm-intel-big-hammer.patch? Sorry but I have some issues at understanding if it is a Xorg bug or a kernel bug, and it seems to have been fixed by the attached kernel patch for me now
Comment 36 Daniele C. 2010-02-17 11:58:33 UTC
(In reply to comment #34)
> I have to admit I was wrong with statement that XOrg crashes after some time
> -
> it's rock solid for me. Thought I am not on 2.6.33-rc8 (just git6rc7). Having
> no idea what might be the issue for Daniele since I believe we now have the
> same SW stack (F12+updates+updates_testing + F13's xorg-drv and
> kernel+Jesse's
> patch).
> 
The weird bit is that when using exactly your SW stack I get a crash before the desktop appears, while now I am using (latest linus tree + 855nolid.patch + drm-intel-big-hammer.patch) and everything works fine and smoothly (rock solid, as you said), except for the fonts which are missing some pixels.

I assume also that your fonts are displaying perfectly.

> For *me* is the 2.6.31 -> 2.6.3{2,3} KMS regression fixed. I don't mind if
> you
> change the title to mention "problem on boot" and file a new bug at
> freedesktop's side.

I think there is a misunderstanding here: the "problem on boot" was due to AGP being modular, as jbarnes said, and I filed a kernel documentation bug here (http://bugzilla.kernel.org/show_bug.cgi?id=15340). Now I am using it built-in (like FC12/FC13 do) so if I had to submit a new bug to freedesktop it would be for the Xorg crash bug, which anyway is not happening anymore with drm-intel-big-hammer.patch (perhaps that patch needs to be submitted at freedesktop's drm/intel, if jbarnes confirms)
Comment 37 Michal Nowak 2010-02-17 12:31:26 UTC
(In reply to comment #36)
> I assume also that your fonts are displaying perfectly.

Yes. It's perfect. The only package which probably differs in our setups is mine cairo built with XCB enabled and I am using awesome WM (not in Fedora repos). But that's unlikely to cause anything with fonts I believe.

> I think there is a misunderstanding here: the "problem on boot" was due to
> AGP
> being modular, as jbarnes said, and I filed a kernel documentation bug here
> (http://bugzilla.kernel.org/show_bug.cgi?id=15340). Now I am using it
> built-in
> (like FC12/FC13 do) so if I had to submit a new bug to freedesktop it would
> be
> for the Xorg crash bug, which anyway is not happening anymore with
> drm-intel-big-hammer.patch (perhaps that patch needs to be submitted at
> freedesktop's drm/intel, if jbarnes confirms)

I am bit puzzled with all the recent 855GM issues but since the setup as of now works for me 100%, I don't have any preferences regarding new bugs elsewhere.
Comment 38 Daniele C. 2010-02-17 14:32:46 UTC
@mnowak: my recent findings are not about FC kernel tests but about a vanilla kernel with those 2 patches. It would be great if you could confirm that such mix causes the garbled fonts display and the freeze while playing videos in your box also.

Your setup uses Fedora patches so if you are not having any issue at all now (neither the video freeze) we should dig a bit more and find out which other patches we need in the mainstream kernel, so that everybody can get benefit from them
Comment 39 Daniele C. 2010-02-17 16:28:55 UTC
Waiting for developer feedback about the inclusion of the drm-intel-big-hammer.patch, I am digging about the crashes happening now more rarely.

The crash is exactly the same of those happening without drm-intel-big-hammer.patch and seems related to GPU "hickups", see also:
http://bugs.gentoo.org/301282

However I have no idea about other patches to try to fix this. Playing videos or intensive CPU/GPU usage will likely cause the bug to trigger; mnowak reports that with his patched kernel this never happens.

@mnowak: I have put the tools and patches I am using (and also the kernel sources) here http://www.iragan.com/linux/i855GM/
Comment 40 Jesse Barnes 2010-02-17 17:10:41 UTC
Can you try using libdrm 2.4.18?  It has some fixes for corruption, and I'm hoping it makes the "big hammer" patch unnecessary as well.
Comment 41 Daniele C. 2010-02-17 18:19:42 UTC
(In reply to comment #40)
> Can you try using libdrm 2.4.18?  It has some fixes for corruption, and I'm
> hoping it makes the "big hammer" patch unnecessary as well.

I just tried libdrm from git:

1) Xorg crashes after more time, like 60 seconds instead of 2-3 seconds
2) the garbled fonts issue is still there

I had to re-apply the big hammer patch to use Xorg+KMS, furthermore they recently made KMS mandatory on Arch Linux so I'll really need this patch to run Xorg.
Anyway it's not a total fix since it is proven that the batchbuffer crashes can still happen under particular stress situations.

Perhaps these i8xx devices have a small pipe and need more time/less flood to work correctly?
Comment 42 noz 2010-02-17 23:19:53 UTC
just to precise, for me fonts are ok, until some time, they start to get garbled (especially after using opengl i think)
when this starts, play with the system a bit more and it will all freeze except the mouse (no ssh in)
Comment 43 Daniele C. 2010-02-19 03:31:50 UTC
For me instead they are garbled since the start of the session; I have verified that XvMC and Tiling are not responsible for this issue and neither for the crash (always happening while playing a video for example)

@jbarnes: do we have a built-in kernel stress test for DRM?

When redirecting buffer of 'startx' command I am getting the batch buffer error lines but not this line:

intel_bufmgr_gem.c:901: Error setting to CPU domain 626: Input/output error

So I assume this one is spitted out directly by libdrm in the first tty available.

These downstream links contain interesting pointers:

http://bugs.gentoo.org/295777
https://bugs.freedesktop.org/show_bug.cgi?id=25510

However I don't think there is any relevant patch mentioned there because this bug could now be triggered from elsewhere.

Can't we do anything to prevent GPU from hanging (or kernel from thinking it hung up?)
Comment 44 Daniele C. 2010-02-19 16:20:11 UTC
acpi=off doesn't change anything about bug triggering. Bug also triggers while using Firefox or Opera browsers.

Garbled fonts are often coupled by artifacts in icon rendering or background slices restoring.

I have applied some other FC13 kernel patches but they don't fix the sudden crash, which needs each time a graceful system restart (VTs are still working, and garbled fonts appear also there).

I have also a full git freedesktop development stack with latest Xorg and libraries, just in case I need to make tests.

Now I am also using the BFS CPU scheduler and it doesn't interfere with bug triggering, while in the past it highly increased bug triggering ratios.