Bug 4513 - Radeon driver in 2.6.10 through 2.6.11.6 fails in DRI mode. Kernel 2.6.9 works fine.
Summary: Radeon driver in 2.6.10 through 2.6.11.6 fails in DRI mode. Kernel 2.6.9 wor...
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(AGP) (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: H. Peter Anvin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-04-18 07:10 UTC by David Cafaro
Modified: 2006-04-10 09:25 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.10 through 2.6.11.6
Tree: Mainline
Regression: ---


Attachments
output of dmesg for bug #4513 (21.81 KB, text/plain)
2006-03-19 19:13 UTC, Brian Hinz
Details
output of lspci for bug #4513 (10.34 KB, text/plain)
2006-03-19 19:14 UTC, Brian Hinz
Details
Xorg log for bug #4513 (40.62 KB, text/plain)
2006-03-19 19:15 UTC, Brian Hinz
Details
xorg.conf for bug #4513 (12.62 KB, text/plain)
2006-03-19 19:15 UTC, Brian Hinz
Details
Dmesg Log from failed Xorg (14.18 KB, text/plain)
2006-03-22 08:33 UTC, David Cafaro
Details
lspci -vv Results (10.30 KB, text/plain)
2006-03-22 08:33 UTC, David Cafaro
Details
Xorg Log file (35.50 KB, text/plain)
2006-03-22 08:34 UTC, David Cafaro
Details
Xorg Config file (2.91 KB, text/plain)
2006-03-22 08:34 UTC, David Cafaro
Details
hack to drm cvs (03/19/2006) that removes mtrr overlaps (471 bytes, patch)
2006-03-22 19:23 UTC, Brian Hinz
Details | Diff
minimal regression patch to drivers/char/agp/generic.c that fixes the screen freeze issue (383 bytes, patch)
2006-03-22 19:59 UTC, Brian Hinz
Details | Diff
potential bug resolution patch (456 bytes, patch)
2006-03-22 21:06 UTC, Brian Hinz
Details | Diff
Proposed patch, version 2 (1.10 KB, patch)
2006-03-22 22:21 UTC, H. Peter Anvin
Details | Diff

Description David Cafaro 2005-04-18 07:10:46 UTC
Distribution: Fedora Core 2 & Fedora Core 3

Hardware Environment: Sharp MM20 Laptop 
Transmetta Efficeon Processor, ATI Mobile Radeon
00:00.0 Host bridge: Transmeta Corporation: Unknown device 0060
00:01.0 PCI bridge: Transmeta Corporation: Unknown device 0061
00:02.0 PCI bridge: ALi Corporation M5249 HTT to PCI Bridge
00:03.0 ISA bridge: ALi Corporation M1563 HyperTransport South Bridge (rev 20)
00:03.1 Bridge: ALi Corporation M7101 Power Management Controller [PMU]
00:04.0 Multimedia audio controller: ALi Corporation M5455 PCI AC-Link
Controller Audio Device (rev 03)
00:06.0 Network controller: Intersil Corporation Intersil ISL3890 [Prism
GT/Prism Duette] (rev 01)
00:09.0 CardBus bridge: Ricoh Co Ltd RL5c475 (rev 81)
00:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
00:0e.0 IDE interface: ALi Corporation M5229 IDE (rev c5)
00:0f.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:0f.1 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:0f.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01)
01:00.0 VGA compatible controller: ATI Technologies Inc Radeon Mobility M6 LY

Software Environment: Fedora Core 3
Broken kernels include: custom built 2.6.11.6 Kernel.org kernel, and all FC3
released 2.6.10-* kernels.  
Working kernel: FC3 2.6.9-1.724_FC3
XWindows:  xorg-x11-6.8.2-1.FC3.13

Problem Description:
When running any 2.6.10 or later kernel with DRI enabled the screen locks to a
point where you only have an x-cursor for the mouse which does respond to the
mouse, and 4 screwed up minature images of the BIOS boot screen along the top of
the LCD with the rest of the LCD being black.  All though the mouse responds,
there is no keyboard response.  I can ssh in to cleanly shutdown the machine, so
it doesn't lock the system, only the xserver.  There are no errors in the xorg
log files to report.

DRI works prefectly fine with the 2.6.9 and before series kernels.
Please see these two Fedora Core bugs as reference, the first bug includes a
picture of what the screen looks like during a failure:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=144867
(please ignore the earlier posts about the 2.6.9 kernel, was never able to
reproduce for that kernel version)

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=152312


Steps to reproduce:
1. Boot a 2.6.10 or greater kernel with DRI enabled in the xorg.conf file.
2. Boot into Xwindows from init 5 or run startx from init 3
3. Watch your screen go screwy and still have a moving mouse cursor but no keyboard.
Comment 1 David Cafaro 2005-04-18 10:18:47 UTC
Ok, I tried something new, and narrowed the issue down to something with the AGP
driver used.  By using Option "BusType" "PCI" and forcing the xorg to use the
AGP card as a PCI card, everything works as expected.  The only issue being that
it is slower than Using AGP mode. 

I'm updating this bug so that is is now listed under Drivers Video (AGP)
Comment 2 Dan Wright 2005-04-18 11:55:53 UTC
I also see this on the same laptop model under Gentoo with 2.6.11.  Nearly the
same symptoms, except that I don't see the "4 screwed up minature images of the
BIOS boot screen along the top of the LCD with the rest of the LCD being black"
-- just a cursor in the middle of the screen with the machine locked up.
Comment 3 Andrew Morton 2005-04-24 01:38:04 UTC
<amanuensis mode>

Dave Jones <davej@redhat.com> wrote:
>
> On Sat, Apr 23, 2005 at 07:38:05PM -0700, Andrew Morton wrote:
>  > 
>  > Think this might be an AGP problem.
>  
> Unconvinced. AGP problems tend to kill the box completely,
> or stomp on memory enough to cause panics. This sounds more
> like the card just locked up for whatever reason.
> 
> The Option "BusType" "PCI" is a red herring. It's going 
> to cause the dri code to take completely different paths.
Comment 4 Andrew Morton 2005-05-25 23:01:56 UTC
Guys, could we please have an update on this?  Is 2.6.12-rc5 working OK?
Comment 5 David Cafaro 2005-05-26 06:55:52 UTC
I'll have to build the new kernel to test that out.  But probably won't be able
to get to that until next week (vacation this weekend).  Were there any specific
changes to 2.6.12-rc5 (or any of the 2.6.12-rc) that might fix this problem?
Comment 6 Chris Kottke 2005-06-24 08:05:18 UTC
Hi there.  I'm running the 2.6.12-rc5 kernel + software suspend 2.1.8.12 patches
on a gentoo machine, same laptop as David and Dan.  I'm still seeing the problem
when I try to start X with the dri module enabled, even when I specify "BusType"
"PCI".  The screen is blank, with a responsive cursor, but the system does not
respond to any other input.
Comment 7 Sergei Klink 2005-07-02 16:59:58 UTC
2.6.12.2 with swsusp2 2.1.9.5 and bootsplash 3.1.6 (on SuSE 9.3)

Same machine, same problem as the original report with both radeonfb and vesafb
(and X doesn't die even after kill -9)
Comment 8 Sergei Klink 2005-07-24 18:02:19 UTC
Using plain sources from kernel.org, on latest gentoo (2005.0), same hardware:

2.6.10 (including all rc-*) - works fine
2.6.11-rc1 and above - doesn't work

There seem to be way too many changes to DRI in 2.6.11-rc1, can someone please
look into it?
Comment 9 David Cafaro 2005-07-24 18:55:57 UTC
Currently running the 2.6.12-1.1372_FC3 kernel from the Fedora core project and
I see the same problem.  AGP with DRI enabled causes the screen to lock.  I have
not had a chance to test a vanila Kernel.org 2.6.12 kernel yet.
Comment 10 Dave Airlie 2005-08-04 04:51:58 UTC
can you try with AGP mode set to 1 in xorg.conf 

I've no idea what is the issue between the transmeta AGP and the radeon, but
there is certainly something wrong...
Comment 11 Dave Airlie 2005-08-04 05:18:07 UTC
Its a pity there was both AGP and DRM changes in 2.6.11-rc1... if someone could
track down which causes the issue there is a lot better chance we can fix
this... this is purely a transmeta AGP bridge and some interaction with DRM...

If you just remove all the DRM changes from the 2.6.11-rc1 patch and see if it
was  that or not ...
Comment 12 David Cafaro 2005-08-04 05:55:25 UTC
The issues started in the 2.6.10 series kernels, so it's not something directly
related to 2.6.11 (or atleast not exclusive).  I tried rolling back some changes
in the efficeon AGP driver that had occured in the 2.6.10 and 2.6.11 branches,
but that didn't help.  Hopefully in the next 2 months I'll have more spare time
to play around with this, but I'm not that great at C coding and certainly no
kernel hacker.
Comment 13 David Cafaro 2005-08-04 06:03:01 UTC
Actually to be more clear:

AGP+DRI 4x/2x broken starting at 2.6.6 kernels (Fedora 2), 2.6.7 (kernel.org
version)

AGP+DRI broken 2.6.10 kernel (Fedora 2/3 versions) 2.6.11 (Kernel.org version)

AGP-DRI works in all

PCI+DRI currently works for me (using Option "BusType" "PCI") 2.6.12 (fedora 3)
Comment 14 Chris Kottke 2005-08-10 13:09:05 UTC
I just tried 2.6.11-rc1 with the drm patches removed, as per Dave Airlie's
suggestion.  Still hung.  I was using AGPMode 1 in xorg.conf.  As I have time,
I'll continue to work backwards through patchsets until I can locate the point
at which it's not broken on my laptop.
Comment 15 Sergei Klink 2005-08-11 14:07:03 UTC
Hm, I'm using 2.6.10 with AGP 2x/4x (glxinfo reports 4x (value from xorg.conf),
kernel - 2x) with no problems (~290 FPS in glxgears :)

Here are some relevant configs/logs (xorg.conf mostly carried over from my old
SuSE install):

http://members.shaw.ca/sklink/mm20/
Comment 16 Dave Jones 2006-01-04 22:47:46 UTC
Is this still a problem with 2.6.15 ?
Comment 17 David Cafaro 2006-01-05 07:35:36 UTC
Truethfully, I don't know, last kernel I tried this with was 2.6.13 on FC3.  At
that point it was still a problem, I had to use the PCI option to get DRI to
work.  I started to install FC4 and half way through decided to give (GULP)
Windows XP a try again.  Currently I'm still running XP on the machine, so
unable to test current kernels.  I keep thinking of trying to get Linux on it
again...
Comment 18 David Cafaro 2006-01-13 08:59:45 UTC
Ok, I've reinstalled FC4 on my laptop and currently no DRI works.  Non-DRI
graphics only work when the system is forced into PCI mode.  This is using the
Fedora Core Kernel:

2.6.14-1.1656_FC4 and xorg 6.8.2-37.FC4.49.2

I will try and test with a kernel.org 2.6.15 kernel this weekend.
Comment 19 Brian Hinz 2006-02-12 19:08:48 UTC
I've managed to find the code that causes this break, and have AGP dri working
with 2.6.11 and also with 2.6.15, although it's just a hack, not a permanent
fix.  Between 2.6.10 and 2.6.11 drivers/char/agp/generic.c was changed, by
reverting these lines (this is 2.6.15, 2.6.11 is slightly different):

bluejay linux # diff drivers/char/agp/generic.c drivers/char/agp/generic.c.bak
213,214c213
<               new->memory[i] =
<                       agp_bridge->driver->mask_memory(bridge,
virt_to_phys(addr), type);
---
>               new->memory[i] = virt_to_gart(addr);
861a861
>               readl(bridge->gatt_table+i);    /* PCI Posting. */
1010a1011
>               readl(bridge->gatt_table+i);    /* PCI Posting. */
1105d1105
< #ifdef CONFIG_SMP
1110d1109
< #endif
1114d1112
< #ifdef CONFIG_SMP
1117,1119d1114
< #else
<       flush_agp_cache();
< #endif

I can compile cleanly, load drm & run in AGP 4x mode (at least it's set in
xorg.conf).  2.6.11 froze after ~10 min, but that was with the kernel drm. drm
from CVS with 2.6.15 seems stable with xorg 7.0 modular.  I'm not sure which of
those lines in the diff is actually the problem, probably either the mask_memory
or the flush_agp_cache, but in either case something may still not be quite
right - 2.6.10 (vanilla) reported the device being put in 2x mode upon AGPgart
initialization, while 2.6.11/15 reports 0x mode. Not sure if this is really a
problem, maybe the mode gets set to whatever X requests anyway...  glxgears
scores are consistent with pre-2.6.11, ~290fps, and render, composite,
pageflipping, 4x mode, etc. all seem to work. Perhaps someone with more
knowledge of the generic agp code can shed some light on this?
Comment 20 Benjamin Herrenschmidt 2006-03-18 21:38:43 UTC
Please, try a recent 2.6.15 kernel from your distro and the latest  
ati-1-0-branch ddx driver, and if the problem still happens, attach:  
  
 - output of lspci -vv taken as root  
 - complete dmesg log  
 - X.org log  
 - xorg.conf  
 
If the freeze prevents you from getting the dmesg & X.log, you can generally at 
least try to get the later by mounting you root file system in synchronous mode 
with mount / -o remount,sync, after reboot, the log should be in the usual 
location 
  
Comment 21 Brian Hinz 2006-03-19 19:13:37 UTC
Created attachment 7609 [details]
output of dmesg for bug #4513
Comment 22 Brian Hinz 2006-03-19 19:14:28 UTC
Created attachment 7610 [details]
output of lspci for bug #4513
Comment 23 Brian Hinz 2006-03-19 19:15:08 UTC
Created attachment 7611 [details]
Xorg log for bug #4513
Comment 24 Brian Hinz 2006-03-19 19:15:37 UTC
Created attachment 7612 [details]
xorg.conf for bug #4513
Comment 25 Brian Hinz 2006-03-19 19:23:31 UTC
Hi Ben,

I've tried the ati-1-0-branch, along with drm from cvs today, but the display
still freezes and causes high cpu utilization (though not hung) without the
virt_to_gart regression. Also, the mtrr overlaps are still present - you can see
from the lspci output that the amount of video memory is erroneously detected as
128Mb, rather than the 16Mb that's actually present. I did not test it yet with
16Mb hard coded into the drm code but virt_to_gart in the kernel agp code. I'll
can post those results if you think it's worth it.  Thanks for taking the time
to look into this.  

-brian
Comment 26 David Cafaro 2006-03-22 08:32:34 UTC
Ok, I've just recently installed Fedora Core 5 on my laptop that just came back
from warranty service for a bad LCD screen.  It has all the most updated
packages available for the system.  Here are some relevant packages:

kernel-2.6.15-1.2054_FC5
xorg-x11-drv-ati-6.5.7.3-4
xorg-x11-server-Xorg-1.0.1-9
xorg-x11-drivers-7.0-2

The xdisplay still locks up (no keyboard/mouse response, blank display) but the
system is still response via SSH from another box into the laptop.  I've
attached the requested files.  Next I will try Brian Hinz's work around.
Comment 27 David Cafaro 2006-03-22 08:33:13 UTC
Created attachment 7639 [details]
Dmesg Log from failed Xorg
Comment 28 David Cafaro 2006-03-22 08:33:49 UTC
Created attachment 7640 [details]
lspci -vv Results
Comment 29 David Cafaro 2006-03-22 08:34:15 UTC
Created attachment 7641 [details]
Xorg Log file
Comment 30 David Cafaro 2006-03-22 08:34:41 UTC
Created attachment 7642 [details]
Xorg Config file
Comment 31 Benjamin Herrenschmidt 2006-03-22 17:21:43 UTC
From what I can see in those logs, everything looks fine.

The difference in "memory size" is due to the fact that the card exposes a 128Mb
region on the PCI bus while only using 16M out of that zone, but that shouldn't
be a problem. The driver gets it all right, the memory map is setup properly,
things aren't even moved around so I don't see any reason why it would break

I suspect there is something unrelated to the radeon driver here. Either some
crap with your MTRR settings and/or an issue with the driver for your host AGP
chipset or both ... I don't know much about MTRRs as they are an x86 thing and I
don't do x86 :) But I suspect that's where you should look at. Maybe try
disabling them ? (I think there is MTRR related stuff both in X and the kernel). 

Also, I don't know what to think about your change to the AGP driver neither as
I don't know that AGP driver but I would expect davej to have an idea there...

So IMHO, it's not a radeon bug.
Comment 32 Brian Hinz 2006-03-22 19:21:11 UTC
I agree, I don't think that it is a bug in the radeon driver either at this
point.  With regards to the MTRR overlaps though, I edited the file
drm/shared-core/radeon_cp.c from drm cvs and replaced "drm_get_resource_len(dev,
0)" with "16777216" and they went away (I've attached the patch).  Honestly, I
don't even know what effect an mtrr overlap has, but if the over-exposed memory
is causing problems for the drm code, then there's a possibility it may be the
culprit in the agp code as well...  I tried running the unpatched kernel with
this mod to drm and the screen still froze.
Comment 33 Brian Hinz 2006-03-22 19:23:43 UTC
Created attachment 7644 [details]
hack to drm cvs (03/19/2006) that removes mtrr overlaps

just hard codes 16mb into the drm radeon mtrr code for testing
Comment 34 Benjamin Herrenschmidt 2006-03-22 19:43:15 UTC
There shouldn't be anything overlapping there... the BIOS assigns 128M to the
card and that's it... if limiting the stuff to 16M fixes it, it means there is a
bug somewhere with MTRR handling...
The AGP aperture itself is mapped at 0xec000000 (for 64M) thus is just below the
framebuffer which gets 0xf0000000 for 128M. It isn't overlapping anything so I
don't see why it would have a problem... I suspect there are some stale MTRR's
floating around the fb space causing the overlap but I can't say if that has
anything to do with your other fix to the AGP driver
Comment 35 Brian Hinz 2006-03-22 19:55:58 UTC
radeonfb creates the first mtrr, 16mb in size, and then drm creates another that
is 128mb in size dmesg shows

...
mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000
mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000
mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000
mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000
mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000
...

again, I don't even know if this is a problem or just a warning, only that
radeonfb's mtrr handing is clever enough figure out the amount of physical video
memory while drm's mtrr handing doesn't. I don't recall what /proc/mtrr showed
when the overlaps existed but when limited to 16mb it looks clean (count=2).
Comment 36 Brian Hinz 2006-03-22 19:59:11 UTC
Created attachment 7645 [details]
minimal regression patch to drivers/char/agp/generic.c that fixes the screen freeze issue

This patch narrows the issue down to one specific call
Comment 37 Benjamin Herrenschmidt 2006-03-22 20:29:28 UTC
Ok, so we have 2 different issues here:

 - some mtrr related garbage between DRI and radeonfb. What should we do here ?
Is X also trying to play with MTRRs ? Should I change radeonfb to cover the
entire PCI area or try to make the DRM smarter ? The problem is a bit annoying
as the DRM doesn't necessarily knows how much the server decided was accessible
(and radeonfb doesn't have half of the bug fixes in this area). Maybe the
message is harmless and we should just not do anything ?

 - This agp specific problem.. Let's have a look... if you look at the
virt_to_gart() macro, the main difference is that ... it doesn't call
mask_memory() on the address when filling the memory[] array.

Now the question is: should we call mask_memory when filling the memory[] array
... I would have expected that to be something that must be done by the chipset
when inserting the memory, not when allocating it... And that's indeed what
agp_generic_insert_memory() does... So let's look at your chipset driver...

Hehe ! 

		unsigned long insert = mem->memory[i];

		page = (unsigned int *) efficeon_private.l1_table[index >> 10];

		if (!page)
			continue;
		
		page += (index & 0x3ff);
		*page = insert;

Looks like there is a mask_memory() call missing here !


Try to change:

-		unsigned long insert = mem->memory[i];
+		unsigned long insert = bridge->driver->mask_memory(bridge, mem->memory[i],
mem->type);

In efficeon-agp.c, line 254 and tell us if that helps.

Comment 38 Brian Hinz 2006-03-22 21:03:13 UTC
Ben,

Indeed, that does fix the problem! I had to change bridge to agp_bridge but
that's trivial.  I'm creating an attachment for Dave C. to try.

As far as the mtrr's go, I care more about radeonfb having the mtrr b/c of the
huge improvement in the speed of text scrolling on a console, but that's just me... 
Comment 39 Brian Hinz 2006-03-22 21:06:12 UTC
Created attachment 7646 [details]
potential bug resolution patch

Dave C - please test this patch and post your results.
Comment 40 Benjamin Herrenschmidt 2006-03-22 21:08:44 UTC
Talking a bit with Peter, it seems that just doing memory[i] | 1 would do the
trick and is simpler :) (Since the memory mask is just that on this chipset).

Regarding the MTRRs, I think those critters are based on physical addresses
anyway which means that even if X/DRM fails to setup it's own, it will benefit
from the radeonfb one, so world peace is safe
Comment 41 H. Peter Anvin 2006-03-22 21:22:26 UTC
Please note that this code devolves to mem->memory[i] | 1; there isn't much
point in calling something indirect through two pointers to do that!  I'm going
to talk to my contacts at Transmeta and see if I can get a copy of the official
documentation, which doesn't seem to be publically available *sigh*.
Comment 42 Brian Hinz 2006-03-22 21:26:37 UTC
just to confirm, (mem->memory[i]|1) works fine.
Comment 43 H. Peter Anvin 2006-03-22 22:21:39 UTC
Created attachment 7647 [details]
Proposed patch, version 2

Slightly cleaned up patch; please test this out as I'd like to push it
upstream.
Comment 44 Brian Hinz 2006-03-23 03:40:09 UTC
Tested, looks good. Thanks!
Comment 45 David Cafaro 2006-03-23 12:06:23 UTC
I can confirm the "proposed patch, version 2" does the trick!  DRI is now
working with AGP on my FC5 install!

Thank you all for finding this fix!
Comment 46 H. Peter Anvin 2006-04-10 09:25:04 UTC
Patch sent upstream to davej; will close BR when patch is merged.

Dave: please let me know if you want me to sent the patch to Linus or Andrew
directly.


Note You need to log in before you can comment on or make changes to this bug.