Bug 4513
Summary: | Radeon driver in 2.6.10 through 2.6.11.6 fails in DRI mode. Kernel 2.6.9 works fine. | ||
---|---|---|---|
Product: | Drivers | Reporter: | David Cafaro (dac) |
Component: | Video(AGP) | Assignee: | H. Peter Anvin (hpa) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | benh, bphinz, davej, hpa |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.10 through 2.6.11.6 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
output of dmesg for bug #4513
output of lspci for bug #4513 Xorg log for bug #4513 xorg.conf for bug #4513 Dmesg Log from failed Xorg lspci -vv Results Xorg Log file Xorg Config file hack to drm cvs (03/19/2006) that removes mtrr overlaps minimal regression patch to drivers/char/agp/generic.c that fixes the screen freeze issue potential bug resolution patch Proposed patch, version 2 |
Description
David Cafaro
2005-04-18 07:10:46 UTC
Ok, I tried something new, and narrowed the issue down to something with the AGP driver used. By using Option "BusType" "PCI" and forcing the xorg to use the AGP card as a PCI card, everything works as expected. The only issue being that it is slower than Using AGP mode. I'm updating this bug so that is is now listed under Drivers Video (AGP) I also see this on the same laptop model under Gentoo with 2.6.11. Nearly the same symptoms, except that I don't see the "4 screwed up minature images of the BIOS boot screen along the top of the LCD with the rest of the LCD being black" -- just a cursor in the middle of the screen with the machine locked up. <amanuensis mode> Dave Jones <davej@redhat.com> wrote: > > On Sat, Apr 23, 2005 at 07:38:05PM -0700, Andrew Morton wrote: > > > > Think this might be an AGP problem. > > Unconvinced. AGP problems tend to kill the box completely, > or stomp on memory enough to cause panics. This sounds more > like the card just locked up for whatever reason. > > The Option "BusType" "PCI" is a red herring. It's going > to cause the dri code to take completely different paths. Guys, could we please have an update on this? Is 2.6.12-rc5 working OK? I'll have to build the new kernel to test that out. But probably won't be able to get to that until next week (vacation this weekend). Were there any specific changes to 2.6.12-rc5 (or any of the 2.6.12-rc) that might fix this problem? Hi there. I'm running the 2.6.12-rc5 kernel + software suspend 2.1.8.12 patches on a gentoo machine, same laptop as David and Dan. I'm still seeing the problem when I try to start X with the dri module enabled, even when I specify "BusType" "PCI". The screen is blank, with a responsive cursor, but the system does not respond to any other input. 2.6.12.2 with swsusp2 2.1.9.5 and bootsplash 3.1.6 (on SuSE 9.3) Same machine, same problem as the original report with both radeonfb and vesafb (and X doesn't die even after kill -9) Using plain sources from kernel.org, on latest gentoo (2005.0), same hardware: 2.6.10 (including all rc-*) - works fine 2.6.11-rc1 and above - doesn't work There seem to be way too many changes to DRI in 2.6.11-rc1, can someone please look into it? Currently running the 2.6.12-1.1372_FC3 kernel from the Fedora core project and I see the same problem. AGP with DRI enabled causes the screen to lock. I have not had a chance to test a vanila Kernel.org 2.6.12 kernel yet. can you try with AGP mode set to 1 in xorg.conf I've no idea what is the issue between the transmeta AGP and the radeon, but there is certainly something wrong... Its a pity there was both AGP and DRM changes in 2.6.11-rc1... if someone could track down which causes the issue there is a lot better chance we can fix this... this is purely a transmeta AGP bridge and some interaction with DRM... If you just remove all the DRM changes from the 2.6.11-rc1 patch and see if it was that or not ... The issues started in the 2.6.10 series kernels, so it's not something directly related to 2.6.11 (or atleast not exclusive). I tried rolling back some changes in the efficeon AGP driver that had occured in the 2.6.10 and 2.6.11 branches, but that didn't help. Hopefully in the next 2 months I'll have more spare time to play around with this, but I'm not that great at C coding and certainly no kernel hacker. Actually to be more clear: AGP+DRI 4x/2x broken starting at 2.6.6 kernels (Fedora 2), 2.6.7 (kernel.org version) AGP+DRI broken 2.6.10 kernel (Fedora 2/3 versions) 2.6.11 (Kernel.org version) AGP-DRI works in all PCI+DRI currently works for me (using Option "BusType" "PCI") 2.6.12 (fedora 3) I just tried 2.6.11-rc1 with the drm patches removed, as per Dave Airlie's suggestion. Still hung. I was using AGPMode 1 in xorg.conf. As I have time, I'll continue to work backwards through patchsets until I can locate the point at which it's not broken on my laptop. Hm, I'm using 2.6.10 with AGP 2x/4x (glxinfo reports 4x (value from xorg.conf), kernel - 2x) with no problems (~290 FPS in glxgears :) Here are some relevant configs/logs (xorg.conf mostly carried over from my old SuSE install): http://members.shaw.ca/sklink/mm20/ Is this still a problem with 2.6.15 ? Truethfully, I don't know, last kernel I tried this with was 2.6.13 on FC3. At that point it was still a problem, I had to use the PCI option to get DRI to work. I started to install FC4 and half way through decided to give (GULP) Windows XP a try again. Currently I'm still running XP on the machine, so unable to test current kernels. I keep thinking of trying to get Linux on it again... Ok, I've reinstalled FC4 on my laptop and currently no DRI works. Non-DRI graphics only work when the system is forced into PCI mode. This is using the Fedora Core Kernel: 2.6.14-1.1656_FC4 and xorg 6.8.2-37.FC4.49.2 I will try and test with a kernel.org 2.6.15 kernel this weekend. I've managed to find the code that causes this break, and have AGP dri working with 2.6.11 and also with 2.6.15, although it's just a hack, not a permanent fix. Between 2.6.10 and 2.6.11 drivers/char/agp/generic.c was changed, by reverting these lines (this is 2.6.15, 2.6.11 is slightly different): bluejay linux # diff drivers/char/agp/generic.c drivers/char/agp/generic.c.bak 213,214c213 < new->memory[i] = < agp_bridge->driver->mask_memory(bridge, virt_to_phys(addr), type); --- > new->memory[i] = virt_to_gart(addr); 861a861 > readl(bridge->gatt_table+i); /* PCI Posting. */ 1010a1011 > readl(bridge->gatt_table+i); /* PCI Posting. */ 1105d1105 < #ifdef CONFIG_SMP 1110d1109 < #endif 1114d1112 < #ifdef CONFIG_SMP 1117,1119d1114 < #else < flush_agp_cache(); < #endif I can compile cleanly, load drm & run in AGP 4x mode (at least it's set in xorg.conf). 2.6.11 froze after ~10 min, but that was with the kernel drm. drm from CVS with 2.6.15 seems stable with xorg 7.0 modular. I'm not sure which of those lines in the diff is actually the problem, probably either the mask_memory or the flush_agp_cache, but in either case something may still not be quite right - 2.6.10 (vanilla) reported the device being put in 2x mode upon AGPgart initialization, while 2.6.11/15 reports 0x mode. Not sure if this is really a problem, maybe the mode gets set to whatever X requests anyway... glxgears scores are consistent with pre-2.6.11, ~290fps, and render, composite, pageflipping, 4x mode, etc. all seem to work. Perhaps someone with more knowledge of the generic agp code can shed some light on this? Please, try a recent 2.6.15 kernel from your distro and the latest ati-1-0-branch ddx driver, and if the problem still happens, attach: - output of lspci -vv taken as root - complete dmesg log - X.org log - xorg.conf If the freeze prevents you from getting the dmesg & X.log, you can generally at least try to get the later by mounting you root file system in synchronous mode with mount / -o remount,sync, after reboot, the log should be in the usual location Created attachment 7609 [details] output of dmesg for bug #4513 Created attachment 7610 [details] output of lspci for bug #4513 Created attachment 7611 [details] Xorg log for bug #4513 Created attachment 7612 [details] xorg.conf for bug #4513 Hi Ben, I've tried the ati-1-0-branch, along with drm from cvs today, but the display still freezes and causes high cpu utilization (though not hung) without the virt_to_gart regression. Also, the mtrr overlaps are still present - you can see from the lspci output that the amount of video memory is erroneously detected as 128Mb, rather than the 16Mb that's actually present. I did not test it yet with 16Mb hard coded into the drm code but virt_to_gart in the kernel agp code. I'll can post those results if you think it's worth it. Thanks for taking the time to look into this. -brian Ok, I've just recently installed Fedora Core 5 on my laptop that just came back from warranty service for a bad LCD screen. It has all the most updated packages available for the system. Here are some relevant packages: kernel-2.6.15-1.2054_FC5 xorg-x11-drv-ati-6.5.7.3-4 xorg-x11-server-Xorg-1.0.1-9 xorg-x11-drivers-7.0-2 The xdisplay still locks up (no keyboard/mouse response, blank display) but the system is still response via SSH from another box into the laptop. I've attached the requested files. Next I will try Brian Hinz's work around. Created attachment 7639 [details]
Dmesg Log from failed Xorg
Created attachment 7640 [details]
lspci -vv Results
Created attachment 7641 [details]
Xorg Log file
Created attachment 7642 [details]
Xorg Config file
From what I can see in those logs, everything looks fine. The difference in "memory size" is due to the fact that the card exposes a 128Mb region on the PCI bus while only using 16M out of that zone, but that shouldn't be a problem. The driver gets it all right, the memory map is setup properly, things aren't even moved around so I don't see any reason why it would break I suspect there is something unrelated to the radeon driver here. Either some crap with your MTRR settings and/or an issue with the driver for your host AGP chipset or both ... I don't know much about MTRRs as they are an x86 thing and I don't do x86 :) But I suspect that's where you should look at. Maybe try disabling them ? (I think there is MTRR related stuff both in X and the kernel). Also, I don't know what to think about your change to the AGP driver neither as I don't know that AGP driver but I would expect davej to have an idea there... So IMHO, it's not a radeon bug. I agree, I don't think that it is a bug in the radeon driver either at this point. With regards to the MTRR overlaps though, I edited the file drm/shared-core/radeon_cp.c from drm cvs and replaced "drm_get_resource_len(dev, 0)" with "16777216" and they went away (I've attached the patch). Honestly, I don't even know what effect an mtrr overlap has, but if the over-exposed memory is causing problems for the drm code, then there's a possibility it may be the culprit in the agp code as well... I tried running the unpatched kernel with this mod to drm and the screen still froze. Created attachment 7644 [details]
hack to drm cvs (03/19/2006) that removes mtrr overlaps
just hard codes 16mb into the drm radeon mtrr code for testing
There shouldn't be anything overlapping there... the BIOS assigns 128M to the card and that's it... if limiting the stuff to 16M fixes it, it means there is a bug somewhere with MTRR handling... The AGP aperture itself is mapped at 0xec000000 (for 64M) thus is just below the framebuffer which gets 0xf0000000 for 128M. It isn't overlapping anything so I don't see why it would have a problem... I suspect there are some stale MTRR's floating around the fb space causing the overlap but I can't say if that has anything to do with your other fix to the AGP driver radeonfb creates the first mtrr, 16mb in size, and then drm creates another that is 128mb in size dmesg shows ... mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000 mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000 mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000 mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000 mtrr: 0xf0000000,0x8000000 overlaps existing 0xf0000000,0x1000000 ... again, I don't even know if this is a problem or just a warning, only that radeonfb's mtrr handing is clever enough figure out the amount of physical video memory while drm's mtrr handing doesn't. I don't recall what /proc/mtrr showed when the overlaps existed but when limited to 16mb it looks clean (count=2). Created attachment 7645 [details]
minimal regression patch to drivers/char/agp/generic.c that fixes the screen freeze issue
This patch narrows the issue down to one specific call
Ok, so we have 2 different issues here: - some mtrr related garbage between DRI and radeonfb. What should we do here ? Is X also trying to play with MTRRs ? Should I change radeonfb to cover the entire PCI area or try to make the DRM smarter ? The problem is a bit annoying as the DRM doesn't necessarily knows how much the server decided was accessible (and radeonfb doesn't have half of the bug fixes in this area). Maybe the message is harmless and we should just not do anything ? - This agp specific problem.. Let's have a look... if you look at the virt_to_gart() macro, the main difference is that ... it doesn't call mask_memory() on the address when filling the memory[] array. Now the question is: should we call mask_memory when filling the memory[] array ... I would have expected that to be something that must be done by the chipset when inserting the memory, not when allocating it... And that's indeed what agp_generic_insert_memory() does... So let's look at your chipset driver... Hehe ! unsigned long insert = mem->memory[i]; page = (unsigned int *) efficeon_private.l1_table[index >> 10]; if (!page) continue; page += (index & 0x3ff); *page = insert; Looks like there is a mask_memory() call missing here ! Try to change: - unsigned long insert = mem->memory[i]; + unsigned long insert = bridge->driver->mask_memory(bridge, mem->memory[i], mem->type); In efficeon-agp.c, line 254 and tell us if that helps. Ben, Indeed, that does fix the problem! I had to change bridge to agp_bridge but that's trivial. I'm creating an attachment for Dave C. to try. As far as the mtrr's go, I care more about radeonfb having the mtrr b/c of the huge improvement in the speed of text scrolling on a console, but that's just me... Created attachment 7646 [details]
potential bug resolution patch
Dave C - please test this patch and post your results.
Talking a bit with Peter, it seems that just doing memory[i] | 1 would do the trick and is simpler :) (Since the memory mask is just that on this chipset). Regarding the MTRRs, I think those critters are based on physical addresses anyway which means that even if X/DRM fails to setup it's own, it will benefit from the radeonfb one, so world peace is safe Please note that this code devolves to mem->memory[i] | 1; there isn't much point in calling something indirect through two pointers to do that! I'm going to talk to my contacts at Transmeta and see if I can get a copy of the official documentation, which doesn't seem to be publically available *sigh*. just to confirm, (mem->memory[i]|1) works fine. Created attachment 7647 [details]
Proposed patch, version 2
Slightly cleaned up patch; please test this out as I'd like to push it
upstream.
Tested, looks good. Thanks! I can confirm the "proposed patch, version 2" does the trick! DRI is now working with AGP on my FC5 install! Thank you all for finding this fix! Patch sent upstream to davej; will close BR when patch is merged. Dave: please let me know if you want me to sent the patch to Linus or Andrew directly. |