Bug 12947 - r128: system hangs when X is started with DRI enabled
r128: system hangs when X is started with DRI enabled
Status: CLOSED CODE_FIX
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64
All Linux
: P1 normal
Assigned To: Venkatesh Pallipadi
:
Depends on:
Blocks: 12398
  Show dependency treegraph
 
Reported: 2009-03-26 16:14 UTC by seraph@xs4all.nl
Modified: 2009-06-01 20:12 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.29
Tree: Mainline
Regression: Yes


Attachments
strace of startx with a working kernel (2.6.27-gentoo-r10) (28.61 KB, text/plain)
2009-04-17 06:13 UTC, seraph@xs4all.nl
Details
strace of startx with a broken kernel (02.6.29) (25.23 KB, text/plain)
2009-04-17 06:14 UTC, seraph@xs4all.nl
Details
Output from dmesg after X failure on kernel 2.6.29.3 (2.64 KB, text/plain)
2009-05-12 10:19 UTC, seraph@xs4all.nl
Details
Complete dmesg with hanging X. (19.97 KB, text/plain)
2009-05-13 11:53 UTC, seraph@xs4all.nl
Details
Complete dmesg after starting X when booted with nopat. (17.63 KB, text/plain)
2009-05-13 11:55 UTC, seraph@xs4all.nl
Details
Test patch (2.72 KB, patch)
2009-05-13 21:27 UTC, Venkatesh Pallipadi
Details | Diff
dmesg output with debugging code in place (18.88 KB, text/plain)
2009-05-14 09:40 UTC, seraph@xs4all.nl
Details
dmesg with Dave's patch and debug patch in place (18.72 KB, text/plain)
2009-05-25 14:41 UTC, seraph@xs4all.nl
Details

Description seraph@xs4all.nl 2009-03-26 16:14:51 UTC
After upgrading to kernel 2.6.29 from 2.6.28.7, the system hangs when X is started with DRI enabled. All I get is a black screen with some garbage at the top.

X starts normally with DRI disabled. I've never had problems with any previous kernels up to 2.6.28.7.

The videocard is an ATI Rage 128 Pro Ultra TF in an Intel AGP slot:

00:01.0 PCI bridge: Intel Corporation 82845 845 [Brookdale] Chipset AGP Bridge (rev 03)
01:00.0 VGA compatible controller: ATI Technologies Inc Rage 128 Pro Ultra TF
Comment 1 Alex Villacis Lasso 2009-03-31 23:12:04 UTC
This bug sort of looks like bug 12920. If r128 also requires a mmap of /dev/dri/cardN to work, it could be affected by the same bug. Can you collect a strace of Xorg in the successful and failing cases, and compare the two, like it was done on bug #12920?
Comment 2 seraph@xs4all.nl 2009-04-17 06:13:56 UTC
Created attachment 21023 [details]
strace of startx with a working kernel (2.6.27-gentoo-r10)
Comment 3 seraph@xs4all.nl 2009-04-17 06:14:53 UTC
Created attachment 21024 [details]
strace of startx with a broken kernel (02.6.29)
Comment 4 seraph@xs4all.nl 2009-04-17 06:17:49 UTC
Sorry it took me a while to get back to this. The machine in question is in a building where I don't have free access to.

I poked around a bit and found the system is still somewhat responsive when this happens. It seems that only the video card hangs, as I can't get a working screen again without resetting the system.

Attached two straces, one with working kernel and the other with the bug occurring.
Comment 5 Rafael J. Wysocki 2009-04-26 11:27:03 UTC
Notify-Also : DRI <dri-devel@lists.sourceforge.net>
Comment 6 Rafael J. Wysocki 2009-04-26 17:44:39 UTC
On Sunday 26 April 2009, Angel wrote:
> On Sun, 26 Apr 2009 11:46:26 +0200 (CEST)
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.28 and 2.6.29.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.28 and 2.6.29.  Please verify if it still should
> > be listed and let me know (either way).
> 
> I've tested with kernel 2.6.29.1. The problem still persists. While everything works fine with 2.6.28
> or lower, starting X (both 1.3 and 1.5) with DRI enabled on a 2.6.29 kernel results
> in a garbled screen and unresponsive keyboard.
Comment 7 Rafael J. Wysocki 2009-04-28 21:49:19 UTC
On Monday 27 April 2009, Jos van der Ende wrote:
> On Sun, 26 Apr 2009 19:43:44 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > On Sunday 26 April 2009, Angel wrote:
> > > On Sun, 26 Apr 2009 11:46:26 +0200 (CEST)
> > > 
> > > I've tested with kernel 2.6.29.1. The problem still persists. While everything works fine with 2.6.28 or lower, starting X (both 1.3 and 1.5) with DRI enabled on a 2.6.29 kernel results in a garbled screen and unresponsive keyboard.
> > 
> > Thanks for the update.
> > 
> > Can you also test 2.6.30-rc3, please?
> 
> Sure. Still broken in exactly the same way. (Garbled screen, non-working keyboard. System can still be ssh-ed into.)
Comment 8 seraph@xs4all.nl 2009-05-11 15:18:56 UTC
I did bisecting, the bad commit is:


commit cdecff6864a1cd352a41d44a65e7451b8ef5cee2
Author: venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com>
Date:   Fri Jan 9 16:13:12 2009 -0800

    x86 PAT: return compatible mapping to remap_pfn_range callers
    

Also, I tested 2.6.29.3 and 2.6.30-rc5. Both are still broken in exactly the same way, except that 2.6.30-rc5 doesn't show the garbage at the top of the screen, it's completely black.
Comment 9 Andrew Morton 2009-05-11 19:02:08 UTC
(x86 cc's added)

(reassigned to x86)

It's a PAT-related regression in 2.6.29.
Comment 10 Venkatesh Pallipadi 2009-05-11 19:42:21 UTC
Does things work with "nopat" with recent 2.6.29.x?

If you can still access the system after this X hang (network or text console) can you get dmesg output with .29.x when the problem happens.
Comment 11 seraph@xs4all.nl 2009-05-12 10:13:35 UTC
I've tested 2.6.29.3 with the nopat option. With that, X works normally.
Comment 12 seraph@xs4all.nl 2009-05-12 10:19:05 UTC
Created attachment 21314 [details]
Output from dmesg after X failure on kernel 2.6.29.3

I was able to ssh into the system after the failure to start X. I've attached the relevant portion of dmesg.

I see a null pointer dereference and a kernel oops, that probably explains the problem. :-)
Comment 13 Venkatesh Pallipadi 2009-05-12 20:21:29 UTC
oops looks like an after effect. So, it is not saying much about why the failure happened in the first place. First suspicious message I see is

[drm:r128_do_init_cce] *ERROR* Could not ioremap agp regions!

which is basically the drm_core_ioremap_wc failing in the driver. But, not sure whether there are any other errors before this point. dmesg is not complete. Can you get the complete dmesg. dmesg -s <size> or /var/log/dmesg (in newer distros) should have all messages from the boot time.

Can you also attach the output of "cat /debug/x86/pat_memtype_list" after you see this X startup problem.
 
Also, dmesg when things work with "nopat" will be interesting as well.
Comment 14 seraph@xs4all.nl 2009-05-13 11:53:58 UTC
Created attachment 21324 [details]
Complete dmesg with hanging X.
Comment 15 seraph@xs4all.nl 2009-05-13 11:55:26 UTC
Created attachment 21325 [details]
Complete dmesg after starting X when booted with nopat.
Comment 16 seraph@xs4all.nl 2009-05-13 12:06:32 UTC
Sorry, I do not have debugfs compiled into the kernel. Also, the system in question is not mine and I have to be a bit selective in what data I upload here. These dmesg files are complete, though.
Comment 17 Venkatesh Pallipadi 2009-05-13 21:27:45 UTC
Created attachment 21337 [details]
Test patch

I am still struggling to connect all the dots here between the commit cdecff6864a1cd352a41d44a65e7451b8ef5cee2 and the error we are seeing here with drm_ioremap() failing with PAT and succeeding with nopat.

Sorry, I dont seem to find a platform that can reproduce this error here in my lab. Can you please apply the test patch here over 29.3 and get the new dmesg "DEBUG" lines.

Thanks.
Comment 18 seraph@xs4all.nl 2009-05-14 09:40:02 UTC
Created attachment 21353 [details]
dmesg output with debugging code in place

Well, it seems that your debugging patch somehow fixed the problem. With it in place, I can no longer trigger the bug! :-)

Anyway, here are the DEBUG lines. I've attached the full dmesg output as well.


X:5348 DEBUG map pfn expected mapping type write-back for e0000000-e0101000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0101000-e0102000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0102000-e0302000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0302000-e07e2000, got write-combining
DEBUG ioremap error for 0xe0000000-0xe0101000, requested 0x10, got 0x8
DEBUG ioremap error for 0xe0101000-0xe0102000, requested 0x10, got 0x8
DEBUG ioremap error for 0xe0102000-0xe0302000, requested 0x10, got 0x8
X:5348 DEBUG map pfn expected mapping type write-back for e0102000-e0302000, got write-combining
X:5348 DEBUG map pfn expected mapping type uncached-minus for d8000000-d9000000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0102000-e0302000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0302000-e07e2000, got write-combining
Comment 19 Venkatesh Pallipadi 2009-05-14 18:27:38 UTC
OK. Yes, with the debug patch I enabled ioremap to succeed even with the error messages above and thats the reason you do not see the problem any more.

From all the info we have here:
- X is actually mapping these regions with what looks like remap_pfn_range interface.
- Before the patch cdecff6864a1cd352a41d44a65e7451b8ef5cee2, that remap was failing due to the above DEBUG error. And X was handling that failure in sane way and kept things working even with that failure. Looks like driver did not call subsequent ioremap() in that case.
- After the patch cdecff6864a1cd352a41d44a65e7451b8ef5cee2, we are allowing the X mapping to pass by changing the attributes appropriately. Then the driver subsequently calls ioremap() which fails and X cannot handle that failure cleanly resulting in the failure that you are seeing.
- Debug patch enabled ioremap to pass in the same way as X map is passing and then things work fine again.

From the latest sources, I see Dave has this patch here:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=42beefc00

This patch should make the ioremap work again as well. Can you apply this patch on 2.6.29.3 + debug patch. Then X should work without these messages above
DEBUG ioremap error for 0xe0000000-0xe0101000, requested 0x10, got 0x8
DEBUG ioremap error for 0xe0101000-0xe0102000, requested 0x10, got 0x8
DEBUG ioremap error for 0xe0102000-0xe0302000, requested 0x10, got 0x8



Now coming to the failure you are seeing with 2.6.30-rc, that seems to be a unrelated to the above PAT problem. That may need another round of bisecting starting from 2.6.29 +  patch 42beefc00 above and ending with latest 2.6.30-rc.
Comment 20 seraph@xs4all.nl 2009-05-25 14:41:20 UTC
Created attachment 21533 [details]
dmesg with Dave's patch and debug patch in place

Here's the dmesg.out file with both Dave's patch and the debugging patch in place (on 2.6.29.3). X works normally in this case.
Comment 21 Venkatesh Pallipadi 2009-05-26 19:11:15 UTC
OK. So we have a fix for this bug in 2.6.29. I will forward Dave's patch to 2.6.29-stable series, which will resolve the problem in 2.6.29.x.

In comment #8, you also said that 2.6.30-rc3 fails with blank screen. Can you please check with latest 2.6.30-rc and see whether you still have the problem there? That looks like is due to some recent change and that still need to be root caused.

Thanks.
Comment 22 Rafael J. Wysocki 2009-06-01 20:11:13 UTC
On Sunday 31 May 2009, Angel wrote:
> On Sat, 30 May 2009 21:55:34 +0200 (CEST)
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.28 and 2.6.29.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.28 and 2.6.29.  Please verify if it still should
> > be listed and let me know (either way).
> > 
> > 
> > Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=12947
> > Subject		: r128: system hangs when X is started with DRI enabled
> > Submitter	: Jos van der Ende <seraph@xs4all.nl>
> > Date		: 2009-03-26 16:14 (66 days old)
> 
> 
> We've found a working bug fix for 2.6.29 which I expect will be included in the next bug fix release. However, 2.6.30-rc[1-7] (which already includes this fix) introduced a new problem which we're still working on.
> 
> The bug fix in question is this one:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=42beefc00

Note You need to log in before you can comment on or make changes to this bug.