Bug 12947
Summary: | r128: system hangs when X is started with DRI enabled | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | seraph (seraph) |
Component: | x86-64 | Assignee: | Venkatesh Pallipadi (venki) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | akpm, avillaci, mingo, rjw, suresh.b.siddha, venki, yinghai |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.29 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 12398 | ||
Attachments: |
strace of startx with a working kernel (2.6.27-gentoo-r10)
strace of startx with a broken kernel (02.6.29) Output from dmesg after X failure on kernel 2.6.29.3 Complete dmesg with hanging X. Complete dmesg after starting X when booted with nopat. Test patch dmesg output with debugging code in place dmesg with Dave's patch and debug patch in place |
Description
seraph@xs4all.nl
2009-03-26 16:14:51 UTC
This bug sort of looks like bug 12920. If r128 also requires a mmap of /dev/dri/cardN to work, it could be affected by the same bug. Can you collect a strace of Xorg in the successful and failing cases, and compare the two, like it was done on bug #12920? Created attachment 21023 [details]
strace of startx with a working kernel (2.6.27-gentoo-r10)
Created attachment 21024 [details]
strace of startx with a broken kernel (02.6.29)
Sorry it took me a while to get back to this. The machine in question is in a building where I don't have free access to. I poked around a bit and found the system is still somewhat responsive when this happens. It seems that only the video card hangs, as I can't get a working screen again without resetting the system. Attached two straces, one with working kernel and the other with the bug occurring. Notify-Also : DRI <dri-devel@lists.sourceforge.net> On Sunday 26 April 2009, Angel wrote:
> On Sun, 26 Apr 2009 11:46:26 +0200 (CEST)
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.28 and 2.6.29.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.28 and 2.6.29. Please verify if it still should
> > be listed and let me know (either way).
>
> I've tested with kernel 2.6.29.1. The problem still persists. While
> everything works fine with 2.6.28
> or lower, starting X (both 1.3 and 1.5) with DRI enabled on a 2.6.29 kernel
> results
> in a garbled screen and unresponsive keyboard.
On Monday 27 April 2009, Jos van der Ende wrote:
> On Sun, 26 Apr 2009 19:43:44 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > On Sunday 26 April 2009, Angel wrote:
> > > On Sun, 26 Apr 2009 11:46:26 +0200 (CEST)
> > >
> > > I've tested with kernel 2.6.29.1. The problem still persists. While
> everything works fine with 2.6.28 or lower, starting X (both 1.3 and 1.5)
> with DRI enabled on a 2.6.29 kernel results in a garbled screen and
> unresponsive keyboard.
> >
> > Thanks for the update.
> >
> > Can you also test 2.6.30-rc3, please?
>
> Sure. Still broken in exactly the same way. (Garbled screen, non-working
> keyboard. System can still be ssh-ed into.)
I did bisecting, the bad commit is: commit cdecff6864a1cd352a41d44a65e7451b8ef5cee2 Author: venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com> Date: Fri Jan 9 16:13:12 2009 -0800 x86 PAT: return compatible mapping to remap_pfn_range callers Also, I tested 2.6.29.3 and 2.6.30-rc5. Both are still broken in exactly the same way, except that 2.6.30-rc5 doesn't show the garbage at the top of the screen, it's completely black. (x86 cc's added) (reassigned to x86) It's a PAT-related regression in 2.6.29. Does things work with "nopat" with recent 2.6.29.x? If you can still access the system after this X hang (network or text console) can you get dmesg output with .29.x when the problem happens. I've tested 2.6.29.3 with the nopat option. With that, X works normally. Created attachment 21314 [details]
Output from dmesg after X failure on kernel 2.6.29.3
I was able to ssh into the system after the failure to start X. I've attached the relevant portion of dmesg.
I see a null pointer dereference and a kernel oops, that probably explains the problem. :-)
oops looks like an after effect. So, it is not saying much about why the failure happened in the first place. First suspicious message I see is [drm:r128_do_init_cce] *ERROR* Could not ioremap agp regions! which is basically the drm_core_ioremap_wc failing in the driver. But, not sure whether there are any other errors before this point. dmesg is not complete. Can you get the complete dmesg. dmesg -s <size> or /var/log/dmesg (in newer distros) should have all messages from the boot time. Can you also attach the output of "cat /debug/x86/pat_memtype_list" after you see this X startup problem. Also, dmesg when things work with "nopat" will be interesting as well. Created attachment 21324 [details]
Complete dmesg with hanging X.
Created attachment 21325 [details]
Complete dmesg after starting X when booted with nopat.
Sorry, I do not have debugfs compiled into the kernel. Also, the system in question is not mine and I have to be a bit selective in what data I upload here. These dmesg files are complete, though. Created attachment 21337 [details]
Test patch
I am still struggling to connect all the dots here between the commit cdecff6864a1cd352a41d44a65e7451b8ef5cee2 and the error we are seeing here with drm_ioremap() failing with PAT and succeeding with nopat.
Sorry, I dont seem to find a platform that can reproduce this error here in my lab. Can you please apply the test patch here over 29.3 and get the new dmesg "DEBUG" lines.
Thanks.
Created attachment 21353 [details]
dmesg output with debugging code in place
Well, it seems that your debugging patch somehow fixed the problem. With it in place, I can no longer trigger the bug! :-)
Anyway, here are the DEBUG lines. I've attached the full dmesg output as well.
X:5348 DEBUG map pfn expected mapping type write-back for e0000000-e0101000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0101000-e0102000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0102000-e0302000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0302000-e07e2000, got write-combining
DEBUG ioremap error for 0xe0000000-0xe0101000, requested 0x10, got 0x8
DEBUG ioremap error for 0xe0101000-0xe0102000, requested 0x10, got 0x8
DEBUG ioremap error for 0xe0102000-0xe0302000, requested 0x10, got 0x8
X:5348 DEBUG map pfn expected mapping type write-back for e0102000-e0302000, got write-combining
X:5348 DEBUG map pfn expected mapping type uncached-minus for d8000000-d9000000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0102000-e0302000, got write-combining
X:5348 DEBUG map pfn expected mapping type write-back for e0302000-e07e2000, got write-combining
OK. Yes, with the debug patch I enabled ioremap to succeed even with the error messages above and thats the reason you do not see the problem any more. From all the info we have here: - X is actually mapping these regions with what looks like remap_pfn_range interface. - Before the patch cdecff6864a1cd352a41d44a65e7451b8ef5cee2, that remap was failing due to the above DEBUG error. And X was handling that failure in sane way and kept things working even with that failure. Looks like driver did not call subsequent ioremap() in that case. - After the patch cdecff6864a1cd352a41d44a65e7451b8ef5cee2, we are allowing the X mapping to pass by changing the attributes appropriately. Then the driver subsequently calls ioremap() which fails and X cannot handle that failure cleanly resulting in the failure that you are seeing. - Debug patch enabled ioremap to pass in the same way as X map is passing and then things work fine again. From the latest sources, I see Dave has this patch here: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=42beefc00 This patch should make the ioremap work again as well. Can you apply this patch on 2.6.29.3 + debug patch. Then X should work without these messages above DEBUG ioremap error for 0xe0000000-0xe0101000, requested 0x10, got 0x8 DEBUG ioremap error for 0xe0101000-0xe0102000, requested 0x10, got 0x8 DEBUG ioremap error for 0xe0102000-0xe0302000, requested 0x10, got 0x8 Now coming to the failure you are seeing with 2.6.30-rc, that seems to be a unrelated to the above PAT problem. That may need another round of bisecting starting from 2.6.29 + patch 42beefc00 above and ending with latest 2.6.30-rc. Created attachment 21533 [details]
dmesg with Dave's patch and debug patch in place
Here's the dmesg.out file with both Dave's patch and the debugging patch in place (on 2.6.29.3). X works normally in this case.
OK. So we have a fix for this bug in 2.6.29. I will forward Dave's patch to 2.6.29-stable series, which will resolve the problem in 2.6.29.x. In comment #8, you also said that 2.6.30-rc3 fails with blank screen. Can you please check with latest 2.6.30-rc and see whether you still have the problem there? That looks like is due to some recent change and that still need to be root caused. Thanks. On Sunday 31 May 2009, Angel wrote: > On Sat, 30 May 2009 21:55:34 +0200 (CEST) > "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.28 and 2.6.29. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.28 and 2.6.29. Please verify if it still should > > be listed and let me know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12947 > > Subject : r128: system hangs when X is started with DRI enabled > > Submitter : Jos van der Ende <seraph@xs4all.nl> > > Date : 2009-03-26 16:14 (66 days old) > > > We've found a working bug fix for 2.6.29 which I expect will be included in > the next bug fix release. However, 2.6.30-rc[1-7] (which already includes > this fix) introduced a new problem which we're still working on. > > The bug fix in question is this one: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=42beefc00 |