Bug 14487
Summary: | PANIC: early exception 08 rip 246:10 error ffffffff810251b5 cr2 0 | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Rafael J. Wysocki (rjw) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | jbeulich, justinmattock, sklif2004 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.32-rc4 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 14230 | ||
Attachments: |
temporary_fix_to_get_early_dma_debugging_working_on_my_machine
This fixes the Panic on my x86_64(pure64) machine. |
Description
Rafael J. Wysocki
2009-10-26 20:51:54 UTC
Created attachment 23570 [details]
temporary_fix_to_get_early_dma_debugging_working_on_my_machine
This is not much for a patch, but more of figuring out where the issue might be
that's causing my machine to panic.
By commenting out the calls I'm able to use early firewire for debugging.
As for what might be going on, taking a look into set_fixmap_nocache
I see this is located in arch/x86/include/asm/fixmap.h.
In there I'm noticing some comments pertaining to x86_64:
/*
* We can't declare FIXADDR_TOP as variable for x86_64 because vsyscall
* uses fixmaps that relies on FIXADDR_TOP for proper address calculation.
* Because of this, FIXADDR_TOP x86 integration was left as later work.
*/
and
/* Only covers 32bit vsyscalls currently. Need another set for 64bit. */
which is leading me to think maybe this is why I'm hitting what I'm hitting on my x86_64 machine
any ideas?
probably would be a good idea to show the panic: (here it is manually writing down, and a url to a picture) [ 0.000000] [<ffffffff81639995>] start_kernel+0x82/0x34d [ 0.000000] [<ffffffff816392a5>] x86_64_start_reservations+0xac/0xb0 [ 0.000000] [<ffffffff816393a1>] x86_64_start_kernel+0xf8/0x107 PANIC: early exception 08 rip 246:10 error ffffffff810251b5 cr2 0 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.32-rc4-00001-g1896a85 #35 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff8163919e>] early_idt_handler+0x5e/0x71 [ 0.000000] [<ffffffff813b9958>] ? panic+0x10c/0x12e [ 0.000000] [<ffffffff8164f777>] ___alloc_bootmem_node+0x0/0x60 [ 0.000000] [<ffffffff8164f8eb>] __alloc_bootmem+0xb/0xd [ 0.000000] [<ffffffff813aba66>] spp_getpage+0x3a/0x6f [ 0.000000] [<ffffffff8102770d>] fill_pte+0x22/0xde [ 0.000000] [<ffffffff810278e7>] set_pte_vaddr_pud+0x2c/0x48 [ 0.000000] [<ffffffff81027963>] set_pte_vaddr+0x60/0x65 [ 0.000000] [<ffffffff8102b82e>] __native_set_fixmap+0x24/0x2c [ 0.000000] [<ffffffff81660252>] init_ohci1394_dma_on_all_controllers+0x9b/0x345 [ 0.000000] [<ffffffff8163be6b>] setup_arch+0x543/0x950 [ 0.000000] [<ffffffff813b99b6>] ? printk+0x3c/0x3e [ 0.000000] [<ffffffff810646b6>] ? clockevents_register_notifier+0x3e/0x48 [ 0.000000] [<ffffffff81639995>] start_kernel+0x82/0x34d [ 0.000000] [<ffffffff816392a5>] x86_64_start_reservations+0xac/0xb0 [ 0.000000] [<ffffffff816393a1>] x86_64_start_kernel+0xf8/0x107 [ 0.000000] RIP 0x10 and the url: http://www.flickr.com/photos/44066293@N08/4046711653/ On Tuesday 17 November 2009, Justin P. Mattock wrote:
> Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.31. Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter : Justin P. Mattock<justinmattock@gmail.com>
> > Date : 2009-10-23 16:45 (25 days old)
> > References : http://lkml.org/lkml/2009/10/23/252
> >
> >
> >
> >
> This one has me a bit dazed i.g. after looking into the issue
> I did find a workaround(keep in mind it's not pretty),
> by commenting out set_fixmap_nocache and
> init_ohci1394_reset_and_init_dma.
> (by doing so I was able to load both machines and
> execute early debugging in case a problem occurs).
>
> Now as to what might be happening, after going through as
> much as I can comprehend the only thing in mind was
> reading fixmap.h the comments are stating that vsyscalls
> only covers 32bit, and that there needs to be another set
> for 64, leading me to believe that this is what I might be hitting.
> (my system is pure64, taking in no 32bit at all).
>
> At this point I think I need somebody to give me some info on this,
> and if the 64bit issue mentioned above is the case, then we can probably
> close this and leave it up to the x86_64 builders to create a 64bit
> call for this whenever they get to it.(main thing is I'm able to
> run dma early in case of an emergency).
On Monday 11 January 2010, Justin P. Mattock wrote:
> On 01/10/10 14:56, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter : Justin P. Mattock<justinmattock@gmail.com>
> > Date : 2009-10-23 16:45 (80 days old)
> > References : http://lkml.org/lkml/2009/10/23/252
> >
> >
> >
>
> I've played around with this. and
> and much confused at what needs to happen.
> (please feedback on what might be happening);
> In any case I can have another try at finding a fix
> so please leave open.
On Monday 25 January 2010, Justin P. Mattock wrote:
> On 01/24/10 14:22, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter : Justin P. Mattock<justinmattock@gmail.com>
> > Date : 2009-10-23 16:45 (94 days old)
> > References : http://lkml.org/lkml/2009/10/23/252
> >
> >
> >
>
> yeah I'm still seeing this during boot.
> As of looking at this, been tied up with another
> issue and totally forgot. next week I'll be away
> for a week, and during that period I can try and look at this
> since I might be hanging around at times.
> (and wont be side tracked with the other issue I was looking at);
>
> So yeah please keep it open, and hopefully somebody
> see's what is happening and maybe has a solution, or
> by chance maybe I can figure something.
Created attachment 24932 [details] This fixes the Panic on my x86_64(pure64) machine. located: http://patchwork.kernel.org/patch/68719/ this patch fixes my Panic I've been getting when using the ohci1394_dma=early option. as well as: http://lists.openwall.net/linux-kernel/2008/08/29/211 but not all of that patch just the(numbers): - FIX_BTMAP_END = __end_of_permanent_fixed_addresses + 512 - - (__end_of_permanent_fixed_addresses & 511), + FIX_BTMAP_END = __end_of_permanent_fixed_addresses + 256 - + (__end_of_permanent_fixed_addresses & 255), I'm going to leave this up to you guys to decide what is the safest approach. if going into init_ohci1394_dma.c and changing something is better let me know, and I can give my best go at it. The attached has been applied to the latest HEAD(rc6), and a bisected-and-tested-by added. Handled-By : Jan Beulich <jbeulich@novell.com> Patch : http://patchwork.kernel.org/patch/68719/ alright.. let me know if I need to test anything out or something. On Monday 08 February 2010, Justin P. Mattock wrote:
> On 02/07/10 16:28, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32. Please verify if it still should
> > be listed and let the tracking team know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter : Justin P. Mattock<justinmattock@gmail.com>
> > Date : 2009-10-23 16:45 (108 days old)
> > References : http://lkml.org/lkml/2009/10/23/252
>
>
> the patch attached to the bug report
> makes my machine boot up with out a
> Panic, and allows me to do remote debugging
> via ohci1394_dma.
>
> I did see a call trace as I was debugging
> which might be related to having one
> system using the patch, and the other not.
> but still need to look at that.
> (only saw this once out of numerous boots
> (could be a rarity)).
The patch pointed to in #6 may hide the problem, but it certainly doesn't resolve it permanently (i.e. as soon as sufficiently many fixmap entries get inserted, the issue will re-surface). And I would suppose that regardless of that patch the issue would continue to exist for 32-bit. The real problem is the lack of compile time enforcement/checking that the early fixmap pte that gets hard-coded into the boot time page tables really covers all (actually - given the use of literal numbers in head_64.S -, any of) the fixmap slots that it is intended for: Presumably FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, and FIX_OHCI1394_BASE all need to be sufficiently close to one another in "enum fixed_addresses", the question just is whether FIX_OHCI1394_BASE needs to be moved up, or whether the other two (if they're both in need of the early fixmap pte being in place) can be moved down. as a test I moved OHCI down two, and hit this. seems to only be satisfied where it was originally, or adjusting the numbers at FIX_MAP_END etc.. As for what/where to move this I can try #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT FIX_OHCI1394_BASE, #endif FIX_BTMAP_END = __end_of_permanent_fixed_addresses + 256 - (__end_of_permanent_fixed_addresses & 255), FIX_BTMAP_BEGIN = FIX_BTMAP_END + NR_FIX_BTMAPS*FIX_BTMAPS_SLOTS - 1, to see, but then you might hit what you where hitting before your commit. I'm thinking before this: __end_of_permanent_fixed_addresses + 256 etc.. the address is enough for OHCI to do it's thing, then once OHCI was moved below FIX_BTMAP_END OHCI just runs out of space(reason for seeing the out of memory(probably)). I did change those numbers to 511/512 and the setup worked as is. (my problem is I don't know/how you calculate such a number). (it's late over here need some Zzz.. will do in the morning, as well as any patches to see). (In reply to comment #11) > to see, but then you might hit what you where hitting before your commit. Yes, it must not end up between __end_of_permanent_fixed_addresses and FIX_BTMAP_END. > I'm thinking before this: __end_of_permanent_fixed_addresses + 256 etc.. > the address is enough for OHCI to do it's thing, then once OHCI was moved > below > FIX_BTMAP_END OHCI just runs out of space(reason for seeing the out of > memory(probably)). As said above, all entries requiring the early fixmap page table to be set up should be as close together as possible (preferably they would all be after __end_of_permanent_fixed_addresses, but if at least on the those entries is meant to be permanent, then all of them have to be or multiple pte pages will need to be hard-coded into the boot time page tables). > I did change those numbers to 511/512 and the setup worked as is. No, you shouldn't fiddle with those numbers. hmm.. so in this case for the ohci1394_dma module this would be
somewhere(like you had mentioned), set_fixmap_nocache(FIX_OHCI1394_BASE, ohci_base); and init_ohci1394_reset_and_init_dma(&ohci);
> I did change those numbers to 511/512 and the setup worked as is.
No, you shouldn't fiddle with those numbers.
yeah but out of curiosity changing these numbers did get the system to boot(why?) no idear!!
Ignore-Patch : http://patchwork.kernel.org/patch/68719/ yeah that patch does fix the boot issue, but like above hides the problem instead of resolves it. currently I'm doing a bisect for ath9k(the disassociating bug a few months back), then I can focus in on this. On 02/14/10 15:54, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14487 > > > Rafael J. Wysocki<rjw@sisk.pl> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|REOPENED |ASSIGNED > > > > doing a bisect for the atheros dissociating bug (a few months back),then I can dive into this one. Justin P. Mattock alright.. was looking into an SELinux issue with suse for a day or so.. while using their x86_64 11.2 system I decided to see if this hits with a regular distro as opposed to hitting this on my custom built system from scratch. results: unfortunately this hits as well on a distro re-creatable each time with using early dma debugging on. I will look more on this withing the next few days. On Sunday 21 February 2010, Justin P. mattock wrote:
> On 02/21/2010 01:42 PM, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32. Please verify if it still should
> > be listed and let the tracking team know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter : Justin P. Mattock<justinmattock@gmail.com>
> > Date : 2009-10-23 16:45 (122 days old)
> > References : http://lkml.org/lkml/2009/10/23/252
> > Handled-By : Jan Beulich<jbeulich@novell.com>
>
> yeah still here.. worst is I'm able to see this with
> suse11.2 as well as with my custom system.
>
> so please leave open.
I can confirm this patch: http://lkml.org/lkml/2010/2/24/210 fixes the Panic I am experiencing.. I will do some firewire debugging to make sure that works. Thanks so much for this Jan, glad to(hopefully) get this bug closed. My Ubuntu 9.10 kernel-2.6.33-amd64 sometimes gave these errors: PANIC: early exception 08 rip 246:10 error ffffffff810356e6 cr2 f08b3a & PANIC: early exception 0f rip 10:ffffffff810356e6 error 0 cr2 f08b3a P.S. solved by installing Debian ;) On 04/26/2010 12:46 AM, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=14487 > > > SkliF<sklif2004@gmail.com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |sklif2004@gmail.com > > > > > --- Comment #22 from SkliF<sklif2004@gmail.com> 2010-04-26 07:46:34 --- > My Ubuntu 9.10 kernel-2.6.33-amd64 sometimes gave these errors: > > PANIC: early exception 08 rip 246:10 error ffffffff810356e6 cr2 f08b3a > & > PANIC: early exception 0f rip 10:ffffffff810356e6 error 0 cr2 f08b3a > > P.S. solved by installing Debian ;) > this patch fixed it for me: https://patchwork.kernel.org/patch/82280/ should be in the latest stable release now. (happy early debugging). Justin P. Mattock Closing on the basis of the last comment. looks good to me.. everything works(just did some remote debugging on a gcc bug for 4.6.0). |