Bug 14487

Summary: PANIC: early exception 08 rip 246:10 error ffffffff810251b5 cr2 0
Product: Platform Specific/Hardware Reporter: Rafael J. Wysocki (rjw)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: CLOSED CODE_FIX    
Severity: normal CC: jbeulich, justinmattock, sklif2004
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32-rc4 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14230    
Attachments: temporary_fix_to_get_early_dma_debugging_working_on_my_machine
This fixes the Panic on my x86_64(pure64) machine.

Description Rafael J. Wysocki 2009-10-26 20:51:54 UTC
Subject    : PANIC: early exception 08 rip 246:10 error ffffffff810251b5 cr2 0
Submitter  : "Justin P. Mattock" <justinmattock@gmail.com>
Date       : 2009-10-23 16:45
References : http://lkml.org/lkml/2009/10/23/252

This entry is being used for tracking a regression from 2.6.31.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Justin P. Mattock 2009-10-28 22:30:25 UTC
Created attachment 23570 [details]
temporary_fix_to_get_early_dma_debugging_working_on_my_machine

This is not much for a patch, but more of figuring out where the issue might be
that's causing my machine to panic.
By commenting out the calls I'm able to use early firewire for debugging.

As for what might be going on, taking a look into set_fixmap_nocache
I see this is located in arch/x86/include/asm/fixmap.h. 
In there I'm noticing some comments pertaining to x86_64:


/*
 * We can't declare FIXADDR_TOP as variable for x86_64 because vsyscall
 * uses fixmaps that relies on FIXADDR_TOP for proper address calculation.
 * Because of this, FIXADDR_TOP x86 integration was left as later work.
 */

and

/* Only covers 32bit vsyscalls currently. Need another set for 64bit. */

which is leading me to think maybe this is why I'm hitting what I'm hitting on my x86_64 machine

any ideas?
Comment 2 Justin P. Mattock 2009-10-29 03:06:18 UTC
probably would be a good idea to show the panic:
(here it is manually writing down, and a url to a picture)

 [    0.000000] [<ffffffff81639995>] start_kernel+0x82/0x34d
 [    0.000000] [<ffffffff816392a5>] x86_64_start_reservations+0xac/0xb0
 [    0.000000] [<ffffffff816393a1>] x86_64_start_kernel+0xf8/0x107
 PANIC: early exception 08 rip 246:10 error ffffffff810251b5 cr2 0
 [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.32-rc4-00001-g1896a85
 #35
 [    0.000000] Call Trace:
 [    0.000000] [<ffffffff8163919e>] early_idt_handler+0x5e/0x71
 [    0.000000] [<ffffffff813b9958>] ? panic+0x10c/0x12e
 [    0.000000] [<ffffffff8164f777>] ___alloc_bootmem_node+0x0/0x60
 [    0.000000] [<ffffffff8164f8eb>] __alloc_bootmem+0xb/0xd
 [    0.000000] [<ffffffff813aba66>] spp_getpage+0x3a/0x6f
 [    0.000000] [<ffffffff8102770d>] fill_pte+0x22/0xde
 [    0.000000] [<ffffffff810278e7>] set_pte_vaddr_pud+0x2c/0x48
 [    0.000000] [<ffffffff81027963>] set_pte_vaddr+0x60/0x65
 [    0.000000] [<ffffffff8102b82e>] __native_set_fixmap+0x24/0x2c
 [    0.000000] [<ffffffff81660252>]
 init_ohci1394_dma_on_all_controllers+0x9b/0x345
 [    0.000000] [<ffffffff8163be6b>] setup_arch+0x543/0x950
 [    0.000000] [<ffffffff813b99b6>] ? printk+0x3c/0x3e
 [    0.000000] [<ffffffff810646b6>] ?
 clockevents_register_notifier+0x3e/0x48
 [    0.000000] [<ffffffff81639995>] start_kernel+0x82/0x34d
 [    0.000000] [<ffffffff816392a5>] x86_64_start_reservations+0xac/0xb0
 [    0.000000] [<ffffffff816393a1>] x86_64_start_kernel+0xf8/0x107
 [    0.000000] RIP 0x10


and the url:

http://www.flickr.com/photos/44066293@N08/4046711653/
Comment 3 Rafael J. Wysocki 2009-11-17 22:43:19 UTC
On Tuesday 17 November 2009, Justin P. Mattock wrote:
> Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.31.  Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject             : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter   : Justin P. Mattock<justinmattock@gmail.com>
> > Date                : 2009-10-23 16:45 (25 days old)
> > References  : http://lkml.org/lkml/2009/10/23/252
> >
> >
> >
> >    
> This one has me a bit dazed i.g. after looking into the issue
> I did find a workaround(keep in mind it's not pretty),
> by commenting out set_fixmap_nocache and
> init_ohci1394_reset_and_init_dma.
> (by doing so I was able to load both machines and
> execute early debugging in case  a problem occurs).
> 
> Now as to what might  be happening, after going through as
> much as I can comprehend the only thing in mind was
> reading fixmap.h the comments are stating that vsyscalls
> only covers 32bit, and that there needs to be another set
> for 64, leading me to believe that this is what I might be hitting.
> (my system is pure64, taking in no 32bit at all).
> 
> At this point I think I need somebody to give me some info on this,
> and if the 64bit issue mentioned above is the case, then we can probably
> close this and leave it up to the x86_64 builders to create a 64bit
> call for this whenever they get to it.(main thing is I'm able to
> run dma early in case of an emergency).
Comment 4 Rafael J. Wysocki 2010-01-11 19:41:22 UTC
On Monday 11 January 2010, Justin P. Mattock wrote:
> On 01/10/10 14:56, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32.  Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject             : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter   : Justin P. Mattock<justinmattock@gmail.com>
> > Date                : 2009-10-23 16:45 (80 days old)
> > References  : http://lkml.org/lkml/2009/10/23/252
> >
> >
> >
> 
> I've played around with this. and
> and much confused at what needs to happen.
> (please feedback on what might be happening);
> In any case I can have another try at finding a fix
> so please leave open.
Comment 5 Rafael J. Wysocki 2010-01-24 23:03:30 UTC
On Monday 25 January 2010, Justin P. Mattock wrote:
> On 01/24/10 14:22, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32.  Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject             : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter   : Justin P. Mattock<justinmattock@gmail.com>
> > Date                : 2009-10-23 16:45 (94 days old)
> > References  : http://lkml.org/lkml/2009/10/23/252
> >
> >
> >
> 
> yeah I'm still seeing this during boot.
> As of looking at this, been tied up with another
> issue and totally forgot. next week I'll be away
> for a week, and during that period I can try and look at this
> since I might be hanging around at times.
> (and wont be side tracked with the other issue I was looking at);
> 
> So yeah please keep it open, and hopefully somebody
> see's what is happening and maybe has a solution, or
> by chance maybe I can figure something.
Comment 6 Justin P. Mattock 2010-02-06 23:49:55 UTC
Created attachment 24932 [details]
This fixes the Panic on my x86_64(pure64) machine.

located:
 http://patchwork.kernel.org/patch/68719/
this patch fixes my Panic I've been getting when using the 
ohci1394_dma=early option.

as well as: http://lists.openwall.net/linux-kernel/2008/08/29/211 
but not all of that patch just the(numbers):
-	FIX_BTMAP_END = __end_of_permanent_fixed_addresses + 512 -
-			(__end_of_permanent_fixed_addresses & 511),
+	FIX_BTMAP_END = __end_of_permanent_fixed_addresses + 256 -
+			(__end_of_permanent_fixed_addresses & 255),

I'm going to leave this up to you guys to decide what is the safest approach.
if going into init_ohci1394_dma.c and changing something is better let
me know, and I can give my best go at it.

The attached has been applied to the latest HEAD(rc6), and a bisected-and-tested-by added.
Comment 7 Rafael J. Wysocki 2010-02-07 00:11:55 UTC
Handled-By : Jan Beulich <jbeulich@novell.com>
Patch : http://patchwork.kernel.org/patch/68719/
Comment 8 Justin P. Mattock 2010-02-07 00:30:35 UTC
alright.. let me know if I need to test anything out or something.
Comment 9 Rafael J. Wysocki 2010-02-08 01:12:39 UTC
On Monday 08 February 2010, Justin P. Mattock wrote:
> On 02/07/10 16:28, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject             : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter   : Justin P. Mattock<justinmattock@gmail.com>
> > Date                : 2009-10-23 16:45 (108 days old)
> > References  : http://lkml.org/lkml/2009/10/23/252
> 
> 
> the patch attached to the bug report
> makes my machine boot up with out a
> Panic, and allows me to do remote debugging
> via ohci1394_dma.
> 
> I did see a call trace as I was debugging
> which might be related to having one
> system using the patch, and the other not.
> but still need to look at that.
> (only saw this once out of numerous boots
> (could be a rarity)).
Comment 10 Jan Beulich 2010-02-08 07:57:47 UTC
The patch pointed to in #6 may hide the problem, but it certainly doesn't resolve it permanently (i.e. as soon as sufficiently many fixmap entries get inserted, the issue will re-surface). And I would suppose that regardless of that patch the issue would continue to exist for 32-bit.

The real problem is the lack of compile time enforcement/checking that the early fixmap pte that gets hard-coded into the boot time page tables really covers all (actually - given the use of literal numbers in head_64.S -, any of) the fixmap slots that it is intended for: Presumably FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, and FIX_OHCI1394_BASE all need to be sufficiently close to one another in "enum fixed_addresses", the question just is whether FIX_OHCI1394_BASE needs to be moved up, or whether the other two (if they're both in need of the early fixmap pte being in place) can be moved down.
Comment 11 Justin P. Mattock 2010-02-08 08:14:42 UTC
as a test I moved OHCI down two, and hit this. seems to only be satisfied
where it was originally, or adjusting the numbers at FIX_MAP_END etc.. As for what/where to move this I can try 

#ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
         FIX_OHCI1394_BASE,
#endif
FIX_BTMAP_END = __end_of_permanent_fixed_addresses + 256 -
                         (__end_of_permanent_fixed_addresses & 255),
         FIX_BTMAP_BEGIN = FIX_BTMAP_END + NR_FIX_BTMAPS*FIX_BTMAPS_SLOTS - 1,

to see, but then you might hit what you where hitting before your commit.


I'm thinking before this: __end_of_permanent_fixed_addresses + 256 etc..
the address is enough for OHCI to do it's thing, then once OHCI was moved below
FIX_BTMAP_END OHCI just runs out of space(reason for seeing the out of memory(probably)).

I did change those numbers to 511/512 and the setup worked as is.
(my problem is I don't know/how you calculate such a number).

(it's late over here need some Zzz.. will do in the morning, as well as any patches to see).
Comment 12 Jan Beulich 2010-02-08 08:28:44 UTC
(In reply to comment #11)
> to see, but then you might hit what you where hitting before your commit.

Yes, it must not end up between __end_of_permanent_fixed_addresses and FIX_BTMAP_END.

> I'm thinking before this: __end_of_permanent_fixed_addresses + 256 etc..
> the address is enough for OHCI to do it's thing, then once OHCI was moved
> below
> FIX_BTMAP_END OHCI just runs out of space(reason for seeing the out of
> memory(probably)).

As said above, all entries requiring the early fixmap page table to be set up should be as close together as possible (preferably they would all be after __end_of_permanent_fixed_addresses, but if at least on the those entries is meant to be permanent, then all of them have to be or multiple pte pages will need to be hard-coded into the boot time page tables).
 
> I did change those numbers to 511/512 and the setup worked as is.

No, you shouldn't fiddle with those numbers.
Comment 13 Justin P. Mattock 2010-02-08 15:04:01 UTC
hmm.. so in this case for the ohci1394_dma module this would be
somewhere(like you had mentioned), set_fixmap_nocache(FIX_OHCI1394_BASE, ohci_base); and init_ohci1394_reset_and_init_dma(&ohci);

> I did change those numbers to 511/512 and the setup worked as is.

No, you shouldn't fiddle with those numbers.

yeah but out of curiosity changing these numbers did get the system to boot(why?) no idear!!
Comment 14 Rafael J. Wysocki 2010-02-14 23:53:58 UTC
Ignore-Patch : http://patchwork.kernel.org/patch/68719/
Comment 15 Justin P. Mattock 2010-02-15 00:01:27 UTC
yeah that patch does fix the boot issue, but like above hides the problem instead of resolves it.
currently I'm doing a bisect for ath9k(the disassociating bug a few months back),
then I can focus in on this.
Comment 16 Justin P. Mattock 2010-02-15 00:13:32 UTC
On 02/14/10 15:54, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14487
>
>
> Rafael J. Wysocki<rjw@sisk.pl>  changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>               Status|REOPENED                    |ASSIGNED
>
>
>
>


doing a bisect for the atheros dissociating bug
(a few months back),then I can dive into this one.

Justin P. Mattock
Comment 17 Justin P. Mattock 2010-02-19 03:24:25 UTC
alright.. was looking into an SELinux issue with suse for a day or so.. while using their x86_64 11.2 system I decided to see if this hits with a regular distro
as opposed to hitting this on my custom built system from scratch.
results:
unfortunately this hits as well on a distro re-creatable each time with using early dma debugging on. I will look more on this withing the next few days.
Comment 18 Rafael J. Wysocki 2010-02-21 22:31:39 UTC
On Sunday 21 February 2010, Justin P. mattock wrote:
> On 02/21/2010 01:42 PM, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=14487
> > Subject             : PANIC: early exception 08 rip 246:10 error
> ffffffff810251b5 cr2 0
> > Submitter   : Justin P. Mattock<justinmattock@gmail.com>
> > Date                : 2009-10-23 16:45 (122 days old)
> > References  : http://lkml.org/lkml/2009/10/23/252
> > Handled-By  : Jan Beulich<jbeulich@novell.com>
> 
> yeah still here.. worst is I'm able to see this with
> suse11.2 as well as with my custom system.
> 
> so please leave open.
Comment 19 Justin P. Mattock 2010-02-26 06:43:26 UTC
I can confirm this patch:
http://lkml.org/lkml/2010/2/24/210
fixes the Panic I am experiencing..
I will do some firewire debugging to make sure
that works.
Comment 20 Jan Beulich 2010-02-26 09:29:13 UTC
Patch: http://patchwork.kernel.org/patch/82280/
Comment 21 Justin P. Mattock 2010-02-26 15:33:50 UTC
Thanks so much for this Jan, glad to(hopefully) get this bug
closed.
Comment 22 SkliF 2010-04-26 07:46:34 UTC
My Ubuntu 9.10 kernel-2.6.33-amd64 sometimes gave these errors:

PANIC: early exception 08 rip 246:10 error ffffffff810356e6 cr2 f08b3a
&
PANIC: early exception 0f rip 10:ffffffff810356e6 error 0 cr2 f08b3a

P.S. solved by installing Debian ;)
Comment 23 Justin P. Mattock 2010-04-26 13:28:19 UTC
On 04/26/2010 12:46 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=14487
>
>
> SkliF<sklif2004@gmail.com>  changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |sklif2004@gmail.com
>
>
>
>
> --- Comment #22 from SkliF<sklif2004@gmail.com>   2010-04-26 07:46:34 ---
> My Ubuntu 9.10 kernel-2.6.33-amd64 sometimes gave these errors:
>
> PANIC: early exception 08 rip 246:10 error ffffffff810356e6 cr2 f08b3a
> &
> PANIC: early exception 0f rip 10:ffffffff810356e6 error 0 cr2 f08b3a
>
> P.S. solved by installing Debian ;)
>


this patch fixed it for me:

https://patchwork.kernel.org/patch/82280/

should be in the latest stable release now.
(happy early debugging).

Justin P. Mattock
Comment 24 Rafael J. Wysocki 2010-04-26 18:34:43 UTC
Closing on the basis of the last comment.
Comment 25 Justin P. Mattock 2010-04-26 18:49:39 UTC
looks good to me.. everything works(just did some remote debugging on a gcc bug for 4.6.0).