Bug 11712

Summary: regression: kernel crashes after agp gart init
Product: Platform Specific/Hardware Reporter: Daniel Vetter (daniel)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: CLOSED CODE_FIX    
Severity: normal CC: bunk, rjw, torvalds
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27-rc8-90-gb1f23f5 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11167    
Attachments: dmesg of the working kernel
dmesg of the failing kernel

Description Daniel Vetter 2008-10-06 09:25:52 UTC
Latest working kernel version:2.6.27-rc8
Earliest failing kernel version:2.6.27-rc8-57-g96d746c
Distribution: debian unstable
Hardware Environment: amd64, 4 core Opteron (dual socket), AMD 81xx agp bridge

When booting up, the kernel crashes right after emitting the following lines via on the console:

[    0.940510] PCI: Using ACPI for IRQ routing
[    0.944389] pci 0000:04:00.0: BAR 0: can't allocate resource
[    0.985267] agpgart-amd64 0000:04:00.0: AMD 8151 AGP Bridge rev B3
[    1.022664] agpgart-amd64 0000:04:00.0: AGP aperture is 512M @ 0xc0000000
[    1.024187] PCI-DMA: using GART IOMMU.
[    1.028053] PCI-DMA: Reserving 256MB of IOMMU area in the AGP aperture

Symptoms are
- freeze (on 2.6.27-rc8-57-g96d746c)
- double fault and freeze (on 2.6.27-rc8-60-g1db9b83
- triple fault/reboot (on 2.6.27-rc8-90-gb1f23f5)
The kernel crashes always at the exact same place (triple boot case confirmed via serial console).

I've tracked in down with git bisect. I'll now test latest -mainline with this commit reverted (2.6.27-rc8-57-g96d746c) - but I goofed up something else, so it will take a few hours.
Comment 1 Daniel Vetter 2008-10-06 09:26:58 UTC
Created attachment 18180 [details]
dmesg of the working kernel

this is from the last good commit from bisecting
Comment 2 Daniel Vetter 2008-10-06 09:28:03 UTC
Created attachment 18181 [details]
dmesg of the failing kernel

this is from the first bad one from bisecting
Comment 3 Linus Torvalds 2008-10-06 10:01:15 UTC

On Mon, 6 Oct 2008, bugme-daemon@bugzilla.kernel.org wrote:
>
> I've tracked in down with git bisect. I'll now test latest -mainline with
> this
> commit reverted (2.6.27-rc8-57-g96d746c) - but I goofed up something else, so
> it will take a few hours.

Ok, that commit (96d746c68fae9a1e3167caab04c22fd0f677f62d, aka "Fix 
init/main.c to use regular printk with '%pF' for initcall fn" is certainly 
easily reverted, but I'm a bit surprised if that's actually the cause. But 
reverting it and testing is definitely the thing to do, and your double 
fault is somewhere in the middle of the initcall sequence.

But the reason I find it surprising is that:

 - it really is in the _middle_ of the initcall sequence, not the first 
   one or something like that. So trivial changes to do_one_initcall() 
   sound unlikely to trigger there.

 - a patch very much like that has been in the tip/tracing/fastboot tree 
   earlier, so it actually got some testing. The patch itself is trivial 
   (but also trivially reverted with no downsides).

 - your successful boot dmesg does not have any of the initcall debugging 
   trace printouts, which in turn means that you don't seem to have 
   "initcall_debug" enabled (it's a kernel command line option), which in 
   turn should make that 96d746c commit a total no-op for you.

So I'd love to hear what happens when you revert that commit, but at the 
same time I do wonder if the bisection perhaps failed?

Also, do you perhaps have "initcall_debug" enabled for your _testing_ 
kernels, but not your fallback ones? That could explain how your normal 
dmesg doesn't have any initcall debugging output, but the testing kernel 
could still trigger some issue..

			Linus
Comment 4 Ingo Molnar 2008-10-06 11:45:05 UTC
could you please also paste the output of 'git bisect log' here?

	Ingo
Comment 5 Yinghai Lu 2008-10-06 11:49:36 UTC
please try patch in 

http://bugzilla.kernel.org/show_bug.cgi?id=11676

and it is in tip/x86
Comment 6 Ingo Molnar 2008-10-06 11:55:21 UTC
here's the linkage of commits around the bisected commit:

897312b: include/linux/stacktrace.h: declare struct task_struct
f2fe163: orion_spi: fix handling of default transfer speed
aef7db4: fbdev: fix recursive notifier and locking when fbdev console is blanked
2e4a75c: rtc: fix kernel panic on second use of SIGIO nofitication
e105eab: Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-li
\
 1db9b83: Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-
\
 96d746c: Fix init/main.c to use regular printk with '%pF' for initcall fn
 95b866d: e1000e: Fix incorrect debug warning
 b5ff7df: Check mapped ranges on sysfs resource files
 6f92a6a: e1000e: update version from k4 to k6

so assuming that one of the final two bisection points was categorized 
incorrectly, other suspect commits would be:

 95b866d: e1000e: Fix incorrect debug warning
 b5ff7df: Check mapped ranges on sysfs resource files
 6f92a6a: e1000e: update version from k4 to k6
 2e4a75c: rtc: fix kernel panic on second use of SIGIO nofitication
 f2fe163: orion_spi: fix handling of default transfer speed
 aef7db4: fbdev: fix recursive notifier and locking when fbdev console is blanked

each has at least the theoretical possibility of causing a symptom like 
that.

albeit a triple fault / double fault is far more likely to be caused by 
infinite stack recursion - which again would finger something that 
messes with the stack - varargs and vsprintf.

OTOH, that commit is really well tested, on a wide range of systems and 
build environments, and it's very central functionality so i dont see 
much space for mistakes there. Not impossible though.
Comment 7 Ingo Molnar 2008-10-06 11:59:04 UTC
* bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote:

> please try patch in 
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=11676
> 
> and it is in tip/x86

hm. We originally thought that this fix, albeit nice, is not that 
relevant and narrowly missed the .27 train.

But maybe it's worth pulling nevertheless - it has had 2 days of testing 
so far, and i queued it up in x86/urgent:

| commit d99e90164e6cf2eb85fa94d547d6336f8127a107
| Author:     Yinghai Lu <yhlu.kernel@gmail.com>
| AuthorDate: Sat Oct 4 15:55:12 2008 -0700
| Commit:     Ingo Molnar <mingo@elte.hu>
| CommitDate: Sun Oct 5 11:19:03 2008 +0200
|
|     x86: gart iommu have direct mapping when agp is present too

I'll send an RFC pull request to Linus.

	Ingo
Comment 8 Thomas Gleixner 2008-10-06 12:04:44 UTC
> Also, do you perhaps have "initcall_debug" enabled for your _testing_ 
> kernels, but not your fallback ones? That could explain how your normal 
> dmesg doesn't have any initcall debugging output, but the testing kernel 
> could still trigger some issue..

There is no initcall_debug neither on the failing nor on the booting kernel.

I compared the logs (up to the point where the failing one stops) and
there is no helpful hint (difference) at all.

This looks very much like the problem in:
http://bugzilla.kernel.org/show_bug.cgi?id=11676

There is a patch available:
http://bugzilla.kernel.org/attachment.cgi?id=18165&action=view

Thanks,

	tglx
Comment 9 Thomas Gleixner 2008-10-06 12:21:47 UTC
> So I'd love to hear what happens when you revert that commit, but at the 
> same time I do wonder if the bisection perhaps failed?

I don't think it is a bisect failure. The bug went unnoticed on his
machine until rc8.

I can not prove it, but this looks like a dependency on the position
of a function or something else in memory. The bisect moved
code/memory across a boundary which made this trigger or not.

Yinghai, any idea how to explain this?

Thanks,

	 tglx
Comment 10 Anonymous Emailer 2008-10-06 12:28:51 UTC
Reply-To: yinghai@kernel.org

On Mon, Oct 6, 2008 at 12:21 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> So I'd love to hear what happens when you revert that commit, but at the
>> same time I do wonder if the bisection perhaps failed?
>
> I don't think it is a bisect failure. The bug went unnoticed on his
> machine until rc8.
>
> I can not prove it, but this looks like a dependency on the position
> of a function or something else in memory. The bisect moved
> code/memory across a boundary which made this trigger or not.
>
> Yinghai, any idea how to explain this?

could be one patch uncover the bug.

YH
Comment 11 Thomas Gleixner 2008-10-06 12:32:11 UTC
> > Yinghai, any idea how to explain this?
> 
> could be one patch uncover the bug.

Sure, that's my wild guess, but what is the reason why it accidentaly
works in some kernel constellations ?

Thanks,

	tglx
Comment 12 Daniel Vetter 2008-10-06 12:40:44 UTC
On Mon, Oct 06, 2008 at 12:21:48PM -0700, bugme-daemon@bugzilla.kernel.org wrote:
> > So I'd love to hear what happens when you revert that commit, but at the 
> > same time I do wonder if the bisection perhaps failed?
> 
> I don't think it is a bisect failure. The bug went unnoticed on his
> machine until rc8.

While bisecting I never had an ambigous rev, even when retesting a few
times.

I checked 2.6.27-rc8-00090-gb1f23f5 with Linus' commit reverted, and that
works. Furthermore I rechecked a few of the bad kernels with

$ make clean && make -j8 && su -c 'make install':

to rule out any other strange effect, and they crashed the same way. So it
really _looks_ like git bisect found the right commit. At least for my
box. Still, the bisect log, reconstructed from my dead-tree notes (I've
already run git bisect reset, unfortunately):

2.6.27-rc8-00090-gb1f23f5: triple fault
2.6.27-rc8-00060-g1db9b83: double fault
2.6.27-rc8-00039-g546ca0e: good
2.6.27-rc8-00047-g4b19de6: good
2.6.27-rc8-00053-g717d438: good
2.6.27-rc8-00056-g95b866d: good
2.6.27-rc8-00058-g75f6276: good
2.6.27-rc8-00057-g96d746c: freeze

I'm gonna test the mentioned patch now.
Comment 13 Anonymous Emailer 2008-10-06 12:43:50 UTC
Reply-To: yinghai@kernel.org

On Mon, Oct 6, 2008 at 12:31 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Yinghai, any idea how to explain this?
>>
>> could be one patch uncover the bug.
>
> Sure, that's my wild guess, but what is the reason why it accidentaly
> works in some kernel constellations ?

lots of system don't have option to set gart iommu aperture size in
BIOS setup. ( gart aperture will between TOM and 4G)
so kernel will steal some address space from RAM, those range already
direct mapped. ( < max_low_mapped_pfn )

only good workstation level system will have that BIOS setup option.

YH
Comment 14 Thomas Gleixner 2008-10-06 13:02:05 UTC
On Mon, 6 Oct 2008, Yinghai Lu wrote:
> On Mon, Oct 6, 2008 at 12:31 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > Yinghai, any idea how to explain this?
> >>
> >> could be one patch uncover the bug.
> >
> > Sure, that's my wild guess, but what is the reason why it accidentaly
> > works in some kernel constellations ?
> 
> lots of system don't have option to set gart iommu aperture size in
> BIOS setup. ( gart aperture will between TOM and 4G)
> so kernel will steal some address space from RAM, those range already
> direct mapped. ( < max_low_mapped_pfn )
> 
> only good workstation level system will have that BIOS setup option.

That part I understand, but what I do not understand is that removing
a completely unrelated and innocent patch makes the bug vanish. It's
still there but it can not be observed.

Thanks,

	tglx
Comment 15 Daniel Vetter 2008-10-06 13:13:44 UTC
> That part I understand, but what I do not understand is that removing
> a completely unrelated and innocent patch makes the bug vanish. It's
> still there but it can not be observed.

2.6.27-rc8-00090-gb1f23f5 with the patch
http://bugzilla.kernel.org/attachment.cgi?id=18165
applied works. Does that at least give us the bug? ;)

btw: I once tested with initcall_debug=1. The bad kernels seem to die
somewhere between printing

PCI-DMA: Reserving 256MB of IOMMU area in the AGP aperture

and the initcall returned printk of do_one_initcall.
Comment 16 Daniel Vetter 2008-10-07 00:08:47 UTC
I retested v2.6.27-rc9-7-g0edd6df from mainline and it works.

Thanks to all who helped track this one down. Of course, if anyone wants to track down _why_ this all happened like it did, I'm willing to test stuff.