Latest working kernel version:2.6.27-rc8 Earliest failing kernel version:2.6.27-rc8-57-g96d746c Distribution: debian unstable Hardware Environment: amd64, 4 core Opteron (dual socket), AMD 81xx agp bridge When booting up, the kernel crashes right after emitting the following lines via on the console: [ 0.940510] PCI: Using ACPI for IRQ routing [ 0.944389] pci 0000:04:00.0: BAR 0: can't allocate resource [ 0.985267] agpgart-amd64 0000:04:00.0: AMD 8151 AGP Bridge rev B3 [ 1.022664] agpgart-amd64 0000:04:00.0: AGP aperture is 512M @ 0xc0000000 [ 1.024187] PCI-DMA: using GART IOMMU. [ 1.028053] PCI-DMA: Reserving 256MB of IOMMU area in the AGP aperture Symptoms are - freeze (on 2.6.27-rc8-57-g96d746c) - double fault and freeze (on 2.6.27-rc8-60-g1db9b83 - triple fault/reboot (on 2.6.27-rc8-90-gb1f23f5) The kernel crashes always at the exact same place (triple boot case confirmed via serial console). I've tracked in down with git bisect. I'll now test latest -mainline with this commit reverted (2.6.27-rc8-57-g96d746c) - but I goofed up something else, so it will take a few hours.
Created attachment 18180 [details] dmesg of the working kernel this is from the last good commit from bisecting
Created attachment 18181 [details] dmesg of the failing kernel this is from the first bad one from bisecting
On Mon, 6 Oct 2008, bugme-daemon@bugzilla.kernel.org wrote: > > I've tracked in down with git bisect. I'll now test latest -mainline with > this > commit reverted (2.6.27-rc8-57-g96d746c) - but I goofed up something else, so > it will take a few hours. Ok, that commit (96d746c68fae9a1e3167caab04c22fd0f677f62d, aka "Fix init/main.c to use regular printk with '%pF' for initcall fn" is certainly easily reverted, but I'm a bit surprised if that's actually the cause. But reverting it and testing is definitely the thing to do, and your double fault is somewhere in the middle of the initcall sequence. But the reason I find it surprising is that: - it really is in the _middle_ of the initcall sequence, not the first one or something like that. So trivial changes to do_one_initcall() sound unlikely to trigger there. - a patch very much like that has been in the tip/tracing/fastboot tree earlier, so it actually got some testing. The patch itself is trivial (but also trivially reverted with no downsides). - your successful boot dmesg does not have any of the initcall debugging trace printouts, which in turn means that you don't seem to have "initcall_debug" enabled (it's a kernel command line option), which in turn should make that 96d746c commit a total no-op for you. So I'd love to hear what happens when you revert that commit, but at the same time I do wonder if the bisection perhaps failed? Also, do you perhaps have "initcall_debug" enabled for your _testing_ kernels, but not your fallback ones? That could explain how your normal dmesg doesn't have any initcall debugging output, but the testing kernel could still trigger some issue.. Linus
could you please also paste the output of 'git bisect log' here? Ingo
please try patch in http://bugzilla.kernel.org/show_bug.cgi?id=11676 and it is in tip/x86
here's the linkage of commits around the bisected commit: 897312b: include/linux/stacktrace.h: declare struct task_struct f2fe163: orion_spi: fix handling of default transfer speed aef7db4: fbdev: fix recursive notifier and locking when fbdev console is blanked 2e4a75c: rtc: fix kernel panic on second use of SIGIO nofitication e105eab: Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-li \ 1db9b83: Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux- \ 96d746c: Fix init/main.c to use regular printk with '%pF' for initcall fn 95b866d: e1000e: Fix incorrect debug warning b5ff7df: Check mapped ranges on sysfs resource files 6f92a6a: e1000e: update version from k4 to k6 so assuming that one of the final two bisection points was categorized incorrectly, other suspect commits would be: 95b866d: e1000e: Fix incorrect debug warning b5ff7df: Check mapped ranges on sysfs resource files 6f92a6a: e1000e: update version from k4 to k6 2e4a75c: rtc: fix kernel panic on second use of SIGIO nofitication f2fe163: orion_spi: fix handling of default transfer speed aef7db4: fbdev: fix recursive notifier and locking when fbdev console is blanked each has at least the theoretical possibility of causing a symptom like that. albeit a triple fault / double fault is far more likely to be caused by infinite stack recursion - which again would finger something that messes with the stack - varargs and vsprintf. OTOH, that commit is really well tested, on a wide range of systems and build environments, and it's very central functionality so i dont see much space for mistakes there. Not impossible though.
* bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote: > please try patch in > > http://bugzilla.kernel.org/show_bug.cgi?id=11676 > > and it is in tip/x86 hm. We originally thought that this fix, albeit nice, is not that relevant and narrowly missed the .27 train. But maybe it's worth pulling nevertheless - it has had 2 days of testing so far, and i queued it up in x86/urgent: | commit d99e90164e6cf2eb85fa94d547d6336f8127a107 | Author: Yinghai Lu <yhlu.kernel@gmail.com> | AuthorDate: Sat Oct 4 15:55:12 2008 -0700 | Commit: Ingo Molnar <mingo@elte.hu> | CommitDate: Sun Oct 5 11:19:03 2008 +0200 | | x86: gart iommu have direct mapping when agp is present too I'll send an RFC pull request to Linus. Ingo
> Also, do you perhaps have "initcall_debug" enabled for your _testing_ > kernels, but not your fallback ones? That could explain how your normal > dmesg doesn't have any initcall debugging output, but the testing kernel > could still trigger some issue.. There is no initcall_debug neither on the failing nor on the booting kernel. I compared the logs (up to the point where the failing one stops) and there is no helpful hint (difference) at all. This looks very much like the problem in: http://bugzilla.kernel.org/show_bug.cgi?id=11676 There is a patch available: http://bugzilla.kernel.org/attachment.cgi?id=18165&action=view Thanks, tglx
> So I'd love to hear what happens when you revert that commit, but at the > same time I do wonder if the bisection perhaps failed? I don't think it is a bisect failure. The bug went unnoticed on his machine until rc8. I can not prove it, but this looks like a dependency on the position of a function or something else in memory. The bisect moved code/memory across a boundary which made this trigger or not. Yinghai, any idea how to explain this? Thanks, tglx
Reply-To: yinghai@kernel.org On Mon, Oct 6, 2008 at 12:21 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >> So I'd love to hear what happens when you revert that commit, but at the >> same time I do wonder if the bisection perhaps failed? > > I don't think it is a bisect failure. The bug went unnoticed on his > machine until rc8. > > I can not prove it, but this looks like a dependency on the position > of a function or something else in memory. The bisect moved > code/memory across a boundary which made this trigger or not. > > Yinghai, any idea how to explain this? could be one patch uncover the bug. YH
> > Yinghai, any idea how to explain this? > > could be one patch uncover the bug. Sure, that's my wild guess, but what is the reason why it accidentaly works in some kernel constellations ? Thanks, tglx
On Mon, Oct 06, 2008 at 12:21:48PM -0700, bugme-daemon@bugzilla.kernel.org wrote: > > So I'd love to hear what happens when you revert that commit, but at the > > same time I do wonder if the bisection perhaps failed? > > I don't think it is a bisect failure. The bug went unnoticed on his > machine until rc8. While bisecting I never had an ambigous rev, even when retesting a few times. I checked 2.6.27-rc8-00090-gb1f23f5 with Linus' commit reverted, and that works. Furthermore I rechecked a few of the bad kernels with $ make clean && make -j8 && su -c 'make install': to rule out any other strange effect, and they crashed the same way. So it really _looks_ like git bisect found the right commit. At least for my box. Still, the bisect log, reconstructed from my dead-tree notes (I've already run git bisect reset, unfortunately): 2.6.27-rc8-00090-gb1f23f5: triple fault 2.6.27-rc8-00060-g1db9b83: double fault 2.6.27-rc8-00039-g546ca0e: good 2.6.27-rc8-00047-g4b19de6: good 2.6.27-rc8-00053-g717d438: good 2.6.27-rc8-00056-g95b866d: good 2.6.27-rc8-00058-g75f6276: good 2.6.27-rc8-00057-g96d746c: freeze I'm gonna test the mentioned patch now.
Reply-To: yinghai@kernel.org On Mon, Oct 6, 2008 at 12:31 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >> > Yinghai, any idea how to explain this? >> >> could be one patch uncover the bug. > > Sure, that's my wild guess, but what is the reason why it accidentaly > works in some kernel constellations ? lots of system don't have option to set gart iommu aperture size in BIOS setup. ( gart aperture will between TOM and 4G) so kernel will steal some address space from RAM, those range already direct mapped. ( < max_low_mapped_pfn ) only good workstation level system will have that BIOS setup option. YH
On Mon, 6 Oct 2008, Yinghai Lu wrote: > On Mon, Oct 6, 2008 at 12:31 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > >> > Yinghai, any idea how to explain this? > >> > >> could be one patch uncover the bug. > > > > Sure, that's my wild guess, but what is the reason why it accidentaly > > works in some kernel constellations ? > > lots of system don't have option to set gart iommu aperture size in > BIOS setup. ( gart aperture will between TOM and 4G) > so kernel will steal some address space from RAM, those range already > direct mapped. ( < max_low_mapped_pfn ) > > only good workstation level system will have that BIOS setup option. That part I understand, but what I do not understand is that removing a completely unrelated and innocent patch makes the bug vanish. It's still there but it can not be observed. Thanks, tglx
> That part I understand, but what I do not understand is that removing > a completely unrelated and innocent patch makes the bug vanish. It's > still there but it can not be observed. 2.6.27-rc8-00090-gb1f23f5 with the patch http://bugzilla.kernel.org/attachment.cgi?id=18165 applied works. Does that at least give us the bug? ;) btw: I once tested with initcall_debug=1. The bad kernels seem to die somewhere between printing PCI-DMA: Reserving 256MB of IOMMU area in the AGP aperture and the initcall returned printk of do_one_initcall.
I retested v2.6.27-rc9-7-g0edd6df from mainline and it works. Thanks to all who helped track this one down. Of course, if anyone wants to track down _why_ this all happened like it did, I'm willing to test stuff.