Bug 16173
Summary: | After uncompressing the kernel, at boot time, the server hangs. | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | David Hill (hilld) |
Component: | i386 | Assignee: | platform_i386 |
Status: | CLOSED CODE_FIX | ||
Severity: | blocking | CC: | akpm, ebiederm, florian, guanx.bac, hilld, hpa, maciej.rutecki, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.35-rc1 2.6.35-rc2 2.6.35-rc3 2.6.35-rc4 2.6.35-rc5 2.6.35.1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 16055 | ||
Attachments: |
.config file used to compile kernel.
config diff between working 2.6.34-rc7 and broken 2.6.35-rc1 Bisection logs... Config of 2.6.35-rc4+ that boots to config with which the problem can be reproduced on ThinkPad T43 1871-FU1 1.5GB Mem with 2.6.35.2 |
Description
David Hill
2010-06-09 23:25:56 UTC
Created attachment 26709 [details]
.config file used to compile kernel.
Created attachment 26710 [details]
config diff between working 2.6.34-rc7 and broken 2.6.35-rc1
Awaiting bisection results ;) This is a regression. Do you know whether 2.6.34 is OK? 2.6.33? Thanks. Well, since 2.6.34-rc7 is OK then 2.6.33 is presumably okay. 2.6.34 is not that different from -rc7, but it's plausible at least that something could have snuck in. Getting the bisection results would indeed be very appreciated. 2.6.31 2.6.32.8 2.6.33 2.6.33-rc6 2.6.34-00476-g4d7b 2.6.34-rc7 are ok. The regression appeared between 2.6.34-rc7 and 2.6.35-rc1. I'm still bisecting... I have 6 more tests (according to git) to do... Some are good, some are bad. (as expected) 2.6.34-00595-g3e1dd19 is good too. 2.6.34-rc6-00113-g4b6b19a is good too. Sorry if it takes some time... I still have 2 steps to make before I'm done. Do you want partial results? Created attachment 26758 [details]
Bisection logs...
2.6.34-rc6-00115-g5777372 is good too :) cf7500c0ea133d66f8449d86392d83f840102632 is the first bad commit commit cf7500c0ea133d66f8449d86392d83f840102632 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Tue Mar 30 01:07:11 2010 -0700 x86, ioapic: In mpparse use mp_register_ioapic Long ago MP_ioapic_info was the primary way of setting up our ioapic data structures and mp_register_ioapic was a compatibility shim for acpi code. Now the situation is reversed and and mp_register_ioapic is the primary way of setting up our ioapic data structures. Keep the setting up of ioapic data structures uniform by having mp_register_ioapic call mp_register_ioapic. This changes a few fields: - type: is now hardset to MP_IOAPIC but type had to bey MP_IOAPIC or MP_ioapic_info would not have been called. - flags: is now hard coded to MPC_APIC_USABLE. We require flags to contain at least MPC_APIC_USEBLE in MP_ioapic_info and we don't ever examine flags so dropping a few flags that might possibly exist that we have never used is harmless. - apicaddr: Unchanged - apicver: Read from the ioapic instead of using the cached hardware value in the MP table. The real hardware value will be more accurate. - apicid: Now verified to be unique and changed if it is not. If the BIOS got this right this is a noop. If the BIOS did not fixing things appears to be the better solution. This adds gsi_base and gsi_end values to our ioapics defined with the mpatable, which will make our lives simpler later since we can always assume gsi_base and gsi_end are valid. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> LKML-Reference: <1269936436-7039-10-git-send-email-ebiederm@xmission.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> :040000 040000 6125cb7d9b0d66175e81f7081f2ae4ff59ca0254 5cc6ef21dbec694043fd0a245701a8e5c491d665 M arch Taking a quick look at this the only bug I can find in that patch is the off by one issue with gsi_end. Does this patch fix things for you? http://lkml.indiana.edu/hypermail/linux/kernel/1006.1/00232.html Eric Can I download a patch somewhere? Or the only way of doing that is cut/pasting the diffs and modifying the diff headers so I can apply it? :S I can't apply the patch... it's too different. actual arch/x86/incluse/asm/io_apic.h struct mp_ioapic_gsi{ u32 gsi_base; u32 gsi_end; }; patch: --- a/arch/x86/include/asm/io_apic.h +++ b/arch/x86/include/asm/io_apic.h @@ -183,7 +183,7 @@ struct mp_ioapic_gsi{ u32 gsi_end; }; extern struct mp_ioapic_gsi mp_gsi_routing[]; -extern u32 gsi_end; +extern u32 gsi_top; int mp_find_ioapic(u32 gsi); int mp_find_ioapic_pin(int ioapic, u32 gsi); void __init mp_register_ioapic(int id, u32 address, u32 gsi_base); And that's one of the many rejected patches. "patch -p1 < ../path/to/patch" should work. That is how the patch header differences are usually handled. This is queued for getting merged but when I looked I didn't see it in Linus tree, when I looked. Thank you for your patience. This was a really stupid bug on my part. Eric It doesn't work... too much differences... :( I don't really know git, but I did try this: cp patch /usr/src/linux-2.6 cd /usr/src/linux-2.6 git checkout cf7500c0ea133d66f8449d86392d83f840102632 patch -p1 < patch It gives me: wolfe:/usr/src/linux-2.6# patch -p1 < patch patching file arch/x86/include/asm/io_apic.h Hunk #1 FAILED at 183. 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/include/asm/io_apic.h.rej patching file arch/x86/kernel/acpi/boot.c Hunk #1 FAILED at 118. 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/acpi/boot.c.rej patching file arch/x86/kernel/apic/io_apic.c Hunk #1 FAILED at 89. 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/apic/io_apic.c.rej patching file arch/x86/kernel/mpparse.c Hunk #1 FAILED at 123. 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/mpparse.c.rej patching file arch/x86/kernel/sfi.c Hunk #1 FAILED at 93. 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/sfi.c.rej > https://bugzilla.kernel.org/show_bug.cgi?id=16173 > > > > > > --- Comment #17 from David Hill <hilld@binarystorm.net> 2010-06-15 01:24:43 > --- > It doesn't work... too much differences... :( I don't really know git, but I > did try this: > > cp patch /usr/src/linux-2.6 > cd /usr/src/linux-2.6 > git checkout cf7500c0ea133d66f8449d86392d83f840102632 > patch -p1 < patch > > It gives me: > wolfe:/usr/src/linux-2.6# patch -p1 < patch > patching file arch/x86/include/asm/io_apic.h > Hunk #1 FAILED at 183. > 1 out of 1 hunk FAILED -- saving rejects to file > arch/x86/include/asm/io_apic.h.rej > patching file arch/x86/kernel/acpi/boot.c > Hunk #1 FAILED at 118. > 1 out of 1 hunk FAILED -- saving rejects to file > arch/x86/kernel/acpi/boot.c.rej > patching file arch/x86/kernel/apic/io_apic.c > Hunk #1 FAILED at 89. > 1 out of 1 hunk FAILED -- saving rejects to file > arch/x86/kernel/apic/io_apic.c.rej > patching file arch/x86/kernel/mpparse.c > Hunk #1 FAILED at 123. > 1 out of 1 hunk FAILED -- saving rejects to file > arch/x86/kernel/mpparse.c.rej > patching file arch/x86/kernel/sfi.c > Hunk #1 FAILED at 93. > 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/sfi.c.rej > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug. Attached is my patch against 2.6.35-rc2, which is the tag v2.6.35-rc2 in git. You should be able to use either git-am or patch -p1 to apply this patch. I don't know if your problem was cut and past introducing different white space and causing rejects, or if simply that my patch was against 2.6.35-rc2, and you were applying it against an earlier point in time. Eric I tried applying it to the latest and it didn't work (rc3) ... I just successfully applied your patch to rc2. I'm compiling and rebooting as soon as it finishes! Ok, the patch didn't work on rc2... it still freezes after uncompressing the kernel. First-Bad-Commit : cf7500c0ea133d66f8449d86392d83f840102632 > --- Comment #20 from David Hill <hilld@binarystorm.net> 2010-06-16 02:36:26
> ---
> Ok, the patch didn't work on rc2... it still freezes after uncompressing the
> kernel.
Weird.
I don't have a machine that actually works in 2.6.34 with mptables handy.
I have one that is close, and it boots with acpi enabled, and the disk doesn't
work with acpi disabled.
With 2.6.35-rc2 I still see boot messages.
I want to think this is a splash screen or something like that hiding your boot
messages. Given that you say you see the uncompressing the kernel message
my splash screen hypothesis doesn't hold water.
You are using a 32bit kernel and I was testing a 64bit kernel. That may
be where the difference lies. I start looking at your config.
There is certainly something weird going on here, and I'm not certain
where to look.
Eric
(In reply to comment #22) > > --- Comment #20 from David Hill <hilld@binarystorm.net> 2010-06-16 > 02:36:26 --- > > Ok, the patch didn't work on rc2... it still freezes after uncompressing > the > > kernel. > > Weird. > > I don't have a machine that actually works in 2.6.34 with mptables handy. > I have one that is close, and it boots with acpi enabled, and the disk > doesn't > work with acpi disabled. > > With 2.6.35-rc2 I still see boot messages. > > I want to think this is a splash screen or something like that hiding your > boot > messages. Given that you say you see the uncompressing the kernel message > my splash screen hypothesis doesn't hold water. Unless splash screen is a new kernel feature, I doubt it is the problem. Don't forget it works with previous kernels ... the problem actually appeared in 2.6.35-rc1 (well , see the commit that causes the problem) > You are using a 32bit kernel and I was testing a 64bit kernel. That may > be where the difference lies. I start looking at your config. Unless I need to disable/enable a new kernel feature, I doubt it is the config that causes the problem. > There is certainly something weird going on here, and I'm not certain > where to look. In the broken kernels, There's something new (I think): CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-ecx -fcall-saved-edx" but, I could not confirm ... If I take the commit before the one mentionned above, the kernel boots... but at that commit, it stops booting... > Eric > --- Comment #23 from David Hill <hilld@binarystorm.net> 2010-06-20 07:10:12 > --- > > Unless splash screen is a new kernel feature, I doubt it is the problem. > Don't > forget it works with previous kernels ... the problem actually appeared in > 2.6.35-rc1 (well , see the commit that causes the problem) I was only speculating on something that would hide kernel boot messages and make it harder to see error messages. >> You are using a 32bit kernel and I was testing a 64bit kernel. That may >> be where the difference lies. I start looking at your config. > > Unless I need to disable/enable a new kernel feature, I doubt it is the > config > that causes the problem. The important part of what I am saying is that I am still looking, and I have a similar configuration that mostly works. So I am trying to figure out what the difference between your configuration and mine is. So far the big variable is 32bit vs 64bit kernels. Once I get that out of the way I will look for other things. >> There is certainly something weird going on here, and I'm not certain >> where to look. > > In the broken kernels, There's something new (I think): > CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-ecx -fcall-saved-edx" > but, I could not confirm ... If I take the commit before the one mentionned > above, the kernel boots... but at that commit, it stops booting... Right. I have wondered at that a little two. I think we are fighting at least two issues. The gsi off by one issue that I know about and something else. mp_register_ioapic does a little more that just fill out the mp_ioapics data structure (which is all the old code did). If the proper setup hasn't occurred that extra work in mp_register_ioapic could be a problem. I will look into that next. Generally I would not expect that as acpi apparently works. But as Sherlock Holmes is reputed to have said. "Once you eliminate the obvious, whatever is left however improbable is the truth". I know the proper setup has occurred on my 64bit kernel. So next I will try a 32bit kernel, and if that works I will try something even closer to your configuration. Eric Created attachment 27037 [details]
Config of 2.6.35-rc4+ that boots to
Config of 2.6.35-rc4+ 140236b4b1c749c9b795ea3d11558a0eb5a3a080 that boots to userspace
I finally got some testing time on a 32bit machine with a known good mptable,
The cpu was a AMD Athlon(tm) 64 X2 Dual Core Processor 3400+
I used the supplied kernel config with the addition of serial console support
and I can boot to userspace ( a busybox initramfs ).
Which tells me this can not be a simple irq setup problem and must
be at best be a weird interaction problem between different parts of
the kernel.
Perhaps we occassionally have weird tlb mismatch issues, or perhaps I
have subtly changed the timing in a way that messes up your box.
I'm not certain where to go from here. I clearly can not reproduce this
issue. David could you test Linus's latest tree (aka 2.6.35-rc4+ 140236b4b1c749c9b795ea3d11558a0eb5a3a080 ) with the config I supplied?
Perhaps I just got lucky and the bug was fixed between now and then?
If this still results in a kernel that doesn't work for you perhaps
it is time to compare tool chains. My build environment here is
on 64bit fedora 12:
> gcc --version
gcc (GCC) 4.4.3 20100127 (Red Hat 4.4.3-4)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Eric
2.6.34.1 is working fine. 2.6.35-rc4 doesn't work as is. I need to recompile it with the specidifed commit and configuration file. It may take a while (like a week) before I can do this. Handled-By : Eric W. Biederman <ebiederm@xmission.com> On Friday, July 09, 2010, David Hill wrote:
> Yes
>
> David Hill
>
> On 2010-07-08, at 19:41, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.34. Please verify if it still should be listed and let the
> tracking team
> > know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=16173
> > Subject : After uncompressing the kernel, at boot time, the server
> hangs.
> > Submitter : David Hill <hilld@binarystorm.net>
> > Date : 2010-06-09 23:25 (30 days old)
On Friday, July 23, 2010, Eric W. Biederman wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
>
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.34. Please verify if it still should be listed and let the
> tracking team
> > know (either way).
>
> This one is a mystery. I have tried and failed to reproduce this on a
> similar
> hardware configuration so I am at a loss as to where to look next to
> understand
> this bug.
Eureka! This commit solves the problem: commit c0e51cfba7377beffe7fb361f8093db95d2a9441 Merge: 320b2b8 417484d Date: Fri Aug 13 02:36:53 2010 -0400 Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into 16173 I figured it would. Apparently you have to have just the right model of intel cpu to hit the problem code path. And for documentation purposes it should be just this commit that solves the problem. I'm glad to see you were finally able to try this change and verify it works for you. commit 5989cd6a1cbf86587edcc856791f960978087311 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Wed Aug 4 13:30:27 2010 -0700 x86, apic: Map the local apic when parsing the MP table. This fixes a regression in 2.6.35 from 2.6.34, that is present for select models of Intel cpus when people are using an MP table. The commit cf7500c0ea133d66f8449d86392d83f840102632 "x86, ioapic: In mpparse use mp_register_ioapic" started calling mp_register_ioapic from MP_ioapic_info. An extremely simple change that was obviously correct. Unfortunately mp_register_ioapic did just a little more than the previous hand crafted code and so we gained this call path. The problem call path is: MP_ioapic_info() mp_register_ioapic() io_apic_unique_id() io_apic_get_unique_id() get_physical_broadcast() modern_apic() lapic_get_version() apic_read(APIC_LVR) Which turned out to be a problem because the local apic was not mapped, at that point, unlike the similar point in the ACPI parsing code. This problem is fixed by mapping the local apic when parsing the mptable as soon as we reasonably can. Looking at the number of places we setup the fixmap for the local apic, I see some serious simplification opportunities. For the moment except for not duplicating the setting up of the fixmap in init_apic_mappings, I have not acted on them. The regression from 2.6.34 is tracked in bug https://bugzilla.kernel.org/show_bug.cgi?id=16173 Cc: <stable@kernel.org> 2.6.35 Reported-by: David Hill <hilld@binarystorm.net> Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com> Tested-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> LKML-Reference: <m1eiee86jg.fsf_-_@fess.ebiederm.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> This problem did not occur to me at 2.6.35.1. It's new to me at 2.6.35.2. UP: vendor_id : GenuineIntel cpu family : 6 model : 13 model name : Intel(R) Pentium(R) M processor 1.60GHz stepping : 8 Created attachment 27457 [details]
config with which the problem can be reproduced on ThinkPad T43 1871-FU1 1.5GB Mem with 2.6.35.2
Hi Guan, this seems to be another problem then. This bug was bisected to a commit (cf7500c0ea133d66f8449d86392d83f840102632) which was introduced before 2.6.34-rc7 and should thus be visible in both 2.6.35.1 and .2. I've opened a new bugreport for your issue: http://bugzilla.kernel.org/show_bug?id=17411 Cheers, Flo |