Bug 16173

Summary: After uncompressing the kernel, at boot time, the server hangs.
Product: Platform Specific/Hardware Reporter: David Hill (hilld)
Component: i386Assignee: platform_i386
Status: CLOSED CODE_FIX    
Severity: blocking CC: akpm, ebiederm, florian, guanx.bac, hilld, hpa, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35-rc1 2.6.35-rc2 2.6.35-rc3 2.6.35-rc4 2.6.35-rc5 2.6.35.1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 16055    
Attachments: .config file used to compile kernel.
config diff between working 2.6.34-rc7 and broken 2.6.35-rc1
Bisection logs...
Config of 2.6.35-rc4+ that boots to
config with which the problem can be reproduced on ThinkPad T43 1871-FU1 1.5GB Mem with 2.6.35.2

Description David Hill 2010-06-09 23:25:56 UTC
After uncompresing the kernel, since 2.6.35rc1 the server hangs.

Server is an old Dual P3 550MHZ with 1GB memory.

I'm starting a bisection.
Comment 1 David Hill 2010-06-10 03:47:35 UTC
Created attachment 26709 [details]
.config file used to compile kernel.
Comment 2 David Hill 2010-06-10 03:48:12 UTC
Created attachment 26710 [details]
config diff between working 2.6.34-rc7 and broken 2.6.35-rc1
Comment 3 Andrew Morton 2010-06-11 20:27:59 UTC
Awaiting bisection results ;)

This is a regression.  Do you know whether 2.6.34 is OK?  2.6.33?

Thanks.
Comment 4 H. Peter Anvin 2010-06-11 20:30:58 UTC
Well, since 2.6.34-rc7 is OK then 2.6.33 is presumably okay.  2.6.34 is not that different from -rc7, but it's plausible at least that something could have snuck in.

Getting the bisection results would indeed be very appreciated.
Comment 5 David Hill 2010-06-12 08:52:41 UTC
2.6.31
2.6.32.8
2.6.33
2.6.33-rc6
2.6.34-00476-g4d7b
2.6.34-rc7

are ok.

The regression appeared between 2.6.34-rc7 and 2.6.35-rc1.

I'm still bisecting... I have 6 more tests (according to git) to do...

Some are good, some are bad. (as expected)
Comment 6 David Hill 2010-06-12 18:25:29 UTC
2.6.34-00595-g3e1dd19 is good too.
Comment 7 David Hill 2010-06-14 02:20:01 UTC
2.6.34-rc6-00113-g4b6b19a is good too.
Comment 8 David Hill 2010-06-14 02:50:52 UTC
Sorry if it takes some time... I still have 2 steps to make before I'm done.

Do you want partial results?
Comment 9 David Hill 2010-06-14 03:49:03 UTC
Created attachment 26758 [details]
Bisection logs...
Comment 10 David Hill 2010-06-14 03:59:39 UTC
2.6.34-rc6-00115-g5777372 is good too :)
Comment 11 David Hill 2010-06-14 04:01:01 UTC
cf7500c0ea133d66f8449d86392d83f840102632 is the first bad commit
commit cf7500c0ea133d66f8449d86392d83f840102632
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Mar 30 01:07:11 2010 -0700

    x86, ioapic: In mpparse use mp_register_ioapic
    
    Long ago MP_ioapic_info was the primary way of setting up our
    ioapic data structures and mp_register_ioapic was a compatibility
    shim for acpi code.  Now the situation is reversed and
    and mp_register_ioapic is the primary way of setting up our
    ioapic data structures.
    
    Keep the setting up of ioapic data structures uniform by
    having mp_register_ioapic call mp_register_ioapic.
    
    This changes a few fields:
    
    - type: is now hardset to MP_IOAPIC but type had to
      bey MP_IOAPIC or MP_ioapic_info would not have been called.
    
    - flags: is now hard coded to MPC_APIC_USABLE.
      We require flags to contain at least MPC_APIC_USEBLE in
      MP_ioapic_info and we don't ever examine flags so dropping
      a few flags that might possibly exist that we have never
      used is harmless.
    
    - apicaddr: Unchanged
    
    - apicver: Read from the ioapic instead of using the cached
      hardware value in the MP table.  The real hardware value
      will be more accurate.
    
    - apicid: Now verified to be unique and changed if it is not.
      If the BIOS got this right this is a noop.  If the BIOS did
      not fixing things appears to be the better solution.
    
    This adds gsi_base and gsi_end values to our ioapics defined with
    the mpatable, which will make our lives simpler later since
    we can always assume gsi_base and gsi_end are valid.
    
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    LKML-Reference: <1269936436-7039-10-git-send-email-ebiederm@xmission.com>
    Signed-off-by: H. Peter Anvin <hpa@zytor.com>
:040000 040000 6125cb7d9b0d66175e81f7081f2ae4ff59ca0254 5cc6ef21dbec694043fd0a245701a8e5c491d665 M      arch
Comment 12 Eric W. Biederman 2010-06-14 05:40:20 UTC
Taking a quick look at this the only bug I can find in that patch
is the off by one issue with gsi_end.

Does this patch fix things for you?
http://lkml.indiana.edu/hypermail/linux/kernel/1006.1/00232.html

Eric
Comment 13 David Hill 2010-06-15 00:10:20 UTC
Can I download a patch somewhere?  Or the only way of doing that is cut/pasting the diffs and modifying the diff headers so I can apply it? :S
Comment 14 David Hill 2010-06-15 00:27:44 UTC
I can't apply the patch... it's too different.
Comment 15 David Hill 2010-06-15 00:31:55 UTC
actual arch/x86/incluse/asm/io_apic.h
struct mp_ioapic_gsi{
    u32 gsi_base;
    u32 gsi_end;
};


patch:
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -183,7 +183,7 @@
 struct mp_ioapic_gsi{
 u32 gsi_end;
 };
 extern struct mp_ioapic_gsi mp_gsi_routing[];
-extern u32 gsi_end;
+extern u32 gsi_top;
 int mp_find_ioapic(u32 gsi);
 int mp_find_ioapic_pin(int ioapic, u32 gsi);
 void __init mp_register_ioapic(int id, u32 address, u32 gsi_base);

And that's one of the many rejected patches.
Comment 16 Eric W. Biederman 2010-06-15 00:47:21 UTC
"patch -p1 < ../path/to/patch" should work.  That is how the patch header differences are usually handled.

This is queued for getting merged but when I looked I didn't see it in Linus
tree, when I looked.

Thank you for your patience.  This was a really stupid bug on my part.

Eric
Comment 17 David Hill 2010-06-15 01:24:43 UTC
It doesn't work... too much differences... :(  I don't really know git, but I did try this:

cp patch /usr/src/linux-2.6
cd /usr/src/linux-2.6
git checkout cf7500c0ea133d66f8449d86392d83f840102632
patch -p1 < patch

It gives me:
wolfe:/usr/src/linux-2.6# patch -p1 < patch 
patching file arch/x86/include/asm/io_apic.h
Hunk #1 FAILED at 183.
1 out of 1 hunk FAILED -- saving rejects to file arch/x86/include/asm/io_apic.h.rej
patching file arch/x86/kernel/acpi/boot.c
Hunk #1 FAILED at 118.
1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/acpi/boot.c.rej
patching file arch/x86/kernel/apic/io_apic.c
Hunk #1 FAILED at 89.
1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/apic/io_apic.c.rej
patching file arch/x86/kernel/mpparse.c
Hunk #1 FAILED at 123.
1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/mpparse.c.rej
patching file arch/x86/kernel/sfi.c
Hunk #1 FAILED at 93.
1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/sfi.c.rej
Comment 18 Eric W. Biederman 2010-06-15 14:58:46 UTC
> https://bugzilla.kernel.org/show_bug.cgi?id=16173
>
>
>
>
>
> --- Comment #17 from David Hill <hilld@binarystorm.net>  2010-06-15 01:24:43
> ---
> It doesn't work... too much differences... :(  I don't really know git, but I
> did try this:
>
> cp patch /usr/src/linux-2.6
> cd /usr/src/linux-2.6
> git checkout cf7500c0ea133d66f8449d86392d83f840102632
> patch -p1 < patch
>
> It gives me:
> wolfe:/usr/src/linux-2.6# patch -p1 < patch 
> patching file arch/x86/include/asm/io_apic.h
> Hunk #1 FAILED at 183.
> 1 out of 1 hunk FAILED -- saving rejects to file
> arch/x86/include/asm/io_apic.h.rej
> patching file arch/x86/kernel/acpi/boot.c
> Hunk #1 FAILED at 118.
> 1 out of 1 hunk FAILED -- saving rejects to file
> arch/x86/kernel/acpi/boot.c.rej
> patching file arch/x86/kernel/apic/io_apic.c
> Hunk #1 FAILED at 89.
> 1 out of 1 hunk FAILED -- saving rejects to file
> arch/x86/kernel/apic/io_apic.c.rej
> patching file arch/x86/kernel/mpparse.c
> Hunk #1 FAILED at 123.
> 1 out of 1 hunk FAILED -- saving rejects to file
> arch/x86/kernel/mpparse.c.rej
> patching file arch/x86/kernel/sfi.c
> Hunk #1 FAILED at 93.
> 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kernel/sfi.c.rej
>
> -- 
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.

Attached is my patch against 2.6.35-rc2, which is the tag v2.6.35-rc2 in git.
You should be able to use either git-am or patch -p1 to apply this patch.

I don't know if your problem was cut and past introducing different white space
and causing rejects, or if simply that my patch was against 2.6.35-rc2, and
you were applying it against an earlier point in time.

Eric
Comment 19 David Hill 2010-06-16 01:06:11 UTC
I tried applying it to the latest and it didn't work (rc3) ...
I just successfully applied your patch to rc2.

I'm compiling and rebooting as soon as it finishes!
Comment 20 David Hill 2010-06-16 02:36:26 UTC
Ok, the patch didn't work on rc2... it still freezes after uncompressing the kernel.
Comment 21 Rafael J. Wysocki 2010-06-17 00:29:31 UTC
First-Bad-Commit : cf7500c0ea133d66f8449d86392d83f840102632
Comment 22 Eric W. Biederman 2010-06-20 04:59:55 UTC
> --- Comment #20 from David Hill <hilld@binarystorm.net>  2010-06-16 02:36:26
> ---
> Ok, the patch didn't work on rc2... it still freezes after uncompressing the
> kernel.

Weird.

I don't have a machine that actually works in 2.6.34 with mptables handy.
I have one that is close, and it boots with acpi enabled, and the disk doesn't
work with acpi disabled.

With 2.6.35-rc2 I still see boot messages.

I want to think this is a splash screen or something like that hiding your boot
messages.  Given that you say you see the uncompressing the kernel message
my splash screen hypothesis doesn't hold water.

You are using a 32bit kernel and I was testing a 64bit kernel.  That may
be where the difference lies.  I start looking at your config.

There is certainly something weird going on here, and I'm not certain
where to look.

Eric
Comment 23 David Hill 2010-06-20 07:10:12 UTC
(In reply to comment #22)
> > --- Comment #20 from David Hill <hilld@binarystorm.net>  2010-06-16
> 02:36:26 ---
> > Ok, the patch didn't work on rc2... it still freezes after uncompressing
> the
> > kernel.
> 
> Weird.
> 
> I don't have a machine that actually works in 2.6.34 with mptables handy.
> I have one that is close, and it boots with acpi enabled, and the disk
> doesn't
> work with acpi disabled.
> 
> With 2.6.35-rc2 I still see boot messages.
> 
> I want to think this is a splash screen or something like that hiding your
> boot
> messages.  Given that you say you see the uncompressing the kernel message
> my splash screen hypothesis doesn't hold water.

Unless splash screen is a new kernel feature, I doubt it is the problem.  Don't forget it works with previous kernels ... the problem actually appeared in 2.6.35-rc1 (well , see the commit that causes the problem)

> You are using a 32bit kernel and I was testing a 64bit kernel.  That may
> be where the difference lies.  I start looking at your config.

Unless I need to disable/enable a new kernel feature, I doubt it is the config that causes the problem.
 
> There is certainly something weird going on here, and I'm not certain
> where to look.

In the broken kernels, There's something new (I think):
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-ecx -fcall-saved-edx"
but, I could not confirm ... If I take the commit before the one mentionned above, the kernel boots... but at that commit, it stops booting... 


> Eric
Comment 24 Eric W. Biederman 2010-06-20 08:06:46 UTC
> --- Comment #23 from David Hill <hilld@binarystorm.net>  2010-06-20 07:10:12
> ---
>
> Unless splash screen is a new kernel feature, I doubt it is the problem. 
> Don't
> forget it works with previous kernels ... the problem actually appeared in
> 2.6.35-rc1 (well , see the commit that causes the problem)

I was only speculating on something that would hide kernel boot messages
and make it harder to see error messages.

>> You are using a 32bit kernel and I was testing a 64bit kernel.  That may
>> be where the difference lies.  I start looking at your config.
>
> Unless I need to disable/enable a new kernel feature, I doubt it is the
> config
> that causes the problem.

The important part of what I am saying is that I am still looking,
and I have a similar configuration that mostly works.

So I am trying to figure out what the difference between your configuration
and mine is.  So far the big variable is 32bit vs 64bit kernels.

Once I get that out of the way I will look for other things.

>> There is certainly something weird going on here, and I'm not certain
>> where to look.
>
> In the broken kernels, There's something new (I think):
> CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-ecx -fcall-saved-edx"
> but, I could not confirm ... If I take the commit before the one mentionned
> above, the kernel boots... but at that commit, it stops booting... 

Right.  I have wondered at that a little two.

I think we are fighting at least two issues.  The gsi off by one issue that I know
about and something else.

mp_register_ioapic does a little more that just fill out the
mp_ioapics data structure (which is all the old code did).  If the
proper setup hasn't occurred that extra work in mp_register_ioapic
could be a problem.  I will look into that next.   Generally I would
not expect that as acpi apparently works.  But as Sherlock Holmes is
reputed to have said.  "Once you eliminate the obvious, whatever is
left however improbable is the truth".

I know the proper setup has occurred on my 64bit kernel.  So next I
will try a 32bit kernel, and if that works I will try something even
closer to your configuration.

Eric
Comment 25 Eric W. Biederman 2010-07-07 06:09:42 UTC
Created attachment 27037 [details]
Config of 2.6.35-rc4+  that boots to

Config of 2.6.35-rc4+ 140236b4b1c749c9b795ea3d11558a0eb5a3a080 that boots to userspace
Comment 26 Eric W. Biederman 2010-07-07 06:22:00 UTC
I finally got some testing time on a 32bit machine with a known good mptable,
The cpu was a AMD Athlon(tm) 64 X2 Dual Core Processor 3400+

I used the supplied kernel config with the addition of serial console support
and I can boot to userspace ( a busybox initramfs ).

Which tells me this can not be a simple irq setup problem and must
be at best be a weird interaction problem between different parts of
the kernel.

Perhaps we occassionally have weird tlb mismatch issues, or perhaps I
have subtly changed the timing in a way that messes up your box.

I'm not certain where to go from here. I clearly can not reproduce this
issue.  David could you test Linus's latest tree (aka 2.6.35-rc4+ 140236b4b1c749c9b795ea3d11558a0eb5a3a080 ) with the config I supplied?
Perhaps I just got lucky and the bug was fixed between now and then?

If this still results in a kernel that doesn't work for you perhaps
it is time to compare tool chains.  My build environment here is
on 64bit fedora 12:

> gcc --version
gcc (GCC) 4.4.3 20100127 (Red Hat 4.4.3-4)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Eric
Comment 27 David Hill 2010-07-07 07:03:04 UTC
2.6.34.1 is working fine.
2.6.35-rc4 doesn't work as is.

I need to recompile it with the specidifed commit and configuration file.  It may take a while (like a week) before I can do this.
Comment 28 Rafael J. Wysocki 2010-07-08 23:03:51 UTC
Handled-By : Eric W. Biederman <ebiederm@xmission.com>
Comment 29 Rafael J. Wysocki 2010-07-09 20:35:44 UTC
On Friday, July 09, 2010, David Hill wrote:
> Yes
> 
> David Hill
> 
> On 2010-07-08, at 19:41, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.34.  Please verify if it still should be listed and let the
> tracking team
> > know (either way).
> > 
> > 
> > Bug-Entry    : http://bugzilla.kernel.org/show_bug.cgi?id=16173
> > Subject        : After uncompressing the kernel, at boot time, the server
> hangs.
> > Submitter    : David Hill <hilld@binarystorm.net>
> > Date        : 2010-06-09 23:25 (30 days old)
Comment 30 Rafael J. Wysocki 2010-07-23 19:59:28 UTC
On Friday, July 23, 2010, Eric W. Biederman wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
> 
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.34.  Please verify if it still should be listed and let the
> tracking team
> > know (either way).
> 
> This one is a mystery.  I have tried and failed to reproduce this on a
> similar
> hardware configuration so I am at a loss as to where to look next to
> understand
> this bug.
Comment 31 David Hill 2010-08-13 06:44:44 UTC
Eureka!

This commit solves the problem:



commit c0e51cfba7377beffe7fb361f8093db95d2a9441
Merge: 320b2b8 417484d
Date:   Fri Aug 13 02:36:53 2010 -0400

    Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into 16173
Comment 32 Eric W. Biederman 2010-08-13 21:22:17 UTC
I figured it would.   Apparently you have to have just the right model of
intel cpu to hit the problem code path.

And for documentation purposes it should be just this commit
that solves the problem.  I'm glad to see you were finally able
to try this change and verify it works for you.

commit 5989cd6a1cbf86587edcc856791f960978087311
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Aug 4 13:30:27 2010 -0700

    x86, apic: Map the local apic when parsing the MP table.
    
    This fixes a regression in 2.6.35 from 2.6.34, that is
    present for select models of Intel cpus when people are
    using an MP table.
    
    The commit cf7500c0ea133d66f8449d86392d83f840102632
    "x86, ioapic: In mpparse use mp_register_ioapic" started
    calling mp_register_ioapic from MP_ioapic_info.  An extremely
    simple change that was obviously correct.  Unfortunately
    mp_register_ioapic did just a little more than the previous
    hand crafted code and so we gained this call path.
    
    The problem call path is:
    MP_ioapic_info()
      mp_register_ioapic()
       io_apic_unique_id()
         io_apic_get_unique_id()
           get_physical_broadcast()
             modern_apic()
               lapic_get_version()
                 apic_read(APIC_LVR)
    
    Which turned out to be a problem because the local apic
    was not mapped, at that point, unlike the similar point
    in the ACPI parsing code.
    
    This problem is fixed by mapping the local apic when
    parsing the mptable as soon as we reasonably can.
    
    Looking at the number of places we setup the fixmap for
    the local apic, I see some serious simplification opportunities.
    For the moment except for not duplicating the setting up of the
    fixmap in init_apic_mappings, I have not acted on them.
    
    The regression from 2.6.34 is tracked in bug
    https://bugzilla.kernel.org/show_bug.cgi?id=16173
    
    Cc: <stable@kernel.org> 2.6.35
    Reported-by: David Hill <hilld@binarystorm.net>
    Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com>
    Tested-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com>
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    LKML-Reference: <m1eiee86jg.fsf_-_@fess.ebiederm.org>
    Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Comment 33 Guan Xin 2010-08-16 01:27:15 UTC
This problem did not occur to me at 2.6.35.1. It's new to me at 2.6.35.2.

UP:
vendor_id       : GenuineIntel
cpu family      : 6
model           : 13
model name      : Intel(R) Pentium(R) M processor 1.60GHz
stepping        : 8
Comment 34 Guan Xin 2010-08-16 01:35:50 UTC
Created attachment 27457 [details]
config with which the problem can be reproduced on ThinkPad T43 1871-FU1 1.5GB Mem with 2.6.35.2
Comment 35 Florian Mickler 2010-08-30 09:11:00 UTC
Hi Guan,

this seems to be another problem then. This bug was bisected to a commit (cf7500c0ea133d66f8449d86392d83f840102632) which was introduced before 2.6.34-rc7 and should thus be visible in both 2.6.35.1 and .2. 

I've opened a new bugreport for your issue:
http://bugzilla.kernel.org/show_bug?id=17411

Cheers,
Flo