Bug 23332

Summary: Boot failure on HP nx6325
Product: Platform Specific/Hardware Reporter: Rafael J. Wysocki (rjw)
Component: x86-64Assignee: Bjorn Helgaas (bjorn.helgaas)
Status: CLOSED CODE_FIX    
Severity: blocking CC: bjorn.helgaas, florian, grzegorz.chwesewicz, jbarnes, maciej.rutecki
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 21782    
Attachments: test patch (aviod last 64K)
dmesg output from HP nx6325
failing log
nx6325 Everest report
patch to avoid allocating PNP resources
debugging patch
v2 patch to avoid allocating PNP resources
patch to avoid opening windows on subtractive decode bridges

Description Rafael J. Wysocki 2010-11-20 00:57:39 UTC
My HP nx6325 doesn't boot any more with 2.6.37-rc2 and later.  It boots normally with 2.6.36 with analogous .config.

The last thing printed by the kernel is that the TSC is unstable with an
insanely huge delta.  Later it only displays vertical stripes and hangs hard,
but I don't really think the problem is related to the radeon driver, because
it's still reproducible with radeon.modeset=0.

I'll try to bisect this over the weekend.
Comment 1 Rafael J. Wysocki 2010-11-20 15:03:29 UTC
Bisection leads to the following commit:

commit 1af3c2e45e7a641e774bbb84fa428f2f0bf2d9c9
Author: Bjorn Helgaas <bjorn.helgaas@hp.com>
Date:   Tue Oct 26 15:41:54 2010 -0600

    x86: allocate space within a region top-down

    Request that allocate_resource() use available space from high addresses
    first, rather than the default of using low addresses first.

    The most common place this makes a difference is when we move or assign
    new PCI device resources.  Low addresses are generally scarce, so it's
    better to use high addresses when possible.  This follows Windows practice
    for PCI allocation.

    Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228#c42
    Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

Reverting it from current mainline
(HEAD=76db8ac45fc738f7d7664fe9b56d15c594a45228) causes the
box to boot successfully again.

First-Bad-Commit : 1af3c2e45e7a641e774bbb84fa428f2f0bf2d9c9
Comment 2 Bjorn Helgaas 2010-11-20 17:43:29 UTC
Created attachment 37752 [details]
test patch (aviod last 64K)

Thanks for the report and bisection!  Matthew Garrett reported a similar
problem on his HP 2530p, and I sent him this patch to test.  Maybe you
could try it, too?  If you could also attach the dmesg log, that'd be useful.

In Matthew's case, we allocated the very last page before the 4GB boundary
to a PCI device.  The E820 map probably should have reserved that area, but
it didn't, so maybe it's a common HP BIOS bug.
Comment 3 Rafael J. Wysocki 2010-11-20 20:42:37 UTC
Created attachment 37772 [details]
dmesg output from HP nx6325

The patch doesn't help and dmesg output from the current mainline kernel
with commit 1af3c2e45e7a641e774bbb84fa428f2f0bf2d9c9 reverted is attached.
Comment 4 H. Peter Anvin 2010-11-20 21:57:40 UTC
It seems like a good idea to reserve 0xff000000..0xffffffff for BIOS as a general policy; 0xfexxxxxx tends to be used by things like APIC.

bugzilla-daemon@bugzilla.kernel.org wrote:

>https://bugzilla.kernel.org/show_bug.cgi?id=23332
>
>
>
>
>
>--- Comment #2 from Bjorn Helgaas <bjorn.helgaas@hp.com>  2010-11-20
>17:43:29 ---
>Created an attachment (id=37752)
> --> (https://bugzilla.kernel.org/attachment.cgi?id=37752)
>test patch (aviod last 64K)
>
>Thanks for the report and bisection!  Matthew Garrett reported a
>similar
>problem on his HP 2530p, and I sent him this patch to test.  Maybe you
>could try it, too?  If you could also attach the dmesg log, that'd be
>useful.
>
>In Matthew's case, we allocated the very last page before the 4GB
>boundary
>to a PCI device.  The E820 map probably should have reserved that area,
>but
>it didn't, so maybe it's a common HP BIOS bug.
>
>-- 
>Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
>------- You are receiving this mail because: -------
>You are watching the assignee of the bug.
Comment 5 Bjorn Helgaas 2010-11-21 05:41:32 UTC
> It seems like a good idea to reserve 0xff000000..0xffffffff for BIOS as a
> general policy; 0xfexxxxxx tends to be used by things like APIC.

You're certainly right that APICs and other things often live in that area.
The question is how we decide on the right boundary.  If we choose a boundary
higher than Windows, there will be cases where Windows works but we don't.
If we choose a boundary lower than Windows, we're "safer," but we may throw
away perfectly usable space.  I chose 0xffff0000 for the debug patch because
that's what Windows used (in one test of Windows 7 on qemu, YMMV).

Rafael, can you try a boot with "pci=use_crs"?  Your box is old enough that
we don't turn it on automatically, and amd_bus.c thinks the host bridge
window leading to bus 00 is [mem 0x80000000-0xfcffffffff] (note that the
end of that window has *ten* digits, not eight).  I'm dubious about
relying on that region above 4GB, especially since ACPI is telling us
the window is only [mem 0x80000000-0xfedfffff].

There is also a window [mem 0xfee01000-0xffffffff] that goes right to the 4GB
boundary, but in this case the BIOS does reserve [mem 0xfff00000-0xffffffff]
via a motherboard device, so I don't think the debug patch will make a
difference.
Comment 6 Rafael J. Wysocki 2010-11-21 15:00:34 UTC
(In reply to comment #5)
> > It seems like a good idea to reserve 0xff000000..0xffffffff for BIOS as a
> > general policy; 0xfexxxxxx tends to be used by things like APIC.
> 
> You're certainly right that APICs and other things often live in that area.
> The question is how we decide on the right boundary.  If we choose a boundary
> higher than Windows, there will be cases where Windows works but we don't.
> If we choose a boundary lower than Windows, we're "safer," but we may throw
> away perfectly usable space.  I chose 0xffff0000 for the debug patch because
> that's what Windows used (in one test of Windows 7 on qemu, YMMV).
> 
> Rafael, can you try a boot with "pci=use_crs"?

That helps.

> Your box is old enough that we don't turn it on automatically, and amd_bus.c
> thinks the host bridge window leading to bus 00 is
> [mem 0x80000000-0xfcffffffff] (note that the
> end of that window has *ten* digits, not eight).  I'm dubious about
> relying on that region above 4GB, especially since ACPI is telling us
> the window is only [mem 0x80000000-0xfedfffff].

Moreover, the box was shipped with 32-bit Windows XP and I run 64-bit Linux
on it.
 
> There is also a window [mem 0xfee01000-0xffffffff] that goes right to the 4GB
> boundary, but in this case the BIOS does reserve [mem 0xfff00000-0xffffffff]
> via a motherboard device, so I don't think the debug patch will make a
> difference.

Right.
Comment 7 Rafael J. Wysocki 2010-11-21 15:22:15 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > > It seems like a good idea to reserve 0xff000000..0xffffffff for BIOS as a
> > > general policy; 0xfexxxxxx tends to be used by things like APIC.
> > 
> > You're certainly right that APICs and other things often live in that area.
> > The question is how we decide on the right boundary.  If we choose a
> boundary
> > higher than Windows, there will be cases where Windows works but we don't.
> > If we choose a boundary lower than Windows, we're "safer," but we may throw
> > away perfectly usable space.  I chose 0xffff0000 for the debug patch
> because
> > that's what Windows used (in one test of Windows 7 on qemu, YMMV).
> > 
> > Rafael, can you try a boot with "pci=use_crs"?
> 
> That helps.

No, it doesn't.  Sorry, I used a wrong kernel for that test.
Comment 8 Bjorn Helgaas 2010-11-21 15:38:08 UTC
Huh, I really expected that to be the problem.  I'll puzzle over it some
more.  If you still have the ability to boot Windows, an Everest report
(http://lavalys.com) might have a hint.  If it's possible to get any kind
of serial or netconsole log of the failed boots (especially with "pci=use_crs",
since I just don't see how we can use the top of the region amd_bus.c reported),
that would help, too.
Comment 9 Rafael J. Wysocki 2010-11-21 18:28:52 UTC
Windows is not installed on this box any more and there's no serial port in it.

I'll try to play with netconsole, but I'm afraid it crashes too early for that
to work.
Comment 10 Rafael J. Wysocki 2010-11-22 23:23:02 UTC
Handled-By : Bjorn Helgaas <bjorn.helgaas@hp.com>
Comment 11 Grzegorz Chwesewicz 2010-11-25 08:27:12 UTC
(In reply to comment #9)
> Windows is not installed on this box any more and there's no serial port in
> it.
> 
> I'll try to play with netconsole, but I'm afraid it crashes too early for
> that
> to work.

I've tried with serial console on USB<-->serial converter, but it also crashes too early. Also tried to take some pictures, but kernel messages are moving too fast ;)
Comment 12 Florian Mickler 2010-11-25 10:30:01 UTC
Maybe the boot_delay kernel parameter will allow you to catch it:

Documentation/kernel-parameters.txt: 
        boot_delay=     Milliseconds to delay each printk during boot.
                        Values larger than 10 seconds (10000) are changed to
                        no delay (0).
Comment 13 Grzegorz Chwesewicz 2010-11-25 10:38:17 UTC
(In reply to comment #12)
> Maybe the boot_delay kernel parameter will allow you to catch it:
> 
> Documentation/kernel-parameters.txt: 
>         boot_delay=     Milliseconds to delay each printk during boot.
>                         Values larger than 10 seconds (10000) are changed to
>                         no delay (0).

Yes, I've also hit on this idea after writing previous post, I'll post taken photos in a few hours.
Comment 14 Grzegorz Chwesewicz 2010-11-25 22:08:39 UTC
Here is my "dmesg" http://ftp.retis.net.pl/dmesg.jpeg I'll try to make dump using firewire if I find my firewire cable.
Comment 15 Bjorn Helgaas 2010-11-27 16:03:32 UTC
Created attachment 38322 [details]
failing log

(In reply to comment #14)
> Here is my "dmesg" http://ftp.retis.net.pl/dmesg.jpeg

Wow, nice.  That is the coolest log I've ever seen:-)  I did find an nx6325
and a docking station, so I was able to collect this serial log easily.
Comment 16 Bjorn Helgaas 2010-11-27 16:16:19 UTC
Created attachment 38332 [details]
nx6325 Everest report

I also have Windows XP on this system.  Here's an Everest report showing
Windows resource usage.

Differences I see:
  0000:01:05.0 Linux assigns ROM at [mem 0xd43e0000-0xd43fffff pref]
  0000:00:12.0 WinXP moved BAR 5 to [mem 0xffeffe00-0xffefffff]
  0000:00:12.0 Linux assigned ROM at [mem 0xffe80000-0xffefffff pref]
  0000:00:14.2 Linux moved BAR 0 to [mem 0xfed7c000-0xfed7ffff 64bit]
  0000:00:14.4 Linux assigned window at [mem 0xf8000000-0xfbffffff pref]

The Linux option ROM assignments are typical; Windows doesn't assign resources
for ROMs, but Linux does.  The 12.0 and 14.2 changes are because of this
collision:

  pci 0000:00:14.2: address space collision: [mem 0xd4408000-0xd440bfff 64bit] conflicts with 0000:00:12.0 [mem 0xd4409000-0xd44091ff]

WinXP moved 12.0 and Linux moved 14.2; both resolve the collision.

I used a test patch to keep Linux from assigning the mem pref window to
the 14.4 subtractive decode bridge, and it avoided the problem, so my
working theory is that there's something in the [mem 0xf8000000-0xfbffffff]
region we should be avoiding.
Comment 17 Rafael J. Wysocki 2010-11-27 20:34:38 UTC
However, the BIOS doesn't seem to tell us what it is?
Comment 18 Bjorn Helgaas 2010-11-28 05:42:18 UTC
Well, the E820 memory map doesn't mention anything in the
[mem 0xf8000000-0xfbffffff] range, and I don't see any ACPI devices
there either.

I can't find it right now, but ISTR a recent problem where we placed
a device on top of a uvesafb frame buffer, so I wonder whether we're
supposed to use the VESA BIOS extensions in addition to the E820
map and ACPI namespace.  But I don't know anything about VESA BIOS;
it's just on my list to look into.
Comment 19 Bjorn Helgaas 2010-12-08 18:02:23 UTC
Created attachment 39272 [details]
patch to avoid allocating PNP resources

I did a lot of experimentation with this, and as far as I can tell,
this is just a BIOS bug -- the BIOS forgot to tell us about a couple
devices in the address space.

The VESA framebuffer issue I was thinking of is bug 22132.  In that
case, the framebuffer was marked "reserved" in E820, but we put another
PCI device there anyway because we don't do a very good job avoiding
those reserved areas.  In any case, that's not the issue here.

Here's a quirk to avoid the hazards in the nx6325 address space.  It works
for me, but it'd be good if you could try it, too, because I only tested
it as far as booting to the point of mounting the root filesystem.
Comment 20 Bjorn Helgaas 2010-12-08 19:30:40 UTC
Created attachment 39302 [details]
debugging patch

Here's the patch I used to explore the address space.  I used boot
arguments like "pci=cbmemsize=1M pci_top=0xf83fffff" to force the
00:14.4 bridge prefetchable memory window to be allocated at
various sizes and addresses.

Here are the results (note that we allocate 64M for CardBus bridge
windows by default, so the 64M allocation is the default behavior
that caused Rafael's hang):

    64M  [mem 0xf8000000-0xfbffffff] HANG
      32M  [mem 0xf8000000-0xf9ffffff] HANG
        16M  [mem 0xf8000000-0xf8ffffff] HANG
           8M  [mem 0xf8000000-0xf87fffff] HANG
             4M  [mem 0xf8000000-0xf83fffff] HANG
               2M  [mem 0xf8000000-0xf81fffff] OK
               2M  [mem 0xf8200000-0xf83fffff] HANG
                 1M  [mem 0xf8200000-0xf82fffff] OK
                 1M  [mem 0xf8300000-0xf83fffff] HANG
             4M  [mem 0xf8400000-0xf87fffff] HANG
               2M  [mem 0xf8400000-0xf85fffff] HANG
                 1M  [mem 0xf8400000-0xf84fffff] OK
                 1M  [mem 0xf8500000-0xf85fffff] HANG
               2M  [mem 0xf8600000-0xf87fffff] OK
           8M  [mem 0xf8800000-0xf8ffffff] OK
        16M  [mem 0xf9000000-0xf9ffffff] HANG
           8M  [mem 0xf9000000-0xf97fffff] HANG
             4M  [mem 0xf9000000-0xf93fffff] HANG
               2M  [mem 0xf9000000-0xf91fffff] HANG
                 1M  [mem 0xf9000000-0xf90fffff] OK
                 1M  [mem 0xf9100000-0xf91fffff] HANG
               2M  [mem 0xf9200000-0xf93fffff] OK
             4M  [mem 0xf9400000-0xf97fffff] OK
           8M  [mem 0xf9800000-0xf9ffffff] OK
      32M  [mem 0xfa000000-0xfbffffff] OK

Based on the above, I think the nx6325 has the following unreported
areas in use:

    1M  [mem 0xf8300000-0xf83fffff] HANG
    1M  [mem 0xf8500000-0xf85fffff] HANG
    1M  [mem 0xf9100000-0xf91fffff] HANG

My quirk in the previous patch combined the first two areas because
I hadn't been quite so systematic when I wrote the patch.  I should
probably have just used these three areas as-is.
Comment 21 Bjorn Helgaas 2010-12-08 19:45:09 UTC
Created attachment 39312 [details]
v2 patch to avoid allocating PNP resources

Updated patch to reserve the three specific areas mentioned in
comment 20.
Comment 22 Bjorn Helgaas 2010-12-15 00:04:54 UTC
Created attachment 40212 [details]
patch to avoid opening windows on subtractive decode bridges

This is a different approach.  This system has a subtractive decode
bridge leading to a CardBus bridge:

  pci 0000:00:14.4: PCI bridge to [bus 02-03] (subtractive decode)
  pci 0000:02:04.0: CardBus bridge to [bus 03-06]

Windows leaves the subtractive decode bridge alone and programs a
64MB window on the CardBus bridge.  This 64MB window relies on
subtractive decode.

Linux programs two 64MB windows on the CardBus bridge, *and* opens
a window on the 00:14.4 bridge, so it positively decodes at least part
of the CardBus space.

This patch makes Linux use the BIOS setup of subtractive decode bridges,
without assigning new explicit windows to them.
Comment 23 Rafael J. Wysocki 2010-12-16 16:33:55 UTC
My HP nx6325 boots correctly with this patch applied.
Comment 24 Rafael J. Wysocki 2010-12-16 16:35:15 UTC
Patch : https://bugzilla.kernel.org/attachment.cgi?id=40212