Bug 23332
Summary: | Boot failure on HP nx6325 | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Rafael J. Wysocki (rjw) |
Component: | x86-64 | Assignee: | Bjorn Helgaas (bjorn.helgaas) |
Status: | CLOSED CODE_FIX | ||
Severity: | blocking | CC: | bjorn.helgaas, florian, grzegorz.chwesewicz, jbarnes, maciej.rutecki |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.37-rc2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 21782 | ||
Attachments: |
test patch (aviod last 64K)
dmesg output from HP nx6325 failing log nx6325 Everest report patch to avoid allocating PNP resources debugging patch v2 patch to avoid allocating PNP resources patch to avoid opening windows on subtractive decode bridges |
Description
Rafael J. Wysocki
2010-11-20 00:57:39 UTC
Bisection leads to the following commit: commit 1af3c2e45e7a641e774bbb84fa428f2f0bf2d9c9 Author: Bjorn Helgaas <bjorn.helgaas@hp.com> Date: Tue Oct 26 15:41:54 2010 -0600 x86: allocate space within a region top-down Request that allocate_resource() use available space from high addresses first, rather than the default of using low addresses first. The most common place this makes a difference is when we move or assign new PCI device resources. Low addresses are generally scarce, so it's better to use high addresses when possible. This follows Windows practice for PCI allocation. Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228#c42 Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Reverting it from current mainline (HEAD=76db8ac45fc738f7d7664fe9b56d15c594a45228) causes the box to boot successfully again. First-Bad-Commit : 1af3c2e45e7a641e774bbb84fa428f2f0bf2d9c9 Created attachment 37752 [details]
test patch (aviod last 64K)
Thanks for the report and bisection! Matthew Garrett reported a similar
problem on his HP 2530p, and I sent him this patch to test. Maybe you
could try it, too? If you could also attach the dmesg log, that'd be useful.
In Matthew's case, we allocated the very last page before the 4GB boundary
to a PCI device. The E820 map probably should have reserved that area, but
it didn't, so maybe it's a common HP BIOS bug.
Created attachment 37772 [details]
dmesg output from HP nx6325
The patch doesn't help and dmesg output from the current mainline kernel
with commit 1af3c2e45e7a641e774bbb84fa428f2f0bf2d9c9 reverted is attached.
It seems like a good idea to reserve 0xff000000..0xffffffff for BIOS as a general policy; 0xfexxxxxx tends to be used by things like APIC. bugzilla-daemon@bugzilla.kernel.org wrote: >https://bugzilla.kernel.org/show_bug.cgi?id=23332 > > > > > >--- Comment #2 from Bjorn Helgaas <bjorn.helgaas@hp.com> 2010-11-20 >17:43:29 --- >Created an attachment (id=37752) > --> (https://bugzilla.kernel.org/attachment.cgi?id=37752) >test patch (aviod last 64K) > >Thanks for the report and bisection! Matthew Garrett reported a >similar >problem on his HP 2530p, and I sent him this patch to test. Maybe you >could try it, too? If you could also attach the dmesg log, that'd be >useful. > >In Matthew's case, we allocated the very last page before the 4GB >boundary >to a PCI device. The E820 map probably should have reserved that area, >but >it didn't, so maybe it's a common HP BIOS bug. > >-- >Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email >------- You are receiving this mail because: ------- >You are watching the assignee of the bug. > It seems like a good idea to reserve 0xff000000..0xffffffff for BIOS as a
> general policy; 0xfexxxxxx tends to be used by things like APIC.
You're certainly right that APICs and other things often live in that area.
The question is how we decide on the right boundary. If we choose a boundary
higher than Windows, there will be cases where Windows works but we don't.
If we choose a boundary lower than Windows, we're "safer," but we may throw
away perfectly usable space. I chose 0xffff0000 for the debug patch because
that's what Windows used (in one test of Windows 7 on qemu, YMMV).
Rafael, can you try a boot with "pci=use_crs"? Your box is old enough that
we don't turn it on automatically, and amd_bus.c thinks the host bridge
window leading to bus 00 is [mem 0x80000000-0xfcffffffff] (note that the
end of that window has *ten* digits, not eight). I'm dubious about
relying on that region above 4GB, especially since ACPI is telling us
the window is only [mem 0x80000000-0xfedfffff].
There is also a window [mem 0xfee01000-0xffffffff] that goes right to the 4GB
boundary, but in this case the BIOS does reserve [mem 0xfff00000-0xffffffff]
via a motherboard device, so I don't think the debug patch will make a
difference.
(In reply to comment #5) > > It seems like a good idea to reserve 0xff000000..0xffffffff for BIOS as a > > general policy; 0xfexxxxxx tends to be used by things like APIC. > > You're certainly right that APICs and other things often live in that area. > The question is how we decide on the right boundary. If we choose a boundary > higher than Windows, there will be cases where Windows works but we don't. > If we choose a boundary lower than Windows, we're "safer," but we may throw > away perfectly usable space. I chose 0xffff0000 for the debug patch because > that's what Windows used (in one test of Windows 7 on qemu, YMMV). > > Rafael, can you try a boot with "pci=use_crs"? That helps. > Your box is old enough that we don't turn it on automatically, and amd_bus.c > thinks the host bridge window leading to bus 00 is > [mem 0x80000000-0xfcffffffff] (note that the > end of that window has *ten* digits, not eight). I'm dubious about > relying on that region above 4GB, especially since ACPI is telling us > the window is only [mem 0x80000000-0xfedfffff]. Moreover, the box was shipped with 32-bit Windows XP and I run 64-bit Linux on it. > There is also a window [mem 0xfee01000-0xffffffff] that goes right to the 4GB > boundary, but in this case the BIOS does reserve [mem 0xfff00000-0xffffffff] > via a motherboard device, so I don't think the debug patch will make a > difference. Right. (In reply to comment #6) > (In reply to comment #5) > > > It seems like a good idea to reserve 0xff000000..0xffffffff for BIOS as a > > > general policy; 0xfexxxxxx tends to be used by things like APIC. > > > > You're certainly right that APICs and other things often live in that area. > > The question is how we decide on the right boundary. If we choose a > boundary > > higher than Windows, there will be cases where Windows works but we don't. > > If we choose a boundary lower than Windows, we're "safer," but we may throw > > away perfectly usable space. I chose 0xffff0000 for the debug patch > because > > that's what Windows used (in one test of Windows 7 on qemu, YMMV). > > > > Rafael, can you try a boot with "pci=use_crs"? > > That helps. No, it doesn't. Sorry, I used a wrong kernel for that test. Huh, I really expected that to be the problem. I'll puzzle over it some more. If you still have the ability to boot Windows, an Everest report (http://lavalys.com) might have a hint. If it's possible to get any kind of serial or netconsole log of the failed boots (especially with "pci=use_crs", since I just don't see how we can use the top of the region amd_bus.c reported), that would help, too. Windows is not installed on this box any more and there's no serial port in it. I'll try to play with netconsole, but I'm afraid it crashes too early for that to work. Handled-By : Bjorn Helgaas <bjorn.helgaas@hp.com> (In reply to comment #9) > Windows is not installed on this box any more and there's no serial port in > it. > > I'll try to play with netconsole, but I'm afraid it crashes too early for > that > to work. I've tried with serial console on USB<-->serial converter, but it also crashes too early. Also tried to take some pictures, but kernel messages are moving too fast ;) Maybe the boot_delay kernel parameter will allow you to catch it: Documentation/kernel-parameters.txt: boot_delay= Milliseconds to delay each printk during boot. Values larger than 10 seconds (10000) are changed to no delay (0). (In reply to comment #12) > Maybe the boot_delay kernel parameter will allow you to catch it: > > Documentation/kernel-parameters.txt: > boot_delay= Milliseconds to delay each printk during boot. > Values larger than 10 seconds (10000) are changed to > no delay (0). Yes, I've also hit on this idea after writing previous post, I'll post taken photos in a few hours. Here is my "dmesg" http://ftp.retis.net.pl/dmesg.jpeg I'll try to make dump using firewire if I find my firewire cable. Created attachment 38322 [details] failing log (In reply to comment #14) > Here is my "dmesg" http://ftp.retis.net.pl/dmesg.jpeg Wow, nice. That is the coolest log I've ever seen:-) I did find an nx6325 and a docking station, so I was able to collect this serial log easily. Created attachment 38332 [details]
nx6325 Everest report
I also have Windows XP on this system. Here's an Everest report showing
Windows resource usage.
Differences I see:
0000:01:05.0 Linux assigns ROM at [mem 0xd43e0000-0xd43fffff pref]
0000:00:12.0 WinXP moved BAR 5 to [mem 0xffeffe00-0xffefffff]
0000:00:12.0 Linux assigned ROM at [mem 0xffe80000-0xffefffff pref]
0000:00:14.2 Linux moved BAR 0 to [mem 0xfed7c000-0xfed7ffff 64bit]
0000:00:14.4 Linux assigned window at [mem 0xf8000000-0xfbffffff pref]
The Linux option ROM assignments are typical; Windows doesn't assign resources
for ROMs, but Linux does. The 12.0 and 14.2 changes are because of this
collision:
pci 0000:00:14.2: address space collision: [mem 0xd4408000-0xd440bfff 64bit] conflicts with 0000:00:12.0 [mem 0xd4409000-0xd44091ff]
WinXP moved 12.0 and Linux moved 14.2; both resolve the collision.
I used a test patch to keep Linux from assigning the mem pref window to
the 14.4 subtractive decode bridge, and it avoided the problem, so my
working theory is that there's something in the [mem 0xf8000000-0xfbffffff]
region we should be avoiding.
However, the BIOS doesn't seem to tell us what it is? Well, the E820 memory map doesn't mention anything in the [mem 0xf8000000-0xfbffffff] range, and I don't see any ACPI devices there either. I can't find it right now, but ISTR a recent problem where we placed a device on top of a uvesafb frame buffer, so I wonder whether we're supposed to use the VESA BIOS extensions in addition to the E820 map and ACPI namespace. But I don't know anything about VESA BIOS; it's just on my list to look into. Created attachment 39272 [details] patch to avoid allocating PNP resources I did a lot of experimentation with this, and as far as I can tell, this is just a BIOS bug -- the BIOS forgot to tell us about a couple devices in the address space. The VESA framebuffer issue I was thinking of is bug 22132. In that case, the framebuffer was marked "reserved" in E820, but we put another PCI device there anyway because we don't do a very good job avoiding those reserved areas. In any case, that's not the issue here. Here's a quirk to avoid the hazards in the nx6325 address space. It works for me, but it'd be good if you could try it, too, because I only tested it as far as booting to the point of mounting the root filesystem. Created attachment 39302 [details]
debugging patch
Here's the patch I used to explore the address space. I used boot
arguments like "pci=cbmemsize=1M pci_top=0xf83fffff" to force the
00:14.4 bridge prefetchable memory window to be allocated at
various sizes and addresses.
Here are the results (note that we allocate 64M for CardBus bridge
windows by default, so the 64M allocation is the default behavior
that caused Rafael's hang):
64M [mem 0xf8000000-0xfbffffff] HANG
32M [mem 0xf8000000-0xf9ffffff] HANG
16M [mem 0xf8000000-0xf8ffffff] HANG
8M [mem 0xf8000000-0xf87fffff] HANG
4M [mem 0xf8000000-0xf83fffff] HANG
2M [mem 0xf8000000-0xf81fffff] OK
2M [mem 0xf8200000-0xf83fffff] HANG
1M [mem 0xf8200000-0xf82fffff] OK
1M [mem 0xf8300000-0xf83fffff] HANG
4M [mem 0xf8400000-0xf87fffff] HANG
2M [mem 0xf8400000-0xf85fffff] HANG
1M [mem 0xf8400000-0xf84fffff] OK
1M [mem 0xf8500000-0xf85fffff] HANG
2M [mem 0xf8600000-0xf87fffff] OK
8M [mem 0xf8800000-0xf8ffffff] OK
16M [mem 0xf9000000-0xf9ffffff] HANG
8M [mem 0xf9000000-0xf97fffff] HANG
4M [mem 0xf9000000-0xf93fffff] HANG
2M [mem 0xf9000000-0xf91fffff] HANG
1M [mem 0xf9000000-0xf90fffff] OK
1M [mem 0xf9100000-0xf91fffff] HANG
2M [mem 0xf9200000-0xf93fffff] OK
4M [mem 0xf9400000-0xf97fffff] OK
8M [mem 0xf9800000-0xf9ffffff] OK
32M [mem 0xfa000000-0xfbffffff] OK
Based on the above, I think the nx6325 has the following unreported
areas in use:
1M [mem 0xf8300000-0xf83fffff] HANG
1M [mem 0xf8500000-0xf85fffff] HANG
1M [mem 0xf9100000-0xf91fffff] HANG
My quirk in the previous patch combined the first two areas because
I hadn't been quite so systematic when I wrote the patch. I should
probably have just used these three areas as-is.
Created attachment 39312 [details] v2 patch to avoid allocating PNP resources Updated patch to reserve the three specific areas mentioned in comment 20. Created attachment 40212 [details]
patch to avoid opening windows on subtractive decode bridges
This is a different approach. This system has a subtractive decode
bridge leading to a CardBus bridge:
pci 0000:00:14.4: PCI bridge to [bus 02-03] (subtractive decode)
pci 0000:02:04.0: CardBus bridge to [bus 03-06]
Windows leaves the subtractive decode bridge alone and programs a
64MB window on the CardBus bridge. This 64MB window relies on
subtractive decode.
Linux programs two 64MB windows on the CardBus bridge, *and* opens
a window on the 00:14.4 bridge, so it positively decodes at least part
of the CardBus space.
This patch makes Linux use the BIOS setup of subtractive decode bridges,
without assigning new explicit windows to them.
My HP nx6325 boots correctly with this patch applied. |