Bug 11054 - Boot hangs in x86_64 OKI ANIMA 3300 laptop
Boot hangs in x86_64 OKI ANIMA 3300 laptop
Status: CLOSED OBSOLETE
Product: Drivers
Classification: Unclassified
Component: PCI
All Linux
: P1 high
Assigned To: drivers_pci@kernel-bugs.osdl.org
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-07-08 02:12 UTC by Juan Jesús García de Soria
Modified: 2012-05-22 12:45 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.25
Tree: Mainline
Regression: Yes


Attachments
Detailed lspci output (43.23 KB, text/plain)
2008-07-18 10:38 UTC, Juan Jesús García de Soria
Details
Disassembled DSDT contents (128.45 KB, text/plain)
2008-11-21 00:16 UTC, Juan Jesús García de Soria
Details

Description Juan Jesús García de Soria 2008-07-08 02:12:41 UTC
Latest working kernel version: commit 5eb7f9fa847b8ab6e4864bfb8cb45f370844a47c (somewhere after 2.6.25-rc7)

Earliest failing kernel version: commit 12c22d6ef299ccf0955e5756eb57d90d7577ac68 (somewhere after 2.6.25-rc7)

Distribution: Gentoo Linux

Hardware Environment: x86_64 OKI ANIMA 3300 laptop

Software Environment: Kernel.

Problem Description: The boot hangs after printing that MSI signaling has been enabled on a pair of PCI bridges. Sometimes the console gets garbled when the hang happens, sometimes not.

I bisected the problem to this specific commit:

commit 12c22d6ef299ccf0955e5756eb57d90d7577ac68
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Mar 26 11:22:40 2008 -0700

    Revert "PCI: remove transparent bridge sizing"

    This reverts commit 8fa5913d54f3b1e09948e6a0db34da887e05ff1f, which
    caused various interesting problems for people, including wrong resource
    allocations.  See for example bugzilla entry "2.6.25-rc2: ohci1394
    problem (MMIO broken)" at

        http://bugzilla.kernel.org/show_bug.cgi?id=10080

[...]


I applied a reversion of that commit onto the mainline head (2.6.26-rc9) and it now boots flawlessly (rt61pci still doesn't work reliably, but I found the system stable when using a PCMCIA ath5k card).


Steps to reproduce: Boot with bad version and hangs; boot with good version and it works.
Comment 1 Juan Jesús García de Soria 2008-07-08 02:15:15 UTC
I don't know about the other interesting problems caused by the commit reverted
in the problematic commit. I know I need it un-reverted to boot my machine.
Perhaps some kind of quirk is needed.
Comment 2 Juan Jesús García de Soria 2008-07-18 10:38:34 UTC
Created attachment 16878 [details]
Detailed lspci output
Comment 3 Juan Jesús García de Soria 2008-07-18 10:40:37 UTC
I've attached the detailed output of lspci on my machine running some 2.6.24 kernel.

I added debug messages to see which was the transparent bridge that required the patch, and this is the specific device info:

00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2) (prog-if 01 [Subtractive decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Bus: primary=00, secondary=04, subordinate=08, sec-latency=64
        I/O behind bridge: 0000f000-00000fff
        Memory behind bridge: c3000000-c30fffff
        Prefetchable memory behind bridge: fff00000-000fffff
        Secondary status: 66MHz- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr+ DiscTmrStat- DiscTmrSERREn-
        Capabilities: [b8] Subsystem: Gammagraphx, Inc. Device 0000
        Capabilities: [8c] HyperTransport: MSI Mapping Enable- Fixed-
                Mapping Address Base: 00000000fee00000
00: de 10 6f 02 07 01 b0 00 a2 01 04 06 00 00 81 00
10: 00 00 00 00 00 00 00 00 00 04 08 40 f0 00 80 02
20: 00 c3 00 c3 f0 ff 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 b8 00 00 00 00 00 00 00 ff 00 04 02
40: 00 00 03 00 01 00 02 00 05 00 00 00 00 00 44 00
50: 00 00 fe 3f 00 00 00 00 ff 1f ff 1f 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 a8
90: 00 00 e0 fe 00 00 00 00 00 00 00 00 00 00 00 00
a0: 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 ff ff 00 00 0d 8c 00 00 00 00 00 00
c0: 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00


Comment 4 Bjorn Helgaas 2008-11-17 11:23:16 UTC
From Juan Jesus's email (http://lkml.org/lkml/2008/11/12/60):

After reading a little bit on how PCI and PCI-to-PCI bridges work, and
how they're handled in linux (http://tldp.org/LDP/tlk/dd/pci.html), I
now know that ranges in the bridge (either I/O, mmio or prefetchable
mem) are simply disabled when start < end, and that's the original
configuration that the BIOS enforces when the bridge is not sized by
Linux.

After inserting kprintf()'s, I see that the hang happens actually while
the positive decoding of the I/O range in the bridge is being activated
in pci_setup_bridge(), sometime in between the writes to the I/O
base/limit registers of the bridge; I don't remember exactly which was
the last pci_Write_config_dword() that allowed the next kprintf() to
succeed. I'll look at it tonight again, but I suspect that the final
enabling write (the one that updates the PCI_IO_BASE_UPPER16 register
with its final value) was the one hanging the machine.

The I/O range being activated was the one in the range 0x1000-0x1fff,
apparently correctly sized to accomodate the two I/O ranges
(0x1000-0x10ff, 0x1400-0x14ff) assigned to the CardBus bridge on the
secondary bus.

One theory is that my system has something actually mapped to that I/O
range in the root PCI bus. When only subtractive decoding is in place
(the I/O range isn't activated), access to the secondary bus behind the
PCI-to-PCI bridge is done when the transaction isn't claimed by any
device in the root bus, after what the PCI docs describe as a 4-cycle
timeout. When the I/O range is activated, that range is positively
decoded by the bridge, which tries to claim the transaction before the
timeout. Perhaps two devices (the bridge and the unknown device on the
root bus) conflict when claiming the same transaction?

Another possibility could be that activating the I/O range disables the
negative decoding in the secondary-to-primary sense of the bridge for
that I/O range. Perhaps some device behind the bridge depends on being
able to forward transactions to the primary bus on that I/O range, but
it's disallowed after the range is configured. For me this seems rather
unlikely, because of the nature of the devices behind the bridge.

I'll look at it more closely again, and I will test whether commenting
out the I/O range sizing (leaving the other ranges to be sized) is
enough to allow the system to run. If so, is there any way to use a
system-specific quirk in order to remove the PCI-to-PCI bridge I/O range
from being sized/activated?


Best regards,

   Juan Jesus.
Comment 5 Juan Jesús García de Soria 2008-11-21 00:16:36 UTC
Created attachment 18961 [details]
Disassembled DSDT contents
Comment 6 Juan Jesús García de Soria 2008-11-21 00:23:51 UTC
From my e-mail: http://permalink.gmane.org/gmane.linux.kernel.pci/1991

PCI bus conflict hang: how to avoid the allocation of an I/O range.
From: GARCIA DE SORIA LUCENA, JUAN JESUS <juanj.g_soria <at> grupobbva.com>
Subject: PCI bus conflict hang: how to avoid the allocation of an I/O range.
Newsgroups: gmane.linux.kernel.pci
Date: 2008-11-17 14:05:22 GMT

After commit 

commit 12c22d6ef299ccf0955e5756eb57d90d7577ac68
Author: Linus Torvalds <torvalds <at> linux-foundation.org>
Date:   Wed Mar 26 11:22:40 2008 -0700

    Revert "PCI: remove transparent bridge sizing"

My laptop began hanging when booting, and I filed
http://bugzilla.kernel.org/show_bug.cgi?id=11054.

I had to disable the sizing of transparent bridges until, after a
conversation in the kernel mailing list, I think I've found the root of
the problem.

A CardBus bridge is on the secondary bus of a transparent bridge. By
default it gets assigned two I/O ranges: 0x1000-0x10ff and
0x-1400-0x14ff, which is translated to the transparent bridge positively
forwarding the range 0x1000-0x1fff. There are no more I/O resources
allocated behind the transparent PCI to PCI bridge.

I suspect there's "something" (some device unknown by the kernel)
decoding I/O accesses in the primary PCI bus, in the 0x1000-0x1fff
range. This device must be causing bus conflicts with the range
allocated to the PCI to PCI bridge. Not sizing the transparent bridge
wouldn't configure any I/O range in it for positive decoding, thus
avoiding the conflict.

The system hangs when the bridge register for the IO base/limit (lower
16 bits, since it's 16 bit only) gets written to with the value
0x????0101.

If I force the range to be allocated to be above 0x4000, everything
works flawlessly. I've been able to do so by two means:

1. Changing the definition of PCIBIOS_MIN_IO in
arch/x86/include/asm/pci.h from 0x1000 to 0x4000. This forces the
CardBus ranges to be allocated above the problematic area, making the
bridge forward 0x4000-0x4fff I/O addresses. BTW, PCIBIOS_MIN_CARDBUS_IO
is defined to be 0x4000 in the same header, but it's only used in
drivers/pcmcia/yenta_socket.c, not apparently when assigning resources
to the CardBus bridge in the functions pci_setup_cardbus() or
pci_bus_size_cardbus() in drivers/pci/setup-bus.c. I suppose that making
the CardBus bridge I/O range allocation respect the defined
PCIBIOS_MIN_CARDBUS_IO limit would fix my issue, but I don't know
whether that's "the right fix" (TM).

2. I've managed to boot a stock Ubuntu Intrepid Ibex x86_64 kernel by
supplying the parameter "pci=cbiosize=8k" to the grub command line. It
doesn't work with a smaller size. With 8k the CardBus bridge I/O ranges
are big enough that they have to be allocated above the "problem area"
because of natural alignment restrictions.

So far I've got what I really wanted (to be able to use my laptop with
modern distributions without having to recompile each kernel version),
although to do so I'm depending on the fact that a kernel parameter
intended for a different use will alter I/O range alignment one PCI to
PCI bridge away.

I write to ask whether the definition of PCIBIOS_MIN_CARDBUS_IO was
indeed intended to affect my case too (in which case what is happening
is the result of a kernel bug that should be fixed) or not. And if it's
not a bug, I'd like to know if there exists any reliable way to
pre-allocate a given I/O range (0x1000-0x1fff in my case) so that it
won't be assigned to PCI busses/devices (without the need to recompile
every kernel version).
Comment 7 Alan 2009-06-29 11:16:52 UTC
Moving the full version info here to unmess the formatting of the lists

2.6.25 + 12c22d6ef299ccf0955e5756eb57d90d7577ac68 up to mainline
Comment 8 Alan 2012-05-22 12:45:11 UTC
Closing as obsolete, please re-open and update the kernel version to a recent one if still seen

Note You need to log in before you can comment on or make changes to this bug.