Bug 6859

Summary: boot hang unless "noapic" - invalid _PRT entries - MSI MS-6390-L
Product: ACPI Reporter: Eugenia Loli-Queru (eloli)
Component: BIOSAssignee: Len Brown (lenb)
Status: CLOSED CODE_FIX    
Severity: blocking CC: acpi-bugzilla, bunk, chicks, mingo, Robert.Moore
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.4 and above Subsystem:
Regression: --- Bisected commit-id:
Attachments: /proc/interrupts output from 2.6.19-1.2895.fc6 kernel
dmesg output from 2.6.19-1.2895.fc6 kernel
patch to work-around garbled _PRT entry vs 2.6.18

Description Eugenia Loli-Queru 2006-07-18 10:39:49 UTC
Most recent kernel where this bug did not occur:
2.6.17
Distribution:
SuSE 10.1
Hardware Environment:
Athlon XP 1.6 GHz, MSI motherboard (MS-6390-L v1.0) which features the VIA KM266
chipset. The motherboard features an onboard RTL8139 network card, AC97/8233A
VIA sound and south bridge, and also an integrated S3 Savage4-PRO+ 266DDR. 512
MB RAM.
Software Environment:
SuSE/Ubuntu/Arch Linux. I think that only Fedora works, but not sure, it has
been a while since I tried Fedora in this machine.
Problem Description:
The kernel will boot only to safe mode and won't mount any partitions (suse).
With other distros it will boot, but no pci hardware will work (ubuntu, Arch).
All these problems are going away only if you pass the "nolapic" kernel
parameter. But I believe that the user experience is not good if the user has to
pass kernel parameters, IMHO this is something that must be fixed.

Steps to reproduce:
Just try to install a recent distro on this machine.

boot message of the suse dvd with apic=debug
https://bugzilla.novell.com/attachment.cgi?id=92035&action=view

Dmesg of suse
https://bugzilla.novell.com/attachment.cgi?id=92036&action=view

https://bugzilla.novell.com/attachment.cgi?id=93759&action=view
acpi dump logs tarball: acpi.txt is from acpidmp, and acpidump.txt us from acpidump

Novell won't look at the problem, so you are my only hope. This is probably a
BIOS bug, but the point is that Windows works (including Vista), BeOS and BSD
works, and even kernel 2.4 works perfectly. So I think that this problem should
be fixed as kernel 2.6 is the only one that misbehaves, and also because older
hardware is what to expect in most Enterprise businesses today. Thanks.
Comment 1 Ingo Molnar 2006-11-13 08:42:35 UTC
you say:

> Most recent kernel where this bug did not occur: 2.6.17

so the upstream kernel is fine?

In particular, could you test 2.6.19-rc5-mm1? (that has a couple of APIC fixes)
Comment 2 Eugenia Loli-Queru 2006-11-13 12:48:18 UTC
I am sorry, but I guess I misread when i said that it did not occur on 2.6.17.
The bug did happen with 2.6.17. I don't know if it still happens or not, I will
have to wait for a distro to try that has these versions of the kernel in it, I
won't manually build one...
Comment 3 Adrian Bunk 2006-11-13 12:58:19 UTC
This Bugzilla is only for bugs in unmodified kernels from ftp.kernel.org.

If you are only using distribution kernels and not ftp.kernel.org kernels,
please ask your distribution for support.

The rationale for this is:
- the bug might be cased by a patch in the distribution kernel
- if you are not able to test patches, it's much harder to find a solution
Comment 4 Eugenia Loli-Queru 2006-11-13 13:02:27 UTC
I think the bug is part of the mainstream kernel, because it happens with all
the distros I tried (suse, ubuntu, arch, fedora).
Comment 5 Adrian Bunk 2006-11-13 13:16:12 UTC
This still leaves my second point that you will not be able to test patches.

If someone thinks he has found the solution for your problem he will create a
patch - and how can you ever verify whether such a patch really fixes your problem?
Comment 6 Christopher Hicks 2007-01-17 11:00:50 UTC
http://wiki.fini.net/bin/view/Support/LinuxLosingPCIonMS6390 is a page I created
discussing this issue.  I am happy to test patches.
Comment 7 Len Brown 2007-01-19 10:14:11 UTC
This is a uni-processor board with a LAPIC/IOAPIC.
In the past, the distros supported these boards with
special uni-processor kernels that disabled the IOAPIC,
but recently they've cut over to uni-processor kernels
that support LAPIC/IOAPIC, and finally SMP kernels by default.

The issue is likely the IOAPIC, not the LAPIC.
Please verify that "noapic" is a sufficient workaround
rather than "nolapic", and attach the dmesg and /proc/interrupts
from a "noapic" boot if it works.

Skip FC4, FC5 and SL10.1.

Try FC6 or OpenSuSE 10.2, which will install directly to 2.6.18.
(Indeed, FC6 will then net update to 2.6.19 -- and I'm told it
is very close to upstream)

If you can get a recent upstream kernel.org kernel to fail
then go ahead and re-open this bug report.  The first thing
I'll be looking for is a dump of the interrupts from Christopher's
W2K boot to show that Windows can handle the IOAPIC on this board.

Comment 8 Christopher Hicks 2007-01-19 10:41:57 UTC
Created attachment 10124 [details]
/proc/interrupts output from  2.6.19-1.2895.fc6 kernel
Comment 9 Christopher Hicks 2007-01-19 10:42:25 UTC
Created attachment 10125 [details]
dmesg output from  2.6.19-1.2895.fc6 kernel
Comment 10 Christopher Hicks 2007-01-19 10:44:17 UTC
nolapic was superfulous.  noapic was sufficient.  I've attached the Linux output
requested.

I don't have permission, but the summary should be updated to say noapic instead
of nolapic.
Comment 11 Len Brown 2007-01-19 15:13:21 UTC
Thanks for verifying that "noapic" is a sufficient workaround.
I see you are now running 2.6.19-1.2895.fc6 --
I assume it also fails to boot if you drop off the "noapic" parameter?

Another workaround might be to use "acpi=noirq" -- as this box
has an MP table.  But this would be a just another workaround...

It turns out that Thomas Renninger already found the root cause
of this failure:
https://bugzilla.novell.com/show_bug.cgi?id=179024

The _PRT entries are garbled, just like they were in bug 1164

                Package (0x04)
                {
                    0x0012FFFF,
                    0x00,
                    0x00,
                    \_SB.PCI0.LNKA
                },

                Package (0x04)
                {
                    0x0012FFFF,
                    0x01,
                    0x00,
                    \_SB.PCI0.LNKB
                },

                Package (0x04)
                {
                    0x0012FFFF,
                    0x02,
                    0x00,
                    \_SB.PCI0.LNKC
                },

                Package (0x04)
                {
                    0x0012FFFF,
                    0x03,
                    0x00,
                    \_SB.PCI0.LNKD

The Links should be the 3rd entry, not the 4th.
At the time we got the BIOS fixed, but it turns out we should
have implemented a kernel workaround for this BIOS bug then
because Windows continues to allow systems to ship with this bug.
Comment 12 Len Brown 2007-01-19 19:18:55 UTC
Created attachment 10126 [details]
patch to work-around garbled _PRT entry vs 2.6.18

Please test this patch, originally written by Shaohua Li,
for http://bugzilla.kernel.org/show_bug.cgi?id=1164#c39
and forward-ported here.  It should apply cleanly to
2.6.18 though 2.6.20-rc5.  For 2.6.16 and 2.6.17 it should
apply with a 3-line offset.

For a boot with no kernel parameters,
please attach the complete dmesg and paste the /proc/interrupts.

To disable the workaround, you can boot with "acpi=strict".
Comment 13 Len Brown 2007-03-08 00:43:51 UTC
patch in comment #12 applied to acpi-test
Comment 14 Len Brown 2007-03-10 21:26:20 UTC
shipped in 2.6.21-rc3-git6
closed
Comment 15 Robert Moore 2008-05-15 14:42:00 UTC
There is a problem with this patch, in this line:

if (ACPI_GET_OBJECT_TYPE (sub_object_list[3]) != ACPI_TYPE_INTEGER) {

In the case where the SourceName and SourceIndex are reversed, if the actual SourceName was unresolved, the object will be null. That is the purpose of the null check later in the code:

obj_desc = sub_object_list[source_name_index];
if (obj_desc) {

For safety, a check for a null SourceName must be made. A "correct" _PRT entry
will always have a valid integer object in index 3 for the SourceIndex, so the
safer code would be:

if (!sub_object_list[3] ||
    (ACPI_GET_OBJECT_TYPE (sub_object_list[3]) != ACPI_TYPE_INTEGER)) {

The ACPICA patch will simply swap the objects in place:

/*
 * If the BIOS has erroneously reversed the _PRT SourceName (index 2)
 * and the SourceIndex (index 3), fix it. _PRT is important enough to
 * workaround this BIOS error. This also provides compatibility with
 * other ACPI implementations.
 */
 ObjDesc = SubObjectList[3];
 if (!ObjDesc || (ACPI_GET_OBJECT_TYPE (ObjDesc) != ACPI_TYPE_INTEGER))
 {
    SubObjectList[3] = SubObjectList[2];
    SubObjectList[2] = ObjDesc;

    ACPI_WARNING ((AE_INFO,
        "(PRT[%X].Source) SourceName and SourceIndex are reversed, fixed",
        Index));
 }
Comment 16 Zhang Rui 2008-07-14 20:44:03 UTC
bob, what's the status of this bug?
Comment 17 Robert Moore 2008-07-15 13:17:55 UTC
The original patch is already in Linux as far as I know.

The updated patch was released in ACPICA 20080609. I'm not sure if it has been integrated into Linux.
Comment 18 Andi Kleen 2008-08-15 09:02:26 UTC
The updated patch is in 2.6.27rc as 

commit d0e184abc5983281ef189db2c759d65d56eb1b80
Author: Bob Moore <robert.moore@intel.com>
Date:   Tue Jun 10 14:16:47 2008 +0800

    ACPICA: Workaround for reversed _PRT entries from BIOS

I don't have rights to close the bug unfortunately. I guess we'll
have to wait for Len to do this? [Hi Len, an easy way to improve your
bug numbers ;-]
Comment 19 Andi Kleen 2008-08-15 09:29:05 UTC
Ok got rights to close the bug now.