Bug 47981

Summary:	Failed to boot after upgrade
Product:	ACPI	Reporter:	jbauer
Component:	Other	Assignee:	acpi_other
Status:	CLOSED WILL_NOT_FIX
Severity:	normal	CC:	bjorn, daniel, feng.tang, florian, lenb, Robert.Moore
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	3.6.0	Subsystem:
Regression:	Yes	Bisected commit-id:
Attachments:	Boot messages from failed boot acpidump output /proc/cpuinfo dmidecode output /proc/iomem /proc/ioports lsb_release output lspci -vvv output /proc/modules output ver_linux output /proc/version Boot messages from failed boot of linux-3.6 Boot messages from failed boot of linux-3.6 (corrected) Boot message from 2.6.33.7 (worked) Boot messages from 2.6.34 (failed) dmesg output from kernel 3.6.0 with pci=nocrs (worked) add_acpi_pci_quirk dmesg output from SecurePlatform (based on linux-2.6.18)

Description jbauer 2012-09-26 13:21:00 UTC

Created attachment 81151 [details]
Boot messages from failed boot

The following Ubuntu bug was reproduced with upstream 3.6.0 kernel.
Attached boot message are from the 3.6.0 kernel.


Original Ubuntu bug report
--------------------------
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1055506


Copy of Ubuntu bug report
--------------------------

After upgrading (with do-release-upgrade) the OS to Ubuntu 12.04.1 LTS, the system failed to boot. It failed to find any disks. Boot process ended with the "(initramfs)" prompt. Output attached. Found workaround by adding "acpi=off" to boot options.

Falling back to old kernel (2.6.32-43-server) the system booted fine.

Tried the following option with the new kernel (3.2.0-31-generic):

noapic -- failed
nolapic -- failed
acpi=off -- worked
acpi=ht -- failed

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-31-generic 3.2.0-31.50
ProcVersionSignature: Ubuntu 3.2.0-31.50-generic 3.2.28
Uname: Linux 3.2.0-31-generic x86_64
AcpiTables:

AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 Sep 24 09:08 seq
 crw-rw---T 1 root audio 116, 33 Sep 24 09:08 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu13
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Mon Sep 24 09:31:18 2012
HibernationDevice: RESUME=UUID=eaff9f20-4df4-40db-a79b-fe4b1a9d48ec
InstallationMedia: Ubuntu-Server 10.04 LTS "Lucid Lynx" - Release amd64 (20100427)
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: CheckPoint P-20-00
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=C
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-31-generic root=UUID=3cb24138-2b7c-4dae-9de8-10ffda57c140 ro console=ttyS0,115200n8 acpi=off
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-31-generic N/A
 linux-backports-modules-3.2.0-31-generic N/A
 linux-firmware 1.79.1
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux
UpgradeStatus: Upgraded to precise on 2012-09-21 (2 days ago)
dmi.bios.date: 03/26/2008
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 080014
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: Bridgeport
dmi.board.vendor: Intel
dmi.board.version: To be filled by O.E.M.
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Intel
dmi.chassis.version: To Be Filled By O.E.M.
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr080014:bd03/26/2008:svnCheckPoint:pnP-20-00:pvrToBeFilledByO.E.M.:rvnIntel:rnBridgeport:rvrTobefilledbyO.E.M.:cvnIntel:ct3:cvrToBeFilledByO.E.M.:
dmi.product.name: P-20-00
dmi.product.version: To Be Filled By O.E.M.
dmi.sys.vendor: CheckPoint

Comment 1 jbauer 2012-09-26 13:21:43 UTC

Created attachment 81161 [details]
acpidump output

Comment 2 jbauer 2012-09-26 13:22:16 UTC

Created attachment 81171 [details]
/proc/cpuinfo

Comment 3 jbauer 2012-09-26 13:22:36 UTC

Created attachment 81181 [details]
dmidecode output

Comment 4 jbauer 2012-09-26 13:22:53 UTC

Created attachment 81191 [details]
/proc/iomem

Comment 5 jbauer 2012-09-26 13:23:11 UTC

Created attachment 81201 [details]
/proc/ioports

Comment 6 jbauer 2012-09-26 13:23:45 UTC

Created attachment 81211 [details]
lsb_release output

Comment 7 jbauer 2012-09-26 13:24:04 UTC

Created attachment 81221 [details]
lspci -vvv output

Comment 8 jbauer 2012-09-26 13:24:18 UTC

Created attachment 81231 [details]
/proc/modules output

Comment 9 jbauer 2012-09-26 13:24:49 UTC

Created attachment 81241 [details]
ver_linux output

Comment 10 jbauer 2012-09-26 13:25:15 UTC

Created attachment 81251 [details]
/proc/version

Comment 11 Len Brown 2012-10-02 02:22:13 UTC

> [    0.088154] ACPI: Core revision 20120711
[    0.092340] ACPI Error: Found unknown opcode 0x1C at AML address ffffc9000060a73e offset 0x22AA, ignoring (20120711/psloop-141)

It looks like we are trying to interpret garbage.

Can you reproduce this using an upstream kernel?
It is likely that ACPI is an innocent victim of another bug
in your kernel.

Comment 12 jbauer 2012-10-02 13:50:18 UTC

I tried linux-3.6 from http://www.kernel.org/pub/linux/kernel/v3.x/
It also failed.

Comment 13 jbauer 2012-10-02 13:52:44 UTC

Created attachment 81821 [details]
Boot messages from failed boot of linux-3.6

Comment 14 jbauer 2012-10-02 15:59:21 UTC

Created attachment 81841 [details]
Boot messages from failed boot of linux-3.6 (corrected)

Comment 15 jbauer 2012-10-03 14:56:58 UTC

I tried a bunch of different kernel versions and determined that the change that lead to the boot problems happened between 2.6.33.7 and 2.6.34.  I also noticed that I get the "ACPI Error: Found unknown opcode..." errors even on the kernels that boot ok.  I'll add attachments of the boot messages the the 2 kernel versions mentioned above.


                                         Boot options
Version                 Source		Default	acpi=off
----------------------- --------------- ------- --------
2.6.32-43-server	ubuntu		ok	--
2.6.33			kernel.org	ok      --
2.6.33.4                kernel.org      ok      --
2.6.33.6                kernel.org      ok      --
2.6.33.7                kernel.org      ok      --
2.6.34                  kernel.org      FAIL    ok
2.6.35			kernel.org	FAIL	ok
2.6.38.1		kernel.org	FAIL	ok
3.2.0-31-generic	ubuntu		FAIL	ok
3.6.0-030600rc7-generic	ubuntu/mainline	FAIL	ok
3.6.0			kernel.org	FAIL	ok

Comment 16 jbauer 2012-10-03 14:58:13 UTC

Created attachment 81901 [details]
Boot message from 2.6.33.7 (worked)

Comment 17 jbauer 2012-10-03 14:59:08 UTC

Created attachment 81911 [details]
Boot messages from 2.6.34 (failed)

Comment 18 Len Brown 2012-10-09 02:22:00 UTC

I think the failure we are trying to isolate is not actually
the boot failure, but the "unknown opcode" failure -- which
presumably may or may not cause the boot failure.

can you please attach the output from dmidecode
to describe the system better?

Comment 19 Feng Tang 2012-10-09 07:48:54 UTC

Could you add "pci=nocrs" to the kernel command line and boot?

Comment 20 jbauer 2012-10-09 12:11:56 UTC

(In reply to comment #18)
> I think the failure we are trying to isolate is not actually
> the boot failure, but the "unknown opcode" failure -- which
> presumably may or may not cause the boot failure.
> 
> can you please attach the output from dmidecode
> to describe the system better?

That is in the 4th attachment

Comment 21 jbauer 2012-10-09 12:18:54 UTC

Tried "pci=nocrs" with kernel 3.6.0.  It booted ok.  Full boot command line was: BOOT_IMAGE=/boot/vmlinuz-3.6.0 root=UUID=3cb24138-2b7c-4dae-9de8-10ffda57c140 ro console=ttyS0,115200n8 pci=nocrs debug

Comment 22 jbauer 2012-10-09 12:20:02 UTC

Created attachment 82721 [details]
dmesg output from kernel 3.6.0 with pci=nocrs (worked)

Comment 23 Feng Tang 2012-10-09 15:25:16 UTC

(In reply to comment #21)
> Tried "pci=nocrs" with kernel 3.6.0.  It booted ok.  Full boot command line
> was: BOOT_IMAGE=/boot/vmlinuz-3.6.0
> root=UUID=3cb24138-2b7c-4dae-9de8-10ffda57c140 ro console=ttyS0,115200n8
> pci=nocrs debug

Glad to hear it works. Is your platform a real product or just some development board? As the dmidecode info is a little different from normal ones:

Handle 0x0001, DMI type 1, 27 bytes
System Information
	Manufacturer: CheckPoint
	Product Name: P-20-00
	Version: To Be Filled By O.E.M.
	Serial Number: To Be Filled By O.E.M.
	UUID: 00020003-0004-0005-0006-000700080009
	Wake-up Type: Power Switch
	SKU Number: To Be Filled By O.E.M.
	Family: Server

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
	Manufacturer: Intel
	Product Name: Bridgeport
	Version: To be filled by O.E.M.
	Serial Number: To be filled by O.E.M.
	Asset Tag: To Be Filled By O.E.M.
	Features:
		Board is a hosting board
		Board is replaceable
	Location In Chassis: To Be Filled By O.E.M.
	Chassis Handle: 0x0003
	Type: Motherboard
	Contained Object Handles: 0

Comment 24 jbauer 2012-10-09 17:34:13 UTC

It is not a development board or an eval unit.  It is a network appliance (Check Point 9070) that is got repurposed and ubuntu installed on it.

Is using pci=nocrs a better workaround then acpi=off?

Comment 25 Feng Tang 2012-10-10 08:52:19 UTC

Created attachment 82781 [details]
add_acpi_pci_quirk

Hi Bauer,

please test this patch with 3.6 kernel, and remove the "pci=nocrs" from kernel cmdline.

This patch is expected to fix the issue in kernel without any cmdline change. thanks,

Comment 26 jbauer 2012-10-10 18:59:43 UTC

System is up and running with patched 3.6 kernel and standard command line.

I still see the "ACPI Error: Found unknown opcode" messages in dmesg output, but if they are harmless, I can live with them.

Comment 27 Feng Tang 2012-10-11 09:20:13 UTC

(In reply to comment #26)
> System is up and running with patched 3.6 kernel and standard command line.
> 
> I still see the "ACPI Error: Found unknown opcode" messages in dmesg output,
> but if they are harmless, I can live with them.

Those error may be related with your ACPI HW, but it has nothing to do with the boot hang.

You can create a new bug to trace the error info if you want

Comment 28 Bjorn Helgaas 2012-10-11 16:44:26 UTC

Wow, this is the worst AML parsing train wreck I've ever seen.  iasl won't disassemble the DSDT at all.

But on the other hand, Linux parsed enough to find 13 PNPACPI device with plausible resources, and there's a PNP0A08 device (though we didn't get valid _CRS resources for it).

What's the normal OS that ships on this box?  Googling suggests maybe "SecurePlatform" or "GAiA" and that they may be Linux-based.  Any clue whether that OS consumes the AML?  Any way to get a dmesg log from those to see if it shows the same AML parsing issues?

Here are the possibilities I see:
  1) Use a patch like Feng's to tiptoe around this issue.  But there are likely other similar issues waiting to be discovered.
  2) Turn off ACPI on this platform altogether.  Seems like a big hammer.
  3) Try to figure out if there's some small ACPICA tweak that would make the AML intelligible.

3) seems like a nice choice, but I don't have time to do it myself.  And I'm a little dubious, given that the shipping software seems to be based on Linux, but the long list of copyright owners, trademarks, etc., in the datasheet[1] doesn't mention Linux, the GPL, or where to get the source.  That doesn't give me warm fuzzies about putting effort into this.

[1] http://www.checkpoint.com/products/downloads/secureplatform_datasheet.pdf

Comment 29 Robert Moore 2012-10-11 17:15:32 UTC

What we have seen in the past is that this type of thing is indicative of the BIOS attempting to modify the DSDT at runtime.

What happens is that the BIOS screws up an internal AML package length and the AML parser ends up hopping into garbage.

AFAIK, there is really no way to workaround the issue; we came to the conclusion that the problem makes it into the platform because Windows just silently ignores it.

I will, however, take a look at the DSDT for the machine.

Comment 30 jbauer 2012-10-11 18:32:10 UTC

(In reply to comment #28)
> Wow, this is the worst AML parsing train wreck I've ever seen.  iasl won't
> disassemble the DSDT at all.
> 
> But on the other hand, Linux parsed enough to find 13 PNPACPI device with
> plausible resources, and there's a PNP0A08 device (though we didn't get valid
> _CRS resources for it).
> 
> What's the normal OS that ships on this box?  Googling suggests maybe
> "SecurePlatform" or "GAiA" and that they may be Linux-based.  Any clue
> whether
> that OS consumes the AML?  Any way to get a dmesg log from those to see if it
> shows the same AML parsing issues?

Yes they ship with SecurePlatform or GAiA.  They are using kernels based on either Linux 2.4.21 or 2.6.18. I looked at the 2.6.18 based SecurePlatform and it has some ACPI errors as well.  I'll attach dmesg outout in a bit.

Comment 31 jbauer 2012-10-11 18:33:12 UTC

Created attachment 82961 [details]
dmesg output from SecurePlatform (based on linux-2.6.18)

Comment 32 Robert Moore 2012-10-11 20:00:05 UTC

The plot is a bit thicker than I expected at first. After recreating some
test ASL code in one of the areas of the table where there are errors, it
looks like the table has been corrupted/scribbled in a somewhat systematic
way.

Below are 4 bytes that are incorrect in the original table, along with
their offsets and the correct values. Note the sequence of incorrect
values: 1C, 1D, 1E, 1F.

@22CD: is 1C, should be 08 - Name() opcode
@22D8: is 1D, should be 14 - Method() opcode
@22E3: is 1E, should be 57 - "W" in GPRW name
@22EE: is 1F, should be 00 - Method flags for _PRW

This corruption is of course enough to thoroughly confuse the AML
interpreter.

One other data point: The table checksum appears to be correct, so
it looks like someone (probably the BIOS) changed a bunch of data in
the table, then recomputed the checksum over the entire modified table.


// Data below

// Some of the errors, all within the same Device() object

Found unknown opcode 0x1C at table offset 0x22CE, context:
  0000: 41 52 31 34 A4 50 52 31 34 5B 82 36 50 30 50 31  AR14.PR14[.6P0P1
  0010: 1C 5F 41 44 52 0C 00 00 1E 00 1D 0F 5F 50 52 57  ._ADR......._PRW
  0020: 00 A4 47 50 52 1E 0A 0B 0A 04 14 16 5F 50 52 54  ..GPR......._PRT

Found unknown opcode 0x1D at table offset 0x22D8, context:
  0000: 82 36 50 30 50 31 1C 5F 41 44 52 0C 00 00 1E 00  .6P0P1._ADR.....
  0010: 1D 0F 5F 50 52 57 00 A4 47 50 52 1E 0A 0B 0A 04  .._PRW..GPR.....
  0020: 14 16 5F 50 52 54 1F A0 0A 50 49 43 4D A4 41 52  .._PRT...PICM.AR

Found unknown opcode 0x0F at table offset 0x22D9, context:
  0000: 36 50 30 50 31 1C 5F 41 44 52 0C 00 00 1E 00 1D  6P0P1._ADR......
  0010: 0F 5F 50 52 57 00 A4 47 50 52 1E 0A 0B 0A 04 14  ._PRW..GPR......
  0020: 16 5F 50 52 54 1F A0 0A 50 49 43 4D A4 41 52 30  ._PRT...PICM.AR0


// Actual (original) table data

  22C0:                      5B 82 36 50 30 50 31 1C 5F  14.PR14[.6P0P1._
  22D0: 41 44 52 0C 00 00 1E 00 1D 0F 5F 50 52 57 00 A4  ADR......._PRW..
  22E0: 47 50 52 1E 0A 0B 0A 04 14 16 5F 50 52 54 1F A0  GPR......._PRT..
  22F0: 0A 50 49 43 4D A4 41 52 30 31 A4 50 52 30 31     .PICM.AR01.PR01[


// Compilation of small test code

                   0x5B,0x82,0x36,0x50,0x30,  /* 00000118    "R14[.6P0" */
    0x50,0x31,0x08,0x5F,0x41,0x44,0x52,0x0C,  /* 00000120    "P1._ADR." */
    0x00,0x00,0x1E,0x00,0x14,0x0F,0x5F,0x50,  /* 00000128    "......_P" */
    0x52,0x57,0x00,0xA4,0x47,0x50,0x52,0x57,  /* 00000130    "RW..GPRW" */
    0x0A,0x0B,0x0A,0x04,0x14,0x16,0x5F,0x50,  /* 00000138    "......_P" */
    0x52,0x54,0x00,0xA0,0x0A,0x50,0x49,0x43,  /* 00000140    "RT...PIC" */
    0x4D,0xA4,0x41,0x52,0x30,0x31,0xA4,0x50,  /* 00000148    "M.AR01.P" */
    0x52,0x30,0x31,                           /* 00000150    "R01..MAI" */


// Small test ASL code

    Device (P0P1)
    {
        Name (_ADR, 0x001E0000)  // _ADR: Address
        Method (_PRW, 0, NotSerialized)  // _PRW: Power Resources for Wake
        {
            Return (GPRW (0x0B, 0x04))
        }

        Method (_PRT, 0, NotSerialized)  // _PRT: PCI Routing Table
        {
            If (PICM)
            {
                Return (AR01)
            }

            Return (PR01)
        }
    }

Comment 33 Robert Moore 2012-10-12 20:36:31 UTC

(In reply to comment #28)
> Here are the possibilities I see:
>   1) Use a patch like Feng's to tiptoe around this issue.  But there are
>   likely
> other similar issues waiting to be discovered.
>   2) Turn off ACPI on this platform altogether.  Seems like a big hammer.
>   3) Try to figure out if there's some small ACPICA tweak that would make the
> AML intelligible.

As far as #3:

Currently, ACPICA will complain and then simply ignore (step past) any unknown AML opcodes. In this case, it recovers rather quickly and at least loads a "somewhat valid" namespace, albeit missing some intended items.

A more severe problem is when an AML package length error causes the interpreter to jump off into space; but this appears to not be the problem with this particular machine.

Comment 34 Bjorn Helgaas 2012-10-12 21:03:52 UTC

My inclination is to do nothing in Linux or ACPICA.  Everything points to this being a BIOS issue, and it seems like such an egregious issue that it's not worth spending time or adding kernel bandaids to try to patch things up.

For this particular issue of the boot failing because we can't get valid _CRS info for the PCI host bridges, booting with "pci=nocrs" is a reasonable workaround and seems sufficient.

We could make the argument that Linux should be able to survive this by reassigning all the BARs to the values they contained at BIOS handoff.  In fact, we do have code that's supposed to do that.  My guess is that it failed in this case because we think there are no resources available on the PCI bus at all.  We can't really fall back to some sort of default resources, because it's actually quite common to have buses with no resources -- these are often used for things like uncore devices that need only config registers and no MEM or IO space -- and in those cases, we don't want to assume default resources.

Comment 35 Feng Tang 2012-10-15 02:59:53 UTC

(In reply to comment #34)
> My inclination is to do nothing in Linux or ACPICA.  Everything points to
> this
> being a BIOS issue, and it seems like such an egregious issue that it's not
> worth spending time or adding kernel bandaids to try to patch things up.

Yes, it should be a broken BIOS. 

But, this platform is a product sold out in market as Bauer answered, we'd better to take this quirk for now to work around the boot hang problem as not all users know modifying the cmdline. And remove the quirk once this broken BIOS get fixed.

I don't know if there is a rule for justify adding a quirk, please let me know if there is some. thanks!

Comment 36 Bjorn Helgaas 2012-10-17 20:10:10 UTC

Here's my reasoning: this is a CheckPoint product, and it looks like an appliance, not really a general-purpose machine.  The issue has apparently been there from day one, and the kernel shipped on the machine complains noisily about the issue, but apparently nobody bothered to investigate it.

This corruption will clearly break other ACPI-related things.  We can sort of work around this one (though the workaround does prevent us from doing any PCI resource reassignment), but we have no idea what the other lurking ACPI issues are (and we have no assurance that *only* ACPI things are broken -- maybe the memory corruption affects other unknown things).  It may take significant debugging effort to identify the next problem.

The only report I've seen (this one) is apparently from a CheckPoint employee, so it's not clear that anybody else is trying to run upstream Linux on it.  Being a CheckPoint employee, J Bauer is probably in a position to get the BIOS fixed.

You might still be able to convince me, but it seems like the benefit to a quirk for this platform is small, and it does cost everybody else something in code size and complexity.

Comment 37 Feng Tang 2012-10-28 13:46:19 UTC

(In reply to comment #36)
> Here's my reasoning: this is a CheckPoint product, and it looks like an
> appliance, not really a general-purpose machine.  The issue has apparently
> been
> there from day one, and the kernel shipped on the machine complains noisily
> about the issue, but apparently nobody bothered to investigate it.
> 
> This corruption will clearly break other ACPI-related things.  We can sort of
> work around this one (though the workaround does prevent us from doing any
> PCI
> resource reassignment), but we have no idea what the other lurking ACPI
> issues
> are (and we have no assurance that *only* ACPI things are broken -- maybe the
> memory corruption affects other unknown things).  It may take significant
> debugging effort to identify the next problem.
> 
> The only report I've seen (this one) is apparently from a CheckPoint
> employee,
> so it's not clear that anybody else is trying to run upstream Linux on it. 
> Being a CheckPoint employee, J Bauer is probably in a position to get the
> BIOS
> fixed.

Fair enough to me! Then should we close this bug with a "Won't Fix"? (Sorry for the late response, just came back from vacation)

Comment 38 Bjorn Helgaas 2012-10-29 18:54:33 UTC

I'd prefer to see a BIOS fix for this (see comment #36).

Comment 39 Florian Mickler 2012-12-22 09:27:44 UTC

A patch referencing this bug report has been merged in Linux v3.8-rc1:

commit bacaf7cd092a2c42a904bce437e64690e04aaa10
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Fri Nov 16 11:08:31 2012 +0100

    Revert "ACPI / x86: Add quirk for "CheckPoint P-20-00" to not use bridge _CRS_ info"

Comment 40 Florian Mickler 2012-12-22 10:48:55 UTC

A patch referencing this bug report has been merged in Linux v3.8-rc1:

commit 0a290ac4252c85205cb924ff7f6da10cfd20fb01
Author: Feng Tang <feng.tang@intel.com>
Date:   Tue Oct 23 01:31:14 2012 +0200

    ACPI / x86: Add quirk for "CheckPoint P-20-00" to not use bridge _CRS_ info