Bug 47981
Summary: | Failed to boot after upgrade | ||
---|---|---|---|
Product: | ACPI | Reporter: | jbauer |
Component: | Other | Assignee: | acpi_other |
Status: | CLOSED WILL_NOT_FIX | ||
Severity: | normal | CC: | bjorn, daniel, feng.tang, florian, lenb, Robert.Moore |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.6.0 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Boot messages from failed boot
acpidump output /proc/cpuinfo dmidecode output /proc/iomem /proc/ioports lsb_release output lspci -vvv output /proc/modules output ver_linux output /proc/version Boot messages from failed boot of linux-3.6 Boot messages from failed boot of linux-3.6 (corrected) Boot message from 2.6.33.7 (worked) Boot messages from 2.6.34 (failed) dmesg output from kernel 3.6.0 with pci=nocrs (worked) add_acpi_pci_quirk dmesg output from SecurePlatform (based on linux-2.6.18) |
Description
jbauer
2012-09-26 13:21:00 UTC
Created attachment 81161 [details]
acpidump output
Created attachment 81171 [details]
/proc/cpuinfo
Created attachment 81181 [details]
dmidecode output
Created attachment 81191 [details]
/proc/iomem
Created attachment 81201 [details]
/proc/ioports
Created attachment 81211 [details]
lsb_release output
Created attachment 81221 [details]
lspci -vvv output
Created attachment 81231 [details]
/proc/modules output
Created attachment 81241 [details]
ver_linux output
Created attachment 81251 [details]
/proc/version
> [ 0.088154] ACPI: Core revision 20120711
[ 0.092340] ACPI Error: Found unknown opcode 0x1C at AML address ffffc9000060a73e offset 0x22AA, ignoring (20120711/psloop-141)
It looks like we are trying to interpret garbage.
Can you reproduce this using an upstream kernel?
It is likely that ACPI is an innocent victim of another bug
in your kernel.
I tried linux-3.6 from http://www.kernel.org/pub/linux/kernel/v3.x/ It also failed. Created attachment 81821 [details]
Boot messages from failed boot of linux-3.6
Created attachment 81841 [details]
Boot messages from failed boot of linux-3.6 (corrected)
I tried a bunch of different kernel versions and determined that the change that lead to the boot problems happened between 2.6.33.7 and 2.6.34. I also noticed that I get the "ACPI Error: Found unknown opcode..." errors even on the kernels that boot ok. I'll add attachments of the boot messages the the 2 kernel versions mentioned above. Boot options Version Source Default acpi=off ----------------------- --------------- ------- -------- 2.6.32-43-server ubuntu ok -- 2.6.33 kernel.org ok -- 2.6.33.4 kernel.org ok -- 2.6.33.6 kernel.org ok -- 2.6.33.7 kernel.org ok -- 2.6.34 kernel.org FAIL ok 2.6.35 kernel.org FAIL ok 2.6.38.1 kernel.org FAIL ok 3.2.0-31-generic ubuntu FAIL ok 3.6.0-030600rc7-generic ubuntu/mainline FAIL ok 3.6.0 kernel.org FAIL ok Created attachment 81901 [details]
Boot message from 2.6.33.7 (worked)
Created attachment 81911 [details]
Boot messages from 2.6.34 (failed)
I think the failure we are trying to isolate is not actually the boot failure, but the "unknown opcode" failure -- which presumably may or may not cause the boot failure. can you please attach the output from dmidecode to describe the system better? Could you add "pci=nocrs" to the kernel command line and boot? (In reply to comment #18) > I think the failure we are trying to isolate is not actually > the boot failure, but the "unknown opcode" failure -- which > presumably may or may not cause the boot failure. > > can you please attach the output from dmidecode > to describe the system better? That is in the 4th attachment Tried "pci=nocrs" with kernel 3.6.0. It booted ok. Full boot command line was: BOOT_IMAGE=/boot/vmlinuz-3.6.0 root=UUID=3cb24138-2b7c-4dae-9de8-10ffda57c140 ro console=ttyS0,115200n8 pci=nocrs debug Created attachment 82721 [details]
dmesg output from kernel 3.6.0 with pci=nocrs (worked)
(In reply to comment #21) > Tried "pci=nocrs" with kernel 3.6.0. It booted ok. Full boot command line > was: BOOT_IMAGE=/boot/vmlinuz-3.6.0 > root=UUID=3cb24138-2b7c-4dae-9de8-10ffda57c140 ro console=ttyS0,115200n8 > pci=nocrs debug Glad to hear it works. Is your platform a real product or just some development board? As the dmidecode info is a little different from normal ones: Handle 0x0001, DMI type 1, 27 bytes System Information Manufacturer: CheckPoint Product Name: P-20-00 Version: To Be Filled By O.E.M. Serial Number: To Be Filled By O.E.M. UUID: 00020003-0004-0005-0006-000700080009 Wake-up Type: Power Switch SKU Number: To Be Filled By O.E.M. Family: Server Handle 0x0002, DMI type 2, 15 bytes Base Board Information Manufacturer: Intel Product Name: Bridgeport Version: To be filled by O.E.M. Serial Number: To be filled by O.E.M. Asset Tag: To Be Filled By O.E.M. Features: Board is a hosting board Board is replaceable Location In Chassis: To Be Filled By O.E.M. Chassis Handle: 0x0003 Type: Motherboard Contained Object Handles: 0 It is not a development board or an eval unit. It is a network appliance (Check Point 9070) that is got repurposed and ubuntu installed on it. Is using pci=nocrs a better workaround then acpi=off? Created attachment 82781 [details]
add_acpi_pci_quirk
Hi Bauer,
please test this patch with 3.6 kernel, and remove the "pci=nocrs" from kernel cmdline.
This patch is expected to fix the issue in kernel without any cmdline change. thanks,
System is up and running with patched 3.6 kernel and standard command line. I still see the "ACPI Error: Found unknown opcode" messages in dmesg output, but if they are harmless, I can live with them. (In reply to comment #26) > System is up and running with patched 3.6 kernel and standard command line. > > I still see the "ACPI Error: Found unknown opcode" messages in dmesg output, > but if they are harmless, I can live with them. Those error may be related with your ACPI HW, but it has nothing to do with the boot hang. You can create a new bug to trace the error info if you want Wow, this is the worst AML parsing train wreck I've ever seen. iasl won't disassemble the DSDT at all. But on the other hand, Linux parsed enough to find 13 PNPACPI device with plausible resources, and there's a PNP0A08 device (though we didn't get valid _CRS resources for it). What's the normal OS that ships on this box? Googling suggests maybe "SecurePlatform" or "GAiA" and that they may be Linux-based. Any clue whether that OS consumes the AML? Any way to get a dmesg log from those to see if it shows the same AML parsing issues? Here are the possibilities I see: 1) Use a patch like Feng's to tiptoe around this issue. But there are likely other similar issues waiting to be discovered. 2) Turn off ACPI on this platform altogether. Seems like a big hammer. 3) Try to figure out if there's some small ACPICA tweak that would make the AML intelligible. 3) seems like a nice choice, but I don't have time to do it myself. And I'm a little dubious, given that the shipping software seems to be based on Linux, but the long list of copyright owners, trademarks, etc., in the datasheet[1] doesn't mention Linux, the GPL, or where to get the source. That doesn't give me warm fuzzies about putting effort into this. [1] http://www.checkpoint.com/products/downloads/secureplatform_datasheet.pdf What we have seen in the past is that this type of thing is indicative of the BIOS attempting to modify the DSDT at runtime. What happens is that the BIOS screws up an internal AML package length and the AML parser ends up hopping into garbage. AFAIK, there is really no way to workaround the issue; we came to the conclusion that the problem makes it into the platform because Windows just silently ignores it. I will, however, take a look at the DSDT for the machine. (In reply to comment #28) > Wow, this is the worst AML parsing train wreck I've ever seen. iasl won't > disassemble the DSDT at all. > > But on the other hand, Linux parsed enough to find 13 PNPACPI device with > plausible resources, and there's a PNP0A08 device (though we didn't get valid > _CRS resources for it). > > What's the normal OS that ships on this box? Googling suggests maybe > "SecurePlatform" or "GAiA" and that they may be Linux-based. Any clue > whether > that OS consumes the AML? Any way to get a dmesg log from those to see if it > shows the same AML parsing issues? Yes they ship with SecurePlatform or GAiA. They are using kernels based on either Linux 2.4.21 or 2.6.18. I looked at the 2.6.18 based SecurePlatform and it has some ACPI errors as well. I'll attach dmesg outout in a bit. Created attachment 82961 [details]
dmesg output from SecurePlatform (based on linux-2.6.18)
The plot is a bit thicker than I expected at first. After recreating some test ASL code in one of the areas of the table where there are errors, it looks like the table has been corrupted/scribbled in a somewhat systematic way. Below are 4 bytes that are incorrect in the original table, along with their offsets and the correct values. Note the sequence of incorrect values: 1C, 1D, 1E, 1F. @22CD: is 1C, should be 08 - Name() opcode @22D8: is 1D, should be 14 - Method() opcode @22E3: is 1E, should be 57 - "W" in GPRW name @22EE: is 1F, should be 00 - Method flags for _PRW This corruption is of course enough to thoroughly confuse the AML interpreter. One other data point: The table checksum appears to be correct, so it looks like someone (probably the BIOS) changed a bunch of data in the table, then recomputed the checksum over the entire modified table. // Data below // Some of the errors, all within the same Device() object Found unknown opcode 0x1C at table offset 0x22CE, context: 0000: 41 52 31 34 A4 50 52 31 34 5B 82 36 50 30 50 31 AR14.PR14[.6P0P1 0010: 1C 5F 41 44 52 0C 00 00 1E 00 1D 0F 5F 50 52 57 ._ADR......._PRW 0020: 00 A4 47 50 52 1E 0A 0B 0A 04 14 16 5F 50 52 54 ..GPR......._PRT Found unknown opcode 0x1D at table offset 0x22D8, context: 0000: 82 36 50 30 50 31 1C 5F 41 44 52 0C 00 00 1E 00 .6P0P1._ADR..... 0010: 1D 0F 5F 50 52 57 00 A4 47 50 52 1E 0A 0B 0A 04 .._PRW..GPR..... 0020: 14 16 5F 50 52 54 1F A0 0A 50 49 43 4D A4 41 52 .._PRT...PICM.AR Found unknown opcode 0x0F at table offset 0x22D9, context: 0000: 36 50 30 50 31 1C 5F 41 44 52 0C 00 00 1E 00 1D 6P0P1._ADR...... 0010: 0F 5F 50 52 57 00 A4 47 50 52 1E 0A 0B 0A 04 14 ._PRW..GPR...... 0020: 16 5F 50 52 54 1F A0 0A 50 49 43 4D A4 41 52 30 ._PRT...PICM.AR0 // Actual (original) table data 22C0: 5B 82 36 50 30 50 31 1C 5F 14.PR14[.6P0P1._ 22D0: 41 44 52 0C 00 00 1E 00 1D 0F 5F 50 52 57 00 A4 ADR......._PRW.. 22E0: 47 50 52 1E 0A 0B 0A 04 14 16 5F 50 52 54 1F A0 GPR......._PRT.. 22F0: 0A 50 49 43 4D A4 41 52 30 31 A4 50 52 30 31 .PICM.AR01.PR01[ // Compilation of small test code 0x5B,0x82,0x36,0x50,0x30, /* 00000118 "R14[.6P0" */ 0x50,0x31,0x08,0x5F,0x41,0x44,0x52,0x0C, /* 00000120 "P1._ADR." */ 0x00,0x00,0x1E,0x00,0x14,0x0F,0x5F,0x50, /* 00000128 "......_P" */ 0x52,0x57,0x00,0xA4,0x47,0x50,0x52,0x57, /* 00000130 "RW..GPRW" */ 0x0A,0x0B,0x0A,0x04,0x14,0x16,0x5F,0x50, /* 00000138 "......_P" */ 0x52,0x54,0x00,0xA0,0x0A,0x50,0x49,0x43, /* 00000140 "RT...PIC" */ 0x4D,0xA4,0x41,0x52,0x30,0x31,0xA4,0x50, /* 00000148 "M.AR01.P" */ 0x52,0x30,0x31, /* 00000150 "R01..MAI" */ // Small test ASL code Device (P0P1) { Name (_ADR, 0x001E0000) // _ADR: Address Method (_PRW, 0, NotSerialized) // _PRW: Power Resources for Wake { Return (GPRW (0x0B, 0x04)) } Method (_PRT, 0, NotSerialized) // _PRT: PCI Routing Table { If (PICM) { Return (AR01) } Return (PR01) } } (In reply to comment #28) > Here are the possibilities I see: > 1) Use a patch like Feng's to tiptoe around this issue. But there are > likely > other similar issues waiting to be discovered. > 2) Turn off ACPI on this platform altogether. Seems like a big hammer. > 3) Try to figure out if there's some small ACPICA tweak that would make the > AML intelligible. As far as #3: Currently, ACPICA will complain and then simply ignore (step past) any unknown AML opcodes. In this case, it recovers rather quickly and at least loads a "somewhat valid" namespace, albeit missing some intended items. A more severe problem is when an AML package length error causes the interpreter to jump off into space; but this appears to not be the problem with this particular machine. My inclination is to do nothing in Linux or ACPICA. Everything points to this being a BIOS issue, and it seems like such an egregious issue that it's not worth spending time or adding kernel bandaids to try to patch things up. For this particular issue of the boot failing because we can't get valid _CRS info for the PCI host bridges, booting with "pci=nocrs" is a reasonable workaround and seems sufficient. We could make the argument that Linux should be able to survive this by reassigning all the BARs to the values they contained at BIOS handoff. In fact, we do have code that's supposed to do that. My guess is that it failed in this case because we think there are no resources available on the PCI bus at all. We can't really fall back to some sort of default resources, because it's actually quite common to have buses with no resources -- these are often used for things like uncore devices that need only config registers and no MEM or IO space -- and in those cases, we don't want to assume default resources. (In reply to comment #34) > My inclination is to do nothing in Linux or ACPICA. Everything points to > this > being a BIOS issue, and it seems like such an egregious issue that it's not > worth spending time or adding kernel bandaids to try to patch things up. Yes, it should be a broken BIOS. But, this platform is a product sold out in market as Bauer answered, we'd better to take this quirk for now to work around the boot hang problem as not all users know modifying the cmdline. And remove the quirk once this broken BIOS get fixed. I don't know if there is a rule for justify adding a quirk, please let me know if there is some. thanks! Here's my reasoning: this is a CheckPoint product, and it looks like an appliance, not really a general-purpose machine. The issue has apparently been there from day one, and the kernel shipped on the machine complains noisily about the issue, but apparently nobody bothered to investigate it. This corruption will clearly break other ACPI-related things. We can sort of work around this one (though the workaround does prevent us from doing any PCI resource reassignment), but we have no idea what the other lurking ACPI issues are (and we have no assurance that *only* ACPI things are broken -- maybe the memory corruption affects other unknown things). It may take significant debugging effort to identify the next problem. The only report I've seen (this one) is apparently from a CheckPoint employee, so it's not clear that anybody else is trying to run upstream Linux on it. Being a CheckPoint employee, J Bauer is probably in a position to get the BIOS fixed. You might still be able to convince me, but it seems like the benefit to a quirk for this platform is small, and it does cost everybody else something in code size and complexity. (In reply to comment #36) > Here's my reasoning: this is a CheckPoint product, and it looks like an > appliance, not really a general-purpose machine. The issue has apparently > been > there from day one, and the kernel shipped on the machine complains noisily > about the issue, but apparently nobody bothered to investigate it. > > This corruption will clearly break other ACPI-related things. We can sort of > work around this one (though the workaround does prevent us from doing any > PCI > resource reassignment), but we have no idea what the other lurking ACPI > issues > are (and we have no assurance that *only* ACPI things are broken -- maybe the > memory corruption affects other unknown things). It may take significant > debugging effort to identify the next problem. > > The only report I've seen (this one) is apparently from a CheckPoint > employee, > so it's not clear that anybody else is trying to run upstream Linux on it. > Being a CheckPoint employee, J Bauer is probably in a position to get the > BIOS > fixed. Fair enough to me! Then should we close this bug with a "Won't Fix"? (Sorry for the late response, just came back from vacation) I'd prefer to see a BIOS fix for this (see comment #36). A patch referencing this bug report has been merged in Linux v3.8-rc1: commit bacaf7cd092a2c42a904bce437e64690e04aaa10 Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Date: Fri Nov 16 11:08:31 2012 +0100 Revert "ACPI / x86: Add quirk for "CheckPoint P-20-00" to not use bridge _CRS_ info" A patch referencing this bug report has been merged in Linux v3.8-rc1: commit 0a290ac4252c85205cb924ff7f6da10cfd20fb01 Author: Feng Tang <feng.tang@intel.com> Date: Tue Oct 23 01:31:14 2012 +0200 ACPI / x86: Add quirk for "CheckPoint P-20-00" to not use bridge _CRS_ info |