Bug 16548
Description
Roman
2010-08-09 12:33:28 UTC
Created attachment 27391 [details]
dmesg for 2.6.34
Created attachment 27392 [details]
dmidecode for 2.6.34
Created attachment 27393 [details]
acpidump on 2.6.34
caused by commit 856b185dd23da39e562983fbf28860f54e661b41. commit 856b185dd23da39e562983fbf28860f54e661b41 Author: Alex Chiang <achiang@canonical.com> Date: Thu Jun 17 09:08:54 2010 -0600 ACPI: processor: fix processor_physically_present on UP The commit 5d554a7bb06 (ACPI: processor: add internal processor_physically_present()) is broken on uniprocessor (UP) configurations, as acpi_get_cpuid() will always return -1. We use the value of num_possible_cpus() to tell us whether we got an invalid cpuid from acpi_get_cpuid() in the SMP case, or if instead, we are UP, in which case num_possible_cpus() is #defined as 1. We use num_possible_cpus() instead of num_online_cpus() to protect ourselves against the scenario of CPU hotplug, and we've taken down all the CPUs except one. Thanks to Jan Pogadl for initial report and analysis and Chen Gong for review. https://bugzilla.kernel.org/show_bug.cgi?id=16357 Reported-by: Jan Pogadl <pogadl.jan@googlemail.com>: Reviewed-by: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Alex Chiang <achiang@canonical.com> Signed-off-by: Len Brown <len.brown@intel.com> diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c index 5128435..e9699aa 100644 --- a/drivers/acpi/processor_core.c +++ b/drivers/acpi/processor_core.c @@ -223,7 +223,7 @@ static bool processor_physically_present(acpi_handle handle) type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0; cpuid = acpi_get_cpuid(handle, type, acpi_id); - if (cpuid == -1) + if ((cpuid == -1) && (num_possible_cpus() > 1)) return false; return true; This bug is fixed in kernel 2.6.36_rc1 (new 20100702 ACPI code revision?) I guess that the possible reason is buggy ACPI code in BIOS. Now instead of crash the kernel prints this: ACPI: EC: Look up EC in DSDT ACPI: BIOS _OSI(Linux) query ignored ACPI Error: Could not map memory at 0x000000007D422720, size 141 (20100702/exregion-178) ACPI Exception: AE_NO_MEMORY, Returned by Handler for [SystemMemory] (20100702/evregion-474) ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.APCT] (Node f70316c0), AE_NO_MEMORY ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.GCAP] (Node f70316a8), AE_NO_MEMORY ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1._PDC] (Node f7031678), AE_NO_MEMORY ACPI: Interpreter enabled ACPI: (supports S0 S3 S4 S5) ACPI: Using IOAPIC for interrupt routing Roman, thanks for testing again. btw, I think your bisection run was a little bit misleading. The initial crash was this: BUG: Unable to handle kernel NULL pointer dereference at 0x00000005 IP: [<c125b489>] apci_ns_lookup+0x125/0x57e As you can see, that's not a part of my patch. I think what happened is that the patch fixed the bad logic introduced by 5d554a7bb06, and we started properly returning false on your machine, which revealed a latent bug somewhere else in ACPI. When you reverted my patch, we incorrectly returned true on your machine, thus masking the symptoms of the buggy ACPI code, allowing your machine to boot. In any case, it looks like your namespace is still broken, but at least you're not crashing anymore. Guys, which commit fixed this in 2.6.36-rc1? Did that commit have the cc:stable in the changelog? Thanks Andrew, We never narrowed it down; it just went away. :-/ If Roman has time it would be great if he could identify the good commit via another bisection run, but it's up to him as to whether he has the time/energy to do so. Thanks. ACPI Error: Could not map memory at 0x000000007D422720, size 141 (20100702/exregion-178) ACPI Exception: AE_NO_MEMORY, Returned by Handler for [SystemMemory] (20100702/evregion-474) ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.APCT] (Node f70316c0), AE_NO_MEMORY It's doing a dynamic table load: Method (APCT, 0, NotSerialized) { If (LAnd (And (CFGD, 0xF0), LNot (And (SDTL, 0x20 )))) { Or (SDTL, 0x20, SDTL) OperationRegion (CST1, SystemMemory, DerefOf (Index (SSDT, 0x0A)), DerefOf (Index (SSDT, 0x0B))) Load (CST1, HC1) } } Name (SSDT, Package (0x0C) { "CPU0IST ", 0x7D5B3520, 0x00000475, "APIST ", 0x7D368CA0, 0x000001CF, "CPU0CST ", 0x7D398020, 0x00000C78, "APCST ", 0x7D422720, 0x0000008D }) Index 0x0A into the SSDT package is in fact 0x7D422720. ACPI Error: Could not map memory at 0x000000007D422720, size 141 So, for whatever reason, the memory mapping failed. Roman, The AE_NO_MEMORY error messages in 2.6.36 are still an important failure. Can you tell us when they started? Can you tell us what change caused the acpi_ns_lookup fault to go away? Did the acpi_ns_lookup fault only happen when preceded by the map failure? Is it possible to get a photo of the stack trace upon the ns_lookup fault? Created attachment 28661 [details]
full backtrace
some typos are possible. Also I can attach source photos if anyone is interested.
Kernel panic (https://bugzilla.kernel.org/attachment.cgi?id=28661) happens on 2.6.35-rc3 -> 2.6.35.4 AE_NO_MEMORY error happens on 2.6.36-rc1 -> 2.6.36-rc3 (latest rc for this moment) I'm bisecting right now to find a commit which fixed the panic and introduced AE_NO_MEMORY error. Ping. Please let me know if this summary is in-accurate: 2.6.35.stable still panics on this box. This is a Uni-processor running an SMP kernel. 2.6.36-rc doesn't panic, but gets an serious AE_NO_MEMORY error. So we want to know why the panic went away. We also want to know what the deal is with AE_NO_MEMORY. Just for comparison, can you attach the dmesg for the lagtest kernel that works properly w/o any kernel cmdline workarounds? I suppose that would be 2.6.34.stable? ping Created attachment 30962 [details] dmesg for 2.6.34.1 dmesg for the latest working 2.6.34.stable kernel (.1 for the moment) 2.6.34.2 - 2.6.34.7 crashes with the same backtrace I provided earlier. This happens because the commit which triggered this bug was backported to 2.6.34 (http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.34.y.git;a=commit;h=3fd02a351fc8aa8602fe2b39d6a098ea2538db2e) Finally found commit which triggers the AE_NO_MEMORY error: http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=commitdiff;h=35be1b716a475717611b2dc04185e9d80b9cb693 without it kernel crashes with the backtrace provided earlier. After applying it kernel boots with AE_NO_MEMORY error. Created attachment 31072 [details]
dmesg for the latest 2.6.36-rc5 kernel
The most interesting part is the same:
[ 0.158142] ACPI: BIOS _OSI(Linux) query ignored
[ 0.163393] ACPI Error: Could not map memory at 0x000000007D422720, size 141 (20100702/exregion-178)
[ 0.163617] ACPI Exception: AE_NO_MEMORY, Returned by Handler for [SystemMemory] (20100702/evregion-474)
[ 0.163839] ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.APCT] (Node f70356c0), AE_NO_MEMORY
[ 0.164108] ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.GCAP] (Node f70356a8), AE_NO_MEMORY
[ 0.164374] ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1._PDC] (Node f7035678), AE_NO_MEMORY
[ 0.170071] ACPI: Interpreter enabled
[ 0.170166] ACPI: (supports S0 S3 S4 S5)
[ 0.170451] ACPI: Using IOAPIC for interrupt routing
[ 0.171417] [Firmware Bug]: ACPI: ACPI brightness control misses _BQC function
[ 0.181379] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
Also on bootable kernels with AE_NO_MEMORY error, powertop tells that power usage (ACPI Estimate) of the whole system is unbeliveable 1.5W :)
Created attachment 31242 [details]
dmesg with CONFIG_ACPI_DEBUG=y
kernel booted with these params: "acpi.debug_layer=0x0000ffff acpi.debug_level=0x000000ff".
But even with this log I can't understand the source of my AE_NO_MEMORY error...
Name (SSDT, Package (0x0C) { "CPU0IST ", 0x7D5B3520, 0x00000475, "APIST ", 0x7D368CA0, 0x000001CF, "CPU0CST ", 0x7D398020, 0x00000C78, "APCST ", 0x7D422720, 0x0000008D }) [ 0.000000] BIOS-e820: 0000000000100000 - 000000007d6a1000 (usable) The ACPI tables should be put in the reserved memory, but as above shows, all SSDTs are put in the normal RAM. This causes below ioremap code returns NULL. for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) { int is_ram = page_is_ram(pfn); if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn))) return NULL; WARN_ON_ONCE(is_ram); } Roman, Could you boot 2.6.36-rc5 with blow kernel option and attach the dmesg again? memmap=2352K#0x7D368000 This will mark memory from 0x7D368000 to 0x7D5B4000 as ACPI data. Thanks. Roman, any update? With boot option "memmap=2352K#0x7D368000" kernel 2.6.36-rc6 crashes with the same trace (it looks like the same for the first sight. I can take a photo if any changes in trace I didn't notice should happen). Without this option kernel boots with AE_NO_MEMORY error. The same situation on 2.6.36-rc8, crash with "memmap=2352K#0x7D368000", AE_NO_MEMORY error without it. What other info can I provide? > "memmap=2352K#0x7D368000
[ 0.000000] BIOS-e820: 0000000000100000 - 000000007d6a1000 (usable)
Hmm, how about a big stick to mark the whole top GB of this region as ACPI:
"memmap=1G#0x3D6A1000"
Marking this as a BIOS bug (is there a BIOS update available for it?)
Roman, does the problem still exist in the latest upstream kernel? Ming, any update on this? On mainline 2.6.37, the same firmware: [ 0.155513] ACPI: EC: Look up EC in DSDT [ 0.163415] [Firmware Bug]: ACPI: BIOS _OSI(Linux) query ignored [ 0.168742] ACPI Error: Could not map memory at 0x000000007D422720, size 141 (20101013/exregion-178) [ 0.168971] ACPI Exception: AE_NO_MEMORY, Returned by Handler for [SystemMemory] (20101013/evregion-474) [ 0.169198] ACPI Error: Method parse/execution failed [\_PR_.CPU1.APCT] (Node f603a6c0), AE_NO_MEMORY (20101013/psparse-537) [ 0.169517] ACPI Error: Method parse/execution failed [\_PR_.CPU1.GCAP] (Node f603a6a8), AE_NO_MEMORY (20101013/psparse-537) [ 0.169832] ACPI Error: Method parse/execution failed [\_PR_.CPU1._PDC] (Node f603a678), AE_NO_MEMORY (20101013/psparse-537) [ 0.180090] ACPI: Interpreter enabled [ 0.180193] ACPI: (supports S0 S3 S4 S5) [ 0.180488] ACPI: Using IOAPIC for interrupt routing [ 0.181673] [Firmware Bug]: ACPI: No _BQC method, cannot determine initial brightness [ 0.192650] ACPI Exception: AE_NOT_FOUND, Evaluating _PRW (20101013/scan-723) [ 0.195214] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62 [ 0.195214] ACPI: Power Resource [FN00] (off) [ 0.195214] ACPI: Power Resource [FN01] (off) [ 0.195214] ACPI: No dock devices found. So looks like that the error still exists. There is a new BIOS firmware avaliable on Samsung site. I'll try to flash it. same problem 2.6.37 kernel https://bugzilla.kernel.org/show_bug.cgi?id=16096 *** Bug 16096 has been marked as a duplicate of this bug. *** *** Bug 34892 has been marked as a duplicate of this bug. *** Created attachment 57282 [details]
Dell Vostro V13 BIOS A01 version full acpi dump
Added Dell Vostro V13 with Phoenix bios version A01 full acpi dump (compressed with bzip2)
Created attachment 57292 [details]
Full content of /proc/iomem
Created attachment 57302 [details]
Dell Vostro V13 full dmesg of kernel-2.6.37
Created attachment 57312 [details]
Dell Vostro V13 full dmesg of kernel-2.6.38
Could you try the workaround at comment 23? That is, reboot kernel with option ""memmap=1G#0x3D6A1000" And then attach new dmesg. It cause kernel hault. See output here http://ompldr.org/vOG5jMA/2011-05-11_10-41-56_749.jpg Created attachment 57322 [details]
Dmesg of 2.6.39-rc7 git kernel with debug enabled
> It cause kernel hault. See output here
> http://ompldr.org/vOG5jMA/2011-05-11_10-41-56_749.jpg
I'm looking at this fault.
Is there any way to get the full dmesg at the point kernel halt?
That would be very helpful.
Done. using kernel options boot_delay and increased acpi_dbg_level=0x1F Here is archive of photos of all dmesg output http://ompldr.org/vOG5ycQ/debug.tar.bz2 Created attachment 57482 [details]
Dmesg of 2.6.39-rc7 git kernel with acpi debug enabled
A summary, 2 problems here: 1. dynamic table memory map fails This is because the dynamic ACPI tables are stored in normal RAW, which causes ioremap fails, see comment 19. workaround by kernel option ""memmap=1G#0x3D6A1000" 2. kernel crashes This maybe caused by the wrong dynamic tables. > ACPI: Dynamic OEM Table Load: > ACPI: APIC 7dbfec92 0005A (v01 PTLD ? APIC 06040000 LTP 00000000) Dynamic table should contain AML code, but this is an APIC table. Kernel crashes maybe because the interpreter try to execute this non-AML table. Please reboot kernel without the "memmap=" option, then attach all the dynamic tables, as below. Let's have a look at the contents of these tables. acpidump --addr 0x7D421B20 --length 0x475 > CPU0IST.dat acpidump --addr 0x7D3E0DA0 --length 0x1CF > APIST.dat acpidump --addr 0x7D558A20 --length 0x54F > CPU0CST.dat acpidump --addr 0x7D374020 --length 0x8D > APCST.dat From the dmesg, this is a Uni-processor running an SMP kernel. Please also attach the output of /proc/cpuinfo If it's indeed an Uni-processor, then here is another problem. ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1._PDC] \_PR_.CPU1._PDC should not be executed at all. Created attachment 57492 [details]
cpuinfo
Attached /proc/cpuinfo , acpitables and dmesg of anpther hack will be bit later
Created attachment 57502 [details]
fix processor_physically_present
Could you please test the attached patch for both CONFIG_SMP and !CONFIG_SMP cases?
You still need to use kernel option "memmap=1G#0x3D6A1000".
Please attach both dmesgs.
> acpidump --addr 0x7D421B20 --length 0x475 > CPU0IST.dat
> acpidump --addr 0x7D3E0DA0 --length 0x1CF > APIST.dat
> acpidump --addr 0x7D558A20 --length 0x54F > CPU0CST.dat
> acpidump --addr 0x7D374020 --length 0x8D > APCST.dat
ACPI tables not found
Created attachment 57662 [details]
dinamic acpi tables
Output of:
acpidump --addr 0x7D421B20 --length 0x475 > CPU0IST.dat
acpidump --addr 0x7D3E0DA0 --length 0x1CF > APIST.dat
acpidump --addr 0x7D558A20 --length 0x54F > CPU0CST.dat
acpidump --addr 0x7D374020 --length 0x8D > APCST.dat
Created attachment 57672 [details]
Patched with CONFIG_SMP
acpi tables dumped with this version of kernel
Created attachment 57682 [details]
Patched with CONFIG_NOSMP
Created attachment 57702 [details]
updated patch
OK, the patch works, but it's not good enough.
Could you have a try this updated one?
Created attachment 57772 [details]
dmesg of new version of patch with no SMP enabled
Created attachment 57782 [details]
Dmesg of new version of patch with SMP enabled
Thanks, patch was send out. http://marc.info/?l=linux-acpi&m=130550812215476&w=2 Mark this bug as resolved. For the memory map fails issue, that is a BIOS bug. You can workaround it by kernel option ""memmap=1G#0x3D6A1000" until the BIOS get fixed. A patch referencing this bug report has been merged in v3.0-rc1: commit 932df7414336a00f45e5aec62724cf736b0bcfd4 Author: Lin Ming <ming.m.lin@intel.com> Date: Mon May 16 09:11:00 2011 +0800 ACPI: processor: fix processor_physically_present in UP kernel |