Bug 16548

Summary: 2.6.35 regression: apci_ns_lookup+0x125/0x57e NULL pointer after ACPI Error: Could not map memory at 0x000000007D422720 - Samsung NP-X120-XA02 laptop
Product: ACPI Reporter: Roman (embeter)
Component: BIOSAssignee: Lin Ming (ming.m.lin)
Status: CLOSED CODE_FIX    
Severity: normal CC: achiang, acpi-bugzilla, akpm, anton.kochkov, deng, florian, lenb, Robert.Moore, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: kernel config for 2.6.35
dmesg for 2.6.34
dmidecode for 2.6.34
acpidump on 2.6.34
full backtrace
dmesg for 2.6.34.1
dmesg for the latest 2.6.36-rc5 kernel
dmesg with CONFIG_ACPI_DEBUG=y
Dell Vostro V13 BIOS A01 version full acpi dump
Full content of /proc/iomem
Dell Vostro V13 full dmesg of kernel-2.6.37
Dell Vostro V13 full dmesg of kernel-2.6.38
Dmesg of 2.6.39-rc7 git kernel with debug enabled
Dmesg of 2.6.39-rc7 git kernel with acpi debug enabled
cpuinfo
fix processor_physically_present
dinamic acpi tables
Patched with CONFIG_SMP
Patched with CONFIG_NOSMP
updated patch
dmesg of new version of patch with no SMP enabled
Dmesg of new version of patch with SMP enabled

Description Roman 2010-08-09 12:33:28 UTC
Created attachment 27390 [details]
kernel config for 2.6.35

After updating from 2.6.34 to 2.6.35, my laptop fails to boot with this message:

ACPI: BIOS _OSI(Linux) query ignored
ACPI: APIC 7dbfee08 0005a (v01 PTLD ? APIC 06040000 LTP 00000000)
ACPI Error: found unknown opcode 0xE0 at AML address f8032e2e offset 0x2, ignoring (20100428/psloop-141) 
  <snip, a lot of errors about unknown opcode>
ACPI: Dynamic OEM Table Load:
ACPI: APIC (null) 0005A (v01 PTLD ? APIC 06040000 LTP 00000000)
ACPI: Interpreter enabled
ACPI: Supports S0 S3 S4 S5
BUG: Unable to handle kernel NULL pointer dereference at 0x00000005
IP: [<c125b489>] apci_ns_lookup+0x125/0x57e

BIOS update didn't help. Booting with acpi=off solves the issue.
Using git-bisect I've found a commit that causes this panic: https://patchwork.kernel.org/patch/106696/ which was a fix for https://bugzilla.kernel.org/show_bug.cgi?id=16357

Reverting this commit solves the issue.
Comment 1 Roman 2010-08-09 12:35:09 UTC
Created attachment 27391 [details]
dmesg for 2.6.34
Comment 2 Roman 2010-08-09 12:35:36 UTC
Created attachment 27392 [details]
dmidecode for 2.6.34
Comment 3 Roman 2010-08-09 12:42:57 UTC
Created attachment 27393 [details]
acpidump on 2.6.34
Comment 4 Zhang Rui 2010-08-16 05:27:56 UTC
caused by commit 856b185dd23da39e562983fbf28860f54e661b41.

commit 856b185dd23da39e562983fbf28860f54e661b41
Author: Alex Chiang <achiang@canonical.com>
Date:   Thu Jun 17 09:08:54 2010 -0600

    ACPI: processor: fix processor_physically_present on UP
    
    The commit 5d554a7bb06 (ACPI: processor: add internal
    processor_physically_present()) is broken on uniprocessor (UP)
    configurations, as acpi_get_cpuid() will always return -1.
    
    We use the value of num_possible_cpus() to tell us whether we got
    an invalid cpuid from acpi_get_cpuid() in the SMP case, or if
    instead, we are UP, in which case num_possible_cpus() is #defined
    as 1.
    
    We use num_possible_cpus() instead of num_online_cpus() to
    protect ourselves against the scenario of CPU hotplug, and we've
    taken down all the CPUs except one.
    
    Thanks to Jan Pogadl for initial report and analysis and Chen
    Gong for review.
    
    https://bugzilla.kernel.org/show_bug.cgi?id=16357
    
    Reported-by: Jan Pogadl <pogadl.jan@googlemail.com>:
    Reviewed-by: Chen Gong <gong.chen@linux.intel.com>
    Signed-off-by: Alex Chiang <achiang@canonical.com>
    Signed-off-by: Len Brown <len.brown@intel.com>

diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 5128435..e9699aa 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -223,7 +223,7 @@ static bool processor_physically_present(acpi_handle handle)
        type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
        cpuid = acpi_get_cpuid(handle, type, acpi_id);
 
-       if (cpuid == -1)
+       if ((cpuid == -1) && (num_possible_cpus() > 1))
                return false;
 
        return true;
Comment 5 Roman 2010-08-17 05:48:17 UTC
This bug is fixed in kernel 2.6.36_rc1 (new 20100702 ACPI code revision?)
I guess that the possible reason is buggy ACPI code in BIOS. Now instead of crash the kernel prints this:

ACPI: EC: Look up EC in DSDT
ACPI: BIOS _OSI(Linux) query ignored
ACPI Error: Could not map memory at 0x000000007D422720, size 141 (20100702/exregion-178)
ACPI Exception: AE_NO_MEMORY, Returned by Handler for [SystemMemory] (20100702/evregion-474)
ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.APCT] (Node f70316c0), AE_NO_MEMORY
ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.GCAP] (Node f70316a8), AE_NO_MEMORY
ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1._PDC] (Node f7031678), AE_NO_MEMORY
ACPI: Interpreter enabled
ACPI: (supports S0 S3 S4 S5)
ACPI: Using IOAPIC for interrupt routing
Comment 6 Alex Chiang 2010-08-18 14:42:26 UTC
Roman, thanks for testing again.

btw, I think your bisection run was a little bit misleading. The initial crash was this:

BUG: Unable to handle kernel NULL pointer dereference at 0x00000005
IP: [<c125b489>] apci_ns_lookup+0x125/0x57e

As you can see, that's not a part of my patch. I think what happened is that the patch fixed the bad logic introduced by 5d554a7bb06, and we started properly returning false on your machine, which revealed a latent bug somewhere else in ACPI.

When you reverted my patch, we incorrectly returned true on your machine, thus masking the symptoms of the buggy ACPI code, allowing your machine to boot.

In any case, it looks like your namespace is still broken, but at least you're not crashing anymore.
Comment 7 Andrew Morton 2010-08-26 23:50:26 UTC
Guys, which commit fixed this in 2.6.36-rc1?

Did that commit have the cc:stable in the changelog?

Thanks
Comment 8 Alex Chiang 2010-08-27 16:11:35 UTC
Andrew,

We never narrowed it down; it just went away. :-/

If Roman has time it would be great if he could identify the good commit via another bisection run, but it's up to him as to whether he has the time/energy to do so.

Thanks.
Comment 9 Robert Moore 2010-08-31 01:36:21 UTC
ACPI Error: Could not map memory at 0x000000007D422720, size 141
(20100702/exregion-178)
ACPI Exception: AE_NO_MEMORY, Returned by Handler for [SystemMemory]
(20100702/evregion-474)
ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.APCT]
(Node f70316c0), AE_NO_MEMORY



It's doing a dynamic table load:

Method (APCT, 0, NotSerialized)
{
    If (LAnd (And (CFGD, 0xF0), LNot (And (SDTL, 0x20
        ))))
    {
        Or (SDTL, 0x20, SDTL)
        OperationRegion (CST1, SystemMemory, DerefOf (Index (SSDT, 0x0A)),
           DerefOf (Index (SSDT, 0x0B)))
        Load (CST1, HC1)
    }
}


        Name (SSDT, Package (0x0C)
        {
            "CPU0IST ", 
            0x7D5B3520, 
            0x00000475, 
            "APIST   ", 
            0x7D368CA0, 
            0x000001CF, 
            "CPU0CST ", 
            0x7D398020, 
            0x00000C78, 
            "APCST   ", 
            0x7D422720, 
            0x0000008D
        })



Index 0x0A into the SSDT package is in fact 0x7D422720.

ACPI Error: Could not map memory at 0x000000007D422720, size 141

So, for whatever reason, the memory mapping failed.
Comment 10 Len Brown 2010-08-31 02:14:25 UTC
Roman,
The AE_NO_MEMORY error messages in 2.6.36 are still an important failure.
Can you tell us when they started?

Can you tell us what change caused the acpi_ns_lookup fault to go away?
Did the acpi_ns_lookup fault only happen when preceded by the map failure?
Is it possible to get a photo of the stack trace upon the ns_lookup fault?
Comment 11 Roman 2010-08-31 18:24:40 UTC
Created attachment 28661 [details]
full backtrace

some typos are possible. Also I can attach source photos if anyone is interested.
Comment 12 Roman 2010-08-31 18:30:00 UTC
Kernel panic (https://bugzilla.kernel.org/attachment.cgi?id=28661) happens on 2.6.35-rc3 -> 2.6.35.4
AE_NO_MEMORY error happens on 2.6.36-rc1 -> 2.6.36-rc3 (latest rc for this moment)

I'm bisecting right now to find a commit which fixed the panic and introduced AE_NO_MEMORY error.
Comment 13 Len Brown 2010-09-14 01:45:19 UTC
Ping.

Please let me know if this summary is in-accurate:
2.6.35.stable still panics on this box.
This is a Uni-processor running an SMP kernel.
2.6.36-rc doesn't panic, but gets an serious AE_NO_MEMORY error.

So we want to know why the panic went away.
We also want to know what the deal is with AE_NO_MEMORY.


Just for comparison, can you attach the dmesg for the
lagtest kernel that works properly w/o any kernel cmdline workarounds?
I suppose that would be 2.6.34.stable?
Comment 14 Len Brown 2010-09-21 01:57:30 UTC
ping
Comment 15 Roman 2010-09-22 11:36:51 UTC
Created attachment 30962 [details]
dmesg for 2.6.34.1

dmesg for the latest working 2.6.34.stable kernel (.1 for the moment)

2.6.34.2 - 2.6.34.7 crashes with the same backtrace I provided earlier. This happens because the commit which triggered this bug was backported to 2.6.34 (http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.34.y.git;a=commit;h=3fd02a351fc8aa8602fe2b39d6a098ea2538db2e)
Comment 16 Roman 2010-09-23 08:46:40 UTC
Finally found commit which triggers the AE_NO_MEMORY error: http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=commitdiff;h=35be1b716a475717611b2dc04185e9d80b9cb693

without it kernel crashes with the backtrace provided earlier. After applying it kernel boots with AE_NO_MEMORY error.
Comment 17 Roman 2010-09-23 10:22:30 UTC
Created attachment 31072 [details]
dmesg for the latest 2.6.36-rc5 kernel

The most interesting part is the same:

[    0.158142] ACPI: BIOS _OSI(Linux) query ignored
[    0.163393] ACPI Error: Could not map memory at 0x000000007D422720, size 141 (20100702/exregion-178)
[    0.163617] ACPI Exception: AE_NO_MEMORY, Returned by Handler for [SystemMemory] (20100702/evregion-474)
[    0.163839] ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.APCT] (Node f70356c0), AE_NO_MEMORY
[    0.164108] ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1.GCAP] (Node f70356a8), AE_NO_MEMORY
[    0.164374] ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1._PDC] (Node f7035678), AE_NO_MEMORY
[    0.170071] ACPI: Interpreter enabled
[    0.170166] ACPI: (supports S0 S3 S4 S5)
[    0.170451] ACPI: Using IOAPIC for interrupt routing
[    0.171417] [Firmware Bug]: ACPI: ACPI brightness control misses _BQC function
[    0.181379] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62


Also on bootable kernels with AE_NO_MEMORY error, powertop tells that power usage (ACPI Estimate) of the whole system is unbeliveable 1.5W :)
Comment 18 Roman 2010-09-24 06:42:28 UTC
Created attachment 31242 [details]
dmesg with CONFIG_ACPI_DEBUG=y

kernel booted with these params: "acpi.debug_layer=0x0000ffff acpi.debug_level=0x000000ff". 

But even with this log I can't understand the source of my AE_NO_MEMORY error...
Comment 19 Lin Ming 2010-10-08 03:00:34 UTC
        Name (SSDT, Package (0x0C)
        {
            "CPU0IST ", 
            0x7D5B3520, 
            0x00000475, 
            "APIST   ", 
            0x7D368CA0, 
            0x000001CF, 
            "CPU0CST ", 
            0x7D398020, 
            0x00000C78, 
            "APCST   ", 
            0x7D422720, 
            0x0000008D
        })

[    0.000000]  BIOS-e820: 0000000000100000 - 000000007d6a1000 (usable)

The ACPI tables should be put in the reserved memory, but as above shows, all SSDTs are put in the normal RAM. 
This causes below ioremap code returns NULL.

        for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) {
                int is_ram = page_is_ram(pfn);

                if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)))
                        return NULL;
                WARN_ON_ONCE(is_ram);
        }

Roman,

Could you boot 2.6.36-rc5 with blow kernel option and attach the dmesg again?
memmap=2352K#0x7D368000

This will mark memory from 0x7D368000 to 0x7D5B4000 as ACPI data.

Thanks.
Comment 20 Lin Ming 2010-10-15 00:49:57 UTC
Roman, any update?
Comment 21 Roman 2010-10-17 07:06:32 UTC
With boot option "memmap=2352K#0x7D368000" kernel 2.6.36-rc6 crashes with the same trace (it looks like the same for the first sight. I can take a photo if any changes in trace I didn't notice should happen). Without this option kernel boots with AE_NO_MEMORY error.
Comment 22 Roman 2010-10-17 09:03:17 UTC
The same situation on 2.6.36-rc8, crash with "memmap=2352K#0x7D368000", AE_NO_MEMORY error without it. What other info can I provide?
Comment 23 Len Brown 2010-10-19 02:50:35 UTC
> "memmap=2352K#0x7D368000

[    0.000000]  BIOS-e820: 0000000000100000 - 000000007d6a1000 (usable)


Hmm, how about a big stick to mark the whole top GB of this region as ACPI:
"memmap=1G#0x3D6A1000"

Marking this as a BIOS bug (is there a BIOS update available for it?)
Comment 24 Zhang Rui 2010-12-27 01:27:31 UTC
Roman, does the problem still exist in the latest upstream kernel?

Ming, any update on this?
Comment 25 Roman 2011-01-06 19:27:11 UTC
On mainline 2.6.37, the same firmware:

[    0.155513] ACPI: EC: Look up EC in DSDT
[    0.163415] [Firmware Bug]: ACPI: BIOS _OSI(Linux) query ignored
[    0.168742] ACPI Error: Could not map memory at 0x000000007D422720, size 141 (20101013/exregion-178)
[    0.168971] ACPI Exception: AE_NO_MEMORY, Returned by Handler for [SystemMemory] (20101013/evregion-474)
[    0.169198] ACPI Error: Method parse/execution failed [\_PR_.CPU1.APCT] (Node f603a6c0), AE_NO_MEMORY (20101013/psparse-537)
[    0.169517] ACPI Error: Method parse/execution failed [\_PR_.CPU1.GCAP] (Node f603a6a8), AE_NO_MEMORY (20101013/psparse-537)
[    0.169832] ACPI Error: Method parse/execution failed [\_PR_.CPU1._PDC] (Node f603a678), AE_NO_MEMORY (20101013/psparse-537)
[    0.180090] ACPI: Interpreter enabled
[    0.180193] ACPI: (supports S0 S3 S4 S5)
[    0.180488] ACPI: Using IOAPIC for interrupt routing
[    0.181673] [Firmware Bug]: ACPI: No _BQC method, cannot determine initial brightness
[    0.192650] ACPI Exception: AE_NOT_FOUND, Evaluating _PRW (20101013/scan-723)
[    0.195214] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
[    0.195214] ACPI: Power Resource [FN00] (off)
[    0.195214] ACPI: Power Resource [FN01] (off)
[    0.195214] ACPI: No dock devices found.

So looks like that the error still exists.
There is a new BIOS firmware avaliable on Samsung site. I'll try to flash it.
Comment 26 Anton Kochkov 2011-03-15 21:16:08 UTC
same problem 2.6.37 kernel https://bugzilla.kernel.org/show_bug.cgi?id=16096
Comment 27 Zhang Rui 2011-03-21 07:20:25 UTC
*** Bug 16096 has been marked as a duplicate of this bug. ***
Comment 28 Lin Ming 2011-05-11 03:20:04 UTC
*** Bug 34892 has been marked as a duplicate of this bug. ***
Comment 29 Anton Kochkov 2011-05-11 04:18:12 UTC
Created attachment 57282 [details]
Dell Vostro V13 BIOS A01 version full acpi dump

Added Dell Vostro V13 with Phoenix bios version A01 full acpi dump (compressed with bzip2)
Comment 30 Anton Kochkov 2011-05-11 04:18:58 UTC
Created attachment 57292 [details]
Full content of /proc/iomem
Comment 31 Anton Kochkov 2011-05-11 04:20:00 UTC
Created attachment 57302 [details]
Dell Vostro V13 full dmesg of kernel-2.6.37
Comment 32 Anton Kochkov 2011-05-11 04:20:52 UTC
Created attachment 57312 [details]
Dell Vostro V13 full dmesg of kernel-2.6.38
Comment 33 Lin Ming 2011-05-11 06:01:27 UTC
Could you try the workaround at comment 23?
That is, reboot kernel with option ""memmap=1G#0x3D6A1000"

And then attach new dmesg.
Comment 34 Anton Kochkov 2011-05-11 07:29:30 UTC
It cause kernel hault. See output here http://ompldr.org/vOG5jMA/2011-05-11_10-41-56_749.jpg
Comment 35 Anton Kochkov 2011-05-11 10:31:34 UTC
Created attachment 57322 [details]
Dmesg of 2.6.39-rc7 git kernel with debug enabled
Comment 36 Lin Ming 2011-05-12 01:13:39 UTC
> It cause kernel hault. See output here
> http://ompldr.org/vOG5jMA/2011-05-11_10-41-56_749.jpg

I'm looking at this fault.
Is there any way to get the full dmesg at the point kernel halt?
That would be very helpful.
Comment 37 Anton Kochkov 2011-05-12 02:51:29 UTC
Done. using kernel options boot_delay and increased acpi_dbg_level=0x1F

Here is archive of photos of all dmesg output http://ompldr.org/vOG5ycQ/debug.tar.bz2
Comment 38 Anton Kochkov 2011-05-12 03:33:34 UTC
Created attachment 57482 [details]
Dmesg of 2.6.39-rc7 git kernel with acpi debug enabled
Comment 39 Lin Ming 2011-05-12 05:23:43 UTC
A summary, 2 problems here:

1. dynamic table memory map fails

This is because the dynamic ACPI tables are stored in normal RAW, which causes ioremap fails, see comment 19.

workaround by kernel option ""memmap=1G#0x3D6A1000"

2. kernel crashes

This maybe caused by the wrong dynamic tables.

> ACPI: Dynamic OEM Table Load:
> ACPI: APIC 7dbfec92 0005A (v01 PTLD  ? APIC    06040000 LTP 00000000)

Dynamic table should contain AML code, but this is an APIC table.
Kernel crashes maybe because the interpreter try to execute this non-AML table.


Please reboot kernel without the "memmap=" option, then attach all the dynamic tables, as below. Let's have a look at the contents of these tables.

acpidump --addr 0x7D421B20 --length 0x475 > CPU0IST.dat
acpidump --addr 0x7D3E0DA0 --length 0x1CF > APIST.dat
acpidump --addr 0x7D558A20 --length 0x54F > CPU0CST.dat
acpidump --addr 0x7D374020 --length  0x8D > APCST.dat
Comment 40 Lin Ming 2011-05-12 05:39:59 UTC
From the dmesg, this is a Uni-processor running an SMP kernel.

Please also attach the output of /proc/cpuinfo

If it's indeed an Uni-processor, then here is another problem.
ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU1._PDC]

\_PR_.CPU1._PDC should not be executed at all.
Comment 41 Anton Kochkov 2011-05-12 05:50:43 UTC
Created attachment 57492 [details]
cpuinfo 

Attached /proc/cpuinfo , acpitables and dmesg of anpther hack will be bit later
Comment 42 Lin Ming 2011-05-12 08:14:43 UTC
Created attachment 57502 [details]
fix processor_physically_present

Could you please test the attached patch for both CONFIG_SMP and !CONFIG_SMP cases?

You still need to use kernel option "memmap=1G#0x3D6A1000".

Please attach both dmesgs.
Comment 43 Anton Kochkov 2011-05-12 12:44:42 UTC
> acpidump --addr 0x7D421B20 --length 0x475 > CPU0IST.dat
> acpidump --addr 0x7D3E0DA0 --length 0x1CF > APIST.dat
> acpidump --addr 0x7D558A20 --length 0x54F > CPU0CST.dat
> acpidump --addr 0x7D374020 --length  0x8D > APCST.dat

ACPI tables not found
Comment 44 Anton Kochkov 2011-05-12 23:04:25 UTC
Created attachment 57662 [details]
dinamic acpi tables

Output of:

acpidump --addr 0x7D421B20 --length 0x475 > CPU0IST.dat
acpidump --addr 0x7D3E0DA0 --length 0x1CF > APIST.dat
acpidump --addr 0x7D558A20 --length 0x54F > CPU0CST.dat
acpidump --addr 0x7D374020 --length  0x8D > APCST.dat
Comment 45 Anton Kochkov 2011-05-12 23:05:32 UTC
Created attachment 57672 [details]
Patched with CONFIG_SMP

acpi tables dumped with this version of kernel
Comment 46 Anton Kochkov 2011-05-12 23:06:10 UTC
Created attachment 57682 [details]
Patched with CONFIG_NOSMP
Comment 47 Lin Ming 2011-05-13 05:53:41 UTC
Created attachment 57702 [details]
updated patch

OK, the patch works, but it's not good enough.

Could you have a try this updated one?
Comment 48 Anton Kochkov 2011-05-14 03:05:05 UTC
Created attachment 57772 [details]
dmesg of new version of patch with no SMP enabled
Comment 49 Anton Kochkov 2011-05-14 03:05:51 UTC
Created attachment 57782 [details]
Dmesg of new version of patch with SMP enabled
Comment 50 Lin Ming 2011-05-16 01:15:36 UTC
Thanks, patch was send out.
http://marc.info/?l=linux-acpi&m=130550812215476&w=2

Mark this bug as resolved.

For the memory map fails issue, that is a BIOS bug.
You can workaround it by kernel option ""memmap=1G#0x3D6A1000" until the BIOS get fixed.
Comment 51 Florian Mickler 2011-05-30 08:02:10 UTC
A patch referencing this bug report has been merged in v3.0-rc1:

commit 932df7414336a00f45e5aec62724cf736b0bcfd4
Author: Lin Ming <ming.m.lin@intel.com>
Date:   Mon May 16 09:11:00 2011 +0800

    ACPI: processor: fix processor_physically_present in UP kernel