Bug 9429

Summary: acpi-cpufreq boot crash
Product: ACPI Reporter: Len Brown (lenb)
Component: ACPICA-CoreAssignee: Robert Moore (Robert.Moore)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, craftyguy, Robert.Moore
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.20-2.6.23-rc3 Subsystem:
Regression: --- Bisected commit-id:
Attachments: console log 2.6.23-rc3+
console log 2.6.24-rc3+ w/ CONFIG_DEBUG_SLAB=n
acpidump
failing config-2.6.24-rc3+
console log 2.6.24-rc3+ 64-bit
console log 2.6.24+ w/o ACPI tracing & debugging
CPU0IST.dat
CPU1IST.dat
console log 2.6.24+ w/ ACPI tracing & debugging
patch vs 2.6.24-rc3
patch from comment #20 backported to linux-2.6.16, 2.6.17
patch from comment #20 backported to linux-2.6.18, 2.6.19, 2.6.20

Description Len Brown 2007-11-21 10:58:32 UTC
With CONFIG_DEBUG_SLAB=y or CONFIG_SLUB_DEBUG=y,
the kernel crashes upon modprobe acpi-cpufreq.
This is true from 2.6.20 (earliest tested)
through 2.6.23, and includes 2.6.24-rc3

BUG: unable to handle kernel paging request at virtual address 6b6b6b71
printing eip: c025d8ea *pde = 00000000
Oops: 0000 [#1] SMP
Modules linked in: acpi_cpufreq freq_table dm_mod parport_pc snd_hda_intel parport ehci_hcd button thermal intel_agp agpgart

Pid: 2496, comm: modprobe Not tainted (2.6.24-rc3-g2ffbb837 #3)
EIP: 0060:[<c025d8ea>] EFLAGS: 00010202 CPU: 0
EIP is at acpi_ns_get_parent_node+0x6/0x14
EAX: 6b6b6b6b EBX: 0000000a ECX: f75ea650 EDX: 00000000
ESI: f75ea650 EDI: f75ea650 EBP: c0266aef ESP: f76f4d94
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process modprobe (pid: 2496, ti=f76f4000 task=f7f5f080 task.ti=f76f4000)
Stack: c025d6c9 f76f4dbc c0266a20 f75ea710 00000000 00000000 c0266aef c0266ee3
       f76f4dec c0266b12 00000010 f7c04458 00000000 c0264ecf f76f4dec f7c04de8
       f75ea590 f76f4e38 f7047c6c 00000000 c0266ac7 f76f4dec 00000074 00000020
Call Trace:
 [<c025d6c9>] acpi_ns_get_pathname_length+0xd/0x24
 [<c0266a20>] acpi_ut_get_simple_object_size+0x70/0xd9
 [<c0266aef>] acpi_ut_get_element_length+0x0/0x43
 [<c0266ee3>] acpi_ut_create_generic_state+0x24/0x3d
 [<c0266b12>] acpi_ut_get_element_length+0x23/0x43
 [<c0264ecf>] acpi_ut_walk_package_tree+0x4a/0xb7
 [<c0266ac7>] acpi_ut_get_object_size+0x3e/0x66
 [<c025d0c7>] acpi_evaluate_object+0x1f2/0x297
 [<c04265be>] __mutex_lock_slowpath+0x1ee/0x1f6
 [<c02718b0>] acpi_processor_preregister_performance+0xb6/0x55e
 [<c0166e96>] percpu_populate+0x20/0x26
 [<c01659ab>] __kmalloc+0xb2/0xe7
 [<c0166e96>] percpu_populate+0x20/0x26
 [<c022805d>] __next_cpu+0x12/0x1e
 [<c0166f3b>] __percpu_populate_mask+0x55/0x65
 [<f883605d>] acpi_cpufreq_init+0x5d/0x8f [acpi_cpufreq]
 [<c01426e5>] sys_init_module+0x14c7/0x15ea
 [<c01688e8>] do_sync_read+0xc7/0x10a
 [<c0103dc6>] sysenter_past_esp+0x5f/0x89
 [<c0420000>] gss_pipe_downcall+0xc7/0x30a
 =======================
Code: 04 8d 04 82 89 41 0c 5b c3 89 c2 8d 40 ff 83 f8 fd 76 06 a1 24 92 61 c0 c3 31 c0 80 7a 04 0f 75 02 89 d0 c3 c3 31 d2 85 c0 74 0b <f6> 40 06 01 8b 40 10 74 f7 89 c2 89 d0 c3 31 d2 f6 40 06 01 75
EIP: [<c025d8ea>] acpi_ns_get_parent_node+0x6/0x14 SS:ESP 0068:f76f4d94
/etc/rc5.d/S06cpuspeed: line 112:  2496 Segmentation fault      /sbin/modprobe acpi-cpufreq 2> /dev/null
Comment 1 Len Brown 2007-11-21 11:01:24 UTC
Created attachment 13673 [details]
console log 2.6.23-rc3+
Comment 2 Len Brown 2007-11-21 11:11:02 UTC
Created attachment 13674 [details]
console log 2.6.24-rc3+ w/ CONFIG_DEBUG_SLAB=n
Comment 3 Len Brown 2007-11-21 11:11:37 UTC
Created attachment 13675 [details]
acpidump
Comment 4 Len Brown 2007-11-21 11:13:51 UTC
typo: comment #1 is 2.6.24-rc3+, not .23-rc3+

The system under test is a Weybridge SDV with a dual core processor.
Comment 5 Venkatesh Pallipadi 2007-11-21 16:32:02 UTC
Len,

Can you attach your config.

Also, do yo see the problem with 64 bit kernel as well?
Comment 6 Venkatesh Pallipadi 2007-11-21 17:02:55 UTC

0x6b is POISON_FREE. SLAB and SLUB debug sets the freed area to 0x6b. But, somewhere in the linked list there is still an active pointer to some freed area which we try to access while parsing the list causing this failure.

It must be in namespace/ns_alloc.c, in the way we allocate free nodes.
Comment 7 Len Brown 2007-11-21 23:29:29 UTC
Created attachment 13689 [details]
failing config-2.6.24-rc3+

here's a config that fails.  disable SLAB debug and it will succeed.
I enabled slab debugging on configs for 2.6.23, 2.6.22, 2.6.21, 2.6.20
and they all failed the same way -- so this is easy to reproduce --
at least with this BIOS on this board.  Note, however, that I don't
recall seeing this failure until after a recent processor & BIOS upgrade.
Comment 8 Len Brown 2007-11-22 00:23:13 UTC
Created attachment 13690 [details]
console log 2.6.24-rc3+ 64-bit

Yes, 64-bit mode fails too.  Attached is a console log from a 64-bit crash.
Comment 9 Len Brown 2007-11-22 07:07:12 UTC
CONFIG_SMP=n has same failure.

Checking for hardware changes [  OK  ]
general protection fault: 0000 [1]
CPU 0
Modules linked in: acpi_cpufreq pata_acpi e1000e thermal
Pid: 2365, comm: modprobe Not tainted 2.6.24-rc3-g2ffbb837-dirty #5
RIP: 0010:[<ffffffff80359c2b>]  [<ffffffff80359c2b>] acpi_ns_get_parent_node+0x7/0x15
RSP: 0018:ffff81007b403d10  EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000005 RCX: ffff81007b403d98
RDX: ffff81007d956198 RSI: ffff81007b403d40 RDI: a56b6b6b6b6b6b6b
RBP: 0000000000000000 R08: ffff81007d956198 R09: ffff81007b5726e8
R10: ffffffff8063dc76 R11: ffffffff803635f2 R12: ffff81007d956198
R13: ffff81007b403d98 R14: ffffffff803631d4 R15: ffff81007b403e48
FS:  00002b1681bbc6f0(0000) GS:ffffffff8070e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b4176f4d000 CR3: 000000007c09e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 2365, threadinfo ffff81007b402000, task ffff81007c4cb040)
Stack:  ffffffff803599ce ffff81007b403d40 ffffffff8036310a ffff81007b403d98
 ffffffff803631ff ffff81007b5725b0 0000000000000018 ffff81007b5726e8
 ffffffff80361744 0000000000004001 ffff81007b5725b0 ffff81007b403de8
Call Trace:
 [<ffffffff803599ce>] acpi_ns_get_pathname_length+0xa/0x31
 [<ffffffff8036310a>] acpi_ut_get_simple_object_size+0x72/0xd6
 [<ffffffff803631ff>] acpi_ut_get_element_length+0x2b/0x4f
 [<ffffffff80361744>] acpi_ut_walk_package_tree+0x5b/0xd5
 [<ffffffff803631a8>] acpi_ut_get_object_size+0x3a/0x66
 [<ffffffff8035935f>] acpi_evaluate_object+0x1e0/0x28d
 [<ffffffff80267a31>] poison_obj+0x26/0x30
 [<ffffffff8036dd93>] acpi_processor_preregister_performance+0x9c/0x311
 [<ffffffff8802005d>] :acpi_cpufreq:acpi_cpufreq_init+0x5d/0x8c
 [<ffffffff8024440a>] sys_init_module+0x12dc/0x140b
 [<ffffffff8020b40e>] system_call+0x7e/0x83


Code: f6 47 0a 01 48 8b 7f 18 74 f6 48 89 f8 c3 31 c0 f6 47 0a 01
RIP  [<ffffffff80359c2b>] acpi_ns_get_parent_node+0x7/0x15
 RSP <ffff81007b403d10>
/etc/rc5.d/S06cpuspeed: line 112:  2365 Segmentation fault      /sbin/modprobe acpi-cpufreq 2> /dev/null
Comment 10 Len Brown 2007-11-22 23:34:02 UTC
The 93,414th call to acpi_ns_get_parent_node() caused the crash.

It was passed a namespace node...
at address ffff81007d8544c8.
node->name "kkkk"
node->flags 6b
node->peer a56b6b6b6b6b6b6b

We had never done an ACPI_FREE on ffff81007d8544c8.
However, it had been used for other namespace nodes,
most recently "DOMN", a child of "_PSD".:

acpi_ns_get_parent_node ffff81007d8544c8 GPL5 0x1 ffff81007d836fb8
ffff81007d836fb8 _CRS 0x1
acpi_ns_get_parent_node ffff81007d8544c8 GP00 0x0 ffff81007d854500
ffff81007d833810 _CRS 0x1
acpi_ns_get_parent_node ffff81007d8544c8 MB03 0x0 ffff81007d854618
ffff81007d833c70 _CRS 0x1
acpi_ns_get_parent_node ffff81007d8544c8 BAS2 0x0 ffff81007d854500
ffff81007d833f80 _CRS 0x1
acpi_ns_get_parent_node ffff81007d8544c8 STS0 0x0 ffff81007d854618
ffff81007d8543e8 _OSC 0x0
acpi_ns_get_parent_node ffff81007d8544c8 DOMN 0x2 ffff81007d854618
ffff81007d854a08 _PSD 0x0
acpi_ns_get_parent_node ffff81007d8544c8 kkkk 0x6b a56b6b6b6b6b6b6b
<oops>
Comment 11 Len Brown 2007-11-23 00:40:21 UTC
Created attachment 13705 [details]
console log 2.6.24+ w/o ACPI tracing & debugging

enabled all debug flags at the top of acpi_processor_preregister_performance.
(hmmm, thought i enabled tracing too, but don't see it here)

in any case the delete of the node before access is clear:
acpi_ns_delete_node ffff81007d854500 DOMN

though it isn't clear to me what the path of control getting
to the access actually is...

Also, there are some other 6b's evident in the log,
so it isn't immediately clear how far before the actual crash things went wrong...
Comment 12 Len Brown 2007-11-23 09:56:08 UTC
Created attachment 13718 [details]
CPU0IST.dat
Comment 13 Len Brown 2007-11-23 09:57:25 UTC
Created attachment 13719 [details]
CPU1IST.dat

$ acpidump -a 0x7DFB0720 -l 0x00000359 > CPU1IST.dat
Comment 14 Len Brown 2007-11-23 09:59:25 UTC
Bob,
It looks like _PSD is returning a package that contains temporary objects?!

        Method (_PSD, 0, NotSerialized)
        {
            Name (DOMN, 0x00)
            Name (CRTP, 0x00)
            Name (NOPR, 0x00)
            If (And (PDC0, 0x0800))
            {
                Store (0xFE, CRTP)
            }
            Else
            {
                Store (0xFC, CRTP)
            }

            Divide (0x01, NPCP, Local1, Local2)
            If (LEqual (Local1, 0x00))
            {
                Store (NPCP, Local1)
            }

            Decrement (Local1)
            Store (Local1, DOMN)
            Divide (NCPU, NPCP, Local2, Local3)
            Store (Local3, NOPR)
            Return (Package (0x01)
            {
                Package (0x05)
                {
                    0x05,
                    0x00,
                    DOMN,
                    CRTP,
                    NOPR
                }
            })
        }
Comment 15 Len Brown 2007-11-23 10:46:08 UTC
Created attachment 13721 [details]
console log 2.6.24+ w/ ACPI tracing & debugging

Attached is a re-run with ACPICA function tracing properly enabled.

Apparently _PSD is returning references to objects which
were freed upon the completion of _PSD evaluation.
Subsequent reference to those objects crash the system.

 psxface-0310 [FFFF81007D852750] [03] ps_execute_method     : Method returned ObjDesc=ffff81007d9464b0
  exdump-0489 [FFFF81007D852750] [03] ex_dump_operand       : ffff81007d9464b0 Package [Len 1] ElementArray ffff81007c5e41f0
  exdump-0487 [FFFF81007D852750] [03] ex_dump_operand       :  [1] ffff81007d946e50 Package [Len 5] ElementArray ffff81007c5b3be8
  exdump-0487 [FFFF81007D852750] [03] ex_dump_operand       :   [2] ffff81007d946140 Integer 0000000000000005
  exdump-0487 [FFFF81007D852750] [03] ex_dump_operand       :   [2] ffff81007d946ea8 Integer 0000000000000000
  exdump-0487 [FFFF81007D852750] [03] ex_dump_operand       :   [2] ffff81007d946f58 Reference.Node->Name 6B6B6B6B
  exdump-0487 [FFFF81007D852750] [03] ex_dump_operand       :   [2] ffff81007d946610 Reference.Node->Name 6B6B6B6B
  exdump-0487 [FFFF81007D852750] [03] ex_dump_operand       :   [2] ffff81007d946508 Reference.Node->Name 6B6B6B6B
 psxface-0316 [FFFF81007D852750] [03] ps_execute_method     : ----Exit- ****Exception****: AE_CTRL_RETURN_VALUE
 exutils-0151 [FFFF81007D852750] [03] ex_exit_interpreter   : ----Entry
 utmutex-0291 [FFFF81007D852750] [03] ut_release_mutex      : Thread FFFF81007D852750 releasing Mutex [ACPI_MTX_Interpreter]
     osl-0896 [FFFF81007D852750] [03] os_signal_semaphore   : Signaling semaphore[ffff81007d81d3b0|1]
 exutils-0159 [FFFF81007D852750] [03] ex_exit_interpreter   : ----Exit-
  nseval-0226 [FFFF81007D852750] [02] ns_evaluate           : *** Completed evaluation of object _PSD ***
  nseval-0232 [FFFF81007D852750] [02] ns_evaluate           : ----Exit- AE_OK
utobject-0607 [FFFF81007D852750] [02] ut_get_package_object_: ----Entry ffff81007d9464b0
  utmisc-0918 [FFFF81007D852750] [03] ut_walk_package_tree  : ----Entry
 utstate-0269 [FFFF81007D852750] [04] ut_create_pkg_state   : ----Entry ffff81007d9464b0
 utstate-0286 [FFFF81007D852750] [04] ut_create_pkg_state   : ----Exit- ffff81007d83c3a8
 utstate-0099 [FFFF81007D852750] [04] ut_push_generic_state : ----Entry
 utstate-0106 [FFFF81007D852750] [04] ut_push_generic_state : ----Exit-
 utstate-0269 [FFFF81007D852750] [04] ut_create_pkg_state   : ----Entry ffff81007d946e50
 utstate-0286 [FFFF81007D852750] [04] ut_create_pkg_state   : ----Exit- ffff81007d83c618
utobject-0428 [FFFF81007D852750] [04] ut_get_simple_object_s: ----Entry ffff81007d946140
utobject-0524 [FFFF81007D852750] [04] ut_get_simple_object_s: ----Exit- AE_OK
utobject-0428 [FFFF81007D852750] [04] ut_get_simple_object_s: ----Entry ffff81007d946ea8
utobject-0524 [FFFF81007D852750] [04] ut_get_simple_object_s: ----Exit- AE_OK
utobject-0428 [FFFF81007D852750] [04] ut_get_simple_object_s: ----Entry ffff81007d946f58
acpi_ns_get_parent_node ffff81007d854500 kkkk 0x6b a56b6b6b6b6b6b6b
acpi_ns_get_parent_node 91231 calls, here is stack for bad one:

Call Trace:
 [<ffffffff803618fc>] acpi_ns_get_parent_node+0x72/0x96
 [<ffffffff8036151d>] acpi_ns_get_pathname_length+0x15/0x3c
 [<ffffffff8036e27b>] acpi_ut_get_simple_object_size+0xcf/0x146
 [<ffffffff8036e3f7>] acpi_ut_get_element_length+0x2b/0x4f
 [<ffffffff8036bdf0>] acpi_ut_walk_package_tree+0x102/0x1b6
 [<ffffffff8036ad7b>] acpi_ut_trace_ptr+0x56/0x63
 [<ffffffff8036e361>] acpi_ut_get_object_size+0x6f/0xda
 [<ffffffff80360bc7>] acpi_evaluate_object+0x244/0x30d
 [<ffffffff80379879>] acpi_processor_preregister_performance+0xba/0x331
 [<ffffffff8072edf7>] acpi_cpufreq_init+0x5d/0x8c
 [<ffffffff8072554c>] kernel_init+0xdf/0x249
 [<ffffffff8020bd58>] child_rip+0xa/0x12
 [<ffffffff8072546d>] kernel_init+0x0/0x249
 [<ffffffff8020bd4e>] child_rip+0x0/0x12
Comment 16 Venkatesh Pallipadi 2007-11-24 06:03:18 UTC
I do see few examples in ACPI spec which has
Return (Package(x)) {}
syntax. That must be valid. Isn't it?
Comment 17 Len Brown 2007-11-24 22:24:35 UTC
>I do see few examples in ACPI spec which has
>Return (Package(x)) {}
>syntax. That must be valid. Isn't it?

Yes, _PSD is supposed to return a package.
The question is what to do when the the returned
package refers to objects of temporary scope.

ie. do we just flag it as an error and fail,
or do we copy-by-value?  Certainly we can't
return references to objects that no longer exist, as we do today...
Comment 18 Robert Moore 2007-11-26 13:34:43 UTC
I suspect that this is fixed in version 20071019:

Fixed a problem with the Package operator where all named references were created as object references and left otherwise unresolved. According to the ACPI specification, a Package can only contain Data Objects or references to control methods. The implication is that named references to Data Objects (Integer, Buffer, String, Package, BufferField, Field) should be resolved immediately upon package creation. This is the approach taken with this change. References to all other named objects (Methods, Devices, Scopes, etc.) are all now properly created as reference objects. (BZ 5328)

I'll validate this, however.
Comment 19 Robert Moore 2007-11-26 14:38:22 UTC
Verified failure on 20070919, works correctly on 20071019:

- ex _psd
Executing \_PSD
Execution of \_PSD returned object 00327E40 Buflen 70
  [Package] Contains 1 Elements:
    [Package] Contains 5 Elements:
      [Integer] = 0000000000000005
      [Integer] = 0000000000000000
      [Integer] = 0000000000000000
      [Integer] = 00000000000000FC
      [Integer] = 0000000000000002
Comment 20 Len Brown 2007-12-02 11:32:11 UTC
Created attachment 13831 [details]
patch vs 2.6.24-rc3

verified that this patch, cherry-picked from the
ACPICA 20071019 release, fixes the problem.
Comment 21 Len Brown 2007-12-05 19:30:23 UTC
shipped in linux-2.6.24-rc4
closed.

note that this patch is also valid for
linux-2.6.21
linux-2.6.22
linux-2.6.23

it is needed by all earlier release, but needs to be tweaked for those.
Comment 22 Len Brown 2007-12-05 20:20:23 UTC
Created attachment 13882 [details]
patch from comment #20 backported to linux-2.6.16, 2.6.17
Comment 23 Len Brown 2007-12-05 20:34:00 UTC
Created attachment 13883 [details]
patch from comment #20 backported to linux-2.6.18, 2.6.19, 2.6.20
Comment 24 Len Brown 2008-01-01 10:53:21 UTC
*** Bug 9486 has been marked as a duplicate of this bug. ***