Bug 206553

Summary: Linux hangs at ACPI init on Medion/Akoya P15648 MD63490
Product: ACPI Reporter: Jan Engelhardt (ej+bko)
Component: ACPICA-CoreAssignee: acpi_acpica-core (acpi_acpica-core)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: erik.kaneda, rjw, Robert.Moore
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: linux-5.6-rc1 (b19e8c6847038) Subsystem:
Regression: No Bisected commit-id:
Attachments: ACPI warnings from FreeBSD kernel
DSDT dump
Cure/Workaround
Full ACPI table dump
Alternate patch to comment 9. Diagnostic patch.
dmesg as per comment 12 item 1, with patch from comment 14
dmesg as per comment 12 item 2
dmesg as per comment 12 item 3, with patch from comment 14
patch to detect bad global lock
Updated patch
Patch from a completely different angle
Fragment for idea from comment 22

Description Jan Engelhardt 2020-02-17 00:02:12 UTC
Created attachment 287431 [details]
ACPI warnings from FreeBSD kernel

Stacktrace:

// osl.c:acpi_os_wait_semaphore:
1262    down_timeout(sem, jiffies)

// exsystem.c:acpi_ex_system_wait_semaphore:
47      status = acpi_os_wait_semaphore(semaphore, 1, timeout);

// evglock.c:acpi_ev_acquire_global_lock:
240     status = acpi_ex_system_wait_semaphore(
                 acpi_gbl_global_lock_semaphore, ACPI_WAIT_FOREVER);

// exmutex.c:acpi_ex_acquire_mutex_object:
152     status = acpi_ev_acquire_global_lock(timeout);

// exutils.c:acpi_ex_acquire_global_lock:
181     status = acpi_ex_acquire_mutex_object(ACPI_WAIT_FOREVER,
                 acpi_gbl_global_lock_mutex,
                 acpi_os_get_thread_id());

// exfield.c:acpi_ex_write_data_to_field:
394     acpi_ex_acquire_global_lock(...);

// exstore.c:acpi_ex_store_object_to_node:
523     status = acpi_ex_write_data_to_field(...)

exstore.c:acpi_ex_store:
        // ACPI_GET_DESCRIPTOR_TYPE(dest_desc) = ACPI_DESC_TYPE_NAMED
75      status = acpi_ex_store_object_to_node(...)

exoparg1.c:acpi_ex_opcode_1A_1T_1R:
        // case AML_STORE_OP:
422     status = acpi_ex_store(...)

dswexec.c:acpi_ds_exec_end_op:
        // case AML_CLASS_EXECUTE:
405     status = acpi_gbl_op_type_dispatch[op_type](walk_state);

acpi_ps_parse_loop
acpi_ps_parse_aml
acpi_ps_execute_method
acpi_ns_evaluate
acpi_ns_init_one_device(obj_handle="\_SB_.PCI0", nesting_level=2)
acpi_ns_walk_namespace
acpi_ns_initialize_devices
acpi_initialize_objects
acpi_init
Comment 1 Jan Engelhardt 2020-02-17 00:03:36 UTC
Created attachment 287433 [details]
DSDT dump

$ iasl -d medion.dsdt.bin

Intel ACPI Component Architecture
ASL+ Optimizing Compiler/Disassembler version 20180105
Copyright (c) 2000 - 2018 Intel Corporation

Input file medion.dsdt.bin, Length 0x5DA6E (383598) bytes
ACPI: DSDT 0x0000000000000000 05DA6E (v02 MEDION MEDIONAG 01072009 INTL 20160527)
Pass 1 parse of [DSDT]
ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180105/psobject-263)
Comment 2 Jan Engelhardt 2020-02-17 00:08:24 UTC
Linux dmesg:

<...lots more AE_ALREADY_EXISTS preceding..., then>
ACPI BIOS Error (bug): Failure creating named object [\_SB.PCI0.XHC.RHUB.SS06._UPC], AE_ALREADY_EXISTS (20200110/dswload2-323)
ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20200110/psobject-220)
ACPI BIOS Error (bug): Failure creating named object [\_SB.PCI0.XHC.RHUB.SS06._PLD], AE_ALREADY_EXISTS (20200110/dswload2-323)
ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20200110/psobject-220)
ACPI: Skipping parse of AML opcode: Method (0x0014)
ACPI: 11 ACPI AML tables successfully acquired and loaded
ACPI: EC: EC started
ACPI: EC: interrupt blocked
ACPI: \: Used as first EC
ACPI: \: GPE=0x10, IRQ=-1, EC_CMD/EC_SC=0x66, EC_DATA=0x62
ACPI: EC: Boot ECDT EC used to handle transactions
<hang>
Comment 3 Jan Engelhardt 2020-02-17 00:10:56 UTC
Does not seem to be a regression; 4.4.76 (but with patches from openSUSE) and 4.12.14 (also patched by openSUSE) exhibit the same hang.
Comment 4 Rafael J. Wysocki 2020-02-17 08:27:09 UTC
It looks like we have a mutex deadlock on the global lock in there.
Comment 5 Robert Moore 2020-02-18 17:43:30 UTC
Do you eventually get a timeout error (like below:)

ACPI Error: Aborting method \_SB.PCI0.LPCB.EC0.RRAM due to previous error (AE_AML_LOOP_TIMEOUT)
Comment 6 Jan Engelhardt 2020-02-18 21:10:06 UTC
I do not get such an error, at least not in a reasonable time. The longest I observed the console is about 50 minutes - the hung task checker kicks in after about 16min, though.

[138.837] ACPI: EC: Boot ECDT EC used to handle transactions
[1115.250] INFO: task swapper/0:1 blocked for more than 491 seconds.
<call trace as printed by stack_trace_print>

Given the execution is stuck with ACPI_WAIT_FOREVER (jiffies==MAX_SCHEDULE_TIMEOUT), I also see no possibility for down_timeout to return anytime soon to give parent functions a chance to return AE_ABORT_METHOD anywhere. Due to exfield.c:394 ignoring the return code from acpi_ex_acquire_global_lock, AE_TIME also never propagates further upwards.

Furthermore, grepping for AE_ABORT_METHOD yields only one place which can return it, dbxface.c:222. This depends on acpi_gbl_abort_method being set, but this variable is never set to "true" anywhere from the looks of it.
Comment 7 Robert Moore 2020-02-18 21:42:04 UTC
>acpi_gbl_abort_method
This flag is primarily used by the AML debugger and the AcpiExec application.

I'm not sure, but I would guess that all of the AE_ALREADY_EXISTS exceptions are causing some confusion in the interpreter -- such that the Global Lock isn't getting released somewhere. These are serious errors, indicating some very bad ASL code, however. You might try getting a firmware upgrade to the latest for your machine.

Bob
Comment 8 Jan Engelhardt 2020-02-19 21:14:38 UTC
I find that acpi_gbl_global_lock_semaphore is initialized with 0 units (nsaccess.c):

/* Create additional counting semaphore for global lock */
status =
       acpi_os_create_semaphore(1, 0,
       &acpi_gbl_global_lock_semaphore);

and that the acpi_ev_global_lock_handler function (evglock.c) that would send a unit never gets called, which leads me to believe the hardware is not generating any ACPI_EVENT_GLOBAL events. But should it, and when?
Comment 9 Jan Engelhardt 2020-02-19 22:49:06 UTC
Created attachment 287517 [details]
Cure/Workaround

Using this ad-hoc patch, I can successfully boot userspace.
Comment 10 Jan Engelhardt 2020-02-23 13:53:53 UTC
Created attachment 287559 [details]
Full ACPI table dump
Comment 11 Jan Engelhardt 2020-02-23 15:13:26 UTC
Another possibility (without comment 9 patch) is to simply enable CONFIG_ACPI_REDUCED_HARDWARE_ONLY=y in .config, as this also will skip evglock.c code.
Comment 12 Erik Kaneda 2020-02-25 18:53:26 UTC
Thanks for the report. At this time, we need more debug information.

If you could build a custom kernel, please build with CONFIG_ACPI_DEBUG=y without any patches of a kernel 5.0 or newer..

Please provide the full following dmesg:

1) plain dmesg
2) dmesg booting with the parameters "acpi.debug_level=0x80000 acpi.debug_layer=0xffffffff"
3) dmesg booting with parameters "acpi.debug_level=0x01000000 acpi.debug_layer=0x4"

There's definitely something holding the lock. We just don't know what's holding it and why it's not releasing it.
Comment 13 Jan Engelhardt 2020-02-26 14:40:40 UTC
>There's definitely something holding the lock.

Did you observe comment 8? It's not the lock, it's a OS-level semaphore that is initialized to 0, and the hardware never emits events needed to "up" it to 1, hence the first "down" call is the one that hangs.

>without any patches of a kernel 5.0 ... 1) plain dmesg

Without any patches, that is going to be impossible. ACPI is so early that an external observer is required, like a serial console (for which laptops have not had a port since a long time), netconsole is going to be impossible because stupid manufacturers have also removed that port, and I can't reach userspace to dump dmesg to a disk, because obviously there's the hang. I can do a video capture, but then someone will have to sift through something like 2 hours of non-text material.
Comment 14 Jan Engelhardt 2020-02-26 17:53:09 UTC
Created attachment 287629 [details]
Alternate patch to comment 9. Diagnostic patch.

This patch kills off the unbounded wait on acpi_gbl_global_lock_semaphore.  Needless to say unbounded waits are kinda bad. It is here replaced with a 44ms wait (just something that will stand out in logs).

Observations from running this patch on the hardware:

1. ACPI_ACQUIRE_GLOBAL_LOCK never succeeds in acquiring acpi_gbl_FACS. Looking at what that macro does, this may turn out to be a hardware bug that can only be worked around. Upgrading the firmware is not an option -- there is no upgrade to be found on the manufacturer's website.

2. (as before) acpi_gbl_global_lock_semaphore never gets posted either, so acpi_ex_system_wait_semaphore will never complete in a stock kernel.

Unlike comment 9, this patch does not skip nearly as much locking code, so the dmesg should be better.
Comment 15 Jan Engelhardt 2020-02-26 17:54:36 UTC
Created attachment 287631 [details]
dmesg as per comment 12 item 1, with patch from comment 14

The kernel used was v5.5.5, the only extra being the patch from comment 14 (which is line number preserving).
Comment 16 Jan Engelhardt 2020-02-26 17:54:53 UTC
Created attachment 287633 [details]
dmesg as per comment 12 item 2
Comment 17 Jan Engelhardt 2020-02-26 17:55:10 UTC
Created attachment 287635 [details]
dmesg as per comment 12 item 3, with patch from comment 14
Comment 18 Erik Kaneda 2020-02-26 19:05:46 UTC
(In reply to Jan Engelhardt from comment #13)
> >There's definitely something holding the lock.
> 
> Did you observe comment 8? It's not the lock, it's a OS-level semaphore that
> is initialized to 0, and the hardware never emits events needed to "up" it
> to 1, hence the first "down" call is the one that hangs.

Sorry, I wasn't reading that closely when I initially read the report. This definitely sounds like a hardware bug. Since we can't trust the global lock, we should not use it if we deem this as broken.
Comment 19 Erik Kaneda 2020-02-26 20:25:44 UTC
Created attachment 287637 [details]
patch to detect bad global lock

please try this patch
Comment 20 Jan Engelhardt 2020-02-26 22:33:13 UTC
Your patch is faulty;

1. AE_TIME is emitted from the function, which is incorrect. We want to execute a similar pattern such as the "Make sure that a global lock actually exists." branch a few lines above.

1b. The caller at exfield.c:394 (see comment 2) ignores the AE_TIME return value, which causes some more problems down the line because things pretend the lock was taken when it's not. Eventually the system hangs again at some point.

2. Due to "break;", the acpi_os_acquire_lock call is skipped but not the acpi_os_release_lock after the loop. This leads to a release of a mutex that is not locked.

New patch follows.
Comment 21 Jan Engelhardt 2020-02-26 22:34:07 UTC
Created attachment 287651 [details]
Updated patch
Comment 22 Jan Engelhardt 2020-02-26 23:27:33 UTC
Created attachment 287653 [details]
Patch from a completely different angle

The real bug is in __acpi_acquire_global_lock. The firmware is fine (well, mostly).

The FACS table contains some garbage bits, which __acpi_acquire_global_lock fails to deal with. This patch also makes the system boot successfully.

This opens up another consideration though: if the FACS table were to contain 0xffffffff in the GL register, then acquiring the lock would always fail logically, even if the firmware is actually performing properly. IOW, I think we need to add a facs->global_lock&=~0x3 statement right after loading the FACS as well, just to reset that register to a sane value.
Comment 23 Jan Engelhardt 2020-02-26 23:33:32 UTC
My patch from comment 22 ignores the bits 2-31 that are marked reserved by the spec, https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf .

Looking at FreeBSD's acpi_acquire_global_lock shows they are even ignoring bits 1-31 (which coincides with my first observation in this bugreport that FreeBSD is booting without issues):

int acpi_acquire_global_lock(volatile uint32_t *lock)
{
        uint32_t        new, old;

        do {
                old = *lock;
                new = (old & ~ACPI_GLOCK_PENDING) | ACPI_GLOCK_OWNED;
                if ((old & ACPI_GLOCK_OWNED) != 0)
                        new |= ACPI_GLOCK_PENDING;
        } while (atomic_cmpset_32(lock, old, new) == 0);

        return ((new & ACPI_GLOCK_PENDING) == 0);
}
Comment 24 Jan Engelhardt 2020-02-26 23:55:38 UTC
FreeBSD "accidentally" fixed the issue at some point.

commit fccbffe7d83251aeb28395fd853f38f45bf3782d
Author: jkim <jkim@FreeBSD.org>
Date:   Fri Feb 10 23:30:29 2012 +0000

    De-obfuscate acpi_acquire_global_lock().  It seems the function was directly
    translated from i386 assembly version.

-       return ((new < GL_BIT_MASK) ? GL_ACQUIRED : GL_BUSY);
+       return ((new & GL_BIT_PENDING) == 0);
Comment 25 Jan Engelhardt 2020-02-27 08:55:11 UTC
Created attachment 287659 [details]
Fragment for idea from comment 22
Comment 26 Erik Kaneda 2020-03-03 18:26:22 UTC
@Rafael, what do you think about taking the patch from comment 22 for Linux?
Comment 27 Rafael J. Wysocki 2020-03-04 10:27:54 UTC
Sounds good to me.

Jan, please send this patch to linux-acpi, thanks!