Created attachment 287431 [details]
ACPI warnings from FreeBSD kernel
1262 down_timeout(sem, jiffies)
47 status = acpi_os_wait_semaphore(semaphore, 1, timeout);
240 status = acpi_ex_system_wait_semaphore(
152 status = acpi_ev_acquire_global_lock(timeout);
181 status = acpi_ex_acquire_mutex_object(ACPI_WAIT_FOREVER,
523 status = acpi_ex_write_data_to_field(...)
// ACPI_GET_DESCRIPTOR_TYPE(dest_desc) = ACPI_DESC_TYPE_NAMED
75 status = acpi_ex_store_object_to_node(...)
// case AML_STORE_OP:
422 status = acpi_ex_store(...)
// case AML_CLASS_EXECUTE:
405 status = acpi_gbl_op_type_dispatch[op_type](walk_state);
Created attachment 287433 [details]
$ iasl -d medion.dsdt.bin
Intel ACPI Component Architecture
ASL+ Optimizing Compiler/Disassembler version 20180105
Copyright (c) 2000 - 2018 Intel Corporation
Input file medion.dsdt.bin, Length 0x5DA6E (383598) bytes
ACPI: DSDT 0x0000000000000000 05DA6E (v02 MEDION MEDIONAG 01072009 INTL 20160527)
Pass 1 parse of [DSDT]
ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180105/psobject-263)
<...lots more AE_ALREADY_EXISTS preceding..., then>
ACPI BIOS Error (bug): Failure creating named object [\_SB.PCI0.XHC.RHUB.SS06._UPC], AE_ALREADY_EXISTS (20200110/dswload2-323)
ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20200110/psobject-220)
ACPI BIOS Error (bug): Failure creating named object [\_SB.PCI0.XHC.RHUB.SS06._PLD], AE_ALREADY_EXISTS (20200110/dswload2-323)
ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20200110/psobject-220)
ACPI: Skipping parse of AML opcode: Method (0x0014)
ACPI: 11 ACPI AML tables successfully acquired and loaded
ACPI: EC: EC started
ACPI: EC: interrupt blocked
ACPI: \: Used as first EC
ACPI: \: GPE=0x10, IRQ=-1, EC_CMD/EC_SC=0x66, EC_DATA=0x62
ACPI: EC: Boot ECDT EC used to handle transactions
Does not seem to be a regression; 4.4.76 (but with patches from openSUSE) and 4.12.14 (also patched by openSUSE) exhibit the same hang.
It looks like we have a mutex deadlock on the global lock in there.
Do you eventually get a timeout error (like below:)
ACPI Error: Aborting method \_SB.PCI0.LPCB.EC0.RRAM due to previous error (AE_AML_LOOP_TIMEOUT)
I do not get such an error, at least not in a reasonable time. The longest I observed the console is about 50 minutes - the hung task checker kicks in after about 16min, though.
[138.837] ACPI: EC: Boot ECDT EC used to handle transactions
[1115.250] INFO: task swapper/0:1 blocked for more than 491 seconds.
<call trace as printed by stack_trace_print>
Given the execution is stuck with ACPI_WAIT_FOREVER (jiffies==MAX_SCHEDULE_TIMEOUT), I also see no possibility for down_timeout to return anytime soon to give parent functions a chance to return AE_ABORT_METHOD anywhere. Due to exfield.c:394 ignoring the return code from acpi_ex_acquire_global_lock, AE_TIME also never propagates further upwards.
Furthermore, grepping for AE_ABORT_METHOD yields only one place which can return it, dbxface.c:222. This depends on acpi_gbl_abort_method being set, but this variable is never set to "true" anywhere from the looks of it.
This flag is primarily used by the AML debugger and the AcpiExec application.
I'm not sure, but I would guess that all of the AE_ALREADY_EXISTS exceptions are causing some confusion in the interpreter -- such that the Global Lock isn't getting released somewhere. These are serious errors, indicating some very bad ASL code, however. You might try getting a firmware upgrade to the latest for your machine.
I find that acpi_gbl_global_lock_semaphore is initialized with 0 units (nsaccess.c):
/* Create additional counting semaphore for global lock */
and that the acpi_ev_global_lock_handler function (evglock.c) that would send a unit never gets called, which leads me to believe the hardware is not generating any ACPI_EVENT_GLOBAL events. But should it, and when?
Created attachment 287517 [details]
Using this ad-hoc patch, I can successfully boot userspace.
Created attachment 287559 [details]
Full ACPI table dump
Another possibility (without comment 9 patch) is to simply enable CONFIG_ACPI_REDUCED_HARDWARE_ONLY=y in .config, as this also will skip evglock.c code.
Thanks for the report. At this time, we need more debug information.
If you could build a custom kernel, please build with CONFIG_ACPI_DEBUG=y without any patches of a kernel 5.0 or newer..
Please provide the full following dmesg:
1) plain dmesg
2) dmesg booting with the parameters "acpi.debug_level=0x80000 acpi.debug_layer=0xffffffff"
3) dmesg booting with parameters "acpi.debug_level=0x01000000 acpi.debug_layer=0x4"
There's definitely something holding the lock. We just don't know what's holding it and why it's not releasing it.
>There's definitely something holding the lock.
Did you observe comment 8? It's not the lock, it's a OS-level semaphore that is initialized to 0, and the hardware never emits events needed to "up" it to 1, hence the first "down" call is the one that hangs.
>without any patches of a kernel 5.0 ... 1) plain dmesg
Without any patches, that is going to be impossible. ACPI is so early that an external observer is required, like a serial console (for which laptops have not had a port since a long time), netconsole is going to be impossible because stupid manufacturers have also removed that port, and I can't reach userspace to dump dmesg to a disk, because obviously there's the hang. I can do a video capture, but then someone will have to sift through something like 2 hours of non-text material.
Created attachment 287629 [details]
Alternate patch to comment 9. Diagnostic patch.
This patch kills off the unbounded wait on acpi_gbl_global_lock_semaphore. Needless to say unbounded waits are kinda bad. It is here replaced with a 44ms wait (just something that will stand out in logs).
Observations from running this patch on the hardware:
1. ACPI_ACQUIRE_GLOBAL_LOCK never succeeds in acquiring acpi_gbl_FACS. Looking at what that macro does, this may turn out to be a hardware bug that can only be worked around. Upgrading the firmware is not an option -- there is no upgrade to be found on the manufacturer's website.
2. (as before) acpi_gbl_global_lock_semaphore never gets posted either, so acpi_ex_system_wait_semaphore will never complete in a stock kernel.
Unlike comment 9, this patch does not skip nearly as much locking code, so the dmesg should be better.
Created attachment 287631 [details]
dmesg as per comment 12 item 1, with patch from comment 14
The kernel used was v5.5.5, the only extra being the patch from comment 14 (which is line number preserving).
Created attachment 287633 [details]
dmesg as per comment 12 item 2
Created attachment 287635 [details]
dmesg as per comment 12 item 3, with patch from comment 14
(In reply to Jan Engelhardt from comment #13)
> >There's definitely something holding the lock.
> Did you observe comment 8? It's not the lock, it's a OS-level semaphore that
> is initialized to 0, and the hardware never emits events needed to "up" it
> to 1, hence the first "down" call is the one that hangs.
Sorry, I wasn't reading that closely when I initially read the report. This definitely sounds like a hardware bug. Since we can't trust the global lock, we should not use it if we deem this as broken.
Created attachment 287637 [details]
patch to detect bad global lock
please try this patch
Your patch is faulty;
1. AE_TIME is emitted from the function, which is incorrect. We want to execute a similar pattern such as the "Make sure that a global lock actually exists." branch a few lines above.
1b. The caller at exfield.c:394 (see comment 2) ignores the AE_TIME return value, which causes some more problems down the line because things pretend the lock was taken when it's not. Eventually the system hangs again at some point.
2. Due to "break;", the acpi_os_acquire_lock call is skipped but not the acpi_os_release_lock after the loop. This leads to a release of a mutex that is not locked.
New patch follows.
Created attachment 287651 [details]
Created attachment 287653 [details]
Patch from a completely different angle
The real bug is in __acpi_acquire_global_lock. The firmware is fine (well, mostly).
The FACS table contains some garbage bits, which __acpi_acquire_global_lock fails to deal with. This patch also makes the system boot successfully.
This opens up another consideration though: if the FACS table were to contain 0xffffffff in the GL register, then acquiring the lock would always fail logically, even if the firmware is actually performing properly. IOW, I think we need to add a facs->global_lock&=~0x3 statement right after loading the FACS as well, just to reset that register to a sane value.
My patch from comment 22 ignores the bits 2-31 that are marked reserved by the spec, https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf .
Looking at FreeBSD's acpi_acquire_global_lock shows they are even ignoring bits 1-31 (which coincides with my first observation in this bugreport that FreeBSD is booting without issues):
int acpi_acquire_global_lock(volatile uint32_t *lock)
uint32_t new, old;
old = *lock;
new = (old & ~ACPI_GLOCK_PENDING) | ACPI_GLOCK_OWNED;
if ((old & ACPI_GLOCK_OWNED) != 0)
new |= ACPI_GLOCK_PENDING;
} while (atomic_cmpset_32(lock, old, new) == 0);
return ((new & ACPI_GLOCK_PENDING) == 0);
FreeBSD "accidentally" fixed the issue at some point.
Author: jkim <jkim@FreeBSD.org>
Date: Fri Feb 10 23:30:29 2012 +0000
De-obfuscate acpi_acquire_global_lock(). It seems the function was directly
translated from i386 assembly version.
- return ((new < GL_BIT_MASK) ? GL_ACQUIRED : GL_BUSY);
+ return ((new & GL_BIT_PENDING) == 0);
Created attachment 287659 [details]
Fragment for idea from comment 22
@Rafael, what do you think about taking the patch from comment 22 for Linux?
Sounds good to me.
Jan, please send this patch to linux-acpi, thanks!