Bug 97141 - ACPI initialisation broken since 3.14-rc1 for some mainboards.
Summary: ACPI initialisation broken since 3.14-rc1 for some mainboards.
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: BIOS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: acpi_bios
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-23 09:38 UTC by Marius Tolzmann
Modified: 2015-08-28 10:21 UTC (History)
8 users (show)

See Also:
Kernel Version: 4.0
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
full dmesg of broken ACPI after booting into 73f7d1c with applied c4e1acbb (91.52 KB, text/plain)
2015-04-23 09:38 UTC, Marius Tolzmann
Details
ACPI / init: Split acpi_early_init() (2.31 KB, patch)
2015-05-27 13:09 UTC, Rafael J. Wysocki
Details | Diff

Description Marius Tolzmann 2015-04-23 09:38:37 UTC
Created attachment 174841 [details]
full dmesg of broken ACPI after booting into 73f7d1c with applied c4e1acbb

We are running Linux on a lot of TYAN S8812 mainboards for some years now. It always worked great. After upgrading the kernel to 3.14.x some days ago our KVM keyboards stopped working. The reason seems to be that ACPI is not enabled anymore:

[    0.000552] ACPI: Core revision 20131115
[    0.007292] ACPI: All ACPI Tables successfully acquired
[    0.049357] ACPI Error: Hardware did not enter ACPI mode (20131115/evxfevnt-113)
[    0.049710] ACPI Warning: AcpiEnable failed (20131115/utxfinit-161)
[    0.049977] ACPI: Unable to enable ACPI

This issue maybe related to: Bug 74551 as in both cases ACPI can not be enabled - but since 3.13.x is in the list of broken kernels for bug 74551 and works for us i opened this new bug.

git bisect found 73f7d1ca32638028e3271f54616773727e2f9f26 to be the first bad commit for our issue.

some issues introduced with this commit were fixed in commit
c4e1acbb35e4a3838cdfc0e7f8237e844aff00b6

apparently this does not fix our issue (even if applied directly onto 73f7d1ca3263) (see attached dmesg).

it seems that acpi_early_init() is called to early for the tyan S8812 mainboard, while it is working fine on other mainboards we are running.

 - BIOS version is 1.08 (latest, 2013/07/09) 
 - BIOS settings have been reset to defaults - still same issue.

what information can i provide to help to fix this issue?
Comment 1 Marius Tolzmann 2015-04-23 10:08:21 UTC
kernel versions 3.18.x, 3.19.x, 4.0 do not boot at all: they die with a kernel panic in 10Gig thernetdriver (ixgbe) and are unable to find root device.

After reverting c4e1acbb35e4a3838cdfc0e7f8237e844aff00b6 and 73f7d1ca32638028e3271f54616773727e2f9f26 v4.0 kernel starts up like a charm again.
Comment 2 Lv Zheng 2015-04-24 07:28:54 UTC
IMO, acpi_load_tables() invoked in acpi_early_init() might be wrong.
It bypasses all module level execution during boot time.
So how useful the namespace nodes created in this stage will be?
I doubt that there should be real drivers depending on such broken or half-parsed namespace nodes...

Thanks and best regards
-Lv
Comment 3 Aaron Lu 2015-04-28 06:39:33 UTC
Add relevant people.
Comment 4 Lee, Chun-Yi 2015-04-28 07:17:06 UTC
The original position of acpi_early_init() is after efi_enter_virtual_mode(),
73f7d1ca3 patch moved acpi_early_init() to the position before acpi_early_init() for two reasons:
 - Find the BGRT table earlier.                <=== move to before efi_enter_virtual_mode()
 - Prepare setting system clock with ACPI TAD. <=== move to before timekeeping_init()

The c4e1acbb3 patch reverted the part of ACPI TAD, it moved acpi_early_init() to before efi_enter_vritual_mode() but not the original position: after efi_enter_vritual_mode()

The original discussion about EFI benefit is here:
  https://lkml.org/lkml/2013/12/20/471
Comment 5 Lee, Chun-Yi 2015-04-28 07:26:23 UTC
Checked v4.0 kernel, the efi_bgrt_init() still called by efi_late_init() in the original place.

Add Matt Fleming to Cc. list to clarify the benefit of finding BGRT.
Comment 6 Lee, Chun-Yi 2015-04-28 07:27:20 UTC
(In reply to Lee, Chun-Yi from comment #4)
> The original position of acpi_early_init() is after efi_enter_virtual_mode(),
> 73f7d1ca3 patch moved acpi_early_init() to the position before
> acpi_early_init() for two reasons:
  ^^^^^^^^^^^^^^^^^ 
       before timekeeping_init(), sorry for my typo
Comment 7 Lv Zheng 2015-04-28 07:33:34 UTC
Thanks for the information.

Though my concern is not related to this bug.
It looks the TAD use case is the root cause of such dirty implementation in the ACPICA interpreter around module level code...
Just note this here for people who is looking at this issue.

Thanks and best regards
-Lv
Comment 8 Lv Zheng 2015-04-28 07:37:40 UTC
I'm OK with BGRT which is a data table, but for TAD which comes from a definition block table, I do have concerns...

At this early boot time, PCI configuration space may not be accessible.
Thus some module level code in the ACPI definition block table is also not executable.
Invoking acpi_load_tables() so early causes a very strange software scheme in ACPICA.
ACPICA interpreter has to execute only special AML code and ignore all module level code in this early stage and at a late stage, re-execute the module level code. This makes ACPICA interpreter implementation very dirty.

Thanks
-Lv
Comment 9 Lee, Chun-Yi 2015-04-28 08:42:37 UTC
(In reply to Lv Zheng from comment #8)
> I'm OK with BGRT which is a data table, but for TAD which comes from a
> definition block table, I do have concerns...
> 
> At this early boot time, PCI configuration space may not be accessible.
> Thus some module level code in the ACPI definition block table is also not
> executable.
> Invoking acpi_load_tables() so early causes a very strange software scheme
> in ACPICA.
> ACPICA interpreter has to execute only special AML code and ignore all
> module level code in this early stage and at a late stage, re-execute the
> module level code. This makes ACPICA interpreter implementation very dirty.
> 
> Thanks
> -Lv

I fully agree your point.

But if TAD could not access earlier before timekeeping_init(), then kernel can not use TAD to set system clock. That causes kernel accessing another RTC interface, like: EFI time service. Unfortunately the EFI time service is unreliable at run time so we avoid using it. Using EFI time service to get the offset in EFI stub maybe works, that need time to implement and check.
Comment 10 Marius Tolzmann 2015-04-28 08:58:18 UTC
Is it safe to "move" acpi_early_init() back to the position it used to be before commit 73f7d1ca326 (e.g. by reverting the both commits mentioned in comment 1)? we could use this as a quick fix to startup current linux kernel on our tyan mainboards - or will we hit some ugly side effects because other init-functions depend on it?

since older kernel versions work great, we can also continue to use older kernels till this issue has been fixed.

best regards and thanks for the help..
Comment 11 Rafael J. Wysocki 2015-05-05 11:09:47 UTC
From the purely ACPI perspective, it should be safe to move acpi_early_init().

I'm not sure what EFI changes have been made since the commit in question, though, which may depend on it. Matt's input would be necessary to answer that.

However, if your machines don't use EFI anyway, please try moving acpi_early_init().
Comment 12 Rafael J. Wysocki 2015-05-05 11:15:16 UTC
(In reply to Lv Zheng from comment #8)
> I'm OK with BGRT which is a data table, but for TAD which comes from a
> definition block table, I do have concerns...
> 
> At this early boot time, PCI configuration space may not be accessible.
> Thus some module level code in the ACPI definition block table is also not
> executable.
> Invoking acpi_load_tables() so early causes a very strange software scheme
> in ACPICA.
> ACPICA interpreter has to execute only special AML code and ignore all
> module level code in this early stage and at a late stage, re-execute the
> module level code. This makes ACPICA interpreter implementation very dirty.

I'm not sure I understand your point correctly.

Do you have any specific examples of what may go wrong as a result of this?
Comment 13 Lv Zheng 2015-05-06 01:29:47 UTC
(In reply to Rafael J. Wysocki from comment #12)
> (In reply to Lv Zheng from comment #8)
> > I'm OK with BGRT which is a data table, but for TAD which comes from a
> > definition block table, I do have concerns...
> > 
> > At this early boot time, PCI configuration space may not be accessible.
> > Thus some module level code in the ACPI definition block table is also not
> > executable.
> > Invoking acpi_load_tables() so early causes a very strange software scheme
> > in ACPICA.
> > ACPICA interpreter has to execute only special AML code and ignore all
> > module level code in this early stage and at a late stage, re-execute the
> > module level code. This makes ACPICA interpreter implementation very dirty.
> 
> I'm not sure I understand your point correctly.
> 
> Do you have any specific examples of what may go wrong as a result of this?

For both the code inside of a Method and the code at the module level for the table (do not belong to any method), AML grammer for them defined in the spec is TermList.
So they should be interpreted in the same style - both table loading and method execution should work in the same way.
I've tried and it worked here:
http://sourceforge.net/p/sdfirm/code/ci/master/tree/kernel/acpi/acpi_interp.c
Both acpi_interpret_table() and acpi_evaluate_object() can invoke acpi_interpret_aml() with same arguments.

But ACPICA is currently not implemented in this way.
In acpi_load_tables(), only limitted opcodes (named object creation opcodes) are executed in a 2-pass parsing (load_pass_1 and load_pass_2) and many opcodes in the "If", "Else", .. packages are just linked together (acpi_ps_link_module_code()). This happens during the early stage (acpi_early_init()). At a later stage (acpi_bus_init()), such module level code will be executed (acpi_ns_exec_module_code_list()).
The code for module level execution is nasty, it hacks "\" from a device to a method in order for the executer code to work and hacks it back after completing the module level execution.

While for method invocation, it is currently executed only in 1-pass parsing (execute_pass).

For the above 2-load-pass parsing implementation, only limitted opcodes can be supported during table loading. And the 2-load-pass parsing caused some known compliance issues.
For example, if the load-pass-1 failed at a point, then the namespace node created before this point will be successfully created, but since the load-pass-2 is aborted, such namespace nodes will be left without operand objects attached, they will be non-calulatable objects... So far no BIOS issues reported on Windows around such broken tables.

This links to the dicussion of this nasty code issue:
http://bugs.acpica.org/show_bug.cgi?id=963

I was thinking the reason for current implementation is due to the all-in-one code base for both the iasl disassembler and the in-kernel interpreter.
But it seems Linux also has something to do with this, it is hard to move acpi_load_tables() to a late stage and merges it with the module level execution into the big execute-pass parsing.

Thanks and best regards
-Lv
Comment 14 Robert Moore 2015-05-06 15:47:10 UTC
Actually, according to the ACPI spec, executable code at module level is illegal and not supported. It was legal in ACPI 1.0 but made illegal in ACPI 2.0

However, ACPICA supports it (as much as it has to).

Bob


> -----Original Message-----
> From: bugzilla-daemon@bugzilla.kernel.org [mailto:bugzilla-
> daemon@bugzilla.kernel.org]
> Sent: Tuesday, May 05, 2015 6:30 PM
> To: acpi-bugzilla@lists.sourceforge.net
> Subject: [Bug 97141] ACPI initialisation broken since 3.14-rc1 for some
> mainboards.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=97141
> 
> --- Comment #13 from Lv Zheng <lv.zheng@intel.com> --- (In reply to Rafael
> J. Wysocki from comment #12)
> > (In reply to Lv Zheng from comment #8)
> > > I'm OK with BGRT which is a data table, but for TAD which comes from
> > > a definition block table, I do have concerns...
> > >
> > > At this early boot time, PCI configuration space may not be
> accessible.
> > > Thus some module level code in the ACPI definition block table is
> > > also not executable.
> > > Invoking acpi_load_tables() so early causes a very strange software
> > > scheme in ACPICA.
> > > ACPICA interpreter has to execute only special AML code and ignore
> > > all module level code in this early stage and at a late stage,
> > > re-execute the module level code. This makes ACPICA interpreter
> implementation very dirty.
> >
> > I'm not sure I understand your point correctly.
> >
> > Do you have any specific examples of what may go wrong as a result of
> this?
> 
> For both the code inside of a Method and the code at the module level for
> the table (do not belong to any method), AML grammer for them defined in
> the spec is TermList.
> So they should be interpreted in the same style - both table loading and
> method execution should work in the same way.
> I've tried and it worked here:
> http://sourceforge.net/p/sdfirm/code/ci/master/tree/kernel/acpi/acpi_inter
> p.c
> Both acpi_interpret_table() and acpi_evaluate_object() can invoke
> acpi_interpret_aml() with same arguments.
> 
> But ACPICA is currently not implemented in this way.
> In acpi_load_tables(), only limitted opcodes (named object creation
> opcodes) are executed in a 2-pass parsing (load_pass_1 and load_pass_2)
> and many opcodes in the "If", "Else", .. packages are just linked together
> (acpi_ps_link_module_code()). This happens during the early stage
> (acpi_early_init()). At a later stage (acpi_bus_init()), such module level
> code will be executed (acpi_ns_exec_module_code_list()).
> The code for module level execution is nasty, it hacks "\" from a device
> to a method in order for the executer code to work and hacks it back after
> completing the module level execution.
> 
> While for method invocation, it is currently executed only in 1-pass
> parsing (execute_pass).
> 
> For the above 2-load-pass parsing implementation, only limitted opcodes
> can be supported during table loading. And the 2-load-pass parsing caused
> some known compliance issues.
> For example, if the load-pass-1 failed at a point, then the namespace node
> created before this point will be successfully created, but since the
> load-pass-2 is aborted, such namespace nodes will be left without operand
> objects attached, they will be non-calulatable objects... So far no BIOS
> issues reported on Windows around such broken tables.
> 
> This links to the dicussion of this nasty code issue:
> http://bugs.acpica.org/show_bug.cgi?id=963
> 
> I was thinking the reason for current implementation is due to the all-in-
> one code base for both the iasl disassembler and the in-kernel
> interpreter.
> But it seems Linux also has something to do with this, it is hard to move
> acpi_load_tables() to a late stage and merges it with the module level
> execution into the big execute-pass parsing.
> 
> Thanks and best regards
> -Lv
> 
> --
> You are receiving this mail because:
> You are watching the assignee of the bug.
> 
> --------------------------------------------------------------------------
> ----
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications Performance
> metrics, stats and reports that give you Actionable Insights Deep dive
> visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> acpi-bugzilla mailing list
> acpi-bugzilla@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/acpi-bugzilla
Comment 15 Lv Zheng 2015-05-07 02:19:26 UTC
Where can we find the "module level code is illegal" like statement in the current spec?

I can only find this:
AMLCode := DefBlockHeader TermList
DefMethod := MethodOp PkgLength NameString MethodFlags TermList

Both the AML code in module level and in method are TermList.
If it is illegal, the spec may define AMLCode using a different term.
Comment 16 Lv Zheng 2015-05-07 05:01:33 UTC
Hi,

It could be better to use "dmesg -c" after each brightness key press.
And upload several files here, each contains a "dmesg -c" output for one press.

Thanks and best regards
-Lv
Comment 17 Lv Zheng 2015-05-07 05:04:38 UTC
(In reply to Lv Zheng from comment #16)
> Hi,
> 
> It could be better to use "dmesg -c" after each brightness key press.
> And upload several files here, each contains a "dmesg -c" output for one
> press.
> 
> Thanks and best regards
> -Lv

This is a wrong post against another bug.
Sorry for the noise.
Please ignore it.

Thanks
-Lv
Comment 18 Len Brown 2015-05-12 15:57:01 UTC
(changing HW to ALL, since this the change that broke this Tyan/AMD
 could break any system)
Comment 19 Matt Fleming 2015-05-12 17:17:28 UTC
(In reply to Rafael J. Wysocki from comment #11)
> From the purely ACPI perspective, it should be safe to move
> acpi_early_init().
> 
> I'm not sure what EFI changes have been made since the commit in question,
> though, which may depend on it. Matt's input would be necessary to answer
> that.
> 
> However, if your machines don't use EFI anyway, please try moving
> acpi_early_init().

Moving acpi_early_init() after efi_enter_virtual_mode() will probably work. I can't think of anything that will break necessarily, I would just try it and see. The benefit is doing it before for BGRT is that we can free the EfiBootServices memory a little bit earlier on boot.

But it's not at all clear to me what the root problem is here and whether a straight up revert is just papering over a problem.
Comment 20 Lee, Chun-Yi 2015-05-13 01:03:45 UTC
(In reply to Marius Tolzmann from comment #0)
[...snip]
> 
> [    0.000552] ACPI: Core revision 20131115
> [    0.007292] ACPI: All ACPI Tables successfully acquired
> [    0.049357] ACPI Error: Hardware did not enter ACPI mode
> (20131115/evxfevnt-113)
> [    0.049710] ACPI Warning: AcpiEnable failed (20131115/utxfinit-161)
> [    0.049977] ACPI: Unable to enable ACPI

The above error message indicate the SCI_EN bit didn't set by the hardware platform and kernel already wait 3 seconds for hardware setting SCI_EN. That means no SCI interrupt will occur.

The position of acpi_early_init() before and after changed:

start_kernel(void)
{
[...snip]
        anon_vma_init();
        acpi_early_init();              <=== current position
#ifdef CONFIG_X86
        if (efi_enabled(EFI_RUNTIME_SERVICES))      /* legacy BIOS in this case, no EFI */
                efi_enter_virtual_mode();
#endif
#ifdef CONFIG_X86_ESPFIX64
        /* Should be run before the first non-init thread is created */
        init_espfix_bsp();
#endif
        thread_info_cache_init();
        cred_init();
        fork_init();
        proc_caches_init();
        buffer_init();
        key_init();
        security_init();
        dbg_late_init();
        vfs_caches_init(totalram_pages);
        signals_init();
        /* rootfs populating might need page-writeback */
        page_writeback_init();
        proc_root_init();
        nsfs_init();
        cpuset_init();
        cgroup_init();
        taskstats_init_early();
        delayacct_init();

        check_bugs();

        /* acpi_early_init(); */ /* before LAPIC and SMP init */   <=== original position
        sfi_init_late();


After a quick glance, honestly I didn't see dubious function call between old and new position.
Comment 21 Rafael J. Wysocki 2015-05-27 13:09:55 UTC
Created attachment 178101 [details]
ACPI / init: Split acpi_early_init()

Marius, can you please check if this patch makes any difference on the affected system?
Comment 22 Marius Tolzmann 2015-05-27 14:53:33 UTC
(In reply to Rafael J. Wysocki from comment #21)
> Created attachment 178101 [details]
> ACPI / init: Split acpi_early_init()
> 
> Marius, can you please check if this patch makes any difference on the
> affected system?

yes it works - the affected system seems to work fine again with the patch applied.

thanks a lot for troubleshooting and fixing our problem ;)
Comment 23 Rafael J. Wysocki 2015-05-27 21:03:54 UTC
OK, thanks for testing!

I'll add a changelog and post it for review, then.
Comment 24 Rafael J. Wysocki 2015-05-30 11:58:07 UTC
(In reply to Rafael J. Wysocki from comment #23)
> OK, thanks for testing!
> 
> I'll add a changelog and post it for review, then.

Submitted as https://patchwork.kernel.org/patch/6512961/
Comment 25 Len Brown 2015-07-22 00:08:10 UTC
Linux-4.2-rc1 has:

commit b064a8fa77dfead647564c46ac8fc5b13bd1ab73
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Wed Jun 10 01:33:36 2015 +0200

    ACPI / init: Switch over platform to the ACPI mode later
Comment 26 Eduardo Ramirez 2015-08-27 19:35:40 UTC
I  have  similar  problem  on  Fedora22.  How  can  I  apply  the  patch  that  you  have developed?
Comment 27 Aaron Lu 2015-08-28 02:34:30 UTC
Build the latest Linus' git tree should have this patch included.
Comment 28 Marius Tolzmann 2015-08-28 10:21:41 UTC
commit id is b064a8fa77dfead647564c46ac8fc5b13bd1ab73

afaik it is also included in older kernels:

in 4.1.x  since v4.1.3   (as 9e60048)
in 4.0.x  since v4.0.9   (as 8ce8431)
in 3.14.x since v3.14.49 (as cdbbeb6)

Note You need to log in before you can comment on or make changes to this bug.