Bug 121701 - Module level code (Lock) - [regression] Kernel does not boot with 7ed18e2d1b6782989eb399ef79a8cc1a1b583b3c
Summary: Module level code (Lock) - [regression] Kernel does not boot with 7ed18e2d1b6...
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: ACPICA-Core (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Lv Zheng
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-08 21:24 UTC by Greg White
Modified: 2016-09-27 07:22 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.7.0-rc6+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
boot.log (56.13 KB, text/x-log)
2016-07-10 20:33 UTC, Greg White
Details
acpidump.tar.gz (56.94 KB, application/octet-stream)
2016-07-10 21:32 UTC, Greg White
Details
[PATCH] ACPICA: Dispatcher: Fix an issue that the opregions created by the linked MLC were not tracked (1.44 KB, patch)
2016-07-11 05:36 UTC, Lv Zheng
Details | Diff
[PATCH] ACPICA: Namespace: Add acpi_ns_get_node_unlocked() (4.98 KB, patch)
2016-07-11 05:36 UTC, Lv Zheng
Details | Diff
[PATCH] ACPICA: Namespace: Fix dynamic table loading issues by tuning namespace/interpreter locks (13.88 KB, patch)
2016-07-11 05:37 UTC, Lv Zheng
Details | Diff
[PATCH 3] ACPICA: This patch fixes an issue with acpi_ds_auto_serialized_method() (2.41 KB, patch)
2016-07-11 05:37 UTC, Lv Zheng
Details | Diff
[PATCH 1/4] ACPICA: Dispatcher: Fix an issue that the opregions created by the linked MLC were not tracked (1.29 KB, patch)
2016-07-11 08:30 UTC, Lv Zheng
Details | Diff
[PATCH 2/4] ACPICA: Namespace: Add acpi_ns_get_node_unlocked() (5.11 KB, patch)
2016-07-11 08:31 UTC, Lv Zheng
Details | Diff
[PATCH 3/4] ACPICA: Namespace: Fix dynamic table loading issues by tuning namespace/interpreter locks (14.31 KB, patch)
2016-07-11 08:33 UTC, Lv Zheng
Details | Diff
[PATCH 4/4] ACPICA: Fix dead lock in acpi_ds_auto_serialize_method() (2.67 KB, patch)
2016-07-11 08:34 UTC, Lv Zheng
Details | Diff
Screen shot at point of boot hang, without patches applied and without commit reverted (3.43 MB, image/jpeg)
2016-07-11 14:24 UTC, Greg White
Details
boot log with patches in place (56.83 KB, text/x-log)
2016-07-12 09:20 UTC, Greg White
Details

Description Greg White 2016-07-08 21:24:09 UTC
Kernel fails to boot with this commit.  With the commit reverted, everything is fine.  I cannot bisect any further - git is failing with "you are trying to use to much memory"(?!)

This is a Dell Precision 5510 with the latest (1.2.10) BIOS applied.

Let me know what additional information I can provide.

commit 7ed18e2d1b6782989eb399ef79a8cc1a1b583b3c
Merge: c09230f b6d9015
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Jul 7 20:49:41 2016 -0700

    Merge tag 'acpi-4.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
    
    Pull ACPI fixes from Rafael Wysocki:
     "All of these fix recent regressions in ACPICA, in the ACPI PCI IRQ
      management code and in the ACPI AML debugger.
Comment 1 Linus Torvalds 2016-07-10 15:47:56 UTC
Can you specify exactly what the git error message was?

It sounds like it might have been

  "you want to use way too much memory"

which is a very specific git error - it is *meant* to only happen when there is an actual arithmetic overflow in "unsigned int" (and it's only tested for one of the internal *string* functions, so it would be a very long string indeed).

But it might be a git bug, of course.

But it would be good to get the exact error message.

(There are other memory allocation messages that can happen, but they tend to be just "Out of memory, realloc failed" and similar).

Anyway, in case it's a git bug, can you check bisecting just that one merge. Do

    # get rid of the old bisect if you didn't already
    git bisect reset

    # mark the merge bad, the parent good.
    git bisect bad 7ed18e2d1b6782989eb399ef79a8cc1a1b583b3c
    git bisect good 7ed18e2d1b6782989eb399ef79a8cc1a1b583b3c^

and see if the git problem was due to some earlier stage in the bisect.

Thanks,

             Linus
Comment 2 Greg White 2016-07-10 19:23:57 UTC
OK, thanks Linus, that helped.  I was able to get down to the specific commit.  The end of the git bisect is below, as is the offending error message, which happened on the final git bisect bad.

~$ git bisect bad
45209046c47b93fadf26dc59a9da724f387b9cf2 is the first bad commit
commit 45209046c47b93fadf26dc59a9da724f387b9cf2
fatal: you want to use way too much memory
~$ git --version 
git version 2.9.0

Just for additional clarity, the commit header is:

commit 45209046c47b93fadf26dc59a9da724f387b9cf2
Author: Lv Zheng <lv.zheng@intel.com>
Date:   Tue Jul 5 13:53:12 2016 +0800

    ACPICA: Namespace: Fix namespace/interpreter lock ordering
 

I have verified that pulling that commit out fixes the problem.  With it in, I cannot boot.
Comment 3 The Linux kernel's regression tracker (Thorsten Leemhuis) 2016-07-10 19:35:58 UTC
Greg, Sinan Kaya in http://thread.gmane.org/gmane.linux.kernel.pci/53279/focus=53316 asked "Can you attach the boot log to the bugzilla?"

Side note: Sorry, I forgot to CC you when raising the issue on the mailing list.
Comment 4 Greg White 2016-07-10 19:39:27 UTC
(I was asked for a boot log)

There is no boot log - the hang appears to be immediate.  No output.  If there is some way to turn on earlier debug logs, please let me know (I do have early printk on, going to the EFI framebuffer.  Nothing shows up.)
Comment 5 The Linux kernel's regression tracker (Thorsten Leemhuis) 2016-07-10 20:26:25 UTC
(In reply to Greg White from comment #4)
>
> There is no boot log - the hang appears to be immediate.  No output. 

I suspect he meant a log from a "good" kernel/one that accentually boots
Comment 6 Greg White 2016-07-10 20:32:29 UTC
Ah.  Of course.  Attached.  Head is at 617a8d6bc19edd075e8111c6770f79cae75be51f, with 45209046c47b93fadf26dc59a9da724f387b9cf2 reverted.
Comment 7 Greg White 2016-07-10 20:33:53 UTC
Created attachment 222671 [details]
boot.log
Comment 8 Rafael J. Wysocki 2016-07-10 21:10:23 UTC
(In reply to Greg White from comment #2)
> OK, thanks Linus, that helped.  I was able to get down to the specific
> commit.  The end of the git bisect is below, as is the offending error
> message, which happened on the final git bisect bad.
> 
> ~$ git bisect bad
> 45209046c47b93fadf26dc59a9da724f387b9cf2 is the first bad commit
> commit 45209046c47b93fadf26dc59a9da724f387b9cf2
> fatal: you want to use way too much memory
> ~$ git --version 
> git version 2.9.0
> 
> Just for additional clarity, the commit header is:
> 
> commit 45209046c47b93fadf26dc59a9da724f387b9cf2
> Author: Lv Zheng <lv.zheng@intel.com>
> Date:   Tue Jul 5 13:53:12 2016 +0800
> 
>     ACPICA: Namespace: Fix namespace/interpreter lock ordering
>  
> 
> I have verified that pulling that commit out fixes the problem.  With it in,
> I cannot boot.

OK, thanks!
Comment 9 Rafael J. Wysocki 2016-07-10 21:26:40 UTC
@Greg: Please also attach the acpidump output from the affected system.
Comment 10 Greg White 2016-07-10 21:32:25 UTC
Created attachment 222681 [details]
acpidump.tar.gz
Comment 11 Greg White 2016-07-10 21:32:57 UTC
Dump attached.  I screwed up the file extension, sorry.  That's a tar.gz.
Comment 12 Linus Torvalds 2016-07-10 21:39:58 UTC
Side note: the git bisect problem is a bug in 2.9.0.

I didn't check if 2.9.1 has the fix, but it's fixed in the current master branch if you want to build your own and avoid this in the future.
Comment 13 Rafael J. Wysocki 2016-07-10 21:45:08 UTC
Thanks!

Lv needs to look at this in detail.

In the meantime I'll queue up a revert of commit 45209046c47b as we can live with the issue it attempted to fix (at least that one doesn't prevent systems from booting).

But maybe Lv is able to come up with a better fix shortly.
Comment 14 Lv Zheng 2016-07-11 02:00:40 UTC
Hi, Greg

45209046c47b is a quick fix for a lock problem.
In fact we have a big trouble related to namespace/interpreter lock issues.
And we are discussing a better solution for this issue.

However the quick fix should be able to fix the current upstream issues.
And your report is not what I can understand.
So we need to get your issue root caused.

I checked the acpidump.
There are "Load" opcodes invoked from
\_PR.CPUx._PDC or \_PR.CPUx._OSC

Could you try to:
1. comment "acpi_processor_set_pdc()" out, it's in drivers/acpi/processor_pdc.c;
2. have 45209046c47b merged;
3. try again to see if the hang can still be seen.

Thanks in advance.

Best regards
-Lv
Comment 15 Greg White 2016-07-11 02:09:37 UTC
OK, I tried that - still hangs immediately.
Comment 16 Lv Zheng 2016-07-11 05:05:15 UTC
This seems to be the earliest one invoking _OSC/_PDC.
Could you please take a screenshot/video about the issue so that we can at least see some useful debugging information?

Also, could you try to boot with "acpi_no_auto_serialize".

Thanks
-Lv
Comment 17 Lv Zheng 2016-07-11 05:30:39 UTC
And please remove the "quiet" from the kernel boot parameter when you do the test.

Thanks
-Lv
Comment 18 Lv Zheng 2016-07-11 05:36:06 UTC
Created attachment 222701 [details]
[PATCH] ACPICA: Dispatcher: Fix an issue that the opregions created by the linked MLC were not tracked
Comment 19 Lv Zheng 2016-07-11 05:36:35 UTC
Created attachment 222711 [details]
[PATCH] ACPICA: Namespace: Add acpi_ns_get_node_unlocked()
Comment 20 Lv Zheng 2016-07-11 05:37:04 UTC
Created attachment 222721 [details]
[PATCH] ACPICA: Namespace: Fix dynamic table loading issues by tuning namespace/interpreter locks
Comment 21 Lv Zheng 2016-07-11 05:37:49 UTC
Created attachment 222731 [details]
[PATCH 3] ACPICA: This patch fixes an issue with acpi_ds_auto_serialized_method()
Comment 22 Lv Zheng 2016-07-11 05:39:47 UTC
Another test is:

1. apply 45209046c47b
2. apply attachment 222701 [details], attachment 222711 [details], attachment 222721 [details], attachment 222721 [details]
3. build the kernel and try again

45209046c47b is the quick fix and the attachnents are the better solution, but is still experimental.
However you can give it a try.

Thanks
-Lv
Comment 23 Lv Zheng 2016-07-11 08:30:45 UTC
Created attachment 222761 [details]
[PATCH 1/4] ACPICA: Dispatcher: Fix an issue that the opregions created by the linked MLC were not tracked
Comment 24 Lv Zheng 2016-07-11 08:31:19 UTC
Created attachment 222771 [details]
[PATCH 2/4] ACPICA: Namespace: Add acpi_ns_get_node_unlocked()
Comment 25 Lv Zheng 2016-07-11 08:33:57 UTC
Created attachment 222781 [details]
[PATCH 3/4] ACPICA: Namespace: Fix dynamic table loading issues by tuning namespace/interpreter locks

Previous version of this patch contains issues around acpi_ut_add_address_range().
This patch is an improvement result.

However, my lock assessment coverity may still not be adequate.
I need to do more assessments.
Let me just know the test result on your platform.

Thanks
-Lv
Comment 26 Lv Zheng 2016-07-11 08:34:27 UTC
Created attachment 222791 [details]
[PATCH 4/4] ACPICA: Fix dead lock in acpi_ds_auto_serialize_method()
Comment 27 Lv Zheng 2016-07-11 08:35:47 UTC
Please ignore comment 22.
The updated test request is:

1. apply 45209046c47b
2. apply attachment 222761 [details], attachment 222771 [details], attachment 222781 [details], attachment 222791 [details]
3. build the kernel and try again

45209046c47b is the quick fix and the attachnents are the better solution, but is still experimental.
However you can give it a try.

Thanks
-Lv
Comment 28 Greg White 2016-07-11 09:57:23 UTC
There are a number of different requests for info in the bug now.  I'll just skip to the last one - applying the sequence of patches attached to the bug seems to fix the problem for me.

I did see this error message (adding a few lines above for context): 

[    1.150526] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    1.150545] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
[    1.150571] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[    1.154608] acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed


If you need verbose boot logs, etc, just let me know.
Comment 29 Rafael J. Wysocki 2016-07-11 13:24:37 UTC
(In reply to Lv Zheng from comment #23)
> Created attachment 222761 [details]
> [PATCH 1/4] ACPICA: Dispatcher: Fix an issue that the opregions created by
> the linked MLC were not tracked

Why is this needed at all?
Comment 30 Lv Zheng 2016-07-11 13:28:49 UTC
(In reply to Rafael J. Wysocki from comment #29)
> (In reply to Lv Zheng from comment #23)
> > Created attachment 222761 [details]
> > [PATCH 1/4] ACPICA: Dispatcher: Fix an issue that the opregions created by
> > the linked MLC were not tracked
> 
> Why is this needed at all?

I just got a few patch dependency issues.

This isn't needed for fixing this lock problem.
It's only useful when we fix the MLC grammar problem.
We can skip it.

Thanks
-Lv
Comment 31 Lv Zheng 2016-07-11 13:30:40 UTC
(In reply to Greg White from comment #28)
> There are a number of different requests for info in the bug now.  I'll just
> skip to the last one - applying the sequence of patches attached to the bug
> seems to fix the problem for me.
> 
> I did see this error message (adding a few lines above for context): 
> 
> [    1.150526] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
> [    1.150545] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
> [    1.150571] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
> [    1.154608] acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed
> 
> 
> If you need verbose boot logs, etc, just let me know.

Please upload the log.
This issue seems to be the root cause of many ACPICA problems.
Just let me know the cases you encountered.
That is potentially be another affected issue.
Thanks in advance.

Best regards
-Lv
Comment 32 Greg White 2016-07-11 14:24:54 UTC
Created attachment 222851 [details]
Screen shot at point of boot hang, without patches applied and without commit reverted
Comment 33 Greg White 2016-07-11 14:32:53 UTC
I also tried booting with acpi_no_auto_serialize; it still hung as before.
Comment 34 Sinan Kaya 2016-07-11 14:57:04 UTC
Thank you, sorry for the noise.

Since the revert on lock solves the issue; this issue does not seem to be related to acpi irq changes.

Acpi irq related ussue symptoms look like irq assignment failures and driver initialization failures.

I don't see any of these.
Comment 35 Lv Zheng 2016-07-12 05:19:24 UTC
(In reply to Greg White from comment #32)
> Created attachment 222851 [details]
> Screen shot at point of boot hang, without patches applied and without
> commit reverted

However, I mean the boot log after applying the patches.
Was this seen after applying the patches?
[    1.154608] acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed

I need to check if it implies more required engineering work.
Comment 36 Lv Zheng 2016-07-12 05:21:54 UTC
(In reply to Greg White from comment #28)
> There are a number of different requests for info in the bug now.  I'll just
> skip to the last one - applying the sequence of patches attached to the bug
> seems to fix the problem for me.

Though the patches can fix your issue.
They are still experimental.
And the solution is still under discussion with the ACPICA upstream review participants.

For now, we could just revert 3 commits to stop the regression.

And when all of the necessary engineering work done, we then can re-enable the reverted feature.

Thanks and best regards
-Lv
Comment 37 Greg White 2016-07-12 09:20:34 UTC
Created attachment 222961 [details]
boot log with patches in place
Comment 38 Greg White 2016-07-12 09:22:07 UTC
Boot log attached.

The acpi_walk_namespace problem did happen with the patches in place.

I will revert the patches and pull out the bad commit instead.

Thanks.
Comment 39 Rafael J. Wysocki 2016-07-13 11:40:22 UTC
45209046c47b93fadf26dc59a9da724f387b9cf2 has been reverted from the Linus' tree along with 2f38b1b16d92 and 3d4b7ae96d81, closing.
Comment 40 Lv Zheng 2016-07-20 05:17:40 UTC
(In reply to Greg White from comment #37 and #38)
> Created attachment 222961 [details]
> boot log with patches in place

Looks good now.

> Boot log attached.
> 
> The acpi_walk_namespace problem did happen with the patches in place.
> 
> I will revert the patches and pull out the bad commit instead.

Yes.
The solution still need to be improved.
I still can see several acpi_ns_get_node() need to be change.

And there are things need to be improved for the debugger.
In which lock need to be tuned around acpi_ps_parse_aml().

Also the solution need to take care of the table lock.

I'm waiting for the ACK to the solution.
And will improve it later.
We can wait for the full fix released from the ACPICA upstream.

Thanks
-Lv

Note You need to log in before you can comment on or make changes to this bug.