Bug 121701
Description
Greg White
2016-07-08 21:24:09 UTC
Can you specify exactly what the git error message was? It sounds like it might have been "you want to use way too much memory" which is a very specific git error - it is *meant* to only happen when there is an actual arithmetic overflow in "unsigned int" (and it's only tested for one of the internal *string* functions, so it would be a very long string indeed). But it might be a git bug, of course. But it would be good to get the exact error message. (There are other memory allocation messages that can happen, but they tend to be just "Out of memory, realloc failed" and similar). Anyway, in case it's a git bug, can you check bisecting just that one merge. Do # get rid of the old bisect if you didn't already git bisect reset # mark the merge bad, the parent good. git bisect bad 7ed18e2d1b6782989eb399ef79a8cc1a1b583b3c git bisect good 7ed18e2d1b6782989eb399ef79a8cc1a1b583b3c^ and see if the git problem was due to some earlier stage in the bisect. Thanks, Linus OK, thanks Linus, that helped. I was able to get down to the specific commit. The end of the git bisect is below, as is the offending error message, which happened on the final git bisect bad. ~$ git bisect bad 45209046c47b93fadf26dc59a9da724f387b9cf2 is the first bad commit commit 45209046c47b93fadf26dc59a9da724f387b9cf2 fatal: you want to use way too much memory ~$ git --version git version 2.9.0 Just for additional clarity, the commit header is: commit 45209046c47b93fadf26dc59a9da724f387b9cf2 Author: Lv Zheng <lv.zheng@intel.com> Date: Tue Jul 5 13:53:12 2016 +0800 ACPICA: Namespace: Fix namespace/interpreter lock ordering I have verified that pulling that commit out fixes the problem. With it in, I cannot boot. Greg, Sinan Kaya in http://thread.gmane.org/gmane.linux.kernel.pci/53279/focus=53316 asked "Can you attach the boot log to the bugzilla?" Side note: Sorry, I forgot to CC you when raising the issue on the mailing list. (I was asked for a boot log) There is no boot log - the hang appears to be immediate. No output. If there is some way to turn on earlier debug logs, please let me know (I do have early printk on, going to the EFI framebuffer. Nothing shows up.) (In reply to Greg White from comment #4) > > There is no boot log - the hang appears to be immediate. No output. I suspect he meant a log from a "good" kernel/one that accentually boots Ah. Of course. Attached. Head is at 617a8d6bc19edd075e8111c6770f79cae75be51f, with 45209046c47b93fadf26dc59a9da724f387b9cf2 reverted. Created attachment 222671 [details]
boot.log
(In reply to Greg White from comment #2) > OK, thanks Linus, that helped. I was able to get down to the specific > commit. The end of the git bisect is below, as is the offending error > message, which happened on the final git bisect bad. > > ~$ git bisect bad > 45209046c47b93fadf26dc59a9da724f387b9cf2 is the first bad commit > commit 45209046c47b93fadf26dc59a9da724f387b9cf2 > fatal: you want to use way too much memory > ~$ git --version > git version 2.9.0 > > Just for additional clarity, the commit header is: > > commit 45209046c47b93fadf26dc59a9da724f387b9cf2 > Author: Lv Zheng <lv.zheng@intel.com> > Date: Tue Jul 5 13:53:12 2016 +0800 > > ACPICA: Namespace: Fix namespace/interpreter lock ordering > > > I have verified that pulling that commit out fixes the problem. With it in, > I cannot boot. OK, thanks! @Greg: Please also attach the acpidump output from the affected system. Created attachment 222681 [details]
acpidump.tar.gz
Dump attached. I screwed up the file extension, sorry. That's a tar.gz. Side note: the git bisect problem is a bug in 2.9.0. I didn't check if 2.9.1 has the fix, but it's fixed in the current master branch if you want to build your own and avoid this in the future. Thanks! Lv needs to look at this in detail. In the meantime I'll queue up a revert of commit 45209046c47b as we can live with the issue it attempted to fix (at least that one doesn't prevent systems from booting). But maybe Lv is able to come up with a better fix shortly. Hi, Greg 45209046c47b is a quick fix for a lock problem. In fact we have a big trouble related to namespace/interpreter lock issues. And we are discussing a better solution for this issue. However the quick fix should be able to fix the current upstream issues. And your report is not what I can understand. So we need to get your issue root caused. I checked the acpidump. There are "Load" opcodes invoked from \_PR.CPUx._PDC or \_PR.CPUx._OSC Could you try to: 1. comment "acpi_processor_set_pdc()" out, it's in drivers/acpi/processor_pdc.c; 2. have 45209046c47b merged; 3. try again to see if the hang can still be seen. Thanks in advance. Best regards -Lv OK, I tried that - still hangs immediately. This seems to be the earliest one invoking _OSC/_PDC. Could you please take a screenshot/video about the issue so that we can at least see some useful debugging information? Also, could you try to boot with "acpi_no_auto_serialize". Thanks -Lv And please remove the "quiet" from the kernel boot parameter when you do the test. Thanks -Lv Created attachment 222701 [details]
[PATCH] ACPICA: Dispatcher: Fix an issue that the opregions created by the linked MLC were not tracked
Created attachment 222711 [details]
[PATCH] ACPICA: Namespace: Add acpi_ns_get_node_unlocked()
Created attachment 222721 [details]
[PATCH] ACPICA: Namespace: Fix dynamic table loading issues by tuning namespace/interpreter locks
Created attachment 222731 [details]
[PATCH 3] ACPICA: This patch fixes an issue with acpi_ds_auto_serialized_method()
Another test is: 1. apply 45209046c47b 2. apply attachment 222701 [details], attachment 222711 [details], attachment 222721 [details], attachment 222721 [details] 3. build the kernel and try again 45209046c47b is the quick fix and the attachnents are the better solution, but is still experimental. However you can give it a try. Thanks -Lv Created attachment 222761 [details]
[PATCH 1/4] ACPICA: Dispatcher: Fix an issue that the opregions created by the linked MLC were not tracked
Created attachment 222771 [details]
[PATCH 2/4] ACPICA: Namespace: Add acpi_ns_get_node_unlocked()
Created attachment 222781 [details]
[PATCH 3/4] ACPICA: Namespace: Fix dynamic table loading issues by tuning namespace/interpreter locks
Previous version of this patch contains issues around acpi_ut_add_address_range().
This patch is an improvement result.
However, my lock assessment coverity may still not be adequate.
I need to do more assessments.
Let me just know the test result on your platform.
Thanks
-Lv
Created attachment 222791 [details]
[PATCH 4/4] ACPICA: Fix dead lock in acpi_ds_auto_serialize_method()
Please ignore comment 22. The updated test request is: 1. apply 45209046c47b 2. apply attachment 222761 [details], attachment 222771 [details], attachment 222781 [details], attachment 222791 [details] 3. build the kernel and try again 45209046c47b is the quick fix and the attachnents are the better solution, but is still experimental. However you can give it a try. Thanks -Lv There are a number of different requests for info in the bug now. I'll just skip to the last one - applying the sequence of patches attached to the bug seems to fix the problem for me. I did see this error message (adding a few lines above for context): [ 1.150526] pci_hotplug: PCI Hot Plug PCI Core version: 0.5 [ 1.150545] pciehp: PCI Express Hot Plug Controller Driver version: 0.4 [ 1.150571] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 [ 1.154608] acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed If you need verbose boot logs, etc, just let me know. (In reply to Lv Zheng from comment #23) > Created attachment 222761 [details] > [PATCH 1/4] ACPICA: Dispatcher: Fix an issue that the opregions created by > the linked MLC were not tracked Why is this needed at all? (In reply to Rafael J. Wysocki from comment #29) > (In reply to Lv Zheng from comment #23) > > Created attachment 222761 [details] > > [PATCH 1/4] ACPICA: Dispatcher: Fix an issue that the opregions created by > > the linked MLC were not tracked > > Why is this needed at all? I just got a few patch dependency issues. This isn't needed for fixing this lock problem. It's only useful when we fix the MLC grammar problem. We can skip it. Thanks -Lv (In reply to Greg White from comment #28) > There are a number of different requests for info in the bug now. I'll just > skip to the last one - applying the sequence of patches attached to the bug > seems to fix the problem for me. > > I did see this error message (adding a few lines above for context): > > [ 1.150526] pci_hotplug: PCI Hot Plug PCI Core version: 0.5 > [ 1.150545] pciehp: PCI Express Hot Plug Controller Driver version: 0.4 > [ 1.150571] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 > [ 1.154608] acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed > > > If you need verbose boot logs, etc, just let me know. Please upload the log. This issue seems to be the root cause of many ACPICA problems. Just let me know the cases you encountered. That is potentially be another affected issue. Thanks in advance. Best regards -Lv Created attachment 222851 [details]
Screen shot at point of boot hang, without patches applied and without commit reverted
I also tried booting with acpi_no_auto_serialize; it still hung as before. Thank you, sorry for the noise. Since the revert on lock solves the issue; this issue does not seem to be related to acpi irq changes. Acpi irq related ussue symptoms look like irq assignment failures and driver initialization failures. I don't see any of these. (In reply to Greg White from comment #32) > Created attachment 222851 [details] > Screen shot at point of boot hang, without patches applied and without > commit reverted However, I mean the boot log after applying the patches. Was this seen after applying the patches? [ 1.154608] acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed I need to check if it implies more required engineering work. (In reply to Greg White from comment #28) > There are a number of different requests for info in the bug now. I'll just > skip to the last one - applying the sequence of patches attached to the bug > seems to fix the problem for me. Though the patches can fix your issue. They are still experimental. And the solution is still under discussion with the ACPICA upstream review participants. For now, we could just revert 3 commits to stop the regression. And when all of the necessary engineering work done, we then can re-enable the reverted feature. Thanks and best regards -Lv Created attachment 222961 [details]
boot log with patches in place
Boot log attached. The acpi_walk_namespace problem did happen with the patches in place. I will revert the patches and pull out the bad commit instead. Thanks. 45209046c47b93fadf26dc59a9da724f387b9cf2 has been reverted from the Linus' tree along with 2f38b1b16d92 and 3d4b7ae96d81, closing. (In reply to Greg White from comment #37 and #38) > Created attachment 222961 [details] > boot log with patches in place Looks good now. > Boot log attached. > > The acpi_walk_namespace problem did happen with the patches in place. > > I will revert the patches and pull out the bad commit instead. Yes. The solution still need to be improved. I still can see several acpi_ns_get_node() need to be change. And there are things need to be improved for the debugger. In which lock need to be tuned around acpi_ps_parse_aml(). Also the solution need to take care of the table lock. I'm waiting for the ACK to the solution. And will improve it later. We can wait for the full fix released from the ACPICA upstream. Thanks -Lv |