Bug 202475

Summary: Boot crash in acpi
Product: ACPI Reporter: Michael J Gruber (mjg)
Component: ACPICA-CoreAssignee: Erik Kaneda (erik.kaneda)
Status: RESOLVED CODE_FIX    
Severity: high CC: erik.kaneda, jdelvare, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.20.5 (and 4.19 series) Subsystem:
Regression: Yes Bisected commit-id:
Attachments: lspci
lscpu
4.19.3 boot kernelmsgs without crash
4.19.7 boot kernelmsgs with crash
4.19.8 atomacpi (revert patch) boot kernelmsgs without crash
4.19.8 boot kernelmsgs with crash
4.19.2-300 boot kernelmsgs with crash
4.19.2-301 boot kernelmsgs with crash
4.19.4-300 boot kernelmsgs without crash
4.19.5 boot kernelmsgs without crash
4.19.6 boot kernelmsgs with crash
4.19.9 boot kernelmsgs with crash
4.19.9 boot kernelmsgs with crash
4.20.0 boot kernelmsgs with crash
4.20.5 no crash with revert patch
4.20.5 crash
4.16.16-300 (Fedora 28 kernel on fedora 29) no crash
4.20.6 with revert patch (no crash)
acpidump
4.16.16 acpi.debug
4.20.5 acpi.debug
4.16.16 debug msgs
4.20.5 debug msgs
remove op region address after control method execution
4.20.11 stock Fedora debug kernel
4.20.11 debug kernel with suggested patch
4.20.11 debug kernel with the revert
4.20.5 debug msgs (new options)
4.20.5 patched debug msgs (new options)
4.20.5 patched non-debug msgs

Description Michael J Gruber 2019-01-31 10:53:54 UTC
Most kernels in the 4.19 and 4.20 series crash during boot in the acpi component on specific hardware (see below).

This is 100% reproducible.

By bisection, the culprit is commit 4abb951b73ff ("ACPICA: AML interpreter: add region addresses in global list during initialization"), which was first introduced in upstream v4.19.2 (commit 22083c028d0b), reverted in v4.19.3 (commit 8ef305fbc50d), and finally included again in v4.19.6 (commit 87403e35bc56).

Jeremy Cline sent an email in response to its inclusion in v4.19.6, but it must have slipped through the cracks for the ACPI maintainers.

Original Fedora bug: https://bugzilla.redhat.com/show_bug.cgi?id=1659225

Note that I tested both with vanilla and with Fedora kernels.

This is is an old HP/Compaq 6715b laptop. lscpu and lspci output is attached to the Fedora bug, as well as kernel msgs with and without crash for several kernel versions. Also, an atom user reported a similar problem that disappeared with the same revert.

Please let me know if I should repost the attachments from the Fedora bz over here.
Comment 1 Erik Kaneda 2019-01-31 21:23:29 UTC
(In reply to M.J.G. from comment #0)
> Most kernels in the 4.19 and 4.20 series crash during boot in the acpi
> component on specific hardware (see below).
> 
> This is 100% reproducible.
> 
> By bisection, the culprit is commit 4abb951b73ff ("ACPICA: AML interpreter:
> add region addresses in global list during initialization"), which was first
> introduced in upstream v4.19.2 (commit 22083c028d0b), reverted in v4.19.3
> (commit 8ef305fbc50d), and finally included again in v4.19.6 (commit
> 87403e35bc56).
> 
> Jeremy Cline sent an email in response to its inclusion in v4.19.6, but it
> must have slipped through the cracks for the ACPI maintainers.
> 
> Original Fedora bug: https://bugzilla.redhat.com/show_bug.cgi?id=1659225
> 
> Note that I tested both with vanilla and with Fedora kernels.
> 
> This is is an old HP/Compaq 6715b laptop. lscpu and lspci output is attached
> to the Fedora bug, as well as kernel msgs with and without crash for several
> kernel versions. Also, an atom user reported a similar problem that
> disappeared with the same revert.
> 
> Please let me know if I should repost the attachments from the Fedora bz
> over here.

Please do, that would be helpful. Also, could you try booting with an older kernel? I'm thinking 4.16 or earlier and post the dmesg here. Also, please post the ACPI dump.

https://01.org/linux-acpi/utilities
Comment 2 Michael J Gruber 2019-02-05 14:25:42 UTC
Created attachment 280967 [details]
lspci
Comment 3 Michael J Gruber 2019-02-05 14:26:11 UTC
Created attachment 280969 [details]
lscpu
Comment 4 Michael J Gruber 2019-02-05 14:26:37 UTC
Created attachment 280971 [details]
4.19.3 boot kernelmsgs without crash
Comment 5 Michael J Gruber 2019-02-05 14:26:55 UTC
Created attachment 280973 [details]
4.19.7 boot kernelmsgs with crash
Comment 6 Michael J Gruber 2019-02-05 14:27:24 UTC
Created attachment 280975 [details]
4.19.8 atomacpi (revert patch) boot kernelmsgs without crash
Comment 7 Michael J Gruber 2019-02-05 14:27:46 UTC
Created attachment 280977 [details]
4.19.8 boot kernelmsgs with crash
Comment 8 Michael J Gruber 2019-02-05 14:28:20 UTC
Created attachment 280979 [details]
4.19.2-300 boot kernelmsgs with crash
Comment 9 Michael J Gruber 2019-02-05 14:29:07 UTC
Created attachment 280981 [details]
4.19.2-301 boot kernelmsgs with crash
Comment 10 Michael J Gruber 2019-02-05 14:29:40 UTC
Created attachment 280983 [details]
4.19.4-300 boot kernelmsgs without crash
Comment 11 Michael J Gruber 2019-02-05 14:30:04 UTC
Created attachment 280985 [details]
4.19.5 boot kernelmsgs without crash
Comment 12 Michael J Gruber 2019-02-05 14:30:26 UTC
Created attachment 280987 [details]
4.19.6 boot kernelmsgs with crash
Comment 13 Michael J Gruber 2019-02-05 14:30:47 UTC
Created attachment 280989 [details]
4.19.9 boot kernelmsgs with crash
Comment 14 Michael J Gruber 2019-02-05 14:31:10 UTC
Created attachment 280991 [details]
4.19.9 boot kernelmsgs with crash
Comment 15 Michael J Gruber 2019-02-05 14:31:32 UTC
Created attachment 280993 [details]
4.20.0 boot kernelmsgs with crash
Comment 16 Michael J Gruber 2019-02-05 14:34:34 UTC
Created attachment 280995 [details]
4.20.5 no crash with revert patch
Comment 17 Michael J Gruber 2019-02-05 14:34:56 UTC
Created attachment 280997 [details]
4.20.5 crash
Comment 18 Michael J Gruber 2019-02-05 14:35:49 UTC
Created attachment 280999 [details]
4.16.16-300 (Fedora 28 kernel on fedora 29) no crash
Comment 19 Michael J Gruber 2019-02-05 14:36:29 UTC
Created attachment 281001 [details]
4.20.6 with revert patch (no crash)
Comment 20 Michael J Gruber 2019-02-05 14:36:49 UTC
Created attachment 281003 [details]
acpidump
Comment 21 Michael J Gruber 2019-02-05 14:38:47 UTC
You want kernel msgs? I'll give kernel msgs! ;)

I'll be happy to give you more; testing/upload cycle takes a bit, though.
Comment 22 Erik Kaneda 2019-02-05 23:40:40 UTC
(In reply to M.J.G. from comment #21)
> You want kernel msgs? I'll give kernel msgs! ;)
> 
> I'll be happy to give you more; testing/upload cycle takes a bit, though.

Thanks! This is helpful but I need a little more information. Please boot the 4.16.16 (working) and 4.20.5 (crash) with the following kernel parameters:

acpi.debug_layer=0xffffffff acpi.debug_level=0x600 log_buf_len=10M

These flags tend to spew a lot of data so that's why I included the log_buf_len parameter. Please feel free to increase or decrease the size of the buffer in order to capture all messages during boot.
Comment 23 Michael J Gruber 2019-02-06 20:17:43 UTC
Created attachment 281029 [details]
4.16.16 acpi.debug
Comment 24 Michael J Gruber 2019-02-06 20:18:23 UTC
Created attachment 281031 [details]
4.20.5 acpi.debug
Comment 25 Erik Kaneda 2019-02-06 23:46:47 UTC
Please post 4.20 adn 4.16 dmesg after re-building with CONFIG_ACPI_DEBUG=y

Sorry, I forgot to mention this :(
Comment 26 Michael J Gruber 2019-02-07 10:00:06 UTC
(In reply to Erik Schmauss from comment #25)
> Please post 4.20 adn 4.16 dmesg after re-building with CONFIG_ACPI_DEBUG=y
> 
> Sorry, I forgot to mention this :(

Looking at the meager diffs between the old and new dmesgs, I suspected something like that already...

On Fedora, this gets enabled for the debug kernel together with a lot of other debug options. Would this be OK for your purposes, or do you need the output of a non-debug kernel specifically rebuilt with CONFIG_ACPI_DEBUG=y only?
Comment 27 Erik Kaneda 2019-02-07 17:02:27 UTC
(In reply to M.J.G. from comment #26)
> (In reply to Erik Schmauss from comment #25)
> > Please post 4.20 adn 4.16 dmesg after re-building with CONFIG_ACPI_DEBUG=y
> > 
> > Sorry, I forgot to mention this :(
> 
> Looking at the meager diffs between the old and new dmesgs, I suspected
> something like that already...
> 
> On Fedora, this gets enabled for the debug kernel together with a lot of
> other debug options. Would this be OK for your purposes, or do you need the

So we might get more debugging info that we actually need. That's completely fine by me!

> output of a non-debug kernel specifically rebuilt with CONFIG_ACPI_DEBUG=y
> only?
Comment 28 Michael J Gruber 2019-02-07 20:17:34 UTC
Created attachment 281045 [details]
4.16.16 debug msgs
Comment 29 Michael J Gruber 2019-02-07 20:18:24 UTC
Created attachment 281047 [details]
4.20.5 debug msgs
Comment 30 Erik Kaneda 2019-02-21 00:10:39 UTC
Sorry about the delayed response. Could you boot with acpi.debug_layer= 0x00400000 acpi.debug_level=0x600 log_buf_len=10M?

Thanks for your patience!
Comment 31 Erik Kaneda 2019-02-21 01:04:00 UTC
Also, what happens when you try to revert this commit on a kernel that crashes?

commit 88fa2dfb075a20c3464e3d303c57dd8ced9e8309
Author: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Date:   Tue Oct 10 18:11:15 2017 +0200

    i2c: piix4: Disable completely the IMC during SMBUS_BLOCK_DATA
Comment 32 Michael J Gruber 2019-02-21 16:46:47 UTC
(In reply to Erik Schmauss from comment #31)
> Also, what happens when you try to revert this commit on a kernel that
> crashes?
> 
> commit 88fa2dfb075a20c3464e3d303c57dd8ced9e8309
> Author: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
> Date:   Tue Oct 10 18:11:15 2017 +0200
> 
>     i2c: piix4: Disable completely the IMC during SMBUS_BLOCK_DATA

What happens is that the revert does not apply cleanly to the 4.20.11 tree; if I tree to resolve the conflicts the kernel fails to build (see build.log for https://koji.fedoraproject.org/koji/taskinfo?taskID=32946407).

That commit is not exactly small, and a lot has changed in drivers/i2c/busses/i2c-piix4.c since its introduction (it's described as v4.14-rc4-5-g88fa2dfb075a) and before the appearance of the crashes (revert does not apply cleanly to 4.19.6 either).

As for the changed debug parameters - which kernel should I boot with them to be useful for you?
Comment 33 Erik Kaneda 2019-02-21 19:31:21 UTC
Please ignore my requests in comment 30 and comment 31 for now. I think I've figure out the issue..
Comment 34 Erik Kaneda 2019-02-21 20:39:48 UTC
Created attachment 281267 [details]
remove op region address after control method execution

Please give apply this on a kernel version that crashes
Comment 35 Michael J Gruber 2019-02-23 11:54:28 UTC
Created attachment 281299 [details]
4.20.11 stock Fedora debug kernel
Comment 36 Michael J Gruber 2019-02-23 11:55:13 UTC
Created attachment 281301 [details]
4.20.11 debug kernel with suggested patch
Comment 37 Michael J Gruber 2019-02-23 11:55:54 UTC
Created attachment 281303 [details]
4.20.11 debug kernel with the revert
Comment 38 Michael J Gruber 2019-02-23 11:57:57 UTC
I'm just dumping there for now - 4.20.11 seems to crash even with the revert patch.

I'm not sue what to make out of this. I guess I should have tested with an earlier kernel - but then, what changed in 4.20.11?
Comment 39 Erik Kaneda 2019-02-25 17:07:32 UTC
(In reply to M.J.G. from comment #38)
> I'm just dumping there for now - 4.20.11 seems to crash even with the revert
> patch.
> 
> I'm not sue what to make out of this. I guess I should have tested with an
> earlier kernel - but then, what changed in 4.20.11?

I don't really know what changed in 4.20.11 that could have caused this crash... Unfortunately, you'll have to bisect it and file another bug.

For now, let's boot on 4.20.5 and work on improving the global address list manipulations.

Thanks!
Comment 40 Michael J Gruber 2019-02-26 20:20:39 UTC
(In reply to Erik Schmauss from comment #39)
> (In reply to M.J.G. from comment #38)
> > I'm just dumping there for now - 4.20.11 seems to crash even with the
> revert
> > patch.
> > 
> > I'm not sue what to make out of this. I guess I should have tested with an
> > earlier kernel - but then, what changed in 4.20.11?
> 
> I don't really know what changed in 4.20.11 that could have caused this
> crash... Unfortunately, you'll have to bisect it and file another bug.
> 
> For now, let's boot on 4.20.5 and work on improving the global address list
> manipulations.
> 
> Thanks!

Sorry, forget what I wrote about 4.20.11, I got confused: The new debug options seem to hide the BUG/Oops lines and produce a call trace no matter what; also, the debug kernels eventually finish the boot even if the corresponding none debug kernel hangs after the crash (at least for a few minutes, after which I force halt).

I booted a 4.20.5 now with your patch (instead of the revert), it boots without crash (non-debug). 

Upcoming are the log for the patched 4.20.5 kernel and the debug logs with acpi.debug_layer=0x00400000 acpi.debug_level=0x600 log_buf_len=40M for the patched and unpatched 4.20.5.
Comment 41 Michael J Gruber 2019-02-26 20:22:06 UTC
Created attachment 281363 [details]
4.20.5 debug msgs (new options)
Comment 42 Michael J Gruber 2019-02-26 20:23:05 UTC
Created attachment 281365 [details]
4.20.5 patched debug msgs (new options)
Comment 43 Michael J Gruber 2019-02-26 20:23:48 UTC
Created attachment 281367 [details]
4.20.5 patched non-debug msgs
Comment 44 Erik Kaneda 2019-02-26 21:45:21 UTC
(In reply to M.J.G. from comment #40)
> (In reply to Erik Schmauss from comment #39)
> > (In reply to M.J.G. from comment #38)
> > > I'm just dumping there for now - 4.20.11 seems to crash even with the
> > revert
> > > patch.
> > > 
> > > I'm not sue what to make out of this. I guess I should have tested with
> an
> > > earlier kernel - but then, what changed in 4.20.11?
> > 
> > I don't really know what changed in 4.20.11 that could have caused this
> > crash... Unfortunately, you'll have to bisect it and file another bug.
> > 
> > For now, let's boot on 4.20.5 and work on improving the global address list
> > manipulations.
> > 
> > Thanks!
> 
> Sorry, forget what I wrote about 4.20.11, I got confused: The new debug
> options seem to hide the BUG/Oops lines and produce a call trace no matter
> what; also, the debug kernels eventually finish the boot even if the
> corresponding none debug kernel hangs after the crash (at least for a few
> minutes, after which I force halt).
> 

These debug messages can easily turn into a fire hose... These debug flags are typically used for interactive debugging purposes rather than debugging boot. The reason why I asked you to use them was to try to under stand what was going on but after thinking about this for a few days, I was finally able to understand what was going on..

> I booted a 4.20.5 now with your patch (instead of the revert), it boots
> without crash (non-debug). 

Great! Thanks for testing this! I'll look over this patch again before sending it out.

I think there may have been a few other people have had issues with it so it would be good for them to test it out as well. If you have seen other people complain about this, please feel free to send this patch to them. I haven't heard anything about this patch for a while but feel free to add me to any existing conversations.

Also, how would you like me to credit you? I was going to add your name to reported-by and tested-by tags to the patch.

> 
> Upcoming are the log for the patched 4.20.5 kernel and the debug logs with
> acpi.debug_layer=0x00400000 acpi.debug_level=0x600 log_buf_len=40M for the
> patched and unpatched 4.20.5.
Comment 45 Michael J Gruber 2019-02-27 09:51:07 UTC
(In reply to Erik Schmauss from comment #44)
> 
> Great! Thanks for testing this! I'll look over this patch again before
> sending it out.

Thanks for you patience and perseverance!

> 
> I think there may have been a few other people have had issues with it so it
> would be good for them to test it out as well. If you have seen other people
> complain about this, please feel free to send this patch to them. I haven't
> heard anything about this patch for a while but feel free to add me to any
> existing conversations.

We had an anonymous reporter (on an atom box) for a Fedora test build. I updated the info on the Fedora bug https://bugzilla.redhat.com/show_bug.cgi?id=1659225 with a link to your patch and my kernel build, maybe some of the cc's there can test.

> Also, how would you like me to credit you? I was going to add your name to
> reported-by and tested-by tags to the patch.

I changed my kernel bz display name to my full name now. In Fedora land, I am:

Michael J Gruber <mjg@fedoraproject.org>

In kernel land I'm unknown, unless from the git side of things ;)
Comment 46 Erik Kaneda 2019-02-28 17:46:02 UTC
I'll be away next week. So I'll take a look around the week of 3/11 and go from there. Let me know if you have any updates as far as testing goes
Comment 47 Erik Kaneda 2019-03-13 20:27:27 UTC
I'm a back now. @Michael: I'm going to submit this change after adding a few tests to our infrastructure. Let me know if you have any other feedback.
Comment 48 Michael J Gruber 2019-03-18 09:04:59 UTC
There has been no further feedback on the rh bz.

I've been applying your patch to 4.20-series Fedora kernels up to 4.20.16. Applies, builds and solves the ACPI issue on my (ancient) hp6715b laptops. Thanks for hanging in!
Comment 49 Erik Kaneda 2019-03-18 16:59:38 UTC
no problem! Thanks for the bug report!
Comment 50 Erik Kaneda 2019-03-22 21:20:58 UTC
This patch should be in rc-3 or rc-4. I'll submit this to the stable kernel mailing list when I do. For now, I'll leave this open and close when I submit the patch.
Comment 51 Erik Kaneda 2019-04-12 19:05:40 UTC
patch has been applied to Rafael's linux-next branch and will be a part of the next rc.

Closing...