Bug 199557 - XPS 13 9360: ACPI BIOS Error (bug): Failure creating [\_GPE.XTBT.SPRT], AE_ALREADY_EXISTS
Summary: XPS 13 9360: ACPI BIOS Error (bug): Failure creating [\_GPE.XTBT.SPRT], AE_AL...
Status: RESOLVED DOCUMENTED
Alias: None
Product: ACPI
Classification: Unclassified
Component: ACPICA-Core (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Mika Westerberg
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-29 10:47 UTC by Esokrarkose
Modified: 2021-12-23 00:48 UTC (History)
11 users (show)

See Also:
Kernel Version: 4.16, 4.17
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Dmidecode output (28.24 KB, text/plain)
2018-04-29 10:47 UTC, Esokrarkose
Details
Acpidump (1017.21 KB, text/plain)
2018-04-30 20:14 UTC, Esokrarkose
Details
Full dmesg output (fresh boot) (71.79 KB, text/plain)
2018-05-03 17:29 UTC, Esokrarkose
Details
Dmesg log of plugging USB to ports (80.07 KB, text/plain)
2018-05-04 07:52 UTC, Esokrarkose
Details
Dmesg log of plugging in the Dell DA200 Usb C Adapter (114.62 KB, text/plain)
2018-05-04 07:59 UTC, Esokrarkose
Details
Dmesg of reverted commit (61.98 KB, text/plain)
2018-05-07 22:58 UTC, Esokrarkose
Details
Dmesg of boot without fwupd (64.97 KB, text/plain)
2018-05-08 17:09 UTC, Esokrarkose
Details

Description Esokrarkose 2018-04-29 10:47:48 UTC
Created attachment 275659 [details]
Dmidecode output

I noticed the following in dmesg while testing kernels through kernels from 4.16.0.rc6 to 4.17.0.rc2 for another bug:

Kernels beginning with 4.16.0rc7 show an ACPI BIOS Error that did not happen with 4.16.0rc6 before and does not occur with kernel 4.16.5.
Kernel 4.17.0rc2 however does have the issue:

[   19.020888] ACPI BIOS Error (bug): Failure creating [\_GPE.XTBT.SPRT], AE_ALREADY_EXISTS (20180313/dswload2-316)
[   19.020898] No Local Variables are initialized for Method [XTBT]
[   19.020900] Initialized Arguments for Method [XTBT]:  (2 arguments defined for method invocation)
[   19.020901]   Arg0:   000000009071d27e <Obj>           Integer 0000000000000001
[   19.020906]   Arg1:   0000000015cc32a9 <Obj>           Integer 0000000002060002
[   19.020910] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180313/psobject-220)
[   19.020913] ACPI Error: Method parse/execution failed \_GPE.XTBT, AE_ALREADY_EXISTS (20180313/psparse-516)
[   19.020917] ACPI Error: Method parse/execution failed \_GPE.XTBT, AE_ALREADY_EXISTS (20180313/psparse-516)
[   19.020926] ACPI Error: Method parse/execution failed \_GPE._E42, AE_ALREADY_EXISTS (20180313/psparse-516)
[   19.020933] ACPI: Marking method _E42 as Serialized because of AE_ALREADY_EXISTS error
[   19.020940] ACPI Error: AE_ALREADY_EXISTS, while evaluating GPE method [_E42] (20180313/evgpe-509)

My Bios is the latest from Dell (2.6.2), please find the dmidecode output attached.
Comment 1 Mario Limonciello 2018-04-30 17:59:41 UTC
I believe you'll need to share your apcidump output for folks to look at and understand this better.  Since you know two very close kernel versions with different results (4.16.0rc7 and 4.16.0rc6) a bisect would be useful to see what exactly regressed.  I think it should be quick bisect as you can "probably" focus on drivers/acpi
Comment 2 Esokrarkose 2018-04-30 20:14:13 UTC
Created attachment 275679 [details]
Acpidump
Comment 3 Esokrarkose 2018-05-02 08:35:20 UTC
Today I also observed the following in dmesg on 4.16.0rc6 on suspend/resume:

[72558.671634] ACPI: EC: interrupt blocked
[72558.715155] ACPI: Preparing to enter system sleep state S3
[72558.726660] ACPI: EC: event blocked
[72558.726662] ACPI: EC: EC stopped
[72558.726663] PM: Saving platform NVS memory
[72558.729921] Disabling non-boot CPUs ...
[72558.742889] IRQ 127: no longer affine to CPU1
[72558.743943] smpboot: CPU 1 is now offline
[72558.766815] IRQ 128: no longer affine to CPU2
[72558.767832] smpboot: CPU 2 is now offline
[72558.790691] IRQ 122: no longer affine to CPU3
[72558.790699] IRQ 123: no longer affine to CPU3
[72558.790716] IRQ 130: no longer affine to CPU3
[72558.791740] smpboot: CPU 3 is now offline
[72558.795390] ACPI Error: AE_BAD_PARAMETER, Returned by Handler for [EmbeddedControl] (20180105/evregion-300)
[72558.795402] ACPI Error: Method parse/execution failed \_SB.PCI0.LPCB.ECDV.ECR1, AE_BAD_PARAMETER (20180105/psparse-550)
[72558.795416] ACPI Error: Method parse/execution failed \_SB.PCI0.LPCB.ECDV.ECR2, AE_BAD_PARAMETER (20180105/psparse-550)
[72558.795427] ACPI Error: Method parse/execution failed \ECRW, AE_BAD_PARAMETER (20180105/psparse-550)
[72558.795437] ACPI Error: Method parse/execution failed \ECG1, AE_BAD_PARAMETER (20180105/psparse-550)
[72558.795446] ACPI Error: Method parse/execution failed \NEVT, AE_BAD_PARAMETER (20180105/psparse-550)
[72558.795456] ACPI Error: Method parse/execution failed \_SB.PCI0.LPCB.ECDV._Q66, AE_BAD_PARAMETER (20180105/psparse-550)
[72558.798404] ACPI: Low-level resume complete
[72558.798513] ACPI: EC: EC started
[72558.798514] PM: Restoring platform NVS memory
[72558.800711] Enabling non-boot CPUs ...

Please let me know if I should open a separate bug report for this error.
This one is not easily reproducible, unlike the first error in the report on the affected kernels (NOTE: the first error is like mentioned earlier not reproducible on 4.16.0rc6).
Comment 4 Esokrarkose 2018-05-03 16:52:23 UTC
Reverting

13d3047c8150 ("ACPI / hotplug / PCI: Check presence of slot itself in get_slot_status()")

makes the errors go away.

Please also have a look at
https://marc.info/?l=linux-usb&m=152533983331372&w=2
Comment 5 Esokrarkose 2018-05-03 17:29:21 UTC
Created attachment 275755 [details]
Full dmesg output (fresh boot)
Comment 6 Erik Kaneda 2018-05-03 18:31:42 UTC
(In reply to Esokrarkose from comment #3)
> Today I also observed the following in dmesg on 4.16.0rc6 on suspend/resume:
> 
> [72558.671634] ACPI: EC: interrupt blocked
> [72558.715155] ACPI: Preparing to enter system sleep state S3
> [72558.726660] ACPI: EC: event blocked
> [72558.726662] ACPI: EC: EC stopped
> [72558.726663] PM: Saving platform NVS memory
> [72558.729921] Disabling non-boot CPUs ...
> [72558.742889] IRQ 127: no longer affine to CPU1
> [72558.743943] smpboot: CPU 1 is now offline
> [72558.766815] IRQ 128: no longer affine to CPU2
> [72558.767832] smpboot: CPU 2 is now offline
> [72558.790691] IRQ 122: no longer affine to CPU3
> [72558.790699] IRQ 123: no longer affine to CPU3
> [72558.790716] IRQ 130: no longer affine to CPU3
> [72558.791740] smpboot: CPU 3 is now offline
> [72558.795390] ACPI Error: AE_BAD_PARAMETER, Returned by Handler for
> [EmbeddedControl] (20180105/evregion-300)
> [72558.795402] ACPI Error: Method parse/execution failed
> \_SB.PCI0.LPCB.ECDV.ECR1, AE_BAD_PARAMETER (20180105/psparse-550)
> [72558.795416] ACPI Error: Method parse/execution failed
> \_SB.PCI0.LPCB.ECDV.ECR2, AE_BAD_PARAMETER (20180105/psparse-550)
> [72558.795427] ACPI Error: Method parse/execution failed \ECRW,
> AE_BAD_PARAMETER (20180105/psparse-550)
> [72558.795437] ACPI Error: Method parse/execution failed \ECG1,
> AE_BAD_PARAMETER (20180105/psparse-550)
> [72558.795446] ACPI Error: Method parse/execution failed \NEVT,
> AE_BAD_PARAMETER (20180105/psparse-550)
> [72558.795456] ACPI Error: Method parse/execution failed
> \_SB.PCI0.LPCB.ECDV._Q66, AE_BAD_PARAMETER (20180105/psparse-550)
> [72558.798404] ACPI: Low-level resume complete
> [72558.798513] ACPI: EC: EC started
> [72558.798514] PM: Restoring platform NVS memory
> [72558.800711] Enabling non-boot CPUs ...
> 
> Please let me know if I should open a separate bug report for this error.
> This one is not easily reproducible, unlike the first error in the report on
> the affected kernels (NOTE: the first error is like mentioned earlier not
> reproducible on 4.16.0rc6).

To keep things simple lets focus on your first issue (the SPRT errors). Go ahead and open up a separate issue report for this ECXX error.
Comment 7 Mika Westerberg 2018-05-04 07:04:20 UTC
Few questions:

1. Do you have anything connected to the USB-C ports while you see these errors?
2. In addition to ACPI errors, does the device malfunction in some way? Looking at the dmesg, I don't see any "abnormal" things happening. Only exception is that the TBT host router is hot-added and then removed.
Comment 8 Esokrarkose 2018-05-04 07:51:19 UTC
1. No, nothing at all
2. YES, I just tested: Plugging in an USB on the left port worked at first. Then I plugged in a USB on the right port, did not work. Replugging on the left port did not work again after that.
Comment 9 Esokrarkose 2018-05-04 07:52:50 UTC
Created attachment 275767 [details]
Dmesg log of plugging USB to ports
Comment 10 Esokrarkose 2018-05-04 07:59:10 UTC
Created attachment 275769 [details]
Dmesg log of plugging in the Dell DA200 Usb C Adapter

In the end I also tested the only device I have with USB C, the DA200 VGA/HDMI/LAN/USB-A Adapter.
This one produced errors too, but at least plugging in the USB there worked and also the VGA output.
Comment 11 Mika Westerberg 2018-05-04 09:22:56 UTC
So without any device connected, you only see errors but things still work, right? If you only plug in USB-C device to the right port (I guess the one without small lighting logo) it works even if you do that several times?

But after you plug something to the left port (the one with small lighting logo) none of the ports work anymore?
Comment 12 Esokrarkose 2018-05-04 10:09:31 UTC
(In reply to Mika Westerberg from comment #11)
> So without any device connected, you only see errors but things still work,

What do you mean by "things"? 
The system is still running, yes.

> right? If you only plug in USB-C device to the right port (I guess the one
> without small lighting logo) it works even if you do that several times?

I could not figure out a pattern.
Info: when I speak of left and right ports, I speak of the USB A ports.
I can only see a lightning logo by the USB C port that is on the left side near the left USB A port.

When you asked I tested plugging in on the right port, which succeeded two times in a row.
Afterwards I plugged in in the left port which worked, but not again.
Then I re-plugged and everything worked again on both sides.

> But after you plug something to the left port (the one with small lighting
> logo) none of the ports work anymore?

Sometimes but not always.
Comment 13 Mika Westerberg 2018-05-04 10:24:52 UTC
The point I'm trying to make is that the USB-C ports and the one with the lighting logo in particular should be the only one affected by the commit you reverted. USB-A ports should not be affected in any way.

What happens here is that there is hotplug event to the root port leading to Alpine Ridge (the USB-C + Thunderbolt controller) and it brings up both controllers even if there is nothing connected. That should not happen and we need to figure out why it does.

We should have few XPS 9360's at the office so I can try to reproduce this on one of those next week.
Comment 14 Esokrarkose 2018-05-04 15:12:10 UTC
(In reply to Mika Westerberg from comment #13)
> The point I'm trying to make is that the USB-C ports and the one with the
> lighting logo in particular should be the only one affected by the commit
> you reverted. USB-A ports should not be affected in any way.

Maybe it was a coincidence that the ports stopped working until reboot (but this happened to me already last week with two different external HDDs, I am not sure which kernel version it was, but the ports stop working problem exists for sure, I do not know how to trigger it however). I do not have the time resources at the moment to test this out, as this is my machine for work.

> We should have few XPS 9360's at the office so I can try to reproduce this
> on one of those next week.

That would be great!
Comment 15 Erik Kaneda 2018-05-04 22:28:31 UTC
(In reply to Esokrarkose from comment #0)
> Created attachment 275659 [details]
> Dmidecode output
> 
> I noticed the following in dmesg while testing kernels through kernels from
> 4.16.0.rc6 to 4.17.0.rc2 for another bug:
> 
> Kernels beginning with 4.16.0rc7 show an ACPI BIOS Error that did not happen
> with 4.16.0rc6 before and does not occur with kernel 4.16.5.
> Kernel 4.17.0rc2 however does have the issue:
> 
> [   19.020888] ACPI BIOS Error (bug): Failure creating [\_GPE.XTBT.SPRT],
> AE_ALREADY_EXISTS (20180313/dswload2-316)
> [   19.020898] No Local Variables are initialized for Method [XTBT]
> [   19.020900] Initialized Arguments for Method [XTBT]:  (2 arguments
> defined for method invocation)
> [   19.020901]   Arg0:   000000009071d27e <Obj>           Integer
> 0000000000000001
> [   19.020906]   Arg1:   0000000015cc32a9 <Obj>           Integer
> 0000000002060002
> [   19.020910] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog
> (20180313/psobject-220)
> [   19.020913] ACPI Error: Method parse/execution failed \_GPE.XTBT,
> AE_ALREADY_EXISTS (20180313/psparse-516)
> [   19.020917] ACPI Error: Method parse/execution failed \_GPE.XTBT,
> AE_ALREADY_EXISTS (20180313/psparse-516)
> [   19.020926] ACPI Error: Method parse/execution failed \_GPE._E42,
> AE_ALREADY_EXISTS (20180313/psparse-516)
> [   19.020933] ACPI: Marking method _E42 as Serialized because of
> AE_ALREADY_EXISTS error
> [   19.020940] ACPI Error: AE_ALREADY_EXISTS, while evaluating GPE method
> [_E42] (20180313/evgpe-509)
> 
> My Bios is the latest from Dell (2.6.2), please find the dmidecode output
> attached.

It seems like we have 2 issues going on here. Could you make a separate bugzilla for this error? Thanks.
Comment 16 Esokrarkose 2018-05-05 10:07:33 UTC
(In reply to Erik Schmauss from comment #15)
> It seems like we have 2 issues going on here. Could you make a separate
> bugzilla for this error? Thanks.

Hmm, the issue you referenced is the original one, this bug report is about. I am a bit confused now, what shall I do exactly?

I see 3 different things:

1. Initial ACPI errors (SPRT)
2. xHCI errors
3. Different EC ACPI errors (ECXX)

This issue was opened for 1. I opened another issue for 2. on the linux-usb mailing list (see https://marc.info/?l=linux-usb&m=152533983331372&w=2), but then I noticed that both 1. and 2. are triggered by the same commit and Mika suggested accumulating the information in this bug report here.

So I only know for sure to open a separate report for 3., is that what you mean?
Comment 17 Mika Westerberg 2018-05-07 15:12:36 UTC
Can you also attach dmesg of the system with the commit reverted?
Comment 18 Erik Kaneda 2018-05-07 17:49:37 UTC
(In reply to Esokrarkose from comment #16)
> (In reply to Erik Schmauss from comment #15)
> > It seems like we have 2 issues going on here. Could you make a separate
> > bugzilla for this error? Thanks.
> 
> Hmm, the issue you referenced is the original one, this bug report is about.
> I am a bit confused now, what shall I do exactly?
> 
> I see 3 different things:
> 
> 1. Initial ACPI errors (SPRT)
> 2. xHCI errors
> 3. Different EC ACPI errors (ECXX)
> 
> This issue was opened for 1. I opened another issue for 2. on the linux-usb
> mailing list (see https://marc.info/?l=linux-usb&m=152533983331372&w=2), but
> then I noticed that both 1. and 2. are triggered by the same commit and Mika
> suggested accumulating the information in this bug report here.

Ok that's fine. Let's just keep all the issues here. So which commit caused this error?

[   19.020888] ACPI BIOS Error (bug): Failure creating [\_GPE.XTBT.SPRT], AE_ALREADY_EXISTS (20180313/dswload2-316)
[   19.020898] No Local Variables are initialized for Method [XTBT]
[   19.020900] Initialized Arguments for Method [XTBT]:  (2 arguments defined for method invocation)
[   19.020901]   Arg0:   000000009071d27e <Obj>           Integer 0000000000000001
[   19.020906]   Arg1:   0000000015cc32a9 <Obj>           Integer 0000000002060002
[   19.020910] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180313/psobject-220)
[   19.020913] ACPI Error: Method parse/execution failed \_GPE.XTBT, AE_ALREADY_EXISTS (20180313/psparse-516)
[   19.020917] ACPI Error: Method parse/execution failed \_GPE.XTBT, AE_ALREADY_EXISTS (20180313/psparse-516)
[   19.020926] ACPI Error: Method parse/execution failed \_GPE._E42, AE_ALREADY_EXISTS (20180313/psparse-516)
[   19.020933] ACPI: Marking method _E42 as Serialized because of AE_ALREADY_EXISTS error
[   19.020940] ACPI Error: AE_ALREADY_EXISTS, while evaluating GPE method [_E42] (20180313/evgpe-509)

From what you've mentioned this does not happen on v4.16-rc6 but it occurs on v4.16-rc7. Is that correct? And if so, could you bisect between these versions?
Comment 19 Esokrarkose 2018-05-07 18:58:47 UTC
(In reply to Erik Schmauss from comment #18)
> Ok that's fine. Let's just keep all the issues here. So which commit caused
> this error?
...
> From what you've mentioned this does not happen on v4.16-rc6 but it occurs
> on v4.16-rc7. Is that correct? And if so, could you bisect between these
> versions?

As already mentioned in comment #4:

> Reverting
> 
> 13d3047c8150 ("ACPI / hotplug / PCI: Check presence of slot itself in
> get_slot_status()")
> 
> makes the errors go away.
> 
> Please also have a look at
> https://marc.info/?l=linux-usb&m=152533983331372&w=2

Please also follow the given link for the description of the xHCI errors.
Comment 20 Erik Kaneda 2018-05-07 22:35:13 UTC
(In reply to Esokrarkose from comment #19)
> (In reply to Erik Schmauss from comment #18)
> > Ok that's fine. Let's just keep all the issues here. So which commit caused
> > this error?
> ...
> > From what you've mentioned this does not happen on v4.16-rc6 but it occurs
> > on v4.16-rc7. Is that correct? And if so, could you bisect between these
> > versions?
> 
> As already mentioned in comment #4:
> 
> > Reverting
> > 
> > 13d3047c8150 ("ACPI / hotplug / PCI: Check presence of slot itself in
> > get_slot_status()")
> > 
> > makes the errors go away.

So this makes the SPRT errors go away or it does it make the EC errors go away?

Is comment #5 the dmesg that you get after reverting the above commit?

> > 
> > Please also have a look at
> > https://marc.info/?l=linux-usb&m=152533983331372&w=2
> 
> Please also follow the given link for the description of the xHCI errors.
Comment 21 Esokrarkose 2018-05-07 22:53:14 UTC
(In reply to Erik Schmauss from comment #20)
> So this makes the SPRT errors go away or it does it make the EC errors go
> away?
Yes for the SPRT errors.

As for the EC errors (comment #3): those are not reproducible so I can't test if they are related to this commit.

> Is comment #5 the dmesg that you get after reverting the above commit?
No, without reverting, you can see the SPRT erros are still present.
Comment 22 Esokrarkose 2018-05-07 22:58:03 UTC
Created attachment 275807 [details]
Dmesg of reverted commit

(In reply to Mika Westerberg from comment #17)
> Can you also attach dmesg of the system with the commit reverted?
Here it is.
All of the sudden however I cannot use USB-A devices on that kernel with the reverted commit (the dmesg shows all the devices, but gparted does not see any of them).
When connecting an external HDD it spins up then immediately spins down.
Comment 23 Mika Westerberg 2018-05-08 14:07:41 UTC
I'm currently trying to reproduce this but so far no success. I'm using an older BIOS version (2.3.1) but will upgrade to the latest soon.

Just to be sure, can you disable CONFIG_INTEL_WMI_THUNDERBOLT in your .config and check if the Thunderbolt controller still appears? To me it looks like fwupd force powers the controller during boot and I can trigger the same here doing that manually.
Comment 24 Mario Limonciello 2018-05-08 14:21:36 UTC
> Just to be sure, can you disable CONFIG_INTEL_WMI_THUNDERBOLT in your .config
> and check if the Thunderbolt controller still appears? To me it looks like
> fwupd force powers the controller during boot and I can trigger the same here
> doing that manually.

I'd guess that OP has gnome-software and fwupd installed and gnome-software starts at session start which starts fwupd which force powers the controller.  So alternatively you can modify /etc/fwupd/daemon.conf and add thunderbolt_power to the blacklisted plugins list to prevent fwupd from doing force power actions.
Comment 25 Esokrarkose 2018-05-08 14:46:54 UTC
(In reply to Mika Westerberg from comment #23)
> Just to be sure, can you disable CONFIG_INTEL_WMI_THUNDERBOLT in your
> .config and check if the Thunderbolt controller still appears? To me it
> looks like fwupd force powers the controller during boot and I can trigger
> the same here doing that manually.
I will try that later.

(In reply to Mario Limonciello from comment #24)
> I'd guess that OP has gnome-software and fwupd installed
Yes that's correct.
Comment 26 Mika Westerberg 2018-05-08 14:58:13 UTC
I now upgraded to the latest BIOS (2.6.2) but still could not reproduce it in any other way than forcing power. So I'm suspecting the USB issue here is pretty much only because the xHCI controller is hot-removed and it happens to print out such message in that case.

However, there is another message mentioned in the original report on linux-usb list:

[   31.956624] xhci_hcd 0000:39:00.0: WARN: xHC restore state timeout
[   31.956631] xhci_hcd 0000:39:00.0: PCI post-resume error -110!

I can reproduce this by doing force_power and immediately suspending the system to mem. This is pretty much something user is not expected to do, I think, and not sure how well BIOS/FW copes with that.

Regarding the \_GPE.XTBT.SPRT error. I think this particular system is one of those DPTF systems where \_GPE.XTBT() calls itself recursively for some reason I already forgot. Second time it is called the SPRT OpRegion is already there and ACPICA rightfully complains.
Comment 27 Mika Westerberg 2018-05-08 15:30:56 UTC
(In reply to Mika Westerberg from comment #26)
> [   31.956624] xhci_hcd 0000:39:00.0: WARN: xHC restore state timeout
> [   31.956631] xhci_hcd 0000:39:00.0: PCI post-resume error -110!
> 
> I can reproduce this by doing force_power and immediately suspending the
> system to mem. This is pretty much something user is not expected to do, I
> think, and not sure how well BIOS/FW copes with that.

Or I think BIOS actually handles this correctly but then it disables force_power on S3 exit (because there is no device connected) and that makes both devices (Thunderbolt controller and xHCI controller) to disappear.
Comment 28 Esokrarkose 2018-05-08 16:49:40 UTC
Mika, could you try this kernel here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16.7/

Please try exactly this build and see if it triggers the errors?
Comment 29 Mika Westerberg 2018-05-08 16:57:28 UTC
Unfortunately I cannot install Ubuntu to that system since it is not mine, I just borrowed it.

However, I think we are pretty close to the root cause though. If you can disable the thunderbolt fwupd plugin like Mario suggested (or alternatively disable CONFIG_INTEL_WMI_THUNDERBOLT from .config) the messages should go away if the theory is correct.
Comment 30 Mario Limonciello 2018-05-08 16:58:58 UTC
minor correction:
s/thunderbolt/thunderbolt_power/

Just turning off the thunderbolt plugin in fwupd I think would still cause thunderbolt_power to be used.
Comment 31 Esokrarkose 2018-05-08 17:09:54 UTC
Created attachment 275849 [details]
Dmesg of boot without fwupd

Confirmed: I removed the whole fwupd and rebooted and the errors were gone.
Comment 32 Esokrarkose 2018-05-08 17:12:08 UTC
@Mika: You would not have to install Ubuntu on the system, installing to a usb drive would not touch the system at all.

I can reliably reproduce with the ubuntu mainline kernel builds.
Comment 33 Mika Westerberg 2018-05-08 17:39:17 UTC
Thanks for checking. I think we can conclude that 13d3047c8150 has nothing to do with the errors.

The reason why you see the errors coming from xHCI is because fwupd uses force_power (a Thunderbolt feature on some systems) during boot when it enumerates devices whose firmware could be upgraded. Once it has done that, it "unforces" power which hot-removes the xHCI device (and Thunderbolt controller as well) and the xHCI driver just prints out that the device is gone. This is normal and should not be considered a bug or regression.

There is a minor issue if you suspend the system when force_power is in effect. The BIOS again "unforces" power during resume so the xHCI controller is removed and the xHCI driver gets confused printing these:

[   31.956624] xhci_hcd 0000:39:00.0: WARN: xHC restore state timeout
[   31.956631] xhci_hcd 0000:39:00.0: PCI post-resume error -110!

the xHCI maintainer (Mathias Nyman) is aware of this and hopefully has fix for it eventually.

That leaves us the SPRT error:

[   47.176816] ACPI BIOS Error (bug): Failure creating [\_GPE.XTBT.SPRT], AE_ALREADY_EXISTS (20180105/dswload2-350)
[   47.176825] No Local Variables are initialized for Method [XTBT]
[   47.176827] Initialized Arguments for Method [XTBT]:  (2 arguments defined for method invocation)
[   47.176827]   Arg0:   000000000581c91d <Obj>           Integer 0000000000000001
[   47.176832]   Arg1:   00000000459e7ef2 <Obj>           Integer 0000000002060002
[   47.176835] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180105/psobject-252)
[   47.176838] ACPI Error: Method parse/execution failed \_GPE.XTBT, AE_ALREADY_EXISTS (20180105/psparse-550)
[   47.176843] ACPI Error: Method parse/execution failed \_GPE.XTBT, AE_ALREADY_EXISTS (20180105/psparse-550)
[   47.176848] ACPI Error: Method parse/execution failed \_GPE._E42, AE_ALREADY_EXISTS (20180105/psparse-550)
[   47.176852] ACPI: Marking method _E42 as Serialized because of AE_ALREADY_EXISTS error
[   47.176856] ACPI Error: AE_ALREADY_EXISTS, while evaluating GPE method [_E42] (20180105/evgpe-646)

This is triggered by the force_power as well (\_GPE.XTBT is called during hot-add of the xHCI and Thunderbolt) and I think it is BIOS bug because it tries to create the same SPRT OpRegion twice (due to recursion). This happens only on those DPTF systems and there is even a "fix" for Linux in the ASL code:


    If (((DPTF == One) && (DDDR == One)))
    {
        If ((_REV == 0x05))
        {
            Return (Zero)
        }
        XTBT (TBSE, CPGN)
    }

The _REV == 0x05 checks for Linux but it does not work because Linux nowadays returns 2 following Windows. You should be able to get rid of the error by passing "acpi_rev_override" in the kernel command line but the real fix is to fix the BIOS to not create the SPRT OpRegion twice.
Comment 34 Mika Westerberg 2018-05-08 17:45:14 UTC
(In reply to Esokrarkose from comment #32)
> I can reliably reproduce with the ubuntu mainline kernel builds.

Can you reproduce it with the fwupd thunderbolt_power plugin blacklisted?
Comment 35 Esokrarkose 2018-05-08 18:10:28 UTC
(In reply to Mika Westerberg from comment #34)
> Can you reproduce it with the fwupd thunderbolt_power plugin blacklisted?
No, I can't.
Comment 36 Esokrarkose 2018-05-08 18:59:42 UTC
(In reply to Mika Westerberg from comment #33)
> Thanks for checking. I think we can conclude that 13d3047c8150 has nothing
> to do with the errors.
I tested all ubuntu mainline kernel builds to verify the cause to be between 4.16 rc6 and rc7.
When you suggested to test the revert of the commit, the errors were gone, however I have now built the kernel without the revert and the errors are gone as well I do not know why though.

I assumed that my build would behave the same as the ubuntu mainline build, however it did not, I suppose due to me building with localmodconfig.

So indeed, it seems my testing lead me to a wrong conclusion.

> This is normal and should not be considered a bug or regression.
Your are referring to the following errors, are you?
[   35.645704] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
[   35.645712] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e0(Receiver ID)
[   35.645714] pcieport 0000:00:1c.0:   device [8086:9d10] error status/mask=00000001/00002000
[   35.645716] pcieport 0000:00:1c.0:    [ 0] Receiver Error         (First)
[   35.717611] xhci_hcd 0000:39:00.0: remove, state 4
[   35.717620] usb usb4: USB disconnect, device number 1
[   35.717801] xhci_hcd 0000:39:00.0: USB bus 4 deregistered
[   35.717814] xhci_hcd 0000:39:00.0: xHCI host controller not responding, assume dead
> 
> There is a minor issue if you suspend the system when force_power is in
> effect. The BIOS again "unforces" power during resume so the xHCI controller
> is removed and the xHCI driver gets confused printing these:
> 
> [   31.956624] xhci_hcd 0000:39:00.0: WARN: xHC restore state timeout
> [   31.956631] xhci_hcd 0000:39:00.0: PCI post-resume error -110!
> 
> the xHCI maintainer (Mathias Nyman) is aware of this and hopefully has fix
> for it eventually.
Is there a issue to track this?
> nowadays returns 2 following Windows. You should be able to get rid of the
> error by passing "acpi_rev_override" in the kernel command line but the real
Yep, that works.
> fix is to fix the BIOS to not create the SPRT OpRegion twice.
@Mario: Could you manage to open an issue at the Dell BIOS team for that?
Comment 37 Mario Limonciello 2018-05-08 19:09:36 UTC
>When you suggested to test the revert of the commit, the errors were gone,
>however I have now built the kernel without the revert and the errors are gone
>as well I do not know why though.
> So indeed, it seems my testing lead me to a wrong conclusion.

So are you saying that by turning off thunderbolt_power you can't reproduce it in any of the kernels you previously could?

> @Mario: Could you manage to open an issue at the Dell BIOS team for that?

Well so the tough thing here is my understanding is the method is "supposed" to be recursively called when used on a system with DPTF.  The whole reason that "workaround" was put in the BIOS was to offer a way to avoid problems on Linux the only way that could be done without causing potentially ill effect to DPTF on Windows.

Keep in mind the BIOS (and that workaround) on the 9360 was developed before the idea of adding a Linux specific _OSI string for stuff like that was a thing recognized and approved of by kernel developers.

So I guess @Mika from your perspective how would you like to (ideally) proceed on this one?
Comment 38 Esokrarkose 2018-05-08 19:21:35 UTC
(In reply to Mario Limonciello from comment #37)
> >When you suggested to test the revert of the commit, the errors were gone,
> >however I have now built the kernel without the revert and the errors are
> gone
> >as well I do not know why though.
> > So indeed, it seems my testing lead me to a wrong conclusion.
> 
> So are you saying that by turning off thunderbolt_power you can't reproduce
> it in any of the kernels you previously could?

I can't follow your logic here.
With "led me to the wrong conlusion" I meant the conclusion that commit 13d3047c8150 caused this issue.
I can't really confirm this anymore.

Blacklisting thunderbolt_power in fwupd/daemon.conf makes the problem disappear on the affected kernel(s).
Comment 39 Mika Westerberg 2018-05-08 19:36:37 UTC
(In reply to Esokrarkose from comment #36)
> (In reply to Mika Westerberg from comment #33)
> > Thanks for checking. I think we can conclude that 13d3047c8150 has nothing
> > to do with the errors.
> I tested all ubuntu mainline kernel builds to verify the cause to be between
> 4.16 rc6 and rc7.
> When you suggested to test the revert of the commit, the errors were gone,
> however I have now built the kernel without the revert and the errors are
> gone as well I do not know why though.
> 
> I assumed that my build would behave the same as the ubuntu mainline build,
> however it did not, I suppose due to me building with localmodconfig.
> 
> So indeed, it seems my testing lead me to a wrong conclusion.

It could be that you have the CONFIG_WMI_INTEL_THUNDERBOLT enabled in the affected kernels or so and the ones you don't see the issue it is disabled. That would explain. Can you check that?

> > This is normal and should not be considered a bug or regression.
> Your are referring to the following errors, are you?
> [   35.645704] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
> [   35.645712] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected,
> type=Physical Layer, id=00e0(Receiver ID)
> [   35.645714] pcieport 0000:00:1c.0:   device [8086:9d10] error
> status/mask=00000001/00002000
> [   35.645716] pcieport 0000:00:1c.0:    [ 0] Receiver Error         (First)
> [   35.717611] xhci_hcd 0000:39:00.0: remove, state 4
> [   35.717620] usb usb4: USB disconnect, device number 1
> [   35.717801] xhci_hcd 0000:39:00.0: USB bus 4 deregistered
> [   35.717814] xhci_hcd 0000:39:00.0: xHCI host controller not responding,
> assume dead

Yes.

> > There is a minor issue if you suspend the system when force_power is in
> > effect. The BIOS again "unforces" power during resume so the xHCI
> controller
> > is removed and the xHCI driver gets confused printing these:
> > 
> > [   31.956624] xhci_hcd 0000:39:00.0: WARN: xHC restore state timeout
> > [   31.956631] xhci_hcd 0000:39:00.0: PCI post-resume error -110!
> > 
> > the xHCI maintainer (Mathias Nyman) is aware of this and hopefully has fix
> > for it eventually.
> Is there a issue to track this?

Probably no. You may file one if you want :)
Comment 40 Mika Westerberg 2018-05-08 19:41:36 UTC
(In reply to Mario Limonciello from comment #37)
> So I guess @Mika from your perspective how would you like to (ideally)
> proceed on this one?

Ideally Linux should be able to call that XTBT method recursively in the same way Windows does. Maybe @Erik has some input to this?
Comment 41 Esokrarkose 2018-05-08 19:52:30 UTC
(In reply to Mika Westerberg from comment #39)
> It could be that you have the CONFIG_WMI_INTEL_THUNDERBOLT enabled in the
> affected kernels or so and the ones you don't see the issue it is disabled.
> That would explain. Can you check that?

Yes that's indeed the reason.
Comment 42 Esokrarkose 2018-05-08 20:16:14 UTC
@Mika: Just confirmed that your commit 13d3047c8150 is absolutely not related.

Do we still need a bisect to know why this was introduced between rc6 and rc7?
Comment 43 Mario Limonciello 2018-05-08 20:19:13 UTC
> Do we still need a bisect to know why this was introduced between rc6 and
> rc7?


If I was going to guess - at least with those mainline builds they didn't enable the configure option until rc7.  They were enabled at the end of last year.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1734761
Comment 44 Esokrarkose 2018-05-08 20:22:40 UTC
(In reply to Mario Limonciello from comment #43)
> > Do we still need a bisect to know why this was introduced between rc6 and
> > rc7?
> 
> 
> If I was going to guess - at least with those mainline builds they didn't
> enable the configure option until rc7.  They were enabled at the end of last
> year.
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1734761

Oh right, the rc6 kernel I tested with did not have CONFIG_WMI_INTEL_THUNDERBOLT enabled. This was the trigger.
Comment 45 Erik Kaneda 2018-05-10 17:42:30 UTC
(In reply to Mika Westerberg from comment #40)
> (In reply to Mario Limonciello from comment #37)
> > So I guess @Mika from your perspective how would you like to (ideally)
> > proceed on this one?
> 
> Ideally Linux should be able to call that XTBT method recursively in the
> same way Windows does. Maybe @Erik has some input to this?

We have been experimenting with windows behavior and we see that recursive methods that create named objects return an error on windows. In the latest versions of iASL we mark these methods as illegal recursive methods. The firmware fix would be to delete the SPRT declarations out of the function. It's declared globally at line 18735 so it should be able to use that one instead.
Comment 46 Esokrarkose 2018-05-11 14:38:21 UTC
(In reply to Mika Westerberg from comment #33)
> The reason why you see the errors coming from xHCI is because fwupd uses
> force_power (a Thunderbolt feature on some systems) during boot when it
> enumerates devices whose firmware could be upgraded. Once it has done that,
> it "unforces" power which hot-removes the xHCI device (and Thunderbolt
> controller as well) and the xHCI driver just prints out that the device is
> gone. This is normal and should not be considered a bug or regression.
Mika, so wouldn't it be more appropriate to use KERN_WARNING then instead of KERN_ERR?
I mean (if I understand correctly), it does sound strange to me as a user when a device is powered down that an error message is emitted when nothing erroneous is going on?
How would I be sure the controller did not disappear because of a real error in contrast?
Or is the way fwupd unforces power unexpected like for example pulling out a usb drive without unmounting/ejecting first? An if so, shouldn't there be a method to unforce power, the driver understands and does not treat as error?
Comment 47 Mika Westerberg 2018-05-14 05:19:46 UTC
I would say a device disappearing underneath suddenly can be considered an error from driver perspective. They have no idea whether the device is connected behind Thunderbolt (where it is normal operation to unplug/plug devices) or behind normal (non-hotplug) PCI bridge (where this is a real error). The exactly same thing happens even without force_power when you have device connected and you unplug it.
Comment 48 Esokrarkose 2018-05-14 07:10:46 UTC
(In reply to Mika Westerberg from comment #47)
> The exactly same thing happens even without force_power when you
> have device connected and you unplug it.
So even correctly (safely) removing a Thunderbolt device would trigger that error? Isn't this something that should be improved (by userspace somehow telling the driver that we do not have an error situation)?
Comment 49 Mika Westerberg 2018-05-14 08:38:58 UTC
Yes.

That's pretty much the nature of surprise hot-remove. You can't really prevent user from unplugging a device in software in those cases. If you want "controlled" hot-remove, you can use the PCI sysfs files under /sys/bus/pci/slots to "remove" the device first.
Comment 50 Mika Westerberg 2018-05-18 07:35:58 UTC
So the conclusion is that this error (Failure creating [\_GPE.XTBT.SPRT], AE_ALREADY_EXISTS) is because of BIOS bug. It tries to create the same object twice (SPRT) twice and ACPICA rightfully refuses to do that. If it needs to be fixed somewhere, it should be in the BIOS and the simple fix is to move SPRT outside of the XTBT method which then avoids the issue.
Comment 51 Esokrarkose 2018-05-18 09:43:45 UTC
@Mika, thanks very much for digging into this and all of your support!
@Mario, could you please forward this to the BIOS team?
Comment 52 Mario Limonciello 2018-05-22 15:06:47 UTC
Sure, I'll let the relevant people know about this.  I can't promise an ETA on fixes.  Fortunately it's "harmless" other than the ACPI error in the system log.

Mika, I believe this bug should be closed on bugzilla as it's confirmed to be a BIOS bug.
Comment 53 Mika Westerberg 2018-05-22 15:20:37 UTC
I agree. Closing this now. Thanks!
Comment 54 Paul Menzel 2018-05-22 15:26:28 UTC
I thought DOCUMENTED is the right “reason” for resolved firmware bugs.
Comment 55 Mika Westerberg 2018-05-22 15:32:18 UTC
OK, thanks for the info. Updated it now.
Comment 56 Marcin Kurek 2018-06-09 23:10:52 UTC
It seems this one affects Dell Precision 5510 as well. I can see similar error in log's ...

--------
ACPI BIOS Error (bug): Failure creating [\_GPE._E42.SPRT], AE_ALREADY_EXISTS (20180313/dswload2-316)
ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180313/psobject-220)
ACPI Error: Method parse/execution failed \_GPE._E42, AE_ALREADY_EXISTS (20180313/psparse-516)
ACPI Error: Method parse/execution failed \_GPE._E42, AE_ALREADY_EXISTS (20180313/psparse-516)
ACPI Error: AE_ALREADY_EXISTS, while evaluating GPE method [_E42] (20180313/evgpe-509)
--------

Most recent BIOS (1.7.0) installed. Tested with 4.17.
Comment 57 Florent Hivert 2021-07-04 10:49:09 UTC
Hi there, 

I'm having the very same issue on a Dell Precision 5530. It's annoying because it  hangs my ethernet network dongle. Currently I need to reboot to fix it. It seems that comment 30 / 31 are describing a workaround but I'm not expert enough to understand. Can someone confirm that there is indeed a workaround and describe a little more the step needed to do it ? Do I need to recompile the kernel (did that routinely between 1995-2005 but haven't done that since ages) or passing some argument to kernel at boot is sufficient. Thanks.
Comment 58 Paul Menzel 2021-07-05 08:37:27 UTC
What firmware version do you use? Please open an issue at Dell asking for fixing the firmware.
Comment 59 David S. 2021-12-23 00:48:18 UTC
Hi,

I have Dell XPS 15 9560 with the Thunderbolt firmware v26.01 installed. This is the firmware link, https://www.dell.com/support/home/en-au/drivers/driversdetails?driverid=rgkm7&oscode=wt64a&productcode=xps-15-9560-laptop.

I created this post, https://www.dell.com/community/XPS/XPS-15-Thunderbolt-firmware-fix/m-p/8110017#M93825. Hopefully, we can get some update from Dell.

Note You need to log in before you can comment on or make changes to this bug.