Bug 155831 - Interrupt storm on Asus UX501VW under 4.4.0-21-generic
Summary: Interrupt storm on Asus UX501VW under 4.4.0-21-generic
Status: CLOSED UNREPRODUCIBLE
Alias: None
Product: ACPI
Classification: Unclassified
Component: Config-Interrupts (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Lv Zheng
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-02 16:55 UTC by Mike Allen
Modified: 2016-12-19 05:32 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.4.0-21-generic
Tree: Mainline
Regression: No


Attachments
grep . -r /sys/firmware/acpi/interrupts/ output (7.35 KB, text/plain)
2016-09-02 16:55 UTC, Mike Allen
Details
Perf report output (3.07 MB, text/plain)
2016-09-02 16:59 UTC, Mike Allen
Details
dmesg output (64.80 KB, text/plain)
2016-09-02 17:00 UTC, Mike Allen
Details
acpidump output (1023.80 KB, text/plain)
2016-09-02 18:09 UTC, Mike Allen
Details
lspci -vvxxx output (57.50 KB, text/plain)
2016-09-09 16:57 UTC, Mike Allen
Details

Description Mike Allen 2016-09-02 16:55:01 UTC
Created attachment 232011 [details]
grep . -r /sys/firmware/acpi/interrupts/ output

I have a kworker that is consuming 100% of the CPU on a single virtual core. It appears that it is experiencing an interrupt storm on gpe61.

I am running Linux Mint 18 with the specified default kernel.

Output of "grep . -r /sys/firmware/acpi/interrupts/" is attached in interrupts.txt.

I also executed "perf record -g -a sleep 10", and the result of "perf report" is attached as "perf.txt".

dmesg output also attached ("dmesg.txt").

Let me know if there's anything else you need.
Comment 1 Mike Allen 2016-09-02 16:59:46 UTC
Created attachment 232021 [details]
Perf report output
Comment 2 Mike Allen 2016-09-02 17:00:05 UTC
Created attachment 232031 [details]
dmesg output
Comment 3 Mike Allen 2016-09-02 18:09:03 UTC
Created attachment 232041 [details]
acpidump output

Also attached the output from "acpidump".
Comment 4 Mike Allen 2016-09-09 16:57:21 UTC
Created attachment 232901 [details]
lspci -vvxxx output

Added output from: lspci -vvxxx

Has anyone taken a look at this issue? Am I OK to simply execute:

  echo disable > /sys/firmware/acpi/interrupts/gpe61

What problems might arise f I do so?

Which device is causing the problem?
Comment 5 Bjorn Helgaas 2016-09-09 21:10:56 UTC
Seems like an ACPI issue, at least to start -- gpe61 is managed by ACPI.
Comment 6 Zhang Rui 2016-09-12 02:26:37 UTC
_L61 is very complicate.

Lv,
for this issue, IMO, we should
1. enable ACPI tracer to see how the AML code is running in _L61
2. make clear some key variables in _L61, like
   a) \_SB.PCI0.RP0{0~20}.HPSX, which seems to be hotplug related, and may get initialized in PCI _OSC.
   b) \_SB.PCI0.RP0{0~20}.VDID
   c) \_SB.PCI0.RP0{0~20}.PDCX
   d) \_SB.PCI0.RP0{0-20}.L0SE
Comment 7 Mike Allen 2016-09-12 03:32:44 UTC
(In reply to Zhang Rui from comment #6)
> _L61 is very complicate.
> 
> Lv,
> for this issue, IMO, we should
> 1. enable ACPI tracer to see how the AML code is running in _L61
> 2. make clear some key variables in _L61, like
>    a) \_SB.PCI0.RP0{0~20}.HPSX, which seems to be hotplug related, and may
> get initialized in PCI _OSC.
>    b) \_SB.PCI0.RP0{0~20}.VDID
>    c) \_SB.PCI0.RP0{0~20}.PDCX
>    d) \_SB.PCI0.RP0{0-20}.L0SE

I'm not familiar with ACPI tracer. How do I enable that and record the _L61 execution and the values  of the variables you mention? I will be glad to provide that information to you...
Comment 8 Zhang Rui 2016-09-12 06:19:58 UTC
                OperationRegion (PXCS, PCI_Config, Zero, 0x0480)
                Field (PXCS, AnyAcc, NoLock, Preserve)
                {
                    VDID,   32, 
                    Offset (0x50),
                    L0SE,   1,
                    Offset (0x5A),
                    ABPX,   1,
                        ,   2,
                    PDCX,   1,
                }

                Field (PXCS, AnyAcc, NoLock, WriteAsZeros)
                {
                    Offset (0xDC),
                        ,   30,
                    HPSX,   1,
                    PMSX,   1
                }

Bjorn,
do you have any idea about those PCI config space bits for the PCI Root ports, say,
1c.{0,1,2,3,4,5,6,7} and 1d.{0,1,2,3,4,5,6,7} and 1b.{0,1,2,3}?

The _L61 control method, which handles gpe 0x61, sounds like something related with PCI hotplug, do you know if there is something worth trying, like enabling/disabling some PCI hotplug features?
Comment 9 Bjorn Helgaas 2016-09-12 16:04:16 UTC
The nvidia module taints the kernel.  Can you reproduce the problem without that module?  I don't know if the \_SB_.PCI0.PEG0.PEGP._DSM errors are related to the GPE 61 problem, but they look like they could be related to nvidia.

From the lspci output in https://bugzilla.kernel.org/attachment.cgi?id=232901:

  00:1c.0 Intel Corporation Sunrise Point-H PCI Express Root Port #2
	Capabilities: [40] Express (v2) Root Port (Slot+)

  00: 86 80 11 a1
  50: 42 00 11 70 00 b2 2c 00 00 00 40 01 00 00 00 00

VDID looks like the vendor & device IDs at offset 0 ([8086:a111]).

Offset 0x50 is the Link Control register in the PCIe capability (the capability starts at 0x40, and Link Control is at offset 0x10 in the capability).  The low-order two bits are ASPM control, and the low bit is set when L0s Entry is enabled, so that looks like L0SE.  See the PCIe r3.0 spec, sec 7.8.7.

Offset 0x5a would be the Slot Status register in the PCIe capability.  Bit 0 is "Attention Button Pressed", and bit 3 is "Presence Detect Changed" (PCIe sec 7.8.11).  Those look like they'd match up with ABPX and PDCX.

I don't know what HPSX and PMSX are.  They look like they should be the two high-order bits of a 32-bit register at 0xdc in config space.  But I don't see any capability structures that include that register.

ABPX and PDCX are definitely hotplug-related.

The BIOS did not give us control over PCIe native hotplug (acpi PNP0A08:00: _OSC failed), so we should be using acpiphp, not pciehp.

_L61 is not really that complicated; it's just the same block of ten lines of code repeated for each possible root port.  It looks like it sends a Bus Check to the root port if the Presence Detect Changed bit is set in the Slot Status.

_L61 also turns off L0SE (the ASPM thing) before sending the notification.  And it touches HPSX, TBTS, and TBSE; I have no idea what those are.

_L61 reads VDID to determine whether the root port is present or not (reading VDID returns 0xffffffff if there's no device there).
Comment 10 Mike Allen 2016-09-14 21:16:35 UTC
I tried switching from nvidia to the nouveau driver, but wasn't able to boot my system with a GUI afterwards. (When I originally setup the system, with nouveau, I had a lot of problems booting - it would only boot successfully occasionally - and the machine's fans were running continually too.)

Interestingly, after reinstalling the nvidia driver to get the system back to normal, the interrupt storm has gone away (gpe 61 is still enabled, but now reports 0 interrupts), which seems peculiar. So far as I can tell, the nvidia driver is the same version (361.42-0ubuntu2) as the one I was using when I reported the problem. Does this look like an nvidia driver-related issue?
Comment 11 Bjorn Helgaas 2016-09-28 19:07:07 UTC
It does seem strange that there's no interrupt storm now, when you're running the same kernel and same drivers as when you reported the problem.  I guess that means we don't really have anything to debug here, since we have no way to reproduce the problem and no way to test a fix.  So I'll close this as resolved for now, and we can reopen it if it happens again.
Comment 12 Mike Allen 2016-09-28 19:09:43 UTC
Fair enough. If it comes back, I'll reopen the bug.

Thanks to everyone who looked into this for your time!
Comment 13 Yiannis 2016-12-04 03:09:34 UTC
First of all, I'm sorry for commenting on a closed bug.
I have the very same issue on my laptop (UX501VW). I too believe it's hotplug/nvidia drivers(blob) related. Is there something that can be done to avoid it without disabling the interrupt? or a proper way to report it to nvidia.
Thank you!
Comment 14 Yiannis 2016-12-19 02:00:02 UTC
UPDATE: so a new firmware was just released (version 303). After updating, the problem simply went away. So I think it's all good now. Not sure if it was really the firmware update that fixed it (the changelog unfortunately didn't mention anything) or the fact that the bios got a reset from the update procedure.
Hopefully that helps anybody else with this problem.
Comment 15 Lv Zheng 2016-12-19 05:32:54 UTC
Closing...
Though I'm not in the context. I'm responsible for bug triage of this category.
Feel free to re-open the bug.

Thanks

Note You need to log in before you can comment on or make changes to this bug.