Bug 13268

Summary: ACPI interrupt storm when system is warm - Arima M620-DC (ICH4)
Product: ACPI Reporter: Christopher Horler (cshorler)
Component: Config-InterruptsAssignee: ykzhao (yakui.zhao)
Status: CLOSED DOCUMENTED    
Severity: normal CC: acpi-bugzilla, cshorler, lenb, rui.zhang, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: acpidump output
disassembly of acpi tables
dmesg output after storm
grep . /sys/firmware/acpi/interrupts/*
try the custom DSDT
kernel log with customized DSDT
interrupts with custom DSDT, still gpe00

Description Christopher Horler 2009-05-07 22:15:20 UTC
Created attachment 21264 [details]
acpidump output

Further details as requested...

Slight change of conditions - turned off the fglrx kernel module and compiled a 2.6.29.2 kernel.  Experienced the same "storm".

interrupts, dmesg and acpi information attached.


The following text is duplicated from the linux-acpi mailing list for reference:


Subject: System overloaded with ACPI interrupts

Hello,

I think I'm experiencing some kind of ACPI issue, I haven't been able
to identify a series of actions that causes this.

I first noticed the problem when looking at 'top' - seeing kacpi at
the top of the process list (followed by kacpi_notify).
I have powertop installed, the dump is attached - acpi interrupts seem
to be excessive.

The computer is perhaps not the best machine:  Arima M620-DC.  I think
the same chassis may be used by a number of other manufacturers.

The system does not boot in this state - it occurs rather
unpredictably after high CPU loads.  I had thought this was a CPU
temperature issue and replaced the thermal paste of the heatsink,
which improved matters.

I have also blacklisted the following modules:
blacklist i2c_i801
blacklist yenta_socket

The kernel reports a conflict between ACPI SBUS and i2c_i801.
I forget why I disabled yenta_socket, but I don't use the card reader.

The OS is openSuSE 11.1 with the 2.6.27.21-0.1-default kernel installed.

I've also tried booting the system with acpi=noirq.  This gives a 'irq
5: nobody cared' and the wireless card (ipw2200) then doesn't work.

My normal kernel command line contains the following parameters:
hpet=force lapic vga=0x317
This is what all the attached logs are generated from and when my
system functions with all devices working.

I have also disassembled the ACPI tables.  I can post them if required?

Hopefully someone can suggest a temporary solution or a permanent one?

Thanks,
Chris

The response:

hi,

please try a recent vanilla kernel and see if it's reproducible. 2.6.29
would be a good choice.

if the problem still exists in 2.6.29 kernel, please attach the output
of dmesg and "grep . /sys/firmware/acpi/interrupts/*" after the
interrupt storm occurs.
please also attach the acpidump of this laptop.

it would be great of you can open a new bug report at
http://bugzilla.kernel.org/enter_bug.cgi?product=ACPI
and attach all the info there.
Comment 1 Christopher Horler 2009-05-07 22:16:02 UTC
Created attachment 21265 [details]
disassembly of acpi tables
Comment 2 Christopher Horler 2009-05-07 22:16:49 UTC
Created attachment 21266 [details]
dmesg output after storm
Comment 3 Christopher Horler 2009-05-07 22:17:33 UTC
Created attachment 21267 [details]
grep . /sys/firmware/acpi/interrupts/*
Comment 4 ykzhao 2009-05-08 01:01:20 UTC
Hi, Christopher
    From the info in comment #3 it seems that the GPE0 is triggered so frequently. And from the acpidump it seems that this is caused by the bogus BIOS.
    >Method (_L00, 0, NotSerialized)
        {
        }
    When the GPE0 is triggered, there is nothing to do in the _L00 method. And then GPE0 will be triggered again. 

    So IMO this is a BIOS issue.And it had better be fixed by upgrading BIOS.
    
    thanks.
Comment 5 Zhang Rui 2009-05-08 01:05:23 UTC
the problem happens in every kernel release that you have tried, right?

please run "echo disable > /sys/firmware/acpi/interrupts/gpe00" before the interrupt storm and see if it helps.
Comment 6 ykzhao 2009-05-08 01:20:28 UTC
Created attachment 21269 [details]
try the custom DSDT

Will you please try the custom and see whether the issue still exists?
    In the custom DSDT the polarity of THRM_POL will be inverted.
    How to use the custom DSDT can be found in :
    http://www.lesswatts.org/projects/acpi/faq.php

    Note: As the DSDT.hex is already attached, the first four steps can be skipped.
    Thanks.
Comment 7 Christopher Horler 2009-05-08 15:57:26 UTC
(In reply to comment #4)
> Hi, Christopher
>     So IMO this is a BIOS issue.And it had better be fixed by upgrading BIOS.

Quite possibly a BIOS issue!  However, I've requested a newer BIOS a couple times before and been told that there isn't one.  So there's not much I can do unless you know another source.

Chris
Comment 8 Christopher Horler 2009-05-08 16:25:35 UTC
(In reply to comment #5)
> the problem happens in every kernel release that you have tried, right?

yes - for as long as I can remember (I can remember as far back as SuSE 10.1, but can't remember what kernel that was running - it probable it was happening before that too).
 
> please run "echo disable > /sys/firmware/acpi/interrupts/gpe00" before the
> interrupt storm and see if it helps.

I tried this and I think it helped - at least I tried to provoke the problem and it didn't appear in about 45 mins of trying.

Thanks!  What practical impact does disabling a gpe00 have? (other than solving my problem).
Comment 9 Christopher Horler 2009-05-08 17:48:27 UTC
(In reply to comment #6)
> Created an attachment (id=21269) [details]
> try the custom DSDT
> 
> Will you please try the custom and see whether the issue still exists?
>     In the custom DSDT the polarity of THRM_POL will be inverted.
>     How to use the custom DSDT can be found in :
>     http://www.lesswatts.org/projects/acpi/faq.php
> 
>     Note: As the DSDT.hex is already attached, the first four steps can be
> skipped.
>     Thanks.

I recompiled the kernel and installed it and then booted the system.  I still get the interrupt storm with the patched DSDT - logs attached.  The number is less, but the system wasn't running as long as last time so this is probably just proportional to the difference in time.

Chris
Comment 10 Christopher Horler 2009-05-08 17:49:46 UTC
Created attachment 21279 [details]
kernel log with customized DSDT
Comment 11 Christopher Horler 2009-05-08 17:50:31 UTC
Created attachment 21280 [details]
interrupts with custom DSDT, still gpe00
Comment 12 Zhang Rui 2009-05-11 02:01:02 UTC
        Method (_L00, 0, NotSerialized)
        {
        }

this is gotten from the acpidump you attached.

We can see that nothing is done in the GPE00 handler.
So IMO, GPE00 is a nop to Linux kernel, i.e. disabling this GPE is harmless.
And "echo disable > /sys/firmware/acpi/interrupts/gpe00" is the command to disable GPE00.

then my question is that,
1. does this problem exist in every kernel you've tried?
2. does this happen from the beginning, or it's caused at runtime by some specific actions?
Comment 13 ykzhao 2009-05-11 03:41:44 UTC
Hi, Rui
    As there exists the GPE storm on GPE00, it can't be disabled by using the command of "echo disable > /sys/firmware/acpi/interrupts/gpe00".

    And the problem is related with the bogus GPE _L00 method. 

    From the ICH4 chipset it seems that the GPE00 is driven by THRM signal. And whether the GPE00_STS is set is controlled by the bit of THRM_POL.
    In the custom DSDT the polarity of THRM_POL bit is inverted. But from the log it seems that the problem still exists even after the custom DSDT is used.
    
    Thanks.
Comment 14 Christopher Horler 2009-05-11 16:33:29 UTC
(In reply to comment #12)
> then my question is that,
> 1. does this problem exist in every kernel you've tried?

Every 2.6 series kernel.

> 2. does this happen from the beginning, or it's caused at runtime by some
> specific actions?

The system normally starts in a stable state - unless rebooting after the interrupt storm.  In which case the storm sometimes continues (I think turning off for a few minutes normally resets everything).

When the system is in a stable state

echo disable > /sys/firmware/acpi/interrupts/gpe00

is effective.  I have now added this to the boot.local script, and so far I have had no more interrupt storms.

Normally I can cause it by running some graphically intensive websites (with lots of CSS and flash on the pages).  It seems to be independent of the graphics driver in use with X (I've tried ATI's and the open source radeon driver).  I think it's in some way related to CPU load.  

It's impossible to give an exact scenario which will initiate the interrupt storm, sometimes it won't happen.

Thanks for your help!

Chris
Comment 15 ykzhao 2009-05-18 03:05:24 UTC
(In reply to comment #14)
> (In reply to comment #12)
> > then my question is that,
> > 1. does this problem exist in every kernel you've tried?
> 
> Every 2.6 series kernel.
> 
> > 2. does this happen from the beginning, or it's caused at runtime by some
> > specific actions?
> 
> The system normally starts in a stable state - unless rebooting after the
> interrupt storm.  In which case the storm sometimes continues (I think
> turning
> off for a few minutes normally resets everything).
> 
> When the system is in a stable state
> 
> echo disable > /sys/firmware/acpi/interrupts/gpe00
Right. When the system is in the stable state, the GPE00 can be disabled by "echo disable > /sys/firmware/acpi/interrupts/gpe00".

> 
> is effective.  I have now added this to the boot.local script, and so far I
> have had no more interrupt storms.
> 
> Normally I can cause it by running some graphically intensive websites (with
> lots of CSS and flash on the pages).  It seems to be independent of the
> graphics driver in use with X (I've tried ATI's and the open source radeon
> driver).  I think it's in some way related to CPU load.  
From the ACPIdump and ICh4 spec we know that the GPE00 is related with thermal.When the cpu temperature arises, the GPE00 interrupt will be triggered. But nothing can be done in the _L00 method. Then the interrupt storm happens.
> 
> It's impossible to give an exact scenario which will initiate the interrupt
> storm, sometimes it won't happen.

From the ICH4 spec the GPE00_STS can be controlled via the polarity of THRM_POL bit. But in the custom DSDT the polarity of THRM_POL bit is inverted in the _L00 method, there still exists the interrupt storm.


In fact IMO this is a BIOS bug.(The bogus GPE00 method). And it had better be fixed by upgrading BIOS.
Comment 16 Zhang Rui 2009-05-18 03:22:33 UTC
(In reply to comment #15)
> 
> From the ICH4 spec the GPE00_STS can be controlled via the polarity of
> THRM_POL
> bit. But in the custom DSDT the polarity of THRM_POL bit is inverted in the
> _L00 method, there still exists the interrupt storm.
> 
> 
> In fact IMO this is a BIOS bug.(The bogus GPE00 method). And it had better be
> fixed by upgrading BIOS.

right, but we still need to make sure that this happens on Windows as well.
But I don't know how to verify an interrupt storm on Windows, does anyone have any ideas?
Comment 17 Len Brown 2009-05-19 02:17:17 UTC
for windows, run perfmon
and add a counter for interrupts/sec ?
Comment 18 Christopher Horler 2009-05-19 16:57:33 UTC
(In reply to comment #17)
> for windows, run perfmon
> and add a counter for interrupts/sec ?

I reinstalled Windows into Virtual Box and no longer have a non- Virtual Box installation.  I've not really had a dual boot machine for about 5 years, so it's not very easy to test this.

Since disabling gpe00 in the boot scripts - I've not encountered this issue again.  (To date).

If I could get a BIOS update that would be great - but I have no idea where to look.  I investigated once before without success.  It's a bit difficult to find what you want when you don't know where to look (Arima may have sold part of their business, the OEM I bought through went bust and the BIOS manufacturer doesn't seem to have an updates website).  Anyone, correct me if I'm wrong - you might know other places to look, or understand the .tw website.
Comment 19 Len Brown 2009-05-19 18:25:16 UTC
If we can prove that Windows figures out how to work properly
in the face of this BIOS bug, then it justifies spending
the effort to make Linux handle the same bug.

I don't know what "virtual box" is, but if windows isn't
talking to the real hardware, then that isn't interesting.

I'm closing this bug as "documented" at this point,
as a workaround is documented that gets you going.
If you can show Windows on the hardware works, or we
run into other systems with the same issue, we can
re-open and investigate further.