Bug 11312 - ACPI_EVENT_RTC handler gets spontaneously disabled - Intel Core2 laptop (DELL), ICH8M, HPET enabled
Summary: ACPI_EVENT_RTC handler gets spontaneously disabled - Intel Core2 laptop (DELL...
Status: REJECTED WILL_NOT_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Config-Interrupts (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Zhang Rui
URL:
Keywords:
Depends on:
Blocks: 11153
  Show dependency tree
 
Reported: 2008-08-12 12:41 UTC by David Brownell
Modified: 2009-05-22 03:04 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.27-rc2-git (as of 11-aug-2008 evening)
Subsystem:
Regression: No
Bisected commit-id:


Attachments
try the debug patch (894 bytes, patch)
2008-08-12 20:49 UTC, ykzhao
Details | Diff
patch: show acpi events status correctly (5.31 KB, patch)
2008-08-13 00:55 UTC, Zhang Rui
Details | Diff
DSDT for the core2 laptop with this problem (187.50 KB, text/plain)
2008-08-13 19:21 UTC, David Brownell
Details

Description David Brownell 2008-08-12 12:41:17 UTC
Latest working kernel version: unknown
Earliest failing kernel version: unknown
Distribution: Ubuntu 8.04
Hardware Environment: Intel Core2 laptop (DELL), ICH8M, HPET enabled
Software Environment: (as summarized above)
Problem Description:
  I had rtc-cmos working through the ACPI interrupt handling infrastructure,
  which avoids the problematic "HPET emulation".  Tested it with the code
  from Documentation/rtc.txt and it passes.  With the UIE test code from
  bug 11153, ditto.  System sat unused overnight.

  The next morning, those tests failed.  In /sys/firmware/acpi/interrupts
  the ff_rt_clk said "166 invalid".  Something spontaneously invalidated
  that handler ... it was no code of mine, which would only deregister the
  handler on rmmod or startup error.  The driver was non-modular, ruling
  out the former, and is long past startup errors.

  Possibly related:  once during testing I saw the same "invalid" status
  for this handler, right after driver setup.  At the time I was puzzled,
  but thought it was something I had done wrong while fixing other code.
  Now I'm not so sure ... I think there's probably something wrong with
  how ACPI handles at least ACPI_EVENT_RTC handlers.

Steps to reproduce:

  (a) Apply the patch from bug 11153.
  (b) Use test programs, like rtctest or uie2 from bug 11153
  (c) ... until it fails sometime

  That is, I don't know precisely how to reproduce this, but since it's appeared a number of times now, I believe it will do so again.

Severity flagged as "high" since this would prevent sane fixes to
bug 11153, and it's clear something is deeply goofy here.
Comment 1 David Brownell 2008-08-12 14:36:41 UTC
I just noticed that the *diagnostic* /sys/firmware/... file changed in RC1 to include the "invalid" status (commit 71b58cbb0c30d1f78636a48c4721529449d6ea37), which wasn't previously present ... don't think that's related.

Another data point:  a different system (old NForce2 hardware) running the kernel code (but different config of course) did NOT have this spontaneous disabling of ff_rt_clk.

Another notable difference in the ff_* handlers there:  ff_pwr_btn is enabled on NF2, but not the newer core2 laptop.  Neither one has ff_slp_btn enabled.  I find this odd since I'd expect both buttons to always be active.  Yet in the case of the laptop, /proc/acpi/wakeup says PBTN is enabled... I suspect the power button is one of the numerous functions overloaded onto GPIO 12 (GPE 0x1C) on this board.  Which is again puzzling, but I guess that's just an implementation choice allowed by ACPI...
Comment 2 ykzhao 2008-08-12 20:20:16 UTC
Hi, David Brownell
    In current kernel the ACPI_EVETN_RTC event is disabled after the handler is installed. At the same time when the ACPI_EVENT_RTC event is triggered, it will also be disabled. 
    Before the following commit is merged, the ACPI_EVENT_RTC is enabled after the handler is installed.
    >commit e1094bfa26e5e94af2fea79e004614dbce42b008
    >Author: Zhao Yakui <yakui.zhao@intel.com>
    > Date:   Wed May 14 11:32:59 2008 +0800
     >    ACPI: Disable Fixed_RTC event when installing RTC handler
     >    http://bugzilla.kernel.org/show_bug.cgi?id=10010
    
    But on the machine of bug 10010 the system will be turned on automatically because of suprious RTC wakeup. So it will be reasonable that the ACPI_EVENT_RTC event is disabled after installing the handler. Only when RTC alarm is required will it be enabled again. 
   At the same time it will also be OK to disable ACPI_EVENT_RTC after it is triggered. Otherwise the similar problem still exists. If the ACPI_EVENT_RTC is still enabled after it is triggered, the machine of bug 10010 will also be turned on automatically. IMO this is not reasonable.

   Maybe it is reasonable that the ACPI_EVENT_RTC is enabled only when RTC alarm is required. For example: in the cmos-rtc driver if the RTC alarm wakeup is required, the cmos_suspend will call the function of rtc_wake_on, in which the ACPI_EVENT_RTC will be enabled. After the system is wakeup from sleeping state, the cmos_resume will call the function of rtc_wake_off, in which the ACPI_EVENT_RTC will be disabled again.
   Thanks.   
Comment 3 ykzhao 2008-08-12 20:49:52 UTC
Created attachment 17201 [details]
try the debug patch

Will you please try the debug patch and see whether the problem still exists?
In the debug patch when AIE/UIE is turned on, the rtc_wake_on hook function will be called to enable ACPI_EVENT_RTC. Otherwise it will be disabled.
Thanks.
Comment 4 David Brownell 2008-08-12 21:38:40 UTC
I'm not following the comments here...

Re comment #3: the patch from bug 11153 sets up a handler which will take care of all RTC IRQs, and leaves it registered at all times ... so that the IRQ enable/disable logic is controlled by updating the relevant RTC control register.  It also changes the names of the hooks, since their functions have changed ... so this "debug patch" won't work over that patch.  (Plus it ignores several cases, like periodic alarms and alarms enabled when the alarm time is being set.  If that's what you want to do, then the newish cmos_irq_enable and cmos_irq_disable calls are what should be changed.)

Re comment #2 ... I think that what you're saying is that some systems issue spurious wakeups, so the handler should be disabled when those systems go to sleep unless the alarm should be triggering a "real" wakeup.  Is that true?  If so, it's an issue I'll need to take into account.  BUT if that's your point, then it's unrelated to bug 11312 (this one), which doesn't involve system sleep states at all!

THIS bug is the handler spontaneously getting disabled.  No message of any kind to say why that might have been done.  And in particular, the system did not enter a sleep state, so it's not a spurious wakeup.

What would cause the RTC event handler to spontaneously become "invalid"?  Over about six hours of system idleness.
Comment 5 ykzhao 2008-08-12 22:50:53 UTC
That the /sys/firware/acpi/interrupts/ff_rt_clk becomes "invalid" won't affect anything. It indicates that the ACPI_EVENT_RTC event is disabled. In fact "166 invalid" means that ACPI_EVENT_RTC event is already triggered 166 times. 

In fact the patch in comment #3 will enable/disable ACPI_EVENT_RTC event by calling the rtc_wake_on/off hook function when updating the relevant RTC control register. Maybe the patch in bug 11153 has already done. But in the patch in bug 11153 the event is still enabled after it is triggered. Maybe on some machines the event will be triggered periodically. IMO this is not very reasonable. If the patch in bug 11153 is only used for test, I won't disagree with it. 

What I point in comment #2 is that the event should be disabled when RTC alarm is not required regardless of whether the system is in sleeping state. It should be disabled when the system goes to sleep unless RTC alarm wakeup is reuired. At the same time the event should also be disabled when the system is in running state. If not, maybe the suprious pheonomenon will happen on some machines. 

Maybe I misunderstand the description of this bug.  What you want to say is why the RTC event handler becomes "invalid" spontaneously. Right? If so, the invalid state doesn't mean that the RTC event handler is realy invalid. It only indicates that the RTC event is disabled. If RTC event is enabled again, it will be triggered again.

Thanks.
Comment 6 David Brownell 2008-08-12 23:56:15 UTC
The bug is that the event was wrongly disabled (why?) ... the ff_rt_clk file is just showing evidence of that failure.  The first 166 interrupts arrived just fine, then something changed ... so that the next interrupt (never mind thousands of later interrupts!) would never be delivered.

I get the point that, because of bugs in certain ACPI implementations, the RTC event handler should be disabled when it's not needed.  That might make for a somewhat simpler patch (for bug 11153) ... assuming I can enable/disable ACPI event handlers in normal contexts.  (Sleeping will generally be a no-no...)

The goal here is to have all RTC interrupts delivered through ACPI, so that we get real interrupts even when HPET "legacy replacement mode" is interfering with normal irq 8 routing.  So the current patch treats it exactly like a normal IRQ handler:  initialized and left active at all times.  (And never disabled...)  That works for alarm interrupts, once-per-second update interrupts, and 2^N Hz "periodic" interrupts.

The bug is that somehow, without my asking for it to be done, the RTC event handler was disabled!  So the driver never again received its interrupts.  What disabled it?  Why?  Would it ever re-enable it, to match the state that the IRQ was left in by the driver?
Comment 7 Zhang Rui 2008-08-13 00:55:19 UTC
Created attachment 17203 [details]
patch: show acpi events status correctly

This patch shows the status of ACPI events correctly.
Comment 8 Zhang Rui 2008-08-13 01:22:10 UTC
Hi, Dave,

(In reply to comment #6)
> The bug is that the event was wrongly disabled (why?) ...
Agree.

> the ff_rt_clk file is just showing evidence of that failure.
No, there is a bug in the previous code.
The ff_rt_clk is always "invalid" if you have not read this file when it's enabled. In order to verify the problem, please try the patch in comment #7.

>  The first 166 interrupts arrived just
> fine, then something changed ... so that the next interrupt (never mind
> thousands of later interrupts!) would never be delivered.
> 
> I get the point that, because of bugs in certain ACPI implementations, the
> RTC
> event handler should be disabled when it's not needed.

With your patch applied, the RTC event should always be enabled. right?
IMO, No ACPI code will disable it any more.

> The bug is that somehow, without my asking for it to be done, the RTC event
> handler was disabled!
I'm not sure about this.
Please retry with the patch applied.
Comment 9 David Brownell 2008-08-13 01:54:24 UTC
OK, I'm about to reboot this system into RC3, with this applied; we'll see.  More info later.

But to be clear:  the ff_rt_clk file previously looked normal, and showed "NNNN enabled" (for some number NNNN).  But overnight it changed its mind and displayed instead "invalid" -- on this one (SMP) machine.  I remember looking at the previous value; in fact, doing so was in the command history on one shell.

Yes, a result of applying that patch is to always leave that handler enabled.  (Unless something decides to spontaneiously disable it!)
Comment 10 David Brownell 2008-08-13 09:32:30 UTC
OK, it repeated.  Reboot into RC3, run some tests, system idle state has the ff_rt_clk file reading "141 enabled".  Some hours later it reads "141 disabled".

On a different note, the patch from 11153 won't work on one system at all, since it doesn't seem to receive interrupts via ACPI at all ... or at least, not RTC ones.  It's an oldish box with a SiS 961 southbridge and APICs disabled, and its DSDT has at least some obvious bugs (non-existent devices listed) that make me thing it was written for a SiS 962 based system.  (Different bug, yes..)
Comment 11 ykzhao 2008-08-13 17:52:18 UTC
It seems that the patch in comment #8 can give the correct status of ACPI_EVENT_RTC. But the ACPI_EVENT_RTC event is still disabled spontaneously. Maybe it is  related with the BIOS.

Will you please attach the output of acpidump?

Thanks.
Comment 12 Zhang Rui 2008-08-13 18:17:07 UTC
ACPI won't remove the RTC event handler IMO.
is there any chance that rtc_irq_filter_off is called in the patch in comment #16 bug #11153?
because it works again with the new patch in comment #18 applied, which didn't call acpi_remove_fixed_event_handler any more.
Comment 13 David Brownell 2008-08-13 19:19:19 UTC
Re comment #11:  I'll attach the disassembled DSDT, which is what I have handy; let me know if you need more.

Re comment #12:  no, filter_off() was never called.  The driver was not modular, so none of those code paths would ever be called.


Note that I updated bug #11153 with a different patch, which (so far!!) doesn't seem to have this issue.  It leaves ACPI out of the picture almost all the time (working around bug #11312).  I figure I'll go with that patch, since it also works on the SiS 961 system I mentioned earlier.

However, I'll be leaving this bug open since it seems clear to me that it's trouble which might affect dispatch of other IRQs through ACPI.  Could the issue be caused by sometimes returning ACPI_INTERRUPT_NOT_HANDLED?  Other than leaving leaving the handler disabled most of the time (workaround this bug), and not using the ACPI mechanism unless the only other option is the "HPET emulation" (working around both this bug and the SiS issue), that's the main difference I can think of between the two patches...
Comment 14 David Brownell 2008-08-13 19:21:16 UTC
Created attachment 17223 [details]
DSDT for the core2 laptop with this problem
Comment 15 David Brownell 2008-08-13 19:23:30 UTC
Reduce severity and remove block for bug #11153, since for the moment I _seem_ to have an alternative patch that makes this not happen so reliably.
Comment 16 David Brownell 2008-08-28 11:25:34 UTC
State should no longer be NEEDINFO, since all the requested info has been provided...
Comment 17 Zhang Rui 2008-09-01 23:15:47 UTC
okay, I have no idea why/when the RTC event handler is removed.
will apply your original patch and see if I can reproduce the problem...
Comment 18 David Brownell 2008-09-04 12:37:02 UTC
FYI a new incarnation of this bug seems to have appeared in 2.6.27-rc5 ...

Relevant patches applied included the one from Bug #11153 comment #18 (with what seemed to be winning workarounds before -- not the original patch!) and your diagnostic patch (which I hope will merge before 2.6.27-final).

Reproduce by running the test program from Bug #11153 comment #2 ... I then observe the update IRQ firing *AT MOST TWICE* before being disabled.  That is, the program output (slightly modified test program, the number in parens is the tv_usec value of the timestamp when the UIE was received in userspace):

  Read using select(2) on /dev/rtc0:
      1    0.338471  (+0.999664) 0190
      2    1.338491  (+0.999684) 0190
  [ and never outputs again ]

The contents of /sys/firmware/acpi/interrupts/ff_rt_clk were initially "disabled" (ignoring IRQ count), then "enabled" while the program was successfully getting events, then "disabled" in the "never outputs again" state.

So I put this bug back at blocking the fix to Bug #11153 ...
Comment 19 Zhang Rui 2008-09-04 23:50:10 UTC
(In reply to comment #18)
> FYI a new incarnation of this bug seems to have appeared in 2.6.27-rc5 ...
> 
> Relevant patches applied included the one from Bug #11153 comment #18 (with
> what seemed to be winning workarounds before -- not the original patch!) and
> your diagnostic patch (which I hope will merge before 2.6.27-final).
> 
Weird.
I did some tests and the test program can run for thousands of times without any problem. and the rtc event status is always updated correctly?

> Reproduce by running the test program from Bug #11153 comment #2 ... I then
> observe the update IRQ firing *AT MOST TWICE* before being disabled.
is this consistent?
i.e the program always stop in few times?

can you reproduce this problem on other platforms?
Comment 20 David Brownell 2008-09-05 01:22:30 UTC
This is consistent on my only HPET system -- the Core2 laptop with the DSTDT previously attached.  And the IRQ gets disabled after firing at most twice; sometimes just once.

(So it might be a different mechanism for spontaneously getting disabled than I first noticed, since that disabling was happening when no RTC interrupts were firing.)
Comment 21 David Brownell 2008-10-16 00:00:08 UTC
PING?  You got the requested info ...
Comment 22 Len Brown 2008-10-27 23:12:03 UTC
patch in comment #7 applied to acpi-test

however,
we still have no explanation why the interrupt is getting disabled.
Comment 23 Len Brown 2008-11-12 21:25:24 UTC
patch in comment #7 shipped in 2.6.28-rc4-git3

commit ed206fac87d65917280b6c3edd3f01125d4095c9
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Mon Oct 27 14:01:02 2008 -0700

    ACPI: bugfix reporting of event handler status
Comment 24 ykzhao 2008-11-25 21:23:01 UTC
    The main issue of this bug is that FIXED_RTC EVENT is spontaneously disabled.
    In the upstream kernel the FIXED_RTC event will be disabled in the RTC event handler. Only when RTC alarm is set again, it will be enabled again. In the test of bug11533 the Fixed_RTC event won't be disabled in the RTC event handler. It means that it won't be disabled by OS unless it is required. But in this bug the problem is that the Fixed_RTC event is spontaneously disabled. 
    And from the acpidump it seems that the SMI will be triggered in a lot of ACPI objects. Maybe the FIXED_RTC_EVENT is disabled in SMI handler.
    If it is disabled in SMI handler, we can't do nothing about it. 
    thanks.
Comment 25 Shaohua 2009-03-03 21:30:04 UTC
should we close this one if this is a BIOS issue?
Comment 26 Len Brown 2009-04-01 01:57:55 UTC
David,
do you think this is specific to this particular system?
Should we run some tests to see if we have a failure that appears on multiple systems?
Comment 27 Zhang Rui 2009-05-05 06:36:46 UTC
ping David.
Comment 28 Zhang Rui 2009-05-22 03:04:26 UTC
it seems to be a BIOS bug on a specific platform.
close it

Note You need to log in before you can comment on or make changes to this bug.