Bug 208733

Summary: s2idle freeze wakes from timekeeping_freeze at least 100 times with ec_no_wakeup enabled
Product: ACPI Reporter: Todd Brandt (todd.e.brandt)
Component: ECAssignee: acpi_ec
Status: RESOLVED CODE_FIX    
Severity: normal CC: todd.e.brandt
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.8.0-rc1 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 178231    
Attachments: issue.def
issue.def
issue.def
issue.def
example_of_infinite_loop_in_amber_lake

Description Todd Brandt 2020-07-29 02:02:04 UTC
Created attachment 290663 [details]
issue.def

When sleepgraph is run it is expected to enter s2idle once and leave s2idle once at the triggering of rtcwake. Sometimes wakeups occur which pull it out of timekeeping_freeze but still within the s2idle_enter function. This can affect power and performance. This issue monitors it for occurrences.
Comment 1 Todd Brandt 2020-07-30 01:47:25 UTC
Created attachment 290681 [details]
issue.def
Comment 2 Todd Brandt 2020-08-05 16:07:03 UTC
Created attachment 290783 [details]
issue.def
Comment 3 Todd Brandt 2020-08-14 03:29:53 UTC
Created attachment 290895 [details]
issue.def
Comment 4 Todd Brandt 2020-08-14 03:35:20 UTC
I've found 5 systems that were/are affected by this behavior in our lab. Two of which are still affected in 5.8.0: the otcpl-aml-y (amber lake) and the otcpl-cml-u (comet lake). A key piece of the puzzle appears to be the acpi.ec_no_wakeup=1 kernel argument. This issue only occurs on these systems when this argument is used. Without it both machines pass with flying colors and achieve full S0iX coverage.

So the simple solution is to not use acpi.ec_no_wakeup=1, but I'm still looking into why this is. The whole purpose of this flag is to make S2idle more stable and prevent unnecessary wakeups. In these two specific systems it appears to achieve the exact opposite effect for a significant part of the stress testing time.
Comment 5 Todd Brandt 2020-08-24 18:31:16 UTC
Upon further debug, I believe that cause of the error is a bug in the drivers/acpi/ec.c library. The ec_no_wakeup parameter has 3 major effects:

1) disable EC GPE before suspend, and re-enable EC GPE after resume
2) prevent EC GPE from being flagged as a wakeup source
3) abort the acpi_ec_dispatch_gpe call if an EC GPE is triggered

When these infinite loops occur, something else re-enables the EC GPE just after ec_no_wakeup disables it for suspend. Thus the EC GPE is erroneously dispatched. However when the EC GPE reaches acpi_ec_dispatch_gpe, the ec_no_wakeup flag prevents it from being processed and completed. Thus it just keeps re-dispatching over and over in an infinite loop until resume clears everything.

The simplest solution to this issue is this:

@@ -2011,9 +2040,17 @@ bool acpi_ec_dispatch_gpe(void)
        if (acpi_any_gpe_status_set(first_ec->gpe))
                return true;
 
-       if (ec_no_wakeup)
-               return false;
-

It allows erroneous EC GPEs to be completed, thus preventing the loop. However if does not solve the mystery of why they're being re-enabled at all.
Comment 6 Todd Brandt 2020-08-24 18:43:05 UTC
Created attachment 292147 [details]
example_of_infinite_loop_in_amber_lake
Comment 7 Todd Brandt 2020-11-19 23:07:15 UTC
This appears fixed with the following:

commit e0e9ce390d7bc6a705653d4a8aa4ea92c9a65e53
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Mon Oct 5 19:13:46 2020 +0200

    ACPI: EC: PM: Drop ec_no_wakeup check from acpi_ec_dispatch_gpe()
    
    It turns out that in some cases there are EC events to flush in
    acpi_ec_dispatch_gpe() even though the ec_no_wakeup kernel parameter
    is set and the EC GPE is disabled while sleeping, so drop the
    ec_no_wakeup check that prevents those events from being processed
    from acpi_ec_dispatch_gpe().
    
    Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com>
    Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>