Bug 202413 - ASUS VivoBook E203NA - MCE hardware error - Internal unclassified error
Summary: ASUS VivoBook E203NA - MCE hardware error - Internal unclassified error
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
Depends on:
Blocks: 178231
  Show dependency tree
Reported: 2019-01-25 01:27 UTC by Todd Brandt
Modified: 2020-01-24 19:14 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.0.0-rc2
Regression: No
Bisected commit-id:

Sleepgraph timeline (376.00 KB, text/html)
2019-01-25 01:27 UTC, Todd Brandt
boot dmesg log (172.71 KB, text/plain)
2019-01-25 01:28 UTC, Todd Brandt
issue.def (461 bytes, text/plain)
2019-04-19 11:49 UTC, Todd Brandt

Description Todd Brandt 2019-01-25 01:27:53 UTC
Created attachment 280753 [details]
Sleepgraph timeline

We run around 3000 iterations of S3 suspend in our weekly stress tests, and we discovered an issue on the Asus VivoBook E203. We consistently receive an mce hardware error on every run (100%). After running mcelog this is the data acquired from a single test run:

Hardware event. This is not a software error.
ADDR fef61100
TIME 1548379311 Thu Jan 24 17:21:51 2019
MCG status:
MCi status:
Uncorrected error
MCi_ADDR register valid
Processor context corrupt
MCA: Internal unclassified error: 408
STATUS a600000000020408 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 92

The sleepgraph timeline for this run is attached (this info is in the log as well).
Comment 1 Todd Brandt 2019-01-25 01:28:14 UTC
Created attachment 280755 [details]
boot dmesg log
Comment 2 Len Brown 2019-01-25 04:16:45 UTC
The MCE is present in the boot dmesg as well:

[    0.271303] smpboot: CPU0: Intel(R) Celeron(R) CPU N3350 @ 1.10GHz (family: 0x6, model: 0x5c, stepping: 0x9)
[    0.271582] mce: [Hardware Error]: Machine check events logged
[    0.271588] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408
[    0.272088] mce: [Hardware Error]: TSC 0 ADDR fef61100 
[    0.272418] mce: [Hardware Error]: PROCESSOR 0:506c9 TIME 1548377732 SOCKET 0 APIC 0 microcode 32
Comment 3 Todd Brandt 2019-01-25 04:21:35 UTC
Note that this does not occur in S3, only in S2idle (freeze).
Comment 4 Todd Brandt 2019-01-25 18:05:33 UTC
(In reply to Todd Brandt from comment #3)
> Note that this does not occur in S3, only in S2idle (freeze).

Strike that, reverse it. This occurs only in S3, not in freeze. Our freeze data is clean.
Comment 5 Todd Brandt 2019-04-19 11:49:49 UTC
Created attachment 282407 [details]
Comment 6 Tony Luck 2020-01-24 19:14:49 UTC
The machine check error code says that the problem happened on an MMIO address. Does Linux know about 0xfef61100 in /proc/iomem?  If not, then this is most likley a BIOS bug. Well, likely anyway. The PCC bit is set in the status. If Linux had done the access, there would have been a fatal machine check.

Note You need to log in before you can comment on or make changes to this bug.