Bug 202413

Summary: ASUS VivoBook E203NA - MCE hardware error - Internal unclassified error
Product: Platform Specific/Hardware Reporter: Todd Brandt (todd.e.brandt)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: NEW ---    
Severity: normal CC: bp, lenb, rui.zhang, tony.luck
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.0.0-rc2 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 178231    
Attachments: Sleepgraph timeline
boot dmesg log
issue.def

Description Todd Brandt 2019-01-25 01:27:53 UTC
Created attachment 280753 [details]
Sleepgraph timeline

We run around 3000 iterations of S3 suspend in our weekly stress tests, and we discovered an issue on the Asus VivoBook E203. We consistently receive an mce hardware error on every run (100%). After running mcelog this is the data acquired from a single test run:

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 4
ADDR fef61100
TIME 1548379311 Thu Jan 24 17:21:51 2019
MCG status:
MCi status:
Uncorrected error
MCi_ADDR register valid
Processor context corrupt
MCA: Internal unclassified error: 408
STATUS a600000000020408 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0
MICROCODE 32
CPUID Vendor Intel Family 6 Model 92

The sleepgraph timeline for this run is attached (this info is in the log as well).
Comment 1 Todd Brandt 2019-01-25 01:28:14 UTC
Created attachment 280755 [details]
boot dmesg log
Comment 2 Len Brown 2019-01-25 04:16:45 UTC
The MCE is present in the boot dmesg as well:

[    0.271303] smpboot: CPU0: Intel(R) Celeron(R) CPU N3350 @ 1.10GHz (family: 0x6, model: 0x5c, stepping: 0x9)
[    0.271582] mce: [Hardware Error]: Machine check events logged
[    0.271588] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408
[    0.272088] mce: [Hardware Error]: TSC 0 ADDR fef61100 
[    0.272418] mce: [Hardware Error]: PROCESSOR 0:506c9 TIME 1548377732 SOCKET 0 APIC 0 microcode 32
Comment 3 Todd Brandt 2019-01-25 04:21:35 UTC
Note that this does not occur in S3, only in S2idle (freeze).
Comment 4 Todd Brandt 2019-01-25 18:05:33 UTC
(In reply to Todd Brandt from comment #3)
> Note that this does not occur in S3, only in S2idle (freeze).

Strike that, reverse it. This occurs only in S3, not in freeze. Our freeze data is clean.
Comment 5 Todd Brandt 2019-04-19 11:49:49 UTC
Created attachment 282407 [details]
issue.def
Comment 6 Tony Luck 2020-01-24 19:14:49 UTC
The machine check error code says that the problem happened on an MMIO address. Does Linux know about 0xfef61100 in /proc/iomem?  If not, then this is most likley a BIOS bug. Well, likely anyway. The PCC bit is set in the status. If Linux had done the access, there would have been a fatal machine check.