Created attachment 280753 [details] Sleepgraph timeline We run around 3000 iterations of S3 suspend in our weekly stress tests, and we discovered an issue on the Asus VivoBook E203. We consistently receive an mce hardware error on every run (100%). After running mcelog this is the data acquired from a single test run: Hardware event. This is not a software error. MCE 0 CPU 0 BANK 4 ADDR fef61100 TIME 1548379311 Thu Jan 24 17:21:51 2019 MCG status: MCi status: Uncorrected error MCi_ADDR register valid Processor context corrupt MCA: Internal unclassified error: 408 STATUS a600000000020408 MCGSTATUS 0 MCGCAP c07 APICID 0 SOCKETID 0 MICROCODE 32 CPUID Vendor Intel Family 6 Model 92 The sleepgraph timeline for this run is attached (this info is in the log as well).
Created attachment 280755 [details] boot dmesg log
The MCE is present in the boot dmesg as well: [ 0.271303] smpboot: CPU0: Intel(R) Celeron(R) CPU N3350 @ 1.10GHz (family: 0x6, model: 0x5c, stepping: 0x9) [ 0.271582] mce: [Hardware Error]: Machine check events logged [ 0.271588] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408 [ 0.272088] mce: [Hardware Error]: TSC 0 ADDR fef61100 [ 0.272418] mce: [Hardware Error]: PROCESSOR 0:506c9 TIME 1548377732 SOCKET 0 APIC 0 microcode 32
Note that this does not occur in S3, only in S2idle (freeze).
(In reply to Todd Brandt from comment #3) > Note that this does not occur in S3, only in S2idle (freeze). Strike that, reverse it. This occurs only in S3, not in freeze. Our freeze data is clean.
Created attachment 282407 [details] issue.def
The machine check error code says that the problem happened on an MMIO address. Does Linux know about 0xfef61100 in /proc/iomem? If not, then this is most likley a BIOS bug. Well, likely anyway. The PCC bit is set in the status. If Linux had done the access, there would have been a fatal machine check.