Created attachment 280749 [details] Sleepgraph timeline We run around 3000 iterations of S3 suspend in our weekly stress tests, and we discovered an issue on the Dell Precision 3520. We consistently receive mce hardware errors at a rate of about 33% of the time (once out of each 3 runs). After running mcelog this is the data acquired from a single test run: Hardware event. This is not a software error. MCE 0 CPU 0 BANK 6 MISC 3880000086 ADDR fef20080 TIME 1548375775 Thu Jan 24 16:22:55 2019 MCG status: MCi status: Error overflow Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ee2000000040110a MCGSTATUS 0 MCGCAP c0a APICID 0 SOCKETID 0 MICROCODE c6 CPUID Vendor Intel Family 6 Model 94 Hardware event. This is not a software error. MCE 1 CPU 0 BANK 7 MISC 3880000086 ADDR fef200c0 TIME 1548375775 Thu Jan 24 16:22:55 2019 MCG status: MCi status: Error overflow Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ee2000000040110a MCGSTATUS 0 MCGCAP c0a APICID 0 SOCKETID 0 MICROCODE c6 CPUID Vendor Intel Family 6 Model 94 Hardware event. This is not a software error. MCE 2 CPU 0 BANK 8 MISC 3880000086 ADDR fef20000 TIME 1548375775 Thu Jan 24 16:22:55 2019 MCG status: MCi status: Error overflow Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ee2000000040110a MCGSTATUS 0 MCGCAP c0a APICID 0 SOCKETID 0 MICROCODE c6 CPUID Vendor Intel Family 6 Model 94 Hardware event. This is not a software error. MCE 3 CPU 0 BANK 9 MISC 3880000086 ADDR fef20040 TIME 1548375775 Thu Jan 24 16:22:55 2019 MCG status: MCi status: Error overflow Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ee2000000040110a MCGSTATUS 0 MCGCAP c0a APICID 0 SOCKETID 0 MICROCODE c6 CPUID Vendor Intel Family 6 Model 94 The sleepgraph timeline for this run is attached (this info is in the log as well).
Created attachment 280751 [details] boot dmesg
Boot dmesg shows MCE also, at different offsets on this page: fef20000 Not immediately clear what is at that address... [ 0.218029] smpboot: CPU0: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz (family: 0x6, model: 0x5e, stepping: 0x3) [ 0.218096] mce: [Hardware Error]: Machine check events logged [ 0.218098] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a [ 0.218101] mce: [Hardware Error]: TSC 0 ADDR fef20080 MISC 3880000086 [ 0.218105] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6 [ 0.218107] mce: [Hardware Error]: Machine check events logged [ 0.218108] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a [ 0.218110] mce: [Hardware Error]: TSC 0 ADDR fef200c0 MISC 3880000086 [ 0.218114] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6 [ 0.218116] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 8: ee0000000040110a [ 0.218118] mce: [Hardware Error]: TSC 0 ADDR fef20000 MISC 3880000086 [ 0.218121] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6 [ 0.218124] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: ee0000000040110a [ 0.218126] mce: [Hardware Error]: TSC 0 ADDR fef20040 MISC 3880000086 [ 0.218129] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6
Created attachment 282409 [details] issue.def