Bug 202409

Summary: Dell Precision 3520 - MCE hardware errors - CACHE Level-2 Generic Error
Product: Platform Specific/Hardware Reporter: Todd Brandt (todd.e.brandt)
Component: x86-64Assignee: Mario Limonciello (superm1)
Status: ASSIGNED ---    
Severity: normal CC: lenb, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.0.0-rc2 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 178231    
Attachments: Sleepgraph timeline
boot dmesg
issue.def

Description Todd Brandt 2019-01-25 01:02:53 UTC
Created attachment 280749 [details]
Sleepgraph timeline

We run around 3000 iterations of S3 suspend in our weekly stress tests, and we discovered an issue on the Dell Precision 3520. We consistently receive mce hardware errors at a rate of about 33% of the time (once out of each 3 runs). After running mcelog this is the data acquired from a single test run:

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 3880000086 ADDR fef20080
TIME 1548375775 Thu Jan 24 16:22:55 2019
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c0a APICID 0 SOCKETID 0
MICROCODE c6
CPUID Vendor Intel Family 6 Model 94
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 3880000086 ADDR fef200c0
TIME 1548375775 Thu Jan 24 16:22:55 2019
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c0a APICID 0 SOCKETID 0
MICROCODE c6
CPUID Vendor Intel Family 6 Model 94
Hardware event. This is not a software error.
MCE 2
CPU 0 BANK 8
MISC 3880000086 ADDR fef20000
TIME 1548375775 Thu Jan 24 16:22:55 2019
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c0a APICID 0 SOCKETID 0
MICROCODE c6
CPUID Vendor Intel Family 6 Model 94
Hardware event. This is not a software error.
MCE 3
CPU 0 BANK 9
MISC 3880000086 ADDR fef20040
TIME 1548375775 Thu Jan 24 16:22:55 2019
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c0a APICID 0 SOCKETID 0
MICROCODE c6
CPUID Vendor Intel Family 6 Model 94

The sleepgraph timeline for this run is attached (this info is in the log as well).
Comment 1 Todd Brandt 2019-01-25 01:05:48 UTC
Created attachment 280751 [details]
boot dmesg
Comment 2 Len Brown 2019-01-25 01:10:50 UTC
Boot dmesg shows MCE also, at different offsets on this page: fef20000

Not immediately clear what is at that address...

[    0.218029] smpboot: CPU0: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz (family: 0x6, model: 0x5e, stepping: 0x3)
[    0.218096] mce: [Hardware Error]: Machine check events logged
[    0.218098] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
[    0.218101] mce: [Hardware Error]: TSC 0 ADDR fef20080 MISC 3880000086 
[    0.218105] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6
[    0.218107] mce: [Hardware Error]: Machine check events logged
[    0.218108] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a
[    0.218110] mce: [Hardware Error]: TSC 0 ADDR fef200c0 MISC 3880000086 
[    0.218114] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6
[    0.218116] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 8: ee0000000040110a
[    0.218118] mce: [Hardware Error]: TSC 0 ADDR fef20000 MISC 3880000086 
[    0.218121] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6
[    0.218124] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: ee0000000040110a
[    0.218126] mce: [Hardware Error]: TSC 0 ADDR fef20040 MISC 3880000086 
[    0.218129] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6
Comment 3 Todd Brandt 2019-04-19 11:52:10 UTC
Created attachment 282409 [details]
issue.def