Bug 202409 - Dell Precision 3520 - MCE hardware errors - CACHE Level-2 Generic Error
Summary: Dell Precision 3520 - MCE hardware errors - CACHE Level-2 Generic Error
Status: ASSIGNED
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Mario Limonciello
URL:
Keywords:
Depends on:
Blocks: 178231
  Show dependency tree
 
Reported: 2019-01-25 01:02 UTC by Todd Brandt
Modified: 2019-04-19 11:52 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.0.0-rc2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Sleepgraph timeline (421.95 KB, text/html)
2019-01-25 01:02 UTC, Todd Brandt
Details
boot dmesg (87.75 KB, text/plain)
2019-01-25 01:05 UTC, Todd Brandt
Details
issue.def (472 bytes, text/plain)
2019-04-19 11:52 UTC, Todd Brandt
Details

Description Todd Brandt 2019-01-25 01:02:53 UTC
Created attachment 280749 [details]
Sleepgraph timeline

We run around 3000 iterations of S3 suspend in our weekly stress tests, and we discovered an issue on the Dell Precision 3520. We consistently receive mce hardware errors at a rate of about 33% of the time (once out of each 3 runs). After running mcelog this is the data acquired from a single test run:

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 3880000086 ADDR fef20080
TIME 1548375775 Thu Jan 24 16:22:55 2019
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c0a APICID 0 SOCKETID 0
MICROCODE c6
CPUID Vendor Intel Family 6 Model 94
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 3880000086 ADDR fef200c0
TIME 1548375775 Thu Jan 24 16:22:55 2019
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c0a APICID 0 SOCKETID 0
MICROCODE c6
CPUID Vendor Intel Family 6 Model 94
Hardware event. This is not a software error.
MCE 2
CPU 0 BANK 8
MISC 3880000086 ADDR fef20000
TIME 1548375775 Thu Jan 24 16:22:55 2019
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c0a APICID 0 SOCKETID 0
MICROCODE c6
CPUID Vendor Intel Family 6 Model 94
Hardware event. This is not a software error.
MCE 3
CPU 0 BANK 9
MISC 3880000086 ADDR fef20040
TIME 1548375775 Thu Jan 24 16:22:55 2019
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee2000000040110a MCGSTATUS 0
MCGCAP c0a APICID 0 SOCKETID 0
MICROCODE c6
CPUID Vendor Intel Family 6 Model 94

The sleepgraph timeline for this run is attached (this info is in the log as well).
Comment 1 Todd Brandt 2019-01-25 01:05:48 UTC
Created attachment 280751 [details]
boot dmesg
Comment 2 Len Brown 2019-01-25 01:10:50 UTC
Boot dmesg shows MCE also, at different offsets on this page: fef20000

Not immediately clear what is at that address...

[    0.218029] smpboot: CPU0: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz (family: 0x6, model: 0x5e, stepping: 0x3)
[    0.218096] mce: [Hardware Error]: Machine check events logged
[    0.218098] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
[    0.218101] mce: [Hardware Error]: TSC 0 ADDR fef20080 MISC 3880000086 
[    0.218105] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6
[    0.218107] mce: [Hardware Error]: Machine check events logged
[    0.218108] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a
[    0.218110] mce: [Hardware Error]: TSC 0 ADDR fef200c0 MISC 3880000086 
[    0.218114] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6
[    0.218116] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 8: ee0000000040110a
[    0.218118] mce: [Hardware Error]: TSC 0 ADDR fef20000 MISC 3880000086 
[    0.218121] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6
[    0.218124] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 9: ee0000000040110a
[    0.218126] mce: [Hardware Error]: TSC 0 ADDR fef20040 MISC 3880000086 
[    0.218129] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1548373451 SOCKET 0 APIC 0 microcode c6
Comment 3 Todd Brandt 2019-04-19 11:52:10 UTC
Created attachment 282409 [details]
issue.def

Note You need to log in before you can comment on or make changes to this bug.