* Newly built system running AMD Ryzen 9 3900X 12 Core CPU. * System boots and runs normally. * System has no issues with repeated stress testing. * System runs multiple days with no issues. * Kernel is reporting frequent MCE Recoverable error on CPU0/CPU12 L1 cache: May 26 19:33:27 kernel: [Hardware Error]: Corrected error, no action required. May 26 19:33:27 kernel: [Hardware Error]: CPU:0 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151 May 26 19:33:27 kernel: [Hardware Error]: Error Addr: 0x00000005b0166ae0 May 26 19:33:27 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507 May 26 19:33:27 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error. May 26 19:33:27 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD May 26 19:33:27 kernel: mce: [Hardware Error]: Machine check events logged May 26 19:33:27 kernel: [Hardware Error]: Corrected error, no action required. May 26 19:33:27 kernel: [Hardware Error]: CPU:12 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151 May 26 19:33:27 kernel: [Hardware Error]: Error Addr: 0x00000005b0159ae0 May 26 19:33:27 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507 May 26 19:33:27 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error. May 26 19:33:27 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD I'm trying to determine if this is an issue I need to RMA the CPU on or if this is something that can safely be ignored.
It says May 26 19:33:27 kernel: [Hardware Error]: Corrected error, no action required. which means, it was likely a single bit flip in the instruction cache arrays and the error was corrected by the hardware. Question is how "frequent" is and whether the bit flips are caused by non-optimal conditions the CPU operates in: too high temperature, insufficient power supply, etc or the IC arrays are really faulty. In any case, the error is benign. Can it get worse? Sure. Can it stay at that frequency? Also possible. Whether to replace is a decision you have to weigh yourself and decide what is more important to you. For example, if this were my test box, I won't do anything. If this were my workstation I won't do anything either because I'm not worried. But I'm not worried because I have backups of the stuff I care about. And a box may fail for a gazillion reasons and at any time... Anyway, this should give you some ideas only and it is in no way a "this is what you should do" suggestion. HTH.
Created attachment 289343 [details] sensor reading of cpu temp every 5 seconds during a 30 minute stress test Here is a text file where I dumped the CPU tempature every 5 seconds during a 30 minute stress test. From what I can see the CPU tempature never exceeded safe specs for this particular cpu or board.
Created attachment 289345 [details] journalctl dump of mce/kernel log during 30 minute stress test. This is the accompanying log dump of the kernel log during the same 30 minute stress test in the previous attachment to provide frequency data during a full stress test run.
The stress test utility used is called GTKStressTesting. https://gitlab.com/leinardi/gst Version: gst-0.7.2-1.fc32.noarch
Do you see less MCEs when you disable boosting by doing as root: echo 0 > /sys/devices/system/cpu/cpufreq/boost and then running the stress test for another 30 mins? Also, you can try experimenting with lowering the cores frequency with the cpupower tool (as root): # cpupower frequency-info will give you the P-states on the machine and then you can set to a lower frequency (P1 or P2) by doing # cpupower frequency-set -f <the number after Pstate-P1: or Pstate-P2: from the above command>MHz HTH. Thx.
Created attachment 289395 [details] script session of setting cpu frequency to lower setting and logging kernel during 30 minute stress test * Set the CPU Frequency down to P1 state (2800Mhz) * Ran 30 minute stress test * Log of kernel mce errors while stress test is running
Created attachment 289397 [details] script session of setting cpu frequency to lower setting, boost off, and logging kernel during 30 minute stress test * Boost set to 0 * cpu frequency set to P2 state * 30 minutes stress testing * kernel log run during the 30 minute stress test.
(In reply to Borislav Petkov from comment #5) > Do you see less MCEs when you disable boosting by doing as root: > > echo 0 > /sys/devices/system/cpu/cpufreq/boost > > and then running the stress test for another 30 mins? > > Also, you can try experimenting with lowering the cores frequency with the > cpupower tool (as root): > > # cpupower frequency-info > > will give you the P-states on the machine and then you can set to a lower > frequency (P1 or P2) by doing > > # cpupower frequency-set -f <the number after Pstate-P1: or Pstate-P2: from > the above command>MHz > > HTH. > > Thx. Ran a pair of 30 stress tests, one with just the frequency setting set down, and another with both boost off and the lower of the cpu frequency set. Grepping out the unique mce error entries, both tests generated 7 mce entries in a 30 minute stress test.
Ok, thanks, so P-states don't affect those. In that case, you could consider what I said in comment #1 to decide whether to replace or not. HTH.
(In reply to Borislav Petkov from comment #9) > Ok, thanks, so P-states don't affect those. > > In that case, you could consider what I said in comment #1 to decide whether > to replace or not. > > HTH. I'm waiting on a replacement power supply as it's the only item that I haven't upgraded in this new workstation build. Once I replace the power supply, I'll run additional tests and see if helps with the issue.
Searching thru the bugzilla, I've found a similar issue reported: https://bugzilla.kernel.org/show_bug.cgi?id=202005 I don't see a resolution to that yet bug yet though.
That got fixed by 71a84402b93e ("x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models") and your error is different.
I've replaced the power supply with a Seasonic PX-1000. I re-ran the same stress test as before, tailing the log. With the old power supply, I had 9 MCE events in a 30 minute stress test run. With the new power supply, I had 6 MCE events in a 30 minute stress test run. So better, but still being generated. I'm now on 5.6.15-300.fc32.x86_64. System remains rock-solid, no crashes and no noticible issues but the log entries.