Bug 207907

Summary: Correctable MCE errors logged for CPU0/CPU12 L1 instruction cache with AMD Ryzen 9 3900X 12-Core Processor
Product: Platform Specific/Hardware Reporter: Phil Hale (phaleintx)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: NEW ---    
Severity: normal CC: bp
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.6.15 Subsystem:
Regression: No Bisected commit-id:
Attachments: sensor reading of cpu temp every 5 seconds during a 30 minute stress test
journalctl dump of mce/kernel log during 30 minute stress test.
script session of setting cpu frequency to lower setting and logging kernel during 30 minute stress test
script session of setting cpu frequency to lower setting, boost off, and logging kernel during 30 minute stress test

Description Phil Hale 2020-05-27 00:39:50 UTC
* Newly built system running AMD Ryzen 9 3900X 12 Core CPU.
* System boots and runs normally. 
* System has no issues with repeated stress testing.
* System runs multiple days with no issues.
* Kernel is reporting frequent MCE Recoverable error on CPU0/CPU12 L1 cache:
May 26 19:33:27 kernel: [Hardware Error]: Corrected error, no action required.
May 26 19:33:27 kernel: [Hardware Error]: CPU:0 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151
May 26 19:33:27 kernel: [Hardware Error]: Error Addr: 0x00000005b0166ae0
May 26 19:33:27 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507
May 26 19:33:27 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
May 26 19:33:27 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 26 19:33:27 kernel: mce: [Hardware Error]: Machine check events logged
May 26 19:33:27 kernel: [Hardware Error]: Corrected error, no action required.
May 26 19:33:27 kernel: [Hardware Error]: CPU:12 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151
May 26 19:33:27 kernel: [Hardware Error]: Error Addr: 0x00000005b0159ae0
May 26 19:33:27 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507
May 26 19:33:27 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
May 26 19:33:27 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

I'm trying to determine if this is an issue I need to RMA the CPU on or if this is something that can safely be ignored.
Comment 1 Borislav Petkov 2020-05-27 09:58:57 UTC
It says

May 26 19:33:27 kernel: [Hardware Error]: Corrected error, no action required.

which means, it was likely a single bit flip in the instruction cache arrays and the error was corrected by the hardware.

Question is how "frequent" is and whether the bit flips are caused by non-optimal conditions the CPU operates in: too high temperature, insufficient power supply, etc or the IC arrays are really faulty.

In any case, the error is benign. Can it get worse? Sure. Can it stay at that frequency? Also possible.

Whether to replace is a decision you have to weigh yourself and decide what is more important to you. For example, if this were my test box, I won't do anything. If this were my workstation I won't do anything either because I'm not worried. But I'm not worried because I have backups of the stuff I care about. And a box may fail for a gazillion reasons and at any time...

Anyway, this should give you some ideas only and it is in no way a "this is what you should do" suggestion.

HTH.
Comment 2 Phil Hale 2020-05-27 14:28:35 UTC
Created attachment 289343 [details]
sensor reading of cpu temp every 5 seconds during a 30 minute stress test

Here is a text file where I dumped the CPU tempature every 5 seconds during a 30 minute stress test.  From what I can see the CPU tempature never exceeded safe specs for this particular cpu or board.
Comment 3 Phil Hale 2020-05-27 14:32:18 UTC
Created attachment 289345 [details]
journalctl dump of mce/kernel log during 30 minute stress test.

This is the accompanying log dump of the kernel log during the same 30 minute stress test in the previous attachment to provide frequency data during a full stress test run.
Comment 4 Phil Hale 2020-05-27 19:15:43 UTC
The stress test utility used is called GTKStressTesting.
https://gitlab.com/leinardi/gst
Version: gst-0.7.2-1.fc32.noarch
Comment 5 Borislav Petkov 2020-05-27 21:03:36 UTC
Do you see less MCEs when you disable boosting by doing as root:

echo 0 > /sys/devices/system/cpu/cpufreq/boost

and then running the stress test for another 30 mins?

Also, you can try experimenting with lowering the cores frequency with the cpupower tool (as root):

# cpupower frequency-info

will give you the P-states on the machine and then you can set to a lower frequency (P1 or P2) by doing

# cpupower frequency-set -f <the number after Pstate-P1: or Pstate-P2: from the above command>MHz

HTH.

Thx.
Comment 6 Phil Hale 2020-05-29 00:26:10 UTC
Created attachment 289395 [details]
script session of setting cpu frequency to lower setting and logging kernel during 30 minute stress test

* Set the CPU Frequency down to P1 state (2800Mhz)
* Ran 30 minute stress test
* Log of kernel mce errors while stress test is running
Comment 7 Phil Hale 2020-05-29 01:01:44 UTC
Created attachment 289397 [details]
script session of setting cpu frequency to lower setting, boost off, and logging kernel during 30 minute stress test

* Boost set to 0
* cpu frequency set to P2 state
* 30 minutes stress testing
* kernel log run during the 30 minute stress test.
Comment 8 Phil Hale 2020-05-29 01:06:57 UTC
(In reply to Borislav Petkov from comment #5)
> Do you see less MCEs when you disable boosting by doing as root:
> 
> echo 0 > /sys/devices/system/cpu/cpufreq/boost
> 
> and then running the stress test for another 30 mins?
> 
> Also, you can try experimenting with lowering the cores frequency with the
> cpupower tool (as root):
> 
> # cpupower frequency-info
> 
> will give you the P-states on the machine and then you can set to a lower
> frequency (P1 or P2) by doing
> 
> # cpupower frequency-set -f <the number after Pstate-P1: or Pstate-P2: from
> the above command>MHz
> 
> HTH.
> 
> Thx.

Ran a pair of 30 stress tests, one with just the frequency setting set down, and another with both boost off and the lower of the cpu frequency set. 

Grepping out the unique mce error entries, both tests generated 7 mce entries in a 30 minute stress test.
Comment 9 Borislav Petkov 2020-05-29 12:22:45 UTC
Ok, thanks, so P-states don't affect those.

In that case, you could consider what I said in comment #1 to decide whether to replace or not.

HTH.
Comment 10 Phil Hale 2020-05-29 13:50:43 UTC
(In reply to Borislav Petkov from comment #9)
> Ok, thanks, so P-states don't affect those.
> 
> In that case, you could consider what I said in comment #1 to decide whether
> to replace or not.
> 
> HTH.

I'm waiting on a replacement power supply as it's the only item that I haven't upgraded in this new workstation build.  Once I replace the power supply, I'll run additional tests and see if helps with the issue.
Comment 11 Phil Hale 2020-05-29 13:55:32 UTC
Searching thru the bugzilla, I've found a similar issue reported:

https://bugzilla.kernel.org/show_bug.cgi?id=202005

I don't see a resolution to that yet bug yet though.
Comment 12 Borislav Petkov 2020-05-29 16:39:36 UTC
That got fixed by

71a84402b93e ("x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models")

and your error is different.
Comment 13 Phil Hale 2020-06-06 00:29:38 UTC
I've replaced the power supply with a Seasonic PX-1000.  I re-ran the same stress test as before, tailing the log.

With the old power supply, I had 9 MCE events in a 30 minute stress test run.

With the new power supply, I had 6 MCE events in a 30 minute stress test run.

So better, but still being generated.  I'm now on 5.6.15-300.fc32.x86_64.

System remains rock-solid, no crashes and no noticible issues but the log entries.