Bug 207907
Summary: | Correctable MCE errors logged for CPU0/CPU12 L1 instruction cache with AMD Ryzen 9 3900X 12-Core Processor | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Phil Hale (phaleintx) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | NEW --- | ||
Severity: | normal | CC: | bp |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.6.15 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
sensor reading of cpu temp every 5 seconds during a 30 minute stress test
journalctl dump of mce/kernel log during 30 minute stress test. script session of setting cpu frequency to lower setting and logging kernel during 30 minute stress test script session of setting cpu frequency to lower setting, boost off, and logging kernel during 30 minute stress test |
Description
Phil Hale
2020-05-27 00:39:50 UTC
It says May 26 19:33:27 kernel: [Hardware Error]: Corrected error, no action required. which means, it was likely a single bit flip in the instruction cache arrays and the error was corrected by the hardware. Question is how "frequent" is and whether the bit flips are caused by non-optimal conditions the CPU operates in: too high temperature, insufficient power supply, etc or the IC arrays are really faulty. In any case, the error is benign. Can it get worse? Sure. Can it stay at that frequency? Also possible. Whether to replace is a decision you have to weigh yourself and decide what is more important to you. For example, if this were my test box, I won't do anything. If this were my workstation I won't do anything either because I'm not worried. But I'm not worried because I have backups of the stuff I care about. And a box may fail for a gazillion reasons and at any time... Anyway, this should give you some ideas only and it is in no way a "this is what you should do" suggestion. HTH. Created attachment 289343 [details]
sensor reading of cpu temp every 5 seconds during a 30 minute stress test
Here is a text file where I dumped the CPU tempature every 5 seconds during a 30 minute stress test. From what I can see the CPU tempature never exceeded safe specs for this particular cpu or board.
Created attachment 289345 [details]
journalctl dump of mce/kernel log during 30 minute stress test.
This is the accompanying log dump of the kernel log during the same 30 minute stress test in the previous attachment to provide frequency data during a full stress test run.
The stress test utility used is called GTKStressTesting. https://gitlab.com/leinardi/gst Version: gst-0.7.2-1.fc32.noarch Do you see less MCEs when you disable boosting by doing as root: echo 0 > /sys/devices/system/cpu/cpufreq/boost and then running the stress test for another 30 mins? Also, you can try experimenting with lowering the cores frequency with the cpupower tool (as root): # cpupower frequency-info will give you the P-states on the machine and then you can set to a lower frequency (P1 or P2) by doing # cpupower frequency-set -f <the number after Pstate-P1: or Pstate-P2: from the above command>MHz HTH. Thx. Created attachment 289395 [details]
script session of setting cpu frequency to lower setting and logging kernel during 30 minute stress test
* Set the CPU Frequency down to P1 state (2800Mhz)
* Ran 30 minute stress test
* Log of kernel mce errors while stress test is running
Created attachment 289397 [details]
script session of setting cpu frequency to lower setting, boost off, and logging kernel during 30 minute stress test
* Boost set to 0
* cpu frequency set to P2 state
* 30 minutes stress testing
* kernel log run during the 30 minute stress test.
(In reply to Borislav Petkov from comment #5) > Do you see less MCEs when you disable boosting by doing as root: > > echo 0 > /sys/devices/system/cpu/cpufreq/boost > > and then running the stress test for another 30 mins? > > Also, you can try experimenting with lowering the cores frequency with the > cpupower tool (as root): > > # cpupower frequency-info > > will give you the P-states on the machine and then you can set to a lower > frequency (P1 or P2) by doing > > # cpupower frequency-set -f <the number after Pstate-P1: or Pstate-P2: from > the above command>MHz > > HTH. > > Thx. Ran a pair of 30 stress tests, one with just the frequency setting set down, and another with both boost off and the lower of the cpu frequency set. Grepping out the unique mce error entries, both tests generated 7 mce entries in a 30 minute stress test. Ok, thanks, so P-states don't affect those. In that case, you could consider what I said in comment #1 to decide whether to replace or not. HTH. (In reply to Borislav Petkov from comment #9) > Ok, thanks, so P-states don't affect those. > > In that case, you could consider what I said in comment #1 to decide whether > to replace or not. > > HTH. I'm waiting on a replacement power supply as it's the only item that I haven't upgraded in this new workstation build. Once I replace the power supply, I'll run additional tests and see if helps with the issue. Searching thru the bugzilla, I've found a similar issue reported: https://bugzilla.kernel.org/show_bug.cgi?id=202005 I don't see a resolution to that yet bug yet though. That got fixed by 71a84402b93e ("x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models") and your error is different. I've replaced the power supply with a Seasonic PX-1000. I re-ran the same stress test as before, tailing the log. With the old power supply, I had 9 MCE events in a 30 minute stress test run. With the new power supply, I had 6 MCE events in a 30 minute stress test run. So better, but still being generated. I'm now on 5.6.15-300.fc32.x86_64. System remains rock-solid, no crashes and no noticible issues but the log entries. |