I am running Ubuntu and installed the mainline kernel from the mainline PPA. It seems like the Ryzen processor has some bug that leads to gcc crashing when compiling a very large program under heavy load. This is easily reproduced in my system using the script from
(It assumes that you are running Ubuntu, maybe Debian also works. Just clone it and run the script kill_ryzen.sh. It downloads the gcc 7.1 code and start multiple compilations of it. If any compilations fails its warns the user giving the time to detect failure).
There is already a bug report about this in the FreeBSD bugzilla (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c89). There is also a thread on the subject in AMD community forum (https://community.amd.com/thread/215773?start=300&tstart=0) and Phoronix (https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads).
This is probably a processor bug. But I thought that I should try to call the attention of the kernel developers to this issue as it may be possible to workaround it in the kernel.
Obs: If I disable SMT in BIOS the problem gets much better moving from failures after a couple of minute to one failure in 3 to 4 hours)
DragonFlyBSD has applied the following workaround to mitigate the issue:
Some traces during kernel compile or afterwards can be found here:
It seems to be confirmed as a hardware problem. See:
I have entered in contact with AMD and am doing an RMA of my CPU in the next days. From what I could grasp the bug is usual in the first batches of Ryzen, so there might be many affected CPUs in the wild. AMD is not issuing a recall, it will treat with it in a case by case basis.
Anyone can check if their CPU has the problem by running the kill-ryzen.sh script described in the original bug report. If your CPU has the problem contact AMD technical support using the link in the Phoronix article linked above.
Maybe I should mark this bug as RESOLVED? At least the bug report will remain here so that other affected users may find out how to proceed.
I did a RMA and the "new" CPU now works as expected. It is not a linux problem.
I can also confirm that an RMA replacement deals with the issue. It is said that any CPU manufactured after around week 25 is okay. I don't know whether it is possible to detect this but if it is, one workaround is to disable ASLR (norandmaps). It doesn't prevent it entirely but it makes it very much less frequent.