Bug 196481 - GCC segfaults under heavy multithreaded compilation with AMD Ryzen
Summary: GCC segfaults under heavy multithreaded compilation with AMD Ryzen
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-25 17:30 UTC by Paulo J. S. Silva
Modified: 2017-10-05 19:59 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.13rc1
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Paulo J. S. Silva 2017-07-25 17:30:48 UTC
I am running Ubuntu and installed the mainline kernel from the mainline PPA. It seems like the Ryzen processor has some bug that leads to gcc crashing when compiling a very large program under heavy load. This is easily reproduced in my system using the script from 

https://github.com/suaefar/ryzen-test

(It assumes that you are running Ubuntu, maybe Debian also works. Just clone it and run the script kill_ryzen.sh. It downloads the gcc 7.1 code and start multiple compilations of it. If any compilations fails its warns the user giving the time to detect failure).

There is already a bug report about this in the FreeBSD bugzilla (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c89). There is also a thread on the subject in AMD community forum (https://community.amd.com/thread/215773?start=300&tstart=0) and Phoronix (https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads).

This is probably a processor bug. But I thought that I should try to call the attention of the kernel developers to this issue as it may be possible to workaround it in the kernel.

Obs: If I disable SMT in BIOS the problem gets much better moving from failures after a couple of minute to one failure in 3 to 4 hours)
Comment 1 Chí-Thanh Christopher Nguyễn 2017-07-25 21:14:10 UTC
DragonFlyBSD has applied the following workaround to mitigate the issue:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20
Comment 2 Klaus Mueller 2017-07-28 04:48:49 UTC
Some traces during kernel compile or afterwards can be found here: 

http://lists-archives.com/linux-kernel/28885552-gcc-segfaults-under-heavy-multithreaded-compilation-with-amd-ryzen.html
Comment 3 Paulo J. S. Silva 2017-08-11 12:13:20 UTC
It seems to be confirmed as a hardware problem. See:

https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response

I have entered in contact with AMD and am doing an RMA of my CPU in the next days. From what I could grasp the bug is usual in the first batches of Ryzen, so there might be many affected CPUs in the wild. AMD is not issuing a recall, it will treat with it in a case by case basis.

Anyone can check if their CPU has the problem by running the kill-ryzen.sh script described in the original bug report. If your CPU has the problem contact AMD technical support using the link in the Phoronix article linked above.

Maybe I should mark this bug as RESOLVED? At least the bug report will remain here so that other affected users may find out how to proceed.
Comment 4 Klaus Mueller 2017-09-27 16:18:03 UTC
I did a RMA and the "new" CPU now works as expected. It is not a linux problem.
Comment 5 James Le Cuirot 2017-10-05 19:59:59 UTC
I can also confirm that an RMA replacement deals with the issue. It is said that any CPU manufactured after around week 25 is okay. I don't know whether it is possible to detect this but if it is, one workaround is to disable ASLR (norandmaps). It doesn't prevent it entirely but it makes it very much less frequent.

Note You need to log in before you can comment on or make changes to this bug.