Bug 203879

Summary: hard freeze on high single threaded load (AMD Ryzen 7 2700X CPU)
Product: Drivers Reporter: Claude Heiland-Allen (claude)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: alexdeucher, nicholas.kazlauskas, sambazley
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.19.37-3 (Debian 4.19.0-5-amd64) and others (including mainline versions) Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg from 4.19.0-5-amd64 with amdgpu.dc=1 (no freeze yet)
dmesg after boot with idle=nomwait (before freeze which occured some hours later)
dmesg after crash

Description Claude Heiland-Allen 2019-06-12 19:44:25 UTC
Created attachment 283233 [details]
dmesg from 4.19.0-5-amd64 with amdgpu.dc=1 (no freeze yet)

I am developing a CPU-based program to render fractals, which I usually run with "nice -n 20".  The main calculations are multi-threaded, using 16 threads on AMD Ryzen 7 2700X Eight-Core Processor.  However, final image PNG saving is single-threaded.  During the single-threaded workload only (as observed by htop and program status prints), it can happen that the system freezes hard (no ssh, stuck mouse pointer, no NumLock LED toggle, no magic SysRq, only physical power button for hard power-off works).

This freeze only happens when Xorg is running on the active virtual terminal: I tried to see if some kernel log messages would be displayed before freeze by switching to a console with Ctrl-Alt-F1 after launching my program, but with the terminal active it doesn't seem to freeze.

The freeze does not always occur, but usually happens before a dozen images are saved (sequential process is full-threaded workload, followed by single-threaded workload, repeated).  This can take a few hours.

With the virtual terminal active instead of Xorg, I have rendered 100+ images in a row without any issues.  Of course, I can't use other X applications at the same time, so this is an annoying workaround.

I mostly run the regular Debian Buster kernel but I have had this freeze occur with other self-compiled kernels of various versions (newer than the Debian kernel, without Debian patches).  I also had the freeze with both amdgpu.dc=1 (default) and amdgpu.dc=0 options.

$ uname -a
Linux eiskaffee 4.19.0-5-amd64 #1 SMP Debian 4.19.37-3 (2019-05-15) x86_64 GNU/Linux

$ apt-cache policy linux-image-4.19.0-5-amd64
linux-image-4.19.0-5-amd64:
  Installed: 4.19.37-3
  Candidate: 4.19.37-3
  Version table:
 *** 4.19.37-3 990
        990 http://ftp.uk.debian.org/debian buster/main amd64 Packages
        500 http://ftp.uk.debian.org/debian unstable/main amd64 Packages
        100 /var/lib/dpkg/status
Comment 1 Claude Heiland-Allen 2019-06-12 23:04:05 UTC
My conjecture that inactive Xorg prevents freeze is false: got a system freeze with virtual terminal active, Xorg running on inactive VT.  No kernel messages were printed :(  Now running a test without Xorg running at all.
Comment 2 Michel Dänzer 2019-06-13 07:41:03 UTC
That sounds like a general CPU related stability issue, not directly related to the amdgpu driver.
Comment 3 Alex Deucher 2019-06-13 15:34:01 UTC
Does appending idle=nomwait to the kernel command line in grub help?
Comment 4 Claude Heiland-Allen 2019-06-17 20:01:00 UTC
Created attachment 283313 [details]
dmesg after boot with idle=nomwait (before freeze which occured some hours later)

I got one freeze so far after about an hour on my workload with idle=nomwait, trying a second time just to verify that it doesn't help.
Comment 5 Claude Heiland-Allen 2019-07-12 14:35:22 UTC
(In reply to Michel Dänzer from comment #2)
> That sounds like a general CPU related stability issue, not directly related
> to the amdgpu driver.

The later tests make me agree, changed title of report, not sure which Product/Component would be more appropriate.

Adding more system monitoring seems to prevent the condition, perhaps due to the added CPU load:

    watch -n 0.1 sensors
    watch -n 0.1 "cat /proc/cpuinfo | grep MHz"

It freezes during PNG saving of a large image, presumably this involves lots of sequential RAM access.  I have XMP enabled in my motherboard BIOS settings iirc, perhaps I should try disabling it?
Comment 6 Sam Bazley 2019-07-13 01:16:40 UTC
I think this bug is also affecting me on Arch with the 5.2 kernel, since my computer is completely freezing when compiling with -j`nproc`. I've bisected, and found 004b3938e6374f39d43cc32bd4953f2fe8b8905b to be the first bad commit.
Comment 7 Sam Bazley 2019-07-13 01:17:32 UTC
(I've got a 2700X and a Vega 64, if that helps)
Comment 8 Sam Bazley 2019-07-16 14:07:09 UTC
Created attachment 283739 [details]
dmesg after crash

Retrieved the dmesg log with ssh after the crash.
Comment 9 Sam Bazley 2019-08-18 00:40:30 UTC
I've realised that I am actually being affected by this bug:
https://bugzilla.kernel.org/show_bug.cgi?id=204181

Please disregard my previous comments.