Bug 206475 - amdgpu under load drop signal to monitor until hard reset
Summary: amdgpu under load drop signal to monitor until hard reset
Status: RESOLVED ANSWERED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-02-09 20:36 UTC by Marco
Modified: 2022-01-06 23:44 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.5.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg for the amdgpu hardware freeze (129.86 KB, application/gzip)
2020-02-09 20:36 UTC, Marco
Details
amdgpu crash with stock kernel (19.20 KB, application/gzip)
2020-02-10 13:21 UTC, Marco
Details
dmesg for amd-drm-next (24.86 KB, application/gzip)
2020-02-10 16:40 UTC, Marco
Details
Latest log with a warning. (807.80 KB, text/plain)
2020-02-24 13:52 UTC, Marco
Details
syslog (130.54 KB, text/plain)
2020-05-22 12:55 UTC, Andrew Ammerlaan
Details
messages (104.43 KB, text/plain)
2020-05-23 14:40 UTC, Andrew Ammerlaan
Details
messages (reset succesful this time) (105.65 KB, text/plain)
2020-05-23 16:44 UTC, Andrew Ammerlaan
Details

Description Marco 2020-02-09 20:36:39 UTC
Created attachment 287265 [details]
dmesg for the amdgpu hardware freeze

While gaming the monitor goes blank randomly, only with this error in the logs of the system

kernel: amdgpu: [powerplay] last message was failed ret is 65535
kernel: amdgpu: [powerplay] failed to send message 200 ret is 65535 
kernel: amdgpu: [powerplay] last message was failed ret is 65535
kernel: amdgpu: [powerplay] failed to send message 282 ret is 65535 
kernel: amdgpu: [powerplay] last message was failed ret is 65535
kernel: amdgpu: [powerplay] failed to send message 201 ret is 65535

with the occasional
kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=5275264, emitted seq=5275266
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Hand of Fate 2. pid 682062 thread Hand of Fa:cs0 pid 682064
kernel: amdgpu 0000:06:00.0: GPU reset begin!

over and over again. If I reset the system no video output is seen until the system is fully shut off.

B450 chipset + Ryzen 5 2600 + Radeon RX580 GPU

Full log is attached to this post.

Can anyone at AMD give me some pointers to what the problem is?

Thanks,

Marco.
Comment 1 Marco 2020-02-10 13:20:58 UTC
Just tested under 5.5.2 stock kernel (besides ZFS module) and the same problem show up. Log attached.
Comment 2 Marco 2020-02-10 13:21:28 UTC
Created attachment 287275 [details]
amdgpu crash with stock kernel
Comment 3 Marco 2020-02-10 16:39:25 UTC
Same thing with linux-amd-drm-next, dmesg attached. Any pointers to the cause?
Comment 4 Marco 2020-02-10 16:40:06 UTC
Created attachment 287277 [details]
dmesg for amd-drm-next
Comment 5 Marco 2020-02-10 19:33:50 UTC
It seems that the problem was insufficient cooling, since the same happened on a Windows VM.
Comment 6 Marco 2020-02-17 13:23:09 UTC
(In reply to Marco from comment #5)
> It seems that the problem was insufficient cooling, since the same happened
> on a Windows VM.

Instead I was wrong, tested Furmark on two different driver sets on W10 bare metal, no crashes for an hour (furmark on VM lasted for 30 seconds).

This is a firmware/software problem. Please fix it.
Comment 7 Marco 2020-02-21 21:13:47 UTC
Found the root of the issue, in some way ZFS was able to achieve a hard lock always in the same way in amdgpu. After removal and a switch to xfs, the problem is gone.
Comment 8 Marco 2020-02-24 13:50:43 UTC
Aaand it's back. Extremely less often, but it still there. However, this time I've got a warning from the kernel in the backtrace:

feb 24 14:31:13 *** kernel: ------------[ cut here ]------------
feb 24 14:31:13 *** kernel: WARNING: CPU: 3 PID: 24149 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dce_link_encoder.c:1099 dce110_link_encoder_disable_output+0x12a/0x140 [amdgpu]
feb 24 14:31:13 *** kernel: Modules linked in: rfcomm fuse xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter tun bridge stp llc cmac algif_hash algif_skcipher af_alg sr_mod cdrom bnep hwmon_vid xfs nls_iso8859_1 nls_cp437 vfat fat btrfs edac_mce_amd kvm_amd kvm blake2b_generic xor btusb btrtl btbcm btintel bluetooth crct10dif_pclmul crc32_pclmul ghash_clmulni_intel igb joydev ecdh_generic aesni_intel eeepc_wmi asus_wmi crypto_simd battery cryptd sparse_keymap mousedev input_leds ecc glue_helper raid6_pq ccp rfkill wmi_bmof pcspkr k10temp dca libcrc32c i2c_piix4 rng_core evdev pinctrl_amd mac_hid gpio_amdpt acpi_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) virtio_mmio virtio_input virtio_pci virtio_balloon usbip_host snd_hda_codec_realtek usbip_core snd_hda_codec_generic uinput i2c_dev ledtrig_audio sg
feb 24 14:31:13 *** kernel:  snd_hda_codec_hdmi vhba(OE) crypto_user snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sd_mod hid_generic usbhid hid ahci libahci libata crc32c_intel xhci_pci xhci_hcd scsi_mod nouveau mxm_wmi wmi amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
feb 24 14:31:13 *** kernel: CPU: 3 PID: 24149 Comm: kworker/3:2 Tainted: G           OE     5.5.5-arch1-1 #1
feb 24 14:31:13 *** kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 3003 12/09/2019
feb 24 14:31:13 *** kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
feb 24 14:31:13 *** kernel: RIP: 0010:dce110_link_encoder_disable_output+0x12a/0x140 [amdgpu]
feb 24 14:31:13 *** kernel: Code: 44 24 38 65 48 33 04 25 28 00 00 00 75 20 48 83 c4 40 5b 5d 41 5c c3 48 c7 c6 40 05 76 c0 48 c7 c7 f0 b1 7d c0 e8 76 c3 d1 ff <0f> 0b eb d0 e8 7d 12 e7 df 66 66 2e 0f 1f 84 00 00 00 00 00 66 90
feb 24 14:31:13 *** kernel: RSP: 0018:ffffb06641417630 EFLAGS: 00010246
feb 24 14:31:13 *** kernel: RAX: 0000000000000000 RBX: ffff9790645be420 RCX: 0000000000000000
feb 24 14:31:13 *** kernel: RDX: 0000000000000000 RSI: 0000000000000082 RDI: 00000000ffffffff
feb 24 14:31:13 *** kernel: RBP: 0000000000000002 R08: 00000000000005ba R09: 0000000000000093
feb 24 14:31:13 *** kernel: R10: ffffb06641417480 R11: ffffb06641417485 R12: ffffb06641417634
feb 24 14:31:13 *** kernel: R13: ffff979064fe6800 R14: ffff978f4f9201b8 R15: ffff97906ba1ee00
feb 24 14:31:13 *** kernel: FS:  0000000000000000(0000) GS:ffff97906e8c0000(0000) knlGS:0000000000000000
feb 24 14:31:13 *** kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
feb 24 14:31:13 *** kernel: CR2: 00007effa1509000 CR3: 0000000350d7a000 CR4: 00000000003406e0
feb 24 14:31:13 *** kernel: Call Trace:
feb 24 14:31:13 *** kernel:  core_link_disable_stream+0x10e/0x3d0 [amdgpu]
feb 24 14:31:13 *** kernel:  ? smu7_send_msg_to_smc.cold+0x20/0x25 [amdgpu]
feb 24 14:31:13 *** kernel:  dce110_reset_hw_ctx_wrap+0xc3/0x260 [amdgpu]
feb 24 14:31:13 *** kernel:  dce110_apply_ctx_to_hw+0x51/0x5d0 [amdgpu]
feb 24 14:31:13 *** kernel:  ? pp_dpm_dispatch_tasks+0x45/0x60 [amdgpu]
feb 24 14:31:13 *** kernel:  ? amdgpu_pm_compute_clocks+0xcd/0x600 [amdgpu]
feb 24 14:31:13 *** kernel:  ? dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu]
feb 24 14:31:13 *** kernel:  dc_commit_state+0x2b9/0x5e0 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_dm_atomic_commit_tail+0x398/0x20f0 [amdgpu]
feb 24 14:31:13 *** kernel:  ? number+0x337/0x380
feb 24 14:31:13 *** kernel:  ? vsnprintf+0x3aa/0x4f0
feb 24 14:31:13 *** kernel:  ? sprintf+0x5e/0x80
feb 24 14:31:13 *** kernel:  ? irq_work_queue+0x35/0x50
feb 24 14:31:13 *** kernel:  ? wake_up_klogd+0x4f/0x70
feb 24 14:31:13 *** kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
feb 24 14:31:13 *** kernel:  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
feb 24 14:31:13 *** kernel:  drm_atomic_helper_disable_all+0x175/0x190 [drm_kms_helper]
feb 24 14:31:13 *** kernel:  drm_atomic_helper_suspend+0x73/0x120 [drm_kms_helper]
feb 24 14:31:13 *** kernel:  dm_suspend+0x1c/0x60 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_device_ip_suspend_phase1+0x81/0xe0 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_device_gpu_recover+0x2ee/0xa13 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_job_timedout+0x103/0x130 [amdgpu]
feb 24 14:31:13 *** kernel:  drm_sched_job_timedout+0x3e/0x90 [gpu_sched]
feb 24 14:31:13 *** kernel:  process_one_work+0x1e1/0x3d0
feb 24 14:31:13 *** kernel:  worker_thread+0x4a/0x3d0
feb 24 14:31:13 *** kernel:  kthread+0xfb/0x130
feb 24 14:31:13 *** kernel:  ? process_one_work+0x3d0/0x3d0
feb 24 14:31:13 *** kernel:  ? kthread_park+0x90/0x90
feb 24 14:31:13 *** kernel:  ret_from_fork+0x22/0x40
feb 24 14:31:13 *** kernel: ---[ end trace 3e7589981fe74b17 ]---

Complete log attached below.
Comment 9 Marco 2020-02-24 13:52:19 UTC
Created attachment 287575 [details]
Latest log with a warning.
Comment 10 Andrew Ammerlaan 2020-05-22 12:55:35 UTC
Created attachment 289235 [details]
syslog

I think I ran into this issue as well. It has happened twice. Both times it happened 10 to 20 minutes *after* playing minecraft. Both times I was in a full screen video meeting. Everything works, except the screen goes black, I could finish the meeting, but without seeing anything. 

Only the monitors connected to my RX 590 go black, the one connected to the iGPU just freezes, and after a while the cursor becomes usable again on that monitor, though all applications remain frozen, and switching to tty does not work. REISUB'ing the machine makes it boot on the iGPU. It needs to be completely switched on and off to boot from the amdgpu.

It looks like it does a graphics reset (why though?):
15554.332021] amdgpu 0000:01:00.0: GPU reset begin!

And from that point onwards everyting goes wrong:
[15554.332296] amdgpu: [powerplay] 
[15554.332296]  last message was failed ret is 65535
[15554.332297] amdgpu: [powerplay] 
[15554.332297]  failed to send message 261 ret is 65535 
[15554.332297] amdgpu: [powerplay] 
[15554.332297]  last message was failed ret is 65535

This is kernel 5.6.14
xorg-1.20.8
mesa-20.1.0_rc3
xf86-video-amdgpu-19.1.0

Full log is attached.
Comment 11 Andrew Ammerlaan 2020-05-23 14:40:22 UTC
Created attachment 289245 [details]
messages

Happened again today, while playing GTA V. Same problems appear in the log (attached). 

I think the title of this bug should be changed, there is more going on here then just dropping the signal to the monitor. Because the monitors connected to the iGPU freeze as well (no signal drop, just a freeze).

It would be great if someone could give me some pointers as to where I could find more useful logs. /var/log/messages doesn't seem to be very informative. It just says a GPU reset began and that it failed to sends some messages after. Or do I maybe need to set some boot parameters, or kernel configs to get more verbose logs?
Comment 12 Andrew Ammerlaan 2020-05-23 16:44:13 UTC
Created attachment 289247 [details]
messages (reset succesful this time)

And again, twice on the same day :(

But this time:
amdgpu 0000:01:00.0: GPU reset begin!
amdgpu 0000:01:00.0: GPU BACO reset
amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume

This time the reset succeeded, however after restarting X, I got stuck on the KDE login splash screen. The log (attached) shows some segfaults.

It seems to me that there are two issues here.

1) The GPU is (often) not successfully recovered after a reset, and if it is recovered successfully segfaults follow in radeonsi_dri.so

2) It goes into a reset in the first place, for no apparent reason

I guess this bug report is mostly about the second issue, why does it go into a reset? How do I debug this?

It would be great if we could get this fixed, as it is getting kinda annoying. (This is a brand new GPU, it is not overheating, what is wrong? )
Comment 13 Marco 2020-06-16 15:48:27 UTC
The only way I had it "fixed" (it's more of a workaround, but it is working) is to slightly drop the clocks (my GPU has by default a max boost clock of 1430 MHz, I have dropped it to 1340 MHz) and voltages (From 1150 mV to 1040 mV on peak clocks, however both depends on your specific silicon, this is just my values). 

Now, if the system post after the downclock of it (sometimes lowering clocks/voltages triggers a black screen bug at boot when the values are applied with systemd, not sure if the issue is the same), however, if I can reach the login screen, the system will work perfectly fine under load with no problem. Never had a crash while using since.

I do not know if it's a specific issue of my silicon, since after this issue happened to my card the same applies under Windows system. Repasting didn't help.
Comment 14 Andrew Ammerlaan 2020-06-16 16:39:56 UTC
I sort of worked around this too.

I changed two things:

1) the iGPU is now the primary GPU, and I use DRI_PRIME=1 to offload to the AMD gpu. This has reduced the amount of things that are rendered on the AMD card. This didn't actually fix anything, but it did remove the necessity for a hard reboot when the AMD GPU does a reset. Now, when the GPU resets only the applications that are rendered on the AMD card stop working, the desktop and stuff stay functional. 

2) I added three fans to my PC. Though the card's thermal sensor never reported that it reached the critical temperature (it went up to 82 Celsius max, critical is 91 Celsius). There definitely does seem to be a correlation between high temperatures and the occurrence of the resets. And more fans is always better anyway.

I still experienced some resets after switching the primary GPU to the iGPU, but only if I really pushed it to it's limits. I haven't had a single reset since I added the fans. (Though admittedly I haven't run a decent stress test yet, so it is still too early to conclude that the problem is completely gone)

Since under-clocking the card worked for you, and adding fans seems to work for me. I have a hunch that even though the thermal sensor doesn't report problematic temperatures some parts of the card actually do reach problematic temperatures nonetheless, which might causes issues leading to a reset.
I'm not sure where the sensor is physically located, but considering that the card is quite large, it doesn't seem that far fetched to me that there could be quite a large difference in temperature between two points on the card.

Perhaps this card could benefit from a second thermal sensor or earlier and/or more aggressive thermal throttling.
Comment 15 Andrew Ammerlaan 2020-06-24 20:33:24 UTC
So today it was *really* hot, and I had this issue occur a couple of times. (The solution with the extra fans was nice and all, but not enough to prevent it entirely)

However, now that the iGPU is default, I can still see the system monitor that I usually run on the other monitor when this issue occurs. Every single time the thermal sensor of the GPU would show a ridiculous value (e.g. 511 degrees Celsius).

Now, this could explain why the GPU does a reset. If the thermal sensor would all of a sudden return a value of e.g. 511, then of course the GPU will shut itself down. 

As it is clearly impossible for the temperature of the GPU to jump from being somewhere between 80 to 90, to over 500 within a couple of milliseconds. I conclude that there is something wrong, either physically with the thermal sensor, or with the way the firmware/driver handles the temperature reporting from the sensor. Also, if the GPU would have actually reached a temperature of 511 it would be broken now, as the melting temperature of tin is about 230 degrees Celsius.

I happen to work with thermometers quite a lot, and I have seen temperature readings do stuff like this. Usually the cause is either a broken, or shorted sensor (which is unlikely in this case, cause it works normally most of the time), or a wrong/incomplete calibration curve. (Usually thermal sensors are only calibrated within the range they are expected to operate, but the high limit of this calibration curve might be too low.)

Anyway, either the GPU reset is caused by the incorrect temperature readings, or the incorrect temperature readings are caused by the GPU reset (which is also possible I guess). In any case, it would be great if AMD could look into this soon. Because clearly something is wrong.
Comment 16 Alex Deucher 2020-06-24 20:41:42 UTC
(In reply to Andrew Ammerlaan from comment #15)
> However, now that the iGPU is default, I can still see the system monitor
> that I usually run on the other monitor when this issue occurs. Every single
> time the thermal sensor of the GPU would show a ridiculous value (e.g. 511
> degrees Celsius).

When the GPU is in reset all reads to the MMIO BAR return 1s so you are just getting all ones until the reset succeeds.  511 is just all ones.  This patch will fix that issue:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9271dfd9e0f79e2969dcbe28568bce0fdc4f8f73
Comment 17 Andrew Ammerlaan 2020-06-25 09:58:46 UTC
(In reply to Alex Deucher from comment #16)
> When the GPU is in reset all reads to the MMIO BAR return 1s so you are just
> getting all ones until the reset succeeds.  511 is just all ones.  This
> patch will fix that issue:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=9271dfd9e0f79e2969dcbe28568bce0fdc4f8f73

Well there goes my hypotheses of the broken thermal sensor xD.

I did discover yesterday that the fan of my GPU spins relatively slow under high load. When the GPU reached ~80 degrees Celsius, the fan didn't even spin at half the maximum RPM! I used the pwmconfig script and the fancontrol service from lm_sensors to force the fan to go to the maximum RPM just before reaching 80 degrees Celsius. It's very noisy, *but* the GPU stays well below 70 degrees Celsius now, even under heavy load. As this issue seems to occur only when the GPU is hotter then ~75 degrees Celsius, I'm hoping that this will help in preventing the problem.

I'm still confused as to why this is at all necessary, the critical temperature is 91, so why do I encounter these issues at ~80?
Comment 18 Marco 2020-09-15 18:31:38 UTC
As of 5.8.7 I've tried to revert to stock clocks, and I had no black screen issue under load even after long game sessions.

It does *seems* to be fixed, at least for me.

I don't know how much code is shared between the Linux open source driver and the Windows closed source driver, but I wonder if it was some bug that jumped over the Windows driver too (or even if it was firmware related).

I'll keep testing to see if it happens again, but I haven't seen any error logs mentioning amdgpu in the dmesg kernel.
Comment 19 Andrew Ammerlaan 2020-09-16 07:52:59 UTC
I'm on 5.8.8 at the moment and I haven't had this happen in a long time. I've had some other freezes but I'm not sure they're GPU/graphics related. So I too think this is fixed.
Comment 20 Marco 2021-03-22 09:36:45 UTC
I finally got where the problem was, and completely fixed it. It was hardware. The issue was the heatsink was not contacting completely a section on the mosfets that was feeding power to the core of the card. Under full load they was thermal tripping for overheating and completely stalling the card to avoid damages to themselves. The problem was that this card wasn't reporting the temps of them to software, even if the actual vrm controller was (or if it was shutting down only when the mosfet trigger purely a signal asserting the thermal runaway condition). This was hell to debug and fix, as always with hardware problems, but after a stress test on both Windows and Linux under full clock, the issue is not present anymore.

I'll keep my optimized clocks for lower temperatures and less fan noise, but for me the issue wasn't software.
Comment 21 rendsvig@gmail.com 2022-01-06 17:58:24 UTC
I have recently started facing this issues with two new games I've started playing, though I've had zero issues with gaming until now.

I'm on kernel 5.15.8, and have tried with the Pop!_OS 21.10 default Mesa drivers 21.2.2 and more recent 21.3.3, drivers, too.

Logging with lm-sensors every 2 seconds while playing, I see little development in temperature before the crash, but a growing power consumption, getting close  to the cap of 183W (five measurements before crash were 183, then 177, 180, 182,   and 181).

Marco, can you explain how to tell if it's the same hardware issue you faced?
Comment 22 rendsvig@gmail.com 2022-01-06 23:44:07 UTC
I resolved my issue by disabling p-state 7 when gaming, with cf. this comment https://www.reddit.com/r/linux_gaming/comments/gbqe0e/comment/fp8r35a

Note You need to log in before you can comment on or make changes to this bug.