Bug 206987 - [drm] [amdgpu] Whole system crashes when the driver is in mode_support_and_system_configuration
Summary: [drm] [amdgpu] Whole system crashes when the driver is in mode_support_and_sy...
Status: RESOLVED DUPLICATE of bug 207979
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-26 19:51 UTC by Cyrax
Modified: 2020-07-23 01:47 UTC (History)
5 users (show)

See Also:
Kernel Version: 5.7.6
Tree: Mainline
Regression: Yes


Attachments
dmesg output (257.73 KB, text/plain)
2020-03-26 21:36 UTC, Cyrax
Details
dmesg output 2 (278.57 KB, text/plain)
2020-04-04 07:40 UTC, Cyrax
Details
dmesg output (1.34 MB, text/plain)
2020-04-18 13:19 UTC, Cyrax
Details
smesg output (145.71 KB, text/plain)
2020-04-19 11:43 UTC, farmboy0
Details
dmesg output (348.35 KB, text/plain)
2020-04-23 05:15 UTC, Cyrax
Details
gdb disassembler dump around mode_support_and_system_configuration (276.07 KB, text/plain)
2020-04-25 08:44 UTC, Cyrax
Details
dmesg output from Linux 5.7-rc3 (165.35 KB, text/plain)
2020-04-27 19:20 UTC, Cyrax
Details
dmesg from 5.6.8 (483.08 KB, text/plain)
2020-05-02 14:18 UTC, Cyrax
Details
kernel log dumped from crash dump by using crash utility (1.68 MB, application/zip)
2020-05-23 01:52 UTC, Cyrax
Details
backtrace created by executing bt -f command in crash utility (9.36 KB, text/plain)
2020-05-23 01:56 UTC, Cyrax
Details
dump of struct dcn_bw_internal_vars (22.87 KB, text/plain)
2020-05-23 01:58 UTC, Cyrax
Details
dmesg from kernel 5.4.0-31 (205.01 KB, text/plain)
2020-05-28 16:05 UTC, Petteri Aimonen
Details
dmesg output kernel 5.7.0 (353.96 KB, text/plain)
2020-06-03 01:34 UTC, Cyrax
Details
config file used to build kernel 5.7.0 with KASAN etc (243.04 KB, text/plain)
2020-06-03 01:35 UTC, Cyrax
Details
used decode_stacktrace.sh to previous dmesg log (365.51 KB, text/plain)
2020-06-03 02:00 UTC, Cyrax
Details
systemd journal from crash (15.42 KB, text/plain)
2020-06-06 01:29 UTC, yaomtc
Details

Description Cyrax 2020-03-26 19:51:03 UTC
Whole system crashes with this error message : simd exception: 0000 [#1] PREEMPT SMP NOPTI

Only giving a REISUB treatment works.

And cause is amdgpu driver.

---

Mar 26 20:47:13 shodan kernel: simd exception: 0000 [#1] PREEMPT SMP NOPTI
Mar 26 20:47:13 shodan kernel: CPU: 7 PID: 1344 Comm: Xorg Tainted: G        W  OE     5.5.11-arch1-1 #1
Mar 26 20:47:13 shodan kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.80 03/06/2019
Mar 26 20:47:13 shodan kernel: RIP: 0010:mode_support_and_system_configuration+0x30a3/0x4d90 [amdgpu]
Mar 26 20:47:13 shodan kernel: Code: 00 0f 28 c3 e8 7e c9 ff ff f3 41 0f 11 87 40 19 00 00 e9 12 fd ff ff 41 83 be a8 00 00 00 06 75 93 f3 41 0f 10 86 40 1b 00 00 <f3> 41 0f 5e 86 f8 17 00 00 e8 4f c9 ff ff 41 8b 87 80 04 00 00 f3
Mar 26 20:47:13 shodan kernel: RSP: 0018:ffffb216c1f3b978 EFLAGS: 00010246
Mar 26 20:47:13 shodan kernel: RAX: 0000000000000006 RBX: ffff9c120bbfadc4 RCX: 0000000000000004
Mar 26 20:47:13 shodan kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c120bbfb008
Mar 26 20:47:13 shodan kernel: RBP: ffff9c120bbfadc4 R08: ffff9c120bbfc164 R09: 0000000000000120
Mar 26 20:47:13 shodan kernel: R10: ffff9c120bbfaee4 R11: ffff9c120bbf0248 R12: ffff9c120bbfc63c
Mar 26 20:47:13 shodan kernel: R13: 0000000000000000 R14: ffff9c120bbfaf5c R15: ffff9c120bbfadc4
Mar 26 20:47:13 shodan kernel: FS:  00007f1c9f336dc0(0000) GS:ffff9c19009c0000(0000) knlGS:0000000000000000
Mar 26 20:47:13 shodan kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 26 20:47:13 shodan kernel: CR2: 00001f82bfec7fe0 CR3: 00000007cbe4a000 CR4: 00000000003406e0
Mar 26 20:47:13 shodan kernel: Call Trace:
Mar 26 20:47:13 shodan kernel:  dcn_validate_bandwidth+0xfe5/0x1f20 [amdgpu]
Mar 26 20:47:13 shodan kernel:  dc_validate_global_state+0x28a/0x310 [amdgpu]
Mar 26 20:47:13 shodan kernel:  amdgpu_dm_atomic_check+0x5d8/0x870 [amdgpu]
Mar 26 20:47:13 shodan kernel:  drm_atomic_check_only+0x578/0x800 [drm]
Mar 26 20:47:13 shodan kernel:  ? dm_crtc_duplicate_state+0x6b/0x1f0 [amdgpu]
Mar 26 20:47:13 shodan kernel:  drm_atomic_commit+0x13/0x50 [drm]
Mar 26 20:47:13 shodan kernel:  drm_atomic_helper_legacy_gamma_set+0x123/0x180 [drm_kms_helper]
Mar 26 20:47:13 shodan kernel:  drm_mode_gamma_set_ioctl+0x171/0x220 [drm]
Mar 26 20:47:13 shodan kernel:  ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm]
Mar 26 20:47:13 shodan kernel:  drm_ioctl_kernel+0xb2/0x100 [drm]
Mar 26 20:47:13 shodan kernel:  drm_ioctl+0x209/0x360 [drm]
Mar 26 20:47:13 shodan kernel:  ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm]
Mar 26 20:47:13 shodan kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Mar 26 20:47:13 shodan kernel:  do_vfs_ioctl+0x4b7/0x730
Mar 26 20:47:13 shodan kernel:  ksys_ioctl+0x5e/0x90
Mar 26 20:47:13 shodan kernel:  __x64_sys_ioctl+0x16/0x20
Mar 26 20:47:13 shodan kernel:  do_syscall_64+0x4e/0x150
Mar 26 20:47:13 shodan kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 26 20:47:13 shodan kernel: RIP: 0033:0x7f1ca01892eb
Mar 26 20:47:13 shodan kernel: Code: 0f 1e fa 48 8b 05 a5 8b 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 8b 0c 00 f7 d8 64 89 01 48
Mar 26 20:47:13 shodan kernel: RSP: 002b:00007ffc60ff5648 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
Mar 26 20:47:13 shodan kernel: RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f1ca01892eb
Mar 26 20:47:13 shodan kernel: RDX: 00007ffc60ff5700 RSI: 00000000c02064a5 RDI: 000000000000000a
Mar 26 20:47:13 shodan kernel: RBP: 00007ffc60ff5680 R08: 0000562bb635c080 R09: 0000562bb635c280
Mar 26 20:47:13 shodan kernel: R10: 0000562bb635be80 R11: 0000000000000206 R12: 0000000000000100
Mar 26 20:47:13 shodan kernel: R13: 0000562bb6ab4f70 R14: 0000562bb635b9c0 R15: 0000000000000100
Mar 26 20:47:13 shodan kernel: Modules linked in: snd_seq_dummy snd_seq bluetooth ecdh_generic rfkill ecc veth fuse iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi tun ip6table_mangle xt_MASQUERADE iptable_nat nf_nat xt_connmark iptable_mangle xt_helper xt_NFLOG xt_limit xt_conntrack xt_tcpudp nf_conntrack_ftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_irc nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) pktcdvd nfnetlink_log nfnetlink ip6table_filter nct6775 ip6_tables hwmon_vid iptable_filter edac_mce_amd kvm_amd ccp ext4 rng_core kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi crc16 mbcache irqbypass mxm_wmi jbd2 snd_hda_intel wmi_bmof snd_intel_dspcfg snd_hda_codec snd_usb_audio crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core uvcvideo snd_usbmidi_lib snd_rawmidi videobuf2_vmalloc videobuf2_memops snd_seq_device videobuf2_v4l2 aesni_intel snd_hwdep videobuf2_common crypto_simd snd_pcm mousedev cryptd glue_helper
Mar 26 20:47:13 shodan kernel:  input_leds sp5100_tco snd_timer igb k10temp pcspkr i2c_piix4 snd soundcore dca wmi evdev mac_hid gpio_amdpt pinctrl_amd acpi_cpufreq xt_mark v4l2loopback(OE) videodev mc usbmon nbd msr vhba(OE) sr_mod cdrom sg br_netfilter bridge stp llc ip_tables x_tables dm_mod btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq sd_mod hid_generic usbhid hid crc32c_intel ahci libahci libata xhci_pci xhci_hcd scsi_mod amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper serio_raw syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart i8042 atkbd libps2 serio
Mar 26 20:47:13 shodan kernel: ---[ end trace e34593e526e29a3d ]---
Mar 26 20:47:13 shodan kernel: RIP: 0010:mode_support_and_system_configuration+0x30a3/0x4d90 [amdgpu]
Mar 26 20:47:13 shodan kernel: Code: 00 0f 28 c3 e8 7e c9 ff ff f3 41 0f 11 87 40 19 00 00 e9 12 fd ff ff 41 83 be a8 00 00 00 06 75 93 f3 41 0f 10 86 40 1b 00 00 <f3> 41 0f 5e 86 f8 17 00 00 e8 4f c9 ff ff 41 8b 87 80 04 00 00 f3
Mar 26 20:47:13 shodan kernel: RSP: 0018:ffffb216c1f3b978 EFLAGS: 00010246
Mar 26 20:47:13 shodan kernel: RAX: 0000000000000006 RBX: ffff9c120bbfadc4 RCX: 0000000000000004
Mar 26 20:47:13 shodan kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c120bbfb008
Mar 26 20:47:13 shodan kernel: RBP: ffff9c120bbfadc4 R08: ffff9c120bbfc164 R09: 0000000000000120
Mar 26 20:47:13 shodan kernel: R10: ffff9c120bbfaee4 R11: ffff9c120bbf0248 R12: ffff9c120bbfc63c
Mar 26 20:47:13 shodan kernel: R13: 0000000000000000 R14: ffff9c120bbfaf5c R15: ffff9c120bbfadc4
Mar 26 20:47:13 shodan kernel: FS:  00007f1c9f336dc0(0000) GS:ffff9c19009c0000(0000) knlGS:0000000000000000
Mar 26 20:47:13 shodan kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 26 20:47:13 shodan kernel: CR2: 00001f82bfec7fe0 CR3: 00000007cbe4a000 CR4: 00000000003406e0
Comment 1 Alex Deucher 2020-03-26 19:54:30 UTC
Please attach your full dmesg output.  What version of gcc are you using?
Comment 2 Cyrax 2020-03-26 21:36:12 UTC
Created attachment 288079 [details]
dmesg output
Comment 3 Cyrax 2020-03-26 21:37:17 UTC
GCC is "gcc (Arch Linux 9.3.0-1) 9.3.0"
Comment 4 Cyrax 2020-04-04 07:40:17 UTC
Created attachment 288203 [details]
dmesg output 2

This crash happened again. In that time I have used VLC, played a game (GZDoom) and tried to listen youtube playlist by using a combination of youtube-dl, ffmpeg and mpv.

I also updated motherboards BIOS/firmware to latest one.
Comment 5 Cyrax 2020-04-04 07:42:04 UTC
Oh and kernel is in 5.5.13 version.
Comment 6 Cyrax 2020-04-18 13:19:16 UTC
Created attachment 288595 [details]
dmesg output

And another one. It seems that switching between virtual consoles causes this bug to happen
Comment 7 farmboy0 2020-04-19 11:42:49 UTC
I am having the same problem sometimes during start/exit of SteamVR.
I have observed with the 5.6 kernels.
My card is a Navi RX 5700XT.
Comment 8 farmboy0 2020-04-19 11:43:47 UTC
Created attachment 288615 [details]
smesg output
Comment 9 Cyrax 2020-04-23 05:15:27 UTC
Created attachment 288679 [details]
dmesg output

And again.
Comment 10 Cyrax 2020-04-25 08:44:00 UTC
Created attachment 288719 [details]
gdb disassembler dump around mode_support_and_system_configuration

And it happened again. Looks like that something goes wrong after while when computer monitor is turned on.
Comment 11 Cyrax 2020-04-27 19:20:31 UTC
Created attachment 288781 [details]
dmesg output from Linux 5.7-rc3

This is starting to be real problem, I can't do anything remotely productive. Crash will happen in just 12 hours (give or take) when system is rebooted from previous one.

I'm running four LXC containers which I have setup to run GUI programs in hosts system by following this help : https://wiki.archlinux.org/index.php/Linux_Containers#Xorg_program_considerations_(optional)

Also I have running VirtualBox but its VM's aren't accessing 3D functions from host at all.
Comment 12 Cyrax 2020-05-02 14:18:02 UTC
Created attachment 288873 [details]
dmesg from 5.6.8

Additionally dmesg output shows this line : note: kworker/0:3[2251663] exited with preempt_count 1

It seems that this bug occurs when the monitor is turned off and then on repeatedly with short delay between.
Comment 13 Cyrax 2020-05-23 01:52:24 UTC
Created attachment 289237 [details]
kernel log dumped from crash dump by using crash utility
Comment 14 Cyrax 2020-05-23 01:56:07 UTC
Created attachment 289239 [details]
backtrace created by executing bt -f command in crash utility
Comment 15 Cyrax 2020-05-23 01:58:36 UTC
Created attachment 289241 [details]
dump of struct dcn_bw_internal_vars
Comment 16 Petteri Aimonen 2020-05-28 14:17:29 UTC
I hit the same issue, using Ubuntu 20.04. It happened when switching window to Firefox. For me it only crashed Xorg, ssh to the machine still worked ok. Killing Xorg didn't work and `shutdown -r now` hung up somewhere.

Here is a bug report on the Ubuntu package: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1881134

Here is call trace decoded with the debug symbols:

--

[455834.385061] Call Trace:
[455834.385120] mode_support_and_system_configuration (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calc_auto.c:176) amdgpu
[455834.385174] ? calculate_inits_and_adj_vp (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_resource.c:950 (discriminator 12)) amdgpu
[455834.385230] dcn_validate_bandwidth (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:1034) amdgpu
[455834.385283] dc_validate_global_state (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_resource.c:2093) amdgpu
[455834.385338] amdgpu_dm_atomic_check (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7413) amdgpu
[455834.385351] drm_atomic_check_only (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_atomic.c:1179) drm
[455834.385361] drm_atomic_commit (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_atomic.c:1220) drm
[455834.385370] drm_mode_obj_set_property_ioctl (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:496 /build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:533) drm
[455834.385379] ? drm_mode_obj_find_prop_id (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:512) drm
[455834.385386] drm_ioctl_kernel (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_ioctl.c:793) drm
[455834.385394] drm_ioctl (/build/linux-FFoizL/linux-5.4.0/include/linux/thread_info.h:119 /build/linux-FFoizL/linux-5.4.0/include/linux/thread_info.h:152 /build/linux-FFoizL/linux-5.4.0/include/linux/uaccess.h:151 /build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_ioctl.c:888) drm
[455834.385402] ? drm_mode_obj_find_prop_id (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:512) drm
[455834.385406] ? recalc_sigpending (/build/linux-FFoizL/linux-5.4.0/kernel/signal.c:184) 
[455834.385440] amdgpu_drm_ioctl (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1293) amdgpu
[455834.385443] do_vfs_ioctl (/build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:47 /build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:510 /build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:697) 
[455834.385444] ? recalc_sigpending (/build/linux-FFoizL/linux-5.4.0/kernel/signal.c:184) 
[455834.385446] ? _copy_from_user (/build/linux-FFoizL/linux-5.4.0/arch/x86/include/asm/uaccess_64.h:46 /build/linux-FFoizL/linux-5.4.0/arch/x86/include/asm/uaccess_64.h:71 /build/linux-FFoizL/linux-5.4.0/lib/usercopy.c:14) 
[455834.385448] ksys_ioctl (/build/linux-FFoizL/linux-5.4.0/include/linux/file.h:43 /build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:715) 
[455834.385449] __x64_sys_ioctl (/build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:719) 
[455834.385451] do_syscall_64 (/build/linux-FFoizL/linux-5.4.0/arch/x86/entry/common.c:290) 
[455834.385455] entry_SYSCALL_64_after_hwframe (/build/linux-FFoizL/linux-5.4.0/arch/x86/entry/entry_64.S:184) 
[455834.385456] RIP: 0033:0x7faf3181837b
Comment 17 Petteri Aimonen 2020-05-28 16:05:44 UTC
Created attachment 289381 [details]
dmesg from kernel 5.4.0-31
Comment 18 Petteri Aimonen 2020-05-28 16:24:21 UTC
As best as I can tell, the crash seems to be caused by some floating point exception (such as underflow/overflow) in this function call in dcn_calc_auto.c line 176:

dcn_bw_ceil2(v->byte_per_pixel_in_dety[k], 1.0)

In dcn_bw_ceil2() the exception occurs in this instruction:

addsd  0x0(%rip),%xmm3

which is performing the addition flr + 0.00001.
At this point %xmm3 is ((int)(v->byte_per_pixel_in_dety[k] / 1.0)) * 1.0
The variable byte_per_pixel_in_dety is only assigned constant values 1.0, 2.0, 4.0, 8.0 so
I don't see any reason for addsd to cause a simd exception. I'm not sure if the exception
is precise or if it could be delayed from some prior instruction, but AFAIK it should be
precise because in usermode the exception handler would attempt a recovery.

Having XMM3 or MXCSR values would help, but they don't seem to get included in the dmesg output and I'm not sure if they are available in a crash dump either.

Google search turned up https://beowulf.beowulf.narkive.com/tAHxVcs0/simd-exception-kernel-panic-on-skylake-ep-triggered-by-openfoam where the exception was delayed for some reason.

Analyzing the dmesgs attached to this bug report, we have following crash locations:

Cyrax    2020-03-26 21:36: divss  xmm0,DWORD PTR [r14+0x17f8]
Cyrax    2020-04-04 07:40: divss  xmm0,DWORD PTR [r14+0x17f8]
Cyrax    2020-04-18 13:19: divss  xmm0,DWORD PTR [r14+0x17f8]
farmboy0 2020-04-19 11:43: not a simd exception
Cyrax    2020-04-23 05:15: divss  xmm0,DWORD PTR [r14+0x17f8]
Cyrax    2020-04-27 19:20: divss  xmm0,DWORD PTR [r14+0x17f8]
Cyrax    2020-05-02 14:18: divss  xmm0,DWORD PTR [r14+0x17f8]
PetteriA 2020-05-28 16:05: addsd  xmm3,QWORD PTR [rip+0x1de967]

So the crash locations appear fairly consistent for Cyrax's machine, but no two machines have the same location.

For other users affected by this problem, it could be helpful if you install kernel debugging symbols and use decode_stacktrace.sh to convert the raw stack trace to code locations.

Also reported on freedesktop amd bugtracker: https://gitlab.freedesktop.org/drm/amd/-/issues/1154
Comment 20 yaomtc 2020-06-02 03:50:14 UTC
So far so good Alex. Using the RX 5700 XT as well. Previously, running SteamVR could pretty quickly crash my system (even before launching a game), and since I rebuilt linux-mainline from AUR, haven't had SteamVR crash my system yet. Fingers crossed that this continues. 

Though Half-Life: Alyx is causing a system crash, which can even happen on Windows with Vulkan apparently! Wow. At least that's not an AMD or Linux specific issue. https://github.com/ValveSoftware/SteamVR-for-Linux/issues/356
Comment 21 Cyrax 2020-06-03 01:34:18 UTC
Created attachment 289479 [details]
dmesg output kernel 5.7.0
Comment 22 Cyrax 2020-06-03 01:35:36 UTC
Created attachment 289481 [details]
config file used to build kernel 5.7.0 with KASAN etc
Comment 23 Cyrax 2020-06-03 02:00:53 UTC
Created attachment 289483 [details]
used decode_stacktrace.sh to previous dmesg log
Comment 24 Cyrax 2020-06-03 02:28:02 UTC
(In reply to Petteri Aimonen from comment #16)
> I hit the same issue, using Ubuntu 20.04. It happened when switching window
> to Firefox. For me it only crashed Xorg, ssh to the machine still worked ok.
> Killing Xorg didn't work and `shutdown -r now` hung up somewhere.
> 
> Here is a bug report on the Ubuntu package:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1881134
> 
> Here is call trace decoded with the debug symbols:
> 
[clip]

Yeah, it happens when switching windows and/or to different workspace. And yes it will crash Xorg only, other things will continue work as usual and issuing reboot command via SSH won't - well - reboot it. Only REISUB brings machine back to usable state.
Comment 25 Petteri Aimonen 2020-06-03 05:14:49 UTC
Looks like there are two kinds of crash bugs here. Many of the amdgpu crashes have been fixed in 5.7.0, but the specific one that gives "simd exception" in dmesg is not.

@Cyrax There is an experimental patch in https://bugzilla.kernel.org/show_bug.cgi?id=207979 if you want to try.

Out of interest, are you possibly running a 32-bit operating system under virtualization on 64-bit host? That's what triggers the bug for me.
Comment 26 Cyrax 2020-06-03 11:05:07 UTC
(In reply to Petteri Aimonen from comment #25)
> Looks like there are two kinds of crash bugs here. Many of the amdgpu
> crashes have been fixed in 5.7.0, but the specific one that gives "simd
> exception" in dmesg is not.
> 
> @Cyrax There is an experimental patch in
> https://bugzilla.kernel.org/show_bug.cgi?id=207979 if you want to try.
> 
> Out of interest, are you possibly running a 32-bit operating system under
> virtualization on 64-bit host? That's what triggers the bug for me.

I'm running one 32-bit LXC container (Arch Linux. <url:https://archlinux32.org/>) and three 64-bit LXC containers (Arch Linux). Additionally I'm running three VirtualBox guests which are Windows, Arch Linux and old version LEDE (OpenWRT) router OS (All are running 64-bit OS).
Comment 27 yaomtc 2020-06-06 01:29:29 UTC
Created attachment 289535 [details]
systemd journal from crash

Update: got a whole system crash again when I was starting up SteamVR. So I guess the issue wasn't resolved for me. It could have reduced the likelihood maybe, or it was luck?

Not sure what else to attach here, but I copied journal entries from the time of the crash (which happens at 21:09:31 near the end). Let me know if there's something else I should attach the next time this happens, if more data would be helpful.
Comment 28 Petteri Aimonen 2020-06-06 06:42:57 UTC
@yaomtc Your bug seems to be some separate issue, as the log does not have the "simd exception" or "mode_support_and_system_configuration" entries in it. It looks more similar to this bug here: https://gitlab.freedesktop.org/drm/amd/-/issues/1149
Comment 29 Alexander Kernozhitsky 2020-07-03 22:22:34 UTC
I encountered this bug today. When running specific graphical applications, the machine hangs, and the kernel logs say about simd exception.

It started to occur after the upgrade to 5.7.6 kernel.

I tried to apply the patch mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=207979, and the patch resolves the issue for me.

Using AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx.
Comment 30 Cyrax 2020-07-15 16:12:51 UTC
The patch in https://bugzilla.kernel.org/show_bug.cgi?id=207979 works beatifully.
19 days heavy usage without system crash on patched 5.7.6 kernel.
Comment 31 Alex Deucher 2020-07-17 04:40:45 UTC
Duplicate of bug 207979.
Comment 32 Cyrax 2020-07-23 01:47:16 UTC
Fix is in stable 5.7.10 kernel.

*** This bug has been marked as a duplicate of bug 207979 ***

Note You need to log in before you can comment on or make changes to this bug.