Bug 206987 - [drm] [amdgpu] Whole system crashes when the driver is in mode_support_and_system_configuration
Summary: [drm] [amdgpu] Whole system crashes when the driver is in mode_support_and_sy...
Status: RESOLVED DUPLICATE of bug 207979
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-26 19:51 UTC by Cyrax
Modified: 2021-04-03 05:23 UTC (History)
8 users (show)

See Also:
Kernel Version: 5.7.6
Tree: Mainline
Regression: Yes


Attachments
dmesg output (257.73 KB, text/plain)
2020-03-26 21:36 UTC, Cyrax
Details
dmesg output 2 (278.57 KB, text/plain)
2020-04-04 07:40 UTC, Cyrax
Details
dmesg output (1.34 MB, text/plain)
2020-04-18 13:19 UTC, Cyrax
Details
smesg output (145.71 KB, text/plain)
2020-04-19 11:43 UTC, farmboy0
Details
dmesg output (348.35 KB, text/plain)
2020-04-23 05:15 UTC, Cyrax
Details
gdb disassembler dump around mode_support_and_system_configuration (276.07 KB, text/plain)
2020-04-25 08:44 UTC, Cyrax
Details
dmesg output from Linux 5.7-rc3 (165.35 KB, text/plain)
2020-04-27 19:20 UTC, Cyrax
Details
dmesg from 5.6.8 (483.08 KB, text/plain)
2020-05-02 14:18 UTC, Cyrax
Details
kernel log dumped from crash dump by using crash utility (1.68 MB, application/zip)
2020-05-23 01:52 UTC, Cyrax
Details
backtrace created by executing bt -f command in crash utility (9.36 KB, text/plain)
2020-05-23 01:56 UTC, Cyrax
Details
dump of struct dcn_bw_internal_vars (22.87 KB, text/plain)
2020-05-23 01:58 UTC, Cyrax
Details
dmesg from kernel 5.4.0-31 (205.01 KB, text/plain)
2020-05-28 16:05 UTC, Petteri Aimonen
Details
dmesg output kernel 5.7.0 (353.96 KB, text/plain)
2020-06-03 01:34 UTC, Cyrax
Details
config file used to build kernel 5.7.0 with KASAN etc (243.04 KB, text/plain)
2020-06-03 01:35 UTC, Cyrax
Details
used decode_stacktrace.sh to previous dmesg log (365.51 KB, text/plain)
2020-06-03 02:00 UTC, Cyrax
Details
systemd journal from crash (15.42 KB, text/plain)
2020-06-06 01:29 UTC, yaomtc
Details
Call DC_FP_START() / DC_FP_END() in dcn21_validate_bandwidth (1.72 KB, patch)
2021-02-11 07:48 UTC, Jan Kokemüller
Details | Diff

Description Cyrax 2020-03-26 19:51:03 UTC
Whole system crashes with this error message : simd exception: 0000 [#1] PREEMPT SMP NOPTI

Only giving a REISUB treatment works.

And cause is amdgpu driver.

---

Mar 26 20:47:13 shodan kernel: simd exception: 0000 [#1] PREEMPT SMP NOPTI
Mar 26 20:47:13 shodan kernel: CPU: 7 PID: 1344 Comm: Xorg Tainted: G        W  OE     5.5.11-arch1-1 #1
Mar 26 20:47:13 shodan kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.80 03/06/2019
Mar 26 20:47:13 shodan kernel: RIP: 0010:mode_support_and_system_configuration+0x30a3/0x4d90 [amdgpu]
Mar 26 20:47:13 shodan kernel: Code: 00 0f 28 c3 e8 7e c9 ff ff f3 41 0f 11 87 40 19 00 00 e9 12 fd ff ff 41 83 be a8 00 00 00 06 75 93 f3 41 0f 10 86 40 1b 00 00 <f3> 41 0f 5e 86 f8 17 00 00 e8 4f c9 ff ff 41 8b 87 80 04 00 00 f3
Mar 26 20:47:13 shodan kernel: RSP: 0018:ffffb216c1f3b978 EFLAGS: 00010246
Mar 26 20:47:13 shodan kernel: RAX: 0000000000000006 RBX: ffff9c120bbfadc4 RCX: 0000000000000004
Mar 26 20:47:13 shodan kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c120bbfb008
Mar 26 20:47:13 shodan kernel: RBP: ffff9c120bbfadc4 R08: ffff9c120bbfc164 R09: 0000000000000120
Mar 26 20:47:13 shodan kernel: R10: ffff9c120bbfaee4 R11: ffff9c120bbf0248 R12: ffff9c120bbfc63c
Mar 26 20:47:13 shodan kernel: R13: 0000000000000000 R14: ffff9c120bbfaf5c R15: ffff9c120bbfadc4
Mar 26 20:47:13 shodan kernel: FS:  00007f1c9f336dc0(0000) GS:ffff9c19009c0000(0000) knlGS:0000000000000000
Mar 26 20:47:13 shodan kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 26 20:47:13 shodan kernel: CR2: 00001f82bfec7fe0 CR3: 00000007cbe4a000 CR4: 00000000003406e0
Mar 26 20:47:13 shodan kernel: Call Trace:
Mar 26 20:47:13 shodan kernel:  dcn_validate_bandwidth+0xfe5/0x1f20 [amdgpu]
Mar 26 20:47:13 shodan kernel:  dc_validate_global_state+0x28a/0x310 [amdgpu]
Mar 26 20:47:13 shodan kernel:  amdgpu_dm_atomic_check+0x5d8/0x870 [amdgpu]
Mar 26 20:47:13 shodan kernel:  drm_atomic_check_only+0x578/0x800 [drm]
Mar 26 20:47:13 shodan kernel:  ? dm_crtc_duplicate_state+0x6b/0x1f0 [amdgpu]
Mar 26 20:47:13 shodan kernel:  drm_atomic_commit+0x13/0x50 [drm]
Mar 26 20:47:13 shodan kernel:  drm_atomic_helper_legacy_gamma_set+0x123/0x180 [drm_kms_helper]
Mar 26 20:47:13 shodan kernel:  drm_mode_gamma_set_ioctl+0x171/0x220 [drm]
Mar 26 20:47:13 shodan kernel:  ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm]
Mar 26 20:47:13 shodan kernel:  drm_ioctl_kernel+0xb2/0x100 [drm]
Mar 26 20:47:13 shodan kernel:  drm_ioctl+0x209/0x360 [drm]
Mar 26 20:47:13 shodan kernel:  ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm]
Mar 26 20:47:13 shodan kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Mar 26 20:47:13 shodan kernel:  do_vfs_ioctl+0x4b7/0x730
Mar 26 20:47:13 shodan kernel:  ksys_ioctl+0x5e/0x90
Mar 26 20:47:13 shodan kernel:  __x64_sys_ioctl+0x16/0x20
Mar 26 20:47:13 shodan kernel:  do_syscall_64+0x4e/0x150
Mar 26 20:47:13 shodan kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 26 20:47:13 shodan kernel: RIP: 0033:0x7f1ca01892eb
Mar 26 20:47:13 shodan kernel: Code: 0f 1e fa 48 8b 05 a5 8b 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 8b 0c 00 f7 d8 64 89 01 48
Mar 26 20:47:13 shodan kernel: RSP: 002b:00007ffc60ff5648 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
Mar 26 20:47:13 shodan kernel: RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f1ca01892eb
Mar 26 20:47:13 shodan kernel: RDX: 00007ffc60ff5700 RSI: 00000000c02064a5 RDI: 000000000000000a
Mar 26 20:47:13 shodan kernel: RBP: 00007ffc60ff5680 R08: 0000562bb635c080 R09: 0000562bb635c280
Mar 26 20:47:13 shodan kernel: R10: 0000562bb635be80 R11: 0000000000000206 R12: 0000000000000100
Mar 26 20:47:13 shodan kernel: R13: 0000562bb6ab4f70 R14: 0000562bb635b9c0 R15: 0000000000000100
Mar 26 20:47:13 shodan kernel: Modules linked in: snd_seq_dummy snd_seq bluetooth ecdh_generic rfkill ecc veth fuse iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi tun ip6table_mangle xt_MASQUERADE iptable_nat nf_nat xt_connmark iptable_mangle xt_helper xt_NFLOG xt_limit xt_conntrack xt_tcpudp nf_conntrack_ftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_irc nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) pktcdvd nfnetlink_log nfnetlink ip6table_filter nct6775 ip6_tables hwmon_vid iptable_filter edac_mce_amd kvm_amd ccp ext4 rng_core kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi crc16 mbcache irqbypass mxm_wmi jbd2 snd_hda_intel wmi_bmof snd_intel_dspcfg snd_hda_codec snd_usb_audio crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core uvcvideo snd_usbmidi_lib snd_rawmidi videobuf2_vmalloc videobuf2_memops snd_seq_device videobuf2_v4l2 aesni_intel snd_hwdep videobuf2_common crypto_simd snd_pcm mousedev cryptd glue_helper
Mar 26 20:47:13 shodan kernel:  input_leds sp5100_tco snd_timer igb k10temp pcspkr i2c_piix4 snd soundcore dca wmi evdev mac_hid gpio_amdpt pinctrl_amd acpi_cpufreq xt_mark v4l2loopback(OE) videodev mc usbmon nbd msr vhba(OE) sr_mod cdrom sg br_netfilter bridge stp llc ip_tables x_tables dm_mod btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq sd_mod hid_generic usbhid hid crc32c_intel ahci libahci libata xhci_pci xhci_hcd scsi_mod amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper serio_raw syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart i8042 atkbd libps2 serio
Mar 26 20:47:13 shodan kernel: ---[ end trace e34593e526e29a3d ]---
Mar 26 20:47:13 shodan kernel: RIP: 0010:mode_support_and_system_configuration+0x30a3/0x4d90 [amdgpu]
Mar 26 20:47:13 shodan kernel: Code: 00 0f 28 c3 e8 7e c9 ff ff f3 41 0f 11 87 40 19 00 00 e9 12 fd ff ff 41 83 be a8 00 00 00 06 75 93 f3 41 0f 10 86 40 1b 00 00 <f3> 41 0f 5e 86 f8 17 00 00 e8 4f c9 ff ff 41 8b 87 80 04 00 00 f3
Mar 26 20:47:13 shodan kernel: RSP: 0018:ffffb216c1f3b978 EFLAGS: 00010246
Mar 26 20:47:13 shodan kernel: RAX: 0000000000000006 RBX: ffff9c120bbfadc4 RCX: 0000000000000004
Mar 26 20:47:13 shodan kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c120bbfb008
Mar 26 20:47:13 shodan kernel: RBP: ffff9c120bbfadc4 R08: ffff9c120bbfc164 R09: 0000000000000120
Mar 26 20:47:13 shodan kernel: R10: ffff9c120bbfaee4 R11: ffff9c120bbf0248 R12: ffff9c120bbfc63c
Mar 26 20:47:13 shodan kernel: R13: 0000000000000000 R14: ffff9c120bbfaf5c R15: ffff9c120bbfadc4
Mar 26 20:47:13 shodan kernel: FS:  00007f1c9f336dc0(0000) GS:ffff9c19009c0000(0000) knlGS:0000000000000000
Mar 26 20:47:13 shodan kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 26 20:47:13 shodan kernel: CR2: 00001f82bfec7fe0 CR3: 00000007cbe4a000 CR4: 00000000003406e0
Comment 1 Alex Deucher 2020-03-26 19:54:30 UTC
Please attach your full dmesg output.  What version of gcc are you using?
Comment 2 Cyrax 2020-03-26 21:36:12 UTC
Created attachment 288079 [details]
dmesg output
Comment 3 Cyrax 2020-03-26 21:37:17 UTC
GCC is "gcc (Arch Linux 9.3.0-1) 9.3.0"
Comment 4 Cyrax 2020-04-04 07:40:17 UTC
Created attachment 288203 [details]
dmesg output 2

This crash happened again. In that time I have used VLC, played a game (GZDoom) and tried to listen youtube playlist by using a combination of youtube-dl, ffmpeg and mpv.

I also updated motherboards BIOS/firmware to latest one.
Comment 5 Cyrax 2020-04-04 07:42:04 UTC
Oh and kernel is in 5.5.13 version.
Comment 6 Cyrax 2020-04-18 13:19:16 UTC
Created attachment 288595 [details]
dmesg output

And another one. It seems that switching between virtual consoles causes this bug to happen
Comment 7 farmboy0 2020-04-19 11:42:49 UTC
I am having the same problem sometimes during start/exit of SteamVR.
I have observed with the 5.6 kernels.
My card is a Navi RX 5700XT.
Comment 8 farmboy0 2020-04-19 11:43:47 UTC
Created attachment 288615 [details]
smesg output
Comment 9 Cyrax 2020-04-23 05:15:27 UTC
Created attachment 288679 [details]
dmesg output

And again.
Comment 10 Cyrax 2020-04-25 08:44:00 UTC
Created attachment 288719 [details]
gdb disassembler dump around mode_support_and_system_configuration

And it happened again. Looks like that something goes wrong after while when computer monitor is turned on.
Comment 11 Cyrax 2020-04-27 19:20:31 UTC
Created attachment 288781 [details]
dmesg output from Linux 5.7-rc3

This is starting to be real problem, I can't do anything remotely productive. Crash will happen in just 12 hours (give or take) when system is rebooted from previous one.

I'm running four LXC containers which I have setup to run GUI programs in hosts system by following this help : https://wiki.archlinux.org/index.php/Linux_Containers#Xorg_program_considerations_(optional)

Also I have running VirtualBox but its VM's aren't accessing 3D functions from host at all.
Comment 12 Cyrax 2020-05-02 14:18:02 UTC
Created attachment 288873 [details]
dmesg from 5.6.8

Additionally dmesg output shows this line : note: kworker/0:3[2251663] exited with preempt_count 1

It seems that this bug occurs when the monitor is turned off and then on repeatedly with short delay between.
Comment 13 Cyrax 2020-05-23 01:52:24 UTC
Created attachment 289237 [details]
kernel log dumped from crash dump by using crash utility
Comment 14 Cyrax 2020-05-23 01:56:07 UTC
Created attachment 289239 [details]
backtrace created by executing bt -f command in crash utility
Comment 15 Cyrax 2020-05-23 01:58:36 UTC
Created attachment 289241 [details]
dump of struct dcn_bw_internal_vars
Comment 16 Petteri Aimonen 2020-05-28 14:17:29 UTC
I hit the same issue, using Ubuntu 20.04. It happened when switching window to Firefox. For me it only crashed Xorg, ssh to the machine still worked ok. Killing Xorg didn't work and `shutdown -r now` hung up somewhere.

Here is a bug report on the Ubuntu package: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1881134

Here is call trace decoded with the debug symbols:

--

[455834.385061] Call Trace:
[455834.385120] mode_support_and_system_configuration (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calc_auto.c:176) amdgpu
[455834.385174] ? calculate_inits_and_adj_vp (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_resource.c:950 (discriminator 12)) amdgpu
[455834.385230] dcn_validate_bandwidth (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:1034) amdgpu
[455834.385283] dc_validate_global_state (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_resource.c:2093) amdgpu
[455834.385338] amdgpu_dm_atomic_check (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7413) amdgpu
[455834.385351] drm_atomic_check_only (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_atomic.c:1179) drm
[455834.385361] drm_atomic_commit (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_atomic.c:1220) drm
[455834.385370] drm_mode_obj_set_property_ioctl (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:496 /build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:533) drm
[455834.385379] ? drm_mode_obj_find_prop_id (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:512) drm
[455834.385386] drm_ioctl_kernel (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_ioctl.c:793) drm
[455834.385394] drm_ioctl (/build/linux-FFoizL/linux-5.4.0/include/linux/thread_info.h:119 /build/linux-FFoizL/linux-5.4.0/include/linux/thread_info.h:152 /build/linux-FFoizL/linux-5.4.0/include/linux/uaccess.h:151 /build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_ioctl.c:888) drm
[455834.385402] ? drm_mode_obj_find_prop_id (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:512) drm
[455834.385406] ? recalc_sigpending (/build/linux-FFoizL/linux-5.4.0/kernel/signal.c:184) 
[455834.385440] amdgpu_drm_ioctl (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1293) amdgpu
[455834.385443] do_vfs_ioctl (/build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:47 /build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:510 /build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:697) 
[455834.385444] ? recalc_sigpending (/build/linux-FFoizL/linux-5.4.0/kernel/signal.c:184) 
[455834.385446] ? _copy_from_user (/build/linux-FFoizL/linux-5.4.0/arch/x86/include/asm/uaccess_64.h:46 /build/linux-FFoizL/linux-5.4.0/arch/x86/include/asm/uaccess_64.h:71 /build/linux-FFoizL/linux-5.4.0/lib/usercopy.c:14) 
[455834.385448] ksys_ioctl (/build/linux-FFoizL/linux-5.4.0/include/linux/file.h:43 /build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:715) 
[455834.385449] __x64_sys_ioctl (/build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:719) 
[455834.385451] do_syscall_64 (/build/linux-FFoizL/linux-5.4.0/arch/x86/entry/common.c:290) 
[455834.385455] entry_SYSCALL_64_after_hwframe (/build/linux-FFoizL/linux-5.4.0/arch/x86/entry/entry_64.S:184) 
[455834.385456] RIP: 0033:0x7faf3181837b
Comment 17 Petteri Aimonen 2020-05-28 16:05:44 UTC
Created attachment 289381 [details]
dmesg from kernel 5.4.0-31
Comment 18 Petteri Aimonen 2020-05-28 16:24:21 UTC
As best as I can tell, the crash seems to be caused by some floating point exception (such as underflow/overflow) in this function call in dcn_calc_auto.c line 176:

dcn_bw_ceil2(v->byte_per_pixel_in_dety[k], 1.0)

In dcn_bw_ceil2() the exception occurs in this instruction:

addsd  0x0(%rip),%xmm3

which is performing the addition flr + 0.00001.
At this point %xmm3 is ((int)(v->byte_per_pixel_in_dety[k] / 1.0)) * 1.0
The variable byte_per_pixel_in_dety is only assigned constant values 1.0, 2.0, 4.0, 8.0 so
I don't see any reason for addsd to cause a simd exception. I'm not sure if the exception
is precise or if it could be delayed from some prior instruction, but AFAIK it should be
precise because in usermode the exception handler would attempt a recovery.

Having XMM3 or MXCSR values would help, but they don't seem to get included in the dmesg output and I'm not sure if they are available in a crash dump either.

Google search turned up https://beowulf.beowulf.narkive.com/tAHxVcs0/simd-exception-kernel-panic-on-skylake-ep-triggered-by-openfoam where the exception was delayed for some reason.

Analyzing the dmesgs attached to this bug report, we have following crash locations:

Cyrax    2020-03-26 21:36: divss  xmm0,DWORD PTR [r14+0x17f8]
Cyrax    2020-04-04 07:40: divss  xmm0,DWORD PTR [r14+0x17f8]
Cyrax    2020-04-18 13:19: divss  xmm0,DWORD PTR [r14+0x17f8]
farmboy0 2020-04-19 11:43: not a simd exception
Cyrax    2020-04-23 05:15: divss  xmm0,DWORD PTR [r14+0x17f8]
Cyrax    2020-04-27 19:20: divss  xmm0,DWORD PTR [r14+0x17f8]
Cyrax    2020-05-02 14:18: divss  xmm0,DWORD PTR [r14+0x17f8]
PetteriA 2020-05-28 16:05: addsd  xmm3,QWORD PTR [rip+0x1de967]

So the crash locations appear fairly consistent for Cyrax's machine, but no two machines have the same location.

For other users affected by this problem, it could be helpful if you install kernel debugging symbols and use decode_stacktrace.sh to convert the raw stack trace to code locations.

Also reported on freedesktop amd bugtracker: https://gitlab.freedesktop.org/drm/amd/-/issues/1154
Comment 20 yaomtc 2020-06-02 03:50:14 UTC
So far so good Alex. Using the RX 5700 XT as well. Previously, running SteamVR could pretty quickly crash my system (even before launching a game), and since I rebuilt linux-mainline from AUR, haven't had SteamVR crash my system yet. Fingers crossed that this continues. 

Though Half-Life: Alyx is causing a system crash, which can even happen on Windows with Vulkan apparently! Wow. At least that's not an AMD or Linux specific issue. https://github.com/ValveSoftware/SteamVR-for-Linux/issues/356
Comment 21 Cyrax 2020-06-03 01:34:18 UTC
Created attachment 289479 [details]
dmesg output kernel 5.7.0
Comment 22 Cyrax 2020-06-03 01:35:36 UTC
Created attachment 289481 [details]
config file used to build kernel 5.7.0 with KASAN etc
Comment 23 Cyrax 2020-06-03 02:00:53 UTC
Created attachment 289483 [details]
used decode_stacktrace.sh to previous dmesg log
Comment 24 Cyrax 2020-06-03 02:28:02 UTC
(In reply to Petteri Aimonen from comment #16)
> I hit the same issue, using Ubuntu 20.04. It happened when switching window
> to Firefox. For me it only crashed Xorg, ssh to the machine still worked ok.
> Killing Xorg didn't work and `shutdown -r now` hung up somewhere.
> 
> Here is a bug report on the Ubuntu package:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1881134
> 
> Here is call trace decoded with the debug symbols:
> 
[clip]

Yeah, it happens when switching windows and/or to different workspace. And yes it will crash Xorg only, other things will continue work as usual and issuing reboot command via SSH won't - well - reboot it. Only REISUB brings machine back to usable state.
Comment 25 Petteri Aimonen 2020-06-03 05:14:49 UTC
Looks like there are two kinds of crash bugs here. Many of the amdgpu crashes have been fixed in 5.7.0, but the specific one that gives "simd exception" in dmesg is not.

@Cyrax There is an experimental patch in https://bugzilla.kernel.org/show_bug.cgi?id=207979 if you want to try.

Out of interest, are you possibly running a 32-bit operating system under virtualization on 64-bit host? That's what triggers the bug for me.
Comment 26 Cyrax 2020-06-03 11:05:07 UTC
(In reply to Petteri Aimonen from comment #25)
> Looks like there are two kinds of crash bugs here. Many of the amdgpu
> crashes have been fixed in 5.7.0, but the specific one that gives "simd
> exception" in dmesg is not.
> 
> @Cyrax There is an experimental patch in
> https://bugzilla.kernel.org/show_bug.cgi?id=207979 if you want to try.
> 
> Out of interest, are you possibly running a 32-bit operating system under
> virtualization on 64-bit host? That's what triggers the bug for me.

I'm running one 32-bit LXC container (Arch Linux. <url:https://archlinux32.org/>) and three 64-bit LXC containers (Arch Linux). Additionally I'm running three VirtualBox guests which are Windows, Arch Linux and old version LEDE (OpenWRT) router OS (All are running 64-bit OS).
Comment 27 yaomtc 2020-06-06 01:29:29 UTC
Created attachment 289535 [details]
systemd journal from crash

Update: got a whole system crash again when I was starting up SteamVR. So I guess the issue wasn't resolved for me. It could have reduced the likelihood maybe, or it was luck?

Not sure what else to attach here, but I copied journal entries from the time of the crash (which happens at 21:09:31 near the end). Let me know if there's something else I should attach the next time this happens, if more data would be helpful.
Comment 28 Petteri Aimonen 2020-06-06 06:42:57 UTC
@yaomtc Your bug seems to be some separate issue, as the log does not have the "simd exception" or "mode_support_and_system_configuration" entries in it. It looks more similar to this bug here: https://gitlab.freedesktop.org/drm/amd/-/issues/1149
Comment 29 Alexander Kernozhitsky 2020-07-03 22:22:34 UTC
I encountered this bug today. When running specific graphical applications, the machine hangs, and the kernel logs say about simd exception.

It started to occur after the upgrade to 5.7.6 kernel.

I tried to apply the patch mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=207979, and the patch resolves the issue for me.

Using AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx.
Comment 30 Cyrax 2020-07-15 16:12:51 UTC
The patch in https://bugzilla.kernel.org/show_bug.cgi?id=207979 works beatifully.
19 days heavy usage without system crash on patched 5.7.6 kernel.
Comment 31 Alex Deucher 2020-07-17 04:40:45 UTC
Duplicate of bug 207979.
Comment 32 Cyrax 2020-07-23 01:47:16 UTC
Fix is in stable 5.7.10 kernel.

*** This bug has been marked as a duplicate of bug 207979 ***
Comment 33 krakopo 2020-08-19 06:37:48 UTC
I'm seeing this on an AMD Ryzen 4500U laptop running 5.8.1 (Arch Linux 5.8.1-arch1-1). I can repro fairly consistently when running a 64-bit KVM virtual machine.

The kernel I'm running has the commit which should resolve this:
7ad816762f9b ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()")

Confirmed patch is in my kernel:
https://git.archlinux.org/linux.git/tree/arch/x86/kernel/fpu/core.c?h=v5.8.1-arch1#n106

Here is what I see in dmesg:

Aug 18 20:25:49 archpad kernel: simd exception: 0000 [#1] PREEMPT SMP NOPTI
Aug 18 20:25:49 archpad kernel: CPU: 0 PID: 509 Comm: Xorg Not tainted 5.8.1-arch1-1 #1
Aug 18 20:25:49 archpad kernel: Hardware name: LENOVO 81W4/LNVNB161216, BIOS DZCN19WW 04/13/2020
Aug 18 20:25:49 archpad kernel: RIP: 0010:dcn_bw_ceil2+0x35/0x60 [amdgpu]
Aug 18 20:25:49 archpad kernel: Code: cd 7b 3e 0f 28 d0 66 0f ef db 66 0f ef e4 f3 0f 5e d1 f3 0f 5a e0 f3 0f 2c c2 66 0f ef d2 f3 0f 2a d0 f3 0f 59 d1 f3 0f 5a da <f2> 0f 58 1d 5b 19 2e 00 66 0f 2f dc 72 01 c3 f3 0f 58 ca 0f 28 c1
Aug 18 20:25:49 archpad kernel: RSP: 0018:ffffb8fac07035f8 EFLAGS: 00010202
Aug 18 20:25:49 archpad kernel: RAX: 0000000000000004 RBX: 0000000000000000 RCX: 0000000000000780
Aug 18 20:25:49 archpad kernel: RDX: ffff97ebd0a63080 RSI: ffff97ebd0a69560 RDI: 0000000044444440
Aug 18 20:25:49 archpad kernel: RBP: ffff97ebd0a631c0 R08: ffff97ebd0a633b4 R09: 0000000000000000
Aug 18 20:25:49 archpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff97ebd0a63360
Aug 18 20:25:49 archpad kernel: R13: 0000000000000001 R14: ffff97ebd0a62188 R15: ffff97ebd0a62028
Aug 18 20:25:49 archpad kernel: FS:  00007f8787a65940(0000) GS:ffff97ec47400000(0000) knlGS:0000000000000000
Aug 18 20:25:49 archpad kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 18 20:25:49 archpad kernel: CR2: 0000000800880000 CR3: 00000001f9040000 CR4: 0000000000340ef0
Aug 18 20:25:49 archpad kernel: Call Trace:
Aug 18 20:25:49 archpad kernel:  dml21_ModeSupportAndSystemConfigurationFull+0x437/0x5cf0 [amdgpu]
Aug 18 20:25:49 archpad kernel:  ? sysvec_apic_timer_interrupt+0x46/0xe0
Aug 18 20:25:49 archpad kernel:  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
Aug 18 20:25:49 archpad kernel:  ? sched_clock+0x5/0x10
Aug 18 20:25:49 archpad kernel:  ? sched_clock_local+0x12/0x80
Aug 18 20:25:49 archpad kernel:  ? amdgpu_sa_bo_new+0xbc/0x550 [amdgpu]
Aug 18 20:25:49 archpad kernel:  ? sched_clock_cpu+0xae/0xd0
Aug 18 20:25:49 archpad kernel:  ? kmem_cache_alloc_trace+0x17c/0x220
Aug 18 20:25:49 archpad kernel:  ? amdgpu_sa_bo_new+0xbc/0x550 [amdgpu]
Aug 18 20:25:49 archpad kernel:  ? _raw_spin_unlock+0x16/0x30
Aug 18 20:25:49 archpad kernel:  ? preempt_count_add+0x49/0xa0
Aug 18 20:25:49 archpad kernel:  ? kernel_init_free_pages+0x6d/0x90
Aug 18 20:25:49 archpad kernel:  ? prep_new_page+0xa2/0xb0
Aug 18 20:25:49 archpad kernel:  ? get_page_from_freelist+0xfa8/0x1220
Aug 18 20:25:49 archpad kernel:  ? __mod_zone_page_state+0x66/0xa0
Aug 18 20:25:49 archpad kernel:  ? hubbub2_get_dcc_compression_cap+0xa8/0x270 [amdgpu]
Aug 18 20:25:49 archpad kernel:  ? fill_plane_buffer_attributes+0x26f/0x420 [amdgpu]
Aug 18 20:25:49 archpad kernel:  dml_get_voltage_level+0x116/0x1e0 [amdgpu]
Aug 18 20:25:49 archpad kernel:  dcn20_fast_validate_bw+0x359/0x680 [amdgpu]
Aug 18 20:25:49 archpad kernel:  ? resource_build_scaling_params+0xc44/0x11a0 [amdgpu]
Aug 18 20:25:49 archpad kernel:  dcn21_validate_bandwidth+0xcd/0x2a0 [amdgpu]
Aug 18 20:25:49 archpad kernel:  dc_validate_global_state+0x2f2/0x390 [amdgpu]
Aug 18 20:25:49 archpad kernel:  amdgpu_dm_atomic_check+0xefb/0x1010 [amdgpu]
Aug 18 20:25:49 archpad kernel:  drm_atomic_check_only+0x57c/0x7f0 [drm]
Aug 18 20:25:49 archpad kernel:  ? __drm_atomic_helper_crtc_duplicate_state+0x85/0xd0 [drm_kms_helper]
Aug 18 20:25:49 archpad kernel:  drm_atomic_commit+0x13/0x50 [drm]
Aug 18 20:25:49 archpad kernel:  drm_atomic_helper_legacy_gamma_set+0x123/0x180 [drm_kms_helper]
Aug 18 20:25:49 archpad kernel:  drm_mode_gamma_set_ioctl+0x19a/0x230 [drm]
Aug 18 20:25:49 archpad kernel:  ? drm_color_lut_check+0xa0/0xa0 [drm]
Aug 18 20:25:49 archpad kernel:  drm_ioctl_kernel+0xb2/0x100 [drm]
Aug 18 20:25:49 archpad kernel:  drm_ioctl+0x208/0x360 [drm]
Aug 18 20:25:49 archpad kernel:  ? drm_color_lut_check+0xa0/0xa0 [drm]
Aug 18 20:25:49 archpad kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Aug 18 20:25:49 archpad kernel:  ksys_ioctl+0x82/0xc0
Aug 18 20:25:49 archpad kernel:  __x64_sys_ioctl+0x16/0x20
Aug 18 20:25:49 archpad kernel:  do_syscall_64+0x44/0x70
Aug 18 20:25:49 archpad kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 18 20:25:49 archpad kernel: RIP: 0033:0x7f87887888eb
Aug 18 20:25:49 archpad kernel: Code: 0f 1e fa 48 8b 05 a5 95 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 95 0c 00 f7 d8 64 89 01 48
Aug 18 20:25:49 archpad kernel: RSP: 002b:00007ffc92f3a9a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 18 20:25:49 archpad kernel: RAX: ffffffffffffffda RBX: 00007ffc92f3a9e0 RCX: 00007f87887888eb
Aug 18 20:25:49 archpad kernel: RDX: 00007ffc92f3a9e0 RSI: 00000000c02064a5 RDI: 000000000000000a
Aug 18 20:25:49 archpad kernel: RBP: 00000000c02064a5 R08: 00005627eb36eb10 R09: 00005627eb36ed10
Aug 18 20:25:49 archpad kernel: R10: 00005627eb36e910 R11: 0000000000000246 R12: 0000000000000100
Aug 18 20:25:49 archpad kernel: R13: 000000000000000a R14: 0000000000000100 R15: 0000000000000100
Aug 18 20:25:49 archpad kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c tun bridge hid_multitouch hid_generic 8021q garp mrp stp llc ebtable_filter ebtables snd_acp3x_rn snd_soc_dmic snd_acp3x_pdm_dma snd_soc_c>
Aug 18 20:25:49 archpad kernel:  drm_kms_helper btintel snd_hwdep i2c_hid hid videobuf2_common cec nls_iso8859_1 snd_pcm rc_core nls_cp437 bluetooth cfg80211 snd_timer syscopyarea videodev ideapad_laptop snd_rn_pci_acp3x sysfillrect vfat ecdh_generic snd sysimgblt tpm_crb snd_pci_acp3x sparse_keymap fat ecc mc fb_sys_fops tpm_tis soundcore ccp rfkill libarc4 wmi battery tpm_tis>
Aug 18 20:25:49 archpad kernel: ---[ end trace 76f111d732bc1b57 ]---
Aug 18 20:25:49 archpad kernel: RIP: 0010:dcn_bw_ceil2+0x35/0x60 [amdgpu]
Aug 18 20:25:49 archpad kernel: Code: cd 7b 3e 0f 28 d0 66 0f ef db 66 0f ef e4 f3 0f 5e d1 f3 0f 5a e0 f3 0f 2c c2 66 0f ef d2 f3 0f 2a d0 f3 0f 59 d1 f3 0f 5a da <f2> 0f 58 1d 5b 19 2e 00 66 0f 2f dc 72 01 c3 f3 0f 58 ca 0f 28 c1
Aug 18 20:25:49 archpad kernel: RSP: 0018:ffffb8fac07035f8 EFLAGS: 00010202
Aug 18 20:25:49 archpad kernel: RAX: 0000000000000004 RBX: 0000000000000000 RCX: 0000000000000780
Aug 18 20:25:49 archpad kernel: RDX: ffff97ebd0a63080 RSI: ffff97ebd0a69560 RDI: 0000000044444440
Aug 18 20:25:49 archpad kernel: RBP: ffff97ebd0a631c0 R08: ffff97ebd0a633b4 R09: 0000000000000000
Aug 18 20:25:49 archpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff97ebd0a63360
Aug 18 20:25:49 archpad kernel: R13: 0000000000000001 R14: ffff97ebd0a62188 R15: ffff97ebd0a62028
Aug 18 20:25:49 archpad kernel: FS:  00007f8787a65940(0000) GS:ffff97ec47400000(0000) knlGS:0000000000000000
Aug 18 20:25:49 archpad kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 18 20:25:49 archpad kernel: CR2: 0000000800880000 CR3: 00000001f9040000 CR4: 0000000000340ef0

$ objdump -d amdgpu.ko
...
00000000001b83c0 <dcn_bw_ceil2>:
  1b83c0:       e8 00 00 00 00          callq  1b83c5 <dcn_bw_ceil2+0x5>
  1b83c5:       66 0f ef ed             pxor   %xmm5,%xmm5
  1b83c9:       0f 2e cd                ucomiss %xmm5,%xmm1
  1b83cc:       7b 3e                   jnp    1b840c <dcn_bw_ceil2+0x4c>
  1b83ce:       0f 28 d0                movaps %xmm0,%xmm2
  1b83d1:       66 0f ef db             pxor   %xmm3,%xmm3
  1b83d5:       66 0f ef e4             pxor   %xmm4,%xmm4
  1b83d9:       f3 0f 5e d1             divss  %xmm1,%xmm2
  1b83dd:       f3 0f 5a e0             cvtss2sd %xmm0,%xmm4
  1b83e1:       f3 0f 2c c2             cvttss2si %xmm2,%eax
  1b83e5:       66 0f ef d2             pxor   %xmm2,%xmm2
  1b83e9:       f3 0f 2a d0             cvtsi2ss %eax,%xmm2
  1b83ed:       f3 0f 59 d1             mulss  %xmm1,%xmm2
  1b83f1:       f3 0f 5a da             cvtss2sd %xmm2,%xmm3
  1b83f5:       f2 0f 58 1d 00 00 00    addsd  0x0(%rip),%xmm3        # 1b83fd <dcn_bw_ceil2+0x3d>
  1b83fc:       00 
  1b83fd:       66 0f 2f dc             comisd %xmm4,%xmm3
  1b8401:       72 01                   jb     1b8404 <dcn_bw_ceil2+0x44>
  1b8403:       c3                      retq   
  1b8404:       f3 0f 58 ca             addss  %xmm2,%xmm1
  1b8408:       0f 28 c1                movaps %xmm1,%xmm0
  1b840b:       c3                      retq   
  1b840c:       75 c0                   jne    1b83ce <dcn_bw_ceil2+0xe>
  1b840e:       66 0f ef c0             pxor   %xmm0,%xmm0
  1b8412:       c3                      retq   
  1b8413:       66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  1b841a:       00 00 00 00 
  1b841e:       66 90                   xchg   %ax,%ax
...

Instruction at RIP: 0010:dcn_bw_ceil2+0x35:

>>> hex(0x00000000001b83c0 + 0x35)
'0x1b83f5'

  1b83f5:       f2 0f 58 1d 00 00 00    addsd  0x0(%rip),%xmm3        # 1b83fd <dcn_bw_ceil2+0x3d>

Same addsd instruction that was mentioned above.
Comment 34 Petteri Aimonen 2020-08-19 06:51:32 UTC
@krakopo Can you apply the debug info patch from here? https://bugzilla.kernel.org/attachment.cgi?id=289421&action=diff

What kernel are you running inside the KVM virtual machine? I wonder if the virtual machine has the MXCSR problem, perhaps it could be leaking to the host somehow.
Comment 35 krakopo 2020-08-20 03:30:51 UTC
@Petteri

I'm running DragonFly BSD 5.8.1 in my KVM virtual machine.

Here is the dmesg output with the debug info patch applied:

Aug 19 23:18:03 archpad kernel: MXCSR: 00000020 XMM3: 4010000000000000
Aug 19 23:18:03 archpad kernel: simd exception: 0000 [#1] PREEMPT SMP NOPTI
Aug 19 23:18:03 archpad kernel: CPU: 5 PID: 518 Comm: Xorg Not tainted 5.8.1-arch1206987 #1
Aug 19 23:18:03 archpad kernel: Hardware name: LENOVO 81W4/LNVNB161216, BIOS DZCN19WW 04/13/2020
Aug 19 23:18:03 archpad kernel: RIP: 0010:dcn_bw_ceil2+0x35/0x60 [amdgpu]
Aug 19 23:18:03 archpad kernel: Code: cd 7b 3e 0f 28 d0 66 0f ef db 66 0f ef e4 f3 0f 5e d1 f3 0f 5a e0 f3 0f 2c c2 66 0f ef d2 f3 0f 2a d0 f3 0f 59 d1 f3 0f 5a da <f2> 0f 58 1d 5b 19 2e 00 66 0f 2f dc 72 01 c3 f3 0f 58 ca 0f 28 c1
Aug 19 23:18:03 archpad kernel: RSP: 0018:ffff9e24c10775f8 EFLAGS: 00010202
Aug 19 23:18:03 archpad kernel: RAX: 0000000000000004 RBX: 0000000000000000 RCX: 0000000000000780
Aug 19 23:18:03 archpad kernel: RDX: ffff93d6d0683080 RSI: ffff93d6d0689560 RDI: 0000000044444440
Aug 19 23:18:03 archpad kernel: RBP: ffff93d6d06831c0 R08: ffff93d6d06833b4 R09: 0000000000000000
Aug 19 23:18:03 archpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff93d6d0683360
Aug 19 23:18:03 archpad kernel: R13: 0000000000000001 R14: ffff93d6d0682188 R15: ffff93d6d0682028
Aug 19 23:18:03 archpad kernel: FS:  00007f222278d940(0000) GS:ffff93d707740000(0000) knlGS:0000000000000000
Aug 19 23:18:03 archpad kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 19 23:18:03 archpad kernel: CR2: 0000000800ea8030 CR3: 00000002021ca000 CR4: 0000000000340ee0
Aug 19 23:18:03 archpad kernel: Call Trace:
Aug 19 23:18:03 archpad kernel:  dml21_ModeSupportAndSystemConfigurationFull+0x437/0x5cf0 [amdgpu]
Aug 19 23:18:03 archpad kernel:  ? cpufreq_this_cpu_can_update+0xe/0x50
Aug 19 23:18:03 archpad kernel:  ? sugov_update_single+0x58/0x210
Aug 19 23:18:03 archpad kernel:  ? sugov_get_util+0xf0/0xf0
Aug 19 23:18:03 archpad kernel:  ? update_blocked_averages+0x539/0x620
Aug 19 23:18:03 archpad kernel:  ? update_group_capacity+0x25/0x1c0
Aug 19 23:18:03 archpad kernel:  ? cpumask_next_and+0x19/0x20
Aug 19 23:18:03 archpad kernel:  ? update_sd_lb_stats.constprop.0+0x799/0x8f0
Aug 19 23:18:03 archpad kernel:  ? cpufreq_this_cpu_can_update+0xe/0x50
Aug 19 23:18:03 archpad kernel:  ? sugov_update_single+0x143/0x210
Aug 19 23:18:03 archpad kernel:  ? sugov_get_util+0xf0/0xf0
Aug 19 23:18:03 archpad kernel:  ? update_load_avg+0x63a/0x660
Aug 19 23:18:03 archpad kernel:  ? update_curr+0x73/0x1f0
Aug 19 23:18:03 archpad kernel:  ? enqueue_entity+0x14e/0x750
Aug 19 23:18:03 archpad kernel:  ? resched_curr+0x20/0xc0
Aug 19 23:18:03 archpad kernel:  ? check_preempt_wakeup+0x13b/0x250
Aug 19 23:18:03 archpad kernel:  ? check_preempt_curr+0x67/0x90
Aug 19 23:18:03 archpad kernel:  ? _raw_spin_unlock+0x16/0x30
Aug 19 23:18:03 archpad kernel:  dml_get_voltage_level+0x116/0x1e0 [amdgpu]
Aug 19 23:18:03 archpad kernel:  dcn20_fast_validate_bw+0x359/0x680 [amdgpu]
Aug 19 23:18:03 archpad kernel:  ? resource_build_scaling_params+0xc44/0x11a0 [amdgpu]
Aug 19 23:18:03 archpad kernel:  dcn21_validate_bandwidth+0xcd/0x2a0 [amdgpu]
Aug 19 23:18:03 archpad kernel:  dc_validate_global_state+0x2f2/0x390 [amdgpu]
Aug 19 23:18:03 archpad kernel:  amdgpu_dm_atomic_check+0xefb/0x1010 [amdgpu]
Aug 19 23:18:03 archpad kernel:  ? free_one_page+0x57/0xd0
Aug 19 23:18:03 archpad kernel:  drm_atomic_check_only+0x57c/0x7f0 [drm]
Aug 19 23:18:03 archpad kernel:  ? __drm_atomic_helper_crtc_duplicate_state+0x85/0xd0 [drm_kms_helper]
Aug 19 23:18:03 archpad kernel:  drm_atomic_commit+0x13/0x50 [drm]
Aug 19 23:18:03 archpad kernel:  drm_atomic_helper_legacy_gamma_set+0x123/0x180 [drm_kms_helper]
Aug 19 23:18:03 archpad kernel:  drm_mode_gamma_set_ioctl+0x19a/0x230 [drm]
Aug 19 23:18:03 archpad kernel:  ? drm_color_lut_check+0xa0/0xa0 [drm]
Aug 19 23:18:03 archpad kernel:  drm_ioctl_kernel+0xb2/0x100 [drm]
Aug 19 23:18:03 archpad kernel:  drm_ioctl+0x208/0x360 [drm]
Aug 19 23:18:03 archpad kernel:  ? drm_color_lut_check+0xa0/0xa0 [drm]
Aug 19 23:18:03 archpad kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Aug 19 23:18:03 archpad kernel:  ksys_ioctl+0x82/0xc0
Aug 19 23:18:03 archpad kernel:  __x64_sys_ioctl+0x16/0x20
Aug 19 23:18:03 archpad kernel:  do_syscall_64+0x44/0x70
Aug 19 23:18:03 archpad kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 19 23:18:03 archpad kernel: RIP: 0033:0x7f22234b08eb
Aug 19 23:18:03 archpad kernel: Code: 0f 1e fa 48 8b 05 a5 95 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 95 0c 00 f7 d8 64 89 01 48
Aug 19 23:18:03 archpad kernel: RSP: 002b:00007ffee6662f48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 19 23:18:03 archpad kernel: RAX: ffffffffffffffda RBX: 00007ffee6662f80 RCX: 00007f22234b08eb
Aug 19 23:18:03 archpad kernel: RDX: 00007ffee6662f80 RSI: 00000000c02064a5 RDI: 000000000000000a
Aug 19 23:18:03 archpad kernel: RBP: 00000000c02064a5 R08: 000055b14cc95f10 R09: 000055b14cc96110
Aug 19 23:18:03 archpad kernel: R10: 000055b14cc95d10 R11: 0000000000000246 R12: 0000000000000100
Aug 19 23:18:03 archpad kernel: R13: 000000000000000a R14: 0000000000000100 R15: 0000000000000100
Aug 19 23:18:03 archpad kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c tun bridge hid_multitouch hid_generic 8021q garp mrp stp llc amdgpu ath10k_pci edac_mce_amd ath10k_core kvm_amd snd_acp3x_rn kvm snd_acp3x_pdm_dma ebtable_filter ebtables ip6table_filter snd_soc_dmic ip6_tables snd_soc_core ath irqbypass iptable_filter crct10dif_pclmul crc32_pclmul mac80211 ghash_clmulni_intel joydev snd_compress ac97_bus snd_pcm_dmaengine mousedev wmi_bmof aesni_intel crypto_simd ccm cryptd glue_helper algif_aead snd_hda_codec_generic btusb rapl snd_hda_codec_hdmi ledtrig_audio des_generic input_leds gpu_sched pcspkr libdes snd_hda_intel btrtl i2c_algo_bit snd_intel_dspcfg btbcm ttm arc4 snd_hda_codec cbc btintel ecb snd_hda_core uvcvideo algif_skcipher bluetooth drm_kms_helper k10temp sp5100_tco snd_hwdep i2c_piix4 snd_pcm cmac md4 videobuf2_vmalloc cec
Aug 19 23:18:03 archpad kernel:  cfg80211 videobuf2_memops algif_hash af_alg videobuf2_v4l2 rc_core tpm_crb videobuf2_common snd_timer nls_iso8859_1 syscopyarea videodev ideapad_laptop sysfillrect nls_cp437 tpm_tis snd ccp ecdh_generic tpm_tis_core snd_rn_pci_acp3x ecc vfat sparse_keymap sysimgblt fat soundcore tpm snd_pci_acp3x mc rfkill fb_sys_fops i2c_hid hid libarc4 wmi evdev pinctrl_amd battery mac_hid elants_i2c acpi_cpufreq rng_core ac drm agpgart pkcs8_key_parser ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 serio_raw xhci_pci atkbd xhci_pci_renesas libps2 xhci_hcd crc32c_intel i8042 serio
Aug 19 23:18:03 archpad kernel: ---[ end trace a01eac408369453d ]---
Aug 19 23:18:03 archpad kernel: RIP: 0010:dcn_bw_ceil2+0x35/0x60 [amdgpu]
Aug 19 23:18:03 archpad kernel: Code: cd 7b 3e 0f 28 d0 66 0f ef db 66 0f ef e4 f3 0f 5e d1 f3 0f 5a e0 f3 0f 2c c2 66 0f ef d2 f3 0f 2a d0 f3 0f 59 d1 f3 0f 5a da <f2> 0f 58 1d 5b 19 2e 00 66 0f 2f dc 72 01 c3 f3 0f 58 ca 0f 28 c1
Aug 19 23:18:03 archpad kernel: RSP: 0018:ffff9e24c10775f8 EFLAGS: 00010202
Aug 19 23:18:03 archpad kernel: RAX: 0000000000000004 RBX: 0000000000000000 RCX: 0000000000000780
Aug 19 23:18:03 archpad kernel: RDX: ffff93d6d0683080 RSI: ffff93d6d0689560 RDI: 0000000044444440
Aug 19 23:18:03 archpad kernel: RBP: ffff93d6d06831c0 R08: ffff93d6d06833b4 R09: 0000000000000000
Aug 19 23:18:03 archpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff93d6d0683360
Aug 19 23:18:03 archpad kernel: R13: 0000000000000001 R14: ffff93d6d0682188 R15: ffff93d6d0682028
Aug 19 23:18:03 archpad kernel: FS:  00007f222278d940(0000) GS:ffff93d707640000(0000) knlGS:0000000000000000
Aug 19 23:18:03 archpad kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 19 23:18:03 archpad kernel: CR2: 0000000800cca010 CR3: 00000002021ca000 CR4: 0000000000340ee0
Comment 36 Petteri Aimonen 2020-08-20 04:11:32 UTC
@krakopo The 00000020 MXCSR value is also exactly like it was for me before the bug fix. So something is definitely clearing MXCSR after it should be set to 0x1F80 by kernel_fpu_begin().

Can you disassemble kernel_fpu_begin() to verify that the ldmxcsr instruction is present close to its end? Also, check that /proc/cpuinfo flags has "sse" in it - not sure though how that could possibly be missing.
Comment 37 krakopo 2020-08-20 04:21:09 UTC
I do see ldmxcsr in the disassembly:

ffffffff81038870 <kernel_fpu_begin>:
ffffffff81038870:       e8 9b 07 03 00          callq  ffffffff81069010 <__fentry__>
ffffffff81038875:       48 83 ec 10             sub    $0x10,%rsp
ffffffff81038879:       bf 01 00 00 00          mov    $0x1,%edi
ffffffff8103887e:       65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
ffffffff81038885:       00 00 
ffffffff81038887:       48 89 44 24 08          mov    %rax,0x8(%rsp)
ffffffff8103888c:       31 c0                   xor    %eax,%eax
ffffffff8103888e:       c7 44 24 04 00 00 00    movl   $0x0,0x4(%rsp)
ffffffff81038895:       00 
ffffffff81038896:       e8 35 ae 08 00          callq  ffffffff810c36d0 <preempt_count_add>
ffffffff8103889b:       e8 80 fd ff ff          callq  ffffffff81038620 <irq_fpu_usable>
ffffffff810388a0:       65 8a 05 b1 f2 fd 7e    mov    %gs:0x7efdf2b1(%rip),%al        # 17b58 <in_kernel_fpu>
ffffffff810388a7:       65 c6 05 a9 f2 fd 7e    movb   $0x1,%gs:0x7efdf2a9(%rip)        # 17b58 <in_kernel_fpu>
ffffffff810388ae:       01 
ffffffff810388af:       65 48 8b 3c 25 c0 7b    mov    %gs:0x17bc0,%rdi
ffffffff810388b6:       01 00 
ffffffff810388b8:       f6 47 26 20             testb  $0x20,0x26(%rdi)
ffffffff810388bc:       74 3c                   je     ffffffff810388fa <kernel_fpu_begin+0x8a>
ffffffff810388be:       48 c7 c7 57 43 40 82    mov    $0xffffffff82404357,%rdi
ffffffff810388c5:       e8 46 41 9c 00          callq  ffffffff819fca10 <__this_cpu_preempt_check>
ffffffff810388ca:       c7 44 24 04 80 1f 00    movl   $0x1f80,0x4(%rsp)
ffffffff810388d1:       00 
ffffffff810388d2:       65 48 c7 05 82 f2 fd    movq   $0x0,%gs:0x7efdf282(%rip)        # 17b60 <fpu_fpregs_owner_ctx>
ffffffff810388d9:       7e 00 00 00 00 
ffffffff810388de:       0f ae 54 24 04          ldmxcsr 0x4(%rsp)
ffffffff810388e3:       db e3                   fninit 
ffffffff810388e5:       48 8b 44 24 08          mov    0x8(%rsp),%rax
ffffffff810388ea:       65 48 2b 04 25 28 00    sub    %gs:0x28,%rax
ffffffff810388f1:       00 00 
ffffffff810388f3:       75 20                   jne    ffffffff81038915 <kernel_fpu_begin+0xa5>
ffffffff810388f5:       48 83 c4 10             add    $0x10,%rsp
ffffffff810388f9:       c3                      retq   
ffffffff810388fa:       48 8b 07                mov    (%rdi),%rax
ffffffff810388fd:       f6 c4 40                test   $0x40,%ah
ffffffff81038900:       75 bc                   jne    ffffffff810388be <kernel_fpu_begin+0x4e>
ffffffff81038902:       f0 80 4f 01 40          lock orb $0x40,0x1(%rdi)
ffffffff81038907:       48 81 c7 00 1b 00 00    add    $0x1b00,%rdi
ffffffff8103890e:       e8 5d fd ff ff          callq  ffffffff81038670 <copy_fpregs_to_fpstate>
ffffffff81038913:       eb a9                   jmp    ffffffff810388be <kernel_fpu_begin+0x4e>
ffffffff81038915:       e8 36 3c 9c 00          callq  ffffffff819fc550 <__stack_chk_fail>
ffffffff8103891a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)


And yes I do have the "sse" flag in /proc/cpuinfo.
Comment 38 Petteri Aimonen 2020-08-20 04:24:36 UTC
@krakopo I must say I don't have any idea what could be happening on your machine. It could be explained if the kernel thread was being pre-empted, but pre-emption is disabled by kernel_fpu_begin().

It may help to ask in bug 207979 also, it has some of the long time x86 maintainers on CC.
Comment 39 Jan Kokemüller 2021-02-11 07:48:53 UTC
Created attachment 295225 [details]
Call DC_FP_START() / DC_FP_END() in dcn21_validate_bandwidth

Could it be that DC_FP_START()/DC_FP_END() aka kernel_fpu_begin()/kernel_fpu_end() are not called in the *_validate_bandwidth code path on AMD Renoir systems? To my untrained eye it looks like it is missing, while it _is_ there for dcn20.

I've been running the attached patch for 2 days now with some KVM VMs open and the system seems stable. Previously, I had similar crashes/backtraces @krakopo described.

I'm happy to help testing any patches. I'm running a Thinkpad T14 with a AMD Ryzen 7 PRO 4750U (Renoir).
Comment 40 Alex Deucher 2021-02-11 14:51:55 UTC
(In reply to Jan Kokemüller from comment #39)
> Created attachment 295225 [details]
> Call DC_FP_START() / DC_FP_END() in dcn21_validate_bandwidth
> 
> Could it be that DC_FP_START()/DC_FP_END() aka
> kernel_fpu_begin()/kernel_fpu_end() are not called in the
> *_validate_bandwidth code path on AMD Renoir systems? To my untrained eye it
> looks like it is missing, while it _is_ there for dcn20.
> 
> I've been running the attached patch for 2 days now with some KVM VMs open
> and the system seems stable. Previously, I had similar crashes/backtraces
> @krakopo described.
> 
> I'm happy to help testing any patches. I'm running a Thinkpad T14 with a AMD
> Ryzen 7 PRO 4750U (Renoir).

Looks correct.  Care to send out a proper git patch?
Comment 41 Jan Kokemüller 2021-02-11 18:36:42 UTC
> Looks correct.  Care to send out a proper git patch?

Thank you for having a look at the patch! I've sent it to the amd-gfx list.

Note You need to log in before you can comment on or make changes to this bug.