Whole system crashes with this error message : simd exception: 0000 [#1] PREEMPT SMP NOPTI Only giving a REISUB treatment works. And cause is amdgpu driver. --- Mar 26 20:47:13 shodan kernel: simd exception: 0000 [#1] PREEMPT SMP NOPTI Mar 26 20:47:13 shodan kernel: CPU: 7 PID: 1344 Comm: Xorg Tainted: G W OE 5.5.11-arch1-1 #1 Mar 26 20:47:13 shodan kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.80 03/06/2019 Mar 26 20:47:13 shodan kernel: RIP: 0010:mode_support_and_system_configuration+0x30a3/0x4d90 [amdgpu] Mar 26 20:47:13 shodan kernel: Code: 00 0f 28 c3 e8 7e c9 ff ff f3 41 0f 11 87 40 19 00 00 e9 12 fd ff ff 41 83 be a8 00 00 00 06 75 93 f3 41 0f 10 86 40 1b 00 00 <f3> 41 0f 5e 86 f8 17 00 00 e8 4f c9 ff ff 41 8b 87 80 04 00 00 f3 Mar 26 20:47:13 shodan kernel: RSP: 0018:ffffb216c1f3b978 EFLAGS: 00010246 Mar 26 20:47:13 shodan kernel: RAX: 0000000000000006 RBX: ffff9c120bbfadc4 RCX: 0000000000000004 Mar 26 20:47:13 shodan kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c120bbfb008 Mar 26 20:47:13 shodan kernel: RBP: ffff9c120bbfadc4 R08: ffff9c120bbfc164 R09: 0000000000000120 Mar 26 20:47:13 shodan kernel: R10: ffff9c120bbfaee4 R11: ffff9c120bbf0248 R12: ffff9c120bbfc63c Mar 26 20:47:13 shodan kernel: R13: 0000000000000000 R14: ffff9c120bbfaf5c R15: ffff9c120bbfadc4 Mar 26 20:47:13 shodan kernel: FS: 00007f1c9f336dc0(0000) GS:ffff9c19009c0000(0000) knlGS:0000000000000000 Mar 26 20:47:13 shodan kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 26 20:47:13 shodan kernel: CR2: 00001f82bfec7fe0 CR3: 00000007cbe4a000 CR4: 00000000003406e0 Mar 26 20:47:13 shodan kernel: Call Trace: Mar 26 20:47:13 shodan kernel: dcn_validate_bandwidth+0xfe5/0x1f20 [amdgpu] Mar 26 20:47:13 shodan kernel: dc_validate_global_state+0x28a/0x310 [amdgpu] Mar 26 20:47:13 shodan kernel: amdgpu_dm_atomic_check+0x5d8/0x870 [amdgpu] Mar 26 20:47:13 shodan kernel: drm_atomic_check_only+0x578/0x800 [drm] Mar 26 20:47:13 shodan kernel: ? dm_crtc_duplicate_state+0x6b/0x1f0 [amdgpu] Mar 26 20:47:13 shodan kernel: drm_atomic_commit+0x13/0x50 [drm] Mar 26 20:47:13 shodan kernel: drm_atomic_helper_legacy_gamma_set+0x123/0x180 [drm_kms_helper] Mar 26 20:47:13 shodan kernel: drm_mode_gamma_set_ioctl+0x171/0x220 [drm] Mar 26 20:47:13 shodan kernel: ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm] Mar 26 20:47:13 shodan kernel: drm_ioctl_kernel+0xb2/0x100 [drm] Mar 26 20:47:13 shodan kernel: drm_ioctl+0x209/0x360 [drm] Mar 26 20:47:13 shodan kernel: ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm] Mar 26 20:47:13 shodan kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Mar 26 20:47:13 shodan kernel: do_vfs_ioctl+0x4b7/0x730 Mar 26 20:47:13 shodan kernel: ksys_ioctl+0x5e/0x90 Mar 26 20:47:13 shodan kernel: __x64_sys_ioctl+0x16/0x20 Mar 26 20:47:13 shodan kernel: do_syscall_64+0x4e/0x150 Mar 26 20:47:13 shodan kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Mar 26 20:47:13 shodan kernel: RIP: 0033:0x7f1ca01892eb Mar 26 20:47:13 shodan kernel: Code: 0f 1e fa 48 8b 05 a5 8b 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 8b 0c 00 f7 d8 64 89 01 48 Mar 26 20:47:13 shodan kernel: RSP: 002b:00007ffc60ff5648 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 Mar 26 20:47:13 shodan kernel: RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f1ca01892eb Mar 26 20:47:13 shodan kernel: RDX: 00007ffc60ff5700 RSI: 00000000c02064a5 RDI: 000000000000000a Mar 26 20:47:13 shodan kernel: RBP: 00007ffc60ff5680 R08: 0000562bb635c080 R09: 0000562bb635c280 Mar 26 20:47:13 shodan kernel: R10: 0000562bb635be80 R11: 0000000000000206 R12: 0000000000000100 Mar 26 20:47:13 shodan kernel: R13: 0000562bb6ab4f70 R14: 0000562bb635b9c0 R15: 0000000000000100 Mar 26 20:47:13 shodan kernel: Modules linked in: snd_seq_dummy snd_seq bluetooth ecdh_generic rfkill ecc veth fuse iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi tun ip6table_mangle xt_MASQUERADE iptable_nat nf_nat xt_connmark iptable_mangle xt_helper xt_NFLOG xt_limit xt_conntrack xt_tcpudp nf_conntrack_ftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_irc nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) pktcdvd nfnetlink_log nfnetlink ip6table_filter nct6775 ip6_tables hwmon_vid iptable_filter edac_mce_amd kvm_amd ccp ext4 rng_core kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi crc16 mbcache irqbypass mxm_wmi jbd2 snd_hda_intel wmi_bmof snd_intel_dspcfg snd_hda_codec snd_usb_audio crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core uvcvideo snd_usbmidi_lib snd_rawmidi videobuf2_vmalloc videobuf2_memops snd_seq_device videobuf2_v4l2 aesni_intel snd_hwdep videobuf2_common crypto_simd snd_pcm mousedev cryptd glue_helper Mar 26 20:47:13 shodan kernel: input_leds sp5100_tco snd_timer igb k10temp pcspkr i2c_piix4 snd soundcore dca wmi evdev mac_hid gpio_amdpt pinctrl_amd acpi_cpufreq xt_mark v4l2loopback(OE) videodev mc usbmon nbd msr vhba(OE) sr_mod cdrom sg br_netfilter bridge stp llc ip_tables x_tables dm_mod btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq sd_mod hid_generic usbhid hid crc32c_intel ahci libahci libata xhci_pci xhci_hcd scsi_mod amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper serio_raw syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart i8042 atkbd libps2 serio Mar 26 20:47:13 shodan kernel: ---[ end trace e34593e526e29a3d ]--- Mar 26 20:47:13 shodan kernel: RIP: 0010:mode_support_and_system_configuration+0x30a3/0x4d90 [amdgpu] Mar 26 20:47:13 shodan kernel: Code: 00 0f 28 c3 e8 7e c9 ff ff f3 41 0f 11 87 40 19 00 00 e9 12 fd ff ff 41 83 be a8 00 00 00 06 75 93 f3 41 0f 10 86 40 1b 00 00 <f3> 41 0f 5e 86 f8 17 00 00 e8 4f c9 ff ff 41 8b 87 80 04 00 00 f3 Mar 26 20:47:13 shodan kernel: RSP: 0018:ffffb216c1f3b978 EFLAGS: 00010246 Mar 26 20:47:13 shodan kernel: RAX: 0000000000000006 RBX: ffff9c120bbfadc4 RCX: 0000000000000004 Mar 26 20:47:13 shodan kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c120bbfb008 Mar 26 20:47:13 shodan kernel: RBP: ffff9c120bbfadc4 R08: ffff9c120bbfc164 R09: 0000000000000120 Mar 26 20:47:13 shodan kernel: R10: ffff9c120bbfaee4 R11: ffff9c120bbf0248 R12: ffff9c120bbfc63c Mar 26 20:47:13 shodan kernel: R13: 0000000000000000 R14: ffff9c120bbfaf5c R15: ffff9c120bbfadc4 Mar 26 20:47:13 shodan kernel: FS: 00007f1c9f336dc0(0000) GS:ffff9c19009c0000(0000) knlGS:0000000000000000 Mar 26 20:47:13 shodan kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 26 20:47:13 shodan kernel: CR2: 00001f82bfec7fe0 CR3: 00000007cbe4a000 CR4: 00000000003406e0
Please attach your full dmesg output. What version of gcc are you using?
Created attachment 288079 [details] dmesg output
GCC is "gcc (Arch Linux 9.3.0-1) 9.3.0"
Created attachment 288203 [details] dmesg output 2 This crash happened again. In that time I have used VLC, played a game (GZDoom) and tried to listen youtube playlist by using a combination of youtube-dl, ffmpeg and mpv. I also updated motherboards BIOS/firmware to latest one.
Oh and kernel is in 5.5.13 version.
Created attachment 288595 [details] dmesg output And another one. It seems that switching between virtual consoles causes this bug to happen
I am having the same problem sometimes during start/exit of SteamVR. I have observed with the 5.6 kernels. My card is a Navi RX 5700XT.
Created attachment 288615 [details] smesg output
Created attachment 288679 [details] dmesg output And again.
Created attachment 288719 [details] gdb disassembler dump around mode_support_and_system_configuration And it happened again. Looks like that something goes wrong after while when computer monitor is turned on.
Created attachment 288781 [details] dmesg output from Linux 5.7-rc3 This is starting to be real problem, I can't do anything remotely productive. Crash will happen in just 12 hours (give or take) when system is rebooted from previous one. I'm running four LXC containers which I have setup to run GUI programs in hosts system by following this help : https://wiki.archlinux.org/index.php/Linux_Containers#Xorg_program_considerations_(optional) Also I have running VirtualBox but its VM's aren't accessing 3D functions from host at all.
Created attachment 288873 [details] dmesg from 5.6.8 Additionally dmesg output shows this line : note: kworker/0:3[2251663] exited with preempt_count 1 It seems that this bug occurs when the monitor is turned off and then on repeatedly with short delay between.
Created attachment 289237 [details] kernel log dumped from crash dump by using crash utility
Created attachment 289239 [details] backtrace created by executing bt -f command in crash utility
Created attachment 289241 [details] dump of struct dcn_bw_internal_vars
I hit the same issue, using Ubuntu 20.04. It happened when switching window to Firefox. For me it only crashed Xorg, ssh to the machine still worked ok. Killing Xorg didn't work and `shutdown -r now` hung up somewhere. Here is a bug report on the Ubuntu package: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1881134 Here is call trace decoded with the debug symbols: -- [455834.385061] Call Trace: [455834.385120] mode_support_and_system_configuration (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calc_auto.c:176) amdgpu [455834.385174] ? calculate_inits_and_adj_vp (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_resource.c:950 (discriminator 12)) amdgpu [455834.385230] dcn_validate_bandwidth (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:1034) amdgpu [455834.385283] dc_validate_global_state (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_resource.c:2093) amdgpu [455834.385338] amdgpu_dm_atomic_check (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7413) amdgpu [455834.385351] drm_atomic_check_only (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_atomic.c:1179) drm [455834.385361] drm_atomic_commit (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_atomic.c:1220) drm [455834.385370] drm_mode_obj_set_property_ioctl (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:496 /build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:533) drm [455834.385379] ? drm_mode_obj_find_prop_id (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:512) drm [455834.385386] drm_ioctl_kernel (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_ioctl.c:793) drm [455834.385394] drm_ioctl (/build/linux-FFoizL/linux-5.4.0/include/linux/thread_info.h:119 /build/linux-FFoizL/linux-5.4.0/include/linux/thread_info.h:152 /build/linux-FFoizL/linux-5.4.0/include/linux/uaccess.h:151 /build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_ioctl.c:888) drm [455834.385402] ? drm_mode_obj_find_prop_id (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/drm_mode_object.c:512) drm [455834.385406] ? recalc_sigpending (/build/linux-FFoizL/linux-5.4.0/kernel/signal.c:184) [455834.385440] amdgpu_drm_ioctl (/build/linux-FFoizL/linux-5.4.0/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1293) amdgpu [455834.385443] do_vfs_ioctl (/build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:47 /build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:510 /build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:697) [455834.385444] ? recalc_sigpending (/build/linux-FFoizL/linux-5.4.0/kernel/signal.c:184) [455834.385446] ? _copy_from_user (/build/linux-FFoizL/linux-5.4.0/arch/x86/include/asm/uaccess_64.h:46 /build/linux-FFoizL/linux-5.4.0/arch/x86/include/asm/uaccess_64.h:71 /build/linux-FFoizL/linux-5.4.0/lib/usercopy.c:14) [455834.385448] ksys_ioctl (/build/linux-FFoizL/linux-5.4.0/include/linux/file.h:43 /build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:715) [455834.385449] __x64_sys_ioctl (/build/linux-FFoizL/linux-5.4.0/fs/ioctl.c:719) [455834.385451] do_syscall_64 (/build/linux-FFoizL/linux-5.4.0/arch/x86/entry/common.c:290) [455834.385455] entry_SYSCALL_64_after_hwframe (/build/linux-FFoizL/linux-5.4.0/arch/x86/entry/entry_64.S:184) [455834.385456] RIP: 0033:0x7faf3181837b
Created attachment 289381 [details] dmesg from kernel 5.4.0-31
As best as I can tell, the crash seems to be caused by some floating point exception (such as underflow/overflow) in this function call in dcn_calc_auto.c line 176: dcn_bw_ceil2(v->byte_per_pixel_in_dety[k], 1.0) In dcn_bw_ceil2() the exception occurs in this instruction: addsd 0x0(%rip),%xmm3 which is performing the addition flr + 0.00001. At this point %xmm3 is ((int)(v->byte_per_pixel_in_dety[k] / 1.0)) * 1.0 The variable byte_per_pixel_in_dety is only assigned constant values 1.0, 2.0, 4.0, 8.0 so I don't see any reason for addsd to cause a simd exception. I'm not sure if the exception is precise or if it could be delayed from some prior instruction, but AFAIK it should be precise because in usermode the exception handler would attempt a recovery. Having XMM3 or MXCSR values would help, but they don't seem to get included in the dmesg output and I'm not sure if they are available in a crash dump either. Google search turned up https://beowulf.beowulf.narkive.com/tAHxVcs0/simd-exception-kernel-panic-on-skylake-ep-triggered-by-openfoam where the exception was delayed for some reason. Analyzing the dmesgs attached to this bug report, we have following crash locations: Cyrax 2020-03-26 21:36: divss xmm0,DWORD PTR [r14+0x17f8] Cyrax 2020-04-04 07:40: divss xmm0,DWORD PTR [r14+0x17f8] Cyrax 2020-04-18 13:19: divss xmm0,DWORD PTR [r14+0x17f8] farmboy0 2020-04-19 11:43: not a simd exception Cyrax 2020-04-23 05:15: divss xmm0,DWORD PTR [r14+0x17f8] Cyrax 2020-04-27 19:20: divss xmm0,DWORD PTR [r14+0x17f8] Cyrax 2020-05-02 14:18: divss xmm0,DWORD PTR [r14+0x17f8] PetteriA 2020-05-28 16:05: addsd xmm3,QWORD PTR [rip+0x1de967] So the crash locations appear fairly consistent for Cyrax's machine, but no two machines have the same location. For other users affected by this problem, it could be helpful if you install kernel debugging symbols and use decode_stacktrace.sh to convert the raw stack trace to code locations. Also reported on freedesktop amd bugtracker: https://gitlab.freedesktop.org/drm/amd/-/issues/1154
Do these patches help? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=59dfb0c64d3853d20dc84f4561f28d4f5a2ddc7d https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5aa82e35cacfdff7278b7eeffd9575e9c386289e
So far so good Alex. Using the RX 5700 XT as well. Previously, running SteamVR could pretty quickly crash my system (even before launching a game), and since I rebuilt linux-mainline from AUR, haven't had SteamVR crash my system yet. Fingers crossed that this continues. Though Half-Life: Alyx is causing a system crash, which can even happen on Windows with Vulkan apparently! Wow. At least that's not an AMD or Linux specific issue. https://github.com/ValveSoftware/SteamVR-for-Linux/issues/356
Created attachment 289479 [details] dmesg output kernel 5.7.0
Created attachment 289481 [details] config file used to build kernel 5.7.0 with KASAN etc
Created attachment 289483 [details] used decode_stacktrace.sh to previous dmesg log
(In reply to Petteri Aimonen from comment #16) > I hit the same issue, using Ubuntu 20.04. It happened when switching window > to Firefox. For me it only crashed Xorg, ssh to the machine still worked ok. > Killing Xorg didn't work and `shutdown -r now` hung up somewhere. > > Here is a bug report on the Ubuntu package: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1881134 > > Here is call trace decoded with the debug symbols: > [clip] Yeah, it happens when switching windows and/or to different workspace. And yes it will crash Xorg only, other things will continue work as usual and issuing reboot command via SSH won't - well - reboot it. Only REISUB brings machine back to usable state.
Looks like there are two kinds of crash bugs here. Many of the amdgpu crashes have been fixed in 5.7.0, but the specific one that gives "simd exception" in dmesg is not. @Cyrax There is an experimental patch in https://bugzilla.kernel.org/show_bug.cgi?id=207979 if you want to try. Out of interest, are you possibly running a 32-bit operating system under virtualization on 64-bit host? That's what triggers the bug for me.
(In reply to Petteri Aimonen from comment #25) > Looks like there are two kinds of crash bugs here. Many of the amdgpu > crashes have been fixed in 5.7.0, but the specific one that gives "simd > exception" in dmesg is not. > > @Cyrax There is an experimental patch in > https://bugzilla.kernel.org/show_bug.cgi?id=207979 if you want to try. > > Out of interest, are you possibly running a 32-bit operating system under > virtualization on 64-bit host? That's what triggers the bug for me. I'm running one 32-bit LXC container (Arch Linux. <url:https://archlinux32.org/>) and three 64-bit LXC containers (Arch Linux). Additionally I'm running three VirtualBox guests which are Windows, Arch Linux and old version LEDE (OpenWRT) router OS (All are running 64-bit OS).
Created attachment 289535 [details] systemd journal from crash Update: got a whole system crash again when I was starting up SteamVR. So I guess the issue wasn't resolved for me. It could have reduced the likelihood maybe, or it was luck? Not sure what else to attach here, but I copied journal entries from the time of the crash (which happens at 21:09:31 near the end). Let me know if there's something else I should attach the next time this happens, if more data would be helpful.
@yaomtc Your bug seems to be some separate issue, as the log does not have the "simd exception" or "mode_support_and_system_configuration" entries in it. It looks more similar to this bug here: https://gitlab.freedesktop.org/drm/amd/-/issues/1149
I encountered this bug today. When running specific graphical applications, the machine hangs, and the kernel logs say about simd exception. It started to occur after the upgrade to 5.7.6 kernel. I tried to apply the patch mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=207979, and the patch resolves the issue for me. Using AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx.
The patch in https://bugzilla.kernel.org/show_bug.cgi?id=207979 works beatifully. 19 days heavy usage without system crash on patched 5.7.6 kernel.
Duplicate of bug 207979.
Fix is in stable 5.7.10 kernel. *** This bug has been marked as a duplicate of bug 207979 ***
I'm seeing this on an AMD Ryzen 4500U laptop running 5.8.1 (Arch Linux 5.8.1-arch1-1). I can repro fairly consistently when running a 64-bit KVM virtual machine. The kernel I'm running has the commit which should resolve this: 7ad816762f9b ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()") Confirmed patch is in my kernel: https://git.archlinux.org/linux.git/tree/arch/x86/kernel/fpu/core.c?h=v5.8.1-arch1#n106 Here is what I see in dmesg: Aug 18 20:25:49 archpad kernel: simd exception: 0000 [#1] PREEMPT SMP NOPTI Aug 18 20:25:49 archpad kernel: CPU: 0 PID: 509 Comm: Xorg Not tainted 5.8.1-arch1-1 #1 Aug 18 20:25:49 archpad kernel: Hardware name: LENOVO 81W4/LNVNB161216, BIOS DZCN19WW 04/13/2020 Aug 18 20:25:49 archpad kernel: RIP: 0010:dcn_bw_ceil2+0x35/0x60 [amdgpu] Aug 18 20:25:49 archpad kernel: Code: cd 7b 3e 0f 28 d0 66 0f ef db 66 0f ef e4 f3 0f 5e d1 f3 0f 5a e0 f3 0f 2c c2 66 0f ef d2 f3 0f 2a d0 f3 0f 59 d1 f3 0f 5a da <f2> 0f 58 1d 5b 19 2e 00 66 0f 2f dc 72 01 c3 f3 0f 58 ca 0f 28 c1 Aug 18 20:25:49 archpad kernel: RSP: 0018:ffffb8fac07035f8 EFLAGS: 00010202 Aug 18 20:25:49 archpad kernel: RAX: 0000000000000004 RBX: 0000000000000000 RCX: 0000000000000780 Aug 18 20:25:49 archpad kernel: RDX: ffff97ebd0a63080 RSI: ffff97ebd0a69560 RDI: 0000000044444440 Aug 18 20:25:49 archpad kernel: RBP: ffff97ebd0a631c0 R08: ffff97ebd0a633b4 R09: 0000000000000000 Aug 18 20:25:49 archpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff97ebd0a63360 Aug 18 20:25:49 archpad kernel: R13: 0000000000000001 R14: ffff97ebd0a62188 R15: ffff97ebd0a62028 Aug 18 20:25:49 archpad kernel: FS: 00007f8787a65940(0000) GS:ffff97ec47400000(0000) knlGS:0000000000000000 Aug 18 20:25:49 archpad kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 18 20:25:49 archpad kernel: CR2: 0000000800880000 CR3: 00000001f9040000 CR4: 0000000000340ef0 Aug 18 20:25:49 archpad kernel: Call Trace: Aug 18 20:25:49 archpad kernel: dml21_ModeSupportAndSystemConfigurationFull+0x437/0x5cf0 [amdgpu] Aug 18 20:25:49 archpad kernel: ? sysvec_apic_timer_interrupt+0x46/0xe0 Aug 18 20:25:49 archpad kernel: ? asm_sysvec_apic_timer_interrupt+0x12/0x20 Aug 18 20:25:49 archpad kernel: ? sched_clock+0x5/0x10 Aug 18 20:25:49 archpad kernel: ? sched_clock_local+0x12/0x80 Aug 18 20:25:49 archpad kernel: ? amdgpu_sa_bo_new+0xbc/0x550 [amdgpu] Aug 18 20:25:49 archpad kernel: ? sched_clock_cpu+0xae/0xd0 Aug 18 20:25:49 archpad kernel: ? kmem_cache_alloc_trace+0x17c/0x220 Aug 18 20:25:49 archpad kernel: ? amdgpu_sa_bo_new+0xbc/0x550 [amdgpu] Aug 18 20:25:49 archpad kernel: ? _raw_spin_unlock+0x16/0x30 Aug 18 20:25:49 archpad kernel: ? preempt_count_add+0x49/0xa0 Aug 18 20:25:49 archpad kernel: ? kernel_init_free_pages+0x6d/0x90 Aug 18 20:25:49 archpad kernel: ? prep_new_page+0xa2/0xb0 Aug 18 20:25:49 archpad kernel: ? get_page_from_freelist+0xfa8/0x1220 Aug 18 20:25:49 archpad kernel: ? __mod_zone_page_state+0x66/0xa0 Aug 18 20:25:49 archpad kernel: ? hubbub2_get_dcc_compression_cap+0xa8/0x270 [amdgpu] Aug 18 20:25:49 archpad kernel: ? fill_plane_buffer_attributes+0x26f/0x420 [amdgpu] Aug 18 20:25:49 archpad kernel: dml_get_voltage_level+0x116/0x1e0 [amdgpu] Aug 18 20:25:49 archpad kernel: dcn20_fast_validate_bw+0x359/0x680 [amdgpu] Aug 18 20:25:49 archpad kernel: ? resource_build_scaling_params+0xc44/0x11a0 [amdgpu] Aug 18 20:25:49 archpad kernel: dcn21_validate_bandwidth+0xcd/0x2a0 [amdgpu] Aug 18 20:25:49 archpad kernel: dc_validate_global_state+0x2f2/0x390 [amdgpu] Aug 18 20:25:49 archpad kernel: amdgpu_dm_atomic_check+0xefb/0x1010 [amdgpu] Aug 18 20:25:49 archpad kernel: drm_atomic_check_only+0x57c/0x7f0 [drm] Aug 18 20:25:49 archpad kernel: ? __drm_atomic_helper_crtc_duplicate_state+0x85/0xd0 [drm_kms_helper] Aug 18 20:25:49 archpad kernel: drm_atomic_commit+0x13/0x50 [drm] Aug 18 20:25:49 archpad kernel: drm_atomic_helper_legacy_gamma_set+0x123/0x180 [drm_kms_helper] Aug 18 20:25:49 archpad kernel: drm_mode_gamma_set_ioctl+0x19a/0x230 [drm] Aug 18 20:25:49 archpad kernel: ? drm_color_lut_check+0xa0/0xa0 [drm] Aug 18 20:25:49 archpad kernel: drm_ioctl_kernel+0xb2/0x100 [drm] Aug 18 20:25:49 archpad kernel: drm_ioctl+0x208/0x360 [drm] Aug 18 20:25:49 archpad kernel: ? drm_color_lut_check+0xa0/0xa0 [drm] Aug 18 20:25:49 archpad kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Aug 18 20:25:49 archpad kernel: ksys_ioctl+0x82/0xc0 Aug 18 20:25:49 archpad kernel: __x64_sys_ioctl+0x16/0x20 Aug 18 20:25:49 archpad kernel: do_syscall_64+0x44/0x70 Aug 18 20:25:49 archpad kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Aug 18 20:25:49 archpad kernel: RIP: 0033:0x7f87887888eb Aug 18 20:25:49 archpad kernel: Code: 0f 1e fa 48 8b 05 a5 95 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 95 0c 00 f7 d8 64 89 01 48 Aug 18 20:25:49 archpad kernel: RSP: 002b:00007ffc92f3a9a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Aug 18 20:25:49 archpad kernel: RAX: ffffffffffffffda RBX: 00007ffc92f3a9e0 RCX: 00007f87887888eb Aug 18 20:25:49 archpad kernel: RDX: 00007ffc92f3a9e0 RSI: 00000000c02064a5 RDI: 000000000000000a Aug 18 20:25:49 archpad kernel: RBP: 00000000c02064a5 R08: 00005627eb36eb10 R09: 00005627eb36ed10 Aug 18 20:25:49 archpad kernel: R10: 00005627eb36e910 R11: 0000000000000246 R12: 0000000000000100 Aug 18 20:25:49 archpad kernel: R13: 000000000000000a R14: 0000000000000100 R15: 0000000000000100 Aug 18 20:25:49 archpad kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c tun bridge hid_multitouch hid_generic 8021q garp mrp stp llc ebtable_filter ebtables snd_acp3x_rn snd_soc_dmic snd_acp3x_pdm_dma snd_soc_c> Aug 18 20:25:49 archpad kernel: drm_kms_helper btintel snd_hwdep i2c_hid hid videobuf2_common cec nls_iso8859_1 snd_pcm rc_core nls_cp437 bluetooth cfg80211 snd_timer syscopyarea videodev ideapad_laptop snd_rn_pci_acp3x sysfillrect vfat ecdh_generic snd sysimgblt tpm_crb snd_pci_acp3x sparse_keymap fat ecc mc fb_sys_fops tpm_tis soundcore ccp rfkill libarc4 wmi battery tpm_tis> Aug 18 20:25:49 archpad kernel: ---[ end trace 76f111d732bc1b57 ]--- Aug 18 20:25:49 archpad kernel: RIP: 0010:dcn_bw_ceil2+0x35/0x60 [amdgpu] Aug 18 20:25:49 archpad kernel: Code: cd 7b 3e 0f 28 d0 66 0f ef db 66 0f ef e4 f3 0f 5e d1 f3 0f 5a e0 f3 0f 2c c2 66 0f ef d2 f3 0f 2a d0 f3 0f 59 d1 f3 0f 5a da <f2> 0f 58 1d 5b 19 2e 00 66 0f 2f dc 72 01 c3 f3 0f 58 ca 0f 28 c1 Aug 18 20:25:49 archpad kernel: RSP: 0018:ffffb8fac07035f8 EFLAGS: 00010202 Aug 18 20:25:49 archpad kernel: RAX: 0000000000000004 RBX: 0000000000000000 RCX: 0000000000000780 Aug 18 20:25:49 archpad kernel: RDX: ffff97ebd0a63080 RSI: ffff97ebd0a69560 RDI: 0000000044444440 Aug 18 20:25:49 archpad kernel: RBP: ffff97ebd0a631c0 R08: ffff97ebd0a633b4 R09: 0000000000000000 Aug 18 20:25:49 archpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff97ebd0a63360 Aug 18 20:25:49 archpad kernel: R13: 0000000000000001 R14: ffff97ebd0a62188 R15: ffff97ebd0a62028 Aug 18 20:25:49 archpad kernel: FS: 00007f8787a65940(0000) GS:ffff97ec47400000(0000) knlGS:0000000000000000 Aug 18 20:25:49 archpad kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 18 20:25:49 archpad kernel: CR2: 0000000800880000 CR3: 00000001f9040000 CR4: 0000000000340ef0 $ objdump -d amdgpu.ko ... 00000000001b83c0 <dcn_bw_ceil2>: 1b83c0: e8 00 00 00 00 callq 1b83c5 <dcn_bw_ceil2+0x5> 1b83c5: 66 0f ef ed pxor %xmm5,%xmm5 1b83c9: 0f 2e cd ucomiss %xmm5,%xmm1 1b83cc: 7b 3e jnp 1b840c <dcn_bw_ceil2+0x4c> 1b83ce: 0f 28 d0 movaps %xmm0,%xmm2 1b83d1: 66 0f ef db pxor %xmm3,%xmm3 1b83d5: 66 0f ef e4 pxor %xmm4,%xmm4 1b83d9: f3 0f 5e d1 divss %xmm1,%xmm2 1b83dd: f3 0f 5a e0 cvtss2sd %xmm0,%xmm4 1b83e1: f3 0f 2c c2 cvttss2si %xmm2,%eax 1b83e5: 66 0f ef d2 pxor %xmm2,%xmm2 1b83e9: f3 0f 2a d0 cvtsi2ss %eax,%xmm2 1b83ed: f3 0f 59 d1 mulss %xmm1,%xmm2 1b83f1: f3 0f 5a da cvtss2sd %xmm2,%xmm3 1b83f5: f2 0f 58 1d 00 00 00 addsd 0x0(%rip),%xmm3 # 1b83fd <dcn_bw_ceil2+0x3d> 1b83fc: 00 1b83fd: 66 0f 2f dc comisd %xmm4,%xmm3 1b8401: 72 01 jb 1b8404 <dcn_bw_ceil2+0x44> 1b8403: c3 retq 1b8404: f3 0f 58 ca addss %xmm2,%xmm1 1b8408: 0f 28 c1 movaps %xmm1,%xmm0 1b840b: c3 retq 1b840c: 75 c0 jne 1b83ce <dcn_bw_ceil2+0xe> 1b840e: 66 0f ef c0 pxor %xmm0,%xmm0 1b8412: c3 retq 1b8413: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) 1b841a: 00 00 00 00 1b841e: 66 90 xchg %ax,%ax ... Instruction at RIP: 0010:dcn_bw_ceil2+0x35: >>> hex(0x00000000001b83c0 + 0x35) '0x1b83f5' 1b83f5: f2 0f 58 1d 00 00 00 addsd 0x0(%rip),%xmm3 # 1b83fd <dcn_bw_ceil2+0x3d> Same addsd instruction that was mentioned above.
@krakopo Can you apply the debug info patch from here? https://bugzilla.kernel.org/attachment.cgi?id=289421&action=diff What kernel are you running inside the KVM virtual machine? I wonder if the virtual machine has the MXCSR problem, perhaps it could be leaking to the host somehow.
@Petteri I'm running DragonFly BSD 5.8.1 in my KVM virtual machine. Here is the dmesg output with the debug info patch applied: Aug 19 23:18:03 archpad kernel: MXCSR: 00000020 XMM3: 4010000000000000 Aug 19 23:18:03 archpad kernel: simd exception: 0000 [#1] PREEMPT SMP NOPTI Aug 19 23:18:03 archpad kernel: CPU: 5 PID: 518 Comm: Xorg Not tainted 5.8.1-arch1206987 #1 Aug 19 23:18:03 archpad kernel: Hardware name: LENOVO 81W4/LNVNB161216, BIOS DZCN19WW 04/13/2020 Aug 19 23:18:03 archpad kernel: RIP: 0010:dcn_bw_ceil2+0x35/0x60 [amdgpu] Aug 19 23:18:03 archpad kernel: Code: cd 7b 3e 0f 28 d0 66 0f ef db 66 0f ef e4 f3 0f 5e d1 f3 0f 5a e0 f3 0f 2c c2 66 0f ef d2 f3 0f 2a d0 f3 0f 59 d1 f3 0f 5a da <f2> 0f 58 1d 5b 19 2e 00 66 0f 2f dc 72 01 c3 f3 0f 58 ca 0f 28 c1 Aug 19 23:18:03 archpad kernel: RSP: 0018:ffff9e24c10775f8 EFLAGS: 00010202 Aug 19 23:18:03 archpad kernel: RAX: 0000000000000004 RBX: 0000000000000000 RCX: 0000000000000780 Aug 19 23:18:03 archpad kernel: RDX: ffff93d6d0683080 RSI: ffff93d6d0689560 RDI: 0000000044444440 Aug 19 23:18:03 archpad kernel: RBP: ffff93d6d06831c0 R08: ffff93d6d06833b4 R09: 0000000000000000 Aug 19 23:18:03 archpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff93d6d0683360 Aug 19 23:18:03 archpad kernel: R13: 0000000000000001 R14: ffff93d6d0682188 R15: ffff93d6d0682028 Aug 19 23:18:03 archpad kernel: FS: 00007f222278d940(0000) GS:ffff93d707740000(0000) knlGS:0000000000000000 Aug 19 23:18:03 archpad kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 19 23:18:03 archpad kernel: CR2: 0000000800ea8030 CR3: 00000002021ca000 CR4: 0000000000340ee0 Aug 19 23:18:03 archpad kernel: Call Trace: Aug 19 23:18:03 archpad kernel: dml21_ModeSupportAndSystemConfigurationFull+0x437/0x5cf0 [amdgpu] Aug 19 23:18:03 archpad kernel: ? cpufreq_this_cpu_can_update+0xe/0x50 Aug 19 23:18:03 archpad kernel: ? sugov_update_single+0x58/0x210 Aug 19 23:18:03 archpad kernel: ? sugov_get_util+0xf0/0xf0 Aug 19 23:18:03 archpad kernel: ? update_blocked_averages+0x539/0x620 Aug 19 23:18:03 archpad kernel: ? update_group_capacity+0x25/0x1c0 Aug 19 23:18:03 archpad kernel: ? cpumask_next_and+0x19/0x20 Aug 19 23:18:03 archpad kernel: ? update_sd_lb_stats.constprop.0+0x799/0x8f0 Aug 19 23:18:03 archpad kernel: ? cpufreq_this_cpu_can_update+0xe/0x50 Aug 19 23:18:03 archpad kernel: ? sugov_update_single+0x143/0x210 Aug 19 23:18:03 archpad kernel: ? sugov_get_util+0xf0/0xf0 Aug 19 23:18:03 archpad kernel: ? update_load_avg+0x63a/0x660 Aug 19 23:18:03 archpad kernel: ? update_curr+0x73/0x1f0 Aug 19 23:18:03 archpad kernel: ? enqueue_entity+0x14e/0x750 Aug 19 23:18:03 archpad kernel: ? resched_curr+0x20/0xc0 Aug 19 23:18:03 archpad kernel: ? check_preempt_wakeup+0x13b/0x250 Aug 19 23:18:03 archpad kernel: ? check_preempt_curr+0x67/0x90 Aug 19 23:18:03 archpad kernel: ? _raw_spin_unlock+0x16/0x30 Aug 19 23:18:03 archpad kernel: dml_get_voltage_level+0x116/0x1e0 [amdgpu] Aug 19 23:18:03 archpad kernel: dcn20_fast_validate_bw+0x359/0x680 [amdgpu] Aug 19 23:18:03 archpad kernel: ? resource_build_scaling_params+0xc44/0x11a0 [amdgpu] Aug 19 23:18:03 archpad kernel: dcn21_validate_bandwidth+0xcd/0x2a0 [amdgpu] Aug 19 23:18:03 archpad kernel: dc_validate_global_state+0x2f2/0x390 [amdgpu] Aug 19 23:18:03 archpad kernel: amdgpu_dm_atomic_check+0xefb/0x1010 [amdgpu] Aug 19 23:18:03 archpad kernel: ? free_one_page+0x57/0xd0 Aug 19 23:18:03 archpad kernel: drm_atomic_check_only+0x57c/0x7f0 [drm] Aug 19 23:18:03 archpad kernel: ? __drm_atomic_helper_crtc_duplicate_state+0x85/0xd0 [drm_kms_helper] Aug 19 23:18:03 archpad kernel: drm_atomic_commit+0x13/0x50 [drm] Aug 19 23:18:03 archpad kernel: drm_atomic_helper_legacy_gamma_set+0x123/0x180 [drm_kms_helper] Aug 19 23:18:03 archpad kernel: drm_mode_gamma_set_ioctl+0x19a/0x230 [drm] Aug 19 23:18:03 archpad kernel: ? drm_color_lut_check+0xa0/0xa0 [drm] Aug 19 23:18:03 archpad kernel: drm_ioctl_kernel+0xb2/0x100 [drm] Aug 19 23:18:03 archpad kernel: drm_ioctl+0x208/0x360 [drm] Aug 19 23:18:03 archpad kernel: ? drm_color_lut_check+0xa0/0xa0 [drm] Aug 19 23:18:03 archpad kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Aug 19 23:18:03 archpad kernel: ksys_ioctl+0x82/0xc0 Aug 19 23:18:03 archpad kernel: __x64_sys_ioctl+0x16/0x20 Aug 19 23:18:03 archpad kernel: do_syscall_64+0x44/0x70 Aug 19 23:18:03 archpad kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Aug 19 23:18:03 archpad kernel: RIP: 0033:0x7f22234b08eb Aug 19 23:18:03 archpad kernel: Code: 0f 1e fa 48 8b 05 a5 95 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 95 0c 00 f7 d8 64 89 01 48 Aug 19 23:18:03 archpad kernel: RSP: 002b:00007ffee6662f48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Aug 19 23:18:03 archpad kernel: RAX: ffffffffffffffda RBX: 00007ffee6662f80 RCX: 00007f22234b08eb Aug 19 23:18:03 archpad kernel: RDX: 00007ffee6662f80 RSI: 00000000c02064a5 RDI: 000000000000000a Aug 19 23:18:03 archpad kernel: RBP: 00000000c02064a5 R08: 000055b14cc95f10 R09: 000055b14cc96110 Aug 19 23:18:03 archpad kernel: R10: 000055b14cc95d10 R11: 0000000000000246 R12: 0000000000000100 Aug 19 23:18:03 archpad kernel: R13: 000000000000000a R14: 0000000000000100 R15: 0000000000000100 Aug 19 23:18:03 archpad kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c tun bridge hid_multitouch hid_generic 8021q garp mrp stp llc amdgpu ath10k_pci edac_mce_amd ath10k_core kvm_amd snd_acp3x_rn kvm snd_acp3x_pdm_dma ebtable_filter ebtables ip6table_filter snd_soc_dmic ip6_tables snd_soc_core ath irqbypass iptable_filter crct10dif_pclmul crc32_pclmul mac80211 ghash_clmulni_intel joydev snd_compress ac97_bus snd_pcm_dmaengine mousedev wmi_bmof aesni_intel crypto_simd ccm cryptd glue_helper algif_aead snd_hda_codec_generic btusb rapl snd_hda_codec_hdmi ledtrig_audio des_generic input_leds gpu_sched pcspkr libdes snd_hda_intel btrtl i2c_algo_bit snd_intel_dspcfg btbcm ttm arc4 snd_hda_codec cbc btintel ecb snd_hda_core uvcvideo algif_skcipher bluetooth drm_kms_helper k10temp sp5100_tco snd_hwdep i2c_piix4 snd_pcm cmac md4 videobuf2_vmalloc cec Aug 19 23:18:03 archpad kernel: cfg80211 videobuf2_memops algif_hash af_alg videobuf2_v4l2 rc_core tpm_crb videobuf2_common snd_timer nls_iso8859_1 syscopyarea videodev ideapad_laptop sysfillrect nls_cp437 tpm_tis snd ccp ecdh_generic tpm_tis_core snd_rn_pci_acp3x ecc vfat sparse_keymap sysimgblt fat soundcore tpm snd_pci_acp3x mc rfkill fb_sys_fops i2c_hid hid libarc4 wmi evdev pinctrl_amd battery mac_hid elants_i2c acpi_cpufreq rng_core ac drm agpgart pkcs8_key_parser ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 serio_raw xhci_pci atkbd xhci_pci_renesas libps2 xhci_hcd crc32c_intel i8042 serio Aug 19 23:18:03 archpad kernel: ---[ end trace a01eac408369453d ]--- Aug 19 23:18:03 archpad kernel: RIP: 0010:dcn_bw_ceil2+0x35/0x60 [amdgpu] Aug 19 23:18:03 archpad kernel: Code: cd 7b 3e 0f 28 d0 66 0f ef db 66 0f ef e4 f3 0f 5e d1 f3 0f 5a e0 f3 0f 2c c2 66 0f ef d2 f3 0f 2a d0 f3 0f 59 d1 f3 0f 5a da <f2> 0f 58 1d 5b 19 2e 00 66 0f 2f dc 72 01 c3 f3 0f 58 ca 0f 28 c1 Aug 19 23:18:03 archpad kernel: RSP: 0018:ffff9e24c10775f8 EFLAGS: 00010202 Aug 19 23:18:03 archpad kernel: RAX: 0000000000000004 RBX: 0000000000000000 RCX: 0000000000000780 Aug 19 23:18:03 archpad kernel: RDX: ffff93d6d0683080 RSI: ffff93d6d0689560 RDI: 0000000044444440 Aug 19 23:18:03 archpad kernel: RBP: ffff93d6d06831c0 R08: ffff93d6d06833b4 R09: 0000000000000000 Aug 19 23:18:03 archpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff93d6d0683360 Aug 19 23:18:03 archpad kernel: R13: 0000000000000001 R14: ffff93d6d0682188 R15: ffff93d6d0682028 Aug 19 23:18:03 archpad kernel: FS: 00007f222278d940(0000) GS:ffff93d707640000(0000) knlGS:0000000000000000 Aug 19 23:18:03 archpad kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 19 23:18:03 archpad kernel: CR2: 0000000800cca010 CR3: 00000002021ca000 CR4: 0000000000340ee0
@krakopo The 00000020 MXCSR value is also exactly like it was for me before the bug fix. So something is definitely clearing MXCSR after it should be set to 0x1F80 by kernel_fpu_begin(). Can you disassemble kernel_fpu_begin() to verify that the ldmxcsr instruction is present close to its end? Also, check that /proc/cpuinfo flags has "sse" in it - not sure though how that could possibly be missing.
I do see ldmxcsr in the disassembly: ffffffff81038870 <kernel_fpu_begin>: ffffffff81038870: e8 9b 07 03 00 callq ffffffff81069010 <__fentry__> ffffffff81038875: 48 83 ec 10 sub $0x10,%rsp ffffffff81038879: bf 01 00 00 00 mov $0x1,%edi ffffffff8103887e: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax ffffffff81038885: 00 00 ffffffff81038887: 48 89 44 24 08 mov %rax,0x8(%rsp) ffffffff8103888c: 31 c0 xor %eax,%eax ffffffff8103888e: c7 44 24 04 00 00 00 movl $0x0,0x4(%rsp) ffffffff81038895: 00 ffffffff81038896: e8 35 ae 08 00 callq ffffffff810c36d0 <preempt_count_add> ffffffff8103889b: e8 80 fd ff ff callq ffffffff81038620 <irq_fpu_usable> ffffffff810388a0: 65 8a 05 b1 f2 fd 7e mov %gs:0x7efdf2b1(%rip),%al # 17b58 <in_kernel_fpu> ffffffff810388a7: 65 c6 05 a9 f2 fd 7e movb $0x1,%gs:0x7efdf2a9(%rip) # 17b58 <in_kernel_fpu> ffffffff810388ae: 01 ffffffff810388af: 65 48 8b 3c 25 c0 7b mov %gs:0x17bc0,%rdi ffffffff810388b6: 01 00 ffffffff810388b8: f6 47 26 20 testb $0x20,0x26(%rdi) ffffffff810388bc: 74 3c je ffffffff810388fa <kernel_fpu_begin+0x8a> ffffffff810388be: 48 c7 c7 57 43 40 82 mov $0xffffffff82404357,%rdi ffffffff810388c5: e8 46 41 9c 00 callq ffffffff819fca10 <__this_cpu_preempt_check> ffffffff810388ca: c7 44 24 04 80 1f 00 movl $0x1f80,0x4(%rsp) ffffffff810388d1: 00 ffffffff810388d2: 65 48 c7 05 82 f2 fd movq $0x0,%gs:0x7efdf282(%rip) # 17b60 <fpu_fpregs_owner_ctx> ffffffff810388d9: 7e 00 00 00 00 ffffffff810388de: 0f ae 54 24 04 ldmxcsr 0x4(%rsp) ffffffff810388e3: db e3 fninit ffffffff810388e5: 48 8b 44 24 08 mov 0x8(%rsp),%rax ffffffff810388ea: 65 48 2b 04 25 28 00 sub %gs:0x28,%rax ffffffff810388f1: 00 00 ffffffff810388f3: 75 20 jne ffffffff81038915 <kernel_fpu_begin+0xa5> ffffffff810388f5: 48 83 c4 10 add $0x10,%rsp ffffffff810388f9: c3 retq ffffffff810388fa: 48 8b 07 mov (%rdi),%rax ffffffff810388fd: f6 c4 40 test $0x40,%ah ffffffff81038900: 75 bc jne ffffffff810388be <kernel_fpu_begin+0x4e> ffffffff81038902: f0 80 4f 01 40 lock orb $0x40,0x1(%rdi) ffffffff81038907: 48 81 c7 00 1b 00 00 add $0x1b00,%rdi ffffffff8103890e: e8 5d fd ff ff callq ffffffff81038670 <copy_fpregs_to_fpstate> ffffffff81038913: eb a9 jmp ffffffff810388be <kernel_fpu_begin+0x4e> ffffffff81038915: e8 36 3c 9c 00 callq ffffffff819fc550 <__stack_chk_fail> ffffffff8103891a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) And yes I do have the "sse" flag in /proc/cpuinfo.
@krakopo I must say I don't have any idea what could be happening on your machine. It could be explained if the kernel thread was being pre-empted, but pre-emption is disabled by kernel_fpu_begin(). It may help to ask in bug 207979 also, it has some of the long time x86 maintainers on CC.
Created attachment 295225 [details] Call DC_FP_START() / DC_FP_END() in dcn21_validate_bandwidth Could it be that DC_FP_START()/DC_FP_END() aka kernel_fpu_begin()/kernel_fpu_end() are not called in the *_validate_bandwidth code path on AMD Renoir systems? To my untrained eye it looks like it is missing, while it _is_ there for dcn20. I've been running the attached patch for 2 days now with some KVM VMs open and the system seems stable. Previously, I had similar crashes/backtraces @krakopo described. I'm happy to help testing any patches. I'm running a Thinkpad T14 with a AMD Ryzen 7 PRO 4750U (Renoir).
(In reply to Jan Kokemüller from comment #39) > Created attachment 295225 [details] > Call DC_FP_START() / DC_FP_END() in dcn21_validate_bandwidth > > Could it be that DC_FP_START()/DC_FP_END() aka > kernel_fpu_begin()/kernel_fpu_end() are not called in the > *_validate_bandwidth code path on AMD Renoir systems? To my untrained eye it > looks like it is missing, while it _is_ there for dcn20. > > I've been running the attached patch for 2 days now with some KVM VMs open > and the system seems stable. Previously, I had similar crashes/backtraces > @krakopo described. > > I'm happy to help testing any patches. I'm running a Thinkpad T14 with a AMD > Ryzen 7 PRO 4750U (Renoir). Looks correct. Care to send out a proper git patch?
> Looks correct. Care to send out a proper git patch? Thank you for having a look at the patch! I've sent it to the amd-gfx list.