Bug 218949 - Kernel panic after upgrading to 6.10-rc2
Summary: Kernel panic after upgrading to 6.10-rc2
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-09 21:27 UTC by Gino Badouri
Modified: 2024-06-11 07:22 UTC (History)
0 users

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
full logfile (zipped) (50.20 KB, application/x-zip-compressed)
2024-06-09 21:27 UTC, Gino Badouri
Details
Settings of the VM (132.60 KB, image/png)
2024-06-10 10:57 UTC, Gino Badouri
Details

Description Gino Badouri 2024-06-09 21:27:24 UTC
Created attachment 306446 [details]
full logfile (zipped)

I've decided to try out 6.10-rc2 on my proxmox machine running on a Zen2 Threadripiper because of all the amd-pstate improvements.
During bootup I notice it prints a lot of kernel panics in the logs.

They mostly look like this:

Jun 09 23:11:23 pve kernel: ------------[ cut here ]------------
Jun 09 23:11:23 pve kernel: WARNING: CPU: 9 PID: 1870 at include/linux/rwsem.h:85 remap_pfn_range_notrack+0x4a5/0x590
Jun 09 23:11:23 pve kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter scsi_transport_iscsi nf_tables bonding tls softdog sunrpc nfnetl>
Jun 09 23:11:23 pve kernel:  xhci_hcd i2c_piix4 wmi
Jun 09 23:11:23 pve kernel: CPU: 9 PID: 1870 Comm: CPU 0/KVM Tainted: G        W  OE      6.10.0-rc2 #3
Jun 09 23:11:23 pve kernel: Hardware name: ASUS System Product Name/ROG ZENITH II EXTREME, BIOS 2102 02/16/2024
Jun 09 23:11:23 pve kernel: RIP: 0010:remap_pfn_range_notrack+0x4a5/0x590
Jun 09 23:11:23 pve kernel: Code: 45 31 d2 45 31 db e9 2a f2 d2 00 48 8b 7d b8 48 89 c6 e8 ce 95 ff ff 85 c0 0f 84 66 fe ff ff eb a6 0f 0b b9 ea ff ff ff eb a2 <0f> 0b e9 e9 fb ff ff 0f 0b 48 8b 7d b8 4c 89 fa 4c 89 ce 4c 89 4d
Jun 09 23:11:23 pve kernel: RSP: 0018:ffffb640c103f900 EFLAGS: 00010246
Jun 09 23:11:23 pve kernel: RAX: 000000802d0644fb RBX: ffff9485c89ea730 RCX: 0000000000100000
Jun 09 23:11:23 pve kernel: RDX: 0000000000000000 RSI: ffff9485e489bc80 RDI: ffff9485c89ea730
Jun 09 23:11:23 pve kernel: RBP: ffffb640c103f9b8 R08: 8000000000000037 R09: 0000000000000000
Jun 09 23:11:23 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000c2100
Jun 09 23:11:23 pve kernel: R13: 00007f8a50200000 R14: 8000000000000037 R15: 00007f8a50100000
Jun 09 23:11:23 pve kernel: FS:  00007f8a4aa006c0(0000) GS:ffff94a47dc80000(0000) knlGS:0000000000000000
Jun 09 23:11:23 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 09 23:11:23 pve kernel: CR2: 00007f8a352ae000 CR3: 0000000117588000 CR4: 0000000000350ef0
Jun 09 23:11:23 pve kernel: Call Trace:
Jun 09 23:11:23 pve kernel:  <TASK>
Jun 09 23:11:23 pve kernel:  ? show_regs+0x6c/0x80
Jun 09 23:11:23 pve kernel:  ? __warn+0x88/0x140
Jun 09 23:11:23 pve kernel:  ? remap_pfn_range_notrack+0x4a5/0x590
Jun 09 23:11:23 pve kernel:  ? report_bug+0x182/0x1b0
Jun 09 23:11:23 pve kernel:  ? handle_bug+0x46/0x90
Jun 09 23:11:23 pve kernel:  ? exc_invalid_op+0x18/0x80
Jun 09 23:11:23 pve kernel:  ? asm_exc_invalid_op+0x1b/0x20
Jun 09 23:11:23 pve kernel:  ? remap_pfn_range_notrack+0x4a5/0x590
Jun 09 23:11:23 pve kernel:  ? track_pfn_remap+0x139/0x140
Jun 09 23:11:23 pve kernel:  ? down_write+0x12/0x80
Jun 09 23:11:23 pve kernel:  remap_pfn_range+0x5c/0xc0
Jun 09 23:11:23 pve kernel:  ? srso_return_thunk+0x5/0x5f
Jun 09 23:11:23 pve kernel:  vfio_pci_mmap_fault+0xb1/0x180 [vfio_pci_core]
Jun 09 23:11:23 pve kernel:  __do_fault+0x3b/0x130
Jun 09 23:11:23 pve kernel:  do_fault+0xc5/0x490
Jun 09 23:11:23 pve kernel:  ? srso_return_thunk+0x5/0x5f
Jun 09 23:11:23 pve kernel:  __handle_mm_fault+0x842/0x1100
Jun 09 23:11:23 pve kernel:  handle_mm_fault+0x197/0x340
Jun 09 23:11:23 pve kernel:  fixup_user_fault+0x91/0x1e0
Jun 09 23:11:23 pve kernel:  vaddr_get_pfns+0x10e/0x280 [vfio_iommu_type1]
Jun 09 23:11:23 pve kernel:  vfio_pin_pages_remote+0x39f/0x520 [vfio_iommu_type1]
Jun 09 23:11:23 pve kernel:  ? srso_return_thunk+0x5/0x5f
Jun 09 23:11:23 pve kernel:  ? alloc_pages_mpol_noprof+0xd9/0x1f0
Jun 09 23:11:23 pve kernel:  vfio_iommu_type1_ioctl+0x10ad/0x1ad0 [vfio_iommu_type1]
Jun 09 23:11:23 pve kernel:  vfio_fops_unl_ioctl+0x6b/0x380 [vfio]
Jun 09 23:11:23 pve kernel:  __x64_sys_ioctl+0xa3/0xf0
Jun 09 23:11:23 pve kernel:  x64_sys_call+0xa68/0x24d0
Jun 09 23:11:23 pve kernel:  do_syscall_64+0x70/0x160
Jun 09 23:11:23 pve kernel:  ? srso_return_thunk+0x5/0x5f
Jun 09 23:11:23 pve kernel:  ? irqentry_exit+0x43/0x50
Jun 09 23:11:23 pve kernel:  ? srso_return_thunk+0x5/0x5f
Jun 09 23:11:23 pve kernel:  ? exc_page_fault+0x93/0x1b0
Jun 09 23:11:23 pve kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jun 09 23:11:23 pve kernel: RIP: 0033:0x7f8a5cb8cc5b
Jun 09 23:11:23 pve kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Jun 09 23:11:23 pve kernel: RSP: 002b:00007f8a4a9faa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 09 23:11:23 pve kernel: RAX: ffffffffffffffda RBX: 0000560ed91739b0 RCX: 00007f8a5cb8cc5b
Jun 09 23:11:23 pve kernel: RDX: 00007f8a4a9faaa0 RSI: 0000000000003b71 RDI: 000000000000003e
Jun 09 23:11:23 pve kernel: RBP: 0000000081c00000 R08: 0000000000000000 R09: 0000000000000000
Jun 09 23:11:23 pve kernel: R10: 00000000000fe000 R11: 0000000000000246 R12: 00000000000fe000
Jun 09 23:11:23 pve kernel: R13: 00000000000fe000 R14: 00007f8a4a9faaa0 R15: 00007f8a4a9fabf0
Jun 09 23:11:23 pve kernel:  </TASK>
Jun 09 23:11:23 pve kernel: ---[ end trace 0000000000000000 ]---

But I've attached a full log containing all the panics.
The systems seems to run stable otherwise.
Comment 1 Gino Badouri 2024-06-10 10:57:25 UTC
Created attachment 306447 [details]
Settings of the VM
Comment 2 Gino Badouri 2024-06-10 11:03:46 UTC
Alright, it's not a regression in the kernel but caused by a bios update (I guess).
I get the same on my previous kernel 6.9.0-rc1.
Both my 6.9.0-rc1 6.10.0-rc2 kernels are vanilla builds from kernel.org (unpatched).

After updating the bios/firmware of my mainboard Asus ROG Zenith II Extreme from 1802 to 2102, it always seems to spawn the error:

[ 1150.380137] ------------[ cut here ]------------
[ 1150.380141] Unpatched return thunk in use. This should not happen!
[ 1150.380144] WARNING: CPU: 3 PID: 4849 at arch/x86/kernel/cpu/bugs.c:2935 __warn_thunk+0x40/0x50
[ 1150.380152] Modules linked in: veth rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables scsi_transport_iscsi bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc amd_atl intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel eeepc_wmi crypto_simd asus_wmi cryptd platform_profile sparse_keymap asus_ec_sensors video pcspkr rapl ccp mxm_wmi wmi_bmof k10temp mac_hid vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd vhost_net vhost vhost_iotlb tap nct6775 nct6775_core hwmon_vid lm75 drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c igb xhci_pci atlantic nvme ahci crc32_pclmul xhci_pci_renesas i2c_algo_bit libahci dca macsec nvme_core xhci_hcd i2c_piix4 wmi
[ 1150.380266] CPU: 3 PID: 4849 Comm: CPU 0/KVM Not tainted 6.9.0-rc1 #1
[ 1150.380269] Hardware name: ASUS System Product Name/ROG ZENITH II EXTREME, BIOS 2102 02/16/2024
[ 1150.380271] RIP: 0010:__warn_thunk+0x40/0x50
[ 1150.380275] Code: 96 f1 fe 00 83 e3 01 74 0e 48 8b 5d f8 c9 31 f6 31 ff e9 43 1c 08 01 48 c7 c7 b8 f2 f4 9e c6 05 56 61 4c 02 01 e8 00 b1 07 00 <0f> 0b 48 8b 5d f8 c9 31 f6 31 ff e9 20 1c 08 01 90 90 90 90 90 90
[ 1150.380278] RSP: 0018:ffffb478c2ce3ca8 EFLAGS: 00010046
[ 1150.380281] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 1150.380283] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1150.380285] RBP: ffffb478c2ce3cb0 R08: 0000000000000000 R09: 0000000000000000
[ 1150.380287] R10: 0000000000000000 R11: 0000000000000000 R12: ffff91f80e948000
[ 1150.380289] R13: 0000000000000000 R14: ffffb478c4ab5000 R15: ffff91f80e948038
[ 1150.380291] FS:  00007f74baa006c0(0000) GS:ffff9216bd980000(0000) knlGS:0000000000000000
[ 1150.380293] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1150.380295] CR2: 0000000000000000 CR3: 0000000106750000 CR4: 0000000000350ef0
[ 1150.380298] Call Trace:
[ 1150.380300]  <TASK>
[ 1150.380304]  ? show_regs+0x6c/0x80
[ 1150.380309]  ? __warn+0x88/0x140
[ 1150.380312]  ? __warn_thunk+0x40/0x50
[ 1150.380316]  ? report_bug+0x182/0x1b0
[ 1150.380322]  ? handle_bug+0x46/0x90
[ 1150.380325]  ? exc_invalid_op+0x18/0x80
[ 1150.380329]  ? asm_exc_invalid_op+0x1b/0x20
[ 1150.380336]  ? __warn_thunk+0x40/0x50
[ 1150.380341]  ? __warn_thunk+0x40/0x50
[ 1150.380344]  warn_thunk_thunk+0x16/0x30
[ 1150.380351]  svm_vcpu_enter_exit+0x71/0xc0 [kvm_amd]
[ 1150.380364]  svm_vcpu_run+0x1e7/0x850 [kvm_amd]
[ 1150.380377]  kvm_arch_vcpu_ioctl_run+0xca3/0x16d0 [kvm]
[ 1150.380458]  kvm_vcpu_ioctl+0x295/0x800 [kvm]
[ 1150.380522]  ? srso_return_thunk+0x5/0x5f
[ 1150.380526]  ? __x64_sys_ioctl+0xbb/0xf0
[ 1150.380530]  ? srso_return_thunk+0x5/0x5f
[ 1150.380533]  ? syscall_exit_to_user_mode+0x75/0x1b0
[ 1150.380537]  ? srso_return_thunk+0x5/0x5f
[ 1150.380541]  ? do_syscall_64+0x84/0x140
[ 1150.380544]  ? srso_return_thunk+0x5/0x5f
[ 1150.380547]  ? do_syscall_64+0x84/0x140
[ 1150.380550]  ? switch_fpu_return+0x50/0xe0
[ 1150.380555]  __x64_sys_ioctl+0xa3/0xf0
[ 1150.380559]  do_syscall_64+0x78/0x140
[ 1150.380563]  ? srso_return_thunk+0x5/0x5f
[ 1150.380566]  ? do_syscall_64+0x84/0x140
[ 1150.380569]  entry_SYSCALL_64_after_hwframe+0x6c/0x74
[ 1150.380573] RIP: 0033:0x7f74cbb8cc5b
[ 1150.380592] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1150.380595] RSP: 002b:00007f74ba9fb060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1150.380598] RAX: ffffffffffffffda RBX: 000056062f0cf7e0 RCX: 00007f74cbb8cc5b
[ 1150.380600] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001f
[ 1150.380602] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000
[ 1150.380604] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 1150.380606] R13: 0000000000000007 R14: 00007ffc10a8f820 R15: 00007f74ba200000
[ 1150.380612]  </TASK>
[ 1150.380613] ---[ end trace 0000000000000000 ]---

This happens when just starting the VM.
Command line: BOOT_IMAGE=/boot/vmlinuz-6.9.0-rc1 root=/dev/mapper/pve-root ro quiet iommu=pt amd_iommu=on kvm_amd.npt=1 video=vesafb:off video=efifb:off video=simplefb:off nomodeset initcall_blacklist=sysfb_init modprobe.blacklist=nouveau modprobe.blacklist=amdgpu modprobe.blacklist=radeon modprobe.blacklist=nvidia amd_pstate=guided

I've attached a screenshot of the settings for the VM.

I believe the new bios updates the AGESA firmware from version:
V9CastlePeakPI-SP3r3-1.0.0.9
To:
CastlePeakPI-SP3r3 1.0.0.A (2023-11-21)
Comment 3 Sean Christopherson 2024-06-10 14:45:31 UTC
On Mon, Jun 10, 2024, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=218949
> 
> --- Comment #2 from Gino Badouri (badouri.g@gmail.com) ---
> Alright, it's not a regression in the kernel but caused by a bios update (I
> guess).
> I get the same on my previous kernel 6.9.0-rc1.

The WARNs are not remotely the same.  The below issue in svm_vcpu_enter_exit()
was resolved in v6.9 final[1].

The lockdep warnings in track_pfn_remap() and remap_pfn_range_notrack() is a
known issue in vfio_pci_mmap_fault(), with an in-progress fix[2] that is destined
for 6.10.

[1] https://lore.kernel.org/all/1d10cd73-2ae7-42d5-a318-2f9facc42bbe@alu.unizg.hr
[2] https://lore.kernel.org/all/20240530045236.1005864-1-alex.williamson@redhat.com

> Both my 6.9.0-rc1 6.10.0-rc2 kernels are vanilla builds from kernel.org
> (unpatched).
> 
> After updating the bios/firmware of my mainboard Asus ROG Zenith II Extreme
> from 1802 to 2102, it always seems to spawn the error:
> 
> [ 1150.380137] ------------[ cut here ]------------
> [ 1150.380141] Unpatched return thunk in use. This should not happen!
> [ 1150.380144] WARNING: CPU: 3 PID: 4849 at arch/x86/kernel/cpu/bugs.c:2935
> __warn_thunk+0x40/0x50

...

> [ 1150.380266] CPU: 3 PID: 4849 Comm: CPU 0/KVM Not tainted 6.9.0-rc1 #1
> [ 1150.380269] Hardware name: ASUS System Product Name/ROG ZENITH II EXTREME,
> BIOS 2102 02/16/2024
> [ 1150.380271] RIP: 0010:__warn_thunk+0x40/0x50

...

> [ 1150.380298] Call Trace:
> [ 1150.380300]  <TASK>
> [ 1150.380344]  warn_thunk_thunk+0x16/0x30
> [ 1150.380351]  svm_vcpu_enter_exit+0x71/0xc0 [kvm_amd]
> [ 1150.380364]  svm_vcpu_run+0x1e7/0x850 [kvm_amd]
> [ 1150.380377]  kvm_arch_vcpu_ioctl_run+0xca3/0x16d0 [kvm]
> [ 1150.380458]  kvm_vcpu_ioctl+0x295/0x800 [kvm]
Comment 4 Gino Badouri 2024-06-10 16:17:42 UTC
Hi Sean!

It always amazes me how fast you guys can find the patches/reports for certain bug reports :)

The WARNING happened on 6.9.0-rc1 (so before the final release).

For the pfn warnings I've applied the patchset from https://lore.kernel.org/all/20240530045236.1005864-1-alex.williamson@redhat.com on top of 6.10-rc3 and it's completely fixed, they're gone now.

I've tested Call of Duty MW3 in Windows 11 with the NVIDIA GPU passed through the VM and I didn't notice any performance or stability difference with or without the patch.

Thanks a lot!

Note You need to log in before you can comment on or make changes to this bug.