Bug 218698

Summary: Kernel panic on adding vCPU to guest in Linux 6.9-rc2
Product: Linux Reporter: Ma Xiangfei (xiangfeix.ma)
Component: KernelAssignee: Virtual assignee for kernel bugs (linux-kernel)
Status: CLOSED CODE_FIX    
Severity: normal CC: dongli.zhang, regressions
Priority: P3 Flags: regressions: bugbot+
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: Yes Bisected commit-id:
Attachments: guest error log

Description Ma Xiangfei 2024-04-09 05:30:19 UTC
Created attachment 306112 [details]
guest error log

Environment:

Host OS: CentOS 9

Host kernel: 6.9.0-rc1

KVM commit: 9bc60f73

Qemu commit: e5c6528d

Guest kernel: 6.9-rc2

Guest commit: 39cd87c4eb2b893354f3b850f916353f2658ae6f

Guest repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git


Bug detail description: 

When hot adding a vCPU to the guest, the guest happens Call Trace and reboot.

Latest successful guest kernel version: 6.8.0-rc7 (commit: 90d35da658da8cff0d4ecbb5113f5fac9d00eb72).


Reproduce steps: 

1. Create guest:

qemu-system-x86_64 -accel kvm -cpu host -smp 4,maxcpus=128 -drive file=/share/xvs/var/tmp-img_vcpu_hot_add_1712412537,if=none,id=virtio-disk0 -device virtio-blk-pci,drive=virtio-disk0,bootindex=0 -m 4096 -monitor pty -daemonize -vnc :16147 -device virtio-net-pci,netdev=nic0,mac=00:c0:82:16:fa:b0 -netdev tap,id=nic0,br=virbr0,helper=/usr/local/libexec/qemu-bridge-helper,vhost=on

2. Add vCPU to guest

echo 'device_add driver=host-x86_64-cpu,socket-id=0,core-id=4,thread-id=0' > /dev/pts/2

cat /dev/pts/2


Error log: 

[   49.782913] Call Trace:
[   49.783039]  <TASK>
[   49.783147]  ? __die+0x24/0x70
[   49.783309]  ? page_fault_oops+0x82/0x150
[   49.783518]  ? kernelmode_fixup_or_oops+0x84/0x110
[   49.783753]  ? exc_page_fault+0xb9/0x160
[   49.783948]  ? asm_exc_page_fault+0x26/0x30
[   49.784144]  ? cpu_update_apic+0x1c/0x70
[   49.784327]  generic_processor_info+0x7e/0x160
[   49.784541]  acpi_register_lapic+0x19/0x80
[   49.784732]  acpi_map_cpu+0x26/0x90
[   49.784896]  acpi_processor_get_info+0x256/0x490
[   49.785344]  acpi_processor_add+0xb9/0x1f0
[   49.785760]  acpi_bus_attach+0x13b/0x220
[   49.786158]  acpi_bus_scan+0x7e/0x1e0
[   49.786548]  acpi_device_hotplug+0x198/0x2b0
[   49.786963]  acpi_hotplug_work_fn+0x1e/0x30
[   49.787363]  process_one_work+0x159/0x370
[   49.787790]  worker_thread+0x302/0x420
[   49.788184]  ? __pfx_worker_thread+0x10/0x10
[   49.788592]  kthread+0xe3/0x120
[   49.788955]  ? __pfx_kthread+0x10/0x10
[   49.789335]  ret_from_fork+0x31/0x50
[   49.789720]  ? __pfx_kthread+0x10/0x10
[   49.790100]  ret_from_fork_asm+0x1b/0x30
[   49.790491]  </TASK>
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-10 07:27:22 UTC
Reminder: this bug tracker is most likely the wrong place for this sort of bug. And I'm not a developer involved in any of the subsystems that might cause this, so anything I might suggest might be sending you sideways.

I fear we might a bisection (see https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html) in the guest to pin this down. But I will prod a developer that recently changed something that might cause this, maybe he will take a look
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-10 07:30:02 UTC
Ohh, and having a bit more log output from right before the backtrace is shown might be good as well.
Comment 3 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-10 13:42:56 UTC
Like fixed in rc4, see https://lore.kernel.org/all/87bk6h49tq.ffs@tglx/

Side note: Artem reassigned this to Virtualization, I don't think that was wise, but whatever, I don't care bout bugzilla.
Comment 4 Dongli Zhang 2024-04-10 15:40:59 UTC
I can reproduce as well. But the callstack is different. It finally reaches at topo_set_cpuids().

/home/zhang/kvm/qemu-8.2.0/build/qemu-system-x86_64 \
-hda disk.qcow2 -m 8G -smp 4,maxcpus=128 -vnc :5 -enable-kvm -cpu host \
-netdev user,id=user0,hostfwd=tcp::5025-:22 \
-device virtio-net-pci,netdev=user0,id=net0,mac=12:14:10:12:14:16,bus=pci.0,addr=0x3 \
-kernel /home/zhang/img/debug/mainline-linux/arch/x86_64/boot/bzImage \
-append "root=/dev/sda3 init=/sbin/init text loglevel=7 console=ttyS0" \
-monitor stdio

(qemu) device_add driver=host-x86_64-cpu,socket-id=0,core-id=4,thread-id=0


[   27.060885] BUG: unable to handle page fault for address: ffffffff83a69778
[   27.061954] #PF: supervisor write access in kernel mode
[   27.062604] #PF: error_code(0x0003) - permissions violation
[   27.063286] PGD 40c49067 P4D 40c49067 PUD 40c4a063 PMD 102213063 PTE 8000000040a69021
[   27.064273] Oops: 0003 [#1] PREEMPT SMP PTI
[   27.064799] CPU: 2 PID: 39 Comm: kworker/u256:1 Not tainted 6.9.0-rc3 #1
[   27.065611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   27.066992] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   27.067669] RIP: 0010:topo_set_cpuids+0x26/0x70
[   27.068242] Code: 90 90 90 90 48 8b 05 d9 bd da 01 48 85 c0 74 31 89 ff 48 8d 04 b8 89 30 48 8b 05 bd bd da 01 48 85 c0 74 3c 48 8d 04 b8 89 10 <f0> 48 0f ab 3d 79 9e 97 01 f0 48 0f ab 3d 40 03 df 01 c3 cc cc cc
[   27.070471] RSP: 0018:ffffc3980034bc28 EFLAGS: 00010286
[   27.071130] RAX: ffffa0bbb6f15160 RBX: 0000000000000004 RCX: 0000000000000040
[   27.072004] RDX: 0000000000000004 RSI: 0000000000000004 RDI: 0000000000000004
[   27.072858] RBP: ffffa0ba80d68540 R08: 000000000001d4c0 R09: 0000000000000001
[   27.073713] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000004
[   27.074565] R13: ffffa0ba883b6c10 R14: ffffa0ba809a9040 R15: 0000000000000000
[   27.075418] FS:  0000000000000000(0000) GS:ffffa0bbb6e80000(0000) knlGS:0000000000000000
[   27.076424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   27.077121] CR2: ffffffff83a69778 CR3: 000000010f946006 CR4: 0000000000370ef0
[   27.077976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   27.078830] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   27.079685] Call Trace:
[   27.080031]  <TASK>
[   27.080341]  ? __die+0x1f/0x70
[   27.080755]  ? page_fault_oops+0x17b/0x490
[   27.081305]  ? search_exception_tables+0x37/0x50
[   27.081897]  ? exc_page_fault+0xba/0x160
[   27.082402]  ? asm_exc_page_fault+0x26/0x30
[   27.082929]  ? topo_set_cpuids+0x26/0x70
[   27.083432]  topology_hotplug_apic+0x54/0xa0
[   27.083979]  acpi_map_cpu+0x1c/0x80
[   27.084437]  acpi_processor_add+0x361/0x630
[   27.084968]  acpi_bus_attach+0x151/0x230
[   27.085473]  ? __pfx_acpi_dev_for_one_check+0x10/0x10
[   27.086091]  device_for_each_child+0x68/0xb0
[   27.086638]  acpi_dev_for_each_child+0x37/0x60
[   27.087197]  ? __pfx_acpi_bus_attach+0x10/0x10
[   27.087757]  acpi_bus_attach+0x89/0x230
[   27.088251]  acpi_bus_scan+0x77/0x1f0
[   27.088753]  acpi_scan_rescan_bus+0x3c/0x70
[   27.089300]  acpi_device_hotplug+0x3a3/0x480
[   27.089840]  acpi_hotplug_work_fn+0x19/0x30
[   27.090369]  process_one_work+0x14c/0x360
[   27.090880]  worker_thread+0x2c5/0x3d0
[   27.091387]  ? __pfx_worker_thread+0x10/0x10
[   27.091941]  kthread+0xd3/0x100
[   27.092361]  ? __pfx_kthread+0x10/0x10
[   27.092843]  ret_from_fork+0x2f/0x50
[   27.093309]  ? __pfx_kthread+0x10/0x10
[   27.093788]  ret_from_fork_asm+0x1a/0x30
[   27.094293]  </TASK>
[   27.094601] Modules linked in:
[   27.095007] CR2: ffffffff83a69778
[   27.095444] ---[ end trace 0000000000000000 ]---
[   27.096018] RIP: 0010:topo_set_cpuids+0x26/0x70
[   27.096590] Code: 90 90 90 90 48 8b 05 d9 bd da 01 48 85 c0 74 31 89 ff 48 8d 04 b8 89 30 48 8b 05 bd bd da 01 48 85 c0 74 3c 48 8d 04 b8 89 10 <f0> 48 0f ab 3d 79 9e 97 01 f0 48 0f ab 3d 40 03 df 01 c3 cc cc cc
[   27.098808] RSP: 0018:ffffc3980034bc28 EFLAGS: 00010286
[   27.099452] RAX: ffffa0bbb6f15160 RBX: 0000000000000004 RCX: 0000000000000040
[   27.100305] RDX: 0000000000000004 RSI: 0000000000000004 RDI: 0000000000000004
[   27.101153] RBP: ffffa0ba80d68540 R08: 000000000001d4c0 R09: 0000000000000001
[   27.102141] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000004
[   27.102995] R13: ffffa0ba883b6c10 R14: ffffa0ba809a9040 R15: 0000000000000000
[   27.103851] FS:  0000000000000000(0000) GS:ffffa0bbb6e80000(0000) knlGS:0000000000000000
[   27.104857] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   27.105559] CR2: ffffffff83a69778 CR3: 000000010f946006 CR4: 0000000000370ef0
[   27.106411] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   27.107264] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

----------------------

I am not able to reproduce with the below:

x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=a9025cd1c673a8d6eefc79d911075b8b452eba8f
Comment 5 Artem S. Tashkinov 2024-04-10 19:46:47 UTC
Please reopen if this is reproducible in 6.9-RC4.
Comment 6 Ma Xiangfei 2024-04-17 07:25:05 UTC
This issue cannot reproduce in the latest linux kernel.
Host OS: CentOS 9
Host kernel: 6.9.0-rc1
KVM commit: 1ab157ce
Qemu commit: 0c2a3807
Guest linux kernel: 6.9.0-rc4
Guest linux commit: 96fca68c4fbf77a8185eb10f7557e23352732ea2