Bug 217822
Summary: | GPF at deactivate_slab+0xeb/0x2f0 during S3 stress test - Dell Precision 7960 with Intel(R) Xeon(R) w9-3495X | ||
---|---|---|---|
Product: | Linux | Reporter: | AceLan Kao (acelan) |
Component: | Kernel | Assignee: | Virtual assignee for kernel bugs (linux-kernel) |
Status: | CLOSED UNREPRODUCIBLE | ||
Severity: | blocking | CC: | kai.heng.feng, lenb, mapengyu, max.lee, rui.zhang, srinidhi.s, srinivas.pandruvada |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
minicom log
minicom s2idle log minicom rtcwake log minicom log dmesg with 6.4 kernel config file for v6.5 |
I see that the problem is reproduced within 100 iterations, does this always true? Can you stress test cpu online/offline and see if the problem still exists? The issue sometimes is hard to reproduce, using fwts to run S3 tests, it may pass 500 iterations of S3. Using sleepgraph is much easier to reproduce it, usually you can see it within 100 iterations. CPU on/off test seems can't trigger the issue, passed the 1500 iterations. $ i=1; while [ $i -le 1500 ]; do echo "i = $i"; sudo ~/linux/tools/testing/selftests/cpu-hotplug/cpu-on-off-test.sh -a; i=$(($i+1)); sleep 30; done Created attachment 304986 [details]
minicom s2idle log
The issue not only happens on S3, but s2idle.
Created attachment 304987 [details]
minicom rtcwake log
Here is the log that uses rtcwake to run the s3 test, it failed at 27th iteration.
This issue is reproduced on Lenovo ThinkStation too. log attached in Bug 217854. Although this is triggered by s3/s2idle, this seems like a memory corruption that could be triggered by any code. It is just that s3/s2idle touches the code all over the places. There are several things to try. 1. does the problem still exist if you boot with kernel parameter clearcpuid=hfi? 2. is there any clue if you enable CONFIG_KCSAN? 3. can you siwtch slab to slub, with CONFIG_SLUB_DEBUG set and capture the log if it still fails? 4. as you're using mainline v6.5-rc7, may I know what kernel config file it is based on? a distro one? Created attachment 305013 [details]
minicom log
1. checkout v6.5
2. enable CONFIG_KASAN
3. add "clearcpuid=hfi" kernel cmdline
I got below error message, and there are many kcsan message during boot time.
BTW, I use ubuntu, and the .config is generated by make localmodconfig and select new configs by its default value.
[ 675.826064] smpboot: CPU 107 is now offline
[ 675.837269] smpboot: CPU 108 is now offline
[ 675.847417] ==================================================================
[ 675.854702] BUG: KCSAN: data-race in qi_submit_sync+0x560/0xbe0
[ 675.860690]
[ 675.862238] race at unknown origin, with read to 0xff1b7e8c4004cc24 of 4 bytes by task 670 on cpu 109:
[ 675.871604] qi_submit_sync+0x560/0xbe0
[ 675.875499] modify_irte.isra.0+0x181/0x2b0
[ 675.879761] intel_ir_set_affinity+0xf3/0x130
[ 675.884176] msi_domain_set_affinity+0x86/0x130
[ 675.888770] irq_do_set_affinity+0x2dc/0x330
[ 675.893113] irq_migrate_all_off_this_cpu+0x378/0x4e0
[ 675.898216] fixup_irqs+0x23/0x1a0
[ 675.901678] native_cpu_disable+0x4a/0x70
[ 675.905744] take_cpu_down+0x49/0xc0
[ 675.909361] multi_cpu_stop+0xbb/0x1e0
[ 675.913172] cpu_stopper_thread+0x149/0x260
[ 675.917401] smpboot_thread_fn+0x17c/0x310
[ 675.921539] kthread+0x18b/0x1d0
[ 675.924823] ret_from_fork+0x43/0x70
[ 675.928455] ret_from_fork_asm+0x1b/0x30
[ 675.932440]
[ 675.933978] value changed: 0x00000001 -> 0x00000002
[ 675.938896]
[ 675.940434] Reported by Kernel Concurrency Sanitizer on:
[ 675.945783] CPU: 109 PID: 670 Comm: migration/109 Tainted: G S L 6.5.0 #3
[ 675.953838] Hardware name: Dell Inc. Precision 7960 Tower/, BIOS 1.0.7 05/09/2023
[ 675.961358] Stopper: multi_cpu_stop+0x0/0x1e0 <- stop_machine_cpuslocked+0x19a/0x1e0
[ 675.969153] ==================================================================
[ 675.977710] smpboot: CPU 109 is now offline
[ 676.044791] smpboot: CPU 110 is now offline
[ 676.056980] smpboot: CPU 111 is now offline
Can we reproduce with 6.4 kernel? Created attachment 305034 [details]
dmesg with 6.4 kernel
Yes, encountered the issue with 6.4 mainline kernel
BTW, I have to disable KCSAN to do the test, or it stops at the first trial with KCSAN errors.
Does the problem still exists with boot option intel_iommu=off? Does the problem still exists with deep cstates disabled? say, boot with kernel parameter intel_idle.max_cstate=1? Created attachment 305059 [details]
config file for v6.5
Here is the kernel config file I use.
I'm trying to reproduce the issue with the new BIOS provided by Dell recently, but not sure why it becomes pretty hard to reproduce it now.
Will update the result here later.
Looks like the issue has been fixed by the new BIOS, I can't reproduce the issue now. Thank you, Rui and Srinivas. |
Created attachment 304934 [details] minicom log Kernel: mainline v6.5-rc7 We see this issue happens on machines with Sapphire Rapids CPU, and encountered the same issue on 2 projects. Using fwts could trigger this issue, but need more trial sudo fwts s3 --s3-multiple=1000 The log attached using sleepgraph with below commands sudo sleepgraph -m mem -rtcwake 90 -sync -gzip -multi 30 90 -skiphtml -o "$PLAINBOX_SESSION_SHARE"/s3_pm-graph/suspend-"$(date -d today +%Y-%m-%d-%H%M)" [ 6167.815739] general protection fault, probably for non-canonical address 0x8cdbb8946f549ea0: 0000 [#2] PREEMPT SMP NOPTI [ 6167.826650] CPU: 105 PID: 8263 Comm: sleepgraph Tainted: G D W 6.5.0-rc7+ #1 [ 6167.834964] Hardware name: Dell Inc. Precision 7960 Tower/, BIOS 1.0.7 05/09/2023 [ 6167.842496] RIP: 0010:deactivate_slab+0xeb/0x2f0 [ 6167.847168] Code: ff 7f 00 00 41 0f af d1 48 01 f2 48 39 d0 73 3b 48 29 f0 48 99 49 f7 f9 48 85 d2 75 2e 83 c3 01 4d 89 f7 49 89 ce 4b 8d 04 06 <48> 8b 10 48 0f c8 48 31 fa 48 89 d1 48 31 c1 48 39 c2 75 9c 83 c3 [ 6167.865967] RSP: 0000:ff43578e0ee57b10 EFLAGS: 00010a16 [ 6167.871239] RAX: 8cdbb8946f549ea0 RBX: 0000000000000032 RCX: 8cdbb8946f549e78 [ 6167.878403] RDX: a48b2c8782538287 RSI: ff1c07ed13945000 RDI: a48b2c8782531932 [ 6167.885567] RBP: ff43578e0ee57bc0 R08: 0000000000000028 R09: 0000000000000000 [ 6167.892732] R10: 0000000000000000 R11: 0000000000000001 R12: ff1c07ec80274000 [ 6167.899898] R13: ff8bf38ac64e5140 R14: 8cdbb8946f549e78 R15: ff1c07ed13945000 [ 6167.907064] FS: 00007f21e25701c0(0000) GS:ff1c07f3f0440000(0000) knlGS:0000000000000000 [ 6167.915183] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6167.920963] CR2: 0000000000000000 CR3: 0000000109324004 CR4: 0000000000771ee0 [ 6167.928128] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6167.935293] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 6167.942459] PKRU: 00000000 [ 6167.945203] Call Trace: [ 6167.947693] <TASK> [ 6167.949835] ? show_regs+0x72/0x90 [ 6167.953292] ? die_addr+0x38/0xb0 [ 6167.956649] ? exc_general_protection+0x1d1/0x460 [ 6167.961395] ? asm_exc_general_protection+0x27/0x30 [ 6167.966315] ? deactivate_slab+0xeb/0x2f0 [ 6167.970376] ? __unfreeze_partials+0x176/0x1f0 [ 6167.974855] slub_cpu_dead+0x70/0xd0 [ 6167.978466] ? __pfx_slub_cpu_dead+0x10/0x10 [ 6167.982776] cpuhp_invoke_callback+0x170/0x4c0 [ 6167.987259] ? __pfx_bio_cpu_dead+0x10/0x10 [ 6167.991487] __cpuhp_invoke_callback_range+0x79/0xf0 [ 6167.996485] _cpu_down+0x12e/0x290 [ 6167.999932] trace_clock_x86_tsc+0x20/0x20 [ 6168.004074] suspend_devices_and_enter+0x2fa/0x8b0 [ 6168.008906] pm_suspend+0x30d/0x6a0 [ 6168.012432] state_store+0x85/0xf0 [ 6168.015874] kobj_attr_store+0xf/0x40 [ 6168.019583] sysfs_kf_write+0x3b/0x60 [ 6168.023285] kernfs_fop_write_iter+0x153/0x1e0 [ 6168.027768] vfs_write+0x2cf/0x400 [ 6168.031215] ksys_write+0x67/0xf0 [ 6168.034573] __x64_sys_write+0x19/0x30 [ 6168.038364] do_syscall_64+0x59/0x90 [ 6168.041975] ? do_syscall_64+0x69/0x90 [ 6168.045766] ? sysvec_apic_timer_interrupt+0x4e/0xb0 [ 6168.050764] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 6168.055857] RIP: 0033:0x7f21e2314a37 [ 6168.059474] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 6168.078254] RSP: 002b:00007ffd24566438 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 6168.085853] RAX: ffffffffffffffda RBX: 0000558328947a80 RCX: 00007f21e2314a37 [ 6168.093019] RDX: 0000000000000003 RSI: 000055832a699fa0 RDI: 0000000000000003 [ 6168.100185] RBP: 000055832981fe60 R08: 0000000000000000 R09: 0000000000000000 [ 6168.107353] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 [ 6168.114520] R13: 00007f21e2570140 R14: 0000000000000003 R15: 000055832a699fa0 [ 6168.121686] </TASK> [ 6168.123915] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp dell_wmi binfmt_misc nls_iso8859_1 pmt_telemetry intel_sdsi pmt_class dell_wmi_ddv dell_wmi_sysman firmware_attributes_class dell_smbios sparse_keymap dell_wmi_descriptor wmi_bmof coretemp kvm_intel dcdbas kvm crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel snd_sof_pci_intel_tgl sha512_ssse3 snd_sof_intel_hda_common snd_soc_hdac_hda aesni_intel snd_sof_pci crypto_simd snd_sof_xtensa_dsp cryptd snd_sof_intel_hda amdgpu snd_ctl_led snd_sof snd_sof_utils rapl snd_soc_acpi_intel_match snd_soc_acpi i2c_algo_bit drm_ttm_helper intel_cstate ttm snd_soc_core snd_hda_codec_realtek video snd_compress drm_suballoc_helper snd_hda_codec_generic snd_sof_intel_hda_mlink amdxcp snd_hda_codec_hdmi ledtrig_audio snd_hda_ext_core iommu_v2 drm_buddy isst_if_mbox_pci gpu_sched snd_hda_intel drm_display_helper snd_intel_dspcfg [ 6168.123967] snd_hda_codec drm_kms_helper isst_if_mmio snd_hwdep idxd cec isst_if_common intel_vsec rc_core snd_hda_core idxd_bus snd_pcm snd_seq cmdlinepart snd_seq_device snd_timer spi_nor mtd snd soundcore joydev input_leds mei_me mei wmi mac_hid sch_fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore drm ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 multipath linear raid0 nvme nvme_core nvme_common hid_generic usbhid hid atlantic ahci i2c_i801 e1000e spi_intel_pci xhci_pci vmd libahci macsec spi_intel i2c_smbus xhci_pci_renesas pinctrl_alderlake [ 6168.274659] ---[ end trace 0000000000000000 ]---