Bug 217854 - GPF at kmem_cache_alloc+0xf6/0x340 during S3 stress test - Lenovo P7 workstation Intel(R) Xeon(R) w5-3435X (32x)
Summary: GPF at kmem_cache_alloc+0xf6/0x340 during S3 stress test - Lenovo P7 workstat...
Status: NEW
Alias: None
Product: Linux
Classification: Unclassified
Component: Kernel (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: Virtual assignee for kernel bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-01 08:38 UTC by Pengyu Ma
Modified: 2024-01-30 14:53 UTC (History)
7 users (show)

See Also:
Kernel Version: 6.5-rc7, 6.1.0-1020-oem
Subsystem:
Regression: No
Bisected commit-id:


Attachments
p7 suspend hang dmesg (157.91 KB, application/x-tar)
2023-09-01 08:38 UTC, Pengyu Ma
Details
P7 C3 with latest EC, suspend hang at 49 times. upstream 6.5 kernel. (74.96 KB, application/gzip)
2023-09-08 13:20 UTC, Pengyu Ma
Details
general protection fault, probably for non-canonical address (4.02 MB, application/x-tar)
2023-09-19 06:30 UTC, Pengyu Ma
Details
debug hack patch for atlantic driver (1.06 KB, text/plain)
2024-01-30 14:53 UTC, Feng Tang
Details

Description Pengyu Ma 2023-09-01 08:38:24 UTC
Created attachment 305002 [details]
p7 suspend hang dmesg

The system is Lenovo workstation P7.
pm sleep is S3.

The system hangs during stress S3, no log in serial port.

Not sure this bug is the same as BUG:217822, it got some error log via serial port.

No matter nvidia or amdgpu used.
Comment 1 Pengyu Ma 2023-09-01 08:41:27 UTC
The fail rate is very random, upstream kernel is reproduced too.
Comment 2 Pengyu Ma 2023-09-01 09:22:33 UTC
It should be a duplicated bug of bug:217822.

Get the error log:

[  142.632617] general protection fault, probably for non-canonical address 0x1d28b5a9c40bb0b6: 0000 [#1] PREEMPT SMP NOPTI
<4>[  142.632628] CPU: 10 PID: 523 Comm: kworker/u64:9 Tainted: P           O       6.1.0-1020-oem #20-Ubuntu
<4>[  142.632636] Hardware name: LENOVO P7/1056, BIOS S0DKT0DA 07/28/2023
<4>[  142.632639] Workqueue: events_unbound async_run_entry_fn
<4>[  142.632656] RIP: 0010:kmem_cache_alloc+0xf6/0x340
<4>[  142.632670] Code: 01 48 83 79 10 00 48 89 45 c8 0f 84 f1 01 00 00 48 85 c0 0f 84 e8 01 00 00 41 8b 4f 28 49 8b 9f b8 00 00 00 49 8b 3f 48 01 c1 <48> 33 19 48 89 ce 48 8d 8a 00 20 00 00 48 0f ce 48 31 f3 65 48 0f
<4>[  142.632674] RSP: 0018:ff403f00423bfab0 EFLAGS: 00010202
<4>[  142.632680] RAX: 1d28b5a9c40bb096 RBX: 5b92bfe5a5d4a869 RCX: 1d28b5a9c40bb0b6
<4>[  142.632683] RDX: 000000000007800a RSI: 0000000000000d00 RDI: 0000000000037870
<4>[  142.632686] RBP: ff403f00423bfaf0 R08: ff18df61401feb00 R09: 0000000000000000
<4>[  142.632689] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000dc0
<4>[  142.632691] R13: 0000000000000000 R14: 0000000000000048 R15: ff18df61401feb00
<4>[  142.632695] FS:  0000000000000000(0000) GS:ff18df64afc80000(0000) knlGS:0000000000000000
<4>[  142.632699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  142.632702] CR2: 0000561bd09b6010 CR3: 000000004ce10006 CR4: 0000000000771ee0
<4>[  142.632706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  142.632708] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
<4>[  142.632710] PKRU: 55555554
<4>[  142.632712] Call Trace:
<4>[  142.632716]  <TASK>
<4>[  142.632720]  ? show_trace_log_lvl+0x1e8/0x30d
<4>[  142.632734]  ? show_trace_log_lvl+0x1e8/0x30d
<4>[  142.632745]  ? acpi_ut_allocate_object_desc_dbg+0x5d/0x130
<4>[  142.632759]  ? show_regs.part.0+0x23/0x31
<4>[  142.632768]  ? __die_body.cold+0x8/0xd
<4>[  142.632777]  ? die_addr+0x3d/0x70
<4>[  142.632790]  ? exc_general_protection+0x1bc/0x3b0
<4>[  142.632800]  ? asm_exc_general_protection+0x27/0x30
<4>[  142.632815]  ? kmem_cache_alloc+0xf6/0x340
<4>[  142.632823]  ? acpi_ut_allocate_object_desc_dbg+0x5d/0x130
<4>[  142.632833]  acpi_ut_allocate_object_desc_dbg+0x5d/0x130
<4>[  142.632843]  acpi_ut_create_internal_object_dbg+0x53/0x140
<4>[  142.632854]  acpi_ut_copy_esimple_to_isimple+0x9f/0x270
<4>[  142.632863]  acpi_ut_copy_eobject_to_iobject+0x4e/0x1a0
<4>[  142.632872]  acpi_evaluate_object+0x148/0x470
<4>[  142.632883]  acpi_device_sleep_wake+0x7b/0x110
<4>[  142.632894]  acpi_enable_wakeup_device_power+0xc5/0x120
<4>[  142.632902]  __acpi_device_wakeup_enable+0x3b/0x120
<4>[  142.632910]  acpi_pm_set_device_wakeup+0x5b/0x130
<4>[  142.632917]  acpi_pci_wakeup+0x92/0xe0
<4>[  142.632925]  __pci_enable_wake+0x68/0xc0
<4>[  142.632936]  pci_prepare_to_sleep+0x73/0xd0
<4>[  142.632941]  pci_pm_suspend_noirq+0x1f7/0x2c0
<4>[  142.632949]  ? pci_pm_suspend_late+0x50/0x50
<4>[  142.632957]  dpm_run_callback+0x63/0x160
<4>[  142.632969]  __device_suspend_noirq+0x8f/0x290
<4>[  142.632978]  async_suspend_noirq+0x23/0x70
<4>[  142.632987]  async_run_entry_fn+0x30/0x130
<4>[  142.632996]  process_one_work+0x222/0x400
<4>[  142.633004]  worker_thread+0x50/0x3e0
<4>[  142.633010]  ? process_one_work+0x400/0x400
<4>[  142.633016]  kthread+0xe6/0x110
<4>[  142.633021]  ? kthread_complete_and_exit+0x20/0x20
<4>[  142.633027]  ret_from_fork+0x1f/0x30
<4>[  142.633040]  </TASK>
<4>[  142.633041] Modules linked in: cfg80211 nvme_fabrics nvme_core nvme_common nvidia_uvm(PO) binfmt_misc nls_iso8859_1 nvidia_drm(PO) nvidia_modeset(PO) snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation intel_rapl_msr soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp intel_rapl_common snd_sof nvidia(PO) snd_sof_utils intel_uncore_frequency snd_soc_hdac_hda intel_uncore_frequency_common snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi i10nm_edac snd_hda_codec_realtek soundwire_bus snd_soc_core snd_hda_codec_generic nfit ledtrig_audio x86_pkg_temp_thermal intel_powerclamp snd_compress ac97_bus snd_pcm_dmaengine snd_hda_codec_hdmi snd_hda_intel coretemp snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec drm_kms_helper kvm_intel snd_usb_audio snd_hda_core snd_usbmidi_lib fb_sys_fops syscopyarea sysfillrect snd_seq_midi sysimgblt snd_hwdep snd_seq_midi_event kvm video mc snd_rawmidi snd_pcm snd_seq snd_seq_device snd_timer
<4>[  142.633139]  cmdlinepart snd mei_me rapl mei think_lmi spi_nor intel_cstate idxd pmt_crashlog pmt_telemetry pmt_class intel_sdsi isst_if_mmio isst_if_mbox_pci mtd soundcore isst_if_common intel_vsec firmware_attributes_class idxd_bus wmi_bmof mac_hid sch_fq_codel msr parport_pc ppdev lp parport drm ramoops reed_solomon efi_pstore pstore_blk pstore_zone ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log hid_generic usbhid hid crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd atlantic e1000e i2c_i801 ahci spi_intel_pci macsec xhci_pci i2c_smbus intel_lpss_pci spi_intel libahci intel_lpss xhci_pci_renesas idma64 wmi
<4>[  142.633253] ---[ end trace 0000000000000000 ]---
Comment 3 Bagas Sanjaya 2023-09-01 12:25:05 UTC
(In reply to Pengyu Ma from comment #1)
> The fail rate is very random, upstream kernel is reproduced too.

On what version?
Comment 4 Pengyu Ma 2023-09-01 14:04:18 UTC
Upstream kernel v6.5-rc7 is reproduced.
Comment 5 Pengyu Ma 2023-09-08 13:20:33 UTC
Created attachment 305070 [details]
P7 C3 with latest EC, suspend hang at 49 times. upstream 6.5 kernel.
Comment 6 Pengyu Ma 2023-09-08 13:21:32 UTC
Error log:

<4>[ 3094.392272] general protection fault, probably for non-canonical address 0x1933c0a4f6d9950b: 0000 [#1] PREEMPT SMP NOPTI
<4>[ 3094.392277] CPU: 0 PID: 4548 Comm: sleepgraph Not tainted 6.5.0-060500-generic #202308271831
<4>[ 3094.392279] Hardware name: LENOVO ABCDEFGHIJ/1056, BIOS S0DKT0DA 07/28/2023
<4>[ 3094.392281] RIP: 0010:deactivate_slab+0xfa/0x320
<4>[ 3094.392289] Code: 00 00 41 0f af d3 48 01 fa 48 39 d0 73 3e 48 29 f8 48 99 49 f7 fb 48 85 d2 75 31 41 83 c7 01 4d 89 ee 49 89 f5 4b 8d 44 0d 00 <48> 8b 10 48 0f c8 4c 31 c2 48 89 d6 48 31 c6 48 39 c2 75 9d 41 83
<4>[ 3094.392291] RSP: 0018:ff66984805b3bac0 EFLAGS: 00010212
<4>[ 3094.392294] RAX: 1933c0a4f6d9950b RBX: ff4135f240037600 RCX: 0000000000000000
<4>[ 3094.392295] RDX: 394378ea04ecd514 RSI: 1933c0a4f6d994eb RDI: ff4135f24eb87000
<4>[ 3094.392296] RBP: ff66984805b3bb68 R08: e24391feedf82e14 R09: 0000000000000020
<4>[ 3094.392297] R10: 0000000000000001 R11: 0000000000000000 R12: ffb2d292843ae1c0
<4>[ 3094.392298] R13: 1933c0a4f6d994eb R14: ff4135f24eb87000 R15: 0000000000000017
<4>[ 3094.392299] FS:  00007f9e3388f1c0(0000) GS:ff4135f5afa00000(0000) knlGS:0000000000000000
<4>[ 3094.392300] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 3094.392301] CR2: 0000560072a61c58 CR3: 0000000116bba003 CR4: 0000000000771ef0
<4>[ 3094.392302] PKRU: 00000000
<4>[ 3094.392303] Call Trace:
<4>[ 3094.392305]  <TASK>
<4>[ 3094.392308]  ? show_regs+0x6d/0x80
<4>[ 3094.392313]  ? die_addr+0x37/0xa0
<4>[ 3094.392316]  ? exc_general_protection+0x1c3/0x460
<4>[ 3094.392322]  ? asm_exc_general_protection+0x27/0x30
<4>[ 3094.392328]  ? deactivate_slab+0xfa/0x320
<4>[ 3094.392330]  ? _raw_spin_lock_irqsave+0xe/0x20
<4>[ 3094.392334]  ? __unfreeze_partials+0x188/0x200
<4>[ 3094.392336]  slub_cpu_dead+0x70/0xd0
<4>[ 3094.392338]  ? __pfx_slub_cpu_dead+0x10/0x10
<4>[ 3094.392339]  cpuhp_invoke_callback+0x345/0x530
<4>[ 3094.392345]  __cpuhp_invoke_callback_range+0x80/0x100
<4>[ 3094.392347]  _cpu_down+0xfe/0x2a0
<4>[ 3094.392352]  osnoise_arch_unregister+0x220/0x220
<4>[ 3094.392356]  suspend_enter+0xc6/0x440
<4>[ 3094.392359]  suspend_devices_and_enter+0x195/0x2f0
<4>[ 3094.392360]  enter_state+0x21b/0x5f0
<4>[ 3094.392361]  pm_suspend+0x44/0xe0
<4>[ 3094.392363]  state_store+0x2b/0x60
<4>[ 3094.392365]  kobj_attr_store+0xf/0x40
<4>[ 3094.392369]  sysfs_kf_write+0x3b/0x60
<4>[ 3094.392372]  kernfs_fop_write_iter+0x14c/0x1e0
<4>[ 3094.392375]  vfs_write+0x251/0x440
<4>[ 3094.392380]  ksys_write+0x73/0x100
<4>[ 3094.392382]  __x64_sys_write+0x19/0x30
<4>[ 3094.392384]  do_syscall_64+0x59/0x90
<4>[ 3094.392386]  ? exit_to_user_mode_prepare+0x30/0xb0
<4>[ 3094.392390]  ? syscall_exit_to_user_mode+0x37/0x60
<4>[ 3094.392392]  ? do_syscall_64+0x68/0x90
<4>[ 3094.392393]  ? do_syscall_64+0x68/0x90
<4>[ 3094.392394]  ? sysvec_apic_timer_interrupt+0x4b/0xd0
<4>[ 3094.392396]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[ 3094.392397] RIP: 0033:0x7f9e33714a37
Comment 7 Pengyu Ma 2023-09-19 06:30:15 UTC
unlike Bug 217822, The issue is still reproduced in 10B BIOS.
10B BIOS should integrated the latest microcode and ME 1625 as DELL did.
Comment 8 Pengyu Ma 2023-09-19 06:30:50 UTC
Created attachment 305125 [details]
general protection fault, probably for non-canonical address
Comment 9 Feng Tang 2023-09-22 01:51:34 UTC
I noticed in Bug 217822, its kernel config has :

   CONFIG_SLUB_DEBUG_ON=y

Could you disable that config option and retest? thanks
Comment 10 Pengyu Ma 2023-09-22 02:32:12 UTC
@Feng Tang,

I checked the kernel config in both 6.1.0 and 6.6.0-rc2 in comments above:

# CONFIG_SLUB_DEBUG_ON is not set
Comment 11 Feng Tang 2024-01-30 14:41:10 UTC
FWIW, we checked the problem, and found one problem (there may still be other problem in the system), that the altantic ethernet driver cause some memory corruptions. When slub_debug and KASAN are enabled, there will be KASAN report about memory been written, and sometimes there is report about task stack been corrupted like:
 "Kernel panic - not syncing: corrupted stack end detected inside scheduler"

The possible root cause is, during suspend, the atlantic driver will halt the HW and free TX/RX memory back to the kernel, but somehow the atlantic HW is not really stopped, and soemtimes it still send data to those RX memory through DMA, even after those memory are given back to kernel and get used by other kernel
components, which cause memory corruption in different places.

The ethernet HW is 
 "01:00.0 Ethernet controller: Aquantia Corp. Device 14c0 (rev 03)"

It could be easily reproduced with S2idle S3 test, in 2-10 runs of test, with cmds
" sleepgraph -m mem-s2idle -multi 2000 20".  It could also be reproduced in normal suspend-to-RAM normal test, but the reproduce rate is very low about 1/1000.
Comment 12 Feng Tang 2024-01-30 14:53:06 UTC
Created attachment 305794 [details]
debug hack patch for atlantic driver

With this hack patch of not freeing RX memory in suspend hook, the memory corruption can't be reproduced in 10000+ S2idle test

Note You need to log in before you can comment on or make changes to this bug.